What fits you as a data scientist?

栏目: IT技术 · 发布时间: 4年前

内容简介:Data science strives to understand the natural world, which is, by nature, very complicated. But how? Analyzing data, a significant amount of data (so-calledAs a data scientist, the first thing to know is the

Discover your place and get the right direction

What fits you as a data scientist?

Photo by Monty Allen onUnsplash

Data science strives to understand the natural world, which is, by nature, very complicated. But how? Analyzing data, a significant amount of data (so-called big data ), trying to understand them and squeeze knowledge and experience , to make decisions and solve problems. For a better understanding of what is the experience in the field of data science and machine learning, please check my introductory article about machine learning and artificial intelligence (link below).

As a data scientist, the first thing to know is the data lifecycle, which is made of steps.

Data Collection

Nowadays, data collection is an easy task. Data collection is the action of gathering data from various sources: web pages, news, social media, reports, graphs, tables, etc. are all sources of digital raw data, ready to consume for everyone interested.

What fits you as a data scientist?

Flows of data — from Giphy

In this field, a good data scientist develops an inherent curiosity about the world; he is data-driven, so he spends enormous amounts of time collecting data to answer the questions of interest. The required skills are:

  • think about what data are needed to solve the problem you are involved in
  • knowing how to collect data from various sources and how to combine them in a structured way
  • knowing some tools or application for the collection of data and ETL (Extract, Transform and Load)

Data Cleaning

Once collected, most of the time, raw data are “messy”.

What fits you as a data scientist?

A very messy office with a bunch of documents and raw data to organize — Photo by Wonderlane onUnsplash

Data cleaning is a complex task, which involves:

  • detecting and correcting corrupt or inaccurate data because of partial or missing data gathering
  • validation of data and estimate of missing values, based on information about the relevant phenomena, relied on the problem.
  • data enhancement, via harmonization and normalization of data
  • transformation of data to obtain uniformity and comparability of values in the dataset

Exploratory Data Analysis

Exploratory data analysis (EDA) is a collection of techniques for seeing what the data can tell us. In EDA, we use both mathematical models and common sense to cope with the significance of our data.

What fits you as a data scientist?

Graphical representation of data — Photo by Stephen Dawson onUnsplash

As data scientists, we must know what to expect from the data we collect, have to formulate a hypothesis, and “fill the gap” in what information we have.

There are many tools to help us:

  • Descriptive statistics: to have a representation of data, made by tables, graphs, summarizing values, etc.
  • Inferential statistics: to bring our collection of data, which is un incomplete representation of reality, to infer, make assumptions, about the fundamental characteristics of the phenomena.
  • A deep understanding of the environment, that is to say, the context of the problem we try to solve with data science techniques.

It’s worth recalling that, in classical machine learning (ML), this phase of the data lifecycle is up to us, the data scientists. When in deep learning (DL) and even more in reinforcement learning (RL), it is up to the model, the machine, to cope with that. In DL, during the training phase, the algorithm learns the characteristics of data provided and adapt to them. In RL, the environment is, even more, an active part of the learning process.

Model Building

Model building is a fundamental part of the ML process. When we create a model, we can then train a machine to learn patterns on our data (training set) to predict unknown or future data.

What fits you as a data scientist?

Be creative … it’s time to model building! — Photo by Jo Szczepanska onUnsplash

In the model building, we try to predict outcomes from the analysis.

Again, some skills are essential here:

  • apply the right learning schema to our data to solve a specific problem (regression, classification, association, clustering, etc…)
  • test and evaluate the results of the model training via defined metrics to compute performance
  • combine many techniques and models to get to a better result in terms of prediction, robustness of the model, etc. ( ensemble modeling)

Model Deployment

When our model is ready, and we get good results from the training and test set (training and evaluation stage), it’s time to put it into production. This is the final stage when we get results from data, for business, study, research, or maybe also for fun!

What fits you as a data scientist?

Time to get results — Photo by NeONBRAND onUnsplash

We need to know:

  • how to deploy our model on various ready to use, state of the art framework. Think about, i.e., in Python tools and libraries such as NumPy, pandas, scikit-learn (ML), TensorFlow Keras, PyTorch (DL), openai tools for RL, etc.
  • how to get results into the production environment, for optimization, anomaly detection, automation, prediction, etc.
  • how to summarize results to stakeholder

So, what’s next?

So far, we talked about the process. But, who’s needed in every step? Let’s clear the concepts and what are the roles involved.

First of all, let’s summarize all the process with a picture

What fits you as a data scientist?

roles and data pipeline — by the Author

Data scientist

As you can see, a data scientist is expected to do everything from data collection to model deployment; he must be aware of the real problems, and he has to know many techniques about every stage of the process. So, the required skills are:

  • a grasp of how to do SQL and other methods of querying datasets
  • a deep understanding of algebra, statistics and set theory for useful modeling techniques of data
  • knowing about Python, R, Java, C++ or other languages for data cleaning, data manipulation, EDA and visualization.
  • ability to select or combine modeling techniques suitable for solving the problems based on data and expected results
  • knowing how to combine the data pipeline in a production environment with methods of visualization and presentation of the results, in the form of a web application, reports, commands to machines, etc.

Data engineer

A data engineer is more focused on data collection and data cleaning. In this position, we must be very expert in database and data query techniques, besides being able in ETL (extract, transform and load) of data from various sources. Then we must know how to clean data, deal with null or inconsistent values and many more techniques to build a strong foundation of sources for the following ML models.

Data analyst

A data analyst works hard on data cleaning and EDA. He masters statistics, both descriptive than inferential, and is always trying to squeeze every single bit of information from data. His role is crucial for data modeling and to build reliable models that can actually capture the behavior of the environment. In the data science path to knowledge, in my opinion, this can be the first step, and then we can explore other phases of the process.

Machine learning engineer

A machine learning engineer knows how to get the most out of the data, based on various techniques and ML algorithms; he masters ML models, optimization of hyperparameters, evaluation and metrics, and is on the edge of the latest research in the field. Besides that, he also knows how to scale and deploy models into production systems.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

网络江湖三十六计

网络江湖三十六计

程苓峰,王晶 / 经济日报出版社 / 2009-6 / 40.00元

《网络江湖三十六计》内容简介:貌合神离:卖个破绽给对手,让他尝到甜头,自认为可安枕无忧,往往就松懈大意。于是,自己蓄力并反击的机会就来了。诱敌就是“貌合”,暗地发力就是“神离”。一起来看看 《网络江湖三十六计》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试