The First Two Questions Every Data Scientist Must Answer

栏目: IT技术 · 发布时间: 6年前

内容简介：Selecting the right model and the composition of the training dataset are constant challenges in every data science project.Building machine learning applications in the real world is a never-ending process of selecting and refining the right elements of a

Selecting the right model and the composition of the training dataset are constant challenges in every data science project.

Jesus Rodriguez

Feb 6 ·5min read

The First Two Questions Every Data Scientist Must Answer

Building machine learning applications in the real world is a never-ending process of selecting and refining the right elements of a specific solution. Among those elements, the selection of the correct model and the right structure of the training dataset are, arguably, the two most important decisions that data scientists need to make when architecting deep learning solutions. How to decide what deep learning model to use for a specific problem? How do we know whether we are using the correct training dataset or we should gather more data? Those questions are the common denominator across all stages of the lifecycle of a deep learning application. Even though there is no magic answer to those questions, there are several ideas that could guide your decision-making process. Let’s start with the selection of the correct deep learning model.

— What mode should I use?

— How much training data should I gather?

Selecting a Baseline Model

The first thing to figure out when exploring an artificial intelligence(AI) problem is to determine whether its a deep learning problem or not. Many AI scenarios are perfectly addressable using basic machine learning algorithms. However, if the problem falls into the category of “AI-Complete” scenarios such as vision analysis, speech translation, natural language process or others of similar nature, then we need to start thinking about how to select the right deep learning model.

Identifying the correct baseline model for a deep learning problem is a complex task that can be segmented into two main parts:

I) Select the core learning algorithm.

II)Select the optimization algorithm that complements the algorithm selected on step 1.

Most deep learning algorithms are correlated to the structure of the training dataset. Again, there is no silver bullet for selecting the right algorithm for a deep learning problem but, some of the following design guidelines should help in the decision:

a) If the input dataset is based on images or similar topological structures, then the problem can be tackled using convolutional neural networks(CNNs)(see my previous articles about CNNs).

b)If the input is a fixed-size vector, we should be thinking of using a feed-forward network with inter layer connectivity.

c) If the input is sequential in nature, then we have a problem better suited for recurrent or recursive neural networks.

Those principles are mostly applicable to supervised deep learning algorithms. However, there are plenty of deep learning scenarios that can benefit from unsupervised deep learning models. In scenarios such as natural language processing or image analysis, using unsupervised learning models can be a useful technique to determine relevant characteristics of the input dataset and structure it accordingly.

In terms of the optimization algorithm, you can rarely go wrong using stochastic gradient descent(SGD). Variations of SGD such as the ones using momentum or learning decay models are very popular in the deep learning space. Adam is, arguably, the most popular alternative to SGD algorithms especially when combined with CNNs.

Now we have an idea of how to select the right deep learning algorithm for a specific scenario. The next step is to validate the correct structure of the training dataset. We will discuss that in the next part of this article.

Building the Right Training Dataset

Structuring a proper training dataset is an essential aspect of effective deep learning models but one that is particularly hard to solve. Part of the challenge comes from the intrinsic relationship between a model and the corresponding training dataset. If the performance of a model is below expectations, it is often hard to determine whether the causes are related to the model itself or to the composition of the training dataset. While there is no magic formula for creating the perfect training dataset, there are some patterns that can help.

When confronted with a deep learning model with poor performance, data scientists should determine if the optimization efforts should focus on the model itself or on the training data. In most real-world scenarios, optimizing a model is exponentially cheaper than gathering additional clean data and retraining the algorithms. From that perspective, data scientists should make sure that the model has been properly optimized and regularized before considering collecting additional data.

Typically, the first rule to consider when a deep learning algorithm is underperforming is to evaluate whether it’s using the entire training dataset. Very often data scientists will be shocked to find out that models that are not working correctly are only using a fraction of the training data. At that point, a logical thing to consider is to increase the capacity of the model(the number of potential hypothesis it can formulate) by adding extra layers and additional hidden units per layer. Another ideas to explore in that scenario is to optimize the model’s hyperparameters. If none of those ideas work, then it might be time to consider gathering more training data.

The process of enriching a training dataset can be cost prohibited in many scenarios. To mitigate that, data scientists should implement a data wrangling pipeline that is constantly labeling new records. semi-supervised learning strategies might also help to incorporate unlabeled records as part of the training dataset.

The imperative question in scenarios that require extra training data always is: how much data? Assuming that the composition of the training dataset doesn’t drastically vary with new records, we can estimate the appropriate size of the new training dataset by monitoring its correlation with the generalization error. A basic principle to follow in that situation is to increase the training dataset at a logarithmic scale by, for example, doubling the number of instances each time. In some cases, we can improve the training dataset by simply creating variations using noise generation models or regularization techniques such as Bagging(read my recent article about Bagging).

Building machine learning solutions is a constant trial and error exercise. Recent techniques such as neural architecture search are definitely helping to address some of the challenges of model selection and dataset size but they still require a lot of work to be widely adopted. For now, selecting the right model and the right training dataset remains one of the biggest challenges faced by data scientists when building machine learning solutions in the real world.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

The First Two Questions Every Data Scientist Must Answer

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

算法竞赛入门经典（第2版）

刘汝佳 / 清华大学出版社 / 2014-6-1 / CNY 49.80

《算法竞赛入门经典（第2版）》是一本算法竞赛的入门与提高教材，把C/C++语言、算法和解题有机地结合在一起，淡化理论，注重学习方法和实践技巧。全书内容分为12 章，包括程序设计入门、循环结构程序设计、数组和字符串、函数和递归、C++与STL入门、数据结构基础、暴力求解法、高效算法设计、动态规划初步、数学概念与方法、图论模型与算法、高级专题等内容，覆盖了算法竞赛入门和提高所需的主要知识点，并含有大量......一起来看看《算法竞赛入门经典（第2版）》这本书的介绍吧!

码农工具