The First Two Questions Every Data Scientist Must Answer

栏目: IT技术 · 发布时间: 4年前

内容简介:Selecting the right model and the composition of the training dataset are constant challenges in every data science project.Building machine learning applications in the real world is a never-ending process of selecting and refining the right elements of a

Selecting the right model and the composition of the training dataset are constant challenges in every data science project.

The First Two Questions Every Data Scientist Must Answer

Building machine learning applications in the real world is a never-ending process of selecting and refining the right elements of a specific solution. Among those elements, the selection of the correct model and the right structure of the training dataset are, arguably, the two most important decisions that data scientists need to make when architecting deep learning solutions. How to decide what deep learning model to use for a specific problem? How do we know whether we are using the correct training dataset or we should gather more data? Those questions are the common denominator across all stages of the lifecycle of a deep learning application. Even though there is no magic answer to those questions, there are several ideas that could guide your decision-making process. Let’s start with the selection of the correct deep learning model.

— What mode should I use?

— How much training data should I gather?

Selecting a Baseline Model

The first thing to figure out when exploring an artificial intelligence(AI) problem is to determine whether its a deep learning problem or not. Many AI scenarios are perfectly addressable using basic machine learning algorithms. However, if the problem falls into the category of “AI-Complete” scenarios such as vision analysis, speech translation, natural language process or others of similar nature, then we need to start thinking about how to select the right deep learning model.

Identifying the correct baseline model for a deep learning problem is a complex task that can be segmented into two main parts:

I) Select the core learning algorithm.

II)Select the optimization algorithm that complements the algorithm selected on step 1.

Most deep learning algorithms are correlated to the structure of the training dataset. Again, there is no silver bullet for selecting the right algorithm for a deep learning problem but, some of the following design guidelines should help in the decision:

a) If the input dataset is based on images or similar topological structures, then the problem can be tackled using convolutional neural networks(CNNs)(see my previous articles about CNNs).

b)If the input is a fixed-size vector, we should be thinking of using a feed-forward network with inter layer connectivity.

c) If the input is sequential in nature, then we have a problem better suited for recurrent or recursive neural networks.

The First Two Questions Every Data Scientist Must Answer

Those principles are mostly applicable to supervised deep learning algorithms. However, there are plenty of deep learning scenarios that can benefit from unsupervised deep learning models. In scenarios such as natural language processing or image analysis, using unsupervised learning models can be a useful technique to determine relevant characteristics of the input dataset and structure it accordingly.

In terms of the optimization algorithm, you can rarely go wrong using stochastic gradient descent(SGD). Variations of SGD such as the ones using momentum or learning decay models are very popular in the deep learning space. Adam is, arguably, the most popular alternative to SGD algorithms especially when combined with CNNs.

Now we have an idea of how to select the right deep learning algorithm for a specific scenario. The next step is to validate the correct structure of the training dataset. We will discuss that in the next part of this article.

Building the Right Training Dataset

Structuring a proper training dataset is an essential aspect of effective deep learning models but one that is particularly hard to solve. Part of the challenge comes from the intrinsic relationship between a model and the corresponding training dataset. If the performance of a model is below expectations, it is often hard to determine whether the causes are related to the model itself or to the composition of the training dataset. While there is no magic formula for creating the perfect training dataset, there are some patterns that can help.

When confronted with a deep learning model with poor performance, data scientists should determine if the optimization efforts should focus on the model itself or on the training data. In most real-world scenarios, optimizing a model is exponentially cheaper than gathering additional clean data and retraining the algorithms. From that perspective, data scientists should make sure that the model has been properly optimized and regularized before considering collecting additional data.

Typically, the first rule to consider when a deep learning algorithm is underperforming is to evaluate whether it’s using the entire training dataset. Very often data scientists will be shocked to find out that models that are not working correctly are only using a fraction of the training data. At that point, a logical thing to consider is to increase the capacity of the model(the number of potential hypothesis it can formulate) by adding extra layers and additional hidden units per layer. Another ideas to explore in that scenario is to optimize the model’s hyperparameters. If none of those ideas work, then it might be time to consider gathering more training data.

The First Two Questions Every Data Scientist Must Answer

The process of enriching a training dataset can be cost prohibited in many scenarios. To mitigate that, data scientists should implement a data wrangling pipeline that is constantly labeling new records. semi-supervised learning strategies might also help to incorporate unlabeled records as part of the training dataset.

The imperative question in scenarios that require extra training data always is: how much data? Assuming that the composition of the training dataset doesn’t drastically vary with new records, we can estimate the appropriate size of the new training dataset by monitoring its correlation with the generalization error. A basic principle to follow in that situation is to increase the training dataset at a logarithmic scale by, for example, doubling the number of instances each time. In some cases, we can improve the training dataset by simply creating variations using noise generation models or regularization techniques such as Bagging(read my recent article about Bagging).

Building machine learning solutions is a constant trial and error exercise. Recent techniques such as neural architecture search are definitely helping to address some of the challenges of model selection and dataset size but they still require a lot of work to be widely adopted. For now, selecting the right model and the right training dataset remains one of the biggest challenges faced by data scientists when building machine learning solutions in the real world.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

JavaScript高级程序设计:第2版

JavaScript高级程序设计:第2版

Nicholas Zakas / 李松峰、曹力 / 人民邮电出版社 / 2010-7 / 89.00元

《JavaScript高级程序设计(第2版)》在上一版基础上进行了大幅度更新和修订,融入了近几年来JavaScript应用发展的最新成果,几乎涵盖了所有需要理解的重要概念和最新的JavaScript应用成果。从颇具深度的JavaScript语言基础到作用域(链),从引用类型到面向对象编程,从极其灵活的匿名函数到闭包的内部机制,从浏览器对象模型(BOM)、文档对象模型(DOM)到基于事件的Web脚本......一起来看看 《JavaScript高级程序设计:第2版》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

SHA 加密
SHA 加密

SHA 加密工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具