Top 8 Challenges for Machine Learning Practitioners

栏目: IT技术 · 发布时间: 5年前

Top 8 Challenges for Machine Learning Practitioners

The major challenges one needs to overcome while developing a machine learning application

Jul 13 ·8min read

Top 8 Challenges for Machine Learning Practitioners — Photo by nappy from Pexels

Many individuals picture a robot or a terminator when they catch wind of Machine Learning (ML) or Artificial Intelligence (AI) . However, they aren’t something out of motion pictures, it is anything but a cutting edge dream. It’s already here. We are living in a situation with numerous cutting edge applications developed using machine learning, despite that there are certain challenges an ML practitioner might face while developing an application from zero to bringing them to production .

What are these challenges? Let’s take a look!

1. Data Collection

Data plays a key role in any use case. 60% of the work of a data scientist lies in collecting the data. For beginners to experiment with machine learning, they can easily find data from Kaggle, UCI ML Repository etc.

To implement real case scenarios, you need to collect the data through web-scraping or (through APIs like twitter) or for solving business problems you need to attain data from clients (here ML engineers need to coordinate with domain experts to collect the data).

Once the data is collected, we need to structure the data and store it in the database. This requires knowledge of Big data (or data engineer) which plays a major role here.

2. Less Amount of Training Data

Once the data is collected you need to validate if the quantity is sufficient for the use case (if it is a time-series data, we need a minimum of 3–5 years of data).

The two important things we do while doing a machine learning project are selecting a learning algorithm and training the model using some of the acquired data. So as humans, we naturally tend to make mistakes and as a result things may go wrong. Here, the mistakes could be opting for the wrong model or selecting a data which is bad. Now what do I mean by bad data? Let us try to understand.

Imagine your machine learning model is a baby, and you plan on teaching the baby to distinguish between a cat and a dog. So we begin with pointing at a cat and saying ‘ it’s a CAT’ and do the same thing with a DOG (possibly repeating this procedure a number of times). Now the child will able to distinguish between dog and cat, by identifying shapes, colors, or any other features. And just like that, the baby becomes a genius (in distinguishing)!

In a similar fashion, we train the model with a lot of data. A child may distinguish the animal with less number of samples, but a machine learning model requires thousands of examples for even simple problems. For complex problems like Image Classification and Speech Recognition it may require data in a count of millions.

Therefore, one thing is clear. We need to train a model with sufficient DATA.

3. Non-representative Training Data

The training data should be representative of the new cases to generalize well i.e., the data we use for training should cover all the cases that occurred and that are going to occur. By using a non-representative training set, the trained model is not likely to make accurate predictions.

Systems which are developed to make predictions for generalized cases in business problem view are said to be good machine learning models. It will help the model to perform well even for the data which the data model has never seen.

If the number of training samples is low, we have sampling noise which is unrepresentative data, again countless training tests bring sampling bias if the strategy utilized for training is defective.

A popular case of examining sampling bias occurred during the US Presidental election in 1936 (Landon against Roosevelt), a very large poll was conducted by the Literary Digest by sending mail to around ten million people out of which 2.4 million answered, and predicted that Landon is going to get 57% of votes with high confidence. Be that as it may, Roosevelt won with 62% of votes.

The problem here is in the sampling method, to get the email address for conducting the poll, Literary Digest used magazine subscribes, club membership lists, and the likes, which are utilized by wealthier individuals who are bound to cast a ballot Republican, (hence Landon). Also, non-response bias comes into the picture as only 25% of people answered to the poll.

To make accurate predictions without any drifts, the training datasets must be representative.

4. Poor Quality of Data

In reality, we don’t directly start training the model, analyzing data is the most important step. But the data we collected might not be ready for training, some samples are abnormal from others having outliers or missing values for instance.

In these cases, we can remove the outliers, or fill the missing features/values using median or mean (to fill height) or simply remove the attributes/instances with missing values, or train the model with and without these instances.

We don’t want our system to make false predictions, right? So the quality of data is very important to get accurate results. Data preprocessing needs to be done by filtering missing values, extract & rearrange what the model needs.

5. Irrelevant/Unwanted Features

Garbage in, Garbage out

If the training data contains a large number of irrelevant features and enough relevant features, the machine learning system will not give the results as expected. One of the important aspects required for the success of a machine learning project is selection of good features to train the model also known as Feature Selection.

Let’s say we are working on a project to predict the number of hours a person needs to exercise based on the input features that we collected — age, gender, weight, height, and location (i.e., where he/she lives).

Among these 5 features, l ocation value might not impact our output function. This is an irrelevant feature, we know that we can have better results without this feature.
Also, we can combine two features to produce a more useful one i.e., Feature Extraction. In our example, we can produce a feature called BMI by eliminating weight and height. We can apply transformations on the dataset too.
Creating new features by gathering more data also helps.

6. Overfitting the Training Data

Say you visited a restaurant in a new city. You looked at the menu to order something and found that the cost or bill is too high. You might be tempted to say that ‘ all the restaurants in the city are too costly and not affordable’. Overgeneralizing is something that we do very frequently, and shockingly, the frameworks can likewise fall into a similar snare and in AI, we call it overfitting.

It means the model is performing well, making likely predictions on the training dataset, but it is not generalized well.

Let’s say you are attempting to implement an Image Classification model to classify apple, peach, oranges, bananas with training samples of — 3000, 500, 500, 500 respectively. If we train the model with these samples the system is more likely to classify oranges as apples as the number of training samples for apples is too high. This can be referred to as Oversampling.

At the point when the model is excessively unpredictable comparative with the noisiness of the training dataset, Overfitting occurs. We can avoid it by:

Gathering more training data.
Selecting a model with fewer features, a higher degree polynomial model is not preferred compared to the linear model.
Fix data errors, remove the outliers, and reduce the number of instances in the training set.

7. Underfitting the Training data

Underfitting which is opposite to Overfitting generally occurs when the model is too simple to understand the base structure of the data. It’s like trying to fit the undersized pants. It generally happens when we have less information to construct an exact model and when we attempt to build or develop a linear model with non-linear information.

Main options to reduce underfitting are:

Feature Engineering — feeding better features to the learning algorithm.
Remove noise from the data.
Increase parameters and select a powerful model.

8. Offline Learning & Deployment of the model

Machine Learning engineering follows these steps while building an application 1) Data collection 2) Data cleaning 3) Feature engineering 4) To analyze patterns 5) Training the model and its Optimization 6) Deployment.

Oops!! Did I say deployment? Yes, a lot of machine learning practitioners can perform all steps but lacks at deployment, bringing their cool applications into production is became one of the biggest challenges due to lack of practice, dependencies issues. low understanding of underlying models with business, understanding of business problems, unstable models.

Generally, many of the developers collect data from websites like Kaggle and start training the model. But in reality, we need to make a source for data collection, that varies dynamically. Offline learning or Batch learning may not be used for this type of variable data. The system is trained and then it is launched into production, runs without learning anymore. Here the data might drift as it changes dynamically.

It is always preferred to build a pipeline to collect, analyze, build/train, test & validate the dataset for any machine learning project and train the model in batches.

References

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. Now, even…

www.oreilly.com

Basic Statistics | Coursera

Understanding statistics is essential to understand research in the social and behavioral sciences. In this course you…

www.coursera.org

Conclusion

A system doesn’t perform well if the training set is too small, or if the data is not generalized, noisy, corrupted with irrelevant features. We went through some of the basic challenges faced by beginners who started machine learning.

I’ll be happy to hear suggestions if you have any. I will come back with another intriguing topic very soon. Till then, Stay Home, Stay Safe, and keep exploring!

If you would like to get in touch, connect with me on LinkedIn .

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Top 8 Challenges for Machine Learning Practitioners

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

深入浅出强化学习：原理入门

郭宪、方勇纯 / 电子工业出版社 / 2018-1 / 79

《深入浅出强化学习：原理入门》用通俗易懂的语言深入浅出地介绍了强化学习的基本原理，覆盖了传统的强化学习基本方法和当前炙手可热的深度强化学习方法。开篇从最基本的马尔科夫决策过程入手，将强化学习问题纳入到严谨的数学框架中，接着阐述了解决此类问题最基本的方法——动态规划方法，并从中总结出解决强化学习问题的基本思路：交互迭代策略评估和策略改善。基于这个思路，分别介绍了基于值函数的强化学习方法和基于直接策略......一起来看看《深入浅出强化学习：原理入门》这本书的介绍吧!

码农工具

Top 8 Challenges for Machine Learning Practitioners