Top 8 Challenges for Machine Learning Practitioners

栏目: IT技术 · 发布时间: 4年前

Top 8 Challenges for Machine Learning Practitioners

The major challenges one needs to overcome while developing a machine learning application

Top 8 Challenges for Machine Learning Practitioners

Photo by nappy from Pexels

Many individuals picture a robot or a terminator when they catch wind of Machine Learning (ML) or Artificial Intelligence (AI) . However, they aren’t something out of motion pictures, it is anything but a cutting edge dream. It’s already here. We are living in a situation with numerous cutting edge applications developed using machine learning, despite that there are certain challenges an ML practitioner might face while developing an application from zero to bringing them to production .

What are these challenges? Let’s take a look!

1. Data Collection

Data plays a key role in any use case. 60% of the work of a data scientist lies in collecting the data. For beginners to experiment with machine learning, they can easily find data from Kaggle, UCI ML Repository etc.

Top 8 Challenges for Machine Learning Practitioners

Photo by Pixabay from Pexels

To implement real case scenarios, you need to collect the data through web-scraping or (through APIs like twitter) or for solving business problems you need to attain data from clients (here ML engineers need to coordinate with domain experts to collect the data).

Once the data is collected, we need to structure the data and store it in the database. This requires knowledge of Big data (or data engineer) which plays a major role here.

2. Less Amount of Training Data

Once the data is collected you need to validate if the quantity is sufficient for the use case (if it is a time-series data, we need a minimum of 3–5 years of data).

The two important things we do while doing a machine learning project are selecting a learning algorithm and training the model using some of the acquired data. So as humans, we naturally tend to make mistakes and as a result things may go wrong. Here, the mistakes could be opting for the wrong model or selecting a data which is bad. Now what do I mean by bad data? Let us try to understand.

Top 8 Challenges for Machine Learning Practitioners

Photo by Sharon McCutcheon from Pexels

Imagine your machine learning model is a baby, and you plan on teaching the baby to distinguish between a cat and a dog. So we begin with pointing at a cat and saying ‘ it’s a CAT’ and do the same thing with a DOG (possibly repeating this procedure a number of times). Now the child will able to distinguish between dog and cat, by identifying shapes, colors, or any other features. And just like that, the baby becomes a genius (in distinguishing)!

In a similar fashion, we train the model with a lot of data. A child may distinguish the animal with less number of samples, but a machine learning model requires thousands of examples for even simple problems. For complex problems like Image Classification and Speech Recognition it may require data in a count of millions.

Therefore, one thing is clear. We need to train a model with sufficient DATA.

3. Non-representative Training Data

The training data should be representative of the new cases to generalize well i.e., the data we use for training should cover all the cases that occurred and that are going to occur. By using a non-representative training set, the trained model is not likely to make accurate predictions.

Systems which are developed to make predictions for generalized cases in business problem view are said to be good machine learning models. It will help the model to perform well even for the data which the data model has never seen.

Top 8 Challenges for Machine Learning Practitioners

Photo by Rupert Britton from Unsplash

If the number of training samples is low, we have sampling noise which is unrepresentative data, again countless training tests bring sampling bias if the strategy utilized for training is defective.

A popular case of examining sampling bias occurred during the US Presidental election in 1936 (Landon against Roosevelt), a very large poll was conducted by the Literary Digest by sending mail to around ten million people out of which 2.4 million answered, and predicted that Landon is going to get 57% of votes with high confidence. Be that as it may, Roosevelt won with 62% of votes.

The problem here is in the sampling method, to get the email address for conducting the poll, Literary Digest used magazine subscribes, club membership lists, and the likes, which are utilized by wealthier individuals who are bound to cast a ballot Republican, (hence Landon). Also, non-response bias comes into the picture as only 25% of people answered to the poll.

To make accurate predictions without any drifts, the training datasets must be representative.

4. Poor Quality of Data

Top 8 Challenges for Machine Learning Practitioners

Photo by Jeppe Hove Jensen from Unsplash

In reality, we don’t directly start training the model, analyzing data is the most important step. But the data we collected might not be ready for training, some samples are abnormal from others having outliers or missing values for instance.

In these cases, we can remove the outliers, or fill the missing features/values using median or mean (to fill height) or simply remove the attributes/instances with missing values, or train the model with and without these instances.

We don’t want our system to make false predictions, right? So the quality of data is very important to get accurate results. Data preprocessing needs to be done by filtering missing values, extract & rearrange what the model needs.

5. Irrelevant/Unwanted Features

Top 8 Challenges for Machine Learning Practitioners

Photo by Gary Chan from Unsplash
Garbage in, Garbage out

If the training data contains a large number of irrelevant features and enough relevant features, the machine learning system will not give the results as expected. One of the important aspects required for the success of a machine learning project is selection of good features to train the model also known as Feature Selection.

Let’s say we are working on a project to predict the number of hours a person needs to exercise based on the input features that we collected — age, gender, weight, height, and location (i.e., where he/she lives).

  1. Among these 5 features, l ocation value might not impact our output function. This is an irrelevant feature, we know that we can have better results without this feature.
  2. Also, we can combine two features to produce a more useful one i.e., Feature Extraction. In our example, we can produce a feature called BMI by eliminating weight and height. We can apply transformations on the dataset too.
  3. Creating new features by gathering more data also helps.

6. Overfitting the Training Data

Top 8 Challenges for Machine Learning Practitioners

Photo by Victor Freitas from Pexels

Say you visited a restaurant in a new city. You looked at the menu to order something and found that the cost or bill is too high. You might be tempted to say that ‘ all the restaurants in the city are too costly and not affordable’. Overgeneralizing is something that we do very frequently, and shockingly, the frameworks can likewise fall into a similar snare and in AI, we call it overfitting.

Top 8 Challenges for Machine Learning Practitioners

Overfitting

It means the model is performing well, making likely predictions on the training dataset, but it is not generalized well.

Let’s say you are attempting to implement an Image Classification model to classify apple, peach, oranges, bananas with training samples of — 3000, 500, 500, 500 respectively. If we train the model with these samples the system is more likely to classify oranges as apples as the number of training samples for apples is too high. This can be referred to as Oversampling.

Top 8 Challenges for Machine Learning Practitioners

Photo by Pixabay from Pexels

At the point when the model is excessively unpredictable comparative with the noisiness of the training dataset, Overfitting occurs. We can avoid it by:

  1. Gathering more training data.
  2. Selecting a model with fewer features, a higher degree polynomial model is not preferred compared to the linear model.
  3. Fix data errors, remove the outliers, and reduce the number of instances in the training set.

7. Underfitting the Training data

Underfitting which is opposite to Overfitting generally occurs when the model is too simple to understand the base structure of the data. It’s like trying to fit the undersized pants. It generally happens when we have less information to construct an exact model and when we attempt to build or develop a linear model with non-linear information.

Top 8 Challenges for Machine Learning Practitioners

Underfitting

Main options to reduce underfitting are:

  1. Feature Engineering — feeding better features to the learning algorithm.
  2. Remove noise from the data.
  3. Increase parameters and select a powerful model.

8. Offline Learning & Deployment of the model

Top 8 Challenges for Machine Learning Practitioners

Photo by Rakicevic Nenad from Pexels

Machine Learning engineering follows these steps while building an application 1) Data collection 2) Data cleaning 3) Feature engineering 4) To analyze patterns 5) Training the model and its Optimization 6) Deployment.

Oops!! Did I say deployment? Yes, a lot of machine learning practitioners can perform all steps but lacks at deployment, bringing their cool applications into production is became one of the biggest challenges due to lack of practice, dependencies issues. low understanding of underlying models with business, understanding of business problems, unstable models.

Generally, many of the developers collect data from websites like Kaggle and start training the model. But in reality, we need to make a source for data collection, that varies dynamically. Offline learning or Batch learning may not be used for this type of variable data. The system is trained and then it is launched into production, runs without learning anymore. Here the data might drift as it changes dynamically.

Top 8 Challenges for Machine Learning Practitioners

Online Learning

It is always preferred to build a pipeline to collect, analyze, build/train, test & validate the dataset for any machine learning project and train the model in batches.

References

Conclusion

A system doesn’t perform well if the training set is too small, or if the data is not generalized, noisy, corrupted with irrelevant features. We went through some of the basic challenges faced by beginners who started machine learning.

I’ll be happy to hear suggestions if you have any. I will come back with another intriguing topic very soon. Till then, Stay Home, Stay Safe, and keep exploring!

If you would like to get in touch, connect with me on LinkedIn .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

MATLAB智能算法30个案例分析

MATLAB智能算法30个案例分析

史峰、王辉、胡斐、郁磊 / 北京航空航天大学出版社 / 2011-7-1 / 39.00元

MATLAB智能算法30个案例分析,ISBN:9787512403512,作者:史峰,王辉 等编著一起来看看 《MATLAB智能算法30个案例分析》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

MD5 加密
MD5 加密

MD5 加密工具