How time can ruin your most precious machine learning model

栏目: IT技术 · 发布时间: 4年前

You learn a lot of things in the typical 101 machine learning course. You will learn different algorithms, feature engineering, methods like cross-validation to get reliable performance measures and how you can tune the hyper parameters of your algorithm. However, most of these introductory courses will not tell you about all the things that can go wrong when your data depends on time.

Now, you might think this does not apply to your problem and only applies to the typical forecast of an economic time series. You might be wrong. A large part of real-world tasks the average data scientist solves today do depend on time.

Whenever you come across machine learning models underperforming once they are deployed, you should ask if and how effects depending on time were taking into account when developing the model.

Basic example: predicting fuel consumption

Let us start with a simple example. We have the task to predict the fuel consumption of cars during design & development for the R&D department. The customer provided us data for a lot of different cars and we discovered that engine’s capacity, transmission type and fuel type are important features for predicting the car’s consumption.

We validated our XGBoost model using k-fold cross validation and got a superb mean percentage error (MPE) of about -0.95%. Everyone is excited and the model is deployed to users of the R&D department.

However, soon users start to complain that the predicted consumption values cannot be true and are unrealistically high.

The fuel consumption data depends on time

It turns out that our data contains training examples from more than 10 years. Once we add the time component of this data generating process, it becomes obvious that there has been substantial progress in constructing fuel-efficient cars in these years. However, our first model could not learn this progress as corresponding features are missing.

We correct the validation scheme to take time into account and replace the k-fold cross validation with a rolling origin forecast resampling. Using the adjusted validation scheme we find that the users were right. Our model’s mean percentage error (MPE) is actually 8 times worse than we thought it was. The model is making heavily biased predictions.

Now, that we know that our data depends on time, we can create features that enable the machine learning algorithm to make less biased predictions.

Less obvious examples where time might matter

The fuel consumption example above was quite obvious and many experienced data scientists would not fall into this trap.

However, sometimes there are cases that are less obvious and it is not that easy to spot the time dependency. Could even image classification depend on time? Sure, two examples:

Automated quality control using a CCD sensor

The training data consists of 45.000 images that were generated during the evenings of January with the shop floor lighted by fluorescent tubes. Once the model will be deployed it will have to operate on images taken on a morning of June with sun shining through a roof window.

Assisting clinicians during radiographic staging
Different typical activities during summer and winter lead to different injuries. How does this influence the data generation and class imbalance? How does the clinician and the model take these imbalances into account and should they?

Most data generating processes are heavily influenced by humans and their environment. Therefore they often show trends, cycles and strong daily, weekly & annual patterns. Basically, whenever the data generating process is interacting with humans and cannot be automatically executed in an isolated laboratory, chances are high that some of these time dependent components can be found and must be modelled.

How-to check if data depends on time?

Whenever you are working on a machine learning problem and your data does not already include time, ask yourself the following questions:

What process generated the data?
How much time did it take to generate this data?
Was the environment of the data generating process stable during that time?
Was the process (and business) itself stable during that time?
Will the environment and the process (and business) stay stable in the future?
Are the statistic properties of the generated data stable over time?

These abstract questions can be supported by more concrete questions to check the process, environment and data for common time dependent effects:

What effect could seasons, weather conditions and holidays have? Could this influence an annual pattern?
What effect could light conditions, noise and working hours have? Could this influence a daily or weekly pattern?
What effect could maintenance have on the data? Are there typical maintenance cycles?
What effect could social, scientific and economic progress (or change) have on the data? Have there been any trends or changes in the past or are they expected in the future?
What effect could demographic or organizational change have on your data? Have there been any changes in the past or are they expected in the future?

Once, potential time dependent effects have been identified you should check if these effects are present in the data and if your data actually covers the necessary time spans to detect it.

What to do if your data depends on time?

If the data depends on time there are three basic options to handle it.

Exclude time effects by adjusting the environment or process . This is only viable, if the output of the model does not have to include time effects like trend and seasonality. In the quality control example from above this could mean to adjust the shop floor situation to guarantee stable lighting conditions.
Exclude time effects by pre-processing the data . This is also only viable, if the output of the model does not have to include time effects like trend and seasonality. In the quality control example from above this could mean to modify images to remove the influence of the lighting conditions.
Include time effects by adjusting the machine learning model and data . Various modelling & feature engineering techniques can be applied to enable the machine learning model to include the time effects. If the output of the model depends on the time effects a time-aware validation scheme should be used. In the quality control example from above this could mean to include various lighting situations in the training data and maybe even generating features that make the model aware of the lighting condition.

What experience regarding time have you made in your machine learning projects? Would you like to read a story about the various modelling & feature engineering techniques and validation schemes to include these time effects?

Thanks for reading and I’m looking forward to your comments!

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

How time can ruin your most precious machine learning model

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

艾伦•图灵传

（英）安德鲁·霍奇斯 / 孙天齐 / 湖南科学技术出版社 / 2012-8-1 / 68.00元

《艾伦·图灵传：如谜的解谜者》是图灵诞辰100周年纪念版，印制工艺更为精美。本书是世界共认的最权威的图灵传记。艾伦？图灵是现代人工智能的鼻祖，在24岁时奠定了计算机的理论基础。二战期间，他为盟军破译密码，为结束战争做出巨大贡献。战后，他开创性地提出人工智能的概念，并做了大量的前期工作。因同性恋问题事发，被迫注射激素，后来吃毒苹果而死。作者是一名数学家，也是一名同性恋者。他对图灵的生平有切身的体会，......一起来看看《艾伦•图灵传》这本书的介绍吧!

码农工具