You learn a lot of things in the typical 101 machine learning course. You will learn different algorithms, feature engineering, methods like cross-validation to get reliable performance measures and how you can tune the hyper parameters of your algorithm. However, most of these introductory courses will not tell you about all the things that can go wrong when your data depends on time.
Now, you might think this does not apply to your problem and only applies to the typical forecast of an economic time series. You might be wrong. A large part of real-world tasks the average data scientist solves today do depend on time.
Whenever you come across machine learning models underperforming once they are deployed, you should ask if and how effects depending on time were taking into account when developing the model.
Basic example: predicting fuel consumption
Let us start with a simple example. We have the task to predict the fuel consumption of cars during design & development for the R&D department. The customer provided us data for a lot of different cars and we discovered that engine’s capacity, transmission type and fuel type are important features for predicting the car’s consumption.
We validated our XGBoost model using k-fold cross validation and got a superb mean percentage error (MPE) of about -0.95%. Everyone is excited and the model is deployed to users of the R&D department.
However, soon users start to complain that the predicted consumption values cannot be true and are unrealistically high.
The fuel consumption data depends on time
It turns out that our data contains training examples from more than 10 years. Once we add the time component of this data generating process, it becomes obvious that there has been substantial progress in constructing fuel-efficient cars in these years. However, our first model could not learn this progress as corresponding features are missing.
We correct the validation scheme to take time into account and replace the k-fold cross validation with a rolling origin forecast resampling. Using the adjusted validation scheme we find that the users were right. Our model’s mean percentage error (MPE) is actually 8 times worse than we thought it was. The model is making heavily biased predictions.
Now, that we know that our data depends on time, we can create features that enable the machine learning algorithm to make less biased predictions.
Less obvious examples where time might matter
The fuel consumption example above was quite obvious and many experienced data scientists would not fall into this trap.
However, sometimes there are cases that are less obvious and it is not that easy to spot the time dependency. Could even image classification depend on time? Sure, two examples:
-
Automated quality control using a CCD sensor
The training data consists of 45.000 images that were generated during the evenings of January with the shop floor lighted by fluorescent tubes. Once the model will be deployed it will have to operate on images taken on a morning of June with sun shining through a roof window.
-
Assisting clinicians during radiographic staging
Different typical activities during summer and winter lead to different injuries. How does this influence the data generation and class imbalance? How does the clinician and the model take these imbalances into account and should they?
Most data generating processes are heavily influenced by humans and their environment. Therefore they often show trends, cycles and strong daily, weekly & annual patterns. Basically, whenever the data generating process is interacting with humans and cannot be automatically executed in an isolated laboratory, chances are high that some of these time dependent components can be found and must be modelled.
How-to check if data depends on time?
Whenever you are working on a machine learning problem and your data does not already include time, ask yourself the following questions:
- What process generated the data?
- How much time did it take to generate this data?
- Was the environment of the data generating process stable during that time?
- Was the process (and business) itself stable during that time?
- Will the environment and the process (and business) stay stable in the future?
- Are the statistic properties of the generated data stable over time?
These abstract questions can be supported by more concrete questions to check the process, environment and data for common time dependent effects:
- What effect could seasons, weather conditions and holidays have? Could this influence an annual pattern?
- What effect could light conditions, noise and working hours have? Could this influence a daily or weekly pattern?
- What effect could maintenance have on the data? Are there typical maintenance cycles?
- What effect could social, scientific and economic progress (or change) have on the data? Have there been any trends or changes in the past or are they expected in the future?
- What effect could demographic or organizational change have on your data? Have there been any changes in the past or are they expected in the future?
Once, potential time dependent effects have been identified you should check if these effects are present in the data and if your data actually covers the necessary time spans to detect it.
What to do if your data depends on time?
If the data depends on time there are three basic options to handle it.
- Exclude time effects by adjusting the environment or process . This is only viable, if the output of the model does not have to include time effects like trend and seasonality. In the quality control example from above this could mean to adjust the shop floor situation to guarantee stable lighting conditions.
- Exclude time effects by pre-processing the data . This is also only viable, if the output of the model does not have to include time effects like trend and seasonality. In the quality control example from above this could mean to modify images to remove the influence of the lighting conditions.
- Include time effects by adjusting the machine learning model and data . Various modelling & feature engineering techniques can be applied to enable the machine learning model to include the time effects. If the output of the model depends on the time effects a time-aware validation scheme should be used. In the quality control example from above this could mean to include various lighting situations in the training data and maybe even generating features that make the model aware of the lighting condition.
What experience regarding time have you made in your machine learning projects? Would you like to read a story about the various modelling & feature engineering techniques and validation schemes to include these time effects?
Thanks for reading and I’m looking forward to your comments!
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
MFC编程技巧与范例详解
曾凡锋、苗雨 / 清华大学出版社 / 2008-10 / 45.00元
本书集作者多年教学与软件开发经验,通过不同类型的实例详解向读者解读了如何使用MFC进行软件开发,并按实例的复杂度进行分级介绍,以满足不同层次读者的切实需要。. 本书共55个完整实例,均选自作者多年工程应用开发中的案例;内容共分14章,分别为MFC的基本概念、文档和视图、对话框、按钮控件、编辑控件、组合框控件、列表框控件、列表视图控件、树状视图控件、图像、多媒体、GDI与GDI+、网络编程、I......一起来看看 《MFC编程技巧与范例详解》 这本书的介绍吧!