Predictive Early Stopping — A Meta Learning Approach

栏目: IT技术 · 发布时间: 4年前

内容简介：Author: Dhruv Nair, Data Scientist, Comet.mlPredictive Early Stopping is a state-of-the-art approach for speeding up model training and hyperparameter optimization. Our benchmarking studies have shown that Predictive Early Stopping can speed up model train

Predictive Early Stopping — A Meta Learning Approach

Ashley Binford

Apr 22 ·6min read

Predictive Early Stopping — A Meta Learning Approach

Author: Dhruv Nair, Data Scientist, Comet.ml

Introduction

Predictive Early Stopping is a state-of-the-art approach for speeding up model training and hyperparameter optimization. Our benchmarking studies have shown that Predictive Early Stopping can speed up model training by up to 30% independent of the underlying infrastructure.

We build on insights gathered from projects such as Learning Curve Extrapolation, Hyperband, and Median Stopping, in order to create a predictive model that can estimate the convergence value of a loss curve.

Comet is able to leverage model data, such as hyperparameters and loss curves, from over two million models in the public section of its platform to create a model whose predictions generalize across hyperparameters and model architectures.

In some cases we are able to provide an estimate of convergence, hundreds of epochs before it actually occurs. In addition to predicting the convergence value, our Predictive Early Stopping product provides an estimate of the probability that the current model will outperform the best model result seen in the current training sweep.

In some cases we are able to provide an estimate of convergence, hundreds of epochs before it actually occurs.

These predictions allow us to terminate the training of underperforming models, so that the search process is spent evaluating only the most promising candidates.

Benchmarking:

We tested our Predictive Early Stopping method in three different settings:

A hyperparameter search that optimizes the parameters of a function that acts as a surrogate for a neural network
A hyperparameter search to optimize a 6 layer CNN on CIFAR10 using the SMAC optimizer, with and without predictive early stopping.
A hyperparameter search to optimize the same 6 layer CNN using Random Search with Hyperband vs Random Search with Predictive Early Stopping.

Results for the Surrogate Function:

In our first test, we set up an exponentially decaying function as our surrogate for a Neural Network. We ran this surrogate model for 20 steps and determined the optimal values for the parameters of this function using Comet’s Bayesian Optimizer and Predictive Early Stopping. During the hyperparameter search, we observed that suboptimal models were not allowed to train the full 20 steps. Figure 1 , illustrates our stopping mechanism.

Results for the CNN model with SMAC:

Our benchmarking test for the CNN model was set up in the following way:

We used the SMAC optimizer to estimate the following hyperparameters in a 6-Layer CNN model. The model hyperparameters and architecture were based on AlexNet.

{
    "learning_rate":{
        "type":"loguniform",
        "value":[0.0000001, 0.01]
    },
    "learning_rate_decay":{
        "type": "uniform",
        "value":[0.000001, 0.001]
    },
    "weight_decay": {
        "type": "loguniform",
        "value": [0.0000005, 0.005]
    }
}

We ran 8 trials of the optimizer, with and without Predictive Early Stopping. Each optimizer trial is given a 6 hour budget to evaluate as many configurations as possible. Each hyperparameter configuration was allowed to train for a maximum of 100 epochs, and the validation set was evaluated at the end of every epoch. The validation loss was used by our Predictive Early Stopping model to determine whether or not to terminate a hyperparameter configuration.

At the end of all the trials, we determined the mean value of the test loss across all trials as a function of the total number of epochs in the hyperparameter sweep that are required to achieve that loss value.

We can see in Figure 2 , that using Predictive Early Stopping allows SMAC to get to a comparable loss value, almost 300 epochs faster. This is a 25% decrease in the amount of time spent doing hyperparameter optimization.

We also divided the hyperparameter configurations into quantiles, based on the final validation loss. We then calculated the average amount of epochs that the Optimizer spent in each quantile across all trials.

In Figure 3 , we see that both SMAC and Predictive-Early-Stopping-SMAC spend roughly the same amount of time evaluating the top 25% and top 50% of configurations. However, Predictive Early Stopping spends 30 fewer epochs training models in the bottom 50% of results, and 20 fewer epochs in bottom 25%.

In Figure 4 , we see sample loss curves from a hyperparameter sweep with Predictive Early Stopping. Suboptimal configurations are stopped well before the total number of allowed training steps.

Results for the CNN model with Hyperband:

We set up the test for Hyperband in a similar manner to SMAC. We specifically used the Asynchronous Successive Halving Pruner implemented in Optuna . We can think of this as Hyperband with a single bracket. We selected 120 hyperparameter configurations at random. Each configuration was allotted a minimum resource of 10 epochs and allowed to train for a maximum of 100 epochs. This leads to a maximum training budget of 12000 epochs.

At every evaluation point, the number of configurations is reduced by a factor of N based on the worst performing validation losses. In our experiments, we evaluated Hyperband with N values of 2, 4 and 8.

For the Predictive Early Stopping, we tested each configuration with different values of an interval parameter. Configurations were evaluated every 10, 15, and 20 epochs. At every evaluation we estimated the probability that the current configuration will be better than the best configuration seen so far. If this probability is less than a threshold, in our case 90%, we terminate the current configuration. Threshold and Interval are both configurable hyperparameters for Predictive Early Stopping.

We then determine the best validation loss achieved in the hyperparameter sweep and the number of epochs remaining in the budget after the sweep.

In Figure 5a , we see that all approaches find the same best value for the validation loss, however, Predictive Early Stopping is able to evaluate the configurations using just 15% of the total budget, compared to Hyperband which uses 25%. This is a 10% improvement in speed.

Conclusion

Predictive Early Stopping has very obvious time, energy, and cost saving benefits. Wasted compute cycles aren’t good for the environment, or a researcher’s budget.

The Allen Institute for AI recently published a report on the rising computational costs of training machine learning models, and how these increasingly large energy requirements adversely affect the environment. The paper states that current advancements in state-of-the-art AI research have largely focussed on metrics such as accuracy, or error, at the cost of being environmentally unfriendly. They call this paradigm, Red AI. To counter the prevalence of Red AI research, they propose a shift towards AI research that places an emphasis on computational efficiency: Green AI.

The paper recommends tracking the efficiency of an AI algorithm based on the total number of floating point operations (FPO’s) required to generate a result. The total number of FPO’s directly correlates with the number of hyperparameter configurations that are evaluated during tuning, as well as the number of training iterations spent on each configuration.

We hope our efforts with Predictive Early Stopping contribute towards increasing the computational efficiency of the hyperparameter search process. We see this tool as a way to lower some of the monetary barriers associated with AI research, and as a step towards adopting Green AI practices.

In our next post, we will describe applying predictive early stopping in individual runs that do not belong to a larger parameter search.

*Predictive Early Stopping is available as an add-on with the purchase of Comet Teams or Comet Enterprise. Patent protection is being sought on Predictive Early Stopping via one or more pending patent applications.

Learn more and sign-up here .

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Predictive Early Stopping — A Meta Learning Approach

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

数据驱动：从方法到实践

桑文锋 / 电子工业出版社 / 2018-3 / 49

本书是从理论到实践的全面且细致的企业数据驱动指南，从作者的百度大数据工作说起，完整还原其从零到一构建百度用户行为大数据处理平台经历。详解大数据本质、理念与现状，围绕数据驱动四环节——采集、建模、分析、指标，深入浅出地讲述企业如何将数据驱动方案落地，并指出数据驱动的价值在于“数据驱动决策”、“数据驱动产品智能”。最后通过互联网金融、电子商务、企业服务、零售四大行业实践，从需求梳理、事件指标设计、数据......一起来看看《数据驱动：从方法到实践》这本书的介绍吧!

码农工具