GridSearch: the ultimate Machine Learning Tool

栏目: IT技术 · 发布时间: 4年前

GridSearch: the ultimate Machine Learning Tool

GridSearch: the ultimate Machine Learning Tool. Photo by Chris Liverani on Unsplash

Machine Learning in short

The goal of supervised Machine Learning is to build a prediction function based on historical data. This data has independent (explanatory) variables and a target variable (the variable that you want to predict).

Once a predictive model has been built, we measure its error on a separate testing data set. We do this using KPIs that allow quantifying the error of the model, for example, the Mean Square Error in a regression context (quantitative target variable) or the Accuracy in a classification context (categorical target variable).

The model with the smallest error is generally selected as the best model. Then we use this model to predict the values of the target variable by inputting the explanatory variables.

In this article, I will deep-dive into GridSearch.

Machine Learning’s Two Types of Optimization

GridSearch is a tool that is used for hyperparameter tuning . As stated before, Machine Learning in practice comes down to comparing different models to each other and trying to find the best working model.

Apart from selecting the right data set, there are generally two aspects of optimizing a predictive model:

  1. Optimize the choice of the best model
  2. Optimize a model’s fit using hyperparameters tuning

Let’s now look into those to have an explanation for the need for GridSearch.

Part 1. Optimize the choice of the best model

In some datasets, there may exist a simple linear relationship that can predict a target variable from the explanatory variables. In other datasets, these relationships may be more complex or highly nonlinear.

At the same time, many models exist. This ranges from simple models like the Linear Regression, up to very complex models like Deep Neural Networks.

It is key to use a model that is appropriate for our data.

For example, if we use a Linear Regression on a very complex task, the model will not be performant. But if we use a Deep Neural Network on a very simple task, this will also not be performant!

To find a well-fitting Machine Learning model, the solution is to split data into train and test data, then fit many models on the training data and test each of them on the test data. The model that has the smallest error on the test data will be kept.

GridSearch: the ultimate Machine Learning Tool

A screenshot from Scikit Learn’s list of supervised models shows that there are a lot of models to try out!

Part 2. Optimize a model’s fit using hyperparameters tuning

After choosing one well-performing model (or a few), the second thing to optimize is the hyperparameters of a model. Hyperparameters are like a configuration of the training phase of the model. They influence what a model can or cannot learn.

Tuning hyperparameters can, therefore, lower the error on the test data set even more.

The way of estimating is different for each model, and thus each model has its own hyperparameters to optimize.

GridSearch: the ultimate Machine Learning Tool

This extract of the documentation of Scikit Learn’s RandomForestClassifier shows numerous parameters that can all influence the final accuracy of your model.

One way to do a thorough search for the best hyperparameters is to use a tool called GridSearch.

What is GridSearch?

GridSearch is an optimization tool that we use when tuning hyperparameters. We define the grid of parameters that we want to search through, and we select the best combination of parameters for our data.

The “Search” in GridSearch

The hypothesis is that there is a specific combination of values of the different hyperparameters that will minimize the error of our predictive model. Our goal using GridSearch is to find this specific combination of parameters.

The “Grid” in GridSearch

GridSearch’s idea for finding this best parameter combination is is simple: just test each parameter combination possible and select the best one!

Not really each combination possible though, since for a continuous scale there would be infinitely many combinations to test. The solution for this is to define a Grid. This Grid defines for each hyperparameter, which values should be tested.

GridSearch: the ultimate Machine Learning Tool

A schematic overview of GridSearch on two hyperparameters Alpha and Beta (graphics by author)

In an example case where two hyperparameters — Alpha and Beta— are tuned: we could give both of them the values [0.1, 0.01, 0.001, 0.0001] resulting in the following “Grid” of values. At each crossing point, our GridSearch will fit the model to see what the error at this point is.

And after checking all the grid points, we know which parameter combination is best for our prediction.

The “Cross-Validation” in GridSearch

At this point, only one thing remains to be added: the Cross-Validation Error.

When testing the performance of a model with each combination of hyperparameters, there could be a risk of overfitting. This means that just by pure chance, only the training data set corresponded well to this particular hyperparameter combination! The performance on new, real-life data, could be much worse!

To get a more reliable estimate of the performances of a hyperparameter combination, we take the Cross Validation Error.

GridSearch: the ultimate Machine Learning Tool

A schematic overview of Cross-Validation (graphics by author)

In Cross-Validation, the data is split in multiple parts. For example 5 parts. Then the model is fit 5 times while leaving out one-fifth of the data. This one-fifth left-out data is used to measure the performances.

For one combination of hyperparameter values, the average of the 5 errors constitutes the cross-validation error. This makes the selection of the final combination more reliable.

What makes GridSearch so important?

GridSearch allows us to find the best model given a data set very easily. It actually makes the Machine Learning part of the Data Scientists role much easier by automating the search.

On the Machine Learning side, some things that still remain to be done is deciding on the right way to measure error, deciding on which models to try out and which hyperparameters to test for. And the most important part, the work on data preparation, is also left for the data scientist.

Thanks to the GridSearch approach, the Data Scientist can focus on the data wrangling work, while automating repetitive tasks of model comparison. This makes the work more interesting and allows the Data Scientist to add value where he’s most needed: working with data.

A number of alternatives for GridSearch exist, including Random Search, Bayesian Optimization, Genetic Algorithms, and more. I will write an article about those soon, so don’t hesitate to stay tuned. Thanks for reading!


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

C++程序设计

C++程序设计

谭浩强 / 清华大学出版社 / 2004-6-1 / 36.00元

《C++程序设计》作者深入调查了我国大学的程序设计课程的现状和发展趋势,参阅了国内外数十种有关C++的教材,认真分析了学习者在学习过程中遇到的困难,研究了初学者的认识规律。在本书中做到准确定位,合理取舍内容,设计了读者易于学习的教材体系,并且以通俗易懂的语言化解了许多复杂的概念,大大减少了初学者学习C++的困难。C++是近年来国内外广泛使用的现代计算机语言,它既支持面向过程的程序设计,也支持基于对......一起来看看 《C++程序设计》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换