Machine Learning in short
The goal of supervised Machine Learning is to build a prediction function based on historical data. This data has independent (explanatory) variables and a target variable (the variable that you want to predict).
Once a predictive model has been built, we measure its error on a separate testing data set. We do this using KPIs that allow quantifying the error of the model, for example, the Mean Square Error in a regression context (quantitative target variable) or the Accuracy in a classification context (categorical target variable).
The model with the smallest error is generally selected as the best model. Then we use this model to predict the values of the target variable by inputting the explanatory variables.
In this article, I will deep-dive into GridSearch.
Machine Learning’s Two Types of Optimization
GridSearch is a tool that is used for hyperparameter tuning . As stated before, Machine Learning in practice comes down to comparing different models to each other and trying to find the best working model.
Apart from selecting the right data set, there are generally two aspects of optimizing a predictive model:
- Optimize the choice of the best model
- Optimize a model’s fit using hyperparameters tuning
Let’s now look into those to have an explanation for the need for GridSearch.
Part 1. Optimize the choice of the best model
In some datasets, there may exist a simple linear relationship that can predict a target variable from the explanatory variables. In other datasets, these relationships may be more complex or highly nonlinear.
At the same time, many models exist. This ranges from simple models like the Linear Regression, up to very complex models like Deep Neural Networks.
It is key to use a model that is appropriate for our data.
For example, if we use a Linear Regression on a very complex task, the model will not be performant. But if we use a Deep Neural Network on a very simple task, this will also not be performant!
To find a well-fitting Machine Learning model, the solution is to split data into train and test data, then fit many models on the training data and test each of them on the test data. The model that has the smallest error on the test data will be kept.
Part 2. Optimize a model’s fit using hyperparameters tuning
After choosing one well-performing model (or a few), the second thing to optimize is the hyperparameters of a model. Hyperparameters are like a configuration of the training phase of the model. They influence what a model can or cannot learn.
Tuning hyperparameters can, therefore, lower the error on the test data set even more.
The way of estimating is different for each model, and thus each model has its own hyperparameters to optimize.
One way to do a thorough search for the best hyperparameters is to use a tool called GridSearch.
What is GridSearch?
GridSearch is an optimization tool that we use when tuning hyperparameters. We define the grid of parameters that we want to search through, and we select the best combination of parameters for our data.
The “Search” in GridSearch
The hypothesis is that there is a specific combination of values of the different hyperparameters that will minimize the error of our predictive model. Our goal using GridSearch is to find this specific combination of parameters.
The “Grid” in GridSearch
GridSearch’s idea for finding this best parameter combination is is simple: just test each parameter combination possible and select the best one!
Not really each combination possible though, since for a continuous scale there would be infinitely many combinations to test. The solution for this is to define a Grid. This Grid defines for each hyperparameter, which values should be tested.
In an example case where two hyperparameters — Alpha and Beta— are tuned: we could give both of them the values [0.1, 0.01, 0.001, 0.0001] resulting in the following “Grid” of values. At each crossing point, our GridSearch will fit the model to see what the error at this point is.
And after checking all the grid points, we know which parameter combination is best for our prediction.
The “Cross-Validation” in GridSearch
At this point, only one thing remains to be added: the Cross-Validation Error.
When testing the performance of a model with each combination of hyperparameters, there could be a risk of overfitting. This means that just by pure chance, only the training data set corresponded well to this particular hyperparameter combination! The performance on new, real-life data, could be much worse!
To get a more reliable estimate of the performances of a hyperparameter combination, we take the Cross Validation Error.
In Cross-Validation, the data is split in multiple parts. For example 5 parts. Then the model is fit 5 times while leaving out one-fifth of the data. This one-fifth left-out data is used to measure the performances.
For one combination of hyperparameter values, the average of the 5 errors constitutes the cross-validation error. This makes the selection of the final combination more reliable.
What makes GridSearch so important?
GridSearch allows us to find the best model given a data set very easily. It actually makes the Machine Learning part of the Data Scientists role much easier by automating the search.
On the Machine Learning side, some things that still remain to be done is deciding on the right way to measure error, deciding on which models to try out and which hyperparameters to test for. And the most important part, the work on data preparation, is also left for the data scientist.
Thanks to the GridSearch approach, the Data Scientist can focus on the data wrangling work, while automating repetitive tasks of model comparison. This makes the work more interesting and allows the Data Scientist to add value where he’s most needed: working with data.
A number of alternatives for GridSearch exist, including Random Search, Bayesian Optimization, Genetic Algorithms, and more. I will write an article about those soon, so don’t hesitate to stay tuned. Thanks for reading!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。