内容简介:The parameters of a model that cannot be trained from the data but need to be assigned before the training process are calledTo select the best set of hyperparameters, we can do it in two ways. First, we can further split the training dataset into two part
The parameters of a model that cannot be trained from the data but need to be assigned before the training process are called hyperparameters . These hyperparameters are always related to the complexity of the model, which needs to be selected properly to avoid underfitting or overfitting problems.
To select the best set of hyperparameters, we can do it in two ways. First, we can further split the training dataset into two parts, namely the training and the validation dataset. Then, we need to evaluate the trained model from the training dataset on the validation set. The best set of hyperparameters is the one with the best performance on the validation set.
However, when the sample size is small, only one split of data can be biased. So, cross-validation is another way of training hyperparameters, which is more popular. I, therefore, use cross-validation in this project.
I list the whole function of the hyperparameter tunning as below and will go through it in detail.
def train_hyper_tune(X,y): # create the pre-processing component my_scaler = StandardScaler() my_imputer = SimpleImputer(strategy="median") # define classifiers ## Classifier 1: Logistic Regression clf_LR = LogisticRegression(random_state=0,penalty='elasticnet',solver='saga') ## Classifier 2: Random Forest Classifier clf_RF = RandomForestClassifier(random_state=0) ## Classifier 3: Deep Learning Binary Classifier clf_DL = KerasClassifier(build_fn=my_DL) # define pipeline for three classifiers ## clf_LR pipe1 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('lr_model',clf_LR)]) ## clf_RF pipe2 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('rf_model',clf_RF)]) ## clf_DL pipe3 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('dl_model',clf_DL)]) # create hyperparameter space of the three models ## clf_LR param_grid1 = { 'lr_model__C' : [1e-1,1,10], 'lr_model__l1_ratio' : [0,0.5,1] } ## clf_RF param_grid2 = { 'rf_model__n_estimators' : [50,100], 'rf_model__max_features' : [0.8,"auto"], 'rf_model__max_depth' : [4,5] } ## clf_DL param_grid3 = { 'dl_model__epochs' : [6,12,18,24], 'dl_model__batchsize' : [256,512] } # set GridSearch via 5-fold cross-validation ## clf_LR grid1 = GridSearchCV(pipe1, cv=5, param_grid=param_grid1) ## clf_RF grid2 = GridSearchCV(pipe2, cv=5, param_grid=param_grid2) ## clf_DL grid3 = GridSearchCV(pipe3, cv=5, param_grid=param_grid3) # run the hyperparameter tunning process grid1.fit(X,y) grid2.fit(X,y) grid3.fit(X,y) # return results of the tunning process return grid1,grid2,grid3,pipe1,pipe2,pipe3
As shown in the code, there are mainly six steps inside the function:
Step 1. Create the pre-processing functions.
# create the pre-processing component my_scaler = StandardScaler() my_imputer = SimpleImputer(strategy="median")
I use the feature’s median to impute the missing values and the standard scaler to normalize the data. This step is the same for all three models.
Step 2. Define all three classifiers.
# define classifiers ## Classifier 1: Logistic Regression clf_LR = LogisticRegression(random_state=0,penalty='elasticnet',solver='saga') ## Classifier 2: Random Forest Classifier clf_RF = RandomForestClassifier(random_state=0) ## Classifier 3: Deep Learning Binary Classifier clf_DL = KerasClassifier(build_fn=my_DL)
First, the logistic regression classifier is usually used as the “Hello world!” model in machine learning books. Here, it is used together with a penalty function to avoid overfitting. The model with this penalty term is called ‘Elastic Net’, which is a combination of l1 and l2 norm in the regularization.
For those who are interested in why we chose Elastic Net as the penalty term, please read my another post as below:
Second, the Random Forest Classifier is defined in a more freestyle without fixing any hyperparameters. Three of its hyperparameters are going to be tuned in the following steps, which I will go over in detail later.
Third, the deep learning classifier used here is based on the Scikit-Learn style model as aforementioned, my_DL . Thankfully Keras provides the wonderful Wrappers for the Scikit-Learn API . I directly call the function my_DL by passing it to the function KerasClassifier().
Step 3. Define a pipeline for each model that combines the pre-processing and modeling together.
# define pipeline for three classifiers ## clf_LR pipe1 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('lr_model',clf_LR)]) ## clf_RF pipe2 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('rf_model',clf_RF)]) ## clf_DL pipe3 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('dl_model',clf_DL)])
For each of the three models, I combine the pre-processing and the classifier together into a pipeline with the Pipeline function in sklearn . For each step of processing, a name should be given. For example, I name my logistic regression model as “ lr_model ” and call it via clf_LR in the pipeline.
The aim of combining everything into a pipeline is to make sure the exact same processing of the training data is used to the testing data in the cross-validation . This is essential to avoid data leaking.
Step 4. Create the hyperparameter space for each of the models.
# create hyperparameter space of the three models ## clf_LR param_grid1 = { 'lr_model__C' : [1e-1,1,10], 'lr_model__l1_ratio' : [0,0.5,1] } ## clf_RF param_grid2 = { 'rf_model__n_estimators' : [50,100], 'rf_model__max_features' : [0.8,"auto"], 'rf_model__max_depth' : [4,5] } ## clf_DL param_grid3 = { 'dl_model__epochs' : [6,12,18,24], 'dl_model__batchsize' : [256,512] }
This part is more flexible because there is plenty number of parameters in these three models. It’s important to select the parameters that are closely related to the complexity of the model. For example, the maximum depth of the trees in the random forest model is a must-tune hyperparameter. For those who are interested, please refer to this post below.
To note, the name of the step in a pipeline needs to be specified in the hyperparameter space. For example, the number of epochs in the deep learning model is named as “dl_model__epochs” , where “dl_model” is the name of the deep learning model in my pipeline and “epochs” is the name of a parameter that can be passed to my deep learning model. They are connected in a string format by “__” in the hyperparameter space.
Step 5. Set the grid search function across the hyperparameter space via cross-validation.
# set GridSearch via 5-fold cross-validation ## clf_LR grid1 = GridSearchCV(pipe1, cv=5, param_grid=param_grid1) ## clf_RF grid2 = GridSearchCV(pipe2, cv=5, param_grid=param_grid2) ## clf_DL grid3 = GridSearchCV(pipe3, cv=5, param_grid=param_grid3)
Comparing to the randomized search , the grid search is more computationally costly because it spans the entire hyperparameter space. In this project, I use the grid search because the hyperparameter space is relatively small.
For each grid search, I use 5-fold cross-validation to evaluate the average performance of the combinations of hyperparameters.
Step 6. Run the tunning process.
# run the hyperparameter tunning process grid1.fit(X,y) grid2.fit(X,y) grid3.fit(X,y)
This step is pretty straight forward, which execute the grid search on the three defined pipelines.
Lastly, we just need to run the function as below:
my_grid1,my_grid2,my_grid3,my_pipe1,my_pipe2,my_pipe3 = train_hyper_tune(X_train, y_train)
We can check the training performance by pulling out the best score in the grid search result.
It seems the random forest has the best performance on the training dataset. But all three models are pretty comparable to each other.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。