End-to-End Project of Game Prediction Based on LeBron’s Stats Using Three Machine Learning ...

栏目: IT技术 · 发布时间: 4年前

内容简介：The parameters of a model that cannot be trained from the data but need to be assigned before the training process are calledTo select the best set of hyperparameters, we can do it in two ways. First, we can further split the training dataset into two part

The parameters of a model that cannot be trained from the data but need to be assigned before the training process are called hyperparameters . These hyperparameters are always related to the complexity of the model, which needs to be selected properly to avoid underfitting or overfitting problems.

To select the best set of hyperparameters, we can do it in two ways. First, we can further split the training dataset into two parts, namely the training and the validation dataset. Then, we need to evaluate the trained model from the training dataset on the validation set. The best set of hyperparameters is the one with the best performance on the validation set.

However, when the sample size is small, only one split of data can be biased. So, cross-validation is another way of training hyperparameters, which is more popular. I, therefore, use cross-validation in this project.

I list the whole function of the hyperparameter tunning as below and will go through it in detail.

def train_hyper_tune(X,y):
 # create the pre-processing component
 my_scaler = StandardScaler()
 my_imputer = SimpleImputer(strategy="median")

 # define classifiers
 ## Classifier 1: Logistic Regression
 clf_LR = LogisticRegression(random_state=0,penalty='elasticnet',solver='saga')
 ## Classifier 2: Random Forest Classifier
 clf_RF = RandomForestClassifier(random_state=0)
 ## Classifier 3: Deep Learning Binary Classifier
 clf_DL = KerasClassifier(build_fn=my_DL)

 # define pipeline for three classifiers
 ## clf_LR
 pipe1 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('lr_model',clf_LR)])
 ## clf_RF
 pipe2 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('rf_model',clf_RF)])
 ## clf_DL
 pipe3 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('dl_model',clf_DL)])

 # create hyperparameter space of the three models
 ## clf_LR
 param_grid1 = {
 'lr_model__C' : [1e-1,1,10],
 'lr_model__l1_ratio' : [0,0.5,1]
 }
 ## clf_RF
 param_grid2 = {
 'rf_model__n_estimators' : [50,100],
 'rf_model__max_features' : [0.8,"auto"],
 'rf_model__max_depth' : [4,5]
 }
 ## clf_DL
 param_grid3 = {
 'dl_model__epochs' : [6,12,18,24],
 'dl_model__batchsize' : [256,512]
 }

 # set GridSearch via 5-fold cross-validation
 ## clf_LR
 grid1 = GridSearchCV(pipe1, cv=5, param_grid=param_grid1)
 ## clf_RF
 grid2 = GridSearchCV(pipe2, cv=5, param_grid=param_grid2)
 ## clf_DL
 grid3 = GridSearchCV(pipe3, cv=5, param_grid=param_grid3)

 # run the hyperparameter tunning process
 grid1.fit(X,y)
 grid2.fit(X,y)
 grid3.fit(X,y)

 # return results of the tunning process
 return grid1,grid2,grid3,pipe1,pipe2,pipe3

As shown in the code, there are mainly six steps inside the function:

Step 1. Create the pre-processing functions.

 # create the pre-processing component
 my_scaler = StandardScaler()
 my_imputer = SimpleImputer(strategy="median")

I use the feature’s median to impute the missing values and the standard scaler to normalize the data. This step is the same for all three models.

Step 2. Define all three classifiers.

 # define classifiers
 ## Classifier 1: Logistic Regression
 clf_LR = LogisticRegression(random_state=0,penalty='elasticnet',solver='saga')
 ## Classifier 2: Random Forest Classifier
 clf_RF = RandomForestClassifier(random_state=0)
 ## Classifier 3: Deep Learning Binary Classifier
 clf_DL = KerasClassifier(build_fn=my_DL)

First, the logistic regression classifier is usually used as the “Hello world!” model in machine learning books. Here, it is used together with a penalty function to avoid overfitting. The model with this penalty term is called ‘Elastic Net’, which is a combination of l1 and l2 norm in the regularization.

For those who are interested in why we chose Elastic Net as the penalty term, please read my another post as below:

Second, the Random Forest Classifier is defined in a more freestyle without fixing any hyperparameters. Three of its hyperparameters are going to be tuned in the following steps, which I will go over in detail later.

Third, the deep learning classifier used here is based on the Scikit-Learn style model as aforementioned, my_DL . Thankfully Keras provides the wonderful Wrappers for the Scikit-Learn API . I directly call the function my_DL by passing it to the function KerasClassifier().

Step 3. Define a pipeline for each model that combines the pre-processing and modeling together.

 # define pipeline for three classifiers
 ## clf_LR
 pipe1 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('lr_model',clf_LR)])
 ## clf_RF
 pipe2 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('rf_model',clf_RF)])
 ## clf_DL
 pipe3 = Pipeline([('imputer', my_imputer), ('scaler', my_scaler), ('dl_model',clf_DL)])

For each of the three models, I combine the pre-processing and the classifier together into a pipeline with the Pipeline function in sklearn . For each step of processing, a name should be given. For example, I name my logistic regression model as “ lr_model ” and call it via clf_LR in the pipeline.

The aim of combining everything into a pipeline is to make sure the exact same processing of the training data is used to the testing data in the cross-validation . This is essential to avoid data leaking.

Step 4. Create the hyperparameter space for each of the models.

 # create hyperparameter space of the three models
 ## clf_LR
 param_grid1 = {
 'lr_model__C' : [1e-1,1,10],
 'lr_model__l1_ratio' : [0,0.5,1]
 }
 ## clf_RF
 param_grid2 = {
 'rf_model__n_estimators' : [50,100],
 'rf_model__max_features' : [0.8,"auto"],
 'rf_model__max_depth' : [4,5]
 }
 ## clf_DL
 param_grid3 = {
 'dl_model__epochs' : [6,12,18,24],
 'dl_model__batchsize' : [256,512]
 }

This part is more flexible because there is plenty number of parameters in these three models. It’s important to select the parameters that are closely related to the complexity of the model. For example, the maximum depth of the trees in the random forest model is a must-tune hyperparameter. For those who are interested, please refer to this post below.

To note, the name of the step in a pipeline needs to be specified in the hyperparameter space. For example, the number of epochs in the deep learning model is named as “dl_model__epochs” , where “dl_model” is the name of the deep learning model in my pipeline and “epochs” is the name of a parameter that can be passed to my deep learning model. They are connected in a string format by “__” in the hyperparameter space.

Step 5. Set the grid search function across the hyperparameter space via cross-validation.

 # set GridSearch via 5-fold cross-validation
 ## clf_LR
 grid1 = GridSearchCV(pipe1, cv=5, param_grid=param_grid1)
 ## clf_RF
 grid2 = GridSearchCV(pipe2, cv=5, param_grid=param_grid2)
 ## clf_DL
 grid3 = GridSearchCV(pipe3, cv=5, param_grid=param_grid3)

Comparing to the randomized search , the grid search is more computationally costly because it spans the entire hyperparameter space. In this project, I use the grid search because the hyperparameter space is relatively small.

For each grid search, I use 5-fold cross-validation to evaluate the average performance of the combinations of hyperparameters.

Step 6. Run the tunning process.

 # run the hyperparameter tunning process
 grid1.fit(X,y)
 grid2.fit(X,y)
 grid3.fit(X,y)

This step is pretty straight forward, which execute the grid search on the three defined pipelines.

Lastly, we just need to run the function as below:

my_grid1,my_grid2,my_grid3,my_pipe1,my_pipe2,my_pipe3 = train_hyper_tune(X_train, y_train)

We can check the training performance by pulling out the best score in the grid search result.

End-to-End Game Prediction (by Yufeng )

It seems the random forest has the best performance on the training dataset. But all three models are pretty comparable to each other.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

End-to-End Project of Game Prediction Based on LeBron’s Stats Using Three Machine Learning ...

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Windows核心编程(第5版)

Jeffrey Richter、Christophe Nasarre / 葛子昂、周靖、廖敏 / 清华大学出版社 / 2008-9 / 99.00元

这是一本经典的Windows核心编程指南，从第1版到第5版，引领着数十万程序员走入Windows开发阵营，培养了大批精英。. 作为Windows开发人员的必备参考，本书是为打算理解Windows的C和C++程序员精心设计的。第5版全面覆盖Windows XP，Windows Vista和Windows Server 2008中的170个新增函数和Windows特性。书中还讲解了Windows......一起来看看《Windows核心编程(第5版)》这本书的介绍吧!

码农工具