内容简介:A complete hands on guide towards building, visualizing, and fine tuning a decision tree using cost computation pruning in PythonDecision tree classifiers are supervised learning models that are useful when we care about interpretability. Think of it like,
Decision Tree Classifier and Cost Computation Pruning using Python
A complete hands on guide towards building, visualizing, and fine tuning a decision tree using cost computation pruning in Python
Introduction
Decision tree classifiers are supervised learning models that are useful when we care about interpretability. Think of it like, breaking down the data by making decisions based on multiple questions at each level. This is one of the widely used algorithms for handling classification problems. To understand it better let’s take a look at the example below.
Decision Tree usually consists of:
- Root Node — Represents the sample or the population that gets divided into further homogeneous groups
- Splitting — Process of dividing the nodes into two sub-nodes
- Decision Node — When a sub-node splits into further sub-nodes based on a certain condition, it is called a decision node
- Leaf or Terminal Node — Sub nodes that don’t split further
- Information Gain — To split the nodes using a condition (say most informative feature) we need to define an objective function that can be optimized. In the decision tree algorithm, we tend to maximize the information gain at each split. Three impurity measures are used commonly in measuring the information gain. They are the Gini impurity, Entropy, and the Classification error
Understanding the Mathematics
To understand how decision trees are developed, we need to have a deeper understanding of how information gain is maximized at each step using one of the impurity measures at each step. Let’s take an example where we have training data comprising student information like gender, grade, and a dependent variable or a classifier variable the identifies if a student is a foodie or not. We have with us the following information outlined below.
- Total Students — 20
- Total Students Categorized as Foodie — 10
- Total Student not Categorized as Foodie — 10
- P(Foodie), probability of a student being foodie = (10/20) = 0.5
- Q(Not Foodie, or 1-P), probability of a student not being foodie = (10/20) = 0.5
Let’s divide the students based on their gender into two nodes and recalculate the above metrics.
Male Students (Node A)
- Total Students — 10
- Total Students Categorized as Foodie — 8
- Total Student not Categorized as Foodie — 2
- P(Foodie), probability of a student being foodie = (8/10) = 0.8
- Q(Not Foodie, or 1-P), probability of a student not being foodie = (2/10) = 0.2
Female Students (Node B)
- Total Students — 10
- Total Students Categorized as Foodie — 4
- Total Student not Categorized as Foodie — 6
- P(Foodie), probability of a student being foodie = (4/10) = 0.4
- Q(Not Foodie, or 1-P), probability of a student not being foodie = (6/10) = 0.6
Gini Index (GIn)for Node A or the Male Students = P² + Q², where P and Q are the probability of a student being a foodie and a non-foodie. GIn (Node A) = 0.8² + 0.2² = 0.68
Gini Impurity (GIp)for Node A = 1-Gini Index = 1–0.68 = 0.32
Gini Index (GIn)for Node B or the Female Students = P² + Q², where P and Q are the probability of a student being a foodie and a non-foodie. GIn (Node B) = 0.4² + 0.6²= 0.52
Gini Impurity (GIp)for Node B= 1-Gini Index = 1–0.52 = 0.48
What we observe above is that when we split the students based on their gender (Male and Female) into Nodes A and B respectively, we have a Gini impurity score for two nodes separately. Now to decide if gender is the right variable to split the students into foodie and non-foodie we need a weighted Gini Impurity Score which is calculated using the formula outlined below.
Weighted Gini Impurity = (Total samples in Node A / Total Samples in the data set) * Gini Impurity (Node A) + (Total samples in Node B/ Total Samples in the data set) * Gini Impurity (Node B)
Using this formulation to calculate the Weighted Gini Impurity Score of the above example, Weighted Gini Impurity Score when Students are divided based on Gender = (10/20)*0.32 + (10/20)*0.48 = 0.4
A classification problem works with multiple independent variables. The variables can be categorical or continuous. Decision trees are well adapted to handle variables of different data types. A decision tree algorithm takes into consideration all possible variables while deciding the split of each node. Variables using which maximum Weighted Impurity Gain can be achieved, is used as a decision variable for a particular node .
In the example above the Weighted Impurity Gain using “gender” as the decision variable is 0.4, however, say using “grade” as the decision variable we manage to achieve a Weighted Impurity Gain of 0.56, the algorithm will use “grade” as the decision variable for creating the first split. A similar methodology is followed for all subsequent steps until every node is homogeneous.
Quick facts about the Decision Tree Algorithm
- Decision trees are prone to overfitting as the algorithm continues to split nodes into sub-nodes till each node becomes homogeneous
- The accuracy of training data is much higher when compared to the test set, hence decision trees should be pruned to prevent the model from overfitting. Pruning can be achieved by controlling the depth of the tree, maximum/minimum number of samples in each node, minimum impurity gain for a node to split, and the maximum leaf nodes
- Python allows users to develop a decision tree using the Gini Impurity or Entropy as the Information Gain Criterion
- A decision tree can be fine-tuned using a Grid Search or a Randomized Search CV. CV stands for cross-validation
Visual Comparison of Three Different Impurity Criteria
The code snippet outlined below provides a visual comparison of different impurity criterion and how they change with different probability values. Note the code below is adapted from Python: Deeper Insights into Machine Learning by S.Raschka, D.Julian, and J.Hearty, 2016.
import matplotlib.pyplot as pltimport numpy as np#-----Calculating Gini Index def gini(p): return (p)*(1 - (p)) + (1 - p)*(1 - (1-p))#-----Calculating Entropy def entropy(p): return - p*np.log2(p) - (1 - p)*np.log2((1 - p))#-----Calculating Classification Error def classification_error(p): return 1 - np.max([p, 1 - p])#----Creating a Numpy Array of probability values from 0 to 1, with an increment of 0.01 x = np.arange(0.0, 1.0, 0.01)#---Obtaining Entropy for different values of p ent = [entropy(p) if p != 0 else None for p in x]#---Obtaining scaled entropy sc_ent = [e*0.5 if e else None for e in ent]#--Classification Error err = [classification_error(i) for i in x]#--Plottingfig = plt.figure(); plt.figure(figsize=(10,8)); ax = plt.subplot(111);for i, lab, ls, c, in zip([ent, sc_ent, gini(x), err], ['Entropy', 'Entropy (scaled)','Gini Impurity', 'Misclassification Error'],['-', '-', '--', '-.'], ['black', 'darkgray','blue', 'brown', 'cyan']): line = ax.plot(x, i, label=lab, linestyle=ls, lw=2, color=c) ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=3, fancybox=True, shadow=False) ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--') ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--') plt.ylim([0, 1.1]) plt.xlabel('p(i=1)') plt.ylabel('Impurity Index') plt.show()
Hands On Exercise
The problem statement aims at developing a classification model to predict the quality of red wine. Details about the problem statement can be found here . This is a classic example of a multi-class classification problem. Note that all machine learning models are sensitive to Outliers, hence features/independent variables consisting of outliers should be treated before building the tree.
An important aspect of different features/independent variables is how they interact. Pearson correlation can be used to determine the degree of association between two features from a data set. However, for a decision-based algorithm like a decision tree, we won’t be dropping highly correlated variables.
#---------------------------------------------Importing Required Libraries----------------------------------- %matplotlib inlineimport numpy as np import pandas as pdfrom sklearn.tree import DecisionTreeClassifierimport numpy as np import pandas as pd import seaborn as snssns.set(color_codes=True)from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split #--------------splitting data into test and train from sklearn.tree import DecisionTreeClassifier #-----------Building decision tree modelfrom sklearn import metrics from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score, confusion_matrix #-----model validation scores %matplotlib inlinefrom IPython.display import display #---------------------for displaying multiple data frames in one outputfrom sklearn.feature_extraction.text import CountVectorizer #DT does not take strings as input for the model fit stepimport missingno as msno_plot #--------------plotting missing valueswine_df = pd.read_csv('winequality-red.csv',sep=';')
Quick descriptive statistics of the data
wine_df.describe().transpose().round(2)
Checking Missing Values
#-------------------------------------------Barplot of non-missing values-------------------------------- plt.title('#Non-missing Values by Columns') msno_plot.bar(wine_df);
Outlier check & Treatment
#--Checking Outliers plt.figure(figsize=(15,15)) pos = 1 for i in wine_df.columns: plt.subplot(3, 4, pos) sns.boxplot(wine_df[i]) pos += 1
col_names=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']display(col_names)for i in col_names: q1, q2, q3 = wine_df[i].quantile([0.25,0.5,0.75]) IQR = q3 - q1 lower_cap=q1-1.5*IQR upper_cap=q3+1.5*IQR wine_df[i]=wine_df[i].apply(lambda x: upper_cap if x>(upper_cap) else (lower_cap if x<(lower_cap) else x))
Outliers above are winsorized using Q1–1.5*IQR and Q3+1.5*IQR values. Q1, Q3, and IQR stand for Quartile 1, Quartile 3, and Inter Quartile Range respectively.
sns.pairplot(wine_df);
Understanding the relationship between different variables. Note — In Decision Trees, we need not remove highly correlated variables as nodes are divided into sub-nodes using one independent variable only, hence even if two or more variables are highly correlated, the variable producing the highest information gain will be used for the analysis.
plt.figure(figsize=(10,8)) sns.heatmap(wine_df.corr(), annot=True, linewidths=.5, center=0, cbar=False, cmap="YlGnBu") plt.show()
Classification problems are sensitive to class imbalance. A class imbalance is a scenario when the dependent attribute has a higher proportion of ones than zeros or vice versa. In a multiclass problem, class imbalance occurs when the proportion of one of the class values is much higher. Class balance is induced by combining values of the attribute “quality” which is the dependent variable in this problem statement.
plt.figure(figsize=(10,8)) sns.countplot(wine_df['quality']);
wine_df['quality'] = wine_df['quality'].replace(8,7) wine_df['quality'] = wine_df['quality'].replace(3,5) wine_df['quality'] = wine_df['quality'].replace(4,5) wine_df['quality'].value_counts(normalize=True)
Data is divided into train and test set to check the accuracy of the model and look for overfitting or under fitting if any.
# splitting data into training and test set for independent attributesfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.3, random_state=22) X_train.shape,X_test.shape
A decision tree model is developed using the Gini criterion. Note that for the sake of simplicity we have pruned the tree to a maximum depth of three. This will help us visualize the tree and tie it back to the concepts we covered in the initial segments.
clf_pruned = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=3, min_samples_leaf=5) clf_pruned.fit(X_train, y_train)
Note that the following parameters can be tuned to improve the model output (Scikit Learn, 2019).
- criterion — Gini impurity is used to decide the variables based on which root node and following decision nodes should be split
- class_weight — None; All classes are assigned weight 1
- max_depth — 3; Pruning is done. When “None”, it signifies that nodes will be expanded till all leaves are homogeneous
- max_features — None; All features or independent variables are considered while deciding split of a node
- max_leaf_nodes — None;
- min_impurity_decrease — 0.0; A node is split only when the split ensures a decrease in the impurity of greater than or equal to zero
- min_impurity_split — None;
- min_samples_leaf — 1; Minimum number of samples required for a leaf to exists
- min_samples_split — 2; If min_samples_leaf =1, it signifies that the right and the left node should have 1 sample each, i.e. the parent node or the root node should have at least two samples
- splitter — ‘best’; Strategy used to choose the split at each node. Best ensure that all features are considered while deciding the split
from sklearn.tree import export_graphviz from sklearn.externals.six import StringIO from IPython.display import Image import pydotplus import graphvizxvar = wine_df.drop('quality', axis=1) feature_cols = xvar.columnsdot_data = StringIO() export_graphviz(clf_pruned, out_file=dot_data, filled=True, rounded=True, special_characters=True,feature_names = feature_cols,class_names=['0','1','2'])from pydot import graph_from_dot_data (graph, ) = graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())
preds_pruned = clf_pruned.predict(X_test) preds_pruned_train = clf_pruned.predict(X_train)print(accuracy_score(y_test,preds_pruned)) print(accuracy_score(y_train,preds_pruned_train))
The accuracy score of the model on training and test data turned out to be 0.60 and 0.62 respectively.
Feature importance refers to a class of techniques for assigning scores to input features of a predictive model that indicates the relative importance of each feature when making a prediction.
## Calculating feature importancefeat_importance = clf_pruned.tree_.compute_feature_importances(normalize=False)feat_imp_dict = dict(zip(feature_cols, clf_pruned.feature_importances_)) feat_imp = pd.DataFrame.from_dict(feat_imp_dict, orient='index') feat_imp.rename(columns = {0:'FeatureImportance'}, inplace = True) feat_imp.sort_values(by=['FeatureImportance'], ascending=False).head()
The DecisionTreeClassifier() provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfitting. Think of it as a scenario where we explicitly define the depth and the maximum leaves in the tree. However, the biggest challenge is to determine the optimum depth and leaves a tree should contain. In the example above we used max_depth=3, min_samples_leaf=5. These numbers are just example figures to see how the tree behaves. But if in reality we were asked to work on this model and come up with an optimum value for the model parameters, it is challenging but not impossible (decision tree models can be fine-tuned using GridSearchCV algorithm).
The other way of doing it is by using the Cost Complexity Pruning (CCP).
Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned (Scikit Learn, n.d.).
In simpler terms, cost complexity is a threshold value. The model split a node further into its child node only when the overall impurity of the model is improved by a value greater than this threshold else it stops.
Lower the CCP, lower is the impurity. How?
When the value of CCP is lower the model will split a node into its child nodes even if the impurity doesn’t decrease much. This is evident as the depth of the tree increases, i.e. as we go down a decision tree, we will find that split doesn’t contribute much to the change in the overall impurity of the model. However higher split ensures that the classes are categorized correctly, i.e. accuracy is more.
When CCP values are low, a higher number of nodes are created. Higher the nodes, the higher is the depth of the tree as well.
The code below (Scikit Learn, n.d.) illustrates how alpha can be tuned to obtain a model with an improved accuracy score.
path = model_gini.cost_complexity_pruning_path(X_train, y_train) ccp_alphas, impurities = path.ccp_alphas, path.impuritiesfig, ax = plt.subplots(figsize=(16,8)); ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post"); ax.set_xlabel("effective alpha"); ax.set_ylabel("total impurity of leaves"); ax.set_title("Total Impurity vs effective alpha for training set");
Let’s understand the variation of depth and the number of nodes with change in alpha.
clfs = clfs[:-1]ccp_alphas = ccp_alphas[:-1]node_counts = [clf.tree_.node_count for clf in clfs]depth = [clf.tree_.max_depth for clf in clfs]fig, ax = plt.subplots(2, 1,figsize=(16,8))ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post") ax[0].set_xlabel("alpha") ax[0].set_ylabel("number of nodes") ax[0].set_title("Number of nodes vs alpha") ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post") ax[1].set_xlabel("alpha") ax[1].set_ylabel("depth of tree") ax[1].set_title("Depth vs alpha") fig.tight_layout()
Understanding the variation in accuracy when alpha is increased.
fig, ax = plt.subplots(figsize=(16,8)); #-----------------Setting size of the canvas train_scores = [clf.score(X_train, y_train) for clf in clfs] test_scores = [clf.score(X_test, y_test) for clf in clfs]ax.set_xlabel("alpha") ax.set_ylabel("accuracy") ax.set_title("Accuracy vs alpha for training and testing sets") ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post") ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post") ax.legend() plt.show()
i = np.arange(len(ccp_alphas)) ccp = pd.DataFrame({'Depth': pd.Series(depth,index=i),'Node' : pd.Series(node_counts, index=i),\ 'ccp' : pd.Series(ccp_alphas, index = i),'train_scores' : pd.Series(train_scores, index = i), 'test_scores' : pd.Series(test_scores, index = i)})ccp.tail()ccp[ccp['test_scores']==ccp['test_scores'].max()]
The above code provides the cost computation pruning value that produces the highest accuracy in the test data.
Reference
- Raschka, S., Julian, D. and Hearty, J. (2016). Python : deeper insights into machine learning : leverage benefits of machine learning techniques using Python : a course in three modules . Birmingham, Uk: Packt Publishing, pp.83, 88, 89.
- Scikit-learn: Machine Learning in Python , Pedregosa et al. , JMLR 12, pp. 2825–2830, 2011.
- Scikit Learn (2019). sklearn.tree.DecisionTreeClassifier — scikit-learn 0.22.1 documentation . [online] Scikit-learn.org. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html.
- Scikit Learn (n.d.). Post pruning decision trees with cost complexity pruning . [online] Available at: https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py.
以上所述就是小编给大家介绍的《Decision Tree Classifier and Cost Computation Pruning using Python》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
C语言编程:一本全面的C语言入门教程(第三版)
(美)Stephen Kochan / 张小潘 / 电子社博文视点资讯有限公司 / 2006年 / 59.00元
本书是极负盛名的C语言入门经典教材,其第一版发行至今已有20年的历史。本书内容详实全面,由浅入深,示例丰富,并在每个章节后面附有部分习题,非常适合读者自学使用。除此之外,《C语言编程》一书对于C语言标准的最新进展、C语言常见开发工具以及管理C语言大型项目等重要方面,也进行了深入浅出的说明。一起来看看 《C语言编程:一本全面的C语言入门教程(第三版)》 这本书的介绍吧!