内容简介:Let’s dive in to understand the CHAID Decision tree algorithm first.This algorithm was originally proposed by Kass in 1980. As is evident from the name of this algorithm, it is based on the chi-square statistic. A Chi-square test yields a probability value
Let’s dive in to understand the CHAID Decision tree algorithm first.
CHAID- Chi-Squared Automatic Interaction Detection
This algorithm was originally proposed by Kass in 1980. As is evident from the name of this algorithm, it is based on the chi-square statistic. A Chi-square test yields a probability value as a result lying anywhere between 0 and 1. A chi-square value closer to 0 indicates that there is a significant difference between the two classes which are being compared. Similarly, a value closer to 1 indicates that there is not any significant difference between the 2 classes.
Variable types used in CHAID algorithm:
Variable to be predicted i.e Dependent variable: Continuous OR Categorical
Independent variables: Categorical ONLY (can be more than 2 categories)
Thus, if there are continuous predictor variables, then we need to transform them into categorical variables before they can be supplied to the CHAID algorithm.
Statistical Tests used to determine the next best split:
Continuous Dependent Variable: F-Test (Regression Problems)
Categorical dependent Variable: Chi-square (Classification Problems)
Let’s understand Bonferroni Adjustment/Correction before we progress further.
Bonferroni Adjustment/Correction
In statistics, the Bonferroni correction is one of several methods used to counteract the problem of multiple comparisons.
This adjustment tackles with the fact that the more tests you perform, the greater the risk of Type 1 error (False Positive) i.e. it appears as if you have stumbled upon something significant, but in reality, you haven’t.
The above image indicates that if we take an alpha value of 0.05 and we conduct 1 test, then we have a 95% confidence level indicating that there is a 95% probability that we’ll be able to avoid Type 1 error. Now, observe that as we start increasing the number of tests to 100, we are left with only 0.^% probability to be able to avoid Type 1 error. To counter this effect, we calculate the adjusted alpha value in tandem with the number of tests. Now, as long as we can use this new adjusted value of alpha, we can be in a safe zone theoretically.
Observe the adjusted alpha value at 100 tests, it has become so low that the tree will stop growing because it’ll not be able to find any variables that can achieve that level of super-significance.
Generally, all the software to develop decision trees gives an option to the modeler to turn it off. Bonferroni adjustment setting should be left on. If in case we turn it off in case of a scenario where the tree is not growing and we would like to experiment by turning Bonferroni off, then consider making the alpha value lower than the usual 0.05 to be careful of the Type 1 error possibility that we discussed above.
Also, always, always validate your tree once the modeling stage has been completed.
Under-the hood process of CHAID Algorithm
- Iterate cyclically through all the predictors one by one to determine the pair of (predictor) categories which is least significantly different with respect to the dependent variable. A chi-square statistic will be computed for classification problems (where the dependent variable is categorical as well), and an F-test for regression problems (where the dependent variable is continuous).
- If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value , then it will merge the respective predictor categories and repeat the first step(i.e., find the next pair of categories, which now may include previously merged categories).
- If the statistical significance for the respective pair of predictor categories is significant = less than the respective alpha-to-merge value, then it will compute a Bonferroni adjusted p-value for the set of categories for the respective predictor if the setting is enabled.
- This step is about selecting the split variable. The predictor variable with the smallest adjusted p-value , i.e., the predictor variable that will yield the most significant split will be considered for the next split in the tree. If the smallest (Bonferroni) adjusted p-value for any predictor is greater than some alpha-to-split value, then no further splits will be performed, and the respective node will become a terminal node.
- This process will continue iteratively until no further splits can be performed (given the alpha-to-merge and alpha-to-split values).
How CHAID handles different types of variables?
Nominal Variable: Automatically groups the data as per point # 2 above
Ordinal Variable: Automatically groups the data as per point # 2 above
Continuous Variable: Converts into segments/deciles before performing #2
The nature of the CHAID algorithm is to create WIDE trees.
以上所述就是小编给大家介绍的《Clearly Explained: Top 2 Types of Decision Trees- CHAID & CART》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Clean Code
Robert C. Martin / Prentice Hall / 2008-8-11 / USD 49.99
Even bad code can function. But if code isn’t clean, it can bring a development organization to its knees. Every year, countless hours and significant resources are lost because of poorly written code......一起来看看 《Clean Code》 这本书的介绍吧!