内容简介:Let’s dive in to understand the CHAID Decision tree algorithm first.This algorithm was originally proposed by Kass in 1980. As is evident from the name of this algorithm, it is based on the chi-square statistic. A Chi-square test yields a probability value
Let’s dive in to understand the CHAID Decision tree algorithm first.
CHAID- Chi-Squared Automatic Interaction Detection
This algorithm was originally proposed by Kass in 1980. As is evident from the name of this algorithm, it is based on the chi-square statistic. A Chi-square test yields a probability value as a result lying anywhere between 0 and 1. A chi-square value closer to 0 indicates that there is a significant difference between the two classes which are being compared. Similarly, a value closer to 1 indicates that there is not any significant difference between the 2 classes.
Variable types used in CHAID algorithm:
Variable to be predicted i.e Dependent variable: Continuous OR Categorical
Independent variables: Categorical ONLY (can be more than 2 categories)
Thus, if there are continuous predictor variables, then we need to transform them into categorical variables before they can be supplied to the CHAID algorithm.
Statistical Tests used to determine the next best split:
Continuous Dependent Variable: F-Test (Regression Problems)
Categorical dependent Variable: Chi-square (Classification Problems)
Let’s understand Bonferroni Adjustment/Correction before we progress further.
Bonferroni Adjustment/Correction
In statistics, the Bonferroni correction is one of several methods used to counteract the problem of multiple comparisons.
This adjustment tackles with the fact that the more tests you perform, the greater the risk of Type 1 error (False Positive) i.e. it appears as if you have stumbled upon something significant, but in reality, you haven’t.
The above image indicates that if we take an alpha value of 0.05 and we conduct 1 test, then we have a 95% confidence level indicating that there is a 95% probability that we’ll be able to avoid Type 1 error. Now, observe that as we start increasing the number of tests to 100, we are left with only 0.^% probability to be able to avoid Type 1 error. To counter this effect, we calculate the adjusted alpha value in tandem with the number of tests. Now, as long as we can use this new adjusted value of alpha, we can be in a safe zone theoretically.
Observe the adjusted alpha value at 100 tests, it has become so low that the tree will stop growing because it’ll not be able to find any variables that can achieve that level of super-significance.
Generally, all the software to develop decision trees gives an option to the modeler to turn it off. Bonferroni adjustment setting should be left on. If in case we turn it off in case of a scenario where the tree is not growing and we would like to experiment by turning Bonferroni off, then consider making the alpha value lower than the usual 0.05 to be careful of the Type 1 error possibility that we discussed above.
Also, always, always validate your tree once the modeling stage has been completed.
Under-the hood process of CHAID Algorithm
- Iterate cyclically through all the predictors one by one to determine the pair of (predictor) categories which is least significantly different with respect to the dependent variable. A chi-square statistic will be computed for classification problems (where the dependent variable is categorical as well), and an F-test for regression problems (where the dependent variable is continuous).
- If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value , then it will merge the respective predictor categories and repeat the first step(i.e., find the next pair of categories, which now may include previously merged categories).
- If the statistical significance for the respective pair of predictor categories is significant = less than the respective alpha-to-merge value, then it will compute a Bonferroni adjusted p-value for the set of categories for the respective predictor if the setting is enabled.
- This step is about selecting the split variable. The predictor variable with the smallest adjusted p-value , i.e., the predictor variable that will yield the most significant split will be considered for the next split in the tree. If the smallest (Bonferroni) adjusted p-value for any predictor is greater than some alpha-to-split value, then no further splits will be performed, and the respective node will become a terminal node.
- This process will continue iteratively until no further splits can be performed (given the alpha-to-merge and alpha-to-split values).
How CHAID handles different types of variables?
Nominal Variable: Automatically groups the data as per point # 2 above
Ordinal Variable: Automatically groups the data as per point # 2 above
Continuous Variable: Converts into segments/deciles before performing #2
The nature of the CHAID algorithm is to create WIDE trees.
以上所述就是小编给大家介绍的《Clearly Explained: Top 2 Types of Decision Trees- CHAID & CART》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
极简算法史:从数学到机器的故事
[法] 吕克•德•布拉班迪尔 / 任轶 / 人民邮电出版社 / 2019-1 / 39.00元
数学、逻辑学、计算机科学三大领域实属一家,彼此成就,彼此影响。从古希腊哲学到“无所不能”的计算机,数字、计算、推理这些貌似简单的概念在三千年里融汇、碰撞。如何将逻辑赋予数学意义?如何从简单运算走向复杂智慧?这背后充满了人类智慧的闪光:从柏拉图、莱布尼茨、罗素、香农到图灵都试图从数学公式中证明推理的合理性,缔造完美的思维体系。他们是凭天赋制胜,还是鲁莽地大胆一搏?本书描绘了一场人类探索数学、算法与逻......一起来看看 《极简算法史:从数学到机器的故事》 这本书的介绍吧!