Clearly Explained: Top 2 Types of Decision Trees- CHAID & CART

栏目: IT技术 · 发布时间: 5年前

内容简介:Let’s dive in to understand the CHAID Decision tree algorithm first.This algorithm was originally proposed by Kass in 1980. As is evident from the name of this algorithm, it is based on the chi-square statistic. A Chi-square test yields a probability value

Let’s dive in to understand the CHAID Decision tree algorithm first.

CHAID- Chi-Squared Automatic Interaction Detection

This algorithm was originally proposed by Kass in 1980. As is evident from the name of this algorithm, it is based on the chi-square statistic. A Chi-square test yields a probability value as a result lying anywhere between 0 and 1. A chi-square value closer to 0 indicates that there is a significant difference between the two classes which are being compared. Similarly, a value closer to 1 indicates that there is not any significant difference between the 2 classes.

Variable types used in CHAID algorithm:

Variable to be predicted i.e Dependent variable: Continuous OR Categorical

Independent variables: Categorical ONLY (can be more than 2 categories)

Thus, if there are continuous predictor variables, then we need to transform them into categorical variables before they can be supplied to the CHAID algorithm.

Statistical Tests used to determine the next best split:

Continuous Dependent Variable: F-Test (Regression Problems)

Categorical dependent Variable: Chi-square (Classification Problems)

Let’s understand Bonferroni Adjustment/Correction before we progress further.

Bonferroni Adjustment/Correction

In statistics, the Bonferroni correction is one of several methods used to counteract the problem of multiple comparisons.

This adjustment tackles with the fact that the more tests you perform, the greater the risk of Type 1 error (False Positive) i.e. it appears as if you have stumbled upon something significant, but in reality, you haven’t.

The above image indicates that if we take an alpha value of 0.05 and we conduct 1 test, then we have a 95% confidence level indicating that there is a 95% probability that we’ll be able to avoid Type 1 error. Now, observe that as we start increasing the number of tests to 100, we are left with only 0.^% probability to be able to avoid Type 1 error. To counter this effect, we calculate the adjusted alpha value in tandem with the number of tests. Now, as long as we can use this new adjusted value of alpha, we can be in a safe zone theoretically.

Observe the adjusted alpha value at 100 tests, it has become so low that the tree will stop growing because it’ll not be able to find any variables that can achieve that level of super-significance.

Generally, all the software to develop decision trees gives an option to the modeler to turn it off. Bonferroni adjustment setting should be left on. If in case we turn it off in case of a scenario where the tree is not growing and we would like to experiment by turning Bonferroni off, then consider making the alpha value lower than the usual 0.05 to be careful of the Type 1 error possibility that we discussed above.

Also, always, always validate your tree once the modeling stage has been completed.

Under-the hood process of CHAID Algorithm

  1. Iterate cyclically through all the predictors one by one to determine the pair of (predictor) categories which is least significantly different with respect to the dependent variable. A chi-square statistic will be computed for classification problems (where the dependent variable is categorical as well), and an F-test for regression problems (where the dependent variable is continuous).
  2. If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value , then it will merge the respective predictor categories and repeat the first step(i.e., find the next pair of categories, which now may include previously merged categories).
  3. If the statistical significance for the respective pair of predictor categories is significant = less than the respective alpha-to-merge value, then it will compute a Bonferroni adjusted p-value for the set of categories for the respective predictor if the setting is enabled.
  4. This step is about selecting the split variable. The predictor variable with the smallest adjusted p-value , i.e., the predictor variable that will yield the most significant split will be considered for the next split in the tree. If the smallest (Bonferroni) adjusted p-value for any predictor is greater than some alpha-to-split value, then no further splits will be performed, and the respective node will become a terminal node.
  5. This process will continue iteratively until no further splits can be performed (given the alpha-to-merge and alpha-to-split values).

How CHAID handles different types of variables?

Nominal Variable: Automatically groups the data as per point # 2 above

Ordinal Variable: Automatically groups the data as per point # 2 above

Continuous Variable: Converts into segments/deciles before performing #2

The nature of the CHAID algorithm is to create WIDE trees.


以上所述就是小编给大家介绍的《Clearly Explained: Top 2 Types of Decision Trees- CHAID & CART》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Python网络数据采集

Python网络数据采集

米切尔 (Ryan Mitchell) / 陶俊杰、陈小莉 / 人民邮电出版社 / 2016-3-1 / CNY 59.00

本书采用简洁强大的Python语言,介绍了网络数据采集,并为采集新式网络中的各种数据类型提供了全面的指导。第一部分重点介绍网络数据采集的基本原理:如何用Python从网络服务器请求信息,如何对服务器的响应进行基本处理,以及如何以自动化手段与网站进行交互。第二部分介绍如何用网络爬虫测试网站,自动化处理,以及如何通过更多的方式接入网络。一起来看看 《Python网络数据采集》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码