Understanding Feature extraction using Correlation Matrix and Scatter Plots

栏目: IT技术 · 发布时间: 4年前

内容简介:The data out there in the world is very huge and needs to be dealt very consciously for any sensible outcome we’d like to achieve through a novel data science approach.This article is going to deal with a very fundamental and important concept when dealing

Understanding Feature extraction using Correlation Matrix and Scatter Plots

The data out there in the world is very huge and needs to be dealt very consciously for any sensible outcome we’d like to achieve through a novel data science approach.

This article is going to deal with a very fundamental and important concept when dealing with large no. of features in a given dataset.

Any typical machine learning or deep learning model is made to provide a single output from huge amounts of data be it structured or unstructured. These factors may contribute to the required result at various coefficients and degrees. These factors need to be filtered out in a way based on their significance in determining the output and also considering the frequency of these factors.

In supervised learning, we know that there is always an output variable and n input variables. To understand this concept very clearly let’s take an example of a simple linear regression problem and then we can jump to multiple regression.

In a simple linear regression model, we ultimately generate an equation from the model of the form y=mx+c where x is an independent variable and y is a dependent variable. Since there is only one variable y has to depend on the value of x. Although in real-time there might be few other ignored external factors such as air resistance while calculating the average velocity of a bus from A to B. These definitely make an impact on the output but yet has the least significance. In this case, our common sense and experience helped us out in picking the factors hence we picked acceleration given to the bus by the driver and ignored the air resistance. What about the complex situations where we have no idea about the significance of input variables on the output. Can mathematics solve this puzzle?

Yes! Here comes the concept of correlation.

Correlationis a statistical measure that indicates the extent to which two or more variables fluctuate together. In simple terms, it tells us how much does one variable changes for a slight change in another variable. It may take positive, negative and zero values depending on the direction of the change. A high correlation value between a dependent variable and an independent variable indicates that the independent variable is of very high significance in determining the output. In a multiple regression setup where there are many factors to set up, it is imperative to find the correlation between the dependent and all the independent variables to build a viable model with higher accuracy. One must always remember that more number of features does not imply better accuracy. More features may lead to a decline in the accuracy if they contain any irrelevant features creating unrequired noise in our model.

Correlation between 2 variables can be found by various metrics such as Pearson r correlation, Kendall rank correlation, Spearman rank correlation, etc.

Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. The Pearson correlation between any 2 variables x,y can be found using :

n-no. of observations and i-denotes ith observation

Let us consider the dataset 50_Strartups on new startups in New York, California, and Florida. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending. Here Profit is the dependent variable to be predicted.

Let us first apply linear regression for every independent variable separately to visualize the correlation with the independent variable.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

当下的启蒙

当下的启蒙

[美] 史迪芬·平克 / 侯新智、欧阳明亮、魏薇 / 浙江人民出版社 / 2018-12 / 159.90

[编辑推荐] ● 比尔•盖茨最喜爱的一本书。理查德·道金斯心中的诺贝尔文学奖作品。尤瓦尔•赫拉利2018年最爱的书之一。 ● 当代最伟大思想家史蒂芬·平克全面超越自我的巅峰之作,一部关于人类进步的英雄史诗。 ●《当下的启蒙》用数据和事实揭示出世界的真相:不是黑暗,而是光明;不是丧,而是燃;我们没有退步,而是一直在进步,还将继续进步。用这本书点燃生活的勇气,亲手创造更美好的未来。 ......一起来看看 《当下的启蒙》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

html转js在线工具
html转js在线工具

html转js在线工具