内容简介:You might be surprised at the choice of the cover image for this post but this is how we can understand Normalization! This mighty concept helps us when we have data that has a variety of features having different measurement scales and thus leaving us in
Why Normalization?
You might be surprised at the choice of the cover image for this post but this is how we can understand Normalization! This mighty concept helps us when we have data that has a variety of features having different measurement scales and thus leaving us in a lurch when we try to derive insights from such data or try to fit a model on such data.
Much like we can’t compare the different fruits shown in the above picture on a common scale, we can’t work efficiently with data that has too many scales.
For example: See the image below and observe the scales of salary Vs Work experience Vs Band level. Due to the higher scale range of the attribute Salary, it can take precedence over the other two attributes while training the model, despite whether or not it actually holds more weight in predicting the dependent variable.
Thus, in the data pre-processing stage of data mining and model development (Statistical or Machine learning), it's a good practice to normalize all the variables to bring them down to a similar scale — If they are of different ranges .
Normalization is not required for every dataset, you have to sift through it and make sure if your data requires it and only then continue to incorporate this step in your procedure. Also, you should apply Normalization if you are not very sure if the data distribution is Gaussian/ Normal/ bell-curve in nature. Normalization will help in reducing the impact of non-gaussian attributes on your model.
What is Normalization?
We’ll talk about two case scenarios here:
1. Your data doesn’t follow Normal/ Gaussian distribution (Prefer this in case of doubt also)
Data normalization, in this case, is the process of rescaling one or more attributes to the range of 0 to 1. This means that the largest value for each attribute is 1 and the smallest value is 0.
It is also known as Min-Max scaling.
2. Your data follows Gaussian distribution
In this case, Normalization can be done by the formula described below where mu is the mean and the sigma is the standard deviation of your sample/population.
When we normalize using the Standard score as given below, it’s also commonly known as Standardization or Z-Score.
More about Z-Score
The z score tells us how many standard deviations away from the mean your score is.
For example —
- Z-score of 1.5 then it implies it’s 1.5 standard deviations above the mean.
- Z-score of -0.8 indicates our value is 0.8 standard deviations below the mean.
As explained above, the z-score tells us where the score lies on anormal distribution curve. A z-score of zero tells us the value is exactly the mean/ average while a score of +3 tells you that the value is much higher than average (probably an outlier)
If you refer to my article onNormal distributions, you’ll quickly understand that Z-score is converting our distribution to a Standard Normal Distribution with a mean of 0 and a Standard deviation of 1.
Interpretation of Z-Score
Let’s quickly understand how to interpret a value of Z-score in terms of AUC (Area under the curve).
According to the Empirical rule, discussed in detail in the article on Normal distributions linked above and stated at the end of this post too, it’s stated that:
- 68% of the data lies between +1SD and -1SD
- 99.5% of the data lies between +2SD and -2SD
- 99.7% of the data lies between +3SD and -3SD
Now, if we want to look at a customized range and calculate how much data that segments covers, then Z-scores come to our rescue. Let’s see how.
For example, we want to know how much percentage of data is covered (probability of occurrence of a data point) between negative extreme on the left and -1SD, we have to refer to Z-score table linked below:
Now, we have to look for value -1.00 and we can see from the snapshot below that is states 15.8% as the answer to our question.
Similarly, if we would have been looking for -1.25, we would have got the value as 10.56% (-1.2 in the column Z and match across the column 0.05 to make -1.25)
Common Z-score values and their results from Z-score table which indicates how much are is covered between the negative extreme end and the point of Z-score taken, i.e. Area to the left of a Z-score point:
We can use these values to calculate between customized ranges as well, For example: If we want to the AUC between -3 and -2.5 Z-score values, it will be (0.62–1.13)%= 0.49% ~0.5%. Thus, this comes in very handy when it comes to problems that do not have straightforward Z-score values to be interpreted.
Real Life Interpretation example
Let’s say we have an IQ score data for a sample that we have normalized using the Z-score. Now to put things into perspective, if a person’s IQ Z-score value is 2 — We see that +2 corresponds to 97.72% on Z-score table, this implies that his/her IQ is better than 97.72% people or his/her IQ is lesser than only 2.28% people implying the person you picked up is really smart!!
This can be applied to almost every use case (weights, heights, salaries, immunity levels, and what not!)
In case of Confusion between Normalization and Standardization
If you have a use case in which you are not readily able to decide which will be good for your model, then you should run two iterations, one with Normalization (Min-max scaling) and another with Standardization(Z-score) and then plot the curves either by using a box-plot visualization to compare which technique is performing better for you or best yet, fit your model to these two versions and the judge using the model validation metrics.
Should we apply Normalization while using Machine learning algorithms?
Contrary to the popular belief that ML algorithms do not require Normalization, you should first take a good look at the technique that your algorithm is using to make a sound decision that favors the model you are developing.
If you are using a Decision Tree, or for that matter any tree-based algorithm, then you can proceed WITHOUT Normalization because the fundamental concept of a tree revolves around making a decision at a node based on a SINGLE feature at a time, thus the difference in scales of different features will not impact a Tree-based algorithm. Whereas, if you are using Linear Regression, Logistic Regression, Neural networks, SVM, K-NN, K-Means or any other distance-based algorithm or gradient descent based algorithm, then all of these algorithms are sensitive to the range of scales of your features and applying Normalization will enhance the accuracy of these ML algorithms.
That’s all about Feature Scaling:)
Happy Learning, happy growing:)
The article on normal distributions that I referred to above in this post:
Watch out this space for more on Machine learning, data analytics, and statistics!
以上所述就是小编给大家介绍的《Clearly explained: What, why and how of feature scaling- normalization & standardization》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
系统程序员成长计划
李先静 / 人民邮电出版社 / 2010-04 / 45.00
在学习程序开发的过程中,你是否总是为自己遇到的一些问题头疼不已,你是否还在为写不出代码而心急如焚?作为软件开发人员,你是否时时为自己如何成为一名合格的程序员而困惑不已?没关系,本书将为你排忧解难。 这是一本介绍系统程序开发方法的书。书中结合内容详尽的代码细致讲述了不少底层程序开发基础知识,并在逐步深入的过程中介绍了一些简单实用的应用程序,最后还讲述了一些软件工程方面的内容,内容全面,语言生动......一起来看看 《系统程序员成长计划》 这本书的介绍吧!