Clearly explained: What, why and how of feature scaling- normalization & standardization

栏目: IT技术 · 发布时间: 4年前

内容简介：You might be surprised at the choice of the cover image for this post but this is how we can understand Normalization! This mighty concept helps us when we have data that has a variety of features having different measurement scales and thus leaving us in

Why Normalization?

You might be surprised at the choice of the cover image for this post but this is how we can understand Normalization! This mighty concept helps us when we have data that has a variety of features having different measurement scales and thus leaving us in a lurch when we try to derive insights from such data or try to fit a model on such data.

Much like we can’t compare the different fruits shown in the above picture on a common scale, we can’t work efficiently with data that has too many scales.

For example: See the image below and observe the scales of salary Vs Work experience Vs Band level. Due to the higher scale range of the attribute Salary, it can take precedence over the other two attributes while training the model, despite whether or not it actually holds more weight in predicting the dependent variable.

Clearly explained: What, why and how of feature scaling- normalization & standardization

Thus, in the data pre-processing stage of data mining and model development (Statistical or Machine learning), it's a good practice to normalize all the variables to bring them down to a similar scale — If they are of different ranges .

Normalization is not required for every dataset, you have to sift through it and make sure if your data requires it and only then continue to incorporate this step in your procedure. Also, you should apply Normalization if you are not very sure if the data distribution is Gaussian/ Normal/ bell-curve in nature. Normalization will help in reducing the impact of non-gaussian attributes on your model.

What is Normalization?

We’ll talk about two case scenarios here:

1. Your data doesn’t follow Normal/ Gaussian distribution (Prefer this in case of doubt also)

Data normalization, in this case, is the process of rescaling one or more attributes to the range of 0 to 1. This means that the largest value for each attribute is 1 and the smallest value is 0.

It is also known as Min-Max scaling.

Source: Wikipedia

2. Your data follows Gaussian distribution

In this case, Normalization can be done by the formula described below where mu is the mean and the sigma is the standard deviation of your sample/population.

When we normalize using the Standard score as given below, it’s also commonly known as Standardization or Z-Score.

Source: Wikipedia

More about Z-Score

The z score tells us how many standard deviations away from the mean your score is.

For example —

Z-score of 1.5 then it implies it’s 1.5 standard deviations above the mean.
Z-score of -0.8 indicates our value is 0.8 standard deviations below the mean.

As explained above, the z-score tells us where the score lies on anormal distribution curve. A z-score of zero tells us the value is exactly the mean/ average while a score of +3 tells you that the value is much higher than average (probably an outlier)

If you refer to my article onNormal distributions, you’ll quickly understand that Z-score is converting our distribution to a Standard Normal Distribution with a mean of 0 and a Standard deviation of 1.

Interpretation of Z-Score

Let’s quickly understand how to interpret a value of Z-score in terms of AUC (Area under the curve).

According to the Empirical rule, discussed in detail in the article on Normal distributions linked above and stated at the end of this post too, it’s stated that:

68% of the data lies between +1SD and -1SD
99.5% of the data lies between +2SD and -2SD
99.7% of the data lies between +3SD and -3SD

Now, if we want to look at a customized range and calculate how much data that segments covers, then Z-scores come to our rescue. Let’s see how.

For example, we want to know how much percentage of data is covered (probability of occurrence of a data point) between negative extreme on the left and -1SD, we have to refer to Z-score table linked below:

Z-Score Table

Now, we have to look for value -1.00 and we can see from the snapshot below that is states 15.8% as the answer to our question.

Similarly, if we would have been looking for -1.25, we would have got the value as 10.56% (-1.2 in the column Z and match across the column 0.05 to make -1.25)

Common Z-score values and their results from Z-score table which indicates how much are is covered between the negative extreme end and the point of Z-score taken, i.e. Area to the left of a Z-score point:

We can use these values to calculate between customized ranges as well, For example: If we want to the AUC between -3 and -2.5 Z-score values, it will be (0.62–1.13)%= 0.49% ~0.5%. Thus, this comes in very handy when it comes to problems that do not have straightforward Z-score values to be interpreted.

Real Life Interpretation example

Let’s say we have an IQ score data for a sample that we have normalized using the Z-score. Now to put things into perspective, if a person’s IQ Z-score value is 2 — We see that +2 corresponds to 97.72% on Z-score table, this implies that his/her IQ is better than 97.72% people or his/her IQ is lesser than only 2.28% people implying the person you picked up is really smart!!

This can be applied to almost every use case (weights, heights, salaries, immunity levels, and what not!)

In case of Confusion between Normalization and Standardization

If you have a use case in which you are not readily able to decide which will be good for your model, then you should run two iterations, one with Normalization (Min-max scaling) and another with Standardization(Z-score) and then plot the curves either by using a box-plot visualization to compare which technique is performing better for you or best yet, fit your model to these two versions and the judge using the model validation metrics.

Should we apply Normalization while using Machine learning algorithms?

Contrary to the popular belief that ML algorithms do not require Normalization, you should first take a good look at the technique that your algorithm is using to make a sound decision that favors the model you are developing.

If you are using a Decision Tree, or for that matter any tree-based algorithm, then you can proceed WITHOUT Normalization because the fundamental concept of a tree revolves around making a decision at a node based on a SINGLE feature at a time, thus the difference in scales of different features will not impact a Tree-based algorithm.
Whereas, if you are using Linear Regression, Logistic Regression, Neural networks, SVM, K-NN, K-Means or any other distance-based algorithm or gradient descent based algorithm, then all of these algorithms are sensitive to the range of scales of your features and applying Normalization will enhance the accuracy of these ML algorithms.

That’s all about Feature Scaling:)

Happy Learning, happy growing:)

The article on normal distributions that I referred to above in this post: