Another Dive into PCA in a Practical View

栏目: IT技术 · 发布时间: 4年前

内容简介:PCA is commonly addressed in interviews; so it’s important to understand the definition of PCA:The orthogonal projection of the data into a

Another Dive into PCA in a Practical View

Understanding how to use PCA for applications as a machine learning practitioner

T here are thousands of articles on Towards Data Science on PCA related topics. Well, I am contributing my view to the thousands. However, I am trying my best to explain PCA in a code-first practical manner that may change your view on PCA. No matter if you are a beginner or PCA master, I am sure you will find this blog refreshing and useful. Principal Component Analysis, or PCA, is a powerful tool that’s widely used for data science applications, such as dimension reduction , feature extraction , and data visualization .

PCA is commonly addressed in interviews; so it’s important to understand the definition of PCA:

The orthogonal projection of the data into a lower dimension linear space( principal subspace ), such that the variance of the projected data is maximized. — Hotelling, 1933

This formal definition explained the the essence of PCA: variance maximization in lower dimensions . Let’s dive into the details of PCA by using code examples.

Get Started

To get started with this post and the following examples, you will need:

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from matplotlib import image%matplotlib inline
plt.style.use('ggplot')

We will dig into three examples of PCA:

  • with a mock dataset
  • with the IRIS dataset
  • with my IMAGE

I am sure after these three examples, you will understand PCA and know how to use it properly in applications.

With a mock dataset

Before jumping into the code, we can simplify PCA to this problem: we have some data x , we want to find z = Wx , such that dim(z) < dim(x) . But how do we find W ? Well, it’s pretty hard to explain this in high dimension. We can consider a very simple example in 2D: we have a vector x in 2D space, and we have to find W such that z covers the most variance.

As you may remember from your linear algebra class, dot product works as follows:

we can see that if we want z to cover more variance , we need to rotate W such that it follows the direction of x . We will explain this in the mock example. Let’s start with the code.

x = np.arange(1,10)
y = 2*x + np.random.rand(9) * 2
plt.scatter(x,y)
plt.xlabel('v1')
plt.ylabel('v2')
plt.xlabel('experience(year)') 
plt.ylabel('salary(k)')

As we can see, we have some data x , we want to find W such that z=Wx covers the most variance . We can see that the variance covered in direction 2 is much larger than in direction1 if you project each point on the line. Direction 1 and direction 2 refer to the choice of directions of W . However, this is not the W that will produce the first principal component. Why? WE NEED TO STANDARDIZE FIRST! Remember, before you perform PCA, make sure you scale your data such that each column has a standard deviation of 1 and mean of 0. Let’s generate W using the following code:

df = pd.DataFrame({'v1': x, 'v2':y})
df = StandardScaler().fit_transform(df)
df = pd.DataFrame(df,columns=['x','y'])
plt.scatter(df.x, df.y)

Standardize the data using StandardScaler from sklearn.preprocessing. Now we can start finding W:

pca = PCA(n_components=1)
pca.fit(df)
pc1 = pca.transform(df)
pc1 
# array([[-2.1107132 ],
#       [-1.66336608],
#       [-1.04335203],
#       [-0.53589344],
#       [-0.0428135 ],
#       [ 0.39641593],
#       [ 1.01178638],
#       [ 1.67023775],
#       [ 2.3176982 ]])
inverse = pca.inverse_transform(pc1)
inverse = pd.DataFrame(inverse, columns=['x','y'])
plt.scatter(df.x, df.y, label='standardized x')
plt.plot(inverse.x, inverse.y,'b',label='w')
plt.legend()

We got W by inverse transforming principal component one( z ). Mathematically, we usually find W by using a Lagrange multiplier. I am not going to dig into the math, but the solution to finding the max of var( Z ) given x is: w is the eigenvector of the covariance matrix of x corresponding to the i^th largest eigenvalue. By the way, you can also solve it by gradient descent .

To illustrate the process of finding W, I’d like to include a beautiful gif:

This is a 2D illustration of PCA, the rotating black line is W. I’d like to include a 3D illustration as well:

We can see that the red line is the W that produces PC1, green — PC2, blue — PC3.

With the IRIS dataset

Iris dataset is a very famous dataset in the machine learning community; it consists of 4 features and 1 target. However, to visualize all the points in 4-D is quite difficult. Instead, we can use PCA to project all the points in a 4-D space into a 2-D space and visualize it easily. First, load the dataset:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url, names=['v1','v2','v3','petal width','target'])

Next, we need to STANDARDIZE as I mentioned above. Notice that we only need to standardize the features here.

x = StandardScaler().fit_transform(df.iloc[:,:-1])

Find pc1 and pc2 :

pca = PCA(n_components=2) 
pca.fit(x)
pcs = pca.transform(x)
pcs = pd.DataFrame(pcs, columns=['pc1','pc2'])
pcs['target'] = df.target #define the target for each set of pcs

Let’s now visualize it in 2-D:

plt.figure(figsize=(14,9))
sns.scatterplot(pcs.pc1, pcs.pc2, hue=pcs.target)

We can see that the each species can still be separated well when projecting into a 2-D space. Another finding is that red(Iris-setosa) differs from the other two species a lot. The takeaway is that PCA can be a tool for data visualization as well though not common.

With my Image

Yes, my image. I have a British short hair who would love to be the model for this post. “It’s me guys! What’s up? XD. My name is Grey because I am.”

Mr. Grey, Ovuvuẹnvuẹnvuẹn Eyẹntuẹnwẹnvuẹn Ugbẹn’ugbẹn Osas

So we will PCA transform Mr. Grey today. PCA is pretty popular in image decomposition nowadays as it’s a variance maximization method. With the power of PCA, we can decompose pictures really easily so that it can contain the most information with the least memory required. Let’s jump straight into it.

image = image.imread('mycat.png')
plt.imshow(image)
plt.title('My Cat')
shape = image.shape
Still cute in Python Matplotlib
def image_decomposition(image, n_components):
    pca = PCA(n_components=n_components, svd_solver='randomized')
    data = image.reshape(image.shape[0],-1)
    data = pca.fit_transform(data)
    print(f'With {n_components} principal components you explained is:{pca.explained_variance_ratio_.sum()}')
    temp = pca.inverse_transform(data)
    temp = temp.reshape(shape)
    plt.imshow(temp)
    plt.title(f'My cat with only {n_components} principal components')image_decomposition(image, 64)
#With 64 principal components you explained is:0.994881272315979
Remains my cuteness with 64 pcs

Mr. Grey still looks like Mr. Grey with 64 principal components.

image_decomposition(image, 32)
#With 32 principal components you explained is:0.9856431484222412
EHH still rocking my baby face
image_decomposition(image, 16)
#With 16 principal components you explained is:0.9685052037239075
Losing it…
image_decomposition(image, 4)
#With 4 principal components you explained is:0.8802358508110046
Oh well…

Summary

PCA is basically linear algebra and it’s a hot topic in interviews. From this blog, you should be able to answer: 1. How does PCA work? 2. How to find W such that the variance of z is maximized. 3. How to project a high-dimensional data into a lower dimensional space? 4. How to use PCA on Mr. Grey or your images :D.


以上所述就是小编给大家介绍的《Another Dive into PCA in a Practical View》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

解决网页设计一定会遇到的210个问题

解决网页设计一定会遇到的210个问题

2006-4 / 42.00元

如何选择适合、简单、方便、快速的方法来解决您的网页设计问题?不会HTML、JavaScript、CSS也可轻易完成许多网页功能与特效。本书包含上百种HTML、JavaScript、CSS使用应用技巧与盲点解说,包含10个常用表单资料判断函数与特殊技巧,不必修改就可用于任何网页。本书现有的多数网页设计书籍相辅相成,让您事半功倍地完成工作。   许多计算机书籍都是从某个语言或者某个软件的......一起来看看 《解决网页设计一定会遇到的210个问题》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具