Another Dive into PCA in a Practical View

栏目: IT技术 · 发布时间: 4年前

内容简介：PCA is commonly addressed in interviews; so it’s important to understand the definition of PCA:The orthogonal projection of the data into a

Another Dive into PCA in a Practical View

Understanding how to use PCA for applications as a machine learning practitioner

T here are thousands of articles on Towards Data Science on PCA related topics. Well, I am contributing my view to the thousands. However, I am trying my best to explain PCA in a code-first practical manner that may change your view on PCA. No matter if you are a beginner or PCA master, I am sure you will find this blog refreshing and useful. Principal Component Analysis, or PCA, is a powerful tool that’s widely used for data science applications, such as dimension reduction , feature extraction , and data visualization .

PCA is commonly addressed in interviews; so it’s important to understand the definition of PCA:

The orthogonal projection of the data into a lower dimension linear space( principal subspace ), such that the variance of the projected data is maximized. — Hotelling, 1933

This formal definition explained the the essence of PCA: variance maximization in lower dimensions . Let’s dive into the details of PCA by using code examples.

Get Started

To get started with this post and the following examples, you will need:

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from matplotlib import image%matplotlib inline
plt.style.use('ggplot')

We will dig into three examples of PCA:

with a mock dataset
with the IRIS dataset
with my IMAGE

I am sure after these three examples, you will understand PCA and know how to use it properly in applications.

With a mock dataset

Before jumping into the code, we can simplify PCA to this problem: we have some data x , we want to find z = Wx , such that dim(z) < dim(x) . But how do we find W ? Well, it’s pretty hard to explain this in high dimension. We can consider a very simple example in 2D: we have a vector x in 2D space, and we have to find W such that z covers the most variance.

As you may remember from your linear algebra class, dot product works as follows:

we can see that if we want z to cover more variance , we need to rotate W such that it follows the direction of x . We will explain this in the mock example. Let’s start with the code.

x = np.arange(1,10)
y = 2*x + np.random.rand(9) * 2
plt.scatter(x,y)
plt.xlabel('v1')
plt.ylabel('v2')
plt.xlabel('experience(year)') 
plt.ylabel('salary(k)')

As we can see, we have some data x , we want to find W such that z=Wx covers the most variance . We can see that the variance covered in direction 2 is much larger than in direction1 if you project each point on the line. Direction 1 and direction 2 refer to the choice of directions of W . However, this is not the W that will produce the first principal component. Why? WE NEED TO STANDARDIZE FIRST! Remember, before you perform PCA, make sure you scale your data such that each column has a standard deviation of 1 and mean of 0. Let’s generate W using the following code:

df = pd.DataFrame({'v1': x, 'v2':y})
df = StandardScaler().fit_transform(df)
df = pd.DataFrame(df,columns=['x','y'])
plt.scatter(df.x, df.y)

Standardize the data using StandardScaler from sklearn.preprocessing. Now we can start finding W:

pca = PCA(n_components=1)
pca.fit(df)
pc1 = pca.transform(df)
pc1 
# array([[-2.1107132 ],
#       [-1.66336608],
#       [-1.04335203],
#       [-0.53589344],
#       [-0.0428135 ],
#       [ 0.39641593],
#       [ 1.01178638],
#       [ 1.67023775],
#       [ 2.3176982 ]])
inverse = pca.inverse_transform(pc1)
inverse = pd.DataFrame(inverse, columns=['x','y'])
plt.scatter(df.x, df.y, label='standardized x')
plt.plot(inverse.x, inverse.y,'b',label='w')
plt.legend()

We got W by inverse transforming principal component one( z ). Mathematically, we usually find W by using a Lagrange multiplier. I am not going to dig into the math, but the solution to finding the max of var( Z ) given x is: w is the eigenvector of the covariance matrix of x corresponding to the i^th largest eigenvalue. By the way, you can also solve it by gradient descent .

To illustrate the process of finding W, I’d like to include a beautiful gif:

https://medium.com/@ashwin8april/dimensionality-reduction-and-visualization-using-pca-principal-component-analysis-8489b46c2ae0

This is a 2D illustration of PCA, the rotating black line is W. I’d like to include a 3D illustration as well:

https://setosa.io/ev/principal-component-analysis/

We can see that the red line is the W that produces PC1, green — PC2, blue — PC3.

With the IRIS dataset

Iris dataset is a very famous dataset in the machine learning community; it consists of 4 features and 1 target. However, to visualize all the points in 4-D is quite difficult. Instead, we can use PCA to project all the points in a 4-D space into a 2-D space and visualize it easily. First, load the dataset:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url, names=['v1','v2','v3','petal width','target'])

Next, we need to STANDARDIZE as I mentioned above. Notice that we only need to standardize the features here.

x = StandardScaler().fit_transform(df.iloc[:,:-1])

Find pc1 and pc2 :

pca = PCA(n_components=2) 
pca.fit(x)
pcs = pca.transform(x)
pcs = pd.DataFrame(pcs, columns=['pc1','pc2'])
pcs['target'] = df.target #define the target for each set of pcs

Let’s now visualize it in 2-D:

plt.figure(figsize=(14,9))
sns.scatterplot(pcs.pc1, pcs.pc2, hue=pcs.target)

We can see that the each species can still be separated well when projecting into a 2-D space. Another finding is that red(Iris-setosa) differs from the other two species a lot. The takeaway is that PCA can be a tool for data visualization as well though not common.

With my Image

Yes, my image. I have a British short hair who would love to be the model for this post. “It’s me guys! What’s up? XD. My name is Grey because I am.”

Mr. Grey, Ovuvuẹnvuẹnvuẹn Eyẹntuẹnwẹnvuẹn Ugbẹn’ugbẹn Osas

So we will PCA transform Mr. Grey today. PCA is pretty popular in image decomposition nowadays as it’s a variance maximization method. With the power of PCA, we can decompose pictures really easily so that it can contain the most information with the least memory required. Let’s jump straight into it.

image = image.imread('mycat.png')
plt.imshow(image)
plt.title('My Cat')
shape = image.shape

Still cute in Python Matplotlib

def image_decomposition(image, n_components):
    pca = PCA(n_components=n_components, svd_solver='randomized')
    data = image.reshape(image.shape[0],-1)
    data = pca.fit_transform(data)
    print(f'With {n_components} principal components you explained is:{pca.explained_variance_ratio_.sum()}')
    temp = pca.inverse_transform(data)
    temp = temp.reshape(shape)
    plt.imshow(temp)
    plt.title(f'My cat with only {n_components} principal components')image_decomposition(image, 64)
#With 64 principal components you explained is:0.994881272315979

Remains my cuteness with 64 pcs

Mr. Grey still looks like Mr. Grey with 64 principal components.

image_decomposition(image, 32)
#With 32 principal components you explained is:0.9856431484222412

EHH still rocking my baby face

image_decomposition(image, 16)
#With 16 principal components you explained is:0.9685052037239075

Losing it…

image_decomposition(image, 4)
#With 4 principal components you explained is:0.8802358508110046

Oh well…

Summary

PCA is basically linear algebra and it’s a hot topic in interviews. From this blog, you should be able to answer: 1. How does PCA work? 2. How to find W such that the variance of z is maximized. 3. How to project a high-dimensional data into a lower dimensional space? 4. How to use PCA on Mr. Grey or your images :D.

以上所述就是小编给大家介绍的《Another Dive into PCA in a Practical View》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Another Dive into PCA in a Practical View

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

区块链核心算法解析

【瑞士】Roger Wattenhofer（罗格.瓦唐霍费尔） / 陈晋川、薛云志、林强、祝庆 / 电子工业出版社 / 2017-8 / 59.00

《区块链核心算法解析》介绍了构建容错的分布式系统所需的基础技术，以及一系列允许容错的协议和算法，并且讨论一些实现了这些技术的实际系统。《区块链核心算法解析》中的主要概念将独立成章。每一章都以一个小故事开始，从而引出该章节的内容。算法、协议和定义都将以形式化的方式描述，以便于读者理解如何实现。部分结论会在定理中予以证明，这样读者就可以明白为什么这些概念或算法是正确的，并且理解它们可以确保实现......一起来看看《区块链核心算法解析》这本书的介绍吧!

码农工具