Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

栏目: IT技术 · 发布时间: 4年前

内容简介：In this story, we are gonna go through three Dimensionality reduction techniques specifically used forMany Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones ar

Visualising a high-dimensional dataset using: PCA, TSNE and UMAP

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Sivakar Siva

May 31 ·10min read

In this story, we are gonna go through three Dimensionality reduction techniques specifically used for Data Visualization : PCA(Principal Component Analysis), t-SNE and UMAP. We are going to explore them in details using the Sign Language MNIST Dataset, without going in-depth with the maths behind the algorithms.

What is Dimensionality Reduction?

Many Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones are:

Makes the training extremely slow
Makes it difficult to find a good solution

This is known as the curse of dimensionality and the Dimensionality Reduction is the process of reducing the number of features to the most relevant ones in simple terms.

Reducing the dimensionality does lose some information, however as most compressing processes it comes with some drawbacks, even though we get the training faster, we make the system perform slightly worse, but this is ok! “sometimes reducing the dimensionality can filter out some of the noise present and some of the unnecessary details”.

Most Dimensionality Reduction applications are used for:

Data Compression
Noise Reduction
Data Classification
Data Visualization

One of the most important aspects of Dimensionality reduction, it is Data Visualization. Having to drop the dimensionality down to two or three, make it possible to visualize the data on a 2d or 3d plot, meaning important insights can be gained by analysing these patterns in terms of clusters and much more.

Main Approaches for Dimensionality Reduction

The two main approaches to reducing dimensionality: Projection and Manifold Learning.

Projection : This technique deals with projecting every data point which is in high dimension, onto a subspace suitable lower-dimensional space in a way which approximately preserves the distances between the points.
Manifold Learning: Many dimensionality reductions algorithm work by modelling the manifold on which the training instance lie; this is called Manifold learning . It relies on the manifold hypothesis or assumption, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold, this assumption in most of the cases is based on observation or experience rather than theory or pure logic.[4]

Now let's briefly explain the three techniques: (PCA, TSNE, UMAP) before jumping into solving the use case.

PCA

One of the most known dimensionality reduction technique is PCA(Principal Component Analysis, this works by identifying the hyperplane which lies closest to the data and then projects the data on that hyperplane while retaining most of the variation in the data set.

Principal Components

The axis that explains the maximum amount of variance in the training set is called the Principal Components .

The axis orthogonal to this axis is called the second principal component . As we go for higher dimensions, PCA would find a third component orthogonal to the other two components and so on, for visualization purposes we always stick to 2 or maximum 3 principal components.

It is very important to choose the right hyperplane so that when the data is projected onto it, it the maximum amount of information about how the original data is distributed.

t-SNE(T-distributed stochastic neighbour embedding)

(t-SNE)or T-distributed stochastic neighbour embedding created in 2008 by ( Laurens van der Maaten and Geoffrey Hinton) for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.

(t-SNE)takes a high dimensional data set and reduces it to a low dimensional graph that retains a lot of the original information. It does so by giving each data point a location in a two or three-dimensional map. This technique finds clusters in data thereby making sure that an embedding preserves the meaning in the data. t-SNE reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.[2]

For a quick a Visualization of this technique, refer to the animation below (it is taken from an amazing tutorial by Cyrille Rossant, I highly recommend to check out his amazing tutorial.

link: https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/

UMAP(Uniform Manifold Approximation and Projection)

Uniform Manifold Approximation and Projectioncreated in 2018 by ( Leland McInnes , John Healy , James Melville ) is a general-purpose manifold learning and dimension reduction algorithm.

UMAP is a nonlinear dimensionality reduction method and is very effective for visualizing clusters or groups of data points and their relative proximities.

The significant difference with TSNE is scalability , it can be applied directly to sparse matrices thereby eliminating the need to applying any Dimensionality reduction such a s PCA or Truncated SVD(Singular Value Decomposition) as a prior pre-processing step . [1]

Put simply, it is similar to t-SNE but with probably higher processing speed, therefore, faster and probably better visualization. (let’s find it out in the tutorial below)

Use Case

Now we are going to go through the above-mentioned use case where all the three techniques will be applied: specifically, we will try to visualize a high dimensional dataset using these techniques: T he Sign-Language-MNIST Dataset: https://www.kaggle.com/datamunge/sign-language-mnist

import numpy as np
import pandas as pd
import time# For plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline#PCA
from sklearn.decomposition import PCA
#TSNE
from sklearn.manifold import TSNE
#UMAP
import umap

The Data

train = pd.read_csv('/kaggle/input/sign-language-mnist/sign_mnist_test/sign_mnist_test.csv')train.head()

Size of the train Data

# Setting the label and the feature columns
y = train.loc[:,'label'].values
x = train.loc[:,'pixel1':].valuesprint(np.unique(y))

The number of unique labels

There are 25 unique labels representing the number of distinct sign-languages.

#Appling PCAstart = time.time()pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
print('Duration: {} seconds'.format(time.time() - start))principal = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3'])
principal.shape

After applying PCA, the new dimensionality of the data has only 2 features compared to the 784 features of the x data.

The number of dimensions has been cut down drastically whilst trying to retain as much of the ‘variation’ in the information as possible.

Drawbacks of PCA

The main drawback of PCA is that it is highly influenced by outliers present in the data. Moreover, PCA is a linear projection , which means it can’t capture non-linear dependencies.

PCA in 2D space

# Plotting PCA 2Dplt.style.use('dark_background')
plt.scatter(principalComponents[:, 0], principalComponents[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through PCA', fontsize=24);
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

From the 2D plot, we can see the two components definitely hold some information, especially for specific digits, but clearly not enough to set all of them apart.

PCA in 3D space

# Plotting PCA 3D
ax = plt.figure(figsize=(12,10)).gca(projection='3d')
ax.scatter(
    xs=principalComponents[:, 0], 
    ys=principalComponents[:, 1], 
    zs=principalComponents[:, 2], 
    c=y, 
    cmap='gist_rainbow'
)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.title('Visualizing sign-language-mnist through PCA in 3D', fontsize=24);
plt.show()

t-SNE with Scikit learn

One thing to now down is that t-SNE is very computationally expensive, hence it is mentioned in its documentation that :

“It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.”[2]

start = time.time()pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(x)tsne = TSNE(random_state = 42, n_components=3,verbose=0, perplexity=40, n_iter=400).fit_transform(pca_result_50)print(‘Duration: {} seconds’.format(time.time() — start))

Thus, I have applied PCA choosing to retain 50 principal components from the original data to cut down the need for more processing power and it will require time to compute the dimensionality reduction if we had considered the original data.

The speed of the three techniques will be analysed and compared in the following sections further down in details.

T-SNE in 2D space

#Visualising t-SNE 2D
fig = plt.figure(figsize=(12,8))
plt.scatter(tsne[:, 0], tsne[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through t-SNE in 2D', fontsize=24);
plt.xlabel('tsne_1')
plt.ylabel('tsne_2')

T-SNE in 3D space

#Visualising t-SNE 3D
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(tsne[:, 0], tsne[:, 1],tsne[:,2], c=y, cmap='gist_rainbow')
ax.set_xlabel('tsne_1')
ax.set_ylabel('tsne_2')
ax.set_zlabel('tsne_3')
plt.title('Visualizing sign-language-mnist through TSNE in 3D', fontsize=24);
plt.show()

Implementing UMAP

UMAP has different hyperparameters that can have an impact on the resulting embeddings:

n_neighbors

This parameter controls how UMAP balances local versus global structure in the data. This low values of n_neighbours forces UMAP to focus on very local structures while the higher values will make UMAP focus on the larger neighbourhoods.

min_dist

This parameter controls how tightly UMAP is allowed to pack points together. Lower values mean the points will be clustered closely and vice versa.

n_components

This parameter allows the user to determine the dimensionality of the reduced dimension space.

metric

This parameter controls how distance is computed in the ambient space of the input data.

For more detailed information, I suggest to check out the UMAP documentation :

//umap-learn.readthedocs.io/en/latest/

For this tutorial, I have chosen to keep the default setting apart for n_components which I set to 3 for the 3d space plot. It would be best to experiment with different hyper-parameter settings to obtain the best out of the algorithm.

start = time.time()
reducer = umap.UMAP(random_state=42,n_components=3)
embedding = reducer.fit_transform(x)
print('Duration: {} seconds'.format(time.time() - start))

UMAP in 2D space

# Visualising UMAP in 2d
fig = plt.figure(figsize=(12,8))
plt.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
plt.title('Visualizing sign-language-mnist with UMAP in 2D', fontsize=24);

We can clearly see that the UMAP does a great job in separating the signs compared to t-SNE and PCA already in 2d space.

UMAP -3D space

# Visualising UMAP in 3d
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1],reducer.embedding_[:, 2], c=y, cmap='gist_rainbow')
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
ax.set_zlabel('umap_3')
plt.title('Visualizing sign-language-mnist through UMAP in 3D', fontsize=24);
plt.show()

Comparison between the Dimension Reduction Techniques: PCA vs t-SNE vs UMAP

By comparing the visualisations produced by the three models, we can see that PCA was not able to do such a good job in differentiating the signs. This is mainly because PCA is a linear projection , which means it can’t capture non-linear dependencies.

t-SNE does a better job as compared to PCA when it comes to visualising High Dimensional datasets. Similar Hand-signs are clustered together, even though there are big agglomerates of data points on top each other from 2d perspective.

UMAPoutperformed the other two techniques in a reasonable manner if we look at the 2d and 3d plot, we can clearly see that sign languages are separated very well compared to the first two techniques. If we applied a clustering algorithm on this, we could be able to assign labels to the clusters.

In terms of speed, UMAP is much faster than t-SNE , another problem faced by the former is the need for another dimensionality reduction method prior, otherwise, it would take a longer time to compute, therefore we can state that UMAP is much faster than t-SNE. PCA is the fastest of them all, however, it does not do a very good job.

Note that: the above table was constructed considering the computation time taken on a Kernel on Kaggle using their GPU.

UMAP can also be used for preprocessing while t-SNE doesn’t have major use outside visualisation. This means that it can often provide a better “big picture” view of the data as well as preserving local neighbour relations.[3]

Summary

We have explored three dimensionality reduction techniques for data visualization : (PCA, t-SNE, UMAP )and tried to use them to visualize a high-dimensional dataset in 2d and 3d plots.

Based on this Tutorial for this particular use case we can say that:

PCA did not work quite well in categorizing the different signs (24). However, instead of arbitrarily choosing the number dimensions to 3, it is much better to choose the number of dimensions that add up to a sufficiently large proportion of variance, but since this is data visualization problem that was the most reasonable thing to do.
TSNE managed to do better work on separating the clusters, the visualization in 2d and 3d was better than PCA definitely. However, it took a very long time to compute its embeddings
UMAP turned out to be the most effective manifold learning in terms of displaying the different clusters, some of them were very well defined and significantly faster than t-SNE implementation.

References

[1] McInnes, L., & Healy, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints.

[2] van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding

t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly…

lvdmaaten.github.io

[3]Kaggle.com. 2020. Visualizing Kannada MNIST With T-SNE . Available at: https://www.kaggle.com/parulpandey/visualizing-kannada-mnist-with-t-sne

[4]Hands on Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurelien Geron

以上所述就是小编给大家介绍的《Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

轻快的Java

(美)塔特、杰兰德/国别：中国大陆 / 张晓坤 / 中国电力出版社 / 2006-7 / 29.00元

Java的开发者正深陷于复杂性的泥沼中而无法自拔。我们的经验和能力正接近极限，程序员为了编写支持所选框架的程序所花的时间比解决真正问题的时间要多得多。我们不禁要问，有必要把Java搞得这么复杂吗？　　答案是否定的。本书给你指引了一条出路。无论是维护应用程序，还是从头开始设计，你都能够超越成规，并大幅精简基本框架、开发过程和最终代码。你能重新掌握一度失控的J2EE应用程序。　　在本书......一起来看看《轻快的Java》这本书的介绍吧!

码农工具