Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

栏目: IT技术 · 发布时间: 4年前

内容简介:In this story, we are gonna go through three Dimensionality reduction techniques specifically used forMany Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones ar

Visualising a high-dimensional dataset using: PCA, TSNE and UMAP

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

May 31 ·10min read

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Photo by Hin Bong Yeung on Unsplash

In this story, we are gonna go through three Dimensionality reduction techniques specifically used for Data Visualization : PCA(Principal Component Analysis), t-SNE and UMAP. We are going to explore them in details using the Sign Language MNIST Dataset, without going in-depth with the maths behind the algorithms.

What is Dimensionality Reduction?

Many Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones are:

  • Makes the training extremely slow
  • Makes it difficult to find a good solution

This is known as the curse of dimensionality and the Dimensionality Reduction is the process of reducing the number of features to the most relevant ones in simple terms.

Reducing the dimensionality does lose some information, however as most compressing processes it comes with some drawbacks, even though we get the training faster, we make the system perform slightly worse, but this is ok! “sometimes reducing the dimensionality can filter out some of the noise present and some of the unnecessary details”.

Most Dimensionality Reduction applications are used for:

  • Data Compression
  • Noise Reduction
  • Data Classification
  • Data Visualization

One of the most important aspects of Dimensionality reduction, it is Data Visualization. Having to drop the dimensionality down to two or three, make it possible to visualize the data on a 2d or 3d plot, meaning important insights can be gained by analysing these patterns in terms of clusters and much more.

Main Approaches for Dimensionality Reduction

The two main approaches to reducing dimensionality: Projection and Manifold Learning.

  • Projection : This technique deals with projecting every data point which is in high dimension, onto a subspace suitable lower-dimensional space in a way which approximately preserves the distances between the points.
  • Manifold Learning: Many dimensionality reductions algorithm work by modelling the manifold on which the training instance lie; this is called Manifold learning . It relies on the manifold hypothesis or assumption, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold, this assumption in most of the cases is based on observation or experience rather than theory or pure logic.[4]

Now let's briefly explain the three techniques: (PCA, TSNE, UMAP) before jumping into solving the use case.

PCA

One of the most known dimensionality reduction technique is PCA(Principal Component Analysis, this works by identifying the hyperplane which lies closest to the data and then projects the data on that hyperplane while retaining most of the variation in the data set.

Principal Components

The axis that explains the maximum amount of variance in the training set is called the Principal Components .

The axis orthogonal to this axis is called the second principal component . As we go for higher dimensions, PCA would find a third component orthogonal to the other two components and so on, for visualization purposes we always stick to 2 or maximum 3 principal components.

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Source: Packt_Pub, via: Hackernoon

It is very important to choose the right hyperplane so that when the data is projected onto it, it the maximum amount of information about how the original data is distributed.

t-SNE(T-distributed stochastic neighbour embedding)

(t-SNE)or T-distributed stochastic neighbour embedding created in 2008 by ( Laurens van der Maaten and Geoffrey Hinton) for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.

(t-SNE)takes a high dimensional data set and reduces it to a low dimensional graph that retains a lot of the original information. It does so by giving each data point a location in a two or three-dimensional map. This technique finds clusters in data thereby making sure that an embedding preserves the meaning in the data. t-SNE reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.[2]

For a quick a Visualization of this technique, refer to the animation below (it is taken from an amazing tutorial by Cyrille Rossant, I highly recommend to check out his amazing tutorial.

link: https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Source: Cyrille Rossant ,via OReilly

UMAP(Uniform Manifold Approximation and Projection)

Uniform Manifold Approximation and Projectioncreated in 2018 by ( Leland McInnes , John Healy , James Melville ) is a general-purpose manifold learning and dimension reduction algorithm.

UMAP is a nonlinear dimensionality reduction method and is very effective for visualizing clusters or groups of data points and their relative proximities.

The significant difference with TSNE is scalability , it can be applied directly to sparse matrices thereby eliminating the need to applying any Dimensionality reduction such a s PCA or Truncated SVD(Singular Value Decomposition) as a prior pre-processing step . [1]

Put simply, it is similar to t-SNE but with probably higher processing speed, therefore, faster and probably better visualization. (let’s find it out in the tutorial below)

Use Case

Now we are going to go through the above-mentioned use case where all the three techniques will be applied: specifically, we will try to visualize a high dimensional dataset using these techniques: T he Sign-Language-MNIST Dataset: https://www.kaggle.com/datamunge/sign-language-mnist

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP
(Sign-Language-MNIST Dataset), screenshot from kaggle.com
import numpy as np
import pandas as pd
import time
# For plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
#PCA
from sklearn.decomposition import PCA
#TSNE
from sklearn.manifold import TSNE
#UMAP
import umap

The Data

train = pd.read_csv('/kaggle/input/sign-language-mnist/sign_mnist_test/sign_mnist_test.csv')train.head()

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Size of the train Data
# Setting the label and the feature columns
y = train.loc[:,'label'].values
x = train.loc[:,'pixel1':].values
print(np.unique(y))
The number of unique labels

There are 25 unique labels representing the number of distinct sign-languages.

#Appling PCAstart = time.time()pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
print('Duration: {} seconds'.format(time.time() - start))
principal = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2','principal component 3'])
principal.shape

After applying PCA, the new dimensionality of the data has only 2 features compared to the 784 features of the x data.

The number of dimensions has been cut down drastically whilst trying to retain as much of the ‘variation’ in the information as possible.

Drawbacks of PCA

The main drawback of PCA is that it is highly influenced by outliers present in the data. Moreover, PCA is a linear projection , which means it can’t capture non-linear dependencies.

PCA in 2D space

# Plotting PCA 2Dplt.style.use('dark_background')
plt.scatter(principalComponents[:, 0], principalComponents[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through PCA', fontsize=24);
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Image by Author

From the 2D plot, we can see the two components definitely hold some information, especially for specific digits, but clearly not enough to set all of them apart.

PCA in 3D space

# Plotting PCA 3D
ax = plt.figure(figsize=(12,10)).gca(projection='3d')
ax.scatter(
xs=principalComponents[:, 0],
ys=principalComponents[:, 1],
zs=principalComponents[:, 2],
c=y,
cmap='gist_rainbow'
)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.title('Visualizing sign-language-mnist through PCA in 3D', fontsize=24);
plt.show()

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Image by Author

t-SNE with Scikit learn

One thing to now down is that t-SNE is very computationally expensive, hence it is mentioned in its documentation that :

“It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.”[2]

start = time.time()pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(x)tsne = TSNE(random_state = 42, n_components=3,verbose=0, perplexity=40, n_iter=400).fit_transform(pca_result_50)print(‘Duration: {} seconds’.format(time.time() — start))

Thus, I have applied PCA choosing to retain 50 principal components from the original data to cut down the need for more processing power and it will require time to compute the dimensionality reduction if we had considered the original data.

The speed of the three techniques will be analysed and compared in the following sections further down in details.

T-SNE in 2D space

#Visualising t-SNE 2D
fig = plt.figure(figsize=(12,8))
plt.scatter(tsne[:, 0], tsne[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through t-SNE in 2D', fontsize=24);
plt.xlabel('tsne_1')
plt.ylabel('tsne_2')

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Image by Author

T-SNE in 3D space

#Visualising t-SNE 3D
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(tsne[:, 0], tsne[:, 1],tsne[:,2], c=y, cmap='gist_rainbow')
ax.set_xlabel('tsne_1')
ax.set_ylabel('tsne_2')
ax.set_zlabel('tsne_3')
plt.title('Visualizing sign-language-mnist through TSNE in 3D', fontsize=24);
plt.show()

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Image by Author

Implementing UMAP

UMAP has different hyperparameters that can have an impact on the resulting embeddings:

  • n_neighbors

This parameter controls how UMAP balances local versus global structure in the data. This low values of n_neighbours forces UMAP to focus on very local structures while the higher values will make UMAP focus on the larger neighbourhoods.

  • min_dist

This parameter controls how tightly UMAP is allowed to pack points together. Lower values mean the points will be clustered closely and vice versa.

  • n_components

This parameter allows the user to determine the dimensionality of the reduced dimension space.

  • metric

This parameter controls how distance is computed in the ambient space of the input data.

For more detailed information, I suggest to check out the UMAP documentation :

//umap-learn.readthedocs.io/en/latest/

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP
UMAP(default setting)

For this tutorial, I have chosen to keep the default setting apart for n_components which I set to 3 for the 3d space plot. It would be best to experiment with different hyper-parameter settings to obtain the best out of the algorithm.

start = time.time()
reducer = umap.UMAP(random_state=42,n_components=3)
embedding = reducer.fit_transform(x)
print('Duration: {} seconds'.format(time.time() - start))

UMAP in 2D space

# Visualising UMAP in 2d
fig = plt.figure(figsize=(12,8))
plt.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
plt.title('Visualizing sign-language-mnist with UMAP in 2D', fontsize=24);

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Image by Author

We can clearly see that the UMAP does a great job in separating the signs compared to t-SNE and PCA already in 2d space.

UMAP -3D space

# Visualising UMAP in 3d
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1],reducer.embedding_[:, 2], c=y, cmap='gist_rainbow')
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
ax.set_zlabel('umap_3')
plt.title('Visualizing sign-language-mnist through UMAP in 3D', fontsize=24);
plt.show()

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

Image by Author

Comparison between the Dimension Reduction Techniques: PCA vs t-SNE vs UMAP

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

PCA (top_row) vs T-SNE (middle_row) vs UMAP(bottom_row) ,Image by Author

By comparing the visualisations produced by the three models, we can see that PCA was not able to do such a good job in differentiating the signs. This is mainly because PCA is a linear projection , which means it can’t capture non-linear dependencies.

t-SNE does a better job as compared to PCA when it comes to visualising High Dimensional datasets. Similar Hand-signs are clustered together, even though there are big agglomerates of data points on top each other from 2d perspective.

UMAPoutperformed the other two techniques in a reasonable manner if we look at the 2d and 3d plot, we can clearly see that sign languages are separated very well compared to the first two techniques. If we applied a clustering algorithm on this, we could be able to assign labels to the clusters.

In terms of speed, UMAP is much faster than t-SNE , another problem faced by the former is the need for another dimensionality reduction method prior, otherwise, it would take a longer time to compute, therefore we can state that UMAP is much faster than t-SNE. PCA is the fastest of them all, however, it does not do a very good job.

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP
Comparison of the speed (computation times)

Note that: the above table was constructed considering the computation time taken on a Kernel on Kaggle using their GPU.

UMAP can also be used for preprocessing while t-SNE doesn’t have major use outside visualisation. This means that it can often provide a better “big picture” view of the data as well as preserving local neighbour relations.[3]

Summary

We have explored three dimensionality reduction techniques for data visualization : (PCA, t-SNE, UMAP )and tried to use them to visualize a high-dimensional dataset in 2d and 3d plots.

Based on this Tutorial for this particular use case we can say that:

  • PCA did not work quite well in categorizing the different signs (24). However, instead of arbitrarily choosing the number dimensions to 3, it is much better to choose the number of dimensions that add up to a sufficiently large proportion of variance, but since this is data visualization problem that was the most reasonable thing to do.
  • TSNE managed to do better work on separating the clusters, the visualization in 2d and 3d was better than PCA definitely. However, it took a very long time to compute its embeddings
  • UMAP turned out to be the most effective manifold learning in terms of displaying the different clusters, some of them were very well defined and significantly faster than t-SNE implementation.

References

[1] McInnes, L., & Healy, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints.

[2] van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding

[3]Kaggle.com. 2020. Visualizing Kannada MNIST With T-SNE . Available at: https://www.kaggle.com/parulpandey/visualizing-kannada-mnist-with-t-sne

[4]Hands on Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurelien Geron


以上所述就是小编给大家介绍的《Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Web开发权威指南

Web开发权威指南

[美] Chris Aquino,、[美] Todd Gandee / 奇舞团 / 人民邮电出版社 / 2017-9 / 99.00元

本书在知名培训机构Big Nerd Ranch 培训教材的基础上编写而成,囊括了JavaScript、HTML5、CSS3等现代前端开发人员急需的技术关键点,包括响应式UI、访问远程Web 服务、用Ember.js 构建应用,等等。此外,还会介绍如何使用前沿开发工具来调试和测试代码,并且充分利用Node.js 和各种开源的npm 模块的强大功能来进行开发。 全书分四部分,每部分独立完成一个项......一起来看看 《Web开发权威指南》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

SHA 加密
SHA 加密

SHA 加密工具