内容简介:In this story, we are gonna go through three Dimensionality reduction techniques specifically used forMany Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones ar
Visualising a high-dimensional dataset using: PCA, TSNE and UMAP
May 31 ·10min read
In this story, we are gonna go through three Dimensionality reduction techniques specifically used for Data Visualization : PCA(Principal Component Analysis), t-SNE and UMAP. We are going to explore them in details using the Sign Language MNIST Dataset, without going in-depth with the maths behind the algorithms.
What is Dimensionality Reduction?
Many Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones are:
- Makes the training extremely slow
- Makes it difficult to find a good solution
This is known as the curse of dimensionality and the Dimensionality Reduction is the process of reducing the number of features to the most relevant ones in simple terms.
Reducing the dimensionality does lose some information, however as most compressing processes it comes with some drawbacks, even though we get the training faster, we make the system perform slightly worse, but this is ok! “sometimes reducing the dimensionality can filter out some of the noise present and some of the unnecessary details”.
Most Dimensionality Reduction applications are used for:
- Data Compression
- Noise Reduction
- Data Classification
- Data Visualization
One of the most important aspects of Dimensionality reduction, it is Data Visualization. Having to drop the dimensionality down to two or three, make it possible to visualize the data on a 2d or 3d plot, meaning important insights can be gained by analysing these patterns in terms of clusters and much more.
Main Approaches for Dimensionality Reduction
The two main approaches to reducing dimensionality: Projection and Manifold Learning.
- Projection : This technique deals with projecting every data point which is in high dimension, onto a subspace suitable lower-dimensional space in a way which approximately preserves the distances between the points.
- Manifold Learning: Many dimensionality reductions algorithm work by modelling the manifold on which the training instance lie; this is called Manifold learning . It relies on the manifold hypothesis or assumption, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold, this assumption in most of the cases is based on observation or experience rather than theory or pure logic.[4]
Now let's briefly explain the three techniques: (PCA, TSNE, UMAP) before jumping into solving the use case.
PCA
One of the most known dimensionality reduction technique is PCA(Principal Component Analysis, this works by identifying the hyperplane which lies closest to the data and then projects the data on that hyperplane while retaining most of the variation in the data set.
Principal Components
The axis that explains the maximum amount of variance in the training set is called the Principal Components .
The axis orthogonal to this axis is called the second principal component . As we go for higher dimensions, PCA would find a third component orthogonal to the other two components and so on, for visualization purposes we always stick to 2 or maximum 3 principal components.
It is very important to choose the right hyperplane so that when the data is projected onto it, it the maximum amount of information about how the original data is distributed.
t-SNE(T-distributed stochastic neighbour embedding)
(t-SNE)or T-distributed stochastic neighbour embedding created in 2008 by ( Laurens van der Maaten and Geoffrey Hinton) for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
(t-SNE)takes a high dimensional data set and reduces it to a low dimensional graph that retains a lot of the original information. It does so by giving each data point a location in a two or three-dimensional map. This technique finds clusters in data thereby making sure that an embedding preserves the meaning in the data. t-SNE reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.[2]
For a quick a Visualization of this technique, refer to the animation below (it is taken from an amazing tutorial by Cyrille Rossant, I highly recommend to check out his amazing tutorial.
link: https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/
UMAP(Uniform Manifold Approximation and Projection)
Uniform Manifold Approximation and Projectioncreated in 2018 by ( Leland McInnes , John Healy , James Melville ) is a general-purpose manifold learning and dimension reduction algorithm.
UMAP is a nonlinear dimensionality reduction method and is very effective for visualizing clusters or groups of data points and their relative proximities.
The significant difference with TSNE is scalability , it can be applied directly to sparse matrices thereby eliminating the need to applying any Dimensionality reduction such a s PCA or Truncated SVD(Singular Value Decomposition) as a prior pre-processing step . [1]
Put simply, it is similar to t-SNE but with probably higher processing speed, therefore, faster and probably better visualization. (let’s find it out in the tutorial below)
Use Case
Now we are going to go through the above-mentioned use case where all the three techniques will be applied: specifically, we will try to visualize a high dimensional dataset using these techniques: T he Sign-Language-MNIST Dataset: https://www.kaggle.com/datamunge/sign-language-mnist
import numpy as np
import pandas as pd
import time# For plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline#PCA
from sklearn.decomposition import PCA
#TSNE
from sklearn.manifold import TSNE
#UMAP
import umap
The Data
train = pd.read_csv('/kaggle/input/sign-language-mnist/sign_mnist_test/sign_mnist_test.csv')train.head()
# Setting the label and the feature columns
y = train.loc[:,'label'].values
x = train.loc[:,'pixel1':].valuesprint(np.unique(y))
There are 25 unique labels representing the number of distinct sign-languages.
#Appling PCAstart = time.time()pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
print('Duration: {} seconds'.format(time.time() - start))principal = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2','principal component 3'])
principal.shape
After applying PCA, the new dimensionality of the data has only 2 features compared to the 784 features of the x data.
The number of dimensions has been cut down drastically whilst trying to retain as much of the ‘variation’ in the information as possible.
Drawbacks of PCA
The main drawback of PCA is that it is highly influenced by outliers present in the data. Moreover, PCA is a linear projection , which means it can’t capture non-linear dependencies.
PCA in 2D space
# Plotting PCA 2Dplt.style.use('dark_background')
plt.scatter(principalComponents[:, 0], principalComponents[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through PCA', fontsize=24);
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
From the 2D plot, we can see the two components definitely hold some information, especially for specific digits, but clearly not enough to set all of them apart.
PCA in 3D space
# Plotting PCA 3D
ax = plt.figure(figsize=(12,10)).gca(projection='3d')
ax.scatter(
xs=principalComponents[:, 0],
ys=principalComponents[:, 1],
zs=principalComponents[:, 2],
c=y,
cmap='gist_rainbow'
)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.title('Visualizing sign-language-mnist through PCA in 3D', fontsize=24);
plt.show()
t-SNE with Scikit learn
One thing to now down is that t-SNE is very computationally expensive, hence it is mentioned in its documentation that :
“It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.”[2]
start = time.time()pca_50 = PCA(n_components=50) pca_result_50 = pca_50.fit_transform(x)tsne = TSNE(random_state = 42, n_components=3,verbose=0, perplexity=40, n_iter=400).fit_transform(pca_result_50)print(‘Duration: {} seconds’.format(time.time() — start))
Thus, I have applied PCA choosing to retain 50 principal components from the original data to cut down the need for more processing power and it will require time to compute the dimensionality reduction if we had considered the original data.
The speed of the three techniques will be analysed and compared in the following sections further down in details.
T-SNE in 2D space
#Visualising t-SNE 2D
fig = plt.figure(figsize=(12,8))
plt.scatter(tsne[:, 0], tsne[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through t-SNE in 2D', fontsize=24);
plt.xlabel('tsne_1')
plt.ylabel('tsne_2')
T-SNE in 3D space
#Visualising t-SNE 3D
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(tsne[:, 0], tsne[:, 1],tsne[:,2], c=y, cmap='gist_rainbow')
ax.set_xlabel('tsne_1')
ax.set_ylabel('tsne_2')
ax.set_zlabel('tsne_3')
plt.title('Visualizing sign-language-mnist through TSNE in 3D', fontsize=24);
plt.show()
Implementing UMAP
UMAP has different hyperparameters that can have an impact on the resulting embeddings:
-
n_neighbors
This parameter controls how UMAP balances local versus global structure in the data. This low values of n_neighbours forces UMAP to focus on very local structures while the higher values will make UMAP focus on the larger neighbourhoods.
-
min_dist
This parameter controls how tightly UMAP is allowed to pack points together. Lower values mean the points will be clustered closely and vice versa.
-
n_components
This parameter allows the user to determine the dimensionality of the reduced dimension space.
-
metric
This parameter controls how distance is computed in the ambient space of the input data.
For more detailed information, I suggest to check out the UMAP documentation :
//umap-learn.readthedocs.io/en/latest/
For this tutorial, I have chosen to keep the default setting apart for n_components which I set to 3 for the 3d space plot. It would be best to experiment with different hyper-parameter settings to obtain the best out of the algorithm.
start = time.time() reducer = umap.UMAP(random_state=42,n_components=3) embedding = reducer.fit_transform(x) print('Duration: {} seconds'.format(time.time() - start))
UMAP in 2D space
# Visualising UMAP in 2d
fig = plt.figure(figsize=(12,8))
plt.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
plt.title('Visualizing sign-language-mnist with UMAP in 2D', fontsize=24);
We can clearly see that the UMAP does a great job in separating the signs compared to t-SNE and PCA already in 2d space.
UMAP -3D space
# Visualising UMAP in 3d
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1],reducer.embedding_[:, 2], c=y, cmap='gist_rainbow')
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
ax.set_zlabel('umap_3')
plt.title('Visualizing sign-language-mnist through UMAP in 3D', fontsize=24);
plt.show()
Comparison between the Dimension Reduction Techniques: PCA vs t-SNE vs UMAP
By comparing the visualisations produced by the three models, we can see that PCA was not able to do such a good job in differentiating the signs. This is mainly because PCA is a linear projection , which means it can’t capture non-linear dependencies.
t-SNE does a better job as compared to PCA when it comes to visualising High Dimensional datasets. Similar Hand-signs are clustered together, even though there are big agglomerates of data points on top each other from 2d perspective.
UMAPoutperformed the other two techniques in a reasonable manner if we look at the 2d and 3d plot, we can clearly see that sign languages are separated very well compared to the first two techniques. If we applied a clustering algorithm on this, we could be able to assign labels to the clusters.
In terms of speed, UMAP is much faster than t-SNE , another problem faced by the former is the need for another dimensionality reduction method prior, otherwise, it would take a longer time to compute, therefore we can state that UMAP is much faster than t-SNE. PCA is the fastest of them all, however, it does not do a very good job.
Note that: the above table was constructed considering the computation time taken on a Kernel on Kaggle using their GPU.
UMAP can also be used for preprocessing while t-SNE doesn’t have major use outside visualisation. This means that it can often provide a better “big picture” view of the data as well as preserving local neighbour relations.[3]
Summary
We have explored three dimensionality reduction techniques for data visualization : (PCA, t-SNE, UMAP )and tried to use them to visualize a high-dimensional dataset in 2d and 3d plots.
Based on this Tutorial for this particular use case we can say that:
- PCA did not work quite well in categorizing the different signs (24). However, instead of arbitrarily choosing the number dimensions to 3, it is much better to choose the number of dimensions that add up to a sufficiently large proportion of variance, but since this is data visualization problem that was the most reasonable thing to do.
- TSNE managed to do better work on separating the clusters, the visualization in 2d and 3d was better than PCA definitely. However, it took a very long time to compute its embeddings
- UMAP turned out to be the most effective manifold learning in terms of displaying the different clusters, some of them were very well defined and significantly faster than t-SNE implementation.
References
[1] McInnes, L., & Healy, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints.
[2] van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding
[3]Kaggle.com. 2020. Visualizing Kannada MNIST With T-SNE . Available at: https://www.kaggle.com/parulpandey/visualizing-kannada-mnist-with-t-sne
[4]Hands on Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurelien Geron
以上所述就是小编给大家介绍的《Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
数据结构与算法分析
韦斯 (Mark Allen Weiss) / 机械工业出版社 / 2013-2-1 / 79.00元
本书是国外数据结构与算法分析方面的经典教材,使用卓越的Java编程语言作为实现工具讨论了数据结构(组织大量数据的方法)和算法分析(对算法运行时间的估计)。 随着计算机速度的不断增加和功能的日益强大,人们对有效编程和算法分析的要求也不断增长。本书将算法分析与最有效率的Java程序的开发有机地结合起来,深入分析每种算法,并细致讲解精心构造程序的方法,内容全面、缜密严格。 第3版的主要更新如......一起来看看 《数据结构与算法分析》 这本书的介绍吧!