内容简介:In recent years we have seen vast advances in the field of computer vision, the application of which spans many industries. Convolutional neural networks (CNN) have come to be known as the pre-eminent method in image classification as evidenced by its succ
In recent years we have seen vast advances in the field of computer vision, the application of which spans many industries. Convolutional neural networks (CNN) have come to be known as the pre-eminent method in image classification as evidenced by its success in the ImageNet Large Scale Visual Recognition Challenge, and in some narrow cases have even reportedly exceeded human abilities. Although in this post we will explore a similar problem, we will approach it with the much simpler K-Nearest Neighbor(KNN) algorithm and in doing so show that reasonable performance can still be achieved with a less complex classifier. This will be implemented in python with the help of the numpy library.
What is KNN?
KNN is a non-parametric learning model that utilizes local approximation and does not require training. As such it is known as a ‘lazy learner’. The training set is memorized by the algorithm and called during the test step to predict new instances based on a simple majority vote of the labels of nearest neighbors.
The basis for determination of the nearest neighbors is predicated on a distance metric. Examples of common distance metrics include Euclidean, Hamming (categorical data) and Manhattan distances. The appropriate distance metric depends on the structure of the data and will directly affect the classifier’s accuracy. Research [1] has shown that a simple KNN algorithm with the appropriate choice of distance metric can outperform many more sophisticated approaches in image classification tasks.
KNN also requires the selection of the parameter k , the number of neighbors by which to assess the classification of an observation. The interaction between the choice of k and the misclassification rate has been well documented and is an illustration of the classic bias vs variance trade off. When k = 1 the model will overfit on the training data set and become prone to generalization error, conversely when k is larger, the prediction is averaged over a larger neighborhood potentially increasing the misclassification rate. [2]
To illustrate KNN in action, the above example introduces a toy example where the 7 closest neighbors for a new data point has been identified(k=7), by taking a simple sum of the labels of each of the neighbors we can see that the majority class is oranges (5 vs 2 counts) thus the new data point will be classified as an orange.
Compared to more sophisticated methods such as neural networks, KNN is significantly faster in the training phase as no computation is required, however the opposite is true in the testing phase where KNN can quickly become very expensive as both k and the number of dimensions increase. To combat this we will apply dimensionality reduction with PCA.
The Algorithm
We will demonstrate the algorithm using the NMIST Fashion data set in python consisting of a training set of 60,000 observations (test set 10,000) of 28 by 28 gray scale images of ten types of clothing.
- Importing the data set
from keras.datasets import fashion_mnistimport numpy as npimport matplotlib.pyplot as plt((trainX, trainY), (testX, testY)) = fashion_mnist.load_data() #flattening image pixel values into vectorstestX=testX.reshape([10000,784])trainX= trainX.reshape([60000,784])print(trainX.shape)print(trainY.shape)print(testX.shape)print(testY.shape)
2. Defining and applying a numpy implementation of the PCA algorithm.
for more information on this implementation check out this article:
def pca_np(x): #centering data m = np.mean(x, axis =0) x_centered = x - m #calculating covariance matrix x_cov=np.cov(x_centered.T) #eigendecomposition eigenvals, eigenvecs = np.linalg.eig(x_cov) #sorting i= np.argsort(eigenvals)[::-1] eigenvecs = eigenvecs[:,i] eigenvals= eigenvals[i] #returning the eigenvalues, eigenvectors and means of training set return(eigenvals, eigenvecs, m)
applying PCA to the training set and plotting % variance explained by first 20 principal components:
evals, evecs, train_mean = pca_np(trainX)exp_var =evals/sum(evals)#plot % of variance explainedplt.bar(range(1,21),exp_var[:20])
there are several “elbows” in the above plot. We will arbitrarily retain the first 10 principle components which explains about 72% of the variance in the
data. Now we project training and test data onto the retained eigenvectors:
#number of PCs to retainn =10 X_evecs_n = evecs[:,:n]#projecting the training and test data back onto the retained eigenvectors to get our factors for use in KNNtrainX_factors = np.dot(trainX-train_mean,X_evecs_n) testX_factors= np.dot(testX-train_mean,X_evecs_n)#checking dimension after PCA print(trainX_factors.shape) print(testX_factors.shape)
3. Now that we have pre-processed the data let us define the KNN function using numpy with the option of using both the euclidean and manhattan distance metrics.
def knn_np(x_train, y_train, K, X_test, dist = 'euclidean'): #calculating the distance metric for the test data point and each observation in the training set if dist !='euclidean' : dist = (abs(X_test - x_train)).sum(axis = 1) else: dist = np.sqrt(((X_test - x_train)**2).sum(axis=1))arg_ascending = np.argsort(dist)classes = np.zeros(10)for i in range(k): if y_train[arg_ascending[i]]==0: classes[0] += 1 elif y_train[arg_ascending[i]]==1: classes[1] += 1 elif y_train[arg_ascending[i]]==2: classes[2] += 1 elif y_train[arg_ascending[i]]==3: classes[3] += 1 elif y_train[arg_ascending[i]]==4: classes[4] += 1 elif y_train[arg_ascending[i]]==5: classes[5] += 1 elif y_train[arg_ascending[i]]==6: classes[6] += 1 elif y_train[arg_ascending[i]]==7: classes[7] += 1 elif y_train[arg_ascending[i]]==8: classes[8] += 1 elif y_train[arg_ascending[i]]==9: classes[9] += 1 return np.argmax(classes)
4. As we do not need to train the algorithm, we can apply it directly to the test set. We will again arbitrarily select a value for k of 10. In practice, the value for k should be determined through cross validation.
import timek=10wrong = 0#iterating through every point in test set and applying knnstart_time = time.time()for i in range(testX_factors.shape[0]):prediction = knn_np(trainX_factors, trainY, k, testX_factors[i])if prediction != testY[i]:wrong += 1end_time = time.time()print("Accuracy=", 1-wrong/testX_factors.shape[0])print("----Runtime:%.0fmin%.0fs----" %((end_time- start_time) // 60, (end_time- start_time) % 60))
as we can see we are able to achieve a decent 81.53% accuracy on the test set with the default euclidean distance metric and at a very reasonable runtime.
running the same exercise again but with the Manhattan distance metric we get a very similar result:
Manhattan distance is generally more robust than euclidean distance as the latter is more sensitive to outliers and scaling issues, however since we’ve applied PCA during the pre-processing step this has helped to center the data and remove some these issues.
Conclusion
In this post I hoped to have demonstrated with a practical example some of the intuition and simplicity between the K-Nearest Neighbor algorithm. The choice of using more sophisticated class of algorithms requires a careful consideration of the trade off between computation costs, complexity and accuracy. This decision should be made with an intimidate understanding of the data in mind as in some cases the correct specification of a simple algorithm may perform on par with its more complex counterparts.
References
[1] Zhang H, Berg AC, Maire M, Malik J. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Conference Paper, 2006.
[2] Murphy KP, Machine Learning: A Probabilistic Perspective, Cambridge, Massachusetts: The MIT Press:2012.
以上所述就是小编给大家介绍的《Image Classification From Scratch With KNN: A Python Implementation》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
数据驱动设计
[美]罗谢尔·肯(RochelleKing)、[美]伊丽莎白F.邱吉尔(Elizabeth F Churchill)、Caitlin Tan / 傅婕 / 机械工业出版社 / 2018-8 / 69.00元
本书旨在帮你了解数据引导设计的基本原则,了解数据与设计流程整合的价值,避免常见的陷阱与误区。本书重点关注定量实验与A/B测试,因为我们发现,数据分析与设计实践在此鲜有交集,但相对的潜在价值与机会缺大。本书提供了一些关于在组织中开展数据实践的观点。通过阅读这本书,你将转变你的团队的工作方式,从数据中获得大收益。后希望你可以在衡量指标的选择、佳展示方式与展示时机、测试以及设计意图增强方面,自信地表达自......一起来看看 《数据驱动设计》 这本书的介绍吧!