1、K邻近算法

栏目: Python · 发布时间: 5年前

内容简介：给定一个训练数据，对于新输入的实例，在训练集中找到与该实例最近邻的k个实例，按照少数服从多数原则，这k个实例的多数属于哪个类，则该实例就属于哪个类。模型： k近邻算法的训练数据集本身就是模型不能把所有的数据集进行训练，需要选取一部分为训练样本，一部分为测试样本，用测试样本测试性能

给定一个训练数据，对于新输入的实例，在训练集中找到与该实例最近邻的k个实例，按照少数服从多数原则，这k个实例的多数属于哪个类，则该实例就属于哪个类。

模型： k近邻算法的训练数据集本身就是模型

2 判断机器学习算法的性能

不能把所有的数据集进行训练，需要选取一部分为训练样本，一部分为测试样本，用测试样本测试性能

import numpy as np


def train_test_split(X, y, test_ratio=0.2, seed=None):
    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""
    assert X.shape[0] == y.shape[0], \
        "the size of X must be equal to the size of y"
    assert 0.0 <= test_ratio <= 1.0, \
        "test_ration must be valid"
    
    # 如果下次测试要还原上次随机过程，可以设置这个种子
    if seed:
        np.random.seed(seed)

    # 0~len(X)的乱序数，因为需要对（X,y）进行乱序
    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffled_indexes[:test_size]
    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

复制代码

3 分类准确性：

预测正确的数量 / 预测的数量

def accuracy_score(y_true, y_predict):
        '''计算y_true和y_predict之间的准确率'''
        assert y_true.shape[0] == y_predict.shape[0], \
            "the size of y_true must be equal to the size of y_predict"
        
        return sum(y_true == y_predict) / len(y_true)
复制代码

4 KNN分类器代码

import numpy as np
from math import sqrt
from collections import Counter
from .metrics import accuracy_score

class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分类器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""
        assert X_train.shape[0] == y_train.shape[0], \
            "一个样本必须对应一个分类"
        assert self.k <= X_train.shape[0], \
            "k必须要小于样本数."

        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""
        assert self._X_train is not None and self._y_train is not None
        assert X_predict.shape[1] == self._X_train.shape[1], \
                "预测实例要与样本实例的特征个数相同"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测数据x，返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1], \
            "x为一个预测实例特征行向量，要与样本的特征量相同"
       
       # 与每一个样本的距离集合
        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._X_train]
        #distance从小到大 排序 的索引index
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)
        # 假如topK_y=[1, 1, 1, 1, 1, 0]， 则votes = Counter({0: 1, 1: 5})，
        #votes.most_common(1) = [(1, 5)]，列表里是个元祖votes.most_common(1)[0] = (1,5)
        return votes.most_common(1)[0][0]

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def accuracy_score(y_true, y_predict):
        '''计算y_true和y_predict之间的准确率'''
        assert y_true.shape[0] == y_predict.shape[0], \
            "the size of y_true must be equal to the size of y_predict"
        
        return sum(y_true == y_predict) / len(y_true)

    def __repr__(self):
        return "KNN(k=%d)" % self.k
复制代码

5 超参数和模型参数

超参数：在算法运行前需要确定的参数
模型参数：算法过程中学习的参数

KNN算法没有模型参数 knn的超参数：

1、k值大小
2 分类规则：投票决策还是距离
3 超参数p: 如果算距离的话，用哪种距离公式，也就是明可夫斯基距离

5.1寻找最好的k

best_score = 0.0
best_k = -1
for k in range(1, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
        
print("best_k =", best_k)
print("best_score =", best_score)
复制代码

运行结果： best_k = 4 best_score = 0.991666666667

注意：如果k在边界的话，有可能会有更好的k值，加入k结果为10的话，则k可能还有更好的值，应该取值（10,20）再测试一下

5.2 分类决策规则选取哪个？

个数的话：蓝色获胜
如果算上距离的话，考虑距离：红色：1 ，蓝色：1/3+1/4=7/12 ，则红色胜
还有一个好处，解决平票的问题：

best_score = 0.0
best_k = -1
best_method = ""
# 两种决策规则：个数or距离
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
        
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)
复制代码

结果为： best_method = uniform best_k = 4 best_score = 0.991666666667

5.3 搜索明可夫斯基距离相应的p

best_score = 0.0
best_k = -1
best_p = -1

for k in range(1, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_p = p
            best_score = score
        
print("best_k =", best_k)
print("best_p =", best_p)
print("best_score =", best_score)
复制代码

结果为： best_k = 3 best_p = 2 best_score = 0.988888888889

默认是明可夫斯基距离，也有其他的公式： scikit-learn.org/stable/modu… ，比如在网格搜索中加上metric参数

5.4 网格搜索和更多kNN中的超参数

Grid Search使用方式：列表里放几个字典

param_grid = [
    {
        'weights': ['uniform'], 
        'n_neighbors': [i for i in range(1, 11)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)], 
        'p': [i for i in range(1, 6)]
    }
]
复制代码

使用方式

knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(knn_clf, param_grid)
grid_search.fit(X_train, y_train)


返回结果：
GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
复制代码

# 获取最好的分类器
grid_search.best_estimator_

结果为：
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=3,
           weights='distance')

# 获取最好的准确率
grid_search.best_score_

结果为：
0.98538622129436326
复制代码

其他的超参数：查文档吧： scikit-learn.org/stable/modu…

6 数据归一化

解决方案：将所有数据映射到同一尺度
最值归一化：把所有数据映射到0-1之间，使用于分布有明显边界的情况（图像是0-255之间，学生成绩分布等），如果无边界的话，就不好办了（如收入分布）
均值方差归一化：把所有数据归一到均值为0，方差为1的分布中

6.1 如何对测试数据集进行归一化

import numpy as np


class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差"""
        assert X.ndim == 2, "The dimension of X must be 2"

        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        assert self.mean_ is not None and self.scale_ is not None
        assert X.shape[1] == len(self.mean_), \
               "每一列对应一个均值"

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]
        return resX

复制代码

使用方式：

from sklearn.preprocessing import StandardScaler 
standardScalar = StandardScaler() 
standardScalar.fit(X_train)
# 对训练集归一化
X_train = standardScalar.transform(X_train)
# 对测试集归一化
X_test_standard = standardScalar.transform(X_test) 

使用归一化后的数据进行knn分类
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test_standard, y_test)
复制代码

7 小结

分割样本为训练集与测试集-> 数据归一化处理-> 通过判断分类的准确性，使用网格搜索，确定最好的超参数，简历一个模型

缺点：

最大缺点：效率低下，如果训练集有m个样本，n个特征，则预测每一个新的数据，需要O(m*n)，优化：使用树结构：KD-Tree，Ball-Tree等
高度数据相关：假设三近邻算法，如果预测结果中有两个是错误的，就有问题了
预测结果不具有可解释性：只能拿到预测结果，但是不知道为什么是这个类别，不能以此为基础去发现新的理论
维数灾难：随着维度的增加，“看似相近”的两个点之间的距离越来越大，解决方法--降维

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

MongoDB

Kristina Chodorow / O'Reilly Media / 2013-5-23 / USD 39.99

How does MongoDB help you manage a huMONGOus amount of data collected through your web application? With this authoritative introduction, you'll learn the many advantages of using document-oriented da......一起来看看《MongoDB》这本书的介绍吧!

码农工具

SHA 加密

SHA 加密工具

UNIX 时间戳转换

RGB CMYK 转换工具

RGB CMYK 互转工具