1、K邻近算法

栏目: Python · 发布时间: 6年前

内容简介:给定一个训练数据,对于新输入的实例,在训练集中找到与该实例最近邻的k个实例,按照少数服从多数原则,这k个实例的多数属于哪个类,则该实例就属于哪个类。模型: k近邻算法的训练数据集本身就是模型不能把所有的数据集进行训练,需要选取一部分为训练样本,一部分为测试样本,用测试样本测试性能

给定一个训练数据,对于新输入的实例,在训练集中找到与该实例最近邻的k个实例,按照少数服从多数原则,这k个实例的多数属于哪个类,则该实例就属于哪个类。

1、K邻近算法

模型: k近邻算法的训练数据集本身就是模型

2 判断机器学习算法的性能

不能把所有的数据集进行训练,需要选取一部分为训练样本,一部分为测试样本,用测试样本测试性能

import numpy as np


def train_test_split(X, y, test_ratio=0.2, seed=None):
    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""
    assert X.shape[0] == y.shape[0], \
        "the size of X must be equal to the size of y"
    assert 0.0 <= test_ratio <= 1.0, \
        "test_ration must be valid"
    
    # 如果下次测试要还原上次随机过程,可以设置这个种子
    if seed:
        np.random.seed(seed)

    # 0~len(X)的乱序数,因为需要对(X,y)进行乱序
    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffled_indexes[:test_size]
    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

复制代码

3 分类准确性:

预测正确的数量 / 预测的数量

def accuracy_score(y_true, y_predict):
        '''计算y_true和y_predict之间的准确率'''
        assert y_true.shape[0] == y_predict.shape[0], \
            "the size of y_true must be equal to the size of y_predict"
        
        return sum(y_true == y_predict) / len(y_true)
复制代码

4 KNN分类器代码

import numpy as np
from math import sqrt
from collections import Counter
from .metrics import accuracy_score

class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分类器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""
        assert X_train.shape[0] == y_train.shape[0], \
            "一个样本必须对应一个分类"
        assert self.k <= X_train.shape[0], \
            "k必须要小于样本数."

        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self._X_train is not None and self._y_train is not None
        assert X_predict.shape[1] == self._X_train.shape[1], \
                "预测实例要与样本实例的特征个数相同"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测数据x,返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1], \
            "x为一个预测实例特征行向量,要与样本的特征量相同"
       
       # 与每一个样本的距离集合
        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._X_train]
        #distance从小到大 排序 的索引index
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)
        # 假如topK_y=[1, 1, 1, 1, 1, 0], 则votes = Counter({0: 1, 1: 5}),
        #votes.most_common(1) = [(1, 5)],列表里是个元祖votes.most_common(1)[0] = (1,5)
        return votes.most_common(1)[0][0]

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def accuracy_score(y_true, y_predict):
        '''计算y_true和y_predict之间的准确率'''
        assert y_true.shape[0] == y_predict.shape[0], \
            "the size of y_true must be equal to the size of y_predict"
        
        return sum(y_true == y_predict) / len(y_true)

    def __repr__(self):
        return "KNN(k=%d)" % self.k
复制代码

5 超参数和模型参数

  • 超参数:在算法运行前需要确定的参数
  • 模型参数:算法过程中学习的参数

KNN算法没有模型参数 knn的超参数:

  • 1、k值大小
  • 2 分类规则:投票决策还是距离
  • 3 超参数p: 如果算距离的话,用哪种距离公式,也就是明可夫斯基距离
    1、K邻近算法

5.1寻找最好的k

best_score = 0.0
best_k = -1
for k in range(1, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
        
print("best_k =", best_k)
print("best_score =", best_score)
复制代码

运行结果: best_k = 4 best_score = 0.991666666667

注意:如果k在边界的话,有可能会有更好的k值,加入k结果为10的话,则k可能还有更好的值,应该取值(10,20)再测试一下

5.2 分类决策规则选取哪个?

1、K邻近算法
  • 个数的话:蓝色获胜
  • 如果算上距离的话,考虑距离:红色:1 ,蓝色:1/3+1/4=7/12 ,则红色胜
  • 还有一个好处,解决平票的问题:
    1、K邻近算法
best_score = 0.0
best_k = -1
best_method = ""
# 两种决策规则:个数or距离
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
        
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)
复制代码

结果为: best_method = uniform best_k = 4 best_score = 0.991666666667

5.3 搜索明可夫斯基距离相应的p

best_score = 0.0
best_k = -1
best_p = -1

for k in range(1, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_p = p
            best_score = score
        
print("best_k =", best_k)
print("best_p =", best_p)
print("best_score =", best_score)
复制代码

结果为: best_k = 3 best_p = 2 best_score = 0.988888888889

默认是明可夫斯基距离,也有其他的公式: scikit-learn.org/stable/modu… ,比如在网格搜索中加上metric参数

5.4 网格搜索和更多kNN中的超参数

Grid Search使用方式:列表里放几个字典

param_grid = [
    {
        'weights': ['uniform'], 
        'n_neighbors': [i for i in range(1, 11)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)], 
        'p': [i for i in range(1, 6)]
    }
]
复制代码

使用方式

knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(knn_clf, param_grid)
grid_search.fit(X_train, y_train)


返回结果:
GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
复制代码
# 获取最好的分类器
grid_search.best_estimator_

结果为:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=3,
           weights='distance')

# 获取最好的准确率
grid_search.best_score_

结果为:
0.98538622129436326
复制代码

其他的超参数:查文档吧: scikit-learn.org/stable/modu…

6 数据归一化

1、K邻近算法
  • 解决方案:将所有数据映射到同一尺度

  • 最值归一化:把所有数据映射到0-1之间,使用于分布有明显边界的情况(图像是0-255之间,学生成绩分布等),如果无边界的话,就不好办了(如收入分布)

    1、K邻近算法
  • 均值方差归一化:把所有数据归一到均值为0,方差为1的分布中

1、K邻近算法

6.1 如何对测试数据集进行归一化

1、K邻近算法
import numpy as np


class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差"""
        assert X.ndim == 2, "The dimension of X must be 2"

        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        assert self.mean_ is not None and self.scale_ is not None
        assert X.shape[1] == len(self.mean_), \
               "每一列对应一个均值"

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]
        return resX

复制代码

使用方式:

from sklearn.preprocessing import StandardScaler 
standardScalar = StandardScaler() 
standardScalar.fit(X_train)
# 对训练集归一化
X_train = standardScalar.transform(X_train)
# 对测试集归一化
X_test_standard = standardScalar.transform(X_test) 

使用归一化后的数据进行knn分类
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test_standard, y_test)
复制代码

7 小结

1、K邻近算法

分割样本为训练集与测试集-> 数据归一化处理-> 通过判断分类的准确性,使用网格搜索,确定最好的超参数,简历一个模型

缺点:

  • 最大缺点:效率低下,如果训练集有m个样本,n个特征,则预测每一个新的数据,需要O(m*n),优化:使用树结构:KD-Tree,Ball-Tree等
  • 高度数据相关:假设三近邻算法,如果预测结果中有两个是错误的,就有问题了
  • 预测结果不具有可解释性:只能拿到预测结果,但是不知道为什么是这个类别,不能以此为基础去发现新的理论
  • 维数灾难:随着维度的增加,“看似相近”的两个点之间的距离越来越大,解决方法--降维
1、K邻近算法

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Rapid Web Applications with TurboGears

Rapid Web Applications with TurboGears

Mark Ramm、Kevin Dangoor、Gigi Sayfan / Prentice Hall PTR / 2006-11-07 / USD 44.99

"Dear PHP, It's over between us. You can keep the kitchen sink, but I want my MVC. With TurboGears, I was able to shed the most heinous FileMaker Pro legacy 'solu-tion' imaginable. It has relationshi......一起来看看 《Rapid Web Applications with TurboGears》 这本书的介绍吧!

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具