并且尽管有前提假设的限制,但是朴素贝叶斯也有其优势所在(如: 6 Easy Steps to Learn Naive Bayes Algorithm (with codes in Python and R) 中所提到的):
4 Applications of Naive Bayes Algorithms
- Real time Prediction : Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time
- Multi class Prediction : This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable
- Text classification/ Spam Filtering/ Sentiment Analysis : Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
- Recommendation System : Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not
- 多项式模型,较为常见,一般用于特征值是离散型数据,会做拉普拉斯平滑(Laplace smoothing)处理(贝叶斯估计),因为用极大似然法会导致估计概率为0的情况出现
- 高斯模型,一般用于特征值是连续型数据,比如身高、体重等等;高斯模型一般假设每个特征维度都满足高斯分布,因此需要计算均值和方差
- 伯努利模型,一般用于特征值是离散型数据,但与多项式模型不同的是,其特征值一般为0或者1等布尔值,因此会多一步二值化的过程
以算法4.1结合测试数据集 MNIST (数字识别),简单的朴素贝叶斯算法代码实现如下(多项式模型):
class MultinomialNB: ''' fit函数输入参数: X 测试数据集 y 标记数据 alpha 贝叶斯估计的正数λ predict函数输入参数: test 测试数据集 ''' def fit(self, X, y, alpha = 0): # 整理分类 feature_data = defaultdict(lambda: []) label_data = defaultdict(lambda: 0) for feature, lab in zip(X, y): feature_data[lab].append(feature) label_data[lab] += 1 # 计算先验概率 self.label = y self.pri_p_label = {k: (v + alpha)/(len(self.label) + len(np.unique(self.label)) * alpha) for k,v in label_data.items()} # 计算不同特征值的条件概率 self.cond_p_feature = defaultdict(lambda: {}) for i,sub in feature_data.items(): sub = np.array(sub) for f_dim in range(sub.shape[1]): for feature in np.unique(X[:,f_dim]): self.cond_p_feature[i][(f_dim,feature)] = (np.sum(sub[:,f_dim] == feature) + alpha) / (sub.shape[0] + len(np.unique(X[:,f_dim])) * alpha) def predict(self, test): p_data = {} for sub_label in np.unique(self.label): # 对概率值取log,防止乘积时浮点下溢 p_data[sub_label] = self.pri_p_label[sub_label] for i in range(len(test)): if self.cond_p_feature[sub_label].get((i,test[i])): p_data[sub_label] *= self.cond_p_feature[sub_label][(i,test[i])] opt_label = max(p_data, key = p_data.get) return([opt_label, p_data.get(opt_label)])
import numpy as np import pandas as pd from collections import defaultdict from sklearn.model_selection import train_test_split dataset = pd.read_csv("train.csv") dataset = np.array(dataset) dataset[:,1:][dataset[:,1:] != 0] = 1 label = dataset[:,0] # 分割训练集和测试集 train_dat, test_dat, train_label, test_label = train_test_split(dataset, label, test_size = 0.2, random_state = 123456) # 构建NB模型 model = MultinomialNB() model.fit(X=train_dat, y=train_label, alpha=1) # 使用NB模型进行预测 pl = {} i = 0 for test in test_dat: temp = model.predict(test=test) pl[i] = temp i += 1 # 输出测试错误率% error = 0 for k,v in pl.items(): if test_label[k] != v[0]: error += 1 print(error/len(test_label)*100)
以上所述就是小编给大家介绍的《统计学习方法-朴素贝叶斯笔记》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 统计学习方法-感知机笔记
- 统计学习方法-决策树笔记
- 《统计学习方法》 ( 李航 ) 读书笔记
- 《统计学习方法》思维导图-中
- 《统计学习方法》slmethod GitHub 模板
- 统计学习方法-k近邻(KNN)笔记
SCWCD Exam Study Kit Second Edition
Hanumant Deshmukh、Jignesh Malavia、Matthew Scarpino / Manning Publications / 2005-05-20 / USD 49.95
Aimed at helping Java developers, Servlet/JSP developers, and J2EE developers pass the Sun Certified Web Component Developer Exam (SCWCD 310-081), this study guide covers all aspects of the Servlet an......一起来看看 《SCWCD Exam Study Kit Second Edition》 这本书的介绍吧!