内容简介:使用sklearn包下的朴素贝叶斯算法,它包含三种模型——高斯模型、多项式模型和伯努利模型,详情可以参考本文将使用贝叶斯多项式模型类来解决英文邮件分类的问题。数据来自
使用sklearn包下的朴素贝叶斯算法,它包含三种模型——高斯模型、多项式模型和伯努利模型,详情可以参考 朴素贝叶斯 — scikit-learn 0.18.1 documentation 。
本文将使用贝叶斯多项式模型类来解决英文邮件分类的问题。
导入各种包
import nltk import numpy as np import pandas as pd import matplotlib.pyplot as plt from tqdm import tqdm_notebook from wordcloud import WordCloud from sklearn.metrics import roc_curve, auc from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize, RegexpTokenizer %matplotlib inline 复制代码
数据集
数据来自 Spam Mails Dataset kaggle ,其中正常邮件标记为ham/0,垃圾邮件为spam/1
data = pd.read_csv('spam_ham_dataset.csv') data = data.iloc[:, 1:] data.head() 复制代码
label | text | label_num | |
---|---|---|---|
0 | ham | Subject: enron methanol ; meter # : 988291\r\n... | 0 |
1 | ham | Subject: hpl nom for january 9 , 2001\r\n( see... | 0 |
2 | ham | Subject: neon retreat\r\nho ho ho , we ' re ar... | 0 |
3 | spam | Subject: photoshop , windows , office . cheap ... | 1 |
4 | ham | Subject: re : indian springs\r\nthis deal is t... | 0 |
data.info() 复制代码
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5171 entries, 0 to 5170 Data columns (total 3 columns): label 5171 non-null object text 5171 non-null object label_num 5171 non-null int64 dtypes: int64(1), object(2) memory usage: 121.3+ KB 复制代码
print('这份数据包含{}条邮件'.format(data.shape[0])) 复制代码
这份数据包含5171条邮件 复制代码
print('正常邮件一共有{}条'.format(data['label_num'].value_counts()[0])) print('垃圾邮件一共有{}条'.format(data['label_num'].value_counts()[1])) plt.style.use('seaborn') plt.figure(figsize=(6, 4), dpi=100) data['label'].value_counts().plot(kind='bar') 复制代码
正常邮件一共有3672条 垃圾邮件一共有1499条 复制代码
新建DataFrame
新建一个DataFrame,所有的处理都在它里面进行
# 只需要text与label_num new_data = data.iloc[:, 1:] length = len(new_data) print('邮件数量 length =', length) new_data.head() 复制代码
邮件数量 length = 5171 复制代码
text | label_num | |
---|---|---|
0 | Subject: enron methanol ; meter # : 988291\r\n... | 0 |
1 | Subject: hpl nom for january 9 , 2001\r\n( see... | 0 |
2 | Subject: neon retreat\r\nho ho ho , we ' re ar... | 0 |
3 | Subject: photoshop , windows , office . cheap ... | 1 |
4 | Subject: re : indian springs\r\nthis deal is t... | 0 |
查看部分具体内容
for i in range(3): print(i, '\n', data['text'][i]) 复制代码
0 Subject: enron methanol ; meter # : 988291 this is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary flow data provided by daren } . please override pop ' s daily volume { presently zero } to reflect daily activity you can obtain from gas control . this change is needed asap for economics purposes . 1 Subject: hpl nom for january 9 , 2001 ( see attached file : hplnol 09 . xls ) - hplnol 09 . xls 2 Subject: neon retreat ho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time ! i know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute . on the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about . i think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up with a potential alternative for how we can get together on that weekend , and then you can let me know which you prefer . the first option would be to have a retreat similar to what we ' ve done the past several years . this year we could go to the heartland country inn ( www . . com ) outside of brenham . it ' s a nice place , where we ' d have a 13 - bedroom and a 5 - bedroom house side by side . it ' s in the country , real relaxing , but also close to brenham and only about one hour and 15 minutes from here . we can golf , shop in the antique and craft stores in brenham , eat dinner together at the ranch , and spend time with each other . we ' d meet on saturday , and then return on sunday morning , just like what we ' ve done in the past . the second option would be to stay here in houston , have dinner together at a nice restaurant , and then have dessert and a time for visiting and recharging at one of our homes on that saturday evening . this might be easier , but the trade off would be that we wouldn ' t have as much time together . i ' ll let you decide . email me back with what would be your preference , and of course if you ' re available on that weekend . the democratic process will prevail - - majority vote will rule ! let me hear from you as soon as possible , preferably by the end of the weekend . and if the vote doesn ' t go your way , no complaining allowed ( like i tend to do ! ) have a great weekend , great golf , great fishing , great shopping , or whatever makes you happy ! bobby 复制代码
预处理
大小写
邮件中含有大小写,故将先单词替换为小写
new_data['text'] = new_data['text'].str.lower() new_data.head() 复制代码
text | label_num | |
---|---|---|
0 | subject: enron methanol ; meter # : 988291\r\n... | 0 |
1 | subject: hpl nom for january 9 , 2001\r\n( see... | 0 |
2 | subject: neon retreat\r\nho ho ho , we ' re ar... | 0 |
3 | subject: photoshop , windows , office . cheap ... | 1 |
4 | subject: re : indian springs\r\nthis deal is t... | 0 |
停用词
使用停用词,邮件中出现的you、me、be等单词对分类没有影响,故可以将其禁用。还要注意的是所有邮件的开头中都含有单词subject(主题),我们也将其设为停用词。这里使用自然语言处理 工具 包nltk下的stopwords
stop_words = set(stopwords.words('english')) stop_words.add('subject') 复制代码
分词
提取一长串句子中的每个单词,并且还要过滤掉各种符号,所以这里使用nltk下的RegexpTokenizer()函数,参数为正则表达式,例如:
string = 'I have a pen,I have an apple. (Uhh~)Apple-pen!' # 来自《PPAP》的歌词 RegexpTokenizer('[a-zA-Z]+').tokenize(string) # 过滤了所有的符号,返回一个列表 复制代码
['I', 'have', 'a', 'pen', 'I', 'have', 'an', 'apple', 'Uhh', 'Apple', 'pen'] 复制代码
词形还原
在英语里面,一个单词有不同的时态,比如love与loves,只是时态不同,但是是同一个意思,于是就有了——词形还原与词干提取。而本文使用的词形还原方法。详情可以参考: 词形还原工具对比 · ZMonster's Blog
这里先使用nltk包下的WordNetLemmatizer()函数,例如:
word = 'loves' print('{}的原形为{}'.format(word, WordNetLemmatizer().lemmatize(word))) 复制代码
loves的原形为love 复制代码
把上面的所有操作一起实现,使用pandas的apply
def text_process(text): tokenizer = RegexpTokenizer('[a-z]+') # 只匹配单词,由于已经全为小写,故可以只写成[a-z]+ lemmatizer = WordNetLemmatizer() token = tokenizer.tokenize(text) # 分词 token = [lemmatizer.lemmatize(w) for w in token if lemmatizer.lemmatize(w) not in stop_words] # 停用词+词形还原 return token 复制代码
new_data['text'] = new_data['text'].apply(text_process) 复制代码
现在我们得到了一个比较干净的数据集了
new_data.head() 复制代码
text | label_num | |
---|---|---|
0 | [enron, methanol, meter, follow, note, gave, m... | 0 |
1 | [hpl, nom, january, see, attached, file, hplno... | 0 |
2 | [neon, retreat, ho, ho, ho, around, wonderful,... | 0 |
3 | [photoshop, window, office, cheap, main, trend... | 1 |
4 | [indian, spring, deal, book, teco, pvr, revenu... | 0 |
训练集与测试集
将处理后的数据集分为训练集与测试集,比例为3:1
seed = 20190524 # 让实验具有重复性 X = new_data['text'] y = new_data['label_num'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed) # 75%作为训练集与25%作为测试集 复制代码
train = pd.concat([X_train, y_train], axis=1) # 训练集 test = pd.concat([X_test, y_test], axis=1) # 测试集 train.reset_index(drop=True, inplace=True) # 重设下标 test.reset_index(drop=True, inplace=True) # 同上 复制代码
print('训练集含有{}封邮件,测试集含有{}封邮件'.format(train.shape[0], test.shape[0])) 复制代码
训练集含有3878封邮件,测试集含有1293封邮件 复制代码
训练集中的垃圾邮件与正常邮件的数量
print(train['label_num'].value_counts()) plt.figure(figsize=(6, 4), dpi=100) train['label_num'].value_counts().plot(kind='bar') 复制代码
0 2769 1 1109 Name: label_num, dtype: int64 复制代码
测试集中的垃圾邮件与正常邮件的数量
print(test['label_num'].value_counts()) plt.figure(figsize=(6, 4), dpi=100) test['label_num'].value_counts().plot(kind='bar') 复制代码
0 903 1 390 Name: label_num, dtype: int64 复制代码
特征工程
如果把所有的单词都拿来统计,单词表里面的单词还是比较多的,这样让我们的模型跑起来也是比较慢的,故这里随机抽取正常邮件与垃圾邮件各10封内的单词作为单词表
ham_train = train[train['label_num'] == 0] # 正常邮件 spam_train = train[train['label_num'] == 1] # 垃圾邮件 ham_train_part = ham_train['text'].sample(10, random_state=seed) # 随机抽取的10封正常邮件 spam_train_part = spam_train['text'].sample(10, random_state=seed) # 随机抽取的10封垃圾邮件 part_words = [] # 部分的单词 for text in pd.concat([ham_train_part, spam_train_part]): part_words += text 复制代码
part_words_set = set(part_words) print('单词表一共有{}个单词'.format(len(part_words_set))) 复制代码
单词表一共有1528个单词 复制代码
这就大大减少了单词量
CountVectorizer
接下来我们要统计每个单词出现的次数,使用sklearn的CountVectorizer()函数,如:
words = ['This is the first sentence', 'And this is the second sentence'] cv = CountVectorizer() # 参数lowercase=True,将字母转为小写,但数据已经是小写了 count = cv.fit_transform(words) print('cv.vocabulary_:\n', cv.vocabulary_) # 返回一个字典 print('cv.get_feature_names:\n', cv.get_feature_names()) # 返回一个列表 print('count.toarray:\n', count.toarray()) # 返回序列 复制代码
cv.vocabulary_: {'this': 6, 'is': 2, 'the': 5, 'first': 1, 'sentence': 4, 'and': 0, 'second': 3} cv.get_feature_names: ['and', 'first', 'is', 'second', 'sentence', 'the', 'this'] count.toarray: [[0 1 1 0 1 1 1] [1 0 1 1 1 1 1]] 复制代码
[0 1 1 0 1 1 1] 对应 ['and', 'first', 'is', 'second', 'sentence', 'the', 'this'],即'first'出现1次,'is'出现1次,如此类推
TfidfTransformer
接下来还要计算TF-IDF,它反映了单词在文本中的重要程度。使用sklearn包下的TfidfTransformer(),如:
tfidf = TfidfTransformer() tfidf_matrix = tfidf.fit_transform(count) print('idf:\n', tfidf.idf_) # 查看idf print('tfidf:\n', tfidf_matrix.toarray()) # 查看tf-idf 复制代码
idf: [1.40546511 1.40546511 1. 1.40546511 1. 1. 1. ] tfidf: [[0. 0.57496187 0.4090901 0. 0.4090901 0.4090901 0.4090901 ] [0.49844628 0. 0.35464863 0.49844628 0.35464863 0.35464863 0.35464863]] 复制代码
可以看到 [0 1 1 0 1 1 1] 变为了 [0. 0.57496187 0.4090901 0. 0.4090901 0.4090901 0.4090901 ]
添加新一列
现在正式开始各种计算,但是开始之前先把单词整理成句子,就是CountVectorizer认识的格式
# 将正常邮件与垃圾邮件的单词都整理为句子,单词间以空格相隔,CountVectorizer()的句子里,单词是以空格分隔的 train_part_texts = [' '.join(text) for text in np.concatenate((spam_train_part.values, ham_train_part.values))] # 训练集所有的单词整理成句子 train_all_texts = [' '.join(text) for text in train['text']] # 测试集所有的单词整理成句子 test_all_texts = [' '.join(text) for text in test['text']] 复制代码
cv = CountVectorizer() part_fit = cv.fit(train_part_texts) # 以部分句子为参考 train_all_count = cv.transform(train_all_texts) # 对训练集所有邮件统计单词个数 test_all_count = cv.transform(test_all_texts) # 对测试集所有邮件统计单词个数 tfidf = TfidfTransformer() train_tfidf_matrix = tfidf.fit_transform(train_all_count) test_tfidf_matrix = tfidf.fit_transform(test_all_count) 复制代码
print('训练集', train_tfidf_matrix.shape) print('测试集', test_tfidf_matrix.shape) 复制代码
训练集 (3878, 1513) 测试集 (1293, 1513) 复制代码
建立模型
mnb = MultinomialNB() mnb.fit(train_tfidf_matrix, y_train) 复制代码
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) 复制代码
模型在测试集上的正确率
mnb.score(test_tfidf_matrix, y_test) 复制代码
0.9265274555297757 复制代码
y_pred = mnb.predict_proba(test_tfidf_matrix) fpr, tpr, thresholds = roc_curve(y_test, y_pred[:, 1]) auc = auc(fpr, tpr) 复制代码
# roc 曲线 plt.figure(figsize=(6, 4), dpi=100) plt.plot(fpr, tpr) plt.title('roc = {:.4f}'.format(auc)) plt.xlabel('fpr') plt.ylabel('tpr') 复制代码
Text(0, 0.5, 'tpr') 复制代码
到此,就完成了从数据清理到建模的一整套流程了,当然其中还要许多东西可以完善的。
ipynb文件移步: github
参考资料
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- DanaBot增添垃圾邮件发送功能
- Rspamd 2.0 发布,高级垃圾邮件过滤系统
- Rspamd 1.5.8 发布,反垃圾邮件系统
- Rspamd 1.6.6 发布,反垃圾邮件系统
- 攻防最前线:银行木马DanaBot新增散布垃圾邮件功能
- 技术支持垃圾邮件使用iframe“冻结”浏览器
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。