内容简介:简单了解下中文分词的概念,并用标准数据集、Keras和TensorFlow,分别基于LSTM和CNN实现中文分词器中文分词是指,将句子根据语义切分成词中文分词的两大难题:
简单了解下中文分词的概念,并用标准数据集、Keras和TensorFlow,分别基于LSTM和CNN实现中文分词器
原理
中文分词是指,将句子根据语义切分成词
我来到北京清华大学 -> 我 来到 北京 清华大学 复制代码
中文分词的两大难题:
- 歧义、多义词
- 未登陆词(Out Of Vocabulary,OOV)、新词识别
常用的两大类分词方法:
- 基于词典:使用已有的词典和一些启发式规则,例如最大匹配法、反向最大匹配法、最少词数法、基于有向无环图的最大概率组合等
- 基于标注:字标注问题,可以看作序列标注(Sequence Labeling)的一种,即
SBME
四标注,例如隐马尔科夫模型HMM
、最大熵模型ME
、条件随机场模型CRF
、神经网络模型等
序列标注属于 Seq2Seq Learning
的一种,即下图中的最后一种情况
来自吴恩达深度学习微专业第五课的例子
全栈课程中介绍过 jieba
分词,其所用的方法:
- 基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图 (DAG)
- 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
- 对于未登录词,采用了基于汉字成词能力的 HMM 模型,使用了 Viterbi 算法
数据
使用 Bakeoff 2005
提供的标注语料,包括四个来源
sighan.cs.uchicago.edu/bakeoff2005…
- Academia Sinica:as
- CityU:cityu
- Peking University:pku
- Microsoft Research:msr
以msr为例,共包括四个文件
msr_training.utf8 msr_training_words.utf8 msr_test.utf8 msr_test_gold.utf8
BiLSTM
将数据整理并进行字嵌入(Character Embedding)之后,使用Keras实现双向LSTM进行序列标注
加载库
# -*- coding: utf-8 -*- from keras.layers import Input, Dense, Embedding, LSTM, Dropout, TimeDistributed, Bidirectional from keras.models import Model, load_model from keras.utils import np_utils import numpy as np import re 复制代码
准备字典
# 读取字典 vocab = open('data/msr/msr_training_words.utf8').read().rstrip('\n').split('\n') vocab = list(''.join(vocab)) stat = {} for v in vocab: stat[v] = stat.get(v, 0) + 1 stat = sorted(stat.items(), key=lambda x:x[1], reverse=True) vocab = [s[0] for s in stat] # 5167 个字 print(len(vocab)) # 映射 char2id = {c : i + 1 for i, c in enumerate(vocab)} id2char = {i + 1 : c for i, c in enumerate(vocab)} tags = {'s': 0, 'b': 1, 'm': 2, 'e': 3, 'x': 4} 复制代码
定义一些参数
embedding_size = 128 maxlen = 32 # 长于32则截断,短于32则填充0 hidden_size = 64 batch_size = 64 epochs = 50 复制代码
定义一个读取并整理数据的函数
def load_data(path): data = open(path).read().rstrip('\n') # 按标点符号和换行符分隔 data = re.split('[,。!?、\n]', data) print('共有数据 %d 条' % len(data)) print('平均长度:', np.mean([len(d.replace(' ', '')) for d in data])) # 准备数据 X_data = [] y_data = [] for sentence in data: sentence = sentence.split(' ') X = [] y = [] try: for s in sentence: s = s.strip() # 跳过空字符 if len(s) == 0: continue # s elif len(s) == 1: X.append(char2id[s]) y.append(tags['s']) elif len(s) > 1: # b X.append(char2id[s[0]]) y.append(tags['b']) # m for i in range(1, len(s) - 1): X.append(char2id[s[i]]) y.append(tags['m']) # e X.append(char2id[s[-1]]) y.append(tags['e']) # 统一长度 if len(X) > maxlen: X = X[:maxlen] y = y[:maxlen] else: for i in range(maxlen - len(X)): X.append(0) y.append(tags['x']) except: continue else: if len(X) > 0: X_data.append(X) y_data.append(y) X_data = np.array(X_data) y_data = np_utils.to_categorical(y_data, 5) return X_data, y_data X_train, y_train = load_data('data/msr/msr_training.utf8') X_test, y_test = load_data('data/msr/msr_test_gold.utf8') print('X_train size:', X_train.shape) print('y_train size:', y_train.shape) print('X_test size:', X_test.shape) print('y_test size:', y_test.shape) 复制代码
定义模型,训练并保存
X = Input(shape=(maxlen,), dtype='int32') embedding = Embedding(input_dim=len(vocab) + 1, output_dim=embedding_size, input_length=maxlen, mask_zero=True)(X) blstm = Bidirectional(LSTM(hidden_size, return_sequences=True), merge_mode='concat')(embedding) blstm = Dropout(0.6)(blstm) blstm = Bidirectional(LSTM(hidden_size, return_sequences=True), merge_mode='concat')(blstm) blstm = Dropout(0.6)(blstm) output = TimeDistributed(Dense(5, activation='softmax'))(blstm) model = Model(X, output) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs) model.save('msr_bilstm.h5') 复制代码
查看模型在训练集和测试集上的分词正确率
print(model.evaluate(X_train, y_train, batch_size=batch_size)) print(model.evaluate(X_test, y_test, batch_size=batch_size)) 复制代码
定义viterbi函数,使用动态规划算法获得最大概率路径
def viterbi(nodes): trans = {'be': 0.5, 'bm': 0.5, 'eb': 0.5, 'es': 0.5, 'me': 0.5, 'mm': 0.5, 'sb': 0.5, 'ss': 0.5} paths = {'b': nodes[0]['b'], 's': nodes[0]['s']} for l in range(1, len(nodes)): paths_ = paths.copy() paths = {} for i in nodes[l].keys(): nows = {} for j in paths_.keys(): if j[-1] + i in trans.keys(): nows[j + i] = paths_[j] + nodes[l][i] + trans[j[-1] + i] nows = sorted(nows.items(), key=lambda x: x[1], reverse=True) paths[nows[0][0]] = nows[0][1] paths = sorted(paths.items(), key=lambda x: x[1], reverse=True) return paths[0][0] 复制代码
使用训练好的模型定义分词函数
def cut_words(data): data = re.split('[,。!?、\n]', data) sens = [] Xs = [] for sentence in data: sen = [] X = [] sentence = list(sentence) for s in sentence: s = s.strip() if not s == '' and s in char2id: sen.append(s) X.append(char2id[s]) if len(X) > maxlen: sen = sen[:maxlen] X = X[:maxlen] else: for i in range(maxlen - len(X)): X.append(0) if len(sen) > 0: Xs.append(X) sens.append(sen) Xs = np.array(Xs) ys = model.predict(Xs) results = '' for i in range(ys.shape[0]): nodes = [dict(zip(['s', 'b', 'm', 'e'], d[:4])) for d in ys[i]] ts = viterbi(nodes) for x in range(len(sens[i])): if ts[x] in ['s', 'e']: results += sens[i][x] + '/' else: results += sens[i][x] return results[:-1] 复制代码
调用分词函数并测试
print(cut_words('中国共产党第十九次全国代表大会,是在全面建成小康社会决胜阶段、中国特色社会主义进入新时代的关键时期召开的一次十分重要的大会。')) print(cut_words('把这本书推荐给,具有一定编程基础,希望了解数据分析、人工智能等知识领域,进一步提升个人技术能力的社会各界人士。')) print(cut_words('结婚的和尚未结婚的。')) 复制代码
在CPU上每轮训练耗时1500多秒,共训练50轮,训练集准确率98.91%,测试集准确率95.47%
再来一个代码,使用训练好的模型进行分词
# -*- coding: utf-8 -*- from keras.models import Model, load_model import numpy as np import re # 读取字典 vocab = open('data/msr/msr_training_words.utf8').read().rstrip('\n').split('\n') vocab = list(''.join(vocab)) stat = {} for v in vocab: stat[v] = stat.get(v, 0) + 1 stat = sorted(stat.items(), key=lambda x:x[1], reverse=True) vocab = [s[0] for s in stat] # 5167 个字 print(len(vocab)) # 映射 char2id = {c : i + 1 for i, c in enumerate(vocab)} id2char = {i + 1 : c for i, c in enumerate(vocab)} tags = {'s': 0, 'b': 1, 'm': 2, 'e': 3, 'x': 4} maxlen = 32 # 长于32则截断,短于32则填充0 model = load_model('msr_bilstm.h5') def viterbi(nodes): trans = {'be': 0.5, 'bm': 0.5, 'eb': 0.5, 'es': 0.5, 'me': 0.5, 'mm': 0.5, 'sb': 0.5, 'ss': 0.5} paths = {'b': nodes[0]['b'], 's': nodes[0]['s']} for l in range(1, len(nodes)): paths_ = paths.copy() paths = {} for i in nodes[l].keys(): nows = {} for j in paths_.keys(): if j[-1] + i in trans.keys(): nows[j + i] = paths_[j] + nodes[l][i] + trans[j[-1] + i] nows = sorted(nows.items(), key=lambda x: x[1], reverse=True) paths[nows[0][0]] = nows[0][1] paths = sorted(paths.items(), key=lambda x: x[1], reverse=True) return paths[0][0] def cut_words(data): data = re.split('[,。!?、\n]', data) sens = [] Xs = [] for sentence in data: sen = [] X = [] sentence = list(sentence) for s in sentence: s = s.strip() if not s == '' and s in char2id: sen.append(s) X.append(char2id[s]) if len(X) > maxlen: sen = sen[:maxlen] X = X[:maxlen] else: for i in range(maxlen - len(X)): X.append(0) if len(sen) > 0: Xs.append(X) sens.append(sen) Xs = np.array(Xs) ys = model.predict(Xs) results = '' for i in range(ys.shape[0]): nodes = [dict(zip(['s', 'b', 'm', 'e'], d[:4])) for d in ys[i]] ts = viterbi(nodes) for x in range(len(sens[i])): if ts[x] in ['s', 'e']: results += sens[i][x] + '/' else: results += sens[i][x] return results[:-1] print(cut_words('中国共产党第十九次全国代表大会,是在全面建成小康社会决胜阶段、中国特色社会主义进入新时代的关键时期召开的一次十分重要的大会。')) print(cut_words('把这本书推荐给,具有一定编程基础,希望了解数据分析、人工智能等知识领域,进一步提升个人技术能力的社会各界人士。')) print(cut_words('结婚的和尚未结婚的。')) 复制代码
FCN
全卷积网络(Fully Convolutional Networks,FCN)的好处是,输入数据的shape是可变的
尤其适合输入长度不定,但输入和输出长度相等的任务,例如序列标注
- 图像:四维tensor,
NHWC
,即样本数量、高度、宽度、通道数。用conv2d
,卷的是中间的两个维度,即高度和宽度 - 文本序列:三维tensor,
NTE
,即样本数量、序列长度、词向量维度。用conv1d
,卷的是中间的一个维度,即序列长度。和N-gram类似,词向量维度对应通道数
使用TensorFlow实现FCN,通过卷积核大小为3的 conv1d
降低通道数,从词向量维度降到序列标注类别数,此处为 SBME
共四类
加载库
# -*- coding: utf-8 -*- import tensorflow as tf import numpy as np import re import time 复制代码
准备字典
# 读取字典 vocab = open('data/msr/msr_training_words.utf8').read().rstrip('\n').split('\n') vocab = list(''.join(vocab)) stat = {} for v in vocab: stat[v] = stat.get(v, 0) + 1 stat = sorted(stat.items(), key=lambda x:x[1], reverse=True) vocab = [s[0] for s in stat] # 5167 个字 print(len(vocab)) # 映射 char2id = {c : i + 1 for i, c in enumerate(vocab)} id2char = {i + 1 : c for i, c in enumerate(vocab)} tags = {'s': [1, 0, 0, 0], 'b': [0, 1, 0, 0], 'm': [0, 0, 1, 0], 'e': [0, 0, 0, 1]} 复制代码
定义一个加载数据并返回批数据的函数
batch_size = 64 def load_data(path): data = open(path).read().rstrip('\n') # 按标点符号和换行符分隔 data = re.split('[,。!?、\n]', data) # 准备数据 X_data = [] Y_data = [] for sentence in data: sentence = sentence.split(' ') X = [] Y = [] try: for s in sentence: s = s.strip() # 跳过空字符 if len(s) == 0: continue # s elif len(s) == 1: X.append(char2id[s]) Y.append(tags['s']) elif len(s) > 1: # b X.append(char2id[s[0]]) Y.append(tags['b']) # m for i in range(1, len(s) - 1): X.append(char2id[s[i]]) Y.append(tags['m']) # e X.append(char2id[s[-1]]) Y.append(tags['e']) except: continue else: if len(X) > 0: X_data.append(X) Y_data.append(Y) order = np.argsort([len(X) for X in X_data]) X_data = [X_data[i] for i in order] Y_data = [Y_data[i] for i in order] current_length = len(X_data[0]) X_batch = [] Y_batch = [] for i in range(len(X_data)): if len(X_data[i]) != current_length or len(X_batch) == batch_size: yield np.array(X_batch), np.array(Y_batch) current_length = len(X_data[i]) X_batch = [] Y_batch = [] X_batch.append(X_data[i]) Y_batch.append(Y_data[i]) 复制代码
定义模型
embedding_size = 128 embeddings = tf.Variable(tf.random_uniform([len(char2id) + 1, embedding_size], -1.0, 1.0)) X_input = tf.placeholder(dtype=tf.int32, shape=[None, None], name='X_input') embedded = tf.nn.embedding_lookup(embeddings, X_input) W_conv1 = tf.Variable(tf.random_uniform([3, embedding_size, embedding_size // 2], -1.0, 1.0)) b_conv1 = tf.Variable(tf.random_uniform([embedding_size // 2], -1.0, 1.0)) Y_conv1 = tf.nn.relu(tf.nn.conv1d(embedded, W_conv1, stride=1, padding='SAME') + b_conv1) W_conv2 = tf.Variable(tf.random_uniform([3, embedding_size // 2, embedding_size // 4], -1.0, 1.0)) b_conv2 = tf.Variable(tf.random_uniform([embedding_size // 4], -1.0, 1.0)) Y_conv2 = tf.nn.relu(tf.nn.conv1d(Y_conv1, W_conv2, stride=1, padding='SAME') + b_conv2) W_conv3 = tf.Variable(tf.random_uniform([3, embedding_size // 4, 4], -1.0, 1.0)) b_conv3 = tf.Variable(tf.random_uniform([4], -1.0, 1.0)) Y_pred = tf.nn.softmax(tf.nn.conv1d(Y_conv2, W_conv3, stride=1, padding='SAME') + b_conv3, name='Y_pred') Y_true = tf.placeholder(dtype=tf.float32, shape=[None, None, 4], name='Y_true') cross_entropy = tf.reduce_mean(-tf.reduce_sum(Y_true * tf.log(Y_pred + 1e-20), axis=[2])) optimizer = tf.train.AdamOptimizer().minimize(cross_entropy) correct_prediction = tf.equal(tf.argmax(Y_pred, 2), tf.argmax(Y_true, 2)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 复制代码
训练模型并保存
saver = tf.train.Saver() max_test_acc = -np.inf epochs = 50 sess = tf.Session() sess.run(tf.global_variables_initializer()) for e in range(epochs): train = load_data('data/msr/msr_training.utf8') accs = [] i = 0 t0 = int(time.time()) for X_batch, Y_batch in train: sess.run(optimizer, feed_dict={X_input: X_batch, Y_true: Y_batch}) i += 1 if i % 100 == 0: acc = sess.run(accuracy, feed_dict={X_input: X_batch, Y_true: Y_batch}) accs.append(acc) print('Epoch %d time %ds' % (e + 1, int(time.time()) - t0)) print('- train accuracy: %f' % (np.mean(accs))) test = load_data('data/msr/msr_test_gold.utf8') accs = [] for X_batch, Y_batch in test: acc = sess.run(accuracy, feed_dict={X_input: X_batch, Y_true: Y_batch}) accs.append(acc) mean_test_acc = np.mean(accs) print('- test accuracy: %f' % mean_test_acc) if mean_test_acc > max_test_acc: max_test_acc = mean_test_acc print('Saving Model......') saver.save(sess, './msr_fcn/msr_fcn') 复制代码
定义viterbi函数
def viterbi(nodes): trans = {'be': 0.5, 'bm': 0.5, 'eb': 0.5, 'es': 0.5, 'me': 0.5, 'mm': 0.5, 'sb': 0.5, 'ss': 0.5} paths = {'b': nodes[0]['b'], 's': nodes[0]['s']} for l in range(1, len(nodes)): paths_ = paths.copy() paths = {} for i in nodes[l].keys(): nows = {} for j in paths_.keys(): if j[-1] + i in trans.keys(): nows[j + i] = paths_[j] + nodes[l][i] + trans[j[-1] + i] nows = sorted(nows.items(), key=lambda x: x[1], reverse=True) paths[nows[0][0]] = nows[0][1] paths = sorted(paths.items(), key=lambda x: x[1], reverse=True) return paths[0][0] 复制代码
定义分词函数
def cut_words(data): data = re.split('[,。!?、\n]', data) sens = [] Xs = [] for sentence in data: sen = [] X = [] sentence = list(sentence) for s in sentence: s = s.strip() if not s == '' and s in char2id: sen.append(s) X.append(char2id[s]) if len(X) > 0: Xs.append(X) sens.append(sen) results = '' for i in range(len(Xs)): X_d = np.array([Xs[i]]) Y_d = sess.run(Y_pred, feed_dict={X_input: X_d}) nodes = [dict(zip(['s', 'b', 'm', 'e'], d)) for d in Y_d[0]] ts = viterbi(nodes) for x in range(len(sens[i])): if ts[x] in ['s', 'e']: results += sens[i][x] + '/' else: results += sens[i][x] return results[:-1] 复制代码
调用分词函数并测试
print(cut_words('中国共产党第十九次全国代表大会,是在全面建成小康社会决胜阶段、中国特色社会主义进入新时代的关键时期召开的一次十分重要的大会。')) print(cut_words('把这本书推荐给,具有一定编程基础,希望了解数据分析、人工智能等知识领域,进一步提升个人技术能力的社会各界人士。')) print(cut_words('结婚的和尚未结婚的。')) 复制代码
由于GPU对CNN的加速效果显著,在GPU上每轮训练仅耗时20秒左右,共训练50轮,训练集准确率99.01%,测试集准确率92.26%
再来一个代码,使用训练好的模型进行分词
# -*- coding: utf-8 -*- import tensorflow as tf import numpy as np import re import time # 读取字典 vocab = open('data/msr/msr_training_words.utf8').read().rstrip('\n').split('\n') vocab = list(''.join(vocab)) stat = {} for v in vocab: stat[v] = stat.get(v, 0) + 1 stat = sorted(stat.items(), key=lambda x:x[1], reverse=True) vocab = [s[0] for s in stat] # 5167 个字 print(len(vocab)) # 映射 char2id = {c : i + 1 for i, c in enumerate(vocab)} id2char = {i + 1 : c for i, c in enumerate(vocab)} tags = {'s': [1, 0, 0, 0], 'b': [0, 1, 0, 0], 'm': [0, 0, 1, 0], 'e': [0, 0, 0, 1]} sess = tf.Session() sess.run(tf.global_variables_initializer()) saver = tf.train.import_meta_graph('./msr_fcn/msr_fcn.meta') saver.restore(sess, tf.train.latest_checkpoint('./msr_fcn')) graph = tf.get_default_graph() X_input = graph.get_tensor_by_name('X_input:0') Y_pred = graph.get_tensor_by_name('Y_pred:0') def viterbi(nodes): trans = {'be': 0.5, 'bm': 0.5, 'eb': 0.5, 'es': 0.5, 'me': 0.5, 'mm': 0.5, 'sb': 0.5, 'ss': 0.5} paths = {'b': nodes[0]['b'], 's': nodes[0]['s']} for l in range(1, len(nodes)): paths_ = paths.copy() paths = {} for i in nodes[l].keys(): nows = {} for j in paths_.keys(): if j[-1] + i in trans.keys(): nows[j + i] = paths_[j] + nodes[l][i] + trans[j[-1] + i] nows = sorted(nows.items(), key=lambda x: x[1], reverse=True) paths[nows[0][0]] = nows[0][1] paths = sorted(paths.items(), key=lambda x: x[1], reverse=True) return paths[0][0] def cut_words(data): data = re.split('[,。!?、\n]', data) sens = [] Xs = [] for sentence in data: sen = [] X = [] sentence = list(sentence) for s in sentence: s = s.strip() if not s == '' and s in char2id: sen.append(s) X.append(char2id[s]) if len(X) > 0: Xs.append(X) sens.append(sen) results = '' for i in range(len(Xs)): X_d = np.array([Xs[i]]) Y_d = sess.run(Y_pred, feed_dict={X_input: X_d}) nodes = [dict(zip(['s', 'b', 'm', 'e'], d)) for d in Y_d[0]] ts = viterbi(nodes) for x in range(len(sens[i])): if ts[x] in ['s', 'e']: results += sens[i][x] + '/' else: results += sens[i][x] return results[:-1] print(cut_words('中国共产党第十九次全国代表大会,是在全面建成小康社会决胜阶段、中国特色社会主义进入新时代的关键时期召开的一次十分重要的大会。')) print(cut_words('把这本书推荐给,具有一定编程基础,希望了解数据分析、人工智能等知识领域,进一步提升个人技术能力的社会各界人士。')) print(cut_words('结婚的和尚未结婚的。')) 复制代码
以上所述就是小编给大家介绍的《深度有趣 | 15 浅谈中文分词》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 中文分词技术深度学习篇
- FoolNLTK 基于深度学习的分词工具首次发布
- 基于海量公司分词ES中文分词插件
- 北大开源全新中文分词工具包:准确率远超THULAC、结巴分词
- 复旦大学提出中文分词新方法,Transformer连有歧义的分词也能学
- 分词,难在哪里?
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。