内容简介:如果你需要知道一本书90%的词汇,你必须要读多少页?
如果你需要知道一本书90%的词汇,你必须要读多少页?
作者:Roman Kierzkowski
本文是在 Vocapouch 上的一部分研究,我们主要为语言学者做出贡献,本文的结果也包括在我们的 博客 里。
我的英语老师说过,如果我熬过了一本书的前20页,剩下的部分会变得简单,因为这20页上出现的词汇组成了整本书词汇的90%。所以你一旦过了这关,就不必前前后后翻阅字典了。
他说的对吗?
我们用3本书来验证一下:
- The Secret Adversary ——是Agatha Christie的侦探小说
- Eve’s Day ——Mark Twain的短篇小说
- Ulysses ——James Joyce的长篇小说
阅读
%pylab inline
import spacy
import codecs
import seaborn as sns
from __future__ import unicode_literals
from collections import Counter
from matplotlib import pyplot
nlp = spacy.load('en')
激活交互numpy和matplotlib的命名空间
为了分析文本,我们使用sapcy。我们清除所有不是词汇的东西,我们只用词汇的典型形式。比如,我们把 went 变成 go , plays 变成 play 等等。我们把所有字母都变成小写的,所有代词都变为’-PRON-‘。最后结果,它们都被记为一个词汇。
def extract_words(path): with codecs.open(path, encoding='utf-8', mode="r") as book: content = book.read() doc = nlp(content) return [ token.lemma_ for token in doc if token.is_alpha ] ulysses = extract_words('ulysses.txt') eves_diary = extract_words('eves_diary.txt') the_secret_adversary = extract_words('the_secret_adversary.txt') finnegans = extract_words('finnegans.txt')
词覆盖计数
为了计算词覆盖度,我们一页一页的检查在该页之前出现的词汇占整本书词汇量的比例。我们也已经计算了某个已出现的词汇的比例。
WPP = 300 def count_coverage(words, wpp = WPP): # wpp - words per page coverage = [] uniqueness = [] occurances = Counter(words) counter = Counter() total = float(len(words)) total_uniq = float(len(occurances.keys())) for n in xrange(len(words) // wpp): page = words[n*wpp:(n+1)*wpp] counter.update(page) s = sum((occurances[w] for w in counter.keys())) coverage.append(s/total) uniqueness.append(len(counter.keys())/total_uniq) return occurances, coverage, uniqueness
为了清除我们检查的书到底多难,你读多少页能够打到词覆盖度90%就是我们的衡量标准。
def calculate_hardness(coverage): for i in range(len(coverage)): if coverage[i] > 0.9: break hardness = (i / float(len(coverage))) * 100 return i, hardness
测试书籍
def analyze_book(words, title): occurances, coverage, uniqueness = count_coverage(words) page, hardness = calculate_hardness(coverage) file_name = title.lower().replace(' ', '_').replace('\'','') + '.png' print("Number of Pages: %.0f" % (len(words) / WPP)) print("Number of Total Words: %s" % len(words)) print("Number of Unique Words: %s" % len(occurances.keys())) print("You will know 90%% of words after %s pages which are %.2f%% of the book." % (page, hardness)) print("At that page, you will know %.2f%% of unique words." % (uniqueness[page] * 100, )) pyplot.plot(coverage, color='b', label="All words") pyplot.plot(uniqueness, color='g', label="Unique words") pyplot.legend(loc=4) pyplot.title(title) pyplot.xlabel('Page') pyplot.ylabel('Coverage [%]') pyplot.savefig(file_name)
analyze_book(the_secret_adversary, title="The Secret Adversary")
Number of Pages: 250 Number of Total Words: 75208 Number of Unique Words: 5248 You will know 90% of words after 40 pages which are 16.00% of the book. At that page, you will know 39.21% of unique words.
analyze_book(eves_diary, title="Eve's Diary")
Number of Pages: 22 Number of Total Words: 6858 Number of Unique Words: 1104 You will know 90% of words after 9 pages which are 40.91% of the book. At that page, you will know 56.70% of unique words.
analyze_book(finnegans, title="Finnegans Wake")
Number of Pages: 729 Number of Total Words: 218793 Number of Unique Words: 50872 You will know 90% of words after 387 pages which are 53.09% of the book. At that page, you will know 60.64% of unique words.
结论
新的词汇会随着阅读的深入不断涌出,但是当你读过了前几页之后,你之前见过的词汇覆盖整本书的大部分。然而,确切的需要读的页码会随着输的厚度和作者多变的语言而不同。最难的书就是尤利西斯(Ulysses),你需要读221页(整本书的25%)。虽然不是20页,但我的老师说的某种程度上是正确的
以上所述就是小编给大家介绍的《如果你需要知道一本书90%的词汇,你必须要读多少页?》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- NLTK学习笔记(五):分类和标注词汇
- 请收好这份NLP热门词汇解读
- Golang基于DFA算法实现敏感词汇过滤
- 分享一次专业领域词汇的无监督挖掘
- Python TfidfVectorizer throw:空词汇;也许文件只包含停用词“
- 某口腔app发现了不友善词汇(f*ckMobile)
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。