内容简介:翻译自:https://stackoverflow.com/questions/20928769/python-tfidfvectorizer-throwing-empty-vocabulary-perhaps-the-documents-only-c
我正在尝试使用 Python 的Tfidf来转换文本语料库.
但是,当我尝试fit_transform它时,我得到一个值错误ValueError:空词汇;也许这些文件只包含停用词.
In [69]: TfidfVectorizer().fit_transform(smallcorp) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-69-ac16344f3129> in <module>() ----> 1 TfidfVectorizer().fit_transform(smallcorp) /Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y) 1217 vectors : array, [n_samples, n_features] 1218 """ -> 1219 X = super(TfidfVectorizer, self).fit_transform(raw_documents) 1220 self._tfidf.fit(X) 1221 # X is already a transformed view of raw_documents so /Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y) 778 max_features = self.max_features 779 --> 780 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) 781 X = X.tocsc() 782 /Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab) 725 vocabulary = dict(vocabulary) 726 if not vocabulary: --> 727 raise ValueError("empty vocabulary; perhaps the documents only" 728 " contain stop words") 729 ValueError: empty vocabulary; perhaps the documents only contain stop words
我在这里阅读了SO问题: Problems using a custom vocabulary for TfidfVectorizer scikit-learn 并尝试了ogrisel建议使用TfidfVectorizer(** params).build_analyzer()(dataset2)来检查文本分析步骤的结果,这似乎按预期工作:下面的代码段:
In [68]: TfidfVectorizer().build_analyzer()(smallcorp) Out[68]: [u'due', u'to', u'lack', u'of', u'personal', u'biggest', u'education', u'and', u'husband', u'to',
还有别的我做错了吗?我正在喂它的语料库只是一条由换行符打断的巨大长串.
谢谢!
In [51]: smallcorp Out[51]: 'Ah! Now I have done Philosophy,\nI have finished Law and Medicine,\nAnd sadly even Theology:\nTaken fierce pains, from end to end.\nNow here I am, a fool for sure!\nNo wiser than I was before:' In [52]: tf = TfidfVectorizer() In [53]: tf.fit_transform(smallcorp.split('\n')) Out[53]: <6x28 sparse matrix of type '<type 'numpy.float64'>' with 31 stored elements in Compressed Sparse Row format>
翻译自:https://stackoverflow.com/questions/20928769/python-tfidfvectorizer-throwing-empty-vocabulary-perhaps-the-documents-only-c
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- Java 社区平台 Sym 3.3.0 发布,支持停用账号
- Linux停用“黑名单”,因为这是敏感词,涉嫌种族歧视
- SonarLint for Eclipse 3.6 发布,可激活或停用任意规则
- OpenBSD 6.4 将默认停用英特尔处理器的超线程功能
- NLTK学习笔记(五):分类和标注词汇
- 请收好这份NLP热门词汇解读
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Build Your Own Web Site the Right Way Using HTML & CSS
Ian Lloyd / SitePoint / 2006-05-02 / USD 29.95
Build Your Own Website The Right Way Using HTML & CSS teaches web development from scratch, without assuming any previous knowledge of HTML, CSS or web development techniques. This book introduces you......一起来看看 《Build Your Own Web Site the Right Way Using HTML & CSS》 这本书的介绍吧!