Python TfidfVectorizer throw:空词汇;也许文件只包含停用词“

栏目: Python · 发布时间: 7年前

内容简介:翻译自:https://stackoverflow.com/questions/20928769/python-tfidfvectorizer-throwing-empty-vocabulary-perhaps-the-documents-only-c

我正在尝试使用 Python 的Tfidf来转换文本语料库.

但是,当我尝试fit_transform它时,我得到一个值错误ValueError:空词汇;也许这些文件只包含停用词.

In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
   1217         vectors : array, [n_samples, n_features]
   1218         """
-> 1219         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1220         self._tfidf.fit(X)
   1221         # X is already a transformed view of raw_documents so

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    778         max_features = self.max_features
    779 
--> 780         vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
    781         X = X.tocsc()
    782 

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    725             vocabulary = dict(vocabulary)
    726             if not vocabulary:
--> 727                 raise ValueError("empty vocabulary; perhaps the documents only"
    728                                  " contain stop words")
    729 

ValueError: empty vocabulary; perhaps the documents only contain stop words

我在这里阅读了SO问题: Problems using a custom vocabulary for TfidfVectorizer scikit-learn 并尝试了ogrisel建议使用TfidfVectorizer(** params).build_analyzer()(dataset2)来检查文本分析步骤的结果,这似乎按预期工作:下面的代码段:

In [68]: TfidfVectorizer().build_analyzer()(smallcorp)
Out[68]: 
[u'due',
 u'to',
 u'lack',
 u'of',
 u'personal',
 u'biggest',
 u'education',
 u'and',
 u'husband',
 u'to',

还有别的我做错了吗?我正在喂它的语料库只是一条由换行符打断的巨大长串.

谢谢!

我想这是因为你只有一个字符串.尝试将其拆分为字符串列表,例如:
In [51]: smallcorp
Out[51]: 'Ah! Now I have done Philosophy,\nI have finished Law and Medicine,\nAnd sadly even Theology:\nTaken fierce pains, from end to end.\nNow here I am, a fool for sure!\nNo wiser than I was before:'

In [52]: tf = TfidfVectorizer()

In [53]: tf.fit_transform(smallcorp.split('\n'))
Out[53]: 
<6x28 sparse matrix of type '<type 'numpy.float64'>'
    with 31 stored elements in Compressed Sparse Row format>

翻译自:https://stackoverflow.com/questions/20928769/python-tfidfvectorizer-throwing-empty-vocabulary-perhaps-the-documents-only-c


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Eloquent JavaScript

Eloquent JavaScript

Marijn Haverbeke / No Starch Press / 2011-2-3 / USD 29.95

Eloquent JavaScript is a guide to JavaScript that focuses on good programming techniques rather than offering a mish-mash of cut-and-paste effects. The author teaches you how to leverage JavaScript's......一起来看看 《Eloquent JavaScript》 这本书的介绍吧!

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具