Text Mining for Dummies: Text Classification with Python

栏目: IT技术 · 发布时间: 5年前

内容简介:This short-read shows the common steps of any text mining project. If you want to follow along in a notebook, you canThis goal is not to give an exhaustive overview of text mining, but to quickstart your thinking and give ideas for further enhancements.Ste

The common steps of any NLP project in 20 lines of code

Text Mining for Dummies: Text Classification with Python

This short-read shows the common steps of any text mining project. If you want to follow along in a notebook, you can get the notebook over here .

This goal is not to give an exhaustive overview of text mining, but to quickstart your thinking and give ideas for further enhancements.

Step 1: Data

For teaching purposes, we start with a very very small data set of 6 reviews.

Data often comes from web scraping review websites, because they are good sources of data with at the same time a raw text and a numeric evaluation.

Step 2: Data preparation

The data will often have to be cleaned more than in this example, eg regex, or python string operations.

The real challenge of text mining is converting text to numerical data. This is often done in two steps:

  • Stemming / Lemmatizing: bringing all words back to their ‘base form’ in order to make an easier word count
  • Vectorizing: applying an algorithm that is based on wordcount (more advanced)
  • In this example, I use a LancasterStemmer and a CountVecotrizer, which are well-known and easy-to-use methods.

Step 2a: LancasterStemmer to bring words back to their base form

Text Mining for Dummies: Text Classification with Python

Step 2b: CountVecorizer to apply Bag Of Word (basically a word count) for vectorizing (that means converting text data into numerical data)

Text Mining for Dummies: Text Classification with Python

Step 3: Machine Learning

Since the text has been converted to numeric data, just use any method that you could use on regular data!

Text Mining for Dummies: Text Classification with Python

I hope this short example helps you on your journey. Don’t hesitate to ask any questions in the comments. Thanks for reading!

Link to the complete notebook: over here.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

重来

重来

[美] 贾森·弗里德、[丹] 戴维·海涅迈尔·汉森 / 李瑜偲 / 中信出版社 / 2010-10 / 36.00元

大多数的企业管理的书籍都会告诉你:制定商业计划、分析竞争形势、寻找投资人等等。如果你要找的是那样的书,那么把这本书放回书架吧。 这本书呈现的是一种更好、更简单的经商成功之道。读完这本书,你就会明白为什么计划实际上百害而无一益,为什么你不需要外界投资人,为什么将竞争视而不见反倒会发展得更好。事实是你所需要的比你想象的少得多。你不必成为工作狂,你不必大量招兵买马,你不必把时间浪费在案头工作和会议......一起来看看 《重来》 这本书的介绍吧!

URL 编码/解码
URL 编码/解码

URL 编码/解码

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具