Text Mining for Dummies: Text Classification with Python

栏目: IT技术 · 发布时间: 4年前

内容简介:This short-read shows the common steps of any text mining project. If you want to follow along in a notebook, you canThis goal is not to give an exhaustive overview of text mining, but to quickstart your thinking and give ideas for further enhancements.Ste

The common steps of any NLP project in 20 lines of code

Text Mining for Dummies: Text Classification with Python

This short-read shows the common steps of any text mining project. If you want to follow along in a notebook, you can get the notebook over here .

This goal is not to give an exhaustive overview of text mining, but to quickstart your thinking and give ideas for further enhancements.

Step 1: Data

For teaching purposes, we start with a very very small data set of 6 reviews.

Data often comes from web scraping review websites, because they are good sources of data with at the same time a raw text and a numeric evaluation.

Step 2: Data preparation

The data will often have to be cleaned more than in this example, eg regex, or python string operations.

The real challenge of text mining is converting text to numerical data. This is often done in two steps:

  • Stemming / Lemmatizing: bringing all words back to their ‘base form’ in order to make an easier word count
  • Vectorizing: applying an algorithm that is based on wordcount (more advanced)
  • In this example, I use a LancasterStemmer and a CountVecotrizer, which are well-known and easy-to-use methods.

Step 2a: LancasterStemmer to bring words back to their base form

Text Mining for Dummies: Text Classification with Python

Step 2b: CountVecorizer to apply Bag Of Word (basically a word count) for vectorizing (that means converting text data into numerical data)

Text Mining for Dummies: Text Classification with Python

Step 3: Machine Learning

Since the text has been converted to numeric data, just use any method that you could use on regular data!

Text Mining for Dummies: Text Classification with Python

I hope this short example helps you on your journey. Don’t hesitate to ask any questions in the comments. Thanks for reading!

Link to the complete notebook: over here.

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网






高可用架构社区 / 电子工业出版社 / 2017-11-1 / 108.00元

《高可用架构(第1卷)》由数十位一线架构师的实践与经验凝结而成,选材兼顾技术性、前瞻性与专业深度。各技术焦点,均由极具代表性的领域专家或实践先行者撰文深度剖析,共同组成“高可用”的全局视野与领先高度,内容包括精华案例、分布式原理、电商架构等热门专题,及云计算、容器、运维、大数据、安全等重点方向。不仅架构师可以从中受益,其他IT、互联网技术从业者同样可以得到提升。一起来看看 《高可用架构(第1卷)》 这本书的介绍吧!



Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换