Text Preprocessing With NLTK

栏目: IT技术 · 发布时间: 5年前

Text Preprocessing With NLTK

Photo by Carlos Muza on Unsplash

Intro

Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. Deep learning models cannot use raw text directly, so it is up to us researchers to clean the text ourselves. Depending on the nature of the task, the preprocessing methods can be different. This tutorial will teach the most common preprocessing approach that can fit in with various NLP tasks using NLTK (Natural Language Toolkit) .

Why NLTK?

  • Popularity : NLTK is one of the leading platforms for dealing with language data.
  • Simplicity : Provides easy-to-use APIs for a wide variety of text preprocessing methods
  • Community : It has a large and active community that supports the library and improves it
  • Open Source : Free and open-source available for Windows, Mac OSX, and Linux.

Now you know the benefits of NLTK, let’s get started!

Tutorial Overview

  1. Lowercase
  2. Removing Punctuation
  3. Tokenization
  4. Stopword Filtering
  5. Stemming
  6. Part-of-Speech Tagger

All code displayed in this tutorial can be accessed in my Github repo .

Import NLTK

Before preprocessing, we need to first download the NLTK library .

pip install nltk

Then, we can import the library in our Python notebook and download its contents.

Lowercase

As an example, we grab the first sentence from the book Pride and Prejudice as the text. We convert the sentence to lowercase via text.lower() .

Removing Punctuation

To remove punctuation, we save only the characters that are not punctuation, which can be checked by using string.punctuation .

Tokenization

Strings can be tokenized into tokens via nltk.word_tokenize .

Stopword Filtering

We can use nltk.corpus.stopwords.words(‘english’) to fetch a list of stopwords in the English dictionary. Then, we remove the tokens that are stopwords.

Stemming

We stem the tokens using nltk.stem.porter.PorterStemmer to get the stemmed tokens.

POS Tagger

Lastly, we can use nltk.pos_tag to retrieve the part of speech of each token in a list.

The full notebook can be seen here .

Combining all Together

We can combine all the preprocessing methods above and create a preprocess function that takes in a .txt file and handles all the preprocessing. We print out the tokens, filtered words (after stopword filtering), stemmed words, and POS, one of which is usually passed on to the model or for further processing. We use the Pride and Prejudice book (accessible here ) and preprocess it.

This notebook can be accessed here .

Conclusion

Text preprocessing is an important first step for any NLP application. In this tutorial, we discussed several popular preprocessing approaches using NLTK: lowercase, removing punctuation, tokenization, stopword filtering, stemming, and part-of-speech tagger.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Google 广告高阶优化(第3版)

Google 广告高阶优化(第3版)

【美】Brad Geddes(布兰德.盖兹) / 宫鑫、康宁、王娜 / 电子工业出版社 / 2015-9 / 99.00元

《Google 广告高阶优化(第3版)》可以说是Google AdWords的终极指南,内容非常丰富,第三版在内容上进行了全面更新。介绍了AdWords的最新最完整的功能,阐释其工作原理,也提供了相应的优化方法、策略和实践教程,读者可以随时在自己的PPC广告系列中进行实践。第三版增添了50多页新内容,涵盖Google系统最近的所有变动,包括广告系列结构的变化、出价调整器、重定向、视频广告功能、全新......一起来看看 《Google 广告高阶优化(第3版)》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具