NLP Sentiment Analysis for beginners.

栏目: IT技术 · 发布时间: 4年前

NLP Sentiment Analysis for Beginners

A Step-By-Step Approach to Understand TextBlob, NLTK, Scikit-Learn, and LSTM networks

NLP Sentiment Analysis for beginners.

Jun 14 ·12min read

NLP Sentiment Analysis for beginners.

Photo by Romain Vignes on Unsplash

Introduction

Natural Language Processing (NLP) is the area of machine learning that focuses on the generation and understanding of language. Its main objective is to enable machines to understand, communicate and interact with humans in a natural way.

NLP has many tasks such as Text Generation, Text Classification, Machine Translation, Speech Recognition, Sentiment Analysis, etc. For a beginner to NLP, looking at these tasks and all the techniques involved in handling such tasks can be quite daunting. And in fact, it is very difficult for a newbie to know where and how to start.

Out of all the NLP tasks, I personally think that Sentiment Analysis (SA) is probably the easiest, which makes it the most suitable starting point for anyone who wants to start go into NLP.

In this article, I will show you how to perform SA using various techniques, ranging from simple ones like TextBlob and NLTK to more advanced ones like Sklearn and Long Short Term Memory (LSTM) networks.

After reading this, you can expect to understand the followings:

  1. Toolkits used in SA: TextBlob and NLTK
  2. Algorithms used in SA: Naive Bayes, SVM, Logistic Regression and LSTM
  3. Jargons like stop-word removal, stemming, bag of words, corpus, tokenisation etc.
  4. Create a word cloud

The flow of this article:

  1. Data cleaning and pre-processing
  2. TextBlob
  3. Algorithms: Logistic Regression, Naive Bayes, SVM and LSTM

Let’s get started!

NLP Sentiment Analysis for beginners.

Just a pic of my messy work-from-home corner

Data and Problem Formulation

In this article, I will the sentiment data set that consists of 3000 sentences coming from reviews on imdb.com , amazon.com , and yelp.com . Each sentence is labeled according to whether it comes from a positive review (labelled as 1 ) or negative review (labelled as 0 ).

Data can be downloaded from the website . Alternatively, it can be downloaded from here (more recommended). The folder sentiment_labelled_sentences (containing the data file full_set.txt ) should be in the same directory as your notebook.

Load and Pre-process the Data

Set up and import libraries

%matplotlib inline
import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

Now, we load in the data and look at the first 10 comments

with open("sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()content[0:10]

NLP Sentiment Analysis for beginners.

## Remove leading and trailing white space
content = [x.strip() for x in content]## Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]sentences[0:10]
labels[0:10]

NLP Sentiment Analysis for beginners.

Separate sentences and labels

One can stop here for this section. But for me, I prefer transforming y into (-1,1) form, where -1 represents negative and 1 represents positive.

## Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

NOTICE THAT SO FAR WE HAVE NOT DONE ANY TO THE WORDS YET!!! The next section focuses on words in sentences.

Pre-processing the text data

To input data into the any model, the data input must be in vector form. We will do the following transformations:

  • Remove punctuation and numbers
  • Transform all words to lower-case
  • Remove stop words (e.g. the, a, that, this, it, …)
  • Tokenizer the texts
  • Convert the sentences into vectors, using a bag-of-words representation

I will explain some new jargons here.

  1. Stop words : common words that are ‘not interesting’ for the task at hand. These usually include articles such as ‘a’ and ‘the’, pronouns such as ‘i’ and ‘they’, and prepositions such ‘to’ and ‘from’, …
def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxt
stoppers = ['a', 'is', 'of','the','this','uhm','uh']
removeStopWords(stoppers, "this is a test of the stop word removal code")

Or we can use NLTK if we do want a complete set of common stop words used

from nltk.corpus import stopwordsstops = stopwords.words("English")removeStopWords(stops, "this is a test of the stop word removal code.")

Same result

2. Corpus simply a collection of text. Order of words matter. ‘Not great’ is different from ‘great’

3. Document-Term Matrix or Bag of Words (BOW) is simply a vectorial representation of text sentences (or documents)

NLP Sentiment Analysis for beginners.

A common way to represent a set of features like this is called a One-Hot vector. For example, lets say our vocabular from our set of texts is:

today, here, I, a, fine, sun, moon, bird, saw

The sentence we want to build a BOW for is:

I saw a bird today.
Using a 1-0 for each word in the vocabulary, our BOW encoded as a one-hot vector would be:

1 0 1 1 0 0 1 1

In order to create a bag of words, we need to break down a long sentence or a document into smaller pieces. This process is called Tokenization . The most common tokenization technique is to break down text into words. We can do this using CountVectorizer in Scikit-Learn, where every row will represent a different document and every column will represent a different word. In addition, with CountVectorizer , we can also remove stop words.

def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x## Remove digits
digits = [str(x) for x in range(10)]
remove_digits = [full_remove(x, digits) for x in sentences]## Remove punctuation
remove_punc = [full_remove(x, list(string.punctuation)) for x in remove_digits]## Make everything lower-case and remove any white space
sents_lower = [x.lower() for x in remove_punc]
sents_lower = [x.strip() for x in sents_lower]## Remove stop words
from nltk.corpus import stopwords
stops = stopwords.words("English")def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxtsents_processed = [removeStopWords(stops,x) for x in sents_lower]

Let’s look at how our sentences look like now

NLP Sentiment Analysis for beginners.

Uhm, wait a minute! Removing many stop words makes many sentences lose their meanings. For example, ‘way plug us unless go converter’ does not make any sense to me. This is because we remove all the common English stop words by using NLTK. To overcome this meaning problem, let’s create our own set of stop words instead.

stop_set = ['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from']sents_processed = [removeStopWords(stop_set,x) for x in sents_lower]

NLP Sentiment Analysis for beginners.

It is ok to stop here and move to Tokenization. However, one can continue with stemming . The goal of stemming is too strip off prefixes and suffixes in the word and convert the word into its base form, e.g. studying->study, beautiful->beauty, cared->care, …In NLTK, there are 2 popular stemming techniques called porter and lanscaster.

import nltk
def stem_with_porter(words):
    porter = nltk.PorterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words
    
def stem_with_lancaster(words):
    porter = nltk.LancasterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words    
    
str = "Please don't unbuckle your seat-belt while I am driving, he said"print("porter:", stem_with_porter(str.split()))
print()
print("lancaster:", stem_with_lancaster(str.split()))

Let’s try on our sents_processed to see whether it makes sense.

porter = [stem_with_porter(x.split()) for x in sents_processed]porter = [" ".join(i) for i in porter]porter[0:10]

Some weird changes occur, e.g. very->veri, quality->qualiti, value->valu, …

I dont know what you think but I personally do not like stemming. Or maybe it can be useful in other cases. For those who are experts in this stemming, let me know when stemming is useful :)

4. Term Document Inverse Document Frequency (TD/IDF). This is a measure of the relative importance of a word within a document, in the context of multiple documents . In our case here, multiple reviews.

We start with the TD part — this is simply a normalized frequency of the word in the document:

(word count in document) / (total words in document)
The IDF is a weighting of the uniquess of the word across all of the documents. Here is the complete formula of TD/IDF:

td_idf(t,d) = wc(t,d)/wc(d) / dc(t)/dc()

where:

wc(t,d) = # of occurrences of term t in doc d

wc(d) = # of words in doc d

dc(t) = # of docs that contain at least 1 occurrence of term t

dc() = # of docs in collection

Now, let’s create a bag of words and normalise the texts

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformervectorizer = CountVectorizer(analyzer = "word", 
                             preprocessor = None, 
                             stop_words =  'english', 
                             max_features = 6000, ngram_range=(1,5))data_features = vectorizer.fit_transform(sents_processed)
tfidf_transformer = TfidfTransformer()
data_features_tfidf = tfidf_transformer.fit_transform(data_features)
data_mat = data_features_tfidf.toarray()

Now data_mat is our Document-Term matrix. Input is ready to put into model. Let’s create Training and Test sets. Here, I split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

np.random.seed(0)
test_index = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))train_index = list(set(range(len(labels))) - set(test_index))train_data = data_mat[train_index,]
train_labels = y[train_index]test_data = data_mat[test_index,]
test_labels = y[test_index]

TextBlob

  1. TextBlob : Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels. TextBlod finds all the words and phrases that it can assign polarity and subjectivity to, and average all of them together
  2. Sentiment Labels : Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we’re going to ignore them for now). A corpus’ sentiment is the average of these.
  • Polarity : How positive or negative a word is. -1 is very negative. +1 is very positive.
  • Subjectivity : How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.
from textblob import TextBlob#Create polarity function and subjectivity functionpol = lambda x: TextBlob(x).sentiment.polaritysub = lambda x: TextBlob(x).sentiment.subjectivitypol_list = [pol(x) for x in sents_processed]sub_list = [sub(x) for x in sents_processed]

This is a rule-based method that determines the sentiment (polarity and subjectivity) of a review.

The next section will look at various algorithms.

Logistic Regression

from sklearn.linear_model import SGDClassifier## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none")
clf.fit(train_data, train_labels)
## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_
## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)
## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))
print("Training error: ", float(errs_train)/len(train_labels))
print("Test error: ", float(errs_test)/len(test_labels))
Training error: 0.0116
Test error: 0.184

Words with large influence

Which words are most important in deciding whether a sentence is positive? As a first approximation to this, we simply take the words whose coefficients in w have the largest positive values.

Likewise, we look at the words whose coefficients in w have the most negative values, and we think of these as influential in negative predictions.

## Convert vocabulary into a list:
vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1])])## Get indices of sorting w
inds = np.argsort(w)## Words with large negative values
neg_inds = inds[0:50]
print("Highly negative words: ")
print([str(x) for x in list(vocab[neg_inds])])## Words with large positive values
pos_inds = inds[-49:-1]
print("Highly positive words: ")print([str(x) for x in list(vocab[pos_inds])])

NLP Sentiment Analysis for beginners.

Create a Word Cloud

from wordcloud import WordCloud
wc = WordCloud(stopwords=stop_set, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)#plt.rcParams['figure.figsize'] = [16, 6]wc.generate(" ".join(list(vocab[neg_inds])))plt.imshow(wc, interpolation="bilinear")
plt.axis("off")    
plt.show()

NLP Sentiment Analysis for beginners.

Naive Bayes

from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(train_data, train_labels)
nb_preds_test = nb_clf.predict(test_data)
nb_errs_test = np.sum((nb_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(nb_errs_test)/len(test_labels))
Test error: 0.174

Let’s do some prediction cases. [1] means positive and [-1] means negative

print(nb_clf.predict(vectorizer.transform(["It's a sad movie but very good"])))[1]print(nb_clf.predict(vectorizer.transform(["Waste of my time"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what like"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what I m looking for"])))[1]

The last test case has problem. It should be a negative comment but the model predicts positive.

SVM

from sklearn.linear_model import SGDClassifiersvm_clf = SGDClassifier(loss="hinge", penalty='l2')
svm_clf.fit(train_data, train_labels)
svm_preds_test = svm_clf.predict(test_data)
svm_errs_test = np.sum((svm_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(svm_errs_test)/len(test_labels))Test error: 0.2

Again, let’s do some prediction

print(svm_clf.predict(vectorizer.transform(["This is not what I like"])))[-1]print(svm_clf.predict(vectorizer.transform(["It is not what I am looking for"])))[-1]print(svm_clf.predict(vectorizer.transform(["I would not recommend this movie"])))[1]

SVM can predict the comment ‘It is not what I am looking for’ correctly. However, it could not predict the comment ‘I do not recommend this movie’.

LSTM networks

A detailed discussion about LSTM networks can be found here .

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStoppingmax_review_length = 200tokenizer = Tokenizer(num_words=10000,  #max no. of unique words to keep
                      filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
                      lower=True #convert to lower case
                     )
tokenizer.fit_on_texts(sents_processed)

Truncate and pad the input sequences so that they are all in the same length

X = tokenizer.texts_to_sequences(sents_processed)
X = sequence.pad_sequences(X, maxlen= max_review_length)
print('Shape of data tensor:', X.shape)
Shape of data tensor: (3000, 200)

Recall that y is vector of 1 and -1. Now I change it to a matrix with 2 columns that represent -1 and 1.

import pandas as pd
Y=pd.get_dummies(y).values
Y

NLP Sentiment Analysis for beginners.

np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))train_data = X[train_inds,]
train_labels = Y[train_inds]test_data = X[test_inds,]
test_labels = Y[test_inds]

Create networks

EMBEDDING_DIM = 200model = Sequential()
model.add(Embedding(10000, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(250, dropout=0.2,return_sequences=True))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

NLP Sentiment Analysis for beginners.

epochs = 2
batch_size = 40model.fit(train_data, train_labels, 
          epochs=epochs, 
          batch_size=batch_size,
          validation_split=0.1)
loss, acc = model.evaluate(test_data, test_labels, verbose=2,
                            batch_size=batch_size)print(f"loss: {loss}")
print(f"Validation accuracy: {acc}")

LSTM performs the best out of all the models trained so far, i.e. Logistic, Naive Bayes and SVM. Now let’s see how it predict a test case

outcome_labels = ['Negative', 'Positive']new = ["I would not recommend this movie"]

seq = tokenizer.texts_to_sequences(new)
padded = sequence.pad_sequences(seq, maxlen=max_review_length)
pred = model.predict(padded)
print("Probability distribution: ", pred)
print("Is this a Positive or Negative review? ")
print(outcome_labels[np.argmax(pred)])
new = ["It is not what i am looking for"]
new = ["This isn't what i am looking for"]

For this case, the difference between probability for negative and positive is not much. And the LSTM model classifies this as positive.

new = ["I wouldn't recommend this movie"]

The same happens for this comment. Hence, this means that our model could not distinguish between n't and not . One possible solution to this would be, in the pre-processing step, instead of removing all punctuations, change all the n't short form into not . This can simply be done with the re module in Python. You can check it out yourself to see how our models prediction improve.

That is it! Hope you guys enjoyed and picked up something from this article. If you have any questions, feel free to put them down in the comment section below. Thank you for your read. Have a great day and take care everyone!!

NLP Sentiment Analysis for beginners.

Photo by Lucas Clara on Unsplash

以上所述就是小编给大家介绍的《NLP Sentiment Analysis for beginners.》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

水平营销

水平营销

[美] 菲利普·科特勒、费尔南多・德・巴斯 / 陈燕茹 / 中信出版社 / 2005-1 / 25.00元

《水平营销》阐明了相对纵向营销而言的的水平营销的框架和理论。引入横向思维来作为发现新的营销创意的又一平台,旨在获得消费者不可能向营销研究人员要求或建议的点子。而这些点子将帮助企业在产品愈加同质和超竞争的市场中立于不败之地。 《水平营销》提到: 是什么创新过程导致加油站里开起了超市? 是什么创新过程导致取代外卖比萨服务的冷冻比萨的亮相? 是什么创新过程导致巧克力糖里冒出了玩具......一起来看看 《水平营销》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

MD5 加密
MD5 加密

MD5 加密工具