NLP Sentiment Analysis for beginners.

栏目: IT技术 · 发布时间: 4年前

NLP Sentiment Analysis for Beginners

A Step-By-Step Approach to Understand TextBlob, NLTK, Scikit-Learn, and LSTM networks

Jun 14 ·12min read

Introduction

Natural Language Processing (NLP) is the area of machine learning that focuses on the generation and understanding of language. Its main objective is to enable machines to understand, communicate and interact with humans in a natural way.

NLP has many tasks such as Text Generation, Text Classification, Machine Translation, Speech Recognition, Sentiment Analysis, etc. For a beginner to NLP, looking at these tasks and all the techniques involved in handling such tasks can be quite daunting. And in fact, it is very difficult for a newbie to know where and how to start.

Out of all the NLP tasks, I personally think that Sentiment Analysis (SA) is probably the easiest, which makes it the most suitable starting point for anyone who wants to start go into NLP.

In this article, I will show you how to perform SA using various techniques, ranging from simple ones like TextBlob and NLTK to more advanced ones like Sklearn and Long Short Term Memory (LSTM) networks.

After reading this, you can expect to understand the followings:

Toolkits used in SA: TextBlob and NLTK
Algorithms used in SA: Naive Bayes, SVM, Logistic Regression and LSTM
Jargons like stop-word removal, stemming, bag of words, corpus, tokenisation etc.
Create a word cloud

The flow of this article:

Data cleaning and pre-processing
TextBlob
Algorithms: Logistic Regression, Naive Bayes, SVM and LSTM

Let’s get started!

Data and Problem Formulation

In this article, I will the sentiment data set that consists of 3000 sentences coming from reviews on imdb.com , amazon.com , and yelp.com . Each sentence is labeled according to whether it comes from a positive review (labelled as 1 ) or negative review (labelled as 0 ).

Data can be downloaded from the website . Alternatively, it can be downloaded from here (more recommended). The folder sentiment_labelled_sentences (containing the data file full_set.txt ) should be in the same directory as your notebook.

Load and Pre-process the Data

Set up and import libraries

%matplotlib inline
import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

Now, we load in the data and look at the first 10 comments

with open("sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()content[0:10]

## Remove leading and trailing white space
content = [x.strip() for x in content]## Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]sentences[0:10]
labels[0:10]

One can stop here for this section. But for me, I prefer transforming y into (-1,1) form, where -1 represents negative and 1 represents positive.

## Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

NOTICE THAT SO FAR WE HAVE NOT DONE ANY TO THE WORDS YET!!! The next section focuses on words in sentences.

Pre-processing the text data

To input data into the any model, the data input must be in vector form. We will do the following transformations:

Remove punctuation and numbers
Transform all words to lower-case
Remove stop words (e.g. the, a, that, this, it, …)
Tokenizer the texts
Convert the sentences into vectors, using a bag-of-words representation

I will explain some new jargons here.

Stop words : common words that are ‘not interesting’ for the task at hand. These usually include articles such as ‘a’ and ‘the’, pronouns such as ‘i’ and ‘they’, and prepositions such ‘to’ and ‘from’, …

def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxt
stoppers = ['a', 'is', 'of','the','this','uhm','uh']
removeStopWords(stoppers, "this is a test of the stop word removal code")

Or we can use NLTK if we do want a complete set of common stop words used

from nltk.corpus import stopwordsstops = stopwords.words("English")removeStopWords(stops, "this is a test of the stop word removal code.")

Same result

2. Corpus simply a collection of text. Order of words matter. ‘Not great’ is different from ‘great’

3. Document-Term Matrix or Bag of Words (BOW) is simply a vectorial representation of text sentences (or documents)

A common way to represent a set of features like this is called a One-Hot vector. For example, lets say our vocabular from our set of texts is:

today, here, I, a, fine, sun, moon, bird, saw

The sentence we want to build a BOW for is:

I saw a bird today.
Using a 1-0 for each word in the vocabulary, our BOW encoded as a one-hot vector would be:

1 0 1 1 0 0 1 1

In order to create a bag of words, we need to break down a long sentence or a document into smaller pieces. This process is called Tokenization . The most common tokenization technique is to break down text into words. We can do this using CountVectorizer in Scikit-Learn, where every row will represent a different document and every column will represent a different word. In addition, with CountVectorizer , we can also remove stop words.

def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x## Remove digits
digits = [str(x) for x in range(10)]
remove_digits = [full_remove(x, digits) for x in sentences]## Remove punctuation
remove_punc = [full_remove(x, list(string.punctuation)) for x in remove_digits]## Make everything lower-case and remove any white space
sents_lower = [x.lower() for x in remove_punc]
sents_lower = [x.strip() for x in sents_lower]## Remove stop words
from nltk.corpus import stopwords
stops = stopwords.words("English")def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxtsents_processed = [removeStopWords(stops,x) for x in sents_lower]

Let’s look at how our sentences look like now

Uhm, wait a minute! Removing many stop words makes many sentences lose their meanings. For example, ‘way plug us unless go converter’ does not make any sense to me. This is because we remove all the common English stop words by using NLTK. To overcome this meaning problem, let’s create our own set of stop words instead.

stop_set = ['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from']sents_processed = [removeStopWords(stop_set,x) for x in sents_lower]

It is ok to stop here and move to Tokenization. However, one can continue with stemming . The goal of stemming is too strip off prefixes and suffixes in the word and convert the word into its base form, e.g. studying->study, beautiful->beauty, cared->care, …In NLTK, there are 2 popular stemming techniques called porter and lanscaster.

import nltk
def stem_with_porter(words):
    porter = nltk.PorterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words
    
def stem_with_lancaster(words):
    porter = nltk.LancasterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words    
    
str = "Please don't unbuckle your seat-belt while I am driving, he said"print("porter:", stem_with_porter(str.split()))
print()
print("lancaster:", stem_with_lancaster(str.split()))

Let’s try on our sents_processed to see whether it makes sense.

porter = [stem_with_porter(x.split()) for x in sents_processed]porter = [" ".join(i) for i in porter]porter[0:10]

Some weird changes occur, e.g. very->veri, quality->qualiti, value->valu, …

I dont know what you think but I personally do not like stemming. Or maybe it can be useful in other cases. For those who are experts in this stemming, let me know when stemming is useful :)

4. Term Document Inverse Document Frequency (TD/IDF). This is a measure of the relative importance of a word within a document, in the context of multiple documents . In our case here, multiple reviews.

We start with the TD part — this is simply a normalized frequency of the word in the document:

(word count in document) / (total words in document)
The IDF is a weighting of the uniquess of the word across all of the documents. Here is the complete formula of TD/IDF:

td_idf(t,d) = wc(t,d)/wc(d) / dc(t)/dc()

where:

— wc(t,d) = # of occurrences of term t in doc d

— wc(d) = # of words in doc d

— dc(t) = # of docs that contain at least 1 occurrence of term t

— dc() = # of docs in collection

Now, let’s create a bag of words and normalise the texts

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformervectorizer = CountVectorizer(analyzer = "word", 
                             preprocessor = None, 
                             stop_words =  'english', 
                             max_features = 6000, ngram_range=(1,5))data_features = vectorizer.fit_transform(sents_processed)
tfidf_transformer = TfidfTransformer()
data_features_tfidf = tfidf_transformer.fit_transform(data_features)
data_mat = data_features_tfidf.toarray()

Now data_mat is our Document-Term matrix. Input is ready to put into model. Let’s create Training and Test sets. Here, I split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

np.random.seed(0)
test_index = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))train_index = list(set(range(len(labels))) - set(test_index))train_data = data_mat[train_index,]
train_labels = y[train_index]test_data = data_mat[test_index,]
test_labels = y[test_index]

TextBlob

TextBlob : Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels. TextBlod finds all the words and phrases that it can assign polarity and subjectivity to, and average all of them together
Sentiment Labels : Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we’re going to ignore them for now). A corpus’ sentiment is the average of these.

Polarity : How positive or negative a word is. -1 is very negative. +1 is very positive.
Subjectivity : How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

from textblob import TextBlob#Create polarity function and subjectivity functionpol = lambda x: TextBlob(x).sentiment.polaritysub = lambda x: TextBlob(x).sentiment.subjectivitypol_list = [pol(x) for x in sents_processed]sub_list = [sub(x) for x in sents_processed]

This is a rule-based method that determines the sentiment (polarity and subjectivity) of a review.

The next section will look at various algorithms.

Logistic Regression

from sklearn.linear_model import SGDClassifier## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none")
clf.fit(train_data, train_labels)## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))print("Training error: ", float(errs_train)/len(train_labels))
print("Test error: ", float(errs_test)/len(test_labels))Training error:  0.0116
Test error:  0.184

Words with large influence

Which words are most important in deciding whether a sentence is positive? As a first approximation to this, we simply take the words whose coefficients in w have the largest positive values.

Likewise, we look at the words whose coefficients in w have the most negative values, and we think of these as influential in negative predictions.

## Convert vocabulary into a list:
vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1])])## Get indices of sorting w
inds = np.argsort(w)## Words with large negative values
neg_inds = inds[0:50]
print("Highly negative words: ")
print([str(x) for x in list(vocab[neg_inds])])## Words with large positive values
pos_inds = inds[-49:-1]
print("Highly positive words: ")print([str(x) for x in list(vocab[pos_inds])])

Create a Word Cloud

from wordcloud import WordCloud
wc = WordCloud(stopwords=stop_set, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)#plt.rcParams['figure.figsize'] = [16, 6]wc.generate(" ".join(list(vocab[neg_inds])))plt.imshow(wc, interpolation="bilinear")
plt.axis("off")    
plt.show()

Naive Bayes

from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(train_data, train_labels)nb_preds_test = nb_clf.predict(test_data)
nb_errs_test = np.sum((nb_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(nb_errs_test)/len(test_labels))Test error:  0.174

Let’s do some prediction cases. [1] means positive and [-1] means negative

print(nb_clf.predict(vectorizer.transform(["It's a sad movie but very good"])))[1]print(nb_clf.predict(vectorizer.transform(["Waste of my time"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what like"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what I m looking for"])))[1]

The last test case has problem. It should be a negative comment but the model predicts positive.

SVM

from sklearn.linear_model import SGDClassifiersvm_clf = SGDClassifier(loss="hinge", penalty='l2')
svm_clf.fit(train_data, train_labels)svm_preds_test = svm_clf.predict(test_data)
svm_errs_test = np.sum((svm_preds_test > 0.0) != (test_labels > 0.0))print("Test error: ", float(svm_errs_test)/len(test_labels))Test error:  0.2

Again, let’s do some prediction

print(svm_clf.predict(vectorizer.transform(["This is not what I like"])))[-1]print(svm_clf.predict(vectorizer.transform(["It is not what I am looking for"])))[-1]print(svm_clf.predict(vectorizer.transform(["I would not recommend this movie"])))[1]

SVM can predict the comment ‘It is not what I am looking for’ correctly. However, it could not predict the comment ‘I do not recommend this movie’.

LSTM networks

A detailed discussion about LSTM networks can be found here .

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStoppingmax_review_length = 200tokenizer = Tokenizer(num_words=10000,  #max no. of unique words to keep
                      filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
                      lower=True #convert to lower case
                     )
tokenizer.fit_on_texts(sents_processed)

Truncate and pad the input sequences so that they are all in the same length

X = tokenizer.texts_to_sequences(sents_processed)
X = sequence.pad_sequences(X, maxlen= max_review_length)
print('Shape of data tensor:', X.shape)Shape of data tensor: (3000, 200)

Recall that y is vector of 1 and -1. Now I change it to a matrix with 2 columns that represent -1 and 1.

import pandas as pd
Y=pd.get_dummies(y).values
Y

np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))train_data = X[train_inds,]
train_labels = Y[train_inds]test_data = X[test_inds,]
test_labels = Y[test_inds]

Create networks

EMBEDDING_DIM = 200model = Sequential()
model.add(Embedding(10000, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(250, dropout=0.2,return_sequences=True))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

epochs = 2
batch_size = 40model.fit(train_data, train_labels, 
          epochs=epochs, 
          batch_size=batch_size,
          validation_split=0.1)

loss, acc = model.evaluate(test_data, test_labels, verbose=2,
                            batch_size=batch_size)print(f"loss: {loss}")
print(f"Validation accuracy: {acc}")

LSTM performs the best out of all the models trained so far, i.e. Logistic, Naive Bayes and SVM. Now let’s see how it predict a test case

outcome_labels = ['Negative', 'Positive']new = ["I would not recommend this movie"]
    
seq = tokenizer.texts_to_sequences(new)
padded = sequence.pad_sequences(seq, maxlen=max_review_length)
pred = model.predict(padded)
print("Probability distribution: ", pred)
print("Is this a Positive or Negative review? ")
print(outcome_labels[np.argmax(pred)])

new = ["It is not what i am looking for"]

new = ["This isn't what i am looking for"]

For this case, the difference between probability for negative and positive is not much. And the LSTM model classifies this as positive.

new = ["I wouldn't recommend this movie"]

The same happens for this comment. Hence, this means that our model could not distinguish between n't and not . One possible solution to this would be, in the pre-processing step, instead of removing all punctuations, change all the n't short form into not . This can simply be done with the re module in Python. You can check it out yourself to see how our models prediction improve.

That is it! Hope you guys enjoyed and picked up something from this article. If you have any questions, feel free to put them down in the comment section below. Thank you for your read. Have a great day and take care everyone!!

以上所述就是小编给大家介绍的《NLP Sentiment Analysis for beginners.》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

NLP Sentiment Analysis for beginners.

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

极简算法史：从数学到机器的故事

[法] 吕克•德•布拉班迪尔 / 任轶 / 人民邮电出版社 / 2019-1 / 39.00元

数学、逻辑学、计算机科学三大领域实属一家，彼此成就，彼此影响。从古希腊哲学到“无所不能”的计算机，数字、计算、推理这些貌似简单的概念在三千年里融汇、碰撞。如何将逻辑赋予数学意义？如何从简单运算走向复杂智慧？这背后充满了人类智慧的闪光：从柏拉图、莱布尼茨、罗素、香农到图灵都试图从数学公式中证明推理的合理性，缔造完美的思维体系。他们是凭天赋制胜，还是鲁莽地大胆一搏？本书描绘了一场人类探索数学、算法与逻......一起来看看《极简算法史：从数学到机器的故事》这本书的介绍吧!

码农工具

NLP Sentiment Analysis for beginners.