NLP Sentiment Analysis for Beginners
A Step-By-Step Approach to Understand TextBlob, NLTK, Scikit-Learn, and LSTM networks
Introduction
Natural Language Processing (NLP) is the area of machine learning that focuses on the generation and understanding of language. Its main objective is to enable machines to understand, communicate and interact with humans in a natural way.
NLP has many tasks such as Text Generation, Text Classification, Machine Translation, Speech Recognition, Sentiment Analysis, etc. For a beginner to NLP, looking at these tasks and all the techniques involved in handling such tasks can be quite daunting. And in fact, it is very difficult for a newbie to know where and how to start.
Out of all the NLP tasks, I personally think that Sentiment Analysis (SA) is probably the easiest, which makes it the most suitable starting point for anyone who wants to start go into NLP.
In this article, I will show you how to perform SA using various techniques, ranging from simple ones like TextBlob and NLTK to more advanced ones like Sklearn and Long Short Term Memory (LSTM) networks.
After reading this, you can expect to understand the followings:
- Toolkits used in SA: TextBlob and NLTK
- Algorithms used in SA: Naive Bayes, SVM, Logistic Regression and LSTM
- Jargons like stop-word removal, stemming, bag of words, corpus, tokenisation etc.
- Create a word cloud
The flow of this article:
- Data cleaning and pre-processing
- TextBlob
- Algorithms: Logistic Regression, Naive Bayes, SVM and LSTM
Let’s get started!
Data and Problem Formulation
In this article, I will the sentiment
data set that consists of 3000 sentences coming from reviews on imdb.com
, amazon.com
, and yelp.com
. Each sentence is labeled according to whether it comes from a positive review (labelled as 1
) or negative review (labelled as 0
).
Data can be downloaded from the website . Alternatively, it can be downloaded from here (more recommended). The folder sentiment_labelled_sentences
(containing the data file full_set.txt
) should be in the same directory as your notebook.
Load and Pre-process the Data
Set up and import libraries
%matplotlib inline import string import numpy as np import matplotlib import matplotlib.pyplot as plt matplotlib.rc('xtick', labelsize=14) matplotlib.rc('ytick', labelsize=14)
Now, we load in the data and look at the first 10 comments
with open("sentiment_labelled_sentences/full_set.txt") as f: content = f.readlines()content[0:10]
## Remove leading and trailing white space content = [x.strip() for x in content]## Separate the sentences from the labels sentences = [x.split("\t")[0] for x in content] labels = [x.split("\t")[1] for x in content]sentences[0:10] labels[0:10]
One can stop here for this section. But for me, I prefer transforming y into (-1,1) form, where -1 represents negative and 1 represents positive.
## Transform the labels from '0 v.s. 1' to '-1 v.s. 1' y = np.array(labels, dtype='int8') y = 2*y - 1
NOTICE THAT SO FAR WE HAVE NOT DONE ANY TO THE WORDS YET!!! The next section focuses on words in sentences.
Pre-processing the text data
To input data into the any model, the data input must be in vector form. We will do the following transformations:
- Remove punctuation and numbers
- Transform all words to lower-case
- Remove stop words (e.g. the, a, that, this, it, …)
- Tokenizer the texts
- Convert the sentences into vectors, using a bag-of-words representation
I will explain some new jargons here.
-
Stop words
: common words that are ‘not interesting’ for the task at hand. These usually include articles such as ‘a’ and ‘the’, pronouns such as ‘i’ and ‘they’, and prepositions such ‘to’ and ‘from’, …
def removeStopWords(stopWords, txt): newtxt = ' '.join([word for word in txt.split() if word not in stopWords]) return newtxt stoppers = ['a', 'is', 'of','the','this','uhm','uh'] removeStopWords(stoppers, "this is a test of the stop word removal code")
Or we can use NLTK if we do want a complete set of common stop words used
from nltk.corpus import stopwordsstops = stopwords.words("English")removeStopWords(stops, "this is a test of the stop word removal code.")
Same result
2. Corpus
simply a collection of text. Order of words matter. ‘Not great’ is different from ‘great’
3. Document-Term Matrix
or Bag of Words
(BOW) is simply a vectorial representation of text sentences (or documents)
A common way to represent a set of features like this is called a One-Hot vector. For example, lets say our vocabular from our set of texts is:
today, here, I, a, fine, sun, moon, bird, saw
The sentence we want to build a BOW for is:
I saw a bird today.
Using a 1-0 for each word in the vocabulary, our BOW encoded as a one-hot vector would be:
1 0 1 1 0 0 1 1
In order to create a bag of words, we need to break down a long sentence or a document into smaller pieces. This process is called Tokenization
. The most common tokenization technique is to break down text into words. We can do this using CountVectorizer
in Scikit-Learn, where every row will represent a different document and every column will represent a different word. In addition, with CountVectorizer
, we can also remove stop words.
def full_remove(x, removal_list): for w in removal_list: x = x.replace(w, ' ') return x## Remove digits digits = [str(x) for x in range(10)] remove_digits = [full_remove(x, digits) for x in sentences]## Remove punctuation remove_punc = [full_remove(x, list(string.punctuation)) for x in remove_digits]## Make everything lower-case and remove any white space sents_lower = [x.lower() for x in remove_punc] sents_lower = [x.strip() for x in sents_lower]## Remove stop words from nltk.corpus import stopwords stops = stopwords.words("English")def removeStopWords(stopWords, txt): newtxt = ' '.join([word for word in txt.split() if word not in stopWords]) return newtxtsents_processed = [removeStopWords(stops,x) for x in sents_lower]
Let’s look at how our sentences look like now
Uhm, wait a minute! Removing many stop words makes many sentences lose their meanings. For example, ‘way plug us unless go converter’ does not make any sense to me. This is because we remove all the common English stop words by using NLTK. To overcome this meaning problem, let’s create our own set of stop words instead.
stop_set = ['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from']sents_processed = [removeStopWords(stop_set,x) for x in sents_lower]
It is ok to stop here and move to Tokenization. However, one can continue with stemming
. The goal of stemming
is too strip off prefixes and suffixes in the word and convert the word into its base form, e.g. studying->study, beautiful->beauty, cared->care, …In NLTK, there are 2 popular stemming
techniques called porter and lanscaster.
import nltk def stem_with_porter(words): porter = nltk.PorterStemmer() new_words = [porter.stem(w) for w in words] return new_words def stem_with_lancaster(words): porter = nltk.LancasterStemmer() new_words = [porter.stem(w) for w in words] return new_words str = "Please don't unbuckle your seat-belt while I am driving, he said"print("porter:", stem_with_porter(str.split())) print() print("lancaster:", stem_with_lancaster(str.split()))
Let’s try on our sents_processed
to see whether it makes sense.
porter = [stem_with_porter(x.split()) for x in sents_processed]porter = [" ".join(i) for i in porter]porter[0:10]
Some weird changes occur, e.g. very->veri, quality->qualiti, value->valu, …
I dont know what you think but I personally do not like stemming. Or maybe it can be useful in other cases. For those who are experts in this stemming, let me know when stemming is useful :)
4. Term Document Inverse Document Frequency
(TD/IDF). This is a measure of the relative importance of a word within a document, in the context of multiple documents . In our case here, multiple reviews.
We start with the TD part — this is simply a normalized frequency of the word in the document:
(word count in document) / (total words in document)
The IDF is a weighting of the uniquess of the word across all of the documents. Here is the complete formula of TD/IDF:
td_idf(t,d) = wc(t,d)/wc(d) / dc(t)/dc()
where:
— wc(t,d)
= # of occurrences of term t in doc d
— wc(d)
= # of words in doc d
— dc(t)
= # of docs that contain at least 1 occurrence of term t
— dc()
= # of docs in collection
Now, let’s create a bag of words and normalise the texts
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformervectorizer = CountVectorizer(analyzer = "word", preprocessor = None, stop_words = 'english', max_features = 6000, ngram_range=(1,5))data_features = vectorizer.fit_transform(sents_processed) tfidf_transformer = TfidfTransformer() data_features_tfidf = tfidf_transformer.fit_transform(data_features) data_mat = data_features_tfidf.toarray()
Now data_mat
is our Document-Term matrix. Input is ready to put into model. Let’s create Training and Test sets. Here, I split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).
np.random.seed(0) test_index = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))train_index = list(set(range(len(labels))) - set(test_index))train_data = data_mat[train_index,] train_labels = y[train_index]test_data = data_mat[test_index,] test_labels = y[test_index]
TextBlob
-
TextBlob
: Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. TheTextBlob
module allows us to take advantage of these labels.TextBlod
finds all the words and phrases that it can assign polarity and subjectivity to, and average all of them together - Sentiment Labels : Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we’re going to ignore them for now). A corpus’ sentiment is the average of these.
- Polarity : How positive or negative a word is. -1 is very negative. +1 is very positive.
- Subjectivity : How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.
from textblob import TextBlob#Create polarity function and subjectivity functionpol = lambda x: TextBlob(x).sentiment.polaritysub = lambda x: TextBlob(x).sentiment.subjectivitypol_list = [pol(x) for x in sents_processed]sub_list = [sub(x) for x in sents_processed]
This is a rule-based method that determines the sentiment (polarity and subjectivity) of a review.
The next section will look at various algorithms.
Logistic Regression
from sklearn.linear_model import SGDClassifier## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none")
clf.fit(train_data, train_labels)## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))print("Training error: ", float(errs_train)/len(train_labels))
print("Test error: ", float(errs_test)/len(test_labels))Training error: 0.0116
Test error: 0.184
Words with large influence
Which words are most important in deciding whether a sentence is positive? As a first approximation to this, we simply take the words whose coefficients in w
have the largest positive values.
Likewise, we look at the words whose coefficients in w
have the most negative values, and we think of these as influential in negative predictions.
## Convert vocabulary into a list: vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1])])## Get indices of sorting w inds = np.argsort(w)## Words with large negative values neg_inds = inds[0:50] print("Highly negative words: ") print([str(x) for x in list(vocab[neg_inds])])## Words with large positive values pos_inds = inds[-49:-1] print("Highly positive words: ")print([str(x) for x in list(vocab[pos_inds])])
Create a Word Cloud
from wordcloud import WordCloud wc = WordCloud(stopwords=stop_set, background_color="white", colormap="Dark2", max_font_size=150, random_state=42)#plt.rcParams['figure.figsize'] = [16, 6]wc.generate(" ".join(list(vocab[neg_inds])))plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.show()
Naive Bayes
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(train_data, train_labels)nb_preds_test = nb_clf.predict(test_data)
nb_errs_test = np.sum((nb_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(nb_errs_test)/len(test_labels))Test error: 0.174
Let’s do some prediction cases. [1] means positive and [-1] means negative
print(nb_clf.predict(vectorizer.transform(["It's a sad movie but very good"])))[1]print(nb_clf.predict(vectorizer.transform(["Waste of my time"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what like"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what I m looking for"])))[1]
The last test case has problem. It should be a negative comment but the model predicts positive.
SVM
from sklearn.linear_model import SGDClassifiersvm_clf = SGDClassifier(loss="hinge", penalty='l2')
svm_clf.fit(train_data, train_labels)svm_preds_test = svm_clf.predict(test_data)
svm_errs_test = np.sum((svm_preds_test > 0.0) != (test_labels > 0.0))print("Test error: ", float(svm_errs_test)/len(test_labels))Test error: 0.2
Again, let’s do some prediction
print(svm_clf.predict(vectorizer.transform(["This is not what I like"])))[-1]print(svm_clf.predict(vectorizer.transform(["It is not what I am looking for"])))[-1]print(svm_clf.predict(vectorizer.transform(["I would not recommend this movie"])))[1]
SVM can predict the comment ‘It is not what I am looking for’ correctly. However, it could not predict the comment ‘I do not recommend this movie’.
LSTM networks
A detailed discussion about LSTM networks can be found here .
from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import SpatialDropout1D from keras.layers.embeddings import Embedding from keras.preprocessing import sequence from keras.preprocessing.text import Tokenizer from keras.callbacks import EarlyStoppingmax_review_length = 200tokenizer = Tokenizer(num_words=10000, #max no. of unique words to keep filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True #convert to lower case ) tokenizer.fit_on_texts(sents_processed)
Truncate and pad the input sequences so that they are all in the same length
X = tokenizer.texts_to_sequences(sents_processed)
X = sequence.pad_sequences(X, maxlen= max_review_length)
print('Shape of data tensor:', X.shape)Shape of data tensor: (3000, 200)
Recall that y is vector of 1 and -1. Now I change it to a matrix with 2 columns that represent -1 and 1.
import pandas as pd Y=pd.get_dummies(y).values Y
np.random.seed(0) test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False)) train_inds = list(set(range(len(labels))) - set(test_inds))train_data = X[train_inds,] train_labels = Y[train_inds]test_data = X[test_inds,] test_labels = Y[test_inds]
Create networks
EMBEDDING_DIM = 200model = Sequential() model.add(Embedding(10000, EMBEDDING_DIM, input_length=X.shape[1])) model.add(SpatialDropout1D(0.2)) model.add(LSTM(250, dropout=0.2,return_sequences=True)) model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(2, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())
epochs = 2 batch_size = 40model.fit(train_data, train_labels, epochs=epochs, batch_size=batch_size, validation_split=0.1)
loss, acc = model.evaluate(test_data, test_labels, verbose=2, batch_size=batch_size)print(f"loss: {loss}") print(f"Validation accuracy: {acc}")
LSTM performs the best out of all the models trained so far, i.e. Logistic, Naive Bayes and SVM. Now let’s see how it predict a test case
outcome_labels = ['Negative', 'Positive']new = ["I would not recommend this movie"]
seq = tokenizer.texts_to_sequences(new)
padded = sequence.pad_sequences(seq, maxlen=max_review_length)
pred = model.predict(padded)
print("Probability distribution: ", pred)
print("Is this a Positive or Negative review? ")
print(outcome_labels[np.argmax(pred)])
new = ["It is not what i am looking for"]
new = ["This isn't what i am looking for"]
For this case, the difference between probability for negative and positive is not much. And the LSTM model classifies this as positive.
new = ["I wouldn't recommend this movie"]
The same happens for this comment. Hence, this means that our model could not distinguish between n't
and not
. One possible solution to this would be, in the pre-processing step, instead of removing all punctuations, change all the n't
short form into not
. This can simply be done with the re
module in Python. You can check it out yourself to see how our models prediction improve.
That is it! Hope you guys enjoyed and picked up something from this article. If you have any questions, feel free to put them down in the comment section below. Thank you for your read. Have a great day and take care everyone!!
以上所述就是小编给大家介绍的《NLP Sentiment Analysis for beginners.》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。