Sentiment Analysis in Python with Amazon Product Review Data

栏目: IT技术 · 发布时间: 5年前

内容简介:In today’s world sentiment analysis can play a vital role in any industry. Classifying tweets, Facebook comments or product reviews using an automated system can save a lot of time and money. At the same time, it is probably more accurate. In this article,

Sentiment Analysis in Python with Amazon Product Review Data

Learn how to perform sentiment analysis in python and python’s scikit-learn library.

Source: Unsplash by Kelly Sikkema

In today’s world sentiment analysis can play a vital role in any industry. Classifying tweets, Facebook comments or product reviews using an automated system can save a lot of time and money. At the same time, it is probably more accurate. In this article, I will explain a sentiment analysis task using a product review dataset.

I am going to use python and a few libraries of python. Even if you haven’t used these libraries before, you should be able to understand it well. If this is new to you, please copy each step of code to your notebook and see the output for better understanding.

Tools Used

  1. Python
  2. Pandas library
  3. scikit-learn library
  4. Jupyter Notebook as an IDE.

Dataset and task Overview

I am going to use a product review dataset as I mentioned earlier. The dataset contains Amazon baby product reviews. Please download the dataset for yourself from this link if you want to practice with it . It has three columns: name, review, and rating. Reviews are text data and ratings are numbering from 1 to 5. 1 for the worst and 5 for the best review. Our job is to analyze the reviews as positive and negative reviews. Let’s have a look at the dataset. Here we used the first five entries to examine the data.

import pandas as pdproducts = pd.read_csv(‘amazon_baby.csv’)products.head()

Data Preprocessing

In real life, data scientists rarely get data that are very clean and already prepared for machine learning models. For almost every project, you have to spend time cleaning and process the data. So, let’s clean the dataset we will work on now.

One important data cleaning process is to get rid of nan values. Let’s check how many null values we have in the dataset. In this dataset, we have to work on these three columns and all three of them are crucial. If the data is not available in any row in a column that row is unnecessary.

len(products) - len(products.dropna())

We have null values in 1147 rows. Now, check how much total data we have.

len(products)

We have a total of 183531 data. So, if we delete all the null values, we will still have a sizable data to train an algorithm. So, drop the null values.

products = products.dropna()

We need to have all the string data in the review column. If there is any data that has other types, it will cause trouble in later steps. Now, we will check the datatype of the review data of every row. If there is any row having a review in any other type than string we will change that to a string.

for i in range(0,len(products)-1):
    if type(products.iloc[i]['review']) != str:
        products.iloc[i]['review'] = str(products.iloc[i]['review'])

As we are doing sentiment analysis, it is important to tell our model what is positive sentiment and what is a negative sentiment. In our rating column, we have ratings from 1 to 5. We can define 1 and 2 as bad reviews and 4 and 5 as good reviews. What about 3? 3 is in the middle. It’s neither good nor bad. Just average. But we want to classify good or bad reviews. So, I decided to get rid of all the 3’s. It depends on the employer or your ideas of good or bad. If you think you will put 3 in the good review slot, just do it. But I am getting rid of them.

products = products[products[‘rating’] != 3]

We will denote positive sentiments as 1 and negative sentiments as 0. Let’s write a function ‘sentiment’ that returns 1 if the rating is 4 or more else return 0. Then, apply the function sentiment and create a new column that will represent the positive and negative sentiment as 1 or 0.

def sentiment(n):return 1 if n >= 4 else 0
products['sentiment'] = products[‘rating’].apply(sentiment)

Now we are ready to develop our sentiment classifier. First, we need to prepare the training features. Combine both ‘name’ and ‘review’ columns and make one single column. First, write a function ‘combined_features’ that will combine both the columns. Then, apply the function and create a new column ‘all_features’ that will contain the strings from both name and review columns.

def combined_features(row):return row['name'] + ' '+ row['review']products['all_features'] = products.apply(combined_features, axis=1)

Develop the sentiment classifier

Here is the process step by step.

We need to define the input variable X and the output variable y. X should be ‘all_features’ column and y should be our ‘sentiment’ column.

X = products['all_features']y = products['sentiment']

Now we are ready to develop our sentiment classifier. We need to split the dataset so that there is a training set and a test set. The ‘train_test_split’ function from the scikit-learn library can be helpful. The model will be trained using the training dataset and the performance of the model can be tested using the test dataset. ‘train_test_split’ automatically splits the data in 75/25 proportion. 75% for the training and 25% for the testing. If you want the proportion to be different, you need to define that.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

I am going to use ‘ CountVectorizer ’ from the scikit-learn library. CountVectorizer develops a vector of all the words in the string. Import CountVectorizer and fit both our training, testing data into it.

From sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer()ctmTr = cv.fit_transform(X_train)X_test_dtm = cv.transform(X_test)

Let’s dive into the original model part. This is the most fun part. We will use the Logistic Regression as this is a binary classification. Let’s do the necessary imports and fit our training data in the model.

from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scoremodel = LogisticRegression()model.fit(ctmTr, y_train)

The logistic regression model is trained with the training data. Here is the output of the training.

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,           intercept_scaling=1, max_iter=100, multi_class='warn',           n_jobs=None, penalty='l2', random_state=None, solver='warn',           tol=0.0001, verbose=0, warm_start=False)

If this output looks obscure to you, please do not worry about it. This output shows optimized parameters for this dataset that this model figured out.

Results

Use the trained model above to predict the sentiments for the test data.

y_pred_class = model.predict(X_test_dtm)

Use the accuracy_score function to get the accuracy_score of the test data.

accuracy_score(y_test, y_pred_class)

The accuracy score I got for this data on the test set is 84%, which is very good. I have another article showing the same project using Tensorflow that gives better accuracy. Please check

Binary Classification of Product Reviews Using Tensorflow and Python

Some more recommended reading materials:

Logistic Regression in Python From Scratch to End With a Real Dataset

Multiclass Classification With Logistic Regression One vs All Method From Scratch Using Python

Polynomial Regression From Scratch in Python

Understanding p-test, Characteristics, and Calculation With Example


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

计算几何

计算几何

奥罗克 / 机械工业 / 2005-4 / 49.00元

本书介绍了在计算机图形学、机器人和工业设计领域逐渐兴起的几何算法的设计和实现。计算几何中使用的基本技术包括多边形三角剖分、凸包、Voronoi图、排列、几何查找、运动计划等。虽然自主处理只涉及数学基础知识领域的一部分,但是它却和当今该研究领域的前沿课题相关。因此,专业的程序员会发现本书是一本不可多得的参考书。   与上一版相比,本版包括以下几方面的新内容:多边形三角剖分的随机化算法、平面点定......一起来看看 《计算几何》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

URL 编码/解码
URL 编码/解码

URL 编码/解码

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具