内容简介:The Covid-19 pandemic poses an unprecedented challenge to the entire world. With the most confirmed cases and deaths, America is one of the countries hardest hit by the virus. As states begin to partially reopen, the nation has become very polarized on the
Using Twitter and sentiment analysis to answer the question
May 24 ·6min read
The Covid-19 pandemic poses an unprecedented challenge to the entire world. With the most confirmed cases and deaths, America is one of the countries hardest hit by the virus. As states begin to partially reopen, the nation has become very polarized on the topic. Some firmly advocate for this measure, citing the importance of the country’s economic health. Yet, others have a strong objection to this, contending that the human cost of reopening can’t be justified. At a time when tensions are high, I sought to gain a better understanding of how exactly Americans feel about the current state of affairs surrounding Covid-19.
In an attempt to answer this question, Srihan Mediboina and I worked together to scrape tweets related to Covid-19 from Twitter and perform sentiment analysis on them. To find out how reactions varied across America, we used tweets from New York, Texas, and California. Let’s get into the project!
Getting Twitter Data
Before we can access our Twitter API credentials, we need to apply for a Twitter developer account. Once our application is approved, we can use Tweepy to access the API and download all the tweets for a hashtag. Calling the search_for_hashtag
function allows us to quickly scrape data across hashtags (#coronavirus, #Covid-19, #NewYork, #California, #Texas were some of the hashtags we used). For a more in-depth look at Tweepy, check out thisarticle.
We performed the sentiment analysis using a Naive Bayes classifier, which requires labeled data because it’s a supervised learning algorithm. Thus, we manually labeled 500 tweets from each of the three states for a total of 1,500 tweets. Each tweet received either a -1 for negative sentiment, a 0 for neutral sentiment, or a 1 for positive sentiment. If you’re interested in performing your own analysis, here’s a link to the data.
Tokenizing
Now we tokenize the tweets by splitting them into individual words (called tokens). Without tokens, we can’t carry out the subsequent steps involved in sentiment analysis. This process becomes simple when we import TweetTokenizer
from the Natural Language Toolkit ( nltk
). The tokenize_tweets
function is only two lines of code and we can apply it to the dataframes to break up the tweets. nltk
is a very powerful package for sentiment analysis so we’ll be using it throughout the article.
Stopwords
Stopwords are common words such as “the”, “a”, and “an”. Since these words don’t further our understanding of the sentiment of the text, we filter them out. By importing stopwords from ntlk
, this step becomes pretty simple: the remove_stopwords
function is also two lines of code.
Some of the stopwords that were removed from the first few rows of our California dataset include “some”, “can”, “just”, and “for”.
Cleaning Text
In addition to removing stopwords, we want to make sure that any random characters in our data frames are also removed. For example, several characters such as ‘x97’ and ‘xa3’ appeared in the csv files after we scraped the tweets. After iterating through to find these miscellaneous characters, we copy pasted them into the CleanTxt
function. Then, we applied the function to each data frame to remove them.
As we can see, hashtags were the most prevalent characters that were removed. By cleaning the text, we can improve our model’s performance.
Lemmatizing
Often, words referring to the same thing appear in different forms (ex. trouble, troubling, troubled, and troubles all essentially refer to trouble). By lemmatizing the text, we group various inflections of a word together to analyze them as the word’s lemma (how it appears in the dictionary). This process prevents the computer from mistaking different forms of a word for different words. We import WordNetLemmatizer
from nltk
and call the lemmatize_tweets
function for this purpose.
Master Dataset
Since we’re done with the preprocessing steps, we can move onto creating a master dataset which encompasses all 1,500 tweets. By using df.itertuples
, we can iterate over dataframe rows as tuples to append the ‘tweet text’
and ‘values’
attributes to our dataset. Then, we shuffle our dataset using random.shuffle
to prevent our model from falling victim to overfitting.
Following that step, we iterate through all of the data frames and add every single word to the all_words list
. Next, we use nltk.FreqDist
to create a distribution of the frequency of each word. Since some words are more common than others, we want to ensure that the most relevant words are used to train our Naive Bayes classifier. Currently, each tweet is a list of words. However, we can represent each tweet as a dictionary instead of a list: the keys are the word features and the values are either True or False based on if the tweet contains that word feature. This dictionary representing the tweets is known as a feature set. We will generate feature sets for each tweet and train our Naive Bayes classifier on the feature sets.
Training/Testing the Model
The feature_sets will be split 80/20 into the training and testing set, respectively. After training the Naive Bayes classifier on the training set, we can check its performance by comparing its predictions for the sentiment of tweets ( results[i]
) against the labeled sentiment of the tweets ( testing_set[i][0]
).
Our output shows the predicted value on the left and the actual value on the right. The error percentage of 40% is very high, which translates to our model only being accurate around 3 out of 5 times. Some improvements that can make our model more accurate are using a larger training set or using a validation set to test out different models before selecting the most efficient.
Using the Model
With the model trained/tested, we can use it now to make predictions on a fresh batch of tweets. We scrape more tweets and run through the same preprocessing steps as we did before for the new dataframes: ca_new_df
, ny_new_df
, and tx_new_df
. The predictions of our classifier are stored in results_new_ca
, results_new_ny
, and results_new_tx
. Our last step is to use the sentiment_percent
function to quantify the percentages.
sentiment_percent(results_new_ca)
sentiment_percent(results_new_ny)
sentiment_percent(results_new_tx)
In our results, California had only around 6% of tweets as positive while Texas had around 27% of tweets as negative. California and New York both had 73% of tweets neutral with their positive and negative percentages varying by around 4%. Texas did have the most percent of tweets negative, but they also had the most amount of tweets positive at around 10% because a lower percentage of their tweets were neutral. It’s important to keep in mind that our model was only 60% accurate so these results probably aren’t the most indicative of the real sentiment expressed in these tweets.
Some code was omitted from this article for the sake of brevity. Click here for the full code.
References
[1] Computer Science channel, Twitter Sentiment Analysis Using Python , Youtube
[2] Vicky Qian, Twitter Crawler , Github
[3]Mohamed Afham, Twitter Sentiment Analysis using NLTK, Python , Towards Data Science
[4]Adam Majmudar, Machines’ key to understanding humans , Medium
以上所述就是小编给大家介绍的《How are Americans reacting to Covid-19?》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
深入浅出Struts 2
Budi Kuniawan / 杨涛、王建桥、杨晓云 / 人民邮电出版社 / 2009-04 / 59.00元
本书是广受赞誉的Struts 2优秀教程,它全面而深入地阐述了Struts 2的各个特性,并指导开发人员如何根据遇到的问题对症下药,选择使用最合适的特性。作者处处从实战出发,在丰富的示例中直观地探讨了许多实用的技术,如数据类型转换、文件上传和下载、提高Struts 2应用的安全性、调试与性能分析、FreeMarker、Velocity、Ajax,等等。跟随作者一道深入Struts 2,聆听大量来之......一起来看看 《深入浅出Struts 2》 这本书的介绍吧!