How are Americans reacting to Covid-19?

栏目: IT技术 · 发布时间: 4年前

内容简介:The Covid-19 pandemic poses an unprecedented challenge to the entire world. With the most confirmed cases and deaths, America is one of the countries hardest hit by the virus. As states begin to partially reopen, the nation has become very polarized on the

Using Twitter and sentiment analysis to answer the question

How are Americans reacting to Covid-19?

Image credit: Twitter

The Covid-19 pandemic poses an unprecedented challenge to the entire world. With the most confirmed cases and deaths, America is one of the countries hardest hit by the virus. As states begin to partially reopen, the nation has become very polarized on the topic. Some firmly advocate for this measure, citing the importance of the country’s economic health. Yet, others have a strong objection to this, contending that the human cost of reopening can’t be justified. At a time when tensions are high, I sought to gain a better understanding of how exactly Americans feel about the current state of affairs surrounding Covid-19.

In an attempt to answer this question, Srihan Mediboina and I worked together to scrape tweets related to Covid-19 from Twitter and perform sentiment analysis on them. To find out how reactions varied across America, we used tweets from New York, Texas, and California. Let’s get into the project!

Getting Twitter Data

How are Americans reacting to Covid-19?

Image credit: Tweepy

Before we can access our Twitter API credentials, we need to apply for a Twitter developer account. Once our application is approved, we can use Tweepy to access the API and download all the tweets for a hashtag. Calling the search_for_hashtag function allows us to quickly scrape data across hashtags (#coronavirus, #Covid-19, #NewYork, #California, #Texas were some of the hashtags we used). For a more in-depth look at Tweepy, check out thisarticle.

We performed the sentiment analysis using a Naive Bayes classifier, which requires labeled data because it’s a supervised learning algorithm. Thus, we manually labeled 500 tweets from each of the three states for a total of 1,500 tweets. Each tweet received either a -1 for negative sentiment, a 0 for neutral sentiment, or a 1 for positive sentiment. If you’re interested in performing your own analysis, here’s a link to the data.

How are Americans reacting to Covid-19?

First 5 rows of the California tweet dataset

Tokenizing

Now we tokenize the tweets by splitting them into individual words (called tokens). Without tokens, we can’t carry out the subsequent steps involved in sentiment analysis. This process becomes simple when we import TweetTokenizer from the Natural Language Toolkit ( nltk ). The tokenize_tweets function is only two lines of code and we can apply it to the dataframes to break up the tweets. nltk is a very powerful package for sentiment analysis so we’ll be using it throughout the article.

How are Americans reacting to Covid-19?

CA dataset after tokenization

Stopwords

Stopwords are common words such as “the”, “a”, and “an”. Since these words don’t further our understanding of the sentiment of the text, we filter them out. By importing stopwords from ntlk , this step becomes pretty simple: the remove_stopwords function is also two lines of code.

How are Americans reacting to Covid-19?

Some of the stopwords that were removed from the first few rows of our California dataset include “some”, “can”, “just”, and “for”.

Cleaning Text

In addition to removing stopwords, we want to make sure that any random characters in our data frames are also removed. For example, several characters such as ‘x97’ and ‘xa3’ appeared in the csv files after we scraped the tweets. After iterating through to find these miscellaneous characters, we copy pasted them into the CleanTxt function. Then, we applied the function to each data frame to remove them.

How are Americans reacting to Covid-19?

As we can see, hashtags were the most prevalent characters that were removed. By cleaning the text, we can improve our model’s performance.

Lemmatizing

Often, words referring to the same thing appear in different forms (ex. trouble, troubling, troubled, and troubles all essentially refer to trouble). By lemmatizing the text, we group various inflections of a word together to analyze them as the word’s lemma (how it appears in the dictionary). This process prevents the computer from mistaking different forms of a word for different words. We import WordNetLemmatizer from nltk and call the lemmatize_tweets function for this purpose.

Master Dataset

Since we’re done with the preprocessing steps, we can move onto creating a master dataset which encompasses all 1,500 tweets. By using df.itertuples , we can iterate over dataframe rows as tuples to append the ‘tweet text’ and ‘values’ attributes to our dataset. Then, we shuffle our dataset using random.shuffle to prevent our model from falling victim to overfitting.

Following that step, we iterate through all of the data frames and add every single word to the all_words list . Next, we use nltk.FreqDist to create a distribution of the frequency of each word. Since some words are more common than others, we want to ensure that the most relevant words are used to train our Naive Bayes classifier. Currently, each tweet is a list of words. However, we can represent each tweet as a dictionary instead of a list: the keys are the word features and the values are either True or False based on if the tweet contains that word feature. This dictionary representing the tweets is known as a feature set. We will generate feature sets for each tweet and train our Naive Bayes classifier on the feature sets.

Training/Testing the Model

The feature_sets will be split 80/20 into the training and testing set, respectively. After training the Naive Bayes classifier on the training set, we can check its performance by comparing its predictions for the sentiment of tweets ( results[i] ) against the labeled sentiment of the tweets ( testing_set[i][0] ).

How are Americans reacting to Covid-19?

Our output shows the predicted value on the left and the actual value on the right. The error percentage of 40% is very high, which translates to our model only being accurate around 3 out of 5 times. Some improvements that can make our model more accurate are using a larger training set or using a validation set to test out different models before selecting the most efficient.

Using the Model

With the model trained/tested, we can use it now to make predictions on a fresh batch of tweets. We scrape more tweets and run through the same preprocessing steps as we did before for the new dataframes: ca_new_df , ny_new_df , and tx_new_df . The predictions of our classifier are stored in results_new_ca , results_new_ny , and results_new_tx . Our last step is to use the sentiment_percent function to quantify the percentages.

sentiment_percent(results_new_ca)
How are Americans reacting to Covid-19?
sentiment_percent(results_new_ny)
How are Americans reacting to Covid-19?
sentiment_percent(results_new_tx)
How are Americans reacting to Covid-19?

In our results, California had only around 6% of tweets as positive while Texas had around 27% of tweets as negative. California and New York both had 73% of tweets neutral with their positive and negative percentages varying by around 4%. Texas did have the most percent of tweets negative, but they also had the most amount of tweets positive at around 10% because a lower percentage of their tweets were neutral. It’s important to keep in mind that our model was only 60% accurate so these results probably aren’t the most indicative of the real sentiment expressed in these tweets.

Some code was omitted from this article for the sake of brevity. Click here for the full code.

References

[1] Computer Science channel, Twitter Sentiment Analysis Using Python , Youtube

[2] Vicky Qian, Twitter Crawler , Github

[3]Mohamed Afham, Twitter Sentiment Analysis using NLTK, Python , Towards Data Science

[4]Adam Majmudar, Machines’ key to understanding humans , Medium


以上所述就是小编给大家介绍的《How are Americans reacting to Covid-19?》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

暗时间

暗时间

刘未鹏 / 电子工业出版社 / 2011-7 / 35.00元

2003年,刘未鹏在杂志上发表了自己的第一篇文章,并开始写博客。最初的博客较短,也较琐碎,并夹杂着一些翻译的文章。后来渐渐开始有了一些自己的心得和看法。总体上在这8年里,作者平均每个月写1篇博客或更少,但从未停止。 刘未鹏说—— 写博客这件事情给我最大的体会就是,一件事情如果你能够坚持做8年,那么不管效率和频率多低,最终总能取得一些很可观的收益。而另一个体会就是,一件事情只要你坚持得足......一起来看看 《暗时间》 这本书的介绍吧!

SHA 加密
SHA 加密

SHA 加密工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具