内容简介:The Covid-19 pandemic poses an unprecedented challenge to the entire world. With the most confirmed cases and deaths, America is one of the countries hardest hit by the virus. As states begin to partially reopen, the nation has become very polarized on the
Using Twitter and sentiment analysis to answer the question
May 24 ·6min read
The Covid-19 pandemic poses an unprecedented challenge to the entire world. With the most confirmed cases and deaths, America is one of the countries hardest hit by the virus. As states begin to partially reopen, the nation has become very polarized on the topic. Some firmly advocate for this measure, citing the importance of the country’s economic health. Yet, others have a strong objection to this, contending that the human cost of reopening can’t be justified. At a time when tensions are high, I sought to gain a better understanding of how exactly Americans feel about the current state of affairs surrounding Covid-19.
In an attempt to answer this question, Srihan Mediboina and I worked together to scrape tweets related to Covid-19 from Twitter and perform sentiment analysis on them. To find out how reactions varied across America, we used tweets from New York, Texas, and California. Let’s get into the project!
Getting Twitter Data
Before we can access our Twitter API credentials, we need to apply for a Twitter developer account. Once our application is approved, we can use Tweepy to access the API and download all the tweets for a hashtag. Calling the search_for_hashtag function allows us to quickly scrape data across hashtags (#coronavirus, #Covid-19, #NewYork, #California, #Texas were some of the hashtags we used). For a more in-depth look at Tweepy, check out thisarticle.
We performed the sentiment analysis using a Naive Bayes classifier, which requires labeled data because it’s a supervised learning algorithm. Thus, we manually labeled 500 tweets from each of the three states for a total of 1,500 tweets. Each tweet received either a -1 for negative sentiment, a 0 for neutral sentiment, or a 1 for positive sentiment. If you’re interested in performing your own analysis, here’s a link to the data.
Tokenizing
Now we tokenize the tweets by splitting them into individual words (called tokens). Without tokens, we can’t carry out the subsequent steps involved in sentiment analysis. This process becomes simple when we import TweetTokenizer from the Natural Language Toolkit ( nltk ). The tokenize_tweets function is only two lines of code and we can apply it to the dataframes to break up the tweets. nltk is a very powerful package for sentiment analysis so we’ll be using it throughout the article.
Stopwords
Stopwords are common words such as “the”, “a”, and “an”. Since these words don’t further our understanding of the sentiment of the text, we filter them out. By importing stopwords from ntlk , this step becomes pretty simple: the remove_stopwords function is also two lines of code.
Some of the stopwords that were removed from the first few rows of our California dataset include “some”, “can”, “just”, and “for”.
Cleaning Text
In addition to removing stopwords, we want to make sure that any random characters in our data frames are also removed. For example, several characters such as ‘x97’ and ‘xa3’ appeared in the csv files after we scraped the tweets. After iterating through to find these miscellaneous characters, we copy pasted them into the CleanTxt function. Then, we applied the function to each data frame to remove them.
As we can see, hashtags were the most prevalent characters that were removed. By cleaning the text, we can improve our model’s performance.
Lemmatizing
Often, words referring to the same thing appear in different forms (ex. trouble, troubling, troubled, and troubles all essentially refer to trouble). By lemmatizing the text, we group various inflections of a word together to analyze them as the word’s lemma (how it appears in the dictionary). This process prevents the computer from mistaking different forms of a word for different words. We import WordNetLemmatizer from nltk and call the lemmatize_tweets function for this purpose.
Master Dataset
Since we’re done with the preprocessing steps, we can move onto creating a master dataset which encompasses all 1,500 tweets. By using df.itertuples , we can iterate over dataframe rows as tuples to append the ‘tweet text’ and ‘values’ attributes to our dataset. Then, we shuffle our dataset using random.shuffle to prevent our model from falling victim to overfitting.
Following that step, we iterate through all of the data frames and add every single word to the all_words list . Next, we use nltk.FreqDist to create a distribution of the frequency of each word. Since some words are more common than others, we want to ensure that the most relevant words are used to train our Naive Bayes classifier. Currently, each tweet is a list of words. However, we can represent each tweet as a dictionary instead of a list: the keys are the word features and the values are either True or False based on if the tweet contains that word feature. This dictionary representing the tweets is known as a feature set. We will generate feature sets for each tweet and train our Naive Bayes classifier on the feature sets.
Training/Testing the Model
The feature_sets will be split 80/20 into the training and testing set, respectively. After training the Naive Bayes classifier on the training set, we can check its performance by comparing its predictions for the sentiment of tweets ( results[i] ) against the labeled sentiment of the tweets ( testing_set[i][0] ).
Our output shows the predicted value on the left and the actual value on the right. The error percentage of 40% is very high, which translates to our model only being accurate around 3 out of 5 times. Some improvements that can make our model more accurate are using a larger training set or using a validation set to test out different models before selecting the most efficient.
Using the Model
With the model trained/tested, we can use it now to make predictions on a fresh batch of tweets. We scrape more tweets and run through the same preprocessing steps as we did before for the new dataframes: ca_new_df , ny_new_df , and tx_new_df . The predictions of our classifier are stored in results_new_ca , results_new_ny , and results_new_tx . Our last step is to use the sentiment_percent function to quantify the percentages.
sentiment_percent(results_new_ca)
sentiment_percent(results_new_ny)
sentiment_percent(results_new_tx)
In our results, California had only around 6% of tweets as positive while Texas had around 27% of tweets as negative. California and New York both had 73% of tweets neutral with their positive and negative percentages varying by around 4%. Texas did have the most percent of tweets negative, but they also had the most amount of tweets positive at around 10% because a lower percentage of their tweets were neutral. It’s important to keep in mind that our model was only 60% accurate so these results probably aren’t the most indicative of the real sentiment expressed in these tweets.
Some code was omitted from this article for the sake of brevity. Click here for the full code.
References
[1] Computer Science channel, Twitter Sentiment Analysis Using Python , Youtube
[2] Vicky Qian, Twitter Crawler , Github
[3]Mohamed Afham, Twitter Sentiment Analysis using NLTK, Python , Towards Data Science
[4]Adam Majmudar, Machines’ key to understanding humans , Medium
以上所述就是小编给大家介绍的《How are Americans reacting to Covid-19?》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
黑客攻防技术宝典(第2版)
[英] Dafydd Stuttard、[英] Marcus Pinto / 石华耀、傅志红 / 人民邮电出版社 / 2012-6-26 / 99.00元
内容简介: Web应用无处不在,安全隐患如影随形。承载着丰富功能与用途的Web应用程序中布满了各种漏洞,攻击者能够利用这些漏洞盗取用户资料,实施诈骗,破坏其他系统等。近年来,一些公司的网络系统频频遭受攻击,导致用户信息泄露,造成不良影响。因此,如何确保Web应用程序的安全,已成为摆在人们眼前亟待解决的问题。 本书是Web安全领域专家的经验结晶,系统阐述了如何针对Web应用程序展开攻击与......一起来看看 《黑客攻防技术宝典(第2版)》 这本书的介绍吧!