Sarcasm Classification (Using FastText)

栏目: IT技术 · 发布时间: 4年前

Sarcasm Classification (Using FastText)

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification . It has gained lot of attraction in NLP community especially as strong baseline for word representation replacing word2vec as it takes the char n-grams into account while getting the word vectors. Here we will be using FastText for text classification.

In this article we will build sarcasm classifier for news headlines using the FastText python module. The data is collected from https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection . Here the sarcastic ones are from TheOnion and nos-sarcastic ones are from HuffPost . Let’s now jump to coding.

First, let’s inspect the data to decide the approach. You can download the data from https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/data?select=Sarcasm_Headlines_Dataset_v2.json .

#load data
import pandas as pd
df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json", lines=True)
#shuffle the data inplace
df = df.sample(frac=1).reset_index(drop=True)
# show first few rows
df.head()

Basically on reading the json in tabular form with pandas, the dataset contains 3 columns where ‘headline’ contains the headline texts of the news and ‘is_sarcastic’ contains 1 and 0 signifying sarcastic and non-sarcastic respectively. If we see the representation of sarcastic and non-sarcastic examples are

0 14985

1 13634

Name: is_sarcastic, dtype: int64

Now to just to look how the texts are looking for sarcastic and non-sarcastic examples:

#word cloud on sarcastic headlinessarcastic = ‘ ‘.join(df[df[‘is_sarcastic’]==1][‘headline’].to_list())plot_wordcloud(sarcastic, ‘Reds’)
#word cloud on sarcastic headlines sarcastic = ' '.join(df[df['is_sarcastic']==0]['headline'].to_list()) plot_wordcloud(sarcastic, 'Reds')

Now before going to building the classifier model, we need to do some cleaning of the text to remove noise. Since these are the news headlines, they don’t contain much crap. The cleanings what I thought of having are getting all string lowercase, getting rid of anything which is not alphanumeric and replacing the numeric with a specific label.

df['headline'] = df['headline'].str.lower()
df['headline'] = df['headline'].apply(alpha_num)
df['headline'] = df['headline'].apply(replace_num)

Now, we are ready with the cleaned text and their corresponding labels to build a binary sarcasm classifier. As discussed, we will build the model using FastText python module.

First need to install the FastText python module in the environment.

#Building fasttext for python\ 
!git clone https://github.com/facebookresearch/fastText.git
!cd fastText
!pip3 install .

We need to make the training and testing file ready in the format which FastText api can understand. The default format of text file on which we want to train our model should be __ label __ <Label> <Text>. We can use other prefix instead of __label__ by changing the parameter accordingly in the training time, which we will see ahead.

#data preparation for fasttext
with open('fasttext_input_sarcastic_comments.txt', 'w') as f:
for each_text, each_label in zip(df['headline'], df['is_sarcastic']):
f.writelines(f'__label__{each_label} {each_text}\n')

The data in the file looks like this:

!head -n 10 fasttext_input_sarcastic_comments.txt__label__1 school friends dont find camp songs funny
__label__0 what cutting americorps would mean for public lands
__label__0 when our tears become medicine
__label__1 craig kilborn weds self in private ceremony
__label__1 white couple admires fall colors
__label__1 mom much more insistent about getting grandkids from one child than other
__label__0 diary of a queer kids mom
__label__1 sephora makeup artist helping woman create the perfect pink eye
__label__1 kelloggs pulls controversial chocobastard from store shelves
__label__0 winston churchills grandson introduces a new nickname for donald trump

Here __label__0 implies non-sarcastic and __label__1 implies sarcastic. Now we are in good shape to start training the classifier model. For that will divide the dataset into training(90%) and testing(10%) dataset. The FastText function to be used for this supervised binary classification is train_supervised .

''
For classification train_supervised call will be used:

The default parameters to it:
input # training file path (required)
lr # learning rate [0.1]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [1]
minCountLabel # minimal number of label occurences [1]
minn # min length of char ngram [0]
maxn # max length of char ngram [0]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [softmax]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
label # label prefix ['__label__']
verbose # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
'''
model = fasttext.train_supervised('sarcasm_train.bin', wordNgrams=2)

To measure the performance in the testing dataset:

#measuring performance on test data
def print_results(sample_size, precision, recall):
precision = round(precision, 2)
recall = round(recall, 2)
print(f'{sample_size=}')
print(f'{precision=}')
print(f'{recall=}')

print_results(*model.test('sarcasm_test.bin'))
sample_size=2862
precision=0.87
recall=0.87

The results though not perfect looks promising. We save the model object now for future inferences.

#save the model
model.save_model('fasttext_sarcasm.model')

FastText also capable of compressing the model in order to have a much smaller size model file by sacrificing only a little bit performance through quantisation.

# with the previously trained `model` object, call
model.quantize(input='sarcasm_train.bin', retrain=True)\
# results on test set
print_results(*model.test('sarcasm_test.bin'))

sample_size=2862
precision=0.86
recall=0.86

If you see the precision and recall seems to suffer .01 score but looking on the model file size:

!du -kh ./fasttext_sarcasm*98M ./fasttext_sarcasm.ftz
774M ./fasttext_sarcasm.model

The compressed model is only 1/12th of the base model. So this is a trade off between model size and performance which user have to decide depending on the use case. Since the classifier model is trained and ready, now is the time to have the inference script ready so you can plan the deployment.

def predict_is_sarcastic(text):
return SarcasmService.get_model().predict(text, k=2)
if __name__ == '__main__':
ip = 'Make Indian manufacturing competitive to curb Chinese imports: RC Bhargava'
print(f'Result : {predict_is_sarcastic(ip)}')
Result : (('__label__0', '__label__1'), array([0.5498156 , 0.45020437]))

As can be seen above the result label with maximum probability is __label__0 that means the used headline is non-sarcastic as per the trained model. In model.predict() call the value of k signifies the number of classes you want in output along with their respective probability score. Since we have used softmax activation (default one in FastText), the sum of probability of the two labels is 1.

To conclude, FastText can be a strong baseline while doing any NLP classification and it’s implementation is very easy. All the source code of this article can be found at my git repo . The FastText python module is not officially supported but that shouldn’t be a issue for tech people to experiment :). In future post will try to discuss how can the trained model be moved to production.

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

正则表达式必知必会(修订版)

正则表达式必知必会(修订版)

福达 (Ben Forta) / 杨涛 / 人民邮电出版社 / 2015-1-1 / 29.00元

《正则表达式必知必会》从简单的文本匹配开始,循序渐进地介绍了很多复杂内容,其中包括回溯引用、条件性求值和前后查找,等等。每章都为读者准备了许多简明又实用的示例,有助于全面、系统、快速掌握正则表达式,并运用它们去解决实际问题。正则表达式是一种威力无比强大的武器,几乎在所有的程序设计语言里和计算机平台上都可以用它来完成各种复杂的文本处理工作。而且书中的内容在保持语言和平台中立的同时,还兼顾了各种平台之......一起来看看 《正则表达式必知必会(修订版)》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具