Sarcasm Classification (Using FastText)

栏目: IT技术 · 发布时间: 4年前

Sarcasm Classification (Using FastText)

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification . It has gained lot of attraction in NLP community especially as strong baseline for word representation replacing word2vec as it takes the char n-grams into account while getting the word vectors. Here we will be using FastText for text classification.

In this article we will build sarcasm classifier for news headlines using the FastText python module. The data is collected from https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection . Here the sarcastic ones are from TheOnion and nos-sarcastic ones are from HuffPost . Let’s now jump to coding.

First, let’s inspect the data to decide the approach. You can download the data from https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/data?select=Sarcasm_Headlines_Dataset_v2.json .

#load data
import pandas as pd
df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json", lines=True)
#shuffle the data inplace
df = df.sample(frac=1).reset_index(drop=True)
# show first few rows
df.head()

Basically on reading the json in tabular form with pandas, the dataset contains 3 columns where ‘headline’ contains the headline texts of the news and ‘is_sarcastic’ contains 1 and 0 signifying sarcastic and non-sarcastic respectively. If we see the representation of sarcastic and non-sarcastic examples are

0 14985

1 13634

Name: is_sarcastic, dtype: int64

Now to just to look how the texts are looking for sarcastic and non-sarcastic examples:

#word cloud on sarcastic headlinessarcastic = ‘ ‘.join(df[df[‘is_sarcastic’]==1][‘headline’].to_list())plot_wordcloud(sarcastic, ‘Reds’)
#word cloud on sarcastic headlines sarcastic = ' '.join(df[df['is_sarcastic']==0]['headline'].to_list()) plot_wordcloud(sarcastic, 'Reds')

Now before going to building the classifier model, we need to do some cleaning of the text to remove noise. Since these are the news headlines, they don’t contain much crap. The cleanings what I thought of having are getting all string lowercase, getting rid of anything which is not alphanumeric and replacing the numeric with a specific label.

df['headline'] = df['headline'].str.lower()
df['headline'] = df['headline'].apply(alpha_num)
df['headline'] = df['headline'].apply(replace_num)

Now, we are ready with the cleaned text and their corresponding labels to build a binary sarcasm classifier. As discussed, we will build the model using FastText python module.

First need to install the FastText python module in the environment.

#Building fasttext for python\ 
!git clone https://github.com/facebookresearch/fastText.git
!cd fastText
!pip3 install .

We need to make the training and testing file ready in the format which FastText api can understand. The default format of text file on which we want to train our model should be __ label __ <Label> <Text>. We can use other prefix instead of __label__ by changing the parameter accordingly in the training time, which we will see ahead.

#data preparation for fasttext
with open('fasttext_input_sarcastic_comments.txt', 'w') as f:
for each_text, each_label in zip(df['headline'], df['is_sarcastic']):
f.writelines(f'__label__{each_label} {each_text}\n')

The data in the file looks like this:

!head -n 10 fasttext_input_sarcastic_comments.txt__label__1 school friends dont find camp songs funny
__label__0 what cutting americorps would mean for public lands
__label__0 when our tears become medicine
__label__1 craig kilborn weds self in private ceremony
__label__1 white couple admires fall colors
__label__1 mom much more insistent about getting grandkids from one child than other
__label__0 diary of a queer kids mom
__label__1 sephora makeup artist helping woman create the perfect pink eye
__label__1 kelloggs pulls controversial chocobastard from store shelves
__label__0 winston churchills grandson introduces a new nickname for donald trump

Here __label__0 implies non-sarcastic and __label__1 implies sarcastic. Now we are in good shape to start training the classifier model. For that will divide the dataset into training(90%) and testing(10%) dataset. The FastText function to be used for this supervised binary classification is train_supervised .

''
For classification train_supervised call will be used:

The default parameters to it:
input # training file path (required)
lr # learning rate [0.1]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [1]
minCountLabel # minimal number of label occurences [1]
minn # min length of char ngram [0]
maxn # max length of char ngram [0]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [softmax]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
label # label prefix ['__label__']
verbose # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
'''
model = fasttext.train_supervised('sarcasm_train.bin', wordNgrams=2)

To measure the performance in the testing dataset:

#measuring performance on test data
def print_results(sample_size, precision, recall):
precision = round(precision, 2)
recall = round(recall, 2)
print(f'{sample_size=}')
print(f'{precision=}')
print(f'{recall=}')

print_results(*model.test('sarcasm_test.bin'))
sample_size=2862
precision=0.87
recall=0.87

The results though not perfect looks promising. We save the model object now for future inferences.

#save the model
model.save_model('fasttext_sarcasm.model')

FastText also capable of compressing the model in order to have a much smaller size model file by sacrificing only a little bit performance through quantisation.

# with the previously trained `model` object, call
model.quantize(input='sarcasm_train.bin', retrain=True)\
# results on test set
print_results(*model.test('sarcasm_test.bin'))

sample_size=2862
precision=0.86
recall=0.86

If you see the precision and recall seems to suffer .01 score but looking on the model file size:

!du -kh ./fasttext_sarcasm*98M ./fasttext_sarcasm.ftz
774M ./fasttext_sarcasm.model

The compressed model is only 1/12th of the base model. So this is a trade off between model size and performance which user have to decide depending on the use case. Since the classifier model is trained and ready, now is the time to have the inference script ready so you can plan the deployment.

def predict_is_sarcastic(text):
return SarcasmService.get_model().predict(text, k=2)
if __name__ == '__main__':
ip = 'Make Indian manufacturing competitive to curb Chinese imports: RC Bhargava'
print(f'Result : {predict_is_sarcastic(ip)}')
Result : (('__label__0', '__label__1'), array([0.5498156 , 0.45020437]))

As can be seen above the result label with maximum probability is __label__0 that means the used headline is non-sarcastic as per the trained model. In model.predict() call the value of k signifies the number of classes you want in output along with their respective probability score. Since we have used softmax activation (default one in FastText), the sum of probability of the two labels is 1.

To conclude, FastText can be a strong baseline while doing any NLP classification and it’s implementation is very easy. All the source code of this article can be found at my git repo . The FastText python module is not officially supported but that shouldn’t be a issue for tech people to experiment :). In future post will try to discuss how can the trained model be moved to production.

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

精通EJB

精通EJB

罗曼 / 第1版 (2005年9月1日) / 2005-9 / 69.0

本书是EJB组件技术教程,专注于EJB的概念、方法、开发过程的介绍。全书共分为4个部分,首先对EJB编程基础进行介绍,其次重点关注EJB编程的具体内容和过程,然后对高级EJB进行了阐述,最后的附录收集了EJB组件技术相关的其他内容。作为一本交互性好、读起来有趣、涉及到EJB中各方面知识的书籍,本书确信这正是你所寻找的。  本书是关于EJB 2.1的经典书籍,是EJB开发者必备的参考书。全书共分为3......一起来看看 《精通EJB》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

随机密码生成器
随机密码生成器

多种字符组合密码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器