Sarcasm Classification (Using FastText)

栏目: IT技术 · 发布时间: 4年前

Sarcasm Classification (Using FastText)

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification . It has gained lot of attraction in NLP community especially as strong baseline for word representation replacing word2vec as it takes the char n-grams into account while getting the word vectors. Here we will be using FastText for text classification.

In this article we will build sarcasm classifier for news headlines using the FastText python module. The data is collected from https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection . Here the sarcastic ones are from TheOnion and nos-sarcastic ones are from HuffPost . Let’s now jump to coding.

First, let’s inspect the data to decide the approach. You can download the data from https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/data?select=Sarcasm_Headlines_Dataset_v2.json .

#load data
import pandas as pd
df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json", lines=True)
#shuffle the data inplace
df = df.sample(frac=1).reset_index(drop=True)
# show first few rows
df.head()

Basically on reading the json in tabular form with pandas, the dataset contains 3 columns where ‘headline’ contains the headline texts of the news and ‘is_sarcastic’ contains 1 and 0 signifying sarcastic and non-sarcastic respectively. If we see the representation of sarcastic and non-sarcastic examples are

0 14985

1 13634

Name: is_sarcastic, dtype: int64

Now to just to look how the texts are looking for sarcastic and non-sarcastic examples:

#word cloud on sarcastic headlinessarcastic = ‘ ‘.join(df[df[‘is_sarcastic’]==1][‘headline’].to_list())plot_wordcloud(sarcastic, ‘Reds’)

#word cloud on sarcastic headlines sarcastic = ' '.join(df[df['is_sarcastic']==0]['headline'].to_list()) plot_wordcloud(sarcastic, 'Reds')

Now before going to building the classifier model, we need to do some cleaning of the text to remove noise. Since these are the news headlines, they don’t contain much crap. The cleanings what I thought of having are getting all string lowercase, getting rid of anything which is not alphanumeric and replacing the numeric with a specific label.

df['headline'] = df['headline'].str.lower()
df['headline'] = df['headline'].apply(alpha_num)
df['headline'] = df['headline'].apply(replace_num)

Now, we are ready with the cleaned text and their corresponding labels to build a binary sarcasm classifier. As discussed, we will build the model using FastText python module.

First need to install the FastText python module in the environment.

#Building fasttext for python\ 
!git clone https://github.com/facebookresearch/fastText.git 
!cd fastText
!pip3 install .

We need to make the training and testing file ready in the format which FastText api can understand. The default format of text file on which we want to train our model should be __ label __ <Label> <Text>. We can use other prefix instead of __label__ by changing the parameter accordingly in the training time, which we will see ahead.

#data preparation for fasttext
with open('fasttext_input_sarcastic_comments.txt', 'w') as f:
for each_text, each_label in zip(df['headline'], df['is_sarcastic']):
        f.writelines(f'__label__{each_label} {each_text}\n')

The data in the file looks like this:

!head -n 10 fasttext_input_sarcastic_comments.txt__label__1 school friends dont find camp songs funny
__label__0 what cutting americorps would mean for public lands
__label__0 when our tears become medicine
__label__1 craig kilborn weds self in private ceremony
__label__1 white couple admires fall colors
__label__1 mom much more insistent about getting grandkids from one child than other
__label__0 diary of a queer kids mom
__label__1 sephora makeup artist helping woman create the perfect pink eye
__label__1 kelloggs pulls controversial chocobastard from store shelves
__label__0 winston churchills grandson introduces a new nickname for donald trump

Here __label__0 implies non-sarcastic and __label__1 implies sarcastic. Now we are in good shape to start training the classifier model. For that will divide the dataset into training(90%) and testing(10%) dataset. The FastText function to be used for this supervised binary classification is train_supervised .

''
For classification train_supervised call will be used:

The default parameters to it:
    input             # training file path (required)
    lr                # learning rate [0.1]
    dim               # size of word vectors [100]
    ws                # size of the context window [5]
    epoch             # number of epochs [5]
    minCount          # minimal number of word occurences [1]
    minCountLabel     # minimal number of label occurences [1]
    minn              # min length of char ngram [0]
    maxn              # max length of char ngram [0]
    neg               # number of negatives sampled [5]
    wordNgrams        # max length of word ngram [1]
    loss              # loss function {ns, hs, softmax, ova} [softmax]
    bucket            # number of buckets [2000000]
    thread            # number of threads [number of cpus]
    lrUpdateRate      # change the rate of updates for the learning rate [100]
    t                 # sampling threshold [0.0001]
    label             # label prefix ['__label__']
    verbose           # verbose [2]
    pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
'''
model = fasttext.train_supervised('sarcasm_train.bin', wordNgrams=2)

To measure the performance in the testing dataset:

#measuring performance on test data
def print_results(sample_size, precision, recall):
    precision   = round(precision, 2)
    recall      = round(recall, 2)
    print(f'{sample_size=}')
    print(f'{precision=}')
    print(f'{recall=}')

print_results(*model.test('sarcasm_test.bin'))sample_size=2862 
precision=0.87 
recall=0.87

The results though not perfect looks promising. We save the model object now for future inferences.

#save the model
model.save_model('fasttext_sarcasm.model')

FastText also capable of compressing the model in order to have a much smaller size model file by sacrificing only a little bit performance through quantisation.

# with the previously trained `model` object, call
model.quantize(input='sarcasm_train.bin', retrain=True)\
# results on test set
print_results(*model.test('sarcasm_test.bin'))
sample_size=2862 
precision=0.86 
recall=0.86

If you see the precision and recall seems to suffer .01 score but looking on the model file size:

!du -kh ./fasttext_sarcasm*98M ./fasttext_sarcasm.ftz
774M ./fasttext_sarcasm.model

The compressed model is only 1/12th of the base model. So this is a trade off between model size and performance which user have to decide depending on the use case. Since the classifier model is trained and ready, now is the time to have the inference script ready so you can plan the deployment.

def predict_is_sarcastic(text):
    return SarcasmService.get_model().predict(text, k=2)if __name__ == '__main__':
    ip = 'Make Indian manufacturing competitive to curb Chinese imports: RC Bhargava'
    print(f'Result : {predict_is_sarcastic(ip)}')Result : (('__label__0', '__label__1'), array([0.5498156 , 0.45020437]))

As can be seen above the result label with maximum probability is __label__0 that means the used headline is non-sarcastic as per the trained model. In model.predict() call the value of k signifies the number of classes you want in output along with their respective probability score. Since we have used softmax activation (default one in FastText), the sum of probability of the two labels is 1.

To conclude, FastText can be a strong baseline while doing any NLP classification and it’s implementation is very easy. All the source code of this article can be found at my git repo . The FastText python module is not officially supported but that shouldn’t be a issue for tech people to experiment :). In future post will try to discuss how can the trained model be moved to production.

Telenor, Sony and Ericsson team to develop smart IoT healthcare devices

8. April 2020

Experts Predict Artificial Intelligence To Transform Warfare – Eurasia Review

7. June 2020

Yangdong Yapay Zeka Araştırma Enstitüsü’nün Kurumsal Bulut Zinciri

25. March 2019

PTC to advance IIoT across the enterprise with ThingWorx 9.0

9. June 2020

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Sarcasm Classification (Using FastText)

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

连线力

杨国斌 / 邓燕华 / 广西师范大学出版社 / 2013-9 / 39.00

《连线力》，最关切我们未来的“思想@网络.中国”丛书之一，互联网中国传媒参考书。中国网民在行动。在中国的广大网民中，普遍存在着对正义的渴望和追求，对弱者和小人物的同情，对贪官污吏的痛恶，对政府的失望，对权贵的嘲讽，对沟通的渴望，甚至对革命的呼唤。这些因素有着共同的内在逻辑，即情感逻辑。在这个意义上，情感汹涌的网络事件，是整个中国社会情感结构的脉络。 1994年，中国开通了全功能的......一起来看看《连线力》这本书的介绍吧!

码农工具

Sarcasm Classification (Using FastText)