Paraphrase any question with T5 (Text-To-Text Transfer Transformer) — Pretrained model and…

栏目: IT技术 · 发布时间: 4年前

内容简介：The input to our program will be anyThe output will beAs you can see we generated about 10 questions that are paraphrases to the original question — “

Input

The input to our program will be any general question that you can think of –

Which course should I take to get started in data Science?

Output

The output will be paraphrased versions of the same question. Paraphrasing a question means, you create a new question that expresses the same meaning using a different choice of words .

Paraphrased Questions generated from our T5 Model ::
0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?

As you can see we generated about 10 questions that are paraphrases to the original question — “ Which course should I take to get started in data science?”

Today we will see how we can train a T5 model from Huggingface’s transformers library to generate these paraphrased questions. We will also see how we can use the pre-trained model provided by me to generate these paraphrased questions.

Practical use case

Icon from Flaticon

Imagine a middle school teacher preparing a quiz for the class. Instead of giving a fixed question to every student he/she can generate multiple variants of a given question and distribute them across students. The school can also augment their question bank with several variants of a given question using this technique.

Let’s get started —

Dataset

Icon from Flaticon

I used the Quora Question Pairs dataset to filter all the questions marked as duplicates and prepared training and validation sets. Questions that are filtered as duplicates serve our purpose of getting paraphrase pairs.

We will discuss in detail how you can –

Use my pre-trained model to generate paraphrased questions for any given question.
Use my training code and dataset to replicate the results on your own GPU machine.

Training Algorithm — T5

Icon generated with Flaticon

T5 is a new transformer model from Google that is trained in an end-to-end manner with text as input and modified text as output . You can read more about it here .

It achieves state-of-the-art results on multiple NLP tasks like summarization, question answering, machine translation, etc using a text-to-text transformer trained on a large text corpus.

I trained T5 with the original sentence as input and paraphrased (duplicate sentence from Quora Question pairs) sentence as output .

Code

All the code for using pre-trained model and training the model with given data is available at –

Using Pre-trained model

The Jupiter notebook t5-pretrained-question-paraphraser contains the code presented below.

First, install the necessary libraries –

!pip install torch==1.4.0
!pip install transformers==2.9.0
!pip install pytorch_lightning==0.7.5

Download pre-trained model from S3 and unzip in the current folder.

Run inference with any question as input and see the paraphrased results.

The output from the above code is –

device cpu

Original Question ::
Which course should I take to get started in data science?


Paraphrased Questions :: 
0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?

Training your own model

Again all the training code and dataset used for training are available in the Github repo mentioned earlier. We will go through the steps that I used to train the model.

1. Data Preparation

First I downloaded the Quora Question pairs tsv file (q uora_duplicate_questions.tsv ) as mentioned in this link .

Extracted only the rows that have is_duplicate =1 since they are the paraphrased question sentences. Then I had split the data into train and validation sets and stored them in separate CSV files.

In the end, each of the CSV files has two columns “ question1 ” and “ question2 ”. “question2” is a paraphrased version of “question1”. Since T5 expects a text as input, I gave “question1” as the input source and asked it to generate “question2” as target output .

The code used to generate the train and validation CSV files is shown below. The CSV files are available under the paraphrase_data folder in the Github repo.

filename = "quora_duplicate_questions.tsv"
import pandas as pd
question_pairs = pd.read_csv(filename, sep='\t')
question_pairs.drop(['qid1', 'qid2'], axis = 1,inplace = True)question_pairs_correct_paraphrased = question_pairs[question_pairs['is_duplicate']==1]
question_pairs_correct_paraphrased.drop(['id', 'is_duplicate'], axis = 1,inplace = True)from sklearn.model_selection import train_test_split
train, test = train_test_split(question_pairs_correct_paraphrased, test_size=0.1)train.to_csv('Quora_Paraphrasing_train.csv', index = False)
test.to_csv('Quora_Paraphrasing_val.csv', index = False)

2. Training

Thanks to Suraj Patil for the amazing Colab notebook on training T5 for any text-to-text task. I borrowed most of the training code from the Colab notebook, changing only the dataset class and training parameters. I adapted the dataset class to our Quora Question Pair dataset.

The training code is available as train.py in the Github Repo.

All you need to do is clone the repo on any GPU machine, install requirements.txt , and run train.py to train the T5 model.

Training this model for 2 epochs (default) took about 20 hrs on p2.xlarge (AWS ec2).

The dataset class looks like below —

The key is how we give our input and output to the T5 model trainer. For any given question pair from the dataset, I gave input (source) and output (target) to the T5 model as shown below –

Input format to T5 for training

paraphrase: What are the ingredients required to make a perfect cake? </s>

Output format to T5 for training

How do you bake a delicious cake? </s>

That’s it! You have a state-of-the-art question paraphraser in your hand.

Perhaps this is the first work of it’s kind out there to generate paraphrased questions from any given question!

Happy coding!

以上所述就是小编给大家介绍的《Paraphrase any question with T5 (Text-To-Text Transfer Transformer) — Pretrained model and…》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Paraphrase any question with T5 (Text-To-Text Transfer Transformer) — Pretrained model and…

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

网站入侵与脚本攻防修炼

逍遥 / 2008-9 / 59.00元

《网站入侵与脚本攻防修炼》从“攻”、“防”两个角度，通过现实中的入侵实例，并结合原理性的分析，图文并茂地展现网站入侵与防御的全过程。全书共分8章，系统地介绍网站入侵的全部过程，以及相应的防御措施和方法。其中包括网站入侵的常见手法、流行网站脚本入侵手法揭密与防范、远程攻击入侵网站与防范、网站源代码安全分析与测试等。《网站入侵与脚本攻防修炼》尤其对网站脚本漏洞原理进行细致的分析，帮助网站管理员、安全人......一起来看看《网站入侵与脚本攻防修炼》这本书的介绍吧!

码农工具