内容简介:The input to our program will be anyThe output will beAs you can see we generated about 10 questions that are paraphrases to the original question — “
Input
The input to our program will be any general question that you can think of –
Which course should I take to get started in data Science?
Output
The output will be paraphrased versions of the same question. Paraphrasing a question means, you create a new question that expresses the same meaning using a different choice of words .
Paraphrased Questions generated from our T5 Model ::
0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?
As you can see we generated about 10 questions that are paraphrases to the original question — “ Which course should I take to get started in data science?”
Today we will see how we can train a T5 model from Huggingface’s transformers library to generate these paraphrased questions. We will also see how we can use the pre-trained model provided by me to generate these paraphrased questions.
Practical use case
Imagine a middle school teacher preparing a quiz for the class. Instead of giving a fixed question to every student he/she can generate multiple variants of a given question and distribute them across students. The school can also augment their question bank with several variants of a given question using this technique.
Let’s get started —
Dataset
I used the Quora Question Pairs dataset to filter all the questions marked as duplicates and prepared training and validation sets. Questions that are filtered as duplicates serve our purpose of getting paraphrase pairs.
We will discuss in detail how you can –
- Use my pre-trained model to generate paraphrased questions for any given question.
- Use my training code and dataset to replicate the results on your own GPU machine.
Training Algorithm — T5
T5 is a new transformer model from Google that is trained in an end-to-end manner with text as input and modified text as output . You can read more about it here .
It achieves state-of-the-art results on multiple NLP tasks like summarization, question answering, machine translation, etc using a text-to-text transformer trained on a large text corpus.
I trained T5 with the original sentence as input and paraphrased (duplicate sentence from Quora Question pairs) sentence as output .
Code
All the code for using pre-trained model and training the model with given data is available at –
Using Pre-trained model
The Jupiter notebook t5-pretrained-question-paraphraser contains the code presented below.
First, install the necessary libraries –
!pip install torch==1.4.0 !pip install transformers==2.9.0 !pip install pytorch_lightning==0.7.5
Download pre-trained model from S3 and unzip in the current folder.
Run inference with any question as input and see the paraphrased results.
The output from the above code is –
device cpu
Original Question ::
Which course should I take to get started in data science?
Paraphrased Questions ::
0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?
Training your own model
Again all the training code and dataset used for training are available in the Github repo mentioned earlier. We will go through the steps that I used to train the model.
1. Data Preparation
First I downloaded the Quora Question pairs tsv file (q uora_duplicate_questions.tsv ) as mentioned in this link .
Extracted only the rows that have is_duplicate =1 since they are the paraphrased question sentences. Then I had split the data into train and validation sets and stored them in separate CSV files.
In the end, each of the CSV files has two columns “ question1 ” and “ question2 ”. “question2” is a paraphrased version of “question1”. Since T5 expects a text as input, I gave “question1” as the input source and asked it to generate “question2” as target output .
The code used to generate the train and validation CSV files is shown below. The CSV files are available under the paraphrase_data folder in the Github repo.
filename = "quora_duplicate_questions.tsv"
import pandas as pd
question_pairs = pd.read_csv(filename, sep='\t')
question_pairs.drop(['qid1', 'qid2'], axis = 1,inplace = True)question_pairs_correct_paraphrased = question_pairs[question_pairs['is_duplicate']==1]
question_pairs_correct_paraphrased.drop(['id', 'is_duplicate'], axis = 1,inplace = True)from sklearn.model_selection import train_test_split
train, test = train_test_split(question_pairs_correct_paraphrased, test_size=0.1)train.to_csv('Quora_Paraphrasing_train.csv', index = False)
test.to_csv('Quora_Paraphrasing_val.csv', index = False)
2. Training
Thanks to Suraj Patil for the amazing Colab notebook on training T5 for any text-to-text task. I borrowed most of the training code from the Colab notebook, changing only the dataset class and training parameters. I adapted the dataset class to our Quora Question Pair dataset.
The training code is available as train.py in the Github Repo.
All you need to do is clone the repo on any GPU machine, install requirements.txt , and run train.py to train the T5 model.
Training this model for 2 epochs (default) took about 20 hrs on p2.xlarge (AWS ec2).
The dataset class looks like below —
The key is how we give our input and output to the T5 model trainer. For any given question pair from the dataset, I gave input (source) and output (target) to the T5 model as shown below –
Input format to T5 for training
paraphrase: What are the ingredients required to make a perfect cake? </s>
Output format to T5 for training
How do you bake a delicious cake? </s>
That’s it! You have a state-of-the-art question paraphraser in your hand.
Perhaps this is the first work of it’s kind out there to generate paraphrased questions from any given question!
Happy coding!
以上所述就是小编给大家介绍的《Paraphrase any question with T5 (Text-To-Text Transfer Transformer) — Pretrained model and…》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。