Going Beyond SQuAD Part 1: Question Answering in Different Languages

栏目: IT技术 · 发布时间: 4年前

内容简介:It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example,Machine Translated SQuAD datasets

Human Annotated Data

It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example, SberQuAD (Russian) and FQuAD (French) generate crowd sourced QA datasets that have proven to be good starting points for building non-English QA systems. KorQuAD (Korean) also replicates the original SQuAD crowd sourced procedure and provides some very interesting insight on how trained QA systems fare in comparison to humans on different types of questions. The authors of FQuAD find that with CamemBERT (a BERT model pre-trained on French), and a dataset that is a quarter the size of the original SQuAD dataset, they are still able to reach approximately 95% of the human F1 performance. The labour intensive nature of native crowd sourced data collection, however, is a limitation to generating a large scale datasets and this has motivated many teams to investigate ways to automatically translate SQuAD.

Comparison of Exact Match (EM) performance on the KorQuAD dataset by type of question-answer pair (Lim et. al. 2019)

Machine Translated Data

Machine Translated SQuAD datasets exists for Korean (K-QuAD), Italian , and Spanish . These are almost always more cost and time efficient especially considering the premium on crowd-sourcing non-English native speakers on platforms such as Mechanical Turk. We at deepset have also experimented with machine translation of SQuAD and have faced the same quality assurance issues that confronted the creators of the aforementioned datasets.

Chief amongst these is the issue of alignment. Though the translation of question and passage is straightforward, it is not always possible to automatically infer the answer span from the translated text since character indices have certainly shifted. Techniques to remedy this include inserting start and end markers that wrap the answer span in the hope that they are maintained after translation. But it is also worth noting that encoder-decoder attention components in modern machine translation models can function as a form of alignment. In cases where the dataset translation is done with full access to a trained model, attention weights can be interpreted as a form of free alignment (c.f. this method ).

Finding the Right Mix

Considering that there is this trade-off between data quality and scale when choosing between human created and machine translated datasets, how can we ensure the best performance in our trained models? In the research literature, there are a few different teams who leverage both kinds of datasets in different ways.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance.

The creators of FQuAD for example have data in both styles and train three models: one using just the machine translated, one using just the human annotated and one using both kinds of data. Even though the machine translated data adds another 46,000 samples on top of the 25,000 human annotated, they find that a model trained on both performs slightly worse than one trained just on the human annotated data.

K-QuAD is also composed of a mix of machine and human created samples and the researchers behind it experiment with combinations of the data. Ultimately, they find that a mixture of both the human and de-noised machine translated data gives the best performance. And finally, the creators of the Arabic Question Answering dataset also experiment with a mixture of human and machine created samples and for them, the best performance comes from a full mixture of both.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance. If you only have around 5,000 such samples, augmenting this set with machine translated data may be worth while.

Multilingual Datasets


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

你的品牌,价值千万

你的品牌,价值千万

温迪 / 人民邮电出版社 / 2018-7-1 / 49.00元

“大道无术,万法归心。” 不管是互联网、社交媒体,还是 AI 怎样让人眼花缭乱。从“真心”出发塑造的个人品牌,都将带你从容面对任何一种变化的冲击。现代生活变得越来越透明,如果你不懂得如何真实、精准地定位和呈现自己,你的个人品牌在 碎片信息中被误解、被曲解就是一种必然。 本书分四步引导你剖析自己、发现自我,构建可持续的品牌生态系统,策划品牌战略,提升个人呈现力,并在最后带你勾画出一幅完整的个人......一起来看看 《你的品牌,价值千万》 这本书的介绍吧!

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具