Going Beyond SQuAD Part 1: Question Answering in Different Languages

栏目: IT技术 · 发布时间: 6年前

内容简介:It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example,Machine Translated SQuAD datasets

Human Annotated Data

It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example, SberQuAD (Russian) and FQuAD (French) generate crowd sourced QA datasets that have proven to be good starting points for building non-English QA systems. KorQuAD (Korean) also replicates the original SQuAD crowd sourced procedure and provides some very interesting insight on how trained QA systems fare in comparison to humans on different types of questions. The authors of FQuAD find that with CamemBERT (a BERT model pre-trained on French), and a dataset that is a quarter the size of the original SQuAD dataset, they are still able to reach approximately 95% of the human F1 performance. The labour intensive nature of native crowd sourced data collection, however, is a limitation to generating a large scale datasets and this has motivated many teams to investigate ways to automatically translate SQuAD.

Comparison of Exact Match (EM) performance on the KorQuAD dataset by type of question-answer pair (Lim et. al. 2019)

Machine Translated Data

Machine Translated SQuAD datasets exists for Korean (K-QuAD), Italian , and Spanish . These are almost always more cost and time efficient especially considering the premium on crowd-sourcing non-English native speakers on platforms such as Mechanical Turk. We at deepset have also experimented with machine translation of SQuAD and have faced the same quality assurance issues that confronted the creators of the aforementioned datasets.

Chief amongst these is the issue of alignment. Though the translation of question and passage is straightforward, it is not always possible to automatically infer the answer span from the translated text since character indices have certainly shifted. Techniques to remedy this include inserting start and end markers that wrap the answer span in the hope that they are maintained after translation. But it is also worth noting that encoder-decoder attention components in modern machine translation models can function as a form of alignment. In cases where the dataset translation is done with full access to a trained model, attention weights can be interpreted as a form of free alignment (c.f. this method ).

Finding the Right Mix

Considering that there is this trade-off between data quality and scale when choosing between human created and machine translated datasets, how can we ensure the best performance in our trained models? In the research literature, there are a few different teams who leverage both kinds of datasets in different ways.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance.

The creators of FQuAD for example have data in both styles and train three models: one using just the machine translated, one using just the human annotated and one using both kinds of data. Even though the machine translated data adds another 46,000 samples on top of the 25,000 human annotated, they find that a model trained on both performs slightly worse than one trained just on the human annotated data.

K-QuAD is also composed of a mix of machine and human created samples and the researchers behind it experiment with combinations of the data. Ultimately, they find that a mixture of both the human and de-noised machine translated data gives the best performance. And finally, the creators of the Arabic Question Answering dataset also experiment with a mixture of human and machine created samples and for them, the best performance comes from a full mixture of both.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance. If you only have around 5,000 such samples, augmenting this set with machine translated data may be worth while.

Multilingual Datasets


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

C++标准库(第2版)

C++标准库(第2版)

Nicolai M. Josuttis / 侯捷 / 电子工业出版社 / 2015-6 / 186.00元

《C++标准库(第2版)》是全球C++经典权威参考书籍时隔12年,基于C++11标准的全新重大升级。标准库提供了一组公共类和接口,极大地拓展了C++语言核心功能。《C++标准库(第2版)》详细讲解了每一标准库组件,包括其设计目的和方法、复杂概念的剖析、实用而高效的编程细节、存在的陷阱、重要的类和函数,又辅以大量用C++11标准实现的实用代码范例。除覆盖全新组件、特性外,《C++标准库(第2版)》一......一起来看看 《C++标准库(第2版)》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试