Going Beyond SQuAD Part 1: Question Answering in Different Languages

栏目: IT技术 · 发布时间: 4年前

内容简介:It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example,Machine Translated SQuAD datasets

Human Annotated Data

It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example, SberQuAD (Russian) and FQuAD (French) generate crowd sourced QA datasets that have proven to be good starting points for building non-English QA systems. KorQuAD (Korean) also replicates the original SQuAD crowd sourced procedure and provides some very interesting insight on how trained QA systems fare in comparison to humans on different types of questions. The authors of FQuAD find that with CamemBERT (a BERT model pre-trained on French), and a dataset that is a quarter the size of the original SQuAD dataset, they are still able to reach approximately 95% of the human F1 performance. The labour intensive nature of native crowd sourced data collection, however, is a limitation to generating a large scale datasets and this has motivated many teams to investigate ways to automatically translate SQuAD.

Comparison of Exact Match (EM) performance on the KorQuAD dataset by type of question-answer pair (Lim et. al. 2019)

Machine Translated Data

Machine Translated SQuAD datasets exists for Korean (K-QuAD), Italian , and Spanish . These are almost always more cost and time efficient especially considering the premium on crowd-sourcing non-English native speakers on platforms such as Mechanical Turk. We at deepset have also experimented with machine translation of SQuAD and have faced the same quality assurance issues that confronted the creators of the aforementioned datasets.

Chief amongst these is the issue of alignment. Though the translation of question and passage is straightforward, it is not always possible to automatically infer the answer span from the translated text since character indices have certainly shifted. Techniques to remedy this include inserting start and end markers that wrap the answer span in the hope that they are maintained after translation. But it is also worth noting that encoder-decoder attention components in modern machine translation models can function as a form of alignment. In cases where the dataset translation is done with full access to a trained model, attention weights can be interpreted as a form of free alignment (c.f. this method ).

Finding the Right Mix

Considering that there is this trade-off between data quality and scale when choosing between human created and machine translated datasets, how can we ensure the best performance in our trained models? In the research literature, there are a few different teams who leverage both kinds of datasets in different ways.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance.

The creators of FQuAD for example have data in both styles and train three models: one using just the machine translated, one using just the human annotated and one using both kinds of data. Even though the machine translated data adds another 46,000 samples on top of the 25,000 human annotated, they find that a model trained on both performs slightly worse than one trained just on the human annotated data.

K-QuAD is also composed of a mix of machine and human created samples and the researchers behind it experiment with combinations of the data. Ultimately, they find that a mixture of both the human and de-noised machine translated data gives the best performance. And finally, the creators of the Arabic Question Answering dataset also experiment with a mixture of human and machine created samples and for them, the best performance comes from a full mixture of both.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance. If you only have around 5,000 such samples, augmenting this set with machine translated data may be worth while.

Multilingual Datasets

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网






晋小彦 / 清华大学出版社 / 2014-1-1 / 59.00元

网页设计师从早年的综合性工作中分化出来,形成了相对独立的专业岗位,网页设计也不再是单纯的软件应用,它衍生出了许多独立的研究方向,当网站策划、交互体验都逐渐独立之后,形式感的突破和表现成为网页视觉设计的一项重要工作。随着时代的发展,网页设计更接近于一门艺术。网络带宽和硬件的发展为网页提供了使用更大图片、动画甚至视频的权利,而这些也为视觉设计师提供了更多表现的空间。另外多终端用户屏幕(主要是各种移动设......一起来看看 《形式感+:网页视觉设计创意拓展与快速表现》 这本书的介绍吧!



UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换