A taste of ACL2020: 6 new Datasets & Benchmarks
This year’s conference from the Association for Computational Linguistics comes packed with more than 700 publications. To make thing easier for you to navigate, here’s a selection of papers with new refreshing datasets and benchmarks for language tasks.
Jul 4 ·7min read
Datasets and benchmarks are at the core of progress in Natural Language Understanding (NLU): in leaderboard-driven research, progress is upper-bounded by the quality of our evaluations. While datasets for Machine Learning used to last — i.e. MNIST didn’t reach human performance until more than a decade after it was introduced — the latest benchmarks for Natural Language understanding are becoming obsolete faster than we expected, highlighting the importance of finding better ones.
The sheer number of papers about the topic is quite astounding, so at Zeta-Alpha we have curated this selection of the most interesting works about it at ACL2020, which will influence how progress is measured in the field .
1. Adversarial NLI: A New Benchmark for Natural Language Understanding
In this paper — which is already making waves at +20 citations — the authors eloquently make the case why static NLU benchmarks become obsolete very fast and often models leverage spurious statistical patterns¹ that were undetectable in the data collection phase.
They introduce a dataset for Natural Language Inference (NLI), where given a premise and hypotesis, one should determine whether they are entailed, in contradiction or neutral. The catch is that they also introduce a framework to iterate on the dataset based on feedback from a trained model and humans in the loop introducing adversarial examples; with the aim of creating a dataset where the model fails. The round of annotation, shown in the figure below, consists of:
- Annotating a dataset and training a model on it.
- Have annotators write new adversarial hypotheses given a context and test them on the trained model.
- If the model succeeds, we add the new samples to the training set.
- When the model fails and another human agrees with the annotation, we add them to the dev, test or training sets.
The authors call this process HAMLET ( Human-And-Model-in-the-Loop Enabled Training ) and in the paper they showcase the creation of a dataset in 3 rounds, where annotators are incentivized to come up with hypothesis where the models will fail. This results in an increasingly more challenging dataset and as a side result, they reach state of the art on some variations of the MNLI dataset. While they speculate that this benchmark will not saturate soon thanks to how it was collected, they emphasize that even so, new rounds could be added to overcome this.
The main inconvenience of dynamic datasets as such is the difficulty of standarization that enables apples-to-apples comparisons of different works. While adversarial human-in-the-loop is not a new idea , this clean instance has the potential to serve as inspiration for future iterations and perhaps overcome standarization obstacles where dynamic datasets become the norm in the near future.
2. ERASER: A Benchmark to Evaluate Rationalized NLP Models
This paper presents a full fledged language benchmark consisting of 7 tasks that not only include labels, but also have human annotated ‘rationales’ for them, inspired by the success of the GLUE benchmark . These tasks include: Evidence inference , BoolQ (boolean QA), Movie Reviews , FEVER (fact extraction verification), MultiRC (reading comprehension), Commonsense Explanations (CoS-E), e-SNLI (language entailment) and Human Agreement .
The authors propose an Area Under the Precision-Recall Curve metric for evaluating the accoradance of model and human annotation rationales, but they are aware that this evaluation is hard to make objectively, which is why they explicitly make the call for more research in this direction.
The proposed benchmark is an early step towards an ambitious vision on more explainable comprehensive evaluation of Language Models .
3. GoEmotions: A Dataset of Fine-Grained Emotions
Sentiment Analysis has long been a fundamental task in NLP, but some of the most widely used datasets — such as SST2 with binary positive/negative sentiments — are surpassing human performance, becoming obsolete for measuring meaningful progress.
GoEmotions is a dataset of 58k manually annotated samples from popular English subreddit comments, it is very fine-grained with 27 sentiment labels (or neutral) . The data collection process adheres to high standads, with full manual reviewing, length filtering, sentiment balancing, subreddit balanging and masking proper names and religion terms. Early baseline tests with a BERT model indicate that there’s a lot of room for improvement and that current state-of-the-art NLU models fail to understand emotion to this degree, making it a challenging new sentiment benchmark to focus on.
This kinds of datasets are extremely valuable in this day and age to help us make sense of complex, large-scale social dynamics that play out on the internet.
On a similar note, also on ACL2020, iSarcasm: A Dataset of Intended Sarcasm , is a dataset that focuses on the differentiation between intended and perceived sarcasm such that we can overcome current biases on models detecting only more obvious forms of it. This dataset, more modest in size at 4.4k samples, also stresses the importance of this topic as a means to understanding social interactions in the context of social media.
4. SCDE: Sentence Cloze Dataset with High Quality Distractors From Examinations
The Sentence Cloze task consists of filling in sentence-sized gaps in text from a set of candidates. Similar sentence-level tasks are often used as a self-supervison for language models pre-training (i.e. BERT’s next sentence prediction); however, this task is often too simple because it can rely on spurious patterns of sentences given that the self-supervised sentence candidates are not challenging enough.
In this work, they introduce distractor sentences, which are human-curated sentences designed by English teachers that require non-local discourse-level aspects of language to successfuly perform the task. Current models only achieve an accuracy of 72% whereas humans get around 87%, showing a considerable room for improvement.
5. Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts
Humans don’t learn language in isolation so should we expect machines to do so? Multimodal machine learning explores the idea of leveraging different modes of data, such as vision and language, to make better models of the world .
“[…] a language understanding system should be able to classify images depicting recess and remorse , not just cats , dogs and bridges .”
After this provocative depiction of most current multimodal vision and language datasets, this work builds the BabelPic dataset focusing on non-concrete concepts , as a step to widen the coverage of multimodal semantic understanding. The dataset is built by combining WordNet’s² and BabelNet³ lexical knowledge bases.
After many filtering tricks, heuristics and manual validation, the resulting ‘gold’ dataset has 2.7k synsets (synonym sets) and 15k matched images, and an extended ‘silver’ set with 10k synsets, generated by a vision-language model by using the natural language definitions in WordNet.
6. R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason
Similarly as Adversarial NLI pointed out, many reading comprehension tasks rely on annotation artifacts and other biases from existing datasets, enabling the completion of tasks without the need of any understanding . To mitigate this, Naoya Inoue et. al present a task that requires not only to find the correct answer in a Reading Comprehension tasks, but also to proivde the adequate supporting facts .
The resulting annotated dataset is a total of 7.1k training and 6.6k development samples, which were sampled from the HotpotQA⁴ dataset, where rationales for each answer are included in the annotations. The evaluation of this task involves scoring answers and assessing the correctness of the ‘rationale alignments’ with the ground truth.
以上所述就是小编给大家介绍的《A taste of ACL2020: 6 new Datasets & Benchmarks》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
PHP 5权威编程
(美)古曼兹等 / 简张桂 / 电子工业出版社 / 2007-12 / 90.00元
《BRUCE PERENS开源系列丛书•PHP 5权威编程》为大家全面介绍了PHP 5中的新功能、面向对象编程方法及设计模式,还分析阐述了PHP5中新的数据库连接处理、错误处理和XML处理等机制。希望能够帮助读者系统了解、熟练掌握PHP,最大程度地挖掘:PHP的潜力,以更低的成本搭建更加稳健、高效的PHP应用。 近年来,随着使用PHP的大流量网站逐渐增加,企业在使用PHP的时候开始面临新的问......一起来看看 《PHP 5权威编程》 这本书的介绍吧!