Learning from unlabelled data with COVID-19 Open Research Dataset

栏目: IT技术 · 发布时间: 5年前

内容简介:The COVID-19 Open Research Dataset can help researchers and the health community in the fight against a global pandemic. TheReleased by theAs soon as it was released, there were a

Objective criteria for text search results and some surprising results

The COVID-19 Open Research Dataset can help researchers and the health community in the fight against a global pandemic. The Vespa team is contributing by releasing a search app based on the dataset. Since the data comes with no reliable labels to judge a good search result from a bad one, we would like to propose objective criteria to evaluate search results that do not rely on human-annotated labels. We use this criterion to run experiments and evaluate the value delivered by term-matching and semantic signals. We then show that the semantic signals deliver poor results even when considering a fine-tuned version of a model specifically designed for scientific text.

Learning from unlabelled data with COVID-19 Open Research Dataset

Photo by National Cancer Institute on Unsplash

Released by the Allen Institute for AI , the COVID-19 Open Research Dataset (CORD-19) contains over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. It was released to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. And it did exactly that.

As soon as it was released, there were a Kaggle challenge , a dataset explorer , fine-tuned embedding models and a run to collect labelled data:

Given my latest experience with labels containing strong term-matching bias in the MS MARCO dataset and the fact that we at vespa.ai wanted to move fast to build a search app around the CORD-19 dataset, I decided to spend some time to think how I could evaluate between different matching criteria and ranking functions without labelled data.

Objective criteria for text search

The goal was to have an objective criteria and to move away from the “it looks good enough” criteria so commonly used when reliable labels are not available. My proposal is simple, we can use the title of the article as a query and consider the associated abstract as the relevant document for the query.

Learning from unlabelled data with COVID-19 Open Research Dataset

Photo by Marc A on Unsplash

This criteria is simple, can scale to massive amounts of data since we do not rely on human annotation, and it makes sense. Think like this, if you use the title as a query and a given method is not able to retrieve the correct abstract and include it in the top 100 of the resulting list we have a very sub-optimal ranking function for the context of a CORD-19 search app.

Results

Some of the results obtained are summarized in this section. We report here three important metrics. The percentage of documents matched by the query, the recall at the top 100 positions and the mean reciprocal rank (MRR) considering the top 100 documents returned.

Term-matching

Table 1 shows results obtained by ranking documents with the term-matching signal BM25 score . The first row shows the result when we only match documents with abstracts that contains every word in the title (AND operator). This is way too restrictive, matching only a small fraction of documents (0.01%) and therefore misses many relevant abstracts leading to poor recall and MRR metrics (20% and 19% respectively).


以上所述就是小编给大家介绍的《Learning from unlabelled data with COVID-19 Open Research Dataset》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

高等应用数学问题的MATLAB求解

高等应用数学问题的MATLAB求解

薛定宇、陈阳泉 / 清华大学出版社 / 2008-10 / 49.00元

薛定宇和陈阳泉编著的《高等应用数学问题的MATLAB求解》首先介绍了MATLAB语言程序设计的基本内容,在此基础上系统介绍了各个应用数学领域的问题求解,如基于MATLAB的微积分问题、线性代数问题的计算机求解、积分变换和复变函数问题、非线性方程与最优化问题、常微分方程与偏微分方程问题、数据插值与函数逼近问题、概率论与数理统计问题的解析解和数值解法等,还介绍了较新的非传统方法,如模糊逻辑与模糊推理、......一起来看看 《高等应用数学问题的MATLAB求解》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具