5 Datasets About COVID-19 you can Use Right Now

栏目: IT技术 · 发布时间: 4年前

5 Datasets About COVID-19 you can Use Right Now

Open datasets you can use to improve forecasting models, predict and analyze the impact of COVID-19 or investigate the information spread on Twitter.

5 Datasets About COVID-19 you can Use Right Now

Photo by Martin Sanchez on Unsplash

The coronavirus outbreak and the disease it causes, COVID-19, has taken the world by storm. Newsrooms filter tons of information every day — articles, official briefings, expert interviews etc. Medical personnel struggle to follow hundreds of scientific publications each week, concerning drug research, epidemiological reports, intervention policies and many more. Moreover, social network platforms need to reduce the noise and promote verified stories to avoid nurturing misinformed and terrified users.

In this fight, we are fortunate to live in a world where the value of data is well understood and there are many efforts underway in collecting and refining such sets. Hence, the question is how to use them to extract value and wisdom that will affect the way policies are made and alarms are triggered.

In this story, I present six well-curated datasets that can prove very useful under a certain analytical light. Their main possible applications spread from improving epidemiological forecasting models and predicting the impact of various intervention policies, to natural language processing and information spread on Twitter. For already existing application I invite you to read the story below.

nCoV-2019

The first dataset we consider was published on March 24th, 2020 under the title “ Epidemiological data from the COVID-19 outbreak, real-time case information ” [1]. It collects information on individuals from national, provincial and municipal health reports, along with additional knowledge from online reports. All data are geo-coded and contain further input such as symptoms, key dates (date of onset, admission, and confirmation) and travel record where available. You can find the associated GitHub repo here .

5 Datasets About COVID-19 you can Use Right Now

COVID-19 outbreak visualization using nCoV-2019

The nCoV-2019 dataset enables the production of real-time approaches that model disease outbreaks. Such mechanisms support public health decision making and assist policymakers to enforce informed guidelines.

COVID-19

COVID-19 [2] is arguably the most extended effort in gathering information about the coronavirus outbreak. Almost everybody that has read anything concerning the imminent pandemic has seen the dashboard it feeds.

5 Datasets About COVID-19 you can Use Right Now

COVID-19 JHU dashboard

The dataset contains two folders; one recording daily case reports and another providing daily time series summary tables, including confirmed new cases, deaths and recovered. The COVID-19 dataset grants researchers, public health authorities, and the general public with an intuitive and user-friendly tool to track the outbreak as it unfolds. You can find the associated GitHub repo here .

CORD-19

The Allen Institute for AI sided with several research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19) [3]. The dataset brings together 44,000 scholarly articles about COVID-19 and the coronavirus family of viruses for use by the global research community.

The dataset has already an associated Kaggle challenge , where data scientists are called upon to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. Furthermore, there is already a CORD-19 Explorer tool, which provides a familiar way to navigate through the CORD-19 corpus.

WHO COVID-2019

A similar effort is on track by the World Health Organization (WHO). WHO updates the dataset every day by manually searching the table of contents of relevant journals. Moreover, they track down other related scientific articles that enrich the dataset.

You can download the whole dataset or search it by author, keyword (title, author, journal), journal, or general topic here .

COVID-19 Tweet IDs

The COVID-19 tweet IDs dataset collects millions of tweets associated with the coronavirus outbreak and the COVID-19 disease [4]. The first tweet in this dataset dates back to January 22, 2020.

The authors used Twitter’s API to search and follow relevant accounts and gather tweets with specific keywords in many languages. Until that point, a language breakdown is given below.

| Language        | ISO     | No. tweets       | % total Tweets |
|-------------    |-----    |------------      |----------------    |
| English         | en      | 44,482,496       | 69.92%             |
| Spanish         | es      | 6,087,308        | 9.57%              |
| Indonesian      | in      | 1,844,037        | 2.90%              |
| French          | fr      | 1,800,318        | 2.83%              |
| Thai            | th      | 1,687,309        | 2.65%              |
| Portuguese      | pt      | 1,278,662        | 2.01%              |
| Japanese        | ja      | 1,223,646        | 1.92%              |
| Italian         | it      | 1,113,001        | 1.75%              |
| (undefined)     | und     | 1,110,165        | 1.75%              |
| Turkish         | tr      | 570,744          | 0.90%              

You can download the dataset as well as more information, including how to hydrate it (i.e. et complete details of a tweet) on the project’s GitHub repo here .

Conclusion

The data community has responded in the coronavirus outbreak by generating datasets of various kinds that can accelerate the research for a new treatment, inform policymakers or create forecasting models to better predict how the current disease behaves or trigger warnings for future events.

What remains is how data scientists will use these sets and what tools will produce. In any case, it seems that we have an extra weapon in our arsenal fighting this virus.

My name is Dimitris Poulopoulos and I’m a machine learning researcher at BigDataStack and PhD(c) at the University of Piraeus, Greece. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA. If you are interested in reading more posts about Machine Learning, Deep Learning and Data Science, follow me on Medium , LinkedIn or @james2pl on twitter.

References

[1] Xu, B., Gutierrez, B., Mekaru, S. et al. Epidemiological data from the COVID-19 outbreak, real-time case information. Sci Data 7, 106 (2020). https://doi.org/10.1038/s41597-020-0448-0

[2] Dong, E., Du, H., & Gardner, L. (2020). An interactive web-based dashboard to track COVID-19 in real-time. The Lancet Infectious Diseases . https://doi.org/10.1016/S1473-3099(20)30120-1

[3] COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-20. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed YYYY-MM-DD. https://doi.org/10.5281/zenodo.3727291

[4] Chen, E., Lerman, K., & Ferrara, E. (2020). COVID-19: The First Public Coronavirus Twitter Dataset. arXiv preprint arXiv:2003.07372 .


以上所述就是小编给大家介绍的《5 Datasets About COVID-19 you can Use Right Now》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Docker——容器与容器云(第2版)

Docker——容器与容器云(第2版)

浙江大学SEL实验室 / 人民邮电出版社 / 2016-10 / 89.00元

本书根据Docker 1.10版和Kubernetes 1.2版对第1版进行了全面更新,从实践者的角度出发,以Docker和Kubernetes为重点,沿着“基本用法介绍”到“核心原理解读”到“高级实践技巧”的思路,一本书讲透当前主流的容器和容器云技术,有助于读者在实际场景中利用Docker容器和容器云解决问题并启发新的思考。全书包括两部分,第一部分深入解读Docker容器技术,包括Docker架......一起来看看 《Docker——容器与容器云(第2版)》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具