5 Datasets About COVID-19 you can Use Right Now

栏目: IT技术 · 发布时间: 4年前

5 Datasets About COVID-19 you can Use Right Now

Open datasets you can use to improve forecasting models, predict and analyze the impact of COVID-19 or investigate the information spread on Twitter.

5 Datasets About COVID-19 you can Use Right Now

Photo by Martin Sanchez on Unsplash

The coronavirus outbreak and the disease it causes, COVID-19, has taken the world by storm. Newsrooms filter tons of information every day — articles, official briefings, expert interviews etc. Medical personnel struggle to follow hundreds of scientific publications each week, concerning drug research, epidemiological reports, intervention policies and many more. Moreover, social network platforms need to reduce the noise and promote verified stories to avoid nurturing misinformed and terrified users.

In this fight, we are fortunate to live in a world where the value of data is well understood and there are many efforts underway in collecting and refining such sets. Hence, the question is how to use them to extract value and wisdom that will affect the way policies are made and alarms are triggered.

In this story, I present six well-curated datasets that can prove very useful under a certain analytical light. Their main possible applications spread from improving epidemiological forecasting models and predicting the impact of various intervention policies, to natural language processing and information spread on Twitter. For already existing application I invite you to read the story below.


The first dataset we consider was published on March 24th, 2020 under the title “ Epidemiological data from the COVID-19 outbreak, real-time case information ” [1]. It collects information on individuals from national, provincial and municipal health reports, along with additional knowledge from online reports. All data are geo-coded and contain further input such as symptoms, key dates (date of onset, admission, and confirmation) and travel record where available. You can find the associated GitHub repo here .

5 Datasets About COVID-19 you can Use Right Now

COVID-19 outbreak visualization using nCoV-2019

The nCoV-2019 dataset enables the production of real-time approaches that model disease outbreaks. Such mechanisms support public health decision making and assist policymakers to enforce informed guidelines.


COVID-19 [2] is arguably the most extended effort in gathering information about the coronavirus outbreak. Almost everybody that has read anything concerning the imminent pandemic has seen the dashboard it feeds.

5 Datasets About COVID-19 you can Use Right Now

COVID-19 JHU dashboard

The dataset contains two folders; one recording daily case reports and another providing daily time series summary tables, including confirmed new cases, deaths and recovered. The COVID-19 dataset grants researchers, public health authorities, and the general public with an intuitive and user-friendly tool to track the outbreak as it unfolds. You can find the associated GitHub repo here .


The Allen Institute for AI sided with several research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19) [3]. The dataset brings together 44,000 scholarly articles about COVID-19 and the coronavirus family of viruses for use by the global research community.

The dataset has already an associated Kaggle challenge , where data scientists are called upon to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. Furthermore, there is already a CORD-19 Explorer tool, which provides a familiar way to navigate through the CORD-19 corpus.


A similar effort is on track by the World Health Organization (WHO). WHO updates the dataset every day by manually searching the table of contents of relevant journals. Moreover, they track down other related scientific articles that enrich the dataset.

You can download the whole dataset or search it by author, keyword (title, author, journal), journal, or general topic here .

COVID-19 Tweet IDs

The COVID-19 tweet IDs dataset collects millions of tweets associated with the coronavirus outbreak and the COVID-19 disease [4]. The first tweet in this dataset dates back to January 22, 2020.

The authors used Twitter’s API to search and follow relevant accounts and gather tweets with specific keywords in many languages. Until that point, a language breakdown is given below.

| Language        | ISO     | No. tweets       | % total Tweets |
|-------------    |-----    |------------      |----------------    |
| English         | en      | 44,482,496       | 69.92%             |
| Spanish         | es      | 6,087,308        | 9.57%              |
| Indonesian      | in      | 1,844,037        | 2.90%              |
| French          | fr      | 1,800,318        | 2.83%              |
| Thai            | th      | 1,687,309        | 2.65%              |
| Portuguese      | pt      | 1,278,662        | 2.01%              |
| Japanese        | ja      | 1,223,646        | 1.92%              |
| Italian         | it      | 1,113,001        | 1.75%              |
| (undefined)     | und     | 1,110,165        | 1.75%              |
| Turkish         | tr      | 570,744          | 0.90%              

You can download the dataset as well as more information, including how to hydrate it (i.e. et complete details of a tweet) on the project’s GitHub repo here .


The data community has responded in the coronavirus outbreak by generating datasets of various kinds that can accelerate the research for a new treatment, inform policymakers or create forecasting models to better predict how the current disease behaves or trigger warnings for future events.

What remains is how data scientists will use these sets and what tools will produce. In any case, it seems that we have an extra weapon in our arsenal fighting this virus.

My name is Dimitris Poulopoulos and I’m a machine learning researcher at BigDataStack and PhD(c) at the University of Piraeus, Greece. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA. If you are interested in reading more posts about Machine Learning, Deep Learning and Data Science, follow me on Medium , LinkedIn or @james2pl on twitter.


[1] Xu, B., Gutierrez, B., Mekaru, S. et al. Epidemiological data from the COVID-19 outbreak, real-time case information. Sci Data 7, 106 (2020). https://doi.org/10.1038/s41597-020-0448-0

[2] Dong, E., Du, H., & Gardner, L. (2020). An interactive web-based dashboard to track COVID-19 in real-time. The Lancet Infectious Diseases . https://doi.org/10.1016/S1473-3099(20)30120-1

[3] COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-20. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed YYYY-MM-DD. https://doi.org/10.5281/zenodo.3727291

[4] Chen, E., Lerman, K., & Ferrara, E. (2020). COVID-19: The First Public Coronavirus Twitter Dataset. arXiv preprint arXiv:2003.07372 .

以上所述就是小编给大家介绍的《5 Datasets About COVID-19 you can Use Right Now》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!




Convergence Culture

Convergence Culture

Henry Jenkins / NYU Press / 2006-08-01 / USD 30.00

"Convergence Culture" maps a new territory: where old and new media intersect, where grassroots and corporate media collide, where the power of the media producer, and the power of the consumer intera......一起来看看 《Convergence Culture》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码


RGB HEX 互转工具

