OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

Covid19-global

active ARFF Creative Commons Attribution 4.0 International Visibility: public Uploaded 25-06-2024 by Bruno Belucci Teixeira
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Daily values of confirmed cases, deaths and recovers for COVID-19 in several countries. From original source: ----- MThis folder contains daily time series summary tables, including confirmed, deaths and recovered. All data is read in from the daily case report. The time series tables are subject to be updated if inaccuracies are identified in our historical data. Two time series tables are for the US confirmed cases and deaths, reported at the county level. They are named time_series_covid19_confirmed_US.csv, time_series_covid19_deaths_US.csv, respectively. Three time series tables are for the global confirmed cases, recovered cases and deaths. Australia, Canada and China are reported at the province/state level. Dependencies of the Netherlands, the UK, France and Denmark are listed under the province/state level. The US and other countries are at the country level. The tables are renamed time_series_covid19_confirmed_global.csv and time_series_covid19_deaths_global.csv, and time_series_covid19_recovered_global.csv, respectively. ----- We have joined the confirmed, deaths and recovered datasets to create multivariate series. Note that we have chosen to use these columns as values to forecast, but we could have transformed the dataset in multiple columns (as many as Province/State - Country) as the series are aligned. There are 10 columns: id_series: The id of the time series. date: The date of the time series in the format "%Y-%m-%d". time_step: The time step on the time series. value_X (X from 0 to 2): The values of the time series, which will be used for the forecasting task. covariate_X (X from 0 to 3): Covariate values of the time series, tied to the 'id_series'. Not interested in forecasting, but can help with the forecasting task. Preprocessing: 1 - For the 'confirmed' and 'deaths' datasets, we have grouped the values for the 'Country/Region' 'Canada' for all the 'Province/State'. The 'recovered' dataset does not have the several 'Province/State' for 'Canada', only the country, so we grouped in order to merge all the datasets. 2 - Filled NaN values for 'Province/State' with the value 'Country'. 3 - Filled NaN values for 'Lat' and 'Long' with 0.0. 4 - Melted the datasets with identifiers 'Province/State', 'Country/Region', 'Lat', 'Long', obtaining columns 'date' and 'value_X', where X is 0 for confirmed cases, 1 for deaths and 2 for recoveries. 5 - Standardize the date to the format %Y-%m-%d and ensured that the frequency is daily. 6 - Merged all the datasets. 7 - Created column 'id_series' from 'Province/State', 'Country/Region' with index from 0 to 273. 8 - Renamed columns 'Province/State', 'Country/Region', 'Lat', 'Long' to 'covariate_0', 'covariate_1', 'covariate_2', 'covariate_3'. 9 - Created column 'time_step' with increasing values of the time_step for the time series. 10 - Casted 'value_X' columns to int, defined 'id_series', covariate_0' and 'covariate_1' as 'category' and casted 'covariate_2' and 'covariate_3' to float.

10 features

covariate_0	nominal	76 unique values 0 missing
covariate_1	nominal	201 unique values 0 missing
covariate_2	numeric	272 unique values 0 missing
covariate_3	numeric	272 unique values 0 missing
date	string	1143 unique values 0 missing
value_0	numeric	116121 unique values 0 missing
value_1	numeric	38827 unique values 0 missing
value_2	numeric	44630 unique values 0 missing
id_series	nominal	274 unique values 0 missing
time_step	numeric	1143 unique values 0 missing