Daily values of confirmed cases, deaths and recovers for COVID-19 in several countries.
From original source:
-----
MThis folder contains daily time series summary tables, including confirmed, deaths and recovered. All data is read in from the daily case report. The time series tables are subject to be updated if inaccuracies are identified in our historical data.
Two time series tables are for the US confirmed cases and deaths, reported at the county level. They are named time_series_covid19_confirmed_US.csv, time_series_covid19_deaths_US.csv, respectively.
Three time series tables are for the global confirmed cases, recovered cases and deaths. Australia, Canada and China are reported at the province/state level. Dependencies of the Netherlands, the UK, France and Denmark are listed under the province/state level. The US and other countries are at the country level. The tables are renamed time_series_covid19_confirmed_global.csv and time_series_covid19_deaths_global.csv, and time_series_covid19_recovered_global.csv, respectively.
-----
We have joined the confirmed, deaths and recovered datasets to create multivariate series. Note that we have chosen to use these columns as values to forecast,
but we could have transformed the dataset in multiple columns (as many as Province/State - Country) as the series are aligned.
There are 10 columns:
id_series: The id of the time series.
date: The date of the time series in the format "%Y-%m-%d".
time_step: The time step on the time series.
value_X (X from 0 to 2): The values of the time series, which will be used for the forecasting task.
covariate_X (X from 0 to 3): Covariate values of the time series, tied to the 'id_series'. Not interested in forecasting, but can help with the forecasting task.
Preprocessing:
1 - For the 'confirmed' and 'deaths' datasets, we have grouped the values for the 'Country/Region' 'Canada' for all the 'Province/State'.
The 'recovered' dataset does not have the several 'Province/State' for 'Canada', only the country, so we grouped in order to merge all the datasets.
2 - Filled NaN values for 'Province/State' with the value 'Country'.
3 - Filled NaN values for 'Lat' and 'Long' with 0.0.
4 - Melted the datasets with identifiers 'Province/State', 'Country/Region', 'Lat', 'Long', obtaining columns 'date' and 'value_X', where X is 0 for confirmed cases, 1 for deaths and 2 for recoveries.
5 - Standardize the date to the format %Y-%m-%d and ensured that the frequency is daily.
6 - Merged all the datasets.
7 - Created column 'id_series' from 'Province/State', 'Country/Region' with index from 0 to 273.
8 - Renamed columns 'Province/State', 'Country/Region', 'Lat', 'Long' to 'covariate_0', 'covariate_1', 'covariate_2', 'covariate_3'.
9 - Created column 'time_step' with increasing values of the time_step for the time series.
10 - Casted 'value_X' columns to int, defined 'id_series', covariate_0' and 'covariate_1' as 'category' and casted 'covariate_2' and 'covariate_3' to float.