Outpatient Illness Surveillance weekly data.
From original source:
-----
Outpatient Illness Surveillance - Information on patient visits to health care providers for influenza-like illness is collected through the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). This collaborative effort between CDC, state and local health departments, and health care providers started during the 1997-98 influenza season when approximately 250 providers were enrolled. Enrollment in the system has increased over time and there were >3,000 providers enrolled during the 2010-11 season.
The number and percent of patients presenting with ILI each week will vary by region and season due to many factors, including having different provider type mixes (children present with higher rates of ILI than adults, and therefore regions with a higher percentage of pediatric practices will have higher numbers of cases). Therefore it is not appropriate to compare the magnitude of the percent of visits due to ILI between regions and seasons.
Baseline levels are calculated both nationally and for each region. Percentages at or above the baseline level are considered to be elevated.
For more information on ILI surveillance and baselines please visit:http://www.cdc.gov/flu/weekly/overview.htm#Outpatient
-----
This data is the extraction of "National" data from seasons 1997-98 to 2023-24.
There are 12 columns:
id_series: The id of the time series.
date: The date of the time series in the format "%Y-%m-%d".
time_step: The time step on the time series.
value_X (X from 0 to 8): The values of the time series, which will be used for the forecasting task.
Preprocessing:
1 - Dropped columns 'REGION' and 'REGION TYPE', as they have only the value 'X'.
2 - Dropped rows with 'YEAR' <= 2002 and 'YEAR' >= 2024.
Before the year 2002, there is a seasonal gap every year between the weeks [21, 39]. This does not happen after 2002. Effectively,
this drop 274 rows, or ~20% of the original amount. We could imagine that a model will automatically account for this, but
we prefered to work with a clean dataset as it is already common for this dataset in other works. Besides, the data is not yet
completed for 2024.
2 - Replaced values 'X' by 0, and casted columns 'AGE 25-49', 'AGE 50-64', and 'AGE 25-64' to int.
3 - Summed columns 'AGE 25-49', 'AGE 50-64', and 'AGE 25-64' to replace the column 'AGE 25-64'.
4 - Dropped columns AGE 25-49', 'AGE 50-64'.
It seems that the values 'X' in the 'AGE X' columns are due to a change on how the age of the patients were accounted for
before and after the year-week 2009-40. With our preprocessing, we correctly find the 'ILITOTAL' if we sum all the 'AGE X' columns.
5 - Created date column 'date' from columns 'YEAR' and 'WEEK', considering the end of week on Saturday in the format "%Y-%m-%d".
6 - Dropped columns 'YEAR' and 'MONTH'.
7 - Renamed columns [:-1] to 'value_X' with X from 0 to 8.
8 - Created 'id_series' with value 0. There is only one multivariate time series.
9 - Ensured that there are no missing dates and that the frequency of the time_series is weekly.
There were only 3 missing rows with dates '2008-01-05', '2013-01-05' and '2019-01-05', they were filled with the last valid values.
10 - Created 'time_step' column from 'date' and 'id_series' with increasing values from 0 to the size of the time series.
11 - Casted 'date' to str, 'time_step' to int, 'value_X' with X in [0, 1] columns to float, the other 'value_X' columns to int and defined 'id_series' as 'category'.