Mexico COVID-19 clinical data
This dataset contains the results of real-time PCR testing for COVID-19 in Mexico as reported by the [General Directorate of Epidemiology](https://www.gob.mx/salud/documentos/datos-abiertos-152127).
The official, raw dataset is available in the Official Secretary of Epidemiology website: https://www.gob.mx/salud/documentos/datos-abiertos-152127.
You might also want to download the official column descriptors and the variable definitions - e.g. SEXO=1 - Female; SEXO=2 - Male; SEXO=99 - Undisclosed) - in the following [zip file](http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/diccionario_datos_covid19.zip). I've maintained the original levels as described in the official dataset, unless otherwise specified.
IMPORTANT: This dataset has been maintained since the original data releases, which weren't tabular, but rather consisted of PDF files, often with many/different inconsistencies which had to be resolved carefully and is annotated in the .R script. More later datasets should be more reliable, but earlier there were a lot of things to figure out like e.g. when the official methodology to assign the region of the case was changed to be based on residence rather than origin). I've added more notes on very early data here: https://github.com/marianarf/covid19_mexico_data.
[More official information here](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico/resource/e8c7079c-dc2a-4b6e-8035-08042ed37165).
Motivation
I hope that this data serves to as a base to understand the clinical symptoms that characterize a COVID-19 positive case from another viral respiratory disease and help expand the knowledge about COVID-19 worldwide.
With more models tested, added features and fine-tuning, clinical data could be used to predict a patient with pending COVID-19 results will get a positive or a negative result in two scenarios:
As lab results are processed, this leaves a window when it's uncertain whether a result will return positive or negative (this is merely didactic, as new reports will corroborate the prediction as soon as the laboratory data for missing cases is reported).
More importantly, it could help predict for similar symptoms e.g. from a survey or an app that checks for similar data (ideally, containing most of the parameters that can be assessed without using variables only available after hospitalization, like e.g. age of the person which is readily available).
The value of the lab result comes from a RT-PCR, and is stored in RESULTADO, where the original data is encoded 1 = POSITIVE and 2 = NEGATIVE.
Source
The data was gathered using a "sentinel model" that samples 10 of the patients that present a viral respiratory diagnosis to test for COVID-19, and consists of data reported by 475 viral respiratory disease monitoring units (hospitals) named USMER (Unidades Monitoras de Enfermedad Respiratoria Viral) throughout the country in the entire health sector (IMSS, ISSSTE, SEDENA, SEMAR, and others).
Preprocess
Data is first processed with this [this .R script](https://github.com/marianarf/covid19_mexico_analysis/blob/master/notebooks/preprocess.R). The file containing the processed data will be updated daily until. Important: Since the data is updated to Github, assume the data uploaded here isn't the latest version, and instead, load data directly from the 'csv' [in this github repository](https://raw.githubusercontent.com/marianarf/covid19_mexico_analysis/master/mexico_covid19.csv).
The data aggregates official daily reports of patients admitted in COVID-19 designated units.
New cases are usually concatenated at the end of the file, but each individual case also contains a unique (official) identifier 'ID_REGISTRO' as well as a (new) unique reference 'id' to remove duplicates.
I fixed a specific change in methodology in reporting, where the patient record used to be assigned in ENTIDAD_UM (the region of the medical unit) but now uses ENTIDAD_RES (the region of residence of the patient).
Note: I have preserved the original structure (column names and factors) as closely as possible to the official data, so that code is reproducible in cross-reference to the official sources.
Added features
In addition to original features reported, I've included missing regional names and also a field 'DELAY' which corresponds to the lag in the processing lab results (since new data contains records from the previous day, this allows to keep track of this lag).
Additional info
According to the Ministry of Health, preliminary data is subject to validation by through the General Directorate of Epidemiology. Also note that the information contained corresponds only to the data obtained from the epidemiological study of a suspected case of viral respiratory disease at the time it is identified in the medical units of the Health Sector. Depending on the clinical diagnosis of admission, it is considered as an outpatient or hospitalized patient. The base does not include the evolution during the stay in the medical units, with the exception of updates of discharge by the hospital epidemiological surveillance units or health jurisdictions in the case of deaths.