OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

Mexico-COVID-19-clinical-data

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Mexico COVID-19 clinical data This dataset contains the results of real-time PCR testing for COVID-19 in Mexico as reported by the [General Directorate of Epidemiology](https://www.gob.mx/salud/documentos/datos-abiertos-152127). The official, raw dataset is available in the Official Secretary of Epidemiology website: https://www.gob.mx/salud/documentos/datos-abiertos-152127. You might also want to download the official column descriptors and the variable definitions - e.g. SEXO=1 - Female; SEXO=2 - Male; SEXO=99 - Undisclosed) - in the following [zip file](http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/diccionario_datos_covid19.zip). I've maintained the original levels as described in the official dataset, unless otherwise specified. IMPORTANT: This dataset has been maintained since the original data releases, which weren't tabular, but rather consisted of PDF files, often with many/different inconsistencies which had to be resolved carefully and is annotated in the .R script. More later datasets should be more reliable, but earlier there were a lot of things to figure out like e.g. when the official methodology to assign the region of the case was changed to be based on residence rather than origin). I've added more notes on very early data here: https://github.com/marianarf/covid19_mexico_data. [More official information here](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico/resource/e8c7079c-dc2a-4b6e-8035-08042ed37165). Motivation I hope that this data serves to as a base to understand the clinical symptoms that characterize a COVID-19 positive case from another viral respiratory disease and help expand the knowledge about COVID-19 worldwide. With more models tested, added features and fine-tuning, clinical data could be used to predict a patient with pending COVID-19 results will get a positive or a negative result in two scenarios: As lab results are processed, this leaves a window when it's uncertain whether a result will return positive or negative (this is merely didactic, as new reports will corroborate the prediction as soon as the laboratory data for missing cases is reported). More importantly, it could help predict for similar symptoms e.g. from a survey or an app that checks for similar data (ideally, containing most of the parameters that can be assessed without using variables only available after hospitalization, like e.g. age of the person which is readily available). The value of the lab result comes from a RT-PCR, and is stored in RESULTADO, where the original data is encoded 1 = POSITIVE and 2 = NEGATIVE. Source The data was gathered using a "sentinel model" that samples 10 of the patients that present a viral respiratory diagnosis to test for COVID-19, and consists of data reported by 475 viral respiratory disease monitoring units (hospitals) named USMER (Unidades Monitoras de Enfermedad Respiratoria Viral) throughout the country in the entire health sector (IMSS, ISSSTE, SEDENA, SEMAR, and others). Preprocess Data is first processed with this [this .R script](https://github.com/marianarf/covid19_mexico_analysis/blob/master/notebooks/preprocess.R). The file containing the processed data will be updated daily until. Important: Since the data is updated to Github, assume the data uploaded here isn't the latest version, and instead, load data directly from the 'csv' [in this github repository](https://raw.githubusercontent.com/marianarf/covid19_mexico_analysis/master/mexico_covid19.csv). The data aggregates official daily reports of patients admitted in COVID-19 designated units. New cases are usually concatenated at the end of the file, but each individual case also contains a unique (official) identifier 'ID_REGISTRO' as well as a (new) unique reference 'id' to remove duplicates. I fixed a specific change in methodology in reporting, where the patient record used to be assigned in ENTIDAD_UM (the region of the medical unit) but now uses ENTIDAD_RES (the region of residence of the patient). Note: I have preserved the original structure (column names and factors) as closely as possible to the official data, so that code is reproducible in cross-reference to the official sources. Added features In addition to original features reported, I've included missing regional names and also a field 'DELAY' which corresponds to the lag in the processing lab results (since new data contains records from the previous day, this allows to keep track of this lag). Additional info According to the Ministry of Health, preliminary data is subject to validation by through the General Directorate of Epidemiology. Also note that the information contained corresponds only to the data obtained from the epidemiological study of a suspected case of viral respiratory disease at the time it is identified in the medical units of the Health Sector. Depending on the clinical diagnosis of admission, it is considered as an outpatient or hospitalized patient. The base does not include the evolution during the stay in the medical units, with the exception of updates of discharge by the hospital epidemiological surveillance units or health jurisdictions in the case of deaths.

41 features

id	numeric	263007 unique values 0 missing
FECHA_ARCHIVO	string	53 unique values 0 missing
ID_REGISTRO	string	263007 unique values 0 missing
ENTIDAD_UM	numeric	32 unique values 0 missing
ENTIDAD_RES	numeric	32 unique values 0 missing
RESULTADO	numeric	2 unique values 0 missing
DELAY	numeric	1 unique values 0 missing
ENTIDAD_REGISTRO	numeric	32 unique values 0 missing
ENTIDAD	string	32 unique values 0 missing
ABR_ENT	string	32 unique values 0 missing
FECHA_ACTUALIZACION	string	46 unique values 0 missing
ORIGEN	numeric	2 unique values 0 missing
SECTOR	numeric	14 unique values 0 missing
SEXO	numeric	2 unique values 0 missing
ENTIDAD_NAC	numeric	33 unique values 0 missing
MUNICIPIO_RES	numeric	359 unique values 6 missing
TIPO_PACIENTE	numeric	2 unique values 0 missing
FECHA_INGRESO	string	155 unique values 0 missing
FECHA_SINTOMAS	string	154 unique values 0 missing
FECHA_DEF	string	88 unique values 0 missing
INTUBADO	numeric	4 unique values 0 missing
NEUMONIA	numeric	3 unique values 0 missing
EDAD	numeric	117 unique values 0 missing
NACIONALIDAD	numeric	2 unique values 0 missing
EMBARAZO	numeric	4 unique values 0 missing
HABLA_LENGUA_INDIG	numeric	3 unique values 0 missing
DIABETES	numeric	3 unique values 0 missing
EPOC	numeric	3 unique values 0 missing
ASMA	numeric	3 unique values 0 missing
INMUSUPR	numeric	3 unique values 0 missing
HIPERTENSION	numeric	3 unique values 0 missing
OTRA_COM	numeric	3 unique values 0 missing
CARDIOVASCULAR	numeric	3 unique values 0 missing
OBESIDAD	numeric	3 unique values 0 missing
RENAL_CRONICA	numeric	3 unique values 0 missing
TABAQUISMO	numeric	3 unique values 0 missing
OTRO_CASO	numeric	3 unique values 0 missing
MIGRANTE	numeric	3 unique values 0 missing
PAIS_NACIONALIDAD	string	78 unique values 0 missing
PAIS_ORIGEN	string	44 unique values 0 missing
UCI	numeric	4 unique values 0 missing