Data
Mexico-COVID-19-clinical-data

Mexico-COVID-19-clinical-data

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
  • Computer Systems Machine Learning
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Mexico COVID-19 clinical data This dataset contains the results of real-time PCR testing for COVID-19 in Mexico as reported by the [General Directorate of Epidemiology](https://www.gob.mx/salud/documentos/datos-abiertos-152127). The official, raw dataset is available in the Official Secretary of Epidemiology website: https://www.gob.mx/salud/documentos/datos-abiertos-152127. You might also want to download the official column descriptors and the variable definitions - e.g. SEXO=1 - Female; SEXO=2 - Male; SEXO=99 - Undisclosed) - in the following [zip file](http://datosabiertos.salud.gob.mx/gobmx/salud/datos_abiertos/diccionario_datos_covid19.zip). I've maintained the original levels as described in the official dataset, unless otherwise specified. IMPORTANT: This dataset has been maintained since the original data releases, which weren't tabular, but rather consisted of PDF files, often with many/different inconsistencies which had to be resolved carefully and is annotated in the .R script. More later datasets should be more reliable, but earlier there were a lot of things to figure out like e.g. when the official methodology to assign the region of the case was changed to be based on residence rather than origin). I've added more notes on very early data here: https://github.com/marianarf/covid19_mexico_data. [More official information here](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico/resource/e8c7079c-dc2a-4b6e-8035-08042ed37165). Motivation I hope that this data serves to as a base to understand the clinical symptoms that characterize a COVID-19 positive case from another viral respiratory disease and help expand the knowledge about COVID-19 worldwide. With more models tested, added features and fine-tuning, clinical data could be used to predict a patient with pending COVID-19 results will get a positive or a negative result in two scenarios: As lab results are processed, this leaves a window when it's uncertain whether a result will return positive or negative (this is merely didactic, as new reports will corroborate the prediction as soon as the laboratory data for missing cases is reported). More importantly, it could help predict for similar symptoms e.g. from a survey or an app that checks for similar data (ideally, containing most of the parameters that can be assessed without using variables only available after hospitalization, like e.g. age of the person which is readily available). The value of the lab result comes from a RT-PCR, and is stored in RESULTADO, where the original data is encoded 1 = POSITIVE and 2 = NEGATIVE. Source The data was gathered using a "sentinel model" that samples 10 of the patients that present a viral respiratory diagnosis to test for COVID-19, and consists of data reported by 475 viral respiratory disease monitoring units (hospitals) named USMER (Unidades Monitoras de Enfermedad Respiratoria Viral) throughout the country in the entire health sector (IMSS, ISSSTE, SEDENA, SEMAR, and others). Preprocess Data is first processed with this [this .R script](https://github.com/marianarf/covid19_mexico_analysis/blob/master/notebooks/preprocess.R). The file containing the processed data will be updated daily until. Important: Since the data is updated to Github, assume the data uploaded here isn't the latest version, and instead, load data directly from the 'csv' [in this github repository](https://raw.githubusercontent.com/marianarf/covid19_mexico_analysis/master/mexico_covid19.csv). The data aggregates official daily reports of patients admitted in COVID-19 designated units. New cases are usually concatenated at the end of the file, but each individual case also contains a unique (official) identifier 'ID_REGISTRO' as well as a (new) unique reference 'id' to remove duplicates. I fixed a specific change in methodology in reporting, where the patient record used to be assigned in ENTIDAD_UM (the region of the medical unit) but now uses ENTIDAD_RES (the region of residence of the patient). Note: I have preserved the original structure (column names and factors) as closely as possible to the official data, so that code is reproducible in cross-reference to the official sources. Added features In addition to original features reported, I've included missing regional names and also a field 'DELAY' which corresponds to the lag in the processing lab results (since new data contains records from the previous day, this allows to keep track of this lag). Additional info According to the Ministry of Health, preliminary data is subject to validation by through the General Directorate of Epidemiology. Also note that the information contained corresponds only to the data obtained from the epidemiological study of a suspected case of viral respiratory disease at the time it is identified in the medical units of the Health Sector. Depending on the clinical diagnosis of admission, it is considered as an outpatient or hospitalized patient. The base does not include the evolution during the stay in the medical units, with the exception of updates of discharge by the hospital epidemiological surveillance units or health jurisdictions in the case of deaths.

41 features

idnumeric263007 unique values
0 missing
FECHA_ARCHIVOstring53 unique values
0 missing
ID_REGISTROstring263007 unique values
0 missing
ENTIDAD_UMnumeric32 unique values
0 missing
ENTIDAD_RESnumeric32 unique values
0 missing
RESULTADOnumeric2 unique values
0 missing
DELAYnumeric1 unique values
0 missing
ENTIDAD_REGISTROnumeric32 unique values
0 missing
ENTIDADstring32 unique values
0 missing
ABR_ENTstring32 unique values
0 missing
FECHA_ACTUALIZACIONstring46 unique values
0 missing
ORIGENnumeric2 unique values
0 missing
SECTORnumeric14 unique values
0 missing
SEXOnumeric2 unique values
0 missing
ENTIDAD_NACnumeric33 unique values
0 missing
MUNICIPIO_RESnumeric359 unique values
6 missing
TIPO_PACIENTEnumeric2 unique values
0 missing
FECHA_INGRESOstring155 unique values
0 missing
FECHA_SINTOMASstring154 unique values
0 missing
FECHA_DEFstring88 unique values
0 missing
INTUBADOnumeric4 unique values
0 missing
NEUMONIAnumeric3 unique values
0 missing
EDADnumeric117 unique values
0 missing
NACIONALIDADnumeric2 unique values
0 missing
EMBARAZOnumeric4 unique values
0 missing
HABLA_LENGUA_INDIGnumeric3 unique values
0 missing
DIABETESnumeric3 unique values
0 missing
EPOCnumeric3 unique values
0 missing
ASMAnumeric3 unique values
0 missing
INMUSUPRnumeric3 unique values
0 missing
HIPERTENSIONnumeric3 unique values
0 missing
OTRA_COMnumeric3 unique values
0 missing
CARDIOVASCULARnumeric3 unique values
0 missing
OBESIDADnumeric3 unique values
0 missing
RENAL_CRONICAnumeric3 unique values
0 missing
TABAQUISMOnumeric3 unique values
0 missing
OTRO_CASOnumeric3 unique values
0 missing
MIGRANTEnumeric3 unique values
0 missing
PAIS_NACIONALIDADstring78 unique values
0 missing
PAIS_ORIGENstring44 unique values
0 missing
UCInumeric4 unique values
0 missing

19 properties

263007
Number of instances (rows) of the dataset.
41
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
6
Number of missing values in the dataset.
6
Number of instances with at least one value missing.
31
Number of numeric attributes.
0
Number of nominal attributes.
0
Number of attributes divided by the number of instances.
75.61
Percentage of numeric attributes.
Percentage of instances belonging to the most frequent class.
0
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
Average class difference between consecutive instances.
0
Percentage of missing values.

0 tasks

Define a new task