OpenML
Pima-Indians-Diabetes-Dataset

Pima-Indians-Diabetes-Dataset

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
  • Computer Systems Machine Learning
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Context The unprocessed dataset was acquired from UCI Machine Learning organisation. This dataset is preprocessed by me, originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to accurately predict whether or not, a patient has diabetes, based on multiple features included in the dataset. I've achieved an accuracy metric score of 92.86 with Random Forest Classifier using this dataset. I've even developed a web-service Diabetes Prediction System using that trained model. You can explore the Exploratory Data Analysis notebook to better understand the data. Attributes Normal Value Range Glucose: Glucose ( 140) = Normal, Glucose (140-200) = Pre-Diabetic, Glucose ( 200) = Diabetic BloodPressure: B.P ( 60) = Below Normal, B.P (60-80) = Normal, B.P (80-90) = Stage 1 Hypertension, B.P (90-120) = Stage 2 Hypertension, B.P ( 120) = Hypertensive Crisis SkinThickness: SkinThickness ( 10) = Below Normal, SkinThickness (10-30) = Normal, SkinThickness ( 30) = Above Normal Insulin: Insulin ( 200) = Normal, Insulin ( 200) = Above Normal BMI: BMI ( 18.5) = Underweight, BMI (18.5-25) = Normal, BMI (25-30) = Overweight, BMI ( 30) = Obese Acknowledgements J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler and R. S. Johannes, "Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus" in Proc. of the Symposium on Computer Applications and Medical Care, pp. 261-265. IEEE Computer Society Press. 1988. Inspiration Multiple models were trained on the original dataset but only Random Forest Classifier was able to score an accuracy metric of 78.57 but with this new preprocessed dataset an accuracy metric score of 92.86 was achieved. Can you build a machine learning model that can accurately predict whether a patient has diabetes or not? and can you achieve an accuracy metric score even higher than 92.86 without overfitting the model?

9 features

Outcome (target)numeric2 unique values
0 missing
Pregnanciesnumeric17 unique values
0 missing
Glucosenumeric135 unique values
0 missing
BloodPressurenumeric47 unique values
0 missing
SkinThicknessnumeric50 unique values
0 missing
Insulinnumeric187 unique values
0 missing
BMInumeric247 unique values
0 missing
DiabetesPedigreeFunctionnumeric517 unique values
0 missing
Agenumeric52 unique values
0 missing

19 properties

768
Number of instances (rows) of the dataset.
9
Number of attributes (columns) of the dataset.
0
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
9
Number of numeric attributes.
0
Number of nominal attributes.
0.01
Number of attributes divided by the number of instances.
100
Percentage of numeric attributes.
Percentage of instances belonging to the most frequent class.
0
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
0.55
Average class difference between consecutive instances.
0
Percentage of missing values.

0 tasks

Define a new task