{ "data_id": "45557", "name": "Mammographic-Mass-Data-Set", "exact_name": "Mammographic-Mass-Data-Set", "version": 2, "version_label": null, "description": "Mammography is the most effective method for breast cancer screening\navailable today. However, the low positive predictive value of breast\nbiopsy resulting from mammogram interpretation leads to approximately\n70% unnecessary biopsies with benign outcomes. To reduce the high\nnumber of unnecessary breast biopsies, several computer-aided diagnosis\n(CAD) systems have been proposed in the last years.These systems\nhelp physicians in their decision to perform a breast biopsy on a suspicious\nlesion seen in a mammogram or to perform a short term follow-up\nexamination instead.\n\nThis data set can be used to predict the severity (benign or malignant)\nof a mammographic mass lesion from BI-RADS attributes and the patient's age.\nIt contains a BI-RADS assessment, the patient's age and three BI-RADS attributes\ntogether with the ground truth (the severity field) for 516 benign and\n445 malignant masses that have been identified on full field digital mammograms\ncollected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006.\n\nEach instance has an associated BI-RADS assessment ranging from 1 (definitely benign)\nto 5 (highly suggestive of malignancy) assigned in a double-review process by\nphysicians. Assuming that all cases with BI-RADS assessments greater or equal\na given value (varying from 1 to 5), are malignant and the other cases benign,\nsensitivities and associated specificities can be calculated. These can be an\nindication of how well a CAD system performs compared to the radiologists.\n\nClass Distribution: benign: 516; malignant: 445\n\n## Attributes\n\n6 Attributes in total (1 goal field, 1 non-predictive, 4 predictive attributes)\n\n1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!) \n2. Age: patient's age in years (integer)\n3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)\n4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)\n5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)\n6. Severity: benign=0 or malignant=1 (binominal, goal field!)\n\n\nMissing Attribute Values:\n - BI-RADS assessment: 2\n - Age: 5\n - Shape: 31\n - Margin: 48\n - Density: 76\n - Severity: 0\n\n## Notes\n\nCompared to v1 this dataset has the following difference:\n* It contains missing values. It appears that v1 has dropped all entries with missing values.\n* Variable types are coded more correctly. BI-RADS assessment and Density should be ordinal, but were coded as float because ordinal is not available on OpenML. They were not coded as int because liac-arff cannot serialize pd.NA yet.\n* The variable `BI-RADS assessment` is names `BI-RADS` because OpenML does not allow whitespace in attribute names.", "format": "arff", "uploader": "Matthias Feurer", "uploader_id": 86, "visibility": "public", "creator": null, "contributor": null, "date": "2023-06-05 09:08:26", "update_comment": null, "last_update": "2023-06-05 09:08:26", "licence": "Unknown", "status": "active", "error_message": null, "url": "https:\/\/api.openml.org\/data\/download\/22116525\/dataset", "default_target_attribute": "Severity", "row_id_attribute": null, "ignore_attribute": "\"BI-RADS\"", "runs": 0, "suggest": { "input": [ "Mammographic-Mass-Data-Set", "Mammography is the most effective method for breast cancer screening available today. However, the low positive predictive value of breast biopsy resulting from mammogram interpretation leads to approximately 70% unnecessary biopsies with benign outcomes. To reduce the high number of unnecessary breast biopsies, several computer-aided diagnosis (CAD) systems have been proposed in the last years.These systems help physicians in their decision to perform a breast biopsy on a suspicious lesion seen " ], "weight": 5 }, "qualities": { "NumberOfInstances": 961, "NumberOfFeatures": 5, "NumberOfClasses": 2, "NumberOfMissingValues": 160, "NumberOfInstancesWithMissingValues": 130, "NumberOfNumericFeatures": 2, "NumberOfSymbolicFeatures": 3, "PercentageOfBinaryFeatures": 20, "PercentageOfInstancesWithMissingValues": 13.527575442247658, "AutoCorrelation": 0.48854166666666665, "PercentageOfMissingValues": 3.329864724245578, "Dimensionality": 0.005202913631633715, "PercentageOfNumericFeatures": 40, "MajorityClassPercentage": 53.69406867845994, "PercentageOfSymbolicFeatures": 60, "MajorityClassSize": 516, "MinorityClassPercentage": 46.30593132154006, "MinorityClassSize": 445, "NumberOfBinaryFeatures": 1 }, "tags": [ { "uploader": "38960", "tag": "Chemistry" }, { "uploader": "38960", "tag": "Life Science" } ], "features": [ { "name": "Severity", "index": "5", "type": "nominal", "distinct": "2", "missing": "0", "target": "1", "distr": [ [ "0", "1" ], [ [ "516", "0" ], [ "0", "445" ] ] ] }, { "name": "BI-RADS", "index": "0", "type": "numeric", "distinct": "7", "missing": "2", "ignore": "1", "min": "0", "max": "55", "mean": "4", "stdev": "2" }, { "name": "Age", "index": "1", "type": "numeric", "distinct": "73", "missing": "5", "min": "18", "max": "96", "mean": "55", "stdev": "14" }, { "name": "Shape", "index": "2", "type": "nominal", "distinct": "4", "missing": "31", "distr": [ [ "1", "2", "3", "4" ], [ [ "186", "38" ], [ "176", "35" ], [ "50", "45" ], [ "85", "315" ] ] ] }, { "name": "Margin", "index": "3", "type": "nominal", "distinct": "5", "missing": "48", "distr": [ [ "1", "2", "3", "4", "5" ], [ [ "316", "41" ], [ "9", "15" ], [ "43", "73" ], [ "89", "191" ], [ "22", "114" ] ] ] }, { "name": "Density", "index": "4", "type": "numeric", "distinct": "4", "missing": "76", "min": "1", "max": "4", "mean": "3", "stdev": "0" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 0, "impact_of_reuse": 0, "reach_of_reuse": 0, "impact": 0 }