Study
OpenML-CC18 Curated Classification benchmark

OpenML-CC18 Curated Classification benchmark

Created 21-02-2019 by Jan van Rijn Visibility: public
Loading wiki
We advocate the use of curated, comprehensive benchmark suites of machine learning datasets, backed by standardized OpenML-based interfaces and complementary software toolkits written in Python, Java and R. We demonstrate how to easily execute comprehensive benchmarking studies using standardized OpenML-based benchmarking suites and complementary software toolkits written in Python, Java and R. Major distinguishing features of OpenML benchmark suites are (i) ease of use through standardized data formats, APIs, and existing client libraries; (ii) machine-readable meta-information regarding the contents of the suite; and (iii) online sharing of results, enabling large scale comparisons. As a first such suite, we propose the OpenML-CC18, a machine learning benchmark suite of 72 classification datasets carefully curated from the thousands of datasets on OpenML. The inclusion criteria are: * classification tasks on dense data set independent observations * number of classes >= 2, each class with at least 20 observations and ratio of minority to majority class must exceed 5% * 500 <= number of observations <= 100000 * number of features after one-hot-encoding < 5000 * no artificial data sets * no subsets of larger data sets nor binarizations of other data sets * no data sets which are perfectly predictable by using a single feature or by using a simple decision tree * source or reference available If you use this benchmarking suite, please cite: Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn and Joaquin Vanschoren. “OpenML Benchmarking Suites” arXiv:1708.03731v2 [stats.ML] (2019). ``` @article{oml-benchmarking-suites, title={OpenML Benchmarking Suites}, author={Bernd Bischl and Giuseppe Casalicchio and Matthias Feurer and Frank Hutter and Michel Lang and Rafael G. Mantovani and Jan N. van Rijn and Joaquin Vanschoren}, year={2019}, journal={arXiv:1708.03731v2 [stat.ML]} } ```