Study
OpenML-CTR23 - A curated tabular regression benchmarking suite

OpenML-CTR23 - A curated tabular regression benchmarking suite

Created 31-05-2023 by Sebastian Fischer Visibility: public
Loading wiki
Inclusion Criteria: * There are between 500 and 100000 observations. * There are less than 5000 features after one-hot encoding all categorical features. * The dataset is not in a sparse format. * The observations are i.i.d., which means that we exclude datasets that have time dependencies or require grouped data splits. * The dataset comes with a source or reference that clearly describes it. * We did not consider the dataset to be artificial, but allowed simulated datasets. * The data is not a subset of a larger dataset. * There is a numeric target variable with at least 5 different values. * The dataset is not trivially solvable by a linear model, i.e. the training error of a linear model fitted to the whole data has an R2 of less than 1. * The dataset does not have ethical concerns. * The use of the dataset for benchmarking is not forbidden. In addition to the datasets, the OpenML tasks also contain resampling splits, which were determined according to the following rule: If there are less than 1000 observations we use 10 times repeated 10-fold CV. If there are more than 10000 observations we use a 33% holdout split, and for everything between, we use 10-fold CV. Please cite the following paper if you use the suite: @inproceedings{ fischer2023openmlctr, title={Open{ML}-{CTR}23 {\textendash} A curated tabular regression benchmarking suite}, author={Sebastian Felix Fischer and Liana Harutyunyan Matthias Feurer and Bernd Bischl}, booktitle={AutoML Conference 2023 (Workshop)}, year={2023}, url={https://openreview.net/forum?id=HebAOoMm94} } If you notice a problem with one of the datasets, please add a comment here: https://github.com/slds-lmu/paper_2023_regression_suite/issues/1