Inclusion Criteria:
* There are between 500 and 100000 observations.
* There are less than 5000 features after one-hot encoding all categorical features.
* The dataset is not in a sparse format.
* The observations are i.i.d., which means that we exclude datasets that have time dependencies or require grouped data splits.
* The dataset comes with a source or reference that clearly describes it.
* We did not consider the dataset to be artificial, but allowed simulated datasets.
* The data is not a subset of a larger dataset.
* There is a numeric target variable with at least 5 different values.
* The dataset is not trivially solvable by a linear model, i.e. the training error of a linear model fitted to the whole data has an R2 of less than 1.
* The dataset does not have ethical concerns.
* The use of the dataset for benchmarking is not forbidden.
In addition to the datasets, the OpenML tasks also contain resampling splits, which were determined according to the following rule: If there are less than 1000 observations we use 10 times repeated 10-fold CV. If there are more than 10000 observations we use a 33% holdout split, and for everything
between, we use 10-fold CV.
Please cite the following paper if you use the suite:
@inproceedings{
fischer2023openmlctr,
title={Open{ML}-{CTR}23 {\textendash} A curated tabular regression benchmarking suite},
author={Sebastian Felix Fischer and Liana Harutyunyan Matthias Feurer and Bernd Bischl},
booktitle={AutoML Conference 2023 (Workshop)},
year={2023},
url={https://openreview.net/forum?id=HebAOoMm94}
}
If you notice a problem with one of the datasets, please add a comment here: https://github.com/slds-lmu/paper_2023_regression_suite/issues/1