Dataset used in the tabular data benchmark https://github.com/LeoGrin/tabular-benchmark, transformed in the same way. This dataset belongs to the "classification on numerical features" benchmark. Original description:
Author: Jock A. Blackard, Dr. Denis J. Dean, Dr. Charles W. Anderson
Source: [LibSVM repository](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) - 2013-11-14
Please cite: For the binarization: R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(05):1105-1114, 2002.
This is the famous covertype dataset in its binary version, retrieved 2013-11-13 from the libSVM site (called covtype.binary there). Additional to the preprocessing done there (see LibSVM site for details), this dataset was created as follows:
-load covertpype dataset, unscaled.
-normalize each file columnwise according to the following rules:
-If a column only contains one value (constant feature), it will set to zero and thus removed by sparsity.
-If a column contains two values (binary feature), the value occuring more often will be set to zero, the other to one.
-If a column contains more than two values (multinary/real feature), the column is divided by its std deviation.
-duplicate lines were finally removed.
Preprocessing: Transform from multiclass into binary class.