Dataset used in the tabular data benchmark https://github.com/LeoGrin/tabular-benchmark, transformed in the same way. This dataset belongs to the "regression on numerical features" benchmark. Original description:
The dataset contains information on 13,932 single-family homes sold in Miami in 2016. Besides publicly available information, the dataset creator Steven C. Bourassa has added distance variables, aviation noise as well as latitude and longitude.
The dataset containts the following columns:
- PARCELNO: unique identifier for each property. About 1% appear multiple times.
- SALE_PRC: sale price ($)
- LND_SQFOOT: land area (square feet)
- TOT_LVG_AREA: floor area (square feet)
- SPEC_FEAT_VAL: value of special features (e.g., swimming pools) ($)
- RAIL_DIST: distance to the nearest rail line (an indicator of noise) (feet)
- OCEAN_DIST: distance to the ocean (feet)
- WATER_DIST: distance to the nearest body of water (feet)
- CNTR_DIST: distance to the Miami central business district (feet)
- SUBCNTR_DI: distance to the nearest subcenter (feet)
- HWY_DIST: distance to the nearest highway (an indicator of noise) (feet)
- age: age of the structure
- avno60plus: dummy variable for airplane noise exceeding an acceptable level
- structure_quality: quality of the structure
- month_sold: sale month in 2016 (1 = jan)
- LATITUDE
- LONGITUDE
A typical model would try to predict log(SALE_PRC) as a function of all variables except the PARCELNO.