OpenML
nyc-taxi-green-dec-2016

nyc-taxi-green-dec-2016

active ARFF Publicly available Visibility: public Uploaded 18-06-2022 by Leo Grin
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
  • Computer Systems Machine Learning
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Dataset used in the tabular data benchmark https://github.com/LeoGrin/tabular-benchmark, transformed in the same way. This dataset belongs to the "regression on categorical and numerical features" benchmark. Original description: Trip Record Data provided by the New York City Taxi and Limousine Commission (TLC) [http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml]. The dataset includes TLC trips of the green line in December 2016. Data was downloaded on 03.11.2018. For a description of all variables in the dataset checkout the TLC homepage [http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf]. The variable 'tip_amount' was chosen as target variable. The variable 'total_amount' is ignored by default, otherwise the target could be predicted deterministically. The date variables 'lpep_pickup_datetime' and 'lpep_dropoff_datetime' (ignored by default) could be used to compute additional time features. In this version, we chose only trips with 'payment_type' == 1 (credit card), as tips are not included for most other payment types. We also removed the variables 'trip_distance' and 'fare_amount' to increase the importance of the categorical features 'PULocationID' and 'DOLocationID'.

11 features

tip_amount (target)numeric1811 unique values
0 missing
VendorIDnominal2 unique values
0 missing
store_and_fwd_flagnominal2 unique values
0 missing
RatecodeIDnominal5 unique values
0 missing
passenger_countnumeric10 unique values
0 missing
extranominal5 unique values
0 missing
mta_taxnominal3 unique values
0 missing
tolls_amountnumeric105 unique values
0 missing
improvement_surchargenominal3 unique values
0 missing
total_amountnumeric5377 unique values
0 missing
trip_typenominal2 unique values
0 missing

19 properties

581835
Number of instances (rows) of the dataset.
11
Number of attributes (columns) of the dataset.
0
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
4
Number of numeric attributes.
7
Number of nominal attributes.
0.36
Average class difference between consecutive instances.
0
Percentage of missing values.
0
Number of attributes divided by the number of instances.
36.36
Percentage of numeric attributes.
Percentage of instances belonging to the most frequent class.
63.64
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
3
Number of binary attributes.
27.27
Percentage of binary attributes.
0
Percentage of instances having missing values.

0 tasks

Define a new task