OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

Rideshare

active ARFF Creative Commons Attribution 4.0 International Visibility: public Uploaded 25-06-2024 by Bruno Belucci Teixeira
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Uber, Lyft and weather hourly data. From original website: ----- Context Uber and Lyft's ride prices are not constant like public transport. They are greatly affected by the demand and supply of rides at a given time. So what exactly drives this demand? The first guess would be the time of the day; times around 9 am and 5 pm should see the highest surges on account of people commuting to work/home. Another guess would be the weather; rain/snow should cause more people to take rides. NOTE: The date is for simulated rides with real prices i.e. how much would the ride cost IF someone actually took it. Uber/Lyft DO NOT make this data public and nor is the case in this dataset Content With no public data of rides/prices shared by any entity, we tried to collect real-time data using Uber&Lyft api queries and corresponding weather conditions. We chose a few hot locations in Boston from this map We built a custom application in Scala to query data at regular intervals and saved it to DynamoDB. The project can be found here on GitHub We queried cab ride estimates every 5 mins and weather data every 1 hr. The data is approx. for a week of Nov '18 ( I actually have included data collected while I was testing the 'querying' application so might have data spread out over more than a week. I didn't consider this as a time-series problem so did not worry about regular interval. The chosen interval was to query as much as data possible without unnecessary redundancy. So data can go from end week of Nov to few in Dec) The Cab ride data covers various types of cabs for Uber & Lyft and their price for the given location. You can also find if there was any surge in the price during that time. Weather data contains weather attributes like temperature, rain, cloud, etc for all the locations taken into consideration. Inspiration Our aim was to try to analyze the prices of these ride-sharing apps and try to figure out what factors are driving the demand. Do Mondays have more demand than Sunday at 9 am? Do people avoid cabs on a sunny day? Was there a Red Sox match at Fenway that caused more people coming in? We have provided a small dataset as well as a mechanism to collect more data. We would love to see more conclusions drawn. ----- The link to the original dataset is https://www.kaggle.com/datasets/ravi72munde/uber-lyft-cab-prices/data We have used the dataset in the Monash Time Series Forecasting Repository (https://zenodo.org/records/5122114) which is hourly (not in 5 min intervals like the original dataset) and already joined with the weather data. However, we have applied some preprocessing steps. There are 21 columns: id_series: The id of the time series. date: The date of the time series in the format "%Y-%m-%d". time_step: The time step on the time series. value_X (X from 0 to 14): The values of the time series, which will be used for the forecasting task. covariate_X (X from 0 to 2): Covariate values of the time series, tied to the 'id_series'. Not interested in forecasting, but can help with the forecasting task. Preprocessing: 1 - Dropped the 'series_name' column and exploded the 'series_value' column. 2 - Created a 'time_step' column from columns 'source_location', 'provider_name', 'provider_service', 'type'. 3 - Pivot the table with index 'source_location', 'provider_name', 'provider_service', 'time_step', using the column 'type' and values 'series_value' 4 - Created column 'id_series' from 'source_location', 'provider_name', 'provider_service', obtaining ids from 0 to 155. 5 - Created column 'date' from the 'start_timestamp' ('2018-11-26 06:00:00') and adding 'time_step' * 1 hour. The format is %Y-%m-%d %H:%M:%S. 6 - Renamed columns 'source_location', 'provider_name', 'provider_service' to 'covariate_X' with X from 0 to 2 and the columns obtained from 'type' to 'value_X' with X from 0 to 14. 7 - Defined 'covariate_X' and 'id_series' columns as 'category' and 'value_X' columns as float. 8 - Dropped 12 'id_series' because they do not have any 'price' information (only NaNs). After this treatment, we do not have any other NaN value.

21 features

covariate_0	nominal	12 unique values 0 missing
covariate_1	nominal	2 unique values 0 missing
covariate_2	nominal	12 unique values 0 missing
time_step	numeric	541 unique values 0 missing
value_0	numeric	53 unique values 0 missing
value_1	numeric	265 unique values 0 missing
value_2	numeric	486 unique values 0 missing
value_3	numeric	29619 unique values 0 missing
value_4	numeric	308 unique values 0 missing
value_5	numeric	210 unique values 0 missing
value_6	numeric	135 unique values 0 missing
value_7	numeric	9386 unique values 0 missing
value_8	numeric	69 unique values 0 missing
value_9	numeric	316 unique values 0 missing
value_10	numeric	8 unique values 0 missing
value_11	numeric	353 unique values 0 missing
value_12	numeric	5 unique values 0 missing
value_13	numeric	2349 unique values 0 missing
value_14	numeric	1680 unique values 0 missing
id_series	nominal	144 unique values 0 missing
date	string	541 unique values 0 missing