Data
Rideshare

Rideshare

active ARFF Creative Commons Attribution 4.0 International Visibility: public Uploaded 25-06-2024 by Bruno Belucci Teixeira
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Uber, Lyft and weather hourly data. From original website: ----- Context Uber and Lyft's ride prices are not constant like public transport. They are greatly affected by the demand and supply of rides at a given time. So what exactly drives this demand? The first guess would be the time of the day; times around 9 am and 5 pm should see the highest surges on account of people commuting to work/home. Another guess would be the weather; rain/snow should cause more people to take rides. NOTE: The date is for simulated rides with real prices i.e. how much would the ride cost IF someone actually took it. Uber/Lyft DO NOT make this data public and nor is the case in this dataset Content With no public data of rides/prices shared by any entity, we tried to collect real-time data using Uber&Lyft api queries and corresponding weather conditions. We chose a few hot locations in Boston from this map We built a custom application in Scala to query data at regular intervals and saved it to DynamoDB. The project can be found here on GitHub We queried cab ride estimates every 5 mins and weather data every 1 hr. The data is approx. for a week of Nov '18 ( I actually have included data collected while I was testing the 'querying' application so might have data spread out over more than a week. I didn't consider this as a time-series problem so did not worry about regular interval. The chosen interval was to query as much as data possible without unnecessary redundancy. So data can go from end week of Nov to few in Dec) The Cab ride data covers various types of cabs for Uber & Lyft and their price for the given location. You can also find if there was any surge in the price during that time. Weather data contains weather attributes like temperature, rain, cloud, etc for all the locations taken into consideration. Inspiration Our aim was to try to analyze the prices of these ride-sharing apps and try to figure out what factors are driving the demand. Do Mondays have more demand than Sunday at 9 am? Do people avoid cabs on a sunny day? Was there a Red Sox match at Fenway that caused more people coming in? We have provided a small dataset as well as a mechanism to collect more data. We would love to see more conclusions drawn. ----- The link to the original dataset is https://www.kaggle.com/datasets/ravi72munde/uber-lyft-cab-prices/data We have used the dataset in the Monash Time Series Forecasting Repository (https://zenodo.org/records/5122114) which is hourly (not in 5 min intervals like the original dataset) and already joined with the weather data. However, we have applied some preprocessing steps. There are 21 columns: id_series: The id of the time series. date: The date of the time series in the format "%Y-%m-%d". time_step: The time step on the time series. value_X (X from 0 to 14): The values of the time series, which will be used for the forecasting task. covariate_X (X from 0 to 2): Covariate values of the time series, tied to the 'id_series'. Not interested in forecasting, but can help with the forecasting task. Preprocessing: 1 - Dropped the 'series_name' column and exploded the 'series_value' column. 2 - Created a 'time_step' column from columns 'source_location', 'provider_name', 'provider_service', 'type'. 3 - Pivot the table with index 'source_location', 'provider_name', 'provider_service', 'time_step', using the column 'type' and values 'series_value' 4 - Created column 'id_series' from 'source_location', 'provider_name', 'provider_service', obtaining ids from 0 to 155. 5 - Created column 'date' from the 'start_timestamp' ('2018-11-26 06:00:00') and adding 'time_step' * 1 hour. The format is %Y-%m-%d %H:%M:%S. 6 - Renamed columns 'source_location', 'provider_name', 'provider_service' to 'covariate_X' with X from 0 to 2 and the columns obtained from 'type' to 'value_X' with X from 0 to 14. 7 - Defined 'covariate_X' and 'id_series' columns as 'category' and 'value_X' columns as float. 8 - Dropped 12 'id_series' because they do not have any 'price' information (only NaNs). After this treatment, we do not have any other NaN value.

21 features

covariate_0nominal12 unique values
0 missing
covariate_1nominal2 unique values
0 missing
covariate_2nominal12 unique values
0 missing
time_stepnumeric541 unique values
0 missing
value_0numeric53 unique values
0 missing
value_1numeric265 unique values
0 missing
value_2numeric486 unique values
0 missing
value_3numeric29619 unique values
0 missing
value_4numeric308 unique values
0 missing
value_5numeric210 unique values
0 missing
value_6numeric135 unique values
0 missing
value_7numeric9386 unique values
0 missing
value_8numeric69 unique values
0 missing
value_9numeric316 unique values
0 missing
value_10numeric8 unique values
0 missing
value_11numeric353 unique values
0 missing
value_12numeric5 unique values
0 missing
value_13numeric2349 unique values
0 missing
value_14numeric1680 unique values
0 missing
id_seriesnominal144 unique values
0 missing
datestring541 unique values
0 missing

19 properties

77904
Number of instances (rows) of the dataset.
21
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
16
Number of numeric attributes.
4
Number of nominal attributes.
4.76
Percentage of binary attributes.
0
Percentage of instances having missing values.
0
Percentage of missing values.
Average class difference between consecutive instances.
76.19
Percentage of numeric attributes.
0
Number of attributes divided by the number of instances.
Percentage of instances belonging to the most frequent class.
19.05
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
1
Number of binary attributes.

0 tasks

Define a new task