Uber, Lyft and weather hourly data.
From original website:
-----
Context
Uber and Lyft's ride prices are not constant like public transport. They are greatly affected by the demand and supply of rides at a given time. So what exactly drives this demand? The first guess would be the time of the day; times around 9 am and 5 pm should see the highest surges on account of people commuting to work/home. Another guess would be the weather; rain/snow should cause more people to take rides.
NOTE: The date is for simulated rides with real prices i.e. how much would the ride cost IF someone actually took it. Uber/Lyft DO NOT make this data public and nor is the case in this dataset
Content
With no public data of rides/prices shared by any entity, we tried to collect real-time data using Uber&Lyft api queries and corresponding weather conditions. We chose a few hot locations in Boston from this map
We built a custom application in Scala to query data at regular intervals and saved it to DynamoDB. The project can be found here on GitHub
We queried cab ride estimates every 5 mins and weather data every 1 hr.
The data is approx. for a week of Nov '18 ( I actually have included data collected while I was testing the 'querying' application so might have data spread out over more than a week. I didn't consider this as a time-series problem so did not worry about regular interval. The chosen interval was to query as much as data possible without unnecessary redundancy. So data can go from end week of Nov to few in Dec)
The Cab ride data covers various types of cabs for Uber & Lyft and their price for the given location. You can also find if there was any surge in the price during that time.
Weather data contains weather attributes like temperature, rain, cloud, etc for all the locations taken into consideration.
Inspiration
Our aim was to try to analyze the prices of these ride-sharing apps and try to figure out what factors are driving the demand. Do Mondays have more demand than Sunday at 9 am? Do people avoid cabs on a sunny day? Was there a Red Sox match at Fenway that caused more people coming in? We have provided a small dataset as well as a mechanism to collect more data. We would love to see more conclusions drawn.
-----
The link to the original dataset is https://www.kaggle.com/datasets/ravi72munde/uber-lyft-cab-prices/data
We have used the dataset in the Monash Time Series Forecasting Repository (https://zenodo.org/records/5122114) which is hourly (not in 5 min intervals
like the original dataset) and already joined with the weather data. However, we have applied some preprocessing steps.
There are 21 columns:
id_series: The id of the time series.
date: The date of the time series in the format "%Y-%m-%d".
time_step: The time step on the time series.
value_X (X from 0 to 14): The values of the time series, which will be used for the forecasting task.
covariate_X (X from 0 to 2): Covariate values of the time series, tied to the 'id_series'. Not interested in forecasting, but can help with the forecasting task.
Preprocessing:
1 - Dropped the 'series_name' column and exploded the 'series_value' column.
2 - Created a 'time_step' column from columns 'source_location', 'provider_name', 'provider_service', 'type'.
3 - Pivot the table with index 'source_location', 'provider_name', 'provider_service', 'time_step', using the column 'type' and values 'series_value'
4 - Created column 'id_series' from 'source_location', 'provider_name', 'provider_service', obtaining ids from 0 to 155.
5 - Created column 'date' from the 'start_timestamp' ('2018-11-26 06:00:00') and adding 'time_step' * 1 hour. The format is %Y-%m-%d %H:%M:%S.
6 - Renamed columns 'source_location', 'provider_name', 'provider_service' to 'covariate_X' with X from 0 to 2 and the columns obtained from 'type' to
'value_X' with X from 0 to 14.
7 - Defined 'covariate_X' and 'id_series' columns as 'category' and 'value_X' columns as float.
8 - Dropped 12 'id_series' because they do not have any 'price' information (only NaNs). After this treatment, we do not have any other NaN value.