LondonSmartMeter forecasting data
From the website:
-----
Energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014.
Readings were taken at half hourly intervals. The customers in the trial were recruited as a balanced sample representative of the Greater London population.
The dataset contains energy consumption, in kWh (per half hour), unique household identifier, date and time. The CSV file is around 10GB when unzipped and contains around 167million rows.
Within the data set are two groups of customers. The first is a sub-group, of approximately 1100 customers, who were subjected to Dynamic Time of Use (dToU) energy prices throughout the 2013 calendar year period. The tariff prices were given a day ahead via the Smart Meter IHD (In Home Display) or text message to mobile phone. Customers were issued High (67.20p/kWh), Low (3.99p/kWh) or normal (11.76p/kWh) price signals and the times of day these applied. The dates/times and the price signal schedule is availaible as part of this dataset. All non-Time of Use customers were on a flat rate tariff of 14.228pence/kWh.
The signals given were designed to be representative of the types of signal that may be used in the future to manage both high renewable generation (supply following) operation and also test the potential to use high price signals to reduce stress on local distribution grids during periods of stress.
The remaining sample of approximately 4500 customers energy consumption readings were not subject to the dToU tariff.
-----
There are 4 columns:
id_series: The identifier of a time series.
LCLid: The category of the time series (dToU or std).
value: The value of the time series at 'time_step'.
time_step: The time step on the time series.
date: The reconstructed date of the time series in the format %Y-%m-%d %H-%M-%S.
Preprocessing:
Training set
1 - Renamed columns 'LCLid', 'DateTime', ''KWH/hh (per half hour)' to 'id_series', 'date', 'value'.
2 - Dropped nan values.
There are some NaN values for extra values in the 'value' column, they are values between 2 measures and not in the usual half-hour interval, they can be safely ignored. Besides,
there are 5 time series, namely 'MAC001150', 'MAC005556', 'MAC005559', 'MAC005560', 'MAC005563', that have only one nan value.
3 - Defined columns 'id_series' and 'LCLid' as 'category', casted 'time_step' to int.
4 - Created time_step column from 'date'.