Daily total sunspot number from 1818 to 2023.
From original source:
-----
Time range: 1/1/1818 - last elapsed month (provisional values)
Data description:
Daily total sunspot number derived by the formula: R= Ns + 10 * Ng, with Ns the number of spots and Ng the number of groups counted over the entire solar disk.
No daily data are provided before 1818 because daily observations become too sparse in earlier years. Therefore, R. Wolf only compiled monthly means and yearly means for all years before 1818.
In the TXT and CSV files, the missing values are marked by -1 (valid Sunspot Number are always positive).
New scale:
The conventional 0.6 Zurich scale factor is not used anymore and A. Wolfer (Wolf's successor) is now defining the scale of the entire series. This puts the Sunspot Number at the scale of raw modern counts, instead of reducing it to the level of early counts by R. Wolf.
Error values:
Those values correspond to the standard deviation of raw numbers provided by all stations. Before 1981, the errors are estimated with the help of an auto-regressive model based on the Poissonian distribution of actual Sunspot Numbers. From 1981 onwards, the error value is the actual standard deviation of the sample of raw observations used to compute the daily value.
The standard error of the daily Sunspot Number can be computed by:
sigma/sqrt(N) where sigma is the listed standard deviation and N the number of observations for the day.
Before 1981, the number of observations is set to 1, as the Sunspot Number was then essentially the raw Wolf number from the Zurich Observatory.
-----
There are 6 columns:
id_series: The id of the time series.
date: The date of the time series in the format "%Y-%m-%d".
time_step: The time step on the time series.
value_X (X from 0 to 2): The values of the time series, which will be used for the forecasting task.
Preprocessing:
1 - Kept only the data with year (column 0) <= 2023.
2 - Created the 'date' column from columns 0 (year), 1 (month) and 2 (day) in the format %Y-%m-%d.
3 - Dropped the columns (0, 1, 2, 3, 7).
Column 3 was the date in fraction of year and 7 was an indicator if the data was under revision or not (there is no data
under revision for our data).
4 - Replaced values of -1 to NaNs to evidenceate the missing data.
5 - Dropped the rows with 'date' < 1818-01-08, as there are only NaNs for these dates.
6 - Created the column 'id_series' with value 0, there is only one long time series.
7 - Ensured that there are no missing dates and that the frequency of the time_series is daily..
8 - Created column 'time_step' with increasing values of time step for the time series.
9 - Casted columns 'value_0' and 'value_1' to float ('value_0' is always int, but casted to float to accomodate NaNs),
casted column 'value_2' to int . Defined 'id_series' as 'category'.