Data
Tourism-competition-yearly

Tourism-competition-yearly

active ARFF Creative Commons Attribution 4.0 International Visibility: public Uploaded 25-06-2024 by Bruno Belucci Teixeira
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Tourism competion for time series forecasting, yearly data. From original source: ----- The data we use include 366 monthly series, 427 quarterly series and 518 yearly series. They were supplied by both tourism bodies (such as Tourism Australia, the Hong Kong Tourism Board and Tourism New Zealand) and various academics, who had used them in previous tourism forecasting studies (please refer to the acknowledgements and details of the data sources and availability). A subset of these series was used for evaluating the forecasting performances of the methods that use explanatory variables. There were 93 quarterly series and 129 yearly series for which we had explanatory variables available. With the exception of 34 yearly series (which represented tourism numbers by purpose of travel at a national level), all of the other series represented total tourism numbers at a country level of aggregation. For each series we split the data into an estimation sample and a hold-out sample which was hidden from all of the co-authors. For each monthly series, the hold-out sample consisted of the 24 most recent observations; for quarterly data, it was the last 8 observations; and for yearly data it consisted of the final 4 observations. Each method was implemented (or trained) on the estimation sample, and forecasts were produced for the whole of the hold-out sample for each series. The forecasts were then compared to the actual withheld observations. ----- There are 4 columns: id_series: The id of the time series. date: The date of the time series in the format "%Y-%m-%d". time_step: The time step on the time series. value_0: The values of the time series, which will be used for the forecasting task. Preprocessing: Training (in) set 1 - Renamed first two columns to 'n' and 'starting_year', and renamed the other columns to reflect the actual time_step of the time series. 2 - Melted the data, obtaining columns 'time_step' and 'value_0'. 3 - Dropped nan values. The nan values correspond to time series that are shorter than the time series with maximum lenght, there are no nans in the middle of a time series. 3 - Obtained the 'date' from the 'starting_year' and 'time_step'. 4 - Casted 'date' to str, 'time_step' to int, 'value_0' to float, and defined 'id_series' as 'category'. Test (oos) set: Same as for the training set. Finally, we have concatenated both training and test set. If one wants to use the same train and test set of the competition, the last N points should be used as test set, where N is 24 for the montthly dataset, 8 for the quarterly dataset and 4 for the yearly dataset.

4 features

id_seriesnominal518 unique values
0 missing
value_0numeric10488 unique values
0 missing
datestring48 unique values
0 missing
time_stepnumeric47 unique values
0 missing

19 properties

12678
Number of instances (rows) of the dataset.
4
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
2
Number of numeric attributes.
1
Number of nominal attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
0
Percentage of missing values.
Average class difference between consecutive instances.
50
Percentage of numeric attributes.
0
Number of attributes divided by the number of instances.
25
Percentage of nominal attributes.
Percentage of instances belonging to the most frequent class.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.

0 tasks

Define a new task