Machine Learning research generally relies on a good benchmarking archive. The well-known, publicly available machine learning dataset repository from the University of California Irvine (UCI) contains more than 450 datasets from various domains and areas [Dua:2019]. This repository has benefited the development of many state-of-the-art machine learning models. Similar phenomena have also been observed for time series research. Time series research has gathered lots of interests in the last decade, especially for Time Series Classification (TSC) [bagnall2017great] and Time Series Forecasting (TSF) [hyndman2018brief, hyndman2008forecasting]. Research in TSC has greatly benefited from the University of California Riverside and University of East Anglia (UCR/UEA) Time Series Archives [dau2019ucr, bagnall2018uea]. The univariate TSC archive was first released in 2002 with 16 datasets to encourage a more rigorous evaluation of TSC algorithms [keogh2003need]. In 2015, it was expanded to 85 datasets, covering a wider range of problems. Then, it was criticised of not being a good representative of the real-world problems where time series often have missing values and are of varying lengths. Hence, the archive was recently expanded to 128 datasets that now include time series of varying lengths, un-normalised time series and time series with missing values [dau2019ucr]. The first official multivariate TSC archive [bagnall2018uea] was recently released by researchers from UEA. It contains 30 multivariate time series datasets of equal length with no missing values. Previously, there were only 12 small multivariate time series datasets from Baydogan [baydogan2015learning]. These archives have motivated the development of numerous new state-of-the-art TSC algorithms in the last five years [lines2015time, bagnall2015time, lines2016hive, lucas2019proximity, shifaz2019ts, fawaz2019deep, dempster2019rocket], each of them being more accurate than their predecessors.
On the other hand, the advancement in Time Series Forecasting relies on time series forecasting competitions [hyndman2018brief]. The most popular ones being the Makridakis competitions, also known as the M-competitions. The M-competitions were started by Spyros Makridakis and Michèle Hibon [hyndman2018brief, makridakis1982accuracy, makridakis1993m2, makridakis2000m3, makridakis2018m4, makridakis2020m4, mofc_2020]. They were the first few researchers who put together 111 time series to compare different forecasting methods [hyndman2018brief]. In 1982, they held the first M-competition involving 1001 time series, comparing 15 TSF models [makridakis1982accuracy]. This competition motivated researchers to focus more on models that give good forecasts and treat TSF a different problem from time series analysis [hyndman2018brief, makridakis1982accuracy]. The competition continues until today with 3003 time series for the M3 competition [makridakis2000m3] and 100,000 time series in the recent M4 competition [makridakis2018m4, makridakis2020m4]. Both of these competitions involved time series of varying lengths, taken from business, demography, finance and economics. Besides, there are also other competitions such as NN3 and NN5 Neural Network competitions [crone2011advances] and a few Kaggle competitions [athanasopoulos2011tourism, web_traffic_forecast].
Each year, thousands of papers proposing new algorithms for (a) TSC – to predict a discrete label of the time series and (b) TSF – to predict some continuous values of a time series in the future using recent and seasonal values, have utilised the benchmarking archives mentioned earlier. These algorithms are designed for these specific problems, but may not be useful for other problems. For example, TSC and TSF methods are not suitable if we wish to predict the heart rate of a person using photoplethysmogram (PPG) and accelerometer data [reiss2019deep], which is a continuous value, not a future value and does not depend on the recent PPG data.
We refer to this problem as Time Series Regression (TSR). Not to be confused with the TSF community where the term Time Series Regression usually means fitting the historical time series data with a regression model, such as Autoregression (ARIMA) [box1970time] or Exponential Smoothing [gardner1985exponential, hyndman2008forecasting]. These models fit to recent and/or seasonal values of the time series and extrapolate to forecast future values. Here, we are interested in a more general methodology of predicting a single continuous value, from univariate or multivariate time series. This prediction can be from the same time series or not directly related to the predictor time series and does not necessarily need to be a future value or depend heavily on recent values. Note that if predicting a future value is of interest, then that becomes a TSF problem and if predicting a discrete value is of interest, then that becomes a TSC problem.
To the best of our knowledge, research into TSR has received much less attention in the time series research community and there are no models developed for general time series regression problems. Most models are developed for a specific problem [reiss2019deep, zhang2015photoplethysmography, zhang2014troika]. In the machine learning research community, this is commonly referred to as “regression”, where a single continuous value is predicted from a set of features [sammut2011encyclopedia]. These features are derived from the data and are usually not correlated to each other or related in time. Features that are highly correlated are typically being treated as redundant, i.e. only one of them is sufficient to achieve similar performance. In our context, these features are time series (a sequence of values) instead of a single value.
We aim to motivate and support the research into TSR by introducing the first TSR benchmarking archive. This archive contains 19 datasets from different domains, with varying number of dimensions, unequal length dimensions and missing values. The rest of this paper is organised as follows. In Section 2, we describe the datasets that are in the archive. Section 3 sets a baseline to the datasets by adapting state-of-the-art TSC and machine learning regression models. Finally, in Section 4, we summarise our contribution and give some direction for future work.
This section outlines the datasets in this TSR archive. The current archive contains 19 time series datasets as shown in Table 1. They are available online at http://timeseriesregression.org/. The archive contains 8 datasets adapted from the UCI machine learning repository [Dua:2019], 3 from Physionet, 1 from a signal processing competition [zhang2014troika], 1 from the World Health Organisation (WHO), 1 from the Australian Bureau of Meteorology (BOM) and the rest are donations.
|Type||Dataset||Train size||Test size||Length||No of Dimension||Missing|
This archive currently covers 5 application areas, Energy Monitoring, Environment Monitoring, Health Monitoring, Sentiment Analysis and Forecasting. The datasets are formatted with the .ts111https://alan-turing-institute.github.io/sktime/examples/loading_data.html format used in tsml222https://github.com/uea-machine-learning/tsml and sktime333https://github.com/alan-turing-institute/sktime time series machine learning repositories. An example of loading the data into Python can be found on the sktime website1 and our github page444https://github.com/ChangWeiTan/TSRegression
. Missing values in the original dataset were not imputed and represented by the ‘?’ symbol, following the.ts convention used in the UCR/UEA archives [dau2019ucr, bagnall2018uea]. For fair comparison of regression models, we split the datasets in the archive into predefined train and test sets which will be outlined in the following sections.
2.1 Energy monitoring
Energy monitoring monitors the energy usage of a building by collecting various data such as temperature, humidity, rain, voltage and current readings from sensors attached all over a building. These data are collected in the form of time series and are mapped to the power consumption of the building. For example, higher power consumption will be observed when the temperature is low, during winter months as more energy is required to heat up a building. They are then used to optimise the energy usage which can save millions of dollars for a large building. In this section, we explain three datasets for energy monitoring obtained from two sources.
2.1.1 Appliances energy prediction
Luis et al. [candanedo2017data] studied models for predicting the energy usage of appliances. The authors monitored the temperature and humidity of different rooms in a house for 4.5 months using ZigBee wireless sensor network, illustrated in Figure 0(a). They measured the temperature and humidity of the kitchen, living, laundry, office, bathroom, ironing, 2 bedrooms and outside of the house. Figure 0(b) shows an example of this layout. Weather data from a nearby airport station, Chievres Airport, Belgium were also being used to improve the predictions. The ground truth, appliances energy, was recorded with m-bus energy meters at 10 minutes interval. Data filtering and feature ranking techniques were discussed in the paper to remove non-predictive parameters [candanedo2017data]
. Then, four models, (1) multiple linear regression, (2) support vector machine with radial kernel, (3) random forest and (4) gradient boosting machine (GBM) were evaluated on the collected data. Although the results showed that GBM performed the best, it was only able to explain 57% of the variance (R2) in the test set. This implies the need for better predictive models.
We created the AppliancesEnergy dataset using this dataset and reformulated the problem. The goal of the original paper [candanedo2017data] was to predict the instantaneous energy usage given all the sensor measurements at a given time point. Instead, we reformulated the problem as given the daily time series of each sensor’s measurement measured at every 10 minutes interval, predict the total daily appliances energy in kWh. This forms a time series with 144 data points per day as shown in Figure 2. Each time series in our AppliancesEnergy dataset consists of 24 variables. The variables are the temperature and humidity measurements of each room in the house and the weather data obtained from the nearby airport. This dataset is split into train and test sets by randomly sampling 70% as train and the remaining 30% as test. This results in 96 train time series and 42 test time series.
2.1.2 Individual household electric power consumption
This dataset was sourced from the UCI repository555https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption [Dua:2019]. It contains 2 million measurements gathered over a period of 47 months, between December 2006 and November 2010. The data measures the minutely global active and reactive power, voltage, current and sub-meter energy of a house located in Sceaux, 7km of Paris. This dataset is converted into a TSR problem for this archive, using the daily voltage, current and 3 sub-metering data to predict the total daily active and reactive power consumptions. Figure 3 shows an example of such time series used for the prediction with the five variables and a length of 1440. The HouseholdPowerConsumption1 and HouseholdPowerConsumption2 datasets represent the dataset for active and reactive power prediction, respectively. The datasets are split into train and test sets by taking all measurements before 2009 as train and after 2009 as test. Both datasets have 746 and 694 train and test time series, respectively.
2.2 Environment monitoring
Environment monitoring has become more important than ever with climate change getting more serious. It is the task to predict anything related to our environment such as pollution level, rainfall, crop yield and flood water level. This section outlines eight environment monitoring datasets in this archive, obtained from four sources.
2.2.1 Air quality
One of the main applications of environment monitoring is to predict air quality for pollution monitoring. Vito et al.
studied the calibration of chemical sensors for benzene estimation to monitor air pollution in an Italian city[de2008field]. They used five metal oxide chemical sensors embedded in an air quality chemical multisensor device to record the hourly air pollutant concentrations from March 2004 to February 2005 [de2008field]. The ground truth concentrations for 5 atmospheric pollutants, Carbon Monoxide, Non Metanic Hydrocarbons, Benzene, Nitrogen Oxides and Nitrogen Dioxide was obtained from a fixed weather station. Besides the pollutants, local temperature, relative humidity and absolute humidity data were also recorded.
We created the BenzeneConcentration dataset using the dataset provided in the study [de2008field]. Apart from the five chemical sensors, this dataset also uses the temperature, relative humidity and absolute humidity data, forming an 8-dimensional time series dataset. As the data was originally used for calibrating chemical sensors, we formulate the regression problem as predicting the benzene concentration for the current hour using the hourly measurements from the last 10 days, forming a time series of length 240. The 10-days window was found to give good calibration results from the paper [de2008field]. Note that the 10-day segment will not be used if the benzene concentration for the current hour is missing. Figure 4 shows an example of the time series measurements used to estimate benzene concentration in the Italian city. The training set consists of the initial 8 months data while the remaining are used as test set. This results in 3433 training instances and 5445 test instances.
2.2.2 Beijing multi-site air quality
The capital of China, Beijing, is one of the cities in the world with the worst air pollution levels. Numerous studies have been conducted to study and reduce the air pollution level in Beijing [zhang2017cautionary]. In 2017, the Beijing Municipal Environmental Monitoring Center (BMEMC) reported a reduction of 9.9% in fine particulate matter (PM2.5) level from the previous year [zhang2017cautionary]. However, a study conducted by Zhang et al. [zhang2017cautionary] shows that there was uncertainty in the report, as they studied the past 4 years of Beijing’s PM2.5 and PM10 data at 36 monitoring sites. Hourly air pollutants data such as SO, NO, CO and O concentrations as well as meteorological data from the air quality monitoring sites were used in the study.
The BeijingPm25Quality and BeijingPm10Quality datasets are created from the dataset provided by [zhang2017cautionary]. The goal is to predict both PM2.5 and PM10 level of Beijing using 9-dimensional time series that measures the four daily air pollutants as well as five meteorological data (temperature, dewpoint temperature, wind speed, pressure and rain amount) from 12 air-quality monitoring sites in Beijing. Figure 5 shows an example of the 9-dimensional time series in these datasets used to predict the PM2.5 and PM10 level in the city of Beijing. The training set consists of all time series taken before the year 2016, with a total of 12432 time series. The test set contains 5100 time series which consists of measurements taken after the year 2016.
2.2.3 Live fuel moisture content
Apart from pollution monitoring, bush fire monitoring is also an important application of environment monitoring. One way to monitor bush fire is to monitor the moisture in the vegetation, i.e. the ratio between the weight of water in vegetation and the weight of the dry part of vegetation (information that is obtained by sampling vegetation in the field, weighing it and drying it to weigh it again). This is known as the live fuel moisture content (LFMC) and is an important variable as the risk of fire increases very rapidly as soon as the LFMC goes below 80% [yebra2018fuel]. We have obtained a LFMC database from researchers at Monash University who are working on developing models to predict LFMC values. They used the Globe-LFMC dataset as the ground truth. Globe-LFMC is an extensive global database of LFMC containing 161,717 instances, measured from 1383 sampling sites in 11 countries. One year of daily reflectance data at 7 spectral bands (459 nm to 2155 nm) before the LFMC sampling date from the Moderate Resolution Imaging Spectrometer (MODIS satellite are one of the inputs to their model. The elevation, slope and aspect of the sampling site extracted from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model Version 3 (GDEM 003) are also being considered.
We created the LiveFuelMoistureContent dataset from the database by stratified sampling 5003 instances across the United States. Figure 6 shows an example of the retrieved LFMC values for the United states. Stratify sampling ensures that the land cover classes in the dataset are well distributed and balanced. Then 70% of the dataset is randomly selected as the training set with 3493 time series and 1510 time series in the test set. Figure 7 illustrates the time series with 7 spectral bands in the dataset.
2.2.4 Flood modeling
Flood modelling consists of solving simplified fluid dynamics equations over a given domain or DEM (Digital Elevation Model) in response to a rainfall time series event. Different processes can happen once the precipitation reaches the ground: infiltration, evaporation, transpiration, interaction with a pipe network or hydraulic structures, and what is left is called surface runoff. Runoff describes the path of water following the slope of the terrain and ending up in a stream. Typically, the input rainfall time series and the 2D DEM topography is passed into a software (Lisflood-FP) that solves the fluid dynamics equations and outputs the water flow (m/s) and water depth (in m). In flood studies, researchers are mostly interested in knowing if, where and how much the domain will get flooded, and produce flood maps. A flood map is a distributed view of the maximum water depth reached due to a given rainfall event. Figure 7(a) and 7(b) show an example of a simple synthetic DEM with a water stream in the middle and a rainfall event that leads to a maximum water depth of 0.441m, respectively.
For the archive, we obtained three synthetic DEMs and rainfall events from flood studies researchers at Monash University. Synthetic data were used because real DEM data that pairs with accurate rainfall events are rare. These DEMs consist of a square grid with different types of terrains and a water stream in the middle of the DEM as shown in Figure 7(a). Then we have the rainfall time series event for the DEM, illustrated in Figure 7(b) which gives the maximum water height near the outlet of the DEM. We created the FloodModeling1, FloodModeling2 and FloodModeling3 datasets from these synthetic DEMS, each of them having a different number of rainfall events. All three datasets are split into train and test set by randomly sampling 70% as the training set. FloodModeling1 has 471 training and 202 test time series. FloodModeling2 has 389 training and 167 test time series. FloodModeling3 has 429 training and 184 test time series.
2.2.5 Rainfall predictions
An important task in environment monitoring is to predict rainfall. The Australian Bureau of Meteorology (BOM) released a dataset666https://data.gov.au/data/dataset/weather-forecasting-verification-data-2015-05-to-2016-04 that contains a year of temperature and rainfall data from May 2015 to April 2016. The dataset was collected from 518 weather stations throughout all of Australia. This dataset was aggregated into hourly values and was used for the comparison and verification of temperature and rainfall forecasts. The dataset contains the hourly average, maximum and minimum temperature as well as the rainfall amounts. We adapted this dataset to create the AustraliaRainfall dataset to predict the total daily rainfall using 24 hours of temperature measurements. Figure 9 shows an example of the air temperature measured from a Western Australia weather station and the total daily rainfall. The dataset is split into training and test sets by randomly sampling 70% as the training set. There are 112186 and 48081 time series in train and test sets respectively.
2.3 Health monitoring
Health monitoring is the task of monitoring the health or vital signs of an individual using some devices. For example, estimating the heart rate, respiratory rate and blood oxygen saturation level. The data typically comes from a wearable device that can be attached to the subject, such as a photoplethysmogram (PPG), electrocardiogram (ECG), electroencephalogram (EEG) or accelerometers but could also come from medical devices. In this section, we describe five health monitoring datasets that come from three sources. These datasets focus on three tasks, estimating heart rate, respiratory rate and blood oxygen saturation level.
PPG sensors are now widely used in many smart wearable devices such as the Fitbit and Apple Watch to measure heart rate [reiss2019deep]. Although ECG is more precise in determining the heart rate, it is cumbersome in daily life settings [reiss2019deep]. PPG-based heart rate estimation is still a challenging task [reiss2019deep]. Previous methods of estimating the heart rate from PPG sensors mostly relies on spectral analysis [zhang2014troika, zhang2015photoplethysmography, salehizadeh2016novel, schack2017computationally] and they are not very accurate [reiss2019deep]. The authors from [reiss2019deep]
proposed a convolutional neural network based approach that takes the signal in the frequency domain as input. They showed that their approach is significantly more accurate compared to the existing spectral methods.
We adapted the original PPG-DaLiA dataset from the UCI machine learning repository [Dua:2019] and created the PPGDalia dataset for our TSR archive. PPG-DaLiA contains a single channel PPG and 3D accelerometer motion data recorded from 15 subjects performing a wide range of real-life activities, creating a 4 dimensional time series. The measurements from each subject are then segmented into 8 seconds windows with 6 seconds overlaps [reiss2019deep]. In PPGDalia, subjects 1 to 10 are selected to be in the training set and the remaining are in the test set, resulting in 43215 train instances and 21482 test instances. Figure 10 shows an example of the PPG signal and accelerometer signals in the dataset. Note that the time series in the PPG and accelerometer channels have different lengths due to different sampling rate of 64Hz and 32Hz respectively.
2.3.2 IEEE Signal Processing Cup 2015
In 2015, IEEE organised a signal processing competition777IEEE Signal Processing Cup 2015: Heart Rate Monitoring During Physical Exercise Using Wrist-Type Photoplethysmographic (PPG) Signals to monitor heart rate using wrist type PPG signals [zhang2014troika], similar to Apple Watch in Figure LABEL:fig:apple_watch. They released a dataset that contains 2 channel PPG signals, 3-axis acceleration signals and 1 channel ECG signals, all sampled at 125Hz. The dataset was recorded from 12 subjects aged between 18-35 years old, running on a treadmill with changing speeds [zhang2014troika]. The owner of the dataset proposed a spectral analysis method to estimate heart rate from PPG signals [zhang2014troika].
We modified and created the IEEEPPG dataset, a 5 dimensional time series from the original dataset, using the 2 PPG signals and 3-axis accelerometer signals. The original train/test split was used, resulting in 1768 train instances and 1328 test instances. Similar to PPGDalia, the signals are segmented into 8 seconds windows with 6 seconds overlaps. With a sampling rate of 125Hz, all the time series have a length of 1000. Figure 11 shows an example of the measurements obtained from the PPG and acceleration sensors in the dataset.
2.3.3 The Beth Israle Deconess Medical Centre (BIDMC) PPG and Respiration
Apart from measuring the heart rate, PPG sensors can also be used to measure other vitals such as the respiratory rate (RR) and blood oxygen saturation level (SpO) of an individual [pimentel2016toward]. Typically PPG sensors are not very accurate in estimating respiratory rate of an individual as they fail to distinguish between periods of high and low-quality input data [pimentel2016toward]. The study by [pimentel2016toward]
claimed that existing systems were not robust for clinical practice. Hence they proposed a method based on multiple autoregressive models to improve the robustness of estimating RR from PPG sensors. The proposed method was able to achieve comparable accuracy to existing methods whilst providing estimates for majority of the data. They extracted a dataset888https://physionet.org/content/bidmc/1.0.0/ from the larger “MIMIC II matched waveform Database” that contains the physiological signals such as PPG and ECG data of the patients, sampled at 125Hz. Then the data is manually annotated with the heart rate (HR), RR and SpO of the patients at 1 second interval.
For this archive, we adapted the original dataset from [pimentel2016toward] and created the BIDMC32HR, BIDMC32RR and BIDMC32SpO2 datasets to estimate the HR, RR and SpO of a patient using PPG and ECG time series data. Following the same procedure in the paper [pimentel2016toward], the PPG and ECG data were converted into time series using a 32 seconds sliding window, illustrated in Figure 12. The average HR, RR and SpO in the 32 seconds window are used as the target for each time series. The datasets are split into train and test sets by randomly selecting 30% as the test set. Therewith, BIDMC32HR consists of 5550 and 2399 train and test time series respectively; BIDMC32RR consists of 5471 and 2399 train and test time series; and BIDMC32SpO2 consists of 5550 and 2399 train and test time series. The difference in the number of training time series is due to missing values in the annotated HR, RR and SpO which are not included in the datasets.
2.4 Sentiment analysis
Sentiment analysis is the interpretation and classification of emotions (positive, negative or neutral) within some text using text analysis techniques. This is typically done by analysing text comments or posts on websites and social media platforms [moniz2018multi]. This section describes two sentiment analysis datasets in this archive.
2.4.1 News popularity in multiple social media platforms
A dataset containing 100,000 news items on four topics: economy, microsoft, obama and palestine was released by [moniz2018multi] and is available in the UCI Machine Learning repository [Dua:2019]. The dataset also contains the respective social feedback on 3 social media platforms: Facebook, Google+ and LinkedIn
. The dataset is collected within a period of 8 months, between November 2015 and July 2016. Sentiment analysis has traditionally being done using natural language processing techniques. Here we attempt a different approach to predict the sentiment score of news headline and news title by analysing the number of reactions on the social media platforms over a period of 2 days with time series analysis.
We created two TSR datasets NewsHeadlineSentiment and NewsTitleSentiment from the original news popularity dataset [moniz2018multi]. The datasets contain 3-dimensional time series that measure the number of reactions to the news on the 3 social media platforms. The number of reactions was recorded at 20 minutes intervals, resulting in time series of 144 datapoints in length. Figure 13 shows an example of the time series in both datasets where the target variables are the sentiment scores for news headline and news title, respectively. 70% of the dataset are randomly selected to be in the training set with 30% in the test set, resulting in 58213 training instances and 24951 test instances.
Time series forecasting is the task of predicting future values based on some recent and/or seasonal values. Typically a model such as ARIMA is fitted to the historical data and extrapolated into the future. TSR can be seen as a general case of forecasting where the goal is to predict a continuous value that may not necessarily be a future value or depending more heavily on recent values. Thus, we included in this archive a dataset that could easily be solved with forecasting models to show that forecasting tasks can also be tackled using TSR models.
In 2020, the world suffers from the Covid-19 pandemic. Covid-19 is one of the worst pandemics in the last century. It is very contagious and spreads rapidly. Within 6 months, by June 2020, there are more than 7 million cases and 430 thousand deaths worldwide. The pandemic had also caused economy downturn for many countries. In this archive, we created the Covid3Month dataset that consists of the total daily confirmed numbers of Covid-19 cases in most countries from January to March 2020. The goal of this dataset is to predict the death rate for each country at 1 April 2020 using the daily confirmed cases for the past 3 months, illustrated in Figure 14. Note that the death rate was terrifyingly high at 12.8% in some countries like Italy. The numbers are obtained from the Covid-19 database from the World Health Organisation (WHO)999https://covid19.who.int/. The dataset is split into train and test sets by randomly sampling 70% as train. We note that using the number of confirmed cases is not sufficient to provide an accurate prediction and will be working on expanding the dimension of the dataset with more indicators to provide a more realistic dataset.
In this section, we aim to set a baseline to the existing datasets in our archive. As there are no existing TSR techniques to our knowledge, we applied standard machine learning regression models and state-of-the-art TSC models to benchmark the datasets. We evaluate and benchmark the following regression models:
Support Vector Regression (SVR) with RBF kernel [drucker1997support]
Random Forest (RF) [breiman2001random] with 100 trees
Extreme Gradient Boosting (XGBoost)[chen2016xgboost] with 100 trees
Time series NN using Euclidean distance with (1-NN-ED and 5-NN-ED)
Time series NN using DTW distance with (1-NN-DTW and 5-NN-DTW)
Fully Convolutional Networks (FCN) [fawaz2019deep]
Residual Network (ResNet) [fawaz2019deep]
Inception Network [fawaz2019inceptiontime]
Random Convolutional Kernel Tranform (Rocket) [dempster2019rocket]
Missing values in the time series are linearly interpolated. When using a traditional regression model (i.e. non-temporal), the time series are flattened out into a single long feature vector of lengthwhere is the number of dimensions in the series and is the length of the time series.
We used the standard Scikit-Learn Python library [scikit-learn] to implement SVR and RF models. The default parameters are used for the SVR model with and . XGBoost was implemented using the Python XGBoost library101010https://xgboost.readthedocs.io/en/latest/python/python_intro.html. Apart from the number of trees, we use the default parameters for both RF and XGBoost from the Python libraries. We adapted the code from [fawaz2019deep]111111https://github.com/hfawaz/dl-4-tsc for both ResNet and FCN and [fawaz2019inceptiontime]121212https://github.com/hfawaz/InceptionTime for the Inception Network. The code for Rocket was taken from [dempster2019rocket]131313https://github.com/angus924/rocket and modified for multivariate time series with the help from the authors. The multivariate version of Rocket applies the transformation to each dimension independently.
The time series nearest neighbours algorithms were all implemented in Java. Our source code has been made open source online athttps://github.com/ChangWeiTan/TSRegression.
Since some of the models are non-deterministic, we evaluate all the models over 5 runs and report the average root mean squared error (RMSE), one of the most widely used metrics for regression tasks. Equation 1 describes the formal definition of RMSE where is the number of instances, and are the actual and predicted target respectively.
We compare the models statistically over the current datasets following the recommendations from [demvsar2006statistical]. First, we rank each model by RMSE for every dataset. Rank 1 is assigned to the model with the lowest RMSE while rank 9 is assigned to the highest one. Fractional ranking is assigned to the model in case of ties. We then compute the average rank for each model. Then, the Friedman test [friedman1940comparison, demvsar2006statistical]
was applied to the average ranks to reject the null hypothesis. If the null hypothesis is rejected, the post-hoc two-tailed Nemenyi test is used to compared the models to each other[demvsar2006statistical]. Using this test, the performance of the models is significantly different if the average ranks differ by at least the critical difference shown in Equation 2, where is the critical value for , being the number of models and being the number of datasets. This gives .
Finally, a critical difference diagram was used to visualise the comparison, where the thick horizontal line connecting a group of models indicates that all the models in the group are not significantly different from one another [demvsar2006statistical]. Figure 15 shows the critical difference diagram of comparing the models used to benchmark the existing archive. The average ranks are indicated next to the models in the figure.
Figure 15 shows that Rocket is the most accurate model with an average rank of 3.2632 and is significantly different to SVR, NN-ED and 1-NN-DTWD. The figure also shows that there is no significant difference between the state-of-the-art time series models and the classical regression models. This suggests that a better model needs to be developed for TSR problems. We refer interested readers to our paper [tan2020time] for a more detailed discussion of the results.
4 Conclusion and Future Work
We have released the first iteration of the TSR archive that contains 19 time series datasets, and set an initial baseline on the archive using typical machine learning regression and state-of-the-art TSC models. Our results show that Rocket, one of the state-of-the-art TSC models performs the best overall. State-of-the-art machine learning models such as XGBoost and Random Forest are very competitive as well. This suggests that better models need to be developed for such TSR problems. Finally, we welcome any donations of data and will continue to expand the archive, providing a wider range of problems.
This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under award number FA2386–18–1–4030. The authors would like to thank the authors of [fawaz2019deep] and [dempster2019rocket] for providing their source code online. The authors also appreciate the data donation from all the donors.