1 Introduction
Analysing and learning from time series is one of the most active topics in scientific research. One of the tasks related to this topic is forecasting, which denotes the process of predicting the value of future observations given some historical data. This task is relevant to organisations across a wide range of domains of application. In many of these organisations, forecasting processes can have a significant financial impact kahn2003measure .
In the machine learning literature, it is widely accepted that the feature set used to represent a data set is a crucial component for building accurate predictive models guyon2006introduction . Hence, feature engineering is regarded as a critical step in machine learning projects. However, feature engineering is often an adhoc process. Practitioners design new features based on their domain knowledge and expertise. The limitations of current approaches to feature engineering are particularly relevant in time series forecasting, where, although evidence exists that it improves forecasting performance Oliveira2014EnsemblesFT , it is often overlooked.
Time series forecasting tasks are typically formalised using an autoregressive approach. Accordingly, observations are modelled using multiple regression; the future observations we want to predict represent the target variable, and the past lags of these observations are used as explanatory variables. Autoregression is at the core of many forecasting models in the literature, such as, for example, ARIMA (AutoRegressive Integrated Moving Average) box2015time . This approach presents an opportunity to approach feature engineering in a systematic way. Statistical features can be extracted from recent observations of time series. These features can help summarise the dynamics of the data and enrich its representation. For example, a simple feature such as the average of the past recent observations can help capture the overall level of the time series in a given point in time. If the process is adequately done, new features will lead to more accurate predictive models and better forecasting performance.
1.1 Our Approach and Paper Organisation
Despite having domain expertise, professionals often lack proper time series analysis skills taylor2018forecasting . Getting the most out of state of the art time series forecasting methods requires significant experience, and it is a complex and timeconsuming task. In this context, developing a framework for automatically extracting an optimal representation from an input time series is beneficial for practitioners with minimal technical expertise.
This paper presents and describes an automatic feature engineering approach called VEST (Vector of Statistics from Time series). VEST is specifically designed to address forecasting problems. To the best of our knowledge, this is the first approach to automatically and systematically extract features from time series.
VEST works according to the following three main steps:

Transformation: We transform the time series into several distinct representations. This process may be beneficial for describing from different perspectives the underlying process causing the time series. For example, a simple moving average transformation can be useful to remove the spurious behaviour of time series;

Summarisation: Each representation is summarised using statistics (e.g. mean, standard deviation);

Selection: The first two steps can lead to a high dimensional problem. We apply a feature selection procedure to cope with this issue. The final set of features is concatenated with the original representation (before any transformation) to learn a regression model for forecasting.
In this paper, we show significant performance gains when applying VEST, based on a case study comprised of 90 univariate time series from several domains of application. These improvements are evidenced for predicting the next point of the time series using two different learning algorithms: a variant of the model tree quinlan1993combining , and lasso tibshirani1996regression .
VEST is available online at https://github.com/vcerqueira/vest. Additionally, the code necessary to reproduce the experimental evaluation presented in the paper is made available to encourage reproducible research.
The organisation of the paper is as follows. In Section 2, we formalise the time series forecasting problems as a predictive regression task. We present a literature review in Section 3
, including topics related to the feature engineering process and its automation, feature selection, and time series representation and feature extraction. Section
4 presents VEST, formalising its main steps: transforming the original representation, summarisation using statistics, and feature selection. We show the usefulness of this framework using empirical evidence presented in Section 5. We present a discussion on the results from such experiments, challenges, and future directions in Section 6. The paper is concluded in Section 7.2 Problem Definition
A univariate time series represents a temporal sequence of values , where is the value of at time and is the length of . Forecasting denotes the process of predicting the value of the upcoming observations of the time series, , given the historical past observations, where denotes the forecasting horizon. In this work, we focus on , which means we attempt to forecast the next value of the time series. We adopt an autoregressive approach to address the problem of time series forecasting. Accordingly, observations of a time series are regressed on their past lags.
We start by reconstructing the time series as a geometric object by applying a time delay embedding using the Takens theorem Takens1981 . Then, the predictive task is framed as a multiple regression problem. We construct a set of observations of the form (, ). In each observation, the value of is modelled based on the past values before it: , where , which represents the observation we want to predict, and represents the
th embedding vector. Effectively, the time series is transformed into the data set
.The learning objective is to build a regression model that provides an approximation to an unknown function . The principle behind this approach is to model the conditional distribution of the th value of the time series given its past values: ().
3 Related Research
This section provides an overview of the literature related to the topic of this work. First, we describe the importance of feature engineering and outline automatic procedures to address this task (Section 3.1). Afterwards, we focus on time series data. We describe approaches for changing the representation of time series (Section 3.2). Finally, we overview approaches for extracting features from time series with the objective of predictive modelling (Section 3.3).
3.1 Feature Engineering
The goal of feature engineering is to enrich the representation of a data set with additional explanatory variables. The expectation is that such new predictors contain useful information and lead to more accurate predictive models guyon2006introduction .
In order to clarify the scope of our work, we split the generated variables through feature engineering into two classes: exogenous and endogenous. Exogenous variables
are those derived from an external source. Consider an example of a time series representing the number of rooms occupied per day in a hotel. Forecasting the values of such a time series is interesting to the business for different reasons (e.g. pricing). In this scenario, a simple binary variable containing the information regarding whether or not the observation to be predicted occurs during the weekend may be useful to the predictive model. Since information is not contained within the original observations (each
) we call it exogenous.Regarding endogenous variables, represents the embedding vector (Section 2) of a time series in a given point in time . We can try to derive more information from by applying some transformations or summary statistics. For example, the average of the values of in a specific period may be a useful indicator for describing the overall level of the time series at that point. As such, a new explanatory variable is generated based on the time series itself, i.e. an endogeneous feature.
In this work, we focus on the automatic discovery of endogenous features in numeric time series. The goal is to augment the representation of the embedding vectors and improve forecasting performance of predictive models.
3.1.1 Automatic Feature Engineering
Typically, feature engineering is an adhoc process guyon2006introduction ; pinto2016towards . Practitioners design features based on their domain knowledge and expertise. However, this process is timeconsuming, and requires both domain expertise and imagination.^{1}^{1}1https://www.kdnuggets.com/2019/03/whyautomlwontreplacedatascientists.html
Recent approaches have been developed to systematise the feature engineering process. This research line is designated in the literature as automatic feature engineering. Examples include the following: Deep Feature Synthesis
kanter2015deep , ExploreKit katz2016explorekit , AutoLearn kaul2017autolearn , Cognito khurana2016cognito , or OneBM lam2017one . These frameworks focus on discovering relevant features from data sets comprised of several attributes, which can be either categorical or numeric. Moreover, most of these focus solely on classification problems and i.i.d. data. In this work, we focus on univariate time series data and the problem of forecasting. As such, many of the operations we develop to discover new relevant features intend to leverage the temporal dependency among the observations in the data set.Deep learning methods can also be regarded as having an internal automatic feature engineering component. These approaches are able to learn higherorder representations based on the raw input data. Still, there are important factors which make standard feature engineering relevant. Deep learning models require a large amount of data, which is often not readily available. The internal representations of neural networks are opaque, while standard feature engineering is typically based on interpretable operations. This transparency may be important in sensitive applications. Besides, the two approaches are not incompatible, as neural networks can potentially leverage standard feature engineering, for example, to improve their learning efficiency.
3.2 Time Series Representation
Sometimes, time series are analysed using a representation that is different from the original one. Changing the representation of a time series can be beneficial for (i) reducing the dimensionality of the data, which leads to more efficient storage and processing; (ii) implicit handling of noise; and (iii) focusing on fundamental characteristics of the data esling2012time . We refer to the work by Esling and Agon esling2012time for a complete read on time series transformations.
Keogh keogh2004towards splits time series representation methods into three main types: nondata adaptive, dataadaptive, and modelbased. In nondata adaptive approaches, the parameters of the transformation are independent of the underlying data. Examples of this approach are the discrete wavelet transformation percival2006wavelet or the simple moving average. Conversely, the parameters of dataadaptive methods depend to some extent on the time series. Symbolic aggregate approximation lin2003symbolic is a wellknown approach of this sort.
Modelbased approaches work on the assumption that some underlying model generates the time series. As such, parameters of the model represent the time series. Autoregressive moving average (ARMA) chatfield2000time models are an example of this type.
3.3 Time Series Feature Extraction
Extracting features from time series has been shown to improve performance in different tasks such as forecasting and classification prudencio2004using ; christ2016distributed . We split the literature on this topic into two dimensions: sequence descriptions and subsequence descriptions. The former denotes approaches which summarise the complete set of observations available in a time series. The latter extract features from subsequences of time series, i.e. the embedding vectors.
3.3.1 Sequence descriptions
There are several approaches which extract features from the complete time series to improve forecasting performance. Examples of this are the works of Prudêncio and Ludermir prudencio2004using , Lemke et al. lemke2010meta , Barandas et al. barandas2020tsfel , or Kang et al. kang2017visualising . Often, the goal of these approaches is to use metalearning for model selection or combination. Essentially, they extract features from each time series in a given database. Then, a predictive model is created which relates the features of a time series with the most appropriate forecasting model in that data. In effect, for a new given time series, such a metalearning model can make predictions regarding which model, or set of models, is more appropriate. Recently, MonteroManso et al. montero2020fforma applied this type of approach and ranked second in the wellknown forecasting M4competition.
In the context of time series classification, Christ et al. christ2016distributed
proposed the method FRESH for feature engineering. This method automatically extracts a large number of features from each time series in the database and selects the most relevant ones for building the classifier. Fulcher et al.
fulcher2017hctsa presented a feature extraction framework called hctsa for time series analysis. This tool extracts over 7000 features. Lubba et al. lubba2019catch22 selected a a subset of 22 hctsa features, leading to a 1000fold reduction in computation time for feature extraction and only a small reduction in time series classification performance.3.3.2 Subsequence descriptions
Compared to feature extraction from complete time series, few works are leveraging these processes for time series subsequences. Paras et al. paras2009feature used a set of statistics, such as simple and exponential moving averages, to improve a neural network model for weather forecasting. Oliveira and Torgo Oliveira2014EnsemblesFT show that the average and standard deviation of recent values improve the performance of a bagging ensemble. Cerqueira et al. cerqueira2017dets later corroborated these results using heterogeneous ensembles. However, the potential of the systematic application of approaches deriving new features from subsequences of time series has never been explored.
We follow this research line and derive new features from subsequences of time series. To accomplish this, we develop VEST, a novel framework for automatic feature engineering using univariate time series. VEST extracts new features from embedding vectors representing a time series and selects the most important ones for combination with the original vector.
4 Vest: Vector of Statistics from Time series
In this section, we propose and formalise VEST (for Vector of Statistics from Time series), an automatic feature engineering process for univariate and numeric time series. VEST is specifically designed for forecasting problems. Given a time series , the goal is to predict the value of the next observation, . Following the formalisation presented in Section 2, we address time series forecasting as an autoregressive task. The th observation of a time series, , is modelled according to the th embedding vector , which represents the previous observations.
VEST is based on the manipulation of the embedding vectors representing each observation of a time series. Particularly, the method contains three main steps, addressing feature generation (steps 1 and 2) and selection (3):
We adopt an expand–reduce approach for feature engineering do2019automated . In the first two steps of the methodology, we expand the representation of the data with a large set of features. In the final step, we reduce this representation and keep only the most relevant variables.
In the next subsections, we formalise these steps in more detail. The workflow for a given instance is exemplified in Figure 1.
4.1 Feature Generation
We base our feature generation process on the manipulation of each embedding vector . In this sense, our approach is entirely endogenous. Exogenous features (e.g. holiday information) can also be essential to improve forecasting performance. However, such analysis is out of the scope of this work.
4.1.1 Transform Operations
The first step of VEST is a transformation procedure. This procedure generates new representations from the original embedding vectors . As we mentioned in Section 3.2, changing the representation of time series embedding vectors is beneficial for handling noise, and to focus on essential characteristics of the data esling2012time . We hypothesize that different representations obtained by distinct transformation operations generate new, complementary information regarding the dynamics of the time series. Therefore, this combination of these new types of information can lead to improvements in forecasting performance which cannot be obtained by using each one of them separately. Formally, a transform operation maps an embedding vector into another dimensional vector :
Essentially, is mapped onto , . Hence, is a vector which denotes the th representation of the th embedding vector.
An example of a possible transform operation is differencing, which means the differences between consecutive observations. This transformation is often applied to time series to remove the trend component.
4.1.2 Summary Operations
By applying distinct transform operations, an embedding vector has several representations (one for each such operation). The next step in the methodology is the application of summary operations. These operations compress each of the representations of () into a set of features through the application of statistical functions. A summary operation can be defined as follows:
where denotes the th summary operation, and denotes the feature obtained when applying the th summary operation to the th representation of the th embedding vector. Each is part of the set , which represents the features describing .
Essentially, each summary operation compresses a numeric vector into a scalar which summarises the current state of the time series in some way. A simple example is the arithmetic mean, which describes the central tendency.
4.2 Feature Selection
The feature generation process described above produces a large number of features. This procedure may lead to a problem of high dimensionality and, consequently, overfitting. We introduce a feature selection procedure to cope with this issue.
Depending on the nature of the time series, the features extracted may have three problems: (i) they may not be applicable, which leads to missing values; (ii) they may not vary enough across the observations and do not provide any information for forecasting; or (iii) they may be highly correlated with each other. Concerning the first problem, we remove any feature with more than a certain percentage of missing values,
. Features with a lower percentage of missing values are imputed using the median function. The second issue is dealt with by removing features with a low number of unique values. Specifically, we remove any feature whose number of unique values relative to the total number of observations is below
. Finally, we apply a filter for removing correlated features. If a pair of features shows a level of correlation above , one of them is discarded.This process leads to the final set of features, . We concatenate this set with the original embedding vector .
5 Experiments
This section presents the empirical experiments carried out to validate the proposed approach. First, we detail the transform and summary operations used in the feature generation process of VEST (Section 5.1). Then, we present the experimental setup (Section 5.2), describing the research question, case study, methods and respective hyperparameters, and evaluation approach. We compare the proposed approach with state of the art approaches in Section 5.3. We perform a feature important analysis in Section 5.4. Finally, we analyse the impact of sample size in the results obtained (Section 5.5).
5.1 Vest Setup
5.1.1 Transform and Summary Operations
The transform operations applied to each embedding vector are described in Table 1 and the summary operations applied to each representation are described in Table 2.
Operation  Description 

I  The Identity transformation, in which each is mapped onto itself 
SMA  We apply a onesided simple moving average which can be beneficial to smooth out spurious fluctuations and highlight the general trend. The number of periods is equal to the square root of the length of , rounded to the nearest unit 
DIFF  First differences are applied to transform the original embedding vector into one without trend. This transformation can help with the modelling of time series with a strong trend component 
DIFF2  Second differences, which is equivalent to applying the DIFF operation twice to . This transformation is useful for describing the curvature of the data 
BC  BoxCox transformation, for stabilising the variance of the time series. The transformation parameter is optimised using all the available observations according to Guerrero guerrero1993time (minimizing the coefficient of variation) 
SIN  Sine terms of order 1 of the Fourier series. This transformation captures the seasonality of the time series. We remark that the frequency of the time series must be available to compute these terms 
COS  Similar and complementary to SIN, COS captures the cosine terms of order 1 of the Fourier series. 
DWT  We apply a 1level discrete wavelet transform using the Daubechies wavelet percival2006wavelet , and retrieve the coefficients of the respective detail signal 
Operation  Description 

MEAN  Arithmetic mean, which is used to estimate the average level of the vector 
MDN  Median: similar to the mean, but more robust to outliers 
SD  Standard deviation, as a measure of the overall dispersion in the vector 
VAR  Variance of the vector, which also measures dispersion 
IQR  Interquartile range, which is another measure of dispersion of the data, but more robust to outliers 
RD  Relative Dispersion, which is estimated according to the ratio between the standard deviation of the vector and the standard deviation of the differenced vector wang2006characteristic 
MIN  Minimum value of the vector 
MAX  Maximum value of the vector 
LP  Last known point of the vector 
SK  Skewness of the distribution of the vector, which is a measure of its asymmetry wang2006characteristic 
KRT  Kurtosis for describing the flatness of the data with respect to a normal distribution wang2006characteristic 
P05, P95  The 5th and 95th percentiles of the vector 
ACC_1, ACC_2  Average (ACC_1) and standard deviation (ACC_2) of the acceleration of the vector, estimated according to the ratio between the simple moving average and the exponential moving average of equal period. In our experiments, the period for computing the moving averages was set to the squared root of the length of the vector, rounded to units 
BP  Level of autocorrelation, which is estimated using a BoxPierce test statistic box1970distribution ; wang2006characteristic 
PACF  Average value of the partial autocorrelation function of the vector up to 10 lags 
ACF  Average value of the autocorrelation function of the vector up to 10 lags 
LRD1 LRD2  Longrange dependence,
estimated using the Hurst exponent approach with wavelet transform with 1 (LRD1 ) and 2 moments ( LRD2) wang2006characteristic 
SLP  Slope of the vector which describes its overall steepness prudencio2004using 
NORM  Euclidean norm of the vector, which captures its total energy 
NO  Number of outliers, estimated according to the number of observations above or below 1.5 times the interquartile range 
AMP  Average amplitude of the fast Fourier transform of the vector 
STEP  Binary random variable which denotes the presence of a step change lemke2010meta . This statistic detects structural breaks in the data 
PEAK_I, PEAK_D  Number of local maxima (PEAK_I) and local minima (PEAK_D) in the vector lemke2010meta . These statistics describe the level of oscillation of the data 
OD  Overall direction of the vector, estimated by the difference between the number of times the vector increases and the number of times the vector decreases 
PV_ST, PV_LT  Shortterm and longterm variability, respectively, estimated using the Poincaré plot brennan2001existing 
MLE  Maximum Lyapunov exponent, which quantifies the chaotic level of a time series wang2006characteristic 
The set of transform operations used try to capture the dynamics of the time series from distinct perspectives. Moreover, the list of summary operations contains several statistics which try to capture different components; from centrality and dispersion to chaos and stochastic randomness.
Overall, we apply eight different transformations and 32 different summary operations, leading to 256 different features before feature selection. Henceforth, we will employ the following notation to refer to a feature generated by VEST: TransformFunction.SummaryFunction. For example DIFF.MEAN represents the average value of the embedding vector representation when transformed with the differencing operation.
Setup.
We set the value to 70. As such, we remove features which have more than 70% of its values missing. Also, we set the threshold to 1. Therefore, we remove features where the percentage of unique values is below 1% of the total number of observations. The feature correlation threshold () was set to 95.
5.2 Experimental Design
The main research question addressed in this paper is the following:
Does VEST, an automatic feature engineering procedure, improve forecasting performance relative to a pure autoregressive approach?
Our experiments to answer this question can be split into the following items:

RQ1: Effect of VEST on the predictive performance of the state of the art pure autoregressive approach. We assess the significance of results according to Bayesian methods;

RQ2: Comparison of the forecasting performance relative to state of the art approaches, such as ARIMA and exponential smoothing;

RQ3: Sensitivity to different learning algorithms;

RQ4: Analysis of the different feature selection approaches;

RQ5: Analysis of the importance scores of each transformation and summary operations (Section 5.1);

RQ6: Sensitivity to different time series sample size.
5.2.1 Data
We used time series from two sources. From the tsdl benchmark library tsdlpackage , we selected all the univariate time series with at least 1000 observations and which have no missing values. This represents 55 time series. These show a varying sampling frequency (e.g. daily) and are from different domains of application. For a complete description of these time series, we refer to the their source tsdlpackage . We also included 35 time series used by Cerqueira et al. cerqueira2019arbitrage . Essentially, from the set of 62 series used by the authors, we selected those with at least 1000 observations and that were not originally from the tsdl database (since these were already retrieved as described above). We refer to the work by Cerqueira et al. cerqueira2019arbitrage for a description of those time series.
5.2.2 Parameter Setting
For each time series, we optimise the embedding dimension using validation data, testing values from 10 to 30. The chosen embedding dimension is the one minimising the error (Section 5.2.4
describes the evaluation metric). In this analysis, we train a model according to pure autoregressive forecasting models (i.e., no feature engineering is involved at this point). We set the minimum value for searching the embedding dimension to 10 to guarantee a reasonable number of observations for computing the transform and summary operations of
VEST.We focus on two learning algorithms. One is the cubist method Cubist2014 , which is a variant of the model tree proposed by Quinlan quinlan1993combining ; This method is competitive in timedependent data ikonomovska2011learning ; cerqueira2019arbitrage . We also use the lasso tibshirani1996regression regression algorithm. Each one of the methods was optimised according to a grid search using validation data.
We will present results that quantify the importance of each feature across the 90 problems. We resort to the RReliefF robnik1997adaptation method for this task. RReliefF (for Regressional ReliefF) extends ReliefF for numerical prediction problems. It estimates the importance of each feature in a data set by measuring the variability of the values of the features in the neighbourhood the observations. This method has been shown to have a connection to impurity measures robnik1997adaptation .
5.2.3 Methods
The learning algorithms indicated above were trained according to the following procedures:

AR: A pure autoregressive process, where the value of the next set of observations is predicted according to the most recent values. This is the typical approach to tackle time series forecasting problems;

AR+VEST: The proposed approach – the combination of AR with the features obtained with VEST;

VEST: A baseline which discards the AR component and models the future behaviour of the time series using only the features obtained with VEST;

AR+BT: Variant of AR+VEST, in which the feature selection approach is different: We use the feature from only a single representation. For each time series, we pick the transformation which maximizes feature importance (according to RReliefF). The importance scores are average across the available summary operations. In other words, this variant contains all the summary operations detailed on Table 2, but are computed only for the best estimated transformation;

AR+BF: Another variant of AR+VEST, in which we select a single transformation for summary operation. This is similar to the variant AR+BT described above. The difference is that, in this case, the single transformation is picked for each summary operation. This selection is also based on feature importance. To be precise, for each time series and for each summary operation, we select the transformation that maximizes feature importance.
Additionally, we also include ARIMA, ETS, and TBATS in the experimental setup. These methods are stateoftheart approaches for time series forecasting. They establish a reference to assess whether the results obtained here are acceptable or not. We resort to the implementations provided by the forecast R package forecast , which automatically tunes these methods to an optimal parameter setting.
5.2.4 Evaluation
We use a holdout repeated in multiple testing periods as the estimation method according to cerqueira2019performanceestimation . We perform 10 repetitions of this procedure. The training size in each repetition was set to 60% of the total number of observations, while the subsequent 10% of observations are used for testing. In each repetition, part of the training data (also 10% of it) was used as a validation set to optimise parameters, such as the embedding dimension or the parameters of the learning algorithms.
Regarding the evaluation metric, we use the mean absolute scaled error (MASE), which is a typical measure of forecasting performance hyndman2006another . We average the loss of each method across the repetitions of the holdout procedure described above. We evaluate the statistical significance of the results according to a Bayesian analysis benavoli2017time . In particular, we applied the Bayes sign test to compare pairs of methods across multiple problems. In the next section, we specify the setup of the test. For a thorough read on Bayesian analysis for comparing predictive models, we refer to the work by Benavoli et al. benavoli2017time .
5.3 Results
In order to have a commensurable metric across data sets, we compute the percentage difference between the MASE of each approach and a benchmark model. We use AR+VEST as the benchmark as it represents the proposed approach that combines autoregression with automatic feature engineering. We formalise the percentage of difference computation as follows:
(1) 
where and represent the loss of the model AR+VEST and the loss of model (the one is under comparison), respectively. We perform a Bayesian analysis of the results using the Bayes sign test benavoli2017time . We define the region of practical equivalence benavoli2017time (ROPE) to be the interval [2.5, 2.5]. Essentially, this means that the performance level of these two methods are nearly indistinguishable if the percentage difference in predictive performance between them falls within this interval.
We start by analysing the average rank, and respective standard deviation, of each method. This is reported in Figure 2 using cubist as learning algorithm. A method with rank 1 in a task means that it was the best performing one in that task. The average rank describes the average position of each method relative to the remaining ones. AR+VEST shows the best average rank score, which shows the usefulness of the proposed approach.
In terms of significance analysis, Figure 3
shows the probability that the respective method wins, draws (result within the ROPE), or loses significantly, against the proposed model (
AR+VEST) also when using the cubist learning algorithm. AR+VEST significantly outperforms the standard autoregressive model (AR) with around 30% probability (RQ1). The opposite scenario occurs with around 17% probability. In the remaining cases, the results are within the ROPE, which means the approaches are statistically equivalent. AR+VEST is also significantly better relative to state of the art forecasting approaches, including ARIMA, TBATS, and ETS (RQ2). This corroborates previous experiments shown by Cerqueira et al. cerqueira2019arbitrage .These results show important evidence that featurebased forecasting is worthwhile, and may be important to improve forecasting performance.
Figures 4 and 5 are similar to Figures Figures 2 and 3, but the analysis is carried out using the lasso learning algorithm. Although not identical, the illustration shows that performance gains are also obtained when using this method (RQ3).
Besides state of the art forecasting approaches, the results also indicate that AR+VEST outperforms three variants: VEST, AR+BT and AR+BS. VEST denotes the approach that discards the autoregressive attributes and uses only the features derived from the proposed framework to forecast the next value of the time series. However, the results show that combining AR with VEST is critical to the performance obtained. By itself, VEST shows a competitive performance, but does not provide a consistent advantage.
Regarding AR+BT and AR+BS, these variants provide a different approach for selecting the features from VEST. We devised AR+BT (the approach which selects a single transformation for each time series) to show the usefulness of multiple representations in a given problem. On the other hand, by outperforming AR+BS (the approach that uses a single transformation per summary operation), we show that multiple transformations are useful for a given summary operation even within the same problem. We dealt with eventual redundancies with a simple correlation filter, as described in the experimental design (RQ4).
5.4 Feature Importance
In the previous section, we presented significant empirical evidence that the use of VEST can significantly improve forecasting performance. In this section, we dive deeper into this matter by analysing the importance of the features used in the development of models. This covers research question RQ5.
5.4.1 Rank Distributions
We start by analysing the distribution of the rank of the feature importance across the 90 time series. We proceed as follows.

We measure the RReliefF of each feature. This score is averaged across the repetitions of the repeated holdout procedure;

We compute the rank of each feature according to its score of importance across the 90 time series. A feature with rank 1 in a given time series has the best score of importance in that problem. We split the computation of ranks into three parts according to the following criteria:

All operations: We compute the rank of all features irrespective of the underlying representation;

Representation: We compute the rank of each transformation. Specifically, we average the score of the importance of the features for each representation. For example, the average importance of all features using the DIFF representation. In this analysis, we include the importance of the past lags of the series (denoted as LAG variables);

Summary function: We also compute the rank of each summary operation. Similarly as above, we average the score of importance across summary operation to obtain the overall importance of the respective function.

Overall Rank.
Figure 6 shows the results of the overall rank as a set of boxplots (one for each feature), which are ordered by median importance rank (lower values are better). The names of the features (xaxis) follows the convention described before. In the interest of conciseness, we only show the top and bottom 30 features in terms of median rank.
The feature with the best median rank is LAG.1, which represents the last known value of the time series in a given point in time. Figure 6 clearly shows the advantage of methods to systematically generate large numbers of new features, when compared to, typically, a few features generated manually, based on domain knowledge. We observe that, overall, there is a great dispersion in the rank importance, showing that different features are more important in different time series. In fact, even those features with high median rank are among the most important in some of the problems.
Rank by Representation.
Figure 7 provides a similar analysis as Figure 6, but combines the results by representation, as explained above. The boxplots provide additional evidence for two observations made previously. Firstly, it shows that, although the obtained features from VEST improve predictive performance, the previous points of the time series (LAG features) provide useful information. In fact, they obtain the best median rank. Secondly, the high dispersion of the rank distributions shows that there is no particular representation which is the most appropriate for all time series. This provides additional evidence of the usefulness and complementarity of the different representations, as observed earlier.
Rank by Summary Operation.
Figure 8 shows a similar analysis but referring to each summary operation. Again, the boxplots show high dispersion suggesting that different statistics are more valuable in different tasks. The statistic with the best median rank is LP, which denotes the last point of the respective transformation.
5.5 Impact of Sample Size
We focused the experimental setup on high frequency time series. This type of data sets is increasingly relevant in many practical applications due to the widespread adoption of sensors. High frequency time series are typically associated with larger sample sizes relative to lower frequency ones. We hypothesise that the sample size is important when performing feature engineering with a method such as VEST. In a small data set additional attribute variables may lead to overfitting issues due to the curse of dimensionality. Therefore, it is important to collect a reasonable amount of observations for feature engineering.
We test the hypothesis above by repeating the experiments with increasing sample size values, similarly to Cerqueira et al. cerqueira2019machine (RQ6). To be more precise, we start by truncating the sample size of the time series to 3000 observations. Only 42 of the 90 time series had enough sample size, and we focus this analysis on that subset of problems. Afterwards, we repeated the experiments (as described in Section 5.2) in 30 different samples sizes from 100 to 3000 in steps of 100 observations (). We remark that, in this particular experiment, we focus on a simple holdout estimation method in which 80% of the initial observations are used for training and the subsequent 20% data points are used for testing. Accordingly, we evaluate the performance of AR+VEST relative to other approaches which do not use VEST. In this experiment, we remove the variants of the proposed method, and keep only AR, ETS, TBATS, and ARIMA. The performance is evaluated as follows: for each problem as for each sample size we compute the MASE error of each approach. Then, each method is ranked according to this error (lower error gives lower rank). We average the rank of each method across the 42 time series in each sample size experiment. This allows us to describe how the average rank of each method evolves as the sample size increases. We remark that we resort to the rank, as opposed to the MASE loss, because it is nonparametric and robust to outliers. Finally, we remark that we focus on the cubist learning algorithm for this analysis in the interest of conciseness. The results are similar when using the lasso algorithm.
The results are presented in Figure 9, which shows the average rank of each method across the 42 time series with an increasing sample size. The results show that, when the sample size is small, AR+VEST shows worse results than all other methods, including AR and the state of the art forecasting methods ARIMA, ETS, and TBATS. However, as the sample size increases, AR+VEST becomes the approach with best average rank. We remark that the average rank scores may be slightly different from the analysis shown previously as there are only 42 time series under analysis in this scenario.
6 Discussion
The experiments carried out in the previous section show the benefits of using VEST for time series forecasting tasks. In this section, we discuss the results obtained and point future directions of research.
6.1 Main Results
VEST is a framework for automatically extracting relevant features from the embedding vectors representing the time series. We showed the usefulness of VEST to tackle time series forecasting tasks based on an extensive set of experiments. When the features generated by VEST are combined with a state of the art autoregressive model (AR), forecasting performance significantly improves relative to only using AR.
We explored these results from different perspectives. Particularly, we presented an analysis which suggests that there is no specific representation or summary statistics which is more appropriate for all time series problems. Even within a single time series, the results suggest that applying summary operations to different representations is important for forecasting performance. This outcome shows the potential benefit of using an automatic approach to extract meaningful features for this type of data. Rather than finding a single feature that improves results across multiple problems, VEST obtains a set of features, each one of which is very important for a small, particular subset of the problems, although not very relevant on the remaining ones.
We believe our work is relevant for automated machine learning frameworks, especially to enable professional with low technical skill to develop accurate forecasting models efficiently taylor2018forecasting .
6.2 Points for Improvement
Despite significant gains in forecasting performance, we believe it is possible to improve the proposed feature engineering process.
VEST is designed as a brute force approach. It works by testing different representations, which are then summarised using different statistics. Those with low feature importance are removed using a feature selection filter. The key factor for the gains in performance is the predictive quality of the features that are tested. In this context, a potentially interesting research line is to develop a method for selecting apriori which transform and summary operations should be computed. For example, Reis do2019automated attempts to use metalearning to predict whether a given feature is going to improve predictive performance. A similar approach could be developed for extending VEST. Other possible interesting solutions are landmarkers Pfahringer2000 or bayesian optimisation rasmussen2003gaussian . Notwithstanding, we remark that, in the proposed framework, the different transform operations are independent of each other, and so are summary ones. Therefore, the processes within each step can run in parallel.
Another point of improvement for VEST is longterm analysis. Except for the dynamic harmonic regression procedure used to estimate the Fourier terms, VEST focuses on extracting information from the past lags. In other words, feature extraction is selfcontained within each embedding vector. In future work, we plan to extend the approach to include a longerterm analysis and extract information across embedding vectors. Such analysis enables a more longterm perspective on the dynamics of the time series. An example of a longterm feature is one attempting to capture a “number of observations since an outlier occurred”.
Although the pool of operations applied cover many properties of time series, the set of transform and summary operations can be increased. Other transformations could be carried out, for example, seasonal adjustment or discrete cosine transform. One could also combine available transform operations, e.g. a transformation which applies the operations BC (BoxCox transformation) and DIFF (first differences transformation) sequentially (c.f. Section 5.1.1).
As we mentioned previously, VEST is designed to extract endogenous features from time series. Notwithstanding, external information may be crucial for building accurate predictive models. With additional time series as explanatory variables, the number of operations to be carried out may be too high for a brute force approach. Thus, the ideas outlined in the second paragraph of this section may be important in these scenarios.
Our experiments are based on 90 time series with a high sampling frequency (daily or higher). Research is still necessary to show the impact of feature engineering in time series with lower sampling frequency. In Section 5.5, we showed that time series sample size is important for the proposed feature engineering solution.
Another point for improvement for VEST is multistep forecasting. During our experiments, we did not find enough evidence that VEST may be better than AR for multistep forecasting. We intend to explore this topic in future research.
7 Summary
Time series forecasting is a relevant predictive task in many domains of application. Datadriven organisations rely on forecasting systems to cope with future uncertainty and support their decisionmaking process.
One of the most important tasks in machine learning is feature engineering. However, this task still requires considerable manual effort and expertise from practitioners kaul2017autolearn . This lead to increasing demand for approaches that automate this part of machine learning projects.
In this work, we present a novel automatic procedure for feature engineering using time series data. The proposed approach, called VEST, is specifically designed to tackle time series forecasting problems. VEST is based on the manipulation of the embedding vectors, which represent the past recent observations used to predict the future ones. It works by transforming time series subsequences into distinct representations. Describing a time series using multiple representations may be useful for capturing its dynamics from different perspectives. Each representation is summarised using statistical functions, such as the mean. After a feature selection process, the final set of features across the available representations is coupled with an autoregressive model.
We validated the proposed approach using an extensive set of experiments, comprised of 90 time series from several domains of application. The results show that the features provided by VEST, along with autoregression, lead to significant gains in forecasting performance.
In future work, we will extend the approach to other forecasting scenarios, for example, multivariate time series or multistep forecasting. VEST is publicly available as an R software package.
Acknowledgements.
This work is financed by National Funds through the Portuguese funding agency, FCT  Fundação para a Ciência e a Tecnologia, within project UIDB/50014/2020.References
 (1) Barandas, M., Folgado, D., Fernandes, L., Santos, S., Abreu, M., Bota, P., Liu, H., Schultz, T., Gamboa, H.: Tsfel: Time series feature extraction library. SoftwareX 11, 100,456 (2020)
 (2) Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. The Journal of Machine Learning Research 18(1), 2653–2688 (2017)
 (3) Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time series analysis: forecasting and control. John Wiley & Sons (2015)
 (4) Box, G.E., Pierce, D.A.: Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. Journal of the American statistical Association 65(332), 1509–1526 (1970)
 (5) Brennan, M., Palaniswami, M., Kamen, P.: Do existing measures of poincare plot geometry reflect nonlinear features of heart rate variability? IEEE transactions on biomedical engineering 48(11), 1342–1347 (2001)
 (6) Cerqueira, V., Torgo, L., Mozetic, I.: Evaluating time series forecasting models: an empirical study on performance estimation methods. Mach Learn (2020) https://doi.org/10.1007/s10994020059107

(7)
Cerqueira, V., Torgo, L., Oliveira, M., Pfahringer, B.: Dynamic and
heterogeneous ensembles for time series forecasting.
In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 242–251 (2017).
DOI 10.1109/DSAA.2017.26  (8) Cerqueira, V., Torgo, L., Pinto, F., Soares, C.: Arbitrage of forecasting experts. Machine Learning 108(6), 913–944 (2019)
 (9) Cerqueira, V., Torgo, L., Soares, C.: Machine learning vs statistical methods for time series forecasting: Size matters. arXiv preprint arXiv:1909.13316 (2019)
 (10) Chatfield, C.: Timeseries forecasting. CRC Press (2000)
 (11) Christ, M., KempaLiehr, A.W., Feindt, M.: Distributed and parallel time series feature extraction for industrial big data applications. arXiv preprint arXiv:1610.07717 (2016)
 (12) Esling, P., Agon, C.: Timeseries data mining. ACM Computing Surveys (CSUR) 45(1), 12 (2012)
 (13) Fulcher, B.D., Jones, N.S.: hctsa: A computational framework for automated timeseries phenotyping using massive feature extraction. Cell systems 5(5), 527–531 (2017)
 (14) Guerrero, V.M.: Timeseries analysis supported by power transformations. Journal of Forecasting 12(1), 37–48 (1993)
 (15) Guyon, I., Elisseeff, A.: An introduction to feature extraction. In: Feature extraction, pp. 1–25. Springer (2006)
 (16) Hyndman, R., Yang, Y.: tsdl: Time Series Data Library (2019). Https://finyang.github.io/tsdl/, https://github.com/FinYang/tsdl
 (17) Hyndman, R.J., with contributions from George Athanasopoulos, Razbash, S., Schmidt, D., Zhou, Z., Khan, Y., Bergmeir, C., Wang, E.: forecast: Forecasting functions for time series and linear models (2014). R package version 5.6
 (18) Hyndman, R.J., et al.: Another look at forecastaccuracy metrics for intermittent demand. Foresight: The International Journal of Applied Forecasting 4(4), 43–46 (2006)
 (19) Ikonomovska, E., Gama, J., Džeroski, S.: Learning model trees from evolving data streams. Data mining and knowledge discovery 23(1), 128–168 (2011)
 (20) Kahn, K.B.: How to measure the impact of a forecast error on an enterprise? The Journal of Business Forecasting 22(1), 21 (2003)
 (21) Kang, Y., Hyndman, R.J., SmithMiles, K.: Visualising forecasting algorithm performance using time series instance spaces. International Journal of Forecasting 33(2), 345–358 (2017)
 (22) Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: Towards automating data science endeavors. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. IEEE (2015)
 (23) Katz, G., Shin, E.C.R., Song, D.: Explorekit: Automatic feature generation and selection. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 979–984. IEEE (2016)
 (24) Kaul, A., Maheshwary, S., Pudi, V.: Autolearn—automated feature generation and selection. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 217–226. IEEE (2017)
 (25) Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameterfree data mining. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 206–215 (2004)

(26)
Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: Automated feature engineering for supervised learning.
In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 1304–1307. IEEE (2016)  (27) Kuhn, M., Weston, S., Keefer, C., code for Cubist by Ross Quinlan, N.C.C.: Cubist: Rule and InstanceBased Regression Modeling (2014). R package version 0.0.18
 (28) Lam, H.T., Thiebaut, J.M., Sinn, M., Chen, B., Mai, T., Alkan, O.: One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327 (2017)
 (29) Lemke, C., Gabrys, B.: Metalearning for time series forecasting and forecast combination. Neurocomputing 73(1012), 2006–2016 (2010)
 (30) Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 2–11 (2003)
 (31) Lubba, C.H., Sethi, S.S., Knaute, P., Schultz, S.R., Fulcher, B.D., Jones, N.S.: catch22: Canonical timeseries characteristics. Data Mining and Knowledge Discovery 33(6), 1821–1852 (2019)
 (32) MonteroManso, P., Athanasopoulos, G., Hyndman, R.J., Talagala, T.S.: Fforma: Featurebased forecast model averaging. International Journal of Forecasting 36(1), 86–92 (2020)
 (33) do Nascimento Reis, G.F.: Automated feature engineering for classification problems (2019)
 (34) Oliveira, M., Torgo, L.: Ensembles for time series forecasting. In: ACML Proceedings of Asian Conference on Machine Learning. JMLR: Workshop and Conference Proceedings (2014)
 (35) Paras, S.M., Kumar, A., Chandra, M.: A feature based neural network model for weather forecasting. International Journal of Computational Intelligence 4(3), 209–216 (2009)
 (36) Percival, D.B., Walden, A.T.: Wavelet methods for time series analysis, vol. 4. Cambridge university press (2006)
 (37) Pfahringer, B., GiraudCarrier, C.: Metalearning by landmarking various learning algorithms. pp. 743–750 (2000)
 (38) Pinto, F., Soares, C., MendesMoreira, J.: Towards automatic generation of metafeatures. In: PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 215–226. Springer (2016)
 (39) Prudêncio, R.B., Ludermir, T.B.: Metalearning approaches to selecting time series models. Neurocomputing 61, 121–137 (2004)
 (40) Quinlan, J.R.: Combining instancebased and modelbased learning. In: Proceedings of the tenth international conference on machine learning, pp. 236–243 (1993)
 (41) Rasmussen, C.E.: Gaussian processes in machine learning. In: Summer School on Machine Learning, pp. 63–71. Springer (2003)
 (42) RobnikŠikonja, M., Kononenko, I.: An adaptation of relief for attribute estimation in regression. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML’97), vol. 5, pp. 296–304 (1997)
 (43) Takens, F.: Dynamical Systems and Turbulence, Warwick 1980: Proceedings of a Symposium Held at the University of Warwick 1979/80, chap. Detecting strange attractors in turbulence, pp. 366–381. Springer Berlin Heidelberg, Berlin, Heidelberg (1981)
 (44) Taylor, S.J., Letham, B.: Forecasting at scale. The American Statistician 72(1), 37–45 (2018)
 (45) Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288 (1996)
 (46) Wang, X., Smith, K., Hyndman, R.: Characteristicbased clustering for time series data. Data mining and knowledge Discovery 13(3), 335–364 (2006)
Comments
There are no comments yet.