Temporal data is generated in natural and technological systems and their analysis is very common (Hamilton (1994), Richards et al. (2011), Roberts et al. (2013)). Many of the analysis manifest as building time-series models via learning, and evalution of their performances are not trivial tasks, specially in the sparse settings (Süzen and Ajraou (2016)).
Learning curves are utilised in evaluating machine learning algorithms’ relative performances (Perlich et al. (2003)) and cross-validation (Stone (1974), Efron and Gong (1983)) are used in assesing generalization abilities of the model and together with selecting a model (Kohavi (1995)). However, usage of cross-validation for temporal data is not directly practiced for time-series, and only were available for uncorrelated errors and stationary time-series: by shuffling chunks of time-series (Politis and Romano (1994)) or naive k-fold CV for uncorrelated time-series (Bergmeir et al. (2018)). On the other hand learning curves for time-series models are not common and no specialized technique is addressed earlier for this. While, learning curves usually build via reducing the dataset sample size by removing points randomly. This approach can not be used for time-series learning curves.
We propose a technique called reconstructive Cross Validations (rCV), combining standard cross-validation (CV) and out-of-sample (OOS) evaluation of performance by introducing reconstruction of fold from CV, i.e. imputation or smoothing, allowing k-times OOS evaluation. rCV does not require any assumption on error structure and do not use future data to predict past in the main cross-validation procedure.
1.1 Single model versus model selection
Earlier literature (Stone (1974)
) signifies building a single model in cross-validations. It implies the learning algorithm builds a single model, a single parametrisation such as weights of a neural networks, with rotations of the data split, i.e., so called k-folds, in the core optimisations. However, building a different model or obtaining different parametrisation of the same model in k-folds (Kohavi (1995)) are introduced later to exercise model selection. Later practice is now used in mainstream machine learning libraries that a k-fold cross-validation would produce k different parametrisation of the same model. Similarly, in rCV, we build
-parametrisations of the model in supervised learning setting. Extending, this to earlier single model cross-validation must be obvious, however we may need to change how underlying solvers interact with data in optimisation phases and modify rCV for a single model build.
2 Proposed techniques
Our contribution has two main implication in generalised learning of time-series. On the one hand, a generic procedure to do cross-validation for time-series model and on the other hand a technique to build learning curves for time-series without need to reducing sample size of the time-series data.
We concentrate on one dimensional time-series for basic investigation. Extensions to higher-dimensions should be self-evident. Consider series of numbers,
, vector of length, , where and where the ordered dataset reads, . This tuple of ordered numbers considered as time-series, as often is interpreted as time evolution of and usually expressed as .
Out-of-sample (OOS) data usually appears as a continuation of the past time-series, hence, as a continuation of is defined as out-of-sample set, , vector of length , , where and , where where the ordered OOS dataset reads, .
Construction of cross-validated performance measures and learning curves for time-series will appear as a meta-algorithm processing time-series and .
2.1 Reconstructive cross-validation for time-series
The first step in rCV is to identify partitions of the time-series, here as in conventional cross-validation (Efron and Gong (1983)). Consider sets of partitions of are each set having randomly assigned , and partitions are approximately equal in size
A training-fold is defined as a union of all partitions except the removed partion,
Missing data would be on the corresponding removed partition .
Due to ordered nature of the series, the standard CV approach would not be used in different folds which yields to an absurd situation of predicting past using the future values. To overcome this, a reconstruction of full training series is denoted by can be introduces. This can be think of an imputation of missing data at random or smoothing in Bayesian sense. Using each training-fold via a secondary model
. A technique could be interpoloation or more advanced filtering approaches like Kalman filtering resulting
The secondary model could retain the given points on the training-fold in this approach. is the reconstructed portion.
The total error due to reconstructed model expresssed as , here for example we write down as a mean absolute percent error (MAPE), obviously different metrics can be used,
The primary model, , is build on each and predictions on out-of-sample set is applied. This results in set of predictions , the error is expressed as
The total error in rCV is computed as follows.
The lower the number better the generalisation, however both and should be judge seperately to detect any anomalies. We have choosed as multiplivative error of reconstruction and prediction errors, so that it represents weighted error. More complex schemes to estimate can be divised . Note that both and are test errors in a conventional sense, while both reconstruction and prediction computations are performed on a Gaussian process that parameters are fixed, i.e., corresponding to Ornstein-Uhlenbeck processes.
2.2 Learning Curves for time-series
Time-series learning curves are not that common due to fact that data sample sizes are limited. However with reconstruction approach we provided above, one can build a learning curve , based on number of folds
is the error measure used with different errors over range of different values. The error terms can be any of the errors defined above , or . Note that, unlike other learning curves build upon reducing sample-size (Perlich et al. (2003)), is constructed with retaining the sample-size of the original time-series over each point on the learning curve. The reason behind this, the number of missing data at random on reconstructed folds explained above, will decrease with increasing number of folds. Combined reconstruction prediction errors, as performance measure, will be effected by this by changing number of folds, hence the learning curve as in basic definition of supervised learning (Mitchell (1997)).
3 Experimental Setup
We have demonstrated the utility of our technique using a specific kernel in a Gaussian process setting (Williams and Rasmussen (2006)). This is corresponding to Ornstein-Uhlenbeck process and used in description of Brownian motion in statistical physics (Gardiner (2009)). The learning task is aim at predicting the OOS data using past series .
3.1 Ornstein-Uhlenbeck process
One can generate Ornstein-Uhlenbeck process drawing numbers from multivariate Gaussian with specific covariance structure.
Taking as all at , and build via kernel where is the distance matrix constructed over time-points as follows, which is a symmetric matrix,
We generated Ornstein-Uhlenbeck (OU) time-series for a regular time points with a spacing of with different length scales, mean values , and additional time points for the prediction task, see Figure 1.
3.2 Reconstructive cross-validation
We apply our meta-algorithm to construct both primary and secondory models using Gaussian process predictions with unit regularisation, it is formulated as follows: Given ordered pairs as time-series, we aim at inferring, i.e., reconstructing, missing values at time-points . The missing values can be identified via Bayesian interpretation of kernel regularisation,
whee . The kernel matrices is build via kernel where is the distance matrix over time-points of missing at random folds and the other folds. A secondary model is used to reconstruct . The absolute errors in 10-fold are shown graphically in Figure 1.
Similar procedure is followed in predicting the OOS vector . The reconstruction error, prediction error and rCV error were computed as 0.029, 0.468 and 0.013 respectively. Note that rCV error is not a MAPE but a measure of generalisation. The high prediction error is attributed to long time horizon we choose. In practice much shorter time-horizon must be used for practical utility.
3.3 Learning curves
We produced time-series learning curves for Ornstein-Uhlenbeck process we generated via rCV with varying fold values, i.e., different partitions of the original time-series , see Figure 2. Constructed time-series learning curves constructed with rCV sample sizes are increasing by increasing number of folds in convensional sense. This is attributed to the fact that having larger number of folds implies less time-points missing at random to reconstruct, corresponding to larger sample, i.e., having more experience. Reported learning cruves corresponds to test learning curves as we use fixed Kernel parameters to generate and predict Ornstein-Uhlenbeck process.
We have presented a framework with a canonical process from Physics, Ornstein-Uhlenbeck process, that helps us in performing generalised learning in time-series without any restriction on the stationarity and retaining serial correlations order in the original dataset. The approach entails in applying cross-validation directly in combination with OOS estimate of the performance and reconstructing missing at random fold instances via secondary model. This approach, rCV, also allows one to generate a learning curve for time-series, as we have demonstrated.
The meta-algorithm we developed in this work can be used with any other learning algorithm. We only choose Gaussian processs in both reconstrucation and prediction tasks in demonstrating the framework due to its minimalist requirements for a naive implementation. Further implementation of the meta-algorithm in a generic setting is possible without embedding the learning algorithm into rCV procedure.
We have provided a Python notebook for the prototype implementation of rCV and for reproducing results presented(rCV_prototype.ipynb).
- A note on the validity of cross-validation for evaluating autoregressive time series prediction. Comput. Stat. Data Anal. 120 (C), pp. 70–83. External Links: Cited by: §1.
- A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician 37 (1), pp. 36–48. Cited by: §1, §2.1.
- Stochastic methods: a handbook for the natural and social sciences. 4th edition, Springer. Cited by: §3.
- Time series analysis. Vol. 2, Princeton university press Princeton. Cited by: §1.
A study of cross-validation and bootstrap for accuracy estimation and model selection.
Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95, pp. 1137–1143. Cited by: §1.1, §1.
- Machine learning. 1st edition, McGraw-Hill, Inc., New York, NY, USA. External Links: Cited by: §2.2.
Tree induction vs. logistic regression: a learning-curve analysis. J. Mach. Learn. Res. 4, pp. 211–255. External Links: Cited by: §1, §2.2.
- The stationary bootstrap. Journal of the American Statistical Association 89 (428), pp. 1303–1313. Cited by: §1.
- On machine-learned classification of variable stars with sparse and noisy time-series data. The Astrophysical Journal 733 (1), pp. 10. Cited by: §1.
- Gaussian processes for time-series modelling. Phil. Trans. R. Soc. A 371 (1984), pp. 20110550. Cited by: §1.
- Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36 (2), pp. 111–133. Cited by: §1.1, §1.
- Evaluating gaussian processes for sparse irregular spatio-temporal data. arXiv preprint arXiv:1611.02978. Cited by: §1.
- Gaussian processes for machine learning. the MIT Press 2 (3), pp. 4. Cited by: §3.