1 Introduction
Temporal data is generated in natural and technological systems and their analysis is very common (Hamilton (1994), Richards et al. (2011), Roberts et al. (2013)). Many of the analysis manifest as building timeseries models via learning, and evalution of their performances are not trivial tasks, specially in the sparse settings (Süzen and Ajraou (2016)).
Learning curves are utilised in evaluating machine learning algorithms’ relative performances (Perlich et al. (2003)) and crossvalidation (Stone (1974), Efron and Gong (1983)) are used in assesing generalization abilities of the model and together with selecting a model (Kohavi (1995)). However, usage of crossvalidation for temporal data is not directly practiced for timeseries, and only were available for uncorrelated errors and stationary timeseries: by shuffling chunks of timeseries (Politis and Romano (1994)) or naive kfold CV for uncorrelated timeseries (Bergmeir et al. (2018)). On the other hand learning curves for timeseries models are not common and no specialized technique is addressed earlier for this. While, learning curves usually build via reducing the dataset sample size by removing points randomly. This approach can not be used for timeseries learning curves.
We propose a technique called reconstructive Cross Validations (rCV), combining standard crossvalidation (CV) and outofsample (OOS) evaluation of performance by introducing reconstruction of fold from CV, i.e. imputation or smoothing, allowing ktimes OOS evaluation. rCV does not require any assumption on error structure and do not use future data to predict past in the main crossvalidation procedure.
1.1 Single model versus model selection
Earlier literature (Stone (1974)
) signifies building a single model in crossvalidations. It implies the learning algorithm builds a single model, a single parametrisation such as weights of a neural networks, with rotations of the data split, i.e., so called kfolds, in the core optimisations. However, building a different model or obtaining different parametrisation of the same model in kfolds (
Kohavi (1995)) are introduced later to exercise model selection. Later practice is now used in mainstream machine learning libraries that a kfold crossvalidation would produce k different parametrisation of the same model. Similarly, in rCV, we buildparametrisations of the model in supervised learning setting. Extending, this to earlier single model crossvalidation must be obvious, however we may need to change how underlying solvers interact with data in optimisation phases and modify rCV for a single model build.
2 Proposed techniques
Our contribution has two main implication in generalised learning of timeseries. On the one hand, a generic procedure to do crossvalidation for timeseries model and on the other hand a technique to build learning curves for timeseries without need to reducing sample size of the timeseries data.
We concentrate on one dimensional timeseries for basic investigation. Extensions to higherdimensions should be selfevident. Consider series of numbers,
, vector of length
, , where and where the ordered dataset reads, . This tuple of ordered numbers considered as timeseries, as often is interpreted as time evolution of and usually expressed as .Outofsample (OOS) data usually appears as a continuation of the past timeseries, hence, as a continuation of is defined as outofsample set, , vector of length , , where and , where where the ordered OOS dataset reads, .
Construction of crossvalidated performance measures and learning curves for timeseries will appear as a metaalgorithm processing timeseries and .
2.1 Reconstructive crossvalidation for timeseries
The first step in rCV is to identify partitions of the timeseries, here as in conventional crossvalidation (Efron and Gong (1983)). Consider sets of partitions of are each set having randomly assigned , and partitions are approximately equal in size
(1) 
A trainingfold is defined as a union of all partitions except the removed partion,
(2) 
Missing data would be on the corresponding removed partition .
Due to ordered nature of the series, the standard CV approach would not be used in different folds which yields to an absurd situation of predicting past using the future values. To overcome this, a reconstruction of full training series is denoted by can be introduces. This can be think of an imputation of missing data at random or smoothing in Bayesian sense. Using each trainingfold via a secondary model
. A technique could be interpoloation or more advanced filtering approaches like Kalman filtering resulting
(3) 
The secondary model could retain the given points on the trainingfold in this approach. is the reconstructed portion.
The total error due to reconstructed model expresssed as , here for example we write down as a mean absolute percent error (MAPE), obviously different metrics can be used,
(4) 
The primary model, , is build on each and predictions on outofsample set is applied. This results in set of predictions , the error is expressed as
(5) 
The total error in rCV is computed as follows.
(6) 
The lower the number better the generalisation, however both and should be judge seperately to detect any anomalies. We have choosed as multiplivative error of reconstruction and prediction errors, so that it represents weighted error. More complex schemes to estimate can be divised . Note that both and are test errors in a conventional sense, while both reconstruction and prediction computations are performed on a Gaussian process that parameters are fixed, i.e., corresponding to OrnsteinUhlenbeck processes.
2.2 Learning Curves for timeseries
Timeseries learning curves are not that common due to fact that data sample sizes are limited. However with reconstruction approach we provided above, one can build a learning curve , based on number of folds
(7) 
is the error measure used with different errors over range of different values. The error terms can be any of the errors defined above , or . Note that, unlike other learning curves build upon reducing samplesize (Perlich et al. (2003)), is constructed with retaining the samplesize of the original timeseries over each point on the learning curve. The reason behind this, the number of missing data at random on reconstructed folds explained above, will decrease with increasing number of folds. Combined reconstruction prediction errors, as performance measure, will be effected by this by changing number of folds, hence the learning curve as in basic definition of supervised learning (Mitchell (1997)).
3 Experimental Setup
We have demonstrated the utility of our technique using a specific kernel in a Gaussian process setting (Williams and Rasmussen (2006)). This is corresponding to OrnsteinUhlenbeck process and used in description of Brownian motion in statistical physics (Gardiner (2009)). The learning task is aim at predicting the OOS data using past series .
3.1 OrnsteinUhlenbeck process
One can generate OrnsteinUhlenbeck process drawing numbers from multivariate Gaussian with specific covariance structure.
(8) 
Taking as all at , and build via kernel where is the distance matrix constructed over timepoints as follows, which is a symmetric matrix,
We generated OrnsteinUhlenbeck (OU) timeseries for a regular time points with a spacing of with different length scales, mean values , and additional time points for the prediction task, see Figure 1.
3.2 Reconstructive crossvalidation
We apply our metaalgorithm to construct both primary and secondory models using Gaussian process predictions with unit regularisation, it is formulated as follows: Given ordered pairs as timeseries, we aim at inferring, i.e., reconstructing, missing values at timepoints . The missing values can be identified via Bayesian interpretation of kernel regularisation,
(9) 
whee . The kernel matrices is build via kernel where is the distance matrix over timepoints of missing at random folds and the other folds. A secondary model is used to reconstruct . The absolute errors in 10fold are shown graphically in Figure 1.
Similar procedure is followed in predicting the OOS vector . The reconstruction error, prediction error and rCV error were computed as 0.029, 0.468 and 0.013 respectively. Note that rCV error is not a MAPE but a measure of generalisation. The high prediction error is attributed to long time horizon we choose. In practice much shorter timehorizon must be used for practical utility.
3.3 Learning curves
We produced timeseries learning curves for OrnsteinUhlenbeck process we generated via rCV with varying fold values, i.e., different partitions of the original timeseries , see Figure 2. Constructed timeseries learning curves constructed with rCV sample sizes are increasing by increasing number of folds in convensional sense. This is attributed to the fact that having larger number of folds implies less timepoints missing at random to reconstruct, corresponding to larger sample, i.e., having more experience. Reported learning cruves corresponds to test learning curves as we use fixed Kernel parameters to generate and predict OrnsteinUhlenbeck process.
4 Conclusion
We have presented a framework with a canonical process from Physics, OrnsteinUhlenbeck process, that helps us in performing generalised learning in timeseries without any restriction on the stationarity and retaining serial correlations order in the original dataset. The approach entails in applying crossvalidation directly in combination with OOS estimate of the performance and reconstructing missing at random fold instances via secondary model. This approach, rCV, also allows one to generate a learning curve for timeseries, as we have demonstrated.
The metaalgorithm we developed in this work can be used with any other learning algorithm. We only choose Gaussian processs in both reconstrucation and prediction tasks in demonstrating the framework due to its minimalist requirements for a naive implementation. Further implementation of the metaalgorithm in a generic setting is possible without embedding the learning algorithm into rCV procedure.
Code Supplement
We have provided a Python notebook for the prototype implementation of rCV and for reproducing results presented(rCV_prototype.ipynb).
References
 A note on the validity of crossvalidation for evaluating autoregressive time series prediction. Comput. Stat. Data Anal. 120 (C), pp. 70–83. External Links: ISSN 01679473 Cited by: §1.
 A leisurely look at the bootstrap, the jackknife, and crossvalidation. The American Statistician 37 (1), pp. 36–48. Cited by: §1, §2.1.
 Stochastic methods: a handbook for the natural and social sciences. 4th edition, Springer. Cited by: §3.
 Time series analysis. Vol. 2, Princeton university press Princeton. Cited by: §1.

A study of crossvalidation and bootstrap for accuracy estimation and model selection.
In
Proceedings of the 14th International Joint Conference on Artificial Intelligence  Volume 2
, IJCAI’95, pp. 1137–1143. Cited by: §1.1, §1.  Machine learning. 1st edition, McGrawHill, Inc., New York, NY, USA. External Links: ISBN 0070428077, 9780070428072 Cited by: §2.2.

Tree induction vs. logistic regression: a learningcurve analysis
. J. Mach. Learn. Res. 4, pp. 211–255. External Links: ISSN 15324435 Cited by: §1, §2.2.  The stationary bootstrap. Journal of the American Statistical Association 89 (428), pp. 1303–1313. Cited by: §1.
 On machinelearned classification of variable stars with sparse and noisy timeseries data. The Astrophysical Journal 733 (1), pp. 10. Cited by: §1.
 Gaussian processes for timeseries modelling. Phil. Trans. R. Soc. A 371 (1984), pp. 20110550. Cited by: §1.
 Crossvalidatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36 (2), pp. 111–133. Cited by: §1.1, §1.
 Evaluating gaussian processes for sparse irregular spatiotemporal data. arXiv preprint arXiv:1611.02978. Cited by: §1.
 Gaussian processes for machine learning. the MIT Press 2 (3), pp. 4. Cited by: §3.
Comments
There are no comments yet.