    # Foundations of Sequence-to-Sequence Modeling for Time Series

The availability of large amounts of time series data, paired with the performance of deep-learning algorithms on a broad class of problems, has recently led to significant interest in the use of sequence-to-sequence models for time series forecasting. We provide the first theoretical analysis of this time series forecasting framework. We include a comparison of sequence-to-sequence modeling to classical time series models, and as such our theory can serve as a quantitative guide for practitioners choosing between different modeling methodologies.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Time series analysis is a critical component of real-world applications such as climate modeling, web traffic prediction, neuroscience, as well as economics and finance. We focus on the fundamental question of time series forecasting. Specifically, we study the task of forecasting the next steps of an -dimensional time series , where is assumed to be very large. For example, in climate modeling, may correspond to the number of locations at which we collect historical observations, and more generally to the number of sources which provide us with time series.

Often, the simplest way to tackle this problem is to approach it as separate tasks, where for each of the dimensions we build a model to forecast the univariate time series corresponding to that dimension. Auto-regressive and state-space models [8, 4, 6, 5, 12], as well as non-parametric approaches such as RNNs , are often used in this setting. To account for correlations between different time series, these models have also been generalized to the multivariate case [23, 24, 35, 13, 14, 1, 2, 35, 2, 38, 31, 41, 25]. In both univariate and multivariate settings, an observation at time is treated as a single sample point, and the model tries to capture relations between observations at times and . Therefore, we refer to these models as local.

In contrast, an alternative methodology based on treating univariate time series as samples drawn from some unknown distribution has also gained popularity in recent years. In this setting, each of the dimensions of is treated as a separate example and a single model is learned from these observations. Given time series of length

, this model learns to map past vectors of length

to corresponding future vectors of length . LSTMs and RNNs  are a popular choice of model class for this setup [9, 11, 42, 22, 21, 43].111Sequence-to-sequence models are also among the winning solutions in the recent time series forecasting competition: https://www.kaggle.com/c/web-traffic-time-series-forecasting. Consequently, we refer to this framework as sequence-to-sequence modeling.

While there has been progress in understanding the generalization ability of local models [40, 27, 28, 29, 17, 18, 19, 20, 44], to the best of our knowledge the generalization properties of sequence-to-sequence modeling have not yet been studied, raising the following natural questions:

• What is the generalization ability of sequence-to-sequence models and how is it affected by the statistical properties of the underlying stochastic processes (e.g. non-stationarity, correlations)?

• When is sequence-to-sequence modeling preferable to local modeling, and vice versa?

We provide the first generalization guarantees for time series forecasting with sequence-to-sequence models. Our results are expressed in terms of simple, intuitive measures of non-stationarity and correlation strength between different time series and hence explicitly depend on the key components of the learning problem. (a) The local model trains each h\scriptsize loc,i on time series Y(i) split into multiple (partly overlapping) examples.

We compare our generalization bounds to guarantees for local models and identify regimes under which one methodology is superior to the other. Therefore, our theory may also serve as a quantitative guide for a practitioner choosing the right modeling approach.

The rest of the paper is organized as follows: in Section 2, we formally define sequence-to-sequence and local modeling. In Section 3, we define the key tools that we require for our analysis. Generalization bounds for sequence-to-sequence models are given in Section 4. We compare sequence-to-sequence and local models in Section 5. Section 6 concludes this paper with a study of a setup that is a hybrid of the local and sequence-to-sequence models.

## 2 Sequence-to-sequence modeling

We begin by providing a formal definition of sequence-to-sequence modeling. The learner receives a multi-dimensional time series which we view as time series of same length . We denote by the value of the -th time series at time and write to denote the sequence . Similarly, we let and . In particular, . In addition, the sequence is of a particular importance in our analysis and we denote it by .

The goal of the learner is to predict .222We are often interested in long term forecasting, i.e. predicting for . For simplicity, we only consider the case of . However, all our results extend immediately to . We further assume that our input is partitioned into a training set of examples , where each . The learner’s objective is to select a hypothesis from a given hypothesis set that achieves a small generalization error:

 L(h∣Y)=1mm∑i=1ED[L(h(YT1(i)),YT+1(i))∣Y],

where is a bounded333Most of the results in this paper can be straightforwardly extended to unbounded case assuming is sub-Gaussian.loss function and is the distribution of conditioned on the past .

In other words, the learner seeks a hypothesis that maps sequences of past values to sequences of future values , justifying our choice of “sequence-to-sequence” terminology.444In practice, each may start at a different, arbitrary time , and may furthermore include some additional features , i.e. . Our results can be extended to this case as well using an appropriate choice of the hypothesis set. Incidentally, the machine translation problem studied by Sutskever et al.  under the same name represents a special case of our problem when sequences (sentences) are independent and data is stationary. In fact, LSTM-based approaches used in aforementioned translation problem are also common for time series forecasting [9, 11, 42, 22, 21, 43]. However, feed-forward NNs have also been successfully applied in this framework  and in practice, our definition allows for any set of functions that map input sequences to output sequences. For instance, we can train a feed-forward NN to map to and at inference time use as input to obtain a forecast for .555As another example, a runner-up in the Kaggle forecasting competition (https://www.kaggle.com/c/web-traffic-time-series-forecasting

) used a combination of boosted decision trees and feed-forward networks, and as such employs the sequence-to-sequence approach.

We contrast sequence-to-sequence modeling to local modeling, which consists of splitting each time series into a training set , where for some , then learning a separate hypothesis for each . Each models relations between observations that are close in time, which is why we refer to this framework as local modeling. As in sequence-to-sequence modeling, the goal of a local learner is to achieve a small generalization error for , given by:

 L(hloc∣Y)=1mm∑i=1ED[L(hi(YTT−p(i)),YT+1(i))∣Y].

In order to model correlations between different time series, it is also common to split into one single set of multivariate examples , where , and to learn a single hypothesis that maps . As mentioned earlier, we consider this approach a variant of local modeling, since in this case again models relations between observations that are close in time.

Finally, hybrid or local sequence-to-sequence

models, which interpolate between local and sequence-to-sequence approaches, have also been considered in the literature

. In this setting, each local example is split across the temporal dimension into smaller examples of length , which are then used to train a single sequence-to-sequence model . We discuss bounds for this specific case in Section 6.

Our work focuses on the statistical properties of the sequence-to-sequence model. We provide the first generalization bounds for sequence-to-sequence and hybrid models, and compare these to similar bounds for local models, allowing us to identify regimes in which one methodology is more likely to succeed.

Aside from learning guarantees, there are other important considerations that may lead a practitioner to choose one approach over others. For instance, the local approach is trivially parallelizable; on the other hand, when additional features are available, sequence-to-sequence modeling provides an elegant solution to the cold start problem in which at test time we are required to make predictions on time series for which no historical data is available.

## 3 Correlations and non-stationarity

In the standard supervised learning scenario, it is common to assume that training and test data are drawn

i.i.d. from some unknown distribution. However, this assumption does not hold for time series, where observations at different times as well as across different series may be correlated. Furthermore, the data-generating distribution may also evolve over time.

These phenomena present a significant challenge to providing guarantees in time series forecasting. To quantify non-stationarity and correlations, we introduce the notions of mixing coefficients and discrepancy, which are defined below.

The final ingredient we need to analyze sequence-to-sequence learning is the Rademacher complexity of a family of functions on a sample of size , which has been previously used to characterize learning in the i.i.d. setting [16, 30]. In App. A, we include a brief discussion of its properties.

### 3.1 Expected mixing coefficients

To measure the strength of dependency between time series, we extend the notion of -mixing coefficients  to expected -mixing coefficients, which are a more appropriate measure of correlation in sequence-to-sequence modeling.

###### Definition 1 (Expected β{\scriptsize s2s} coefficients).

Let . We define

 β{\scriptsize s2s}(i,j)= EY′[∥P(YT(i)|Y′)P(YT(j)|Y′)−P(YT(i),YT(j)|Y′)∥TV],

where denotes the total variations norm. For a subset , we define

 β{\scriptsize s2s}(I)=supi,j∈Cβ{\scriptsize s% 2s}(i,j).

The coefficient captures how close and are to being independent, given (and averaged over all realizations of ). We further study these coefficients in Section 4, where we derive explicit upper bounds on expected -mixing coefficients for various standard classes of stochastic processes, including spatio-temporal and hierarchical time series.

We also define the following related notion of -coefficients.

###### Definition 2 (Unconditional ¯β coefficients).

Let . We define

 ¯β(i,j)= ∥Pr(YT1(i),YT1(j))−Pr(YT1(i))Pr(YT1(j))∥TV ¯β′(i,j)= ∥Pr(YT−11(i),YT−11(j))−Pr(YT−11(i))Pr(YT−11(j))∥TV

and as before, for a subset of , write (and similarly for ).

Note that coefficients measure the strength of dependence between time series conditioned on the history observed so far, while coefficients measure the (unconditional) strength of dependence between time series. The following result relates these two notions.

###### Lemma 1.

For (and similarly), we have the following upper bound:

 ¯β(i,j)≤ β{\scriptsize s2s}(i,j)+EY′[Cov(Pr(YT(i)∣Y′),Pr(YT(j)∣Y′))]

The proof of this result (as well as all other proofs in this paper) is deferred to the supplementary material.

Finally, we require the notion of tangent collections, within which time series are independent.

###### Definition 3 (Tangent collection).

Given a collection of time series , we define the tangent collection as such that is drawn according to the marginal and such that and are independent for .

The notion of tangent collections, combined with mixing coefficients, allows us to reduce the analysis of correlated time series in to the analysis of independent time series in (see Prop. 6 in the appendix).

### 3.2 Discrepancy

Various notions of discrepancy have been previously used to measure the non-stationarity of the underlying stochastic processes with respect to the hypothesis set and loss function in the analysis of local models [18, 44]. In this work, we introduce a notion of discrepancy specifically tailored to sequence-to-sequence modeling scenario, taking into account both the hypothesis set and the loss function.

###### Definition 4 (Discrepancy).

Let be the distribution of conditioned on and let be the distribution of conditioned on . We define the discrepancy as where .

The discrepancy forms a pseudo-metric on the space of probability distributions and can be completed to a Wasserstein metric (by extending

to all Lipschitz functions). This also immediately implies that the discrepancy can be further upper bounded by the -distance and by relative entropy between conditional distributions of and (via Pinsker’s inequality). However, unlike these other divergences, the discrepancy takes into account both the hypothesis set and the loss function, making it a finer measure of non-stationarity.

However, the most important property of the discrepancy is that it can be upper bounded by the related notion of symmetric discrepancy

, which can be estimated from data.

###### Definition 5 (Symmetric discrepancy).

We define the symmetric discrepancy as

 Δs= 1msuph,h′∈H∣∣∑mi=1L(h(YT1(i)),h′(YT1(i)))−L(h(YT−11(i)),h′(YT−11(i)))∣∣.
###### Proposition 1.

Let be a hypothesis space and let be a bounded loss function which respects the triangle inequality. Let be any hypothesis. Then,

We do not require test labels to evaluate . Since only depends on the observed data, can be computed directly from samples, making it a useful tool to assess the non-stationarity of the learning problem.

Another useful property of is that, for certain classes of stochastic processes, we can provide a direct analysis of this quantity.

###### Proposition 2.

Let be a partition of , be the corresponding partition of and be the corresponding partition of . Write , and define the expected discrepancy

 Δe= suph,h′∈H[EY[L(h(YT1),h′(YT1))]−EY[L(h(YT−11),h′(YT−11))]].

Then, writing the Rademacher complexity (see Appendix A) we have with probability ,

 Δs≤ Δe+max(maxjR|Cj|(˜C′j),maxjR|Cj|(˜Cj))+ ⎷12clog2kδ−∑j(|Ij|−1)[¯β(Ij)+¯β′(Ij)].

The expected discrepancy can be computed analytically for many classes of stochastic processes. For example, for stationary processes, we can show that it is negligible. Similarly, for covariance-stationary666Recall that a process is called stationary if for any , the distributions of and are the same. Covariance stationarity is a weaker condition that requires that be independent of and that for some . processes with linear hypothesis sets and the squared loss function, the discrepancy is once again negligible. These examples justify our use of the discrepancy as a natural measure of non-stationarity. In particular, the covariance-stationary example highlights that the discrepancy takes into account not only the distribution of the stochastic processes but also and .

###### Proposition 3.

If is stationary for all , and is a hypothesis space such that (i.e. the hypotheses only consider the last values of ), then .

###### Proposition 4.

If is covariance stationary for all , is the squared loss, and is a linear hypothesis space , .

Another insightful example is the case when : then, even if is non-stationary, which illustrates that learning is trivial for trivial hypothesis sets, even in non-stationary settings.

The final example that we consider in this section is the case of non-stationary periodic time series. Remarkably, we show that the discrepancy is still negligible in this case provided that we observe all periods with equal probability.

###### Proposition 5.

If the are periodic with period and the observed starting time of each is distributed uniformly at random in , then .

## 4 Generalization bounds

We now present our generalization bounds for time series prediction with sequence-to-sequence models. We write , where is the loss of hypothesis given by . To obtain bounds on the generalization error , we study the gap between and the empirical error of a hypothesis , where

 ˆL(h)=1mm∑i=1f(h,Zi).

That is, we aim to give a high probability bound on the supremum of the empirical process . We take the following high-level approach: we first partition the training set into collections such that within each collection, correlations between different time series are as weak as possible. We then analyze each collection by comparing the generalization error of sequence-to-sequence learning on to the sequence-to-sequence generalization error on the tangent collection .

###### Theorem 4.1.

Let form a partition of the training input and let denote the set of indices of time series that belong to . Assume that the loss function is bounded by 1. Then, we have for any , with probability ,

 Φ (Y)≤maxj[ˆR˜Cj(F)]+Δ+1√2minj|Ij| ⎷log(kδ−∑j(|Ij|−1)β{\scriptsize s2s}(Ij)).

Theorem 4.1 illustrates the trade-offs that are involved in sequence-to-sequence learning for time series forecasting. As is a function of , we expect it to decrease as grows (i.e. more time series we have), allowing for smaller as increases.

Assuming that the are of the same size, if

is a collection of neural networks of bounded depth and width then

(see Appendix A). Therefore,

with high probability uniformly over , provided that . This shows that extremely high-dimensional () time series are beneficial for sequence-to-sequence models, whereas series with a long histories will generally not benefit from sequence-to-sequence learning. Note also that correlations in data reduce the effective sample size from to .

Furtermore, Theorem 4.1 indicates that balancing the complexity of the model (e.g. depth and width of a neural net) with the fit it provides to the data is critical for controlling both the discrepancy and Rademacher complexity terms. We further illustrate this bound with several examples below.

### 4.1 Independent time series

We begin by considering the case where all dimensions of

are independent. Although this may seem a restrictive assumption, it arises in a variety of applications: in neuroscience, different dimensions may represent brain scans of different patients; in reinforcement learning, they may correspond to different trajectories of a robotic arm.

###### Theorem 4.2.

Let be a given hypothesis space with associated function family corresponding to a loss function bounded by 1. Suppose that all dimensions of are independent and let ; then and so for any , with probability at least and for any :

 L(h|Y)≤ˆL(h)+2Rm(F)+Δ+√log(1/δ)m.

Theorem 4.2 shows that when time series are independent, learning is not affected by correlations in the samples and can only be obstructed by the non-stationarity of the problem, captured via .

Note that when examples are drawn i.i.d., we have in Theorem 4.2: we recover the standard standard generalization results for regression problems.

### 4.2 Correlated time series

We now consider several concrete examples of high-dimensional time series in which different dimensions may be correlated. This setting is common in a variety of applications including stock market indicators, traffic conditions, climate observations at different locations, and energy demand.

Suppose that each is generated by the auto-regressive (AR) processes with correlated noise

 yt+1(i)=Θi(yt0(i))+εt+1(i) (4.1)

where the are unknown parameters and the noise vectors

are drawn from a Gaussian distribution

where, crucially, is not diagonal. The following lemma is key to our analysis.

###### Lemma 2.

Two AR processes generated by (4.1) such that verify .

#### Hierarchical time series.

As our first example, we consider the case of hierarchical time series that arises in many real-world applications [39, 37]. Consider the problem of energy demand forecasting: frequently, one observes a sequence of energy demands at a variety of levels: single household, local neighborhood, city, region and country. This imposes a natural hierarchical structure on these time series.

Formally, we consider the following hierarchical scenario: a binary tree of total depth , where time series are generated at each of the leaves. At each leaf, is given by the AR process (4.1) where we impose given the length of the shortest path from either leaf to the closest common ancestor between and . Hence, as increases, and grow more independent.

For the bound of Theorem 4.1 to be non-trivial, we require a partition of such that within a given the time series are close to being independent. One such construction is the following: fix a depth and construct such that each contains exactly one time series from each sub-tree of depth ; hence, . Lemma 2 shows that for each , we have . For example, setting , it follows that for any , with probability ,

 L(h|Y)≤

Furthermore, suppose the model is a linear AR process given by . Then, the underlying stochastic process is weakly stationary and by Prop. 3 our bound reduces to: . By Proposition 5, similar results holds when is periodic.

#### Spatio-temporal processes.

Another common task is spatio-temporal forecasting, in which historical observation are available at different locations. These observations may represent temperature at different locations, as in the case of climate modeling [26, 10], or car traffic at different locations .

It is natural to expect correlations between time series to decay as the geographical distance between them increases. As a simplified example, consider that the sphere is subdivided according to a geodesic grid and a time series is drawn from the center of each patch according to (4.1), also with but this time with equal to the (geodesic) distance between the center of two cell centers. We choose subsets with the goal of minimizing the strength of dependencies between time series within each subsets. Assuming we divide the sphere into collections size approximately such that the minimal distance between two points in a set is , we obtain

 L(h∣Y)≤

As in the case of hierarchical time series, Proposition 3 or Proposition 5 can be used to remove the dependence on for certain families of stochastic processes.

## 5 Comparison to local models

This section provides comparison of learning guarantees for sequence-to-sequence models with those of local models. In particular, we will compare our bounds on the generalization gap for sequence-to-sequence models and local models, where the gap is given by

 Φ\scriptsize loc(Y)=sup(h1,…,hm)∈Hm[L(hloc∣Y)−ˆL(hloc)] (5.1)

where is the average empirical error of on the sample , defined as where .

To give a high probability bound for this setting, we take advantage of existing results for the single local model . These results are given in terms of a slightly different notion of discrepancy , defined by

 Δ(Zi)= suph∈H[E[L(h(YTt−p+1),YT+1)∣YT1]−1TT∑t=1E[L(h(Yt−1t−p),Yt)∣Yt−11]].

Another required ingredient to state these results is the expected sequential covering number  . For many hypothesis sets, the log of the sequential covering number admits upper bounds similar to those presented earlier for the Rademacher complexity. We provide some examples below and refer the interested reader to  for a details.

###### Theorem 5.1.

For and , with probability at least , for any , and any ,

 Φ\scriptsize loc(Y)≤ 1mm∑i=1Δ(Zi)+2α+√2TlogmmaxiEv∼T(Zi)[N1(α,F,v)]δ.

Choosing , we can show that, for standard local models such as the linear hypothesis space , we have

 √1Tlog2mEv∼T(Z)[N1(α,F,v)]δ=O(√logmT).

In this case, it follows that . where the last term in this bound should be compared with corresponding (non-discrepancy) terms in the bound of Theorem. 4.1, which, as discussed above, scales as for a variety of different hypothesis sets.

Hence, when we have access to relatively few time series compared to their length (), learning to predict each time series as its own independent problem will with high probability lead to a better generalization bound. On the other hand, in extremely high-dimensional settings when we have significantly more time series than time steps (), sequence-to-sequence learning will (with high probability) provide superior performance. We also expect the performance of sequence-to-sequence models to deteriorate as the correlation between time series increases.

A direct comparison of bounds in Theorem 4.1 and Theorem 5.1 is complicated by the fact that discrepancies that appear in these results are different. In fact, it is possible to design examples where is constant and is negligible, and vice-versa.

Consider a tent function such that for and for . Let be its periodic extension to the real line, and define . Suppose that we sample uniformly and times, and observe time series . Then, as we have shown in Proposition 5, is negligible for sequence-to-sequence models. However, unless the model class is trivial, it can be shown that is bounded away from zero for all .

Conversely, suppose we sample uniformly times and observe time series . Consider a set of local models that learn an offset from the previous point . It can be shown that in this case , whereas is bounded away from zero for any non-trivial class of sequence-to-sequence models.

From a practical perspective, we can simply use and empirical estimates of to decide whether to choose sequence-to-sequence or local models.

We conclude this section with an observation that similar results to Theorem 4.1 can be proved for multivariate local models with the only difference that the sample complexity of the problem scales as

, and hence these models are even more prone to the curse of dimensionality.

## 6 Hybrid models

In this section, we discuss models that interpolate between local and sequence-to-sequence models. This hybrid approach trains a single model on the union of local training sets used to train models in the local approach. The bounds that we state here require the following extension of the discrepancy to , defined as

 Δt= 1msuph∈H∣∣m∑i=1ED[L(h(Yt−1t−p−1(i)),Yt(i))|Yt−11]−ED′[L(h(YTT−p(i)),YT+1(i))|Y]∣∣

Many of the properties that were discussed for the discrepancy carry over to as well. The empirical error in this case is the same as for the local models:

 ˆL(h)=1mTm∑i=1T∑t=1f(h,Zt,i).

Observe that one straightforward way to obtain a bound for hybrid models is to apply Theorem 5.1 with . Alternatively, we can apply Theorem 4.1 at every time point .

Combining these results via union bound leads to the following learning guarantee for hybrid models.

###### Theorem 6.1.

Let form a partition of the training input and let denote the set of indices of time series that belong to . Assume that the loss function is bounded by 1. Then, for any , with probability , for any and any

 L(h∣Y)≤ˆL(h)+min(B1,B2),

where

 B1= 1T∑Tt=1Δt+maxjˆR˜Cj(F)+1√2minj|Ij| ⎷log(2Tkδ−2∑j(|Ij|−1)β{% \scriptsize s2s}(Ij)) B2= 1m∑mi=1Δ(Zi)+2α+√2Tlog2mmaxiEv∼T(Zi)[N1(α,F,v)]δ.

Using the same arguments for the complexity terms as in the case of sequence-to-sequence and local models, this result shows that hybrid models are successful with high probability when or correlations between time series are strong, as well as when .

Potential costs for this model are hidden in the new discrepancy term . This term leads to different bounds depending on the particular non-stationarity in the given problem. As before this trade-off can be accessed empirically using the data-dependent version of discrepancy.

Note that the above bound does not imply that hybrid models are superior to local models: using hypotheses can help us achieve a better trade-off between and , and vice versa.

## 7 Conclusion

We formally introduce sequence-to-sequence learning for time series, a framework in which a model learns to map past sequences of length to their next values. We provide the first generalization bounds for sequence-to-sequence modeling. Our results are stated in terms of new notions of discrepancy and expected mixing coefficients. We study these new notions for several different families of stochastic processes including stationary, weakly stationary, periodic, hierarchical and spatio-temporal time series.

Furthermore, we show that our discrepancy can be computed from data, making it a useful tool for practitioners to empirically assess the non-stationarity of their problem. In particular, the discrepancy can be used to determine whether the sequence-to-sequence methodology is likely to succeed based on the inherent non-stationarity of the problem.

Furthermore, compared to the local framework for time series forecasting, in which independent models for each one-dimensional time series are learned, our analysis shows that the sample complexity of sequence-to-sequence models scales as , providing superior guarantees when the number of time series is significantly greater than the length of each series, provided that different series are weakly correlated.

Conversely, we show that the sample complexity of local models scales as , and should be preferred when or when time series are strongly correlated. We also study hybrid models for which learning guarantees are favorable both when and , but which have a more complex trade-off in terms of discrepancy.

As a final note, the analysis we have carried through is easily extended to show similar results for the sequence-to-sequence scenario when the test data includes new series not observed during training, as is often the case in a variety of applications.

## References

• Banbura et al.  Marta Banbura, Domenico Giannone, and Lucrezia Reichlin. Large Bayesian vector auto regressions. Journal of Applied Econometrics, 25(1):71–92, 2010.
• Basu and Michailidis  Sumanta Basu and George Michailidis. Regularized estimation in sparse high-dimensional time series models. Ann. Statist., 43(4):1535–1567, 2015.
• Bianchi et al.  Filippo Maria Bianchi, Enrico Maiorino, Michael C. Kampffmeyer, Antonello Rizzi, and Robert Jenssen. Recurrent Neural Networks for Short-Term Load Forecasting - An Overview and Comparative Analysis. Springer Briefs in Computer Science. Springer, 2017.
• Bollerslev  Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307 – 327, 1986.
• Box and Jenkins  George Edward Pelham Box and Gwilym Jenkins. Time Series Analysis, Forecasting and Control. Holden-Day, Incorporated, 1990.
• Brockwell and Davis  Peter J Brockwell and Richard A Davis. Time Series: Theory and Methods. Springer-Verlag New York, Inc., 1986.
• Doukhan  P. Doukhan. Mixing: Properties and Examples. Lecture notes in statistics. Springer, 1994.
• Engle  Robert Engle.

Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation.

Econometrica, 50(4):987–1007, 1982.
• Flunkert et al.  Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks. Arxiv:1704.04110, 2017.
• Ghafarianzadeh and Monteleoni  Mahsa Ghafarianzadeh and Claire Monteleoni. Climate prediction via matrix completion. In

Late-Breaking Developments in the Field of Artificial Intelligence

, volume WS-13-17 of AAAI Workshops. AAAI, 2013.
• Goel et al.  Hardik Goel, Igor Melnyk, and Arindam Banerjee. R2N2: Residual recurrent neural networks for multivariate time series forecasting. arXiv:1709.03159, 2017.
• Hamilton  James Douglas Hamilton. Time series analysis. Princeton Univ. Press, 1994.
• Han et al. [2015a] Fang Han, Huanran Lu, and Han Liu. A direct estimation of high dimensional stationary vector autoregressions.

Journal of Machine Learning Research

, 16:3115–3150, 2015a.
• Han et al. [2015b] Fang Han, Sheng Xu, and Han Liu. Rate optimal estimation of high dimensional time series. Technical report, Technical Report, Johns Hopkins University, 2015b.
• Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Neural Comput., 9(8):1735–1780, 1997. ISSN 0899-7667.
• Koltchinskii and Panchenko  V. Koltchinskii and D. Panchenko.

Empirical margin distributions and bounding the generalization error of combined classifiers.

Ann. Statist., 30(1):1–50, 2002.
• Kuznetsov and Mohri  Vitaly Kuznetsov and Mehryar Mohri. Generalization Bounds for Time Series Prediction with Non-stationary Processes, pages 260–274. Springer International Publishing, 2014.
• Kuznetsov and Mohri  Vitaly Kuznetsov and Mehryar Mohri. Learning theory and algorithms for forecasting non-stationary time series. In Advances in Neural Information Processing Systems 28, pages 541–549. Curran Associates, Inc., 2015.
• Kuznetsov and Mohri  Vitaly Kuznetsov and Mehryar Mohri. Time series prediction and online learning. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1190–1213. PMLR, 2016.
• Kuznetsov and Mohri  Vitaly Kuznetsov and Mehryar Mohri. Discriminative state space models. In NIPS, Long Beach, CA, USA, 2017.
• Laptev et al.  Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. Time-series extreme event forecasting with neural networks at uber. In ICML Workshop, 2017.
• Li et al.  Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv:1707.01926, 2017.
• Lütkepohl  Helmut Lütkepohl. Chapter 6 Forecasting with VARMA models. In Handbook of Economic Forecasting, volume 1, pages 287 – 325. Elsevier, 2006.
• Lütkepohl  Helmut Lütkepohl. New Introduction to Multiple Time Series Analysis. Springer Publishing Company, Incorporated, 2007.
• Lv et al.  Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. Traffic flow prediction with big data: A deep learning approach. IEEE Transactions on Intelligent Transportation Systems, 16:865–873, 2015.
• McQuade and Monteleoni  Scott McQuade and Claire Monteleoni. Global climate model tracking using geospatial neighborhoods. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada. AAAI Press, 2012.
• Meir and Hellerstein  Ron Meir and Lisa Hellerstein. Nonparametric time series prediction through adaptive model selection. In Machine Learning, pages 5–34, 2000.
• Mohri and Rostamizadeh  Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In Advances in Neural Information Processing Systems 21, pages 1097–1104. Curran Associates, Inc., 2009.
• Mohri and Rostamizadeh  Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary -mixing and -mixing processes. J. Mach. Learn. Res., 11:789–814, 2010. ISSN 1532-4435.
• Mohri et al.  Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
• Negahban and Wainwright  Sahand Negahban and Martin J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist., 39(2):1069–1097, 2011.
• Neyshabur et al.  Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1376–1401. PMLR, 2015.
• Rakhlin et al.  Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari.

Sequential complexities and uniform martingale laws of large numbers.

Probability Theory and Related Fields, 161(1-2):111–153, 2015.
• Romeu et al.  Pablo Romeu, Francisco Zamora-Martínez, Paloma Botella-Rocamora, and Juan Pardo. Time-Series Forecasting of Indoor Temperature Using Pre-trained Deep Neural Networks, pages 451–458. Springer Berlin Heidelberg, 2013.
• Song and Bickel  Song Song and Peter J. Bickel. Large vector auto regressions. arXiv:1106.3915, 2011.
• Sutskever et al.  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014.
• Taieb et al.  Souhaib Ben Taieb, James W. Taylor, and Rob J. Hyndman. Coherent probabilistic forecasts for hierarchical time series. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3348–3357. PMLR, 2017.
• W. Sun  D. Malioutov W. Sun. Time series forecasting with shared seasonality patterns using non-negative matrix factorization. In The 29th Annual Conference on Neural Information Processing Systems (NIPS). Time Series Workshop, 2015.
• Wickramasuriya et al.  Shanika L Wickramasuriya, George Athanasopoulos, and Rob J Hyndman. Forecasting hierarchical and grouped time series through trace minimization. Technical Report 15/15, Monash University, Department of Econometrics and Business Statistics, 2015.
• Yu  Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. Ann. Probab., 22(1):94–116, 1994.
• Yu et al.  Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in Neural Information Processing Systems 29, pages 847–855. Curran Associates, Inc., 2016.
• Yu et al.  Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using tensor-train RNNs. Arxiv:1711.00073, 2017.
• Zhu and Laptev  Lingxue Zhu and Nikolay Laptev. Deep and confident prediction for time series at Uber. arXiv:1709.01907, 2017.
• Zimin and Lampert  Alexander Zimin and Christoph H. Lampert. Learning theory for conditional risk minimization. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, pages 213–222, 2017.

Given a family of functions and a training set , the Rademacher complexity of conditioned on is given by

where

are i.i.d. random variables uniform on

. The Rademacher complexity of for sample size is given by

 Rm(F)=EY′[ˆRZ(F)].

The Rademacher complexity has been studied for a variety of function classes. For instance, for the linear hypothesis space , can be upper bounded by

. As another example, the hypothesis class of ReLu feed-forward neural networks with

layers and weight matrices such that verifies  .

## Appendix B Discrepancy analysis

###### Proposition 1.

Let be a hypothesis space and let be a bounded loss function which respects the triangle inequality. Let . Then,

 Δ≤Δs+L(h∣Y)+L(h∣Y′)
###### Proof.

Let . For ease of notation, we write

 Δs(h,h′,Y′)= 1m∑iL(h(YT1(i)),h′(YT1(i)))−1m∑iL(h(YT−11(i)),h′(YT−11(i))).

Applying the triangle inequality to ,

 L(h∣Y)= 1m∑iE[L(h(YT1(i)),YT+1(i))∣Y] ≤ 1m∑iL(h(YT1(i)),h′(YT1(i)))+1m∑iE[L(h′(YT1(i)),YT+1(i))∣Y] = 1m∑iL(h(YT1(i)),h′(YT1(i)))+L(h′∣Y).

Then, by definition of , we have

 L(h∣Y)≤ 1m∑iL(h(YT1(i)),h′(YT1(i)))−1m∑iL(h(YT−11(i)),h′(YT−11(i))) +1m∑iL(h(YT−11(i)),h′(YT−11(i)))+L(h′∣Y) ≤ Δs(h,h′,Y′)+L(h′∣Y)+1m∑iL(h(YT−11(i)),h′(YT−11(i))).

By an application of the triangle inequality to ,

 L(h,D)≤ Δs(h,h′,Y′)+L(h′∣Y)+1m∑iE[L(h(YT−11(i)),YT(i))∣Y′] +1m∑iE[L(h′(YT−11(i)),YT(i))∣Y′] = Δs(h,h′,Y′)+L(h′∣Y)+L(h∣Y′)+L(h′∣Y′).

Finally, we obtain

 L(h