DeepAI

# Rademacher complexity of stationary sequences

We show how to control the generalization error of time series models wherein past values of the outcome are used to predict future values. The results are based on a generalization of standard i.i.d. concentration inequalities to dependent data without the mixing assumptions common in the time series setting. Our proof and the result are simpler than previous analyses with dependent data or stochastic adversaries which use sequential Rademacher complexities rather than the expected Rademacher complexity for i.i.d. processes. We also derive empirical Rademacher results without mixing assumptions resulting in fully calculable upper bounds.

• 17 publications
• 20 publications
03/15/2018

### Theory and Algorithms for Forecasting Time Series

We present data-dependent learning bounds for the general scenario of no...
03/04/2011

### Estimating β-mixing coefficients

The literature on statistical learning for time series assumes the asymp...
06/21/2019

### Learning from weakly dependent data under Dobrushin's condition

Statistical learning theory has largely focused on learning and generali...
08/02/2021

### Generalization bounds for nonparametric regression with β-mixing samples

In this paper we present a series of results that permit to extend in a ...
02/22/2018

### Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification

We prove that the ordinary least-squares (OLS) estimator attains nearly ...
05/12/2015

### Permutational Rademacher Complexity: a New Complexity Measure for Transductive Learning

Transductive learning considers situations when a learner observes m lab...
10/14/2022

### Consistent Causal Inference from Time Series with PC Algorithm and its Time-Aware Extension

The estimator of a causal directed acyclic graph (DAG) with the PC algor...

## 1 Introduction

Much of the literature in machine learning focuses on studying the behavior of predictions constructed based on a training set

where one wishes to construct a mapping from to . This training set may consist of IID draws from a common distribution, or it may have some dependence property such as ergodicity or mixing behavior [8, 4, 7]. It may even be generated by an adversary intent on deceiving us about the relationship [2, 10].

Time series data are different. We observe only a single sequence of random variables

taking values in a measurable space and wish to learn a function which takes the past observations as inputs and predicts the future. Suppose, given data from time 1 to time , we wish to predict time for some

. Then for some loss function

, and some predictor , we define the prediction risk, or generalization error, as

 R(g):=E[ℓ(Yn+h,g(Yn1)]. (1)

Here we assume that the data series is stationary, a notion to be defined more precisely later. But this allows us to have some hope of controlling the generalization error defined in (1). Absent this sort of behavior, the past and future could be unrelated.

Since the true distribution is unknown, so is

, but we can attempt to estimate it based on only our observed data. In situations with predictors

and responses , there is the obvious estimator

 ˜Rn(g):=1nn∑i=1ℓ(Yi,g(Xi)).

However, in this case, we may use some or all of the past to generate predictions, and similarly, it may be that we have not observed for some . To ease notation for the remainder of the paper, assume that we have observed some sequence of data for such that it is possible to evaluate the quantity for each . For time series prediction, we define the training error as

 ˆRn(g):=1nn∑i=1ℓ(Yi+h,g(Yi1). (2)

Here is some function chosen out of a class of possible functions .

Choosing a particular prediction function as the minimizer of over is “empirical risk minimization” (ERM); this often gives poor results because the choice of adapts to the training data, causing the training error to be an over-optimistic estimate of the true risk. Additionally, training error must shrink as model complexity grows so that ERM will tend to overfit the data and give poor out-of-sample predictions.

While converges to for many algorithms, one can show that when minimizes (2), . There are a number of ways to mitigate this issue. The first is to restrict the class . The second is to change the optimization problem, penalizing model complexity. Rather than attempting to estimate

, we provide bounds on it which hold with high probability across all possible prediction functions

. A typical result in this literature is a confidence bound on the risk which says that with probability at least ,

 R(ˆg)≤ˆRn(ˆg)+Γ(C(G),n,δ),

where measures the complexity of the model class , and is a function of the complexity, the confidence level, and the number of observed data points.

In §2, we provide some background material necessary to characterize our results, including some concentration inequalities for dependent data. Section 3 derives risk bounds for time series and gives a novel proof that the standard Rademacher complexity characterizes the flexibility of . Section 4 supplies some straightforward examples showing how dependence affects the quality of bounds. Section 5 concludes and provides some ideas about the future of these results.

## 2 Time series, complexity, and concentration of measure

In this section, we introduce some of the math necessary to develop our results: stationarity is a prerequisite for control of generalization error; Rademacher complexity measures the flexibility of the model space ; dependence modifies concentration inequalities.

Throughout what follows, will be a sequence of random variables, i.e., each is a measurable mapping from some probability space into a measurable space . A block of the random sequence will be written , where either limit may go to infinity. The -field generated by a particular block will be given by .

### 2.1 Time series

The dependent data setting we investigate is based on stationary time series input data. We first remind the reader of the notion of (strict or strong) stationarity.

###### Definition 2.1 (Stationarity).

A sequence of random variables is stationary when all its finite-dimensional distributions are invariant over time: for all and all non-negative integers and

, the random vectors

and have the same distribution.

Stationarity does not imply that the random variables are independent across time , only that the distribution of is constant over time.

Statistical learning theory provides several ways of measuring the complexity of a class of predictive models. The results we use rely on Rademacher complexity (see, e.g., [1]

), which measures how well the model can (seem to) fit white noise.

Let

be a time series drawn according to a joint distribution

. The empirical Rademacher complexity is

 ˆRn(G):=2Eσ[supg∈G∣∣ ∣∣1nn∑i=1σig(Yi1)∣∣ ∣∣ | Yn1],

where are a sequence of random variables, independent of each other and everything else, and equal to or with equal probability. The Rademacher complexity is

 Rn(G):=Eν[ˆRn(G)]

where the expectation is over sample paths generated by .

The term inside the supremum, , is the sample covariance between the noise and the predictions of a particular model . The Rademacher complexity takes the largest value of this sample covariance over all models in the class (mimicking empirical risk minimization), then averages over realizations of the noise.

Intuitively, Rademacher complexity measures how well our models could seem to fit outcomes which were really just noise, giving a baseline against which to assess the risk of over-fitting or failing to generalize. As the sample size grows, for any given the sample covariance , by the ergodic theorem; the overall Rademacher complexity should also shrink, though more slowly, unless the model class is so flexible that it can fit absolutely anything, in which case one can conclude nothing about how well it will predict in the future from the fact that it performed well in the past.

### 2.3 Concentration inequalities

For IID data, the main tools for developing risk bounds are the inequalities of Hoeffding [3] and McDiarmid [6]. Instead, we will use dependent versions of each which generalize the IID results. These inequalities are derived in van de Geer [12]. They rely on constructing predictable bounds for random variables based on past behavior, rather than assuming a priori knowledge of the distribution.

###### Theorem 2.3 (van de Geer [12] Theorem 2.5).

Consider a random sequence where

 Li≤Yi≤Ui a.s. for all i≥1,

where are -measurable random variables, . Define

 C2n=n∑i=1(Ui−Li)2,

with the convention . Then for all , ,

 P(n∑i=1Yi≥ϵ and C2n≤c2 for some n)≤exp{−2ϵ2c2}.

Of course if and are non-random, this returns the usual Hoeffding inequality. Here however, they must only be forecastable given past values of the random sequence.

###### Theorem 2.4 (van de Geer [12] Theorem 2.6).

Fix . Let be -measurable such that

 Li≤E[Zn | Fi1]≤Ui,%a.s.

where are -measurable. Define as above. Then for all , ,

 P(Zn−E[Zn]≥ϵ and C2n≤c2)≤exp{−2ϵ2c2}.

To see how this generalizes McDiarmid’s inequality, we provide the following corollary.

###### Corollary 2.5.

Let be some real valued function on such that

 (3)

where is -measurable. Then,

 P(g(Y1,…,Yn)−E[g(Y1,…,Yn)]>ϵ and ∑ik2i

In particular, this gives a couple of immediate consequences. Suppose that is bounded. Then, we have that

 ki≤supYnisupYn′i|g(Y1,…,Yi−1,Yi,…,Yn)−g(Y1,…,Yi−1,Y′i,…,Y′n)|=bi.

This contrasts with the bounded differences inequality in the IID case, wherein one only needs to be concerned with one point that is different. For IID data, we have starting from (3),

 ki ≤supYi,Y′i|g(Y1,…,Yi−1,Yi,…,Yn)−g(Y1,…,Yi−1,Y′i,…,Yn)|=di,

if satisfies bounded differences with constants . In other words, Theorem 2.4 conflates dependence with nice functional behavior.

## 3 Risk bounds

Generalization error bounds follow from deriving high probability upper bounds on the quantity

 Qn(H):=suph∈H(R(h)−ˆRn(h)),

which is the worst case difference between the true risk and the empirical risk over all functions in the class of losses defined over a particular class of prediction functions . In the case of time series, is -measurable, so we can get risk bounds from Theorem 2.4 if we can find suitable and sequences.

###### Theorem 3.1.

Suppose that satisfies the forecastable boundedness condition of Theorem 2.4. Then,

 P⎛⎝R(h)<ˆRn(h)+E[Qn(H)]+c√log1/δ2   or   C2n>c⎞⎠≤1−δ.

In many cases (as in the examples below), will be deterministic, in which case, the result above is greatly simplified. Essentially, the theorem says that as long as each new gives us additional control on the conditional expectation of , we can ensure that with high probability, our forecasts of the future will have only small losses. The proof is straightforward: simply set the right hand side of Theorem 2.4 to and use DeMorgan’s law.

Since is a complicated and unintuitive object, we upper bound it with the Rademacher complexity. The standard symmetrization argument for the IID case does not work, but, for time series prediction (as opposed to the more general dependent data case or the online learning case), Rademacher bounds are still available. We provide this result now.

###### Theorem 3.2.

For a time series prediction problem based on a sequence ,

 E[Qn(H)]≤Rn(H). (4)

The standard way of proving this result in the IID case is through introduction of a “ghost sample” which has the same distribution as . Taking empirical expectations over the ghost sample is then the same as taking expectations with respect to the distribution of . Randomly exchanging with by using Rademacher variables allows for control of and leads to the factor of 2 in Definition 2.2. However, in the dependent data setting, this is not quite so easy.

For dependent data, both the ghost sample and the introduction of Rademacher variables arise differently. A similar situation also occurs in the more complex cases of online learning with a (perhaps constrained) adversary choosing the data sequence. It is covered in depth in Rakhlin et al. [10, 11]. With dependent data we need a different version of the “ghost sample” than that used in the IID case. First, we rewrite the left side of (4):

 EY[Qn(H)] =EY[suph∈H(Rn(h)−ˆRn(h))] =EY[supg∈G(EYn+h[ℓ(Yn+h,g(Yn1))]−1nn∑i=1ℓ(Yi+1,g(Yi1)))]. (5)

Here, we define so that for some . At this point, following [10, 11], we introduce a “tangent sequence” rather than the ghost sample. We construct it recursively as follows. Let,

 L(Y′1) =L(Y1) and L(Y′i|Y1,…,Yi−1)=L(Yi|Y1,…,Yi−1),

where denotes the probability law. Then, let and .

###### Proof of Theorem 3.2.

Starting from (5) we have

 E[Qn(H)] =EZ[suph∈H(EZ[1nn∑i=1h(zi)]−1nn∑i=1h(zi))] =EZ[suph∈H(EZ′[1nn∑i=1h(z′i)]−1nn∑i=1h(zi))]. (6)

Here we have constructed as a tangent sequence to as discussed above. Then,

 (???) ≤EZ,Z′[suph∈H1nn∑i=1h(z′i)−h(zi)] (Jensen) =Ez1z′1Ez2|z1z′2|z′1⋯Eczn|zn−1,…,z1z′n|z′n−1,…,z′1[suph∈H1nn∑i=1h(z′i)−h(zi)] (7)

Now, due to dependence, Rademacher variables must be introduced

carefully as in the adversarial case. Rademacher variables create two tree structures, one associated to the sequence, and one associated to the sequence (see [10, 11] for a thorough treatment). We write these trees as and , where is a particular sequence of Rademacher variables (e.g. ) which creates a path along each tree. For example, consider . Then, and , the “right” path of both tree structures. For . Then, and , the “left” path of both tree structures. Changing from to exchanges for in both trees and chooses the left child of and rather than the right child. Figure 1 displays both trees. In order to talk about the probability of conditional on the “past” in the tree, we need to know the path taken so far. For this, we define a selector function

 χ(σ) :=χ(σ,ρ,ϱ)={ρσ=1ϱσ=−1.

Distributions over these trees then become the objects of interest.

In the time series case, as opposed to the online learning scenario, the dependence between future and past means the adversary is not free to change predictors and responses separately. Once a branch of the tree is chosen, the distribution of future data points is fixed, and depends only on the preceding sequence. Because of this, the joint distribution of any path along the tree is the same as any other path, i.e. for any two paths

 L(Z(σ)) =L(Z(σ′)) and L(Z′(σ))=L(Z′(σ′)).

Similarly, due to the construction of the tangent sequence, we have that . This equivalence between paths allows us to introduce Rademacher variables swapping for as well as the ability to combine terms below:

 (???) =Ez1z′1Eσ1Ez2|χ(σ1,z1,z′1)z′2|χ(σ1,z′1,z1)Eσ2⋯Ezn|χ(σn−1),…,χ(σ1)z′n|χ(σn−1),…,χ(σ1)Eσn[suph∈H1nn∑i=1σi(h(z′i)−h(zi))] =EZ,Z′,σ[suph∈H1nn∑i=1σi(h(z′i)−h(zi))] ≤EZ,σ[suph∈H1nn∑i=1σih(zi)]+EZ′,σ[suph∈H1nn∑i=1σih(z′i)] =2EZ,σ[suph∈H1nn∑i=1σih(zi)] =Rn(H).

Good control of through the Rademacher complexity therefore implies good control of the generalization error. Rademacher complexity is easy to handle for wide ranges of learning algorithms using results in [1]

and elsewhere. Support vector machines, kernel methods, and neural networks all have known Rademacher complexities. Furthermore, Lipschitz composition arguments in

[5] allow us to deal only with the Rademacher complexity of the function class rather than the induced loss class . For loss functions which are -Lipschitz in their second argument,

The main issue then in the application of Theorem 3.1 is the determination of the forecastable bounds and from the data generating process. In the next section, we provide a few simple examples to aid intuition.

## 4 Examples

We consider three different examples which should aid the reader in understanding the nature of the forecastable bounds. Here we present two extreme cases — independence and complete dependence — as well as an intermediate case. It is important to note that is deterministic in all three cases, though this need not be the case.

### 4.1 Independence

For IID data, we simply recover IID concentration results. As noted in Corollary 2.5, for IID data, bounded differences yields good control. Similarly, Theorem 2.3 gives the same results as Hoeffding’s inequality for IID data. Dependence is more interesting.

### 4.2 Complete dependence

Let be generated as follows:

 Y1 ∼U(a,b),    b>a Yi=Yi−1,    i≥2.

Consider trying to predict the mean . Then, given no observations, the almost sure upper bound while the lower bound . So . For , conditional on (and therefore ), . Thus, giving the entirely useless result:

 P(1nn∑i=1Yi−(b+a)/2≥ϵ)

The right side is independent of implying that we essentially observed one data point regardless of .

### 4.3 Partial dependence

Let be generated as follows:

 Y0 =0, Yi=θYi−1+ηi   i≥2,

where and with . Again, consider trying to predict the mean . We can define and as follows:

 Li =an1−θn−i1−θ+1ni−1∑k=1Yk+θYi−1, Ui=bn1−θn−i1−θ+1ni−1∑k=1Yk+θYi−1.

From this, we have that

 C2n =n∑i=1(b−a)2n2(1−θ)2(1−θn−i)2 =(b−a)2n2(1−θ)2(θ2−1)(θ2n−2θn+1−2θn+nθ2+2θ−n+1) <(b−a)2n(1−θ)2.

Therefore, by Theorem 2.4,

 P(1nn∑i=1Yi−(b+a)/2>ϵ)

For comparison, if everything was IID, Hoeffding’s inequality gives

 P(1nn∑i=1Yi−(b+a)/2>ϵ)

Therefore, the dependence in reduces the effective sample size by . If , then each additional datapoint decreases the probability of a bad event by only a relative to the IID scenario.

## 5 Discussion

In this paper, we have demonstrated how to control the generalization of time series prediction algorithms. These methods use some or all of the observed past to predict future values of the same series. In order to handle the complicated Rademacher complexity bound for the expectation, we have followed the approach used in the online learning case pioneered by Rakhlin et al. [10, 11], but we show that in our particular case, much of the structure needed to deal with the adversary is unnecessary. This results in clean risk bounds which have a form similar to the IID case.

The main issue with risk bounds for dependent data is that they rely on complete knowledge of the dependence for application. This is certainly true in our case in that we need to know how to choose and such that we almost surely control . For the standard case of bounded loss, there are trivial bounds, but these will not give the necessary dependence on which would imply learnability of good predictors. More knowledge of the dependence structure of the process is required, though this is in some sense undesirable. However, previous results in the dependent data setting, such as those presented in [8, 4, 7, 9], also have this requirement.111IID results have an even more onerous requirement: we must be able to rule out any dependence at all. They rely on precise knowledge of the mixing behavior of the data which is unavailable. At the same time, mixing characterizations are often unintuitive conditions based on infinite dimensional joint distributions. Our version depends only on the ability to forecastably bound expectations given increasing amounts of data.

## References

• Bartlett and Mendelson [2002] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
• Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Press, Cambridge, UK, 2006.
• Hoeffding [1963] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. ISSN 0162-1459.
• Karandikar and Vidyasagar [2009] R. L. Karandikar and M. Vidyasagar. Probably approximately correct learning with beta-mixing input sequences. submitted for publication, 2009.
• Ledoux and Talagrand [1991] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. A Series of Modern Surveys in Mathematics. Springer Verlag, Berlin, 1991. ISBN 3540520139.
• McDiarmid [1989] C. McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, volume 141 of London Mathematical Society Lecture Note Series, pages 148–188. Cambridge University Press, 1989.
• Meir [2000] Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5–34, 2000.
• Mohri and Rostamizadeh [2009] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, volume 21, pages 1097–1104, 2009.
• Mohri and Rostamizadeh [2010] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary -mixing and -mixing processes. Journal of Machine Learning Research, 11:789–814, February 2010.
• Rakhlin et al. [2010] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinatorial parameters, and learnability. Technical report, arXiv, 2010. URL http://arxiv.org/abs/1006.1138v1.
• Rakhlin et al. [2011] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic and constrained adversaries. Technical report, arXiv, 2011. URL http://arxiv.org/abs/1104.5070.
• van de Geer [2002] Sara van de Geer. On hoeffding’s inequality for dependent random variables. In Herold Dehling, Thomas Mikosch, and Michael Sørensen, editors, Empirical Process Techniques for Dependent Data, pages 161–169. Birkhäuser, Boston, 2002.