1 Introduction
Much of the literature in machine learning focuses on studying the behavior of predictions constructed based on a training set
where one wishes to construct a mapping from to . This training set may consist of IID draws from a common distribution, or it may have some dependence property such as ergodicity or mixing behavior [8, 4, 7]. It may even be generated by an adversary intent on deceiving us about the relationship [2, 10].Time series data are different. We observe only a single sequence of random variables
taking values in a measurable space and wish to learn a function which takes the past observations as inputs and predicts the future. Suppose, given data from time 1 to time , we wish to predict time for some. Then for some loss function
, and some predictor , we define the prediction risk, or generalization error, as(1) |
Here we assume that the data series is stationary, a notion to be defined more precisely later. But this allows us to have some hope of controlling the generalization error defined in (1). Absent this sort of behavior, the past and future could be unrelated.
Since the true distribution is unknown, so is
, but we can attempt to estimate it based on only our observed data. In situations with predictors
and responses , there is the obvious estimatorHowever, in this case, we may use some or all of the past to generate predictions, and similarly, it may be that we have not observed for some . To ease notation for the remainder of the paper, assume that we have observed some sequence of data for such that it is possible to evaluate the quantity for each . For time series prediction, we define the training error as
(2) |
Here is some function chosen out of a class of possible functions .
Choosing a particular prediction function as the minimizer of over is “empirical risk minimization” (ERM); this often gives poor results because the choice of adapts to the training data, causing the training error to be an over-optimistic estimate of the true risk. Additionally, training error must shrink as model complexity grows so that ERM will tend to overfit the data and give poor out-of-sample predictions.
While converges to for many algorithms, one can show that when minimizes (2), . There are a number of ways to mitigate this issue. The first is to restrict the class . The second is to change the optimization problem, penalizing model complexity. Rather than attempting to estimate
, we provide bounds on it which hold with high probability across all possible prediction functions
. A typical result in this literature is a confidence bound on the risk which says that with probability at least ,where measures the complexity of the model class , and is a function of the complexity, the confidence level, and the number of observed data points.
In §2, we provide some background material necessary to characterize our results, including some concentration inequalities for dependent data. Section 3 derives risk bounds for time series and gives a novel proof that the standard Rademacher complexity characterizes the flexibility of . Section 4 supplies some straightforward examples showing how dependence affects the quality of bounds. Section 5 concludes and provides some ideas about the future of these results.
2 Time series, complexity, and concentration of measure
In this section, we introduce some of the math necessary to develop our results: stationarity is a prerequisite for control of generalization error; Rademacher complexity measures the flexibility of the model space ; dependence modifies concentration inequalities.
Throughout what follows, will be a sequence of random variables, i.e., each is a measurable mapping from some probability space into a measurable space . A block of the random sequence will be written , where either limit may go to infinity. The -field generated by a particular block will be given by .
2.1 Time series
The dependent data setting we investigate is based on stationary time series input data. We first remind the reader of the notion of (strict or strong) stationarity.
Definition 2.1 (Stationarity).
A sequence of random variables is stationary when all its finite-dimensional distributions are invariant over time: for all and all non-negative integers and
, the random vectors
and have the same distribution.Stationarity does not imply that the random variables are independent across time , only that the distribution of is constant over time.
2.2 Rademacher complexity
Statistical learning theory provides several ways of measuring the complexity of a class of predictive models. The results we use rely on Rademacher complexity (see, e.g., [1]
), which measures how well the model can (seem to) fit white noise.
Definition 2.2 (Rademacher Complexity).
Let
be a time series drawn according to a joint distribution
. The empirical Rademacher complexity iswhere are a sequence of random variables, independent of each other and everything else, and equal to or with equal probability. The Rademacher complexity is
where the expectation is over sample paths generated by .
The term inside the supremum, , is the sample covariance between the noise and the predictions of a particular model . The Rademacher complexity takes the largest value of this sample covariance over all models in the class (mimicking empirical risk minimization), then averages over realizations of the noise.
Intuitively, Rademacher complexity measures how well our models could seem to fit outcomes which were really just noise, giving a baseline against which to assess the risk of over-fitting or failing to generalize. As the sample size grows, for any given the sample covariance , by the ergodic theorem; the overall Rademacher complexity should also shrink, though more slowly, unless the model class is so flexible that it can fit absolutely anything, in which case one can conclude nothing about how well it will predict in the future from the fact that it performed well in the past.
2.3 Concentration inequalities
For IID data, the main tools for developing risk bounds are the inequalities of Hoeffding [3] and McDiarmid [6]. Instead, we will use dependent versions of each which generalize the IID results. These inequalities are derived in van de Geer [12]. They rely on constructing predictable bounds for random variables based on past behavior, rather than assuming a priori knowledge of the distribution.
Theorem 2.3 (van de Geer [12] Theorem 2.5).
Consider a random sequence where
where are -measurable random variables, . Define
with the convention . Then for all , ,
Of course if and are non-random, this returns the usual Hoeffding inequality. Here however, they must only be forecastable given past values of the random sequence.
Theorem 2.4 (van de Geer [12] Theorem 2.6).
Fix . Let be -measurable such that
where are -measurable. Define as above. Then for all , ,
To see how this generalizes McDiarmid’s inequality, we provide the following corollary.
Corollary 2.5.
Let be some real valued function on such that
(3) |
where is -measurable. Then,
In particular, this gives a couple of immediate consequences. Suppose that is bounded. Then, we have that
This contrasts with the bounded differences inequality in the IID case, wherein one only needs to be concerned with one point that is different. For IID data, we have starting from (3),
if satisfies bounded differences with constants . In other words, Theorem 2.4 conflates dependence with nice functional behavior.
3 Risk bounds
Generalization error bounds follow from deriving high probability upper bounds on the quantity
which is the worst case difference between the true risk and the empirical risk over all functions in the class of losses defined over a particular class of prediction functions . In the case of time series, is -measurable, so we can get risk bounds from Theorem 2.4 if we can find suitable and sequences.
Theorem 3.1.
Suppose that satisfies the forecastable boundedness condition of Theorem 2.4. Then,
In many cases (as in the examples below), will be deterministic, in which case, the result above is greatly simplified. Essentially, the theorem says that as long as each new gives us additional control on the conditional expectation of , we can ensure that with high probability, our forecasts of the future will have only small losses. The proof is straightforward: simply set the right hand side of Theorem 2.4 to and use DeMorgan’s law.
Since is a complicated and unintuitive object, we upper bound it with the Rademacher complexity. The standard symmetrization argument for the IID case does not work, but, for time series prediction (as opposed to the more general dependent data case or the online learning case), Rademacher bounds are still available. We provide this result now.
Theorem 3.2.
For a time series prediction problem based on a sequence ,
(4) |
The standard way of proving this result in the IID case is through introduction of a “ghost sample” which has the same distribution as . Taking empirical expectations over the ghost sample is then the same as taking expectations with respect to the distribution of . Randomly exchanging with by using Rademacher variables allows for control of and leads to the factor of 2 in Definition 2.2. However, in the dependent data setting, this is not quite so easy.
For dependent data, both the ghost sample and the introduction of Rademacher variables arise differently. A similar situation also occurs in the more complex cases of online learning with a (perhaps constrained) adversary choosing the data sequence. It is covered in depth in Rakhlin et al. [10, 11]. With dependent data we need a different version of the “ghost sample” than that used in the IID case. First, we rewrite the left side of (4):
(5) |
Here, we define so that for some . At this point, following [10, 11], we introduce a “tangent sequence” rather than the ghost sample. We construct it recursively as follows. Let,
and |
where denotes the probability law. Then, let and .
Proof of Theorem 3.2.
Starting from (5) we have
(6) |
Here we have constructed as a tangent sequence to as discussed above. Then,
(Jensen) | |||||
(7) |
Now, due to dependence, Rademacher variables must be introduced
carefully as in the adversarial case. Rademacher variables create two tree structures, one associated to the sequence, and one associated to the sequence (see [10, 11] for a thorough treatment). We write these trees as and , where is a particular sequence of Rademacher variables (e.g. ) which creates a path along each tree. For example, consider . Then, and , the “right” path of both tree structures. For . Then, and , the “left” path of both tree structures. Changing from to exchanges for in both trees and chooses the left child of and rather than the right child. Figure 1 displays both trees. In order to talk about the probability of conditional on the “past” in the tree, we need to know the path taken so far. For this, we define a selector function
Distributions over these trees then become the objects of interest.
In the time series case, as opposed to the online learning scenario, the dependence between future and past means the adversary is not free to change predictors and responses separately. Once a branch of the tree is chosen, the distribution of future data points is fixed, and depends only on the preceding sequence. Because of this, the joint distribution of any path along the tree is the same as any other path, i.e. for any two paths
and |
Similarly, due to the construction of the tangent sequence, we have that . This equivalence between paths allows us to introduce Rademacher variables swapping for as well as the ability to combine terms below:
∎
Good control of through the Rademacher complexity therefore implies good control of the generalization error. Rademacher complexity is easy to handle for wide ranges of learning algorithms using results in [1]
and elsewhere. Support vector machines, kernel methods, and neural networks all have known Rademacher complexities. Furthermore, Lipschitz composition arguments in
[5] allow us to deal only with the Rademacher complexity of the function class rather than the induced loss class . For loss functions which are -Lipschitz in their second argument,The main issue then in the application of Theorem 3.1 is the determination of the forecastable bounds and from the data generating process. In the next section, we provide a few simple examples to aid intuition.
4 Examples
We consider three different examples which should aid the reader in understanding the nature of the forecastable bounds. Here we present two extreme cases — independence and complete dependence — as well as an intermediate case. It is important to note that is deterministic in all three cases, though this need not be the case.
4.1 Independence
4.2 Complete dependence
Let be generated as follows:
Consider trying to predict the mean . Then, given no observations, the almost sure upper bound while the lower bound . So . For , conditional on (and therefore ), . Thus, giving the entirely useless result:
The right side is independent of implying that we essentially observed one data point regardless of .
4.3 Partial dependence
Let be generated as follows:
where and with . Again, consider trying to predict the mean . We can define and as follows:
From this, we have that
Therefore, by Theorem 2.4,
For comparison, if everything was IID, Hoeffding’s inequality gives
Therefore, the dependence in reduces the effective sample size by . If , then each additional datapoint decreases the probability of a bad event by only a relative to the IID scenario.
5 Discussion
In this paper, we have demonstrated how to control the generalization of time series prediction algorithms. These methods use some or all of the observed past to predict future values of the same series. In order to handle the complicated Rademacher complexity bound for the expectation, we have followed the approach used in the online learning case pioneered by Rakhlin et al. [10, 11], but we show that in our particular case, much of the structure needed to deal with the adversary is unnecessary. This results in clean risk bounds which have a form similar to the IID case.
The main issue with risk bounds for dependent data is that they rely on complete knowledge of the dependence for application. This is certainly true in our case in that we need to know how to choose and such that we almost surely control . For the standard case of bounded loss, there are trivial bounds, but these will not give the necessary dependence on which would imply learnability of good predictors. More knowledge of the dependence structure of the process is required, though this is in some sense undesirable. However, previous results in the dependent data setting, such as those presented in [8, 4, 7, 9], also have this requirement.^{1}^{1}1IID results have an even more onerous requirement: we must be able to rule out any dependence at all. They rely on precise knowledge of the mixing behavior of the data which is unavailable. At the same time, mixing characterizations are often unintuitive conditions based on infinite dimensional joint distributions. Our version depends only on the ability to forecastably bound expectations given increasing amounts of data.
References
- Bartlett and Mendelson [2002] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
- Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Press, Cambridge, UK, 2006.
- Hoeffding [1963] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. ISSN 0162-1459.
- Karandikar and Vidyasagar [2009] R. L. Karandikar and M. Vidyasagar. Probably approximately correct learning with beta-mixing input sequences. submitted for publication, 2009.
- Ledoux and Talagrand [1991] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. A Series of Modern Surveys in Mathematics. Springer Verlag, Berlin, 1991. ISBN 3540520139.
- McDiarmid [1989] C. McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, volume 141 of London Mathematical Society Lecture Note Series, pages 148–188. Cambridge University Press, 1989.
- Meir [2000] Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5–34, 2000. URL http://www.ee.technion.ac.il/~rmeir/Publications/MeirTimeSeries00.pdf.
- Mohri and Rostamizadeh [2009] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, volume 21, pages 1097–1104, 2009.
- Mohri and Rostamizadeh [2010] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary -mixing and -mixing processes. Journal of Machine Learning Research, 11:789–814, February 2010.
- Rakhlin et al. [2010] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinatorial parameters, and learnability. Technical report, arXiv, 2010. URL http://arxiv.org/abs/1006.1138v1.
- Rakhlin et al. [2011] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic and constrained adversaries. Technical report, arXiv, 2011. URL http://arxiv.org/abs/1104.5070.
- van de Geer [2002] Sara van de Geer. On hoeffding’s inequality for dependent random variables. In Herold Dehling, Thomas Mikosch, and Michael Sørensen, editors, Empirical Process Techniques for Dependent Data, pages 161–169. Birkhäuser, Boston, 2002. URL http://stat.ethz.ch/~geer/hoeffding2.pdf.