Time series forecasting plays a crucial role in a number of domains ranging from weather forecasting and earthquake prediction to applications in economics and finance. The classical statistical approaches to time series analysis are based on generative models such as the autoregressive moving average (ARMA) models, or their integrated versions (ARIMA) and several other extensions (Engle, 1982; Bollerslev, 1986; Brockwell and Davis, 1986; Box and Jenkins, 1990; Hamilton, 1994)
. Most of these models rely on strong assumptions about the noise terms, often assumed to be i.i.d. random variables sampled from a Gaussian distribution, and the guarantees provided in their support are only asymptotic.
An alternative non-parametric approach to time series analysis consists of extending the standard i.i.d. statistical learning theory framework to that of stochastic processes. In much of this work, the process is assumed to be stationary and suitably mixing(Doukhan, 1994). Early work along this approach consisted of the VC-dimension bounds for binary classification given by Yu (1994) under the assumption of stationarity and -mixing. Under the same assumptions, Meir (2000) presented bounds in terms of covering numbers for regression losses and Mohri and Rostamizadeh (2009) proved general data-dependent Rademacher complexity learning bounds. Vidyasagar (1997) showed that PAC learning algorithms in the i.i.d. setting preserve their PAC learning property in the -mixing stationary scenario. A similar result was proven by Shalizi and Kontorovich (2013) for mixtures of -mixing processes and by Berti and Rigo (1997) and Pestov (2010) for exchangeable random variables. Alquier and Wintenberger (2010) and Alquier et al. (2014) also established PAC-Bayesian learning guarantees under weak dependence and stationarity.
A number of algorithm-dependent bounds have also been derived for the stationary mixing setting. Lozano et al. (2006) studied the convergence of regularized boosting. Mohri and Rostamizadeh (2010) gave data-dependent generalization bounds for stable algorithms for -mixing and -mixing stationary processes. Steinwart and Christmann (2009) proved fast learning rates for regularized algorithms with -mixing stationary sequences and Modha and Masry (1998) gave guarantees for certain classes of models under the same assumptions.
However, stationarity and mixing are often not valid assumptions. For example, even for Markov chains, which are among the most widely used types of stochastic processes in applications, stationarity does not hold unless the Markov chain is started with an equilibrium distribution. Similarly, long memory models such as ARFIMA, may not be mixing or mixing may be arbitrarily slow(Baillie, 1996). In fact, it is possible to construct first order autoregressive processes that are not mixing (Andrews, 1983)
. Additionally, the mixing assumption is defined only in terms of the distribution of the underlying stochastic process and ignores the loss function and the hypothesis set used. This suggests that mixing may not be the right property to characterize learning in the setting of stochastic processes.
A number of attempts have been made to relax the assumptions of stationarity and mixing. Adams and Nobel (2010) proved asymptotic guarantees for stationary ergodic sequences. Agarwal and Duchi (2013) gave generalization bounds for asymptotically stationary (mixing) processes in the case of stable on-line learning algorithms. Kuznetsov and Mohri (2014) established learning guarantees for fully non-stationary - and -mixing processes.
In this paper, we consider the general case of non-stationary non-mixing processes. We are not aware of any prior work providing generalization bounds in this setting. In fact, our bounds appear to be novel even when the process is stationary (but not mixing). The learning guarantees that we present hold for both bounded and unbounded memory models. Deriving generalization bounds for unbounded memory models even in the stationary mixing case was an open question prior to our work (Meir, 2000). Our guarantees cover the majority of approaches used in practice, including various autoregressive and state space models.
The key ingredients of our generalization bounds are a data-dependent measure of sequential complexity (expected sequential covering number or sequential Rademacher complexity (Rakhlin et al., 2010)) and a measure of discrepancy between the sample and target distributions. Kuznetsov and Mohri (2014, 2016a) also give generalization bounds in terms of discrepancy. However, unlike these result, our analysis does not require any mixing assumptions which are hard to verify in practice. More importantly, under some additional mild assumption, the discrepancy measure that we propose can be estimated from data, which leads to data-dependent learning guarantees for non-stationary non-mixing case.
We also show that our methodology can be applied to analysis of stable time series forecasting algorithms, which extends previous results of Mohri and Rostamizadeh (2010) to the setting of non-stationary non-mixing stochastic processes. Proof techniques combining decoupled tangent sequences with stability analysis can be of independent interest.
We devise new algorithms for non-stationary time series forecasting that benefit from our data-dependent guarantees. The parameters of generative models such as ARIMA are typically estimated via the maximum likelihood technique, which often leads to non-convex optimization problems. In contrast, our objective is convex and leads to an optimization problem with a unique global solution that can be found efficiently. Another issue with standard generative models is that they address non-stationarity in the data via a differencing transformation which does not always lead to a stationary process. In contrast, we address the problem of non-stationarity in a principled way using our learning guarantees.
The rest of this paper is organized as follows. The formal definition of the time series forecasting learning scenario as well as that of several key concepts is given in Section 2. In Section 3, we introduce and prove our new generalization bounds. Section 4 provides an analysis in the special case of kernel-based hypotheses with regression losses. Section 5 provides an analysis of regularized ERM algorithms based on the notion of algorithmic stability. In Section 6, we give data-dependent learning bounds based on the empirical discrepancy. These results are used to devise new forecasting algorithms in Section 7. In Appendix 8, we report the results of preliminary experiments using these algorithms.
We consider the following general time series prediction setting where the learner receives a realization of some stochastic process, with . The objective of the learner is to select out of a specified family a hypothesis that achieves a small path-dependent generalization error
conditioned on observed data, where is a given loss function. The path-dependent generalization error that we consider in this work is a finer measure of the generalization ability than the averaged generalization error
since it only takes into consideration the realized history of the stochastic process and does not average over the set of all possible histories. The results that we present in this paper also apply to the setting where the time parameter can take non-integer values and prediction lag is an arbitrary number . That is, the error is defined by but for notational simplicity we set .
Our setup covers a larger number of scenarios commonly used in practice. The case
corresponds to a large class of autoregressive models. Takingleads to growing memory models which, in particular, include state space models. More generally, may contain both the history of the process and some additional side information. Note that output space may also be high-dimensional. This covers both the case when we are trying to forecast high-dimensional time series as well as multi-step forecasting.
To simplify the notation, we will often use the shorter notation , for any and introduce the family containing such functions . We will assume a bounded loss function, that is for all for some . Finally, we will use the shorthand to denote a sequence of random variables .
The key quantity of interest in the analysis of generalization is the following supremum of the empirical process defined as follows:
where are real numbers, which in the standard learning scenarios are chosen to be uniform. In our general setting, different s may follow different distributions, thus distinct weights could be assigned to the errors made on different sample points depending on their relevance to forecasting the future . The generalization bounds that we present below are for an arbitrary sequence which, in particular, covers the case of uniform weights. Remarkably, our bounds do not even require the non-negativity of .
The two key ingredients of our analysis are sequential complexities (Rakhlin et al., 2010) and a novel discrepancy measure between target and source distributions. In the next two sections we provide a detailed overview of these notions.
2.1 Sequential Complexities
Our generalization bounds are expressed in terms of data-dependent measures of sequential complexity such as expected sequential covering number or sequential Rademacher complexity (Rakhlin et al., 2010), which we review in this section.
We adopt the following definition of a complete binary tree: a -valued complete binary tree is a sequence of mappings , . A path in the tree is . To simplify the notation we will write instead of , even though depends only on the first elements of . The following definition generalizes the classical notion of covering numbers to sequential setting. A set of -valued trees of depth is a sequential -cover (with respect to -weighted norm) of a function class on a tree of depth if for all and all , there is such that
where is the dual norm. The (sequential) covering number of a function class on a given tree is defined to be the size of the minimal sequential cover. The maximal covering number is then taken to be . One can check that in the case of uniform weights this definition coincides with the standard definition of sequential covering numbers. Note that this is a purely combinatorial notion of complexity which ignores the distribution of the process in the given learning problem.
Data-dependent sequential covering numbers can be defined as follows. Given a stochastic process distributed according to the distribution with denoting the conditional distribution at time , we sample a -valued tree of depth according to the following procedure. Draw two independent samples from : in the left child of the root draw according to and in the right child according to . More generally, for a node that can be reached by a path , we draw according to , where and . Let denote the tree formed using s and define the expected covering number to be , where denotes the distribution of . For i.i.d. sequences expected sequential covering numbers exactly coincide with the notion of expected covering numbers from classical statistical learning theory.
The sequential Rademacher complexity of a function class is defined as the following:
where the supremum is taken over all complete binary trees of depth with values in and where is a sequence of Rademacher random variables.
Similarly, one can also define distribution-dependent sequential Rademacher complexity as well as other notions of sequential complexity such as Littlestone dimension and sequential metric entropy that have been shown to characterize learning in on-line framework. For further details, we refer the reader to (Littlestone, 1987; Rakhlin et al., 2010, 2011, 2015a, 2015b)
2.2 Discrepancy Measure
The final ingredient needed for expressing our learning guarantees is the notion of discrepancy between target distribution and the distribution of the sample:
The discrepancy is a natural measure of the non-stationarity of the stochastic process with respect to both the loss function and the hypothesis set . In particular, note that if the process is i.i.d., then we simply have provided that
s form a probability distribution.
As a more insightful example, consider the case of a time-homogeneous Markov chain on a set such that and for some . This process is non-stationary if it is not started with an equilibrium distribution. Suppose that the set of hypothesis is and the loss function for some . It follows that for any
and hence provided is a probability distribution. Note that if we chose a larger hypothesis set then
and in general it may be the case that . This highlights an important property of discrepancy: it takes into account not only the underlying distribution of the stochastic process but other components of the learning problem such as the loss function and the hypothesis set that is being used.
The weights play a crucial role in the learning problem as well. Consider our earlier example, where transition probability distributions are different for each state . Note that choosing
to be a uniform distribution, in general, leads to a non-zero discrepancy. However, withand discrepancy is zero. Note that in fact it is not the only choice that leads to a zero discrepancy in this example and in fact any distribution that is supported on s for which will lead to the same result. However, s based on have an advantage of providing the largest effective sample.
Similar results can be established for weakly stationary stochastic process as well.
It is also possible to give bounds on in terms of other natural distances between distribution. For instance, if is a probability distribution then Pinsker’s inequality yields
where is the total variation distance and the relative entropy, the conditional distribution of , and the mixture of the sample marginals. Note that these upper bounds are often too loose since they are agnostic to the loss function and the hypothesis set that is being used. For our earlier Markov chain example, the support of is while the mixture is likely to be supported on the whole set which leads to a large total variation distance. Of course, it is possible to choose s so that the mixture is also supported only on but that may reduce the effective sample size which is not necessary when working with .
However, the most important property of the discrepancy is that, as shown later in Section 6, it can be estimated from data under some additional mild assumptions. (Kuznetsov and Mohri, 2014) also give generalization bounds based on averaged generalization error for non-stationary mixing processes in terms of a related notion of discrepancy. It is not known if the discrepancy measure used in (Kuznetsov and Mohri, 2014) can be estimated from data.
3 Generalization Bounds
In this section, we prove new generalization bounds for forecasting non-stationary time series. The first step consists of using decoupled tangent sequences to establish concentration results for the supremum of the empirical process . Given a sequence of random variables we say that is a decoupled tangent sequence if is distributed according to and is independent of . It is always possible to construct such a sequence of random variables (De la Peña and Giné, 1999). The next theorem is the main result of this section.
Let be a sequence of random variables distributed according to . Fix . Then, the following holds:
The first step is to observe that, since the difference of the suprema is upper bounded by the supremum of the difference, it suffices to bound the probability of the following event
By Markov’s inequality, for any , the following inequality holds:
Since is a tangent sequence the following equalities hold: . Using these equalities and Jensen’s inequality, we obtain the following:
where for the second inequality we used Young’s inequality and for the last equality we used symmetry. Given let denote the minimal -cover with respect to the -weighted -norm of on . Then, the following bound holds
By the monotonicity of the exponential function,
Since depends only on , by Hoeffding’s bound,
and iterating this inequality and using the union bound, we obtain the following:
Optimizing over completes the proof.
An immediate consequence of Theorem 3 is the following result. For any , with probability at least , for all and all ,
We are not aware of other finite sample bounds in a non-stationary non-mixing case. In fact, our bounds appear to be novel even in the stationary non-mixing case.
While Rakhlin et al. (2015a) give high probability bounds for a different quantity than the quantity of interest in time series prediction,
their analysis of this quantity can also be used in our context to derive high probability bounds for . However, this approach results in bounds that are in terms of purely combinatorial notions such as maximal sequential covering numbers . While at first sight, this may seem as a minor technical detail, the distinction is crucial in the setting of time series prediction. Consider the following example. Let be drawn from a uniform distribution on and with being a distribution over such that if and otherwise. Let be defined by . Then, one can check that , while . The data-dependent bounds of Theorem 3 and Corollary 3 highlight the fact that the task of time series prediction lies in between the familiar i.i.d. scenario and adversarial on-line learning setting.
However, the key component of our learning guarantees is the discrepancy term . Note that in the general non-stationary case, the bounds of Theorem 3 may not converge to zero due to the discrepancy between the target and sample distributions. This is also consistent with the lower bounds of Barve and Long (1996) that we discuss in more detail in Section 6. However, convergence can be established in some special cases. In the i.i.d. case our bounds reduce to the standard covering numbers learning guarantees. In the drifting scenario, with being a sequence of independent random variables, our discrepancy measure coincides with the one used and studied in (Mohri and Muñoz Medina, 2012). Convergence can also be established for weakly stationary stochastic processes. However, as we show in Section 6, the most important advantage of our bounds is that the discrepancy measure we use can be estimated from data.
We now show that expected sequential covering numbers can be upper bounded in terms of the sequential Rademacher complexity. While generalization bounds in terms of sequential Rademacher complexity are not as tight as bounds in terms expected sequential covering numbers since the former is a purely combinatorial notion, the analysis of sequential Rademacher complexity may be simpler for certain hypothesis classes such as for instance kernel-based hypothesis that we study in Section 4. We have the following extension of Sudakov’s Minoration Theorem to the setting of sequential complexities.
The following bound holds:
We consider the following Gaussian-Rademacher sequential complexity:
where is an independent sequence of Rademacher random variables, is an independent sequence of standard Gaussian random variables and is a complete binary tree of depth with values in .
Observe that if is any -cover with respect to the -weighted -norm of on . Then the following holds by independence of and :
Observe that is also -cover with respect to the -weighted -norm of on . We can obtain a smaller -cover from be eliminating s that are close to some other . Since is finite, let , and for each we delete such that
It is straightforward to verify that is -cover with respect to the -weighted -norm of on . Furthermore, it follows that for a fixed , the following holds:
for any . Let be a sequence of independent Gaussian random variables with and . Observe that and hence by Sudakov-Fernique inequality it follows that
where the last inequality is the standard result for Gaussian random variables. Therefore, we conclude that . On the other hand, using standard properties of Gaussian complexity Ledoux and Talagrand (1991):
where is an independent sequence of Rademacher variables. We re-arrange into so that for all and it follows that
Therefore, the following inequality holds
and conclusion of this theorem follows by taking supremum with respect to on both sides of this inequality.
For any , with probability at least , for all and all ,
As we have already mentioned in Section 2, sequential Rademacher complexity can be further upper bounded in terms of sequential metric entropy, sequential Littlestone dimension, maximal sequential covering numbers and other combinatorial notions of sequential complexity of . These notions have been extensively studied in the past Rakhlin et al. (2015b).
We conclude this section by observing that our results also hold in the case when
, which is a common heuristic used in some algorithms for forecasting non-stationary time series(Lorenz, 1969; Zhao and Giannakis, 2016). We formalize this result in the following theorem. Let and suppose is -measurable. Then, for any , with probability at least , for all and all ,
where is defined by
We illustrate this result with some examples. Consider for instance a Gaussian Markov process with
being a normal distribution with mean
and unit variance. Supposefor some function . We let and observe that for any :
which show that in this case. More generally, if is time-homogeneous Markov process then one can use Radon-Nikodym derivative for , which will again lead to zero discrepancy. The major obstacle for this approach is that Radon-Nikodym derivatives are typically unknown and one needs to learn them from data via density estimation, which itself can be a difficult task. In Section 4, we investigate an alternative approach to choosing weights based on extending results in Theorem 3
to hold uniformly over weight vectors.
4 Kernel-Based Hypotheses with Regression Losses
In this section, we present generalization bounds for kernel-based hypothesis with regression losses. Our results in this section, are based on the learning guarantee presented in Corollary 3 in terms of sequential Rademacher complexity. Our first result is a bound on the sequential Rademacher complexity of the kernel-based hypothesis with regression losses.
Let and where is a Hilbert space and a feature map. Assume that the condition holds for all and all such that . Then, the following inequalities hold:
where is a PDS kernel associated to , , , and .
We begin the proof by setting , where and . We let