One of the cornerstone assumptions in the analysis of machine learning algorithms is that the training examples are independently and identically distributed (i.i.d.). Interestingly, this assumption is hardly ever fulfilled in practice: dependencies between examples exist even in the famous textbook example of classifying e-mails intoham or spam. For example, when multiple emails are exchanged with the same writer, the contents of later emails depends on the contents of earlier ones. For this reason, there is growing interest in the development of algorithms that learn from dependent data and still offer generalization guarantees similar to the i.i.d. situation.
In this work, we are interested in learning algorithms for stochastic processes, i.e. data sources with samples arriving in a sequential manner. Traditionally, the generalization performance of learning algorithms for stochastic processes is phrased in terms of the marginal risk: the expected loss for a new data point that is sampled with respect to the underlying marginal distribution, regardless of which samples have been observed before. In this work, we instead are interested in the conditional risk, i.e. the expectation over the loss taken with respect to the conditional distribution of the next sample given the samples observed so far. For i.i.d. data both notions of risk coincide. For dependent data, however, they can differ drastically and the conditional risk is the more promising quantity for sequential prediction tasks. Imagine, for example, a self-driving car. At any point of time it makes its next decision, e.g. determines if there is a pedestrian in front of the vehicle based on the image from camera. A typical machine learning approach (based on the marginal risk) would use a single classifier that works well on average. However, choosing different classifiers at each step such that they are adapted to work well in the current conditions (based on the conditional risk) is clearly beneficial in this case.
There are two main challenges when trying to learn predictors of low conditional risk. First, the conditional distribution typically changes in each step, so we are trying to learn a moving target. Second, we cannot make use of out-of-the-box empirical risk minimization, since that would just lead to predictors of low marginal risk.
Our main contributions in this work are the following:
a non-parametric empirical estimator of the conditional risk with finite history,
a proof of consistency under mild assumptions on the process
a finite sample concentration bound that, under certain technical assumptions, guarantees and quantifies the uniform convergence of the above estimator to the true conditional risk.
Our results provide the necessary tools to theoretically justify and practically perform empirical risk minimization with respect to the conditional distribution of a stochastic process. To our knowledge, our work is the first one providing a consistent algorithm for this problem.
2 Risk minimization
We study the problem of risk minimization: having observed a sequence of examples, from a stochastic process, our goal is to select a predictor, , of minimal risk from a fixed hypothesis set . The risk
is defined as the expected loss for the next observation with respect to a given loss function. For example, in a classification setting, one would have , where are inputs and are class labels, and solve the task of identifying a predictor, , that minimizes the expected -loss, , for .
Different distribution lead to different definitions of risk. The simplest possibility is the marginal risk,
|which has two desirable properties: it is in fact independent of the actual value of (for the type of the processes that we consider), and under weak conditions on the process it can be estimated by a simple average of the losses over the training set, i.e. the empirical risk,|
On the downside, the minimizer of the marginal risk might have low prediction performance on the actual sequence because it tries to generalize across all possible histories, while what we care about in the end is the prediction only for one observed realization. For an i.i.d. process this would not matter, since any future sample would be independent of the past observations anyway. For a dependent process, however, the sequence (where is a shorthand notation for ) might carry valuable information about the distribution of , see Section 5 for an example and numerical simulations.
In this work we study the conditional risk for a finite history of length ,
for any . Our goal is to identify a predictor of minimal conditional risk in the hypothesis set, i.e. to solve the following optimization problem
Note that on a practical level this is a more challenging problem than marginal risk minimization: the conditional risk depends on the history, , so different predictors will be optimal for different histories and time steps. However, this comes with the benefit that the resulting predictor is tuned for the actually observed history.
For better understanding let us consider a related problem of time series prediction that can be formulated as a conditional risk minimization. One can consider constant predictors and the loss should measure the distance of the prediction from the next value of the process, e.g. for a square loss this would mean minimizing over . Notice that there is a big difference from the standard time series approaches to prediction, where one minimizes a fixed (not changing with time) measure of risk over predictors that can take a finite history into account to make their predictions, which can be written as with . In this way we choose a fixed function based on the data and use it henceforth. In our case, at each step we try to find a (simpler) predictor that minimizes the risk for this particular step, meaning that we can perform optimization over less complex predictors, but we need to recompute it at every step.
3 Related work
While statistical learning theory was first formulated for the i.i.d. setting(vapnik1971uniform), it was soon recognized that extension to non-i.i.d. situations, in particular many classes of stochastic processes, were possible and useful. As in the i.i.d. case, the core of such results is typically formed by a combination of a capacity bound on the class of considered predictors and a law of large numbers argument, that ensures that the empirical average of function values converges to a desired expected value. Combining both, one obtains, for example, that empirical risk minimization (ERM) is a successful learning strategy.
Most existing results study the situation of stationary stochastic processes, for which the definition of a marginal risk makes sense. The consistency of ERM or similar principles can then be established under certain conditions on the dependence structure, for example for processes that are -, - or -mixing (YuBin01; karandikar2002rates; steinwart2009fast; zou2009generalization), exchangeable, or conditionally i.i.d. missing (Berti01; Pestov2010).
Asymptotic or distribution dependent results were furthermore obtained even for processes that are just ergodic (adams2010uniform)
, and Markov chains with countably infinite state-space(gamarnik2003extension).
All of the above works aim to study the minimizer of a long-term or marginal risk. Actually minimizing the conditional risk has not received much attention in the literature, even though some conditional notions of risk were noticed and discussed. The most popular objective is the conditional risk based on the full history, that is . For example, Pestov2010 and Shalizi13 argue in favor of minimizing this conditional risk, but focus on exchangeable processes, for which the unweighted average over the training samples can be used for this purpose.
The following two papers focus on the variants of the conditioning, but consider the situations where the objective is close to the marginal risk. Kuznetsov01 look at the minimization of with integer gap for the non-stationary processes, but the convergence of their bound requires growing as the amount of data grows. Mohri02fixed discuss the conditional risk based on the full history in the context of generalization guarantees for stable algorithms. Their proofs require an assumption that one can freely remove points from the conditioning set without changing the distribution.111Apparently, the need for this assumption was realized only after publication of the JMLR paper of the same title. Our discussion is based on the PDF version of the manuscript from the author homepage, dated 10/10/13. In our notation this assumption means for integer ’s. This again allows the conditional risk to be approximated by the marginal one by separating the point in the loss from the history by an arbitrarily large gap. In both cases, this makes the problem much easier for mixing processes, since for big values of , is almost independent of . In contrast to these two works, in our setting the conditional risk is indeed different from the marginal one (see Figure 1 for an example).
Agarwal01 extend online-to-batch conversion to mixing processes. The authors construct an estimator for the marginal risk, and then show that it can also be used for an average over future conditional risks. In our notation this corresponds to . Their results are based on the idea of separating the point in the loss and the history by a large enough gap. Similarly to the above papers, the convergence only holds for , where the average conditional risk converges to the marginal one, while our setting corresponds the case without any gap, , with conditioning on a finite history.
wintenberger2014optimal introduces a novel online-to-batch conversion technique to show bounds in the regret framework for the cumulative conditional risk, defined as a sum of conditional risks, . Our setting is a harder version of this problem, when we need to minimize each summand separately, not only the whole sum.
The work of kuznetsov2015learning
is the most related to ours. They aim to minimize the conditional risk based on the full history by using a weighted empirical average, in the spirit of our estimator. They provide a generalization bound for fixed, non-random weights and, based on their results, they derive a heuristic procedure for finding the weights without guarantees of convergence. The main difference of our work is that we provide a data-dependent way to choose weights with the proof of the convergence.
We are not aware of any work that provides a convergent algorithm for the conditional risk based on the full or finite history. In our work we focus on the finite history and in Section 6 we discuss the relation between the two notions.
On a technical level, our work is related to the task of one-step ahead prediction of time series (Modha01; modha1998memory; Meir01; alquier2012model). The goal of these methods is to reason about the next step of a process, though not in order to choose a hypothesis of minimal risk, but to predict the value for the next observation itself. Our work on empirical conditional risk is inspired by this school of thought, in particular on kernel-based nonparametric sequential prediction (Biau01).
In this section we present our results together with the assumptions needed for the proofs.
When we want to perform the optimization (4) in practice, we do not have access to the conditional distribution of . Thus, we take the standard route and aim at minimizing an empirical estimator, , of the conditional risk.
Our first contribution is the definition of a suitable conditional risk estimator when the process takes values in . The estimator is based on the notion of a smoothing kernel222Here and later in this section, as well as in Section 4, we use kernel only in the sense of kernel-based non-parametric density estimation, not in the sense of positive definite kernel functions from kernel methods..
A function is called a smoothing kernel if it satisfies
is bounded by ,
A typical example is a squared exponential kernel, , but many other choices are possible. We can now define our estimator.
For a smoothing kernel function, , and a bandwidth, , we define the empirical conditional risk estimator
where is the index set of samples used and .
In words, the estimator is a weighted average loss over the training set, where the weight of each sample, , is proportional to how similar its history, , is to the target history, . Similar kernel-based non-parametric estimators have been used successfully in time series prediction (Gyorfi01). Note, however, that risk estimation might be an easier task than that, especially for processes of complex objects, since we do not have to predict the actual values of , but only the loss it causes for a hypothesis . In the self-driving car example, compare the pixel-wise prediction of the next image to the prediction of the loss that our classifier causes.
Our main result in this work is the proof that minimizing the above empirical conditional risk is a successful learning strategy for finding a minimizer of the conditional risk. The following well-known result (vapnik1998statistical) shows that it suffices to focus on uniform deviations of the estimator from the actual risk.
As our first result we show the convergence of such uniform deviations to zero. But before we can make the formal statement, we need to introduce the technical assumptions and a few definitions.
We assume that we observe data from a stationary -mixing stochastic process taking values in . Stationarity means that for all
the vectorhas the same distribution as for all . In order to quantify the dependence between the past and the future of the process, we consider mixing coefficients.
Let be a sigma algebra generated by . Then the -th -mixing coefficient is
A process is called -mixing if as . We call a process an exponentially -mixing if for some . On a high level, a process is mixing if the head of the process, and the tail of the process, , become as close to independent from each other as wanted when they are separated by a large enough gap.
Many classical stochastic processes are -mixing, see (Bradley01)
for a detailed survey. For example, many finite-state Markov and hidden Markov models as well as autoregressive moving average (ARMA) processes fulfillat an exponential rate (athreya1986mixing), while certain diffusion processes are -mixing at least with polynomial rates (chen2010nonlinearity). Clearly, i.i.d. processes are -mixing with for all .
To control the complexity of the hypothesis space we use covering numbers.
A set, , of -valued functions is a -cover (with respect to the -norm) of on a sample if
The -covering number of a function class on a given sample is
The maximal -covering number of a function class is
there exist such that for , where is the Lebesgue measure on
for every hypothesis , the conditional risk is -Lipschitz continuous in
if as slowly enough (depending on the covering number of the hypothesis set and the mixing rate of the process). The same statement holds almost surely for an exponentially -mixing process.
We will refer to the first assumption of Theorem 1 as smoothness and to the second one as robustness. The need for these assumptions is due to the fact that we use nonparametric estimates for the conditional risk, because as it is shown in (Gyorfi06) the ergodicity (which is implied by mixing) itself is not enough to show even
-consistency of kernel density estimators. Because of this, additional assumption are required. Smoothness means that the marginal distribution of the process and the Lebesgue measure are mutually absolute continuous. This assumption, for example, satisfied for processes with a density, which is bounded from below away from 0.caires2005non argue that it implies some kind of recurrence of the process, that is that for almost every point in the support, the process visits its neighborhood infinitely often. The usage of local averaging estimation implicitly assumes the continuity of the underlying function, that is the robustness assumption. The proof would work with weaker, but more technical assumption, however, we stick to this, more natural one.
As a second result we establish the convergence rate of the estimator.
there exist such that for , where is a Lebesgue measure on
the random vector has a density . Also, and are both twice continuously differentiable in with second derivatives bounded by
loss function is -Lipschitz in the first argument
Then the following holds for , , and any such that :
The first assumption in Theorem 2 is the smoothness condition of Theorem 1. The second one is a stricter version of robustness needed to quantify the convergence rate, we will refer to it as strong robustness. The last assumption is a standard way to relate the covering numbers of the induced space to (such losses are sometimes called admissible).
As an example, let us consider a case of an exponentially -mixing process, e.g. when . For with a finite fat-shattering dimension, , and if we choose , and , then the bound of Theorem 2 is approximately for a fixed .
Before we present the proofs, we introduce some auxiliary results.
For a sequence of random variables
For a sequence of random variablesand integers , another random sequence is called a (,)-independent block copy of if and the blocks for are independent and have the same marginal distributions as the corresponding blocks in .
Lemma 2 (YuBin01, Corollary 2.7).
For a -mixing sequence of random variables , let be its (,)-independent block copy. Then for any measurable function defined on every second block of of length and bounded by , it holds
Note that an application of this lemma and the proof of Lemma 3 does not require to hold exactly. It is possible to work with and such that by putting all the remaining points into the last block. However, for the convenience of notations we will write equality.
The proof of Theorems 1 and 2 relies on Lemma 3, a concentration inequality that bounds the uniform deviations of functions on blocks of a -mixing stochastic process. This lemma uses a popular independent block technique. However, it is not a direct application of the previous results (e.g. (Mohri01)) as a careful decomposition is required to deal with the fact the summands are defined on overlapping sets of variables.
Let be a class of functions defined on blocks of variables. For any integers , such that , let be an (,)-independent block copy of . Then
We start by splitting the index set into two sets and , such that and . Then, recalling that and, hence, for 333 if , then we need and such that , which holds only for and the bound of the lemma is trivial in this case.
Both summands can be bounded in the same way, so we focus on the first one. We are going to use the independent block technique due to YuBin01. For this we choose and such that . We will split the sample into blocks of consecutive points (Note that is a different splitting than the one above, since here we split the variables themselves). Thanks to the above step, there is no function that takes variables from different blocks. Let , where , and , where . In this way, the blocks within and are separated by gaps of points. Note that thanks to the first splitting, we can rewrite
where we defined composite hypotheses with . Then
Let be the random variables having the same marginal distributions as ’s, but drawn independently from each other. Using the fact that probability of an event is an expectation of the indicator of the same event, we can apply Lemma 2 to obtain
The first term can be bounded using the standard techniques for uniform laws of large numbers for i.i.d. random variables. We are going to use the bound in terms of covering numbers. Following the standard proof, e.g. (Gyorfi01, Theorem 9.1), we obtain
where is a class of composite hypotheses. The only thing that is left is to connect the covering number of to the covering number of . This follows from the fact that for any and any fixed blocks on :
For a fixed let
Assume that loss function is -Lipschitz in the first argument. Then under the conditions of Lemma 3, the following holds for
The corollary follows from the proof of Lemma 3 by a further upper bound on the covering number. First, for any two on fixed
Hence, , where . Second, by the Lipschitz property of the loss function we get: . ∎
Now we are ready to prove Theorem 1.
The proof is based on the argument of collomb1984proprietes with appropriate modifications to achieve the uniform convergence over hypotheses.
We start by reducing the problem to the supremum over the histories. Then we make the following decomposition
First, note that by the smoothness assumption . A minor modification of Lemma 5 from (collomb1984proprietes) coupled with robustness gives us the convergence of . Next, note that, by Lemma 5, for , where
and is a covering of a -dimensional hypercube with an appropriate ball width. Now, by Lemma 3, we get a bound on and and obtain the convergence in probability.
For exponentially -mixing processes, the same Lemma gives an exponential bound on and and hence we get the almost sure convergence of and to 0 (by the Borel-Cantelli lemma). ∎
The proof of the convergence rate requires a bit different decomposition than in Theorem 1 and more delicate treatment of the terms using the additional assumptions. We introduce two further lemmas: Lemma 4 shows how to express the concentration of in terms of the concentration of and . Lemma 5 uses covers to eliminate the supremum over .
Assume smoothness and strong robustness. Then, for , , , , and with ,
We start with the following decomposition
Using the fact that smoothness implies and that and are both upper bounded by 1, we can bound the left hand side of (45) by
The statement of the lemma will be proven if we show that and are upper bounded by . We demostrate this only for . Using the stationarity,
Now we apply the Taylor expansion:
|and, invoking the assumptions on the kernel,|
Let be an -covering number of a -dimensional hypercube (in -norm) with