One of the most beautiful and best developed branches of machine learning theory is classical statistical learning theory (see for a non-technical overview and for more extensive references). However, it deals primarily with independent and identically distributed (iid) sampling of examples. There have been several attempts to deal with both dependence and non-stationarity: we discuss some of these extensions in Section 1.1. However in general the non-iid case is not as well developed as the classical iid case.
Another well developed branch of learning theory that has its own share of elegant mathematical ideas is online learning theory (the book 
is an excellent if somewhat dated introduction). With roots in game theory and the area of information theory known as universal prediction of individual sequences, online learning theory, unlike statistical learning theory, does not use probabilistic foundations. It is therefore quite surprising that there are uncanny parallels between iid learning theory and online learning theory. The reader is invited to compare the statements of the fundamental theorems in these two areas (restated in this paper as Theorem1 and Theorem 2).
Our main goal in this paper is to study learnability of a function class in the statistical setting under extremely general assumptions that do not require independence or stationarity. We first summarize the key theorems of iid and online learning in Section 3 and Section 4
. Although this material is not new, we feel that the broader data science community might not be very familiar with results in online learning since it a younger field compared to statistical learning theory. Also, presenting both iid learning and online learning results in a unified way allows us to draw parallels between the two theories and to motivate the need for novel theories that connect these two.
We propose a definition of learnability under general stochastic processes in Section 5. We then sandwich the difficulty of learning under this general definition between that of iid and online learning (Theorem 5). We give a prequential version of our main definition in Section 6. In the prequential version, as in online learning, the function output by the learning algorithm at any given time cannot peek into the future. We show that learnability under the prequential version of our general learning setting is equivalent to online learnability (Theorem 8). We focus on the problem of binary classification for simplicity. But we also provide extensions of the sandwiching and equivalence results to the problem of real valued prediction (i.e., regression) in Section 7 (see Theorem 11 and Theorem 12).
1.1 Related Work
. However, in this line of investigation, the stationary assumption is kept and the theory resembles the iid theory to a large extent since mixing implies approximate independence of random variables that are sufficiently separated in time. Mixing assumptions can be shown to hold for some interesting classes of processes, including some Markov and hidden Markov processes. Markov sampling has also been considered on its own as a generalization of iid sampling[5, 6, 7].
There has been work on performance guarantees of specific algorithms like boosting  and SVMs [9, 10] under non-iid assumptions. However, our focus here is not on any specific learning methodology. Also, while we focus on learnability of functions in a fixed class, the question of universal consistency has also been studied in the context of general stochastic processes [11, 12].
. These papers focus on process decompositions: expressing a complex stochastic process as a mixture of simpler stochastic processes. For example, de Finetti’s theorem shows that exchangeable distributions are mixtures of iid distributions. The basic idea is to output a function with small expected loss one step beyond the observed sampled where the expectation is also conditioned on the observed sample. While closely related, our performance measures are cumulative in nature and are inspired more by regret analysis in online learning than PAC bounds in computational learning theory.
The use of tools from online learning theory (e.g., sequential Rademacher complexity) for developing learning theory for dependent, non-stationary process was pioneered by Kuznetsov and Mohri [16, 17]. However, their focus is on time series forecasting applications and therefore their performance measures always involve the expected loss of the function chosen by the learning algorithm some steps into the future (i.e., the part not seen by the learning algorithm) of the process. In contrast, our definition uses conditional distributions of the stochastic process on the realized path to define our performance measure. We also point out that there are also earlier papers that apply learning theory tools to understand time series prediction [18, 19, 20]. Some very recent work has also begun to extend some of the work on time series to processes with spatial structure and dependence such as those occurring on a network .
A direct inspiration for this paper is the work of Skouras and Dawid on estimation in semi-parametric statistical models under misspecification. They highlighted that under misspecification, M-estimators including maximum likelihood estimators may not converge to a deterministic limit even asymptotically. Instead, the limit can be stochastic. This is because, under misspecification, the ”best” model can depend on the observed sequence of data. They gave examples showing that this can happen for non-ergodic processes or processes with long range dependencies that do not decay fast enough. Our work can be seen as a direct extension of their ideas to the learning theory setting where the focus is not on parameter estimation but on loss minimization over potentially massive function spaces.
We consider a supervised learning setting where we want to learn a mapping from an input spaceto an output space . Two output spaces of interest to us in this paper are (binary classification) and (regression). Instead of talking about the difficulty of learning individual functions, we will define learnability for a class of functions that we will denote by . Let and let
be a loss function mapping an input-output pairand a function to a non-negative loss. The set will be denoted by and we use denote an indicator function that is if the condition is true and otherwise. Two important loss functions are the - loss (in binary classification) and the absolute loss (in regression).
We often denote an input-output pair by . When the input-output pair is random, we will denote it by perhaps with additional time indices such as . We will use the abbreviation to denote the sequence . A learning rule is a map from to . We will abuse notation a bit and refer to the learning rule and the function output by the learning rule both by . An important learning rule is empirical risk minimization (ERM): given a sequence of input-output pairs, it outputs the function,
Note that, for infinite function classes, the minimum may not be achieved. In that case, one can work with functions achieving empirical risks that are arbitrarily close to the infimum of the empirical risk over the class .
Given a distribution on , the loss function can be extended as follows:
The function minimizing the expectation above is
3 Learnability in the IID Setting
In this section, we review some basic results of statistical learning theory under iid sampling. For more details, the reader can consult standard texts in this area [23, 3, 24]. In the standard formulation of statistical learning theory, we draw a sequence of iid examples from a distribution
. That is, the joint distribution ofis a product distribution . We adopt the minimax framework to define learnability of a class of functions with respect to a loss function . Define the worst case performance of a learning rule by
and the minimax value by
For the sake of conciseness, the notation above hides the fact that depends on the sequence . The expectation above is taken over the randomness in these samples.
We say that is learnable in the iid learning setting if
Furthermore, we say that is learnable via a sequence of learning rules if
One of the major achievements of statistical learning theory was the determination of necessary and sufficient conditions for learnability of a class
. Learnability is known to be equivalent to a probabilistic condition, namely the uniform law of large numbers (ULLN) for the class:
Here are drawn iid from and . Whether or not ULLN holds for a class depends on the finiteness of different combinatorial parameters depending on whether we are in the binary classification or regression setting. We will discuss the binary classification case here leaving the regression case to Section 7.
The VC dimension of , denoted by , is the length of the longest sequence shattered by . We say that a sequence is shattered by if
Finally, we recall the definition of the (expected) Rademacher complexity of a function class with respect to a distribution :
Note that the expectation above is with respect to both and . The former are drawn iid from whereas the latter are iid -valued Rademacher (also called symmetric Bernoulli) random variables. The worst case, over , Rademacher complexity is denoted by
Consider binary classification with - loss in the iid setting. Then, the following are equivalent:
is learnable via ERM.
The ULLN condition (2) holds for .
4 Learnability in the Online Setting
A second learning setting with a well developed theory is the online learning setting where no probabilistic assumptions are placed on the data generating process. Compared to statistical learning theory under iid sampling, online learning theory is a younger field. The main combinatorial parameter in this area, namely the Littlestone dimension, was defined in . It was given the name ”Littlestone dimension” in  where it was also shown that it fully characterizes learnability in the binary classification setting. Scale-sensitive analogues of Littlestone dimension for regression problems and the sequential version of Rademacher complexity were studied in [27, 28].
The online learning setting takes an individual sequence approach, where results are sought that hold for every possible sequence that might be encountered by the learning rule. The object of interest here is the following minimax value,
Note that the infimum above is taken, not over single learning rules, but over sequences of learning rules, where takes in as input the sequence and outputs a function in . The performance measure inside the supremum, called the expected (normalized) regret of on the sequence , obeys the prequential principle : performance of , which is learned using , is judged using loss evaluated on with no overlap between data used for learning and for performance evaluation. The expectation is needed because the learning rules may use internal randomization to achieve robustness to adversarial data. The regret nomenclature comes from the fact that cannot peek into the future to lower its loss but its cumulative performance is compared with lowest possible loss, in hindsight, over the entire sequence . However, the comparator term has its own restriction: it uses the best fixed function in hindsight, as opposed to the best sequence of functions.
We say that is learnable in the online learning setting if
As in statistical learning, we have necessary and sufficient conditions for learnability that almost mirror those in Theorem 1. The ULLN condition gets replaced by the Uniform Martingale Law of Large Numbers (UMLLN). We say that UMLLN holds for if
The crucial difference between the UMLLN condition and the ULLN condition is that here the supremum is taken over all joint distributions of . In particular need not be iid. Also, to obtain a martingale structure, we use an arbitrary filtration such that is -measurable. It is easy to see that UMLLN is a stronger condition than ULLN: simply restrict to be a product distribution and let be the natural filtration of . Then the UMLLN condition reduces to the ULLN condition.
The VC dimension of is replaced by another combinatorial parameter, called the Littlestone dimension of , denoted by . Before, we present the definition of Littestone definition, we need some notation to handle complete binary trees labeled with examples drawn from the input space . We think of a complete binary tree of depth as defining a sequence of maps. The map gives us the examples sitting at level of the tree. For example, is the root, is the left child of the root, is the right child of the root, and so on. In general is the node at level that we reach by following the path given by the sign sequence where means ”go left” and means ”go right”. The Littlestone dimension of is the depth of the largest complete binary tree shattered by . We say that a complete binary tree is shattered by if
Finally, Rademacher complexity gets replaced with its sequential analogue, called the sequential Rademacher complexity. We first define the sequential Rademacher complexity of given a tree of depth as:
Note that the expectation above is only with respect to the Rademacher random variables as is a fixed tree. Taking the worst case over all complete binary trees of depth gives us the sequential Rademacher complexity of :
Consider binary classification with - loss in the online (individual sequence) setting. Then, the following are equivalent:
The UMLLN condition (3) holds for .
As in the iid setting, a similar result holds for online regression with absolute loss, with the Littlestone dimension condition (i.e., condition number 3 above) replaced by a similar one involving its scale-sensitive counterpart, called the sequential fat shattering dimension (see Section 7.2 for details).
It is well known that online learnability is harder than iid learnability. That is, for any and the gap in this inequality can be arbitrarily large. For example, the set of threshold functions on :
has but .
A conspicuous difference between Theorem 1 and Theorem 2 is the absence of the condition involving ERM. Indeed, ERM is not a good learning rule in the online setting: there exist classes learnable in the online setting that are not learnable via ERM. Unfortunately, the learning rules that learn a class in the online setting are quite complex . It is not known if there exists a rule as simple as ERM that will learn a class whenever is online learnable. In any case, ERM does not play as central a role in online learning as it does in learning in the iid setting.
5 Learnability under General Stochastic Processes
In this section, we move beyond the iid setting to cover all distributions, not just product distributions. The key idea behind our generalization is to focus on the -regret of a function ,
Note that the -regret depends on the class but we hide this dependence when the function class is clear from the context. The -regret is the central quantity involved in the definition of . It is similar in flavor to, but distinct from, the individual sequence regret of a sequence of functions on a sequence .
For a general stochastic process , we still have an analogue of at time , namely
This is the conditional distribution of given . Just like , this is unknown to the learning rule. However, unlike in the iid case, is data-dependent. Therefore the -regret of a function is data-dependent. We will often hide the dependence of on past data . We can use the average of the -regrets,
as a performance measure. Note that the minimizer of this performance measure is data-dependent, unlike the iid case. The value of a learning rule is now defined as
where the supremum is now taken over all joint distributions over . This leads to consideration of the following minimax value to define learnability:
We say that is learnable in the general stochastic process setting if
Furthermore, we say that is learnable via a sequence of learning rules if
Note that in the iid case, when is a product distribution with marginal , we have for all and therefore, for any ,
We have the following result as an immediate consequence.
Fix any loss function and function class . For any learning rule , . This also means that .
Given a loss and function class , define the loss class as
We define the sequential Rademacher complexity of a loss class as
Note that the supremum here is over -valued trees that are labeled with input-output pairs. It is easy for us to connect the complexity to the loss class to the complexity of the underlying function class for a simple loss function like the - loss (see Appendix A for details.)
We now provide a performance guarantee for ERM under the general stochastic process setting.
Fix any loss function and function class . Let denote the ERM learning rule defined in (1). Then, we have,
The first inequality is true by definition of . So we just have to prove the second one.
Note, by definition of ,
Therefore, we have,
The justification for the last inequality is as follows. First, we know that . Second, when is achieved, at say, we have,
Taking expectations on both sides of (5) gives us
Note that the last inequality follows from Theorem 2 in . Since the last quantity above does not depend on , we can take supremum over on both sides to finish the proof. ∎
We now have everything in place to be able to sandwich learnability under general stochastic processes between iid learnability and online learnability.
A slightly weaker version of the result can be shown to hold in the regression case (see Section 7.3). Moreover, we conjecture that learnability in the general stochastic process setting is, in fact, equivalent to online learnability. Further, we believe that the conjecture can be proved by first showing that the class of thresholds defined in (4) is not learnable in the general stochastic process setting, and then using the results in  that connect thresholds and finiteness of Littlestone dimension.
We end this section with some examples showing that our definition of learnability under general stochastic processes is natural, interesting and worth studying.
Let us note once again that if is a product measure then is just and therefore not random. In this special but important case, our definition of learnability reduces to the standard definition of learnability under iid sampling.
Asymptotically Stationary Process
Suppose that is not a product measure but the process is asymptotically stationary
in the sense that the random probability measureconverges to some fixed deterministic in total variation as . For a class that is learnable in the general stochastic process setting and for loss function bounded by , we have
By the stationarity assumption, the first term on the right in the last inequality goes to zero. By learnability of via ERM in the general stochastic process setting, the last term goes to zero. Note that is linear in and therefore . So, under stationarity, our learnability condition implies that ERM does well when its performance is measured under the (asymptotic) stationary distribution .
Mixture of IID
Consider a simple mixture of product distributions
where, for simplicity, assume that and have disjoint supports. Then with probability we have and with probability we have . Therefore, the minimizer of
is with probability and with probability (assuming, again for simplicity, that the minimizers are unique). Here, unlike the iid and stationary examples, the ”best” function, even with infinite data, is not deterministic but is random depending on which mixture component was selected. Still, if learnability in our general sense holds, then ERM will do well according to the performance measure (6). Note that this example can be easily generalized to a mixture of more than two iid processes.
Consider a class closed under translations by a constant, i.e., if then for any constant . Let be iid drawn from some distribution on that has a density with respect to Lebesgue measure on . Let for where are iid standard normal. Note that the process is not iid. It is not even mixing in any sense due to long range dependence in caused by . Note that ERM over with squared loss will solve the problem:
where last equality holds because and we have assumed that all empirical minimizers are unique with probability . Thus, we have shown that
If is iid learnable then converges (in sense) to the function which means the ERM on converges to the random function . At the same time, let be the conditional distribution of given and . Then we have
Therefore, the minimizer of over is which is what ERM converges to.
6 A Prequential Definition of Learnability
The previous section generalized the statistical setting to include non-product distributions and extended the definition of learnability to a more general setting. In this section, we will generalize the online learnability definition to obtain a prequential version of learnability while still keeping the level of generality of the previous section. As in the online setting, consider a sequence of learning rules where is a function only of , i.e. it cannot peek ahead to access . Unlike the online learning setting, is a random sequence drawn from some general distribution over . Now, define the minimax value,
Note that the expectation above is both with respect to as well as any internal randomness used by the rules . As before, the definition of the minimax value leads to the definition of learnability.
We say that is learnable in the prequential, general stochastic process setting if
The definition of can be obtained from the definition of by replacing , which depends on the entire sequence , by , which depends only on , in the loss term that involves . It can also be thought as a generalization of because of the following. When the distribution degenerates to being a point mass at a specific sequence then becomes a point mass at and the difference of cumulative losses above reduces to the individual sequence regret of on . This observation immediately gives us the following result.
Fix any loss function and function class . Then we have .
The lemma above says that learning a function class in the prequential, general stochastic process setting is at least as hard as learning it in the online setting. Our next lemma provides a converse result.
Fix any loss function and function class . Then, for any sequence of learning rules we have,
This also means that
Let be an arbitrary distribution. We have the following three term decomposition,
The term involves a martingale difference sequence and hence has expectation zero under . Term is the individual sequence regret of on the sequence and hence is bounded, in expectation, by . Term , in expectation, is at most,
where the inequality again follows from Theorem 2 in .
The lemma now follows by taking expectations on both sides of the three term decomposition above and plugging in the upper bounds for each term’s expected value. ∎
We now have all the ingredients to characterize learnability for binary classification in the prequential, general stochastic process setting.
Consider binary classification with - loss. Then, all of the conditions in Theorem 2 are also equivalent to:
is learnable in the prequential, general stochastic process setting.
A slightly weaker version of the result above for the regression setting can be found in Section 7.4.
7 The Regression Setting
In this section, we provide analogues of most of the binary classification results for the regression setting with absolute loss. Our results are not as sharp as in the binary classification setting and the reason for this is explained at the beginning of Section 7.3. Note that rates of convergence can depend on the loss function but learnability is quite robust to changes in the loss function. For example, we can also use squared loss . But we will keep our focus on the absolute loss in this section.
We remind the reader that our organization in this section is similar to the organization of results for binary classification. Section 7.1 and Section 7.2 review known results in iid and online learning but give them a unified presentation. Section 7.3 and Section 7.4 present new results.
7.1 Statistical Learning
The fat shattering dimension of is a scale-sensitive parameter that takes a scale as an argument. The fat shattering dimension of at scale , denoted by , is the length of the longest sequence that is -shattered by . We say that a sequence is -shattered by if there exists a witness sequence of real numbers such that
Consider regression with absolute loss in the statistical setting. Then, the following are equivalent:
is learnable via ERM.
The ULLN condition (2) holds for .
7.2 Online Setting
The fat shattering dimension of is replaced by its sequential analogue just like VC dimension gets replaced by Littlestone dimension in the case of binary classification. The sequential fat shattering dimension of at scale , denoted by is the depth of the deepest tree that is -shattered by . We say that a complete binary tree is -shattered by if there exists a complete binary real valued witness tree such that
Consider regression with absolute loss in the online (individual sequence) setting. Then, the following are equivalent:
The UMLLN condition (3) holds for .
As in the binary classification setting, online learnability is harder than statistical learnability. That is, for any and any , and the gap in this inequality can be arbitrarily large. For example, the set of bounded variation functions from to with total variation at most , has for all but