# Learning Whenever Learning is Possible: Universal Learning under General Stochastic Processes

This work initiates a general study of learning and generalization without the i.i.d. assumption, starting from first principles. While the standard approach to statistical learning theory is based on assumptions chosen largely for their convenience (e.g., i.i.d. or stationary ergodic), in this work we are interested in developing a theory of learning based only on the most fundamental and natural assumptions implicit in the requirements of the learning problem itself. We specifically study universally consistent function learning, where the objective is to obtain low long-run average loss for any target function, when the data follow a given stochastic process. We are then interested in the question of whether there exist learning rules guaranteed to be universally consistent given only the assumption that universally consistent learning is possible for the given data process. The reasoning that motivates this criterion emanates from a kind of optimist's decision theory, and so we refer to such learning rules as being optimistically universal. We study this question in three natural learning settings: inductive, self-adaptive, and online. Remarkably, as our strongest positive result, we find that optimistically universal learning rules do indeed exist in the self-adaptive learning setting. Establishing this fact requires us to develop new approaches to the design of learning algorithms. Along the way, we also identify concise characterizations of the family of processes under which universally consistent learning is possible in the inductive and self-adaptive settings. We additionally pose a number of enticing open problems, particularly for the online learning setting.

## Authors

• 17 publications
• ### On Learnability under General Stochastic Processes

Statistical learning theory under independent and identically distribute...
05/15/2020 ∙ by A. Philip Dawid, et al. ∙ 12

• ### Online Learning Algorithms for Statistical Arbitrage

Statistical arbitrage is a class of financial trading strategies using m...
11/01/2018 ∙ by Christopher Mohri, et al. ∙ 0

• ### Note on universal algorithms for learning theory

We propose the general way of study the universal estimator for the regr...
11/23/2018 ∙ by Karol Dziedziul, et al. ∙ 0

• ### Covariance-based Dissimilarity Measures Applied to Clustering Wide-sense Stationary Ergodic Processes

We introduce a new unsupervised learning problem: clustering wide-sense ...
01/27/2018 ∙ by Qidi Peng, et al. ∙ 0

• ### Online Learning: Stochastic and Constrained Adversaries

Learning theory has largely focused on two main learning scenarios. The ...
04/27/2011 ∙ by Alexander Rakhlin, et al. ∙ 0

• ### Universal Bayes consistency in metric spaces

We show that a recently proposed 1-nearest-neighbor-based multiclass lea...
06/24/2019 ∙ by Steve Hanneke, et al. ∙ 0

• ### Asymptotic nonparametric statistical analysis of stationary time series

Stationarity is a very general, qualitative assumption, that can be asse...
03/30/2019 ∙ by Daniil Ryabko, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

At least since the time of the ancient Pyrrhonists, it has been observed that learning in general is sometimes not possible. Rather than turning to radical skepticism, modern learning theorists have preferred to introduce constraining assumptions, under which learning becomes possible, and have established positive guarantees for various learning strategies under these assumptions. However, one problem is that the assumptions we have focused on in the literature tend to be assumptions of convenience, simplifying the analysis, rather than assumptions rooted in a principled approach. This is typified by the overwhelming reliance on the assumption that training samples are independent and identically distributed, or resembling this (e.g., stationary ergodic). In the present work, we revisit the issue of the assumptions at the foundations of statistical learning theory, starting from first principles, without relying on assumptions of convenience about the data, such as independence or stationarity.

We approach this via a kind of optimist’s decision theory, reasoning that if we are tasked with achieving a given objective in some scenario, then already we have implicitly committed to the assumption that achieving objective is at least possible in that scenario. We may therefore rely on this assumption in our strategy for achieving the objective. We are then most interested in strategies guaranteed to achieve objective in all scenarios where it is possible to do so: that is, strategies that rely only on the assumption that objective is achievable. Such strategies have the satisfying property that, if ever they fail to achieve the objective, we may rest assured that no other strategy could have succeeded, so that nothing was lost.

Thus, in approaching the problem of learning (suitably formalized), we may restrict focus to those scenarios in which learning is possible. This assumption — that learning is possible — essentially represents a most “natural” assumption, since it is necessary for a theory of learning. Concretely, in this work, we initiate this line of exploration by focusing on (arguably) the most basic type of learning problem: universal consistency in learning a function. Following the optimist’s reasoning above, we are interested in determining whether there exist learning strategies that are optimistically universal learners, in the sense that they are guaranteed to be universally consistent given only the assumption that universally consistent learning is possible under the given data process: that is, they are universally consistent under all data processes that admit the existence of universally consistent learners. We find that, in certain learning protocols, such optimistically universal learners do indeed exist, and we provide a construction of such a learning rule. Interestingly, it turns out that not all learning rules consistent under the i.i.d. assumption satisfy this type of universality, so that this criterion can serve as an informative desideratum in the design of learning methods. Along the way, we are also interested in expressing concise necessary and sufficient conditions for universally consistent learning to be possible for a given data process.

We specifically consider three natural learning settings — inductive, self-adaptive, and online — distinguished by the level of access to the data available to the learner. In all three settings, we suppose there is an unknown target function and a sequence of data with , of which the learner is permitted to observe the first samples : the training data. Based on these observations, the learner is tasked with producing a predictor . The performance of the learner is determined by how well approximates the (unobservable) value for data encountered in the future (i.e., ).111Of course, in certain real learning scenarios, these future values might never actually be observable, and therefore should be considered merely as hypothetical values for the purpose of theoretical analysis of performance. To quantify this, we suppose there is a loss function , and we are interested in obtaining a small long-run average value of . A learning rule is said to be universally consistent under the process if it achieves this (almost surely, as ) for all target functions .222Technically, to be consistent with the terminology used in the literature on universal consistency, we should qualify this as “universally consistent for function learning,” to indicate that is a fixed function of . However, since we do not consider noisy values or drifting target functions in this work, we omit this qualification and simply write “universally consistent” for brevity. The three different settings are then formed as natural variants of this high-level description. The first is the basic inductive learning setting, in which is fixed after observing the initial samples, and we are interested in obtaining a small value of for all large . This inductive setting is perhaps the most commonly-studied in the prior literature on statistical learning theory (see e.g., Devroye, Györfi, and Lugosi, 1996). The second setting is a more-advanced variant, which we call self-adaptive learning, in which may be updated after each subsequent prediction , based on the additional unlabeled observations : that is, it continues to learn from its test data. In this case, denoting by the predictor chosen after observing , we are interested in obtaining a small value of for all large . This setting is related to several others studied in the literature, including semi-supervised learning (Chapelle, Schölkopf, and Zien, 2010), transductive learning (Vapnik, 1982, 1998), and (perhaps most-importantly) the problems of domain adaptation and covariate shift (Huang, Smola, Gretton, Borgwardt, and Schölkopf, 2007; Cortes, Mohri, Riley, and Rostamizadeh, 2008; Ben-David, Blitzer, Crammer, Kulesza, Pereira, and Vaughan, 2010). Finally, the strongest setting considered in this work is the online learning setting, in which, after each prediction , the learner is permitted to observe and update its predictor . We are then interested in obtaining a small value of for all large . This is a particularly strong setting, since it requires that the supervisor providing the responses remains present in perpetuity. Nevertheless, this is sometimes the case to a certain extent (e.g., in forecasting problems), and consequently the online setting has received considerable attention (e.g., Littlestone, 1988; Haussler, Littlestone, and Warmuth, 1994; Cesa-Bianchi and Lugosi, 2006; Ben-David, Pál, and Shalev-Shwartz, 2009; Rakhlin, Sridharan, and Tewari, 2015).

### 1.1 Formal Definitions

We begin our formal discussion with a few basic definitions. Let be a measurable space, with a Borel -algebra generated by a separable metrizable topological space , where is called the instance space and is assumed to be nonempty. Fix a space , called the value space, and a function , called the loss function. We also denote . Unless otherwise indicated explicitly, we will suppose (i.e., is bounded); the sole exception to this is Section 8, which is devoted to exploring the setting of unbounded . Furthermore, to focus on nontrivial scenarios, we will suppose and are nonempty and throughout.

For simplicity, we suppose that is a metric, and that is a separable metric space. For instance, this is the case for discrete classification under the - loss, or real-valued regression under the absolute loss. However, we note that most of the theory developed here easily extends (with only superficial modifications) to any that is merely dominated by a separable metric , in the sense that for some continuous nondecreasing function with , and which satisfies a non-triviality condition . This then admits regression under the squared loss, discrete classification with asymmetric misclassification costs, and many other interesting cases. We include a brief discussion of this generalization in Section 9.1.

Below, any reference to a measurable set should be taken to mean , unless otherwise specified. Additionally, let be the topology on induced by , and let denote the Borel -algebra on generated by ; references to measurability of subsets below should be taken to indicate . We will be interested in the problem of learning from data described by a discrete-time stochastic process on . We do not make any assumptions about the nature of this process. For any and , and any sequence , define , or if , where or denotes the empty sequence (overloading notation, as these may also denote the empty set); for convenience, also define . For any function and sequence in the domain of , we denote and . Also, for any set , we denote by or the subsequence of all entries of contained in , and denotes the number of indices with .

For any function , and any sequence in , define

 ^μx(g)=limsupn→∞1nn∑t=1g(xt).

For any set we overload this notation, defining , where is the binary indicator function for the set . We also use the notation , for any logical proposition , to denote a value that is if holds (evaluates to “True”), and otherwise. We also make use of the standard notation for limits of sequences of sets (see e.g., Ash and Doléans-Dade, 2000): , , and exists and equals if and only if . Additionally, for a set , a function bounded from below, and a value , define as an arbitrary element with ; we also allow in this definition in the case is realized by some , ; to be clear, we suppose evaluates to the same every time it appears (for a given function and set ).

As discussed above, we are interested in three learning settings, defined as follows. An inductive learning rule is any sequence of measurable functions , . A self-adaptive learning rule is any array of measurable functions , , . An online learning rule is any sequence of measurable functions , . In each case, these functions can potentially be stochastic (that is, we allow

itself to be a random variable), though independent from

. For any measurable , any inductive learning rule , any self-adaptive learning rule , and any online learning rule , we define

 ^LX(fn,f⋆;n) =limsupt→∞1tn+t∑m=n+1ℓ(fn(X1:n,f⋆(X1:n),Xm),f⋆(Xm)), ^LX(gn,⋅,f⋆;n) =limsupt→∞1t+1n+t∑m=nℓ(gn,m(X1:m,f⋆(X1:n),Xm+1),f⋆(Xm+1)), ^LX(h⋅,f⋆;n) =1nn−1∑t=0ℓ(ht(X1:t,f⋆(X1:t),Xt+1),f⋆(Xt+1)).

In each case, measures a kind of limiting loss of the learning rule, relative to the source of the target values: . In this context, we refer to as the target function. Note that, in the cases of inductive and self-adaptive learning rules, we are interested in the average future losses after some initial number of “training” observations, for which target values are provided, and after which no further target values are observable. Thus, a small value of the loss in these settings represents a kind of generalization to future (possibly previously-unseen) data points. In particular, in the special case of i.i.d. with marginal distribution

, the strong law of large numbers implies that the loss

of an inductive learning rule is equal (almost surely) to the usual notion of the risk of — namely, — commonly studied in the statistical learning theory literature, so that represents a generalization of the notion of risk (for deterministic responses). Note that, in the general case, the average loss might not have a well-defined limit as , particularly for non-stationary processes , and it is for this reason that we use the limit superior in the definition (and similarly for ). We also note that, since the loss function is always finite, we could have included the losses on the training samples in the summation in the inductive definition without affecting its value. This observation yields a convenient simplification of the definition, as it implies the following equality.

 ^LX(fn,f⋆;n)=^μX(ℓ(fn(X1:n,f⋆(X1:n),⋅),f⋆(⋅))).

The distinction between the inductive and self-adaptive settings is merely the fact that the self-adaptive learning rule is able to update the function used for prediction after observing each “test” point , . Note that the target values are not available for these test points: only the “unlabeled” values. In the special case of an i.i.d. process, the self-adaptive setting is closely related to the semi-supervised learning setting studied in the statistical learning theory literature (Chapelle, Schölkopf, and Zien, 2010). In the case of non-stationary processes, it has relations to problems of domain adaptation and covariate shift (Huang, Smola, Gretton, Borgwardt, and Schölkopf, 2007; Cortes, Mohri, Riley, and Rostamizadeh, 2008; Ben-David, Blitzer, Crammer, Kulesza, Pereira, and Vaughan, 2010).

In the case of online learning, the prediction function is again allowed to update after every test point, but in this case the target value for the test point is accessible (after the prediction is made). This online setting, with precisely this same objective function, has been studied in the learning theory literature, both in the case of i.i.d. processes and relaxations thereof (e.g., Haussler, Littlestone, and Warmuth, 1994; Györfi, Kohler, Krzyżak, and Walk, 2002) and in the very-general setting of an arbitrary process (e.g., Littlestone, 1988; Cesa-Bianchi and Lugosi, 2006; Rakhlin, Sridharan, and Tewari, 2015).

Our interest in the present work is the basic problem of universal consistency, wherein the objective is to design a learning rule with the guarantee that the long-run average loss approaches zero (almost surely) as the training sample size grows large, and that this fact holds true for any target function . Specifically, we have the following definitions. We say an inductive learning rule is strongly universally consistent under if, for every measurable , .
We say a process admits strong universal inductive learning if there exists an inductive learning rule that is strongly universally consistent under .
We denote by the set of all processes that admit strong universal inductive learning. We say a self-adaptive learning rule is strongly universally consistent under if, for every measurable , .
We say a process admits strong universal self-adaptive learning if there exists a self-adaptive learning rule that is strongly universally consistent under .
We denote by the set of all processes that admit strong universal self-adaptive learning. We say an online learning rule is strongly universally consistent under if, for every measurable , .
We say a process admits strong universal online learning if there exists an online learning rule that is strongly universally consistent under .
We denote by the set of all processes that admit strong universal online learning.

Technically, the above definitions of universal consistency are defined relative to the loss function . However, we will establish below that and are in fact invariant to the choice of , subject to the basic assumptions stated above (separable, ). We will also find that this is true of , subject to the additional constraint that is totally bounded. Furthermore, for unbounded losses we find that all three families are invariant to , subject to separability and .

As noted above, much of the prior literature on universal consistency without the i.i.d. assumption has focused on relaxations of the i.i.d. assumption to more-general families of processes, such as stationary mixing, stationary ergodic, or certain limited forms of non-stationarity (see e.g., Steinwart, Hush, and Scovel, 2009, Chapter 27 of Györfi, Kohler, Krzyżak, and Walk, 2002, and references therein). In each case, these relaxations were chosen largely for their convenience, as they preserve the essential features of the i.i.d. setting used in the traditional approaches to proving consistency of certain learning rules (particularly, features related to concentration of measure). In contrast, our primary interest in the present work is to study the natural assumption intrinsic to the universal consistency problem itself: the assumption that universal consistency is possible. In other words, we are interested in the following abstract question:

Do there exist learning rules that are strongly universally consistent under every process that admits strong universal learning?

Each of the three learning settings yields a concrete instantiation of this question. For the reason discussed in the introductory remarks, we refer to any such learning rule as being optimistically universal. Thus, we have the following definition.

An (inductive/self-adaptive/online) learning rule is optimistically universal if it is strongly universally consistent under every process that admits strong universal (inductive/self-adaptive/online) learning.

### 1.2 Summary of the Main Results

Here we briefly summarize the main results of this work. Their proofs, along with several other results, will be developed throughout the rest of this article.

The main positive result in this work is the following theorem, which establishes that optimistically universal self-adaptive learning is indeed possible. In fact, in proving this result, we develop a specific construction of one such self-adaptive learning rule.

There exists an optimistically universal self-adaptive learning rule.

Interestingly, it turns out that the additional capabilities of self-adaptive learning, compared to inductive learning, are actually necessary for optimistically universal learning. This is reflected in the following result.

There does not exist an optimistically universal inductive learning rule, if is an uncountable Polish space.

Taken together, these two results are interesting indeed, as they indicate there can be strong advantages to designing learning methods to be self-adaptive. This seems particularly interesting when we note that very few learning methods in common use are designed to exploit this capability: that is, to adjust their trained predictor based on the (unlabeled) test samples they encounter. In light of these results, it therefore seems worthwhile to revisit the definitions of these methods with a view toward designing self-adaptive variants.

As for the online learning setting, the present work makes only partial progress toward resolving the question of the existence of optimistically universal online learning rules (in Section 6). In particular, the following question remains open at this time.

###### Open Problem 1

Does there exist an optimistically universal online learning rule?

To be clear, as we discuss in Section 6, one can convert the optimistically universal self-adaptive learning rule from Theorem 1.2 into an online learning rule that is strongly universally consistent for any process that admits strong universal self-adaptive learning. However, as we prove below, the set of processes that admit strong universal online learning is a strict superset of these, and so optimistically universal online learning represents a much stronger requirement for the learner.

In the process of studying the above, we also investigate the problem of concisely characterizing the family of processes that admit strong universal learning, of each of the three types: that is, , , and . In particular, consider the following simple condition on the tail behavior of a given process .

###### Condition 1

For every monotone sequence of sets in with ,

 limk→∞E[^μX(Ak)]=0.

Denote by the set of all processes satisfying Condition 1. In Section 2 below, we discuss this condition in detail, and also provide several equivalent forms of the condition. One interesting instance of this is Theorem 2.2, which notes that Condition 1 is equivalent to the condition that the set function is a continuous submeasure (Definition 2.2 below). For our present interest, the most important fact about Condition 1 is that it precisely identifies which processes admit strong universal inductive or self-adaptive learning, as the following theorem states.

The following statements are equivalent for any process .

• satisfies Condition 1.

• admits strong universal inductive learning.

Equivalently, .

Certainly any i.i.d. process satisfies Condition 1 (by the strong law of large numbers). Indeed, we argue in Section 3.1 that any process satisfying the law of large numbers — or more generally, having pointwise convergent relative frequencies — satisfies Condition 1, and hence by Theorem 1.2 admits strong universal learning (in both settings). For instance, this implies that all stationary processes admit strong universal inductive and self-adaptive learning. However, as we also demonstrate in Section 3.1, there are many other types of processes, which do not have convergent relative frequencies, but which do satisfy Condition 1, and hence admit universal learning, so that Condition 1 represents a strictly more-general condition.

Other than the fact that Condition 1 precisely characterizes the families of processes that admit strong universal inductive or self-adaptive learning, another interesting fact established by Theorem 1.2 is that these two families are actually equivalent: that is, . Interestingly, as alluded to above, we find that this equivalence does not extend to online learning. Specifically, in Section 6 we find that , with strict inclusion iff is infinite.

As for the problem of concisely characterizing the family of processes that admit strong universal online learning, again the present work only makes partial progress. Specifically, in Section 6, we formulate a concise necessary condition for a process to admit strong universal online learning (Condition 2 below), but we leave open the important question of whether this condition is also sufficient, or more-broadly of identifying a concise condition on equivalent to the condition that admits strong universal online learning.

In addition to the questions of optimistically universal learning and concisely characterizing the family of processes admitting universal learning, another interesting question is whether it is possible to empirically test whether a given process admits universal learning (of any of the three types). However, in Section 7 we find that in all three settings this is not the case. Specifically, in Theorem 7 we prove that (when is infinite) there does not exist a consistent hypothesis test for whether a given admits strong universal (inductive/self-adaptive/online) learning. Hence, the assumption that learning is possible truly is an assumption, rather than a testable hypothesis.

While all of the above results are established for bounded losses, Section 8 is devoted to the study of these same issues in the case of unbounded losses. In that case, the theory becomes significantly simplified, as universal consistency is much more difficult to achieve, and hence the family of processes that admit universal learning is severely restricted. We specifically find that, when the loss is unbounded, there exists an optimistically universal learning rule of all three types. We also identify a concise condition (Condition 3 below) that is necessary and sufficient for a process to admit strong universal learning in any/all of the three settings.

We discuss extensions of this theory in Section 9, discussing more-general loss functions, as well as relaxation of the requirement of strong consistency to mere weak consistency. Finally, we conclude the article in Section 10 by summarizing several interesting open questions that arise from the theory developed below.

## 2 Equivalent Expressions of Condition 1

Before getting into the analysis of learning, we first discuss basic properties of the functional. In particular, we find that there are several equivalent ways to state Condition 1, which will be useful in various parts of the proofs below, and which may themselves be of independent interest in some cases.

### 2.1 Basic Lemmas

We begin by stating some basic properties of the functional that will be indispensable in the proofs below.

For any sequence in , and any functions and , if and are not both infinite and of opposite signs, then the following properties hold.

 1. (monotonicity) if f≤g, then ^μx(f)≤^μx(g), 2. (homogeneity) ∀c∈(0,∞),^μx(cf)=c^μx(f), 3. (subadditivity) ^μx(f+g)≤^μx(f)+^μx(g).

Proof  Properties and follow directly from the definition of , and monotonicity and homogeneity (for positive constants) of . Property is established by noting

 limsupn→∞1nn∑t=1(f(xt)+g(xt)) ≤limk→∞(supn≥k1nn∑t=1f(xt))+(supn≥k1nn∑t=1g(xt)) =(limsupn→∞1nn∑t=1f(xt))+(limsupn→∞1nn∑t=1g(xt)).

These properties immediately imply related properties for the set function .

For any sequence in , and any sets ,

 1. (nonnegativity) 0≤^μx(A), 2. (monotonicity) ^μx(A∩B)≤^μx(A), 3. (subadditivity) ^μx(A∪B)≤^μx(A)+^μx(B).

Proof  These follow directly from the properties listed in Lemma 2.1, since , , and .

### 2.2 An Equivalent Expression in Terms of Continuous Submeasures

Next, we note a connection to a much-studied definition from the measure theory literature: namely, the notion of a continuous submeasure. This notion appears in the measure theory literature, most commonly under the name Maharam submeasure (see e.g., Maharam, 1947; Talagrand, 2008; Bogachev, 2007), but is also referred to as a subadditive Dobrakov submeasure (see e.g., Dobrakov, 1974, 1984), and related notions arise in discussions of Choquet capacities (see e.g., Choquet, 1954; O’Brien and Vervaat, 1994).

A submeasure on is a function satisfying the following properties.

• .

• , .

• , .

A submeasure is called continuous if it additionally satisfies the condition

• For every monotone sequence in with , .

The relevance of this definition to our present discussion is via the set function , which is always a submeasure, as stated in the following lemma.

For any process , is a submeasure.

Proof  Since follows directly from the definition of , we have as well (property of Definition 2.2). Furthermore, monotonicity of (Lemma 2.1) and monotonicity of the expectation imply monotonicity of (property of Definition 2.2). Likewise, finite subadditivity of (Lemma 2.1) implies that for , , so that monotonicity and linearity of the expectation imply (property 3 of Definition 2.2).

Together with the definition of Condition 1, this immediately implies the following theorem, which states that Condition 1 is equivalent to ] being a continuous submeasure.

A process satisfies Condition 1 if and only if is a continuous submeasure.

### 2.3 Other Equivalent Expressions of Condition 1

We next state several other results expressing equivalent formulations of Condition 1, and other related properties. These equivalent forms will be useful in later proofs below.

The following conditions are all equivalent to Condition 1.

• For every monotone sequence of sets in with ,

 limk→∞^μX(Ak)=0 (a.s.).
• For every sequence of sets in ,

 limi→∞^μX(⋃k≥iAk)=^μX(limsupk→∞Ak) (a.s.).
• For every disjoint sequence of sets in ,

 limi→∞^μX