# Learning from dependent observations

In most papers establishing consistency for learning algorithms it is assumed that the observations used for training are realizations of an i.i.d. process. In this paper we go far beyond this classical framework by showing that support vector machines (SVMs) essentially only require that the data-generating process satisfies a certain law of large numbers. We then consider the learnability of SVMs for -mixing (not necessarily stationary) processes for both classification and regression, where for the latter we explicitly allow unbounded noise.

## Authors

• 23 publications
• 1 publication
• 4 publications
• ### Learning theory estimates with observations from general stationary stochastic processes

This paper investigates the supervised learning problem with observation...
05/10/2016 ∙ by Hanyuan Hang, et al. ∙ 0

• ### Adaptive Learning Rates for Support Vector Machines Working on Data with Low Intrinsic Dimension

We derive improved regression and classification rates for support vecto...
03/13/2020 ∙ by Thomas Hamm, et al. ∙ 0

• ### Support Vector Regression for Right Censored Data

We develop a unified approach for classification and regression support ...
02/23/2012 ∙ by Yair Goldberg, et al. ∙ 0

• ### SVM Learning Rates for Data with Low Intrinsic Dimension

We derive improved regression and classification rates for support vecto...
03/13/2020 ∙ by Thomas Hamm, et al. ∙ 0

• ### Estimation of scale functions to model heteroscedasticity by support vector machines

A main goal of regression is to derive statistical conclusions on the co...
11/08/2011 ∙ by Robert Hable, et al. ∙ 0

• ### Upgrading Pulse Detection with Time Shift Properties Using Wavelets and Support Vector Machines

Current approaches in pulse detection use domain transformations so as t...
05/20/2005 ∙ by Jaime Gomez, et al. ∙ 0

• ### A Game-Theoretic Approach to Design Secure and Resilient Distributed Support Vector Machines

Distributed Support Vector Machines (DSVM) have been developed to solve ...
02/07/2018 ∙ by Rui Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years Support Vector Machines (SVMs) have become one of the most widely used algorithms for classification and regression problems. Besides their good performance in practical applications they also enjoy a good theoretical justification in terms of both universal consistency (see [1, 2, 3, 4]) and learning rates (see [5, 6, 7, 8, 9]

) if the training samples come from an i.i.d. process. However, often this i.i.d. assumption cannot be strictly justified in real-world problems. For example, many machine learning applications such as market prediction, system diagnosis, and speech recognition are inherently temporal in nature, and consequently not i.i.d. processes. Moreover, samples are often gathered from different sources and hence it seems unlikely that they are identically distributed. Although SVMs have no theoretical justification in such non-i.i.d. scenarios they are often applied successfully. One of the goals of this work is explain this success by establishing consistency results for SVMs under somewhat minimal assumptions on the data generating process. Namely, we show that for any data-generating process that satisfies certain laws of large numbers there exists a sequence of regularization parameters such that the corresponding SVM is consistent. By general negative results (see

[10]) on universal consistency for stationary ergodic processes this sequence of regularization parameters must depend on the stochastic properties of the data-generating process and cannot be adaptively chosen. However, we show that if the process satisfies certain mixing properties such as polynomially decaying -mixing coefficients (see the definitions in the following sections) then a suitable regularization sequence can be chosen a-priori. In addition, a side-effect of our analysis is that it provides consistency for SVMs using Gaussian kernels even if the common compactness assumption of the input space is violated. Consequently, our consistency results for -mixing processes generalizes earlier consistency results of [1, 2, 3] with respect to both the compactness assumption on and the i.i.d. assumption on the data-generating process.

Relaxations of the independence assumption have been considered for quite a while in both the machine learning and the statistical literature. For example PAC-learning for stationary -mixing processes has been investigated in [11], and more recently, consistency of regularized boosting for classification was established for such processes. For a larger class of processes, namely

-mixing but not necessarily stationary processes, consistency of kernel density estimators was shown in

[12]. For bounded, stationary processes with exponentially decaying -mixing coefficients a consistent method for one-step-ahead prediction (also known as “static autoregressive forecasting”, see [13]) was presented in [14]. Moreover, for this prediction problem [15] establishes consistency for a certain structural risk minimization approach under the assumption that the process is stationary and has polynomially decaying -mixing rates. For further results and references we refer to [16, 17].

Relaxations of the stationarity of the process are less common. In fact, to our best knowledge [12] is the only work which deals with such processes. One of the reasons for this lack of literature may be the fact that for non identically distributed observations there is no obvious way to define a reasonable risk functional which resembles the idea of “average future error”. On the other hand, it seems obvious that learning methods based on a modified empirical risk minimization procedure require at least that the process satisfies certain laws of large numbers. Interestingly, we will show that for processes satisfying such laws of large numbers there is always a “limit” distribution which can be used to define a reasonable risk functional. Moreover, for many interesting classes of processes the existence of such a limit distribution turns out to be equivalent to a law of large numbers.

The rest of this work is organized as follows: In Section 2

we will define the notions “laws of large numbers” and “limit” distributions for stochastic processes. We then discuss the relationship between these concepts and consider specific classes of stochastic processes that satisfy these definitions. We then recall some basic classes of loss functions and define consistency of learning algorithms for stochastic processes satisfying certain laws of large numbers. Finally, we show that SVMs can be made consistent for such processes. In Section

3 we then recall various mixing coefficients for stochastic processes. These coefficient are then used to establish consistency results for SVMs with a-priori chosen regularization sequence. Finally, the proofs of our results can be found in Section 4.

## 2 Consistency for Processes satisfying a Law of Large Numbers

The aim of this section is to show that SVMs can be made consistent whenever the data-generating process satisfies a certain type of law of large numbers (LLNs). To this end we first recall some notions for stochastic processes and introduce these laws of large numbers in Subsection 2.1. Some examples of processes satisfying LLNs are then presented in Subsection 2.2. In Subsection 2.3 we then recall some important notions for loss functions and risks. We also define consistency of learning algorithms for data-generating processes that satisfy a law of large numbers. Finally, we present and discuss our consistency results for SVMs in Subsection 2.4.

### 2.1 Law of Large Numbers for Stochastic Processes

In this subsection we mainly introduce laws of large numbers for general, not necessarily stationary stochastic processes. The concepts we will present seem to be quite natural and elementary, and therefore one would expect that they have already been introduced elsewhere. Surprisingly, however, we were not able to find any exposition that covers major parts of the material of this section, and thus we discuss the following notions in some detail.

Let us begin with some notations. Given a measurable space we write for the set of all measurable functions , and for the set of all bounded measurable functions . Moreover, for a set we write for its indicator function, i.e.  with if and only if

. Let us now assume that we also have a probability space

and a measurable map . Then denotes the smallest -algebra on for which is measurable. Moreover, denotes the -image measure of , which is defined by , measurable.

Again, let be a probability space and be a measurable space. Recall that for a stochastic process , i.e. a sequence of measurable maps , , the map defined by is -measurable. Consequently, has an image measure which is given by for all .

Furthermore, recall that is called identically distributed if for all , and stationary in the wide sense if for all . Moreover, is said to be stationary if for all .

As we will see later we are not interested in the data-generating process itself, but only in processes of the form for measurable. In the following we call an image of the process , and itself a hidden process. The following definition introduces laws of large numbers for stochastic processes by considering real-valued image processes:

###### Definition 2.1

Let be a probability space, be a measurable space, and be a -valued stochastic process on . We say that satisfies the weak law of large numbers for events (WLLNE) if for all measurable there exists a constant such that for all we have

 (1)

Moreover, we say that satisfies the strong law of large numbers for events (SLLNE) if for all measurable there exists a constant with

 limn→∞1nn∑i=11B∘Zi(ω)=cB (2)

for -almost all .

It is obvious that satisfies the WLLNE if and only if the sequences converge in probability for all measurable . Consequently, the SLLNE implies the WLLNE but in general the converse implication does not hold. Moreover, if satisfies the WLLNE then the constants in (1) must obviously satisfy for all measurable . Finally, if satisfies the WLLNE or SLLNE then it is a trivial exercise to check that every image also satisfies the WLLNE or SLLNE, respectively.

It is well known that i.i.d. processes generated by satisfy the -SLLNE with for all measurable , but these processes are by far not the only ones (see Subsection 2.2 for some other examples). For the following development it is instructive to observe that for i.i.d. processes the map defines a probability measure on . Our next goal is to show that this remains true for general processes satisfying a WLLNE. To this end we first consider the averages of the probabilities of the event :

###### Definition 2.2

Let be a probability space, be a measurable space, and be a -valued stochastic process on . We say that is asymptotically mean stationary (AMS) if

 P(B):=limn→∞1nn∑i=1Eμ1B∘Zi (3)

exists for all measurable .

The notion “asymptotically mean stationary” was first introduced for dynamical systems by Grey and Kieffer in [18]. We are unaware of any work that introduces this notion for general stochastic processes, though a similar idea already appears as assumption (S1) in [12].

Using the simple formula it is obvious that every image of an AMS process is again AMS. Moreover, identically distributed—and hence stationary—processes are obviously AMS. Moreover, for such processes we also have for all measurable , and consequently, (3) defines a probability measure on . The following lemma whose proof can be found in Section 4 shows that the latter observation remains true for general AMS processes.

###### Lemma 2.3

Let be a probability space, be a measurable space, and be a -valued stochastic process on which is AMS. Then defined by (3) is a probability measure on . We call the stationary mean of .

It it well-known that not every stationary process satisfies a (weak, strong) law of large numbers for events. Consequently, we see that in general AMS processes do not satisfy a law of large numbers. However, the following theorem proved in Section 4 shows that the converse implication is true. In addition, it shows that the constants in (1) define the stationary mean distribution:

###### Theorem 2.4

Let be a probability space, be a measurable space, and be a -valued stochastic process on satisfying the WLLNE. Then is AMS and the stationary mean of satisfies

 (4)

for all measurable and all . Moreover, if satisfies the SLLNE then

 limn→01nn∑i=11B∘Zi(ω)=P(B)

holds for -almost all .

Equation (4) shows that the stationary mean describes with high probability our average observations from . Given a loss function (see Subsection 2.3 for definitions) it seems therefore natural to approximate the empirical -risk of a function by the corresponding -risk defined by .222For i.i.d. observations one typically argues the other way around. However, for general stochastic processes the learning goal should be to minimize the future average loss. This loss is an empirical -risk which can be approximated by the -risk defined by . In the training phase of empirical risk minimizers the latter -risk is then approximated by the empirical -risk of the already observed training samples. In this way and the corresponding convergence rates in (3) and (4) tell us how well we can generalize from the past to the future. However, in order to make this ansatz rigorous we have to extend (4) to function classes larger than the set of indicator functions. We begin with the following result that shows that a law of large numbers for events implies a corresponding law of large numbers of bounded functions:

###### Lemma 2.5

Let be a probability space, be a measurable space, and be a -valued stochastic process on satisfying the WLLNE. Furthermore, let be the asymptotic mean of . Then for all we have

 EPf=limn→∞1nn∑i=1f∘Zi (5)

in probability and

 EPf=limn→∞1nn∑i=1Eμf∘Zi. (6)

Moreover, if actually satisfies the SLLNE then the convergence in (5) holds -almost surely.

For classification problems we usually can restrict our considerations to bounded functions, and hence Lemma 2.5 is all that we need. However, for regression problems with unbounded noise we have to consider integrable functions, instead. The following definition serves this purpose:

###### Definition 2.6

Let be a probability space, be a measurable space, and be a -valued stochastic process on . Assume that is AMS and let be the asymptotic mean of . We say that satisfies the weak law of large numbers (WLLN) if for all and all we have

 (7)

Moreover, we say that satisfies the strong law of large numbers (SLLN) if for all we have

 limn→∞1nn∑i=1f∘Zi(ω)=EPf (8)

for -almost all .

### 2.2 Examples of Processes Satisfying a Law of Large Numbers

In this subsection we recall several examples of stochastic processes satisfying a law of large numbers. In particular, we consider independent processes, dynamical systems, and Markov chains.

#### 2.2.1 Uncorrelated and independent processes

Recall that two real-valued random variables

and are called uncorrelated if they satisfy . The following proposition proved in Section 4 shows that AMS, mutually uncorrelated processes satisfy a WLLNE:

###### Proposition 2.7

Let be a probability space, be a measurable space, and be a -valued stochastic process on . Assume that the random variables and are uncorrelated for all measurable and all with . Then the following statements are equivalent:

1. is AMS.

2. satisfies the WLLNE.

Considering the proof of the above proposition it is immediately clear that the proposition remains true if the process is not uncorrelated but only satisfies

 (9)

for all measurable . Processes satisfying such a weaker assumption are introduced and discussed in Subsection 3.1.

It is obvious that Proposition 2.7 holds for processes for which the image processes are independent. However, by applying [19, Theorem 2.7.1] we have the following stronger result:

###### Proposition 2.8

Let be a probability space, be a measurable space, and be a -valued stochastic process on . Assume that are independent for all fixed measurable . Then the following statements are equivalent:

1. is AMS.

2. satisfies the SLLNE.

Note that the independence assumption in Theorem 2.8 is weaker than assuming that the process is independent.

By Kolmogorov’s well-known strong law of large numbers it is obvious that every process whose -valued images are i.i.d. processes satisfies a SLLN. Moreover, a result by Etemadi [20] shows that the independence assumption can be relaxed to pairwise independence. Finally, the following result whose proof can again be found in Section 4 generalizes Kolmogorov’s law of large numbers to a certain type of martingale:

###### Proposition 2.9

Let be a probability space, be a measurable space, and be a -valued stochastic process on . Assume that for all and , , we have and

 E(1nn∑i=1f∘Zi∣∣Fn+1)=1n+1n+1∑i=1f∘Zi. (10)

Then satisfies the SLLN and is the asymptotic mean of .

#### 2.2.2 Ergodic processes

In this section we recall the basic notions and results for dynamical systems. To this end let be a measurable space and be the shift operator defined by . A set is called invariant if . Moreover, let be a probability space and be a -valued stochastic process on . Then is called ergodic if we have for all measurable invariant subsets . It is not hard to see that every image of an ergodic process is again an ergodic process.

In the following we are mainly interested in stationary ergodic processes. To this end let us now assume that is a probability space and is a measurable map. Then the stochastic process is called a dynamical system, and it is called an invariant dynamical system if the -image of satisfies . Recall that an invariant dynamical system on a probability space is ergodic if and only if satisfies for all measurable with . Moreover, recall that every stationary process is the image of a hidden invariant dynamical system. Conversely, every invariant dynamical system is stationary and hence AMS. In addition recall that Birkhoff’s theorem (see e.g. [21, p. 82ff]):

###### Theorem 2.10

Let be an invariant dynamical system on a probability space . Then the following statements are equivalent:

1. satisfies the SLLNE.

2. satisfies the SLLN.

3. is ergodic.

With the help of the above theorem one can show (see e.g. [22, p. 26f]) that every stationary ergodic process satisfies the SLLN. Moreover, by a theorem by Gray and Kieffer (see e.g. [22, p. 33]) we know that a dynamical system is AMS if and only if exists -almost surely for all . Note that Birkhoff’s theorem shows that the corresponding limit is a constant function if and only if the dynamical system is ergodic. Finally, it is interesting to note that for stationary, ergodic processes the limit relation (9) holds (see e.g. [23, Thm. 2.19, p. 61]).

Let us now recall a notion related to ergodicity. To this end let be a probability space and be an invariant dynamical system on . Then is said to be weakly mixing if

 limn→∞1nn−1∑i=0∣∣μ(T−i(A)∩B)−μ(A)μ(B)∣∣=0,A,B∈B.

It is well-known that weak mixing implies ergodicity, and that that the converse implication does not hold in general (see e.g. [24, p. 41ff]). Moreover, one can also introduce mixing conditions for general stationary ergodic processes. For example, if is a probability space, is a measurable space, and is a -valued stochastic process on , then is called mixing if

 limn→∞μZ(S−n(A)∩B)=μZ(A)μZ(B) (11)

holds for all measurable . One can show (see e.g. [23, Prop. 2.8, p. 50]) that for invariant dynamical systems this definition coincides with the above mixing definition. Moreover, recall that i.i.d. processes are invariant and weakly mixing (see [24, p. 58]).

The weak mixing is important because it allows us to establish the ergodicity of products of dynamical systems. This leads to our last example:

###### Proposition 2.11

Let be a probability measure on and be an invariant ergodic dynamical system on . Furthermore, let be a probability space and be an i.i.d. sequence of random variables . Then the process defined on satisfies the SLLN.

#### 2.2.3 Markov chains

In this subsection we briefly discuss a law of large numbers for Markov chains. To this end let us fix a probability space . Furthermore, let be a stochastic transition function, i.e. a Markov kernel. Let us define a probability measure on by

 P(B1×⋯×Bn):=∫1B1×⋯×Bn(z1,…,zn)p(dzn,zn−1)…p(dz2,z1)ν(dz1), (12)

where runs over all integers and run over all measurable subsets of . A -valued stochastic process defined on a probability space is called homogeneous333Since we only deal with homogeneous Markov chains we often omit the adjective “homogeneous”. Markov chain with transition function and initial distribution if it satisfies , where is determined by (12). Obviously, the sequence of coordinate projections , is a canonical model of such a Markov chain if is equipped with the distribution . Moreover, if the homogeneous Markov chain is stationary then satisfies for all .

The transition function describes the probability of given the state of the process at time . For larger steps ahead one can iteratively compute the corresponding transition probabilities by

 p(1)(B,z) = p(B,z) p(n+1)(B,z) = ∫pn(B,z′)p(dz′,z).

Let us now assume that there exists a finite measure on with , an integer , and a real number such that for all measurable we have

 Q(B)≤ε⟹p(n)(B,z)≤1−ε for all z∈Z. (13)

This assumption taken from [25, p. 192] is often called the “Doeblin condition” (see e.g. [25, p. 197] or [26, p. 156]). If is a finite set, then (13) is automatically satisfied (see e.g. [25, p. 192]). Moreover, if is a set of finite Lebesgue measure and the distributions , are absolutely continuous with uniformly bounded transition densities then (13) also holds (see e.g. [25, p. 193]). For some similar conditions we finally refer to [26] and the references therein).

Now, the following theorem which can be found in [25, p. 219] gives a simple condition ensuring a SLLN for Markov chains:

###### Theorem 2.12

Let be a probability space, be a stochastic transition function and be a stationary homogeneous Markov chain with transition function and initial distribution . If satisfies (13) then satisfies the SLLN.

The above theorem can be generalized to non-homogeneous, not identically distributed Markov chains. Since these generalizations are out of the scope of the paper we refer to [19, p. 129-135] for details. Finally, we would also like to mention without explaining the details that if is a countable set then an irreducible, positive recurrent, homogeneous Markov chain satisfies the SLLNE (see e.g. [27, Thm. 1.10.2]).

### 2.3 Loss functions, Risks, and Consistency

In this section we recall some basic notions for loss functions and their associated risks. We then introduce consistency notions for learning algorithms for stochastic processes satisfying a law of large numbers.

In the following is always a measurable space if not mentioned otherwise and is always a closed subset. Moreover, metric spaces are always equipped with the Borel -algebra, and products of measurable spaces are always equipped with the corresponding product -algebra. Finally, stands for the standard space of -integrable functions with respect to the measure on .

###### Definition 2.13

A function is called a loss function if it is measurable. In this case is called:

1. convex if is convex for all , .

2. continuous if is continuous for all , .

Moreover, for a probability measure on and an the -risk of is defined by

 RL,P(f):=∫X×YL(x,y,f(x))dP(x,y)=∫X∫YL(x,y,f(x))dP(y|x)dPX(x).

Finally, the Bayes -risk is .

Note that the integral defining the -risk always exists since is non-negative and measurable. In addition it is obvious that the risk of a convex loss is convex on . However, in general the risk of a continuous loss is not continuous. In order to ensure this continuity and several other, more sophisticated properties we need the following definition:

###### Definition 2.14

We call a loss function a Nemitski loss function if there exist a measurable function and an increasing function with

 L(x,y,t)≤b(x,y)+h(|t|),(x,y,t)∈X×Y×R. (14)

Furthermore, we say that is a Nemitski loss of order , if there exists a constant with for all . Finally, if is a distribution on with we say that is a -integrable Nemitski loss.

Note that -integrable Nemitski loss functions satisfy for all , and consequently we also have and .

For our further investigations we also need the following additional properties which are satisfied by basically all commonly used loss functions:

###### Definition 2.15

Let be a loss function. We say that is:

1. locally bounded if for all bounded the restriction of is a bounded function.

2. locally Lipschitz continuous if for all we have

 |L|a,1:=supt,t′∈[−a,a]t≠t′supx∈Xy∈Y∣∣L(x,y,t)−L(x,y,t′)∣∣|t−t′|<∞. (15)
3. Lipschitz continuous if we have .

Note that if is a finite subset and is a convex loss function then is a locally Lipschitz continuous loss function. Moreover, a locally Lipschitz continuous loss function is a Nemitski loss since (15) yields

 L(x,y,t)≤L(x,y,0)+|L||t|,1|t|,(x,y,t)∈X×Y×R. (16)

In particular, a locally Lipschitz continuous loss is a -integrable Nemitski loss if and only if . Moreover, if is Lipschitz continuous then is a Nemitski loss of order .

The following examples recall that (locally) Lipschitz continuous losses are often used in learning algorithms for classification and regression problems:

###### Example 2.16

A loss of the form for a suitable function and all and , is called margin-based. Recall that margin-based losses such as the (squared) hinge loss, the AdaBoost loss, the logistic loss and the least squares loss are used in many classification algorithms. Obviously, is convex, continuous, or (locally) Lipschitz continuous if and only if is. In addition, convexity of implies local Lipschitz continuity of . Moreover, is always a -integrable Nemitski loss since we have

 L(y,t)≤max{φ(−t),φ(t)} (17)

for all and all . In particular, this estimate shows that every convex margin-based loss is locally bounded. Moreover, from (17) we can easily derive a characterization for being a -integrable Nemitski loss of order .

###### Example 2.17

A loss of the form for a suitable function and all and , is called distance-based. Distance-based losses such as the least squares loss, Huber’s insensitive loss, the logistic loss, or the -insensitive loss are usually used for regression. It is easy to see that is convex, continuous, or Lipschitz continuous if and only if is. Let us say that is of upper growth if there is a with

 ψ(r)≤c(|r|p+1),r∈R.

Analogously, is said to be of lower growth if there is a with

 ψ(r)≥c(|r|p−1),r∈R.

Recall that most of the commonly used distance-based loss functions including the above examples are of the same upper and lower growth type. Then it is obvious that is of upper growth type 1 if it is Lipschitz continuous, and if is convex the converse implication also holds. Moreover, non-trivial convex are always of lower growth type . In addition, a distance-based loss function of upper growth type is a Nemitski loss of order , and if the distribution

satisfies the moment condition

 |P|p:=(E(x,y)∼P|y|p)1/p:=(∫X×R|y|pdP(x,y))1/p<∞ (18)

it is also -integrable.

If our observations are realizations of a sequence of random variables satisfying a law of large numbers then the following lemma proved in Section 4 shows that the risk with respect to the asymptotic mean distribution actually describes the average future loss.

###### Lemma 2.18

Let be a probability space, be a measurable space, be a closed subset, and be a -valued stochastic process on satisfying the WLLNE. Furthermore, let be the asymptotic mean of and be a loss function. If is locally bounded then for all and all we have

 RL,P(f)=limn→∞1n−n0n∑i=n0+1L(Xi,Yi,f(Xi)), (19)

where the limit is with respect to the convergence in probability . Moreover, if actually satisfies the SLLNE then (19) holds -almost surely. Finally, the same conclusions hold if is a -integrable Nemitski loss and satisfies the WLLN or SLLN.

With the help of the above lemma we can now introduce some reasonable concepts describing the asymptotic learning ability of learning algorithms. To this end recall that a method that provides to every training set a (measurable) function is called a learning method. The following definition introduces an asymptotic way to describe whether a learning method can learn from samples:

###### Definition 2.19

Let be a probability space, be a measurable space, be a closed subset, and be a -valued stochastic process on satisfying the WLLNE. Furthermore, let be the asymptotic mean of and be a loss function. We say that a learning method is -consistent for if

 limn→∞RL,P(fTn)=R∗L,P (20)

holds in probability , where and is the Bayes risk defined in Definition 2.13. Moreover, we say that is strongly -consistent for if (20) holds -almost surely.

### 2.4 Consistency of SVMs

In this subsection we present some results showing that support vector machines (SVMs) can learn whenever the data-generating process satisfies a law of large numbers.

Let us begin by recalling the definition of SVMs. To this end let be a convex loss function and be a reproducing kernel Hilbert space (RKHS) over (see e.g. [28]). Then for all and all observations there exists exactly one element with

 fT,λ∈argminf∈Hλ∥f∥2H+1nn∑i=1L(xi,yi,f(xi)). (21)

Given a null-sequence of strictly positive real numbers we call the learning method which provides to every training set the decision function an -SVM based on and . For more information on SVMs we refer to [29, 30].

Moreover, given a distribution on we say that the RKHS is -rich if we have

 R∗L,P,H:=inff∈HRL,P(f)=R∗L,P,

i.e. if the Bayes risk can be approximated by functions from . Note that the condition is satisfied (see [31]) whenever, the kernel of is universal in the sense of [32], i.e.  is a compact metric space and is dense in the space of continuous functions. Less restrictive assumptions on and have been recently found in [31]. In particular, it was shown in [31] that the RKHSs , , of the Gaussian RBF kernels

 kσ(x,x′) := exp(−σ2∥x−x′∥22),x,x′∈Rd

are -rich for all distributions on and all continuous, -integrable Nemitski losses of order . Finally, one can also find some necessary and sufficient conditions for -richness on countable spaces in [31].

In order to present our first main result let us recall that a Polish space is separable topological space with a countable dense subset whose topology can be described by a complete metric. It is well known that e.g. closed and open subset of and compact metric spaces are Polish. Now our first theorem shows that for every process satisfying a law of large numbers for events there exists an SVM which is consistent for this process:

###### Theorem 2.20

Let be a Polish space, be a closed subset and be a convex, locally Lipschitz continuous, and locally bounded loss function. Moreover, let be a probability space, be an -valued stochastic process on satisfying the WLLNE, and be the asymptotic mean of . Finally, let be an -rich RKHS over with continuous kernel. Then there exists a null-sequence of strictly positive real numbers such that the -SVM based on and is -consistent for .
In addition, if satisfies the SLLNE then can be chosen such that the -SVM is strongly -consistent for .

The next theorem establishes a similar result for distance-based loss functions (see Example 2.17) which, in general, are not locally bounded.

###### Theorem 2.21

Let be a Polish space, be a closed subset and be a convex, distance-based loss function of upper growth-type . Moreover, let be a probability space, be an -valued stochastic process on satisfying the WLLN, and be the asymptotic mean of . We assume . Finally, let be the -rich RKHS of a continuous kernel on . Then there exists a null-sequence of strictly positive real numbers such that the -SVM based on and is -consistent for .
In addition, if satisfies the SLLN then can be chosen such that the -SVM is strongly -consistent for .

The techniques used in the proofs of Theorem 2.20 and 2.21 are based on a (hidden) skeleton argument in the proof of Lemma 4.5. A more general though standard skeleton argument can be used to derive results similar to Theorem 2.20 and 2.21 for other empirical risk minimization methods using hypothesis sets with reasonably controllable complexity. Due to space constraints we omit the details.

Let us now assume for a moment that is a subset of , is a loss function in the sense of either Theorem 2.20 or 2.21, and is the RKHS of a Gaussian RBF kernel. Then the above theorems together with the richness results from [31] show that for all data-generating processes satisfying a law of large numbers there exist suitable regularization sequences that allows us to build a consistent SVM. However, the sequences of Theorem 2.20 or 2.21 depend on , and consequently, it would be desirable to have either a universal sequence , i.e. a sequence that guarantees consistency for all , or a consistent method that finds suitable values for from the observations. Unfortunately, the following theorem due to Nobel, [10], together with Birkhoff’s ergodic theorem shows that neither of these alternatives is possible:444Recall that binary classification is the “easiest” non-parametric learning problem in the sense that negative results for this learning problem can typically be translated into negative results for almost all learning problems defined by loss functions (cf. p.118f in [33] for some examples in this direction and the proof of the below theorem in [10] for the least squares loss).

###### Theorem 2.22

There is no learning method which is -consistent for all stationary ergodic processes with values in , where denotes the usual least square loss , . Moreover, there is no learning method which is -consistent for all stationary ergodic processes with values in , where denotes the classification loss , , .

Roughly speaking the impossibility of finding a universal sequence is related to the fact that there is no uniform convergence speed in the LLNs for general processes. More precisely, if is a stochastic process which satisfies a law of large numbers then for all , , and all suitable functions there exists a with

 μ({ω∈Ω:∣∣1nn∑i=1f∘(Xi,Yi)(ω)−EPf∣∣>ε}) ≤ δ(ε,f,n) (22)

and . Now, the proofs of Theorem 2.20 and Theorem 2.21 (essentially) show that we can determine a sequence whenever we know such for all , , and a suitably large class of functions . However, since there exists no universal sequence by Theorem 2.22 we consequently see that there exists no values such that (22) holds for all (stationary) processes satisfying a law of large numbers.

This discussion shows that in order to build consistent SVMs for interesting classes of processes one has to find quantitative versions of laws of large numbers. For i.i.d. processes such laws have been established in recent years by several authors. In the following section we will present a simple yet powerful method for establishing quantitative versions of laws of large numbers for mixing processes.

## 3 Consistency for Mixing Processes

In this section we derive consistency results for SVMs under the assumption that the data-generating process satisfies certain mixing conditions. These mixing conditions generally quantify how much a process fails to be independent. In the first subsection we recall some commonly used mixing conditions. In the second subsection we then present our consistency results and compare them with known consistency results for other learning algorithms.

### 3.1 A Brief Introduction to Mixing Coefficients for Processes

In this subsection we recall some standard mixing coefficients and their basic properties (see e.g. [23] and [17] for thorough treatment). To this end let be a set, and be two -algebras on , and be a probability measure on . Furthermore, let be a Hilbert space and be the space of all -measurable -valued functions that are -integrable with respect to . Using the convention we define the following mixing coefficients for the pair :

 α(A,B,μ) := supA∈AB∈B∣∣μ(A∩B)−μ(A)μ(B)∣∣ β(A,B,μ) := 12sup{∞∑i=1∞∑j=1∣∣μ(Ai∩Bj)−μ(Ai)μ(Bj)∣∣:(Ai)⊂A and (Bj)⊂B partitions} φ(A,B,μ) := supA∈AB∈B∣∣∣μ(A∩B)−μ(A)μ(B)μ(A)∣∣∣ φsym(A,B,μ) := √φ(A,B,μ)⋅φ(B,A,μ) RHp(A,B,μ) := supf∈Lp(A,μ,H)g∈Lp(B,μ,H)∣∣∣Eμ⟨f,g⟩−⟨Eμf,Eμg⟩∥f∥p∥g∥p∣∣∣,p∈[2,∞].

It is obvious from the definitions that all mixing coefficients equal 0 if and are independent. Furthermore, besides they are all symmetric in and . Moreover, we have and . In addition, they satisfy the relations (see e.g. [23, Section 1] and the references therein):

Moreover, the coefficients are essentially equivalent to the coefficients for the scalar case since [34, Thm. 4.1] shows that for all there exists a constant such that for all Hilbert spaces we have

 RpR(A,B,μ) ≤ RHp(A,B,μ) ≤ cpRpR(A,B,μ). (23)

Note that for we actually have and for we may choose the famous Grothendieck constant (see the proof of Lemma 2.2 in [35]). Moreover, it is obvious from the definition that is decreasing in , i.e.

 RHp(A,B,μ) ≤ RHq(A,B,μ),q≤p.

In particular this yields for all . Finally, Theorem 4.13 in [36] gives the highly non-trivial relation

 RpR(A,B,μ) ≤ 2πα1−2p(A,B,μ)φ2psym(A,B,μ),p∈[2,∞]. (24)

In view of our consistency results we are mainly interested in the coefficients . Note that with the help of the above inequalities these coefficients can be estimated by the typically more accessible coefficients and . The coefficient , which can often (see [36, Prop. 3.22] for an exact statement) be computed by

 β(A,B,μ)=EμsupB∈B∣∣μ(B)−Eμ(B|A)∣∣,

is mainly mentioned because it was used in earlier works (see e.g. [11, 37]) on learning from dependent observations.

Let us now consider mixing coefficients and corresponding mixing notion for stochastic processes:

###### Definition 3.1

Let be a probability space, be a measurable space, and