1 Introduction and Motivation
In this paper, we study the supervised learning problem which aims at inferring a functional relation between explanatory variables and response variables
. In the literature of statistical learning theory, one of the main research topics is the generalization ability of different learning schemes which indicate their learnabilities on future observations. Nowadays, it has been well understood that the Bernstein-type inequalities play an important role in deriving fast learning rates. For example, the analysis of various algorithms from non-parametric statistics and machine learning crucially depends on these inequalities, see e.g.[12, 13, 18, 36]
. Here, stronger results can typically be achieved since the Bernstein-type inequality allows for localization due to its specific dependence on the variance. In particular, most derivations of minimax optimal learning rates are based on it.
The classical Bernstein inequality assumes that the data are generated by an i.i.d. process. Unfortunately, however, this assumption is often violated in many real-world applications including financial prediction, signal processing, system identification and diagnosis, text and speech recognition, and time series forecasting, among others. For this and other reasons, there has been some effort to establish Bernstein-type inequalities for non-i.i.d. processes. For instance, generalizations of Bernstein-type inequalities to the cases of -mixing  and -mixing  processes have been found [7, 28, 27, 32] and [21, 19], respectively. These Bernstein-type inequalities have been applied to derive various convergence rates. For example, the Bernstein-type inequality established in  was employed in  to derive convergence rates for sieve estimates from strictly stationary
-mixing processes in the special case of neural networks. applied the Bernstein-type inequality in  to derive an oracle inequality (see Page in  for the meaning of the oracle inequality) for generic regularized empirical risk minimization algorithms with stationary -mixing processes. By applying the Bernstein-type inequality in ,  derived almost sure uniform convergence rates for the estimated Lévy density both in mixed-frequency and low-frequency setups and proved their optimality in the minimax sense. Particularly, concerning the least squares loss,  obtained the optimal learning rates for -mixing processes by applying the Bernstein-type inequality established in . By developing a Bernstein-type inequality for -mixing processes that include -mixing processes and many discrete-time dynamical systems,  established an oracle inequality as well as fast learning rates for generic regularized empirical risk minimization algorithms with observations from -mixing processes.
The above-mentioned inequalities are termed as Bernstein-type since they rely on the variance of the random variables. However, we note that these inequalities are usually presented in similar but rather complicated forms which consequently are not easy to apply directly in analyzing the performance of statistical learning schemes and may be also lack of interpretability. On the other hand, existing studies on learning from mixing processes may diverse from one to another since they may be conducted under different assumptions and notations, which leads to barriers in comparing the learnability of these learning algorithms.
In this work, we first introduce a generalized Bernstein-type inequality and show that it can be instantiated to various stationary mixing processes. Based on the generalized Bernstein-type inequality, we establish an oracle inequality for a class of learning algorithms including ERM [36, Chapter 6] and SVMs. On the technical side, the oracle inequality is derived by refining and extending the analysis of . To be more precise, the analysis in  partially ignored localization with respect to the regularization term, which in our study is addressed by a carefully arranged peeling approach inspired by . This leads to a sharper stochastic error bound and consequently a sharper bound for the oracle inequality, comparing with that of . Besides, based on the assumed generalized Bernstein-type inequality, we also provide an interpretation and comparison of the effective numbers of observations when learning from various mixing processes.
Our second main contribution made in the present study lies in that we present a unified treatment on analyzing learning schemes with various mixing processes. For example, we establish fast learning rates for -mixing and (time-reversed) -mixing processes by tailoring the generalized oracle inequality. For ERM, our results match those in the i.i.d. case, if one replaces the number of observations with the effective number of observations. For LS-SVMs, as far as we know, the best learning rates for the case of geometrically -mixing process are those derived in [51, 43, 17]. When applied to LS-SVMs, it turns out that our oracle inequality leads to faster learning rates that those reported in  and . For sufficiently smooth kernels, our rates are also faster than those in . For other mixing processes including geometrically
-mixing Markov chains, geometrically-mixing processes, and geometrically -mixing processes, our rates for LS-SVMs with Gaussian kernels match essentially the optimal learning rates, while for LS-SVMs with given generic kernel, we only obtain rates that are close to the optimal rates.
The rest of this work is organized as follows: In Section 2, we introduce some basics of statistical learning theory. Section 3 presents the key assumption of a generalized Bernstein-type inequality for stationary mixing processes, and present some concrete examples that satisfy this assumption. Based on the generalized Bernstein-type inequality, a sharp oracle inequality is developed in Section 4 while its proof is deferred to the Appendix. Section 5 provides some applications of the newly developed oracle inequality. The paper is ended in Section 6.
2 A Primer in Learning Theory
Let be a measurable space and be a closed subset. The goal of (supervised) statistical learning is to find a function such that for the value is a good prediction of at . The following definition will help us define what we mean by “good”.
Let be a measurable space and be a closed subset.
Then a function is called a loss function, or simply a loss, if it is measurable.
is called a loss function, or simply a loss, if it is measurable.
In this study, we are interested in loss functions that in some sense can be restricted to domains of the form as defined below, which is typical in learning theory [36, Definition 2.22] and is in fact motivated by the boundedness of .
We say that a loss can be clipped at , if, for all , we have
where denotes the clipped value of at , that is
Throughout this work, we make the following assumptions on the loss function :
The loss function can be clipped at some . Moreover, it is both bounded in the sense of and locally Lipschitz continuous, that is,
Here both inequalites are supposed to hold for all and .
Note that the above assumption with Lipschitz constant equals to one can typically be enforced by scaling. To illustrate the generality of the above assumptions on , let us first consider the case of binary classification, that is . For this learning problem one often uses a convex surrogate for the original discontinuous classification loss , since the latter may lead to computationally infeasible approaches. Typical surrogates belong to the class of margin-based losses, that is, is of the form , where is a suitable, convex function. Then can be clipped, if and only if has a global minimum, see [36, Lemma 2.23]. In particular, the hinge loss, the least squares loss for classification, and the squared hinge loss can be clipped, but the logistic loss for classification and the AdaBoost loss cannot be clipped. On the other hand,  established a simple technique, which is similar to inserting a small amount of noise into the labeling process, to construct a clippable modification of an arbitrary convex, margin-based loss. Finally, both the Lipschitz continuity and the boundedness of can be easily verified for these losses, where for the latter it may be necessary to suitably scale the loss.
Bounded regression is another class of learning problems, where the assumptions made on are often satisfied. Indeed, if and is a convex, distance-based loss represented by some , that is , then can be clipped whenever , see again [36, Lemma 2.23]. In particular, the least squares loss
and the -pinball loss
used for quantile regression can be clipped. Again, for both losses, the Lipschitz continuity and the boundedness can be easily enforced by a suitable scaling of the loss.
Given a loss function and an , we often use the notation for the function . Our major goal is to have a small average loss for future unseen observations . This leads to the following definition.
Let be a loss function and be a probability measure on
be a probability measure on. Then, for a measurable function , the -risk is defined by
Moreover, the minimal -risk
is called the Bayes risk with respect to and . In addition, a measurable function satisfying is called a Bayes decision function.
Let be a probability space, be an -valued stochastic process on , we write
for a training set of length that is distributed according to the first components of . Informally, the goal of learning from a training set is to find a decision function such that is close to the minimal risk . Our next goal is to formalize this idea. We begin with the following definition.
Let be a set and be a closed subset. A learning method on maps every set , , to a function .
Now a natural question is whether the functions produced by a specific learning method satisfy
If this convergence takes place for all , then the learning method is called universally consistent. In the i.i.d. case many learning methods are known to be universally consistent, see e.g.  for classification methods,  for regression methods, and  for generic SVMs. For consistent methods, it is natural to ask how fast the convergence rate is. Unfortunately, in most situations uniform convergence rates are impossible, see [12, Theorem 7.2], and hence establishing learning rates require some assumptions on the underlying distribution . Again, results in this direction can be found in the above-mentioned books. In the non-i.i.d. case,  showed that no uniform consistency is possible if one only assumes that the data generating process is stationary and ergodic. On the other hand, if some further assumptions of the dependence structure of are made, then consistency is possible, see e.g. .
Let us now describe the learning algorithms of particular interest to us. To this end, we assume that we have a hypothesis set consisting of bounded measurable functions , which is pre-compact with respect to the supremum norm . Since the cardinality of can be infinite, we need to recall the following concept, which will enable us to approximate by using finite subsets.
Let be a metric space and . We call an -net of if for all there exists an with . Moreover, the -covering number of is defined by
where and denotes the closed ball with center and radius .
Note that our hypothesis set is assumed to be pre-compact, and hence for all , the covering number is finite.
Denote , where denotes the (random) Dirac measure at . In other words, is the empirical measure associated to the data set . Then, the risk of a function with respect to this measure
is called the empirical -risk.
With these preparations we can now introduce the class of learning methods of interest:
Let be a loss that can be clipped at some , be a hypothesis set, that is, a set of measurable functions , with , and be a regularizer on , that is, with . Then, for , a learning method whose decision functions satisfy
for all and is called -approximate clipped regularized empirical risk minimization (-CR-ERM) with respect to , , and .
In the case , we simply speak of clipped regularized empirical risk minimization (CR-ERM). In this case, in fact can be also defined as follows:
Note that on the right-hand side of (4) the unclipped loss is considered, and hence CR-ERMs do not necessarily minimize the regularized clipped empirical risk . Moreover, in general CR-ERMs do not minimize the regularized risk either, because on the left-hand side of (4) the clipped function is considered. However, if we have a minimizer of the unclipped regularized risk, then it automatically satisfies (4). In particular, ERM decision functions satisfy (4) for the regularizer and , and SVM decision functions satisfy (4) for the regularizer and . In other words, ERM and SVMs are CR-ERMs.
3 Mixing Processes and A Generalized Bernstein-type Inequality
In this section, we introduce a generalized Bernstein-type inequality. Here the inequality is said to be generalized in that it depends on the effective number of observations instead of the number of observations, which, as we shall see later, makes it applicable to various stationary stochastic processes. To this end, let us first introduce several mixing processes.
3.1 Several Stationary Mixing Processes
We begin with introducing some notations. Recall that is a measurable space and is closed. We further denote as a probability space, as an -valued stochastic process on , and as the -algebras generated by and , respectively. Throughout, we assume that is stationary, that is, the -valued random variables and have the same distribution for all , , . Let be a measurable map. is denoted as the -image measure of , which is defined as , measurable. We denote as the space of (equivalence classes of) measurable functions with finite -norm . Then together with forms a Banach space. Moreover, if is a sub--algebra, then denotes the space of all -measurable functions . denotes the space of -dimensional sequences with finite Euclidean norm. Finally, for a Banach space , we write for its closed unit ball.
In order to characterize the mixing property of a stationary stochastic process, various notions have been introduced in the literature . Several frequently considered examples are -mixing, -mixing and -mixing, which are, respectively, defined as follows:
Definition 7 (-Mixing Process).
A stochastic process is called -mixing if there holds
where is the -mixing coefficient defined by
Moreover, a stochastic process is called geometrically -mixing, if
for some constants , , and .
Definition 8 (-Mixing Process).
A stochastic process is called -mixing if there holds
where is the -mixing coefficient defined by
Definition 9 (-Mixing Process).
A stochastic process is called -mixing if there holds
where is the -mixing coefficient defined by
The -mixing concept was introduced by Rosenblatt  while the -mixing coefficient was introduced by [49, 50], and was attributed there to Kolmogorov. Moreover, Ibragimov  introduced the -coefficient, see also . An extensive and thorough account on mixing concepts including - and -mixing is also provided by . It is well-known that, see e.g. [20, Section 2], the - and -mixing sequences are also -mixing, see Figure 1. From the above definition, it is obvious that i.i.d. processes are also geometrically -mixing processes since (5) is satisfied for and all . Moreover, several time series models such as ARMA and GARCH, which are often used to describe, e.g. financial data, satisfy (5) under natural conditions [16, Chapter 2.6.1], and the same is true for many Markov chains including some dynamical systems perturbed by dynamic noise, see e.g. [48, Chapter 3.5].
Another important class of mixing processes called (time-reversed) -mixing processes was originally introduced in  and recently investigated in . As shown below, it is defined in association with a function class that takes into account of the smoothness of functions and therefore could be more general in the dynamical system context. As illustrated in  and , the -mixing process encounters a large family of dynamical systems. Given a semi-norm on a vector space of bounded measurable functions , we define the -norm by
and denote the space of all bounded -functions by .
Definition 10 (-Mixing Process).
Let be a stationary stochastic process. For , the -mixing coefficients are defined by
and similarly, the time-reversed -mixing coefficients are defined by
Let be a strictly positive sequence converging to . Then we say that is (time-reversed) -mixing with rate , if we have for all . Moreover, if is of the form
for some constants , , and , then is called geometrically (time-reversed) -mixing. If is of the form
for some constants , and , then is called polynomial (time-reversed) -mixing.
Figure 2 illustrates the relations among -mixing processes, -mixing processes, and -mixing processes. Clearly, -mixing processes are -mixing . Furthermore, various discrete-time dynamical systems including Lasota-Yorke maps, uni-modal maps, and piecewise expanding maps in higher dimension are -mixing, see . Moreover, smooth expanding maps on manifolds, piecewise expanding maps, uniformly hyperbolic attractors, and non-uniformly hyperbolic uni-modal maps are time-reversed geometrically -mixing, see [47, Proposition 2.7, Proposition 3.8, Corollary 4.11 and Theorem 5.15], respectively.
3.2 A Generalized Bernstein-type Inequality
As discussed in the introduction, the Bernstein-type inequality plays an important role in many areas of probability and statistics. In the statistical learning theory literature, it is also crucial in conducting concentrated estimation for learning schemes. As mentioned previously, these inequalities are usually presented in rather complicated forms under different assumptions, which therefore limit their portability to other contexts. However, what is common behind these inequalities is their relying on the boundedness assumption of the variance. Given the above discussions, in this subsection, we introduce the following generalized Bernstein-type inequality, with the hope of making it as an off-the-shelf tool for various mixing processes.
Let be an -valued, stationary stochastic process on and . Furthermore, let be a bounded measurable function for which there exist constants and such that , , and . Assume that, for all , there exist constants independent of and such that for all , we have
where is the effective number of observations, is a constant independent of , and , are positive constants.
Note that in Assumption 2, the generalized Bernstein-type inequality (7) is assumed with respect to instead of , which is a function of and is termed as the effective number of observations. The terminology, effective number of observations
, “provides a heuristic understanding of the fact that the statistical properties of autocorrelated data are similar to a suitably defined number of independent observations”. We will continue our discussion on the effective number of observations in Subsection 3.4 below.
3.3 Instantiation to Various Mixing Processes
We now show that the generalized Bernstein-type inequality in Assumption 2 can be instantiated to various mixing processes, e.g., i.i.d processes, geometrically -mixing processes, restricted geometrically -mixing processes, geometrically -mixing Markov chains, -mixing processes, geometrically -mixing processes, polynomially -mixing processes, among others.
3.3.1 I.I.D Processes
3.3.2 Geometrically -Mixing Processes
for any , and , where
where is the largest integer less than or equal to and is the smallest integer greater than or equal to for . Observe that for all and for all . From this it is easy to conclude that, for all with
we have . Hence, the right-hand side of (7) takes the form
3.3.3 Restricted Geometrically -Mixing Processes
A restricted geometrically -mixing process is referred to as a geometrically -mixing process (see Definition 10) with . For this kind of -mixing processes, [27, Theorem 2] established a bound for the right-hand side of (7) that takes the following form
for all and , where is some constant depending only on , is some constant depending only on , and is defined by
In fact, for any , by using Davydov’s covariance inequality [11, Corollary to Lemma 2.1] with and , we obtain for ,
Consequently, we have
then the probability bound (9) can be reformulated as
it can be further upper bounded by
Therefore, the Bernstein-type inequality for the restricted -mixing process is also of the generalized form (7) where , , , and .
3.3.4 Geometrically -Mixing Markov Chains
Following the similar arguments as in the restricted geometrically -mixing case, we know that for an arbitrary , there holds
That is, when the Bernstein-type inequality for the geometrically -mixing Markov chain can be also formulated as the generalized form (7) with , , , and .
3.3.5 -Mixing Processes
3.3.6 Geometrically -Mixing Processes
the right-hand side of (7) takes the form
3.3.7 Polynomially -Mixing Processes
For the polynomially -mixing processes, a Bernstein-type inequality was established recently in . Under the same restriction on the semi-norm and assumption on as in the geometrically -mixing case, it states that when with
the right-hand side of (7) takes the form
3.4 From Observations to Effective Observations
The generalized Bernstein-type inequality in Assumption 2 is assumed with respect to the effective number of observations . As verified above, the assumed generalized Bernstein-type inequality indeed holds for many mixing processes whereas may take different values in different circumstances. Supposing that we have observations drawn from a certain mixing process discussed above, Table 1 reports its effective number of observations. As mentioned above, it can be roughly treated as the number of independent observations when inferring the statistical properties of correlated data. In this subsection, we make some effort in presenting an intuitive understanding towards the meaning of the effective number of observations.
|examples||effective number of observations|
|geometrically -mixing processes|
|restricted geometrically -mixing processes|
|geometrically -mixing Markov chains|
|geometrically -mixing processes|
|polynomially -mixing processes||with|
The terminology - effective observations, which may be also referred as the effective number of observations depending on the context, appeared probably first in  when studying the autocorrelated time series data. In fact, many similar concepts can be found in the literature of statistical learning from mixing processes, see e.g., [24, 52, 28, 33, 55]. For stochastic processes, mixing indicates the asymptotic independence. In some sense, the effective observations can be taken as the independent observations that can contribute when learning from a certain mixing process.
In fact, when inferring statistical properties with data drawn from mixing processes, a frequently employed technique is to split the data of size into blocks, each of size [53, 28, 8, 29, 21]. Each block may be constructed either by choosing consecutive points in the original observation set or by a jump selection [28, 21]. With the constructed blocks, one can then introduce a new sequence of blocks that are independent between the blocks by using the coupling technique. Due to the mixing assumption, the difference between the two sequences of blocks can be measured with respect to a certain metric. Therefore, one can deal with the independent blocks instead of dependent blocks now. On the other hand, for observations in each originally constructed block, one can again apply the coupling technique [8, 14] to tackle, e.g., introducing new i.i.d observations and bounding the difference between the newly introduced observations and the original observations with respect to a certain metric. During this process, one tries to ensure that the number of blocks is as large as possible, for which turns out to be the choice. An intuitive illustration of this procedure is shown in Fig. 3.
4 A Generalized Sharp Oracle Inequality
In this section we present one of our main results: an oracle inequality for learning from mixing processes satisfying the generalized Bernstein-type inequality (7). We first introduce a few more notations. Let be a hypothesis set in the sense of Definition 6. For
and , we write
Since , , and , then we have , Furthermore, we assume that there exists a function that satisfies
Now, we present the oracle inequality as follows:
Let be a stochastic process satisfying Assumption 2 with constants , , , and . Furthermore, let be a loss satisfying Assumption 1. Moreover, assume that there exists a Bayes decision function and constants and such that
where is a hypothesis set with . We define and by (15) and (16), respectively and assume that (17) is satisfied. Finally, let be a regularizer with , be a fixed function, and be a constant such that . Then, for all fixed , , , , and satisfying
with , every learning method defined by (4) satisfies with probability not less than :
The proof of Theorem 1 will be provided in the Appendix. Before we illustrate this oracle inequality in the next section with various examples, let us briefly discuss the variance bound (18). For example, if and is the least squares loss, then it is well-known that (18) is satisfied for and , see e.g. [36, Example 7.3]. Moreover, under some assumptions on the distribution ,  established a variance bound of the form (18) for the so-called pinball loss used for quantile regression. In addition, for the hinge loss, (18) is satisfied for , if Tsybakov’s noise assumption [45, Proposition 1] holds for , see [36, Theorem 8.24]. Finally, based on ,  established a variance bound with for the earlier mentioned clippable modifications of strictly convex, twice continuously differentiable margin-based loss functions.
We remark that in Theorem 1 the constant is necessary since the assumed boundedness of only guarantees , while bounds the function for an unclipped . We do not assume that all satisfy , therefore in general is necessary. We refer to Examples 2, 3 and 4 for situations, where is significantly larger than .
5 Applications to Statistical Learning
To illustrate the oracle inequality developed in Section 4, we now apply it to establish learning rates for some algorithms including ERM over finite sets and SVMs using either a given generic kernel or a Gaussian kernel with varying widths. In the ERM case, our results match those in the i.i.d. case, if one replaces the number of observations with the effective number of observations while, for LS-SVMs with given generic kernels, our rates are slightly worse than the recently obtained optimal rates  for i.i.d. observations. The latter difference is not surprising when considering the fact that  used heavy machinery from empirical process theory such as Talagrand’s inequality and localized Rademacher averages while our results only use a light-weight argument based on the generalized Bernstein-type inequality and the peeling method. However, when using Gaussian kernels, we indeed recover the optimal rates for LS-SVMs and SVMs for quantile regression with i.i.d. observations.
Let us now present the first example, that is, the empirical risk minimization scheme over a finite hypothesis set.