In nonparametric regression, we are given a set of samples of the form , where each is an input, is a real-valued output, and the samples are drawn i.i.d. from an unknown distribution on The goal is to learn a function which can be used to predict future outputs based on the inputs.
Kernel methods [18, 5, 21] are a popular nonparametric technique based on choosing a hypothesis space to be a reproducing kernel Hilbert space (RKHS). Stochastic/online learning algorithms [9, 3] (often called stochastic gradient methods [14, 12]
in convex optimization) are among the most efficient and fast learning algorithms. At each iteration, they compute a gradient estimate with respect to a new sample point and then updates the current solution by subtracting the scaled gradient estimate. In general, the computational complexities for training arein space and in time, due to the nonlinearity of kernel methods. In recent years, different types of online/stochastic learning algorithms, either with or without an explicit penalty term, have been proposed and analyzed, see e.g. [3, 23, 25, 17, 22, 15, 7, 11] and references therein.
In classic stochastic learning algorithms, all sampling points need being stored for testing. Thus, the implementation of the algorithm may be difficult in learning problems with high-dimensional inputs and large datasets. To tackle such a challenge, an alternative stochastic method, called doubly stochastic learning algorithm was proposed in . The new algorithm is based on the random feature approach proposed in . The latter result is based on Bochner’s theorem and shows that most shift-invariant kernel functions can be expressed as an inner product of some suitable random features. Thus the kernel function at each iteration in the original stochastic learning algorithm can be estimated (or replaced) by a random feature. As a result, the new algorithm allows us to avoid keeping all the sample points since it only requires generating the random features and recovers past random resampling them using specific random seeds . The computational complexities of the algorithm are (independent of the dimension of the data) in space and in time. Numerical experiments given in , show that the algorithm is fast and comparable with state-of-the-art algorithms. Convergence results with respect to the solution of regularized expected risk minimization were derived in  for doubly stochastic learning algorithms with regularization, considering general Lipschitz and smooth losses.
In this paper, we study generalization properties of doubly stochastic learning algorithms in the framework of nonparametric regression with the square loss. Our contributions are theoretical. First, for the first time, we prove generalization error bounds for doubly stochastic learning algorithms without regularization, either using a fixed constant step-size or a decaying step-size. Compared with the regularized version studied in , doubly stochastic learning algorithms without regularization do not involve the model selection of regularization parameters, and thus it may have some computational advantages in practice. Secondly, we also prove generalization error bounds for doubly stochastic learning algorithms with regularization. Compared with the results in , our convergence rates are faster and do not require the bounded assumptions on the gradient estimates as in , see the discussion section for details. The key ingredients to our proof are an error decomposition and an induction argument, which enables us to derive total error bounds provided that the initial (or approximation) and sample errors can be bounded. The initial and sample errors are bounded using properties from integral operators and functional analysis. The difficulty in the analysis is the estimation of the sample error, since the sequence generated by the algorithm may not be in the hypothesis space. The novelty in our proof is the estimation of the sample error involving upper bounding a trace norm of an operator, and a refined analysis of bounding the initial error.
The rest of the paper is organized as follows. In the next section, we introduce the learning setting we consider and the doubly stochastic learning algorithms. In Section 3, we present the main results on generalization properties for the studied algorithms and give some simple discussions. Sections 4 to 7 are devoted to the proofs of all the main results.
2 Learning Setting and Doubly Stochastic Learning Algorithms
Learning a function from a given finite number of instances through efficient and practical algorithms is the basic goal of learning theory. Let the input space be a closed subset of Euclidean space , the output space , and Let
be a fixed Borel probability measure on, with its induced marginal measure on and conditional measure on given denoted by and
respectively. In statistical learning theory, the Borel probability measureis unknown, but only a set of sample points of size is given. Here, we assume that the sample points are independently and identically drawn from the distribution .
The quality of a function can be measured in terms of the expected risk with the square loss defined as
In this case, the function minimizing the expected risk over all measurable functions is the regression function given by
For any it is easy to prove that
Here, is the Hilbert space of square integral functions with respect to , with its induced norm given by . Throughout this paper we assume that Thus, using (2.3) with , is finite.
Kernel methods is based on choosing a hypothesis space as a reproducing kernel Hilbert space (RKHS). Recall that a reproducing kernel is a symmetric function such that is positive semidefinite for any finite set of points in . The kernel defines a RKHS as the completion of the linear span of the set with respect to the inner product For simplicity, we assume that is a Mercer kernel, that is, is a compact set and is continuous.
Online/stochastic learning is an important class of efficient algorithms to perform learning tasks. Over the past few decades, several variants of online/stochastic learning algorithms have been studied, many of which take the form of
and generalization properties have been derived. Here is a step-size sequence, and can be chosen as a positive constant depending on the sample size [23, 22], or to be zero [25, 17, 11]. In general, the computational complexities of the algorithm are in space and in time.
According to Bochner’s theorem, a continuous kernel on is positive definite if and only if
is the Fourier transform of a non-negative measure. Thus, most shift-invariant kernel functions can be expressed as an integration of some random features. A basic example for the Gaussian kernel is detailed as follows.
Example 2.1 (Random Fourier Features ).
Let the Gaussian kernel
for some Then according to Fourier inversion theorem, and by a simple calculation, one can prove that
Replacing in (2.4) by an unbiasd estimate with respect to a random feature, we get the doubly stochastic learning algorithms111Note that  studied the algorithm with a general convex loss function. Specializing to the square loss leads to the algorithm (
studied the algorithm with a general convex loss function. Specializing to the square loss leads to the algorithm (2.6).. Let be another probability measure on a measurable set , and a square-integrable (with respect to ) function. Assume that the kernel can be written as [13, 1]
Let be elements in , i.i.d. according to the distribution . The doubly stochastic learning algorithm associated with random features is defined by and
The computational complexities of the algorithm are (independent of the dimension of the data) in space and in time.
In this paper, we study the generalization properties of Algorithm (2.6), either with a fixed constant step-size or a decaying step-size , where . Under basic assumptions in the standard learning theory and with appropriate choices of parameters, we shall prove upper bounds for the excess expected risks, i.e.,
denotes the set of positive integers. for any For the set is denoted by . We will use the following conventional notations and for any sequence of real numbers For any operator on a Hilbert space , denotes the identity operator on and when and . For a given bounded operator denotes the operator norm of , i.e., . For two positive sequences and (or ) stands for for some positive constant (independent of ) for all . The indicator function of a subset is denoted by
3 Generalization Properties for Doubly Stochastic Learning Algorithms
In this section, after introducing some basic assumptions, we state our main results, following with simple discussions.
We first make the following basic assumption, with respect to the RKHS and its associated kernel as well as the underlying features.
is separable and is measurable. Furthermore, there exists a positive constant , such that and almost surely with respect to .
The bounded assumptions on the kernel function and random features are fairly common. For example, when
is a Gaussian kernel with variance, , we have .
To present our next assumption, we need to introduce the integral operator , defined as
We make the following assumption on the regularity of the regression function.
There exists and , such that
The above assumption is very standard [5, 21] in nonparametric regression. It characterizes how big is the subspace that the target function lies in. Particularly, the bigger the is, the more stringent is the assumption and the smaller is the subspace, since when Moreover, when we are making no assumption as holds trivially, while for we are requiring 222This should be interpreted as that there exists a such that -almost surely..
Finally, the last assumption is related to the capacity of the RKHS.
For some and , satisfies
, or the degrees of freedom. It can be related to covering/entropy number conditions, see[20, 21] for further details. Assumption 3 is always true for and , since
is a trace class operator which implies the eigenvalues of, denoted as , satisfy The case is referred to as the capacity independent setting. Assumption 3 with allows to derive better error rates. It is satisfied, e.g., if the eigenvalues of satisfy a polynomial decaying condition , or with if is finite rank. Kernels with polynomial decaying eigenvalues include those that underlie for the Sobolev spaces with different orders of smoothness (e.g. ). As a concrete example, the first-order Sobolev kernel generates a RKHS of Lipschitz functions, and one has that and thus .
3.2 Main Results
We are now ready to present our main results, whose proofs are postponed to Section 7. Our first main result provides generalization error bounds for the studied algorithms with and a constant (but depending on ) step-size.
According to (3.4), to derive a convergence result from the above theorem, one can choose with for some appropriate The error bound (3.5) is composed of two terms, which arise from estimating the initial and sample errors respectively in our proof, and are controlled by directly. A bigger may lead to a smaller initial error but may enlarge the sample error, while a smaller may reduce the sample error but may enlarge the initial error. Solving this trade-off leads to the best rate obtainable from the above theorem, which is stated next.
The above corollary asserts that with an appropriate fixed step-size, the doubly stochastic learning algorithm without regularization achieves generalization error bounds of order
As mentioned before, Assumption 3 is always satisfied with and , which is called as the capacity independent case. Setting and in Corollary 3.2, we have the following results in the capacity independent cases.
The above corollary can be further simplified as follows if we consider the special case i.e, Assumption 2 with
Theorem 3.1 and its corollaries provide generalization error bounds for the studied algorithm without regularization in the fixed step-size setting. In the next theorem, we give generalization error bounds for the studied algorithm (2.6) without regularization in a decaying step-size setting.
Similarly, there is a trade-off problem in the error bounds of the above theorem. Balancing the last two terms of the error bounds, we get the following corollary.
Corollary 3.6 asserts that with an appropriate choice of the decaying exponent for the step-size, the doubly stochastic learning algorithm without regularization has a generalization error bound of order when , or of order when . Comparing Corollary 3.2 with Corollary 3.6, the latter has a slower convergence rate when This suggests that the fixed step-size setting may be more favourable.
Theorems 3.1 and 3.5 provide generalization error bounds for doubly stochastic learning algorithms without regularization. In the next theorem, we give generalization error bounds for doubly stochastic learning algorithms with regularization.
Balancing the two terms from the error bounds in the above theorem to optimize the bounds, we can get the following results.
The above corollary asserts that for some appropriate choices on the regularized parameter and the decaying exponent of the step-size, doubly stochastic learning algorithm with regularization achieves generalization error bounds of order where can be arbitrarily close to zero. The convergence rate from Corollary 3.8 is essentially the same as that from Corollary 3.2 for . For the case , the best obtainable rate from Corollary 3.8 for the studied algorithm is of order
. This type of phenomenon is called as saturation effect in learning theory. Note that kernel ridge regression also saturates when.
We compare our results with those in . A regularized version of doubly stochastic learning algorithms with a convex loss function was studied in . When the loss function is the square loss, the algorithm in  is exactly Algorithm (2.6). [6, Theorem 6] asserts that with high probability, the learning sequence generated by (2.6) with and , satisfies
provided that . Here is the solution of the regularized risk minimization
The optimal obtainable error bound is achieved by setting , in which case,
Comparing the above result with Corollaries 3.3 and 3.8, the error bounds (of order in the capacity independent case) from Corollaries 3.3 and 3.8 are better, while they do not require the bounded assumption
We discuss some issues that might be considered in the future. First, our generalization error bounds are in expectation, and it would be interesting to derive high-probability error bounds in the future. Second, the rates in our results are not optimal and they should be further improved in the future by using a more involved technique (perhaps with a better estimate on the sample variance). Finally, in this paper, we only consider simple stochastic gradient methods (SGM) with last iterates. It would be interesting to extend our analysis to different variants of SGM, such as the fully online/stochastic learning [24, 22], SGM with mini-batches , the stochastic average gradient , averaging SGM , multi-pass SGM , and stochastic pairwise learning  in the future.
4 Error Decomposition
The rest of this paper is devoted to proving our main results. To this end, we need some preliminary analysis and a key error decomposition.
For notational simplicity, we denote by for any
, and set the residual vector
where we denote
Using the iterated relationship (4.1) multiple times, we can prove the following error decomposition.
For any , we have the following error decomposition
Using (4.1) iteratively, with and , we get
which is exactly
In the rest of the proof, we will write as for short, and use the notation for Following from (4.6), we get
From (2.6), we know that for any , is depending only on and Also, note that the family is independent. Thus, we can prove that has the following vanishing property:
The proof is complete. ∎
The error decomposition (4.3) is fairly common in analyzing standard stochastic/online learning algorithm . The term is related to an initial error, which is deterministic and will be estimated in the next section. The term is a sample error depending on the sample, which will be estimated in Section 6.
5 Estimating Initial Error
In this section, we will upper bound the initial error, namely, the first term of the right-hand side of (4.3). To this end, we introduce the following two lemmas.
Let , and be such that for all . Then for all
(5.1) holds trivially for the case Now, we consider the case Recall that is a self-adjoint, compact, and positive operator on . According to the spectral theorem,
has only non-negative singular valuessuch that . Thus,
Letting for each , we have
Therefore, we get
When we have
From the above analysis, we can get (5.1). The proof is complete. ∎
Under the assumptions of Lemma 5.1, we have for and any non-negative integer
Now, we can upper bound the initial error as follows.
Under Assumption 2, let for all , with such that and Then, for any