Generalization Properties of Doubly Online Learning Algorithms

07/03/2017 ∙ by Junhong Lin, et al. ∙ MIT 0

Doubly online learning algorithms are scalable kernel methods that perform very well in practice. However, their generalization properties are not well understood and their analysis is challenging since the corresponding learning sequence may not be in the hypothesis space induced by the kernel. In this paper, we provide an in-depth theoretical analysis for different variants of doubly online learning algorithms within the setting of nonparametric regression in a reproducing kernel Hilbert space and considering the square loss. Particularly, we derive convergence results on the generalization error for the studied algorithms either with or without an explicit penalty term. To the best of our knowledge, the derived results for the unregularized variants are the first of this kind, while the results for the regularized variants improve those in the literature. The novelties in our proof are a sample error bound that requires controlling the trace norm of a cumulative operator, and a refined analysis of bounding initial error.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In nonparametric regression, we are given a set of samples of the form , where each is an input, is a real-valued output, and the samples are drawn i.i.d. from an unknown distribution on The goal is to learn a function which can be used to predict future outputs based on the inputs.

Kernel methods [18, 5, 21] are a popular nonparametric technique based on choosing a hypothesis space to be a reproducing kernel Hilbert space (RKHS). Stochastic/online learning algorithms [9, 3] (often called stochastic gradient methods [14, 12]

in convex optimization) are among the most efficient and fast learning algorithms. At each iteration, they compute a gradient estimate with respect to a new sample point and then updates the current solution by subtracting the scaled gradient estimate. In general, the computational complexities for training are

in space and in time, due to the nonlinearity of kernel methods. In recent years, different types of online/stochastic learning algorithms, either with or without an explicit penalty term, have been proposed and analyzed, see e.g. [3, 23, 25, 17, 22, 15, 7, 11] and references therein.

In classic stochastic learning algorithms, all sampling points need being stored for testing. Thus, the implementation of the algorithm may be difficult in learning problems with high-dimensional inputs and large datasets. To tackle such a challenge, an alternative stochastic method, called doubly stochastic learning algorithm was proposed in [6]. The new algorithm is based on the random feature approach proposed in [13]. The latter result is based on Bochner’s theorem and shows that most shift-invariant kernel functions can be expressed as an inner product of some suitable random features. Thus the kernel function at each iteration in the original stochastic learning algorithm can be estimated (or replaced) by a random feature. As a result, the new algorithm allows us to avoid keeping all the sample points since it only requires generating the random features and recovers past random resampling them using specific random seeds [6]. The computational complexities of the algorithm are (independent of the dimension of the data) in space and in time. Numerical experiments given in [6], show that the algorithm is fast and comparable with state-of-the-art algorithms. Convergence results with respect to the solution of regularized expected risk minimization were derived in [6] for doubly stochastic learning algorithms with regularization, considering general Lipschitz and smooth losses.

In this paper, we study generalization properties of doubly stochastic learning algorithms in the framework of nonparametric regression with the square loss. Our contributions are theoretical. First, for the first time, we prove generalization error bounds for doubly stochastic learning algorithms without regularization, either using a fixed constant step-size or a decaying step-size. Compared with the regularized version studied in [6], doubly stochastic learning algorithms without regularization do not involve the model selection of regularization parameters, and thus it may have some computational advantages in practice. Secondly, we also prove generalization error bounds for doubly stochastic learning algorithms with regularization. Compared with the results in [6], our convergence rates are faster and do not require the bounded assumptions on the gradient estimates as in [6], see the discussion section for details. The key ingredients to our proof are an error decomposition and an induction argument, which enables us to derive total error bounds provided that the initial (or approximation) and sample errors can be bounded. The initial and sample errors are bounded using properties from integral operators and functional analysis. The difficulty in the analysis is the estimation of the sample error, since the sequence generated by the algorithm may not be in the hypothesis space. The novelty in our proof is the estimation of the sample error involving upper bounding a trace norm of an operator, and a refined analysis of bounding the initial error.

The rest of the paper is organized as follows. In the next section, we introduce the learning setting we consider and the doubly stochastic learning algorithms. In Section 3, we present the main results on generalization properties for the studied algorithms and give some simple discussions. Sections 4 to 7 are devoted to the proofs of all the main results.

2 Learning Setting and Doubly Stochastic Learning Algorithms

Learning a function from a given finite number of instances through efficient and practical algorithms is the basic goal of learning theory. Let the input space be a closed subset of Euclidean space , the output space , and Let

be a fixed Borel probability measure on

, with its induced marginal measure on and conditional measure on given denoted by and

respectively. In statistical learning theory, the Borel probability measure

is unknown, but only a set of sample points of size is given. Here, we assume that the sample points are independently and identically drawn from the distribution .

The quality of a function can be measured in terms of the expected risk with the square loss defined as

(2.1)

In this case, the function minimizing the expected risk over all measurable functions is the regression function given by

(2.2)

For any it is easy to prove that

(2.3)

Here, is the Hilbert space of square integral functions with respect to , with its induced norm given by . Throughout this paper we assume that Thus, using (2.3) with , is finite.

Kernel methods is based on choosing a hypothesis space as a reproducing kernel Hilbert space (RKHS). Recall that a reproducing kernel is a symmetric function such that is positive semidefinite for any finite set of points in . The kernel defines a RKHS as the completion of the linear span of the set with respect to the inner product For simplicity, we assume that is a Mercer kernel, that is, is a compact set and is continuous.

Online/stochastic learning is an important class of efficient algorithms to perform learning tasks. Over the past few decades, several variants of online/stochastic learning algorithms have been studied, many of which take the form of

(2.4)

and generalization properties have been derived. Here is a step-size sequence, and can be chosen as a positive constant depending on the sample size [23, 22], or to be zero [25, 17, 11]. In general, the computational complexities of the algorithm are in space and in time.

According to Bochner’s theorem, a continuous kernel on is positive definite if and only if

is the Fourier transform of a non-negative measure. Thus, most shift-invariant kernel functions can be expressed as an integration of some random features. A basic example for the Gaussian kernel is detailed as follows.

Example 2.1 (Random Fourier Features [13]).

Let the Gaussian kernel

for some Then according to Fourier inversion theorem, and by a simple calculation, one can prove that

Replacing in (2.4) by an unbiasd estimate with respect to a random feature, we get the doubly stochastic learning algorithms111Note that [6]

studied the algorithm with a general convex loss function. Specializing to the square loss leads to the algorithm (

2.6).. Let be another probability measure on a measurable set , and a square-integrable (with respect to ) function. Assume that the kernel can be written as [13, 1]

(2.5)

Let be elements in , i.i.d. according to the distribution . The doubly stochastic learning algorithm associated with random features is defined by and

(2.6)

The computational complexities of the algorithm are (independent of the dimension of the data) in space and in time.

In this paper, we study the generalization properties of Algorithm (2.6), either with a fixed constant step-size or a decaying step-size , where . Under basic assumptions in the standard learning theory and with appropriate choices of parameters, we shall prove upper bounds for the excess expected risks, i.e.,

Notation

denotes the set of positive integers. for any For the set is denoted by . We will use the following conventional notations and for any sequence of real numbers For any operator on a Hilbert space , denotes the identity operator on and when and . For a given bounded operator denotes the operator norm of , i.e., . For two positive sequences and (or ) stands for for some positive constant (independent of ) for all . The indicator function of a subset is denoted by

3 Generalization Properties for Doubly Stochastic Learning Algorithms

In this section, after introducing some basic assumptions, we state our main results, following with simple discussions.

3.1 Assumptions

We first make the following basic assumption, with respect to the RKHS and its associated kernel as well as the underlying features.

Assumption 1.

is separable and is measurable. Furthermore, there exists a positive constant , such that and almost surely with respect to .

The bounded assumptions on the kernel function and random features are fairly common. For example, when

is a Gaussian kernel with variance

, , we have .

To present our next assumption, we need to introduce the integral operator , defined as

(3.1)

Under Assumption 1, the operator is known to be symmetric, positive definite and trace class. Thus, its power is well defined for . Particularly, we know that [5, 21] for and with

(3.2)

We make the following assumption on the regularity of the regression function.

Assumption 2.

There exists and , such that

The above assumption is very standard [5, 21] in nonparametric regression. It characterizes how big is the subspace that the target function lies in. Particularly, the bigger the is, the more stringent is the assumption and the smaller is the subspace, since when Moreover, when we are making no assumption as holds trivially, while for we are requiring 222This should be interpreted as that there exists a such that -almost surely..

Finally, the last assumption is related to the capacity of the RKHS.

Assumption 3.

For some and , satisfies

(3.3)

The left hand-side of (3.3) is called as the effective dimension [2]

, or the degrees of freedom. It can be related to covering/entropy number conditions, see

[20, 21] for further details. Assumption 3 is always true for and , since

is a trace class operator which implies the eigenvalues of

, denoted as , satisfy The case is referred to as the capacity independent setting. Assumption 3 with allows to derive better error rates. It is satisfied, e.g., if the eigenvalues of satisfy a polynomial decaying condition , or with if is finite rank. Kernels with polynomial decaying eigenvalues include those that underlie for the Sobolev spaces with different orders of smoothness (e.g. [8]). As a concrete example, the first-order Sobolev kernel generates a RKHS of Lipschitz functions, and one has that and thus .

3.2 Main Results

We are now ready to present our main results, whose proofs are postponed to Section 7. Our first main result provides generalization error bounds for the studied algorithms with and a constant (but depending on ) step-size.

Theorem 3.1.

Under Assumptions 1, 2 and 3, Let be generated by (2.6) with , for all such that

(3.4)

Then

(3.5)

Here, the constant in the right-hand side depends only on , and will be given explicitly in the proof.

According to (3.4), to derive a convergence result from the above theorem, one can choose with for some appropriate The error bound (3.5) is composed of two terms, which arise from estimating the initial and sample errors respectively in our proof, and are controlled by directly. A bigger may lead to a smaller initial error but may enlarge the sample error, while a smaller may reduce the sample error but may enlarge the initial error. Solving this trade-off leads to the best rate obtainable from the above theorem, which is stated next.

Corollary 3.2.

Under Assumptions 1, 2 and 3, let be generated by (2.6) with and

(3.6)

Then,

(3.7)

The above corollary asserts that with an appropriate fixed step-size, the doubly stochastic learning algorithm without regularization achieves generalization error bounds of order

As mentioned before, Assumption 3 is always satisfied with and , which is called as the capacity independent case. Setting and in Corollary 3.2, we have the following results in the capacity independent cases.

Corollary 3.3.

Under Assumptions 1 and 2, let be generated by (2.6) with and

Then,

The above corollary can be further simplified as follows if we consider the special case i.e, Assumption 2 with

Corollary 3.4.

Under Assumption 1, let and be generated by (2.6) with and Then,

Theorem 3.1 and its corollaries provide generalization error bounds for the studied algorithm without regularization in the fixed step-size setting. In the next theorem, we give generalization error bounds for the studied algorithm (2.6) without regularization in a decaying step-size setting.

Theorem 3.5.

Under Assumptions 1,2 and 3, let and for all with and such that

(3.8)

where

(3.9)

Then, for any

(3.10)

Similarly, there is a trade-off problem in the error bounds of the above theorem. Balancing the last two terms of the error bounds, we get the following corollary.

Corollary 3.6.

Under Assumptions 1, 2 and 3, let and for all .
a) If , then by selecting and

(3.11)

b) If , then by selecting and

(3.12)

Corollary 3.6 asserts that with an appropriate choice of the decaying exponent for the step-size, the doubly stochastic learning algorithm without regularization has a generalization error bound of order when , or of order when . Comparing Corollary 3.2 with Corollary 3.6, the latter has a slower convergence rate when This suggests that the fixed step-size setting may be more favourable.

Theorems 3.1 and 3.5 provide generalization error bounds for doubly stochastic learning algorithms without regularization. In the next theorem, we give generalization error bounds for doubly stochastic learning algorithms with regularization.

Theorem 3.7.

Under Assumptions 1, 2 and 3, let , , for all , with , and such that (3.8). Then,

(3.13)

Balancing the two terms from the error bounds in the above theorem to optimize the bounds, we can get the following results.

Corollary 3.8.

Under Assumptions 1, 2 and 3, let , For all let and with . Then

(3.14)

The above corollary asserts that for some appropriate choices on the regularized parameter and the decaying exponent of the step-size, doubly stochastic learning algorithm with regularization achieves generalization error bounds of order where can be arbitrarily close to zero. The convergence rate from Corollary 3.8 is essentially the same as that from Corollary 3.2 for . For the case , the best obtainable rate from Corollary 3.8 for the studied algorithm is of order

. This type of phenomenon is called as saturation effect in learning theory. Note that kernel ridge regression also saturates when

.

Discussions

We compare our results with those in [6]. A regularized version of doubly stochastic learning algorithms with a convex loss function was studied in [6]. When the loss function is the square loss, the algorithm in [6] is exactly Algorithm (2.6). [6, Theorem 6] asserts that with high probability, the learning sequence generated by (2.6) with and , satisfies

(3.15)

provided that . Here is the solution of the regularized risk minimization

Combining (3.15) with the fact that [19] under Assumption 2 with ,

one has

The optimal obtainable error bound is achieved by setting , in which case,

Comparing the above result with Corollaries 3.3 and 3.8, the error bounds (of order in the capacity independent case) from Corollaries 3.3 and 3.8 are better, while they do not require the bounded assumption

We discuss some issues that might be considered in the future. First, our generalization error bounds are in expectation, and it would be interesting to derive high-probability error bounds in the future. Second, the rates in our results are not optimal and they should be further improved in the future by using a more involved technique (perhaps with a better estimate on the sample variance). Finally, in this paper, we only consider simple stochastic gradient methods (SGM) with last iterates. It would be interesting to extend our analysis to different variants of SGM, such as the fully online/stochastic learning [24, 22], SGM with mini-batches [4], the stochastic average gradient [16], averaging SGM [7], multi-pass SGM [10], and stochastic pairwise learning [26] in the future.

4 Error Decomposition

The rest of this paper is devoted to proving our main results. To this end, we need some preliminary analysis and a key error decomposition.

For notational simplicity, we denote by for any

, and set the residual vector

Since is generated by (2.6), subtracting from both sides of (2.6), by direct computations, one can easily prove that

(4.1)

where we denote

(4.2)

Using the iterated relationship (4.1) multiple times, we can prove the following error decomposition.

Proposition 4.1.

For any , we have the following error decomposition

(4.3)

where

(4.4)

and

(4.5)
Proof.

Using (4.1) iteratively, with and , we get

which is exactly

(4.6)

In the rest of the proof, we will write as for short, and use the notation for Following from (4.6), we get

From (2.6), we know that for any , is depending only on and Also, note that the family is independent. Thus, we can prove that has the following vanishing property:

(4.7)

Therefore,

The proof is complete. ∎

The error decomposition (4.3) is fairly common in analyzing standard stochastic/online learning algorithm [25]. The term is related to an initial error, which is deterministic and will be estimated in the next section. The term is a sample error depending on the sample, which will be estimated in Section 6.

5 Estimating Initial Error

In this section, we will upper bound the initial error, namely, the first term of the right-hand side of (4.3). To this end, we introduce the following two lemmas.

Lemma 5.1.

Let , and be such that for all . Then for all

(5.1)
Proof.

(5.1) holds trivially for the case Now, we consider the case Recall that is a self-adjoint, compact, and positive operator on . According to the spectral theorem,

has only non-negative singular values

such that . Thus,

Letting for each , we have

Therefore, we get

When we have

When

From the above analysis, we can get (5.1). The proof is complete. ∎

Lemma 5.2.

Under the assumptions of Lemma 5.1, we have for and any non-negative integer

(5.2)

The above lemma is essentially proved in [25, 22]. For completeness, we provide a proof in the appendix.

Now, we can upper bound the initial error as follows.

Proposition 5.3.

Under Assumption 2, let for all , with such that and Then, for any

(5.3)

and

(5.4)
Proof.

Note that is given by (4.4). Thus, we have

(5.5)

With Assumption 2, we can write for some . We thus derive

Note that with satisfying and implies that for all Thus, we can use (5.1) and (5.2) to bound the last two terms and get that

(5.6)

Observe that

(5.7)

and that by the mean value theorem,