In supervised learning, given a sample of
pairs of inputs and outputs, the goal is to estimate a function to be used to predict future outputs based on observing only the corresponding inputs. The quality of an estimate is often measured in terms of the mean-squared prediction error, in which case the regression function is optimal.
Since the properties of the function to be estimated are not known a priori, nonparametric techniques, that can adapt their complexity to the problem at hand, are often key to good results. Kernel methods [15, 36]
are probably the most common nonparametric approaches to learning. They are based on choosing a reproducing kernel Hilbert space (RKHS) as the hypothesis space in the design of learning algorithms. A classical learning algorithm using kernel methods to perform learning tasks is kernel ridge regression (KRR), which is based on minimizing the sum of a data-fitting term and an explicit penalty term. The penalty term is used for regularization, and controls the complexity of the solution, preventing overfitting. The statistical properties of KRR have been studied extensively, see e.g.[5, 39], and are known to be optimal in a minmax sense . The drawbacks of KRR are mainly computational. Indeed, a standard implementation of KRR requires the computation of a linear system defined by a kernel matrix, which thus requires costs in time and in memory, where is the number of points. Such scalings are prohibitive when in large scale scenario, where the sample size is large. A possible alternative is considering learning algorithms based on iterative procedure [14, 51, 47]. In this kind of learning algorithms, an empirical objective function is optimized in an iterative way with no explicit constraint or penalization, and the regularization against overfitting is realized by early-stopping the empirical procedure. Early-stopping has certain computational advantage over KRR, as it does not require the computation of the inverse of a kernel matrix. Indeed, if the algorithm stops after iterations, the aggregate time complexity is for gradient descent [47, 30] and conjugate gradient methods , while for stochastic gradient methods (SGM) [32, 21].
Although the statistical aspects of early-stopping procedures are well understood, either the computation or the storage of these algorithms can be challenging for large datasets. Indeed, the storage and/or computational cost of these algorithms, are/is at least quadratic in the number of training examples, due to the storage and/or calculation of a fully empirical kernel matrix. To avoid storing and/or computing a large kernel matrix, a natural approach is to replace the standard kernel matrix with a smaller matrix obtained by subsampling [38, 44]
. Such an approach, referred to as Nyström method in machine learning, provides one of the main approaches towards kernel methods with large scale learning. Particularly, Nyström techniques are successfully used together with KRR[33, 45] while achieving optimal statistical results  in the random design setting. Moreover, it has recently been combined with early-stopping on batch gradient methods, and optimal statistical results in the fixed design setting are provided .
In this paper, we investigate stochastic gradient methods with Nyström subsampling (named as NySGM) in the nonparametric regression setting. At each iteration, NySGM updates its current solution by subtracting a scaled gradient estimate over a mini-batch of points drawn uniformly at random from the sample, and subsequently projecting onto an “empirically subsampling” space. The subsampling level, the number of iterations/passes, the step-size and the mini-batch size are then the free parameters to be determined. Our main results show how can these parameters be chosen so that the corresponding solutions achieve optimal learning errors in a variety of settings. In comparisons with state-of-the-art algorithms such as Nyström KRR and classic SGM, NySGM has the advantage either on the computation or on the storage, while achieving the same optimal error bounds, see Section 4 for details. For example, in the special case that no benign assumptions on the problem  are made, NySGM with suitable choices of parameters can lead to optimal learning rates after one pass over the data, where the costs are in time and in memory, compared to in time and in memory for Nyström KRR. Moreover, as will be seen in Section 3, our results indicate that using mini-batches can reduce the total computational cost, while achieving the same optimal statistical results. Such a result is somewhat surprising, as it is well-known that using mini-batches does not reduce the total computational cost for classical SGM. The proof for our main results is based on tools from concentration inequalities, operator theory and convex analysis, and it borrows idea from, e.g., [47, 37, 2, 48, 33].
The rest of this paper is organized as follows. In the next section, we introduce the nonparametric regression setting and NySGM. In Section 3, we present our theoretical results following with simple discussions. In Section 4, we discuss and compare our results with related work. All proofs for related results and equalities of this paper are given in Section 5 and the appendix.
2 Learning with Nyström Stochastic Gradient Methods
In this section, we first describe the learning setting and then introduce the studied algorithm.
2.1 Learning Problems
We consider a supervised learning problem. Let be a probability measure on a measure space where is the input space and is the output space. Here, is fixed but unknown. Its information can be only known through a sample of points, which we assume to be i.i.d..
Kernel methods are based on choosing a hypothesis space as a reproducing kernel Hilbert space (RKHS) associated with a reproducing kernel. Recall that a reproducing kernel is a symmetric function such that is positive semidefinite for any finite set of points in . The reproducing kernel defines a RKHS as the completion of the linear span of the set with respect to the inner product
Given only the sample , the goal is to solve the following expected risk minimization problem,
2.2 Nyström Stochastic Gradient Method
To solve the expected risk minimization problem, in this paper, we propose the following SGM, using mini-batches and Nyström subsampling. For the set of the first positive integers is denoted by .
Let Given any , let with . Let be the projection operator with its range as the subspace The Nyström stochastic gradient method (abbreviated as NySGM) is defined by and
At each iteration, the above algorithm updates its current solution by subtracting a scaled gradient estimate and then projecting onto . In comparison with the classic SGM from , the studied algorithm has an extra projection step in its iterative relationship. The projection step is a result of the subsampling technique. It ensures that the learning sequence always lies in , a smaller space than . When , the above algorithm is exactly the classic SGM studied in .
Note that there are not any explicit penalty terms in (2.2), in which case one does not need to tune the penalty parameter, and the only free parameters are the subsampling level , the step-size , the mini-batch size and the total number of iterations . Different choices of these parameters can lead to different strategies. In the coming subsection, we are particularly interested in the fixed step-size setting, i.e., for some , with or .
The total number of iterations can be bigger than the sample size , which means that the algorithm can use the data more than once, or in another words, we can run the algorithm with multiple passes over the data. Here and in what follows, the number of ‘passes’ over the data is referred to at -th iteration of the algorithm.
The aim of this paper is to derive generalization error bounds, i.e., the excess risk for the above algorithm. Throughout this paper, we assume that is non-increasing and with . We denote by the set and by the set .
2.3 Numerical Realizations
Algorithm 1 has different equivalent forms, which are easier to be implemented for numerical simulations. For any finite subsets and in , denote the cardinality of the set by and the kernel matrix by . Let be such that . Then as will be shown in Subsection 5.2, Algorithm 1 is equivalent to, with
Here, , and denotes the -th row of the matrix . Assuming that and that the cost of evaluating the kernel on a pair of sample points is , if the computer computes and stores and in the preprocessing and then updates by (2.3) based on and the space and time complexities for training this algorithm are
respectively. Alternatively, if the computer computes and stores in the preprocessing and then updates by (2.3) based on and the space and time complexities for training are
To see the performance of Algorithm 1, we carried out some simple numerical simulations on a simple problem. We constructed i.i.d. training examples of the form . Here, the regression function is the input is uniformly distributed in and
is a Gaussian noise with zero mean and standard deviationfor each For all the simulations, the RKHS is associated with a Gaussian kernel where , the mini-batch size , and the step-size , as suggested by Corollary 3.3 in Section 3. For each subsampling level , we ran NySGM 50 times. The mean and the standard deviation of the computed generalization errors over trials with respect to the number of passes are depicted in Figure 1. Here, the (approximated) generalization errors were computed over an empirical measure with points. As we see from the plots, NySGM performs well when the subsampling level Moreover, the minimal generalization error is achieved after some number of passes, and it is comparable with of KRR using cross-validation.
3 Generalization Error Bounds for NySGM
In this section, we present our main results on generalization errors for NySGM, followed by some simple discussions. Throughout this paper, we make the following basic assumptions.
is separable, is measurable and there exists a constant , such that for all
Furthermore, Problem (2.1) has at least a solution .
The boundedness assumption (3.1) is fairly common in standard learning theory. It can be satisfied, for example when the kernel is a Gaussian kernel. The condition on the existence of at least one minimizer in is for the sake of easy presentation. Such a condition can be relaxed, by using a more involved analysis as that in .
Under these basic assumptions, we can state our first theorem as follows. It provides generalization error bounds for the studied algorithms with different choices of the step-size, the mini-batch size and the total number of iterations.
Let a.s., ,
and . Consider Algorithm 1 with either of the following choices on , and :
I) for all ,
II) for all ,
Then with probability at least
Here, we use the notations to mean for some positive constant which is depending only (a polynomial function) on and and to mean .
We add some comments on the above results. First, the bounded output assumption is trivially satisfied for some learning problems such as binary classification problems where Second, the error bound in (3.2) is optimal up to a logarithmic factor, in the sense that it matches the minimax rate in  and those of kernel ridge regression [37, 5, 40]. Moreover, according to Theorem 3.1, NySGM with two different choices on the step-size and the mini-batch size can achieve optimal learning error bounds after one pass over the data, provided that the subsampling level . Thus, if the computer computes and stores and in the preprocessing and then updates by (2.3) based on and according to (2.4), the cost for NySGM with both (I) and (II) is in memory and in time, lower than in memory and in time required by Nyström KRR . Alternatively, if the computer computes and stores in the preprocessing and then updates by (2.3) based on and the cost is in memory and in time for NySGM (II), while in memory and in time for NySGM (I). Compared to in memory and in time for classic SGM, NySGM using mini-batches has lower computational cost. In this sense, using mini-batches can reduce the computational complexity. Finally, using mini-batches allows using a larger step-size, while achieving the same optimal error bounds.
Theorem 3.1 provides generalization error bounds for the studied algorithm, without considering the possible effect of benign assumptions on the problem. In the next theorem, we will show that when the learning problem satisfies some additional regularity and capacity assumptions, it is possible to achieve faster learning rates than . Also, the boundedness assumption on the output in Theorem 3.1
will be replaced by a less strict condition, the moment hypothesis onas follows.
There exists constants and such that
To present our next assumptions, we introduce the covariance operator , defined by Under Condition (3.1), is known to be positive definite and trace class. Thus, with can be defined by using the spectral theory. We make the following assumption on the regularity of the target function .
For some and ,
The above assumption is very standard [10, 32] in nonparametric regression. It characterizes how big the subspace that the target function lies in. Particularly, the bigger the is, the more stringent the assumption is and the smaller the subspace is, since when Moreover, when we are making no assumption.
The last assumption relates to the capacity of the hypothesis space.
For some and , satisfies
, or the degrees of freedom. It can be related to covering/entropy number conditions, see  for further details. Assumption 4 is always true for and , since
is a trace class operator which implies the eigenvalues of, denoted as , satisfy This is referred to as the capacity independent setting. Assumption 4 with allows to derive better error rates. It is satisfied, e.g., if the eigenvalues of satisfy a polynomial decaying condition , or with if is finite rank.
Now, we are ready to state our next theorem as follows.
The above result is a direct consequence of Theorem 5.22 in the coming Subsection 5.8, from which, the interested readers can also find convergence results for the decaying step-size setting, i.e., with , as well as the omitted constants. There are three terms in the upper bounds of (3.6
). The first two terms are related to the regularity of the target function and the sample size, and they arise from estimations of the projected bias and the sample variance. The last term results from estimating the computational variance due to the random choices of the points. Note that there is a trade-off between the first two terms. Stopping too earlier may lead to a large projected bias, while stopping too late may enlarge the sample variance. The optimal number of iterationsis thus achieved by balancing these two terms. Furthermore, to achieve optimal rates, it is necessary to choose a suitable step-size and a mini-batch size such that the computational variance is smaller than the first term of (3.7). In the next corollary, we provide different choices on step-size and mini-batch size to achieve optimal convergence rates.
We add some comments. First, the convergence rate in (3.8) is optimal up to a logarithmic factor, as it matches the minimax rate in . Thus, with a subsampling level , NySGM with suitable choices of step-size, mini-batch size and number of iterations/passes can generalize optimally. Second, different choices of step-size, mini-batch size and number of iterations/passes correspond to different regularization regimes. Particularly, in the last two regimes, the step-size and the mini-batch size are fixed as some universal constants, while the number of iterations/passes is depending on the unknown distribution parameters and . In this case, the only regularization parameter is the number of iterations/passes, which can be tuned by cross-validation in practice. Besides, the step-size and the number of iterations/passes in the first regime, or the mini-batch size and the number of iterations/passes in the second regime, depend on the unknown distribution parameters, and they can be tuned by cross-validation in practice. Third, according to Corollary 3.3, the number of passes needed for NySGM to generalize optimally is in the last two regimes, while in the first two regimes. In comparison, NySGM with the first two regimes has a smaller number of passes than that of the last two regimes. This indicates that NySGM with the first two regimes may have some certain advantage on computational complexity, although in this case the step-size or the mini-batch size might need to be tuned.
The next corollary is a direct consequence of Corollary 3.3 in the capacity independent case. Indeed, in the capacity independent case, as mentioned before, Assumption 4 is always satisfied with . Thus, following from Corollary 3.3, we have the following results.
From the above corollary, we see that NySGM achieves optimal capacity-independent rate with one pass over the data if the step-size or the mini-batch size is suitably chosen. Theorem 3.1 is a direct consequence of Corollary 3.4 in the special case that and a.s.. We finish this section with some remarks.
I) All results in Theorem 3.2 and its corollaries still hold when the condition on the subsampling level (3.5) is replaced by
Thus, if 222Note that this condition is always satisfied with as by (3.1),
. for all for some , then the condition (3.5)
can be replaced by
II) If we consider a more involved subsampling technique, the approximate leverage scores Nyström methods [13, 8, 1], as we will see in our proof, all results in Theorem 3.2 and its corollaries still hold with a less strict requirement on the subsampling level,
We must compare our results with related works. There is a large amount of work on online learning (OL) algorithms and, more generally, stochastic approximations, see, e.g., [31, 28, 7] and the references therein. Here, we briefly review some recent works on online learning algorithms in the framework of nonparametric regression with the square loss. In what follows, we used the term “online learning algorithm” to mean the “stochastic gradient method” that each sample point can only be used once. First, OL with regularization has been studied in [46, 41], where the recursion appears as
Here, is a regularization parameter. In particular, generalization error bounds of order in confidence were proved in  for OL with suitable choices of regularization parameter and step-size , when , without considering the capacity assumption, i.e.,. Comparing with our results for NySGM from Corollary 3.3.(II), as indicated in Table 1, the computational cost of NySGM is lower, while both algorithms have the same optimal rates in the capacity independent case. Second, [49, 48] studied unregularized OL, i.e., (4.1) with , where the derived convergence rates  are of order and are in expectation, without considering the capacity assumption. If we make no assumption on the capacity condition, the derived rate for NySGM in this paper is the same as that of  for unregularized OL for the case and the former has a lower computational cost. Note that NySGM saturates for , while OL from  does not. We conjecture that by considering a different subsampling technique, we may get optimal error bounds even for . Third, by considering an averaging scheme, optimal capacity dependent rates can be proved for OL with appropriate parameters, either without  or with  regularization. In comparisons of NySGM with averaged OL (AveOL) from , as indicated in Table 1, both algorithms have the same optimal rates. When the storage complexities for both algorithms are of the same orders, while the computational complexity for the former is lower than that of the latter. For the case NySGM seemly has higher storage requirement. But as we will see in Subsection 5.8, by considering the approximate leverage scores (ALS) Nyström methods or making an extra assumption on the input data, the subsampling level for NySGM can be further reduced, which thus potentially leads to a smaller computational complexity. However, the ALS subsampling technique, or the extra condition is less well understood and should be further studied in the future. Finally,  studied (multi-pass) SGM, i.e, Algorithm 1 without projection. With suitable parameter choices, SGM achieves optimal rate after some number of iterations/passes . In comparisons, again, the computational complexity for NySGM is lower than that of SGM from  when . All results mentioned in the above are summarized in Table 1.
Summary of assumptions and results for NySGM and related approaches including online learning (OL), averaged OL (AveOL) and SGM. Noted that all the logarithmic factors are ignored.
Next, we will briefly review some of the recent theoretical results on Nyström subsampling. Theoretical results considering the discrepancy between a given empirical kernel matrix and its subsampled version can be found in, e.g., [17, 13, 20] and references therein. While interesting in their own right, these latter results do not directly yield information on the generalization properties of the obtained algorithm. Results on this direction were first derived in e.g., [9, 19] with the fixed design regression setting and in  with the random design regression setting, both for Nystöm KRR. Particularly, a sharp learning rate of order was derived in  for Nyström KRR provided that the subsampling level . In comparison, both Nyström KRR and NySGM have the same requirement on subsampling level while sharing the same optimal rates. According to Corollary 3.3.(II) and Equation (2.4), the cost for NySGM are in memory and in time, compared to in memory and in time of Nyström KRR from . The most related to our work is , where a regularized OL with Nyström is investigated. Certain convergence results on regret errors with respect to the KRR estimator were shown in  under a bounded assumption on the gradient estimates, but the generalization properties are less clear. Moreover, the derived rates are capacity-independent and both the derived rates and the subsampling level tend to be suboptimal, as the error bounds are depending directly on the discrepancy between a given empirical kernel matrix and its subsampled version.
Note that in this paper, we assume that the parameter choices on step-size, mini-batch size, number of iterations and subsampling level, involved in our theoretical results can be given in advance, and we did not consider model selection of these parameters. In practice, model selection on these parameters can be possibly realized by a cross-validation approach [39, 6], and one can possibly prove that such an approach can lead to the same statistical results. Indeed, we will introduce the cross-validation approach for tuning the step-size in Subsection 5.9, and prove theoretical results for such an approach. For model selection on the other parameters, we left it as an open problem in the future.
We end up this section with some remarks and future issues. First, in this paper, all derived convergence results for NySGM hold for the case that Problem (2.1) has at least one solution . In the case that the condition is not satisfied, one can possibly derive similar results as those in , by a more involved argument. Second, in this paper, we only prove results on convergence in -norm, but the extension to results in -norm, and moreover the ‘middle’ norm between -norm and , are possible. Third, all results in this paper are stated for a real-valued output space case, but they can be easily extended to the case that the output space is a general Hilbert space, as those in . Fourth, we didn’t try to optimize the conditions and error bounds, some of which can be further improved by a more involved argument. In particular, the boundedness assumption (3.1) can be possibly replaced by Also, the logarithmic factor from the derived error bounds can be possibly removed either using a more involved argument or considering a proper non-uniform averaging scheme as that in . Finally, using the techniques developed in this paper, it would be interesting to study stochastic gradient methods with a preconditioned operator as that in , or random features . Also, rather than considering a simple stochastic gradient method, it would be interesting to consider more sophisticated, accelerated iterations , and assess the potential advantages in terms of computational and generalization aspects.
This section is devoted to the proof of all related equations and results stated in the last sections. We begin in the next subsection with the basic notations.
Denote as the induced marginal measure of on , and as the conditional probability measure on with respect to and . The function minimizing the expected risk over all measurable functions is the regression function, which is given by
The Hilbert spaces of square integral functions with respect to , with its induced norm given by , is denoted by Under Assumption 1, we know that the projection of the regression function onto the closer of in , lies in , and is a solution of the normalized embedding equation
For any and , the following well known reproducing property holds:
For any the set of the first positive integers is denoted by . for and for any operator where is a Hilbert space and denotes the identity operator on . denotes the expectation of a random variable For a given bounded operator denotes the operator norm of , i.e., . We will use the conventional notations on summation and production: and denotes the supreme norm with respect to
We introduce the inclusion operator , which is continuous under Assumption (3.1). Furthermore, we consider the adjoint operator , the covariance operator given by , and the operator given by It can be easily proved that and The operators and can be proved to be positive trace class operators (and hence compact). For any function , the -norm can be related to the -norm by 
We define the sampling operator (with respect to ) by , where the norm is the standard Euclidean norm. Its adjoint operator defined by for is thus given by
Moreover, we can define the empirical covariance operator (with respect to ) such that . Obviously,
Finally, we can define the sampling and empirical covariance operators with respect to any given set , in a similar way. For any finite subsets and in , denote the kernel matrix by . Obviously,
For notational simplicity, we let and for any
Let be the SVD of , where , and with and . Then the orthogonal projection operator is given by
For any , we define the random variable with distributed according to and let
5.2 Equivalent Forms of NySGM
which leads to
which is exactly (2.2).
In the next subsections 5.3-5.8, we will give the proof of Theorem 3.2. The proof is quite lengthy. The key is an error decomposition similar as that for classic SGM in , and the basic tools are some concentration inequalities, operator inequalities and estimates which have already been broadly used in the literature, e.g., [47, 37, 5, 2, 48, 41, 33, 32, 21].
5.3 Preliminarily Inequalities
In this subsection, we introduce some concentration inequalities, operator inequalities and basic estimates that are necessary to the proof of Theorem 3.2. Proofs for some of these inequalities can be found in Appendix.
Let and . With probability at least the following holds:
Moreover, if then with probability at least
The above result also holds when replacing with . Particularly, (since implied by (3.1)) if with probability at least