1 Introduction
While nonparametric models offer great flexibility, they can also lead to overfitting, and thus poor generalization performance. For this reason, it is wellunderstood that procedures for fitting nonparametric models must involve some form of regularization. When models are fit via a form of empirical risk minimization, the most classical form of regularization is based on adding some type of penalty to the objective function. An alternative form of regularization is based on the principle of
early stopping, in which an iterative algorithm is run for a prespecified number of steps, and terminated prior to convergence.While the basic idea of early stopping is fairly old (e.g., [32, 1, 36]
), recent years have witnessed renewed interests in its properties, especially in the context of boosting algorithms and neural network training (e.g.,
[26, 13]). Over the past decade, a line of work has yielded some theoretical insight into early stopping, including works on classification error for boosting algorithms [4, 14, 19, 24, 40, 41], boosting algorithms for regression [9, 8], and similar gradient algorithms in reproducing kernel Hilbert spaces (e.g. [12, 11, 35, 40, 27]). A number of these papers establish consistency results for particular forms of early stopping, guaranteeing that the procedure outputs a function with statistical error that converges to zero as the sample size increases. On the other hand, there are relatively few results that actually establish rate optimality of an early stopping procedure, meaning that the achieved error matches known statistical minimax lower bounds. To the best of our knowledge, Bühlmann and Yu [9] were the first to prove optimality for early stopping of boosting as applied to spline classes, albeit with a rule that was not computable from the data. Subsequent work by Raskutti et al. [27] refined this analysis of boosting for kernel classes and first established an important connection to the localized Rademacher complexity; see also the related work [40, 28, 10] with rates for particular kernel classes.More broadly, relative to our rich and detailed understanding of regularization via penalization (e.g., see the books [18, 34, 33, 38] and papers [3, 21]
for details), our understanding of early stopping regularization is not as well developed. Intuitively, early stopping should depend on the same biasvariance tradeoffs that control estimators based on penalization. In particular, for penalized estimators, it is now wellunderstood that complexity measures such as the
localized Gaussian width, or its Rademacher analogue, can be used to characterize their achievable rates [3, 21, 33, 38]. Is such a general and sharp characterization also possible in the context of early stopping?The main contribution of this paper is to answer this question in the affirmative for the early stopping of boosting algorithms for a certain class of regression and classification problems involving functions in reproducing kernel Hilbert spaces (RKHS). A standard way to obtain a good estimator or classifier is through minimizing some penalized form of loss functions of which the method of kernel ridge regression
[37] is a popular choice. Instead, we consider an iterative update involving the kernel that is derived from a greedy update. Borrowing tools from empirical process theory, we are able to characterize the “size” of the effective function space explored by taking steps, and then to connect the resulting estimation error naturally to the notion of localized Gaussian width defined with respect to this effective function space. This leads to a principled analysis for a broad class of loss functions used in practice, including the loss functions that underlie the boost, LogitBoost and AdaBoost algorithms, among other procedures.The remainder of this paper is organized as follows. In Section 2, we provide background on boosting methods and reproducing kernel Hilbert spaces, and then introduce the updates studied in this paper. Section 3 is devoted to statements of our main results, followed by a discussion of their consequences for particular function classes in Section 4. We provide simulations that confirm the practical effectiveness of our stopping rules, and show close agreement with our theoretical predictions. In Section 5, we provide the proofs of our main results, with certain more technical aspects deferred to the appendices.
2 Background and problem formulation
The goal of prediction is to learn a function that maps covariates to responses . In a regression problem, the responses are typically realvalued, whereas in a classification problem, the responses take values in a finite set. In this paper, we study both regression () and classification problems (e.g., in the binary case). Our primary focus is on the case of fixed design, in which we observe a collection of pairs of the form , where each is a fixed covariate, whereas is a random response drawn independently from a distribution which depends on . Later in the paper, we also discuss the consequences of our results for the case of random design, where the
pairs are drawn in an i.i.d. fashion from the joint distribution
for some distribution on the covariates.In this section, we provide some necessary background on a gradienttype algorithm which is often referred to as boosting algorithm. We also discuss briefly about the reproducing kernel Hilbert spaces before turning to a precise formulation of the problem that is studied in this paper.
2.1 Boosting and early stopping
Consider a cost function , where the nonnegative scalar denotes the cost associated with predicting when the true response is . Some common examples of loss functions that we consider in later sections include:

the leastsquares loss that underlies boosting [9],

the logistic regression loss that underlies the LogitBoost algorithm [15, 16], and

the exponential loss that underlies the AdaBoost algorithm [14].
The leastsquares loss is typically used for regression problems (e.g., [9, 12, 11, 35, 40, 27]), whereas the latter two losses are frequently used in the setting of binary classification (e.g., [14, 24, 16]).
Given some loss function , we define the population cost functional via
(1) 
Note that with the covariates fixed, the functional is a nonrandom object. Given some function space , the optimal function^{1}^{1}1As clarified in the sequel, our assumptions guarantee uniqueness of . minimizes the population cost functional—that is
(2) 
As a standard example, when we adopt the leastsquares loss , the population minimizer corresponds to the conditional expectation .
Since we do not have access to the population distribution of the responses however, the computation of is impossible. Given our samples , we consider instead some procedure applied to the empirical loss
(3) 
where the population expectation has been replaced by an empirical expectation. For example, when corresponds to the log likelihood of the samples with , direct unconstrained minimization of would yield the maximum likelihood estimator.
It is wellknown that direct minimization of over a sufficiently rich function class may lead to overfitting. There are various ways to mitigate this phenomenon, among which the most classical method is to minimize the sum of the empirical loss with a penalty regularization term. Adjusting the weight on the regularization term allows for tradeoff between fit to the data, and some form of regularity or smoothness in the fit. The behavior of such penalized of regularized estimation methods is now quite well understood (for instance, see the books [18, 34, 33, 38] and papers [3, 21] for more details).
In this paper, we study a form of algorithmic regularization, based on applying a gradienttype algorithm to but then stopping it “early”—that is, after some fixed number of steps. Such methods are often referred to as boosting algorithms, since they involve “boosting” or improve the fit of a function via a sequence of additive updates (see e.g. [29, 14, 7, 6, 30]). Many boosting algorithms, among them AdaBoost [14], boosting [9] and LogitBoost [15, 16], can be understood as forms of functional gradient methods [24, 16]; see the survey paper [8] for further background on boosting. The way in which the number of steps is chosen is referred to as a stopping rule, and the overall procedure is referred to as early stopping of a boosting algorithm.
(a)  (b) 
In more detail, a broad class of boosting algorithms [24] generate a sequence via updates of the form
(4) 
where the scalar is a sequence of step sizes chosen by the user, the constraint defines the unit ball in a given function class ,
denotes the gradient taken at the vector
, and is the usual inner product between vectors . For nondecaying step sizes and a convex objective , running this procedure for an infinite number of iterations will lead to a minimizer of the empirical loss, thus causing overfitting. In order to illustrate this phenomenon, Figure 1 provides plots of the squared error versus the iteration number, for LogitBoost in panel (a) and AdaBoost in panel (b). See Section 4.2 for more details on exactly how these experiments were conducted.In the plots in Figure 1, the dotted line indicates the minimum meansquared error over all iterates of that particular run of the algorithm. Both plots are qualitatively similar, illustrating the existence of a “good” number of iterations to take, after which the MSE greatly increases. Hence a natural problem is to decide at what iteration to stop such that the iterate satisfies bounds of the form
(5) 
with high probability. Here
indicates that for some universal constant . The main results of this paper provide a stopping rule for which bounds of the form (5) do in fact hold with high probability over the randomness in the observed responses.Moreover, as shown by our later results, under suitable regularity conditions, the expectation of the minimum squared error is proportional to the statistical minimax risk , where the infimum is taken over all possible estimators . Note that the minimax risk provides a fundamental lower bound on the performance of any estimator uniformly over the function space . Coupled with our stopping time guarantee (5), we are guaranteed that our estimate achieves the minimax risk up to constant factors. As a result, our bounds are unimprovable in general (see Corollary 2).
2.2 Reproducing Kernel Hilbert Spaces
The analysis of this paper focuses on algorithms with the update (4) when the function class is a reproducing kernel Hilbert space (RKHS, see standard sources [37, 17, 31, 5]), consisting of functions mapping a domain to the real line . Any RKHS is defined by a bivariate symmetric kernel function which is required to be positive semidefinite, i.e. for any integer and a collection of points in , the matrix is positive semidefinite.
The associated RKHS is the closure of the linear span of functions in the form , where is some collection of points in , and is a realvalued sequence. For two functions which can be expressed as a finite sum and , the inner product is defined as with induced norm . For each , the function belongs to , and satisfies the reproducing relation
(6) 
Moreover, when the covariates are drawn i.i.d. from a distribution with domain we can invoke Mercer’s theorem which states that any function in can be represented as
(7) 
where are the eigenvalues of the kernel function and
are eigenfunctions of
which form an orthonormal basis of with the inner product . We refer the reader to the standard sources [37, 17, 31, 5] for more details on RKHSs and their properties.Throughout this paper, we assume that the kernel function is uniformly bounded, meaning that there is a constant such that . Such a boundedness condition holds for many kernels used in practice, including the Gaussian, Laplacian, Sobolev, other types of spline kernels, as well as any trace class kernel with trigonometric eigenfunctions. By rescaling the kernel as necessary, we may assume without loss of generality that . As a consequence, for any function such that , we have by the reproducing relation that
Given samples , by the representer theorem [20], it is sufficient to restrict ourselves to the linear subspace , for which all can be expressed as
(8) 
for some coefficient vector . Among those functions which achieve the infimum in expression (1), let us define as the one with the minimum Hilbert norm. This definition is equivalent to restricting to be in the linear subspace .
2.3 Boosting in kernel spaces
For a finite number of covariates from , let us define the normalized kernel matrix with entries . Since we can restrict the minimization of and from to the subspace w.l.o.g., using expression (8) we can then write the function value vectors as . As there is a onetoone correspondence between the dimensional vectors and the corresponding function in by the representer theorem, minimization of an empirical loss in the subspace essentially becomes the dimensional problem of fitting a response vector over the set . In the sequel, all updates will thus be performed on the function value vectors .
With a change of variable we then have
In this paper, we study the choice in the boosting update (4), so that the function value iterates take the form
(9) 
where is a constant stepsize choice. Choosing ensures that all iterates remain in the range space of .
In this paper we consider the following three error measures for an estimator :
Excess risk: 
where the expectation in the norm is taken over random covariates which are independent of the samples used to form the estimate . Our goal is to propose a stopping time such that the averaged function satisfies bounds of the type (5). We begin our analysis by focusing on the empirical error, but as we will see in Corollary 1, bounds on the empirical error are easily transformed to bounds on the population error. Importantly, we exhibit such bounds with a statistical error term that is specified by the localized Gaussian complexity of the kernel class.
3 Main results
We now turn to the statement of our main results, beginning with the introduction of some regularity assumptions.
3.1 Assumptions
Recall from our earlier setup that we differentiate between the empirical loss function in expression (3), and the population loss in expression (1). Apart from assuming differentiability of both functions, all of our remaining conditions are imposed on the population loss. Such conditions at the population level are weaker than their analogues at the empirical level.
For a given radius , let us define the Hilbert ball around the optimal function as
(10) 
Our analysis makes particular use of this ball defined for the radius where the effective noise level is defined in the sequel.
We assume that the population loss is strongly
convex and smooth over , meaning that the
condition:
holds for all and all design points . In addition, we assume that the function is Lipschitz in its second argument over the interval . To be clear, here denotes the vector in obtained by taking the gradient of with respect to the vector . It can be verified by a straightforward computation that when is induced by the leastsquares cost , the condition holds for . The logistic and exponential loss satisfy this condition (see supp. material), where it is key that we have imposed the condition only locally on the ball .
In addition to the leastsquares cost, our theory also applies to losses
induced by scalar functions that satisfy the
boundedness:
This condition holds with for the logistic loss for all , and for the exponential loss for binary classification with , using our kernel boundedness condition. Note that whenever this condition holds with some finite , we can always rescale the scalar loss by so that it holds with , and we do so in order to simplify the statement of our results.
3.2 Upper bound in terms of localized Gaussian width
Our upper bounds involve a complexity measure known as the localized Gaussian width. In general, Gaussian widths are widely used to obtain risk bounds for leastsquares and other types of estimators. In our case, we consider Gaussian complexities for “localized” sets of the form
(11) 
with . The Gaussian complexity localized at scale is given by
(12) 
where denotes an i.i.d. sequence of standard Gaussian variables.
An essential quantity in our theory is specified by a certain fixed point equation that is now standard in empirical process theory [33, 3, 21, 27]. Let us define the effective noise level
(13) 
The critical radius is the smallest positive scalar such that
(14) 
We note that past work on localized Rademacher and Gaussian complexity [25, 3] guarantees that there exists a unique that satisfies this condition, so that our definition is sensible.
3.2.1 Upper bounds on excess risk and empirical error
With this setup, we are now equipped to state our main theorem. It provides highprobability bounds on the excess risk and error of the estimator defined by averaging the iterates of the algorithm. It applies to both the leastsquares cost function, and more generally, to any loss function satisfying the condition and the boundedness condition.
Theorem 1.
Suppose that the sample size large enough such that , and we compute the sequence using the update (9) with initialization and any step size . Then for any iteration , the averaged function estimate satisfies the bounds
(15a)  
(15b) 
where both inequalities hold with probability at least .
A few comments about the constants in our statement: in all cases, constants of the form are universal, whereas the capital may depend on parameters of the joint distribution and population loss . In Theorem 1, we have the explicit value and is proportional to the quantity . While inequalities (15a) and (15b) are stated as high probability results, similar bounds for expected loss (over the response , with the design fixed) can be obtained by a simple integration argument.
In order to gain intuition for the claims in the theorem, note that apart from factors depending on , the first term dominates the second term whenever . Consequently, up to this point, taking further iterations reduces the upper bound on the error. This reduction continues until we have taken of the order many steps, at which point the upper bound is of the order .
More precisely, suppose that we perform the updates with step size ; then, after a total number of many iterations, the extension of Theorem 1 to expectations guarantees that the mean squared error is bounded as
(16) 
where is another constant depending on . Here we have used the fact that in simplifying the expression. It is worth noting that guarantee (16) matches the best known upper bounds for kernel ridge regression (KRR)—indeed, this must be the case, since a sharp analysis of KRR is based on the same notion of localized Gaussian complexity (e.g. [2, 3]) . Thus, our results establish a strong parallel between the algorithmic regularization of early stopping, and the penalized regularization of kernel ridge regression. Moreover, as will be clarified in Section 3.3, under suitable regularity conditions on the RKHS, the critical squared radius also acts as a lower bound for the expected risk, meaning that our upper bounds are not improvable in general.
Note that the critical radius only depends on our observations through the solution of inequality (14). In many cases, it is possible to compute and/or upper bound this critical radius, so that a concrete and valid stopping rule can indeed by calculated in advance. In Section 4, we provide a number of settings in which this can be done in terms of the eigenvalues of the normalized kernel matrix.
3.2.2 Consequences for random design regression
Thus far, our analysis has focused purely on the case of fixed design, in which the sequence of covariates is viewed as fixed. If we instead view the covariates as being sampled in an i.i.d. manner from some distribution over , then the empirical error of a given estimate is a random quantity, and it is interesting to relate it to the squared population norm .
In order to state an upper bound on this error, we introduce a population analogue of the critical radius , which we denote by . Consider the set
(17) 
It is analogous to the previously defined set , except that the empirical norm has been replaced by the population version. The population Gaussian complexity localized at scale is given by
(18) 
where are an i.i.d. sequence of standard normal variates, and is a second i.i.d. sequence, independent of the normal variates, drawn according to . Finally, the population critical radius is defined by equation (18), in which is replaced by .
Corollary 1.
In addition to the conditions of Theorem 1, suppose that the sequence of covariateresponse pairs are drawn i.i.d. from some joint distribution , and we compute the boosting updates with step size and initialization . Then the averaged function estimate at time satisfies the bound
with probability at least over the random samples.
The proof of Corollary 1 follows directly from standard empirical process theory bounds [3, 27] on the difference between empirical risk and population risk . In particular, it can be shown that and norms differ only by a factor proportion to . Furthermore, one can show that the empirical critical quantity is bounded by the population . By combining both arguments the corollary follows. We refer the reader to the papers [3, 27] for further details on such equivalences.
It is worth comparing this guarantee with the past work of Raskutti et al. [27], who analyzed the kernel boosting iterates of the form (9), but with attention restricted to the special case of the leastsquares loss. Their analysis was based on first decomposing the squared error into bias and variance terms, then carefully relating the combination of these terms to a particular bound on the localized Gaussian complexity (see equation (19) below). In contrast, our theory more directly analyzes the effective function class that is explored by taking steps, so that the localized Gaussian width (18) appears more naturally. In addition, our analysis applies to a broader class of loss functions.
In the case of reproducing kernel Hilbert spaces, it is possible to sandwich the localized Gaussian complexity by a function of the eigenvalues of the kernel matrix. Mendelson [25] provides this argument in the case of the localized Rademacher complexity, but similar arguments apply to the localized Gaussian complexity. Letting denote the ordered eigenvalues of the normalized kernel matrix , define the function
(19) 
Up to a universal constant, this function is an upper bound on the Gaussian width for all , and up to another universal constant, it is also a lower bound for all .
3.3 Achieving minimax lower bounds
In this section, we show that the upper bound (16) matches known minimax lower bounds on the error, so that our results are unimprovable in general. We establish this result for the class of regular kernels, as previously defined by Yang et al. [39], which includes the Gaussian and Sobolev kernels as special cases.
The class of regular kernels is defined as follows. Let denote the ordered eigenvalues of the normalized kernel matrix , and define the quantity . A kernel is called regular whenever there is a universal constant such that the tail sum satisfies . In words, the tail sum of the eigenvalues for regular kernels is roughly on the same or smaller scale as the sum of the eigenvalues bigger than .
For such kernels and under the Gaussian observation model (), Yang et al. [39] prove a minimax lower bound involving . In particular, they show that the minimax risk over the unit ball of the Hilbert space is lower bounded as
(20) 
Comparing the lower bound (20) with upper bound (16) for our estimator stopped after many steps, it follows that the bounds proven in Theorem 1 are unimprovable apart from constant factors.
We now state a generalization of this minimax lower bound, one which applies to a subclass of generalized linear models, or GLM for short. In these models, the conditional distribution of the observed vector given takes the form
(21) 
where is a known scale factor and is the cumulant function of the generalized linear model. As some concrete examples:

The linear Gaussian model is recovered by setting and .

The logistic model for binary responses is recovered by setting and .
Our minimax lower bound applies to the class of GLMs for which the cumulant function is differentiable and has uniformly bounded second derivative . This class includes the linear, logistic, multinomial families, among others, but excludes (for instance) the Poisson family. Under this condition, we have the following:
Corollary 2.
Suppose that we are given i.i.d. samples from a GLM (21) for some function in a regular kernel class with . Then running iterations with step size and yields an estimate such that
(22) 
Here means up to a universal constant . As always, in the minimax claim (22
), the infimum is taken over all measurable functions of the input data and the expectation is taken over the randomness of the response variables
. Since we know that , the way to prove bound (22) is by establishing . See Section 5.2 for the proof of this result.At a high level, the statement in Corollary 2 shows that early stopping prevents us from overfitting to the data; in particular, using the stopping time yields an estimate that attains the optimal balance between bias and variance.
4 Consequences for various kernel classes
In this section, we apply Theorem 1 to derive some concrete rates for different kernel spaces and then illustrate them with some numerical experiments. It is known that the complexity of an RKHS in association with a distribution over the covariates can be characterized by the decay rate (7) of the eigenvalues of the kernel function. In the finite sample setting, the analogous quantities are the eigenvalues of the normalized kernel matrix . The representation power of a kernel class is directly correlated with the eigendecay: the faster the decay, the smaller the function class. When the covariates are drawn from the distribution , empirical process theory guarantees that the empirical and population eigenvalues are close.
4.1 Theoretical predictions as a function of decay
In this section, let us consider two broad types of eigendecay:

exponential decay: For some , the kernel matrix eigenvalues satisfy a decay condition of the form , where are universal constants. Examples of kernels in this class include the Gaussian kernel, which for the Lebesgue measure satisfies such a bound with (real line) or (compact domain).

polynomial decay: For some , the kernel matrix eigenvalues satisfy a decay condition of the form , where is a universal constant. Examples of kernels in this class include the order Sobolev spaces for some fixed integer with Lebesgue measure on a bounded domain. We consider Sobolev spaces that consist of functions that have order weak derivatives being Lebesgue integrable and . For such classes, the polynomial decay condition holds with .
Given eigendecay conditions of these types, it is possible to compute an upper bound on the critical radius . In particular, using the fact that the function from equation (19) is an upper bound on the function , we can show that for exponentially decaying kernels, we have , whereas for polynomial kernels, we have up to universal constants. Combining with our Theorem 1, we obtain the following result:
Corollary 3 (Bounds based on eigendecay).
Under the conditions of Theorem 1:

For kernels with exponential eigendecay, we have
(23a) 
For kernels with polynomial eigendecay, we have
(23b)
In particular, these bounds hold for LogitBoost and AdaBoost. We note that similar bounds can also be derived with regard to risk in norm as well as the excess risk .
To the best of our knowledge, this result is the first to show nonasymptotic and optimal statistical rates for the error when early stopping LogitBoost or AdaBoost with an explicit dependence of the stopping rule on . Our results also yield similar guarantees for boosting, as has been established in past work [27]. Note that we can observe a similar tradeoff between computational efficiency and statistical accuracy as in the case of kernel leastsquares regression [40, 27]: although larger kernel classes (e.g. Sobolev classes) yield higher estimation errors, boosting updates reach the optimum faster than for a smaller kernel class (e.g. Gaussian kernels).
4.2 Numerical experiments
We now describe some numerical experiments that provide illustrative confirmations of our theoretical predictions. While we have applied our methods to various kernel classes, in this section, we present numerical results for the firstorder Sobolev kernel as two typical examples for exponential and polynomial eigendecay kernel classes.
Let us start with the firstorder Sobolev space of Lipschitz functions on the unit interval , defined by the kernel , and with the design points set equidistantly over . Note that the equidistant design yields polynomial decay of the eigenvalues of with as in the case when are drawn i.i.d. from the uniform measure on . Consequently we have that . Accordingly, our theory predicts that the stopping time should lead to an estimate such that .
In our experiments for Boost, we sampled according to with
, which corresponds to the probability distribution
, where is defined on the unit interval . By construction, the function belongs to the firstorder Sobolev space with . For LogitBoost, we sampled according to where . In all cases, we fixed the initialization , and ran the updates (9) for Boost and LogitBoost with the constant step size . We compared various stopping rules to the oracle gold standard , meaning the procedure that examines all iterates , and chooses the stopping time that yields the minimum prediction error. Of course, this procedure is unimplementable in practice, but it serves as a convenient lower bound with which to compare.(a)  (b) 
Figure 2 shows plots of the meansquared error over the sample size averaged over trials, for the gold standard and stopping rules based on for different choices of
. Error bars correspond to the standard errors computed from our simulations. Panel (a) shows the behavior for
boosting, whereas panel (b) shows the behavior for LogitBoost.Note that both plots are qualitatively similar and that the theoretically derived stopping rule with , while slightly worse than the Gold standard, tracks its performance closely. We also performed simulations for some “bad” stopping rules, in particular for an exponent not equal to , indicated by the green and black curves. In the log scale plots in Figure 3 we can clearly see that for the performance is indeed much worse, with the difference in slope even suggesting a different scaling of the error with the number of observations . Recalling our discussion for Figure 1, this phenomenon likely occurs due to underfitting and overfitting effects. These qualitative shifts are consistent with our theory.
(a)  (b) 
5 Proof of main results
In this section, we present the proofs of our main results. The technical details are deferred to Appendix A.
In the following, recalling the discussion in Section 2.3, we denote the vector of function values of a function evaluated at as , where we omit the subscript when it is clear from the context. As mentioned in the main text, updates on the function value vectors correspond uniquely to updates of the functions . In the following we repeatedly abuse notation by defining the Hilbert norm and empirical norm on vectors in as
Comments
There are no comments yet.