Optimal learning rates for Kernel Conjugate Gradient regression

by   Gilles Blanchard, et al.

We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the intrinsic dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators.



There are no comments yet.


page 1

page 2

page 3

page 4


Convergence rates of Kernel Conjugate Gradient for random design regression

We prove statistical rates of convergence for kernel-based least squares...

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

We consider the optimization of a quadratic objective function whose gra...

Early stopping and non-parametric regression: An optimal data-dependent stopping rule

The strategy of early stopping is a regularization technique based on ch...

Parallelizing Spectral Algorithms for Kernel Learning

We consider a distributed learning approach in supervised learning for a...

Kernel regression, minimax rates and effective dimensionality: beyond the regular case

We investigate if kernel regularization methods can achieve minimax conv...

Distributed learning with regularized least squares

We study distributed learning with the least squares regularization sche...

Conditional mean embeddings as regressors - supplementary

We demonstrate an equivalence between reproducing kernel Hilbert space (...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The contribution of this paper is the learning theoretical analysis of kernel-based least squares regression in combination with conjugate gradient techniques. The goal is to estimate a regression function

based on random noisy observations. We have an i.i.d. sample of observations from an unknown distribution that follows the model

where is a noise variable whose distribution can possibly depend on , but satisfies . We assume that the true regression function belongs to the space of square-integrable functions. Following the kernelization principle, we implicitly map the data into a reproducing kernel Hilbert space with a kernel . We denote by the normalized kernel matrix and by the -vector of response observations. The task is to find coefficients such that the function defined by the normalized kernel expansion

is an adequate estimator of the true regression function . The closeness of the estimator to the target is measured via the distance,

The last equality recalls that this criterion is the same as the excess generalization error for the squared error loss .

In empirical risk minimization, we use the training data empirical distribution as a proxy for the generating distribution, and minimize the training squared error. This gives rise to the linear equation


Assuming invertible, the solution of the above equation is given by , which yields a function in interpolating perfectly the training data but having poor generalization error. It is well-known that to avoid overfitting, some form of regularization is needed. There is a considerable variety of possible approaches (see e.g. [12] for an overview). Perhaps the most well-known one is


known alternatively as kernel ridge regression, least squares support vector machine, or Tikhonov’s regularization. A powerful generalization of this is to consider


where is a fixed function depending on a parameter . The notation is to be interpreted as

applied to each eigenvalue of

in its eigen decomposition. Intuitively, should be a “regularized” version of the inverse function . This type of regularization, which we refer to as linear regularization methods, is directly inspired from the theory of inverse problems. Popular examples include as particular cases kernel Ridge Regression, Principal components regression and -Boosting. Their application in a learning context has been studied extensively [2, 3, 6, 7, 14]. Results obtained in this framework will serve as a comparison yardstick in the sequel.

In this paper, we study conjugate gradient (CG) techniques in combination with early stopping for the regularization of the kernel based learning problem (1). The principle of CG techniques is to restrict the learning problem onto a nested set of data-dependent subspaces, the so-called Krylov subspaces, defined as


Denote by the usual euclidean scalar product on rescaled by the factor . We define the -norm as The CG solution after iterations is formally defined as


and the number of CG iterations is the model parameter. To simplify notation we define . In the learning context considered here, regularization corresponds to early stopping. Conjugate gradients have the appealing property that the optimization criterion (5) can be computed by a simple iterative algorithm that constructs basis vectors of by using only forward multiplication of vectors by the matrix . Algorithm 1 displays the computation of the CG kernel coefficients defined by (5).

  Input kernel matrix , response vector , maximum number of iterations
  for  do
      (normalization of the basis vector)
      (step size)
      (new basis vector)
  end for
  Return: CG kernel coefficients , CG function
Algorithm 1 Kernel Conjugate Gradient regression

The CG approach is also inspired by the theory of inverse problems, but it is not covered by the framework of linear operators defined in (3): As we restrict the learning problem onto the Krylov space , the CG coefficients are of the form with a polynomial of degree . However, the polynomial is not fixed but depends on as well, making the CG method nonlinear in the sense that the coefficients depend on in a nonlinear fashion.

We remark that in machine learning, conjugate gradient techniques are often used as fast solvers for operator equations, e.g. to obtain the solution for the regularized equation (

2). We stress that in this paper, we study conjugate gradients as a regularization approach for kernel based learning, where the regularity is ensured via early stopping. This approach is not new. As mentioned in the abstract, the algorithm that we study is closely related to Kernel Partial Least Squares [21]. The latter method also restricts the learning problem onto the Krylov subspace , but it minimizes the euclidean distance instead of the distance defined above111This is generalized to a CG-l algorithm () by replacing the -norm in (5) with the norm defined by . Corresponding fast iterative algorithms to compute the solution exist for all (see e.g. [13]).. Kernel Partial Least Squares has shown competitive performance in benchmark experiences (see e.g [21, 22]). Moreover, a similar conjugate gradient approach for non-definite kernels has been proposed and empirically evaluated by Ong et al [19]. The focus of the current paper is therefore not to stress the usefulness of CG methods in practical applications (and we refer to the above mentioned references) but to examine its theoretical convergence properties. In particular, we establish the existence of early stopping rules that lead to optimal convergence rates. We summarize our main results in the next session.

2 Main results

For the presentation of our convergence results, we require suitable assumptions on the learning problem. We first assume that the kernel space is separable and that the kernel function is measurable. (This assumption is satisfied for all practical situations that we know of.) Furthermore, for all results, we make the (relatively standard) assumption that the kernel is bounded:


We consider – depending on the result – one of the following assumptions on the noise:

  • (Bounded) (Bounded ): almost surely.

  • (Bernstein) (Bernstein condition): almost surely, for all integers .

The second assumption is weaker than the first. In particular, the first assumption implies that not only the noise, but also the target function is bounded in supremum norm, while the second assumption does not put any additional restriction on the target function.

The regularity of the target function is measured in terms of a source condition as follows. The kernel integral operator is given by

The source condition for the parameter is defined by:

It is a known fact (see, e.g., [10]) that if , then coincides almost surely with a function belonging to . Therefore, we refer to as the “inner case” and to as the “outer case”.

The regularity of the kernel operator with respect to the marginal distribution is measured in terms of the intrinsic dimensionality parameter, defined by the condition

It is known that the best attainable rates of convergence, as a function of the number of examples , is determined by the parameters and . It was shown in [11] that the minimax learning rate given these two parameters is lower bounded by .

We now expose our main results in different situations. In all the cases considered, the early stopping rule takes the form of a so-called discrepancy stopping rule: For some sequence of thresholds to be specified (and possibly depending on the data), define the (data-dependent) stopping iteration as the first iteration for which


(Only in the first result below, the threshold actually depends on the iteration and on the data.)

2.1 Inner case without knowledge on intrinsic dimension

The inner case corresponds to , i.e. the target function lies in almost surely. For some constants and , we consider the discrepancy stopping rule with the threshold sequence


For technical reasons, we consider a slight variation of the rule in that we stop at step instead of if , where is the iteration polynomial such that . Denote the resulting stopping step. We obtain the following result.

Theorem 2.1.

Suppose that is bounded (Bounded), and that the source condition SC() holds for

. With probability

 , the estimator obtained by the (modified) discrepancy stopping rule (8) satisfies

We present the proof in Section 4.

2.2 Optimal rates in inner case

We now introduce a stopping rule yielding order-optimal convergence rates as a function of the two parameters and in the “inner” case (, which is equivalent to saying that the target function belongs to almost surely). For some constant and , we consider the discrepancy stopping rule with the fixed threshold


for which we obtain the following:

Theorem 2.2.

Suppose that the noise fulfills the Bernstein assumption (Bernstein), that the source condition SC() holds for , and that ID() holds. With probability  , the estimator obtained by the discrepancy stopping rule (9) satisfies

Due to space limitations, the proof is presented in the supplementary material.

2.3 Optimal rates in outer case, given additional unlabeled data

We now turn to the “outer” case. In this case, we make the additional assumption that unlabeled data is available. Assume that we have i.i.d. observations , out of which only the first are labeled. We define a new response vector and run the CG algorithm 1 on and . We use the same threshold (9) as in the previous section for the stopping rule, except that the factor is replaced by .

Theorem 2.3.

Suppose assumptions (Bounded), SC() and ID(), with . Assume unlabeled data is available with

Then with probability  , the estimator obtained by the discrepancy stopping rule defined above satisfies

A sketch of the proof can be found in the supplementary material.

3 Discussion and comparison to other results

For the inner case – i.e. almost surely – we provide two different consistent stopping criteria. The first one (Section 2.1) is oblivious to the intrinsic dimension parameter , and the obtained bound corresponds to the “worst case” with respect to this parameter (that is, ). However, an interesting feature of stopping rule (8) is that the rule itself does not depend on the a priori knowledge of the regularity parameter , while the achieved learning rate does (and with the optimal dependence in when ). Hence, Theorem 2.1 implies that the obtained rule is automatically adaptive with respect to the regularity of the target function. This contrasts with the results obtained in [2] for linear regularization schemes of the form (3), (also in the case ) for which the choice of the regularization parameter leading to optimal learning rates required the knowledge or beforehand.

When taking into account also the intrinsic dimensionality parameter , Theorem 2.2 provides the order-optimal convergence rate in the inner case (up to a log factor). A noticeable difference to Theorem 2.1 however, is that the stopping rule is no longer adaptive, that is, it depends on the a priori knowledge of parameters and . We observe that previously obtained results for linear regularization schemes of the form (2) in [7] and of the form (3) in [6], also rely on the a priori knowledge of and to determine the appropriate regularization parameter .

The outer case – when the target function does not lie in the reproducing Kernel Hilbert space – is more challenging and to some extent less well understood. The fact that additional assumptions are made is not a particular artefact of CG methods, but also appears in the studies of other regularization techniques. Here we follow the semi-supervised approach that is proposed in e.g. [6] (to study linear regularization of the form (3)) and assume that we have sufficient additional unlabeled data in order to ensure learning rates that are optimal as a function of the number of labeled data. We remark that other forms of additional requirements can be found in the recent literature in order to reach optimal rates. For regularized M-estimation schemes studied in [23], availability of unlabeled data is not required, but a condition is imposed of the form for all and some . In [15]

, assumptions on the supremum norm of the eigenfunctions of the kernel integral operator are made (see

[23] for an in-depth discussion on this type of assumptions).

Finally, as explained in the introduction, the term ’conjugate gradients’ comprises a class of methods that approximate the solution of linear equations on Krylov subspaces. In the context of learning, our approach is most closely linked to Partial Least Squares (PLS) [24] and its kernel extension [21]. While PLS has proven to be successful in a wide range of applications and is considered one of the standard approaches in chemometrics, there are only few studies of its theoretical properties. In [9, 16], consistency properties are provided for linear PLS under the assumption that the target function depends on a finite known number of orthogonal latent components. These findings were recently extended to the nonlinear case and without the assumption of a latent components model [4], but all results come without optimal rates of convergence. For the slightly different CG approach studied by Ong et al [19], bounds on the difference between the empirical risks of the CG approximation and of the target function are derived in [18], but no bounds on the generalization error were derived.

4 Proofs

Convergence rates for regularization methods of the type (2) or (3) have been studied by casting kernel learning methods into the framework of inverse problems (see [11]). We use this framework for the present results as well, and recapitulate here some important facts.

We first define the empirical evaluation operator as follows:

and the empirical integral operator as:

Using the reproducing property of the kernel, it can be readily checked that and are adjoint operators, i.e. they satisfy , for all  . Furthermore, , and therefore . Based on these facts, equation (5) can be rewritten as


where is a self-adjoint operator of , called empirical covariance operator. This definition corresponds to that of the “usual” conjugate gradient algorithm formally applied to the so-called normal equation (in )

which is obtained from (1) by left multiplication by . The advantage of this reformulation is that it can be interpreted as a “perturbation” of a population, noiseless version (of the equation and of the algorithm), wherein is replaced by the target function and the empirical operator are respectively replaced by their population analogues, the kernel integral operator

and the change-of-space operator

The latter maps a function to itself but between two Hilbert spaces which differ with respect to their geometry – the inner product of being defined by the kernel function , while the inner product of depends on the data generating distribution (this operator is well defined: since the kernel is bounded, all functions in are bounded and therefore square integrable under any distribution ).

The following results, taken from [2] (Propositions 21 and 22) quantify more precisely that the empirical covariance operator and the empirical integral operator applied to the data, , are close to the population covariance operator and to the kernel integral operator applied to the noiseless target function, respectively.

Proposition 4.1.

Provided that condition (6) is true, the following holds:


where denotes the Hilbert-Schmidt norm. If the representation holds, and under assumption (Bernstein), we have the following:


We note that implies that the target function coincides with a function belonging to (remember that is just the change-of-space operator). Hence, the second result (12) is valid for the case with , but it is not true in general for .

4.1 Nemirovskii’s result on conjugate gradient regularization rates

We recall a sharp result due to Nemirovskii [17] establishing convergence rates for conjugate gradient methods in a deterministic context. We present the result in an abstract context, then show how, combined with the previous section, it leads to a proof of Theorem 2.1. Consider the linear equation

where is a bounded linear operator over a Hilbert space  . Assume that the above equation has a solution and denote its minimal norm solution; assume further that a self-adjoint operator , and an element are known such that


(with and known positive numbers). Consider the CG algorithm based on the noisy operator and data , giving the output at step


The discrepancy principle stopping rule is defined as follows. Consider a fixed constant and define

We output the solution obtained at step  . Consider a minor variation of of this rule:

where is the degree polynomial such that  , and is an arbitrary positive constant such that  . Nemirovskii established the following theorem:

Theorem 4.2.

Assume that (a) and that (b) with   for some . Then for any  , provided that it holds that

4.2 Proof of Theorem 2.1

We apply Nemirovskii’s result in our setting (assuming ): By identifying the approximate operator and data as and , we see that the CG algorithm considered by Nemirovskii (14) is exactly (10), more precisely with the identification .

For the population version, we identify , and (remember that provided in the source condition, then there exists such that ).

Condition (a) of Nemirovskii’s theorem 4.2 is satisfied with by the boundedness of the kernel. Condition (b) is satisfied with and , as implied by the source condition SC(). Finally, the concentration result 4.1 ensures that the approximation conditions (13) are satisfied with probability  , more precisely with and . (Here we replaced in (11) and (12) by , so that the two conditions are satisfied simultaneously, by the union bound). The operator norm is upper bounded by the Hilbert-Schmidt norm, so that the deviation inequality for the operators is actually stronger than what is needed.

We consider the discrepancy principle stopping rule associated to these parameters, the choice , and  , thus obtaining the result, since

4.3 Notes on the proof of Theorems 2.2 and 2.3

The above proof shows that an application of Nemirovskii’s fundamental result for CG regularization of inverse problems under deterministic noise (on the data and the operator) allows us to obtain our first result. One key ingredient is the concentration property 4.1 which allows to bound deviations in a quasi-deterministic manner.

To prove the sharper results of Theorems 2.2 and 2.3, such a direct approach does not work unfortunately, and a complete rework and extension of the proof is necessary. The proof of Theorem 2.2 is presented in the supplementary material to the paper. In a nutshell, the concentration result 4.1 is too coarse to prove the optimal rates of convergence taking into account the intrinsic dimension parameter. Instead of that result, we have to consider the deviations from the mean in a “warped” norm, i.e. of the form for the data, and for the operator (with an appropriate choice of ) respectively. Deviations of this form were introduced and used in [6, 7] to obtain sharp rates in the framework of Tikhonov’s regularization (2) and of the more general linear regularization schemes of the form (3

). Bounds on deviations of this form can be obtained via a Bernstein-type concentration inequality for Hilbert-space valued random variables.

On the one hand, the results concerning linear regularization schemes of the form (3) do not apply to the nonlinear CG regularization. On the other hand, Nemirovskii’s result does not apply to deviations controlled in the warped norm. Moreover, the “outer” case introduces additional technical difficulties. Therefore, the proofs for Theorems 2.2 and 2.3, while still following the overall fundamental structure and ideas introduced by Nemirovskii, are significantly different in that context. As mentioned above, we present the complete proof of Theorem 2.2 in the supplementary material and a sketch of the proof of Theorem 2.3.

5 Conclusion

In this work, we derived early stopping rules for kernel Conjugate Gradient regression that provide optimal learning rates to the true target function. Depending on the situation that we study, the rates are adaptive with respect to the regularity of the target function in some cases. The proofs of our results rely most importantly on ideas introduced by Nemirovskii [17] and further developed by Hanke [13] for CG methods in the deterministic case, and moreover on ideas inspired by [6, 7].

Certainly, in practice, as for a large majority of learning algorithms, cross-validation remains the standard approach for model selection. The motivation of this work is however mainly theoretical, and our overall goal is to show that from the learning theoretical point of view, CG regularization stands on equal footing with other well-studied regularization methods such as kernel ridge regression or more general linear regularization methods (which includes between many others boosting). We also note that theoretically well-grounded model selection rules can generally help cross-validation in practice by providing a well-calibrated parametrization of regularizer functions, or, as is the case here, of thresholds used in the stopping rule.

One crucial property used in the proofs is that the proposed CG regularization schemes can be conveniently cast in the reproducing kernel Hilbert space as displayed in e.g (10). This reformulation is not possible for Kernel Partial Least Squares: It is also a CG type method, but uses the standard Euclidean norm instead of the -norm used here. This point is the main technical justification on why we focus on (5) rather than kernel PLS. Obtaining optimal convergence rates also valid for Kernel PLS is an important future direction and should build on the present work.

Another important direction for future efforts is the derivation of stopping rules that do not depend on the confidence parameter . Currently, this dependence prevents us to go from convergence in high probability to convergence in expectation, which would be desirable. Perhaps more importantly, it would be of interest to find a stopping rule that is adaptive to both parameters (target function regularity) and (intrinsic dimension parameter) without their a priori knowledge. We recall that our first stopping rule is adaptive to but at the price of being worst-case in . In the literature on linear regularization methods, the optimal choice of regularization parameter is also non-adaptive, be it when considering optimal rates with respect to only [2] or to both and [6]. An approach to alleviate this problem is to use a hold-out sample for model selection; this was studied theoretically in [8] for linear regularization methods (see also [5] for an account of the properties of hold-out in a general setup). We strongly believe that the hold-out method will yield theoretically founded adaptive model selection for CG as well. However, hold-out is typically regarded as inelegant in that it requires to throw away part of the data for estimation. It would be of more interest to study model selection methods that are based on using the whole data in the estimation phase. The application of Lepskii’s method is a possible step towards this direction.


  • [1] R. Bathia. Matrix Analysis, volume 169 of Graduate texts in mathematics. Springer, 1997.
  • [2] F. Bauer, S. Pereverzev, and L. Rosasco. On Regularization Algorithms in Learning Theory. Journal of Complexity, 23:52–72, 2007.
  • [3] N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart. Convergence Rates of General Regularization Methods for Statistical Inverse Problems and Applications. SIAM Journal on Numerical Analysis, 45(6):2610–2636, 2007.
  • [4] G. Blanchard and N. Krämer. Kernel Partial Least Squares is Universally Consistent.

    Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, JMLR Workshop & Conference Proceedings

    , 9:57–64, 2010.
  • [5] G. Blanchard and P. Massart. Discussion of V. Koltchinskii’s 2004 IMS Medallion Lecture paper, ”Local Rademacher complexities and oracle inequalities in risk minimization”. Annals of Statistics, 34(6):2664–2671, 2006.
  • [6] A. Caponnetto. Optimal Rates for Regularization Operators in Learning Theory. Technical Report CBCL Paper 264/ CSAIL-TR 2006-062, Massachusetts Institute of Technology, 2006.
  • [7] A. Caponnetto and E. De Vito. Optimal Rates for Regularized Least-squares Algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • [8] A. Caponnetto and Y. Yao. Cross-validation based Adaptation for Regularization Operators in Learning Theory. Analysis and Applications, 8(2):161–183, 2010.
  • [9] H. Chun and S. Keles. Sparse Partial Least Squares for Simultaneous Dimension Reduction and Variable Selection. Journal of the Royal Statistical Society B, 72(1):3–25, 2010.
  • [10] F. Cucker and S. Smale. On the Mathematical Foundations of Learning. Bulletin of the American Mathematical Society, 39(1):1–50, 2002.
  • [11] E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning from Examples as an Inverse Problem. Journal of Machine Learning Research, 6(1):883, 2006.
  • [12] L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2002.
  • [13] M. Hanke. Conjugate Gradient Type Methods for Linear Ill-posed Problems. Pitman Research Notes in Mathematics Series, 327, 1995.
  • [14] L. Lo Gerfo, L. Rosasco, E. Odone, F.and De Vito, and A. Verri.

    Spectral Algorithms for Supervised Learning.

    Neural Computation, 20:1873–1897, 2008.
  • [15] S. Mendelson and J. Neeman. Regularization in Kernel Learning. The Annals of Statistics, 38(1):526–565, 2010.
  • [16] P. Naik and C.L. Tsai. Partial Least Squares Estimator for Single-index Models. Journal of the Royal Statistical Society B, 62(4):763–771, 2000.
  • [17] A. S. Nemirovskii. The Regularizing Properties of the Adjoint Gradient Method in Ill-posed Problems. USSR Computational Mathematics and Mathematical Physics, 26(2):7–16, 1986.
  • [18] C. S. Ong. Kernels: Regularization and Optimization. Doctoral dissertation, Australian National University, 2005.
  • [19] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with Non-positive Kernels. In Proceedings of the 21st International Conference on Machine Learning, pages 639 – 646, 2004.
  • [20] I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for probabilities of large deviations. Theory Probab. Appl., 1(30):143–148, 1985.
  • [21] R. Rosipal and L.J. Trejo. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Spaces. Journal of Machine Learning Research, 2:97–123, 2001.
  • [22] R. Rosipal, L.J. Trejo, and B. Matthews. Kernel PLS-SVC for Linear and Nonlinear Classification. In Proceedings of the Twentieth International Conference on Machine Learning, pages 640–647, Washington, DC, 2003.
  • [23] I. Steinwart, D. Hush, and C. Scovel. Optimal Rates for Regularized Least Squares Regression. In Proceedings of the 22nd Annual Conference on Learning Theory, pages 79–93, 2009.
  • [24] S. Wold, H. Ruhe, H. Wold, and W.J. Dunn III.

    The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses.

    SIAM Journal of Scientific and Statistical Computations, 5:735–743, 1984.
  • [25] V. Yurinksi. Sums and Gaussian vectors, volume 1617 of Lecture notes in mathematics. Springer, 1995.

Appendix A Supplementary Material

a.1 Notation

We follow the notation used in the main part, in particular the operators defined in Section 4.1, and we recall that ; ; ; and .

We denote by the possibly finite sequence in of nonzero eigenvalues of and  , and by the -sequence of eigenvalues of and respectively (in each case in decreasing order and with multiplicity). Finally, denotes the spectral family of the operator  , i.e. is the orthogonal projector on the subspace of

spanned by eigenvectors of

corresponding to eigenvalues strictly less than .

It is useful to consider the spectral integral representation: If denotes the orthogonal eigensystem of associated to the non-zero eigenvalues  , for any integrable function on , we set

By its definition, the output of the -th iteration of the CG algorithm can be put under the form where  , the set of real polynomials of degree less than  . A crucial role is played by the residual polynomial

where is the set of real polynomials of degree less than and having constant term equal to 1. In particular . Furthermore, the definition of the CG algorithm implies that the sequence are orthogonal polynomials for the scalar product , where for we define

This can be shown as follows: is the orthogonal projection, of the origin onto the affine space with the scalar product , , where denotes (with some abuse of notation) the set of polynomials of degree less than with constant coefficient equal to zero. Thus for any  . From the theory of orthogonal polynomials, it results that for any  , the polynomial has exactly distinct roots belonging to  , which we denote by (in increasing order). Finally, we use the notation to denote a function depending on the stated parameters only, and whose exact value can change from line to line.

a.2 Preparation of the proof

We follow the general architecture of Nemirovskii’s proof to establish rates. We recall that since we assume , the representation holds. The main difference to Nemirovskii’s original result is that (similar to the approach of [6, 7]) we use deviation bounds in a “warped” norm rather than in the standard norm. More precisely, we consider the following type of assumptions:

  •  ,

  •  , with
    (this implies in particular via (34) below) ,

  •  .

In the rest of this section we set . Under the source condition assumption SC(), for the representation can be rewritten

by identification we therefore have the source condition for given by with , and , since is a restricted isometry from into .

We define the shortcut notation


We start with preliminary technical lemmas, before turning to the proof of Theorem 2.2.

Lemma A.1.

For any  , if assumptions SC(), B1(), B2() and B3 hold, then for any iteration step


Recall that denote the roots of the polynomial  ; define further the function on the interval as

Following the idea introduced by Nemirovski, it can be shown that

Above, the first inequality (lemma 3.7. in [13]) is the crucial point, and relies fundamentally on the fact that is an orthogonal polynomial sequence.

We start with controlling the second term:

where the last line used the inequality (see (3.10) in [13])


for any (using the convention ), which we applied above for  . For the first term, we use assumption SC(); first consider the case :

where we applied (17) with , and (33). ∎

For the case , using (34) and arguments similar to the previous case:

Lemma A.2.

For any  , if assumptions SC(), B1(), B2() and B3 hold, then for any iteration step , for any :

If , the above inequality is valid for any .


Set  . This is the element in that we obtain by applying the th-iteration CG polynomial to the noiseless data. We have