# Regularized ERM on random subspaces

We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data. This approach naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. Our main results show the existence of different regimes, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance. Theoretical results are complemented with numerical experiments on large scale benchmark data sets.

## Authors

• 1 publication
• 10 publications
• 7 publications
• 63 publications
• ### Risk Bounds for Robust Deep Learning

It has been observed that certain loss functions can render deep-learnin...
09/14/2020 ∙ by Johannes Lederer, et al. ∙ 0

• ### Risk Analysis of Divide-and-Conquer ERM

Theoretical analysis of the divide-and-conquer based distributed learnin...
03/09/2020 ∙ by Yong Liu, et al. ∙ 26

• ### Theoretical Analysis of Divide-and-Conquer ERM: Beyond Square Loss and RKHS

Theoretical analysis of the divide-and-conquer based distributed learnin...
03/09/2020 ∙ by Yong Liu, et al. ∙ 0

• ### Stability of SGD: Tightness Analysis and Improved Bounds

Stochastic Gradient Descent (SGD) based methods have been widely used fo...
02/10/2021 ∙ by Yikai Zhang, et al. ∙ 8

• ### Persistent Reductions in Regularized Loss Minimization for Variable Selection

In the context of regularized loss minimization with polyhedral gauges, ...
11/30/2020 ∙ by Amin Jalali, et al. ∙ 0

• ### Two-stage Best-scored Random Forest for Large-scale Regression

We propose a novel method designed for large-scale regression problems, ...
05/09/2019 ∙ by Hanyuan Hang, et al. ∙ 0

• ### Sharp Asymptotics and Optimal Performance for Inference in Binary Models

We study convex empirical risk minimization for high-dimensional inferen...
02/17/2020 ∙ by Hossein Taheri, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Despite excellent practical performances, state of the art machine learning (ML) methods often require huge computational resources, motivating the search for more efficient solutions. This has led to a number of new results in optimization

[22, 41], as well as the development of approaches mixing linear algebra and randomized algorithms [30, 18, 56, 11]. While these techniques are applied to empirical objectives, in the context of learning it is natural to study how different numerical solutions affect statistical accuracy. Interestingly, it is now clear that there is a whole set of problems and approaches where computational savings do not lead to any degradation in terms of learning performance [39, 4, 7, 50, 28, 40, 12].

Here, we follow this line of research and study an instance of regularized empirical risk minimization where, given a fixed high– possibly infinite– dimensional hypothesis space, the search for a solution is restricted to a smaller– possibly random– subspace. This is equivalent to considering sketching operators [27], or equivalently regularization with random projections [56]. For infinite dimensional hypothesis spaces, it includes Nyström methods used for kernel methods [46] and Gaussian processes [55]. Recent works in supervised statistical learning has focused on smooth loss functions [39, 3, 31], whereas here we consider convex, Lipschitz but possibly non smooth losses. The lack of smoothness requires using different technical tools for the analysis. While for smooth losses linear algebra and matrix concentration can be used [51], here we need machinery from empirical processes [8]. In particular, fast rates require considering localized complexity measures [48, 5, 26].

Our main interest is characterizing the relation between computational efficiency and statistical accuracy. We do so studying the interplay between regularization, subspace size and different parameters describing how are hard or easy is the considered problem. Indeed, our analysis starts from basic assumption, that eventually we first strengthen to get faster rates, and then weaken to consider more general scenarios. Our results show that also for convex, Lipschitz losses there are settings in which the best known statistical bounds can be obtained while substantially reducing computational requirements. Interestingly, these effects are relevant but also less marked than for smooth losses. In particular, some form of adaptive sampling seems needed to ensure no loss of accuracy and achieve sharp learning bounds. In contrast, uniform sampling suffices to achieve similar results for smooth loss functions. It is an open question whether this is a byproduct of our analysis, or a fundamental limitation. Some preliminary numerical results complemented with numerical experiments are given considering benchmark datasets.

The rest of the paper is organized as follow. In Section 2, we introduce the setting. In Section 3, we introduce the ERM approach we consider. In Section 4, we present and discuss the main results and defer the proofs to the appendix. In Section  5, we collect some numerical results.

## 2 Statistical learning with ERM

Let

be random variables in

, with distribution satisfying the following conditions.

###### Assumption 1.

The space is a real separable Hilbert space with scalar product , is a Polish space, and there exists such that almost surely.

Since is bounded, the covariance operator given by can be shown to be self-adjoint, positive and trace class with . We can think of and as input and output spaces, respectively, and some examples are relevant.

###### Example 1.

An example is linear estimation, where

is and . Another example is kernel methods, where is a separable reproducing kernel Hilbert space on a measurable space . The data are then mapped from to through the feature map where is the (measurable) reproducing kernel of [48].

We denote by the loss function. Given a function on with values in , we view as the error made predicting by . We make the following assumption.

###### Assumption 2 (Lipschitz loss).

The loss function is convex and Lipschitz in its second argument, namely there exists such that for all and ,

 |ℓ(y,a)−ℓ(y,a′)|≤G|a−a′|andℓ0=supy∈Yℓ(y,0). (1)
###### Example 2 (Hinge loss & other loss functions).

The main example we have in mind is the hinge loss with , which is convex but not differentiable, and for which and . Another example is the logistic loss , for which and .

Given a loss, the corresponding expected risk is for all

 L(w)=E[ℓ(Y,⟨w,X⟩)]=∫H×Yℓ(y,⟨w,x⟩)dP(x,y),

and can be easily shown to be convex and Lipschitz continuous.

In this setting, we are interested in the problem of solving

 minw∈HL(w), (2)

when the distribution is known only through a training set of independent samples . Since we only have the data , we cannot solve the problem exactly and given an empirical approximate solution , a natural error measure is the the excess risk , which is a random variable through its dependence on , and hence on the data. In the following we are interested in characterizing its distribution for finite sample sizes. Next we discuss how approximate solutions can be obtained from data.

### 2.1 Empirical risk minimization (ERM)

A natural approach to derive approximate solutions is based on replacing the expected risk with the empirical risk defined for all as

We consider regularized empirical risk minimization (ERM) based on the solution of the problem,

 minw∈HˆLλ(w),ˆLλ(w)=ˆL(w)+λ∥w∥2. (3)

Note that is continuous and strongly convex, hence there exists a unique minimizer . If we let denote the data matrix, by the representer theorem [54, 42] there exists such that

 ˆwλ=ˆX⊤c. (4)

The expression of the coefficient depends on the considered loss function. Next, we comment on different approaches to obtain a solution when is the hinge loss. We add one remark first.

###### Remark 1 (Constrained ERM).

A related approach is based on considering the problem

 min∥w∥≤RˆL(w). (5)

Minimizing (3) can be seen as a Lagrange multiplier formulation of the above problem. While these problems are equivalent [10], the exact correspondence is implicit. As a consequence their statistical analysis differ. We primarily discuss Problem (3), but also analyze Problem (5) in Appendix H.

### 2.2 Computations with the hinge loss

Minimizing (3) can be solved in many ways and we provide some basic considerations. If is finite dimensional, iteratively via gradient methods can be used. For example, the subgradient method [10] applied to (3) is given, for some suitable and step-size sequence , by

 wt+1=wt−ηt(1nn∑i=1yixigi(wt)+2λwt), (6)

where is the subgradient of the map evaluated at , see also [37]. The corresponding iteration cost is in time and memory. Clearly, other variants can be considered, for example adding a momentum term [35], stochastic gradients and minibatching or considering other approaches for example based on coordinate descent [45]. When is infinite dimensional a different approach is possible, provided can be computed for all . For example, it is easy to prove by induction that the iteration in (6) satisfies , where

 ct+1=ct−ηt(1nn∑i=1yieigi(ˆX⊤ct)+2λct), (7)

and where is the canonical basis in . The cost of the above iteration is for computing , where is the cost of evaluating one inner product. Also in this case, a number of other approaches can be considered, see e.g. [48, Chap. 11] and references therein. We illustrate the above ideas for the hinge loss.

###### Example 3 (Hinge loss & SVM).

Considering problem (3) with the hinge loss corresponds to support vector machines for classification. With this choice if , if and if . In particular, in (7) we can take

## 3 ERM on random subspaces

In this paper, we consider a variant of ERM based on considering a subspace and the corresponding regularized ERM problem,

 minβ∈BˆLλ(β). (8)

As clear from (4), choosing is not a restriction and yields the same solution as considering (3). From this observation a natural choice is to consider for ,

 Bm=span{˜x1,…,˜xm} (9)

with a subset of the input points. A basic idea we consider is to sample the points uniformly at random. Another more refined choice we consider is sampling exactly or approximately (see Definition 1 in the Appendix) according to the leverages scores [17]

 li(α)=⟨xi,(ˆXˆX⊤x+αIn)−1xi⟩i=1,…,n. (10)

While leverage scores computation is costly, approximate leverage scores (ALS) computation can be done efficiently, see [38] and references therein. Following [39], other choices are possible. Indeed for any and we could consider and derive a formulation as in (11) replacing with the matrix with rows . We leave this discussion for future work. Here, we focus on the computational benefits of considering ERM on random subspaces and analyze the corresponding statistical properties.
The choice of as in (9) allows to improve computations with respect to (4). Indeed, is equivalent to the existence of s.t. , so that we can replace (8) with the problem

 minb∈Rm1nn∑i=1ℓ(yi,⟨˜X⊤b,xi⟩)+λ⟨b,˜X˜X⊤b⟩m.

Further, since is symmetric and positive semi-definite, we can derive a formulation close to that in (3), considering the reparameterization which leads to,

 mina∈Rm1nn∑i=1ℓ(yi,⟨a,xi⟩m)+λ∥a∥2m, (11)

where for all , we defined the embedding . Note that this latter operation only involves the inner product in and hence can be computed in time. The subgradient method for (11) has a cost per iteration. In summary, we obtained that the cost for the ERM on subspaces is and should be compared with the cost of solving (7) which is The corresponding costs to predict new points are and , while the memory requirements are and , respectively. Clearly, memory requirements can be reduced recomputing things on the fly. As clear from the above discussion, computational savings can be drastic, as long as , and the question arises of how this affect the corresponding statistical accuracy. Next section is devoted to this question.

## 4 Statistical analysis of ERM on random subspaces

We divide the presentation of the results in three parts. First, we consider a setting where we make basic assumptions. Then, we discuss improved results considering more benign assumptions. Finally, we describe general results covering also less favorable conditions. In all cases, we provide simplified statements for the results, omitting numerical constants, logarithmic and higher order terms, for ease of presentation. The complete statements and the proofs are provided in the appendices.

### 4.1 Basic setting

In this section, we only assume the best in the model to exist.

###### Assumption 3.

There exists such that .

We first provide some benchmark results for regularized ERM under this assumption.

###### Theorem 1 (Regularized ERM).

Under Assumption  123, the following inequality holds, for all and

, with probability at least

,

 L(ˆwλ)−L(w∗)≲G2κ2log(1/δ)λn+λ∥w∗∥2.

Hence letting leads to a rate of .

The proof of the above result is given in Appendix A. It shows the excess risk bound for regularized ERM arises from a tradeoff between an estimation and an approximation term. While this result can be derived specializing more refined analysis, see e.g. [48] or later sections, we provide a novel self-contained proof which is of interest in its own right. Similar bounds in high-probability for ERM constrained to the ball of radius can be obtained through a uniform convergence argument over such balls, see [6, 33, 24]. In order to apply this to regularized ERM, one could in principle use the fact that by Assumption 2, (see Appendix) [48], but this yields a suboptimal dependence in . [47] address this issue by considering ERM with both constraint over a ball and regularization, and derives a high-probability bound. Finally, a similar rate for , though only in expectation, can be derived through a stability argument [9, 43]. The proof of our high-probability bound builds on uniform convergence, but circumvents using a crude bound on by instead exploiting its definition as a regularized minimizer and bounding . We use this bound as a reference and to derive results for regularized ERM on subspaces. We first consider any fixed subspace and then specialize to suitable random subspaces. Given , let be the minimizer of (8), the orthogonal projection on , and

###### Theorem 2 (Regularized ERM on subspaces).

Fix , and . Under Assumptions 123, with probability at least ,

Compared to Theorem 1, the above result shows that there is an extra approximation error term due to considering a subspace. The coefficient appears in the analysis also for other loss functions, see e.g. [39, 31]. Roughly speaking, it captures how well the subspace is adapted to the problem. We next develop this reasoning, specializing the above result to a random subspace as in (9). Note that, if is random then is a random variable through its dependence on and on . We denote by the unique minimizer of on and by the corresponding projection. Further, it is also useful to introduce the so-called effective dimensions [57, 13, 39]. We denote by the distribution of , with its support111Namely, the smallest closed subset of with -measure , well-defined since is a Polish space [48]. , and define for

 dα,2=Tr((Σ+αI)−1Σ),dα,∞=supx∈supp(PX)⟨x,(Σ+αI)−1x⟩. (12)

Then, is finite since is trace class, and is finite since is bounded. Further, we denote by

the strictly positive eigenvalues of

, with eigenvalues counted with respect to their multiplicity and ordered in a non-increasing way. We borrow the following results from [39].

###### Proposition 1 (Uniform and leverage scores sampling).

Fix and . With probability at least

 μBm2=∥∥Σ1/2(I−Pm)∥∥2≤3α. (13)

provided that or for uniform and ALS sampling, respectively.

Moreover, if the spectrum of has a polynomial decay, i.e. for some

 σj(Σ)≲j−1p (14)

then (13) holds if or for uniform and ALS sampling, respectively.

Combining the above proposition with Theorem 2 we have the following.

###### Theorem 3 (Uniform and leverage scores sampling under eigen-decay).

Under Assumption  123 and condition (14), for all and , with probability ,

 L(ˆβλ,m)−L(w∗)≲G2κ2log(3/δ)λn+λ∥w∗∥2+√αG∥w∗∥.

Hence, letting , and taking points by uniform sampling or by leverage score sampling, leads to a rate of .

The above results show that it is possible to achieve the same rate of standard regularized ERM, but to do so uniform sampling does not seem to provide a computational benefit. As clear from the proof, computational benefits for smaller subspace dimension would lead to worse rates. This behavior is worse than that allowed by smooth loss functions [39, 31]. These results can be recovered with our approach. Indeed, for both least squares and self-concordant losses, the bound in Theorem (2) can be easily improved to have a linear dependence on , leading to straightforward improvements. We will detail this derivation in a longer version of the paper. Due to space constraints, here we focus on non-smooth losses, since these results, and not only their proof, are new. For this class of loss functions, Theorem 3 shows that leverage scores sampling can lead to better results depending on the spectral properties of the covariance operator. Indeed, if there is a fast eigendecay, then using leverage scores and a subspace dimension one can achieve the same rates as exact ERM. For fast eigendecay ( small), the subspace dimension can decrease dramatically. For example, as a reference for then suffices. Note that, other decays, e.g. exponential, could also be considered. These observations are consistent with recent results for random features [4, 28, 50], while they seem new for ERM on subspaces. Compare to random features the proof techniques have similarities but also differences due to the fact that in general random features do not define subspaces. Finding a unifying analysis would be interesting, but it is left for future work. Also, we note that uniform sampling can have the same properties of leverage scores sampling, if

. This happens under the strong assumptions on the eigenvectors of the covariance operator, but can also happen in kernel methods with kernels corresponding to Sobolev spaces

[49]. With these comments in mind, here, we focus on subspace defined through leverage scores noting that the assumption on the eigendecay not only allows for smaller subspace dimensions, but can also lead to faster learning rates. Indeed, we study this next.

### 4.2 Fast rates

To exploit the eigendecay assumption and derive fast rates, we begin considering further conditions on the problem. We relax these assumptions in the next section. First, we let for -almost all

 f∗(x)=argmina∈R∫Yℓ(y,a)dP(y|x) (15)

where is the conditional distribution 222The conditional distribution always exists since is separable and is a Polish space [48], of given and make the following assumption.

###### Assumption 4.

There exists such that, almost surely,

In our context, this is the same as requiring the model to be well specified. Second, following [48], we consider a loss that can be clipped at that is such that for all ,

 ℓ(y′,ycl)⩽ℓ(y′,y), (16)

where denotes the clipped value of at , that is

 ycl=−Mify⩽M,ycl=yify∈[−M,M],ycl=Mify⩾M. (17)

If , denotes the non-linear function . This assumption holds for hinge loss with , and for bounded regression. Finally, we make the following assumption on the loss.

###### Assumption 5 (Simplified Bernstein condition).

There are constants , such that for all ,

 ℓ(Y,⟨w,X⟩))⩽B (18) (19)

This is a standard assumption to derive fast rates for ERM [48, 5]. In classification with the hinge loss, it is implied by standard margin conditions characterizing classification noise, and in particular by hard margin assumptions on the data distribution [2, 52, 32, 48]. As discussed before, we next focus on subspaces defined by leverage scores and derive fast rates under the above assumptions.

###### Theorem 4.

Fix and . Under Assumptions 1, 245, and a polynomial decay of the spectrum of with rate , as in (14), then, with probability at least

provided that and are large enough. Further, for ALS with the choice

 λ≍n−11+p,α≍n−21+p,m≳n2p1+plogn,

with high probability,

 L(ˆβclλ,m)−L(w∗)≲(logn)1/2pn−11+p. (20)

The above result is a special case of the analysis in the next section, but it is easier to interpret. Compared to Theorem 3 the assumption on the spectrum also leads to an improved estimation error bound and hence improved learning rates. In this sense, these are the correct estimates since the decay of eigenvalues is used both for the subspace approximation error and the estimation error. As is clear from (20), for fast eigendecay, the obtained rate goes from to . Taking again, leads to a rate which is better than the one in Theorem 3. In this case, the subspace defined by leverage scores needs to be chosen of dimension at least . Note that again, the subspace dimension is even smaller for faster eigendecay. Next, we extend these results considering weaker, more general assumptions.

### 4.3 General analysis

Last, we give a general analysis relaxing the above assumptions. We replace Assumption 4 by

 infw∈HL(w)=E[ℓ(Y,f∗(X))], (21)

and introduce the approximation error,

 A(λ)=minw∈HL(w)+λ∥w∥2−infw∈HL(w). (22)

Condition (21) may be relaxed at the cost of an additional approximation term, but the analysis is lengthier and is postponed. It has a natural interpretation in the context of kernel methods, see Example 1, where it is satisfied by universal kernels [48]. Regarding the approximation error, note that, if exists then , and we can recover the results in Section 4.1. More generally, the approximation error decreases with and learning rates can be derived assuming a suitable decay. Further, we consider a more general form of the Bernstein condition.

###### Assumption 6 (Bernstein condition).

There exist constants , and , such that for all , the following inequalities hold almost surely:

 ℓ(Y,⟨w,X⟩))⩽B, (23) E[{ℓ(Y,(⟨w,X⟩)cl)−ℓ(Y,f∗(X))}2]⩽VE[ℓ(Y,(⟨w,X⟩)cl)−ℓ(Y,f∗(X)]θ. (24)

Again in classification, the above condition is implied by margin conditions, and the parameter characterizes how easy or hard is the classification problem. The strongest assumption is choosing , with which we recover the result in the previous section. Then, we have the following result.

###### Theorem 5.

Fix and . Under Assumptions 1, 26, and a polynomial decay of the spectrum of , as in (14), then with probability at least

 L(ˆβclλ,m)−infw∈HL(w)≲(G2κ2log(2/δ)λp)12−p−θ+θp+A(λ)+G√αA(λ)λ+Gκlog(2/δ)n√A(λ)λ.

Furthermore, if there exists such that , then with the choice for ALS

 λ≍n−min{2r+1,1r(2−p−θ+θp)+p}α≍n−min{2,r+1r(2−p−θ+θp)+p}m≳nmin{2p,p(r+1)r(2−p−θ+θp)+p}

with high probability

 L(ˆβclλ,m)−L(f∗) ≲(logn)1/2pn−min{2rr+1,rr(2−p−θ+θp)+p}.

The proof of the above bound follows combining Proposition 1 with results to analyze the learning properties of regularized ERM with kernels [48]. While general, the obtained bound is harder to parse. For the bound become vacuous and there are not enough assumptions to derive a bound [16]. Taking gives the best bound, recovering the result in the previous section when . Note that large values of are prevented, indicating a saturation effect (see [53, 34]). As before the bound improves when there is a fast eigendecay. Taking we recover the previous bounds, whereas smaller lead to worse bounds. Since, given any acceptable choice of and , the quantity takes values in , the best rate, that differently from before can also be slower than , can always be achieved choosing (up to logarithmic terms).

## 5 Experiments

We report simple experiments in the context of kernel methods, considering Nyström method. In particular, we choose the hinge loss, hence SVM for classification.
Nyström-Pegasos. Classic SVM implementations with hinge loss are based on considering a dual formulation and a quadratic programming problem [21]. This is the case for example, for the LibSVM library [14] available on Scikit-learn [36]. We use this implementation for comparison, but find it convenient to combine the Nyström method to a primal solver akin to (6) (see [29, 20] for the dual formulation). More precisely, we use Pegasos [44] which is based on a simple and easy to use stochastic subgradient iteration333Python implementation from https://github.com/ejlb/pegasos. We consider a procedure in two steps. First, we compute the embedding discussed in Section 3. With kernels it takes the form where with . Second, we use Pegasos on the embedded data. As discussed in Section 3, the total cost is in time (here , i.e. one epoch equals steps of stochastic subgradient) and in memory (needed to compute the pseudo-inverse and embedding the data in batches of size ).
Datasets & set up (see Appendix I). We consider five datasets444Datasets available from LIBSVM website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ and from [23] http://manikvarma.org/code/LDKL/download.html#Jose13 of size , challenging for standard SVM implementations. We use a Gaussian kernel, tuning width and regularization parameter as explained in appendix. We report classification error and for data sets with no fixed test set, we set apart of the data.
Results We compare with linear and kernel SVM see Table 1. For all the data sets, the Nyström-Pegasos approach provides comparable performances with much better times (except for the small-size Usps). Note that K-SVM cannot be run on millions of points, whereas Nyström-Pegasos is still fast and provides much better results than linear SVM. Finally, in Figure 1 we illustrate the interplay between and for the Nyström-Pegasos considering SUSY data set.

## 6 Acknowledgements

This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. Part of this work has been carried out at the Machine Learning Genoa (MaLGa) center, Università di Genova (IT).
L. R. acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826.

## References

• [1] A. Alaoui and M. W. Mahoney.

Fast randomized kernel ridge regression with statistical guarantees.

In Advances in Neural Information Processing Systems, pages 775–783, 2015.
• [2] J.-Y. Audibert and A. B. Tsybakov.

Fast learning rates for plug-in classifiers.

The Annals of Statistics, 35(2):608–633, 2007.
• [3] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Theory, pages 185–209, 2013.
• [4] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
• [5] P. L. Bartlett, O. Bousquet, S. Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
• [6] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
• [7] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.
• [8] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, 2013.
• [9] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
• [10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
• [11] D. Calandriello, A. Lazaric, and M. Valko. Distributed adaptive sampling for kernel matrix approximation. arXiv preprint arXiv:1803.10172, 2018.
• [12] D. Calandriello and L. Rosasco.

Statistical and computational trade-offs in kernel k-means.

In Advances in Neural Information Processing Systems, pages 9357–9367, 2018.
• [13] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
• [14] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
• [15] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 181–190, 2015.
• [16] L. Devroye, L. Györfi, and G. Lugosi.

A probabilistic theory of pattern recognition

, volume 31.
Springer Science & Business Media, 2013.
• [17] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec):3475–3506, 2012.
• [18] P. Drineas and M. W. Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(Dec):2153–2175, 2005.
• [19] E. Giné and J. Zinn. Some limit theorems for empirical processes. The Annals of Probability, pages 929–989, 1984.
• [20] C.-J. Hsieh, S. Si, and I. S. Dhillon. Fast prediction for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 3689–3697, 2014.
• [21] T. Joachims. Making large-scale svm learning practical. Technical report, Technical Report, 1998.
• [22] R. Johnson and T. Zhang.

Accelerating stochastic gradient descent using predictive variance reduction.

In Advances in neural information processing systems, pages 315–323, 2013.
• [23] C. Jose, P. Goyal, P. Aggrwal, and M. Varma. Local deep kernel learning for efficient non-linear svm prediction. In International conference on machine learning, pages 486–494, 2013.
• [24] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems 21, pages 793–800, 2009.
• [25] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, volume 2033 of École d’Été de Probabilités de Saint-Flour. Springer-Verlag Berlin Heidelberg, 2011.
• [26] V. Koltchinskii et al. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.
• [27] S. Kpotufe and B. K. Sriperumbudur. Kernel sketching yields kernel jl. arXiv preprint arXiv:1908.05818, 2019.
• [28] Z. Li, J.-F. Ton, D. Oglic, and D. Sejdinovic. Towards a unified analysis of random fourier features. arXiv preprint arXiv:1806.09178, 2018.
• [29] Z. Li, T. Yang, L. Zhang, and R. Jin. Fast and accurate refined nyström-based kernel svm. In

Thirtieth AAAI Conference on Artificial Intelligence

, 2016.
• [30] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224, 2011.
• [31] U. Marteau-Ferey, D. Ostrovskii, F. Bach, and A. Rudi. Beyond least-squares: Fast rates for regularized empirical risk minimization through self-concordance. arXiv preprint arXiv:1902.03046, 2019.
• [32] P. Massart, É. Nédélec, et al. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
• [33] R. Meir and T. Zhang. Generalization error bounds for Bayesian mixture algorithms. Journal of Machine Learning Research, 4(Oct):839–860, 2003.
• [34] N. Mücke, G. Neu, and L. Rosasco. Beating sgd saturation with tail-averaging and minibatching. In Advances in Neural Information Processing Systems, pages 12568–12577, 2019.
• [35] Y. Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
• [36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
• [37] R. T. Rockafellar. Convex analysis. Number 28. Princeton university press, 1970.
• [38] A. Rudi, D. Calandriello, L. Carratino, and L. Rosasco. On fast leverage score sampling and optimal learning. In Advances in Neural Information Processing Systems, pages 5672–5682, 2018.
• [39] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages 1657–1665, 2015.
• [40] A. Rudi and L. Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems 30, pages 3215–3225, 2017.
• [41] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
• [42] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In

International conference on computational learning theory

, pages 416–426. Springer, 2001.
• [43] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010.
• [44] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
• [45] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
• [46] A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.
• [47] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. In Advances in Neural Information Processing Systems 21, pages 1545–1552, 2009.
• [48] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
• [49] I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), pages 79–93, 2009.
• [50] Y. Sun, A. Gilbert, and A. Tewari. But how does it work in theory? linear svm with random features. In Advances in Neural Information Processing Systems, pages 3379–3388, 2018.
• [51] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
• [52] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
• [53] E. D. Vito, L. Rosasco, A. Caponnetto, U. D. Giovannini, and F. Odone. Learning from examples as an inverse problem. Journal of Machine Learning Research, 6(May):883–904, 2005.
• [54] G. Wahba. Spline models for observational data, volume 59. Siam, 1990.
• [55] C. K. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
• [56] D. P. Woodruff. Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357, 2014.
• [57] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005.

## Appendix A Proof of Theorem 1

This section is devoted to the proof of Theorem 1. In the following we restrict to linear functions, i.e for some and, with slight abuse of notation we set

 ℓ(w,z)=ℓ(y,⟨w,x⟩),z=(x,y)∈H×Y, w∈H.

With this notation . The Lipschitz assumption implies that is almost surely Lipschitz in its argument, with Lipschitz constant .

Specifically, we will show the following:

###### Theorem 6.

Under Assumptions 1, 2, for and let

 Cλ,δ=4{1+√log(1+log2(3+ℓ0κ2/λ))+log(2/δ)}=O(√loglog(3+ℓ0κ2/λ)+log(1/δ)).

If Assumption 3 holds, then with probability ,

 L(ˆwλ)

More generally, with probability , letting ,

 L(ˆwλ)−infHL<2A(λ)+C2λ,δG2κ2+8G2κ2log(2/δ)4λn+GCλ,δ√n+ℓ0√2log(2/δ)n (26) ⩽2(inf∥w∥⩽RL(w)−infHL)+2λR2+C2λ,δG2κ2+8G2κ2log(2/δ)4λn+GCλ,δ+ℓ0√2log(2/δ)√n

for every .

The proof starts with the following bound on the generalization gap uniformly over balls. While this result is well-known and follows from standard arguments (see, e.g., [6, 25]), we include a short proof for completeness.

###### Lemma 1.

Under Assumptions 1 and 2 and, for every , one has with probability at least ,

 sup∥w∥⩽R[L(w)−ˆL(w)]
###### Proof of Lemma 1.

The proof starts by a standard symmetrization step [19, 25]. Let us call i.i.d. from , as well as an independent i.i.d. from and i.i.d. with . We denote the error on the sample . Then,

 ED∼Pnsup∥w∥⩽R[L(w)−ˆL(w)] =EDsup∥w∥⩽R[ED′ˆL′(w)−ˆL(w)] ⩽ED,D′sup∥w∥⩽R[ˆL′(w)−ˆL(w)] =2ED,ε[sup∥w∥⩽R1nn∑i=1εiℓ(w,zi)]

where we used that , and that and have the same distribution, as well as and . The last term corresponds to the Rademacher complexity of the class of functions [6, 25]. Now, using that for , where is -Lipschitz by Assumption 2, Ledoux-Talagrand’s contraction inequality for Rademacher averages [33] gives

 ED,ε[sup∥w∥⩽R1nn∑i=1εiℓ(w,zi)] ⩽GED,ε[sup∥w∥⩽R1nn∑i=1εi⟨w,xi⟩] ⩽GRED,ε[∥∥1nn∑i=1εixi∥∥2]1/2 =GRE[∥x∥2]1/2√n ⩽GRκ√n

where we used that for by independence, and that almost surely (Assumption 1). Hence,

 (28)

To write the analogous bound in high probability we apply McDiarmid’s inequality [8]. We know that given , and defining we have

 ∣∣ϕ(D)−ϕ(Di)∣∣ ⩽sup∥w∥⩽R∣∣1nℓ(w,zi)−1nℓ(w,z′i)∣∣ ⩽Gnsup∥w∥⩽R∣∣⟨w,xi−x′i⟩∣∣ ⩽2GRκn (29)

using the Assumption 1 of boundedness of the input. Hence, by McDiarmid inequality:

 P[ϕ(D)−ED[ϕ(D)]⩾t]⩽exp(−t2n2G2R2κ2); (30)

taking so that , we obtain the desired bound (27). ∎

Lemma 1 suffices to control the excess risk of the constrained risk minimizer for . On the other hand, this result cannot be readily applied to , since its norm is itself random. Observe that, by definition and by Assumption 2,

 λ∥ˆwλ∥2⩽