# Effective Dimension of Exp-concave Optimization

We investigate the role of the effective (a.k.a. statistical) dimension in determining both the statistical and the computational costs associated with exp-concave stochastic minimization. We derive sample complexity bounds that scale with d_λ/ϵ, where d_λ is the effective dimension associated with the regularization parameter λ. These are the first fast rates in this setting that do not exhibit any explicit dependence either on the intrinsic dimension or the ℓ_2-norm of the optimal classifier. We also propose fast preconditioned methods that solve the ERM problem in time ((X)+_λ'>λλ'/λ d_λ'^2d), where (X) is the number of nonzero entries in the data. Our analysis emphasizes interesting connections between leverage scores, algorithmic stability and regularization. In particular, our algorithm involves a novel technique for choosing a regularization parameter λ' that minimizes the complexity bound λ'/λ d_λ'^2d, while avoiding the entire (approximate) computation of the effective dimension for each candidate λ'. All of our result extend to the kernel setting.

## Authors

• 21 publications
• 8 publications
• ### Dimension-free Information Concentration via Exp-Concavity

Information concentration of probability measures have important implica...
02/26/2018 ∙ by Ya-Ping Hsieh, et al. ∙ 0

• ### Optimal computational and statistical rates of convergence for sparse nonconvex learning problems

We provide theoretical analysis of the statistical and computational pro...
06/20/2013 ∙ by Zhaoran Wang, et al. ∙ 0

• ### A Simple Analysis for Exp-concave Empirical Minimization with Arbitrary Convex Regularizer

In this paper, we present a simple analysis of fast rates with high pr...
09/09/2017 ∙ by Tianbao Yang, et al. ∙ 0

• ### The Efficacy of L_1 Regularization in Two-Layer Neural Networks

A crucial problem in neural networks is to select the most appropriate n...
10/02/2020 ∙ by Gen Li, et al. ∙ 0

• ### Fast Randomized Kernel Methods With Statistical Guarantees

One approach to improving the running time of kernel-based machine learn...
11/02/2014 ∙ by Ahmed El Alaoui, et al. ∙ 0

• ### The Complexity of Making the Gradient Small in Stochastic Convex Optimization

We give nearly matching upper and lower bounds on the oracle complexity ...
02/13/2019 ∙ by Dylan Foster, et al. ∙ 0

• ### Linear Regression without Correspondences via Concave Minimization

Linear regression without correspondences concerns the recovery of a sig...
03/17/2020 ∙ by Liangzu Peng, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Exp-concave stochastic optimization

underlies many important machine learning problems such as

linear regression, logistic regression and portfolio selection. While the worst-case complexity of exp-concave stochastic optimization is fairly understood ([23, 29, 19, 16]), a promising avenue is to investigate these complexities under distributional assumptions. A common distributional condition which can be exploited potentially is fast eigendecay (measured quantitatively by the notion of effective dimension (see Equation (3))) ([13, 5, 26, 1]

). Namely, in many machine learning problems, the eigenvalues associated with the population covariance matrix exhibit a fast decay, where the tail of the eigenvalues are significantly smaller than the desired precision. Naturally, this phenomenon suggests a

sketch-and-solve approach, where a sufficiently accurate solution is obtained by projecting the data onto a low-dimensional space and solving the smaller problem. Indeed, many algorithmic ideas in this spirit have been suggested in the recent years (e.g. [3, 26]).

A more sophisticated approach, which we name sketch-to-precondition ([2, 9]), is to enhance the performance of first-order optimization methods via preconditioning, where the preconditioner is based on a coarse low-rank approximation to the data matrix. The main message of our paper is as follows:

Main message: The sample complexity of any algorithm minimizing an exp-concave empirical risk scales optimally with the effective dimension, rendering the sketch-and-solve approach useless in the statistical setting. On the other hand, the sketch-to-precondition approach is effective for optimization and can be accelerated via model selection.

To illustrate this message, we next describe our results in the context of both linear and Kernelized -regression.

### 1.1 Results for Linear and Kernel ℓ2-regression

 F(w)=12E(x,y)∼D[(w⊤x−y)2], (1)

over a compact set . Here, is a distribution over which satisfies . We denote the minimizer by . As usual, the input to the learning algorithm consists of an i.i.d. sample . Our focus is on algorithms that minimize the empirical risk over . Although regularization is not needed for generalization purposes (as shown by [16]), for reasons that will become apparent soon, we introduce a ridge parameter

 λ≜ϵB2,  B≜diam(W) ,

and consider the minimization problem:

 ^wλ≜argmin{^Fλ(w)≜12nn∑i=1(w⊤xi−yi)2+λ2∥w∥2: w∈W} . (2)

Tight sample complexity bound in terms of the effective dimension: we define the sample complexity as the minimal number of samples required for ensuring that . As we mentioned above, sample complexity bounds for this formulation are well-understood. Namely, results from [19, 29, 16] imply that that . We refer to the leftmost term as a dimension-dependent fast rate (i.e., it scales with rather than with ), whereas the right term is a dimension-independent slow rate. While the above bound is tight, it can be significantly improved if the spectrum of the covariance of the underlying data decays fast. A common measure used to capture this decay is the effective dimension, defined by

 dλ≜dλ(Ex∈Dx[xx⊤])≜d∑i=1λiλi+λ, (3)

where are the eigenvalues of the population covariance matrix . Clearly, . However, it is very typical that most of the eigenvalues are dominated by , and consequently . For example, if the spectrum decays exponentially, the effective dimension is polylogarithmic in ([13]). Our sample complexity bound in this setting is as follows.

###### Theorem 1.

The sample complexity of linear regression satisfies

 n(ϵ)=O(dλϵ).

where .

We also prove a nearly matching lower bound (Theorem 6) in a high accuracy regime and specify our bounds for several regimes of interest corresponding to eigendecay patterns.

Essentially, we enjoy the best of the two worlds, as our bound is is both fast (in terms of and dimension-independent. Also note that the bound is independent of the diameter, . The only dependence on is implicit through the definition of . Indeed, while can be trivially used to bound the magnitude of the prediction, such a bound is often loose due to a failure of the -metric to capture the geometry of the problem (e.g., due to sparsity).

Redundancy of Sketch-and-Solve: It is instructive to examine the sketch-and-solve approach ([26]), whereby one uses leverage score sampling to find a small -spectral approximation to the empirical covariance (respectively, the kernel) matrix using a subsample of size , and then solves the corresponding smaller problem (see Section C of [26] for more details).

While there are relatively efficient methods for approximating the leverage scores, their computation is clearly more involved than sampling uniformly at random. In some sense, our sample complexity result shows that the sketch-and-solve is redundant.111Note that the boundedness assumption is crucial here. Sketch-and-solve can be very helpful when instances are not bounded (and instead of additive accuracy we aim at multiplicative accuracy). Namely, our bound implies that the same (additive) accuracy we can attain the same (additive) accuracy by sampling a training sub-sequence of the same size uniformly at random.

Efficacy of Sketch-to-Precondition in Optimization: A different approach is to use ridge leverage score sampling in order to improve the condition number of the optimization problem. Instead of aiming at -spectral approximation, we draw only samples to compute a constant spectral approximation to the empirical covariance matrix. This approximation is used to reduce the condition number to a constant order (see Section 5). Notably, maintaining this preconditioner (i.e., computing it and multiplying any

-dimensional vector by its inverse) can be done in time

. By endowing Gradient Descent (GD) with this preconditioner, we can find an -approximate ERM in time

As we discussed above, the regularization parameter used in practice is often chosen via model selection. Both our sample and computational complexity bounds shed light on the bias-complexity trade-off reflected by the choice of . Namely, as we increase , the effective dimension (and hence the complexity) become smaller, whereas the bias increases. In Section 6 we show that even if we have already chosen a desired regularization parameter (e.g., , as we describe next, we may still achieve a significant gain by performing optimization with . Namely, the effective dimension associated with might be much smaller, and we can compensate for using a larger ridge parameter by repeating the optimization process times. The main challenge we need to tackle is that the cost of computing the effective dimension associated with each candidate parameter dominates the entire optimization process. The main contribution described in Section 6

is a new algorithm which finds the best ridge candidate by iteratively sharpening its estimates to the corresponding effective dimensions.

###### Theorem 2.

There exists an algorithm that finds an -approximate minimizer to (2) in time

 ~O(minλ′≥λ{nnz(A)+λ′λd2λ′d}) .

In Appendix D we explain how the above results extend to the kernel setting.

## 2 Related work

### 2.1 Sample complexity bounds

To the best of our knowledge, the first bounds for empirical risk minimization for kernel ridge regression in terms of the effective dimension have been proved by

[31]. By analyzing the Local Rademacher complexity ([6]), they proved an upper bound of on the sample complexity. On the contrary, our bound has no explicit dependence on . More recently, [13] used compression schemes ([21]) together with results on leverage score sampling from [26] in order to derive a bound in terms of the effective dimension with no explicit dependence on . However, their rate is slow in terms of .

Beside improving the above aspects in terms of accuracy, rate and explicit dependence on , our analysis is arguably simple and underscores nice connections between algorithmic stability and ridge leverage scores.

#### 2.1.1 Online Newton Sketch

The Online Newton Step (ONS) algorithm due to [17]

is a well-established method for minimizing exp-concave loss functions both in the stochastic and the online settings. As hinted by its name, each step of the algorithm involves a conditioning step that resembles a Newton step. Recent papers reanalyzed ONS and proved upper bounds on the regret (and consequently on the sample complexity) in terms of the effective dimension (

[22, 8]). We note that using a standard online to batch reduction, the regret bound of [22] implies the same (albeit a little weaker in terms of constants) sample complexity bounds as this paper. While ONS is certainly appealing in the context of regret minimization, in the statistical setting, our paper establishes the sample complexity bound irrespective of the optimization algorithm used for the intermediate ERM step, thereby establishing that the computational overhead resulted by conditioning in ONS is not required.222We also do not advocate ONS for offline optimization, as it does not yield linear rate (i.e., iterations).

### 2.2 Sketch-and-Solve vs. Sketch-to-Precondition

As we discussed above, the sketch-and-solve approach (e.g. see the nice survey by [30]) has gained considerable attention recently in the context of enhancing both discrete and continuous optimization ([22, 15, 14, 9]). As we briefly mentioned above, a recent paper by [26] suggested to combine ridge leverage score sampling with the Nyström method to compute a spectral approximation of the Kernel matrix. As an application, they consider the problem of Kernel ridge regression and describe how this spectral approximation facilitates the task of finding -approximate minimizer in time , where . Based on Corollary 4 (with ), our complexity is better by factor of .

We would like to stress that our results only obviate the necessity of the sketch-and-solve approach in the statistical setting, where we assume boundedness and aim at additive error bounds. On the other hand, most of sketch-and-solve results (e.g., [26]) are multiplicative and do not require boundedness.

The Sketch-to-precondition approach is more appealing in scenarios where machine precision accuracy is required ([30][Section 2.6]). In Appendix 5 we review this approach in detail and describe a corresponding preconditioned GD that solves the empirical risk in time (or ) respectively in the Kernel setting). A different application of the sketch-to-precondition approach, due to [2], focuses on polynomial Kernels and yields an algorithm whose runtime resembles our running time but also scales exponentially with the polynomial degree.

## 3 Preliminaries

### 3.1 Problem Setting

We consider the problem of minimizing the expected risk

 F(w)=E(x,y)∼D[ϕy(w⊤x)], (4)

over a compact and convex set whose diameter is denoted by . Following [16], we assume that for all , is twice-continuously differentiable and satisfies the following assumptions:

1. Lipschitzness: for all , .

2. Strong convexity: for all , .

3. Smoothness:333This assumptions is only required for our optimization results. for all , .

As noted in [16], our framework includes all known -exp-concave functions. A prominent example illustrated below is bounded -regression. Further examples include logistic regression and log-loss ([17]).

###### Example 1.

Bounded -regression: let and let and be two compact sets in such that and , . The loss is defined by . It is easily verified that and .

The input to the learning algorithm consists of an i.i.d. sample . A popular practice is regularized loss minimization (RLM) which, given a regularization parameter , is defined as

 ^wλ≜argminw∈W^Fλ(w)≜argminw∈W⎛⎜ ⎜⎝1nn∑i=1ϕyi(w⊤xi)fi(w)+λ2∥w∥2⎞⎟ ⎟⎠. (5)

We also define the unregularized empirical loss as

 ^F(w)≜n∑i=1ϕyi(w⊤xi). (6)

The strong convexity of implies the following property of the empirical loss (e.g. see Lemma 2.8 of [28]).

###### Lemma 1.

Given a sample , let be as defined in Equation (5) . Then for all ,

 ^Fλ(w)−^Fλ(^wλ)≥α2(w−^wλ)⊤(1nn∑i=1xix⊤i+λαI)(w−^wλ).

### 3.2 Sketching via leverage-score sampling

In this section we define the notion of ridge leverage scores, relate it to the effective dimension and explain how sampling according to these scores facilitates the task of spectral approximation.

Given a sample , we define the data matrix by

 A=[a1,…,an]=n−1/2[x1;…,xn]∈Rn×d

Given a ridge parameter , we define the -th leverage score by

 τλ,i=a⊤i(A⊤A+λI)−1ai.

It’s easily seen that . The following lemma intuitively says that the (ridge) leverage score captures the importance of the -th example in composing the column space of the covariance matrix. The proof is detailed in Appendix F.

###### Lemma 2.

For a ridge parameter and for any , is the minimal scalar such that .

The notion of leverage scores give rise to a natural algorithm for spectral approximation by sampling rows with probability proportional to the corresponding ridge leverage scores. Before describing the sampling procedure, we define the goal of spectral approximation.

###### Definition 1.

(Spectral approximation) We say that a matrix is a -spectral approximation to if

 1−ϵ1+ϵ(A⊤A+λI)⪯~A⊤~A+λI⪯A⊤A+λI
###### Definition 2.

(Ridge Leverage Score Sampling) Let be a sequence of ridge leverage score overestimates, i.e., for all . For a fixed positive constant and accuracy parameter , define for each . Let denote a function which returns a diagonal matrix , where with probability and otherwise.

###### Theorem 3.

[24, 26] Let be ridge leverage score overestimates, and let .444We use the symbols to denote global constants.

1. With high probability, is a -spectral approximation to .

2. With high probability, has at most nonzero entries. In particular, if for some constant , then has at most nonzero entries.

3. There exists an algorithm which computes with for all in time

### 3.3 Stability

In this section we define the notion of algorithmic stability, a common tool to bound the generalization error of a given algorithm. Analogously to the definition of in (2), for each , we define to be the predictor produced by the algorithm on the sample , obtained from by replacing the example with a fresh i.i.d. pair . We can now define the stability terms

 Δi≜fi(^wλ,i)−fi(^wλ)% and Δi′≜fi′(^wλ)−fi′(^wλ,i).

The following theorem relates the expected generalization error to the expected average stability.

We have .

## 4 Sample Complexity Bounds for Exp-Concave Minimization

In this section we show nearly tight sample complexity bounds for exp-concave minimization based on the effective dimension. Let .

###### Theorem 5.

For any the excess risk of RLM is bounded as follows:

 ES∼Dn[F(^wλ)−F(w⋆)]≤8ρ2dλα(C)αn+λ2B2

Choosing gives us the following corollary.

###### Corollary 1.

The sample complexity is bounded as

###### Remark 1.

To obtain high-probability bounds (rather than in expectation) we can employ the validation process suggested in [25].

###### Proof of Theorem 5.

For a given sample , define to be the associated leverage scores with ridge parameter . We first use Theorem 4 to relate the excess risk to the average stability:

 E[F(^wλ)−F(w∗)]=E[F(^wλ)−^F(^wλ)] +E[^F(^wλ)−F(w∗)]≤E[F(^wλ)−^F(^wλ)] +E[^Fλ(^wλ)−^Fλ(w∗)]+λ2B2≤E[n−1∑Δi] +λ2B2

It is left to bound the average stability. Towards this end we fix some . By the mean value theorem there exists with such that We now have that

 Δi≤ρ⋅|x⊤i(^wiλ−^wλ)|=ρ⋅√(^wλ,i−^wλ)⊤xix⊤i(^wiλ−^wλ) ≤ρ⋅√n⋅^τ⋅(^wλ,i−^wλ)⊤(^C+λαI)(^wλ,i−^wλ) (Lemma ???) ≤ρ⋅√2n⋅^τα⋅√(^Fλ(^wλ,i)−^Fλ(^wλ)) (Lemma ???) ≤ρ⋅√2n⋅^τα⋅√Δi+Δi′n,

where the last inequality uses the fact that . Similarly, , where is the -th ridge leverage score corresponding to .

Combining the above and using the inequality , we obtain that

 Δi+Δ′i≤4ρ2(^τ+^τp)α⇒1nn∑i=1(Δi+Δ′i)≤4ρ2αnn∑i=1(^τ+^τp) .

Since and (similarly, and ) are distributed identically, the result now follows from the following lemma whose proof is given in Appendix A.

###### Lemma 3.

Let

be a random variable supported in a bounded set of

with . Let where are i.i.d copies of . Then we have that for any fixed

 E[dλ(^C)]≤2dλ(C)

We now state a nearly matching lower bound on the sample complexity. To exhibit a lower bound we consider the special case of linear regression. Notably, our lower bound holds for any spectrum specification. The proof appears in Appendix C

###### Theorem 6.

Given numbers and , define and 555For any , is a diagonal matrix with entry . Then for any algorithm there exist a distribution over such that for any algorithm that returns a linear predictor , given independent samples from , satisfies

 ES∼Dm[E(x,y)∼D[12(^wTx−y)2]]−minw:∥w∥≤BE[12(^wTx−y)2]≥dγ/(n⋅B2)n

for any satisfying

 dγ/(n⋅B2)−d∑i=1⎛⎜⎝λiλi+γ(n⋅B2)⎞⎟⎠2≤γ (7)

To put the bound achieved by Theorem 6 into perspective we specialize the bound achieved for two popular cases for eigenvalue profiles defined in [13]. We say that a given eigenvalue profile satisfies Polynomial Decay if there exists numbers such that . Similarly it satisfies -Exponential Decay if there exists a number such that . The following table specifies nearly matching upper and lower bounds for polynomial and exponential decays (see exact statements in Appendix E).

## 5 Sketch-to-precondition: an overview

In this section we describe in more detail the sketch-to-precondition approach and specify it to exp-concave stochastic optimization. This scheme will serve as a basis for the acceleration technique presented in the next section.

For concreteness, suppose we apply Gradient Descent (GD) to minimize the regularized risk (5). Denote by the empirical covariance matrix . As we assume that is -smooth and -strongly convex, it can be easily verified that the entire regularized risk is -strongly convex and -smooth. Denote ’s strong convexity and smoothness parameters by and , respectively. The quantity is referred to as the condition number of the regularized risk. It is well known (e.g., see [27]) that GD converges after iterations. Note that if the eigendecay is fast, the condition number may be much larger that the so-called functional condition number

 ~κ≜β+λα+λ .

Preconditioning can be seen as a change of variable, where instead of optimizing over , we optimize over , where is called a preconditioner. It can be easily verified (e.g. see [14]) that this operation amounts to replacing each instance with (after decomposing the regularization into a suitable form). Straightforward calculations show that The Hessian of at any point becomes

 P−1/2(1nn∑i=1ϕ′′yi(w⊤xi)xix⊤i+λI)P−1/2.

Therefore, if satisfies , the smoothness and strong convexity of imply that the resulted condition number is . Using Theorem 3, we can compute a -spectral approximation to the data matrix in time . Furthermore, multiplying any -dimensional vector with the inverse of this approximation can be done in time . Note that the gradient at some point is . By maintaining both and and assuming that can be computed in time , we are able to perform a single step of preconditioned GD in time . Overall, we obtain the following result.

###### Theorem 7.

There exists an algorithm that finds an -approximate minimizer to (5) in time .

In the Kernel setting we need to make some modifications to this scheme. First, we need to form the Gram matrix in time . Furthermore, as the number of samples replaces the intrinsic dimension, maintaining the preconditioner costs rather than .

Finally, we remark that by using more advanced first-order methods such as Accelerated SVRG ([18, 20]), we can obtain a better dependence on . However, to keep our presentation simple we stick to GD.

## 6 Optimizing the Tradeoff between Oracle Complexity and Effective Dimensionality

As explained in the introduction, given a ridge parameter , we may prefer to perform optimization with a different ridge parameter in order to accelerate the optimization process.

### 6.1 The Proximal Point algorithm: overview

Before quantifying the tradeoff reflected by the choice of , we need to explain how to reduce minimization w.r.t. to minimization w.r.t. . The basic idea is to repeat the minimization process for epochs. We demonstrate this idea using the Proximal Point algorithm (PPA) due to [12]. For a fixed , define . Suppose we start from . At time , we find a point satisfying

 ^Fλ,wt−1(wt)−minw∈W^Fλ,wt−1(w)≤cλλ′(^Fλ,wt−1(wt−1)−minw∈W^Fλ,wt−1(w)).
###### Lemma 4.

[12] Applying PPA with yields -approximate minimizer to after epochs, i.e., .

Applying PPA while using sketch-to-preconditioning as described in Section 5 yields the following complexity bound:

 ~O(min{λ′λβα⋅(nnz(A)+d2λ′d): λ′≥λ})

Focusing on the (reasonable) regime where ,666The term involving can be easily optimized w.r.t. the ridge parameter . we note that may be large as . Notably, while the deterioration in runtime scales linearly on , the improvement in terms of is quadratic. For instance, if , the computational gain is ).

Therefore, we wish to minimize the complexity term

 ψ(λ′):=λ′λd2λ′ (8)

over all possible . To this end, suppose that we had an access to an oracle that computes for a given parameter . Using the continuity of the effective dimension, we could optimize the above quantity over a discrete set of the form .777Clearly, the optimal ridge parameter can not be larger than . The main difficulty stems from the fact that the cost of implementing this oracle already scales with .

### 6.3 Efficient tuning using undersampling

Our second main contribution is a novel approach for minimizing (8) in negligible amount of time.

###### Theorem 8.

There exists an algorithm which receives a data matrix and a regularization parameter , and with high probability outputs a regularization parameter satisfying

 ¯λλd2¯λ=O(ψ⋆)≜O(minλ′≥λ{λ′λd2λ′}) .

The runtime of the algorithm is .

###### Corollary 2.

There exists an algorithm that finds an -approximate solution to (5) in time

 ~O(min{λ′λβα⋅(nnz(A)+d2λ′d):λ′≥λ})

The main idea behind Theorem 8 is that instead of (approximately) computing the effective dimension for each candidate , we guess the optimal complexity and employ undersampling to test whether a given candidate attains the desired complexity. The key ingredient to this approach is described in the following theorem.

###### Theorem 9.

Let , and . There exists an algorithm that verifies whether in time .

###### Proof.

(of Theorem 8) Starting from a small constant as our “guess” for , we double our guess until finding a candidate which satisfies the desired bound. According to Theorem 9, for each guess and candidate , the complexity of verifying whether is at most . The number of such tests is logarithmic, hence the theorem follows. ∎

#### 6.3.1 Proof of Theorem 9

Inspired by [10, 11], our strategy is to use undersampling to obtain sharper estimates to the ridge leverage scores. We start by incorporating an undersampling parameter into Definition 2.

###### Definition 3.

(Ridge Leverage Score Undersampling) Let be a sequence of ridge leverage score overestimates, i.e., for all . For some fixed positive constant , accuracy parameter , define for each . Let denote a function which returns a diagonal matrix , where with probability and otherwise.

Note that while we reduce each probability by factor , the definition of neglects this modification. Hence, our undersampling is equivalent to sampling according to Definition 2 and preserving each row with probability . By employing undersampling we cannot hope to obtain a constant approximation to the true ridge leverage scores. However, as we describe in the following theorem, this strategy still helps us to sharpen our estimates to the ridge leverage scores.

###### Theorem 10.

Let for all and let be an undersampling parameter. Given , we form new estimates by

 u(new)i:=min{ai(A⊤S⊤SA+λI)−1ai,ui}. (9)

Then with high probability, each is an overestimate of and .

The proof of the theorem (which is similar to Theorem 3 of [10] and Lemma 13 of [11]) is provided in Appendix B. Equipped with this result, we employ the following strategy in order to verify whether . Applying the lemma with , we have that if then This gives rise to the following test:

1. If , accept the hypothesis that .

2. If , reject the hypothesis that .

3. Otherwise, apply Theorem 10 to obtain a new vector of overestimates, .

###### Proof.

(of Theorem 9) Note that the rank of the matrix is with high probabilty. Hence, each step of the testing procedure costs .888Namely, we can compute in time . Thereafter, computing can be done in time . Since our range of candidate ridge parameters is of logarithmic size and each test consists of logarithmic number of steps, the theorem follows using the union bound. ∎

## Acknowledgements

We thank Elad Hazan, Ohad Shamir and Christopher Musco for fruitful discussions and valuable suggestions.

## References

• [1] Ahmed El Alaoui and Michael W Mahoney. Fast Randomized Kernel Methods With Statistical Guarantees.
• [2] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster Kernel Ridge Regression Using Sketching and Preconditioning. SIAM Journal on Matrix Analysis and Applications, 1(1), 2017.
• [3] Haim Avron, Yorktown Heights, David P Woodruff, and San Jose. Subspace Embeddings for the Polynomial Kernel. Advances In Neural Information Processing Systems, 2014.
• [4] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random {F}ourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees. Proceedings of the 34th International Conference on Machine Learning, 70:253–262, 2017.
• [5] Francis Bach and Francis Bach@ens Fr. Sharp analysis of low-rank kernel matrix approximations. 30:1–25, 2013.
• [6] Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
• [7] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
• [8] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online convex optimization with adaptive sketching. In International Conference on Machine Learning, 2017.
• [9] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In

Proceedings of the forty-fifth annual ACM symposium on Theory of computing

, pages 81–90. ACM, 2012.
• [10] Michael B. Cohen, Yin Tat Lee, Cameron Christopher Musco, Cameron Christopher Musco, Richard Peng, and Aaron Sidford. Uniform Sampling for Matrix Approximation. Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science - ITCS ’15, pages 181–190, 2015.
• [11] Michael B Cohen, Cameron Christopher Musco, and Cameron Christopher Musco. Input Sparsity Time Low-Rank Approximation via Ridge Leverage Score Sampling. 2016.
• [12] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. arXiv preprint arXiv:1506.07512, 2015.
• [13] Surbhi Goel and Adam Klivans.

Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks.

Advances in Neural Information Processing Systems, 2017.
• [14] Alon Gonen, Francesco Orabona, and Shai Shalev-Shwartz. Solving ridge regression using sketched preconditioned svrg. In International Conference on Machine Learning, pages 1397–1405, 2016.
• [15] Alon Gonen and Shai Shalev-Shwartz. Faster sgd using sketched conditioning. arXiv preprint arXiv:1506.02649, 2015.
• [16] Alon Gonen and Shai Shalev-Shwartz. Average Stability is Invariant to Data Preconditioning. Implications to Exp-concave Empirical Risk Minimization. arXiv preprint arXiv:1601.04011, 2016.
• [17] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
• [18] Rie Johnson and Tong Zhang.

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction.

In neural Information Processing system, number 4, pages 1–9, 2013.
• [19] Tomer Koren and Kfir Levy. Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems, pages 1477–1485, 2015.
• [20] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015.
• [21] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Technical report, Technical report, University of California, Santa Cruz, 1986.
• [22] Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, and John Langford. Efficient second order online learning by sketching. In Advances in Neural Information Processing Systems, pages 902–910, 2016.
• [23] Mehrdad Mahdavi. Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization. 40:1–16, 2015.
• [24] Michael W Mahoney and S Muthukrishnan. Relative-Error CUR Matrix Decompositions. SIAM Journal on Matrix Analysis and Applications, 2008.
• [25] Nishant A Mehta, Centrum Wiskunde, and Informatica Cwi. From exp-concavity to variance control : O ( 1 / n ) rates and online-to-batch conversion with high probability arXiv : 1605 . 01288v3 [ cs . LG ] 18 Aug 2016. (1990):1–13, 2015.
• [26] Cameron Musco and Christopher Musco. Recursive Sampling for the Nyström Method. In Advances In Neural Information Processing Systems, 2017.
• [27] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
• [28] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2011.
• [29] Ohad Shamir. The Sample Complexity of Learning Linear Predictors with the Squared Loss. Journal of Machine Learning Research, 16:3475–3486, 2015.
• [30] David P. Woodruff. Sketching as a Tool for Numerical Linear Algebra. arXiv preprint arXiv:1411.4357, 2014.
• [31] Tong Zhang. Effective dimension and generalization of kernel learning. In Advances in Neural Information Processing Systems, pages 471–478, 2003.

## Appendix A Concentration of The Effective Dimension

###### Proof of Lemma 3.

Let and denote the spectral decomposition of by . Let Note that

 ∀i∈[k]λiλi+λ≥λi2λi=12,∀i>kλiλi+λ≥λi2λ\,.

Therefore,

 dλ(C)=k∑i=1λiλi+λ+∑i>kλiλi+λ≥12(k+∑i>kλiλ) (10)

Denote the eigenvalues of by . Since for any , , we have that

 dλ(^C)=k∑i=1^λi^λi+λ+∑i>k^λi^λi+λ≤k+∑i>k^λi^λi+λ≤k+∑i>k^λiλ. (11)

We now consider the random variable . To argue about this random variable consider the following identity which follows from the Courant-Fisher min-max principle for real symmetric matrices.

 ∑i>k^λi=min{tr(V⊤^CV):V∈Rd×k,V⊤V=I} .

Let be the matrix with the columns . We now have that

 E[∑i>k^λi] =min{tr(V⊤^CV):V∈Rd×k,V⊤V=I} ≤min{E[tr(V⊤^CV)]:V∈Rd×k,V⊤V=I} ≤E[tr(U⊤i>k^CUi>k)]=tr(U⊤i>kCUi>k) =∑i>kλi (12)

Combining Equation (10), Equation (11) and Equation (A) and taking expectations we get that

 E[dλ(^C)]≤2dλ(C)

## Appendix B Ridge Leverage Score Undersampling

In this section we prove Theorem 10. The next lemma intuitively says only a small fraction of rows might have a high leverage score.

###### Lemma 5.

Let and denote by the effective dimension of For any there exists a diagonal rescsaling matrix such that for all , and

###### Proof.

We prove the lemma by considering a hypothetical algorithm which constructs a sequence