1 Introduction
Expconcave stochastic optimization
underlies many important machine learning problems such as
linear regression, logistic regression and portfolio selection. While the worstcase complexity of expconcave stochastic optimization is fairly understood ([23, 29, 19, 16]), a promising avenue is to investigate these complexities under distributional assumptions. A common distributional condition which can be exploited potentially is fast eigendecay (measured quantitatively by the notion of effective dimension (see Equation (3))) ([13, 5, 26, 1]). Namely, in many machine learning problems, the eigenvalues associated with the population covariance matrix exhibit a fast decay, where the tail of the eigenvalues are significantly smaller than the desired precision. Naturally, this phenomenon suggests a
sketchandsolve approach, where a sufficiently accurate solution is obtained by projecting the data onto a lowdimensional space and solving the smaller problem. Indeed, many algorithmic ideas in this spirit have been suggested in the recent years (e.g. [3, 26]).A more sophisticated approach, which we name sketchtoprecondition
([2, 9]), is to enhance the performance of firstorder optimization methods via preconditioning, where the preconditioner is based on a coarse lowrank approximation to the data matrix. The main message
of our paper is as follows:
Main message: The sample complexity of any algorithm minimizing an expconcave empirical risk scales optimally with the effective dimension, rendering the sketchandsolve
approach useless in the statistical setting. On the other hand, the sketchtoprecondition approach is effective for optimization and can be accelerated via model selection.
To illustrate this message, we next describe our results in the context of both linear and Kernelized regression.
1.1 Results for Linear and Kernel regression
Consider the task of minimizing
(1) 
over a compact set . Here, is a distribution over which satisfies . We denote the minimizer by . As usual, the input to the learning algorithm consists of an i.i.d. sample . Our focus is on algorithms that minimize the empirical risk over . Although regularization is not needed for generalization purposes (as shown by [16]), for reasons that will become apparent soon, we introduce a ridge parameter
and consider the minimization problem:
(2) 
Tight sample complexity bound in terms of the effective dimension: we define the sample complexity as the minimal number of samples required for ensuring that . As we mentioned above, sample complexity bounds for this formulation are wellunderstood. Namely, results from [19, 29, 16] imply that that . We refer to the leftmost term as a dimensiondependent fast rate (i.e., it scales with rather than with ), whereas the right term is a dimensionindependent slow rate. While the above bound is tight, it can be significantly improved if the spectrum of the covariance of the underlying data decays fast. A common measure used to capture this decay is the effective dimension, defined by
(3) 
where are the eigenvalues of the population covariance matrix . Clearly, . However, it is very typical that most of the eigenvalues are dominated by , and consequently . For example, if the spectrum decays exponentially, the effective dimension is polylogarithmic in ([13]). Our sample complexity bound in this setting is as follows.
Theorem 1.
The sample complexity of linear regression satisfies
where .
We also prove a nearly matching lower bound (Theorem 6) in a high accuracy regime and specify our bounds for several regimes of interest corresponding to eigendecay patterns.
Essentially, we enjoy the best of the two worlds, as our bound is is both fast (in terms of and dimensionindependent.
Also note that the bound is independent of the diameter, . The only dependence on is implicit through the definition of . Indeed, while can be trivially used to bound the magnitude of the prediction, such a bound is often loose
due to a failure of the metric to capture the geometry of the
problem (e.g., due to sparsity).
Redundancy of SketchandSolve: It is instructive to examine the sketchandsolve approach ([26]), whereby one uses leverage score sampling to find a small spectral approximation to the empirical covariance (respectively, the kernel) matrix using a subsample of size , and then solves the corresponding smaller problem (see Section C of [26] for more details).
While there are relatively efficient methods for approximating the leverage scores, their computation is clearly more involved than sampling uniformly at random. In some sense, our sample complexity result shows that the sketchandsolve is redundant.^{1}^{1}1Note that the boundedness assumption is crucial here. Sketchandsolve can be very helpful when instances are not bounded (and instead of additive accuracy we aim at multiplicative accuracy). Namely, our bound implies that the same (additive) accuracy we can attain the same (additive) accuracy by sampling a training subsequence of the same size uniformly at random.
Efficacy of SketchtoPrecondition in Optimization: A different approach is to use ridge leverage score sampling in order to improve the condition number of the optimization problem. Instead of aiming at spectral approximation, we draw only samples to compute a constant spectral approximation to the empirical covariance matrix. This approximation is used to reduce the condition number to a constant order (see Section 5). Notably, maintaining this preconditioner (i.e., computing it and multiplying any
dimensional vector by its inverse) can be done in time
. By endowing Gradient Descent (GD) with this preconditioner, we can find an approximate ERM in timeAs we discussed above, the regularization parameter used in practice is often chosen via model selection. Both our sample and computational complexity bounds shed light on the biascomplexity tradeoff reflected by the choice of . Namely, as we increase , the effective dimension (and hence the complexity) become smaller, whereas the bias increases. In Section 6 we show that even if we have already chosen a desired regularization parameter (e.g., , as we describe next, we may still achieve a significant gain by performing optimization with . Namely, the effective dimension associated with might be much smaller, and we can compensate for using a larger ridge parameter by repeating the optimization process times. The main challenge we need to tackle is that the cost of computing the effective dimension associated with each candidate parameter dominates the entire optimization process. The main contribution described in Section 6
is a new algorithm which finds the best ridge candidate by iteratively sharpening its estimates to the corresponding effective dimensions.
Theorem 2.
There exists an algorithm that finds an approximate minimizer to (2) in time
In Appendix D we explain how the above results extend to the kernel setting.
2 Related work
2.1 Sample complexity bounds
To the best of our knowledge, the first bounds for empirical risk minimization for kernel ridge regression in terms of the effective dimension have been proved by
[31]. By analyzing the Local Rademacher complexity ([6]), they proved an upper bound of on the sample complexity. On the contrary, our bound has no explicit dependence on . More recently, [13] used compression schemes ([21]) together with results on leverage score sampling from [26] in order to derive a bound in terms of the effective dimension with no explicit dependence on . However, their rate is slow in terms of .Beside improving the above aspects in terms of accuracy, rate and explicit dependence on , our analysis is arguably simple and underscores nice connections between algorithmic stability and ridge leverage scores.
2.1.1 Online Newton Sketch
The Online Newton Step (ONS) algorithm due to [17]
is a wellestablished method for minimizing expconcave loss functions both in the stochastic and the online settings. As hinted by its name, each step of the algorithm involves a conditioning step that resembles a Newton step. Recent papers reanalyzed ONS and proved upper bounds on the regret (and consequently on the sample complexity) in terms of the effective dimension (
[22, 8]). We note that using a standard online to batch reduction, the regret bound of [22] implies the same (albeit a little weaker in terms of constants) sample complexity bounds as this paper. While ONS is certainly appealing in the context of regret minimization, in the statistical setting, our paper establishes the sample complexity bound irrespective of the optimization algorithm used for the intermediate ERM step, thereby establishing that the computational overhead resulted by conditioning in ONS is not required.^{2}^{2}2We also do not advocate ONS for offline optimization, as it does not yield linear rate (i.e., iterations).2.2 SketchandSolve vs. SketchtoPrecondition
As we discussed above, the sketchandsolve approach (e.g. see the nice survey by [30]) has gained considerable attention recently in the context of enhancing both discrete and continuous optimization ([22, 15, 14, 9]). As we briefly mentioned above, a recent paper by [26] suggested to combine ridge leverage score sampling with the Nyström method to compute a spectral approximation of the Kernel matrix. As an application, they consider the problem of Kernel ridge regression and describe how this spectral approximation facilitates the task of finding approximate minimizer in time , where . Based on Corollary 4 (with ), our complexity is better by factor of .
We would like to stress that our results only obviate the necessity of the sketchandsolve approach in the statistical setting, where we assume boundedness and aim at additive error bounds. On the other hand, most of sketchandsolve results (e.g., [26]) are multiplicative and do not require boundedness.
The Sketchtoprecondition approach is more appealing in scenarios where machine precision accuracy is required ([30][Section 2.6]). In Appendix 5 we review this approach in detail and describe a corresponding preconditioned GD that solves the empirical risk in time (or ) respectively in the Kernel setting). A different application of the sketchtoprecondition approach, due to [2], focuses on polynomial Kernels and yields an algorithm whose runtime resembles our running time but also scales exponentially with the polynomial degree.
3 Preliminaries
3.1 Problem Setting
We consider the problem of minimizing the expected risk
(4) 
over a compact and convex set whose diameter is denoted by . Following [16], we assume that for all , is twicecontinuously differentiable and satisfies the following assumptions:

Lipschitzness: for all , .

Strong convexity: for all , .

Smoothness:^{3}^{3}3This assumptions is only required for our optimization results. for all , .
As noted in [16], our framework includes all known expconcave functions. A prominent example illustrated below is bounded regression. Further examples include logistic regression and logloss ([17]).
Example 1.
Bounded regression: let and let and be two compact sets in such that and , . The loss is defined by . It is easily verified that and .
The input to the learning algorithm consists of an i.i.d. sample . A popular practice is regularized loss minimization (RLM) which, given a regularization parameter , is defined as
(5) 
We also define the unregularized empirical loss as
(6) 
The strong convexity of implies the following property of the empirical loss (e.g. see Lemma 2.8 of [28]).
Lemma 1.
Given a sample , let be as defined in Equation (5) . Then for all ,
3.2 Sketching via leveragescore sampling
In this section we define the notion of ridge leverage scores, relate it to the effective dimension and explain how sampling according to these scores facilitates the task of spectral approximation.
Given a sample , we define the data matrix by
Given a ridge parameter , we define the th leverage score by
It’s easily seen that . The following lemma intuitively says that the (ridge) leverage score captures the importance of the th example in composing the column space of the covariance matrix. The proof is detailed in Appendix F.
Lemma 2.
For a ridge parameter and for any , is the minimal scalar such that .
The notion of leverage scores give rise to a natural algorithm for spectral approximation by sampling rows with probability proportional to the corresponding ridge leverage scores. Before describing the sampling procedure, we define the goal of spectral approximation.
Definition 1.
(Spectral approximation) We say that a matrix is a spectral approximation to if
Definition 2.
(Ridge Leverage Score Sampling) Let be a sequence of ridge leverage score overestimates, i.e., for all . For a fixed positive constant and accuracy parameter , define for each . Let denote a function which returns a diagonal matrix , where with probability and otherwise.
Theorem 3.
[24, 26] Let be ridge leverage score overestimates, and let .^{4}^{4}4We use the symbols to denote global constants.

With high probability, is a spectral approximation to .

With high probability, has at most nonzero entries. In particular, if for some constant , then has at most nonzero entries.

There exists an algorithm which computes with for all in time
3.3 Stability
In this section we define the notion of algorithmic stability, a common tool to bound the generalization error of a given algorithm. Analogously to the definition of in (2), for each , we define to be the predictor produced by the algorithm on the sample , obtained from by replacing the example with a fresh i.i.d. pair . We can now define the stability terms
The following theorem relates the expected generalization error to the expected average stability.
Theorem 4 ([7]).
We have .
4 Sample Complexity Bounds for ExpConcave Minimization
In this section we show nearly tight sample complexity bounds for expconcave minimization based on the effective dimension. Let .
Theorem 5.
For any the excess risk of RLM is bounded as follows:
Choosing gives us the following corollary.
Corollary 1.
The sample complexity is bounded as
Remark 1.
To obtain highprobability bounds (rather than in expectation) we can employ the validation process suggested in [25].
Proof of Theorem 5.
For a given sample , define to be the associated leverage scores with ridge parameter . We first use Theorem 4 to relate the excess risk to the average stability:
It is left to bound the average stability. Towards this end we fix some . By the mean value theorem there exists with such that We now have that
where the last inequality uses the fact that . Similarly, , where is the th ridge leverage score corresponding to .
Combining the above and using the inequality , we obtain that
Since and (similarly, and ) are distributed identically, the result now follows from the following lemma whose proof is given in Appendix A.
Lemma 3.
Let
be a random variable supported in a bounded set of
with . Let where are i.i.d copies of . Then we have that for any fixed∎
We now state a nearly matching lower bound on the sample complexity. To exhibit a lower bound we consider the special case of linear regression. Notably, our lower bound holds for any spectrum specification. The proof appears in Appendix C
Theorem 6.
Given numbers and , define and ^{5}^{5}5For any , is a diagonal matrix with entry . Then for any algorithm there exist a distribution over such that for any algorithm that returns a linear predictor , given independent samples from , satisfies
for any satisfying
(7) 
To put the bound achieved by Theorem 6 into perspective we specialize the bound achieved for two popular cases for eigenvalue profiles defined in [13]. We say that a given eigenvalue profile satisfies Polynomial Decay if there exists numbers such that . Similarly it satisfies Exponential Decay if there exists a number such that . The following table specifies nearly matching upper and lower bounds for polynomial and exponential decays (see exact statements in Appendix E).
Decay  Upper Bound  Lower Bound 

Polynomial Decay (degree )  
Exponential Decay 
5 Sketchtoprecondition: an overview
In this section we describe in more detail the sketchtoprecondition approach and specify it to expconcave stochastic optimization. This scheme will serve as a basis for the acceleration technique presented in the next section.
For concreteness, suppose we apply Gradient Descent (GD) to minimize the regularized risk (5). Denote by the empirical covariance matrix . As we assume that is smooth and strongly convex, it can be easily verified that the entire regularized risk is strongly convex and smooth. Denote ’s strong convexity and smoothness parameters by and , respectively. The quantity is referred to as the condition number of the regularized risk. It is well known (e.g., see [27]) that GD converges after iterations. Note that if the eigendecay is fast, the condition number may be much larger that the socalled functional condition number
Preconditioning can be seen as a change of variable, where instead of optimizing over , we optimize over , where is called a preconditioner. It can be easily verified (e.g. see [14]) that this operation amounts to replacing each instance with (after decomposing the regularization into a suitable form). Straightforward calculations show that The Hessian of at any point becomes
Therefore, if satisfies , the smoothness and strong convexity of imply that the resulted condition number is . Using Theorem 3, we can compute a spectral approximation to the data matrix in time . Furthermore, multiplying any dimensional vector with the inverse of this approximation can be done in time . Note that the gradient at some point is . By maintaining both and and assuming that can be computed in time , we are able to perform a single step of preconditioned GD in time . Overall, we obtain the following result.
Theorem 7.
There exists an algorithm that finds an approximate minimizer to (5) in time .
In the Kernel setting we need to make some modifications to this scheme. First, we need to form the Gram matrix in time . Furthermore, as the number of samples replaces the intrinsic dimension, maintaining the preconditioner costs rather than .
6 Optimizing the Tradeoff between Oracle Complexity and Effective Dimensionality
As explained in the introduction, given a ridge parameter , we may prefer to perform optimization with a different ridge parameter in order to accelerate the optimization process.
6.1 The Proximal Point algorithm: overview
Before quantifying the tradeoff reflected by the choice of , we need to explain how to reduce minimization w.r.t. to minimization w.r.t. . The basic idea is to repeat the minimization process for epochs. We demonstrate this idea using the Proximal Point algorithm (PPA) due to [12]. For a fixed , define . Suppose we start from . At time , we find a point satisfying
Lemma 4.
[12] Applying PPA with yields approximate minimizer to after epochs, i.e., .
6.2 Quantifying the tradeoff
Applying PPA while using sketchtopreconditioning as described in Section 5 yields the following complexity bound:
Focusing on the (reasonable) regime where ,^{6}^{6}6The term involving can be easily optimized w.r.t. the ridge parameter . we note that may be large as . Notably, while the deterioration in runtime scales linearly on , the improvement in terms of is quadratic. For instance, if , the computational gain is ).
Therefore, we wish to minimize the complexity term
(8) 
over all possible . To this end, suppose that we had an access to an oracle that computes for a given parameter . Using the continuity of the effective dimension, we could optimize the above quantity over a discrete set of the form .^{7}^{7}7Clearly, the optimal ridge parameter can not be larger than . The main difficulty stems from the fact that the cost of implementing this oracle already scales with .
6.3 Efficient tuning using undersampling
Our second main contribution is a novel approach for minimizing (8) in negligible amount of time.
Theorem 8.
There exists an algorithm which receives a data matrix and a regularization parameter , and with high probability outputs a regularization parameter satisfying
The runtime of the algorithm is .
Corollary 2.
There exists an algorithm that finds an approximate solution to (5) in time
The main idea behind Theorem 8 is that instead of (approximately) computing the effective dimension for each candidate , we guess the optimal complexity and employ undersampling to test whether a given candidate attains the desired complexity. The key ingredient to this approach is described in the following theorem.
Theorem 9.
Let , and . There exists an algorithm that verifies whether in time .
Proof.
(of Theorem 8) Starting from a small constant as our “guess” for , we double our guess until finding a candidate which satisfies the desired bound. According to Theorem 9, for each guess and candidate , the complexity of verifying whether is at most . The number of such tests is logarithmic, hence the theorem follows. ∎
6.3.1 Proof of Theorem 9
Inspired by [10, 11], our strategy is to use undersampling to obtain sharper estimates to the ridge leverage scores. We start by incorporating an undersampling parameter into Definition 2.
Definition 3.
(Ridge Leverage Score Undersampling) Let be a sequence of ridge leverage score overestimates, i.e., for all . For some fixed positive constant , accuracy parameter , define for each . Let denote a function which returns a diagonal matrix , where with probability and otherwise.
Note that while we reduce each probability by factor , the definition of neglects this modification. Hence, our undersampling is equivalent to sampling according to Definition 2 and preserving each row with probability . By employing undersampling we cannot hope to obtain a constant approximation to the true ridge leverage scores. However, as we describe in the following theorem, this strategy still helps us to sharpen our estimates to the ridge leverage scores.
Theorem 10.
Let for all and let be an undersampling parameter. Given , we form new estimates by
(9) 
Then with high probability, each is an overestimate of and .
The proof of the theorem (which is similar to Theorem 3 of [10] and Lemma 13 of [11]) is provided in Appendix B. Equipped with this result, we employ the following strategy in order to verify whether . Applying the lemma with , we have that if then This gives rise to the following test:

If , accept the hypothesis that .

If , reject the hypothesis that .

Otherwise, apply Theorem 10 to obtain a new vector of overestimates, .
Proof.
(of Theorem 9) Note that the rank of the matrix is with high probabilty. Hence, each step of the testing procedure costs .^{8}^{8}8Namely, we can compute in time . Thereafter, computing can be done in time . Since our range of candidate ridge parameters is of logarithmic size and each test consists of logarithmic number of steps, the theorem follows using the union bound. ∎
Acknowledgements
We thank Elad Hazan, Ohad Shamir and Christopher Musco for fruitful discussions and valuable suggestions.
References
 [1] Ahmed El Alaoui and Michael W Mahoney. Fast Randomized Kernel Methods With Statistical Guarantees.
 [2] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster Kernel Ridge Regression Using Sketching and Preconditioning. SIAM Journal on Matrix Analysis and Applications, 1(1), 2017.
 [3] Haim Avron, Yorktown Heights, David P Woodruff, and San Jose. Subspace Embeddings for the Polynomial Kernel. Advances In Neural Information Processing Systems, 2014.
 [4] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random {F}ourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees. Proceedings of the 34th International Conference on Machine Learning, 70:253–262, 2017.
 [5] Francis Bach and Francis Bach@ens Fr. Sharp analysis of lowrank kernel matrix approximations. 30:1–25, 2013.
 [6] Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
 [7] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
 [8] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Secondorder kernel online convex optimization with adaptive sketching. In International Conference on Machine Learning, 2017.

[9]
Kenneth L. Clarkson and David P. Woodruff.
Low rank approximation and regression in input sparsity time.
In
Proceedings of the fortyfifth annual ACM symposium on Theory of computing
, pages 81–90. ACM, 2012.  [10] Michael B. Cohen, Yin Tat Lee, Cameron Christopher Musco, Cameron Christopher Musco, Richard Peng, and Aaron Sidford. Uniform Sampling for Matrix Approximation. Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science  ITCS ’15, pages 181–190, 2015.
 [11] Michael B Cohen, Cameron Christopher Musco, and Cameron Christopher Musco. Input Sparsity Time LowRank Approximation via Ridge Leverage Score Sampling. 2016.
 [12] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. arXiv preprint arXiv:1506.07512, 2015.

[13]
Surbhi Goel and Adam Klivans.
Eigenvalue Decay Implies PolynomialTime Learnability for Neural Networks.
Advances in Neural Information Processing Systems, 2017.  [14] Alon Gonen, Francesco Orabona, and Shai ShalevShwartz. Solving ridge regression using sketched preconditioned svrg. In International Conference on Machine Learning, pages 1397–1405, 2016.
 [15] Alon Gonen and Shai ShalevShwartz. Faster sgd using sketched conditioning. arXiv preprint arXiv:1506.02649, 2015.
 [16] Alon Gonen and Shai ShalevShwartz. Average Stability is Invariant to Data Preconditioning. Implications to Expconcave Empirical Risk Minimization. arXiv preprint arXiv:1601.04011, 2016.
 [17] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(23):169–192, 2007.

[18]
Rie Johnson and Tong Zhang.
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction.
In neural Information Processing system, number 4, pages 1–9, 2013.  [19] Tomer Koren and Kfir Levy. Fast rates for expconcave empirical risk minimization. In Advances in Neural Information Processing Systems, pages 1477–1485, 2015.
 [20] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for FirstOrder Optimization. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015.
 [21] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Technical report, Technical report, University of California, Santa Cruz, 1986.
 [22] Haipeng Luo, Alekh Agarwal, Nicolo CesaBianchi, and John Langford. Efficient second order online learning by sketching. In Advances in Neural Information Processing Systems, pages 902–910, 2016.
 [23] Mehrdad Mahdavi. Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization. 40:1–16, 2015.
 [24] Michael W Mahoney and S Muthukrishnan. RelativeError CUR Matrix Decompositions. SIAM Journal on Matrix Analysis and Applications, 2008.
 [25] Nishant A Mehta, Centrum Wiskunde, and Informatica Cwi. From expconcavity to variance control : O ( 1 / n ) rates and onlinetobatch conversion with high probability arXiv : 1605 . 01288v3 [ cs . LG ] 18 Aug 2016. (1990):1–13, 2015.
 [26] Cameron Musco and Christopher Musco. Recursive Sampling for the Nyström Method. In Advances In Neural Information Processing Systems, 2017.
 [27] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
 [28] Shai ShalevShwartz. Online Learning and Online Convex Optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2011.
 [29] Ohad Shamir. The Sample Complexity of Learning Linear Predictors with the Squared Loss. Journal of Machine Learning Research, 16:3475–3486, 2015.
 [30] David P. Woodruff. Sketching as a Tool for Numerical Linear Algebra. arXiv preprint arXiv:1411.4357, 2014.
 [31] Tong Zhang. Effective dimension and generalization of kernel learning. In Advances in Neural Information Processing Systems, pages 471–478, 2003.
Appendix A Concentration of The Effective Dimension
Proof of Lemma 3.
Let and denote the spectral decomposition of by . Let Note that
Therefore,
(10) 
Denote the eigenvalues of by . Since for any , , we have that
(11) 
We now consider the random variable . To argue about this random variable consider the following identity which follows from the CourantFisher minmax principle for real symmetric matrices.
Let be the matrix with the columns . We now have that
(12) 
Appendix B Ridge Leverage Score Undersampling
In this section we prove Theorem 10. The next lemma intuitively says only a small fraction of rows might have a high leverage score.
Lemma 5.
Let and denote by the effective dimension of For any there exists a diagonal rescsaling matrix such that for all , and
Proof.
We prove the lemma by considering a hypothetical algorithm which constructs a sequence
Comments
There are no comments yet.