Many settings in modern machine learning, optimization and scientific computing require us to work with data matrices that are so large that some form of dimensionality reduction is a necessary component of the process. One of the most popular families of methods for dimensionality reduction, coming from the literature on Randomized Numerical Linear Algebra (RandNLA), consists of data-oblivious sketches [Mah11, HMT11, Woo14]. Consider a large matrix . A data-oblivious sketch of size is the matrix , where is a random matrix such that , whose distribution does not depend on . This sketch reduces the first dimension of from to a much smaller (we assume without loss of generality that ), and an analogous procedure can be defined for reducing the second dimension as well. This approximate representation of
is central to many algorithms in areas such as linear regression, low-rank approximation, kernel methods, and iterative second-order optimization. While there is a long line of research aimed at bounding the worst-case approximation error of such representations, these bounds are often too loose to reflect accurately the practical performance of these methods. In this paper, we develop new theory which enables more precise analysis of the accuracy of sketched data representations.
A common way to measure the accuracy of the sketch is by considering the -dimensional subspace spanned by its rows. The goal of the sketch is to choose a subspace that best aligns with the distribution of all of the rows of in
. Intuitively, our goal is to minimize the (norm of the) residual when projecting a vectoronto that subspace, i.e., , where is the orthogonal projection matrix onto the subspace spanned by the rows of (and denotes the Moore-Penrose pseudoinverse). For this reason, the quantity that has appeared ubiquitously in the error analysis of RandNLA sketching is what we call the residual projection matrix:
Since is random, the average performance of the sketch can often be characterized by its expectation, . For example, the low-rank approximation error of the sketch can be expressed as , where denotes the Frobenius norm. A similar formula follows for the trace norm error of a sketched Nyström approximation [WS01, GM16]
. Among others, this approximation error appears in the analysis of sketched kernel ridge regression[FSS20] and Gaussian process regression [BRVDW19]. Furthermore, a variety of iterative algorithms, such as randomized second-order methods for convex optimization [QRTF16, QR16, GKLR19, GRB20] and linear system solvers based on the generalized Kaczmarz method [GR15]
, have convergence guarantees which depend on the extreme eigenvalues of
. Finally, a generalized form of the expected residual projection has been recently used to model the implicit regularization of the interpolating solutions in over-parameterized linear models[DLM19, BLLT19].
1.1 Main result
Despite its prevalence in the literature, the expected residual projection is not well understood, even in such simple cases as when is a Gaussian sketch (i.e., with i.i.d. standard normal entries). We address this by providing a surrogate expression, i.e., a simple analytically tractable approximation, for this matrix quantity:
Here, means that while the surrogate expression is not exact, it approximates the true quantity up to some accuracy. Our main result provides a rigorous approximation guarantee for this surrogate expression with respect to a range of sketching matrices , including the standard Gaussian and Rademacher sketches. We state the result using the positive semi-definite ordering denoted by .
Let be a sketch of size with i.i.d. mean-zero sub-Gaussian entries and let be the stable rank of . If we let be a fixed constant larger than , then
In other words, when the sketch size is smaller than the stable rank of , then the discrepancy between our surrogate expression and is of the order , where the big-O notation hides only the dependence on and on the sub-Gaussian constant (see Theorem 2 for more details). Our proof of Theorem 1 is inspired by the techniques from random matrix theory which have been used to analyze the asymptotic spectral distribution of large random matrices by focusing on the associated matrix resolvents and Stieltjes transforms [HLN07, BS10]. However, our analysis is novel in several respects:
The residual projection matrix can be obtained from the appropriately scaled resolvent matrix by taking . Prior work (e.g., [HMRT19]
) combined this with an exchange-of-limits argument to analyze the asymptotic behavior of the residual projection. This approach, however, does not allow for a precise control in finite-dimensional problems. We are able to provide a more fine-grained, non-asymptotic analysis by working directly with the residual projection itself, instead of the resolvent.
We require no assumptions on the largest and smallest singular value of. Instead, we derive our bounds in terms of the stable rank of (as opposed to its actual rank), which implicitly compensates for ill-conditioned data matrices.
We obtain upper/lower bounds for in terms of the positive semi-definite ordering , which can be directly converted to guarantees for the precise expressions of expected low-rank approximation error derived in the following section.
1.2 Low-rank approximation
We next provide some immediate corollaries of Theorem 1, where we use to denote a multiplicative approximation . Note that our analysis is new even for the classical Gaussian sketch where the entries of are i.i.d. standard normal. However the results apply more broadly, including a standard class of data-base friendly Rademacher sketches where each entry is a
Rademacher random variable[Ach03]. We start by analyzing the Frobenius norm error of sketched low-rank approximations. Note that by the definition of in (1), we have , so the surrogate expression we obtain for the expected error is remarkably simple.
Let be the singular values of . Under the assumptions of Theorem 1, we have:
The parameter increases at least linearly as a function of , which is why the expected error will always decrease with increasing . For example, when the singular values of exhibit exponential decay, i.e., for , then the error also decreases exponentially, at the rate of . We discuss this further in Section 3, giving explicit formulas for the error as a function of under both exponential and polynomial spectral decay profiles.
The above result is important for many RandNLA methods, and it is also relevant in the context of kernel methods, where the data is represented via a positive semi-definite kernel matrix which corresponds to the matrix of dot-products of the data vectors in some reproducible kernel Hilbert space. In this context, sketching can be applied directly to the matrix via an extended variant of the Nyström method [GM16]. A Nyström approximation constructed from a sketching matrix is defined as , where and
, and it is applicable to a variety of settings, including Gaussian Process regression, kernel machines and Independent Component Analysis[BRVDW19, WS01, BJ03]. By setting , it is easy to see [DKM20] that the trace norm error is identical to the squared Frobenius norm error of the low-rank sketch , so Corollary 1 implies that
with any sub-Gaussian sketch, where denote the eigenvalues of . Our error analysis given in Section 3
is particularly relevant here, since commonly used kernels such as the Radial Basis Function (RBF) or the Matérn kernel induce a well-understood eigenvalue decay[SZW97, RW06].
1.3 Randomized iterative optimization
We next turn to a class of iterative methods which take advantage of sketching to reduce the per iteration cost of optimization. These methods have been developed in a variety of settings, from solving linear systems to convex optimization and empirical risk minimization, and in many cases the residual projection matrix appears as a black box quantity whose spectral properties determine the convergence behavior of the algorithms [GR15]. With our new results, we can precisely characterize not only the rate of convergence, but also, in some cases, the complete evolution of the parameter vector, for the following algorithms:
Generalized Kaczmarz method [GR15] for approximately solving a linear system ;
Randomized Subspace Newton [GKLR19], a second order method, where we sketch the Hessian matrix.
Jacobian Sketching [GRB20], a class of first order methods which use additional information via a weight matrix that is sketched at every iteration.
We believe that extensions of our techniques will apply to other algorithms, such as that of [LPP19].
The corollary follows from Theorem 1 combined with Theorem 4.1 in [GR15]. Note that when is positive definite then , so the algorithm will converge from any starting point, and the worst-case convergence rate of the above method can be obtained by evaluating the largest eigenvalue of . However the result itself is much stronger, in that it can be used to describe the (expected) trajectory of the iterates for any starting point . Moreover, when the spectral decay profile of is known, then the explicit expressions for as a function of derived in Section 3 can be used to characterize the convergence properties of generalized Kaczmarz as well as other methods discussed above.
1.4 Implicit regularization
Setting , we can view one step of the iterative method in Corollary 2 as finding a minimum norm interpolating solution of an under-determined linear system . Recent interest in the generalization capacity of over-parameterized machine learning models has motivated extensive research on the statistical properties of such interpolating solutions, e.g., [BLLT19, HMRT19, DLM19]. In this context, Theorem 1 provides new evidence for the implicit regularization conjecture posed by [DLM19] (see their Theorem 2 and associated discussion), with the amount of regularization equal , where is implicitly defined in (1):
While implicit regularization has received attention recently in the context of SGD algorithms for overparameterized machine learning models, it was originally discussed in the context of approximation algorithms more generally [Mah12]. Recent work has made precise this notion in the context of RandNLA [DLM19], and our results here can be viewed in terms of implicit regularization of scalable RandNLA methods.
1.5 Related work
A significant body of research has been dedicated to understanding the guarantees for low-rank approximation via sketching, particularly in the context of RandNLA [DM16, DM18]. This line of work includes i.i.d. row sampling methods [BMD08, AM15] which preserve the structure of the data, and data-oblivious methods such as Gaussian and Rademacher sketches [Mah11, HMT11, Woo14]. However, all of these results focus on worst-case upper bounds on the approximation error. One exception is a recent line of works on non-i.i.d. row sampling with determinantal point processes (DPP). In this case, exact analysis of the low-rank approximation error [DKM20], as well as precise convergence analysis of stochastic second order methods [MDK19], have been obtained. Remarkably, the expressions they obtain are analogous to (1), despite using completely different techniques. However, their analysis is limited only to DPP-based sketches, which are considerably more expensive to construct and thus much less widely used. The connection between DPPs and Gaussian sketches was recently explored by [DLM19]
in the context of analyzing the implicit regularization effect of choosing a minimum norm solution in under-determined linear regression. They conjectured that the expectation formulas obtained for DPPs are a good proxy for the corresponding quantities obtained under a Gaussian distribution. While they only provide empirical evidence and asymptotic results for this particular claim, our Theorem1 can be viewed as the first theoretical non-asymptotic justification of that conjecture.
The effectiveness of sketching has also been extensively studied in the context of second order optimization. These methods differ depending on how the sketch is applied to the Hessian matrix, and whether or not it is applied to the gradient as well. The class of methods discussed in Section 1.3, including Randomized Subspace Newton and the Generalized Kaczmarz method, relies on projecting the Hessian downto a low-dimensional subspace, which makes our results directly applicable. A related family of methods uses the so-called Iterative Hessian Sketch (IHS) approach [PW16, LP19]. The similarities between IHS and the Subspace Newton-type methods (see [QRTF16] for a comparison) suggest that our techniques could be extended to provide precise convergence guarantees also to the IHS. Finally, yet another family of Hessian sketching methods has been studied by [RKM19, WGM17, XRKM17, YXRKM18, RLXM18, WRXM17, DM19]. These methods preserve the rank of the Hessian, and thus they provide somewhat different convergence guarantees that do not rely on the residual projection matrix.
2 Precise analysis of the residual projection
In this section, we give a detailed statement of our main technical result, along with a sketch of the proof. First, recall the definition of sub-Gaussian random variables and vectors.
We say that is a -sub-Gaussian random variable if its sub-Gaussian Orlicz norm , where . Similarly, we say that a random vector is -sub-Gaussian if for all we have .
For convenience, we state the main result in a slightly different form than Theorem 1. Namely, we replace the matrix with a positive semi-definite matrix . Furthermore, instead of a sketch with i.i.d. sub-Gaussian entries, we use a random matrix with i.i.d. sub-Gaussian rows, which is a strictly weaker condition because it allows for the entries of each row to be correlated. Since the rows of are also assumed to have mean zero and identity covariance, each row of has covariance . In Section 2.2 we show how to convert this statement back to the form of Theorem 1.
Let for , where has i.i.d. -sub-Gaussian rows with zero mean and identity covariance, and is an positive semi-definite matrix. Define:
Let be the stable rank of and fix . There exists a constant , depending only on and , such that if , then
We first provide the following informal derivation of the expression for given in Theorem 2. Let us use to denote the matrix . Using a rank-one update formula for the Moore-Penrose pseudoinverse (see Lemma 1 in the appendix) we have
where we use to denote the -th row of , and , where is the matrix without its -th row. Due to the sub-Gaussianity of , the quadratic form in the denominator concentrates around its expectation (with respect to ), i.e., , where we use . Further note that, with for large and from a concentration argument, we conclude that
and thus for and . This leads to the (implicit) expression for and given in Theorem 2.
2.1 Proof sketch of Theorem 2
First note that, to obtain the lower and upper bound for in the sense of symmetric matrix as in Theorem 2, it suffices to bound the spectral norm , so that, with for from the definition of , we have
This means that all eigenvalues of the p.s.d. matrix lie in the interval , so Multiplying by from both sides, we obtain the desired bound.
Then, we carefully design an event
that (i) is provable to occur with high probability and (ii) ensures that the denominators in the following decomposition are bounded away from zero:
where we let and .
It then remains to bound the spectral norms of respectively to reach the conclusion. More precisely, the terms and are proportional to , while the term can be bounded using the rank-one update formula for the pseudoinverse (Lemma 1 in the appendix). The remaining term is more subtle and can be bounded with a careful application of sub-Gaussian concentration inequalities (Lemmas 2 and 3 in the appendix). This allows for a bound on the operator norm and hence the conclusion.
2.2 Proof of Theorem 1
We now discuss how Theorem 1 can be obtained from Theorem 2. The crucial difference between the statements is that in Theorem 1 we let be an arbitrary rectangular matrix, whereas in Theorem 2 we instead use a square, symmetric and positive semi-definite matrix . To convert between the two notations, consider the SVD decomposition of (recall that we assume ), where and have orthonormal columns and is a diagonal matrix. Now, let , and . Using the fact that , it follows that:
Note that since , the rows of are sub-Gaussian with the same constant as the rows of . Moreover, using the fact that implies for any p.s.d. matrices and , Theorem 1 follows as a corollary of Theorem 2.
3 Explicit formulas under known spectral decay
The expression we give for the expected residual projection, , is implicit in that it depends on the parameter which is the solution of the following equation:
In general, it is impossible to solve this equation analytically, i.e., to write as an explicit formula of , and the singular values of . However, we show that when the singular values exhibit a known rate of decay, then it is possible to obtain explicit formulas for . In particular, this allows us to provide precise and easily interpretable rates of decay for the low-rank approximation error of a sub-Gaussian sketch.
Matrices that have known spectral decay, most commonly with either exponential or polynomial rate, arise in many machine learning problems [MDK19]. Such behavior can be naturally occurring in data, or it can be induced by feature expansion using, say, the RBF kernel (for exponential decay) [SZW97] or the Matérn kernel (for polynomial decay) [RW06]. Understanding these two classes of decay plays an important role in distinguishing the properties of light-tailed and heavy-tailed data distributions. Note that in the kernel setting we may often represent our data via the kernel matrix , instead of the data matrix , and study the sketched Nyström method [GM16] for low-rank approximation. To handle the kernel setting in our analysis, it suffices to replace the squared singular values of with the eigenvalues of .
3.1 Exponential spectral decay
Suppose that the squared singular values of exhibit exponential decay, i.e. , where is a constant and . For simplicity of presentation, we will let . Under this spectral decay, we can approximate the sum in (4) by the analytically computable integral , obtaining . Applying this to the formula from Corollary 1, we can express the low-rank approximation error for a sketch of size as follows:
In Figure 1a, we plot the above formula against the numerically obtained implicit expression from Corollary 1, as well as empirical results for a Gaussian sketch. First, we observe that the theoretical predictions closely align with empirical values even after the sketch size crosses the stable rank , suggesting that Theorem 1 can be extended to this regime. Second, while it is not surprising that the error decays at a similar rate as the singular values, our predictions offer a much more precise description, down to lower order effects and even constant factors. For instance, we observe that the error (normalized by , as in the figure) only starts decaying exponentially after crosses the stable rank, and until that point it decreases at a linear rate with slope .
3.2 Polynomial spectral decay
We now turn to polynomial spectral decay, which is a natural model for analyzing heavy-tailed data distributions. Let have squared singular values for some , and let . As in the case of exponential decay, we use the integral to approximate the sum in (4), and solve for , obtaining . Combining this with Corollary 1 we get:
Figure 1b compares our predictions to the empirical results for several values of . In all of these cases, the stable rank is close to 1, and yet the theoretical predictions align very well with the empirical results. Overall, the asymptotic rate of decay of the error is . However it is easy to verify that the lower order effect of appearing instead of in (6) significantly changes the trajectory for small values of . Also, note that as grows large, the constant goes to , but it plays a significant role for or (roughly, scaling the expression by a factor of ). Finally, we remark that for , our integral approximation of (4) becomes less accurate. We expect that a corrected expression is possible, but likely more complicated and less interpretable.
4 Empirical results
In this section, we numerically verify the accuracy of our theoretical predictions for the low-rank approximation error of sketching on benchmark datasets from the libsvm repository [CL11] (further numerical results are in Appendix C
). We repeated every experiment 10 times, and plot both the average and standard deviation of the results. We use the followingsketching matrices :
Gaussian sketch: with i.i.d. standard normal entries;
Rademacher sketch: with i.i.d. entries equal with probability 0.5 and otherwise.
Varying spectral decay.
To demonstrate the role of spectral decay and the stable rank on the approximation error, we performed feature expansion using the radial basis function (RBF) kernel , obtaining an kernel matrix . We used the sketched Nyström method to construct a low-rank approximation , and computed the normalized trace norm error . The theoretical predictions are coming from (2), which in turn uses Theorem 1. Following [GM16], we use the RBF kernel because varying the scale parameter allows us to observe the approximation error under qualitatively different spectral decay profiles of the kernel. In Figure 2, we present the results for the Gaussian sketch on two datasets, with three values of , and in all cases our theory aligns with the empirical results. Furthermore, as smaller leads to slower spectral decay and larger stable rank, it also makes the approximation error decay more linearly for small sketch sizes. This behavior is predicted by our explicit expressions (5) for the error under exponential spectral decay from Section 3. Once the sketch sizes are sufficiently larger than the stable rank of , the error starts decaying at an exponential rate. Note that Theorem 1 only guarantees accuracy of our expressions for sketch sizes below the stable rank, however the predictions are accurate regardless of this constraint.
Varying sketch type.
In the next set of empirical results, we compare the performance of Gaussian and Rademacher sketches, and also verify the theory when sketching the data matrix without kernel expansion, plotting . Since both of the sketching methods have sub-Gaussian entries, Corollary 1 predicts that they should have comparable performance in this task and match our expressions. This is exactly what we observe in Figure 3 for two datasets and a range of sketching sizes, as well as in other empirical results shown in Appendix C.
We derived the first theoretically supported precise expressions for the expected residual projection matrix, which is a central component in the analysis of RandNLA dimensionality reduction via sketching. Our analysis provides a new understanding of low-rank approximation, the Nyström method, and the convergence properties of many randomized iterative algorithms. As a direction for future work, we conjecture that our main result can be extended to sketch sizes larger than the stable rank of the data matrix.
We would like to acknowledge DARPA, NSF, and ONR for providing partial support of this work.
- [Ach03] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
- [AM15] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Proceedings of the 28th International Conference on Neural Information Processing Systems, pages 775–783, Montreal, Canada, December 2015.
- [BJ03] Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, March 2003.
- [BLLT19] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Technical Report Preprint: arXiv:1906.11300, 2019.
- [BMD08] Christos Boutsidis, Michael Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 12 2008.
- [BRVDW19] David Burt, Carl Edward Rasmussen, and Mark Van Der Wilk. Rates of convergence for sparse variational Gaussian process regression. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 862–871, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- [BS10] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
- [Bur73] Donald L Burkholder. Distribution function inequalities for martingales. the Annals of Probability, pages 19–42, 1973.
Chih-Chung Chang and Chih-Jen Lin.
LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
- [DKM20] Michał Dereziński, Rajiv Khanna, and Michael W Mahoney. Improved guarantees and a multiple-descent curve for the column subset selection problem and the nyström method. arXiv preprint arXiv:2002.09073, 2020.
- [DLM19] Michał Dereziński, Feynman Liang, and Michael W. Mahoney. Exact expressions for double descent and implicit regularization via surrogate random design. arXiv e-prints, page arXiv:1912.04533, Dec 2019.
- [DM16] Petros Drineas and Michael W. Mahoney. RandNLA: Randomized numerical linear algebra. Communications of the ACM, 59:80–90, 2016.
- [DM18] P. Drineas and M. W. Mahoney. Lectures on randomized numerical linear algebra. In M. W. Mahoney, J. C. Duchi, and A. C. Gilbert, editors, The Mathematics of Data, IAS/Park City Mathematics Series, pages 1–48. AMS/IAS/SIAM, 2018.
Michał Dereziński and Michael W Mahoney.
Distributed estimation of the inverse Hessian by determinantal averaging.In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11401–11411. Curran Associates, Inc., 2019.
- [FSS20] Michaël Fanuel, Joachim Schreurs, and Johan AK Suykens. Diversity sampling is an implicit regularization for kernel methods. arXiv:2002.08616, 2020.
- [GKLR19] Robert Gower, Dmitry Koralev, Felix Lieder, and Peter Richtarik. RSN: Randomized subspace Newton. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 614–623. Curran Associates, Inc., 2019.
- [GM16] Alex Gittens and Michael W. Mahoney. Revisiting the Nyström method for improved large-scale machine learning. J. Mach. Learn. Res., 17(1):3977–4041, January 2016.
- [GR15] Robert M. Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM. J. Matrix Anal. & Appl., 36(4), 1660–1690, 2015, 2015.
Robert Gower, Peter Richtárik, and Francis Bach.
Stochastic quasi-gradient methods: variance reduction via Jacobian sketching.Mathematical Programming, 05 2020.
- [HLN07] Walid Hachem, Philippe Loubaton, Jamal Najim, et al. Deterministic equivalents for certain functionals of large random matrices. The Annals of Applied Probability, 17(3):875–930, 2007.
- [HMRT19] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. Technical Report Preprint: arXiv:1903.08560, 2019.
- [HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
- [LP19] Jonathan Lacotte and Mert Pilanci. Faster least squares optimization. arXiv preprint arXiv:1911.02675, 2019.
- [LPP19] Jonathan Lacotte, Mert Pilanci, and Marco Pavone. High-dimensional optimization in adaptive random subspaces. In Advances in Neural Information Processing Systems, pages 10846–10856, 2019.
- [Mah11] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011. Also available at: arXiv:1104.5557.
- [Mah12] M. W. Mahoney. Approximate computation and implicit regularization for very large-scale data analysis. In Proceedings of the 31st ACM Symposium on Principles of Database Systems, pages 143–154, 2012.
- [MDK19] Mojmír Mutný, Michał Dereziński, and Andreas Krause. Convergence analysis of the randomized Newton method with determinantal sampling. arXiv e-prints, page arXiv:1910.11561, Oct 2019.
- [Mey73] Carl D. Meyer. Generalized inversion of modified matrices. SIAM Journal on Applied Mathematics, 24(3):315–323, 1973.
- [PW16] Mert Pilanci and Martin J Wainwright. Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares. The Journal of Machine Learning Research, 17(1):1842–1879, 2016.
- [QR16] Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling II: Expected separable overapproximation. Optimization Methods and Software, 31(5):858–884, 2016.
- [QRTF16] Zheng Qu, Peter Richtárik, Martin Takác, and Olivier Fercoq. SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization. Proceedings of The 33rd International Conference on Machine Learning, Feb 2016.
- [RKM19] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods. Mathematical Programming, 174(1-2):293–326, 2019.
- [RLXM18] F. Roosta, Y. Liu, P. Xu, and M. W. Mahoney. Newton-MR: Newton’s method without smoothness or convexity. Technical report, 2018. Preprint: arXiv:1810.00303.
- [RW06] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
- [SZW97] Huaiyu Zhu Santa, Huaiyu Zhu, Christopher K. I. Williams, Richard Rohwer, and Michal Morciniec. Gaussian regression and optimal finite dimensional linear models. In Neural Networks and Machine Learning, pages 167–184. Springer-Verlag, 1997.
High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [WGM17] Shusen Wang, Alex Gittens, and Michael W. Mahoney. Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3608–3616, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- [Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
- [WRXM17] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney. GIANT: globally improved approximate Newton method for distributed optimization. CoRR, abs/1709.03528, 2017.
- [WS01] Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.
- [XRKM17] P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Newton-type methods for non-convex optimization under inexact Hessian information. Technical report, 2017. Preprint: arXiv:1708.07164.
- [YXRKM18] Z. Yao, P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Inexact non-convex Newton-type methods. Technical report, 2018. Preprint: arXiv:1802.06925.
- [Zaj18] Krzysztof Zajkowski. Bounds on tail probabilities for quadratic forms in dependent sub-gaussian random variables. arXiv preprint arXiv:1809.08569, 2018.
Appendix A Proof of Theorem 2
We first introduce the following technical lemmas.
For with , denote and , with the matrix without its i-th row . Then, conditioned on the event :
Proof Since conditioned on we have , from Theorem 1 in [Mey73] we deduce
for so that , where we used the fact that is a projection matrix so that . As a consequence, multiplying by and simplifying we get
By definition of the pseudoinverse, so that
where we used and thus the conclusion.
For a -sub-Gaussian random vector with , and positive semi-definite matrix , we have
with the stable rank of , and
for some independent of .
Proof From Corollary 2.9 in [Zaj18] we have, for -sub-Gaussian with , and symmetric positive semi-definite that
for some universal constant . Taking we have
where we use the fact that .
Integrating this bound yields:
and thus the conclusion.
With the notations of Lemma 1, for and , we have
for some universal constant .
Proof To simplify notations, we work on instead of , the same line of argument applies to by changing the sample size to .
First note that
where we used the fact that , for the conditional expectation with respect to the -field generating the rows of . This forms a martingale difference sequence (it is a difference sequence of the Doob martingale for with respect to filtration ) hence it falls within the scope of the Burkholder inequality [Bur73], recalled as follows.
For a real martingale difference sequence with respect to the increasing field , we have, for , there exists such that
In particular, for , we obtain .
For the second result, since we have almost surely bounded martingale differences (), by the Azuma-Hoeffding inequality
a.1 Complete proof of Theorem 2
Equipped with the lemmas above, we are ready to prove Theorem 2. First note that:
Since for any , we can assume without loss of generality (after rescaling correspondingly) that .
According to the definition of and , the following bounds hold
for and , where we used the fact that
so that .
so that, with from (7), we have
Defining , this means that all eigenvalues of the p.s.d. matrix lie in the interval , and
so that by multiplying on both sides, we obtain the desired bound.
As a consequence of the above observations, we only need to prove (8) under the setting . The proof comes in the following two steps:
For , with the matrix without its -th row, we define, for , the following events
where we recall is the -th row of so that and . With Lemma 2, we can bound the probability of and , and consequently that of for ;
We then bound, conditioned on and respectively, the spectral norm . More precisely, since
where we used Lemma 1 for the third equality and denote as well as . It then remains to bound the spectral norms of to reach the conclusion.
Another important relation that will be constantly used throughout the proof is