Precise expressions for random projections: Low-rank approximation and randomized Newton

06/18/2020
by   Michał Dereziński, et al.
berkeley college
0

It is often desirable to reduce the dimensionality of a large dataset by projecting it onto a low-dimensional subspace. Matrix sketching has emerged as a powerful technique for performing such dimensionality reduction very efficiently. Even though there is an extensive literature on the worst-case performance of sketching, existing guarantees are typically very different from what is observed in practice. We exploit recent developments in the spectral analysis of random matrices to develop novel techniques that provide provably accurate expressions for the expected value of random projection matrices obtained via sketching. These expressions can be used to characterize the performance of dimensionality reduction in a variety of common machine learning tasks, ranging from low-rank approximation to iterative stochastic optimization. Our results apply to several popular sketching methods, including Gaussian and Rademacher sketches, and they enable precise analysis of these methods in terms of spectral properties of the data. Empirical results show that the expressions we derive reflect the practical performance of these sketching methods, down to lower-order effects and even constant factors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/28/2019

Stochastic Linear Bandits with Hidden Low Rank Structure

High-dimensional representations often have a lower dimensional underlyi...
05/23/2019

Johnson-Lindenstrauss Property Implies Subspace Restricted Isometry Property

Dimensionality reduction is a popular approach to tackle high-dimensiona...
06/17/2018

On Sketching the q to p norms

We initiate the study of data dimensionality reduction, or sketching, fo...
06/23/2017

On the numerical rank of radial basis function kernel matrices in high dimension

Low-rank approximations are popular techniques to reduce the high comput...
05/14/2019

Dimensionality Reduction for Tukey Regression

We give the first dimensionality reduction methods for the overconstrain...
07/04/2013

Toward Guaranteed Illumination Models for Non-Convex Objects

Illumination variation remains a central challenge in object detection a...
04/29/2014

Randomized Sketches of Convex Programs with Sharp Guarantees

Random projection (RP) is a classical technique for reducing storage and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many settings in modern machine learning, optimization and scientific computing require us to work with data matrices that are so large that some form of dimensionality reduction is a necessary component of the process. One of the most popular families of methods for dimensionality reduction, coming from the literature on Randomized Numerical Linear Algebra (RandNLA), consists of data-oblivious sketches [Mah11, HMT11, Woo14]. Consider a large matrix . A data-oblivious sketch of size is the matrix , where is a random matrix such that , whose distribution does not depend on . This sketch reduces the first dimension of from to a much smaller (we assume without loss of generality that ), and an analogous procedure can be defined for reducing the second dimension as well. This approximate representation of

is central to many algorithms in areas such as linear regression, low-rank approximation, kernel methods, and iterative second-order optimization. While there is a long line of research aimed at bounding the worst-case approximation error of such representations, these bounds are often too loose to reflect accurately the practical performance of these methods. In this paper, we develop new theory which enables more precise analysis of the accuracy of sketched data representations.

A common way to measure the accuracy of the sketch is by considering the -dimensional subspace spanned by its rows. The goal of the sketch is to choose a subspace that best aligns with the distribution of all of the rows of in

. Intuitively, our goal is to minimize the (norm of the) residual when projecting a vector

onto that subspace, i.e., , where is the orthogonal projection matrix onto the subspace spanned by the rows of (and denotes the Moore-Penrose pseudoinverse). For this reason, the quantity that has appeared ubiquitously in the error analysis of RandNLA sketching is what we call the residual projection matrix:

Since is random, the average performance of the sketch can often be characterized by its expectation, . For example, the low-rank approximation error of the sketch can be expressed as , where denotes the Frobenius norm. A similar formula follows for the trace norm error of a sketched Nyström approximation [WS01, GM16]

. Among others, this approximation error appears in the analysis of sketched kernel ridge regression

[FSS20] and Gaussian process regression [BRVDW19]. Furthermore, a variety of iterative algorithms, such as randomized second-order methods for convex optimization [QRTF16, QR16, GKLR19, GRB20] and linear system solvers based on the generalized Kaczmarz method [GR15]

, have convergence guarantees which depend on the extreme eigenvalues of

. Finally, a generalized form of the expected residual projection has been recently used to model the implicit regularization of the interpolating solutions in over-parameterized linear models

[DLM19, BLLT19].

1.1 Main result

Despite its prevalence in the literature, the expected residual projection is not well understood, even in such simple cases as when is a Gaussian sketch (i.e., with i.i.d. standard normal entries). We address this by providing a surrogate expression, i.e., a simple analytically tractable approximation, for this matrix quantity:

(1)

Here, means that while the surrogate expression is not exact, it approximates the true quantity up to some accuracy. Our main result provides a rigorous approximation guarantee for this surrogate expression with respect to a range of sketching matrices , including the standard Gaussian and Rademacher sketches. We state the result using the positive semi-definite ordering denoted by .

Theorem 1

Let be a sketch of size with i.i.d. mean-zero sub-Gaussian entries and let be the stable rank of . If we let be a fixed constant larger than , then

In other words, when the sketch size is smaller than the stable rank of , then the discrepancy between our surrogate expression and is of the order , where the big-O notation hides only the dependence on and on the sub-Gaussian constant (see Theorem 2 for more details). Our proof of Theorem 1 is inspired by the techniques from random matrix theory which have been used to analyze the asymptotic spectral distribution of large random matrices by focusing on the associated matrix resolvents and Stieltjes transforms [HLN07, BS10]. However, our analysis is novel in several respects:

  1. The residual projection matrix can be obtained from the appropriately scaled resolvent matrix by taking . Prior work (e.g., [HMRT19]

    ) combined this with an exchange-of-limits argument to analyze the asymptotic behavior of the residual projection. This approach, however, does not allow for a precise control in finite-dimensional problems. We are able to provide a more fine-grained, non-asymptotic analysis by working directly with the residual projection itself, instead of the resolvent.

  2. We require no assumptions on the largest and smallest singular value of

    . Instead, we derive our bounds in terms of the stable rank of (as opposed to its actual rank), which implicitly compensates for ill-conditioned data matrices.

  3. We obtain upper/lower bounds for in terms of the positive semi-definite ordering , which can be directly converted to guarantees for the precise expressions of expected low-rank approximation error derived in the following section.

1.2 Low-rank approximation

We next provide some immediate corollaries of Theorem 1, where we use to denote a multiplicative approximation . Note that our analysis is new even for the classical Gaussian sketch where the entries of are i.i.d. standard normal. However the results apply more broadly, including a standard class of data-base friendly Rademacher sketches where each entry is a

Rademacher random variable

[Ach03]. We start by analyzing the Frobenius norm error of sketched low-rank approximations. Note that by the definition of in (1), we have , so the surrogate expression we obtain for the expected error is remarkably simple.

Corollary 1

Let be the singular values of . Under the assumptions of Theorem 1, we have:

Remark 1

The parameter increases at least linearly as a function of , which is why the expected error will always decrease with increasing . For example, when the singular values of exhibit exponential decay, i.e., for , then the error also decreases exponentially, at the rate of . We discuss this further in Section 3, giving explicit formulas for the error as a function of under both exponential and polynomial spectral decay profiles.

The above result is important for many RandNLA methods, and it is also relevant in the context of kernel methods, where the data is represented via a positive semi-definite kernel matrix which corresponds to the matrix of dot-products of the data vectors in some reproducible kernel Hilbert space. In this context, sketching can be applied directly to the matrix via an extended variant of the Nyström method [GM16]. A Nyström approximation constructed from a sketching matrix is defined as , where and

, and it is applicable to a variety of settings, including Gaussian Process regression, kernel machines and Independent Component Analysis

[BRVDW19, WS01, BJ03]. By setting , it is easy to see [DKM20] that the trace norm error is identical to the squared Frobenius norm error of the low-rank sketch , so Corollary 1 implies that

(2)

with any sub-Gaussian sketch, where denote the eigenvalues of . Our error analysis given in Section 3

is particularly relevant here, since commonly used kernels such as the Radial Basis Function (RBF) or the Matérn kernel induce a well-understood eigenvalue decay

[SZW97, RW06].

1.3 Randomized iterative optimization

We next turn to a class of iterative methods which take advantage of sketching to reduce the per iteration cost of optimization. These methods have been developed in a variety of settings, from solving linear systems to convex optimization and empirical risk minimization, and in many cases the residual projection matrix appears as a black box quantity whose spectral properties determine the convergence behavior of the algorithms [GR15]. With our new results, we can precisely characterize not only the rate of convergence, but also, in some cases, the complete evolution of the parameter vector, for the following algorithms:

  1. Generalized Kaczmarz method [GR15] for approximately solving a linear system ;

  2. Randomized Subspace Newton [GKLR19], a second order method, where we sketch the Hessian matrix.

  3. Jacobian Sketching [GRB20], a class of first order methods which use additional information via a weight matrix that is sketched at every iteration.

We believe that extensions of our techniques will apply to other algorithms, such as that of [LPP19].

We next give a result in the context of linear systems for the generalized Kaczmarz method [GR15], but a similar convergence analysis is given for the methods of [GKLR19, GRB20] in Appendix B.

Corollary 2

Let be the unique solution of and consider the iterative algorithm:

Under the assumptions of Theorem 1, with defined in (1) and , we have:

The corollary follows from Theorem 1 combined with Theorem 4.1 in [GR15]. Note that when is positive definite then , so the algorithm will converge from any starting point, and the worst-case convergence rate of the above method can be obtained by evaluating the largest eigenvalue of . However the result itself is much stronger, in that it can be used to describe the (expected) trajectory of the iterates for any starting point . Moreover, when the spectral decay profile of is known, then the explicit expressions for as a function of derived in Section 3 can be used to characterize the convergence properties of generalized Kaczmarz as well as other methods discussed above.

1.4 Implicit regularization

Setting , we can view one step of the iterative method in Corollary 2 as finding a minimum norm interpolating solution of an under-determined linear system . Recent interest in the generalization capacity of over-parameterized machine learning models has motivated extensive research on the statistical properties of such interpolating solutions, e.g., [BLLT19, HMRT19, DLM19]. In this context, Theorem 1 provides new evidence for the implicit regularization conjecture posed by [DLM19] (see their Theorem 2 and associated discussion), with the amount of regularization equal , where is implicitly defined in (1):

While implicit regularization has received attention recently in the context of SGD algorithms for overparameterized machine learning models, it was originally discussed in the context of approximation algorithms more generally [Mah12]. Recent work has made precise this notion in the context of RandNLA [DLM19], and our results here can be viewed in terms of implicit regularization of scalable RandNLA methods.

1.5 Related work

A significant body of research has been dedicated to understanding the guarantees for low-rank approximation via sketching, particularly in the context of RandNLA [DM16, DM18]. This line of work includes i.i.d. row sampling methods [BMD08, AM15] which preserve the structure of the data, and data-oblivious methods such as Gaussian and Rademacher sketches [Mah11, HMT11, Woo14]. However, all of these results focus on worst-case upper bounds on the approximation error. One exception is a recent line of works on non-i.i.d. row sampling with determinantal point processes (DPP). In this case, exact analysis of the low-rank approximation error [DKM20], as well as precise convergence analysis of stochastic second order methods [MDK19], have been obtained. Remarkably, the expressions they obtain are analogous to (1), despite using completely different techniques. However, their analysis is limited only to DPP-based sketches, which are considerably more expensive to construct and thus much less widely used. The connection between DPPs and Gaussian sketches was recently explored by [DLM19]

in the context of analyzing the implicit regularization effect of choosing a minimum norm solution in under-determined linear regression. They conjectured that the expectation formulas obtained for DPPs are a good proxy for the corresponding quantities obtained under a Gaussian distribution. While they only provide empirical evidence and asymptotic results for this particular claim, our Theorem

1 can be viewed as the first theoretical non-asymptotic justification of that conjecture.

The effectiveness of sketching has also been extensively studied in the context of second order optimization. These methods differ depending on how the sketch is applied to the Hessian matrix, and whether or not it is applied to the gradient as well. The class of methods discussed in Section 1.3, including Randomized Subspace Newton and the Generalized Kaczmarz method, relies on projecting the Hessian downto a low-dimensional subspace, which makes our results directly applicable. A related family of methods uses the so-called Iterative Hessian Sketch (IHS) approach [PW16, LP19]. The similarities between IHS and the Subspace Newton-type methods (see [QRTF16] for a comparison) suggest that our techniques could be extended to provide precise convergence guarantees also to the IHS. Finally, yet another family of Hessian sketching methods has been studied by [RKM19, WGM17, XRKM17, YXRKM18, RLXM18, WRXM17, DM19]. These methods preserve the rank of the Hessian, and thus they provide somewhat different convergence guarantees that do not rely on the residual projection matrix.

2 Precise analysis of the residual projection

In this section, we give a detailed statement of our main technical result, along with a sketch of the proof. First, recall the definition of sub-Gaussian random variables and vectors.

Definition 1

We say that is a -sub-Gaussian random variable if its sub-Gaussian Orlicz norm , where . Similarly, we say that a random vector is -sub-Gaussian if for all we have .

For convenience, we state the main result in a slightly different form than Theorem 1. Namely, we replace the matrix with a positive semi-definite matrix . Furthermore, instead of a sketch with i.i.d. sub-Gaussian entries, we use a random matrix with i.i.d. sub-Gaussian rows, which is a strictly weaker condition because it allows for the entries of each row to be correlated. Since the rows of are also assumed to have mean zero and identity covariance, each row of has covariance . In Section 2.2 we show how to convert this statement back to the form of Theorem 1.

Theorem 2

Let for , where has i.i.d. -sub-Gaussian rows with zero mean and identity covariance, and is an positive semi-definite matrix. Define:

Let be the stable rank of and fix . There exists a constant , depending only on and , such that if , then

(3)

We first provide the following informal derivation of the expression for given in Theorem 2. Let us use to denote the matrix . Using a rank-one update formula for the Moore-Penrose pseudoinverse (see Lemma 1 in the appendix) we have

where we use to denote the -th row of , and , where is the matrix without its -th row. Due to the sub-Gaussianity of , the quadratic form in the denominator concentrates around its expectation (with respect to ), i.e., , where we use . Further note that, with for large and from a concentration argument, we conclude that

and thus for and . This leads to the (implicit) expression for and given in Theorem 2.

2.1 Proof sketch of Theorem 2

To make the above intuition rigorous, we next present a proof sketch for Theorem 2, with the detailed proof deferred to Appendix A. The proof can be divided into the following three steps.

Step 1.

First note that, to obtain the lower and upper bound for in the sense of symmetric matrix as in Theorem 2, it suffices to bound the spectral norm , so that, with for from the definition of , we have

This means that all eigenvalues of the p.s.d. matrix lie in the interval , so Multiplying by from both sides, we obtain the desired bound.

Step 2.

Then, we carefully design an event

that (i) is provable to occur with high probability and (ii) ensures that the denominators in the following decomposition are bounded away from zero:

where we let and .

Step 3.

It then remains to bound the spectral norms of respectively to reach the conclusion. More precisely, the terms and are proportional to , while the term can be bounded using the rank-one update formula for the pseudoinverse (Lemma 1 in the appendix). The remaining term is more subtle and can be bounded with a careful application of sub-Gaussian concentration inequalities (Lemmas 2 and 3 in the appendix). This allows for a bound on the operator norm and hence the conclusion.

2.2 Proof of Theorem 1

We now discuss how Theorem 1 can be obtained from Theorem 2. The crucial difference between the statements is that in Theorem 1 we let be an arbitrary rectangular matrix, whereas in Theorem 2 we instead use a square, symmetric and positive semi-definite matrix . To convert between the two notations, consider the SVD decomposition of (recall that we assume ), where and have orthonormal columns and is a diagonal matrix. Now, let , and . Using the fact that , it follows that:

Note that since , the rows of are sub-Gaussian with the same constant as the rows of . Moreover, using the fact that implies for any p.s.d. matrices and , Theorem 1 follows as a corollary of Theorem 2.

3 Explicit formulas under known spectral decay

The expression we give for the expected residual projection, , is implicit in that it depends on the parameter which is the solution of the following equation:

(4)

In general, it is impossible to solve this equation analytically, i.e., to write as an explicit formula of , and the singular values of . However, we show that when the singular values exhibit a known rate of decay, then it is possible to obtain explicit formulas for . In particular, this allows us to provide precise and easily interpretable rates of decay for the low-rank approximation error of a sub-Gaussian sketch.

Matrices that have known spectral decay, most commonly with either exponential or polynomial rate, arise in many machine learning problems [MDK19]. Such behavior can be naturally occurring in data, or it can be induced by feature expansion using, say, the RBF kernel (for exponential decay) [SZW97] or the Matérn kernel (for polynomial decay) [RW06]. Understanding these two classes of decay plays an important role in distinguishing the properties of light-tailed and heavy-tailed data distributions. Note that in the kernel setting we may often represent our data via the kernel matrix , instead of the data matrix , and study the sketched Nyström method [GM16] for low-rank approximation. To handle the kernel setting in our analysis, it suffices to replace the squared singular values of with the eigenvalues of .

3.1 Exponential spectral decay

Suppose that the squared singular values of exhibit exponential decay, i.e. , where is a constant and . For simplicity of presentation, we will let . Under this spectral decay, we can approximate the sum in (4) by the analytically computable integral , obtaining . Applying this to the formula from Corollary 1, we can express the low-rank approximation error for a sketch of size as follows:

(5)

In Figure 1a, we plot the above formula against the numerically obtained implicit expression from Corollary 1, as well as empirical results for a Gaussian sketch. First, we observe that the theoretical predictions closely align with empirical values even after the sketch size crosses the stable rank , suggesting that Theorem 1 can be extended to this regime. Second, while it is not surprising that the error decays at a similar rate as the singular values, our predictions offer a much more precise description, down to lower order effects and even constant factors. For instance, we observe that the error (normalized by , as in the figure) only starts decaying exponentially after crosses the stable rank, and until that point it decreases at a linear rate with slope .

(a) Singular values are given by .
(b) Singular values are given by .
Figure 1: Theoretical predictions of low-rank approximation error of a Gaussian sketch under known spectral decays, compared to the empirical results. The constant is scaled so that and we let . For the theory, we plot the explicit formulas (5) and (6) (dashed lines), as well as the implicit expression from Corollary 1 (thin solid lines) obtained by numerically solving (4). Observe that the explicit and implicit predictions are nearly (but not exactly) identical.

3.2 Polynomial spectral decay

We now turn to polynomial spectral decay, which is a natural model for analyzing heavy-tailed data distributions. Let have squared singular values for some , and let . As in the case of exponential decay, we use the integral to approximate the sum in (4), and solve for , obtaining . Combining this with Corollary 1 we get:

(6)

Figure 1b compares our predictions to the empirical results for several values of . In all of these cases, the stable rank is close to 1, and yet the theoretical predictions align very well with the empirical results. Overall, the asymptotic rate of decay of the error is . However it is easy to verify that the lower order effect of appearing instead of in (6) significantly changes the trajectory for small values of . Also, note that as grows large, the constant goes to , but it plays a significant role for or (roughly, scaling the expression by a factor of ). Finally, we remark that for , our integral approximation of (4) becomes less accurate. We expect that a corrected expression is possible, but likely more complicated and less interpretable.

4 Empirical results

In this section, we numerically verify the accuracy of our theoretical predictions for the low-rank approximation error of sketching on benchmark datasets from the libsvm repository [CL11] (further numerical results are in Appendix C

). We repeated every experiment 10 times, and plot both the average and standard deviation of the results. We use the following

sketching matrices :

  1. Gaussian sketch: with i.i.d. standard normal entries;

  2. Rademacher sketch: with i.i.d. entries equal with probability 0.5 and otherwise.

Figure 2: Theoretical predictions versus approximation error for the sketched Nyström with the RBF kernel (spectral decay shown at the bottom).
Figure 3: Theoretical predictions versus approximation error for the Gaussian and Rademacher sketches (spectral decay shown at the bottom).

Varying spectral decay.

To demonstrate the role of spectral decay and the stable rank on the approximation error, we performed feature expansion using the radial basis function (RBF) kernel , obtaining an kernel matrix . We used the sketched Nyström method to construct a low-rank approximation , and computed the normalized trace norm error . The theoretical predictions are coming from (2), which in turn uses Theorem 1. Following [GM16], we use the RBF kernel because varying the scale parameter allows us to observe the approximation error under qualitatively different spectral decay profiles of the kernel. In Figure 2, we present the results for the Gaussian sketch on two datasets, with three values of , and in all cases our theory aligns with the empirical results. Furthermore, as smaller leads to slower spectral decay and larger stable rank, it also makes the approximation error decay more linearly for small sketch sizes. This behavior is predicted by our explicit expressions (5) for the error under exponential spectral decay from Section 3. Once the sketch sizes are sufficiently larger than the stable rank of , the error starts decaying at an exponential rate. Note that Theorem 1 only guarantees accuracy of our expressions for sketch sizes below the stable rank, however the predictions are accurate regardless of this constraint.

Varying sketch type.

In the next set of empirical results, we compare the performance of Gaussian and Rademacher sketches, and also verify the theory when sketching the data matrix without kernel expansion, plotting . Since both of the sketching methods have sub-Gaussian entries, Corollary 1 predicts that they should have comparable performance in this task and match our expressions. This is exactly what we observe in Figure 3 for two datasets and a range of sketching sizes, as well as in other empirical results shown in Appendix C.

5 Conclusions

We derived the first theoretically supported precise expressions for the expected residual projection matrix, which is a central component in the analysis of RandNLA dimensionality reduction via sketching. Our analysis provides a new understanding of low-rank approximation, the Nyström method, and the convergence properties of many randomized iterative algorithms. As a direction for future work, we conjecture that our main result can be extended to sketch sizes larger than the stable rank of the data matrix.

Acknowledgments.

We would like to acknowledge DARPA, NSF, and ONR for providing partial support of this work.

References

Appendix A Proof of Theorem 2

We first introduce the following technical lemmas.

Lemma 1

For with , denote and , with the matrix without its i-th row . Then, conditioned on the event :

Proof  Since conditioned on we have , from Theorem 1 in [Mey73] we deduce

for so that , where we used the fact that is a projection matrix so that . As a consequence, multiplying by and simplifying we get

By definition of the pseudoinverse, so that

where we used and thus the conclusion.  

Lemma 2

For a -sub-Gaussian random vector with , and positive semi-definite matrix , we have

with the stable rank of , and

for some independent of .

Proof  From Corollary 2.9 in [Zaj18] we have, for -sub-Gaussian with , and symmetric positive semi-definite that

for some universal constant . Taking we have

where we use the fact that .

Integrating this bound yields:

and thus the conclusion.  

Lemma 3

With the notations of Lemma 1, for and , we have

for some universal constant .

Proof  To simplify notations, we work on instead of , the same line of argument applies to by changing the sample size to .

First note that

where we used the fact that , for the conditional expectation with respect to the -field generating the rows of . This forms a martingale difference sequence (it is a difference sequence of the Doob martingale for with respect to filtration ) hence it falls within the scope of the Burkholder inequality [Bur73], recalled as follows.

Lemma 4

For a real martingale difference sequence with respect to the increasing field , we have, for , there exists such that

From Lemma 1, is positive semi-definite, we have so that with Lemma 4 we obtain with that, for

In particular, for , we obtain .

For the second result, since we have almost surely bounded martingale differences (), by the Azuma-Hoeffding inequality

as desired.  

a.1 Complete proof of Theorem 2

Equipped with the lemmas above, we are ready to prove Theorem 2. First note that:

  1. [leftmargin=*]

  2. Since for any , we can assume without loss of generality (after rescaling correspondingly) that .

  3. According to the definition of and , the following bounds hold

    (7)

    for and , where we used the fact that

    so that .

  4. As already discussed in Section 2.1, to obtain the lower and upper bound for in the sense of symmetric matrix as in Theorem 2, it suffices to bound the following spectral norm

    (8)

    so that, with from (7), we have

    Defining , this means that all eigenvalues of the p.s.d. matrix lie in the interval , and

    so that by multiplying on both sides, we obtain the desired bound.


As a consequence of the above observations, we only need to prove (8) under the setting . The proof comes in the following two steps:

  1. [leftmargin=*]

  2. For , with the matrix without its -th row, we define, for , the following events

    (9)

    where we recall is the -th row of so that and . With Lemma 2, we can bound the probability of and , and consequently that of for ;

  3. We then bound, conditioned on and respectively, the spectral norm . More precisely, since

    where we used Lemma 1 for the third equality and denote as well as . It then remains to bound the spectral norms of to reach the conclusion.

Another important relation that will be constantly used throughout the proof is