Dimensionality reduction is a ubiquitous tool across a wide array of disciplines: machine learning[WDL09], high-dimensional computational geometry [Ind01], privacy [BBDS12], compressed sensing [CT05], spectral graph theory [SS11]
, interior point methods for linear programming[LS13], numerical linear algebra [Sar06]BB05, BBV06], manifold learning [HWB07, Cla08], motif-finding in computational biology [BT02], astronomy [CM12]
, and several others. Across all these disciplines one is typically faced with data that is not only massive, but each data item itself is represented as a very high-dimensional vector. For example, when learning spam classifiers a data point is an email, and it is represented as a high-dimensional vector indexed by dictionary words[WDL09]. In astronomy a data point could be a star, represented as a vector of light intensities measured over various points sampled in time [KZM02, VJ11]. Dimensionality reduction techniques in such applications provide the following benefits:
Smaller storage consumption.
Speedup during data analysis.
Cheaper signal acquisition.
Cheaper transmission of data across computing clusters.
The technical guarantees required from a dimensionality reduction routine are application-specific, but typically such methods must reduce dimension while still preserving point geometry, e.g. inter-point distances and angles. That is, one has some point set with very large, and we would like a dimensionality-reducing map , , such that
for some norm . Note also that for unit vectors , , and thus also preserves angles with additive error if it preserves Euclidean norms of points in .
Theorem 1 (JL lemma).
For any subset of Euclidean space and , there exists with providing Eq.(1.1) for .
This bound on is nearly tight: for any Alon exhibited a point set , , such that any such JL map must have [Alo03]. In fact, all known proofs of the JL lemma provide linear , and the JL lemma is tight up to a constant factor in when must be linear [LN14]. Unfortunately, for actual applications such worst-case understanding is unsatisfying. Rather we could ask: if given a distortion parameter and point set as input (or a succinct description of it if is large or even infinite, as in some applications), what is the best target dimension such that a JL map exists for with this particular ? That is, in practice we are more interested in moving beyond worst case analysis and being as efficient as possible for our particular data .
Unfortunately the previous question seems fairly difficult. For the related question of computing the optimal distortion for embedding into a line (i.e. ), it is computationally hard to approximate the optimal distortion even up to a multiplicative factor polynomial in [BCIS05]. In practice, however, typically cannot be chosen arbitrarily as a function of
anyway. For example, when employing certain learning algorithms such as stochastic gradient descent on dimensionality-reduced data, it is at least required thatis differentiable on (where ) [WDL09]. For several applications it is also crucial that is linear, e.g. in numerical linear algebra [Sar06] and compressed sensing [CT05, Don06]. In one-pass streaming applications [CW09] and data structural problems such as nearest neighbor search [HIM12], it is further required that is chosen randomly without knowing . For any particular , a random
drawn from some distribution must satisfy the JL guarantee with good probability. In streaming applications this is becauseis not fully known up front, but is gradually observed in a stream. In data structure applications this is because must preserve distances to some future query points, which are not known at the time the data structure is constructed.
Due to the considerations discussed, in practice typically is chosen as a random linear map drawn from some distribution with a small number of parameters (in some cases simply the parameter ). For example, popular choices of
include a random matrix with independent Gaussian[HIM12] or Rademacher [Ach03] entries. While worst case bounds inform us how to set parameters to obtain the JL guarantee for worst case , we typically can obtain better parameters by exploiting prior knowledge about . Henceforth we only discuss linear , so we write for . Furthermore by linearity, rather than preserving Euclidean distances in it is equivalent to discuss preserving norms of all vectors in , the set of all normalized difference vectors amongst points in . Thus up to changing by roughly a factor of , Eq.(1.1) is equivalent to
Furthermore, since we consider chosen at random, we more specifically want
Instance-wise understanding for achieving Eq.(1.3) was first provided by Gordon [Gor88], who proved that a random with i.i.d. Gaussian entries satisfies Eq.(1.3) as long as , where we write if for a universal constant . Denoting by a standard -dimensional Gaussian vector, the parameter is defined as the Gaussian mean width
One can think of as describing the -geometric complexity of . It is always true that , and thus Gordon’s theorem implies the JL lemma. In fact for all we know from applications, such as for the restricted isometry property from compressed sensing [CT05] or subspace embeddings from numerical linear algebra [Sar06], the best bound on is a corollary of Gordon’s theorem. Later works extended Gordon’s theorem to other distributions for , such as having independent subgaussian entries (e.g. Rademachers) [KM05, MPTJ07, Dir14].
Although Gordon’s theorem gives a good understanding for in most scenarios, it suffers from the fact that it analyzes a dense random , which means that performing the dimensionality reduction is dense matrix-vector multiplication, and is thus slow. For example, in some numerical linear algebra applications (such as least squares regression [Sar06]), multiplying a dense unstructured with the input turns out to be slower than obtaining the solution of the original, high-dimensional problem! In compressed sensing, certain iterative recovery algorithms such as CoSamp [NT09] and Iterative Hard Thresholding [BD08] involve repeated multiplications by and , the conjugate transpose of , and thus supporting fast matrix-vector multiply are desirable in such applications as well.
The first work to provide with small supporting faster multiplication is the Fast Johnson-Lindenstrauss Transform (FJLT) of [AC09] for finite . The value of was still , with the time to multiply being . [AL09] later gave an improved construction with time for any small constant . Most recently several works gave nearly linear embedding time in , independent of , at the expense of increasing by a factor [AL13, KW11, NPW14]. In all these works
is the product of some number of very sparse matrices and Fourier matrices, with the speed coming from the Fast Fourier Transform (FFT)[CT65]. It is also known that this FFT-based approach can be used to obtain fast RIP matrices for compressed sensing [CT06, RV08, CGV13] and fast oblivious subspace embeddings for numerical linear algebra applications [Sar06] (see also [Tro11, LDFU13] for refined analyses in the latter case).
Another line of work, initiated in [Ach03] and greatly advanced in [DKS10], sought fast embedding time by making sparse. If is drawn from a distribution over matrices having at most non-zeroes per column, then can be computed in time . After some initial improvements [KN10, BOR10], the best known achievable value of to date for the JL lemma while still maintaining is the sparse Johnson-Lindenstrauss Transform (SJLT) of [KN14], achieving . Furthermore, an example of a set exists which requires this bound on up to for any linear JL map [NN13b]. Note however that, again, this is an understanding of the worst-case parameter settings over all .
In summary, while Gordon’s theorem gives us a good understanding of instance-wise bounds on for achieving good dimensionality reduction, it only does so for dense, slow . Meanwhile, our understanding for efficient , such as the SJLT with small , has not moved beyond the worst case. In some very specific examples of we do have good bounds for settings of that suffice, such as the unit norm vectors in a -dimensional subspace [CW13, MM13, NN13a], or all elements of having small norm [Mat08, DKS10, KN10, BOR10]. However, our understanding for general is non-existent. This brings us to the main question addressed in this work, where denotes the -unit sphere .
Let and be the SJLT. What relationship must satisfy, in terms of the geometry of , to ensure (1.3)?
We also note that while the FFT-based and sparse approaches seem orthogonal at first glance, the two are actually connected, as pointed out before [AC09, Mat08, NPW14]. The FJLT sets where is some random preconditioning matrix that makes “nice” with high probability, and is a random sparse matrix. We point out that although the SJLT is not the same as the matrix typically used in the FFT-based literature, one could replace with the SJLT and hope for similar algorithmic outcome if is small: nearly linear embedding time.
The answer to the analog of Question 2 for a standard Gaussian matrix depends only on the -metric structure of . Indeed, since both -distances and Gaussian matrices are invariant under orthogonal transformations, so is (1.3) in this case. This is reflected in Gordon’s theorem, where the embedding dimension is governed by the Gaussian width, which is invariant under orthogonal transformations. We stress that in sharp contrast, a resolution of Question 2 cannot solely depend on the -metric structure of . Indeed, we require that be sparse in a particular basis and is therefore not invariant under orthogonal transformations. Thus an answer to Question 2 must be more nuanced (see our main theorem, Theorem 3).
Our Main Contribution:
We provide a general theorem which answers Question 2. Specifically, for every analyzed in previous work that we apply our general theorem to here, we either (1) qualitatively recover or improve the previous best known result, or (2) prove the first non-trivial result for dimensionality reduction with sparse . We say “qualitatively” since applying our general theorem to these applications loses a factor of in and in .
In particular for (2), our work is the first to imply that non-trivially sparse dimensionality reducing linear maps can be applied for gain in model-based compressed sensing [BCDH10], manifold learning [TdSL00, DG13], and constrained least squares problems such as the popular Lasso [Tib96].
Specifically, we prove the following theorem.
Theorem 3 (Main Theorem).
Let and be an SJLT with column sparsity . Define the complexity parameter
where are i.i.d. standard Gaussian and i.i.d. Bernoulli with mean . If
Then (1.3) holds as long as furthermore satisfy the condition
The complexity parameter may seem daunting at first, but Section 8 shows it can be controlled quite easily for all the we have come across in applications.
Here we describe various and their importance in certain applications. Then we state the consequences that arise from our theorem. In order to highlight the qualitative understanding arising from our work, we introduce the notation if . A summary of our bounds is in Figure 1.
Our theorem implies suffices in general, and in the latter case, qualitatively matching the above.
Here for some -dimensional linear subspace , for which achieving Eq.(1.3) with is possible [AHK06, CW13]. A distribution satisfying Eq.(1.3) for any -dimensional subspace is known as an oblivious subspace embedding (OSE). The paper [Sar06] pioneered the use of OSE’s for speeding up approximate algorithms for numerical linear algebra problems such as low-rank approximation and least-squares regression. More applications have since been found to approximating leverage scores [DMIMW12], -means clustering [BZMD11, CEM14], canonical correlation analysis [ABTZ13]PBMID13], -regression [CDMI13, WZ13]LDFU13]
, streaming approximation of eigenvalues[AN13], and speeding up interior point methods for linear programming [LS13]. In many of these applications there is some input , , and the subspace is for example the column space of
. Often an exact solution requires computing the singular value decomposition (SVD) of, but using OSE’s the running time is reduced to that for computing , plus computing the SVD of the smaller matrix . The work [CW13] showed with small is sufficient, yielding algorithms for least squares regression and low-rank approximation whose running times are linear in the number of non-zero entries in for sufficiently lopsided rectangular matrices.
Our theorem implies that and suffices to satisfy Eq.(1.3), which is qualitatively correct. Furthermore, a subset of our techniques reveals that if the maximum incoherence is at most , then suffices (Theorem 5). This was not known in previous work. A random -dimensional subspace has incoherence w.h.p. for by the JL lemma, and thus is very incoherent if .
Closed convex cones:
For , , and a closed convex set, consider the constrained least squares problem of minimizing subject to . A popular choice is the Lasso [Tib96], in which the constraint set encourages sparsity of . Let be an optimal solution, and let be the tangent cone of at some point (see Appendix B for a definition). Suppose we wish to accelerate approximately solving the constrained least squares problem by instead computing a minimizer of subject to . The work [PW14] showed that to guarantee , it suffices that satisfy two conditions, one of which is Eq.(1.2) for . The paper [PW14] then analyzed dense random matrices for sketching constrained least squares problems. For example, for the Lasso if we are promised that the optimal solution is -sparse, [PW14] shows that it suffices to set
where is the th column of and the smallest -restricted eigenvalue of :
Our work also applies to such (and we further show the SJLT with small satisfies the second condition required for approximate constrained least squares; see Theorem 16). For example for the Lasso, we show that again it suffices that
but where we only require from that
That is, the sparsity of need only depend on the largest entry in as opposed to the largest column norm in . The former can be much smaller.
Unions of subspaces:
Define , where is some index set and each is a -dimensional linear subspace. A case of particular interest is when ranges over all -subsets of , and is the subspace spanned by (so ). Then is simply the set of all -sparse unit vectors of unit Euclidean norm: for denoting support size. satisfying (1.2) is then referred to as having the restricted isometry property (RIP) of order with restricted isometry constant [CT05]. Such are known to exist with , and furthermore it is known that implies that any (approximately) -sparse can be (approximately) recovered from in polynomial time by solving a certain linear program [CT05, Can08]. Unfortunately it is known for that any RIP matrix with such small must have [NN13b]. Related is the case of vectors sparse in some other basis, i.e. for some so-called “dictionary” (i.e. the subspaces are acted on by ), or when only allows for some subset of all sparsity patterns in model-based compressed sensing [BCDH10] (so that ).
Our main theorem also implies RIP matrices with . More generally when a dictionary is involved such that the subspaces are all -incoherent (as defined above), the sparsity can be further improved to . That is, to satisfy the RIP with dictionaries yielding incoherent subspaces, we can keep qualitatively the same while making much smaller. For the general problem of unions of -dimensional subspaces, our theorem implies one can either set or . Previous work required to depend on the product of and instead of the sum [NN13a], including a nice recent improvement by Cohen [Coh14], and is thus unsuitable for the application to the set of -sparse vectors (RIP matrices with rows are already attainable via simpler methods using incoherence; e.g. see [BDF11, Proposition 1]). Iterative recovery algorithms such as CoSamp can also be used in model-based sparse recovery [BCDH10], which again involves multiplications by , and thus sparse is relevant for faster recovery. Our theorem thus shows, for the first time, that the benefit of model-based sparse recovery is not just a smaller , but rather that the measurement matrix can be made much sparser if the model is simple (i.e. is small). For example, in the block-sparse model one wants to (approximately) recover a signal based on linear measurements, where is (approximately) -block-sparse. That is, the coordinates are partitioned into blocks of size each, and each block is either “on” (at least one coordinate in that block non-zero), or “off” (all non-zero). A -block-sparse signal has at most blocks on (thus ). Thus . Then as long as , our results imply non-trivial column-sparsity . Ours is the first result yielding non-trivial sparsity in a model-RIP for any model with a number of measurements qualitatively matching the optimal bound (which is on the order of [ADR14]). We remark that for model-based RIP, where one wants to approximately preserve -norms of -block-sparse vectors, which is useful for recovery, [IR13] have shown a much better sparsity bound of non-zeroes per column in their measurement matrix. However, they have also shown that any model-based RIP matrix for block-sparse signals must satisfy the higher lower bound of (which is tight).
Previous work also considered , where is any bounded orthonormal system, i.e. is orthogonal and (e.g. the Fourier matrix). Work of [CT06, RV08, CGV13] shows can then be a sampling matrix (one non-zero per row) with . Since randomly flipping the signs of every column in an RIP matrix yields JL [KW11], this also gives a good implementation of an FJLT. Our theorem recovers a similar statement, but using the SJLT instead of a sampling matrix, where we show and suffice for orthogonal satisfying the weaker requirement .
Suppose we are given several images of a human face, but with varying lighting and angle of rotation. Or perhaps the input is many sample handwritten images of letters. In these examples, although input data is high-dimensional ( is the number of pixels), we imagine all possible inputs come from a set of low intrinsic dimension. That is, they lie on a -dimensional manifold where . The goal is then, given a large number of manifold examples, to learn the parameters of to allow for nonlinear dimensionality reduction (reducing just to the few parameters of interest). This idea, and the first successful algorithm (ISOMAP) to learn a manifold from sampled points is due to [TdSL00]. Going back to the example of human faces, [TdSL00] shows that different images of a human face can be well-represented by a 3-dimensional manifold, where the parameters are brightness, left-right angle of rotation, and up-down angle of rotation. Since [TdSL00], several more algorithms have been developed to handle more general classes of manifolds than ISOMAP [RS00, DG13].
Baraniuk and Wakin [BW09] proposed using dimensionality-reducing maps to first map to and then learn the parameters of interest in the reduced space (for improved speed). Sharper analyses were later given in [Cla08, EW13, Dir14]. Of interest are both that (1) any curve in should have length approximately preserved in , and (2) should be a manifold embedding, in the sense that all curves should satisfy that the preimage of (in ) is a curve in . Then by (1) and (2), geodesic distances are preserved by .
To be concrete, let be a -dimensional manifold obtained as the image , for smooth ( is the unit ball of ). We assume (where denotes that both and ), and that the map sending to the tangent plane at , , is Lipschitz from to . Here is geodesic distance on , and is the Finsler distance, where is the orthogonal projection onto .
We want satisfying for all curves . Here is curve length. To obtain this, it suffices that satisfy Eq.(1.2) for [Dir14], an infinite union of subspaces. Using this observation, [Dir14] showed this property is satisfied for with a dense matrix of subgaussian entries. For as given above, preservation of geodesic distances is also satisfied for this .
Our main theorem implies that to preserve curve lengths one can set for , where is the largest incoherence of any tangent space for . That is, non-trivial sparsity with is possible for any . Furthermore, we show that this is optimal by constructing a manifold with maximum incoherence of a tangent space approximately such that any linear map preserving curve lengths with must have (see Remark 20). We also show that is a manifold embedding with large probability if the weaker condition is satisfied. Combining these observations, we see that for the SJLT to preserve geodesic distances it suffices to set and .
|set to preserve||our||our||previous||previous||ref|
|tangent cone for Lasso||same as here||[PW14]|
|infinite||see appendix||see appendix||similar to||[Dir14]|
|a smooth manifold||[Dir14]|
As seen above, not only does our answer to Question 2 qualitatively explain all known results, but it gives new results not known before with implications in numerical linear algebra, compressed sensing (model-based, and with incoherent dictionaries), constrained least squares, and manifold learning. We also believe it is possible for future work to sharpen our analyses to give asymptotically correct parameters for all the applications; see the discussion in Section 9.
We now end the introduction with an outline for the remainder. Section 2 defines the notation for the rest of the paper. Section 3 provides an overview of the proof of our main theorem, Theorem 3. Section 4 is a warmup that applies a subset of the proof ideas for Theorem 3 to the special case where , a linear subspace of dimension . In fact the proof of our main theorem reduces the case of general to several linear subspaces of varying dimensions, and thus this case serves as a useful warmup. Section 5 applies a different subset of our proof ideas to the special case where the norm defined by has a small type- constant, which is relevant for analyzing the FFT-based approaches to RIP matrices. Section 6 shows how similar ideas can be applied to constrained least squares problems, such as Lasso. Section 7 states and proves our most general theorem for arbitrary . Section 8 shows how to apply Theorem 3 to obtain good bounds for various , albeit losing a factor in and a factor in as mentioned above. Finally, Section 9 discusses avenues for future work.
In the appendix, in Appendix A for the benefit of the reader we review many probabilistic tools that are used throughout this work . Appendix B reviews some introductory material related to convex analysis, which is helpful for understanding Section 6 on constrained least squares problems. In Appendix C we also give a direct analysis for using the FJLT for sketching constrained least squares programs, providing quantitative benefits over some analyses in [PW14].
We fix some notation that we will be used throughout this paper. For a positive integer , we set . For , denotes for some universal constant , and signifies that both and hold. For any and , we let denote its -norm. To any set we associate a semi-norm defined by . Note that , where is the closed convex hull of , i.e., the closure of the set of all convex combinations of elements in . We use and to denote the standard basis in and , respectively.
is a sequence of random variables, we letdenote the probability space on which it is defined. We use and to denote the associated expected value and -space, respectively. If is another sequence of random variables, then means that we first take the -norm and afterwards the -norm. We reserve the symbol to denote a sequence of i.i.d. standard Gaussian random variables; unless stated otherwise, the covariance matrix is the identity.
If is an matrix, then we use or to denote its operator norm. Moreover we let be the trace operator and use to denote the Frobenius norm.
In the remainder, we reserve the letter to denote (semi-)metrics. If corresponds to a (semi-)norm , then we let denote the associated (semi-)metric. Also, we use to denote the diameter of a set with respect to and write instead of for brevity. So, for example, is the Euclidean metric and the -diameter of . From here on, is always a fixed subset of , and the parameter appearing in (1.3).
We make use of chaining results in the remainder, so we define some relevant quantities. For a bounded set , is the Gaussian mean width of , where is a Gaussian vector with identity covariance matrix. For a (semi-)metric on , Talagrand’s -functional is defined by
where is the distance from to , and the infimum is taken over all collections , , with . If corresponds to a (semi-)norm , then we usually write instead of . It is known that for any bounded , and differ multiplicatively by at most a universal constant [Fer75, Tal05]. Whenever appears without a specified norm, we imply use of or
operator norm. We frequently use the entropy integral estimate (see[Tal05])
Here denotes the minimum number of -balls of radius centered at points in required to cover . If corresponds to a (semi-)norm , then we write instead of .
Let us now introduce the sparse Johnson-Lindenstrauss transform in detail. Let be independent Rademacher random variables, i.e., . We consider random variables with the following properties:
For fixed the are negatively correlated, i.e.
For any fixed there are exactly nonzero , i.e., ;
The vectors are independent across different .
We emphasize that the and are independent, as they are defined on different probability spaces. The sparse Johnson-Lindenstrauss transform with column sparsity , or SJLT for short, is defined by
The work [KN14] gives two implementations of such a satisfying the above conditions. In one example, the columns are independent, and in each column we choose exactly locations uniformly at random, without replacement, to specify the . The other example is essentially the CountSketch of [CCFC04]. In this implementation, the rows of are partitioned arbitrarily into groups of size each. Then each column of is chosen independently, where in each column and for each block we pick a random row in that block to be non-zero for that column (the rest are zero).
We say that is an -restricted isometry on if Eq.(1.2) holds. We define
as the restricted isometry constant of on . In the following we will be interested in estimating . For this purpose, we use the following -bound for the supremum of a second order Rademacher chaos from [KMR14] (see [Dir15, Theorem 6.5] for the refinement stated here).
Let and let be independent Rademacher random variables. For any
For and , let be the operator . Then, for any we can write , where
for . Note that for all and therefore
Associated with we define a random norm on by
With this definition, we have for any ,
Therefore, (4) implies that
Taking -norms on both sides yields
Thus, to find a bound for the expected value of the restricted isometry constant of on , it suffices to estimate and . If we can find good bounds on and for all , then we in addition obtain a high probability bound for the restricted isometry constant.
Unless explicitly stated otherwise, from here on always denotes the SJLT with non-zeroes per column.
3. Overview of proof of main theorem
Here we give an overview of the proof of Theorem 3. To illustrate the ideas going into our proof, it is simplest to first consider the case where is the set of all unit vectors in a -dimensional linear subspace . By Eq.(2.10) for we have to bound for example . Standard estimates give, up to ,
for a Gaussian vector . From this point onward one can arrive at a result using the non-commutative Khintchine inequality [LP86, LPP91] and other standard Gaussian concentration arguments (see appendix).
Unfortunately, Eq.(3.2) is very specific to unit balls of linear subspaces and has no analog for a general set . At this point we use a statement about the duality of entropy numbers [BPSTJ89, Proposition 4]. This is the principle that for two symmetric convex bodies and , and are roughly comparable for some constant ( is the number of translates of needed to cover ). Although it has been an open conjecture for over 40 years as to whether this holds in the general case [Pie72, p. 38], the work [BPSTJ89, Proposition 4] shows that these quantities are comparable up to logarithmic factors as well as a factor depending on the type- constant of the norm defined by (i.e. the norm whose unit vectors are those on the boundary of ). In our case, this lets us relate with , losing small factors. Here is the convex hull of , and is the unit ball of . We next use Maurey’s lemma, which is a tool for bounding covering numbers of the set of convex combinations of vectors in various spaces. This lets us relate to quantities of the form , where has size (see Lemma 18). For a fixed , we bucket according to and define