A vector is said to be -sparse if it has at most nonzero coordinates. Sparsity is a structure of wide applicability, with a broad literature dedicated to its study in various scientific fields (see, e.g., (Foucart and Rauhut, 2017; Eldar and Kutyniok, 2012)). Given an , an matrix (typically with ) is said to satisfy the -Restricted Isometry Property (RIP) (Candes and Tao, 2005) if it approximately preserves the Euclidean norm in the following sense: for every -sparse vector, we have
RIP is a fundamental property of a matrix that enables recovery of a sparse high-dimensional signal from its compressed measurement. Given this, matrices satisfying the restricted isometry property have found many interesting applications in high-dimensional statistics, machine learning, and compressed sensing(Foucart and Rauhut, 2017; Eldar and Kutyniok, 2012; Wainwright, 2019)
. Restricted isometry is also closely related to other matrix properties such as restricted nullspace, restricted eigenvalue, and pairwise incoherence(Wainwright, 2019).
Various probabilistic models are known to generate random matrices that satisfy the restricted isometry property with a value of which is (almost) linear . For example, generating entries of i.i.d. from a common distribution (like satisfying subgaussianity) and then normalizing the columns of to unit norm, guarantees RIP with high probability provided (Baraniuk et al., 2008). We refer the reader to (Vershynin, 2010) for additional references to the probabilistic RIP literature. The restricted isometry property also holds for a rich class of structured random matrices, where usually the best known bounds for have additional log factors in (Foucart and Rauhut, 2017). The use of randomness still remains pivotal for near-optimal results.
At the same time, verifying whether a given matrix satisfies the restricted isometry property is tricky, as the problem of certifying RIP of a matrix in the worst case is NP-hard (Bandeira et al., 2013a). Furthermore, even determining the RIP value up to a certain approximation factor is hard in the average-case sense, as shown by (Wang et al., 2016) using a reduction from the Planted Clique Problem. While there are constructions of RIP matrices when , most methods however break down when is at least some constant times , see (Bandeira et al., 2013b) for a survey of deterministic methods. The best unconditional explicit construction to date is due to Bourgain et al. (2011) which gives a RIP guarantee for for some unspecified small constant , in the regime and with matrix containing complex valued entries. Gamarnik (2018) recently showed why explicit RIP matrix construction is a “hard” challenge, by connecting it to a question in the field of extremal combinatorics which has been open for decades.
So on the one hand, while it is easy to generate matrices satisfying the restricted isometry property through random designs, designing natural families of deterministic matrices satisfying RIP seem to run into hard barriers. A natural research direction is to bridge this gap between these two worlds, that we address in this paper. We start with this simple question: can we incorporate a fixed (deterministic) matrix while constructing a family of matrices satisfying the restricted isometry condition?
Our Contributions. We establish the restricted isometry property for a wide class of matrices, which can be factorized through (possibly non-i.i.d.) random matrices. In particular, we will be interested in the class of matrices which have a -factorization, where is a fixed (deterministic) matrix and is a random matrix. The -model (product of a deterministic and random matrix) is a common model in random matrix theory, and in the context of RIP it captures a variety of applications some of which we discuss later. The main challenge in establishing RIP comes from the fact that the entries in the matrix could be highly correlated, even if the entries in are independent.
Our main result is that if we start with any deterministic matrix , satisfying a very mild easy to check condition, the matrix satisfies RIP (with high probability) for a constructed from a variety of popular probabilistic models. All we need is that the stable rank (or numerical rank) of is not “too small”. Stable rank of a matrix (denoted by ), defined as the squared ratio of Frobenius and spectral norms of is a commonly used robust surrogate to usual matrix rank in linear algebra. Stable rank of a matrix is at most its usual rank. Computing the stable rank of a matrix is a polynomial time operation, meaning that given the factorization one could easily verify whether the required conditions on is satisfied. We investigate many common constructions of the random matrix :111All the results are high probability statements, and for simplicity we omit the dependence on certain parameters such as , -norm of subgaussian vectors, etc.
Columns of are independent subgaussian random vectors: In this setting, we obtain RIP on the matrix, if the stable rank of satisfies (see Theorem 3.1). Note that this setting includes the (standard random matrix) case where is an i.i.d. subgaussian random matrix. The dependence on in this stable rank condition cannot be improved in general.
Generated by -wise independent distributions: We ask: can we reduce the amount of randomness in ? We answer in affirmative by showing that one can achieve restricted isometry with the same condition on as above, by only requiring that be generated from a -wise independent distribution (see Theorem 3.3).
Sparse-subgaussian matrix: Sparsity is a desirable property in as it leads to computational speedup when working with sparse signals. We use the standard Bernoulli-Subgaussian process for generating a sparse matrix , where is defined as the Hadamard matrix product of an i.i.d. Bernoulli random matrix and an i.i.d. subgaussian random matrix. In this case, we get RIP if the stable rank of satisfies, , where is the Bernoulli parameter (see Theorem 3.4).
Columns of are independent vectors satisfying convex concentration: Our result holds for distributions that satisfy the so-called convex concentration property. The convex concentration property of a random vector was first observed by Talagrand who first proved it for the uniform measure on the discrete cube and for general product measures with bounded support (Talagrand, 1988, 1995)
. Vectors satisfying convex concentration are regularly used in statistical analysis, as it includes random vectors drawn from a centered multivariate Gaussian distribution with arbitrary covariance matrix, random vectors uniformly distributed on the sphere, random vectors satisfying thelogarithmic Sobolev inequality, among many others (Ledoux, 2001). Ignoring the dependence on the convex concentration constant, again we get RIP on matrix if the stable rank of satisfies, (see Theorem 3.5).
-way column Hadamard-product construction: Motived by an application in understanding the effectiveness of word vector embeddings (also referred to as just word embeddings) in linear classification tasks, we investigate a correlated random matrix, formed by taking all possible (disregarding order) -way entrywise products222The entrywise product of vectors is the vector with entry equaling . of the columns of an i.i.d. centered bounded random matrix (see Definition 5). Let denote this constructed matrix starting from (with ). We establish RIP on for if (see Theorem 4.1), and for if (see Theorem 4.3). Notice that the dependence on on the sparsity parameter is worse than in the previous cases (however, the value of typically used in the motivating application is a small constant). We conjecture that these bounds can be improved to have a better dependence on .
Our proofs rely on the Hanson-Wright inequality which provides a large deviation bound for quadratic forms of i.i.d. subgaussian random variables(Rudelson and Vershynin, 2013), along with its recent extensions (Zhou, 2015; Adamczak et al., 2015). While the proof is simple when the columns of are independent subgaussian random vectors, various challenges arise when dealing with other models of the random matrix. One general idea is get to a concentration bound on , where is a fixed matrix, is some random matrix, and is a fixed -sparse vector, and then use a net argument over sparse vectors on the sphere. In the case of column Hadamard-product, getting this concentration bound is tricky as it involves analyzing some high order homogeneous chaos in terms of the random variables, and we use a different idea based on bounding the -norm. Throughout the proofs the challenge comes in dealing with dependences that arise both in the random matrix, and in the product matrix.
Applications. The restricted isometry property is widely utilized in the compressed sensing and statistical learning literature. Here we mention a few interesting applications of our restricted isometry results.
Effectiveness of Word/Sequence Embeddings. Consider a vocabulary set (say, all words in a particular language). Word embeddings, which associates with each word in a vocabulary set a vector representation , is a basic building block in Natural Language Processing pipelines and algorithms. Word embeddings were recently popularized via embedding schemes such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). Word embeddings pretrained on large sets of raw text have demonstrated remarkable success when used as features to a supervised learner in various applications ranging from question-answering (Zhou et al., 2015) to sentiment classification of text documents (Maas et al., 2011). The general intuition behind word embeddings is that they are known to capture the “similarities” between words. There has also been recent work on creating representations for word sequences such as phrases or sentences with methods ranging from simple composition of the word vectors to sophisticated architectures as recurrent neural networks (see e.g., (Arora et al., 2017) and references therein)
Understanding the theoretical properties of these word embeddings is an area of active interest. In a recent result, Arora et al. (2018a) introduced the scheme of Distributed Cooccurrence (DisC) embedding (Definition 7) for a word sequence that produces a compressed representation of a Bag-of--cooccurrences vector.333An -cooccurrence is a set of words. Bag-of--cooccurrences for a word sequence counts the number of times any possible -cooccurrence (from the vocabulary set ) for appears in the word sequence. See Definition 6. Linear classification models are (empirically) known to perform well over these simple representations (Wang and Manning, 2012; Arora et al., 2018a). They showed that if one uses i.i.d.
-random vectors as word embeddings (instead of pretrained vectors), then a generalized linear model classifier trained on these compressed DisC embeddings performs “as good as” on classification tasks (up to a small error) as the best classifier trained on the original Bag-of--cooccurrences vectors. This was the first result that provided provable quantification of the power of any text embedding.444Arora et al. (2018a)
have additional results on the powerfulness of low-memory LSTM (Long Short Term Memory) network embeddings, by showing that the LSTMs can simulate these DisC embeddings, that we do not focus on here.They achieve this result by connecting this problem with the theory of compressed sensing, an idea that we build upon here. Let be a matrix whose columns are the embeddings for all the words in . Arora et al. (2018a) result relies on establishing the restricted isometry property on the matrix where is an i.i.d. -random matrix (i.e., random vectors are used as word embeddings).
Linear transformations are regularly used to transfer between different embeddings or to adapt to a new domain (Bollegala et al., 2017; Arora et al., 2018b). The linear transformation can encode contextual information, an idea utilized recently by (Khodak et al., 2018) who applied a linear transformation on the DisC embedding scheme to construct a new embedding scheme (referred to as à la carte embedding), and empirically showed that it outperforms many other popular word sequence embedding schemes. Now akin to (Arora et al., 2018a) result on DisC embeddings, our results shed some theoretical insights into the performance on linear transformations of DisC embeddings. In particular, our RIP results on , where is is an i.i.d. centered bounded random matrix, provides provable performance guarantees on linear transformation (defined by ) of DisC embeddings in a linear classification setting, under a stable rank assumption on (see Corollary 4.5). We expand on this application in Section 4.1.
Linear Transformation and Johnson-Lindenstrauss Embedding. Johnson-Lindenstrauss (JL) embedding lemma states that any set of points in high dimensional Euclidean space can be embedded into dimensions, without distorting the distance between any two points by more than a factor between and (Johnson and Lindenstrauss, 1984). JL lemma has become a valuable tool for dimensionality reduction (Woodruff, 2014). Typically, the embedding is constructed through a random linear map referred to as an -JL matrix. Let be an -JL matrix. Consider a set of points , and let be their low-dimensional embedding. A natural question to ask is: Under what fixed linear transformations of does this distance preservation property still hold? More concretely, let be a linear transformation. Does the JL distance preservation property still hold for ? Our results answers this question because of the close connection between JL-embedding and the restricted isometry property. Krahmer and Ward (2011) showed that, when the columns of a matrix satisfying RIP are multiplied by independent random signs, any -RIP matrix becomes an -JL matrix for a fixed set of vectors with probability at least .555The connection in other direction going from JL-embedding to RIP is also well-known (Baraniuk et al., 2008). This means, one can now use our RIP results discussed above. For example, if the entries of are drawn i.i.d. from a centered symmetric subgaussian distribution, then for a (scaled to have unit Frobenius norm), if 666The notation hides polylog factors in ., we have that satisfies the JL-embedding property with high probability (as multiplying with random signs does not change the distribution ).
Compressed Sensing under Linear Transformations. Compressed sensing algorithms are designed to recover approximately sparse signals, and a popular sufficient condition for a matrix to succeed for the purposes of compressed sensing is given by the restricted isometry property (see, e.g., Theorem B.3). Another question that can be addressed using our results is how does the recovery guarantee hold if we apply a linear transformation on the compressed signals. Our RIP results establishes conditions on , for the class of random matrices described earlier, under which given where is a noise vector, one can recover an approximately sparse vector .
Source Separation. Separation of underdetermined mixtures is an important problem in signal processing. Sparsity has often been exploited for source separation. The general goal with source separation is that given a signal matrix , a mixing matrix , and a dictionary , is to find a matrix of coefficients such that and is as sparse as possible. The dictionary is generally overcomplete, i.e., . The connection between source separation and compressed sensing was first noted by (Blumensath and Davies, 2007).777In a variant of this problem, called the blind source separation, even the mixing matrix is assumed to be unknown. It is easy to recast (see Appendix A for details) the source separation problem as a compressed sensing problem of the form:
, where the goal is to estimate the sparse(entries of the matrix ). Here, , , , and , with , , . Hence, our RIP results establishes conditions on the mixing matrix , for the classes of random matrices (dictionaries) described earlier, under which we can get recovery guarantees on (entries of the matrix ).
Singular Values of Correlated Random Matrices.
Next application is a simple consequence of the restricted isometry property. Understanding the singular values of random matrices is an important problem, with lots of applications in machine learning and in the field of non-asymptotic theory of random matrices(Vershynin, 2016). Restricted isometry property of can be interpreted in terms of the extreme singular values of submatrices of . Indeed, the restricted isometry property (assuming for simplicity that has unit Frobenius norm888Otherwise results can be appropriately scaled.) equivalently states that the inequality
holds for all submatrices , those formed by the columns of indexed by sets of size . If the columns of are independent random vectors drawn from distribution over , we can think of our restricted isometry results as bounds on the singular values of matrices formed from by linear transformation of random vectors (or equivalently, the eigenvalues of the sample covariance matrix drawn from the distribution ). For example, let be a matrix whose columns are independent random vectors from a subgaussian distribution or satisfying convex concentration property, then our RIP results through (1) bounds all the singular values on (with high probability) under the condition that . Previously only a spectral norm bound on for the cases where has independent entries drawn from a subgaussian or heavy-tailed distributions was known (Rudelson and Vershynin, 2013; Vershynin, 2011).
Notation. We use to denote the set . We denote the Hadamard (elementwise) product by . For a set , denotes its complement set.
Vectors used in the paper are by default column vectors and are denoted by boldface letters. For a vector , denotes its transpose, and denote the and - norm respectively, and its support. denote the standard basis with th coordinate . For a matrix , denotes its spectral norm, denotes its Frobenius norm, and denotes its th entry, denotes the diagonal of , and denotes the off-diagonal of .
represents the identity matrix in dimension. For a vector and set of indices , let be the vector formed by the entries in whose indices are in , and similarly, is the matrix formed by columns of whose indices are in .
The Euclidean sphere in centered at origin is denoted by . We call a vector , -sparse, if it has at most non-zero entries. Denote by the set of all vectors with support size at most : . Stable rank (denoted by ) of a matrix is defined as:
Stable rank cannot exceed the usual rank. The stable rank is a more robust notion than the usual rank because it is largely unaffected by tiny singular values.
Throughout this paper , also with subscripts, denote positive absolute constants, whose value may change from line to line. In Appendix B we discuss additional preliminaries about subgaussian/subexponential random variables, sparse recovery, and Hanson-Wright inequality.
Restricted Isometry. Candes and Tao (2005) introduced the following isometry condition on matrices . It is perhaps one of the most popular property of a matrix which makes it “good” for compressed sensing.
Let be an matrix with real entries. Let and let be an integer. We say that satisfies -RIP if for every -sparse vector (i.e., )
Thus, acts almost as an isometry when we restrict attention to -sparse vectors. Geometrically, the restricted isometry property guarantees that the geometry of -sparse vectors is well preserved by the measurement matrix . In turns out that in this case, given a (noisy) compressed measurement , for approximately sparse , one can recover using a simple convex program. Theorem B.3 provides a bound on the worst-case recovery performance for uniformly bounded noise. Similar recovery results using RIP on the measurement matrix are also well-known under other interesting settings (we refer the reader to the survey on this topic in (Eldar and Kutyniok, 2012)).
We investigate the restricted isometry property of matrices that can be factored as , where will be some random matrix and is a fixed matrix. Now RIP is not invariant under scaling, i.e., given a RIP matrix , changing to some for some scales both the left and right hand side of the inequality in the Definition 1 by . So while working in the -model we have to adjust for the scaling introduced by , and we use a generalization of Definition 1 appropriate for this setting.
Definition 2 (RIP Condition).
A matrix satisfies -RIP if for every -sparse vector (i.e., )
This scaling by is unavoidable, because even if
is a matrix with centered uncorrelated entries of unit variance, thenfor any with . Note that it is easy to reconstruct standard RIP scenarios by choosing appropriately. For example, if contains i.i.d. subgaussian random variables with zero mean and unit variance, then setting , leads to standard setting of and .
Hanson-Wright Inequality. An important concentration tool used in this paper is the Hanson-Wright inequality (Theorem 2.1) that investigates concentration of a quadratic form of independent centered subgaussian random variables, and its recent extensions (Zhou, 2015; Adamczak et al., 2015). A slightly weaker version of this inequality was first proved in (Hanson and Wright, 1971).
Theorem 2.1 (Standard Hanson-Wright Inequality (Rudelson and Vershynin, 2013)).
Let be a random vector with independent components which satisfy and . Let be an matrix. Then for every ,
3 Restricted Isometry of with “Random”
In this section, we investigate the restricted isometry property for the class of matrices, for various classes of random matrices . Missing details from this section are collected in Appendix C.
As a warmup, we start with the simplest case where is a centered i.i.d. subgaussian random matrix, and build on this result in the following subsections, where we consider various other general families of such as those constructed using low-randomness, with sparsity structure, or satisfying a convex concentration condition.
Theorem 3.1 presents the result in case is an i.i.d. subgaussian random matrix. The proof idea here is quite simple, but provides a framework that will be helpful later. Under the stable rank condition, a net argument, along with Hanson-Wright inequality implies, , where is the submatrix of with columns from the set . Also, by the same Hanson-Wright, for any , . Using a bound on the net size for sparse vectors, gives all the ingredients for the following theorem.
Let be an matrix. Let be a matrix whose entries are with independent entries such that , , and . Let , and let be a number satisfying . Then with probability at least , the matrix satisfies -RIP, i.e.,
Note that under the assumption the stable rank, the probability is at least .
Notice that there is no direct condition on , except that comes through the stable rank assumption on , as . Indeed one should expect this to happen. For example, if we take a matrix and add a bunch of zero rows to it this would increase , but should not change the recovery properties. This suggests that we need a notion of a “true” dimension of the range of , and the stable rank is the one.
This assumption on the stable rank is optimal up to constant factors. For example, when is the identity matrix and is a standard Gaussian matrix, then , and therefore the stable rank condition just becomes . We know that up to constant, this dependence of on and in is optimal (Foucart et al., 2010). This shows that the lower bound on the stable rank in Theorem 3.1 cannot be improved in general.
In the next few subsections, we investigate other popular families of random matrices, thereby extending the result in Theorem 3.1 in various directions.
Independent Subgaussian Columns: Consider the case where the columns of the matrix are independently drawn isotropic subgaussian random vectors. In this case the proof proceeds as in Theorem 3.1, by applying in this case an extension of Hanson-Wright inequality (Theorem 2.1) to subgaussian random vectors (Vershynin, 2016), and by noting that for any fixed vector , will be an isotropic subgaussian random vector as it is a linear combination of columns of which are all independent.
Low Randomness: Optimal use of randomness is an important consideration when designing the matrix . In the dimensionality reduction literature much attention has been given in obtaining explicit constructions of minimizing the number of random bits used (see, e.g., (Kane et al., 2011) and references therein). For a fixed , we show that as long as satisfies -wise independence, then satisfies RIP (see Theorem 3.3) under the same (up to constant) stable rank condition on as in Theorem 3.1. So in effect, one can reduce the number of random bits from (in Theorem 3.1) to .
Sparse-Subgaussian: Sparsity is a desirable property in the measurement matrix because it leads to faster computation matrix-vector product. For example, if is drawn from a distribution over matrices having at most non-zeroes per column, then can be computed in time . We use the sparse-subgaussian model of random matrices.
Under Convex Concentration: Convex concentration property is a generalization of standard concentration property (such as Gaussian concentration) by requiring concentration to hold only for 1-Lipschitz convex functions.
Definition 3 (Convex Concentration Property).
Let be a random vector in . We will say that has the convex concentration property (c.c.p) with constant if for every -Lipschitz convex function , we have and for every ,
The class of distributions satisfying c.c.p is extremely broad (Ledoux, 2001). Some examples include: (a) Gaussian random vectors drawn from have c.c.p with , (b) random vectors that are uniformly distributed on the sphere have c.c.p with constant , (c) subclass of logarithmically concave random vectors, (d) random vectors with possibly dependent entries which satisfy a Dobrushin type condition, and (e) random vectors satisfying the logarithmic Sobolev inequality.
We remark while the class of matrices generated by the above list of mentioned distributions are not completely disjoint, they are also not completely subsumed by one another.
3.1 Restricted Isometry of with Low Randomness
In this section, we operate under a weaker randomness assumption on . In particular, we will use the notion of -wise independence to capture low randomness to construct . When truly random bits are costly to generate or supplying them in advance requires too much space, the standard idea is to use -wise independence (see Definition 12) which allows one to maintain a succinct data structure for storing the random bits. For simplicity, we will work with Rademacher random variables (random signs). Constructing -wise independent random signs from truly independent random signs using simple families of hash functions is a well-known idea (Motwani and Raghavan, 1995).
-wise independence. Our general strategy in this proof will be to rely on higher moments where we can treat certain variables as independent. We start withwith i.i.d. entries, and establish an bound on . Using Markov’s inequality for higher moments and a union bound gives our first result, . We then derive a concentration bound for for a fixed . For this, we investigate certain higher moments of
using moment generating functions. The result then follows using a net argument over the set of sparse vectors on the sphere.
Let be an matrix. Let be a matrix whose entries are -wise independent -random variables. Let , and let be a number satisfying . Then with probability at least , the matrix satisfies -RIP, i.e.,
Comparing Theorems 3.3 and 3.1. While the assumption on the stable rank does not change (by more than a constant) between these two theorems, we have a drastic reduction in the number of random bits from (in Theorem 3.1) to (in Theorem 3.3). Storing the matrix when is an i.i.d. random matrix takes words of memory, while if is -wise independent storing only requires words of memory.
3.2 Restricted Isometry of with Sparse Random
In this section, we investigate the restricted isometry property when is a sparse random matrix. We use the following popular probabilistic model for our sparse random matrices (Spielman et al., 2012; Luh and Vu, 2016; Wang and Chi, 2016).
Definition 4 (Bernoulli-Subgaussian Sparse Random Matrix).
We say that satisfies the Bernoulli-Subgaussian model with parameter if , where is an i.i.d. Bernoulli matrix where each entry is independently with probability , and is an random matrix with independent entries that satisfy: , , and , and denotes the Hadamard product.
We note that the sparsity of
is manipulated by the Bernoulli distribution, and the non-zero entries ofobey the subgaussian distribution, thereby facilitating a very general model of the sparse matrix.
Our results in this section will rely on the sparse Hanson-Wright inequality from Zhou (2015) (Theorem C.5). Similar to Hanson-Wright inequality (Theorem 2.1), sparse Hanson-Wright inequality provides a large deviation bound for a quadratic form. However, in this case, the quadratic form is sparse, and is of the form where is an random vector with independent subgaussian components and contains independent Bernoulli random variables.
and a bound on spectral norm of submatrices of (Lemma C.8)
These results in conjunction with a net argument provides the recipe for the following result.
Let be an matrix. Let be a matrix, where is an i.i.d. Bernoulli matrix where each entry is independently with probability , and is an random matrix with independent entries that satisfy: , , and . Let . Let , and let be a number satisfying . Then with probability at least , the matrix satisfies -RIP, i.e.,
3.3 Restricted Isometry of under Convex Concentration Property on
We investigate the case where is composed of independent columns satisfying the convex concentration property. In this case, we utilize the recent result of (Adamczak et al., 2015), who proved the Hanson-Wright inequality for isotropic random vectors having convex concentrations property (Theorem C.10). Our first result here is to show that a linear combination of independent vectors having a convex concentration property would have this property as well. To this end, we start with a fixed and construct a martingale with variables where is a 1-Lipschitz convex function and is the th column in , and , and then apply Azuma’s inequality to it. Once we have this established, verifying RIP consists of applying the Hanson-Wright inequality of (Adamczak et al., 2015) in a proof framework similar to Theorem 3.1.
Let be an matrix. Let be a matrix with mean zero independent columns satisfying the convex concentration property (Definition 3) with constant . Let , and let be a number satisfying . Then with probability at least , the matrix satisfies -RIP, i.e.,
4 Restricted Isometry under -way Column Hadamard-product
In this section, we investigate the restricted isometry property for a class of correlated random matrices motivated by theoretically understanding the effectiveness of word vector embeddings. Missing details from this section are collected in Appendix D.
To introduce this setting, let us start with the definition of a matrix product operation introduced by (Arora et al., 2018a) to construct their distributed cooccurrence (DisC) word embeddings.
Definition 5 (-way Column Hadamard-product Operation).
Let be an matrix, and let . The -way column Hadamard-product operation constructs a matrix whose columns indexed by a sequence is the elementwise product of the -th columns of , i.e., -th column in is , where for is the th column in .
Arora et al. (2018a), based on an application of a result of (Foucart and Rauhut, 2017) on RIP for bounded orthonormal systems, showed that if is an i.i.d. random sign matrix, and if , then with probability at least , satisfies restricted isometry property. In this paper, we investigate RIP on . Qualitatively achieving RIP for is simpler than achieving RIP for . In the case one has only to ensure that is close to a constant with high probability for a fixed sparse unit vector . In the case of , in addition to it, one has to guarantee that the direction of the vector is more or less uniformly distributed over the sphere.
For , Theorem 3.1 holds, and we get RIP under the condition . Below, we first analyze the case of , and then the larger ’s. In the motivating application discussed in Section 4.1, and are probably the most interesting scenarios.
Analysis for . Let be a random matrix with centered bounded entries.999The centered assumption is necessary because to hope for RIP with high probability, we must have it in average. The boundedness assumption is an artifact of our proof approach. The product of subgaussians is not subgaussian. However, the class of bounded random variables is closed under product. We prove that the matrix satisfies RIP with high probability provided that is sufficiently large. Note that we do not require the entries in to be identically distributed.
Let be an matrix, and let be a random matrix with independent entries such that . Let , and let be a number satisfying . Then with probability at least , the matrix satisfies the -RIP property, i.e., for any with ,
Let be a vector with . Let be the th element in with . The random variable is order 4 homogenous chaos in terms of the random variables . Establishing concentration for this chaos can be a difficult task, so we will approach the problem from a different angle. Let , and define
Note that the random variables are independent. We will estimate the -norm of and use the Hanson-Wright inequality (Theorem 2.1) to establish the concentration for the norm of (where ). Note that the support of contains the pair with . Directly from the triangle inequality, one can get
since . This estimate is, however, too wasteful. We will prove a more precise one using a special decomposition of the vector . It is based on a novel induction procedure. For simplicity, we ignore the subscript , and denote by and by . We explain the induction idea here, deferring the entire proof to Appendix D.
The support of vector can be viewed as an matrix with at most non-zero entries. If each row of this matrix contains at most one non-zero entry, we can condition on and get a bound on the -norm which does not depend on using Hoeffding’s inequality. The same is true if each column of the matrix contains at most one non-zero entry. This suggests that intuitively, the worst case scenario occurs when the non-zero entries of form a submatrix. This submatrix can be split into the sum of rows, which in combination with Cauchy-Schwarz inequality allows to bound the -norm by . But it is not clear how to extend this splitting to the case of with an arbitrary support. However, the matrix can be split in a slightly different way. Namely, we separate the first row of the matrix of support of , so that the remaining part is a matrix. After that, we separate the first column of the remaining matrix which leaves a matrix. Alternating between rows and columns, we get a matrix after steps. This process ends in steps, which yields the same bound for the -norm. We show below that this method can be extended to a vector with an arbitrary support.
Let be a vector with . Let be independent random variables with . Then
Recall that for any -sparse vector , , where is a random vector with independent coordinates such that for all
By the volumetric estimate we can choose an -net in the set of all -sparse vectors in such that
Using Corollary B.1, for any with (and ),
Combining this with the union bound over and using the assumption on , we get
Assume that the complement of the event in the left-hand side of the previous inequality occurs. Then an approximation idea (as in Theorem 3.1) yields
for all -sparse unit vectors . This completes the proof of Theorem 4.1. ∎
Analysis for . A looser version of Theorem 4.1 also extends to larger ’s. While extending the stronger -norm estimate given by Lemma 4.2 to these larger ’s looks tricky, the looser bound obtained through a triangle inequality argument (as in (2)) still holds. This leads to the following theorem.
Let be an matrix, and let be a random matrix with independent entries such that . Let be a constant. Let , and let be a number satisfying . Then with probability at least , the matrix satisfies the -RIP property, i.e., for any with ,
4.1 Application to Understanding Effectiveness of Word Embeddings
Word embeddings which represent the “meaning” of each word via a low-dimensional vector, have been widely utilized for many natural language processing and machine learning applications. We refer the reader to (Mikolov et al., 2013; Pennington et al., 2014) for additional background about word vector embeddings. Individual word embeddings can be extended to embed word sequences (such as a phrase or sentence) in multiple ways. There has been a recent effort to better understand the effectiveness of these embeddings, in terms of the information they encode and how this relates to performance on downstream tasks (Arora et al., 2017, 2018a; Khodak et al., 2018). In this paper, we work with a recently introduced word sequence embedding scheme, called distributed cooccurrence (DisC) embedding, that has been shown to be empirically effective for downstream classification tasks and also supports some theoretical justification (Arora et al., 2018a).
We in fact investigate linear transformations of these DisC embedding vectors (Definition 8). Linear transformations are commonly applied over existing embeddings to construct new embeddings in the context of domain adaptation, transfer learning, etc. For example, recently (Khodak et al., 2018) applied a (learnt) linear transformation on the DisC embedding vectors to construct a new embedding scheme, referred to as à la carte embedding, which they empirically show regularly outperforms the DisC embedding. Our results show, under some conditions, these linearly transformed DisC embeddings have provable performance guarantees, in that they are at least as powerful on classification tasks, up to small error, as a linear classifier on Bag-of--cooccurrences vectors. Before stating our result formally, we need some definitions.
Let denote the vocabulary set (collection of some words) with . We assume each word has a vector representation . Let denote the matrix whose columns are these word embeddings. A -gram is a contiguous sequence of words in a longer word sequence. Here, is referred commonly to as bigram, is referred to as trigram, etc.101010For example, if the word sequence equals (“the”, “cow”, “jumps” ,“over”, “the”, “moon”), then it has 5 different words. The collection of -grams (bigrams) for this sequence would be “the cow”, “cow jumps",…,“the moon”. Similarly, the collection of -grams (trigrams) would be “the cow jumps”, “cow jumps over”,….,“over the moon”. In this case, the collection of -cooccurrences and -cooccurrences are same as the collection of -grams and -grams respectively. The Bag-of--grams representation of a word sequence is a vector that counts the number of times any possible -gram (from the vocabulary set ) for some appears in the sequence. In practice, is set to a small value, typically . Tweaked versions of this simple representation is known to perform well for many downstream classification tasks (Wang and Manning, 2012). It is common to ignore the ordering of words in an -gram and to define -gram as an unordered collection of words (also referred to as -cooccurrence). Also as in (Arora et al., 2018a) for simplicity, we assume that each -cooccurrence contains a word at most once (as noted by (Arora et al., 2018a) this can be ensured by merging words during a preprocessing step).
Definition 6 (Bag-of--cooccurrences).
Given a -word sequence and , we define its Bag-of--cooccurrences vector as the concatenation of vectors where
Here, is a dimensional vector. Here, is number of possible -cooccurrences in a word vocabulary set.
In practice, one would also consider only small word sequences, and therefore will be a small number. The embedding of an -cooccurrence is defined as the elementwise product of the embeddings of its constituent words.
Definition 7 (DisC Embedding (Arora et al., 2018a)).
Given a -word sequence and , we define its -DisC embedding as the -dimensional vector formed by concatenation of vectors where
i.e., is the sum of the -cooccurrence embeddings of all -cooccurrences in the document.
where is the -way column Hadamard-product constructed out of . The columns of contain the DisC embedding of all possible -cooccurrences in the vocabulary set (and thus ).
Definition 8 (Linearly Transformed DisC Embedding).
Given a matrix of word vectors and a set of matrices . A linearly transformed -DisC embedding matrix of is a block-diagonal matrix with blocks