The Permuted Striped Block Model and its Factorization – Algorithms with Recovery Guarantees

04/10/2020
by   Michael Murray, et al.
University of Oxford
0

We introduce a novel class of matrices which are defined by the factorization Y :=AX, where A is an m × n wide sparse binary matrix with a fixed number d nonzeros per column and X is an n × N sparse real matrix whose columns have at most k nonzeros and are dissociated. Matrices defined by this factorization can be expressed as a sum of n rank one sparse matrices, whose nonzero entries, under the appropriate permutations, form striped blocks - we therefore refer to them as Permuted Striped Block (PSB) matrices. We define the PSB data model as a particular distribution over this class of matrices, motivated by its implications for community detection, provable binary dictionary learning with real valued sparse coding, and blind combinatorial compressed sensing. For data matrices drawn from the PSB data model, we provide computationally efficient factorization algorithms which recover the generating factors with high probability from as few as N =O(n/klog^2(n)) data vectors, where k, m and n scale proportionally. Notably, these algorithms achieve optimal sample complexity up to logarithmic factors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/20/2021

Dictionary-Sparse Recovery From Heavy-Tailed Measurements

The recovery of signals that are sparse not in a basis, but rather spars...
08/28/2013

New Algorithms for Learning Incoherent and Overcomplete Dictionaries

In sparse recovery we are given a matrix A (the dictionary) and a vector...
01/19/2017

Stochastic Subsampling for Factorizing Huge Matrices

We present a matrix-factorization algorithm that scales to input matrice...
12/28/2021

Robust Sparse Recovery with Sparse Bernoulli matrices via Expanders

Sparse binary matrices are of great interest in the field of compressed ...
04/16/2018

Binary Matrix Factorization via Dictionary Learning

Matrix factorization is a key tool in data analysis; its applications in...
01/22/2016

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

We introduce a task-parallel algorithm for sparse incomplete Cholesky fa...
06/27/2012

Matrix Tile Analysis

Many tasks require finding groups of elements in a matrix of numbers, sy...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many data science contexts, data is represented as a matrix and is often factorized into the product of two or more structured matrices so as to reveal important information. Perhaps the most famous of these factorizations is principle component analysis (PCA)

[22]), in which the unitary factors represent dominant correlations within the data. Dictionary learning [35] is another prominent matrix factorization in which the data matrix is viewed to lie, at least approximately, on a union of low rank subspaces. These subspaces are represented as the product of an overcomplete matrix, known as a dictionary, and a sparse matrix. More generally a wide variety of matrix factorizations have been studied to solve a broad range of problems, for example missing data in recommender systems [27], nonnegative matrix factorization [23]

and automatic separation of outliers from a low rank model via sparse PCA

[14].

In this paper we introduce a new data matrix class which permits a particular factorization of interest. The members of this class are composed of a sum of rank one matrices of the form , where is a binary column vector with exactly non-zeros and is a real row vector. Unlike PCA, and more analogous to dictionary learning, we typically consider . An example of a matrix in this class is the sum of rank one matrices whose supports are striped blocks of size that may or may not overlap. Here ‘striped’ refers to the entries in any given column of an associated block having the same coefficient value. More generally, data matrices in this class can be expressed as the sum of independently permuted striped blocks and as a result we will refer to them as Permuted Striped Block (PSB) matrices. We provide a visualization of such matrices in Figure 1.

Figure 1: Visualization of two examples of PSB matrices. In both cases the data consists of a sum of four rank 1 matrices, each with support size . The left hand plot corresponds to the case where the support of each rank 1 matrix is arranged into a block. The right hand plot takes the same rank 1 matrices as in the left hand plot, but before summing them applies an independent random permutation to each. Note that white squares indicate a zero entry and that entries in the support (squares coloured a blue shade) that are in the same column have the same coefficient value (hence, aside from where there are overlaps, each column of a block is a single stripe of colour).

1.1 Data Model and Problem Definition

A PSB matrix can be defined as , where A is an sparse binary matrix with exactly nonzeros per column and X is an column sparse real matrix. In Definition 1 we present the PSB data model, which defines a particular distribution over the set of PSB matrices. The focus of this paper is to show that, under certain conditions and with high probability, data sampled from the PSB data model has a unique (up to permutation) factorization of the form discussed which can be computed efficiently and in dimension scalings that are near optimal. The details of PSB data model are given below in Definition 1. As a quick point to clarify our notation, bold upper case letters will be used to refer to deterministic matrices while nonbolded upper case letters will be used to refer to random matrices, i.e., a matrix drawn from a particular distribution (from the context it should be clear which distribution is being referred to).

Definition 1.

Permuted Striped Block (PSB) data model: given , with , , define

  • as the set of binary vectors of dimension with exactly nonzeros per column and as the set of matrices with columns for all .

  • as the set of real, dissociated (see Definition 2) and sparse dimensional vectors, and as the set of matrices with columns for all .

We now define the following random matrices, note that the randomness is over their supports only.

  • is a random binary matrix of size where for . The distribution over the supports of these random vectors is defined as follows. The first columns are formed by dividing a random permutation of into disjoint sets of size and assigning each disjoint set as the support of an . This process is repeated with independent permutations until columns are formed. In this construction there are a fixed number nonzeros per column and a maximum of nonzeros per row.

  • is a random real matrix of size whose distribution is defined by concatenating mutually independent and identically distributed random vectors from ; that is, the support of each is chosen uniformly at random across all possible supports of size at most .

The PSB data model is the product of the aforementioned factors, generating the random matrix

.

The columns comprising in the PSB data model have nonzeros drawn so that partial sums of the nonzeros give unique nonzero values - a property which is referred to as dissociated.

Definition 2.

A vector is said to be dissociated iff for any subsets then .

The concept of dissociation comes from the field of additive combinatorics ( Definition 4.32 in [33]). Although at first glance this condition appears restrictive it is fulfilled almost surely for isotropic vectors and more generally for any random vector whose nonzeros are drawn from a continuous distribution.

1.2 Motivation and Related Work

The PSB data model and the associated factorization task can be interpreted and motivated from a number of perspectives, three of which we now highlight.

  • A generative model for studying community detection and clustering:

    consider the general problem of partitioning the nodes of a graph into clusters so that intra cluster connectivity is high relative to inter cluster connectivity. Given such a graph two basic questions of interest are 1) do these clusters or communities of nodes exist (community detection) and 2) can we actually recover them (clustering)? To study this question researchers study various generative models for random graphs, one of the most popular (particularly in machine learning and the network sciences) being the stochastic block model (for a recent survey see

    [1]). In this setting the observed data matrix is the adjacency matrix of a graph, generated by first sampling the cluster to which each node belongs and then sampling edges based on whether or not the relevant pair of nodes belong to the same cluster. The weighted stochastic block model, Aicher et al [3], generalizes this idea, allowing the weights of this adjacency matrix to be non-binary. The PSB data model can be viewed as an alternative data model for studying community detection and clustering. Indeed, the PSB data model can be interpreted as the adjacency matrix of a weighted bipartite graph. Recovering the factors and from is valuable as encodes clusters or groups in the set (where each group is a fixed size ) and each column of encodes a soft clustering of an object in into of groups. The nonzero coefficients in a given column of represent the strength of association between an object in and the clusters defined by .

  • Dictionary learning with a sparse, binary dictionary: for classes of commonly used data it is typically the case that there are known representations in which the object can be represented using few components, e.g. images are well represented using few coefficients in wavelet or discrete cosine representations. That is, there is a known “dictionary” for which data from a specific class has the property that is small even while the number of nonzeros in is limited to , meaning . Dictionary learning allows one to tailor a dictionary to a specific data set by solving subject to for all . Alternatively, dictionary learning applied to a data matrix without prior knowledge of an initial dictionary reveals properties of the data through the learned dictionary. Factorizing a matrix drawn from the PSB data model can be viewed as a specific instance of dictionary learning in which the dictionary is restricted to be in the class of overcomplete sparse, binary dictionaries. While this restricted class of dictionaries limits the type of data which the PSB data model can be used to describe, as we will show in Section 3 it does allow for a rigorous proof that with high probability it is possible to efficiently learn the dictionary and sparse code. This extends the growing literature on provable dictionary learning - see Table 1 for a summary of some recent results.

  • Learned combinatorial compressed sensing: the field of compressed sensing [13, 17] studies how to efficiently reconstruct a sparse vector from only a few linear measurements. These linear measurements are generated by multiplying the sparse vector in question by a matrix, referred to as the sensing matrix. The sensing matrices most commonly studied are random and drawn from one of the following distributions: Gaussian, uniform projections over the unit sphere and partial Fourier. Combinatorial compressed sensing [10] instead studies the compressed sensing problem in the context of a random, sparse, binary sensing matrix. The problem of factorizing PSB matrices can therefore also be interpreted as recovering the sparse coding matrix from compressed measurements without access to the sensing matrix . Indeed, the factorization problem we study here can be motivated by the conjecture that, as combinatorial compressed sensing algorithms are so effective, the sparse code can be recovered even without knowledge of the sensing matrix.

The PSB data model, Definition 1, and the associated factorization task are most closely related to the literature on subspace clustering, multiple measurement combinatorial compressed sensing, and more generally the literature on provable dictionary learning. Each of these topics studies the model , where is assumed to be a sparse matrix with up to nonzeros per column. These topics differ in terms of what further assumptions are imposed on the dictionary and sparse coding matrix . The most general setup is that of dictionary learning, in which typically no additional structure is imposed on . In the literature on provable dictionary learning however, structure is often imposed on the factor matrices so as to facilitate the development of theoretical guarantees even if this comes at the expense of model expressiveness. Two popular structural assumptions are that the dictionary is complete (square and full rank) or that its columns are uncorrelated. Furthermore, as is the case for the PSB data model, in provable dictionary learning it is common to assume that the factor matrices are drawn from a particular distribution, e.g. Bernoulli-Subgaussian, so as to understand how difficult the factorization typically is to perform, rather than in the worst case. Table 1 lists some recent results and summarizes their modelling assumptions. Subspace clustering considers the further structural assumption that the columns of the sparse coding matrix can be partitioned into disjoint sets , where the number of nonzeros per row of each of the submatrices is less than the minimum dimension of and . In this setting, the data is contained on a union of subspaces which are of lower dimension than the ambient dimension [19].

The PSB data model differs from the above literature most notably in the construction of being sparse, binary and having exactly nonzeros per column. The case of being sparse and binary has been studied previously in [5]

, where, guided by the notion of reversible neural networks (that a neural network can be run in reverse, acting as a generative model for the observed data), the authors consider learning a deep network as a layerwise nonlinear dictionary learning problem. This work differs substantially from

[5] in a number of respects: first the authors consider , here is the elementwise unit step function which removes information about the nonzero coefficients of and means that is binary. As a result of this, and because the authors are in the setup where each layer feeds into the next, then is also assumed to be binary. The factors generating this thresholded model are challenging to recover due to the limited information available and as a consequence this model is also less descriptive for nonbinary data. In particular, in the three motivating examples previously covered: the communities would be unweighted, the dictionary learning data would be binary, and the learned combinatorial compressed sensing would only be for binary signals. Third and finally, in [5] the nonzeros of are drawn independently of one another, the distribution defined in the PSB data model is a significant departure from this with both inter and intra column nonzero dependencies. We also emphasize that in terms of method the approach we take to factorize a matrix drawn from the PSB data model differs markedly from that adopted in [5]. In this prior work is reconstructed (up to permutation) using a non-iterative approach, involving first the computation of all row wise correlations of from which pairs of nonzero entries are recovered. Using a clustering technique adapted from the graph square root problem, these pairs of nonzeros, which can be thought of as partial supports of the columns of of length 2, are then combined to recover the columns in question. In contrast, our method iteratively generates large partial supports directly from the columns of which can readily be clustered while simultaneously revealing nonzero entries in . This process is then repeated on , which is the residual of after the approximations of and at the iteration have been removed.

Authors Run time Recovery
Spielman et al [29] Complete Bernoulli-Subgauss. poly() -accuracy
Agarwal et al [2] Incoherent Bernoulli-Uniform ) poly() -accuracy
Arora et al [6] Incoherent Bernoulli-Uniform poly() -accuracy
Sun et al [31] [32] Complete Bernoulli-Gaussian poly() poly() -accuracy
Barak et al [9] -dictionary -nice distribution poly(n) quasi-poly() -accuracy
Arora et al [4] Indiv. Recoverable Bernoulli-{0,1} () poly() quasi-poly() -accuracy
Arora et al [5] Bernoulli- Bernoulli- * * * Exact
M&T Definition 1 Definition 1 Exact
Table 1: Summary of recent results from the provable dictionary learning literature. *The recovery results for [5] displayed here are presented under the assumption that for some large constant C (as stated in Theorem 1, pg. 5) and (Theorem 1 of Appendix B, pg. 21), meaning . Furthermore, the run time for [5] is based only on step 1 of Algorithm 1 (pg. 8) which requires computing the row wise correlation of row vectors in ; as there are pairs of rows then if each row has nonzeros then this implies .

1.3 Main Contributions

The main contributions in this paper are an 1) the introduction of the Permuted Sparse Binary (PSB) data model, 2) an algorithm, Expander Based Factorization (EBF), for computing the factorization of PSB matrices and 3) recovery guarantees for EBF under the PSB data model, which we summarize in Theorem 1. Central to our method of proof is the observation that the random construction of as in Definition 1 is with high probability the adjacency matrix of a left regular () bipartite expander graph. We define such graphs in Definition 3 and discuss their properties in Section 2.2. In what follows the set of left regular () bipartite expander graph adjacency matrices of dimension is denoted .

Theorem 1.

Let be drawn from the PSB data model (Definition 1) under the assumption that . If EBF, Algorithm 1, terminates at an iteration then the following statements are true.

  1. EBF only identifies correct entries: for all there exists a permutation such that , and all nonzeros in are equal to the corresponding entry in .

  2. Uniqueness of factorization: if then this factorization is unique up to permutation.

  3. Probability that EBF is successful: suppose and where and . If where (see Lemma 8 for a full definition) and is a constant, then the probability that EBF recovers and up to permutation is greater than .

As each column of is composed of columns from , which itself has columns, a minimum number of columns are necessary in order for to have at least one contribution from each column in . To be clear, this is a necessary condition on to identify all columns in . Theorem 1 states that is sufficient to recover both and from with high probability under the PSB data model. We believe this result is likely to be a sharp lower bound as one factor arises from a coupon collector argument inherent to the way is sampled, and the second factor is needed to achieve the stated rate of convergence. As discussed in more detail in Section 4.2, assuming the same asymptotic relationships as in statement 3 of Theorem 1, EBF has a per iteration cost (in terms of the input data dimensions) of . The rest of this paper is structured as follows: in Section 2 we provide the algorithmic details of EBF, in Section 3 we prove Theorem 1 and in Section 4 we provide a first step improvement of EBF along with numerical simulations, demonstrating the efficacy of the algorithms in practice. The proofs of the key supporting lemmas are mostly deferred to the Appendix.

2 Algorithmic Ideas

In this section we present a simple algorithm which leverages the properties of expander graphs to try and compute the factorization of where and . We call this algorithm Expander Based Factorization (EBF). To describe and define this algorithm we adopt the following notational conventions. Note that all variables referenced to here will be defined properly in subsequent sections.

  • For any we define as the set of natural numbers less than or equal to , for example, .

  • If is an matrix, a subset of the row indices and a subset of the column indices, then is the submatrix of formed with only the rows in and the columns in .

  • will be used as an the iteration index for the algorithms presented.

  • is the estimate of

    , up to column permutation, at iteration .

  • is the estimate of , up to row permutation, at iteration .

  • is the residual of the data matrix at iteration .

  • is the number of partial supports extracted from at iteration .

  • is the matrix of partial supports , extracted from .

  • is the set of singleton values (see Definition 4) extracted from which appear in a column of more than times.

  • is the set of partial supports (see Definition 5) of extracted from used to update the th column of the estimate .

  • A column with of said to have been recovered iff there is some iteration for which for all subsequent iterates there exists a column where such that .

  • A column with of is said to be complete at iteration iff

  • is the elementwise unit step function; for and for .

  • is the set of column indices of which have non-zeros, i.e., . Furthermore .

  • For consistency and clarity we will typically adopt the following conventions: will be used as an index for the columns of , , and , will be used as an index for the rows of , , and , will be used as an index for the columns of as well as the rows of , will be used as an index for the columns of as well as the rows of , and finally will be used as an index for the columns of .

2.1 The Expander Based Factorization Algorithm (EBF)

The key idea behind EBF is that if the binary columns of are sufficiently sparse and have nearly disjoint supports, then certain entries in , which we will term singleton values (see Definition 4), can be directly identified from entries of . Furthermore, it will become apparent that singleton values not only provide entries of but also entries in via the construction of partial supports. By iteratively combining the information gained about and from the singleton values and partial supports and then removing it from the residual, EBF can iteratively recover parts of and until either no further progress can be made or the factors have been fully recovered (up to permutation).

1:  Init: , , , .
2:  while  or or  do
3:     Set .
4:     Extract set of singleton values and associated partial supports from .
5:     Update by inspecting for singleton values identified in prior iterations.
6:     Cluster partial supports () into sets ().
7:     Update : for all if has partial support set .
8:     Update : for all set .
9:     Update residual: .
10:  end while
11:  Return
Algorithm 1 EBF(, , , , , , )

The remainder of Section 2 is structured as follows: in Section 2.2 we review the definition and properties of expander graphs so as to formalize a notion of the columns of being sufficiently disjoint in support, in Section 2.3 we define and prove certain key properties of singleton values and partial supports, in Section 2.4 we prove that EBF never introduces erroneous nonzeros into the estimates of either or and review each step of Algorithm 1 in more detail. Finally, in Section LABEL:subsection_A_whp_expander, for completeness we provide a sketch of a proof that in a certain parameter regime the matrix in the PSB data model is with high probability the adjacency matrix of a expander graph.

2.2 Background on Expander Graphs

The PSB data model in Definition 1 is chosen so as to leverage certain properties of expander graphs. Such graphs are both highly connected yet sparse and are an interesting object both from a practical perspective, playing a key role in applications, e.g. error correcting codes, as well in areas of pure maths (for a survey see Lubotzky [24]). We consider only left -regular bipartite expander graphs and for ease we will typically refer to these graphs as expanders.

Definition 3 ((k, , d) Expander Graph [28]).

Consider a left d-regular bipartite graph , for any let be the set of nodes in connected to a node in . is a (k, , d) expander iff

(2.1)

Here denotes the set of subsets of with cardinality at most . A key property of such graphs is the Unique Neighbour Property.

Theorem 2 (The Unique Neighbour Property [15, Lemma 1.1]).

Suppose that G is an unbalanced, left d-regular bipartite graph . Let be any subset of nodes and define

here is the set of nodes of connected to node . If is a (k, , d) expander graph then

(2.2)

A proof of Theorem 2 in the notation used here is available in [25, Appendix A]. For our purposes it will prove more relevant to describe expander graphs using matrices. The adjacency matrix of a expander graph is an binary matrix where if there is an edge between node and and is otherwise111We note that adjacency matrices of graphs are often defined to describe the edges present between all nodes in the graph, however we emphasize that in the definition adopted here the edges between nodes in the same group are not set to zero but rather are omitted entirely.. Applying Definition 3 and Theorem 2 we make the following observations.

Corollary 1 (Adjacency matrix of a (k, , d) Expander Graph).

If is the adjacency matrix of a (k, , d) Expander Graph then any submatrix of , where , satisfies the following properties.

  1. By definition 3 there are more than rows in that have at least one non-zero.

  2. By Theorem 2 there are more than rows in that have only one non-zero.

We will use to denote the set of (k, , d) expander graph adjacency matrices of dimension .

2.3 Singleton values and partial supports

For now we will consider a generic vector such that where and . We will later apply the theory we develop here to the columns of the residual at any iteration . Letting denote the th row of , then any given entry of is a sum of some subset of the entries of . We now introduce two concepts which underpin much of what follows.

Definition 4 (Singleton Value).

Consider a vector where and , a singleton value of is an entry such that , hence for some .

Definition 5 (Partial Support).

A partial support of a column of is a binary vector satisfying .

Singleton values are of interest in the context of factorizing a matrix drawn from the PSB data model as once identified their contribution can be removed from the residual. Furthermore, using a singleton value one can construct a partial support by creating a binary vector with nonzeros where the singleton value appears. Therefore identifying singleton values allows for the recovery of nonzeros in both and . To leverage this fact however we need a criteria or certificate for identifying them. Under the assumption that is a () expander and that is dissociated (see Definition 2), then it is possible to derive a certificate based on a lower bound on the mode (or frequency) with which a value appears in .

Lemma 1 (Identification of singleton values).

Consider a vector where and , then the frequency of any singleton value is at least . To be precise,

A proof of Lemma 1 is provided in Appendix A.1. This Lemma provides a sufficient condition for testing whether a value is a singleton or not. However, it does not provide any guarantees that in there will be any singleton values. In Lemma 2, adapted from [25, Theorem 4.6], we show, under certain assumptions, that there always exist a positive number of singleton values which appear more than times in .

Lemma 2 (Existence of singleton values, adapted from [25, Theorem 4.6]).

Consider a vector where and . For , let be the set of row indices of for which , i.e., . Defining as the set of singleton values which appear more than times in , then

Therefore, so long as is nonzero it is always possible to extract at least 1 partial support of size greater than .

A proof of Lemma 2 is given in Appendix A.2.

Step 5 of Algorithm 1 relies on us being able to accurately and efficiently sort the partial supports extracted by their column of . Corollary 2 states that if with and if the partial supports are sufficiently large, then clustering can be achieved without error simply by computing pairwise inner products.

Corollary 2 (Clustering partial supports).

Any pair of partial supports , satisfying and originate from the same column of iff .

A proof of Corollary 2, which follows directly from results derived in the proof of Lemma 2, is provided in Appendix A.3.

2.4 Algorithm 1 accuracy guarantees and summary

Based on the results of Section 2.3 it is now possible to provide the following accuracy guarantees for EBF.

Lemma 3 (EBF only identifies correct entries).

Let , where with and . Suppose that EBF terminates at iteration , then the following statements are true.

  1. For all there exists a permutation such that ,
    and all nonzeros in are equal to the corresponding entry in .

  2. EBF fails only if .

For a proof of Lemma 3 see Appendix A.4. Lemma 3 states that EBF never makes a mistake in terms of introducing erroneous nonzeros to either or , and hence fails to factorize only by failing to recover certain nonzeros in or . We are now in a position to be able to summarize and justify how and why each step of Algorithm 1 works.

  • Steps 1, 2 and 3 are self explanatory, 1 being the initialization of certain key variables, 2 defining a while loop which terminates only when neither of the estimates or are updated from the previous iterate and 3 being the update of the iteration index.

  • Steps 4 and 5 are in fact conducted simultaneously with each column of the residual being processed in parallel. The mode of each nonzero in a column is calculated and the value checked against singleton values identified in previous iterates. If a nonzero value appears more than times then its associated partial support is constructed. Note that although it may be possible to identify singleton values which appear at least times, unless they appear more than then it is not possible to guarantee that their associated partial support can be clustered correctly. Therefore, such a singleton value cannot be placed in a row of consistent with the other singleton values extracted, and its associated partial support cannot be used to augment . If any nonzero value of a column of the residual matches a previously identified singleton value from the same column then it can also be used to augment . Indeed, if then by Lemma 3, and by the fact that is dissociated it must be that .

  • Steps 6, 7 and 8 can be conducted on the basis of Corollary 2. A naive method would be to simply take each of the partial supports extracted from in turn and compute its inner product with the nonzero columns of . By Lemma 3 and the fact that each partial support has cardinality larger than , then either a partial support will match with an existing column of or otherwise it can be used as the starting estimate for a new column of . Once a partial support, for which we know the column in from which it was extracted, is assigned to a column of , then the corresponding location in can be updated using its associated singleton value.

  • Step 9 - once all singleton values and partial supports extracted from have been used to augment and then the residual can be updated and the algorithm loops.

2.5 The construction of in the PSB data model: motivation and connections to expander graphs

The dictionary model in Definition 1 is chosen so as to satisfy two properties that facilitate the identification of its columns from the data . First, the number of nonzeros per row of is bounded so that each entry can be readily guaranteed to be recovered, and second, the supports of the nonzeros per column are drawn so that the probability of substantial overlap is bounded. The remainder of this section is a further discussion of how these properties motivate the construction of and give insight into other choices of for which the methods presented in this paper could be extended. Some explicit examples of such extensions are given in Section 5.

In order that an entire column of can be identified, the union of the associated partial supports extracted by EBF needs to be equal to the entire support of the column. That sufficiently many partial supports of a column will be observed at some iteration of EBF is achieved by ensuring that there are enough nonzeros per row of , this is analyzed in Lemma 9 in Section 3. The inability to identify all entries in a column of only occurs if the random combinations of columns of have a consistent overlap, resulting in an entry from the column in question not being present in any of the partial supports. The probability of such an overlap is analyzed in the proof of Lemma 8, where we show that bounding the maximum number of nonzeros per row of is sufficient to ensure that such a consistent overlap has a probability that goes to zero exponentially fast in the number of partial supports available. This motivates the random construction of in Definition 1, where the maximum number of nonzeros per row of is given by .

The algorithmic approach of EBF, i.e., the extraction and clustering of singleton values and partial supports, follows from being a expander. In [7] it was shown that if were constructed such that the column supports have fixed cardinality and were drawn independently and uniformly at random, then would be a expander with high probability. More precisely, [7, 8] considered and to be fixed with , , and growing proportionally and showed that the probability that would not be the adjacency matrix of a expander graph is bounded by a given function which decays exponentially in . Without modification, the aforementioned proof and bounds in [7] prove that the construction of in the PSB data model, Definition 1, is with high probability a expander graph. For brevity we do not review this proof in detail and instead only sketch why the proof is equally valid here. In [7] the large deviation probability bounds are derived by considering the submatrix consisting of columns from . The support is repeatedly divided into two disjoint supports, say and , each of size approximately and then the number of neighbours is bounded in terms of and . This process is repeated until the finest level partitions contain a single column, in which case for all . Bounds on the probability of the number of neighbours is computed using the mutual independence of the columns. These bounds also hold for the construction in Definition 1 since the columns either have independently drawn supports or if they are dependent then they are disjoint by construction.

3 Theoretical Guarantees

Analyzing and proving theoretical guarantess for EBF directly is challenging, so instead our approch will be to study a simpler surrogate algorithm which we can use to lower bound the performance of EBF. To this end, we introduce the Naive Expander Based Factorization Algorithm (NEBF) which, despite being suboptimal from a practical perspective, is still sufficiently effective for us to prove Theorem 1.

3.1 The Naive Expander Based Factorization Algorithm (NEBF)

NEBF, given in Algorithm 2, is based on the same principles as EBF developed in Section 2 - in short the extraction and clustering of partial supports and singleton values. However, there are a number of significant restrictions included in order to allow us to determine when and with what probability it will succeed. At each iteration of the while loop, lines 2-14, NEBF attempts to recover a column of , if it fails to do so then the algorithm terminates. Therefore, by construction, at the start of the th iteration the first of columns are complete. On line 3 the subroutine PeelRes (for more details see Algorithm 4 in Appendix B.1) iteratively removes the contributions of complete columns of from the residual until none of the partial supports extracted match with a complete column. This means that the clusters of partial supports returned by PeelRes all correspond to as of yet not complete columns of . To define PeelRes we need to introduce the notion of the visibile set , which is the set of column indices of for which there exists at least one partial support extracted from . In step 4 NEBF identifies the cluster of partial supports with the largest cardinality and then in steps 6 and 7 attempts to use this cluster to compute and complete the th column of . If this step is successful then in lines 8-9 the residual and iteration index are updated and the algorithm repeates. If the construction of the th column is unsuccessful, i.e., it’s missing at least one entry, then the algorithm terminates.

1:  Init: , , , , , ERROR FALSE
2:  while  and ERROR FALSE do
3:     Iteratively peel residual:
4:     Set to be the cluster of partial supports in with the largest cardinality.
5:     if  then
6:        if  then
7:           Compute the th column and update : .
8:           Compute the residual: .
9:           Update column index:
10:        else
11:           ERROR TRUE
12:        end if
13:     end if
14:  end while
15:  Return
Algorithm 2 NEBF(, , , , , , )

We now present a few properties of NEBF: first and analagous to Lemma 3 NEBF does not make mistakes.

Lemma 4 (NEBF only identifies correct entries).

Let , where with and . Suppose NEBF terminates at an iteration , then the following statements are true.

  1. For all there exists a permutation matrix such that

    1. where .

    2. From any nonzero , at least singleton values and associated partial supports, each with more than nonzeros, can be extracted and clustered without error.

    3. , and any nonzero in is equal to its corresponding entry in .

  2. NEBF fails only if .

A proof of Lemma 4 can be found in appendix A.5. One of the key differences between NEBF and EBF is that at every iteration the residual is of the form where and : as a result Lemma 2 applies at every iteration. This is covered in more detail in Lemma 4 below.

3.2 Key supporting Lemmas

Before we can proceed to prove Theorem 1 it is necessary for us to introduce a number of key supporting lemmas. First, and taking inspiration from algorithms in the combinatorial compressed sensing literature (in particular Expander Recovery [21]), it holds for both EBF and NEBF that recovery of is sufficient for the recovery of . This result will prove useful in what follows as it allows us to study the recovery of only rather than both and .

Lemma 5 (Recovery of is sufficient to guarantee success).

Let , where with and . If either EBF or NEBF recover up to permutation at some iteration , then both are guaranteed to recover by iteration . Therefore, for both algorithms recovery of is both a necessary and sufficient condition for success and hence under the PSB data model the probability that is successfully factorized is equal to the probability that is recovered.

For a proof of Lemma 5 see Appendix A.6. Second, using Lemmas 3 and 4, we have the following uniqueness result concerning the factorization calculated by EBF and NEBF.

Lemma 6 (Uniqueness of factorization).

Let , where with and . If either EBF or NEBF terminates at an iteration such that then the factorization is not only successful, but is also unique, up to permutation, in terms of factorizations of this form.

For a proof of Lemma 6 see Appendix A.7. The next Lemma states that if NEBF succeeds in computing the factors of then so will EBF. The key implication of this Lemma is that it is sufficient to study NEBF in order to lower bound the probability that EBF is successful.

Lemma 7 (NEBF can be used to lower bound the performance of EBF).

Assuming that with and , then if NEBF successfully computes the factorization of up to some permutation of the columns and rows of and respectively, then so will EBF. Furthermore, if the probability that NEBF successfully factorizes , drawn from the PSB data model, is greater than then the probability EBF successfuly factorizes is also greater than .

A proof of Lemma 7 can be found in Appendix A.8. Lemma 8 below provides a lower bound on the probability that a column of the random matrix from the PSB data model can be recovered from of its partial supports.

Lemma 8 ( Column recovery from partial supports).

Under the construction of in Definition 1, consider any . Let be a set of partial supports associated with . Let , where is the unit step function applied elementwise, be the reconstruction of based on these partial supports. With , then

Furthermore, if and , where are constants and , then this upper bound can be simplified as follows,

where is .

For a proof see Appendix A.9. The key takeaway of this lemma is that the probability that NEBF fails to recover a column decreases exponentially in , the number of partial supports available to reconstruct it. Finally, Lemma 9 concerns the number of data points required so that each column of is seen sufficiently many times. To be clear, we require large enough so that the number of nonzeros per row of is at least as large as some lower bound with high probability.

Lemma 9 (Nonzeros per row in ).

Under the construction of in Definition 1, with for some , then the probability that has at least non-zeros per row is at least .

For a proof of Lemma 9 see Appendix A.10. With these Lemmas in place we are ready to proceed to the proof of Theorem 1.

3.3 Proof of Theorem 1

Statements 1 and 2 of Theorem 1 are immediate consequences of Lemmas 3 and 6 respectively, so all that is left is to prove statement 3. To quickly recap, our objective is to recover up to permutation the random factor matrices and , as defined in the PSB data model in Definition 1, from the random matrix . Our strategy at a high level is as follows: using Lemmas 5 and 7 we lower bound the probability that EBF factorizes by lower bounding the probability that NEBF recovers . NEBF recovers up to permutation iff at each iteration of the while loop, lines 2-14 of Algorithm 2, a new column of is recovered. We lower bound the probability of this using Lemma 8 by first conditioning on there being a certain number of nonzeros per row of using Lemma 9, and then using a pigeon hole principle argument to ensure that . Here is chosen to ensure the desired rate. In what follows we adopt the following notation.

  • and are the events that EBF and NEBF respectively recover and up to permutation, meaning there exists an iteration and a permutation such that and .

  • is the event that is the adjacency matrix of a expander graph with expansion parameter .

  • For let denote the event that a column of is recovered at the th iterate of the while loop on lines 2-14 of Algorithm 2 at some iteration of NEBF. Note that by construction where is the index of the last column completed before NEBF terminates.

  • is the event that each row of has at least some quantity (yet to be specified) nonzeros per row which is a function of .

Proof.

As NEBF recovers iff at every iteration of the while loop (lines 2-14 of Algorithm 2) a column of is recovered, then

We now apply Bayes’ Theorem and condition on

,