 # A Clustering Approach to Learn Sparsely-Used Overcomplete Dictionaries

We consider the problem of learning overcomplete dictionaries in the context of sparse coding, where each sample selects a sparse subset of dictionary elements. Our main result is a strategy to approximately recover the unknown dictionary using an efficient algorithm. Our algorithm is a clustering-style procedure, where each cluster is used to estimate a dictionary element. The resulting solution can often be further cleaned up to obtain a high accuracy estimate, and we provide one simple scenario where ℓ_1-regularized regression can be used for such a second stage.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The dictionary learning problem is as follows: given observations , the task is to factorize it as

 Y=AX,Y∈Rd×n,A∈Rd×r,X∈Rr×n, (1)

where is referred to as the coefficient matrix and the columns of are referred to as the dictionary elements. There are indeed infinite factorizations for (1) unless further constraints are imposed. A natural assumption is that the coefficient matrix is sparse, and in fact, that each sample selects a sparse subset of dictionary elements from . This instance of dictionary learning is popularly known as the sparse coding problem [30, 25]. It has been argued that sparse coding can provide a succinct representation of the observed data, given only unlabeled samples 

. Through this lens of unsupervised learning, dictionary learning has received an increased attention from the machine learning community in the last few years; see Section

1.2 for a brief survey.

Although the above problem has been extensively studied, most of the methods are heuristic and lack guarantees. Spielman et. al

 provide exact recovery results for this problem, when the coefficient matrix has Bernoulli-Gaussian entries and the dictionary matrix has full column rank. This condition entails that the dictionary is undercomplete, i.e., the observed dimensionality needs to be greater than the number of dictionary elements . However, for most practical settings, it has been argued that overcomplete representations, where , are far more relevant, and can provide greater flexibility in modeling as well as greater robustness to noise [26, 12]. Moreover, in the context of blind source separation (BSS) of audio, image or video signals, the dictionary learning problem is typically overcomplete, since there are more sources than observations . In this work, we provide guaranteed methods for learning overcomplete dictionaries.

### 1.1 Summary of Results

In this paper we present a novel algorithm for the estimation of overcomplete dictionaries. The algorithm can be seen as a clustering style method

followed by a singular value decomposition (SVD) within each cluster resulting in an estimate for each dictionary element. The clusters are formed based on the magnitudes of the correlation between pairs of samples. Under our probabilistic model of generating data as well as assumptions on the coefficients and dictionaries, it can be guaranteed that such a procedure approximately recovers the unknown overcomplete dictionary. Under further conditions, it is often possible to start with this approximate solution and perform additional post-processing on it to obtain arbitrarily good estimates of the dictionary. We present one such set of conditions under which

sparse regression can be used for this post-processing. More advanced post-processing methods have been developed in subsequent works [1, 8].

We consider a random coefficient matrix, where each column of has non-zero entries which are randomly chosen, i.e., each sample selects dictionary elements uniformly at random. We additionally assume that the dictionary elements are pairwise incoherent and that the dictionary matrix satisfies a certain bound on the spectral norm. Under these conditions, we establish that our algorithm estimates the dictionary elements with bounded (constant) error when the number of samples scales as , and when the sparsity . To the best of our knowledge, this is the first result of its kind which analyzes the global recovery properties of a computationally efficient procedure in the setup of overcomplete dictionary learning.

In the special case when the coefficients are -valued with zero mean, the resulting solution from the first step can be further plugged into any sparse regression algorithm for estimating the coefficients given this dictionary estimate. Under a more stringent sparsity constraint: , it can be shown that this second step will recover the coefficients exactly even from this approximate dictionary, which then also leads to an exact recovery of the dictionary by solving the linear system. Hence, we provide a simple method for exactly recovering the unknown dictionary in this special case. A natural generalization of this procedure to general weights is analyzed using alternating minimization procedure in a subsequent work .

We outline our method as well as our analysis techniques in Section 1.3. This is the first work to provide a tractable method for guaranteed recovery of overcomplete dictionaries, and we discuss the previous results below. Finally, concurrently with our work, an approximate recovery result with a similar procedure was recently announced by Arora et al. . A detailed discussion comparing our and their results is presented in Section 1.2.

### 1.2 Related Works

This work overlaps with and relates to prior works in many different communities and we discuss them below in turn.

##### Dictionary Learning:

Hillar and Sommer 

consider conditions for identifiability of sparse coding and establish that when the dictionary succeeds in reconstructing a certain set of sparse vectors, there exists a unique sparse coding, up to permutation and scaling. However, the number of samples required to establish identifiability is exponential in

for the general case. In contrast, we show that efficient recovery is possible using samples, albeit under additional conditions such as incoherence among the dictionary elements.

Spielman et. al  provide exact recovery results for a based method in the undercomplete setting, where . In contrast, we allow for the overcomplete setting where . There exist a plethora of heuristics for dictionary learning, which work well in practice in many contexts, but lack theoretical guarantees. For instance, Lee et. al. propose an iterative and optimization procedures . This is similar to the the method of optimal directions (MOD) proposed in . Another popular method is the so-called K-SVD, which iterates between estimation of and given an estimate of , updates the dictionary estimate using a spectral procedure on the residual. Other works consider more sophisticated methods from an optimization viewpoint while still alternating between dictionary and coefficient updates [24, 18]. Geng et al.  and Jenatton et al.  study the local optimality properties of an alternating minimization procedure. In contrast, our work focuses on global properties of a more combinatorial procedure than several of the above works which are more optimization flavored. The upshot is that our procedure, while still being computationally quite efficient, is able to guarantee global bounds on the quality of the solution obtained.

Recent works [34, 29, 27, 32] provide generalization bounds and algorithmic stability for predictive sparse coding, where the goal of the learned sparse representation is to obtain good performance on some predictive task. This differs from our framework since we do not consider predictive tasks here, but the accuracy in recovering the underlying dictionary elements.

Finally, our results are closely related to the very recent work of Arora et al. , carried out independently and concurrently with our work. There are however some important distinctions: we require only samples in our analysis, while Arora et al.  require samples in their result. At the same time, their analysis yields milder conditions on the sparsity level in terms of its dependence on and . Following this work, Arora et al.  and Agarwal et al.  also developed a post-processing techniques which can be thought of as a more advanced variant of the simpler sparse-regression step that we analyze. These subsequent works view the methods developed here as initialization procedures to alternating optimization schemes.

##### Blind Source Separation/ICA/Topic Models:

The problem of dictionary learning is applicable to blind source separation (BSS), where the rows of are signals from the sources and represents the linear mixing matrix. The term blind implies that the dictionary matrix is unknown and needs to be jointly estimated with the coefficient matrix , given samples

. This problem has been extensively studied and the most popular setting is the independent component analysis (ICA), where the sources are assumed to be independent. In contrast, for the sparse component analysis problem, no assumptions are made on the statistics of the sources. Many works provide guarantees for ICA in the undercomplete setting, where there are fewer sources than observations

[21, 9, 4] and some works provide guarantees in the overcomplete setting [14, 19]. However, the techniques are very different since they rely on the independence among the sources. The problem of learning topic models can be cast as a similar factorization problem, where now corresponds to the topic-word matrix and corresponds to the proportions of topics in various documents. There are various recent works providing guaranteed methods for learning topic models, e.g [2, 7, 6, 5]. However, these works make different assumptions on either or or both to guarantee recovery. For instance, the work  assumes that the topic-word matrix has rows such that for each column, only the entry corresponding to that column is non-zero. The work  assumes expansion conditions on and provides recovery through -based optimization. We note that the techniques of  are related to those employed by Spielman et. al  for dictionary learning, but make different assumptions. All these works only deal with the undercomplete setting. The recent work  considers topic models in the overcomplete setting, and provides guarantees when

satisfies certain higher order expansion conditions. The techniques are very different from the ones employed here since they involve higher order moments and tensor forms.

##### Connection to Learning Overlapping Communities:

Our initial step for estimating the dictionary elements involves finding large cliques in the sample correlation graph, where the nodes are the samples and the edges represent sufficiently large correlations among the endpoints. The clique finding problem is a special instance of the overlapping community detection problem, which has been studied in various contexts, e.g. [3, 11, 10, 22, 28]. However, the correlation graph here has different kinds of constraints than the ones studied before as follows. In our setting involving noise-free dictionary learning, each community corresponds to a clique and there are no edges across two different communities. In contrast, many works on community detection are concerned about handling noise efficiently, where each community is not a full clique, and there are edges across different communities. Here, we need to learn overlapping communities, while many community detection methods limit to learning non-overlapping ones. In our setting, we argue that the overlap across different communities is small under a random coefficient matrix, and thus, we can find the communities efficiently through simple random sampling and neighborhood testing procedures.

### 1.3 Overview of Techniques

As stated earlier, our main algorithm consists of a clustering procedure which yields an approximate estimate of the dictionary. This estimate can be subsequently post-processed for exact recovery of the dictionary under certain further conditions. Below we give the outline and the main intuition underlying these procedures and their analysis.

##### Dictionary estimation via clustering:

This step first involves construction of the sample correlation graph , where the nodes are samples and an edge implies that , for some . We then employ a clustering procedure on the graph to obtain a subset of samples, which are then employed to estimate each dictionary element. Roughly, we search for large cliques in the correlation graph and obtain a spectral estimate of each dictionary element using samples from such sets.

##### Key intuitions for the clustering procedure:

The core intuitions can be described in terms of the relationships between the two graphs, viz., the coefficient bipartite graph and the sample correlation graph , shown in Figures (a)a and (b)b. As described earlier, the correlation graph consists of edges between well correlated samples. The coefficient bipartite graph consists of dictionary elements on one side and the samples on the other, and the bipartite graph encodes the sparsity pattern of the coefficient matrix . In other words, it maps the dictionary elements to samples on which they are supported on and denotes the neighborhood of in the graph .

Now given this bipartite graph , for each dictionary element , consider a set of samples222Note that such a set need not be unique. which (pairwise) have only one dictionary element in common, and denote such a set by i.e.

 Ci:={yk,k∈S:NB(yk)∩NB(yl)=ai,∀k,l∈S}. (2)

For a random coefficient matrix (resulting in a random bipartite graph), we argue that there exists (large) sets , for each , which consists of a large fraction of , and no two elements and have a large fraction of samples in common. In other words, for random coefficient matrices, we see a diversity in the dictionary elements among the samples, and this can be viewed as an expansion property from the dictionary elements to the set of samples. We exploit this property to establish success for our method.

Our subsequent analysis is broadly divided into two parts, viz., establishing that (large) sets can be found efficiently, and that the dictionary elements can be estimated accurately once such sets are found. We establish that the sets are cliques in the correlation graph when the dictionary elements are incoherent, as shown in Figure (b)b. Combined with the previous argument that the different sets ’s have only a small amount of overlap for random coefficient matrices, we argue that these sets can be found efficiently through simple random sampling and neighborhood testing on the correlation graph. Once a large enough set is found, we argue that under incoherence, the dictionary element can be estimated accurately through SVD over the samples in .

##### Sparse regression for post-processing:

This is a relatively straightforward procedure. Once an initial estimate of the dictionary matrix is obtained, we estimate the coefficient matrix through any sparse regression procedure (such as Lasso) and then perform thresholding on the recovered coefficients. Now, we re-estimate the dictionary, given this coefficient matrix, by solving another linear system. This provides us with a final estimate of both the dictionary as well as the coefficient matrix.

Since we only have a noisy estimate of the dictionary, our analysis here is slightly different from the usual analysis for a sparse linear system. The noise in our system is dependent on the approximate dictionary employed, which differs from the typical statistical setting, where noise is assumed to be independent. We exploit the known guarantees available for Lasso under deterministic noise  for our setting. Combining Lasso with a simple thresholding procedure, we guarantee exact recovery of the coefficient matrix, albeit under a more stringent condition on the sparsity and the coefficient values (namely zero mean and -valued ). The dictionary is then re-estimated by solving another linear system, which is of course correct owing to the exact estimation of the coefficient matrix.

## 2 Method and Guarantees

##### Notation:

Let and for a vector , let denote the support of , i.e. the set of indices where is non-zero. Let denote the norm of vector , and similarly for a matrix , denotes its spectral norm. Let , where denotes the column, and similarly for and . For a graph , let denote set of neighbors for node in .

### 2.1 Clustering procedure and its analysis

We start with presenting the main algorithm of our work and bound the recovery error under certain assumptions.

#### 2.1.1 Algorithm

Our main algorithm is presented in Algorithm 1. Given samples , we first construct the correlation graph , where the nodes are samples and an edge implies that , for some threshold . We then determine a good subset of samples via a clustering procedure on the graph as follows: we first randomly sample an edge and then consider the intersection of their neighborhoods, denoted by . We then employ UniqueIntersection routine in Procedure 1 to determine if is a “good set” for estimating a dictionary element, and this is done by ensuring that the set has sufficient number of mutual neighbors333For convenience to avoid dependency issues, in Procedure 1, we partition into sets consisting of node pairs and determine if there are sufficient number of node pairs which are neighbors. in the correlation graph. Once is determined to be a good set, we then proceed by estimating the matrix using samples in

and output its top singular vector as the estimate of a dictionary element. The method is repeated over all edges in the correlation graph to ensure that all the dictionary elements get estimated with high probability.

#### 2.1.2 Assumptions and Main Result

##### Assumptions:

We now provide guarantees for the proposed method under the following assumptions on and .

1. Unit-norm Dictionary Elements: All the elements are normalized: , for .

2. Incoherent Dictionary Elements: We assume pairwise incoherence condition on the dictionary elements, for some constant ,

 |⟨ai,aj⟩|<μ0√d. (3)
3. Spectral Condition on Dictionary Elements: The dictionary matrix has bounded spectral norm, for some constant ,

 ∥A∥<μ1√rd. (4)
4. Entries in Coefficient Matrix: We assume that the non-zero entries of are drawn from a zero-mean distribution supported on for some fixed constants and .

5. Sparse Coefficient Matrix: The columns of coefficient matrix have bounded number of non-zero entries which are selected randomly, i.e.

 |Supp(xi)|=s,∀i∈[n]. (5)

We require to be

 s

for some small enough constant .

6. Sample Complexity: Given a parameter (which is related to the error in recovery of dictionary, see Theorem 2.1), and a universal constant , choose and the number of samples such that

 n:=n(d,r,s,δ,α)=crα2slogdδ,n2δ<1.
7. Choice of Threshold for Correlation Graph: The correlation graph is constructed using threshold such that

 ρ=m22−s2M2μ0√d>0. (7)
8. Choice of Separation Parameter between Estimated Dictionary Elements: This is the desired accuracy of the estimated dictionary elements to the true dictionary elements using just the initialization step. It can be chosen to be:

 32sM2m2(μ1√ds+μ21d+s3r+α2+α√s)<ϵ2dict<14. (8)

The assumption on normalization is without loss of generality since we can always rescale the dictionary elements and the corresponding coefficients and obtain the same observations. The assumption on incoherence is crucial to our analysis. In particular, incoherence also leads to a bound on the RIP constant; see Lemma A.5 in Appendix A.6. The assumption provides a bound on the spectral norm of .

The assumption assumes that the non-zero entries of are drawn from a zero-mean distribution with natural upper and lower bounds on the coefficients. Note that a similar assumption is made in the work of Arora et.al .

The assumption on sparsity in the coefficient matrix is crucial for identifiability of dictionary learning problem. We require for the sparsity to be not too large for recovery.

The assumption provides a bound on sample complexity. We subsequently establish that in order to have decaying error for recovery of dictionary elements, we require samples for recovery. Thus, we obtain a nearly linear sample complexity for our method.

Assumption specifies the threshold for the construction of the correlation graph. Intuitively, we require a threshold such that we can distinguish pairs of samples which share a dictionary element from those which do not.

##### Main Result:

We now present our main result which bounds the error in the estimates of Algorithm 1.

###### Theorem 2.1 (Approximate recovery of dictionary).

Suppose the output of Algorithm 1 is . Then with probability greater than , there exists a permutation matrix such that:

 ϵ2A:=mini∈[r]minz∈{−1,+1}∥∥zai−(P¯A)i∥∥22<32sM2m2(μ1√ds+μ21d+s3r+α2+α√s). (9)
##### Remark:

Note that we have a sign ambiguity in recovery of the dictionary elements, since we can exchange the signs of the dictionary elements and the coefficients to obtain the same observations. The assumption on sparsity in implies that the first two terms in (9) decay. For the third term in (9) to decay, we require instead of as in . Moreover, we require that . Since the sample complexity in scales as , we require samples for recovery of dictionary with decaying error. Thus, we obtain a near linear sample complexity for our method. We observe that the error in our estimation depends inversely on dimension-related quantities such as and and not on the number of samples . This is because the errors in our estimates arise from errors in SVD step, specifically from the discrepancy between the SVD vector and the dictionary element responsible for a cluster. Even the population SVD will suffer from an approximation error here, which is responsible for our error bound, but the probability in the error bound improves with the number of samples as we get closer and closer to the population SVD estimate.

### 2.2 Post-processing for binary coefficients

We now present the post-processing step which will be analyzed under a more stringent condition on the coefficients.

#### 2.2.1 Algorithm

Once we obtain an estimate of the dictionary elements, we proceed to estimate the coefficient matrix. The main observation at this step is that the coefficient vector for each sample is a -sparse vector in -dimensions. Hence, recovering the coefficients would be a standard sparse linear problem if we knew the dictionary exactly. Our analysis will show that even an approximately correct dictionary from Algorithm 1 suffices to provide guarantees for this recovery. Once the coefficients are estimated, the dictionary can be re-estimated by solving another linear system. The procedure is formally described in Algorithm 2. We do not prescribe any particular choice of computational procedure to solve the optimization problem (10), but there are many algorithms available in standard literature. As a concrete example, the GraDeS algorithm of Garg and Khandekar  or OMP of Tropp and Gilbert works in our setting.

#### 2.2.2 Exact recovery for bernoulli coefficients

Our second result is that under stronger conditions than before, it is possible to exactly recover the unknown dictionary with high probability. This result will be obtained by initializing Algorithm 2 with the output of Algorithm 1. We start with the additional assumptions, putting restrictions on the allowed sparsity level as a function of and .

###### Assumption B1 (Conditions for exact recovery).

The non-zero coefficients in coefficient matrix are zero-mean Bernoulli. This corresponds to setting in Assumption (A4).

The sparsity level , and the number of dictionary elements and the observed dimension satisfy

 32s(μ1√ds+μ21d)≤11200s2,and32s4r≤11200s2.

The constant in Theorem 2.1 satisfies

 32s(α2+α√s) ≤11200s2.

The number of samples , in addition to assumption , satisfies

 n≥4rc0logdδ,

where is a universal constant.

The accuracy parameter in Algorithm 2 is chosen as , where is the error in estimating the dictionary elements in (9).

###### Theorem 2.2 (Exact recovery for bernoulli coefficients).

Under the conditions of Theorem 2.1, and suppose, in addition Assumption B1 holds, then the output of Algorithm 2 initialized with Algorithm 1 satisfies up to permutation of columns, with probability at least .

##### Remark:

Assumption B1 for exact recovery places more stringent conditions on the distribution of the coefficients and the sparsity level , compared to for approximate recovery. While for approximate recovery, we require , in Assumption B1, we require for exact recovery. Note that the additional constraint on sample complexity in Assumption B1 still has the same scaling, and thus, suffices both for approximate and exact recovery.

We also observe that the result of Theorem 2.2 relies on Algorithm 1 as the initialization procedure, but in principle we can also use a different approximate recovery procedure to initialize Algorithm 2. In particular, a different initialization procedure with a better error guarantee would also directly translate to better recovery properties in the second step, in terms of the assumptions relating to and . Understanding these issues appears to be an interesting direction for future research.

## 3 Proofs of main results

In this section we will present the proofs of our main results, Theorems 2.1 and 2.2. We will start by presenting a host of useful lemmas, and sketch out how they fit together to yield the main results before moving on to the proofs.

### 3.1 Correlation graph properties

In this section we will present some useful properties of the correlation graph described in Section 1.3. Recall that , where the nodes are samples and an edge implies that , for some . This is employed by Algorithm 1 as a proxy for identifying samples which have common dictionary elements. We now make this connection concrete in the next few lemmas. For this we also recall our notation which is the neighborhood of a sample in the coefficient bipartite graph (see Figure (a)a), that is, the set of dictionary elements that combine to yield .

###### Lemma 3.1 (Correlation graph).

Under the incoherence assumption and the threshold in assumption , the following is true for the edges in the correlation graph :

 |NB(yk)∩NB(yl)|=1 ⇒(yk,yl)∈Gcorr(ρ),∀i∈[r], (11) (yk,yl)∈Gcorr(ρ) ⇒|NB(yk)∩NB(yl)|≥1, (12)

for all .

Lemma 3.1 suggests that nodes which intersect in exactly one dictionary element are special, in that they are guaranteed to have an edge between them in . Our next lemma works towards establishing something even stronger. We will next establish that there are large cliques in the correlation graph where any two samples in the clique intersect in the same unique dictionary element. In order to state the lemma, we need some additional notation.

For each dictionary element , consider a set of samples444Note that such a set need not be unique. , for some , such that they only have in common, and denote such a set by i.e.

 Ci:={yk,k∈S:NB(yk)∩NB(yl)={ai,}∀k,l∈S}. (13)

Lemma 3.1 implies that in the correlation graph, the set of nodes in form a clique (not necessarily maximal), for each , as shown in Figure (b)b. The above implication can be exploited for recovery of dictionary elements: if we find the set , then we can hope to recover the element , since that is the only element in common to the samples in .

For ease of stating the next lemma, we further define two shorthand notations.

 Uniq-intersect(yi,yj):={(yi,yj)∈Gcorr(ρ)and|NB(yi)∩NB(yj)|=1}, (14)

Intuitively, the samples satisfying are guaranteed to have an edge between them by Lemma 3.1. In order to guarantee large cliques, we will also need to measure the number of triangles in .

In order to do this, given anchor samples and have a unique intersection, we now bound the probability that a randomly chosen sample , among the neighborhood set of and in the correlation graph also has a unique intersection. Now define unique intersection event for a new sample with respect to anchor samples and as follows

 Uniq-intersect(yi;yi∗,yj∗):={NB(yi)∩NB(yi∗)=NB(yi)∩NB(yj∗)={ak}}, (15)

where is the unique intersection of the anchor samples and . In other words, indicates the event that the pairwise intersections of the new sample with each of the anchors and is unique and equal to the unique intersection of and .

###### Lemma 3.2 (Formation of clique under good anchor samples).
 P[Uniq-intersect(yi;yi∗,yj∗)∣∣Uniq-intersect(yi∗,yj∗), % and (yi,yi∗),(yi,yj∗)∈Gcorr(ρ)] ≥1−s3r.

Lemma 3.2 is crucial for our algorithm. It guarantees that given a pair of good anchor elements—one satisfying unique intersection property—a large fraction of their neighrbors also contain this common dictionary element. Some further arguments can then be made to establish that a large fraction of the neighbors of and also have edges amongst themselves and hence form cliques as defined in Equation 13.

### 3.2 Correctness of Procedure 1

A key component in our analysis is the correctness of Procedure 1. As we saw in the previous lemmas, it is crucial for a chosen pair of anchor elements to have a unique intersection in order to use them for identifying large cliques in . Procedure 1 plays a crucial role by providing a verifiable test for whether a pair of anchor elements have a unique intersection or not. Our next two lemmas help us establish that this test is sound with high probability. We first show that two neighbors of a bad anchor pair do not have an edge amongst them with high probability.

Denote the event

 Δ(yi,yj,yk):={(yi,yj),(yj,yk),(yi,yk)∈Gcorr(ρ)},

i.e., the samples form a triangle in the correlation graph.

###### Lemma 3.3 (Detection of bad anchor samples).

For randomly chosen samples

 P[(yi,yj)∉Gcorr(ρ)∣Δ(yi,yi∗,yj∗),Δ(yj,yi∗,yj∗),¬Uniq−intersect(yi∗,yj∗)]>116.

Intuitively, this means that the number of sets which will be edges in is rather small for an anchor pair with multiple dictionary elements in common. In order for correctness of the procedure, we will in fact need this number to be substantially smaller than that for a good anchor pair. This is indeed the case as we next establish.

###### Lemma 3.4 (Detection of good anchor samples).

For randomly chosen samples

 P[(yi,yj)∉Gcorr(ρ)∣Δ(yi,yi∗,yj∗),Δ(yj,yi∗,yj∗),Uniq-intersect(yi∗,yj∗)]≤24s3r.

Combining the above two lemmas, the correctness of Procedure 1 naturally follows.

###### Proposition 3.1 (Correctness of Procedure 1).

Suppose . Suppose that and . Then Algorithm 1 returns the value of correctly with probability greater than .

### 3.3 Proof of Theorem 2.1

In this section we will put all the pieces together and establish Theorem 2.1. We start by establishing that given a pair of good anchor elements, the SVD step in Algorithm 1 approximately recovers the unique dictionary element in the intersection of the two anchors.

###### Proposition 3.2 (Accuracy of SVD).

Consider anchor samples and such that is satisfied, and wlog, let . Recall the definition of  (25), and further define and . If is the top singular vector of , then there exists a universal constant such that we have:

 minz∈{−1,1}∥ˆa−za1∥22<32sM2m2(μ1√ds+μ21d+s3r+α2+α√s),

with probability greater than for .

Given the above proposition, the proof of Theorem 2.1 is relatively straightforward. Indeed, the key missing piece is the dependence on the random quantity in the error probability in Proposition 3.2. We now present the proof.

Proof of Theorem 2.1:

Consider a particular iteration of Algorithm 1. Procedure 1 returns with probability greater than . If , then Algorithm 1 proceeds to the next iteration. Consider the case of and suppose . Using Proposition 3.2, with probability greater than , we have:

 ∥al−ˆa∥22<32sM2m2(μ1√ds+μ21d+s3r+α2+α√s).

Using Lemma A.4 and Lemma 3.1, we see that with probability greater than . Using a union bound over all the iterations (which are at most ), the above claims hold for all iterations with probability greater than .

Using Lemma A.4 and Lemma 3.1, with probability greater than , for every , there are at least pairs such that and . Lines 9-11 of the algorithm then ensure that there is a unique copy of the approximation to dictionary element. Using a union bound now gives the result.

### 3.4 Analysis of post-processing step

In this section, we will show how to clean up the approximate recovery of the previous section and obtain exact recovery of the dictionary under Assumption B1. We start by setting up the problem as that of sparse estimation with deterministic noise and describing some guarantees in a general setup. We then specialize these to the assumptions of our problem and present the proof of Theorem 2.2.

#### 3.4.1 Lasso with determinstic noise

Recalling the model (1), we see that each observation is generated according to the linear model

 yi=Axi,for i=1,2,…,n,

where is a -sparse vector in dimensions. If we knew the dictionary , then this is the usual sparse linear system. Given the knowledge of an approximate dictionary however, we can rewrite the system as

 yi=¯Axi+(A−¯A)xiwi, (16)

where is the error matrix. Note that the errors in are not zero mean, or even independent of unlike typical statistical settings. Under our initialization, however, they are bounded, which we establish subsequently. For the remainder of this section, we assume the following facts about . Note that this is not an assumption about the model, but a condition on the output of Algorithm 1, which will be proved in the next section.

###### Assumption C1 (Approximate initialization).

Assume that is an approximately correct initialization for , meaning the following hold:

RIP: The -RIP constant of the matrix , . That is, for every with , the smallest and largest singular values, and respectively of the matrix satisfy:

 67<σmin<σmax<87.

Bounded error: for all .

Under these general assumptions, we can provide a guarantee on the error incurred in (10) in step (2) of Algorithm 2. While this result has been obtained in many contexts by various authors, we use the following precise form from Candes .

###### Theorem 3.1 (Theorem 1.2 from Candes ).

Suppose is generated according to the linear model (16), where is -sparse and assume that . Then the solution to Equation (10) obeys the following, for a universal constant ,

 ∥ˆxi−xi∥2≤C1∥wi∥2.

In particular, suffices for .

#### 3.4.2 Proof of Theorem 2.2

In order to prove Theorem 2.2, we first establish that under our assumptions, the coefficients are exactly recovered in Equation (10). Once this is established, Theorem 2.2 follows in a straightforward manner. We start with a useful proposition.

###### Proposition 3.3.

Under conditions of Theorem 2.1, assume further that for the dictionary returned in Algorithm 1. Then Algorithm 2 guarantees that for all .

Proof:    We would like to use Theorem 3.1 to show that we recover the coefficients correctly in the lasso step (10) of Algorithm 2. In order to do this, we first need to verify Assumption C1 for the dictionary returned by Algorithm 1, and then obtain bounds on the quantity . We start with the former.

Consider any -sparse subset of . We have:

 σmin(¯AS) σmax(¯AS)

where and follow from Lemma A.5 in Appendix A.6. Since is a matrix, it satisfies that . Given the assumption , it immediately follows that the minimum and maximum singular values of are at least and respectively, so that we obtain .

This shows that satisfies Assumption C1. Next we bound the norm of the noise vector . Again bounding the frobenius norm of the error in the dictionary in the same way as above, we obtain

 ∥wi∥2≤∥(A−¯A)Si∥2∥xi∥2≤∥∥(A−¯A)Si∥∥F√s≤sϵA,

where is the support of . Consequently, we obtain from Theorem 3.1 that the output of Equation 10 satisfies

 ∥ˆxi−xi∥2≤C1sϵA≤9sϵA≤9/20. (17)

We now observe that an error guarantee is also an error guarantee. Recall that by the model assumption, each non-zero coefficient of has an absolute value of . Since Equation (17) guarantees that the error guarantee is no larger than , all the coefficients will be uniquely recovered and hence .

Proof of Theorem 2.2:

We are now ready to provide our proof of exact recovery. Based on Proposition 3.3, we only need to verify two things. First is that the initialization satisfies and the second is that the linear system is well-posed when we solve for . In order to verify the former, we observe that our additional conditions in Assumption B1 guarantee that

 32s(μ1√ds+μ21d) ≤11200s2, 32s4r ≤11200s2,and 32s(α2+α√s) ≤11200s2.

Hence we obtain from Theorem 2.1 that with probability at least , . Hence, it only remains to verify that the linear system is well-posed.

According to Lemma A.7 in Appendix A.6, the matrix so that all of its singular values are equal to . We now appeal to Theorem A.1 with , and . Then we obtain for any with probability at least

 σmin(XXT)≥nsr−nmax{√srδ,δ2},

where . Substituting the value of , we obtain the lower bound

 σmin(XXT) ≥nsr−nmax{√srt√sn,t2sn} ≥nsr(1−t√rn−t2rn) =ns4r,

for . This means that the linear system is well-posed with probability at least . Choosing to now be finishes the proof.

## 4 Discussion and Conclusion

In this paper, we proposed simple and tractable methods for dictionary learning. We present a novel clustering-based approach which can approximately recover the uknown overcomplete dictionary from samples. We also analyzed a simple denoising strategy based on sparse recovery algorithms for reconstructing the dictionary exactly under some simplifying assumptions on the model. In particular, the second step is not tied to the first step in any critical way, and more sophisticated post-processing procedures have since been developed. There is of course, also room for developing better approximate recovery schemes, building on our work.

In the analysis of the clustering step, we provide guarantees when the coefficient matrix is sparse and randomly drawn. In principle, our analysis can be extended to general sparse coefficient matrices and can be cast as a higher-order expansion condition on the coefficient bipartite graph. Similar (and yet not the same) expansion conditions have appeared in other contexts involving learning of overcomplete models. For instance, in , Anandkumar et. al. establish that under an expansion condition on the topic-word matrix, unsupervised learning of the model is possible. Here, the hidden topics correspond to dictionary elements, and the observed words correspond to the samples in the dictionary setting.

Finally, our work suggests some natural and interesting directions for future research. While both the steps of our algorithm seem inherently robust to noise, it remains important to quantify the recovery properties when the observations are noisy in future work. Another natural question is raised by the fact that we use only one step of lasso and least squares for exact recovery. Indeed, the subsequent work  analyzes a generalization where we perform multiple iterations of lasso followed by subsequent dictionary estimation, and is able to exactly recover the dictionary under a much broader set of conditions. Since our study was motivated by natural applications of dictionary learning in signal processing and machine learning, it would also be interesting to investigate how our provably correct procedures perform compared to the popular heuristic methods.

#### Acknowledgements

A. Agarwal thanks Yonina Eldar for suggesting the problem to him. A. Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF Career award CCF-1254106, NSF Award CCF-1219234, and ARO YIP Award W911NF-13-1-0084. P. Netrapalli thanks Yash Deshpande for helpful discussions. The authors thank Matus Telgarsky for suggesting Lemma A.6 and thank Sham Kakade and Dean Foster for initial discussions.

## Appendix A Proofs for clustering analysis

In this section we will provide the proofs of many of the Lemmas along with some auxilliary results in Sections 3.13.3. Some of the more technical results that are required will be deferred to Appendix A.6.

### a.1 Proofs of correlation graph properties

We start by proving Lemmas 3.1 and 3.2 in Section 3.1.

Proof of Lemma 3.1:

We first prove (12) via contradiction. Suppose , we then have

 |⟨yk,yl⟩| =|∑i,jxikxjl⟨ai,aj⟩|≤∑i,j|xikxjl⟨ai,aj⟩| ≤|NB(yk)|⋅|NB(yl)|⋅maxi,j,k,l|xikxjl|⋅maxi≠j|⟨ai,aj⟩|≤s2M2μ0√d

For (11), let

 |⟨yk,yl⟩| =|∑i,jxikxjl⟨ai,aj⟩|≥|x