When is sparse dictionary learning well-posed?

06/22/2016 ∙ by Charles J. Garfinkle, et al. ∙ 0

Dictionary learning methods for sparse coding have exposed underlying structure in many kinds of natural signals. However, universal theorems guaranteeing the statistical consistency of inference in this model are lacking. Here, we prove that for almost all diverse enough datasets generated from the model, latent dictionaries and sparse codes are uniquely identifiable up to an error commensurate with measurement noise. Applications are given to data analysis, neuroscience, and engineering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Blind source separation is a classical problem in signal processing [1]. In one common modern formulation, each of observed -dimensional signals is a (noisy) linear combination of at most elementary waveforms drawn from some unknown dictionary of size , typically with (see [2] for a comprehensive review of this and related models). Approximating solutions to this sparsity-constrained inverse problem have provided insight into the structure of many signal classes lacking domain-specific formal models (e.g., in vision [3]

). In particular, it has been shown that response properties of simple-cell neurons in mammalian visual cortex emerge from optimizing a dictionary to represent small patches of natural images

[4, 5, 6, 7], a major advance in computational neuroscience. A curious aspect of this finding is that the latent waveforms (e.g., “Gabor” wavelets) estimated from data appear to be canonical [8]; i.e., they are found in learned dictionaries independent of algorithm or image training set.

Motivated by these discoveries and earlier work in the theory of neural communication [9], we address when dictionaries and the sparse representations they induce are uniquely determined by data. Answers to this question also have other real-world implications. For example, a sparse coding analysis of local painting style can be used for forgery detection [10], but only if all dictionaries consistent with training data do not differ appreciably in their ability to sparsely encode new samples. Fortunately, algorithms with proven recovery of generating dictionaries under certain conditions have recently been proposed (see [11, Sec. I-E] for a summary of the state-of-the-art). Few theorems, however, can be cited to explain this uniqueness more broadly; in particular, a universal guarantee in the context of noise has yet to emerge.

Here, we prove very generally that uniqueness and stability is an expected property of the sparse linear coding model. More specifically, dictionaries that preserve sparse codes (e.g., satisfy a “spark condition”) are identifiable from as few as noisy sparse linear combinations of their columns up to an error that is linear in the noise (Thm. 1). In fact, provided , in almost all cases the dictionary learning problem is well-posed (as per Hadamard [12]) given enough data (Cor. 2). Moreover, these explicit, algorithm-independent guarantees hold without assuming the recovered matrix satisfies a spark condition, and even when the number of dictionary elements is unknown.

More formally, let be a matrix with columns () and let dataset consist of measurements:

(1)

for -sparse having at most nonzero entries and noise , with bounded norm representing our combined worst-case uncertainty in measuring . The first mathematical problem we address is the following.

Problem 1 (Sparse linear coding).

Find a matrix and -sparse so that for all .

Note that any particular solution to this problem gives rise to an orbit of equivalent solutions , where is any permutation matrix and any invertible diagonal. Previous theoretical work addressing the noiseless case (e.g., [13, 14, 15, 16]) with has shown that a solution to Prob. 1 (when it exists) is unique up to this ambiguity provided the are sufficiently diverse and the matrix satisfies the spark condition:

(2)

which would be, in any case, necessary for uniqueness of the . Here, we study stability in the practical setting of noise.

Definition 1.

Fix . We say has a -sparse representation in if there exists a matrix and -sparse such that for all . This representation is stable if for every , there exists some that is strictly positive for positive and such that if and -sparse satisfy:

then there is some permutation matrix and invertible diagonal matrix such that for all :

(3)

To see how Def. 1 directly relates to Prob. 1, suppose that has a stable -sparse representation in and fix to be the desired recovery accuracy in (3). Consider any dataset generated as in (1) with . Then from the triangle inequality, it follows that any matrix and -sparse solving Prob. 1 are necessarily within of the original dictionary and codes , respectively.

In the next section, we give precise statements of our main results, which include an explicit form for . We then prove our main theorem (Thm. 1) in Sec. III after stating some additional definitions and lemmas required for the proof, including a useful result in combinatorial matrix analysis (Lem. 1, proven in Appendix). We also provide an argument that extends our guarantees to the following more common optimization formulation of the dictionary learning problem (Thm. 2).

Problem 2.

Find and that solve:

(4)

We then sketch proofs of probabilistic extensions of Thm. 1 to random data and dictionaries (Thm. 3 and Cor. 2). Finally, in Sec. IV, we discuss both theoretical and practical applications of our main mathematical findings.

Ii Results

Before stating our results precisely, we identify criteria on the support sets of the generating codes that imply stable sparse representations. Letting be denoted , its power set , and the set of subsets of of size , we say a hypergraph on vertices is -uniform when . The degree of a node is the number of sets in that contain , and we say is regular when for some we have for all (given such an , we say is -regular). We also write .

Definition 2.

Given , the star is the collection of sets in containing . We say has the singleton intersection property (SIP) when for all .

Next, we describe a quantitative spark condition. The lower bound of a matrix is the largest with for all [17]. By compactness of the unit sphere, injective linear maps have nonzero lower bound; hence, if satisfies (2), then each submatrix formed from of its columns or less has a strictly positive lower bound.

We generalize the lower bound to a domain-restricted union of subspaces model [18] derived from a hypergraph . Let denote the submatrix formed by the columns of indexed by , with . (In the sections that follow, we write to denote the column-span of a submatrix , and to denote .) Define:

(5)

where we write when .111We note that is known as the asymmetric lower restricted isometry constant for matrices with unit -norm columns [19]. Clearly, for satisfying (2) and whenever . Note also that if has .

A vector

is said to be supported in when , where form the standard column basis in . A set of -sparse vectors is said to be in general linear position when any of them are linearly independent. The following is a precise statement of our main result. We leave the quantity undefined until Eq. (15). All of our theorems assume matrices consist of real numbers.

Theorem 1.

Fix an matrix with for an -regular with the SIP. If contains, for each , more than -sparse vectors in general linear position supported in , then there is with the following holding for all222The condition is necessary; otherwise, with = and , the matrix and 1-sparse codes for and for satisfy but nonetheless violate (6). :

Every matrix for which there are -sparse satisfying for all has and, provided ,

(6)

for some nonempty and of size , permutation matrix , and invertible diagonal matrix .

Moreover, if satisfies (2) and , then and:

(7)

where and here represent subvectors formed from restricting to entries indexed by and , respectively.

In words, Thm. 1 says that the smaller the difference between recovery and original latent dictionary sizes, the more columns and coefficients of the original dictionary and codes are contained (up to noise) in the appropriately scaled dictionary and codes . In the particular case when , the theorem directly implies that has a stable -sparse representation in , with inequalities (3) guaranteed for in Def. 1 given by:

(8)

Note that sparse codes with a shared support that are in general linear position are straightforward to produce using a “Vandermonde” matrix construction (i.e., use the columns of the matrix , for distinct nonzero ). Thus, the assumptions of Thm. 1 are easily met, leading to the following direct application to uniqueness in sparse linear coding.

Corollary 1.

Given a regular hypergraph with the SIP, there are vectors such that every matrix with generates a dataset with a stable -sparse representation in (with as in (8)).

One can also easily verify that for every , there are regular -uniform hypergraphs with the SIP besides the obvious ; for instance, take to be the consecutive intervals of length in some cyclic order on . In this case, a direct consequence of Cor. 1 is rigorous verification of the lower bound for sufficient sample size from the introduction. Often, the SIP is achievable with fewer supports. For example, when , take to be the rows and columns formed by arranging into a square grid.

Another practical implication of Thm. 1 is the following: there is an effective procedure sufficient to affirm if a proposed solution to Prob. 1 is indeed unique (up to noise and inherent ambiguities). Simply check that the matrix and codes satisfy the (computable) assumptions of Thm. 1 on and the .

We furthermore note that unlike in previous works, the matrix need not satisfy (2) to be recoverable from data. As an example for , let where , and take to be all consecutive pairs of arranged in cyclic order. Then, dictionary satisfies the assumptions of Thm. 1 underlying (6) without satisfying (2).

There are other less direct consequences of Thm. 1. For instance, we use it to prove uniqueness for the optimization formulation of sparse linear coding, Prob. 2, the main object of interest for those applying dictionary learning to their data.

Theorem 2.

If the assumptions of Thm. 1 hold, only now with more than vectors supported in each , then all solutions to Prob. 2 with necessarily satisfy recovery inequalities (6) and (7) of Thm. 1.

Another extension of Thm. 1 arises from the following analytic characterization of the spark condition. Let be the matrix of indeterminates . When real numbers are substituted for , the resulting matrix satisfies (2) if and only if the following polynomial is nonzero:

where for any and , the symbol denotes the submatrix of entries with . We note that the large number of terms in this product is likely necessary due to the NP-hardness of deciding whether a given matrix satisfies the spark condition [20].

Since is analytic, having a single substitution of a real matrix with necessarily implies that the zeroes of form a set of (Borel) measure zero. Fortunately, such a matrix is easily constructed by adding rows of zeroes to any Vandermonde matrix as described above (so that each term in the product above for is nonzero). Hence, almost every matrix with satisfies (2).

A similar phenomenon applies to datasets of vectors with a stable sparse representation. As in [16, Sec. IV], consider the “symbolic” dataset generated by indeterminate and indeterminate -sparse .

Theorem 3.

There is a polynomial in the entries of and with the following property: if evaluates to a nonzero number and more than of the resulting are supported in each for some regular with the SIP, then has a stable -sparse representation in (Def. 1). In particular, all – except for a Borel set of measure zero – substitutions impart to this property.

Corollary 2.

Fix , , and let the entries of and -sparse

be drawn independently from probability measures absolutely continuous with respect to the standard Borel measure. If more than

of the vectors are supported in each for a regular with the SIP, then has a stable -sparse representation in with probability one.

Thus, choosing the dictionary and codes “randomly” almost certainly generates data with a stable sparse representation.

We remark that these results have an important application to theoretical neuroscience by mathematically justifying one of the few hypothesized theories of bottleneck communication between sparsely active neural populations [9].

Proposition 1.

Sparse neural population activity is recoverable from noisy random linear compression by any method that solves Prob. 1 or 2; in particular, via biophysically plausible unsupervised sparse linear coding (e.g., [21, 22, 23]).

We close this section with comments on optimality. Our linear scaling for in (8) is essentially optimal (e.g., see [24]), but a basic open problem remains: how many samples are necessary to determine the sparse linear coding model? If is held fixed or if the size of the support set of reconstructing codes is known to be polynomial in and , then a practical (polynomial) amount of data suffices.333In the latter case, a reexamination of the pigeonholing argument in the proof of Thm. 1 requires a polynomial number of samples distributed over a polynomial number of supports. Reasons to be skeptical that this holds in general, however, can be found in [20, 25].

Iii Proofs

We now begin our proof of Thm. 1 by showing how dictionary recovery (6) already implies sparse code recovery (7) when (provided satisfies (2)), temporarily assuming (without loss of generality) that . For -sparse , the triangle inequality gives . Thus:

since . Hence, , and (7) then follows from:

The heart of the matter is therefore (6), which we now establish, first in the important special case of .

Proof of Thm. 1 for .

Since the only 1-uniform hypergraph with the SIP is , we have for , . In this case, we require that .

Fix satisfying (2) and suppose that for some and -sparse we have for all . Then, there exist and a map such that:

(9)

Note that , since otherwise we have the contradiction .

We now show that is injective (in particular, a permutation if ). Suppose that for some and . Then, and . Scaling and summing these inequalities by and , respectively, and applying the triangle inequality, we obtain:

which contradicts the bound . Hence, the map is injective and therefore . Setting and letting and , we see that (9) becomes, for all :

We require a few additional tools to extend the proof to the general case . These include a generalized notion of distance (Def. 3) and angle (Def. 4) between subspaces as well as a stability result in combinatorial matrix analysis (Lem. 1).

Definition 3.

For and vector spaces , let and define:

(10)

We note the following facts. If , then and [26, Ch. 4 Cor. 2.6]:

(11)

Also, from [27, Lem. 3.2], we have:

(12)

Our result in combinatorial matrix analysis is the following.

Lemma 1.

Suppose the matrix has for some -regular with the SIP. There exists for which the following holds for all :

If for some matrix and map ,

(13)

then , and, provided , there is a permutation matrix and invertible diagonal such that:

(14)

for some nonempty and of size .

The constant in Thm. 1 is then given by444Note that for and -sparse . Therefore, since and by general linear position of the . Thus, .:

(15)

where, given vectors , we denote by the matrix with columns and by the set of indices for which is supported in .

The constant , in turn, will be presented relative to a quantity from [28] used to analyze the convergence of the alternating projections algorithm. Specifically, in Lem. 3 below, the following definition is used to bound the distance between a point and the intersection of subspaces given an upper bound on its distance from each individual subspace.

Definition 4.

For a collection of real subspaces , define when , and otherwise:

(16)

where the maximum is taken over all orderings of the and the angle is defined implicitly as [28, Def. 9.4]:

Note that implies , and that when .555We acknowledge the counter-intuitive property that when . The constant in Lem. 1 is then:

(17)

which we remark yields a constant consistent with what is required for the case considered at the beginning of this section.666For , the denominator in (15) becomes , hence with we have .

The pragmatic reader should note that the explicit constants and are effectively computable: the quantity

may be calculated as the smallest singular value of a certain matrix, while the quantity

involves computing “canonical angles” between subspaces, which reduce again to an efficient singular value decomposition. There is no known fast computation of

in general, however, since even is NP-hard [20]; although fixing yields polynomial complexity. Moreover, calculating requires an exponential number of queries to unless is held fixed, too (e.g., the “cyclic order” hypergraphs described above have ). Thus, as presented, and are not efficiently computable.

Proof of Thm. 1 for .

We find a map for which the distance is controlled by . Applying Lem. 1 then completes the proof.

Since there are more than vectors supported in each , the pigeonhole principle gives and a set of indices with all , , supported in . It also follows from and the general linear position of the that ; that is, the columns of the matrix form a basis for .

Fixing , there then exists such that . Setting , we have:

where the last inequality follows directly from (5). From Def. 3:

(18)

where the second inequality is due to and (15). Finally, apply Lem. 1 with . ∎

Proof of Thm. 2.

We bound the number of -sparse and then apply Thm. 1. Let be the number of with . Since the are all -sparse, by (4) we have: Hence,

(19)

demonstrating that the number vectors that are not -sparse is controlled by how many are -sparse.

Next, observe that no more than of the share a support of size less than ; otherwise, by the pigeonhole principle, at least of these indices belong to the same for some and (as argued previously) (18) follows. Since the right-hand side of (18) is less than one, by (11) we have the contradiction

The total number of -sparse vectors thus can not exceed . By (19), no more than vectors are not -sparse. Since for every there are over vectors supported there, it must be that more than of them have corresponding that are -sparse. The result now follows from Thm. 1. ∎

Proof (sketch) of Thm. 3.

Let M be the matrix with columns , . Consider the polynomial in the indeterminate entries of and , with notation as in Sec. II. It can be checked that when this polynomial is nonzero for a substitution of real numbers for the indeterminates, all of the genericity requirements on and in our proofs of stability in Thm. 1 are satisfied (in particular, the spark condition (2) on ). ∎

Proof (sketch) of Cor. 2.

First, note that if a set of measure spaces has that is absolutely continuous with respect to for all , where is the standard Borel measure on , then the product measure is absolutely continuous with respect to the standard Borel product measure on . By Thm. 3, there is a polynomial that is nonzero when has a stable -sparse representation in ; in particular, stability holds almost surely. ∎

Iv Discussion

The goal of this work was to explain the emergence of characteristic representations from sparse linear coding models fit to natural data, despite the varied assumptions underlying the many algorithms in current use. To this end, we have taken an important step toward unifying the hundreds (if not thousands) of publications on the topic by demonstrating very general, deterministic conditions under which the identification of parameters in this parsimonious model is not only possible but also robust to the inevitable uncertainty permeating measurement and model choice.

Specifically, we have shown that, given sufficient data, the problem of seeking a dictionary and sparse codes with minimal average support size (Prob. 2) reduces to an instance of Prob. 1, to which our main result (Thm. 1) applies: every dictionary and sparse codes consistent with the data are equivalent up to inherent relabeling/scaling ambiguities and a discrepancy (error) that scales linearly with the measurement noise or modeling inaccuracy. The constants we provide are explicit and computable; as such, there is an effective procedure that sufficiently affirms if a proposed solution to Probs. 1 or 2 is indeed unique up to noise and inherent ambiguities.

An immediate application of our theoretical work is Prop. 1, which certifies the validity of perhaps the only published theory of neurally-plausible bottleneck communication in the brain: that sparse linear coding in a compressed space of neural activity can recover sparse codes sent through a randomly-constructed (but unknown) noisy wiring bottleneck [9].777We refer the reader to [29] for more on interactions between dictionary learning and neuroscience.

Beyond an original extension of existing noiseless guarantees [16] to the noisy regime and their application to Prob. 2, a major innovation in our work is a theory of combinatorial designs for support sets key to the identification of the dictionary. We incorporate this idea into a new fundamental lemma in matrix theory (Lem. 1) that draws upon the definition of a new matrix lower bound induced by a hypergraph. Insights enabled by our combinatorial approach include: 1) a subset of dictionary elements is recoverable even if dictionary size is overestimated, 2) data require only a polynomial number of distinct sparse supports, and 3) the spark condition is not a necessary property of recoverable dictionaries.

A technical difficulty in proving Thm. 1 is the absence of any assumption at all on dictionaries in solutions to Prob. 1. We sought such a guarantee because of the practical difficulty of ensuring that an algorithm maintain a dictionary satisfying the spark condition (2) at each iteration, an implicit requirement of all previous works except [16]; indeed, even certifying a dictionary has this property is NP-hard [20].

In fact, uniqueness guarantees with minimal assumptions apply to all areas of data science and engineering that utilize learned sparse structure. For example, several groups have applied compressed sensing to signal processing tasks: MRI analysis

[30], image compression [31], and, more recently, the design of an ultrafast camera [32]. Given such effective uses of compressed sensing, it is only a matter of time before these systems incorporate dictionary learning to encode and decode signals (e.g., in a device that learns structure from motion [33]) just as scientists have used it to make sense of their data [34, 35, 36, 37]. Assurances such as those offered by our theorems certify that different devices (with different initialization, etc.) will learn equivalent representations given enough data from statistically identical systems.888

To contrast with the current hot topic of “Deep Learning”, there are few such uniqueness guarantees for these models of data; moreover, even small noise can dramatically alter their output

[38].
Indeed, it seems a main reason for the sustained interest in dictionary learning as an unsupervised method for data analysis is the assumed well-posedness of parameter identification in the sparse linear coding model, confirmation of which forms the core of our theoretical findings.

Acknowledgment

We thank Fritz Sommer for turning our attention to the dictionary learning problem and Darren Rhea for sharing early explorations. We also thank Ian Morris for posting a reference to his proof of (12) online at Stack Exchange. Finally, we thank Bizzyskillet (www.soundcloud.com/bizzyskillet) for the “No Exam Jams”, which played on repeat during many long hours designing proofs.

References

  • [1] Y. Sato, “A method of self-recovering equalization for multilevel amplitude-modulation systems,” IEEE Trans. Commun., vol. 23, no. 6, pp. 679–682, 1975.
  • [2] Z. Zhang, Y. Xu, J. Yang, X. Li, and D. Zhang, “A survey of sparse representation: algorithms and applications,” Access, IEEE, vol. 3, pp. 490–530, 2015.
  • [3] Z. Wang, J. Yang, H. Zhang, Z. Wang, Y. Yang, D. Liu, and T. Huang,

    Sparse coding and its applications in computer vision

    .   World Scientific, 2015.
  • [4] B. Olshausen and D. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996.
  • [5]

    J. Hurri, A. Hyvärinen, J. Karhunen, and E. Oja, “Image feature extraction using independent component analysis,” in

    Proc. NORSIG ’96 (Nordic Signal Proc. Symposium), 1996, pp. 475–478.
  • [6] A. Bell and T. Sejnowski, “The “independent components” of natural scenes are edge filters,” Vision Res., vol. 37, no. 23, pp. 3327–3338, 1997.
  • [7] J. van Hateren and A. van der Schaaf, “Independent component filters of natural images compared with simple cells in primary visual cortex,” Proc. R Soc. Lond. [Biol.], vol. 265, no. 1394, pp. 359–366, 1998.
  • [8] D. Donoho and A. Flesia, “Can recent innovations in harmonic analysis ’explain’ key findings in natural image statistics?” Network Comp. Neural, vol. 12, no. 3, pp. 371–393, 2001.
  • [9] G. Isely, C. Hillar, and F. Sommer, “Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication,” in Adv. Neural Inf. Process. Syst., 2010, pp. 910–918.
  • [10] J. Hughes, D. Graham, and D. Rockmore, “Quantification of artistic style through sparse coding analysis in the drawings of Pieter Bruegel the Elder,” Proc. Natl. Acad. Sci., vol. 107, no. 4, pp. 1279–1283, 2010.
  • [11] J. Sun, Q. Qu, and J. Wright, “Complete dictionary recovery over the sphere I: Overview and the geometric picture,” IEEE Trans. Inf. Theory, pp. 853 – 884, 2016.
  • [12] J. Hadamard, “Sur les problèmes aux dérivées partielles et leur signification physique,” Princeton University Bulletin, vol. 13, no. 49-52, p. 28, 1902.
  • [13] Y. Li, A. Cichocki, and S.-I. Amari, “Analysis of sparse representation and blind source separation,” Neural Comput., vol. 16, no. 6, pp. 1193–1234, 2004.
  • [14] P. Georgiev, F. Theis, and A. Cichocki, “Sparse component analysis and blind source separation of underdetermined mixtures,” IEEE Trans. Neural Netw., vol. 16, pp. 992–996, 2005.
  • [15]