In the problem of sparse coding/dictionary learning, given i.i.d. samples produced from the generative model
for , the goal is to recover a fixed dictionary and -sparse vectors . (An -sparse vector has no more than non-zero entries.) In many problems of interest, the dictionary is often overcomplete, that is,
. This is believed to add flexibility in modeling and robustness. This model was first proposed in neuroscience as an energy minimization heuristic that reproduces features of the V1 region of the visual cortex(Olshausen and Field, 1997; Lewicki and Sejnowski, 2000)
. It has also been an extremely successful approach to identifying low dimensional structure in high dimensional data; it is used extensively to find features in images, speech and video (see, for example, references in(Elad and Aharon, 2006)).
Most formulations of dictionary learning tend to yield non-convex optimization problems. For example, note that if either or were known, given , this would just be a (matrix/sparse) regression problem. However, estimating both and simultaneously leads to both computational as well as statistical complications. The heuristic of alternating minimization works well empirically for dictionary learning. At each step, first an estimate of the dictionary is held fixed while the sparse coefficients are estimated; next, using these sparse coefficients the dictionary is updated. Note that in each step the sub-problem has a convex formulation, and there is a range of efficient algorithms that can be used. This heuristic has been very successful empirically, and there has also been significant recent theoretical progress in understanding its performance, which we discuss next.
1.1 Related Work
A recent line of work theoretically analyzes local linear convergence rates for alternating minimization procedures applied to dictionary learning (Agarwal et al., 2014; Arora et al., 2015). Arora et al. (2015) present a neurally plausible algorithm that recovers the dictionary exactly for sparsity up to , where is the level of incoherence in the dictionary (which is a measure of the similarity of the columns; see Assumption A1 below). Agarwal et al. (2014) analyze a least squares/ minimization scheme and show that it can tolerate sparsity up to
. Both of these establish local linear convergence guarantees for the maximum column-wise distance. Exact recovery guarantees require a singular-value decomposition (SVD) or clustering based procedure to initialize their dictionary estimates (see also the previous work(Arora et al., 2013; Agarwal et al., 2013)).
For the undercomplete case (when ), Sun et al. (2017) provide a Riemannian trust region method that can tolerate sparsity , while earlier work by Spielman et al. (2012) provides an algorithm that works in this setting for sparsity .
Local and global optima of non-convex formulations for the problem have also been extensively studied in (Wu and Yu, 2015; Gribonval et al., 2015; Gribonval and Nielsen, 2003), among others. Apart from alternating minimization, other approaches (without theoretical convergence guarantees) for dictionary learning include K-SVD (Aharon et al., 2006) and MOD (Engan et al., 1999). There is also a nice formulation by Barak et al. (2015), based on the sum-of-squares hierarchy. Recently, Hazan and Ma (2016) provide guarantees for improper dictionary learning, where instead of learning a dictionary, they learn a comparable encoding via convex relaxations. Our work also adds to the recent literature on analyzing alternating minimization algorithms (Jain et al., 2013; Netrapalli et al., 2013; Netrapalli et al., 2014; Hardt, 2014; Balakrishnan et al., 2017).
Our main contribution is to present new conditions under which alternating minimization for dictionary learning converges at a linear rate to the global optimum. We impose a condition on the matrix infinity norm (largest magnitude entry) of the underlying dictionary. This allows dictionaries with operator norm growing with dimension (). The error rates are measured in the matrix infinity norm, which is sharper than the previous error rates in maximum column-wise error.
We also identify conditions under which a trivial random initialization of the dictionary works, as opposed to the more complex SVD and clustering procedures required in previous work. This is possible as our radius of convergence, again measured in the matrix infinity norm, is larger than that of previous results, which required the initial estimate to be close column-wise. Our results hold for a rather arbitrary level of overcompleteness, . We establish convergence results for sparsity level , which is information theoretically optimal for incoherent dictionaries and improves the previously best known results in the overcomplete setting by a logarithmic factor. Our algorithm is simple, involving an -minimization step followed by a gradient update for the dictionary.
A key step in our proofs is an analysis of a robust sparse estimator—-MU Selector—under fixed (worst case) corruption in the dictionary. We prove that this estimator is minimax optimal in this setting, which might be of independent interest.
In Section 2, we present our alternating minimization algorithm and discuss the sparse regression estimator. In Section 3, we list the assumptions under which our algorithm converges and state the main convergence result. Finally, in Section 4, we prove convergence of our algorithm. We defer technical lemmas, analysis of the sparse regression estimator, and minimax analysis to the appendix.
For a vector , denotes the component of the vector, denotes the norm, denotes the support of a vector , that is, the set of non-zero entries of the vector, denotes the sign of the vector , that is, a vector with if , if and if . For a matrix , denotes the column, is the element in the row and column, denotes the operator norm, and denotes the maximum of the magnitudes of the elements of . For a set , we denote its cardinality by . Throughout the paper, we use multiple times to denote global constants that are independent of the problem parameters and dimension. We denote the indicator function by .
Given an initial estimate of the dictionary we alternate between an minimization procedure (specifically the -MU Selector— in the algorithm—followed by a thresholding step) and a gradient step, under sample loss, to update the dictionary. We analyze this algorithm and demand linear convergence at a rate of 7/8; convergence analysis for other rates follows in the same vein with altered constants. In subsequent sections, we also establish conditions under which the initial estimate for the dictionary can be chosen randomly. Below we state the permitted range for the various parameters in the algorithm above.
Step size: We need to set the step size in the range .
Threshold: At each step set the threshold at .
Tuning parameters: We need to pick and such that the assumption (D5) is satisfied. A choice that is suitable that satisfies this assumption is
We need to set as specified by Theorem 16,
2.1 Sparse Regression Estimator
Our proof of convergence for Algorithm 1 also goes through with a different choices of robust sparse regression estimators, however, we can establish the tightest guarantees when the -MU Selector is used in the sparse regression step. The -MU Selector (Belloni et al., 2014) was established as a modification of the Dantzig selector to handle uncertainty in the dictionary. There is a beautiful line of work that precedes this that includes (Rosenbaum et al., 2010, 2013; Belloni et al., 2016). There are also modified non-convex LASSO programs that have been studied (Loh and Wainwright, 2011) and Orthogonal Matching Pursuit algorithms under in-variable errors (Chen and Caramanis, 2013). However these estimators require the error in the dictionary to be stochastic and zero mean which makes them less suitable in this setting. Also note that standard minimization estimators like the LASSO and Dantzig selector are highly unstable under errors in the dictionary and would lead to much worse guarantees in terms of radius of convergence (as studied in (Agarwal et al., 2014)). We establish the error guarantees for a robust sparse estimator -MU Selector under fixed corruption in the dictionary. We also establish that this estimator is minimax optimal when the error in the sparse estimate is measured in infinity norm and the dictionary is corrupted.
The -MU Selector
Define the estimator such that is the solution to the convex minimization problem
where, and are tuning parameters that are chosen appropriately. is an upper bound on the error in our dictionary measured in matrix infinity norm. Henceforth the first coordinate () of this estimator is called , where the first argument is the sample, the second is the matrix, and the third is the value of the upper bound on the error of the dictionary measured in infinity norm. We will see that under our assumptions we will be able to establish an upper bound on the error on the estimator, , where . We define a threshold at each step . The thresholded estimate is defined as
Our assumptions will ensure that we have the guarantee . This will be crucial in our proof of convergence. The analysis of this estimator is presented in Appendix B.
To identify the signs of the sparse covariates correctly using this class of thresholded estimators, we would like in the first step to use an estimator that is optimal, as this would lead to tighter control over the radius of convergence. This makes the choice of -MU Selector natural, as we will show it is minimax optimal under certain settings.
Theorem 1 (informal).
Define the sets of matrices and with . Then there exists an and with such that
where the is over all measurable estimators with input , and the is over -sparse vectors with -norm .
Note that when and , this lower bound matches the upper bound we have for Theorem 16 (up to logarithmic factors) and hence the -MU Selector is minimax optimal.
The proof of this theorem follows by Fano’s method and is relegated to Appendix C.
2.2 Gradient Update for the dictionary
3 Main Results and Assumptions
In this section we state our convergence result and state the assumptions under which our results are valid.
Incoherence: We assume the the true underlying dictionary is -incoherent
This is a standard assumption in the sparse regression literature when support recovery is of interest. It was introduced in (Fuchs, 2004; Tropp, 2006) in signal processing and independently in (Zhao and Yu, 2006; Meinshausen and Bühlmann, 2006) in statistics. We can also establish guarantees under the strictly weaker -sensitivity condition (cf. (Gautier and Tsybakov, 2011)) used in analyzing sparse estimators under in-variable uncertainty in (Belloni et al., 2016; Rosenbaum et al., 2013). The -MU selector that we use for our sparse recovery step also works with this more general assumption, however for ease of exposition we assume to be -incoherent.
Normalized Columns: We assume that all the columns of are normalized to ,
Note that the samples are invariant when we scale the columns of or under permutations of its columns. Thus we restrict ourselves to dictionaries with normalized columns and label the entire equivalence class of dictionaries with permuted columns and varying signs as .
Bounded max-norm: We assume that is bounded in matrix infinity norm
This is in contrast with previous work that imposes conditions on the operator norm of (Arora et al., 2015; Agarwal et al., 2014; Arora et al., 2013). Our assumptions help provide guarantees under alternate assumptions and it also allows the operator norm to grow with dimension, whereas earlier work requires to be such that
. In general the infinity norm and operator norm balls are hard to compare. However, one situation where a comparison is possible is if we assume the entries of the dictionary to be drawn iid from a Gaussian distribution
. Then by standard concentration theorems, for the operator norm condition to be satisfied we would need the variance () of the distribution to scale as while, for the infinity norm condition to be satisfied we need the variance to be . This means that modulo constants the variance can be much larger for the infinity norm condition to be satisfied than for the operator norm condition.
Assumption on the initial estimate and initialization
We require an initial estimate for the dictionary such that,
with ; where . Assuming
allows a fast random initialization, where we draw each entry of the initial estimate from the uniform distribution (on the interval). This allows us to circumvent the computationally heavy SVD/clustering step required in previous work to get an initial dictionary estimate (Arora et al., 2015; Agarwal et al., 2014; Arora et al., 2013). Note that we start with a random initialization and not with , as this causes our sparse estimator to fail (columns of need to be non-zero).
Next we assume a generative model on the -sparse covariates . Here are the assumptions we make about the (unknown) distribution of .
Conditional Independence: We assume that distribution of non-zero entries of is conditionally independent and identically distributed. That is, .
Sparsity Level:We assume that the level of sparsity is bounded
where is an appropriate global constant such that satisfies assumption (D3), see Remark 15. For incoherent dictionaries, this upper bound is tight up to constant factors for sparse recovery to be feasible (Donoho and Huo, 2001; Gribonval and Nielsen, 2003).
Boundedness: Conditioned on the event that is in the subset of non-zero entries, we have
with and . This is needed for the thresholded sparse estimator to correctly predict the sign of the true covariate (). We can also relax the boundedness assumption: it suffices for the to have sub-Gaussian distributions.
Probability of support: The probability of being in the support of is uniform over all . This translates to
Mean and variance of variables in the support:
We assume that the non-zero random variables in the support ofare centered and are normalized
We note that these assumptions (A1), (A2) and (C1) - (C5) are similar to those made in (Arora et al., 2015; Agarwal et al., 2014). Agarwal et al. (2014) require the matrices to satisfy the restricted isometry property, which is strictly weaker than -incoherence, however they can tolerate a much lower level of sparsity ().
3.2 Main Result
Suppose that true dictionary and the distribution of the -sparse samples satisfy the assumptions stated in Section 3.1 and we are given an estimate such that . If we are given i.i.d. samples in every iteration with then Algorithm 1 with parameters chosen as specified in Section 3.1 after iterations returns a dictionary such that,
4 Proof of Convergence
In this section we will prove the main convergence results stated as Theorem 3. To prove this result we will analyze the gradient update to the dictionary at each step. We will decompose this gradient update (which is a random variable) into a first term which is its expected value and a second term which is its deviation from expectation. We will prove a deterministic convergence result by working with the expected value of the gradient and then appeal to standard concentration arguments to control the deviation of the gradient from its expected value with high probability.
To un-clutter notation let, , . The coordinate of the covariate is written as . Similarly let be the coordinate of the estimate of the covariate at step . Finally let , and be the element of the gradient with samples at step . Unwrapping the expression for we get,
where denotes element of the expected value (infinite samples) of the gradient. The second term is the deviation of the gradient from its expected value. By Theorem 10 we can control the deviation of the sample gradient from its mean via an application of McDiarmid’s inequality. With this notation in place we are now ready to prove Theorem 3.
Proof [Proof of Theorem 3] First we analyze the structure of the expected value of the gradient.
Step 1: Unwrapping the expected value of the gradient we find it decomposes into three terms
The first term points in the correct direction and will be useful in converging to the right answer. The other terms could be in a bad direction and we will control their magnitude with Lemma 5 such that . The proof of Lemma 5 is the main technical challenge in the convergence analysis to control the error in the gradient. Its proof is deferred to the appendix.
Step 2: Given this bound, we analyze the gradient update,
So if we look at the distance to the optimum we have the relation,
Taking absolute values, we get
provided the first term is at non-negative. Here, follows by triangle inequality and is by Lemma 5. Next we give an upper and lower bound on . We would expect that as gets smaller this variance term approaches . By invoking Lemma 6 we can bound this term to be . We want the first term to contract at a rate ; it suffices to have
We also have with probability at least for all ; thus we are guaranteed to remain in our initial ball of radius with high probability, completing the proof.
An interesting question would be to further explore and analyze the range of algorithms for which alternating minimization works and identifying the conditions under which they provably converge. There also seem to be many open questions for improper dictionary learning and developing provably faster algorithms there. Going beyond sparsity still remains challenging, and as noted in previous work alternating minimization also appears to break down experimentally and new algorithms are required in this regime. Also all theoretical work on analyzing alternating minimization for dictionary learning seems to rely on identifying the signs of the samples () correctly at every step. It would be an interesting theoretical question to analyze if this is a necessary condition or if an alternate proof strategy and consequently a bigger radius of convergence are possible. Lastly, it is not known what the optimal sample complexity for this problem is and lower bounds there could be useful in designing more sample efficient algorithms.
We gratefully acknowledge the support of the NSF through grant IIS-1619362, and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the ARC Centre of Excellence for Mathematical and Statistical Frontiers. Thanks also to the Simons Institute for the Theory of Computing Spring 2017 Program on Foundations of Machine Learning.
- Agarwal et al. (2014) Agarwal, A., A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon (2014). Learning sparsely used overcomplete dictionaries. In COLT, pp. 123–137.
- Agarwal et al. (2013) Agarwal, A., A. Anandkumar, and P. Netrapalli (2013). A clustering approach to learn sparsely-used overcomplete dictionaries. arXiv preprint arXiv:1309.1952.
- Aharon et al. (2006) Aharon, M., M. Elad, and A. Bruckstein (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54(11), 4311–4322.
- Arora et al. (2015) Arora, S., R. Ge, T. Ma, and A. Moitra (2015). Simple, efficient, and neural algorithms for sparse coding. In COLT, pp. 113–149.
- Arora et al. (2013) Arora, S., R. Ge, and A. Moitra (2013). New algorithms for learning incoherent and overcomplete dictionaries.
- Balakrishnan et al. (2017) Balakrishnan, S., M. J. Wainwright, B. Yu, et al. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics 45(1), 77–120.
et al. (2015)
Barak, B., J. A. Kelner, and D. Steurer (2015).
Dictionary learning and tensor decomposition via the sum-of-squares method.In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 143–151. ACM.
- Belloni et al. (2014) Belloni, A., M. Rosenbaum, and A. B. Tsybakov (2014). An -Regularization Approach to High-Dimensional Errors-in-variables Models. arXiv preprint arXiv:1412.7216.
- Belloni et al. (2016) Belloni, A., M. Rosenbaum, and A. B. Tsybakov (2016). Linear and conic programming estimators in high dimensional errors-in-variables models. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
- Boucheron et al. (2013) Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
- Chen and Caramanis (2013) Chen, Y. and C. Caramanis (2013). Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery.
- Donoho and Huo (2001) Donoho, D. L. and X. Huo (2001). Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory 47(7), 2845–2862.
- Elad and Aharon (2006) Elad, M. and M. Aharon (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing 15(12), 3736–3745.
- Engan et al. (1999) Engan, K., S. O. Aase, and J. H. Husoy (1999). Method of optimal directions for frame design. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, Volume 5, pp. 2443–2446. IEEE.
- Fuchs (2004) Fuchs, J.-J. (2004). Recovery of exact sparse representations in the presence of noise. In Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on, Volume 2, pp. ii–533. IEEE.
- Gautier and Tsybakov (2011) Gautier, E. and A. Tsybakov (2011). High-dimensional instrumental variables regression and confidence sets. arXiv preprint arXiv:1105.2454.
et al. (2015)
Gribonval, R., R. Jenatton, and F. Bach (2015).
Sparse and spurious: dictionary learning with noise and outliers.IEEE Transactions on Information Theory 61(11), 6298–6319.
- Gribonval and Nielsen (2003) Gribonval, R. and M. Nielsen (2003). Sparse representations in unions of bases. IEEE transactions on Information theory 49(12), 3320–3325.
- Hardt (2014) Hardt, M. (2014). Understanding alternating minimization for matrix completion. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pp. 651–660. IEEE.
Hazan and Ma (2016)
Hazan, E. and T. Ma (2016).
A Non-generative Framework and Convex Relaxations for Unsupervised Learning.In Advances in Neural Information Processing Systems, pp. 3306–3314.
- Jain et al. (2013) Jain, P., P. Netrapalli, and S. Sanghavi (2013). Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 665–674. ACM.
- Lewicki and Sejnowski (2000) Lewicki, M. S. and T. J. Sejnowski (2000). Learning overcomplete representations. Neural computation 12(2), 337–365.
- Loh and Wainwright (2011) Loh, P.-L. and M. J. Wainwright (2011). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pp. 2726–2734.
- McDiarmid (1989) McDiarmid, C. (1989). On the method of bounded differences. Surveys in combinatorics 141(1), 148–188.
- Meinshausen and Bühlmann (2006) Meinshausen, N. and P. Bühlmann (2006). High-dimensional graphs and variable selection with the Lasso. The annals of statistics, 1436–1462.
- Netrapalli et al. (2013) Netrapalli, P., P. Jain, and S. Sanghavi (2013). Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pp. 2796–2804.
- Netrapalli et al. (2014) Netrapalli, P., U. Niranjan, S. Sanghavi, A. Anandkumar, and P. Jain (2014). Non-convex robust PCA. In Advances in Neural Information Processing Systems, pp. 1107–1115.
- Olshausen and Field (1997) Olshausen, B. A. and D. J. Field (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research 37(23), 3311–3325.
- Rigollet and Tsybakov (2011) Rigollet, P. and A. Tsybakov (2011). Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 731–771.
- Rosenbaum et al. (2010) Rosenbaum, M., A. B. Tsybakov, et al. (2010). Sparse recovery under matrix uncertainty. The Annals of Statistics 38(5), 2620–2651.
- Rosenbaum et al. (2013) Rosenbaum, M., A. B. Tsybakov, et al. (2013). Improved matrix uncertainty selector. In From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner, pp. 276–290. Institute of Mathematical Statistics.
- Spielman et al. (2012) Spielman, D. A., H. Wang, and J. Wright (2012). Exact Recovery of Sparsely-Used Dictionaries. In COLT, pp. 37–1.
- Sun et al. (2017) Sun, J., Q. Qu, and J. Wright (2017). Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Transactions on Information Theory 63(2), 853–884.
- Tropp (2006) Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE transactions on information theory 52(3), 1030–1051.
- Tsybakov (2009) Tsybakov, A. B. (2009). Introduction to nonparametric estimation. Revised and extended from the 2004 French original. Translated by Vladimir Zaiats.
- Wu and Yu (2015) Wu, S. and B. Yu (2015). Local identifiability of -minimization dictionary learning: a sufficient and almost necessary condition. arXiv preprint arXiv:1505.04363.
- Yu (1997) Yu, B. (1997). Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pp. 423–435. Springer.
- Zhao and Yu (2006) Zhao, P. and B. Yu (2006). On model selection consistency of Lasso. Journal of Machine learning research 7(Nov), 2541–2563.
Appendix A Proof of Convergence
For Appendix A.1 and A.2, we borrow the notation from Section 4. In Appendix A.1 we prove Lemma 4 that controls an error term which will be useful in establishing Lemma 5 that bounds the error terms in the gradient, and . Corollary 9 establishes the error bound for the sparse estimate while Lemma 8 establishes that the sparse estimate after the thresholding step has the correct sign. In Appendix A.2, we establish finite sample guarantees.
a.1 Proof of Auxillary Lemmas
Let the assumptions stated in Section 3.1 hold. Then at each iteration step we have the guarantee that
Proof Let us define
and let us define the event . Expanding ,
Given that the non-zero entries of are independent and mean zero we have . Next we see and are bounded above as
these follow as , and . The goal now is to show that , and . Let us unwrap the first term of
where follows by invoking Corollary 9 and follows as and . Our choice ensures that . The second term in the upper bound on can be bounded by the same technique as we used to bound , giving . For the first term in , we have
where these inequalities follow by invoking Corollary 9 and by the upper bounds on and . Again our choice ensures that which leaves us with the upper bound on . Finally to bound we observe that the first term is bounded as follows,
where the last inequality is due to the fact that . As , we have and similar arguments as above can be used to show that the second term in is also bounded above by . Having controlled and at the appropriate levels completes the proof and yields the desired bound on .
Let the assumptions stated in Section 3.1 hold. Then at each iteration step we can bound the error terms in the gradient as
Proof Part 1-We first prove the bound on . We start by unpacking