Computational Intractability of Dictionary Learning for Sparse Representation

by   Meisam Razaviyayn, et al.

In this paper we consider the dictionary learning problem for sparse representation. We first show that this problem is NP-hard by polynomial time reduction of the densest cut problem. Then, using successive convex approximation strategies, we propose efficient dictionary learning schemes to solve several practical formulations of this problem to stationary points. Unlike many existing algorithms in the literature, such as K-SVD, our proposed dictionary learning scheme is theoretically guaranteed to converge to the set of stationary points under certain mild assumptions. For the image denoising application, the performance and the efficiency of the proposed dictionary learning scheme are comparable to that of K-SVD algorithm in simulation.



page 10


Denoising and Completion of 3D Data via Multidimensional Dictionary Learning

In this paper a new dictionary learning algorithm for multidimensional d...

A Split-and-Merge Dictionary Learning Algorithm for Sparse Representation

In big data image/video analytics, we encounter the problem of learning ...

Convolutional Dictionary Learning

Convolutional sparse representations are a form of sparse representation...

Learning Hierarchical Sparse Representations using Iterative Dictionary Learning and Dimension Reduction

This paper introduces an elemental building block which combines Diction...

A Theoretically Novel Trade-off for Sparse Secret-key Generation

We in this paper theoretically go over a rate-distortion based sparse di...

Dictionary Learning with Uniform Sparse Representations for Anomaly Detection

Many applications like audio and image processing show that sparse repre...

ℓ_1-K-SVD: A Robust Dictionary Learning Algorithm With Simultaneous Update

We develop a dictionary learning algorithm by minimizing the ℓ_1 distort...

Code Repositories


Codes for dictionary learning for sparse representation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The idea of representing a signal with few samples/observations dates back to the classical result of Kotelnikon, Nyquist, Shannon, and Whittaker [1, 2, 3, 4, 5]. This idea has evolved over time, and culminated to the compressive sensing concept in recent years [6, 7]. The compressive sensing or sparse recovery approach relies on the observation that many practical signals can be sparsely approximated in a suitable over-complete basis (i.e., a dictionary). In other words, the signal can be approximately written as a linear combination of only a few components (or atoms) of the dictionary. This observation is a key to many lossy compression methods such as JPEG and MP3.

Theoretically, the exact sparse recovery is possible with high probability under certain conditions. More precisely, it is demonstrated that if the linear measurement matrix satisfies some conditions such as null space property (NSP) or restricted isometry property (RIP), then the exact recovery is possible

[6, 7]. These conditions are satisfied with high probability for different matrices such as Gaussian random matrices, Bernoulli random matrices, and partial random Fourier matrices.

In addition to the theoretical advances, compressive sensing has shown great potential in various applications. For example, in the nuclear magnetic resonance (NMR) imaging application, compressive sensing can help reduce the radiation time [8, 9]. Moreover, the compressive sensing technique has been successfully applied to many other practical scenarios including sub-Nyquist sampling [10, 11], compressive imaging [12, 13], and compressive sensor networks [14, 15], to name just a few.

In some of the aforementioned applications, the sensing matrix and dictionary are pre-defined using application domain knowledge. However, in most applications, the dictionary is not known a-priori and must be learned using a set of training signals. It has been observed that learning a good dictionary can substantially improve the compressive sensing performance, see [16, 17, 18, 19, 20, 21, 22]. In these applications, dictionary learning is the most crucial step affecting the performance of the compressive sensing approach.

To determine a high quality dictionary, various learning algorithms have been proposed; see, e.g., [22, 16, 23, 24]. These algorithms are typically composed of two major steps: 1) finding an approximate sparse representation of the training signals 2) updating the dictionary using the sparse representation.

In this paper, we consider the dictionary learning problem for sparse representation. We first establish the NP-hardness of this problem. Then we consider different formulations of the dictionary learning problem and propose several efficient algorithms to solve this problem. In contrast to the existing dictionary training algorithms [23, 22, 16]

, our methods neither solve Lasso-type subproblems nor find the active support of the sparse representation vector at each step; instead, they require only simple inexact updates in closed form. Furthermore, unlike most of the existing methods in the literature, e.g.,

[16, 22], the iterates generated by the proposed dictionary learning algorithms are theoretically guaranteed to converge to the set of stationary points under certain mild assumptions.

Ii Problem Statement

Given a set of training signals our task is to find a dictionary that can sparsely represent the training signals in the set . Let , denote the coefficients of sparse representation of the signal , i.e., , where is the -th component of signal . By concatenating all the training signals, the dictionary elements, and the coefficients, we can define the matrices , , and . Having these definitions in our hands, the dictionary learning problem for sparse representation can be stated as


where and are two constraint sets. The function measures our model goodness of fit. In the next section, we analyze the computational complexity of one of the most popular forms of problem (1).

Iii Complexity Analysis

Consider a special case of problem (1) by choosing the distance function to be the Frobenius norm and imposing sparsity by considering the constraint set . Then the optimization problem (1) can be re-written as


This formulation is very popular and is considered in different studies; see, e.g., [25, 22]. The following theorem characterizes the computational complexity of (2) by showing its NP-hardness. In particular, we show that even for the simple case of and , problem (2) is NP-hard. To state our result, let us define the following concept: let be a solution of (2). For , we say a point is an -optimal solution of (2) if .

Theorem 1

Assume and . Then finding an -optimal algorithm for solving (2) is NP-hard. More precisely, there is no polynomial time algorithm in that can solve (2) to -optimality, unless .


The proof is based on the polynomial time reduction of the densest cut problem. The densest cut problem can be stated as follows:
Densest Cut Problem: Given a graph , the goal is to maximize the ratio over all the bipartitions of the vertices of the graph . Here denotes the set of edges between the two partitions and the operator returns the cardinality of a set.
Given an undirected graph , we put an arbitrary directions on it and we define to be the incidence transpose matrix of the directed graph. In other words, with

  • if edge leaves vertex

  • if edge enters vertex

  • otherwise

Now let us consider the following optimization problem:


with and .
Claim 1: Problem (3) is equivalent to the densest cut problem over the graph  [26].
Claim 2: Consider two different feasible points and in problem (3). Let (resp. ) be the optimal solution of (3) after fixing the variable to (resp. ). Let us further assume that . Then, .
The proof of claims 1 and 2 are relegated to the appendix section. Clearly, problem (3) is different from (2); however the only difference is in the existence of the extra linear constraint in (3). To relate these two problems, let us define the following problem:


where is of the same dimension as , but the matrices and have one more row than and . Here the matrices and have the same number of columns as and , respectively. By giving a special form to the matrix , we will relate the optimization problem (4) to (3). More specifically, each column of is defined as follows:

with . Clearly, the optimization problem (4) is of the form (2). Let denote the optimizer of (4). Then it is not hard to see that the first row of the matrix should be nonzero and hence by a proper normalization of the matrices and , we can assume that the first row of the matrix is , i.e., . Define . Let denote the minimizer of (3). Similarly, define where is the minimizer of (4), excluding the first row. Furthermore, define , where is obtained by replacing the nonzero entries of with one. Having these definitions in our hands, the following claim will relate the two optimization problems (3) and (4).
Claim 3: .
The proof of this claim can be found in the appendix section.
Now set . If we can solve the optimization problem (4) to the -accuracy, then according to Claim 3, we have the optimal value of problem (3) with accuracy . Noticing that and using Claim 2, we can further conclude that the exact optimal solution of (3) is known; which implies that the optimal value of the original densest cut problem is known (according to Claim 1). The NP-hardness of the densest cut problem will complete the proof. ∎

Remark 1

Note that in the above NP-hardness result, the input size of is considered instead of . This in fact implies a stronger result that there is no quasi-polynomial time algorithm for solving (2); unless P=NP.

It is worth noting that the above NP-hardness result is different from (and is not a consequence of) the compressive sensing NP-hardness result in [27]. In fact, for a fixed sparsity level , the compressive sensing problem is no longer NP-hard, while the dictionary learning problem considered herein remains NP-hard (see Theorem 1).

Iv Algorithms

Iv-a Optimizing the goodness of fit

In this section, we assume that the function is composed of a smooth part and a non-smooth part for promoting sparsity, i.e., , where is smooth and is continuous and possibly non-smooth. Let us further assume that the sets are closed and convex. Our approach to solve (1) is to apply the general block successive upper-bound minimization framework developed in [28]. More specifically, we propose to alternately update the variables and . Let be the point obtained by the algorithm at iteration . Then, we select one of the following methods to update the dictionary variable at iteration :

and we update the variable by

  • .

Here the operator denotes the inner product; the superscript represents the iteration number; the notation is the projection operator to the convex set ; and the constants and are chosen such that



It should be noted that each step of the algorithm requires solving an optimization problem. For the commonly used objective functions and constraint sets, the solution to these optimization problems is often in closed form. In addition, the update rule (b) is the classical gradient projection step which can be viewed as an approximate version of (a). As we will see later, for some special choices of the function and the set , using (b) leads to a closed form update rule, while (a) does not. In the sequel, we specialize this framework to different popular choices of the objective functions and the constraint sets.

Case I: Constraining the total dictionary norm
For any , we consider the following optimization problem


where denotes the regularization parameter. By simple calculations, we can check that all the steps of the proposed algorithm can be done in closed form. More specifically, using the dictionary update rule (a) will lead to Algorithm 1.

  initialize randomly such that
  until some convergence criterion is met
Algorithm 1 The proposed algorithm for solving (6)

In this algorithm,

denotes the maximum singular value;

is the Lagrange multiplier of the constraint which can be found using one dimensional search algorithms such as bisection or Newton. The notation denotes the component-wise soft shrinkage operator, i.e., if

where and denote the -th component of the matrices and , respectively.

Case II: Constraining the norm of each dictionary atom
In many applications, it is of interest to constrain the norm of each dictionary atom, i.e., the dictionary is learned by solving:


In this case, the dictionary update rule (a) cannot be expressed in closed form; as an alternative, we can use the update rule (b), which is in closed form, in place of (a). This gives Algorithm 2.

  For solving (7): initialize randomly s.t.
  For solving (8): initialize and
     For solving (7):
     For solving (8):
  until some convergence criterion is met
Algorithm 2 The proposed algorithm for solving (7) and (8)

In this algorithm, the set is defined as

Case III: Non-negative dictionary learning with the total norm constraint
Consider the non-negative dictionary learning problem for sparse representation:


Utilizing the update rule (b) leads to Algorithm 2. Note that in this case, projections to the sets and are simple. In particular, to project to the set , we just need to first project to the set of nonnegative matrices first and then project to the set .

It is worth noting that Algorithm 2 can also be applied to the case where , since the projection to the constraint set still remains simple.

Case IV: Sparse non-negative matrix factorization
In some applications, it is desirable to have a sparse non-negative dictionary; see, e.g., [29, 30, 31]. In such cases, we can formulate the dictionary learning problem as:


It can be checked that we can again use the essentially same steps of the algorithm in case III to solve (9). The only required modification is in the projection step since the projection should be onto the set . This step can be performed in a column-wise manner by updating each column to , where denotes the projection to the set of nonnegative matrices and is a constant that can be determined via one dimensional bisection. The resulting algorithm is very similar (but not identical) to the one in [29]. However, unlike the algorithm in [29], all of our proposed algorithms are theoretically guaranteed to converge, as shown in Theorem 2.

Theorem 2

The iterates generated by the algorithms in cases I-IV converge to the set of stationary points of the corresponding optimization problems.

Proof: Each of the proposed algorithms in cases I-IV is a special case of the block successive upper-bound minimization approach [28]. Therefore, [28, Theorem 2] guarantees the convergence of the proposed methods.

Iv-B Constraining the goodness of fit

In some practical applications, the goodness of fit level may be known a-priori. In these cases, we may be interested in finding the sparsest representation of the data for a given goodness of fit level. In particular, for a given , we consider


For example, when the noise level is known, the goodness of fit function can be set as . We propose an efficient method (Algorithm 3) to solve (10), where the constant is chosen according to criterion in (5).

  initialize randomly s.t. and find a feasible
  until some convergence criterion is met
Algorithm 3 The proposed algorithm for solving (10)

It is clear that Algorithm 3 is not a special case of block coordinate descent method [32] or even the block successive upper-bound minimization method [28]. Nonetheless, the convergence of Algorithm 3 is guaranteed in light of the following theorem.

Theorem 3

Assume that is a limit point of the iterates generated by Algorithm 3. Furthermore, assume that the subproblem for updating is strictly feasible at , i.e., there exists such that Then is a stationary point of (10).

This theorem is similar to [33, Property 3]. However, the proof here is different due to the lack of smoothness in the objective function. The proof is omitted due to the space limitation.

V Numerical Experiments

In this section, we apply the proposed sparse dictionary learning method, namely algorithm 2, to the image denoising application; and compare its performance with that of the K-SVD algorithm proposed in [18] (and summarized in Algorithm 4

). As a test case, we use the image of Lena corrupted by additive Gaussian noise with various variances (


In Algorithm 4, denotes the image patch centered at coordinate. In step , dictionary is trained to sparsely represent noisy image patches by using either K-SVD algorithm or Algorithm 2. The term denotes the sparse representation coefficient of the patch . In K-SVD, it (approximately) solves -norm regularized problem (11) by using orthogonal matching pursuit (OMP) to update . In our approach, we use Algorithm 2 with to solve the -penalized dictionary learning formulation (12). We set in (12) with , and denotes the total number of image patches. This choice of the parameter intuitively means that we emphasize on sparsity more in the presence of stronger noise. Numerical values are determined experimentally. The final denoised image is obtained by (13) and setting , as suggested in [18].

Fig. 1: Sample denoised images ().
/PSNR DCT K-SVD Algorithm 2
20/22.11 32 32.38 30.88
60/12.57 26.59 26.86 26.37
100/8.132 24.42 24.45 24.46
140/5.208 22.96 22.93 23.11
180/3.025 21.73 21.69 21.96
TABLE I: Image denoising result comparison on “Lena” for different noise levels. Values are averaged over Monte Carlo simulations.
0:  noisy image , noise variance
0:  denoised image
1:  Initialization: ,
2:  Dictionary learning:K-SVD:
Algorithm 2:
3:   update:
Algorithm 4 Image denoising using K-SVD or algorithm 2

The final peak signal-to-noise ratio (PSNR) comparison is summarized in Table I; and sample images are presented in Figure 1. As can be seen in Table I, the resulting PSNR values of the proposed algorithm are comparable with the ones obtained by K-SVD. However, visually, K-SVD produces more noticeable artifacts (see the circled spot in Figure 1) than our proposed algorithm. The artifacts may be due to the use of OMP in K-SVD which is less robust to noise than the -regularizer used in Algorithm 2. As for the CPU time, the two algorithms perform similarly in the numerical experiments.

Acknowledgment: The authors are grateful to the University of Minnesota Graduate School Doctoral Dissertation Fellowship support during this research.

[Part I: NP-hardness Proof]

Proof of Claim 1: This proof is exactly the same as the proof in [26]. Here we restate the proof since some parts of the proof is necessary for the proof of Claim 2. Consider a feasible point of problem (3). Clearly, in any column of the matrix , either the first component is zero, or the second one. This gives us a partition of the columns of the matrix (which is equivalent to a partition over the nodes of the graph). Let (resp. ) be the set of columns of for which the first (resp. the second) component is nonzero at the optimality. Define and . Then the optimal value of the matrix is given by:

  • , if

  • if

where is the -th component of column in matrix . Plugging in the optimal value of the matrix , the objective function of (3) can be rewritten as:


Hence, clearly, solving is equivalent to solving the densest cut problem on graph .

Proof of Claim 2: According to the proof of Claim 1, we can write

Proof of Claim 3: First of all, notice that the point

is feasible and it should have a higher objective value than the optimal one. Therefore,

which in turn implies that


since . Clearly, and moreover notice that for each only one of the elements and is nonzero. Therefore, any nonzero element should be larger than . On the other hand, due to the way that we construct , we have . This implies that , leading to


where and are the first and the second column of matrix . Having these simple bounds in our hands, we are now able to bound :


Furthermore, since is a feasible point for (3) and due to the optimality of , we have


On the other hand,


otherwise, we can add the row on top of and get a lower objective for (4). Combining (17), (18), and (19) will conclude the proof.

[Part II: Successive Convex Approximation] In this part of the appendix, we analyze the performance of the successive convex approximation method which is used in the development of Algorithm 3. To the best of our knowledge, very little is known about the convergence of the successive convex approximation method in the general nonsmooth nonconvex setting. Hence here we state our analysis for the general case. To the best of our knowledge, the previous analysis of this method in [33, Property 3] is for the smooth case only and a special approximation function; where our analysis covers the nonsmooth case and it appears to be much simpler. To state our result, let us first define the successive convex approximation approach. Consider the following optimization problem:


where the function is smooth (possibly nonconvex) and is convex (possibly nonsmooth), for all . A popular practical approach for solving this problem is the successive convex approximation (also known as majorization minimization) approach where at each iteration of the method, a locally tight approximation of the original optimization problem is solved subject to a tight convex restriction of the constraint sets. More precisely, we consider the successive convex approximation method in Algorithm 5.

  Find a feasible point in (20), choose a stepsize , and set
     Set to be a solution of the following optimization problem
  until some convergence criterion is met
Algorithm 5 Successive Convex Approximation Method for Solving (20)

The approximation functions in the algorithm need to satisfy the following assumptions:

Assumption 1

Assume the approximation functions satisfy the following assumptions:

  • is continuous in

  • is convex in

  • Function value consistency:

  • Gradient consistency:

  • Upper-bound:

In other words, we assume that at each iteration, we approximate the original functions with some upper-bounds of them which have the same first order behavior.

In order to state our result, we need to define the following condition:

Slater condition for SCA: Given the constraint approximation functions , we say that the Slater condition is satisfied at a given point if there exists a point in the interior of the restricted constraint sets at the point , i.e.,

for some . Notice that if the approximate constraints are the same as the original constraints, then this condition will be the same as the well-known Slater condition for strong duality.

Theorem 4

Let be a limit point of the iterates generated by Algorithm 5. Assume Assumption 1 is satisfied and Slater condition holds at the point . Then is a KKT point of (20).


First of all since the approximate functions are upper-bounds of the original functions, all the iterates are feasible in the algorithm. Moreover, due to the upper-bound and function value consistency assumptions, it is not hard to see that

where the second inequality is the result of convexity of . Hence, the objective value is nonincreasing and we must have




Let be the subsequence converging to the limit point . Consider any fixed point satisfying


Then for sufficiently large, we must have

i.e., is a strictly feasible point at the iteration . Therefore,

due to the definition of . Letting and using (22), we have

Notice that this inequality holds for any satisfying . Combining this fact with the convexity of and the Slater condition implies that

Since the Slater condition is satisfied, using the gradient consistency assumption, the KKT condition of the above optimization problem implies that there exist such that

Using the upper-bound and the objective value consistency assumptions, we have

which completes the proof. ∎

It is also worth noting that in the presence of linear constraints, the Slater condition should be considered for the relative interior of the constraint set instead of the interior.


  • [1] V. A. Kotel nikov, “On the carrying capacity of the ether and wire in telecommunications,” in Material for the First All-Union Conference on Questions of Communication, Izd. Red. Upr. Svyazi RKKA, Moscow, 1933.
  • [2] H. Nyquist, “Certain topics in telegraph transmission theory,” Transactions of the American Institute of Electrical Engineers, vol. 47, no. 2, pp. 617–644, 1928.
  • [3] C. E. Shannon, “Communication in the presence of noise,” Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, 1949.
  • [4] E. T. Whittaker,

    On the functions which are represented by the expansions of the interpolation-theory

    Edinburgh University, 1915.
  • [5] R. Prony, “Essai experimental et analytique sur les lois de la dilatabilite des fluides elastiques et sur celles de la force expansive de la vapeur de leau et de la vapeur de lalkool, r differentes temperatures,” Journal Polytechnique ou Bulletin du Travail fait r Lecole Centrale des Travaux Publics, Paris, Premier Cahier, pp. 24–76, 1995.
  • [6] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
  • [7] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006.
  • [8] M. Lustig, D. L. Donoho, and J. M. Pauly, “Rapid mr imaging with compressed sensing and randomly under-sampled 3dft trajectories,” in Proceedings of 14th Annual Meeting of ISMRM. Citeseer, 2006.
  • [9] M. Lustig, J. H. Lee, D. L. Donoho, and J. M. Pauly, “Faster imaging with randomly perturbed, under-sampled spirals and l1 reconstruction,” in Proceedings of the 13th Annual Meeting of ISMRM, Miami Beach, 2005, p. 685.
  • [10] K. Gedalyahu and Y. C. Eldar,

    “Time-delay estimation from low-rate samples: A union of subspaces approach,”

    IEEE Transactions on Signal Processing, vol. 58, no. 6, pp. 3017–3031, 2010.
  • [11] M. Mishali and Y. C. Eldar, “Blind multiband signal reconstruction: Compressed sensing for analog signals,” IEEE Transactions on Signal Processing, vol. 57, no. 3, pp. 993–1009, 2009.
  • [12] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk, “Single-pixel imaging via compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 83–91, 2008.
  • [13] R. F. Marcia, Z. T. Harmany, and R. M. Willett, “Compressive coded aperture imaging,” in IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics, 2009, pp. 72460G–72460G.
  • [14] M. F. Duarte, S. Sarvotham, D. Baron, M. B. Wakin, and R. G. Baraniuk, “Distributed compressed sensing of jointly sparse signals,” in Asilomar Conference on Signals, Systems, and Computing, 2005, pp. 1537–1541.
  • [15] M. A. Davenport, C. Hegde, M. F. Duarte, and R. G. Baraniuk, “Joint manifolds for data fusion,” IEEE Transactions on Image Processing, vol. 19, no. 10, pp. 2580–2594, 2010.
  • [16] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse coding algorithms,” in Advances in neural information processing systems, 2006, pp. 801–808.
  • [17] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee, and T. J. Sejnowski, “Dictionary learning algorithms for sparse representation,” Neural computation, vol. 15, no. 2, pp. 349–396, 2003.
  • [18] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15, no. 12, pp. 3736–3745, 2006.
  • [19] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural computation, vol. 12, no. 2, pp. 337–365, 2000.
  • [20] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,