Sparse coding which represents a signal as a sparse linear combination of a representation basis in a dictionary has been successfully applied in many signal processing tasks, such as image restoration [6, 21], image classification [20, 22], to name a few. The dictionary is crucial and plays an important role in the success of sparse representation. Most of the compressive sensing literature takes off-the-shelf bases such as wavelets as the dictionary [4, 5]
. In contrast, dictionary learning assumes that a signal can be sparsely represented in a learned and usually over-completed dictionary. The pre-specified dictionary might be universal but will not be effective enough for specific task such as face recognition[24, 11]. Instead, using the learned dictionary has recently led to state-of-the-art results in many practical applications, such as denoising [6, 17, 27, 1], inpainting [13, 15, 18], and image compression .
In this paper, we propose a novel regularization method called the Grouped Smoothly Clipped Absolute Deviation (GSCAD) to learn a sparse dictionary and select the appropriate dictionary size simultaneously. It should be emphasized that determining a proper size of the to-be-learned dictionary is crucial for both precision and efficiency of the process. There are not too many existing work on discussing the selection of the dictionary size while most algorithms fix the number of atoms in the dictionary. In general, a two-stage procedure may be used to infer the dictionary size by first learning a dictionary with a fixed size then defining a new objective function penalizing the model complexity . The Bayesian technique can be also employed by putting a prior on the dictionary size . In addition, many methods have addressed the group variable selection problem in statistics literature [23, 25, 10, 28, 9].
This paper makes four main contributions:
Our approach imposes sparsity-enforcing constraints on the learned atoms, which improves interpretability of the results and achieves variable selection in the input space.
Our approach is a one-stage procedure to learn a sparse dictionary and the dictionary size jointly.
Our proposed algorithm is based on the alternative direction method of multipliers (ADMM) . The joint non-convex problem with the non-convex penalty is decomposed into two convex optimization problems.
Compared with other state-of-the-art dictionary learning methods, GSCAD has better or competitive performance in various denoising tasks.
2 GSCAD penalty
Review of the Smoothly Clipped Absolute Deviation (SCAD) penalty. SCAD penalty is first proposed by 
in the context of high dimensional linear regression. SCAD has some desired properties: (i) Unbiasedness: the resulting estimator is nearly unbiased when the true unknown parameter is large; (ii) Sparsity: The resulting estimator is able to sets small estimated coefficients to zero to reduce model complexity; (iii) Continuity: The resulting estimator is continuous in data to avoid instability in model prediction. Defined as
for some and , the SCAD contains three segments. When is small (less than ), it acts exactly like the Lasso penalty; when is big (greater than ), it becomes a constant so that no extra penalty is applied to truly significant parameters; these two segments are connected by a quadratic function which results in a continuous differentiable SCAD penalty function .
GSCAD penalty. Even though the SCAD penalty possesses many good properties, it only treats parameters individually and does not address any group effect among parameters. With respect to the structure of the dictionary, we propose a new penalty, GSCAD, where G stands for group. Let
be a vector in. The GSCAD penalty is defined as
where is the SCAD penalty defined in (1). It inherits all three merits of SCAD, unbiasedness, sparsity and continuity, and at the same time takes into account both individual parameters and group effect among parameters. Individually, the GSCAD penalty tends to set small estimated to zero. Group-wise, if all elements in are small, the penalty will penalize the entire vector to zero. In addition, if some of the is significantly large, the penalty will have more tolerance of smaller elements appearing in .
To better understand GSCAD, let us consider a penalized least squares problem with an orthogonal design
where and are vectors in . For GSCAD, SCAD and LASSO, the penalty is, respectively,
Estimators of when are shown in Figure 1 (a), where GSCAD performs very similar to SCAD. All three penalties shows sparsity properties since they all set to zero when . While the soft-thresholding from LASSO has the inherent bias issue, SCAD and GSCAD give when and and avoid bias. In a two-dimensional case when and , we investigate partitions of the space according to the number of non-zero element in the resulting estimator , see Figure 1 (b)-(c). While SCAD and Lasso treat each coordinate individually, GSCAD takes into account the whole group. It is less likely to set the estimator of one coordinate to zero as the estimator of another coordinate gets away from zero.
Convexity. Even though GSCAD is build upon the non-convex penalty function SCAD, our development uncovers a surprising fact that the optimization problem of GSCAD under orthogonal design is a convex problem. This will greatly facilitates the implementation of GSCAD.
Define as the minima of optimization problem
, and . Denote , and let be the open interval between and 0. Then problem is equivalent to
Let , be the number of non-zero element in . If
then optimization problem is convex, and is continuous in data .
Remarks on Theorem 1. (i) Adding a constant in (2) makes the problem more general such that the convexity result can be directly applied to the algorithms in Section 3.3, where plays a role of penalty parameter in the Augmented Lagrangian method. (ii) Condition (4) can be satisfied easily under a wide range of circumstances. For instance, in the previous two-dimensional example with , , and , Condition (4) will be satisfied as long as .
3 Dictionary Learning with GSCAD
3.1 Matrix Factorization Framework
Dictionary learning problems are commonly specified under the framework of matrix factorization. Consider a vectorized clean signal and a dictionary , with its columns referred to as atoms. Sparse representation theory assumes that signal can be well approximated by a linear combination of a few atoms in , i.e.
where the number of non-zero elements in is far less than the number of atoms . In most of the cases, the clean signal won’t be available, and instead, we will only be able to observe a noisy signal , where
represents noise with mean zero and variance. Suppose we have signals , and we want to retrieve the corresponding clean signals . This can be summarized as a matrix factorization model
where . To make the problem identifiable, we require the dictionary belongs to a convex set
Dictionary learning aims to obtain estimations of dictionary and sparse coding , and then reconstruct the clean signal as . This is usually done by minimizing the total squared error:
where is the Frobenius norm. Constrains such as (-penalty ) and (Lasso penalty) for some positive constants and are widely adopted by dictionary learning literature. Experiments have shown that Lasso penalty provides better results when used for learning the dictionary, while norm should always be used for the final reconstruction step .
3.2 Regularization on Dictionary
Compared with sparse coding, regularization on dictionary size is less studied. Most of the existing methods, such as K-SVD and Online Learning, estimate the dictionary directly with a fixed dictionary size. They usually require the size of the dictionary to be specified before learning, and this will end up with a solution of over completed dictionary with , which may not be very helpful if we want to better understand the mechanism. In addition, learning a sparse dictionary can lower the model complexity and improve interpretability of the results. All these issues can be addressed with the help of GSCAD penalty, where we would be able to reveal the real size of the dictionary and at the same time obtain an estimated sparse dictionary. More specifically, denote dictionary as with atoms . The GSCAD penalty on dictionary is defined by
where is the SCAD penalty defined in (1). The objective function for our problem is formulated as
Firstly, the GSCAD penalty tends to set small estimated to zero, and reduces the complexity of the estimated dictionary. If all elements in are small, GSCAD will lead to . Therefore, when starting with a relatively large , GSCAD will be able to prune the dictionary by penalizing useless atoms to zero. In this way, the true size of the dictionary can be approximated by the number of non-zero columns in the resulting dictionary. In addition, if GSCAD detects some significant s in , it will exert less penalty on the whole to avoid mistakenly truncating any real signals.
We follow the classic two steps approach to solve the optimization problem iteratively. Given the dictionary , we update by solving the Lasso problem,
for all signals . Given , the optimization problem becomes
which is addressed by the ADMM algorithm. Once is updated, we remove all zero columns of and reset to the number of current atoms. Algorithm 1 demonstrates this whole procedure. It should be noted that (6) is a non-convex problem. Recently, the global convergence of ADMM in non-convex optimization is discussed in , which shows that several ADMM algorithms including SCAD are guaranteed to converge.
ADMM for updating dictionary. Problem (6) is equivalent to
We form the augmented Lagrangian as
where is the element-wise multiplication operator of two matrices, and . The ADMM algorithm consists three steps in each iteration
Problem (7) bears an explicit solution
in (8) can be solved by columnwise optimization such as
for . In theorem 1, we have shown that this is a convex problem under Condition (4), and can be solved easily by exiting convex optimization algorithms. The ADMM algorithm for updating dictionaries is summarized in Algorithm 2.
as random matrix with;
Details in implementation. As demonstrated above, GSCAD has the ability to prune dictionary, and later in Section 4.1, we will see that its empirical performance is promising and competitive. However, if we initiate dictionary with a size smaller than the truth, there is nothing left for GSCAD to help. Therefore, an over-sized dictionary in the initiation step is strongly preferred. Experiments have shown that there is nothing to lose to start with a large dictionary as GSCAD can always prune it to the right size.
During the dictionary updating stage after we obtain a new dictionary from ADMM, if any two atoms are highly correlated, correlation greater than 0.95 for example, we only keep one of them. Some experiments have shown that this does not have much effect on the results, but will speed up convergence of the algorithm.
We define the convergence of the algorithm by the differences of and the differences of between two consecutive iterations. If they are both below a certain threshold, the algorithm stops. However, in implementation, we add an extra rule on the maximum number of iterations, since GSCAD may get stuck to a region where keeps alternating from two local minima and never converge due to a bad initiation. Fortunately, the performance of local minima is mostly decent in terms of denoising.
4 Experimental Results
4.1 Synthetic Experiments
We design a simple example to check the performance of GSCAD from two aspects: (i) whether GSCAD could recover the true size of the dictionary, and (ii) its denoising performance compared with other methods.
Data generation. The generating dictionary contains 10 atoms. Each atom is a vectoried patch shown in Figure 2. Then 1500 signals in are generated, each created by a linear combination of three different generating dictionary atoms picked randomly, with identically independently distributed coefficients following . Gaussian noises are added, with signal-to-noise ratio (SNR) controlled by the Gaussian variance . Four levels of noise level are adopted at .
Applying GSCAD. In order to examine GSCAD’s ability to prune dictionaries to the right size, dictionaries are initialized with varying number of atoms , namely, 10 (true size), 15, 20 and 50. We run the GSCAD and received a learned dictionary , where the resulting size of the dictionary might be, and in most of the cases, is less than the initial size . For validation, another 1000 signals are generated under the same setting. Both clean signals and noisy signals are recorded. Coefficients corresponding to are obtained using Orthogonal Matching Pursuit(OMP) with the number of non-zero elements fixed to three. We then reconstruct signals as , and calculate the PSNR as
Comparison. For each setting, we also run the K-SVD algorithm using the Matlab Toolbox associated its original paper Aharon et al. , and Online Learning algorithm  using the SPAMS package. Since neither K-SVD nor Online Learning would prune the dictionary, the learned dictionary will be in the same space as its initial value , i.e. . Validation for both method are conducted in the same fashion as that in GSCAD.
Result. For each setting of and
, experiments using GSCAD, Online Learning and K-SVD are repeated 100 times. Median, first quartile and third quartile of the PSNR are shown in Figure3. GSCAD performs better consistently than the other two methods when varying initial size and SNR levels controlled by , except just one case when initial is at the true value 10 and . As suggested in Section 3.3
, to make fully use of GSCAD’s power of pruning, it is better to start with an over-sized initial dictionary. The mean and standard deviation of, the size of the resulting dictionary for GSCAD are reported in Table 1. The resulting size of the dictionary learned from GSCAD are very close to the truth, with very small standard deviations across all cases.
|10||9.97 (0.171)||10.00 (0)||10.00 (0)||10.00 (0)|
|15||10.02 (0.245)||10.03 (0.171)||10.15 (0.359)||10.14 (0.403)|
|20||10.02 (0.245)||10.08 (0.273)||10.15 (0.359)||10.35 (0.626)|
|50||10.07 (0.293)||10.07 (0.293)||10.19 (0.394)||10.36 (0.644)|
4.2 Image Denoising
Following the denoising scheme proposed by , we train our dictionaries directly on patches from corrupted images. More details about the scheme can be found in . Five classical images (4) used in the image denoising benchmarks are corrupted with Gaussian noise. Standard deviations of Gaussian noise are set to be separately, for pixel values in the range . For each corrupted image, overlapped patches are obtained and centered as training set. For an image of size , a total number of 255025 patches are extracted from the original image.
Dictionaries are trained using (i) the proposed GSCAD algorithm with and for Algorithm 2, (ii) K-SVD algorithm from the KSVD Matlab Toolbox, and (iii) Online Learning algorithm from the SPAMS package. Redundant DCT of size size is used to initialize for all three methods. The resulting dictionary is in , where stays the same as for K-SVD and Online Learning, but might be smaller for GSCAD.
Once a dictionary is obtained, patches are approximated up to the noise level by a sparse linear combination of atoms in the dictionary:
where is proportional to the noise variance . We set with
following the effective heuristic by. is the of distribution with freedom of . Then the denoised image is reconstructed based the sparse approximation , and its mean squared error () comparing to the clean image is calculated. For each setting, the whole procedure is repeated five times with different realizations of the noise. Define PSNR as
The results for all three methods are very close to each other in general. At lower noise levels, GSCAD has a better performance. 3 summarizes the average number of atoms for the resulting dictionaries from GSCAD. It shows that GSCAD outperforms the other two methods with a learned dictionary less than half of the size of that used by the other algorithms. On the other hand, at higher noise levels, GSCAD becomes less competitive. Specifically, when , all resulting are close to the initial value of 256, which may indicate that dictionaries of size 256 is not large enough as an initial value for GSCAD. Our experience suggests that the higher noise level requires the larger dictionary size. Another interesting finding is that bearing the same level of noise, image ’fingerprint’ needs a much larger dictionary to denoise compared with other images.
The GSCAD method has been presented for learning a sparse dictionary and selecting the dictionary size simultaneously. The experimental analysis has demonstrated very encouraging results relative to the state-of-the-art methods. This new framework may also be applied to the general subspace clustering problem for imaging clustering, which assumes that similar points are described as points lying in the same subspace. The proposed formulation can learn the clustering and the number of clusters at the same time. This framework may also be applied to the architecture design of deep learning. The new GSCAD penalty can learn a sparse connection between units of two layers in the deep neural network to improve efficiency.
Proof of Theorem 1.
(1). When , we have , and further
for any . When , we have
Therefore to minimize(2), has to satisfy . If we denote and denote as the open interval between and 0, i.e.
then optimization problem (2) is equivalent to
(2). Recall that . To simplify the notation, we rewrite as if there were no zero-element in , and correspondingly . Define :
We expend to the whole half plane:
If we can show that is convex in , this will inply that is convex over , as is continous all over .
To show that the optimization problem within is convex, we are going to varify the inequality
for any . This is trivial for , and for , we consider the following two cases.
Case 1: . Therefore only a finite number of points in set such that does not have a second-order derivative. Let . Define . If we can show that is continuous on , and except at a finite number of points, therefore is non-decreasing, and furthermore is convex on . By definition, for any ,
Therefore is convex.
Now we are going to show that is continuous and except at a finite number of points, where does not exist. Taking derivative of , we get
Since is continuous for all and ,
is contious. Except a finite number of , such that does not exist at , we have
To show that , we only need to show that . Since , without loss of generality, we are only going to show that , for .
Take derivative of ,
Since and , we have for all . Observe that is piece-wise continous on , and . For ,
Therefore , for , and furthermore, except a finite number of . Thus we finished the proof of case 1.
Case 2: or , where . Without loss of generality, we assume that the last elements of and are the same, and the rest are not, i.e. for and for . Let , and . Therefore only a finite number of such that point belongs to .
Let , and define as
Define . Then similar to Case 1, we can show that