1 Introduction
Sparse coding which represents a signal as a sparse linear combination of a representation basis in a dictionary has been successfully applied in many signal processing tasks, such as image restoration [6, 21], image classification [20, 22], to name a few. The dictionary is crucial and plays an important role in the success of sparse representation. Most of the compressive sensing literature takes offtheshelf bases such as wavelets as the dictionary [4, 5]
. In contrast, dictionary learning assumes that a signal can be sparsely represented in a learned and usually overcompleted dictionary. The prespecified dictionary might be universal but will not be effective enough for specific task such as face recognition
[24, 11]. Instead, using the learned dictionary has recently led to stateoftheart results in many practical applications, such as denoising [6, 17, 27, 1], inpainting [13, 15, 18], and image compression [3].In this paper, we propose a novel regularization method called the Grouped Smoothly Clipped Absolute Deviation (GSCAD) to learn a sparse dictionary and select the appropriate dictionary size simultaneously. It should be emphasized that determining a proper size of the tobelearned dictionary is crucial for both precision and efficiency of the process. There are not too many existing work on discussing the selection of the dictionary size while most algorithms fix the number of atoms in the dictionary. In general, a twostage procedure may be used to infer the dictionary size by first learning a dictionary with a fixed size then defining a new objective function penalizing the model complexity [8]. The Bayesian technique can be also employed by putting a prior on the dictionary size [26]. In addition, many methods have addressed the group variable selection problem in statistics literature [23, 25, 10, 28, 9].
This paper makes four main contributions:

Our approach imposes sparsityenforcing constraints on the learned atoms, which improves interpretability of the results and achieves variable selection in the input space.

Our approach is a onestage procedure to learn a sparse dictionary and the dictionary size jointly.

Our proposed algorithm is based on the alternative direction method of multipliers (ADMM) [2]. The joint nonconvex problem with the nonconvex penalty is decomposed into two convex optimization problems.

Compared with other stateoftheart dictionary learning methods, GSCAD has better or competitive performance in various denoising tasks.
2 GSCAD penalty
Review of the Smoothly Clipped Absolute Deviation (SCAD) penalty. SCAD penalty is first proposed by [7]
in the context of high dimensional linear regression. SCAD has some desired properties: (i) Unbiasedness: the resulting estimator is nearly unbiased when the true unknown parameter is large; (ii) Sparsity: The resulting estimator is able to sets small estimated coefficients to zero to reduce model complexity; (iii) Continuity: The resulting estimator is continuous in data to avoid instability in model prediction. Defined as
(1) 
for some and , the SCAD contains three segments. When is small (less than ), it acts exactly like the Lasso penalty; when is big (greater than ), it becomes a constant so that no extra penalty is applied to truly significant parameters; these two segments are connected by a quadratic function which results in a continuous differentiable SCAD penalty function .
GSCAD penalty. Even though the SCAD penalty possesses many good properties, it only treats parameters individually and does not address any group effect among parameters. With respect to the structure of the dictionary, we propose a new penalty, GSCAD, where G stands for group. Let
be a vector in
. The GSCAD penalty is defined aswhere is the SCAD penalty defined in (1). It inherits all three merits of SCAD, unbiasedness, sparsity and continuity, and at the same time takes into account both individual parameters and group effect among parameters. Individually, the GSCAD penalty tends to set small estimated to zero. Groupwise, if all elements in are small, the penalty will penalize the entire vector to zero. In addition, if some of the is significantly large, the penalty will have more tolerance of smaller elements appearing in .
To better understand GSCAD, let us consider a penalized least squares problem with an orthogonal design
where and are vectors in . For GSCAD, SCAD and LASSO, the penalty is, respectively,
Estimators of when are shown in Figure 1 (a), where GSCAD performs very similar to SCAD. All three penalties shows sparsity properties since they all set to zero when . While the softthresholding from LASSO has the inherent bias issue, SCAD and GSCAD give when and and avoid bias. In a twodimensional case when and , we investigate partitions of the space according to the number of nonzero element in the resulting estimator , see Figure 1 (b)(c). While SCAD and Lasso treat each coordinate individually, GSCAD takes into account the whole group. It is less likely to set the estimator of one coordinate to zero as the estimator of another coordinate gets away from zero.
Convexity. Even though GSCAD is build upon the nonconvex penalty function SCAD, our development uncovers a surprising fact that the optimization problem of GSCAD under orthogonal design is a convex problem. This will greatly facilitates the implementation of GSCAD.
Theorem 1.
Define as the minima of optimization problem
(2) 
Then,

, and . Denote , and let be the open interval between and 0. Then problem is equivalent to
(3) 
Let , be the number of nonzero element in . If
(4) then optimization problem is convex, and is continuous in data .
Remarks on Theorem 1. (i) Adding a constant in (2) makes the problem more general such that the convexity result can be directly applied to the algorithms in Section 3.3, where plays a role of penalty parameter in the Augmented Lagrangian method. (ii) Condition (4) can be satisfied easily under a wide range of circumstances. For instance, in the previous twodimensional example with , , and , Condition (4) will be satisfied as long as .
3 Dictionary Learning with GSCAD
3.1 Matrix Factorization Framework
Dictionary learning problems are commonly specified under the framework of matrix factorization. Consider a vectorized clean signal and a dictionary , with its columns referred to as atoms. Sparse representation theory assumes that signal can be well approximated by a linear combination of a few atoms in , i.e.
where the number of nonzero elements in is far less than the number of atoms . In most of the cases, the clean signal won’t be available, and instead, we will only be able to observe a noisy signal , where
represents noise with mean zero and variance
. Suppose we have signals , and we want to retrieve the corresponding clean signals . This can be summarized as a matrix factorization modelwhere . To make the problem identifiable, we require the dictionary belongs to a convex set
Dictionary learning aims to obtain estimations of dictionary and sparse coding , and then reconstruct the clean signal as . This is usually done by minimizing the total squared error:
where is the Frobenius norm. Constrains such as (penalty ) and (Lasso penalty) for some positive constants and are widely adopted by dictionary learning literature. Experiments have shown that Lasso penalty provides better results when used for learning the dictionary, while norm should always be used for the final reconstruction step [12].
3.2 Regularization on Dictionary
Compared with sparse coding, regularization on dictionary size is less studied. Most of the existing methods, such as KSVD and Online Learning, estimate the dictionary directly with a fixed dictionary size. They usually require the size of the dictionary to be specified before learning, and this will end up with a solution of over completed dictionary with , which may not be very helpful if we want to better understand the mechanism. In addition, learning a sparse dictionary can lower the model complexity and improve interpretability of the results. All these issues can be addressed with the help of GSCAD penalty, where we would be able to reveal the real size of the dictionary and at the same time obtain an estimated sparse dictionary. More specifically, denote dictionary as with atoms . The GSCAD penalty on dictionary is defined by
where is the SCAD penalty defined in (1). The objective function for our problem is formulated as
(5) 
Firstly, the GSCAD penalty tends to set small estimated to zero, and reduces the complexity of the estimated dictionary. If all elements in are small, GSCAD will lead to . Therefore, when starting with a relatively large , GSCAD will be able to prune the dictionary by penalizing useless atoms to zero. In this way, the true size of the dictionary can be approximated by the number of nonzero columns in the resulting dictionary. In addition, if GSCAD detects some significant s in , it will exert less penalty on the whole to avoid mistakenly truncating any real signals.
3.3 Algorithms
We follow the classic two steps approach to solve the optimization problem iteratively. Given the dictionary , we update by solving the Lasso problem,
for all signals . Given , the optimization problem becomes
(6) 
which is addressed by the ADMM algorithm. Once is updated, we remove all zero columns of and reset to the number of current atoms. Algorithm 1 demonstrates this whole procedure. It should be noted that (6) is a nonconvex problem. Recently, the global convergence of ADMM in nonconvex optimization is discussed in [19], which shows that several ADMM algorithms including SCAD are guaranteed to converge.
ADMM for updating dictionary. Problem (6) is equivalent to
We form the augmented Lagrangian as
where is the elementwise multiplication operator of two matrices, and . The ADMM algorithm consists three steps in each iteration
(7)  
(8)  
Problem (7) bears an explicit solution
in (8) can be solved by columnwise optimization such as
for . In theorem 1, we have shown that this is a convex problem under Condition (4), and can be solved easily by exiting convex optimization algorithms. The ADMM algorithm for updating dictionaries is summarized in Algorithm 2.
(9) 
Details in implementation. As demonstrated above, GSCAD has the ability to prune dictionary, and later in Section 4.1, we will see that its empirical performance is promising and competitive. However, if we initiate dictionary with a size smaller than the truth, there is nothing left for GSCAD to help. Therefore, an oversized dictionary in the initiation step is strongly preferred. Experiments have shown that there is nothing to lose to start with a large dictionary as GSCAD can always prune it to the right size.
During the dictionary updating stage after we obtain a new dictionary from ADMM, if any two atoms are highly correlated, correlation greater than 0.95 for example, we only keep one of them. Some experiments have shown that this does not have much effect on the results, but will speed up convergence of the algorithm.
We define the convergence of the algorithm by the differences of and the differences of between two consecutive iterations. If they are both below a certain threshold, the algorithm stops. However, in implementation, we add an extra rule on the maximum number of iterations, since GSCAD may get stuck to a region where keeps alternating from two local minima and never converge due to a bad initiation. Fortunately, the performance of local minima is mostly decent in terms of denoising.
4 Experimental Results
4.1 Synthetic Experiments
We design a simple example to check the performance of GSCAD from two aspects: (i) whether GSCAD could recover the true size of the dictionary, and (ii) its denoising performance compared with other methods.
Data generation. The generating dictionary contains 10 atoms. Each atom is a vectoried patch shown in Figure 2. Then 1500 signals in are generated, each created by a linear combination of three different generating dictionary atoms picked randomly, with identically independently distributed coefficients following . Gaussian noises are added, with signaltonoise ratio (SNR) controlled by the Gaussian variance . Four levels of noise level are adopted at .
Applying GSCAD. In order to examine GSCAD’s ability to prune dictionaries to the right size, dictionaries are initialized with varying number of atoms , namely, 10 (true size), 15, 20 and 50. We run the GSCAD and received a learned dictionary , where the resulting size of the dictionary might be, and in most of the cases, is less than the initial size . For validation, another 1000 signals are generated under the same setting. Both clean signals and noisy signals are recorded. Coefficients corresponding to are obtained using Orthogonal Matching Pursuit(OMP) with the number of nonzero elements fixed to three. We then reconstruct signals as , and calculate the PSNR as
Comparison. For each setting, we also run the KSVD algorithm using the Matlab Toolbox associated its original paper Aharon et al. [1], and Online Learning algorithm [16] using the SPAMS package. Since neither KSVD nor Online Learning would prune the dictionary, the learned dictionary will be in the same space as its initial value , i.e. . Validation for both method are conducted in the same fashion as that in GSCAD.
Result. For each setting of and
, experiments using GSCAD, Online Learning and KSVD are repeated 100 times. Median, first quartile and third quartile of the PSNR are shown in Figure
3. GSCAD performs better consistently than the other two methods when varying initial size and SNR levels controlled by , except just one case when initial is at the true value 10 and . As suggested in Section 3.3, to make fully use of GSCAD’s power of pruning, it is better to start with an oversized initial dictionary. The mean and standard deviation of
, the size of the resulting dictionary for GSCAD are reported in Table 1. The resulting size of the dictionary learned from GSCAD are very close to the truth, with very small standard deviations across all cases.0.01  0.025  0.05  0.1  

10  9.97 (0.171)  10.00 (0)  10.00 (0)  10.00 (0) 
15  10.02 (0.245)  10.03 (0.171)  10.15 (0.359)  10.14 (0.403) 
20  10.02 (0.245)  10.08 (0.273)  10.15 (0.359)  10.35 (0.626) 
50  10.07 (0.293)  10.07 (0.293)  10.19 (0.394)  10.36 (0.644) 
4.2 Image Denoising
Following the denoising scheme proposed by [6], we train our dictionaries directly on patches from corrupted images. More details about the scheme can be found in [12]. Five classical images (4) used in the image denoising benchmarks are corrupted with Gaussian noise. Standard deviations of Gaussian noise are set to be separately, for pixel values in the range . For each corrupted image, overlapped patches are obtained and centered as training set. For an image of size , a total number of 255025 patches are extracted from the original image.
Dictionaries are trained using (i) the proposed GSCAD algorithm with and for Algorithm 2, (ii) KSVD algorithm from the KSVD Matlab Toolbox, and (iii) Online Learning algorithm from the SPAMS package. Redundant DCT of size size is used to initialize for all three methods. The resulting dictionary is in , where stays the same as for KSVD and Online Learning, but might be smaller for GSCAD.
Once a dictionary is obtained, patches are approximated up to the noise level by a sparse linear combination of atoms in the dictionary:
where is proportional to the noise variance . We set with
following the effective heuristic by
[14]. is the of distribution with freedom of . Then the denoised image is reconstructed based the sparse approximation , and its mean squared error () comparing to the clean image is calculated. For each setting, the whole procedure is repeated five times with different realizations of the noise. Define PSNR asThe results for all three methods are very close to each other in general. At lower noise levels, GSCAD has a better performance. 3 summarizes the average number of atoms for the resulting dictionaries from GSCAD. It shows that GSCAD outperforms the other two methods with a learned dictionary less than half of the size of that used by the other algorithms. On the other hand, at higher noise levels, GSCAD becomes less competitive. Specifically, when , all resulting are close to the initial value of 256, which may indicate that dictionaries of size 256 is not large enough as an initial value for GSCAD. Our experience suggests that the higher noise level requires the larger dictionary size. Another interesting finding is that bearing the same level of noise, image ’fingerprint’ needs a much larger dictionary to denoise compared with other images.
Barb  Boat  Fgrpt  

Gscad  Ksvd  Online  Gscad  Ksvd  Online  Gscad  Ksvd  Online  
5  38.23  38.05  38.15  37.24  37.24  37.06  36.60  36.65  36.53 
10  34.57  34.39  34.73  33.70  33.64  33.81  32.38  32.44  32.52 
20  30.79  30.90  31.13  30.30  30.46  30.63  28.31  28.58  28.70 
50  25.51  25.75  25.87  25.79  26.09  26.25  23.00  23.63  23.59 
Lena  Peppers  Average  

Gscad  Ksvd  Online  Gscad  Ksvd  Online  Gscad  Ksvd  Online  
5  38.63  38.60  38.56  38.04  37.78  37.81  37.75  37.66  37.62 
10  35.58  35.48  35.70  34.51  34.22  34.71  34.15  34.03  34.29 
20  32.31  32.41  32.60  30.84  30.82  31.22  30.51  30.63  30.86 
50  27.66  27.88  28.02  25.84  26.32  26.37  25.56  25.93  26.02 
Barb  Boat  Fgrpt  Lena  Peppers  

5  109  125  178  86  100 
10  92  94  195  80  86 
20  120  151  224  136  153 
50  248  248  246  250  247 
5 Conclusion
The GSCAD method has been presented for learning a sparse dictionary and selecting the dictionary size simultaneously. The experimental analysis has demonstrated very encouraging results relative to the stateoftheart methods. This new framework may also be applied to the general subspace clustering problem for imaging clustering, which assumes that similar points are described as points lying in the same subspace. The proposed formulation can learn the clustering and the number of clusters at the same time. This framework may also be applied to the architecture design of deep learning. The new GSCAD penalty can learn a sparse connection between units of two layers in the deep neural network to improve efficiency.
Appendix
Proof of Theorem 1.
(1). When , we have , and further
for any . When , we have
and further
Therefore to minimize(2), has to satisfy . If we denote and denote as the open interval between and 0, i.e.
then optimization problem (2) is equivalent to
(2). Recall that . To simplify the notation, we rewrite as if there were no zeroelement in , and correspondingly . Define :
We expend to the whole half plane:
If we can show that is convex in , this will inply that is convex over , as is continous all over .
To show that the optimization problem within is convex, we are going to varify the inequality
for any . This is trivial for , and for , we consider the following two cases.
Case 1: . Therefore only a finite number of points in set such that does not have a secondorder derivative. Let . Define . If we can show that is continuous on , and except at a finite number of points, therefore is nondecreasing, and furthermore is convex on . By definition, for any ,
Therefore is convex.
Now we are going to show that is continuous and except at a finite number of points, where does not exist. Taking derivative of , we get
where
Since is continuous for all and ,
is contious. Except a finite number of , such that does not exist at , we have
Let
To show that , we only need to show that . Since , without loss of generality, we are only going to show that , for .
Take derivative of ,
Since and , we have for all . Observe that is piecewise continous on , and . For ,
For
For ,
Therefore , for , and furthermore, except a finite number of . Thus we finished the proof of case 1.
Case 2: or , where . Without loss of generality, we assume that the last elements of and are the same, and the rest are not, i.e. for and for . Let , and . Therefore only a finite number of such that point belongs to .
Let , and define as
Define . Then similar to Case 1, we can show that
Comments
There are no comments yet.