Simultaneous Sparse Dictionary Learning and Pruning

05/25/2016 ∙ by Simeng Qu, et al. ∙ 0

Dictionary learning is a cutting-edge area in imaging processing, that has recently led to state-of-the-art results in many signal processing tasks. The idea is to conduct a linear decomposition of a signal using a few atoms of a learned and usually over-completed dictionary instead of a pre-defined basis. Determining a proper size of the to-be-learned dictionary is crucial for both precision and efficiency of the process, while most of the existing dictionary learning algorithms choose the size quite arbitrarily. In this paper, a novel regularization method called the Grouped Smoothly Clipped Absolute Deviation (GSCAD) is employed for learning the dictionary. The proposed method can simultaneously learn a sparse dictionary and select the appropriate dictionary size. Efficient algorithm is designed based on the alternative direction method of multipliers (ADMM) which decomposes the joint non-convex problem with the non-convex penalty into two convex optimization problems. Several examples are presented for image denoising and the experimental results are compared with other state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sparse coding which represents a signal as a sparse linear combination of a representation basis in a dictionary has been successfully applied in many signal processing tasks, such as image restoration [6, 21], image classification [20, 22], to name a few. The dictionary is crucial and plays an important role in the success of sparse representation. Most of the compressive sensing literature takes off-the-shelf bases such as wavelets as the dictionary [4, 5]

. In contrast, dictionary learning assumes that a signal can be sparsely represented in a learned and usually over-completed dictionary. The pre-specified dictionary might be universal but will not be effective enough for specific task such as face recognition

[24, 11]. Instead, using the learned dictionary has recently led to state-of-the-art results in many practical applications, such as denoising [6, 17, 27, 1], inpainting [13, 15, 18], and image compression [3].

In this paper, we propose a novel regularization method called the Grouped Smoothly Clipped Absolute Deviation (GSCAD) to learn a sparse dictionary and select the appropriate dictionary size simultaneously. It should be emphasized that determining a proper size of the to-be-learned dictionary is crucial for both precision and efficiency of the process. There are not too many existing work on discussing the selection of the dictionary size while most algorithms fix the number of atoms in the dictionary. In general, a two-stage procedure may be used to infer the dictionary size by first learning a dictionary with a fixed size then defining a new objective function penalizing the model complexity [8]. The Bayesian technique can be also employed by putting a prior on the dictionary size [26]. In addition, many methods have addressed the group variable selection problem in statistics literature [23, 25, 10, 28, 9].

This paper makes four main contributions:

  • Our approach imposes sparsity-enforcing constraints on the learned atoms, which improves interpretability of the results and achieves variable selection in the input space.

  • Our approach is a one-stage procedure to learn a sparse dictionary and the dictionary size jointly.

  • Our proposed algorithm is based on the alternative direction method of multipliers (ADMM) [2]. The joint non-convex problem with the non-convex penalty is decomposed into two convex optimization problems.

  • Compared with other state-of-the-art dictionary learning methods, GSCAD has better or competitive performance in various denoising tasks.

2 GSCAD penalty

Review of the Smoothly Clipped Absolute Deviation (SCAD) penalty. SCAD penalty is first proposed by [7]

in the context of high dimensional linear regression. SCAD has some desired properties: (i) Unbiasedness: the resulting estimator is nearly unbiased when the true unknown parameter is large; (ii) Sparsity: The resulting estimator is able to sets small estimated coefficients to zero to reduce model complexity; (iii) Continuity: The resulting estimator is continuous in data to avoid instability in model prediction. Defined as

(1)

for some and , the SCAD contains three segments. When is small (less than ), it acts exactly like the Lasso penalty; when is big (greater than ), it becomes a constant so that no extra penalty is applied to truly significant parameters; these two segments are connected by a quadratic function which results in a continuous differentiable SCAD penalty function .

GSCAD penalty. Even though the SCAD penalty possesses many good properties, it only treats parameters individually and does not address any group effect among parameters. With respect to the structure of the dictionary, we propose a new penalty, GSCAD, where G stands for group. Let

be a vector in

. The GSCAD penalty is defined as

where is the SCAD penalty defined in (1). It inherits all three merits of SCAD, unbiasedness, sparsity and continuity, and at the same time takes into account both individual parameters and group effect among parameters. Individually, the GSCAD penalty tends to set small estimated to zero. Group-wise, if all elements in are small, the penalty will penalize the entire vector to zero. In addition, if some of the is significantly large, the penalty will have more tolerance of smaller elements appearing in .

To better understand GSCAD, let us consider a penalized least squares problem with an orthogonal design

where and are vectors in . For GSCAD, SCAD and LASSO, the penalty is, respectively,

Estimators of when are shown in Figure 1 (a), where GSCAD performs very similar to SCAD. All three penalties shows sparsity properties since they all set to zero when . While the soft-thresholding from LASSO has the inherent bias issue, SCAD and GSCAD give when and and avoid bias. In a two-dimensional case when and , we investigate partitions of the space according to the number of non-zero element in the resulting estimator , see Figure 1 (b)-(c). While SCAD and Lasso treat each coordinate individually, GSCAD takes into account the whole group. It is less likely to set the estimator of one coordinate to zero as the estimator of another coordinate gets away from zero.

Figure 1: (a) 1-dim threshold function. (b)-(c) Partitions of the 2-dim space according to the number of nonzero elements in .

Convexity. Even though GSCAD is build upon the non-convex penalty function SCAD, our development uncovers a surprising fact that the optimization problem of GSCAD under orthogonal design is a convex problem. This will greatly facilitates the implementation of GSCAD.

Theorem 1.

Define as the minima of optimization problem

(2)

Then,

  • , and . Denote , and let be the open interval between and 0. Then problem is equivalent to

    (3)
  • Let , be the number of non-zero element in . If

    (4)

    then optimization problem is convex, and is continuous in data .

Remarks on Theorem 1. (i) Adding a constant in (2) makes the problem more general such that the convexity result can be directly applied to the algorithms in Section 3.3, where plays a role of penalty parameter in the Augmented Lagrangian method. (ii) Condition (4) can be satisfied easily under a wide range of circumstances. For instance, in the previous two-dimensional example with , , and , Condition (4) will be satisfied as long as .

3 Dictionary Learning with GSCAD

3.1 Matrix Factorization Framework

Dictionary learning problems are commonly specified under the framework of matrix factorization. Consider a vectorized clean signal and a dictionary , with its columns referred to as atoms. Sparse representation theory assumes that signal can be well approximated by a linear combination of a few atoms in , i.e.

where the number of non-zero elements in is far less than the number of atoms . In most of the cases, the clean signal won’t be available, and instead, we will only be able to observe a noisy signal , where

represents noise with mean zero and variance

. Suppose we have signals , and we want to retrieve the corresponding clean signals . This can be summarized as a matrix factorization model

where . To make the problem identifiable, we require the dictionary belongs to a convex set

Dictionary learning aims to obtain estimations of dictionary and sparse coding , and then reconstruct the clean signal as . This is usually done by minimizing the total squared error:

where is the Frobenius norm. Constrains such as (-penalty ) and (Lasso penalty) for some positive constants and are widely adopted by dictionary learning literature. Experiments have shown that Lasso penalty provides better results when used for learning the dictionary, while norm should always be used for the final reconstruction step [12].

3.2 Regularization on Dictionary

Compared with sparse coding, regularization on dictionary size is less studied. Most of the existing methods, such as K-SVD and Online Learning, estimate the dictionary directly with a fixed dictionary size. They usually require the size of the dictionary to be specified before learning, and this will end up with a solution of over completed dictionary with , which may not be very helpful if we want to better understand the mechanism. In addition, learning a sparse dictionary can lower the model complexity and improve interpretability of the results. All these issues can be addressed with the help of GSCAD penalty, where we would be able to reveal the real size of the dictionary and at the same time obtain an estimated sparse dictionary. More specifically, denote dictionary as with atoms . The GSCAD penalty on dictionary is defined by

where is the SCAD penalty defined in (1). The objective function for our problem is formulated as

(5)

Firstly, the GSCAD penalty tends to set small estimated to zero, and reduces the complexity of the estimated dictionary. If all elements in are small, GSCAD will lead to . Therefore, when starting with a relatively large , GSCAD will be able to prune the dictionary by penalizing useless atoms to zero. In this way, the true size of the dictionary can be approximated by the number of non-zero columns in the resulting dictionary. In addition, if GSCAD detects some significant s in , it will exert less penalty on the whole to avoid mistakenly truncating any real signals.

3.3 Algorithms

We follow the classic two steps approach to solve the optimization problem iteratively. Given the dictionary , we update by solving the Lasso problem,

for all signals . Given , the optimization problem becomes

(6)

which is addressed by the ADMM algorithm. Once is updated, we remove all zero columns of and reset to the number of current atoms. Algorithm 1 demonstrates this whole procedure. It should be noted that (6) is a non-convex problem. Recently, the global convergence of ADMM in non-convex optimization is discussed in [19], which shows that several ADMM algorithms including SCAD are guaranteed to converge.

ADMM for updating dictionary. Problem (6) is equivalent to

We form the augmented Lagrangian as

where is the element-wise multiplication operator of two matrices, and . The ADMM algorithm consists three steps in each iteration

(7)
(8)

Problem (7) bears an explicit solution

in (8) can be solved by columnwise optimization such as

for . In theorem 1, we have shown that this is a convex problem under Condition (4), and can be solved easily by exiting convex optimization algorithms. The ADMM algorithm for updating dictionaries is summarized in Algorithm 2.

Input : Training samples , parameter ,,,,
initialize

as random matrix with

;
while not converge do
      Sparse Coding Stage: for , update by solving Lasso problem
Dictionary Update Stage: update using Algorithm 2;
       Number of atoms:
end while
Output : ,
Algorithm 1 Dictionary Learning with GSCAD
Input : Training samples , current , parameter ,,
Initialize , set
while not converge do
       Normalize each column of as Update : for ,
(9)
;
      
end while
Remove the zero columns of ;
Output : 
Algorithm 2 Update dictionary using ADMM

Details in implementation. As demonstrated above, GSCAD has the ability to prune dictionary, and later in Section 4.1, we will see that its empirical performance is promising and competitive. However, if we initiate dictionary with a size smaller than the truth, there is nothing left for GSCAD to help. Therefore, an over-sized dictionary in the initiation step is strongly preferred. Experiments have shown that there is nothing to lose to start with a large dictionary as GSCAD can always prune it to the right size.

During the dictionary updating stage after we obtain a new dictionary from ADMM, if any two atoms are highly correlated, correlation greater than 0.95 for example, we only keep one of them. Some experiments have shown that this does not have much effect on the results, but will speed up convergence of the algorithm.

We define the convergence of the algorithm by the differences of and the differences of between two consecutive iterations. If they are both below a certain threshold, the algorithm stops. However, in implementation, we add an extra rule on the maximum number of iterations, since GSCAD may get stuck to a region where keeps alternating from two local minima and never converge due to a bad initiation. Fortunately, the performance of local minima is mostly decent in terms of denoising.

4 Experimental Results

4.1 Synthetic Experiments

We design a simple example to check the performance of GSCAD from two aspects: (i) whether GSCAD could recover the true size of the dictionary, and (ii) its denoising performance compared with other methods.

Data generation. The generating dictionary contains 10 atoms. Each atom is a vectoried patch shown in Figure 2. Then 1500 signals in are generated, each created by a linear combination of three different generating dictionary atoms picked randomly, with identically independently distributed coefficients following . Gaussian noises are added, with signal-to-noise ratio (SNR) controlled by the Gaussian variance . Four levels of noise level are adopted at .

Applying GSCAD. In order to examine GSCAD’s ability to prune dictionaries to the right size, dictionaries are initialized with varying number of atoms , namely, 10 (true size), 15, 20 and 50. We run the GSCAD and received a learned dictionary , where the resulting size of the dictionary might be, and in most of the cases, is less than the initial size . For validation, another 1000 signals are generated under the same setting. Both clean signals and noisy signals are recorded. Coefficients corresponding to are obtained using Orthogonal Matching Pursuit(OMP) with the number of non-zero elements fixed to three. We then reconstruct signals as , and calculate the PSNR as

Comparison. For each setting, we also run the K-SVD algorithm using the Matlab Toolbox associated its original paper Aharon et al. [1], and Online Learning algorithm [16] using the SPAMS package. Since neither K-SVD nor Online Learning would prune the dictionary, the learned dictionary will be in the same space as its initial value , i.e. . Validation for both method are conducted in the same fashion as that in GSCAD.

Result. For each setting of and

, experiments using GSCAD, Online Learning and K-SVD are repeated 100 times. Median, first quartile and third quartile of the PSNR are shown in Figure

3. GSCAD performs better consistently than the other two methods when varying initial size and SNR levels controlled by , except just one case when initial is at the true value 10 and . As suggested in Section 3.3

, to make fully use of GSCAD’s power of pruning, it is better to start with an over-sized initial dictionary. The mean and standard deviation of

, the size of the resulting dictionary for GSCAD are reported in Table 1. The resulting size of the dictionary learned from GSCAD are very close to the truth, with very small standard deviations across all cases.

Figure 2: Atoms of the generating dictionary . Each atom corresponds to a patch with white region representing 1 and black region representing 0.
Figure 3: Synthetic results. Median PSNR: circle points connected by lines; first quartile: triangle-shaped points; third quartile: square-shaped points. Different algorithms are indexed by color.
0.01 0.025 0.05 0.1
10 9.97 (0.171) 10.00 (0) 10.00 (0) 10.00 (0)
15 10.02 (0.245) 10.03 (0.171) 10.15 (0.359) 10.14 (0.403)
20 10.02 (0.245) 10.08 (0.273) 10.15 (0.359) 10.35 (0.626)
50 10.07 (0.293) 10.07 (0.293) 10.19 (0.394) 10.36 (0.644)
Table 1: Average number of atoms in the resulting dictionary. Numbers in the parenthesis are corresponding standard deviations.

4.2 Image Denoising

Following the denoising scheme proposed by [6], we train our dictionaries directly on patches from corrupted images. More details about the scheme can be found in [12]. Five classical images (4) used in the image denoising benchmarks are corrupted with Gaussian noise. Standard deviations of Gaussian noise are set to be separately, for pixel values in the range . For each corrupted image, overlapped patches are obtained and centered as training set. For an image of size , a total number of 255025 patches are extracted from the original image.

Figure 4: Images used in denoising: Barb, Boat, Fgprt, Lena, Peppers.

Dictionaries are trained using (i) the proposed GSCAD algorithm with and for Algorithm 2, (ii) K-SVD algorithm from the KSVD Matlab Toolbox, and (iii) Online Learning algorithm from the SPAMS package. Redundant DCT of size size is used to initialize for all three methods. The resulting dictionary is in , where stays the same as for K-SVD and Online Learning, but might be smaller for GSCAD.

Once a dictionary is obtained, patches are approximated up to the noise level by a sparse linear combination of atoms in the dictionary:

where is proportional to the noise variance . We set with

following the effective heuristic by

[14]. is the of distribution with freedom of . Then the denoised image is reconstructed based the sparse approximation , and its mean squared error () comparing to the clean image is calculated. For each setting, the whole procedure is repeated five times with different realizations of the noise. Define PSNR as

The results for all three methods are very close to each other in general. At lower noise levels, GSCAD has a better performance. 3 summarizes the average number of atoms for the resulting dictionaries from GSCAD. It shows that GSCAD outperforms the other two methods with a learned dictionary less than half of the size of that used by the other algorithms. On the other hand, at higher noise levels, GSCAD becomes less competitive. Specifically, when , all resulting are close to the initial value of 256, which may indicate that dictionaries of size 256 is not large enough as an initial value for GSCAD. Our experience suggests that the higher noise level requires the larger dictionary size. Another interesting finding is that bearing the same level of noise, image ’fingerprint’ needs a much larger dictionary to denoise compared with other images.

Barb Boat Fgrpt
Gscad Ksvd Online Gscad Ksvd Online Gscad Ksvd Online
5 38.23 38.05 38.15 37.24 37.24 37.06 36.60 36.65 36.53
10 34.57 34.39 34.73 33.70 33.64 33.81 32.38 32.44 32.52
20 30.79 30.90 31.13 30.30 30.46 30.63 28.31 28.58 28.70
50 25.51 25.75 25.87 25.79 26.09 26.25 23.00 23.63 23.59
Lena Peppers Average
Gscad Ksvd Online Gscad Ksvd Online Gscad Ksvd Online
5 38.63 38.60 38.56 38.04 37.78 37.81 37.75 37.66 37.62
10 35.58 35.48 35.70 34.51 34.22 34.71 34.15 34.03 34.29
20 32.31 32.41 32.60 30.84 30.82 31.22 30.51 30.63 30.86
50 27.66 27.88 28.02 25.84 26.32 26.37 25.56 25.93 26.02
Table 2: Average PSNR over five runs.
Barb Boat Fgrpt Lena Peppers
5 109 125 178 86 100
10 92 94 195 80 86
20 120 151 224 136 153
50 248 248 246 250 247
Table 3: Number of atoms for the resulting dictionary ().

5 Conclusion

The GSCAD method has been presented for learning a sparse dictionary and selecting the dictionary size simultaneously. The experimental analysis has demonstrated very encouraging results relative to the state-of-the-art methods. This new framework may also be applied to the general subspace clustering problem for imaging clustering, which assumes that similar points are described as points lying in the same subspace. The proposed formulation can learn the clustering and the number of clusters at the same time. This framework may also be applied to the architecture design of deep learning. The new GSCAD penalty can learn a sparse connection between units of two layers in the deep neural network to improve efficiency.

Appendix

Proof of Theorem 1.

(1). When , we have , and further

for any . When , we have

and further

Therefore to minimize(2), has to satisfy . If we denote and denote as the open interval between and 0, i.e.

then optimization problem (2) is equivalent to

(2). Recall that . To simplify the notation, we rewrite as if there were no zero-element in , and correspondingly . Define :

We expend to the whole half plane:

If we can show that is convex in , this will inply that is convex over , as is continous all over .

To show that the optimization problem within is convex, we are going to varify the inequality

for any . This is trivial for , and for , we consider the following two cases.

Case 1: . Therefore only a finite number of points in set such that does not have a second-order derivative. Let . Define . If we can show that is continuous on , and except at a finite number of points, therefore is non-decreasing, and furthermore is convex on . By definition, for any ,

Therefore is convex.

Now we are going to show that is continuous and except at a finite number of points, where does not exist. Taking derivative of , we get

where

Since is continuous for all and ,

is contious. Except a finite number of , such that does not exist at , we have

Let

To show that , we only need to show that . Since , without loss of generality, we are only going to show that , for .

Take derivative of ,

Since and , we have for all . Observe that is piece-wise continous on , and . For ,

For

For ,

Therefore , for , and furthermore, except a finite number of . Thus we finished the proof of case 1.

Case 2: or , where . Without loss of generality, we assume that the last elements of and are the same, and the rest are not, i.e. for and for . Let , and . Therefore only a finite number of such that point belongs to .

Let , and define as

Define . Then similar to Case 1, we can show that