I Introduction
Given a set of samples, sparse coding tries to learn an overcomplete dictionary and then represent each sample as a sparse combination (code) of the dictionary atoms. It has been used in various signal processing [1, 2]
and computer vision applications
[3, 4]. Albeit its popularity, sparse coding cannot capture shifted local patterns in the data. Hence, preprocessing (such as the extraction of sample patches) and postprocessing (such as aggregating patch representation back into sample representation) are needed, otherwise redundant representations will be learned.Convolutional sparse coding (CSC) [5] is a recent method which improves sparse coding by learning a shiftinvariant dictionary. This is done by replacing the multiplication between codes and dictionary by convolution operation, which can capture local patterns of shifted locations in the data. Therefore, no preprocessing or postprocessing are needed, and the sample can be optimized as a whole and represented as the sum of a set of filters from the dictionary convolved with the corresponding codes. It has been successfully used on various data types, including trajectories [6], images [7], audios [8], videos [9], multispectral and light field images [10] and biomedical data [11, 12, 13, 14]. It also succeeds on a variety of applications accompanying the data, such as recovering nonrigid structure from motion [6]
, image super resolution
[15], image denoising and inpainting [10], music transcription [8], video deblurring [9], neuronal assembly detection
[16] and so on.All above CSC works use square loss, thus assume that the noise in the data is from Gaussian distribution. However, this can be restrictive and does not suit many realworld problems. For example, although CSC is popularly used for biomedical data sets [11, 13, 14, 16] where shifting patterns abound due to cell division, it cannot handle the various complicated noises in the data. In fact, biomedical data sets usually contain artifacts during recording, e.g., biomedical heterogeneities, large variations in luminance and contrast, and disturbance due to other small living animals [11, 17]. Moreover, as the target biomedical structures are often tiny and delicate, the existing of noises will heavily interfere the quality of the learned filters and representation [14].
Lots of algorithms have been proposed for CSC with square loss. While the objective is not convex, it is convex when codes of the dictionary are fixed. Thus, CSC is mainly solved by alternatively update the codes and dictionary by block coordinate descent (BCD) [18]. The difference of methods mainly lies in how to solve the subproblems (codes update or dictionary update) separately. The pioneering work Deconvolutional network [5] uses gradient descent for both subproblems. ConvCoD [19]
uses stochastic gradient descent for dictionary update and additionally learns an encoder to output the codes. Recently, other works
[7, 9, 20, 21, 22, 23] use alternating direction method of multipliers (ADMM) [24]. ADMM is favored since it can decompose the subproblem into smaller ADMM subproblems which usually have closedform solutions. The decomposition allows solving CSC by separately performing faster convolution in frequency domain while enforcing the translationvariant constraints and regularizers in spatial domain. However, one needs another alternating loop between these ADMM subproblems so as to coordinate on the solution of the original subproblem.Recently, there emerges one work which models the noise in CSC other than Gaussian. Jas et al. [14] proposed the alphastable CSC (CSC) which models the noise in 1D signals by the symmetric alphastable distribution [25]
. This distribution includes a range of heavytailed distributions, and is known to be more robust to noise and outliers. However, the probability density function of the alphastable distribution does not have an analytical form, and its inference needs to be approximated by the Markov chain Monte Carlo (MCMC)
[26] procedure, which is known to be computationally expensive. Moreover, as shown in Figure. 1, the alphastable distribution still restricts the noise to be of one particular type in advance, which is not appropriate due to unknown ground truth noise type.In this paper, we propose a general CSC model (GCSC) which enables CSC to deal with complicated unknown noise. Specifically, we model the noise in CSC by the Gaussian mixture model (GMM), which can approximate any continuous probability density function. The proposed model is then solved by the ExpectationMaximization algorithm (EM). However, the maximization step becomes a weighted variant of the CSC problem which cannot be efficiently solved by existing algorithms, e.g., BCD and ADMM, since they bring extra inner loops in Mstep. Besides, the weight matrix prevents us from solving the whole objective in the frequency domain to speed up the convolution. In our proposed method, we develop a new solver to update the dictionary and codes together by a nonconvex accelerated proximal algorithm without alternating loops. Moreover, we manage to efficiently speed up the convolution in the frequency domain and calculate the part involving the weight matrix in the spatial domain. The resultant algorithm achieves comparable time and space complexity compared with stateoftheart CSC algorithms (for square loss). Extensive experiments are performed on both synthetic data and realworld biomedical data such as local field potential signals and retinal scan images. Results show that the proposed method can model the complex underlying data noise, and obtain highquality filters and representations.
The rest of the paper is organized as follows. Section II briefly reviews CSC and proximal algorithm. Section III describes the proposed method, Experimental results are presented in Section IV, and the last section gives some concluding remarks.
Notations
: For vector
, its th element is denoted , its norm is , its norm is , and reshapes to a diagonal matrix with elements ’s. Given another vector , the convolution produces a vector , with . For matrix , denotes its transpose, denotes its complex conjugate, is its conjugate transpose (i.e., ).denotes pointwise product. The identity matrix is denoted
.is the fast Fourier transform that maps
from the spatial domain to the frequency domain, while is the inverse operator which maps back to .Ii Related Work
Iia Convolutional Sparse Coding
Given samples ’s, where each , CSC learns a dictionary of filters ’s, each of length , such that each can be well represented as
(1) 
Here, is the (spatial) convolution operation, and ’s are the codes for , each of length . The filters and codes are obtained by solving the following optimization problem
(2) 
where ensures that the filters are normalized, and the regularizer encourages the codes to be sparse.
To solve (2), block coordinate descent (BCD) [18] is typically used [5, 7, 19, 20, 21, 22]. The codes and dictionary are updated in an alternating manner as follows.
IiA1 Code Update
Given ’s, the corresponding ’s are obtained as
(3) 
Convolution can be performed much faster in the frequency domain via the convolution theorem^{1}^{1}1, where
is first zeropadded to
dimensional. [27]. Combining this with the use of Parseval’s theorem^{2}^{2}2 For , . [27] and the linearity of FFT, problem (3) is reformulated in [20, 7, 21] as:(4) 
IiA2 Dictionary Update
Given ’s, ’s is updated by solving
Similar to the code update, it is more efficient to perform convolution in the frequency domain, as:
(5)  
s.t. 
where is a matrix with and for which is used to crop the extra dimension to recover the original spatial support, and can pad to be dimensional. The constraint scales all filters to unit norm.
The alternating direction method of multipliers (ADMM) [24] has been commonly used for the code update and dictionary update subproblems (4) and (5)) [7, 20, 21, 22]. Each ADMM subproblem has a closedform solution that is easy to compute. Besides, with the introduction of auxiliary variables, ADMM separates computations in the frequency domain (involving convolutions) and spatial domain (involving the regularizer and unit norm constraint).
IiB Proximal Algorithm
The proximal algorithm [28] is used to solve composite optimization problems of the form
where is smooth, is nonsmooth, and both are convex. To make the proximal algorithm efficient, its underlying proximal step (with stepsize )
has to be inexpensive.
Iii Proposed Method
The square loss in (2
) implicitly assumes that the noise is normally distributed. In this section, we relax this assumption, and assume the noise to be generated from a Gaussian mixture model (GMM). It is wellknown that a GMM can approximate any continuous probability density function
[30].As in other applications of GMM, we will use the EM algorithm for inference. However, as will be shown, the Mstep involves a difficult weighted CSC problem. In Section IIIC, we design an efficient solver based on nonconvex accelerated proximal algorithm, with comparable time and space complexity as stateoftheart CSC algorithms (for the square loss).
Iiia GMM Noise
We assume that the noise associated with follows the GMM distribution:
where is the number of Gaussian components, is the latent variable denoting which Gaussian component belongs to, and ’s are mixing coefficients with . Variable follows the multinomial distribution , and the conditional distribution of given follows the normal distribution with mean and diagonal covariance matrix . The regularizer in (2) corresponds to the prior Laplace distribution () on each th element of : .
IiiB Using Expectation Maximization (EM) Algorithm
Let denote the collection of all parameters ’s, ’s, ’s, ’s and
’s. The log posterior probability for
is:(6) 
This can be maximized by the Expectation Maximization (EM) algorithm [31].
The Estep computes , the posterior probability that belongs to the th Gaussian given . Using Bayes rule, we have
(7) 
where , and in (1).
IiiB1 Updating ’s, ’s, ’s in
IiiB2 Updating ’s and ’s in
Given ’s, ’s, ’s and ’s, we obtain ’s and ’s from (9) as:
This can be rewritten as
(13) 
where
(14) 
with , and
(15) 
Here, is the indicator function on (i.e., if , and otherwise).
Compared with the standard CSC problem in (2), problem (13) can be viewed as a weighted CSC problem (with weight ). A CSC variant [7] puts weights on , while ours are on . In [14], the model also leads to a weighted CSC problem. However, the authors there mentioned that it is not clear how to solve a weighted CSC problem in the frequency domain. Instead, they resorted to solving it in the spatial domain, which is less efficient as discussed in Section IIA.
IiiC Solving the Weighted CSC Subproblem (13)
In this section, we solve the weighted CSC problem in (13) using niAPG [29] (Algorithm 1). Note that the weights ’s in (14) is the cause that prevents us from transforming the whole objective (13) to the frequency domain. Recall that (3) is transformed to (4) by first transforming everything in norm to frequency domain by Parseval’s theorem, separately computing and by linearity of FFT, then replacing the convolution in by pointwise product using convolution theorem. However, with ’s, cannot use convolution theorem to speed up. Therefore, the key in designing an efficient solver is to only transform terms involving convolutions to the frequency domain, while leaving the weight in spatial domain. Hence we replace the term in (14) by .
The core steps in the niAPG algorithm are computing (i) the gradient w.r.t. ’s and ’s , and (ii) the proximal step . We first introduce the following Lemmas.
Lemma 1.
Let for . Then, , which reshapes to a diagonal matrix with elements ’s.
Proof.
, where . Then, . ∎
Lemma 2.
[32] For , and , where , is the th root of unity, and . Moreover, and .
Proposition 3.
Proof.
can be rewritten as , where , , , , , , .
Note that in (15) is separable^{3}^{3}3 is separable if .. This simplifies the associated proximal step, as shown by the following Lemma.
Lemma 4.
[28] If is separable, .
Using Lemma 4, the component proximal steps can be easily computed in closed form as [28]: , and . In the sequel, we avoid tuning by using line search [33], which also speeds up convergence empirically. The procedure for the solving the weighted CSC subproblem (14) is shown in Algorithm 2. The whole algorithm, which will be called general CSC (GCSC), is shown in Algorithm 3.
method  convolution operation  noise modeling  algorithm 

DeconvNet [5]  spatial  Gaussian  BCD, updating codes and dictionary by gradient descent 
ConvCod [19]  spatial  Gaussian  BCD, updating dictionary by gradient descent and codes by encoder 
FCSC [20]  frequency  Gaussian  BCD, updating codes and dictionary by ADMM 
FFCSC [7]  frequency  Gaussian  BCD, updating codes and dictionary by ADMM 
CBPDN [21]  frequency  Gaussian  BCD, updating codes and dictionary by ADMM 
CONSENSUS [22]  frequency  Gaussian  BCD, updating codes and dictionary by ADMM 
SBCSC [23]  spatial  Gaussian  BCD, updating dictionary by ADMM and codes by LARS[34] 
OCDLDegraux [35]  spatial  Gaussian  BCD, updating codes and dictionary by projected BCD 
OCDLLiu [36]  frequency  Gaussian  BCD, updating codes and dictionary by proximal gradient descent 
OCSC [37]  frequency  Gaussian  BCD, updating codes and dictionary by ADMM 
CCSC [9]  frequency  Gaussian  BCD, updating codes and dictionary by ADMM 
CSC [14]  spatial  alphastable  BCD, updating codes and dictionary by LBFGS [38] 
GCSC  frequency  Gaussian mixture  niAPG 
IiiD Complexity Analysis
In each EM iteration, the Estep in (7) takes time. The Mstep is dominated by gradient computations in (17) and (19). These take time for the underlying FFT and inverse FFT operations, and time for the pointwise product. Thus, each EM iteration takes a total of time, where is the number of niAPG iterations. Empirically, is around 50. As for space, this is dominated by the dimensional codes for each of the samples, leading to a space complexity of .
In comparison, the stateoftheart batch CSC method (which uses the square loss) [7] takes time per iteration and space. Usually, .
IiiE Discussion with Existing CSC Works
Table I compares GCSC with existing CSC algorithms. The key differences are in noise modeling and algorithm design. First, all methods except GCSC and CSC model the noise by Gaussian distribution, and CSC uses symmetric alphastable distribution. Recall that GMM can approximate any continuous distribution, the noises considered previously all are special case of GMM noise. Second, all algorithms except GCSC use BCD, which alternatively updates codes and dictionary, and a majority of methods then update the codes and dictionary by ADMM separately. As GCSC already has one alternating loop between Estep and Mstep, using BCD and then ADMM will bring in two more alternating loops, resulting a much slower algorithm compared with existing CSC algorithms. Therefore, we use niAPG to directly update codes and dictionary together. Empirical results in next section validate the efficiency of solving the weighted CSC problem (13) in GCSC by niAPG, rather than BCD.
MAE  RMSE  time (seconds)  
no noise  CSC  0.0003590.000027  0.0008470.000109  419.6437.86 
CSC  0.0003640.000171  0.0008710.000183  651.6798.94  
CSC  0.0003620.000109  0.0008680.000122  2217.44345.96  
GCSC  0.0003300.000125  0.0008490.000141  414.3434.48  
Gaussian noise  CSC  0.003680.00036  0.007750.00072  246.3355.39 
CSC  0.01040.0012  0.02490.0008  715.4493.86  
CSC  0.003530.00017  0.007660.00052  1986.08262.47  
GCSC  0.003430.00021  0.007620.00026  238.5219.78  
Laplace noise  CSC  0.008350.00042  0.01140.0010  470.13 27.44 
CSC  0.003470.00026  0.007350.00038  358.64175.40  
CSC  0.006920.00042  0.009730.00013  2309.94724.53  
GCSC  0.003350.00017  0.007320.00020  350.7668.33  
alphastable noise  CSC  0.07020.0060  0.08400.0091  597.8390.70 
CSC  0.01600.0025  0.03370.0047  476.2437.92  
CSC  0.004160.00039  0.008210.00024  2198.32470.57  
GCSC  0.004020.00030  0.008150.00041  412.4371.22  
zeromean mixture noise  CSC  0.03210.0007  0.05450.0010  344.0827.44 
CSC  0.06040.0055  0.08490.0059  588.5788.89  
CSC  0.01140.0002  0.01580.0003  1120.70 463.70  
GCSC  0.005310.00021  0.009710.00082  336.0077.85  
nonzeromean mixture noise  CSC  0.07320.0015  0.01510.0011  642.0384.44 
CSC  0.06700.0037  0.01300.0013  788.6088.89  
CSC  0.06670.0002  0.01270.0014  2882.26907.28  
GCSC  0.005560.00024  0.008180.00037  471.4087.90 
Performance on the synthetic data. The best and comparable results (according to the pairwise ttest with 95% confidence) are highlighted in bold.
Iv Experiments
In this section, we perform experiments on both synthetic and realworld data sets. Experiments are performed on a PC with Intel i7 4GHz CPU with 32GB memory.
Iva Baseline Algorithms
The proposed GCSC is compared with the following CSC stateofthearts:

CSC [7]^{4}^{4}4http://www.cs.ubc.ca/labs/imager/tr/2015/FastFlexibleCSC/, which models noise by the Gaussian distribution.

Alphastable CSC (CSC) [14]^{5}^{5}5https://alphacsc.github.io/, which uses symmetric alphastable distribution to models noise by setting the parameters of alphastable distribution (with stability parameter
, skewness parameter
, scale parameter and position parameter ) as and where is defined as in (1).^{6}^{6}6In [14], is simply set to 1.2. Here, we choose by using a validation set, which is obtained by randomly sampling 20% of the samples.
As a further baseline, we also compare with CSC, a CSC variant which models noise by the Laplace distribution. It is formulated as the following optimization problem:
(20) 
Details are in Appendix A.
We follow the automatic pruning strategy in [39] to select the number of mixture components in GCSC. We start with a relatively large . At each EM iteration, the relative difference among all Gaussians are computed:
For the Gaussian pair with the smallest relative difference, if this value is small (less than 0.1), they are merged as
Stopping Criteria
The optimization problems in CSC and GCSC are solved by the EM algorithm. We stop the EM iterations when the relative change of log posterior in consecutive iterations is smaller than . In the Mstep, we stop the updating of weighted CSC (Algorithm 2 for GCSC, and Algorithm in Appendix B of CSC paper [14]) if the relative change of the respective objective value is smaller than .
The optimization problems in CSC and CSC are solved by BCD. Alternating minimization is stopped when the relative change of objective value ((2) for CSC and (20) for CSC) in consecutive iterations is smaller than . As for the optimization subproblems of ’s (given ’s) and ’s (given ’s), we stop when the relative change of objective value is smaller than .
IvB Synthetic Data
In this experiment, we first demonstrate the performance on synthetic data. Following [14], we use filters ’s (triangle, square, and sine), each of length (Figure 2(a)). Each
is normalized to have zero mean and unit variance. Each
has only one nonzero entry, whose magnitude is uniformly drawn from (Figure 2(b)). clean samples, each of length , are generated as:
Comments
There are no comments yet.