Online Convolutional Sparse Coding with Sample-Dependent Dictionary

04/27/2018 ∙ by Yaqing Wang, et al. ∙ 0

Convolutional sparse coding (CSC) has been popularly used for the learning of shift-invariant dictionaries in image and signal processing. However, existing methods have limited scalability. In this paper, instead of convolving with a dictionary shared by all samples, we propose the use of a sample-dependent dictionary in which filters are obtained as linear combinations of a small set of base filters learned from the data. This added flexibility allows a large number of sample-dependent patterns to be captured, while the resultant model can still be efficiently learned by online learning. Extensive experimental results show that the proposed method outperforms existing CSC algorithms with significantly reduced time and space requirements.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional sparse coding (CSC) (Zeiler et al., 2010) has been successfully used in image processing (Gu et al., 2015; Heide et al., 2015) signal processing (Cogliati et al., 2016), and biomedical applications (Pachitariu et al., 2013; Andilla & Hamprecht, 2014; Chang et al., 2017; Jas et al., 2017; Peter et al., 2017). It is closely related to sparse coding (Aharon et al., 2006), but CSC is advantageous in that its shift-invariant dictionary can capture shifted local patterns common in signals and images. Each data sample is then represented by the sum of a set of filters from the dictionary convolved with the corresponding codes.

Traditional CSC algorithms operate in the batch mode (Kavukcuoglu et al., 2010; Zeiler et al., 2010; Bristow et al., 2013; Heide et al., 2015; Šorel & Šroubek, 2016; Wohlberg, 2016; Papyan et al., 2017), which take time and space (where is the number of samples, is the number of filters, and is the dimensionality. Recently, a number of online CSC algorithms have been proposed for better scalability (Degraux et al., 2017; Liu et al., 2017; Wang et al., 2018). As data samples arrive, relevant information is compressed into small history statistics, and the model is incrementally updated. In particular, the state-of-the-art OCSC algorithm (Wang et al., 2018) has the much smaller time and space complexities of and , respectively.

However, the complexities of OCSC still depend quadratically on , and cannot be used with a large number of filters. The number of local patterns that can be captured is thus limited, and may lead to inferior performance especially on higher-dimensional data sets. Besides, the use of more filters also leads to a larger number of expensive convolution operations. Rigamonti et al. (2013) and Sironi et al. (2015) proposed to post-process the learned filters by approximating them with separable filters, making the convolutions less expensive. However, as learning and post-processing are then two independent procedures, the resultant filters may not be optimal. Moreover, these separate filters cannot be updated online with the arrival of new samples.

Another direction to scale up CSC is via distributed computation (Bertsekas & Tsitsiklis, 1997). By distributing the data and workload onto multiple machines, the recent consensus CSC algorithm (Choudhury et al., 2017) can handle large, higher-dimensional data sets such as videos, multispectral images and light field images. However, the heavy computational demands of the CSC problem are only shared over the computing platform, but not fundamentally reduced.

In this paper, we propose to approximate the possibly large number of filters by a sample-dependent combination of a small set of base filters learned from the data. While the standard CSC dictionary is shared by all samples, we propose each sample to have its own “personal” dictionary to compensate for the reduced flexibility of using these base filters. In this way, the representation power can remain the same but with a reduced number of parameters. Computationally, this structure also allows efficient online learning algorithms to be developed. Specifically, the base filter can be updated by the alternating direction method of multipliers (ADMM) (Boyd et al., 2011), while the codes and combination weights can be learned by accelerated proximal algorithms (Yao et al., 2017). Extensive experimental results on a variety of data sets show that the proposed algorithm is more efficient in both time and space, and outperforms existing batch, online and distributed CSC algorithms.

The rest of the paper is organized as follows. Section 2 briefly reviews convolutional sparse coding. Section 3 describes the proposed algorithm. Experimental results are presented in Section 4, and the last section gives some concluding remarks.


: For a vector

, its th element is , -norm is and -norm is . The convolution of two vectors and is denoted . For a matrix , is its complex conjugate, and its conjugate transpose. The Hadamard product of two matrices and is

. The identity matrix is denoted


is the Fourier transform that maps

from the spatial domain to the frequency domain, while is the inverse operator which maps back to .

2 Review: Convolutional Sparse Coding

Given samples in , CSC learns a shift-invariant dictionary , with the columns representing the filters. Each sample is encoded as , with the th column being the code convolved with filter . The dictionary and codes are learned together by solving the optimization problem:


where the first term measures the signal reconstruction error, ensures that the filters are normalized, and is a regularization parameter controlling the sparsity of ’s.

Convolution in (1) is performed in the spatial domain. This takes time, and is expensive. In contrast, recent CSC methods perform convolution in the frequency domain, which takes time (Mallat, 1999) and is faster for typical choices of and . Let , , and be the Fourier-transformed counterparts of and . The codes and filters are updated in an alternating manner by block coordinate descent, as:

Update Codes: Given , each is independently obtained as


where is an auxiliary variable.

Update Dictionary: is updated by solving


where is an auxiliary variable, and crops the extra dimensions in .

Both (2) and (2) can be solved by the alternating direction method of multipliers (ADMM) (Boyd et al., 2011). Subsequently, and can be transformed back to the spatial domain as and . Note that while ’s (in the spatial domain) are sparse, the FFT-transformed ’s are not.

On inference, given the learned dictionary , the testing sample is reconstructed as , where is the obtained code.

2.1 Post-Processing for Separable Filters

Filters obtained by CSC are non-separable and subsequent convolutions may be slow. To speed this up, they can be post-processed and approximated by separable filters (Rigamonti et al., 2013; Sironi et al., 2015). Specifically, the learned is approximated by , where contains rank-1 base filters , and contains the combination weights. However, this often leads to performance deterioration.

2.2 Online CSC

An online CSC algorithm (OCSC) is recently proposed in (Wang et al., 2018). Given the Fourier-transformed sample and dictionary from the last iteration, the corresponding are obtained as in (2). The following Proposition updates and by reformulating (2) for use with smaller history statistics.

Proposition 1 ((Wang et al., 2018)).

can be obtained by solving the optimization problem:


where , and .

Problem (1) can be solved by ADMM. The total space complexity is only , which is independent of . Moreover, and can be updated incrementally.

Two other online CSC reformulations have also been proposed recently. Degraux et al. (2017) performs convolution in the spatial domain, and is slow. Liu et al. (2017) performs convolution in the frequency domain, but requires expensive huge sparse matrix operations.

3 Online CSC with Sample-Dependent Dictionary

Though OCSC scales well with sample size , its space complexity still depends on quadratically. This limits the number of filters that can be used and can impact performance. Motivated by the ideas of separable filters in Section 2.1, we enable learning with more filters by approximating the filters with base filters, where . In contrast to the separable filters, which are obtained by post-processing and may not be optimal, we propose to learn the dictionary directly during signal reconstruction. Moreover, filters in the dictionary are combined from the base filters in a sample-dependent manner.

3.1 Problem Formulation

Recall that each in (1) is represented by . Let , with columns being the base filters. We propose to represent as:


where is the matrix for the filter combination weights. In other words, each in (1) is replaced by , or equivalently,


which is sample-dependent. As will be seen, this allows the ’s to be learned independently (Section 3.3). This also leads to more sample-dependent patterns being captured and thus better performance (Section  4.4).

Sample-dependent filters have been recently studied in convolutional neural networks (CNN)

(Jia et al., 2016). Empirically, this outperforms standard CNNs in one-shot learning (Bertinetto et al., 2016), video prediction (Jia et al., 2016) and image deblurring (Kang et al., 2017). Jia et al. (2016) uses a specially designed neural network to learn the filters, and does not consider the CSC model. In contrast, the sample-dependent filters here are integrated into CSC.

The dictionary can also be adapted to individual samples by fine-tuning (Donahue et al., 2014). However, learning the initial shared dictionary is still expensive when is large. Besides, as will be shown in Section 4.2, the proposed method outperforms fine-tuning empirically.

3.2 Learning

Plugging (6) into the CSC formulation in (1), we obtain

s.t. (9)


and . As and are coupled together in (9), this makes the optimization problem difficult. The following Proposition decouples and . All the proofs are in the Appendix.

Proposition 2.

For , we have if (i) , or (ii) .

To simplify notations, we use to denote or . By imposing either one of the above structures on , we have the following optimization problem:


On inference with sample , the corresponding can be obtained by solving (3.2) with the learned fixed.

3.3 Online Learning Algorithm for (3.2)

As in Section 2.2, we propose an online algorithm for better scalability. At the th iteration, consider


Let , where

is zero-padded to be

-dimensional. Note that the number of convolutions can be reduced from to by rewriting the summation above as . The following Proposition rewrites (3.3) and performs convolutions in the frequency domain.

Proposition 3.

Problem (3.3) can be rewritten as


where , and .

The spatial-domain base filters can be recovered from as .

3.3.1 Obtaining

From (3), can be obtained by solving the subproblem:


where is an auxiliary variable. This is of the same form as (2). Hence, analogous to (1), can be obtained as:


where , and . They can be incrementally updated as


Problem (3.3.1) can then be solved using ADMM as in (1).

space code update time filter update time
OCSC (Wang et al., 2018)
OCDL-Degraux (Degraux et al., 2017)
OCDL-Liu (Liu et al., 2017)
CCSC (Choudhury et al., 2017)
Table 1: Comparing the proposed SCSC algorithm with other scalable CSC algorithms on per iteration cost. For CCSC, the cost is measured per machine, and is the number of machines in the distributed system. Usually, , and .

3.3.2 Obtaining and

With the arrival of , we fix the base filters to learned at the last iteration, and obtain from (3) as:


where is the indicator function on (i.e., if and otherwise).

As in the CSC literature, it can be shown that ADMM can also be used to solve (16). While CSC’s code update subproblem in (2) is convex, problem (16) is nonconvex and existing convergence results for ADMM (Wang et al., 2015) do not apply.

In this paper, we will instead use the nonconvex and inexact accelerated proximal gradient (niAPG) algorithm (Yao et al., 2017). This is a recent proximal algorithm for nonconvex problems. As the regularizers on and in (16) are independent, the proximal step w.r.t. the two blocks can be performed separately as: (Parikh & Boyd, 2014). As shown in (Parikh & Boyd, 2014), these individual proximal steps can be easily computed (for or ).

3.3.3 Complete Algorithm

The whole procedure, which will be called “Sample-dependent Convolutional Sparse Coding (SCSC)”, is shown in Algorithm 1. Its space complexity, which is dominated by and , is . Its per-iteration time complexity is , where the term is due to gradient computation, and is due to FFT/inverse FFT. Table 1 compares its complexities with those of the other online and distributed CSC algorithms. As can be seen, SCSC has much lower time and space complexities as .

1:  Initialize , , , ;
2:  for  do
3:     draw from ;
4:     ;
5:     obtain using niAPG;
6:     for  do
7:        ;
8:     end for
9:     update using (14);
10:     update using (15);
11:     update by (3.3.1) using ADMM;
12:  end for
13:  for  do
14:     ;
15:  end for
15:  .
Algorithm 1 Sample-dependent CSC (SCSC).

4 Experiments

Experiments are performed on a number of data sets (Table 2). Fruit and City are two small image data sets that have been commonly used in the CSC literature (Zeiler et al., 2010; Bristow et al., 2013; Heide et al., 2015; Papyan et al., 2017). We use the default training and testing splits provided in (Bristow et al., 2013). The images are pre-processed as in (Zeiler et al., 2010; Heide et al., 2015; Wang et al., 2018), which includes conversion to grayscale, feature standardization, local contrast normalization and edge tapering. These two data sets are small. In some experiments, we will also use two larger data sets, CIFAR-10 (Krizhevsky & Hinton, 2009) and Flower (Nilsback & Zisserman, 2008). Following (Heide et al., 2015; Choudhury et al., 2017; Papyan et al., 2017; Wang et al., 2018), we set the filter size as , and the regularization parameter in (1) as 1.

size #training #testing
Fruit 100100 10 4
City 100100 10 4
CIFAR-10 3232 50,000 10,000
Flower 500500 2,040 6,149
Table 2: Summary of the image data sets used.

To evaluate efficacy of the learned dictionary, we will mainly consider the task of image reconstruction as in (Aharon et al., 2006; Heide et al., 2015; Sironi et al., 2015). The reconstructed image quality is evaluated by the testing peak signal-to-noise ratio (Papyan et al., 2017): , where is the reconstruction of from test set . The experiment is repeated five times with different dictionary initializations.

4.1 Choice of versus

First, we study the choice of in Proposition 2. We compare SCSC-L1, which uses , with SCSC-L2, which uses . Experiments are performed on Fruit and City. As in (Heide et al., 2015; Papyan et al., 2017; Wang et al., 2018), the number of filters is set to 100. Recall the space complexity results in Table 1, we define the compression ratio of SCSC relative to OCSC (using the same ) as . We vary in . The corresponding CR is .

Results are shown in Figure 1. As can be seen, SCSC-L1 is much inferior. Figure 2(a) shows the weight obtained with and by SCSC-L1 on a test sample from City (results on the other data sets are similar). As can be seen, most of its entries are zero because of the sparsity induced by the norm. The expressive power is severely limited as typically only one base filter is used to approximate the original filter. On the other hand, the learned by SCSC-L2 is dense and has more nonzero entries (Figure 2(b)). In the sequel, we will only focus on SCSC-L2, which will be simply denoted as SCSC.

(a) Fruit.
(b) City.
Figure 1: Testing PSNR’s of SCSC-L1 and SCSC-L2 at different ’s on the Fruit and City data sets.
(a) SCSC-L1.
(b) SCSC-L2.
Figure 2: Weight matrices obtained on a test sample from City. Each column corresponds to an original filter.

4.2 Sample-Dependent Dictionary

In this experiment, we set , and compare SCSC with the following algorithms that use sample-independent dictionaries: (i) SCSC (shared): This is a SCSC variant in which all ’s in (5

) are the same. Its optimization is based on alternating minimization. (ii) Separable filters learned by tensor decomposition (SEP-TD)

(Sironi et al., 2015), which is based on post-processing the (shared) dictionary learned by OCSC as reviewed in Section 2.1; (iii) OCSC (Wang et al., 2018): the state-of-the-art online CSC algorithm.

Results are shown in Figure 3. As can be seen, SCSC always outperforms SCSC(shared) and SEP-TD, and outperforms OCSC when (corresponding to ) or above. This demonstrates the advantage of using a sample-dependent dictionary.

(a) Fruit.
(b) City.
Figure 3: Testing PSNR vs for CSC algorithms using shared and sample-dependent dictionaries.

Next, we compare against OCSC with fine-tuned filters, which are also sample-dependent. Specifically, given test sample , we first obtain its code from (2) with the learned dictionary , and then fine-tune by solving (2) using the newly computed . As in (Donahue et al., 2014), this is repeated for a few iterations.111In the experiments, we stop after five iterations. We set OCSC’s to be equal to SCSC’s , so that the two methods take the same space (Table 1). The used in SCSC is still 100. Results are shown in Figure 4. As can be seen, though fine-tuning improves the performance of OCSC slightly, this approach of generating sample-dependent filters is still much worse than SCSC.

(a) Fruit.
(b) City.
Figure 4: Comparison of SCSC and OCSC with fine-tuning.

4.3 Learning with More Filters

Recall that SCSC allows the use of more filters (i.e., a larger ) because of its lower time and space complexities. In this Section, we demonstrate that this can lead to better performance. We compare SCSC with two most recent batch and online CSC methods, namely, slice-based CSC (SBCSC) (Papyan et al., 2017) and OCSC. For SCSC, we set for Fruit and City, and for CIFAR-10 and Flower.

Figure 5 shows the testing PSNR’s at different ’s. As can be seen, a larger consistently leads to better performance for all methods. SCSC allows the use of a larger because of its much smaller memory footprint. For example, on CIFAR-10, at ; on Flower, at .

(a) Fruit.
(b) City.
(c) CIFAR-10.
(d) Flower.
Figure 5: Effect of on the testing PSNR. Note that SBCSC cannot be run on CIFAR-10 and Flower, which are large. For OCSC, it can only run up to on Flower.

4.4 Comparison with the State-of-the-Art

First, we perform experiments on the two smaller data sets of Fruit and City, with . We set (i.e., ) for SCSC. This is compared with the batch CSC algorithms, including (i) deconvolution network (DeconvNet) (Zeiler et al., 2010), (ii) fast CSC (FCSC) (Bristow et al., 2013), (iii) fast and flexible CSC (FFCSC) (Heide et al., 2015), (iv) convolutional basis pursuit denoising (CBPDN) (Wohlberg, 2016), (v) the CONSENSUS algorithm (Šorel & Šroubek, 2016), and (vi) slice-based CSC (SBCSC) (Papyan et al., 2017). We also compare with the online CSC algorithms, including (vii) OCSC (Wang et al., 2018), (viii) OCDL-Degraux (Degraux et al., 2017), and (ix) OCDL-Liu (Liu et al., 2017).

Figure 6 shows convergence of the testing PSNR with clock time. As also demonstrated in (Degraux et al., 2017; Liu et al., 2017; Wang et al., 2018), online CSC methods converge faster and have better PSNR than batch CSC methods. Among the online methods, SCSC has comparable PSNR as OCSC, but is faster and requires much less storage ().

(a) Fruit.
(b) City.
Figure 6: Testing PSNR on the small data sets.

Next, we perform experiments on the two large data sets, CIFAR-10 and Flower. All the batch CSC algorithms and two online CSC algorithms, OCDL-Degraux and OCDL-Liu, cannot handle such large data sets. Hence, we will only compare SCSC with OCSC. On CIFAR-10, we set , and the corresponding CR for SCSC is 100. On Flower, is still 300 for SCSC. However, OCSC can only use because of its much larger memory footprint. Figure 7 shows convergence of the testing PSNR. In both cases, SCSC significantly outperforms OCSC.

(a) CIFAR-10.
(b) Flower.
Figure 7: Testing PSNR on the large data sets.

4.5 Higher-Dimensional Data

In this section, we perform experiments on data sets with dimensionalities larger than two. To alleviate the large memory problem, Choudhury et al. (2017) proposed the use of distributed algorithms. Here, we show that SCSC can effectively handle these data sets using one single machine.

Experiments are performed on three data sets (Table 3) in (Choudhury et al., 2017). The Video data set contains image subsequences recorded in an airport (Li et al., 2004). The length of each video is 7, and each image frame is of size . The Multispectral data contains patches from multispectral images (covering 31 wavelengths) of real-world objects and materials (Yasuma et al., 2010). The Light field data contains patches of light field images on objects and scenes (Kalantari et al., 2016). For each pixel, the light rays are from different directions. Following (Choudhury et al., 2017), we set the filter size to for Video, for Multispectral, and for Light field.

size #training #testing
Video 1001007 573 143
Multispectral 606031 2,200 1,000
Light field 606088 7,700 385
Table 3: Summary of the higher-dimensional data sets used.

We compare SCSC with OCSC and the concensus CSC (CCSC) (Choudhury et al., 2017) algorithms, with . For fair comparison, only one machine is used for all methods. We do not compare with the batch methods and the two online methods (OCDL-Degraux and OCDL-Liu) as they are not scalable (as already shown in Section 4.4).

Because of the small memory footprint of SCSC, we run it on a GTX 1080 Ti GPU in this experiment. OCSC is also run on GPU for Video. However, OCSC can only run on CPU for Multispectral and Light field. CCSC, which needs to access all the samples and codes during processing, can only be on CPU.222For Video, the memory used (in GB) by CCSC, OCSC, SCSC (with ) and SCSC (with ) are 28.73, 7.58, 2.66, and 2.87, respectively. On Multispectral, they are 28.26, 11.09, 0.73 and 0.76; on Light field, they are 29.79, 15.94, 7.26 and 8.88, respectively.

Results are shown in Table 4. Note that SCSC is the only method that can handle the whole of Video, Multispectral and Light field data sets on a single machine. In comparison, CCSC can only handle a maximum of 30 Video samples, 40 Multispectral samples, and 35 Light field samples. OCSC can handle the whole of Video and Multispectral, but cannot converge in 2 days when the whole Light field data set is used. Again, SCSC outperforms OCSC and CCSC.

Video Multispectral Light field
PSNR time PSNR time PSNR time
CCSC 20.430.11 11.910.07 17.670.14 27.880.07 13.700.09 8.990.11
OCSC 33.170.01 1.410.04* 30.120.02 31.190.02 - -
SCSC 35.300.02 0.730.02* 30.510.02 1.210.03* 29.300.03 11.120.07*
38.020.03 0.810.01* 31.710.01 1.400.01* 31.700.02 17.970.05*
Table 4: Results on the higher-dimensional data sets. PSNR is in dB and clock time is in hours. Timing results based on GPU are marked with asterisks.

As for speed, SCSC is the fastest. However, note that this is for reference only as SCSC is run on GPU while the others (except for OCSC on Video) are run on CPU. Nevertheless, this still demonstrates an important advantage of SCSC, namely that its small memory footprint can benefit from the use of GPU, while the others cannot.

denoising inpainting
Wind Mill 14.880.03 16.200.03 17.270.02 29.760.13 29.400.14 29.760.08
Sea Rock 14.800.02 16.010.02 17.100.02 24.920.06 25.040.04 25.170.04
Parthenon 14.970.02 16.330.01 17.440.03 27.060.06 26.790.04 28.040.04
Rolls Royce 15.230.01 16.270.01 17.630.02 24.960.13 24.660.10 25.060.05
Fence 15.210.04 16.530.02 17.560.03 26.810.05 26.710.08 26.850.05
Car 16.900.01 18.050.03 20.060.05 29.600.07 29.400.09 30.440.04
Kid 14.900.01 16.210.02 17.220.03 25.360.01 25.420.07 25.670.07
Tower 14.890.02 16.190.01 18.360.05 26.640.04 26.480.06 26.960.03
Fish 16.400.01 17.400.01 18.610.02 27.490.03 26.980.08 27.230.07
Food 16.380.01 17.680.02 18.560.03 29.960.05 29.620.08 31.490.02
Table 5: Testing PSNR (dB) on image denoising and inpainting.

4.6 Image Denoising and Inpainting

In previous experiments, superiority of the learned dictionary is demonstrated by reconstruction of clean images. In this section, we further examine the learned dictionary on two applications: image denoising and inpainting. Ten test images provided by (Choudhury et al., 2017)

are used. In denoising, we add Gaussian noise with zero mean and variance 0.01 to the test images (the average input PSNR is 10dB). In inpainting, we random sub-sample 50% of the pixels as 0 (the average input PSNR is 9.12dB). Following

(Heide et al., 2015; Choudhury et al., 2017; Papyan et al., 2017), we use a binary weight matrix to mask out positions of the missing pixels. We use the filters learned from Fruit in Section 4.4. SCSC is compared with (batch) SBCSC and (online) OCSC.

Results are shown in Table 5. As can be seen, the PSNRs obtained by SCSC are consistently higher than those by the other methods. This shows that the dictionary, which yields high PSNR on image reconstruction, also leads to better performance in other image processing applications.

4.7 Solving (16): niAPG vs ADMM

Finally, we compare the performance of ADMM and niAPG in solving subproblem (16). We use a training sample from City. The experiment is repeated five times with different initializations. Figure 8 shows convergence of the objective in (16) with time. As can be seen, niAPG has fast convergence while ADMM fails to converge. Figure 9 shows , which measures violation of the ADMM constraints, with the number of iterations. As can be seen, the violation does not go to zero, which indicates that ADMM does not converge.

Figure 8: Convergence of niAPG and ADMM on solving (16).
Figure 9: Constraint violation in ADMM.

5 Conclusion

In this paper, we proposed a novel CSC extension, in which each sample has its own sample-dependent dictionary constructed from a small set of shared base filters. Using online learning, the model can be efficiently updated with low time and space complexities. Extensive experiments on a variety of data sets including large image data sets and higher-dimensional data sets all demonstrate its efficiency and scalability.


The second author especially thanks Weiwei Tu and Yuqiang Chen from 4Paradigm Inc. This research was supported in part by the Research Grants Council, Hong Kong, under Grant 614513, and by the University of Macau Grant SRG2015-00050-FST.


  • Aharon et al. (2006) Aharon, M., Elad, M., and Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
  • Andilla & Hamprecht (2014) Andilla, F. and Hamprecht, F. Sparse space-time deconvolution for calcium image analysis. In Advances in Neural Information Processing Systems, pp. 64–72, 2014.
  • Bertinetto et al. (2016) Bertinetto, L., Henriques, J. F., Valmadre, J., Torr, P., and Vedaldi, A. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pp. 523–531, 2016.
  • Bertsekas & Tsitsiklis (1997) Bertsekas, D.P. and Tsitsiklis, J.N. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 1997.
  • Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends in Machine Learning

    , 3(1):1–122, 2011.
  • Bristow et al. (2013) Bristow, H., Eriksson, A., and Lucey, S. Fast convolutional sparse coding. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 391–398, 2013.
  • Chang et al. (2017) Chang, H., Han, J., Zhong, C., Snijders, A., and Mao, J.

    Unsupervised transfer learning via multi-scale convolutional sparse coding for biomedical applications.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • Choudhury et al. (2017) Choudhury, B., Swanson, R., Heide, F., Wetzstein, G., and Heidrich, W. Consensus convolutional sparse coding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4280–4288, 2017.
  • Cogliati et al. (2016) Cogliati, A., Duan, Z., and Wohlberg, B. Context-dependent piano music transcription with convolutional sparse coding. IEEE/ACM Transactions on Audio Speech and Language Processing, 24(12):2218–2230, 2016.
  • Degraux et al. (2017) Degraux, K., Kamilov, U. S., Boufounos, P. T., and Liu, D. Online convolutional dictionary learning for multimodal imaging. In IEEE International Conference on Image Processing, pp. 1617–1621, 2017.
  • Donahue et al. (2014) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pp. 647–655, 2014.
  • Gu et al. (2015) Gu, S., Zuo, W., Xie, Q., Meng, D., Feng, X., and Zhang, L.

    Convolutional sparse coding for image super-resolution.

    In International Conference on Computer Vision, pp. 1823–1831, 2015.
  • Heide et al. (2015) Heide, F., Heidrich, W., and Wetzstein, G. Fast and flexible convolutional sparse coding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5135–5143, 2015.
  • Jas et al. (2017) Jas, M., La Tour, T. D., Simsekli, U., and Gramfort, A. Learning the morphology of brain signals using alpha-stable convolutional sparse coding. In Advances in Neural Information Processing Systems, pp. 1099–1108, 2017.
  • Jia et al. (2016) Jia, X., De Brabandere, B., Tuytelaars, T., and Gool, L. V. Dynamic filter networks. In Advances in Neural Information Processing Systems, pp. 667–675, 2016.
  • Kalantari et al. (2016) Kalantari, N. K., Wang, T., and Ramamoorthi, R. Learning-based view synthesis for light field cameras. ACM Transactions on Graphics, 35(6):193, 2016.
  • Kang et al. (2017) Kang, D., Dhar, D., and Chan, A. Incorporating side information by adaptive convolution. In Advances in Neural Information Processing Systems, pp. 3870–3880, 2017.
  • Kavukcuoglu et al. (2010) Kavukcuoglu, K., Sermanet, P., Boureau, Y., Gregor, K., Mathieu, M., and LeCun, Y. Learning convolutional feature hierarchies for visual recognition. In Advances in Neural Information Processing Systems, pp. 1090–1098, 2010.
  • Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Li et al. (2004) Li, L., Huang, W., Gu, I. Y., and Tian, Q. Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing, 13(11):1459–1472, 2004.
  • Liu et al. (2017) Liu, J., Garcia-Cardona, C., Wohlberg, B., and Yin, W. Online convolutional dictionary learning. In IEEE International Conference on Image Processing, pp. 1707–1711, 2017.
  • Mallat (1999) Mallat, S. A Wavelet Tour of Signal Processing. Academic Press, 1999.
  • Nilsback & Zisserman (2008) Nilsback, M. and Zisserman, A. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729, 2008.
  • Pachitariu et al. (2013) Pachitariu, M., Packer, A., Pettit, N., Dalgleish, H., Hausser, M., and Sahani, M. Extracting regions of interest from biological images with convolutional sparse block coding. In Advances in Neural Information Processing Systems, pp. 1745–1753, 2013.
  • Papyan et al. (2017) Papyan, V., Romano, Y., Sulam, J., and Elad, M. Convolutional dictionary learning via local processing. In International Conference on Computer Vision, pp. 5296–5304, 2017.
  • Parikh & Boyd (2014) Parikh, N. and Boyd, S. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
  • Peter et al. (2017) Peter, S., Kirschbaum, E., Both, M., Campbell, L., Harvey, B., Heins, C., Durstewitz, D., Diego, F., and Hamprecht, F. A.

    Sparse convolutional coding for neuronal assembly detection.

    In Advances in Neural Information Processing Systems, pp. 3678–3688, 2017.
  • Rigamonti et al. (2013) Rigamonti, R., Sironi, A., Lepetit, V., and Fua, P. Learning separable filters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2754–2761, 2013.
  • Sironi et al. (2015) Sironi, A., Tekin, B., Rigamonti, R., Lepetit, V., and Fua, P. Learning separable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1):94–106, 2015.
  • Šorel & Šroubek (2016) Šorel, M. and Šroubek, F. Fast convolutional sparse coding using matrix inversion lemma. Digital Signal Processing, 55:44–51, 2016.
  • Wang et al. (2015) Wang, Y., Yin, W., and Zeng, J. Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2015.
  • Wang et al. (2018) Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. Scalable online convolutional sparse coding. IEEE Transactions on Image Processing, 2018.
  • Wohlberg (2016) Wohlberg, B. Efficient algorithms for convolutional sparse representations. IEEE Transactions on Image Processing, 25(1):301–315, 2016.
  • Yao et al. (2017) Yao, Q., Kwok, J., Gao, F., Chen, W., and Liu, T.-Y. Efficient inexact proximal gradient algorithm for nonconvex problems. In International Joint Conferences on Artifical Intelligence, pp. 3308–3314, 2017.
  • Yasuma et al. (2010) Yasuma, F., Mitsunaga, T., Iso, D., and Nayar, S. K. Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE Transactions on Image Processing, 19(9):2241–2253, 2010.
  • Zeiler et al. (2010) Zeiler, M., Krishnan, D., Taylor, G., and Fergus, R. Deconvolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2528–2535, 2010.

Appendix A Proofs

a.1 Proposition 2

Since , we have


so that


Therefore, when

  • : Use (18), we can write

    Thus, if , then

    which means .

  • : First, we have . Then, by Cauchy-Schwarz inequality, we have


    where (19) is due to (17). Therefore, if , holds.

a.2 Proposition 3

Let , (3) is equivalent to (3.3) since the following equations hold:


where (20) is due to

Then, (21) comes from the convolution theorem (Mallat, 1999), i.e.,

where and are first zero-padded to -dimensional, and the Parseval’s theorem (Mallat, 1999): where .

As for constraints, when is transformed to the frequency domain, it is padded from dimensional to dimensional. Thus, we use to crop the extra dimensions to get back the original support.