LSALSA: efficient sparse coding in single and multiple dictionary settings

02/13/2018 ∙ by Benjamin Cowen, et al. ∙ 0

We propose an efficient sparse coding (SC) framework for obtaining sparse representation of data. The proposed framework is very general and applies to both the single dictionary setting, where each data point is represented as a sparse combination of the columns of one dictionary matrix, as well as the multiple dictionary setting as given in morphological component analysis (MCA), where the goal is to separate the data into additive parts such that each part has distinct sparse representation within an appropriately chosen corresponding dictionary. Both tasks have been cast as ℓ_1-regularized optimization problems of minimizing quadratic reconstruction error. In an effort to accelerate traditional acquisition of sparse codes, we propose a deep learning architecture that constitutes a trainable time-unfolded version of the Split Augmented Lagrangian Shrinkage Algorithm (SALSA), a special case of the alternating direction method of multipliers (ADMM). We empirically validate both variants of the algorithm on image vision tasks and demonstrate that at inference our networks achieve improvements in terms of the running time and the quality of estimated sparse codes on both classic SC and MCA problems over more common baselines. We finally demonstrate the visual advantage of our technique on the task of source separation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

page 14

page 19

page 20

page 21

page 22

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the SC framework, we seek to efficiently represent data by using only a sparse combination of available basis vectors. We therefore assume that an

-dimensional data vector can be approximated as

(1)

where is sparse and is a dictionary, sometimes referred to as the synthesis matrix, whose columns are the basis vectors.

This paper focuses on the SC problem of decomposing a signal into morphologically distinct components. A typical assumption for this problem is that the data is a linear combination of source signals:

(2)

The MCA framework (Starck et al., 2004) requires that each component admits a sparse representation within the corresponding dictionary . The dictionaries s are distinct, i.e. each source-specific dictionary allows obtaining sparse representation of the corresponding source signal and is highly inefficient in representing the other content in the mixture. This leads to a signal model that generalizes the one given by Equation 1 as

(3)

The bottleneck of SC techniques is that at inference a sparse code has to be computed for each data point or data patch (as in case of high-resolution images) and this is typically done via iterative optimization. In case of single dictionary setting, ISTA (Daubechies et al., 2004) and FISTA (Beck & Teboulle, 2009) are classical algorithmic choices for this purpose. For the MCA problem, the standard choice is SALSA (Afonso et al., 2011), an instance of ADMM (Boyd et al., 2011). This process is prohibitively slow for high-throughput real-time applications.

The key contribution of this paper is an efficient and accurate deep learning architecture that is general enough to well-approximate optimal codes for both classic SC in a single-dictionary framework and MCA-based signal separation. We call our deep learning approximator Learned SALSA (LSALSA). The proposed encoder is formulated as a time-unfolded version of the SALSA algorithm with a fixed number of iterations, where the depth of the deep learning model corresponds to the number of SALSA iterations. We train the deep model in the supervised fashion to predict optimal sparse codes for a given input and thus in practice we can use shallow architectures of fixed-depth that correspond to only few iterations of the original SALSA and achieve superior performance to this algorithm. Furthermore, SALSA comes with a built-in source separation mechanism and cross-iteration memory sharing connections. These main algorithmic features of SALSA translate to a specific connectivity pattern of the corresponding deep learning architecture of LSALSA that in consequence gives LSALSA an advantage over previous deep encoders, like LISTA (Gregor & LeCun, 2010), in terms of applicability to a broader class of learning problems (LISTA is used only in the single dictionary setting) and performance. To the best of our knowledge, our approach is the first one to utilize an instance of ADMM unrolled into a deep learning architecture to address a source separation problem.

This paper is organized as follows: Section 2 provides literature review, Section 3 formulates the SC problem in detail, Section 4 shows how to derive predictive single dictionary SC and multiple dictionary MCA from their iterative counterparts and explains our approach (LSALSA). Finally, Section 5 shows experimental results for both the single dictionary setting and MCA. Section 6 concludes the paper.

2 Related Work

A sparse code inference aims at computing sparse codes for given data and is most widely addressed via iterative schemes such as aforementioned ISTA and FISTA. Predicting approximations of optimal codes can be done using deep feed-forward learning architectures based on truncated convex solvers. This family of approaches lies at the core of this paper. A notable approach in this family known as LISTA (Gregor & LeCun, 2010) stems from earlier predictive sparse decomposition methods (Kavukcuoglu et al., 2010; Jarrett et al., 2009)

, which however were obtaining approximations to the sparse codes of insufficient quality. LISTA improves over these techniques and enhances ISTA by unfolding a fixed number of iterations to define a fixed-depth deep neural network that is trained with examples of input vectors paired with their corresponding optimal sparse codes obtained by conventional methods like ISTA or FISTA. LISTA was shown to provide high-quality approximations of optimal sparse codes with a fixed computational cost. Unrolling methodology has also been applied to algorithms solving SC with

-regularization (Wang et al., 2016) and message passing schemes (Borgerding & Schniter, 2016)

. In other prior works, ISTA was recast as a recurrent neural network unit giving rise to a variant of LSTM 

(Gers et al., 2003; Zhou et al., 2018). The above mentioned algorithms do not suit well to the MCA problem as they have no algorithmic mechanism for handling multiple dictionaries. In other words, they would approach the MCA problem by casting it as a SC problem with access to a single dictionary that is a concatenation of source-specific dictionaries, e.g. .

This paper considers a generalization of the single-dictionary SC problem to the MCA framework. The framework assumes that the data can be explained by multiple distinct dictionaries. MCA has been used successfully in a number of applications that include decomposing images into textures and cartoons for denoising and inpainting (Starck et al., 2005b, a; Elad et al., 2005; Peyré et al., 2007; Shoham & Elad, 2008; Peyré et al., 2010), detecting text in natural scene images (Liu et al., 2017), as well as other source separation problems such as separating non-stationary clutter from weather radar signals (Uysal et al., 2016), transients from sustained rhythmic components in EEG signals (Parekh et al., 2014), and stationary from dynamic components of MRI videos (Otazo et al., 2015). The MCA problem is traditionally solved via SALSA algorithm, which constitutes a special case of the ADMM method.

There exist a few approaches in the literature utilizing ADMM unrolled into a deep learning architecture. One such computationally efficient framework (Sprechmann et al., 2013) was applied to learning task-specific (reconstruction or classification) sparse models via sparsity-promoting convolutional operators. Another unrolled version of ADMM (Yang et al., 2016) was demonstrated to improve the reconstruction accuracy and computational speed of baseline ADMM algorithm for the problem of compressive sensing Magnetic Resonance Imaging. A variety of papers followed up on this work for various image reconstruction tasks, such as the Learned Primal-dual Algorithm  (Adler & Öktem, 2017). None of these approaches were applied to the MCA or other source separation problems. An unrolled nonnegative matrix factorization (NMF) algorithm (Le Roux et al., 2015) was implemented as a deep network for the task of speech separation. In another work (Wisdom et al., 2017), the NMF-based speech separation task was solved with an ISTA-like unfolded network.

3 Problem Formulation

This paper focuses on the inference problem in SC. It is formulated as finding the optimal sparse code given input vector and dictionary matrix , whose columns are the normalized basis vectors. minimizes the -regularized linear least squares cost function

(4)

where the scalar constant balances sparsity with data fidelity. Thus is the optimal code for with respect to . The dictionary matrix

is usually learned by minimizing a loss function given below 

(Olshausen & Field, 1996)

(5)

with respect to

using stochastic gradient descent (SGD), where

is the size of the training data set, is the training sample, and is the corresponding optimal sparse code. The optimal sparse codes in each iteration are obtained in this paper with FISTA.

In the MCA framework, a generalization of the cost function from Equation 4 is minimized to estimate from the model given in Equation 3. Thus one minimizes

(6)

where

(7)
(8)

where for , , and s are the coefficients controlling the sparsity penalties. We denote the concatenated optimal codes with .

In the classic MCA works, the dictionaries s are selected to be well-known filter banks with explicitly designed sparsification properties. Such hand-designed transforms have good generalization abilities and help to prevent overfitting. Also, MCA algorithms often require solving large systems of equations involving or . An appropriate constraining of leads to a banded system of equations and in consequence reduces the computational complexity of these algorithms, e.g.: (Parekh et al., 2014). More recent MCA works use learned dictionaries for image analysis (Shoham & Elad, 2008; Peyré et al., 2007). Some extensions of MCA consider learning dictionaries s and sparse codes jointly (Peyré et al., 2007, 2010). In our paper, we learn dictionaries independently. In particular, for each we minimize

(9)

with respect to using SGD, where is the mixture component of the training sample and is the corresponding optimal sparse code. The optimal sparse codes in each iteration are obtained with FISTA.

4 From iterative to predictive SC and MCA

4.1 Split Augmented Lagrangian Shrinkage Algorithm (SALSA)

The objective functions used in SC (Equation 4) and MCA (Equation 6) are each convex with respect to , allowing a wide variety of optimization algorithms with well-studied convergence results to be applied (Bauschke & Combettes, 2011). Here we describe a popular algorithm that is general enough to solve both problems called SALSA, which is an instance of ADMM.

ADMM addresses an optimization problem of the form

(10)

by re-casting it as the equivalent, constrained problem

(11)

ADMM then minimizes the corresponding augmented Lagrangian,

(12)

where correspond to Lagrangian multipliers, one variable at a time until convergence.

Figure 1: The deep learning architecture of LSALSA for .

SALSA addresses an instance of the general optimization problem from Equation 11, where is a least-squares term and is such that its proximity operator can be computed exactly  (Afonso et al., 2010). The algorithm falls under a sub-category of Douglas-Rachford Splitting methods, for which convergence has been proved (Eckstein & Bertsekas, 1992).

  Input:
  Initialize: and
  repeat
     
     Solve for :
     
  until change in below a threshold
Algorithm 1 SALSA (Single Dictionary)

SALSA is given in Algorithms 1 and 2 for the single-dictionary case and the MCA case involving two dictionaries111In this paper we consider MCA framework with two dictionaries. Extensions to more than two dictionaries are straightforward., respectively, where

(13)

is the soft-thresholding function with threshold . Note that in Algorithm 2, the and updates can be performed with element-wise operations. The -update, however, is non-separable with respect to components for general . We call this the splitting step.

  Input:
  Initialize:
  repeat
     
     Solve for :
     
  until change in below a threshold
Algorithm 2 SALSA (Two Dictionaries)
Figure 2: A block diagram of SALSA. The one-time initialization is represented by a gate on the left.

As mentioned in Section 3, the -update is often simplified to element-wise operations by constraining matrix to have special properties. For example: requiring , , reduces the -update step to element-wise division (after applying the matrix inverse lemma). In  (Yang et al., 2016),

is set to be the partial Fourier transform, reducing the system of equations of the

-update to be a series of convolutions and element-wise operations. In our work, as is typical in the case of SC, is a learned dictionary without any imposed structure.

Note that solving for in Algorithms 1 and 2 requires the inversion of matrix

. This however needs to be done just once, at the very beginning, as this matrix remains fixed during the entire run of SALSA. We abbreviate the inverted matrix as

(14)

We call this matrix a splitting operator. The recursive block diagram of SALSA is depicted in Figure 2.

4.2 Learned SALSA (LSALSA)

We next describe our proposed deep encoder architecture that we refer to as Learned SALSA (LSALSA).

Consider truncating SALSA algorithm to a fixed number of iterations and then time-unfolding it into a deep neural network architecture that matches the truncated SALSA’s output exactly. The obtained architecture is illustrated in Figure 1 for .

We initialize the parameters of the deep model with

(15)
(16)

where in the MCA case, to achieve an exact correspondence with SALSA. All splitting operators share parameters across the network.

LSALSA can be trained with a standard backpropagation. Let

denote the output of the LSALSA architecture. The cost function used for training the model is defined as

(17)

The pseudocodes for computing a forward-pass through LSALSA are given in Algorithm 3 and Algorithm 4 for the single-dictionary and MCA cases, respectively.

  Input:
   ,
  for  to  do
     
     
     
  end for
  Output:
Algorithm 3 LSALSA (Single Dictionary): Forward Pass
  Input:
   ,
  for  to  do
     
     
     
  end for
  Output:
Algorithm 4 LSALSA (Two Dictionaries): Forward Pass

4.3 LSALSA versus LISTA

Here we explain the conceptual difference between LSALSA and LISTA (see also Section A in the Supplement). This difference is a direct consequence of a different nature of their maternal algorithms, SALSA and ISTA respectively. ISTA is a proximal gradient method that solves the optimization problem of Equation 4 by iteratively repeating gradient descent step followed by soft thresholding. SALSA on the other hand is a second-order method that recasts the problem in terms of constrained optimization and optimizes the corresponding Augmented Lagrangian. Consequently, LISTA has a simple structure such that each layer depends only on the previous layer and re-injection of the filtered data (see (Gregor & LeCun, 2010) for reference). LSALSA has cross-layer connections resulting from the existence of the Lagrangian multiplier update (the -step) in the SALSA algorithm, which allows for learning dependencies between non-adjacent layers.

5 Experimental Results

We run different optimization algorithms to predict optimal codes for various data sets and for a varying number of iterations (). We provide empirical evaluation for both one and two-dictionary (MCA) settings. We focus on the inference problem and thus for each experiment the dictionaries were learned off-line and used for all methods (the visualization of the atoms of the obtained dictionaries can be found in Section B in the Supplement). We compare the following methods: LSALSA, truncated SALSA, truncated FISTA, and LISTA. Both LSALSA and LISTA are implemented as feedforward neural networks. For MCA experiments, we simply run FISTA and LISTA using the concatenated dictionary .

5.1 Single Dictionary Case

We run experiments with four data sets: Fashion MNIST (Xiao et al., 2017) ( classes), ASIRRA (Elson et al., 2007) ( classes), MNIST (LeCun et al., 2009) ( classes), and CIFAR-10  (Krizhevsky & Hinton, 2009) ( classes). The ASIRRA data set is a collection of natural images of cats and dogs. We use a subset of the whole data set: training images and testing images as commonly done (Golle, 2008). The results for MNIST and CIFAR-10 are reported in Section C in the Supplement.

The Fashion MNIST images were first divided into non-overlapping patches (ignoring extra pixels on two edges), resulting in patches per image. Then, optimal codes were computed for each vectorized patch by minimizing the objective from Equation 4 with FISTA for iterations. The ASIRRA images come in varying sizes. We resized them all to the resolution of and converted them to grayscale. Then we divided them into non-overlapping patches, resulting in patches per image. Optimal codes were computed patch-wise as for Fashion MNIST, but taking iterations to ensure convergence on this more difficult SC problem. For both data sets, was chosen so that the sparsity level was about .

The data sets were then separated into training and testing sets. The training patches were used to produce the dictionaries. Visualizations of the dictionary atoms are provided in Section B in the Supplement. An exhaustive hyper-parameter search was performed for each encoding method and for each number of iterations , to minimize RMSE between obtained and optimal codes. The hyper-parameters search included for all methods, for SALSA and LSALSA, as well as learning rates and learning rate decays for LSALSA and LISTA.

The obtained encoders were used to compute sparse codes on the test set. Those were then compared with the optimal codes via RMSE. The results for Fashion MNIST are shown in terms of the number of iterations (Figure 3) and the wallclock time in seconds (Figure 4) used to make the prediction. It takes FISTA more than iterations and SALSA more than to reach the error achieved by LSALSA in just one. Near , both FISTA and SALSA are finally converging to the optimal codes. LISTA outperforms FISTA at first, but does not show much improvement after . Similar results for ASIRRA are shown in Figures 5 and 6. On this more difficult problem, it takes FISTA more than iterations and SALSA more than to catch up with LSALSA with a single iteration. LISTA and LSALSA are comparable for , after which LSALSA dramatically improves its optimal code prediction and, similarly as in case of Fashion MNIST, shows advantage in terms of the number of iterations, wallclock time, and the quality of the recovered sparse codes over other methods.

Figure 3: Fashion MNIST code prediction errors for varying numbers of iterations.
Figure 4: Fashion MNIST code prediction error as a function of the wallclock time.

We also investigated which method yields better codes in terms of the classification task. For each data set, we trained a logistic regression classifier to predict the label from the corresponding sparse code. Thus, for Fashion MNIST each image is associated with

optimal codes (one for each patch), yielding a total feature length of . The Fashion MNIST classifier was trained until it achieved classification error on the testing set. For ASIRRA, each concatenated optimal code had length ; to reduce the dimensionality we applied a random Gaussian projection before inputting the codes into the classifier. The classifier was trained on the optimal projected codes of length until it achieved error. The results for Fashion MNIST and ASIRRA are shown in Table 1 and 2, respectively. Note: The classifier was trained on the optimal codes for images from a test set. Thus, the resulting classification error is only due to the difference between the optimal and estimated codes.

Figure 5: ASIRRA code prediction errors for varying numbers of iterations.
Figure 6: ASIRRA code prediction error as a function of the wallclock time.
(a)
(b)
(c)
Figure 7: Sparsity/accuracy trade-off analysis for ASIRRA obtained for the source separation experiment with MNIST + ASIRRA data set. Each method corresponds to a colored point cloud, where each point corresponds to one sample from the ASIRRA test data set. LSALSA achieves the best sparsity/accuracy trade-off and is faster than other methods.
Classification Error (in %)
Iter FISTA LISTA SALSA LSALSA
1 87.5312 54.6056 56.4777 11.2252
5 78.4570 38.1251 23.6107 3.1798
7 70.2489 37.1643 9.1991 0.6628
10 56.0629 32.9025 1.5871 0.0755
15 31.9948 30.4471 0.0000 0.0000
50 0.1011 14.0294 0.0000 0.0000
100 0.0000 7.9534 0.0000 0.0000
Table 1: Fashion-MNIST classification results. The best performer is in bold. All methods but LISTA were able to match the optimal codes well enough to get zero percent error by .
Classification Error (in %)
Iter FISTA LISTA SALSA LSALSA
1 48.9000 52.4000 48.8000 40.1000
3 49.2000 52.7000 46.0000 42.8000
5 48.5000 53.5000 44.8000 35.0000
7 47.8000 53.7000 44.5000 35.1000
10 46.5000 38.5000 42.7000 34.4000
15 43.9000 38.1000 40.7000 33.1000
20 42.1000 37.6000 38.7000 31.6000
50 37.8000 38.2000 37.2000 31.9000
100 36.4000 37.1000 36.8000 30.8000
Table 2: ASIRRA classification results. The best performer is in bold.
Figure 8: MCA experiment using MNIST + ASIRRA data set. Code prediction errors for varying numbers of iterations. LISTA saturates very fast as it does not support unmixing signals.

5.2 MCA: Two-Dictionary Case

We first describe the data set that we use for the MCA experiments. Following the notation introduced previously in the paper, we set s to be the whole MNIST images and s to be the non-overlapping patches from ASIRRA (thus we have patches per image). We obtain k training and k testing patches from ASIRRA, and k training and k testing images from MNIST. We randomly mix images from MNIST and patches from ASIRRA and generate k mixed training images and k mixed testing images. Optimal codes were computed using SALSA (Algorithm 2) for iterations, ensuring that both components had a sparsity level around . Note that we also performed experiments on a mixed dataset of CIFAR-10 and MNIST. Those can be found in Section D in the Supplement.

An exhaustive hyper-parameter search was performed for each encoding method and for each number of iterations . The hyper-parameters search included for FISTA and LISTA, for SALSA and LSALSA, as well as learning rates for LSALSA and LISTA.

Code prediction error curves are presented in Figure 8 and 9. LSALSA steadily outperforms the others, until SALSA catches up around . FISTA and LISTA, without a mechanism for distinguishing two dictionaries, struggle to estimate the optimal codes.

Figure 9: MCA experiment using MNIST + ASIRRA data set. Code prediction error versus wallclock time.

In Figure 7 we illustrate each method’s sparsity/accuracy trade-off on the ASIRRA test data set, while varying (Section E in the Supplement shows the same plot for a wider variety of as well as contains a similar plot for MNIST). The LSALSA retains both the highest sparsity and accuracy levels, even for small , from among all the methods.

Similarly as before, we performed an evaluation on the classification task. A separate classifier was trained for each data set using the separated optimal codes and , respectively. As before, a random Gaussian projection was used to reduce the ASIRRA codes to the length before inputting to the classifier. The classification results are depicted in Table 3 for MNIST and Table 4 for ASIRRA.

Finally, in Figure 10 we present exemplary reconstructed images obtained by different methods when performing source separation (more reconstruction results can be found in Section F in the Supplement). No additional learning was performed to achieve reconstruction, the estimated codes were simply multiplied by the corresponding dictionary matrix, i.e. for LSALSA we have , where represents the th component of an encoder’s output. FISTA and LISTA are unable to separate components without severely corrupting the ASIRRA component. LSALSA has visually recognizable separations even at , and the MNIST component is almost gone by .

Classification Error (in %)
Iter FISTA LISTA SALSA LSALSA
1 70.2886 16.8114 26.9935 2.3720
3 34.0881 16.0946 24.4524 3.6885
5 89.9683 14.7785 39.7438 1.1499
10 90.0000 15.0006 3.0251 0.0520
20 90.0000 9.7423 0.8543 0.0495
50 1.2976 6.7252 0.0243 0.0184
Table 3: MNIST classification error obtained after source separation. The best performer is in bold.
Classification Error (in %)
Iter FISTA LISTA SALSA LSALSA
1 43.7000 39.8000 46.7000 32.6000
3 41.5000 40.2000 42.4000 35.3000
5 49.8000 38.8000 43.8000 30.0000
10 49.9000 38.6000 28.9000 22.3000
20 45.5000 37.9000 23.0000 19.1000
50 28.4000 36.4000 12.4000 12.7000
Table 4: ASIRRA classification error obtained after source separation. The best performer is in bold.
Figure 10: MCA experiment. Image reconstructions obtained by SALSA, LSALSA, FISTA, LISTA for . Top row: original data (components and mixed).

6 Conclusions

In this paper we propose a deep encoder architecture LSALSA obtained from time-unfolding the Split Augmented Lagrangian Shrinkage Algorithm (SALSA). We demonstrate that LSALSA outperforms baseline methods such as SALSA, FISTA, and LISTA, in terms of the quality of predicted sparse codes, as well as the running time in both the single and multiple (MCA) dictionary case. In the two-dictionary MCA setting, we furthermore show that LSALSA obtains the separation of image components that has better visual quality than the separation obtained by SALSA.

References

  • Adler & Öktem (2017) Adler, J. and Öktem, O. Learned primal-dual reconstruction. CoRR, abs/1707.06474, 2017.
  • Afonso et al. (2010) Afonso, M., Bioucas-Dias, J., and Figueiredo, M. Fast image recovery using variable splitting and constrained optimization. IEEE Trans. Image Processing, 19(9):2345–2356, 2010.
  • Afonso et al. (2011) Afonso, M., Bioucas-Dias, J., and Figueiredo, M. An augmented lagrangian approach to the constrained optimization formulation of imaging inverse problems. Trans. Img. Proc., 20(3):681–695, 2011.
  • Bauschke & Combettes (2011) Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, 1st edition, 2011.
  • Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, 2009.
  • Borgerding & Schniter (2016) Borgerding, M. and Schniter, P. Onsager-corrected deep learning for sparse linear inverse problems. In GlobalSIP, 2016.
  • Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends® in Machine Learning

    , 3(1):1–122, 2011.
  • Daubechies et al. (2004) Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004.
  • Eckstein & Bertsekas (1992) Eckstein, J. and Bertsekas, D. On the douglas-rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program., 55:293–318, 1992.
  • Elad et al. (2005) Elad, M., Starck, J. L., Querre, P., and Donoho, D. L.

    Simultaneous cartoon and texture image inpainting using morphological component analysis (mca).

    Applied and Computational Harmonic Analysis, 19(3):340–358, 2005.
  • Elson et al. (2007) Elson, J., Douceur, J., Howell, J., and Saul, J. Asirra: a captcha that exploits interest-aligned manual image categorization. In ACM CCS, 2007.
  • Gers et al. (2003) Gers, F. A., Schraudolph, N. N., and Schmidhuber, J. Learning precise timing with lstm recurrent networks. J. Mach. Learn. Res., 3:115–143, 2003.
  • Golle (2008) Golle, P. Machine learning attacks against the asirra captcha. In ACM CCS, 2008.
  • Gregor & LeCun (2010) Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In ICML, 2010.
  • Jarrett et al. (2009) Jarrett, K., Kavukcuoglu, K., Koray, M., and LeCun, Y. What is the best multi-stage architecture for object recognition? In ICCV, 2009.
  • Kavukcuoglu et al. (2010) Kavukcuoglu, K., Ranzato, M. A., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition. CoRR, abs/1010.3467, 2010.
  • Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images, 2009.
  • Le Roux et al. (2015) Le Roux, J., Hershey, J. R., and Weninger, F. Deep nmf for speech separation. In ICASSP, 2015.
  • LeCun et al. (2009) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 2009.
  • Liu et al. (2017) Liu, S., Xian, Y., Li, H., and Yu, Z. Text detection in natural scene images using morphological component analysis and laplacian dictionary. IEEE/CAA Journal of Automatica Sinica, PP(99):1–9, 2017.
  • Olshausen & Field (1996) Olshausen, B. and Field, D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.
  • Otazo et al. (2015) Otazo, R., Candès, E., and Sodickson, D. K. Low-rank and sparse matrix decomposition for accelerated dynamic mri with separation of background and dynamic components. Magn Reson Med, 73(3):1125–36, 2015.
  • Parekh et al. (2014) Parekh, A., Selesnick, I., Rapoport, D., and Ayappa, I. Sleep spindle detection using time-frequency sparsity. In IEEE SPMB, 2014.
  • Peyré et al. (2007) Peyré, G., J.Fadili, and Starck, J.-L. Learning adapted dictionaries for geometry and texture separation. In SPIE Wavelets, 2007.
  • Peyré et al. (2010) Peyré, G., J.Fadili, and Starck, J.-L. Learning the morphological diversity. SIAM J. Imaging Sciences, 3(3):646–669, 2010.
  • Shoham & Elad (2008) Shoham, N. and Elad, M. Algorithms for signal separation exploiting sparse representations, with application to texture image separation. In Proceedings of the IEEE 25th Convention of Electrical and Electronics Engineers in Israel, 2008.
  • Sprechmann et al. (2013) Sprechmann, P., Litman, R., Yakar, T., Bronstein, A., and Sapiro, G. Efficient supervised sparse analysis and synthesis operators. In NIPS, 2013.
  • Starck et al. (2004) Starck, J.-L, Elad, M., and Donoho, D. Redundant multiscale transforms and their application for morphological component separation. Advances in Imaging and Electron Physics - ADV IMAG ELECTRON PHYS, 132:287–348, 2004.
  • Starck et al. (2005a) Starck, J.-L., Elad, M., and Donoho, D. Image decomposition via the combination of sparse representations and a variational approach. IEEE Trans. Image Processing, 14(10):1570–1582, 2005a.
  • Starck et al. (2005b) Starck, J.-L., Moudden, Y., J.Bobina, Elad, M., and Donoho, D. Morphological component analysis. In Proc. SPIE Wavelets, 2005b.
  • Uysal et al. (2016) Uysal, F., Selesnick, I., and Isom, B. Mitigation of wind turbine clutter for weather radar by signal separation. IEEE Trans. Geoscience and Remote Sensing, 54(5):2925–2934, 2016.
  • Wang et al. (2016) Wang, Z., Ling, Q., and Huang, T. Learning deep l0 encoders. In AAAI, 2016.
  • Wisdom et al. (2017) Wisdom, S., Powers, T., Pitton, J., and Atlas, L. Deep recurrent nmf for speech separation by unfolding iterative thresholding. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 254–258, 2017.
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.
  • Yang et al. (2016) Yang, Y., Sun, J., Li, H., and Xu, Z. Deep admm-net for compressive sensing mri. In NIPS, 2016.
  • Zhou et al. (2018) Zhou, J., Di, K., Du, J., Peng, X., Yang, H., S. Pan, I. Tsang, Liu, Y., Qin, Z., and Goh, R. Sc2net: Sparse lstms for sparse coding. In AAAI, 2018.

Appendix A Additional discussion on the difference between LSALSA and LISTA

The recursive formula in LISTA is given as

(18)

The recursive formula in LSALSA is derived below. We start with the output of the non-linearity from Algorithm 1:

(19)
(20)
(21)

where , and u(t) is the output of the network non-linearity. Clearly, in the case of LSALSA, the non-linearity output has a dependence on all of the previous layers’ outputs. This comes from the auxiliary variable , i.e. the Lagrangian multipliers term. LISTA’s output only depends directly on the previous layer.

Appendix B Dictionary Learning Experiments

We visualize the learned dictionary atoms for both single (Figure 11 and 12) and two dictionary (MCA) case (Figure 13, and 14).

b.1 Dictionaries used in single dictionary experiments

Figure 11: Visualization of dictionary atoms trained on (a) 1010 MNIST image patches (b) 1010 Fashion-MNIST image patches (c) 1010 CIFAR-10 image patches. Each dictionary has atoms (complete) and each atom is a unit norm vector of length reshaped to .
Figure 12: Visualization of ASIRRA dictionary trained on 1616 image patches. The dictionary has (complete) atoms and each atom is a unit norm vector of length reshaped to .

b.2 Dictionaries used in MCA experiments

In the first set of MCA experiments we performed source separation on MNIST + ASIRRA images. We used two dictionaries trained independently using whole MNIST images and patches of ASIRRA images. In the second set of MCA experiments, we performed source separation on spatially added MNIST and CIFAR-10 images (more results of this experiment showed in Section D of the Supplement). We used same MNIST dictionary as used in MNIST + ASSIRA experiments and trained a CIFAR-10 dictionary on the whole grayscale CIFAR-10 data set images.These dictionaries have 1024 atoms (complete), all normalized vectors of length 1024 reshaped to . A subset of atoms of the dictionaries used in MCA experiments are visualized in Figure 13 and Figure 14.

Figure 13: Visualization of dictionary atoms trained on CIFAR-10 images.
Figure 14: Visualization of dictionary atoms trained on (a) MNIST images and (b) ASIRRA image patches.

Appendix C Additional single dictionary experiments

The single dictionary experiments on MNIST and CIFAR-10 data sets are summarized below. The code prediction errors for MNIST are captured in Figure 15 and for CIFAR are captured in Figure 16. The classification results are captured in Table 5 for MNIST and Table 6 for CIFAR.

c.1 Mnist

The MNIST images were first scaled to pixel values in range and then divided into non-overlapping patches (ignoring extra pixels on edges), resulting in

patches per image. Only patches with standard deviation

0.1 were used in training and the remaining ones were discarded (as they are practically all-black). Optimal codes were computed for each vectorized patch by minimizing the objective from Equation 4 by running FISTA for iterations giving approximately sparse optimal codes.

Figure 15: (Left:) MNIST code prediction errors for varying numbers of iterations. FISTA takes iterations to give error produced by iteration of LSALSA. FISTA estimates optimal codes better than LISTA for higher . (Right:) MNIST code prediction error as a function of the wallclock time.
Classification Error (in %)
Iter FISTA LISTA SALSA LSALSA
1 40.8682 4.5842 20.7291 1.9130
5 6.1046 4.6527 4.8549 1.7773
7 3.4363 2.0406 0.8097 0.4574
10 2.0326 1.3802 0.0990 0.0996
15 1.0725 0.8778 0.0200 0.0103
50 0.0205 0.6168 0.0000 0.0000
100 0.0000 0.4228 0.0000 0.0000
Table 5: MNIST classification results. The best performer is in bold.

c.2 Cifar-10

In CIFAR-10 experiments, natural images were first converted to grayscale, scaled to values in range , and broken down to non-overlapping patches. Each image resulted in patches. Then optimal codes were computed on these patches in similar fashion as described above for MNIST data set.

Figure 16: (a) CIFAR-10 code prediction errors for varying number of iterations. All methods except LISTA are converging fast after on this data set. LISTA, FISTA, and SALSA took more than iterations to produce error obtained by LSALSA in only iteration. (b) CIFAR-10 code prediction error as a function of the wallclock time.
Classification Error (in %)
Iter FISTA LISTA SALSA LSALSA
1 86.8600 79.1300 89.0700 64.6900
5 82.3300 76.2600 87.2700 66.3100
7 79.4700 74.1000 82.7100 64.6400
10 75.5200 71.6500 82.8300 54.9800
15 70.1900 72.4500 75.4100 54.9900
50 43.1400 66.3400 43.6100 49.4100
100 67.8600 60.2200 10.4800 18.4400
Table 6: CIFAR-10 classification results. The best performer is in bold.

Appendix D Additional MCA experiments

d.1 Mnist + Cifar

Figure 17: MCA experiments with MNIST + CIFAR data sets. (a) Code prediction errors for varying numbers of iterations. (b) Code prediction error as a function of the wallclock time.

MNIST + CIFAR-10 MCA experimental results are summarized here. We combined whole MNIST digits images with grayscale CIFAR-10 images and performed source separation on them. Code prediction error curves are presented with respect to the number of iterations and wallclock time used to make prediction in Figure 17. The classification results are captured in Table 8 for MNIST codes and Table 8 for CIFAR-10 codes.

Classification Error (in %) Iter FISTA LISTA SALSA LSALSA 1 66.8726 37.1971 33.3767 5.8773 3 90.0000 33.2179 60.7343 7.3075 5 90.0000 18.0357 19.2877 4.2980 7 90.0000 15.8905 8.8672 3.2094 10 90.0000 13.5879 5.3613 3.2018 20 8.4367 10.1961 2.8601 4.6468 50 21.2430 6.4679 12.9358 2.9800
Table 7: MNIST classification error after source separation. The best performer is highlighted in bold.
Classification Error (in %) Iter FISTA LISTA SALSA LSALSA 1 88.1200 87.4300 83.9500 84.2700 3 88.7300 88.1500 84.4200 82.5500 5 88.6300 81.9900 82.5900 74.8700 7 88.4300 82.8500 81.1000 68.2100 10 88.8500 80.0800 79.0200 63.9300 20 81.3000 79.1600 76.2400 57.5300 50 70.4000 81.0900 74.7100 52.6000
Table 8: CIFAR-10 classification error after source separation. The best performer is highlighted in bold.

Appendix E Additional plots: MNIST+ASIRRA

Figure 18 shows an extended version of Figure 7. In Figure 19 we illustrate each method’s sparsity/accuracy trade-off on the MNIST test data set, while varying . The LSALSA retains both the highest sparsity and accuracy levels, even for small , from among all the methods.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 18: Sparsity/accuracy trade-off analysis for ASIRRA obtained for the source separation experiment with MNIST + ASIRRA data set. Each method corresponds to a colored point cloud, where each point corresponds to one sample from the ASIRRA test data set. LSALSA achieves the best sparsity/accuracy trade-off and is faster than other methods.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 19: Sparsity/accuracy trade-off analysis for MNIST obtained for the source separation experiment with MNIST + ASIRRA data set. Each method corresponds to a colored point cloud, where each point corresponds to one sample from the ASIRRA test data set. LSALSA achieves the best sparsity/accuracy trade-off and is faster than other methods.

Appendix F Source separation: image reconstruction results

Figure 20: MCA experiment using MNIST + ASIRRA data set. Image reconstructions obtained by SALSA, LSALSA, FISTA, LISTA for . Top row: original data (components and mixed).
Figure 21: MCA experiment using MNIST + ASIRRA data set. Image reconstructions obtained by SALSA, LSALSA, FISTA, LISTA for . Top row: original data (components and mixed).
Figure 22: MCA experiment using MNIST + ASIRRA data set. Image reconstructions obtained by SALSA, LSALSA, FISTA, LISTA for . Top row: original data (components and mixed).
Figure 23: MCA experiment using MNIST + ASIRRA data set. Image reconstructions obtained by SALSA, LSALSA, FISTA, LISTA for . Top row: original data (components and mixed).
Figure 24: MCA experiment using MNIST + ASIRRA data set. Image reconstructions obtained by SALSA, LSALSA, FISTA, LISTA for . Top row: original data (components and mixed).