1 Introduction
In the SC framework, we seek to efficiently represent data by using only a sparse combination of available basis vectors. We therefore assume that an
dimensional data vector can be approximated as(1) 
where is sparse and is a dictionary, sometimes referred to as the synthesis matrix, whose columns are the basis vectors.
This paper focuses on the SC problem of decomposing a signal into morphologically distinct components. A typical assumption for this problem is that the data is a linear combination of source signals:
(2) 
The MCA framework (Starck et al., 2004) requires that each component admits a sparse representation within the corresponding dictionary . The dictionaries s are distinct, i.e. each sourcespecific dictionary allows obtaining sparse representation of the corresponding source signal and is highly inefficient in representing the other content in the mixture. This leads to a signal model that generalizes the one given by Equation 1 as
(3) 
The bottleneck of SC techniques is that at inference a sparse code has to be computed for each data point or data patch (as in case of highresolution images) and this is typically done via iterative optimization. In case of single dictionary setting, ISTA (Daubechies et al., 2004) and FISTA (Beck & Teboulle, 2009) are classical algorithmic choices for this purpose. For the MCA problem, the standard choice is SALSA (Afonso et al., 2011), an instance of ADMM (Boyd et al., 2011). This process is prohibitively slow for highthroughput realtime applications.
The key contribution of this paper is an efficient and accurate deep learning architecture that is general enough to wellapproximate optimal codes for both classic SC in a singledictionary framework and MCAbased signal separation. We call our deep learning approximator Learned SALSA (LSALSA). The proposed encoder is formulated as a timeunfolded version of the SALSA algorithm with a fixed number of iterations, where the depth of the deep learning model corresponds to the number of SALSA iterations. We train the deep model in the supervised fashion to predict optimal sparse codes for a given input and thus in practice we can use shallow architectures of fixeddepth that correspond to only few iterations of the original SALSA and achieve superior performance to this algorithm. Furthermore, SALSA comes with a builtin source separation mechanism and crossiteration memory sharing connections. These main algorithmic features of SALSA translate to a specific connectivity pattern of the corresponding deep learning architecture of LSALSA that in consequence gives LSALSA an advantage over previous deep encoders, like LISTA (Gregor & LeCun, 2010), in terms of applicability to a broader class of learning problems (LISTA is used only in the single dictionary setting) and performance. To the best of our knowledge, our approach is the first one to utilize an instance of ADMM unrolled into a deep learning architecture to address a source separation problem.
This paper is organized as follows: Section 2 provides literature review, Section 3 formulates the SC problem in detail, Section 4 shows how to derive predictive single dictionary SC and multiple dictionary MCA from their iterative counterparts and explains our approach (LSALSA). Finally, Section 5 shows experimental results for both the single dictionary setting and MCA. Section 6 concludes the paper.
2 Related Work
A sparse code inference aims at computing sparse codes for given data and is most widely addressed via iterative schemes such as aforementioned ISTA and FISTA. Predicting approximations of optimal codes can be done using deep feedforward learning architectures based on truncated convex solvers. This family of approaches lies at the core of this paper. A notable approach in this family known as LISTA (Gregor & LeCun, 2010) stems from earlier predictive sparse decomposition methods (Kavukcuoglu et al., 2010; Jarrett et al., 2009)
, which however were obtaining approximations to the sparse codes of insufficient quality. LISTA improves over these techniques and enhances ISTA by unfolding a fixed number of iterations to define a fixeddepth deep neural network that is trained with examples of input vectors paired with their corresponding optimal sparse codes obtained by conventional methods like ISTA or FISTA. LISTA was shown to provide highquality approximations of optimal sparse codes with a fixed computational cost. Unrolling methodology has also been applied to algorithms solving SC with
regularization (Wang et al., 2016) and message passing schemes (Borgerding & Schniter, 2016). In other prior works, ISTA was recast as a recurrent neural network unit giving rise to a variant of LSTM
(Gers et al., 2003; Zhou et al., 2018). The above mentioned algorithms do not suit well to the MCA problem as they have no algorithmic mechanism for handling multiple dictionaries. In other words, they would approach the MCA problem by casting it as a SC problem with access to a single dictionary that is a concatenation of sourcespecific dictionaries, e.g. .This paper considers a generalization of the singledictionary SC problem to the MCA framework. The framework assumes that the data can be explained by multiple distinct dictionaries. MCA has been used successfully in a number of applications that include decomposing images into textures and cartoons for denoising and inpainting (Starck et al., 2005b, a; Elad et al., 2005; Peyré et al., 2007; Shoham & Elad, 2008; Peyré et al., 2010), detecting text in natural scene images (Liu et al., 2017), as well as other source separation problems such as separating nonstationary clutter from weather radar signals (Uysal et al., 2016), transients from sustained rhythmic components in EEG signals (Parekh et al., 2014), and stationary from dynamic components of MRI videos (Otazo et al., 2015). The MCA problem is traditionally solved via SALSA algorithm, which constitutes a special case of the ADMM method.
There exist a few approaches in the literature utilizing ADMM unrolled into a deep learning architecture. One such computationally efficient framework (Sprechmann et al., 2013) was applied to learning taskspecific (reconstruction or classification) sparse models via sparsitypromoting convolutional operators. Another unrolled version of ADMM (Yang et al., 2016) was demonstrated to improve the reconstruction accuracy and computational speed of baseline ADMM algorithm for the problem of compressive sensing Magnetic Resonance Imaging. A variety of papers followed up on this work for various image reconstruction tasks, such as the Learned Primaldual Algorithm (Adler & Öktem, 2017). None of these approaches were applied to the MCA or other source separation problems. An unrolled nonnegative matrix factorization (NMF) algorithm (Le Roux et al., 2015) was implemented as a deep network for the task of speech separation. In another work (Wisdom et al., 2017), the NMFbased speech separation task was solved with an ISTAlike unfolded network.
3 Problem Formulation
This paper focuses on the inference problem in SC. It is formulated as finding the optimal sparse code given input vector and dictionary matrix , whose columns are the normalized basis vectors. minimizes the regularized linear least squares cost function
(4) 
where the scalar constant balances sparsity with data fidelity. Thus is the optimal code for with respect to . The dictionary matrix
is usually learned by minimizing a loss function given below
(Olshausen & Field, 1996)(5) 
with respect to
using stochastic gradient descent (SGD), where
is the size of the training data set, is the training sample, and is the corresponding optimal sparse code. The optimal sparse codes in each iteration are obtained in this paper with FISTA.In the MCA framework, a generalization of the cost function from Equation 4 is minimized to estimate from the model given in Equation 3. Thus one minimizes
(6) 
where
(7)  
(8) 
where for , , and s are the coefficients controlling the sparsity penalties. We denote the concatenated optimal codes with .
In the classic MCA works, the dictionaries s are selected to be wellknown filter banks with explicitly designed sparsification properties. Such handdesigned transforms have good generalization abilities and help to prevent overfitting. Also, MCA algorithms often require solving large systems of equations involving or . An appropriate constraining of leads to a banded system of equations and in consequence reduces the computational complexity of these algorithms, e.g.: (Parekh et al., 2014). More recent MCA works use learned dictionaries for image analysis (Shoham & Elad, 2008; Peyré et al., 2007). Some extensions of MCA consider learning dictionaries s and sparse codes jointly (Peyré et al., 2007, 2010). In our paper, we learn dictionaries independently. In particular, for each we minimize
(9) 
with respect to using SGD, where is the mixture component of the training sample and is the corresponding optimal sparse code. The optimal sparse codes in each iteration are obtained with FISTA.
4 From iterative to predictive SC and MCA
4.1 Split Augmented Lagrangian Shrinkage Algorithm (SALSA)
The objective functions used in SC (Equation 4) and MCA (Equation 6) are each convex with respect to , allowing a wide variety of optimization algorithms with wellstudied convergence results to be applied (Bauschke & Combettes, 2011). Here we describe a popular algorithm that is general enough to solve both problems called SALSA, which is an instance of ADMM.
ADMM addresses an optimization problem of the form
(10) 
by recasting it as the equivalent, constrained problem
(11)  
ADMM then minimizes the corresponding augmented Lagrangian,
(12) 
where correspond to Lagrangian multipliers, one variable at a time until convergence.
SALSA addresses an instance of the general optimization problem from Equation 11, where is a leastsquares term and is such that its proximity operator can be computed exactly (Afonso et al., 2010). The algorithm falls under a subcategory of DouglasRachford Splitting methods, for which convergence has been proved (Eckstein & Bertsekas, 1992).
SALSA is given in Algorithms 1 and 2 for the singledictionary case and the MCA case involving two dictionaries^{1}^{1}1In this paper we consider MCA framework with two dictionaries. Extensions to more than two dictionaries are straightforward., respectively, where
(13) 
is the softthresholding function with threshold . Note that in Algorithm 2, the and updates can be performed with elementwise operations. The update, however, is nonseparable with respect to components for general . We call this the splitting step.
As mentioned in Section 3, the update is often simplified to elementwise operations by constraining matrix to have special properties. For example: requiring , , reduces the update step to elementwise division (after applying the matrix inverse lemma). In (Yang et al., 2016),
is set to be the partial Fourier transform, reducing the system of equations of the
update to be a series of convolutions and elementwise operations. In our work, as is typical in the case of SC, is a learned dictionary without any imposed structure.Note that solving for in Algorithms 1 and 2 requires the inversion of matrix
. This however needs to be done just once, at the very beginning, as this matrix remains fixed during the entire run of SALSA. We abbreviate the inverted matrix as
(14) 
We call this matrix a splitting operator. The recursive block diagram of SALSA is depicted in Figure 2.
4.2 Learned SALSA (LSALSA)
We next describe our proposed deep encoder architecture that we refer to as Learned SALSA (LSALSA).
Consider truncating SALSA algorithm to a fixed number of iterations and then timeunfolding it into a deep neural network architecture that matches the truncated SALSA’s output exactly. The obtained architecture is illustrated in Figure 1 for .
We initialize the parameters of the deep model with
(15)  
(16) 
where in the MCA case, to achieve an exact correspondence with SALSA. All splitting operators share parameters across the network.
LSALSA can be trained with a standard backpropagation. Let
denote the output of the LSALSA architecture. The cost function used for training the model is defined as(17) 
4.3 LSALSA versus LISTA
Here we explain the conceptual difference between LSALSA and LISTA (see also Section A in the Supplement). This difference is a direct consequence of a different nature of their maternal algorithms, SALSA and ISTA respectively. ISTA is a proximal gradient method that solves the optimization problem of Equation 4 by iteratively repeating gradient descent step followed by soft thresholding. SALSA on the other hand is a secondorder method that recasts the problem in terms of constrained optimization and optimizes the corresponding Augmented Lagrangian. Consequently, LISTA has a simple structure such that each layer depends only on the previous layer and reinjection of the filtered data (see (Gregor & LeCun, 2010) for reference). LSALSA has crosslayer connections resulting from the existence of the Lagrangian multiplier update (the step) in the SALSA algorithm, which allows for learning dependencies between nonadjacent layers.
5 Experimental Results
We run different optimization algorithms to predict optimal codes for various data sets and for a varying number of iterations (). We provide empirical evaluation for both one and twodictionary (MCA) settings. We focus on the inference problem and thus for each experiment the dictionaries were learned offline and used for all methods (the visualization of the atoms of the obtained dictionaries can be found in Section B in the Supplement). We compare the following methods: LSALSA, truncated SALSA, truncated FISTA, and LISTA. Both LSALSA and LISTA are implemented as feedforward neural networks. For MCA experiments, we simply run FISTA and LISTA using the concatenated dictionary .
5.1 Single Dictionary Case
We run experiments with four data sets: Fashion MNIST (Xiao et al., 2017) ( classes), ASIRRA (Elson et al., 2007) ( classes), MNIST (LeCun et al., 2009) ( classes), and CIFAR10 (Krizhevsky & Hinton, 2009) ( classes). The ASIRRA data set is a collection of natural images of cats and dogs. We use a subset of the whole data set: training images and testing images as commonly done (Golle, 2008). The results for MNIST and CIFAR10 are reported in Section C in the Supplement.
The Fashion MNIST images were first divided into nonoverlapping patches (ignoring extra pixels on two edges), resulting in patches per image. Then, optimal codes were computed for each vectorized patch by minimizing the objective from Equation 4 with FISTA for iterations. The ASIRRA images come in varying sizes. We resized them all to the resolution of and converted them to grayscale. Then we divided them into nonoverlapping patches, resulting in patches per image. Optimal codes were computed patchwise as for Fashion MNIST, but taking iterations to ensure convergence on this more difficult SC problem. For both data sets, was chosen so that the sparsity level was about .
The data sets were then separated into training and testing sets. The training patches were used to produce the dictionaries. Visualizations of the dictionary atoms are provided in Section B in the Supplement. An exhaustive hyperparameter search was performed for each encoding method and for each number of iterations , to minimize RMSE between obtained and optimal codes. The hyperparameters search included for all methods, for SALSA and LSALSA, as well as learning rates and learning rate decays for LSALSA and LISTA.
The obtained encoders were used to compute sparse codes on the test set. Those were then compared with the optimal codes via RMSE. The results for Fashion MNIST are shown in terms of the number of iterations (Figure 3) and the wallclock time in seconds (Figure 4) used to make the prediction. It takes FISTA more than iterations and SALSA more than to reach the error achieved by LSALSA in just one. Near , both FISTA and SALSA are finally converging to the optimal codes. LISTA outperforms FISTA at first, but does not show much improvement after . Similar results for ASIRRA are shown in Figures 5 and 6. On this more difficult problem, it takes FISTA more than iterations and SALSA more than to catch up with LSALSA with a single iteration. LISTA and LSALSA are comparable for , after which LSALSA dramatically improves its optimal code prediction and, similarly as in case of Fashion MNIST, shows advantage in terms of the number of iterations, wallclock time, and the quality of the recovered sparse codes over other methods.
We also investigated which method yields better codes in terms of the classification task. For each data set, we trained a logistic regression classifier to predict the label from the corresponding sparse code. Thus, for Fashion MNIST each image is associated with
optimal codes (one for each patch), yielding a total feature length of . The Fashion MNIST classifier was trained until it achieved classification error on the testing set. For ASIRRA, each concatenated optimal code had length ; to reduce the dimensionality we applied a random Gaussian projection before inputting the codes into the classifier. The classifier was trained on the optimal projected codes of length until it achieved error. The results for Fashion MNIST and ASIRRA are shown in Table 1 and 2, respectively. Note: The classifier was trained on the optimal codes for images from a test set. Thus, the resulting classification error is only due to the difference between the optimal and estimated codes.Classification Error (in %)  

Iter  FISTA  LISTA  SALSA  LSALSA 
1  87.5312  54.6056  56.4777  11.2252 
5  78.4570  38.1251  23.6107  3.1798 
7  70.2489  37.1643  9.1991  0.6628 
10  56.0629  32.9025  1.5871  0.0755 
15  31.9948  30.4471  0.0000  0.0000 
50  0.1011  14.0294  0.0000  0.0000 
100  0.0000  7.9534  0.0000  0.0000 
Classification Error (in %)  

Iter  FISTA  LISTA  SALSA  LSALSA 
1  48.9000  52.4000  48.8000  40.1000 
3  49.2000  52.7000  46.0000  42.8000 
5  48.5000  53.5000  44.8000  35.0000 
7  47.8000  53.7000  44.5000  35.1000 
10  46.5000  38.5000  42.7000  34.4000 
15  43.9000  38.1000  40.7000  33.1000 
20  42.1000  37.6000  38.7000  31.6000 
50  37.8000  38.2000  37.2000  31.9000 
100  36.4000  37.1000  36.8000  30.8000 
5.2 MCA: TwoDictionary Case
We first describe the data set that we use for the MCA experiments. Following the notation introduced previously in the paper, we set s to be the whole MNIST images and s to be the nonoverlapping patches from ASIRRA (thus we have patches per image). We obtain k training and k testing patches from ASIRRA, and k training and k testing images from MNIST. We randomly mix images from MNIST and patches from ASIRRA and generate k mixed training images and k mixed testing images. Optimal codes were computed using SALSA (Algorithm 2) for iterations, ensuring that both components had a sparsity level around . Note that we also performed experiments on a mixed dataset of CIFAR10 and MNIST. Those can be found in Section D in the Supplement.
An exhaustive hyperparameter search was performed for each encoding method and for each number of iterations . The hyperparameters search included for FISTA and LISTA, for SALSA and LSALSA, as well as learning rates for LSALSA and LISTA.
Code prediction error curves are presented in Figure 8 and 9. LSALSA steadily outperforms the others, until SALSA catches up around . FISTA and LISTA, without a mechanism for distinguishing two dictionaries, struggle to estimate the optimal codes.
In Figure 7 we illustrate each method’s sparsity/accuracy tradeoff on the ASIRRA test data set, while varying (Section E in the Supplement shows the same plot for a wider variety of as well as contains a similar plot for MNIST). The LSALSA retains both the highest sparsity and accuracy levels, even for small , from among all the methods.
Similarly as before, we performed an evaluation on the classification task. A separate classifier was trained for each data set using the separated optimal codes and , respectively. As before, a random Gaussian projection was used to reduce the ASIRRA codes to the length before inputting to the classifier. The classification results are depicted in Table 3 for MNIST and Table 4 for ASIRRA.
Finally, in Figure 10 we present exemplary reconstructed images obtained by different methods when performing source separation (more reconstruction results can be found in Section F in the Supplement). No additional learning was performed to achieve reconstruction, the estimated codes were simply multiplied by the corresponding dictionary matrix, i.e. for LSALSA we have , where represents the th component of an encoder’s output. FISTA and LISTA are unable to separate components without severely corrupting the ASIRRA component. LSALSA has visually recognizable separations even at , and the MNIST component is almost gone by .
Classification Error (in %)  

Iter  FISTA  LISTA  SALSA  LSALSA 
1  70.2886  16.8114  26.9935  2.3720 
3  34.0881  16.0946  24.4524  3.6885 
5  89.9683  14.7785  39.7438  1.1499 
10  90.0000  15.0006  3.0251  0.0520 
20  90.0000  9.7423  0.8543  0.0495 
50  1.2976  6.7252  0.0243  0.0184 
Classification Error (in %)  

Iter  FISTA  LISTA  SALSA  LSALSA 
1  43.7000  39.8000  46.7000  32.6000 
3  41.5000  40.2000  42.4000  35.3000 
5  49.8000  38.8000  43.8000  30.0000 
10  49.9000  38.6000  28.9000  22.3000 
20  45.5000  37.9000  23.0000  19.1000 
50  28.4000  36.4000  12.4000  12.7000 
6 Conclusions
In this paper we propose a deep encoder architecture LSALSA obtained from timeunfolding the Split Augmented Lagrangian Shrinkage Algorithm (SALSA). We demonstrate that LSALSA outperforms baseline methods such as SALSA, FISTA, and LISTA, in terms of the quality of predicted sparse codes, as well as the running time in both the single and multiple (MCA) dictionary case. In the twodictionary MCA setting, we furthermore show that LSALSA obtains the separation of image components that has better visual quality than the separation obtained by SALSA.
References
 Adler & Öktem (2017) Adler, J. and Öktem, O. Learned primaldual reconstruction. CoRR, abs/1707.06474, 2017.
 Afonso et al. (2010) Afonso, M., BioucasDias, J., and Figueiredo, M. Fast image recovery using variable splitting and constrained optimization. IEEE Trans. Image Processing, 19(9):2345–2356, 2010.
 Afonso et al. (2011) Afonso, M., BioucasDias, J., and Figueiredo, M. An augmented lagrangian approach to the constrained optimization formulation of imaging inverse problems. Trans. Img. Proc., 20(3):681–695, 2011.
 Bauschke & Combettes (2011) Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, 1st edition, 2011.
 Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, 2009.
 Borgerding & Schniter (2016) Borgerding, M. and Schniter, P. Onsagercorrected deep learning for sparse linear inverse problems. In GlobalSIP, 2016.

Boyd et al. (2011)
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends® in Machine Learning
, 3(1):1–122, 2011.  Daubechies et al. (2004) Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004.
 Eckstein & Bertsekas (1992) Eckstein, J. and Bertsekas, D. On the douglasrachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program., 55:293–318, 1992.

Elad et al. (2005)
Elad, M., Starck, J. L., Querre, P., and Donoho, D. L.
Simultaneous cartoon and texture image inpainting using morphological component analysis (mca).
Applied and Computational Harmonic Analysis, 19(3):340–358, 2005.  Elson et al. (2007) Elson, J., Douceur, J., Howell, J., and Saul, J. Asirra: a captcha that exploits interestaligned manual image categorization. In ACM CCS, 2007.
 Gers et al. (2003) Gers, F. A., Schraudolph, N. N., and Schmidhuber, J. Learning precise timing with lstm recurrent networks. J. Mach. Learn. Res., 3:115–143, 2003.
 Golle (2008) Golle, P. Machine learning attacks against the asirra captcha. In ACM CCS, 2008.
 Gregor & LeCun (2010) Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In ICML, 2010.
 Jarrett et al. (2009) Jarrett, K., Kavukcuoglu, K., Koray, M., and LeCun, Y. What is the best multistage architecture for object recognition? In ICCV, 2009.
 Kavukcuoglu et al. (2010) Kavukcuoglu, K., Ranzato, M. A., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition. CoRR, abs/1010.3467, 2010.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images, 2009.
 Le Roux et al. (2015) Le Roux, J., Hershey, J. R., and Weninger, F. Deep nmf for speech separation. In ICASSP, 2015.
 LeCun et al. (2009) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 2009.
 Liu et al. (2017) Liu, S., Xian, Y., Li, H., and Yu, Z. Text detection in natural scene images using morphological component analysis and laplacian dictionary. IEEE/CAA Journal of Automatica Sinica, PP(99):1–9, 2017.
 Olshausen & Field (1996) Olshausen, B. and Field, D. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.
 Otazo et al. (2015) Otazo, R., Candès, E., and Sodickson, D. K. Lowrank and sparse matrix decomposition for accelerated dynamic mri with separation of background and dynamic components. Magn Reson Med, 73(3):1125–36, 2015.
 Parekh et al. (2014) Parekh, A., Selesnick, I., Rapoport, D., and Ayappa, I. Sleep spindle detection using timefrequency sparsity. In IEEE SPMB, 2014.
 Peyré et al. (2007) Peyré, G., J.Fadili, and Starck, J.L. Learning adapted dictionaries for geometry and texture separation. In SPIE Wavelets, 2007.
 Peyré et al. (2010) Peyré, G., J.Fadili, and Starck, J.L. Learning the morphological diversity. SIAM J. Imaging Sciences, 3(3):646–669, 2010.
 Shoham & Elad (2008) Shoham, N. and Elad, M. Algorithms for signal separation exploiting sparse representations, with application to texture image separation. In Proceedings of the IEEE 25th Convention of Electrical and Electronics Engineers in Israel, 2008.
 Sprechmann et al. (2013) Sprechmann, P., Litman, R., Yakar, T., Bronstein, A., and Sapiro, G. Efficient supervised sparse analysis and synthesis operators. In NIPS, 2013.
 Starck et al. (2004) Starck, J.L, Elad, M., and Donoho, D. Redundant multiscale transforms and their application for morphological component separation. Advances in Imaging and Electron Physics  ADV IMAG ELECTRON PHYS, 132:287–348, 2004.
 Starck et al. (2005a) Starck, J.L., Elad, M., and Donoho, D. Image decomposition via the combination of sparse representations and a variational approach. IEEE Trans. Image Processing, 14(10):1570–1582, 2005a.
 Starck et al. (2005b) Starck, J.L., Moudden, Y., J.Bobina, Elad, M., and Donoho, D. Morphological component analysis. In Proc. SPIE Wavelets, 2005b.
 Uysal et al. (2016) Uysal, F., Selesnick, I., and Isom, B. Mitigation of wind turbine clutter for weather radar by signal separation. IEEE Trans. Geoscience and Remote Sensing, 54(5):2925–2934, 2016.
 Wang et al. (2016) Wang, Z., Ling, Q., and Huang, T. Learning deep l0 encoders. In AAAI, 2016.
 Wisdom et al. (2017) Wisdom, S., Powers, T., Pitton, J., and Atlas, L. Deep recurrent nmf for speech separation by unfolding iterative thresholding. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 254–258, 2017.
 Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.
 Yang et al. (2016) Yang, Y., Sun, J., Li, H., and Xu, Z. Deep admmnet for compressive sensing mri. In NIPS, 2016.
 Zhou et al. (2018) Zhou, J., Di, K., Du, J., Peng, X., Yang, H., S. Pan, I. Tsang, Liu, Y., Qin, Z., and Goh, R. Sc2net: Sparse lstms for sparse coding. In AAAI, 2018.
Appendix A Additional discussion on the difference between LSALSA and LISTA
The recursive formula in LISTA is given as
(18) 
The recursive formula in LSALSA is derived below. We start with the output of the nonlinearity from Algorithm 1:
(19)  
(20)  
(21) 
where , and u(t) is the output of the network nonlinearity. Clearly, in the case of LSALSA, the nonlinearity output has a dependence on all of the previous layers’ outputs. This comes from the auxiliary variable , i.e. the Lagrangian multipliers term. LISTA’s output only depends directly on the previous layer.
Appendix B Dictionary Learning Experiments
We visualize the learned dictionary atoms for both single (Figure 11 and 12) and two dictionary (MCA) case (Figure 13, and 14).
b.1 Dictionaries used in single dictionary experiments
b.2 Dictionaries used in MCA experiments
In the first set of MCA experiments we performed source separation on MNIST + ASIRRA images. We used two dictionaries trained independently using whole MNIST images and patches of ASIRRA images. In the second set of MCA experiments, we performed source separation on spatially added MNIST and CIFAR10 images (more results of this experiment showed in Section D of the Supplement). We used same MNIST dictionary as used in MNIST + ASSIRA experiments and trained a CIFAR10 dictionary on the whole grayscale CIFAR10 data set images.These dictionaries have 1024 atoms (complete), all normalized vectors of length 1024 reshaped to . A subset of atoms of the dictionaries used in MCA experiments are visualized in Figure 13 and Figure 14.
Appendix C Additional single dictionary experiments
The single dictionary experiments on MNIST and CIFAR10 data sets are summarized below. The code prediction errors for MNIST are captured in Figure 15 and for CIFAR are captured in Figure 16. The classification results are captured in Table 5 for MNIST and Table 6 for CIFAR.
c.1 Mnist
The MNIST images were first scaled to pixel values in range and then divided into nonoverlapping patches (ignoring extra pixels on edges), resulting in
patches per image. Only patches with standard deviation
0.1 were used in training and the remaining ones were discarded (as they are practically allblack). Optimal codes were computed for each vectorized patch by minimizing the objective from Equation 4 by running FISTA for iterations giving approximately sparse optimal codes.Classification Error (in %)  

Iter  FISTA  LISTA  SALSA  LSALSA 
1  40.8682  4.5842  20.7291  1.9130 
5  6.1046  4.6527  4.8549  1.7773 
7  3.4363  2.0406  0.8097  0.4574 
10  2.0326  1.3802  0.0990  0.0996 
15  1.0725  0.8778  0.0200  0.0103 
50  0.0205  0.6168  0.0000  0.0000 
100  0.0000  0.4228  0.0000  0.0000 
c.2 Cifar10
In CIFAR10 experiments, natural images were first converted to grayscale, scaled to values in range , and broken down to nonoverlapping patches. Each image resulted in patches. Then optimal codes were computed on these patches in similar fashion as described above for MNIST data set.
Classification Error (in %)  

Iter  FISTA  LISTA  SALSA  LSALSA 
1  86.8600  79.1300  89.0700  64.6900 
5  82.3300  76.2600  87.2700  66.3100 
7  79.4700  74.1000  82.7100  64.6400 
10  75.5200  71.6500  82.8300  54.9800 
15  70.1900  72.4500  75.4100  54.9900 
50  43.1400  66.3400  43.6100  49.4100 
100  67.8600  60.2200  10.4800  18.4400 
Appendix D Additional MCA experiments
d.1 Mnist + Cifar
MNIST + CIFAR10 MCA experimental results are summarized here. We combined whole MNIST digits images with grayscale CIFAR10 images and performed source separation on them. Code prediction error curves are presented with respect to the number of iterations and wallclock time used to make prediction in Figure 17. The classification results are captured in Table 8 for MNIST codes and Table 8 for CIFAR10 codes.
Appendix E Additional plots: MNIST+ASIRRA
Figure 18 shows an extended version of Figure 7. In Figure 19 we illustrate each method’s sparsity/accuracy tradeoff on the MNIST test data set, while varying . The LSALSA retains both the highest sparsity and accuracy levels, even for small , from among all the methods.
Comments
There are no comments yet.