Dictionary Learning/Sparse Coding has demonstrated its high potential in exploring the semantic information embedded in high dimensional noisy data. It has been successfully applied for solving different inference tasks, such as image denoising , image restoration 
, image super-resolution[40, 25], audio processing  and image classification .
While Synthesis Dictionary Learning (SDL) has been greatly investigated and widely used, the Analysis Dictionary Learning (ADL)/Transform Learning, as a dual problem, has been getting greater attention for its robustness property among others [24, 3, 27]. DL based methods have primarily focused on learning one-layer dictionary and its associated sparse representation. Other variations on the classification theme have also been appearing with a goal of addressing some recognized limitations, such as task-driven dictionary learning , first introduced to jointly learn the dictionary, its sparse representation, and its classification objective. In , a label consistent term is additionally considered. Class-specific dictionary learning has been recently shown to improve the discrimination in [23, 37, 35]
at the expense of a higher complexity. On the ADL side, more and more efficient classifiers[11, 32, 33, 28, 29] have resulted from numerous research efforts, and have yielded to an outperformance of SDL in both training and testing phases .
DL methods with their associated sparse representation, present significant computational challenges addressed by different techniques, including K-SVD [1, 24], SNS-ADL  and Fast Iterative Shrinkage-thresholding Algorithm (FISTA) . Meant to provide a practically faster solution, the alternating minimization of FISTA still exhibited limitations and a relatively high computational cost.
To address these computational and scaling difficulties, differentiable programming solutions have also been developed, to take advantage of the efficiency of neural networks. LISTA  was first proposed to unfold iterative hard-thresholding into an RNN format, thus speeding up SDL. Unlike conventional solutions for solving optimization problems, LISTA uses the forward and backward passes to simultaneously update the sparse representation and dictionary in an efficient manner. In the same spirit, sparse LSTM (SLSTM) 
adapts LISTA to a Long Short Term Memory structure to automatically learn the dimension of the sparse representation.
Although the aforementioned differentiable programming methods are efficient at solving a single-layer DL problem, the latter formulation still does not yield the best performance in image classification tasks. With the fast development of deep learning, Deep Dictionary Learning (DDL) methods[31, 19] have thus come into play. In , a deep model for ADL followed by a SDL is developed for image super-resolution. Also,  deeply stacks SDLs to classify images by achieving promising and robust results. Unsupervised DDL approaches have also been proposed, with promising results [18, 12].
However, to the best of our knowledge, no DDL model which can provide both a fast and reliable solution has been proposed.
The proposed work herein, aims at ensuring the discriminative ability of single-layer DL while providing the efficiency of end-to-end models. To this end, we propose a novel differentiable programming method to jointly learn a deep metric together with an associated transform. Cascading these canonical structures will exploit and strengthen the structure learning capacity of a deep network, yielding what we refer to a Deep Transform and Metric Learning Network (DeTraMe-Net). This newly proposed approach not only increases the discrimination capabilities of DL, but also affords a flexibility of constructing different DDL or Deep Neural Network (DNN) architectures. As will be later shown, this approach also resolves usually arising initialization and gradient propagation issues in DDL.
As shown in Figure 1, in each layer of DeTraMe-Net, the DL problem is decomposed as a transform learning one, i.e.
a linear layer part cascaded with a nonlinear component using a learned metric. The latter, referred to as Q-Metric Learning, is realized by an RNN. One of the contributions of our work is to show how DDL can theoretically be reformulated as such a combination of linear layers and RNNs. Decoupling the metric and the dual frame operator (pseudo-inverse of dictionary) into two independent variables is also shown to introduce additional flexibility, and to improve the power of DL. On the practical side, and to achieve a faster and simpler implementation, we impose a block-diagonal structure for Q-Metric Learning leading to parallel processing of independent channels. Moreover, a convolutional operator is also introduced to decrease the number of parameters, thus leading to a Convolutional-RNN. Additionally, the Q-Metric Learning part may be viewed as a non-separable activation function that can be flexibly included into any architecture. As a result, different new DeTraMe networks may be obtained by integrating Q-Metric Learning into various CNN architectures such as Plain CNN and ResNets . The resulting DeTraMe-Nets-based architectures are demonstrated to be more discriminative than generic CNN models.
Although the authors of  and  also used a CNN followed by an RNN for respectively solving super-resolution and sense recognition tasks, they directly used LISTA in their model. In turn, our method actually solves the same problem as LISTA. In addition, in  and , a sparse representation was jointly learned, while a more discriminative DDL approach is achieved in our work. We also formally derive the linear and RNN-based layer structure from DDL, thus providing a theoretical justification and a rationale to such approaches. This may also open an avenue to new and more creative and performing alternatives.
Our main contributions are summarized below:
We theoretically transform one-layer dictionary learning into transform learning and Q-Metric learning, and deduce how to convert DDL into DeTraMe-Net.
Such joint transform learning and Q-Metric learning are successfully and easily implemented as a tandem of a linear layer and an RNN. A convolutional layer can be chosen for the linear part, and the RNN can also be simplified into a Convolutional-RNN. To the best of our knowledge, this is the first work which makes an insightful bridge between DDL methods and the combination of linear layers and RNNs, with the associated performance gains.
The transform and Q-Metric learning uses two independent variables, one for the dictionary and the other for the dual frame operator of the dictionary. This bridges the current work to conventional SDL while introducing more discriminative power, and allowing the use of faster learning procedures than the original DL.
The Q-Metric can also be viewed as a parametric non-separable nonlinear activation function, while in current neural network architectures, very few non-separable nonlinear operators are used (softmax, max pooling, average pooling). As a component of a neural network, it can be flexibly inserted into any network architecture to easily construct a DL layer.
The proposed DeTraMe-Net is demonstrated to not only improve the discrimination power of DDL, but to also achieve a better performance than state-of-the-art CNNs.
The paper is organized as follows: In Section 2, we introduce the required background material. We derive the theoretical basis for our novel approach in Section 3. Its algorithmic solution is investigated in Section 4. Substantiating experimental results and evaluations are presented in Section 5. Finally, we provide some concluding remarks in Section 6.
2.1 Dictionary Learning for Classification
In task-driven dictionary learning , the common method for one-layer dictionary learning classifier is to jointly learn the dictionary matrix , the sparse representation of a given vector , and the classifier parameter . Let be the data and the associated labels. Task-driven DL can be expressed as finding
In SDL, we learn the composition of a dictionary and a sparse reconstruction in order to reconstruct or synthesize the data, hence yielding the standard formulation,
Alternatively, in ADL, we directly operate on the data using a dictionary, leading to,
may correspond to various kinds of loss functions, such as least-squares, cross-entropy, or hinge loss.
2.2 Deep Dictionary Learning for Classification
An efficient DDL approach  consists of computing
denotes the estimated label,is the classifier matrix, is a nonlinear function, and
where denotes the composition of operators. For every layer is a reshaping operator, which is a tall matrix. Moreover, is a nonlinear operator computing a sparse representation within a synthesis dictionary matrix . More precisely, for a given matrix ,
where , , and is a function in , the class of proper lower semicontinuous convex functions from to . A simple choice consists in setting to zero, while adopting the following specific form for ;
where denotes the indicator function of a set (equal to zero in and otherwise). Note that Eq. (6) corresponds to the minimization of a strongly convex function, which thus admits a unique minimizer, so making the operator properly defined.
3 Joint Deep Metric and Transform Learning
3.1 Proximal interpretation
Our goal here is to establish an equivalent but more insightful solution for in each layer.
Claim 1: can be solved by a proximal operator of a transform learning with a metric :
To simplify notation, we omit the superscript which denotes the layer in Eq. (6) which, in turn, aims at finding the sparse representation . For every , and , Eq. (7) can thus be re-expressed as follows:
This thus establishes a re-expression of the solution of the representation procedure as the proximity operator of within the metric induced by the symmetric definite positive matrix [6, 5]. Furthermore, it shows that the SDL can be equivalently viewed as an ADL formulation involving the dictionary matrix , provided that a proper metric is chosen.
3.2 Multilayer representation
where, for , the affine operators mapping to by an analysis transform and a shift term , and explicitly as,
Eq. (15) shows that, for each layer , we obtain a structure similar to a linear layer by treating as the weight operator and as the bias parameter, which are referred as the Transform learning part in DeTraMe method. In standard Forward Neural Networks (FNNs), the activation functions can be interpreted as proximity operators of convex functions . Eq. (14) attests that our model is more general, in the sense that different metrics are introduced for these operators. In the next section, we propose an efficient method to learn these metrics in a supervised manner.
4 Q-Metric Learning
4.1 Prox computation
Reformulation (14) has the great advantage to allow us to benefit from algorithmic frameworks developed for FNNs, provided that we are able to compute efficiently
where is the -weighted Frobenius norm. Hereabove, is a matrix where the samples associated with the training set have been stacked columnwise. A similar convention is used to construct and from and . An elastic-net like regularization is chosen by setting with . We have, in particular, observed that the last quadratic term has a positive influence in increasing stability and avoiding overfitting. As in Eq. (12), Eq. (17) is actually equivalent to solving the following optimization problem:
Claim 2: We show next that the solution of Eq. (18) is obtained as an iteration of the form:
Various iterative splitting methods could be used to find the unique minimizer of the above optimized convex function [4, 15]. Our purpose is to develop an algorithmic solution for which classical NN learning techniques can be applied in a fast and convenient manner. By subdifferential calculus, the solution to the problem (18) satisfies the following optimality condition:
where . Element-wise rewriting of Eq. (20) yields, for every , and ,
Let us adopt a block-coordinate approach and update the -th row of by fixing all the other ones. As is a positive definite matrix, and Eq. (21) implies that
where . And let
with denoting the Hadamard (element-wise) product. Note that a similar expression can be derived by applying a preconditioned forward-backward algorithm  to Eq. (18), where the preconditioning matrix is , which has been detailed in the supplementary material. The implementation of the method allowing us to compute the proximity operator in (17) is summarized below:
4.2 RNN implementation
Given , , and , Alg. (1) can be viewed as an RNN structure for which is the hidden variable and is a constant input over time. By taking advantage of existing gradient back-propagation techniques for RNNs, can thus be directly computed in order to minimize the global loss . This shows that, thanks to the re-parameterization in Eq. (23), Q-Metric Learning has been recast as the training of a specific RNN.
Note that is a symmetric matrix. In order to reduce the number of parameters and ease of optimizing them, we choose a block-diagonal structure for . In addition, for each of the blocks, either an arbitrary or convolutive structure can be adopted. Since the structure of is reflected by the structure of , this leads in Eq. (19) to fully connected or convolutional layers where the channel outputs are linked to non overlapping blocks of the inputs. In our experiments on images, Convolutional-RNNs have been preferred for practical efficiency.
4.3 Training procedure
We have finally transformed our DDL approach in an alternation of linear layers and specific RNNs. This not only simplifies the implementation of the resulting DeTraMe-Net by making use of standard NN tools, but also allows us to employ well-established stochastic gradient-based learning strategies. Let be the learning rate at iteration , the simplified form of a training method for DeTraMe-Nets is provided in Alg. 2.
The constraints on the parameters of the RNNs have been imposed by projections. In Alg. 2, denotes the projection onto a nonempty closed convex set and is the vector space of matrices with diagonal terms equal to 0.
5 Experiments and Results
In this section, our DeTraMe-Net method is evaluated on three popular datasets, namely CIFAR10 , CIFAR100  and Street View House Numbers (SVHN) . Since the common NN architectures are plain networks such as ALL-CNN  and residual ones, such as ResNet  and WideResNet , we compare DeTraMe-Net with these three respective state-of-the-art architectures.
Since we break SDL into two independent linear layer and RNN parts, RNNs can be flexibly inserted into any nonlinear layer of a deep neural network. After choosing convolutional linear layers, we can construct two different architectures when inserting RNN into Plain Networks and residual blocks. One is to replace all the RELU activation layers in PlainNet with Q-Metric ReLU, leading to DeTraMe-PlainNet. Another is to replace the RELU layer inside the block in ResNet by Q-Metric ReLU, giving rise to DeTraMe-ResNet. When replacing all the RELU layers, DeTraMe-PlainNet becomes equivalent to DDL as explained in Section 4. When only replacing a single RELU layer in the ResNet architecture, a new DeTraMe-ResNet structure is built. The detailed architectures are illustrated in the supplementary materials.
|DeTraMe-PlainNet 3-layer||PlainNet 3-layer||PlainNet 6-layer||PlainNet 9-layer||PlainNet 12-layer|
|Input 32 x 32 RGB Image with dropout(0.2)|
|conv 96||conv 96 RELU||conv 96 RELU||conv 96 RELU||conv 96 RELU|
|+ Q-Metric: conv 96||conv 96 RELU||conv 96 RELU||conv 96 RELU|
|conv 96 RELU||conv 96 RELU|
with stride=2, dropout(0.5)
|with stride=2, dropout(0.5)|
|conv 192 RELU|
|conv 96||conv 96 RELU||conv 96 RELU||conv 192 RELU||conv 192 RELU|
|with stride=2||with stride=2||with stride=2, dropout(0.5)||conv 192 RELU||conv 192 RELU|
|+ Q-Metric: conv 96||conv 192 RELU||conv 192 RELU||with stride=2, dropout(0.5)|
|with stride=2, dropout(0.5)||conv 192 RELU|
|conv 10||conv 10 RELU||conv 192 RELU||conv 192 RELU||conv 192 RELU|
|with stride=2||with stride=2||conv 10 RELU||conv 192 RELU||with stride=2|
|+Q-Metric: conv 10||with stride=2||conv 10 RELU||conv 192 RELU|
|conv 192 RELU|
|conv 10 RELU|
|Global Average Pooling|
For the PlainNet, we use a 9 layer architecture similar to ALL-CNN  with dropouts, as listed in Table 1. For the ResNet architecture, we follow the setting in , the first layer is a convolutional layer with 16 filters. 3 residual blocks with output map size of 32, 16, and 8 are then used with 16, 32 and 64 filters for each block. The network ends up with a global average pooling and a fully-connected layer. The parameters listed in Table 2 are respectively chosen equal to for ResNet 8, 20, 56, 110 and 164-layer networks, and we respectively use and for WideResNet 16-4 and WideResNet 16-8 networks as suggested in .
|output map size|
For DeTraMe-Net, we use convolutional RNNs having the same filter size (resp. number of channels) as those in the convolutional layer before. The number of parameters of each model as well as the number of iterations performed in RNNs, are indicated in Table 4.
5.2 Datasets and Training Settings
CIFAR10  contains 60,000 color images divided into 10 classes. 50,000 images are used for training and 10,000 images for testing. CIFAR100  is also constituted of color images. However, it includes 100 classes with 50,000 images for training and 10,000 images for testing. SVHN  contains 630,420 color images with size . 604,388 images are used for training and 26,032 images are used for testing.
For CIFAR datasets, the normalized input image is randomly cropped after padding on each sides of the image and random flipping, similarly to [13, 38]. No other data augmentation is used. For SVHN, we normalize the range of the images between 0 and 1. All the models are trained on an Nvidia V100 32Gb GPU with 128 mini-batch size. The models of both PlainNet and ResNet architectures are trained by SGD optimizer with momentum equal to 0.9 and a weight decay of
. On CIFAR datasets, the algorithm starts with a learning rate of 0.1. 300 epochs are used to train the models, and the learning rate is reduced at the 150-th and 225-th epochs. On SVHN dataset, a learning rate of 0.01 is used at the beginning and is then divided by 10 at the 80-th and 120-th epochs within a total of 160 epochs. The same settings are used as in.
5.3.1 DeTraMe-Net vs. DDL
|DDL 9 ||1.4M||0.9304||0.6876|
First, we compare our results with those achieved by the DDL approach in . As we break the dictionary and its pseudo inverse into two independent variables, a higher number of parameters is involved in DeTraMe-Net than in . However, DeTraMe-Net presents two main advantages: The first one is a better capability to discriminate: in Table 3, compared to DDL, DeTraMe-Net respectively achieves and improvements on CIFAR10 and CIFAR100 datasets. The second advantage is that DeTraMe-Net is implemented in a network framework, with no need for extra functions to compute gradients at each layer. Moreover, by taking advantage of the developed techniques in neural networks, DeTraMe-Net does not meet the difficulties of sensitivity to initialization and gradient propagation that the original DDL approach faces.
5.3.2 DeTraMe-Net vs. Generic CNNs
|Accuracy||# Parameters||CIFAR10 +||CIFAR100 +||SVHN|
|PlainNet 3-layer||0.094M||0.261M||0.4248||0.8867 (5)||0.2209||0.6475 (3)||0.4564||0.9721 (8)|
|PlainNet 6-layer||1.016M||1.929M||0.8634||0.9241 (2)||0.6275||0.7014 (2)||0.9755||0.9817 (5)|
|PlainNet 9-layer||1.370M||2.984M||0.9036||0.9340 (2)||0.6591||0.7034 (2)||0.9798||0.9826 (5)|
|PlainNet 12-layer||2.366M||3.980M||0.9108||0.9361 (2)||0.6901||0.7117 (2)||0.9814||0.9827 (3)|
|ResNet 8||0.074M||0.123M||0.8782||0.8941 (3)||0.5997||0.6527 (2)||0.9670||0.9750 (3)|
|ResNet 20||0.268M||0.413M||0.9214||0.9253 (3)||0.6833||0.6890 (2)||0.9770||0.9782 (2)|
|ResNet 56||0.848M||0.994M||0.9365||0.9375 (3)||0.7113||0.7166 (2)||0.9796||0.9804 (2)|
|ResNet 110||1.719M||1.867M||0.9374||0.9377 (2)||0.7273||0.7364 (2)||-||-|
|ResNet 164||2.590M||2.738M||0.9359||0.9439 (2)||0.7357||0.7441 (2)||-||-|
|WideResNet 16-4||3.585M||5.136 M||0.9525||0.9531 (2)||0.7679||0.7761 (3)||0.9806||0.9816 (3)|
|WideResNet 16-8||10.783M||16.983M||0.9572||0.9579 (2)||0.7945||0.8048 (3)||0.9817||0.9823 (3)|
We next compare DeTraMe-Net with generic CNNs with respect to three different aspects: Accuracy, Parameternumber and Capacity.
Accuracy. As shown in Table 4, with the same architecture, using DeTraMe-Net structures achieves an overall better performance than all various generic CNN models do. For PlainNet architecture, DeTraMe-Net increases the accuracy with a median of on CIFAR10, on CIFAR100 and on SVHN, and respectively increases the accuracy of at least on theses three datasets. For ResNet architecture, DeTraMe-Net also consistently increases the accuracy with a median of on CIFAR10, on CIFAR100 and on SVHN, and at least on all datasets.
Parameter number. Although, for a given architecture, DeTraMe-Net improves the accuracy, it involves more parameters. However, as demonstrated in Figure 2, for a given number of parameters, DeTraMe-Net outperforms the original CNNs over all three datasets. Plots corresponding to DeTraMe-Net for both PlainNet and ResNet architectures are indeed above those associated with standard CNNs.
Capacity. In terms of depth, comparing improvements with PlainNet and ResNet, shows that the shallower the network, the more accurate. It is remarkable that DeTraMe-Net leads to more than accuracy increase for PlainNet 3-layer on CIFAR10, CIFAR100 and SVHN datasets. When the networks become deeper, they better capture discriminative features of the classes, and albeit with smaller gains, DeTraMe-Net still achieves a better accuracy than a generic deep CNN, e.g. around higher than ResNet 164 on CIFAR10 and CIFAR100. In terms of width, we use WideResNet-16-4 and WideResNet-16-8 as two reference models, since both of them include 16 layers but have different widths. Table 4 shows that increasing width is beneficial to DeTraMe-Net. Since the original models have already achieved excellent performance for CIFAR10 and SVHN, DeTraMe-Nets with various widths show similarly slightly improved accuracies. However, for CIFAR100, enlarging the width for DeTraMe-Net leads to an increase in the accuracy gain from to .
Starting from a DDL formulation, we have shown that it is possible to reformulate the problem in a standard optimization problem with the introduction of metrics within standard activation operators. This yields a novel Deep Transform and Metric Learning problem. This has allowed us to show that the original DDL can be performed thanks to a network mixing linear layer and RNN algorithmic structures, thus leading to a fast and flexible network framework for building efficient DDL-based classifiers with a higher discriminiative ability. Our experiments show that the resulting DeTraMe-Net performs better than the original DDL approach and state-of-the-art generic CNNs. We think that the bridge we established between DDL and DNN will help in further understanding and controlling these powerful tools so as to attain better performance and properties.
-  Aharon, M., Elad, M., Bruckstein, A.: K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing 54(11), 4311–4322 (2006)
-  Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
-  Bian, X., Krim, H., Bronstein, A., Dai, L.: Sparsity and nullity: Paradigms for analysis dictionary learning. SIAM Journal on Imaging Sciences 9(3), 1107–1126 (2016)
-  Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)
-  Chouzenoux, E., Pesquet, J.C., Repetti, A.: Variable metric forward-backward algorithm for minimizing the sum of a differentiable function and a convex function. Journal of Optimization Theory and Applications 162(1), 107–132 (Jul 2014)
-  Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer-Verlag, New York (2010)
-  Combettes, P.L., Pesquet, J.C.: Deep neural network structures solving variational inequalities. Set-Valued and Variational Analysis (2018), https://arxiv.org/abs/1808.07526
-  Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing 15(12), 3736–3745 (Dec 2006). https://doi.org/10.1109/TIP.2006.881969, https://doi.org/10.1109/TIP.2006.881969
Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. pp. 399–406. Omnipress (2010)
Grosse, R., Raina, R., Kwong, H., Ng, A.Y.: Shift-invariant sparse coding for audio classification. In: Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence. p. 149–158. UAI’07, AUAI Press, Arlington, Virginia, USA (2007)
-  Guo, J., Guo, Y., Kong, X., Zhang, M., He, R.: Discriminative analysis dictionary learning. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
-  Gupta, P., Maggu, J., Majumdar, A., Chouzenoux, E., Chierchia, G.: Deconfuse: A deep convolutional transform based unsupervised fusion framework. Tech. rep. (2020), https://hal.archives-ouvertes.fr/hal-02461768
-  Huang, J.J., Dragotti, P.L.: A deep dictionary model for image super-resolution. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, Calgary, Canada (March 2018)
-  Komodakis, N., Pesquet, J.C.: Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Processing Magazine 32(6), 31–5 (Nov 2014)
-  Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Tech. rep., Citeseer (2009)
Liu, Y., Chen, Q., Chen, W., Wassell, I.: Dictionary learning inspired deep network for scene recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
-  Maggu, J., Majumdar, A.: Unsupervised deep transform learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6782–6786. Calgary, Canada (15-20 April 2018)
-  Mahdizadehaghdam, S., Dai, L., Krim, H., Skau, E., Wang, H.: Image classification: A hierarchical dictionary learning approach. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2597–2601. IEEE (2017)
-  Mahdizadehaghdam, S., Panahi, A., Krim, H., Dai, L.: Deep dictionary learning: A parametric network approach. IEEE Transactions on Image Processing (2019)
-  Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., Bach, F.R.: Supervised dictionary learning. In: Advances in Neural Information Processing Systems. pp. 1033–1040 (2009)
-  Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
-  Ramirez, I., Sprechmann, P., Sapiro, G.: Classification and clustering via dictionary learning with structured incoherence and shared features. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. pp. 3501–3508. IEEE (2010)
-  Rubinstein, R., Peleg, T., Elad, M.: Analysis k-svd: A dictionary-learning algorithm for the analysis sparse model. IEEE Transactions on Signal Processing 61(3), 661–677 (2013)
-  Skau, E., Wohlberg, B., Krim, H., Dai, L.: Pansharpening via coupled triple factorization dictionary learning. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1234–1237. IEEE (2016)
-  Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
-  Tang, W., Otero, I.R., Krim, H., Dai, L.: Analysis dictionary learning for scene classification. In: Statistical Signal Processing Workshop (SSP), 2016 IEEE. pp. 1–5. IEEE (2016)
-  Tang, W., Panahi, A., Krim, H., Dai, L.: Structured analysis dictionary learning for image classification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2181–2185. IEEE (2018)
-  Tang, W., Panahi, A., Krim, H., Dai, L.: Analysis dictionary learning: an efficient and discriminative solution. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3682–3686. IEEE (2019)
-  Tang, W., Panahi, A., Krim, H., Dai, L.: Analysis dictionary learning based classification: Structure for robustness. IEEE Transactions on Image Processing 28(12), 6035–6046 (2019)
-  Tariyal, S., Majumdar, A., Singh, R., Vatsa, M.: Deep dictionary learning. IEEE Access 4, 10096–10109 (2016)
-  Wang, J., Guo, Y., Guo, J., Luo, X., Kong, X.: Class-aware analysis dictionary learning for pattern classification. IEEE Signal Processing Letters 24(12), 1822–1826 (2017)
-  Wang, Q., Guo, Y., Guo, J., Kong, X.: Synthesis k-svd based analysis dictionary learning for pattern classification. Multimedia Tools and Applications 77(13), 17023–17041 (2018)
-  Wang, Z., Liu, D., Yang, J., Han, W., Huang, T.: Deep networks for image super-resolution with sparse prior. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 370–378 (2015)
-  Wang, Z., Yang, J., Nasrabadi, N., Huang, T.: A max-margin perspective on sparse representation-based classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1217–1224 (2013)
-  Xu, M., Jia, X., Pickering, M., Plaza, A.J.: Cloud removal based on sparse representation via multitemporal dictionary learning. IEEE Transactions on Geoscience and Remote Sensing 54(5), 2998–3006 (2016). https://doi.org/10.1109/tgrs.2015.2509860, https://app.dimensions.ai/details/publication/pub.1061614193
-  Yang, M., Zhang, L., Feng, X., Zhang, D.: Fisher discrimination dictionary learning for sparse representation. In: 2011 International Conference on Computer Vision. pp. 543–550. IEEE (2011)
-  Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
-  Zhang, D., Liu, P., Zhang, K., Zhang, H., Wang, Q., Jing, X.: Class relatedness oriented-discriminative dictionary learning for multiclass image classification. Pattern Recognition 59(C), 168–175 (Nov 2016). https://doi.org/10.1016/j.patcog.2015.12.005, https://doi.org/10.1016/j.patcog.2015.12.005
-  Zhong, W.: Robust object tracking via sparsity-based collaborative model. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). p. 1838–1845. CVPR ’12, IEEE Computer Society, USA (2012)
-  Zhou, J.T., Di, K., Du, J., Peng, X., Yang, H., Pan, S.J., Tsang, I.W., Liu, Y., Qin, Z., Goh, R.S.M.: Sc2net: Sparse lstms for sparse coding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)