is usually applied to optimize the parameters of the linear transforms and the non-linearities within this highly non-linear and non-convex system. With the help of massive labeled training data and powerful Graphics Processing Units (GPU), DNNs have achieved outstanding performance in many signal processing and computer vision tasks. However, the working of DNNs is still not completely clear. The optimized DNNs are usually treated as black-box systems. Some natural questions are what are the functions of the linear transform and the non-linearities and what is the role of the “cascade” in DNNs.
Some recent works have tried to provide insights into the workings of DNNs. Bruna and Mallat [31, 5] proposed a Scattering Convolutional Network (SCN) by replacing the learned filters with wavelet-like transforms. SCN provides feature representations which are translation and rotation invariant. Zeiler and Fergus 
proposed a deconvolution technique to visualize the intermediate feature layers of a Convolutional Neural Network (CNN) trained for image classification. The filters in the first layer are Gabor like, and the deeper layer feature maps tend to be active only for certain objects. In, the authors suggested that an auto-encoder partitions the low-dimensional data manifold into cells and approximates each piece by a hyper-plane. The encoding capability limit of a DNN is described by the upper bound of the number of cells. Montufar et al. 
have shown that the number of cells is exponentially larger than the degrees of freedom. Giryeset al.  theoretically analyzed the fully connected DNN with i.i.d. random Gaussian weights. They prove that a DNN with random Gaussian weights performs a distance-preserving embedding of the data.
Contrary to DNNs, the sparse representation framework  is much more established. Sparse representation over an over-complete dictionary can serve as an effective and robust representation in both classification and regression problems. Depending on the way the signal is modelled, the sparse representation model can be divided into synthesis or analysis model .
A synthesis model  represents a signal as a linear combination of a small number of column atoms from a redundant synthesis dictionary with and . Sparse pursuit is to seek the sparsest representation given the input signal and the dictionary . Sparse pursuit algorithms include greedy algorithms [30, 7, 4] and convex relaxation based algorithms [43, 6, 8, 3]. The greedy algorithms, like Orthogonal Matching Pursuit (OMP) , find at each iteration a sparse coefficient with the aim of reducing the approximation error. Convex relaxation algorithms relax the non-convex norm to a convex norm. The sparse representation problem can then be solved using basis pursuit (BP)  or iterative algorithms, like Iterative Soft-Thresholding Algorithm (ISTA) [8, 3]. The synthesis dictionary learning algorithms [15, 2] mainly take an alternating minimization strategy and iterate between a sparse-coding stage and a dictionary update stage. With well established theories and algorithms, sparse representation over redundant synthesis model has been extensively used in signal and image processing.
Recently, the analysis model [12, 38] has attracted more research interests. A redundant analysis dictionary is a tall matrix with where each row of is an atom of the dictionary. The expectation is that the analysis coefficients are sparse. This means that the analysis dictionary should be able to sparsify the input signal, whilst preserving its essential information. The analysis dictionary usually serves as a regularization term in the optimization formulation and models the co-sparse prior which can be considered as an extension of the Total Variation (TV) prior. Alternating minimization strategy can also be applied for learning analysis dictionaries [38, 46, 11, 36, 34]. Analysis K-SVD  iterates between an analysis pursuit operation and a K-SVD dictionary update stage. Yaghoobi et al.  proposed a uniformly-normalized tight frame constraint for learning analysis operators. Analysis Simultaneous Codeword Optimization (ASimCO) algorithm  enforces a -sparse constraint on the sparse-coding stage and updates multiple dictionary atoms simultaneously in the dictionary update stage. Sparsifying transform learning [36, 34] proposed to constrain the analysis operator to be full rank and well-conditioned. The GeOmetric Analysis Operator Learning (GOAL) algorithm [17, 23] learns the analysis dictionary by employing an alternative optimization strategy. It performs dictionary learning on manifolds by minimizing an objective function which promotes sparse representation and also imposes full rank constraint.
Building a deep model using sparse representation over redundant synthesis dictionaries has facilitated interpretations of DNNs. Rubinstein and Elad  proposed an Analysis-Synthesis Thresholding (AST) model for image deblurring which consists of an analysis dictionary, element-wise hard-thresholding operators and a synthesis dictionary. The AST model can be interpreted as a fully connected DNN which uses hard-thresholding as non-linearity and has one hidden layer. The Double-Sparsity model 
proposes to learn a two-layer synthesis dictionary. The first layer is a dictionary that models a fixed basis, while the second one is a sparse dictionary with sparse atoms. The effective dictionary is therefore the multiplication between the two dictionaries. The Double-Sparsity model provides a more efficient and adaptive modeling and enables learning large dictionary from high-dimensional data. A Multi-Layer Convolutional Sparse Coding (ML-CSC) model is proposed in[33, 41] and gives a new interpretation on the working of the Convolutional Neural Networks (CNNs). The linear models in CNNs are interpreted as synthesis dictionaries with convolutional structure and the function of the non-linearities is interpreted as a simplified sparse pursuit procedure. The ML-CSC model has multiple layers of synthesis dictionaries where the first dictionary is non-sparse while the following dictionaries are sparse. Tariyal et al.  proposed a greedy layer-wise deep dictionary learning method which performs synthesis dictionary learning layer-by-layer. A parametric approach is proposed in  to learn a deep dictionary for image classification tasks. The proposed dictionary learning method contains a forward pass which performs sparse coding with the given synthesis dictionaries and a backward pass which updates the dictionaries by gradient descent.
The contribution of this paper is three-fold:
We propose a Deep Analysis dictionary Model (DeepAM) which is composed of multiple layers of analysis dictionaries with associated soft-thresholding operators and a layer of synthesis dictionary.
We propose to characterize each analysis dictionary as a combination of two sub-dictionaries: an Information Preserving Analysis Dictionary (IPAD) and a Clustering Analysis Dictionary (CAD). The IPAD together with the soft-thresholding operator preserves the key information of the input data, and the thresholds are set essentially to denoise the data. The CAD with the associated soft-thresholding operator generates a discriminative representation, and the thresholds are set facilitate such discrimination.
We propose learning algorithms for DeepAM. To achieve the two different goals of IPAD and CAD, different learning criteria are introduced. We validate our proposed DeepAM on the single image super-resolution task. Simulation results show that the proposed deep dictionary model achieves comparable performance with a DNN which has the same structure but is optimized using back-propagation.
The rest of the paper is organized as follows. Section II gives an overview of the proposed deep analysis dictionary model. Section III analyzes the proposed model and explains the rationale behind splitting each analysis dictionary into an information preserving and a clustering sub-dictionary. Section IV introduces the learning algorithm for the deep analysis dictionary model. Section V presents simulation results on single image super-resolution task and Section VI concludes the paper.
We begin this section by briefly introducing the single image super-resolution problem and some notations. We then outline the structure of our deep dictionary model.
Ii-a Image Super-Resolution
The task of single image super-resolution (SISR) is to estimate a (high-resolution) HR imagefrom an observed (low-resolution) LR image .
We use patch-based single image super-resolution as the sample application to validate our proposed architecture. Instead of estimating the HR image as a whole, the patch-based approaches [47, 49, 44, 45, 21, 22, 20] divide the LR image into overlapping patches and infer a HR patch for each LR patch. The HR image can then be reconstructed using the estimated HR patches by patch overlapping. Learning-based approaches [47, 49, 44, 21, 22, 20, 9, 10, 24] learn the inference model from an external training dataset which contains LR and HR image pairs. The patch-based methods use LR-HR patch pairs extracted from the training dataset. The size of the LR patches and the HR patches is assumed to be and , respectively. The variable represents the up-scaling factor. To gain illumination invariance property, the mean value of each patch is normally removed. For simplicity, we denote and
. By vectorizing the image patches and grouping the training vectors into a matrix, we obtain the input LR training data matrixand the corresponding ground-truth HR training data matrix .
Ii-B Deep Analysis Dictionary Model
To address the SISR problem, we propose a Deep Analysis dictionary Model (DeepAM) which consists of multiple layers of analysis dictionary interleaved with soft-thresholding operations and a single synthesis dictionary. In an -layer DeepAM, there are dictionaries and layers that correspond to the non-linear operations. The first dictionaries are analysis dictionaries and are denoted as with . The row atoms of the analysis dictionary are of unit norm. The non-linear operator used in DeepAM is the element-wise soft-threshold111Soft-thresholding is defined as where denotes the threshold vector at layer . The dictionary in the last layer is a synthesis dictionary and is designed to optimize the regression task at hand. Fig. 1 shows an example of a 3-layer deep analysis dictionary model for the image super-resolution task.
The -layer DeepAM can therefore be expressed mathematically as:
where and is the input signal and the estimated output signal, respectively.
Let us denote with the output of the -th layer. This means that the input signal is multiplied with the analysis dictionary and then passed through the element-wise soft-thresholding operator . The thresholded output signal will be a sparse signal and is expected to be better at predicting the ground-truth signal than . After layers, is transformed to the HR signal domain via the synthesis dictionary. Note that the input LR signal lives in a lower dimensional space when compared to the target HR signal. It is therefore essential for the inference model to be able to non-linearly transform the input data to a higher dimensional space. This is to be achieved by combined use of the analysis dictionaries and the associated soft-thresholding operators.
where , and . The superscripts HC and VC stand for horizontal concatenate and vertical concatenate, respectively.
From Eqn. (2), we realize that a layer of analysis dictionary and soft-thresholding operation with atoms can be represented with a layer of Neural Networks with ReLU and neurons. For data which is symmetrically distributed around the origin, DeepAM with soft-thresholding can be more efficient than DNNs with ReLU. This observation will be validated numerically in Section V.
The learning problem for DeepAM can be formulated as learning the parameters that minimize the mean squared error between the ground-truth data and the estimations:
where denotes all the parameters of the -layer DeepAM.
Optimizing Eqn. (3) directly can be very difficult as it involves non-convex matrix factorization and the learning of the parameters of the non-linear operators. Standard tools for optimizing DNNs can be utilized, for example back-propagation algorithm . However this would lead to effective but difficult to interpret results.
Our objective instead is to build a deep dictionary model with a higher interpretability through understanding the purpose of different components in the model. The analysis dictionary and threshold pairs play a key role in DeepAM as they determine the way effective features are generated. The synthesis dictionary instead can be learned using least squares once all the analysis dictionaries and thresholds have been determined. We propose a layer-wise learning strategy to learn the pair of analysis dictionary and soft-thresholding operators. In this way, we can obtain a system where the purpose of every component is easier to interpret.
Iii Analyzing the Deep Analysis Dictionary Model
To justify our approach, we begin by considering a 1-layer DeepAM system:
where is one of the elements of , with and .
We assume, for the sake of argument, that the degradation model is linear. That is, there exists a degradation matrix such that . Denote as the projection of the LR signal onto the HR signal space with the pseudo-inverse matrix .
As shown in Fig. 2, the signal lies in a low-dimensional subspace of the ground-truth HR data space . A linear operation will not be able to recover the components that are orthogonal to (i.e. the dashed line in Fig. 2). It is therefore imperative to design the analysis dictionary and the non-linear soft-thresholding operation in a way that facilitates the recovery of the information of in .
When we multiply with , the analysis dictionary atoms project the input LR data onto specific directions. After soft-thresholding, the resulting signal is sparse and the end result is a partitioning of as shown in Fig. 3. The input data within each piece is then linearly transformed for prediction, and all the linear transforms are jointly described by the synthesis dictionary .
We note that if we assume that all thresholds are large, there is a convex polyhedron in which all input data will be set to zero by the analysis and the thresholding operations, that is, , (see the central black region in Fig. 3). Therefore, the corresponding estimation will be zero, and the information of the data within the convex polyhedron will then be completely lost. This may lead to a large mean squared error for prediction.
This suggests that not all thresholds should be too large. The problem can be solved if there is a set of analysis dictionary atoms with small thresholds. If we assume that the signal subspace has dimension , then in order not to lose essential information there should be at least pairs of analysis dictionary atoms and soft-thresholds for information preservation. These atoms are associated with small soft-thresholds and are able to fully represent the input data space. Therefore the pairs of analysis atoms and soft-thresholds together with the corresponding synthesis atoms provide a baseline estimation for the input data. The remaining analysis dictionary atoms can instead be associated with large thresholds. The outcome of these analysis and thresholding operations is a clustering of the input data. That is, the data within the same cluster has the same sparsity pattern and shares the same linear model for prediction. The corresponding synthesis atoms then help recover the signal components within the orthogonal subspace .
Based on the above discussion, we propose to divide an analysis dictionary into two sub-dictionaries , and similarly divide each threshold vector into two parts . The Information Preserving Analysis Dictionary (IPAD) with its thresholds aims at passing key information of the input signal to the next layer. The Clustering Analysis Dictionary (CAD) with its threshold is to facilitate the separation of key feature in the signal. The CAD and thresholding operators provide a highly non-linear prediction. Fig. 4 shows an analysis dictionary and the soft-thresholding operation with the partition of the IPAD part and the CAD part.
In a multi-layer DeepAM, we adopt the same information preserving and clustering strategy. As depicted in Fig. 5, the analysis dictionary at each layer is composed of an IPAD part and a CAD part. The IPADs and thresholds create a channel which transmits the information of the input signal to the CAD in each intermediate layer and to the final estimation. The feature representation generated by the last layer of IPAD and its thresholds should be able to well represent signal components of the HR signal which are within the input data subspace. The CADs and thresholds are the main source of non-linearity in DeepAM. The feature representation generated by the last layer of CAD should be able to well represent the signal components of which are orthogonal to the input data subspace. Therefore, the function of CAD and its thresholds can be interpreted as identifying the data with large energy in the subspace orthogonal to the input data subspace. A deep layer of CAD takes the feature representation generated by the IPAD and CAD of the previous layer as input and can generate a non-linear feature representation that cannot be attained by a single layer soft-thresholding with the same number of atoms. Therefore a DeepAM with deeper layers is expected to outperform a shallower one.
Iv Learning a Deep Analysis Dictionary Model
In this section, we introduce the proposed learning algorithm for DeepAM. In view of the distinctive goals of the two pairs of sub-dictionary and thresholds, different learning criteria have been proposed for learning the IPAD and its thresholds and the CAD and its thresholds.
Iv-a Basic Analysis Dictionary Learning Algorithm
The IPAD and the CAD have different functions but also share some common learning objectives. There are four basic objectives for learning an analysis dictionary: (1) its atoms span a subspace of the input data space; (2) it is able to sparsify the input data; (3) there are no pairs of row atoms that are linearly dependent; (4) the row atoms are of unit norm.
Our proposed learning algorithm is an extension of the GeOmetric Analysis Operator Learning (GOAL) algorithm [17, 23] and we denote it as GOAL+. The four learning objectives can be attained by using corresponding constraints. The IPAD and the CAD are learned using modified versions of GOAL+ algorithm.
For simplicity, let us denote the analysis dictionary to be learned as , the -th atom of as and the training data as . We assume that the data span a dimensional subspace . Let us denote with an orthogonal basis of , and with the orthogonal basis of the orthogonal complement .
The first learning objective is that the learned analysis dictionary should span only the subspace . There are two main reasons for this requirement. First, the analysis dictionary which spans can fully preserve the information of the input data. Second, if an atom belongs to , it is an unnecessary atom. This is because the coefficients will be zero since is in the null-space of . We apply a logarithm determinant (log-det) term to impose the information preservation constraint:
Together with the unit norm constraint, the feasible set of the analysis dictionary is therefore defined as with being the unit sphere in and being the product of unit spheres . In other words the feasible set restricts the learned atoms in to be of unit norm and excludes the contributions from . Eqn. (5) is a generalization of the log-det constraint term applied in GOAL [17, 23]. This is because in our case, defines a basis of the input data space whose size could be much smaller than the dimension of the input signal in particular when considering dictionaries in deeper layers. In GOAL [17, 23], defines a much larger subspace and is with or .
We achieve the constraint by performing orthogonal projection onto the tangent space of the manifold at location . For a row atom , the operation of the orthogonal projection onto the tangent space can be represented by the projection matrix :
is the identity matrix, and.
Sparsifying ability is essential for both IPAD and CAD. The sparsifying constraint is imposed by using a log-square function which is a good approximation of the norm:
where is a tunable parameter which controls the sparsifying ability of the learned dictionary.
Linearly dependent row atoms (e.g. ) are undesirable in the learned dictionary. A logarithm barrier term is used to penalize linearly dependent row atoms:
where with and being the regularization parameters.
The objective function defined in Eqn. (9) is optimized using a geometric conjugate gradient descent method [1, 17]. The analysis dictionary learning algorithm GOAL+ is summarized in Algorithm 1. At iteration , the gradient of the objective function is computed and orthogonal projected on the tangent space of the manifold at location . The orthogonal projection of onto the tangent space can be expressed as . Let us denote , the search direction can be set as . In practice, the search direction is a combination of and the previous search direction . The updated analysis dictionary is then obtained by gradient descent with backtracking line search along the search direction . The halting condition is when the analysis dictionary converges or when a pre-defined maximum number of iterations is reached. In summary, our optimization approach is similar to that in GOAL  with the exception of the orthogonal projection step as described in Eqn. (6) which represents the constraint introduced by the feasible set . For a more detailed treatment of the geometric conjugate gradient descent we refer to [1, 17]. Now that the overall objectives of GOAL+ have been introduced, we can focus on how to tailor the optimization in Eqn. (9) to achieve the objectives of IPAD and CAD respectively.
Iv-B Learning IPAD and Threshold Pairs
The function of the IPAD and threshold pair is to pass key information of the input data to deeper layers. The learned IPADs create a channel that enables the information flow from the input signal to the estimated output signal.
Iv-B1 IPAD Learning
The training data for learning is the -th layer input training data (the -th layer training data is obtained as for ). Let us denote the rank of the input training data at the first layer as where outputs the rank of a matrix. The IPAD is assumed to have atoms to ensure that the learned IPAD can well represent the input data subspace.
By setting the training data as , the -th layer analysis dictionary can be learned using GOAL+. The orthonormal basis is set to be an arbitrary basis of that corresponds to the signal subspace of . The orthogonal basis is set to span the orthogonal complement of the subspace spanned by .
Iv-B2 Learning the Thresholds for IPAD
Given a learned IPAD , the analysis coefficients contain sufficient information of . When is a redundant representation or when the input data is noisy, applying a proper thresholding operation to can further enhance the robustness of the representation. We propose to apply soft-thresholding with small thresholds to the IPAD analysis coefficients as and interpret the soft-thresholding operation as a form of denoising.
There are related works in the literature about thresholding for redundant representations [13, 35, 28]. Elad  shows that simple shrinkage is effective for redundant representation and interprets the simple shrinkage as the first iteration for solving Basis Pursuit Denoising (BPDN) problems. Raphan and Simoncelli  proposed a denoising scheme for redundant representations based on Stein’s Unbiased Risk Estimator. Lin and Lee  proposed a Bayesian framework for finding the optimal -norm regularization.
Let us consider a weighed -norm regularized minimization problem:
where is the -th coefficient of the sparse vector , and is the corresponding regularization parameter.
Selecting the soft-threshold is equivalent to finding suitable regularization parameters in Eqn. (10) as the soft-thresholding operation can be interpreted as the first iteration of the Iterative Soft-Thresholding Algorithm (ISTA)  for solving Eqn. (10):
where the initial sparse code .
Lin and Lee  proposed a method to choose the optimal regularization parameters based on a Bayesian analysis. They assume that the data
is with additive i.i.d. zero mean Gaussian noise with variance:
and assume a Laplacian distribution prior for the sparse code with parameters :
Empirically, we have found that the prior distribution can be well characterized by an i.i.d. zero-mean Laplacian distribution. Based on the analysis in , the optimal regularization parameters for Eqn. (10) can be set as inversely proportional to the variance of the Laplacian distribution:
where is the variance of the Laplacian distribution for the -th sparse code .
From Eqn. (14), the soft-threshold associated with IPAD is:
where is a scaling parameter, and the variance of the -th coefficient can be estimated using the obtained IPAD and its input data.
There is only a free parameter to be determined. It can be obtained by solving a 1-dimensional search problem. The optimization problem for is therefore formulated as:
where , with , and is a discrete set of values.
The obtained pair should be able to preserve the important information within the input signal and give no worse performance when compared to a linear model without any non-linearity.
Iv-C Learning CAD and Threshold Pairs
The function of the clustering analysis dictionary and threshold pair is to sparsify its input data and identify the data with large residual energy. The CAD and threshold pairs at shallow layers provide low-level feature representations for the CADs at deeper layers.
Iv-C1 CAD Learning
Different from IPAD, learning CAD requires supervision from both the input training data and the ground-truth HR training data .
Let us denote with the middle resolution data and with the residual data where is the synthesis dictionary of layer which minimizes the mean squared reconstruction error and can be obtained by solving:
It has a closed-form solution given by:
The learning objective for CAD is that its atoms should be able to project onto directions where the data with large residual error has responses with large magnitude. To achieve that, we propose to first learn an analysis dictionary which acts on the data and is able to jointly sparsify the data and the residual data . That is, the atoms in are able to identify the data in with large residual energy. The -th layer CAD is then re-parameterized as:
As a result, the learned CAD will have the same identification ability as since .
We propose the following constraint for learning the analysis dictionary . Each analysis atom is enforced to be able to jointly sparsify and :
where , and is a tunable parameter.
The objective function for learning the analysis dictionary is then formulated as:
The input training data is set to . Let us denote the rank of as . The orthogonal basis is set to be an arbitrary basis of the signal subspace of . The orthonormal basis is set to be a basis spanning the orthogonal complement to the subspace spanned by .
Iv-C2 Learning the Thresholds for CAD
The thresholds for CAD are crucial to the performance of DeepAM as the CAD and threshold pair is the main source of non-linearity in DeepAM. The atoms of the learned CAD project the input data onto directions where the data with large residual prediction error will have responses with large magnitude. After soft-thresholding, the coefficients should be as sparse as possible to achieve a strong discriminative power.
We propose to set the CAD thresholds in the form of:
where is a scaling parameter, and is the variance of the Laplacian distribution for the -th atom.
As discussed in the previous section, the analysis coefficients can be well modelled by Laplacian distributions. By setting the CAD thresholds proportional to the variance of the analysis coefficients, the proportion of data that has been set to zero for each pair of atom and threshold will be the same. When the synthesis dictionary is applied for reconstruction, the synthesis atoms corresponding to the CAD atoms will be activated with a similar frequency.
With this simplification, the CAD thresholds can be learned in an efficient way. The scaling parameter can be obtained using the same strategy as in Eqn. (16) by solving a 1-dimensional search problem. The optimization problem for is formulated as:
where is the estimation residual obtained after using IPAD, , with , and is a discrete set of values.
In the simulation results in Section V, we will show that the learned CAD thresholds lead to an effective system.
Iv-D Synthesis Dictionary Learning
The deep analysis dictionary learning can be considered as a layer-wise representation learning process in which the input data is consistently non-linearly transformed to a high dimensional feature representation with good descriptive and discriminative properties. The synthesis dictionary models the linear transformation from to the desired HR counterpart . Similar to Eqn. (18), the synthesis dictionary is learned using least squares:
Iv-E DeepAM Learning Algorithm
The overall learning algorithm for DeepAM is summarized in Algorithm 2. We adopt a layer-wise learning strategy for DeepAM. At each layer, two sub-dictionaries IPAD and CAD are independently learned and then combined to form the analysis dictionary, and the thresholds for IPAD and CAD are obtained with two different strategies. In this way, each pair of analysis dictionary and soft-thresholding operations fulfil two different functionalities. Finally, the synthesis dictionary is learned using least squares.
V Simulation Results
In this section, we report the implementation details and numerical results of our proposed DeepAM method and compare our proposed method with Deep Neural Networks learned using back-propagation and with other single image super-resolution algorithms.
V-a Implementation Details
as the testing dataset. The Peak Signal-to-Noise Ratio (PSNR)333PSNR=, where is the mean squared error between the ground-truth HR image and the estimated HR image
is used as the evaluation metric. The color images have been converted from RGB color space to YCbCr color space and image super-resolution is only applied to the luminance component. The low-resolution (LR) images are obtained by down-sampling the ground-truth high-resolution (HR) images by a factorusing Matlab function imresize. The size of the low-resolution patches is set to for the purpose of better visualization and easier interpretation. The size of the high-resolution patches is then . Around LR-HR patch pairs have been collected for training. During testing, full patch overlapping is applied to reconstruct the HR images.
Table I shows the parameters setting of GOAL+ algorithm for learning the -th layer Information Preserving Analysis Dictionary (IPAD) and Clustering Analysis Dictionary (CAD). Both the IPAD and the CAD are initialized with i.i.d. Gaussian random entries. We apply batch training for GOAL+ algorithm. The training data has been divided into batches with size . During training, the GOAL+ algorithm is sequentially applied to batches until the learned dictionary converges. For each batch, iterations of conjugate gradient descent are performed to update the dictionary. The discrete set used for searching the scaling parameter of the thresholds is set to be .
V-B Visualization of the Learned DeepAM
In this section, we will show and analyze the learned DeepAM with the implementation details as described in the previous section.
Figure 6 shows an example of a learned 1-layer DeepAM. It contains an analysis dictionary , thresholds and a synthesis dictionary . Each atom is displayed in a 2D patch in which black and white corresponds to the smallest and the largest value, respectively. The number of the information preserving atoms is set to 40 which is larger than the rank of the input data. The thresholds depicted in Fig. 6(b) show a clear bimodal behaviour. The first 40 thresholds are close to zero, while the remaining 60 thresholds are relatively large. After thresholding, almost all coefficients corresponding to IPAD are non-zero, and the percentage of non-zero coefficients of different CAD atoms are similar and are around 8%. This indicates that modelling the distribution of the analysis coefficients as a Laplacian distribution is a good approximation. The atoms within the blue box are the clustering atoms. The atoms in IPAD shown in Fig. 6(a) are similar to the LR versions of their corresponding synthesis atoms in Fig. 6(c). The CAD atoms look like directional filters and are more localized. There is little low-frequency information. The corresponding synthesis atoms are correlated to the CAD atoms, however, they are not the HR counterpart. The inner product between the HR projection of a CAD atom and its corresponding synthesis atom is close to zero. This shows that, in line with our objective, the synthesis atoms which correspond to the CAD part are nearly orthogonal to the LR data subspace.
Back-propagation can be used to further update the parameters in our learned DeepAM. The back-propagation update is implemented using Pytorch with Adam optimizer, batch size , initial learning rate , learning rate decay step , and decay rate . The parameter setting has been tuned to achieve the best performance.
shows the 1-layer DeepAM after updating using back-propagation. With back-propagation, the performance of DeepAM has a rapid improvement with the first 5 epochs and converges within 20 epochs. After back-propagation update, the average PSNR evaluated onSet5 has improved by approximately 0.3 dB. We can find that the different characteristics of the IPAD part and the CAD part are preserved on the updated DeepAM. There are subtle differences on the updated dictionaries. In general, the IPAD atoms have no visible changes, while the CAD atoms have become more localized. The thresholds continue to have a bimodal behaviour. There is only a slight change on the percentage of non-zero coefficients of different atoms.
Figure 8 shows the dictionaries of a learned 2-layer DeepAM including two analysis dictionaries , and a synthesis dictionary . The first analysis dictionary is similar to that in Figure 6(a), while its CAD part mainly contains directional filters due to a smaller number of clustering atoms. The second analysis dictionary is shown in Figure 8(b) and is a sparse dictionary where the sparse atoms can be considered as indicating a weighted combination of the first layer analysis dictionary atoms if the soft-thresholding operation is neglected. The effective dictionary shown in Figure 8(c) can partially show the effective atoms applied to the input LR data whose IPAD part is similar to that in and CAD part contains more localized atoms when compared to those in . Similar observations can be found in a deeper analysis dictionary in DeepAM. The synthesis dictionary has similar characteristics as the one in the 1-layer DeepAM.
Figure 9 shows the dictionaries of the 2-layer DeepAM after updating with back-propagation. The back-propagation slightly updates the dictionaries and converges within 20 epochs. The average PSNR evaluated on Set5 improves by 0.2 dB after the first 5 epochs and achieved a 0.3 dB improvement after convergence. As in the 1-layer DeepAM case, after back-propagation, there is still a clear difference between the IPAD atoms and the CAD atoms. The IPAD atoms did not change significantly, while the CAD atoms in and the effective dictionary have become more localized.
V-C Comparison with Deep Neural Networks
In this section, we compare our proposed DeepAM method with Deep Neural Networks (DNNs). The number of IPAD atoms in each layer is set to be 35 which is the rank of the input LR data since a DeepAM with more CAD atoms provides better performance when the input data is noiseless. For comparison, DNNs are learned with the same training data using gradient descent with back-propagation. Let us denote DNN-R and DNN-S as the DNN with ReLU as non-linearity and soft-thresholding as non-linearity, respectively. The architecture of DNN-S is the same as our DeepAM. The implementation of DNNs is based on Pytorch with Adam optimizer , batch size , initial learning rate , learning rate decay step , and decay rate . The total number of epoches for training is . The parameter setting has been tuned to achieve the best performance. The parameters of the DNNs are initialized using the default method in Pytorch.
Table II reports the average PSNR (dB) of DNN-R, DNN-S and the proposed DeepAM method evaluated on Set5  and Set14 . There are three different model sizes which correspond to DNNs with 1 hidden layer, 2 hidden layers and 3 hidden layers where the number of neurons in each layer is fixed to be 256.
With a deeper model, the performance of our proposed DeepAM method improves even when the size of the final feature representation is the same. This indicates that the depth of DeepAM is important for the final performance. Figure 10 further shows the percentage of non-zero coefficients for each atom in 3 different layers of the 3-layer DeepAM in Table II. We can find that the percentage of non-zero coefficients has a bimodal behaviour in all three layers which is the same to that shown in Figure 6(b). After thresholding, the percentage of non-zero coefficients corresponding to CAD atoms are almost the same in each layer. The percentage of non-zero coefficients for CAD atoms in layer 1, 2 and 3 is around 9%, 14% and 22%, respectively. This means the feature representation becomes less sparse with the increase of layers. A denser signal representation is helpful for modelling more complex signals which requires the use of more synthesis atoms for a good reconstruction quality. This could be the reason for an improved performance with the increase of depth.
Our proposed DeepAM method achieves a similar average PSNR when compared to the DNN-R method over different model sizes. The DNN-S method achieves the highest average PSNR which is around 0.3 dB and 0.2 dB higher than that of our proposed DeepAM method when evaluated on Set5 and Set 14, respectively. The slightly lower performance of DeepAM when compared to DNN-S is not surprising since DeepAM does not utilize joint optimization as DNNs and the thresholds are set according to simple principles. On the other hand, the simulation result validates the effectiveness of our proposed method. The better performance of DNN-S also suggests that DNNs with soft-thresholding as non-linearity can be more effective for image enhancement applications.
As shown in the previous section, back-propagation can be used to further improve the learned DeepAM. Figure 11 shows average PSNR of the 3-layer DeepAM in Table II which is updated using back-propagation and the performance of the baseline DNN-S. The performance of DeepAM improves significantly and outperforms DNN-S within 15 epochs. The converged performance of DeepAM is around 0.1 dB higher than that of DNN-S. The result shows that the parameters of DeepAM are not far from the final parameters. It also suggests that the DeepAM learning algorithm has the potential to be a good initialization method for DNNs.
V-D Comparison with Single Image Super-Resolution Methods
|Images||Bicubic||SC ||ANR ||A+ ||SRCNN ||DeepAM|
In this section, we will compare our proposed DeepAM method with some existing single image super-resolution methods including Bicubic interpolation, SC-based method, Anchored Neighbor Regression (ANR) method , Adjusted Anchored Neighborhood Regression (A+) method , and Super-Resolution Convolutional Neural Network (SRCNN) method .
The SC-based method  is a synthesis dictionary based method with a coupled LR and HR dictionary. The LR dictionary is learned using K-SVD  and has 1024 atoms, and the HR dictionary is learned using least squares. It assumes that a LR patch and its corresponding HR patch share the same sparse code which is retrieved using OMP 
. The input LR feature is the concatenation of the intensity, the first-order derivatives, and the second-order derivatives of the LR data and is further compressed using Principal Component Analysis (PCA). The ANR method and the A+ method  use the same feature representation as . They apply a learned LR synthesis dictionary for LR patch clustering and have a regression model for each dictionary atom. The super-resolution algorithm finds the nearest neighbor atom for each input LR signal and apply the corresponding regression model for HR signal prediction. The dictionary has 1024 atoms and thus there are 1024 regression models. The A+ method  represented the state-of-the-art before the emergence of methods based on deep convolutional neural networks. The aforementioned methods [49, 44, 45] are all patch-based. The Super-Resolution Convolutional Neural Network (SRCNN) method  is the first to use Convolutional Neural Network for single image super-resolution. SRCNN has 3 layers and is with filters with spatial size , filters with spatial size and filters with spatial size for layer 1, 2 and 3, respectively. It takes the Bicubic up-scaled image as input and is able to upscale the input LR image without dividing the input image into patches.
In Table III, and DeepAM represents the 3-layer DeepAM (model size is ) with and without back-propagation, respectively. The input data is the intensity of the LR image patches. Instead of predicting HR for each input LR patch, the output of DeepAM is the central HR patch since the input LR patch does not contain sufficient information to predict the boundary pixels. DeepAM achieves a comparable performance to the existing methods. Its average PSNR is around 0.3 dB higher than that of the SC-based method and the ANR, while it is around 0.2 dB lower than that of A+ and is around 0.1 dB lower than that of SRCNN. achieves the highest average PSNR. Figure 12 shows an example of the input LR image and the reconstructed HR images using DeepAM and .
In this paper, we proposed a Deep Analysis Dictionary Model (DeepAM) which consists of multiple layers of analysis dictionary and soft-thresholding operators and a layer of synthesis dictionary. Each analysis dictionary has been designed to contain two sub-dictionaries: an Information Preserving Analysis Dictionary (IPAD) and a Clustering Analysis Dictionary (CAD). The IPAD and threshold pairs are to pass key information from input to deeper layers. The function of the CAD and threshold pairs is to facilitate discrimination of key features. We proposed an extension of GOAL  to perform dictionary learning for both the IPAD and the CAD. The thresholds have been efficiently set according to simple principles, while leading to effective models. Simulation results show that our proposed DeepAM achieves comparable performance with DNNs and other existing single image super-resolution methods.
-  (2009) Optimization algorithms on matrix manifolds. Princeton University Press. Cited by: §IV-A.
-  (2006) K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing 54 (11), pp. 4311. Cited by: §I, §V-D.
-  (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2 (1), pp. 183–202. Cited by: §I.
-  (2009) Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis 27 (3), pp. 265–274. Cited by: §I.
-  (2013) Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1872–1886. Cited by: §I.
-  (2001) Atomic decomposition by basis pursuit. SIAM review 43 (1), pp. 129–159. Cited by: §I.
-  (2009) Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory 55 (5), pp. 2230–2249. Cited by: §I.
-  (2004) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 57 (11), pp. 1413–1457. Cited by: §I, §IV-B2.
-  (2014) Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pp. 184–199. Cited by: §II-A, §V-D, §V-D, TABLE III.
-  (2016) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §II-A.
-  (2015) Analysis simco algorithms for sparse analysis model based dictionary learning. IEEE Transactions on Signal Processing 64 (2), pp. 417–431. Cited by: §I.
-  (2007) Analysis versus synthesis in signal priors. Inverse problems 23 (3), pp. 947. Cited by: §I, §I, §I.
-  (2006) Why simple shrinkage is still relevant for redundant representations?. IEEE transactions on information theory 52 (12), pp. 5559–5569. Cited by: §IV-B2.
-  (2010) Sparse and redundant representations: from theory to applications in signal and image processing. Springer. Cited by: §I.
-  (1999) Method of optimal directions for frame design. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, Vol. 5, pp. 2443–2446. Cited by: §I.
-  (2016) Deep neural networks with random Gaussian weights: a universal classification strategy?. IEEE Trans. Signal Processing 64 (13), pp. 3444–3457. Cited by: §I.
-  (2013) Analysis operator learning and its application to image reconstruction. IEEE Transactions on Image Processing 22 (6), pp. 2138–2150. Cited by: §I, §IV-A, §IV-A, §IV-A, §VI.
-  (2018-03) A deep dictionary model for image super-resolution. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18), Cited by: Learning Deep Analysis Dictionaries—Part I: Unstructured Dictionaries.
-  (2019-05) A deep dictionary model to preserve and disentangle key features in a signal. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19), Cited by: Learning Deep Analysis Dictionaries—Part I: Unstructured Dictionaries.
SRHRF+: self-example enhanced single image super-resolution using hierarchical random forests. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on New Trends in Image Restoration and Enhancement, Cited by: §II-A.
-  (2015) Fast image interpolation via random forests. IEEE Transactions on Image Processing 24 (10), pp. 3232–3245. Cited by: §II-A.
Learning hierarchical decision trees for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology. External Links: Cited by: §II-A.
-  (2015) A bimodal co-sparse analysis model for image processing. International Journal of Computer Vision 114 (2-3), pp. 233–247. Cited by: §I, §IV-A, §IV-A, §IV-A.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654. Cited by: §II-A.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B, §V-C.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
-  (2018) Geometric understanding of deep learning. arXiv preprint arXiv:1805.10451. Cited by: §I.
-  (2006) Bayesian l 1-norm sparse learning. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, Vol. 5, pp. V–V. Cited by: §IV-B2, §IV-B2, §IV-B2.
-  (2018) Deep dictionary learning: a parametric network approach. arXiv preprint arXiv:1803.04022. Cited by: §I.
-  (1993) Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing 41 (12), pp. 3397–3415. Cited by: §I, §V-D.
-  (2012) Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §I.
-  (2014) On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932. Cited by: §I.
Convolutional neural networks analyzed via convolutional sparse coding.
The Journal of Machine Learning Research18 (1), pp. 2887–2938. Cited by: §I.
-  (2018) Learning filter bank sparsifying transforms. IEEE Transactions on Signal Processing 67 (2), pp. 504–519. Cited by: §I.
-  (2008) Optimal denoising in redundant representations. IEEE Transactions on Image Processing 17 (8), pp. 1342–1352. Cited by: §IV-B2.
-  (2012) Learning sparsifying transforms. IEEE Transactions on Signal Processing 61 (5), pp. 1072–1086. Cited by: §I.
-  (2014) Dictionary learning for analysis-synthesis thresholding. IEEE Transactions on Signal Processing 62 (22), pp. 5962–5972. Cited by: §I.
-  (2013) Analysis k-svd: a dictionary-learning algorithm for the analysis sparse model. IEEE Transactions on Signal Processing 61 (3), pp. 661–677. Cited by: §I.
-  (2010) Double sparsity: learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing 58 (3), pp. 1553–1564. Cited by: §I.
-  (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §I, §II-B.
-  (2018) Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Transactions on Signal Processing 66 (15), pp. 4090–4104. Cited by: §I.
-  (2016) Deep dictionary learning. IEEE Access 4, pp. 10096–10109. Cited by: §I.
-  (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288. Cited by: §I.
-  (2013) Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1920–1927. Cited by: §II-A, §V-D, §V-D, TABLE III.
-  (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision, pp. 111–126. Cited by: §II-A, §V-D, §V-D, TABLE III.
-  (2013) Constrained overcomplete analysis operator learning for cosparse signal modelling. IEEE Transactions on Signal Processing 61 (9), pp. 2341–2355. Cited by: §I.
-  (2010) Image super-resolution via sparse representation. IEEE transactions on Image Processing 19 (11), pp. 2861–2873. Cited by: §II-A, Fig. 11, §V-A, §V-C, TABLE II.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §I.
-  (2010) On single image scale-up using sparse-representations. In International conference on curves and surfaces, pp. 711–730. Cited by: §II-A, §V-A, §V-C, §V-D, §V-D, TABLE II, TABLE III.