I Introduction
Deep Neural Networks (DNNs) [26] are complex architectures composed of a cascade of multiple linear and nonlinear layers. Backpropagation algorithm [40]
is usually applied to optimize the parameters of the linear transforms and the nonlinearities within this highly nonlinear and nonconvex system. With the help of massive labeled training data and powerful Graphics Processing Units (GPU), DNNs have achieved outstanding performance in many signal processing and computer vision tasks. However, the working of DNNs is still not completely clear. The optimized DNNs are usually treated as blackbox systems. Some natural questions are what are the functions of the linear transform and the nonlinearities and what is the role of the “cascade” in DNNs.
Some recent works have tried to provide insights into the workings of DNNs. Bruna and Mallat [31, 5] proposed a Scattering Convolutional Network (SCN) by replacing the learned filters with waveletlike transforms. SCN provides feature representations which are translation and rotation invariant. Zeiler and Fergus [48]
proposed a deconvolution technique to visualize the intermediate feature layers of a Convolutional Neural Network (CNN) trained for image classification. The filters in the first layer are Gabor like, and the deeper layer feature maps tend to be active only for certain objects. In
[27], the authors suggested that an autoencoder partitions the lowdimensional data manifold into cells and approximates each piece by a hyperplane. The encoding capability limit of a DNN is described by the upper bound of the number of cells. Montufar et al. [32]have shown that the number of cells is exponentially larger than the degrees of freedom. Giryes
et al. [16] theoretically analyzed the fully connected DNN with i.i.d. random Gaussian weights. They prove that a DNN with random Gaussian weights performs a distancepreserving embedding of the data.Contrary to DNNs, the sparse representation framework [14] is much more established. Sparse representation over an overcomplete dictionary can serve as an effective and robust representation in both classification and regression problems. Depending on the way the signal is modelled, the sparse representation model can be divided into synthesis or analysis model [12].
A synthesis model [12] represents a signal as a linear combination of a small number of column atoms from a redundant synthesis dictionary with and . Sparse pursuit is to seek the sparsest representation given the input signal and the dictionary . Sparse pursuit algorithms include greedy algorithms [30, 7, 4] and convex relaxation based algorithms [43, 6, 8, 3]. The greedy algorithms, like Orthogonal Matching Pursuit (OMP) [30], find at each iteration a sparse coefficient with the aim of reducing the approximation error. Convex relaxation algorithms relax the nonconvex norm to a convex norm. The sparse representation problem can then be solved using basis pursuit (BP) [6] or iterative algorithms, like Iterative SoftThresholding Algorithm (ISTA) [8, 3]. The synthesis dictionary learning algorithms [15, 2] mainly take an alternating minimization strategy and iterate between a sparsecoding stage and a dictionary update stage. With well established theories and algorithms, sparse representation over redundant synthesis model has been extensively used in signal and image processing.
Recently, the analysis model [12, 38] has attracted more research interests. A redundant analysis dictionary is a tall matrix with where each row of is an atom of the dictionary. The expectation is that the analysis coefficients are sparse. This means that the analysis dictionary should be able to sparsify the input signal, whilst preserving its essential information. The analysis dictionary usually serves as a regularization term in the optimization formulation and models the cosparse prior which can be considered as an extension of the Total Variation (TV) prior. Alternating minimization strategy can also be applied for learning analysis dictionaries [38, 46, 11, 36, 34]. Analysis KSVD [38] iterates between an analysis pursuit operation and a KSVD dictionary update stage. Yaghoobi et al. [46] proposed a uniformlynormalized tight frame constraint for learning analysis operators. Analysis Simultaneous Codeword Optimization (ASimCO) algorithm [11] enforces a sparse constraint on the sparsecoding stage and updates multiple dictionary atoms simultaneously in the dictionary update stage. Sparsifying transform learning [36, 34] proposed to constrain the analysis operator to be full rank and wellconditioned. The GeOmetric Analysis Operator Learning (GOAL) algorithm [17, 23] learns the analysis dictionary by employing an alternative optimization strategy. It performs dictionary learning on manifolds by minimizing an objective function which promotes sparse representation and also imposes full rank constraint.
Building a deep model using sparse representation over redundant synthesis dictionaries has facilitated interpretations of DNNs. Rubinstein and Elad [37] proposed an AnalysisSynthesis Thresholding (AST) model for image deblurring which consists of an analysis dictionary, elementwise hardthresholding operators and a synthesis dictionary. The AST model can be interpreted as a fully connected DNN which uses hardthresholding as nonlinearity and has one hidden layer. The DoubleSparsity model [39]
proposes to learn a twolayer synthesis dictionary. The first layer is a dictionary that models a fixed basis, while the second one is a sparse dictionary with sparse atoms. The effective dictionary is therefore the multiplication between the two dictionaries. The DoubleSparsity model provides a more efficient and adaptive modeling and enables learning large dictionary from highdimensional data. A MultiLayer Convolutional Sparse Coding (MLCSC) model is proposed in
[33, 41] and gives a new interpretation on the working of the Convolutional Neural Networks (CNNs). The linear models in CNNs are interpreted as synthesis dictionaries with convolutional structure and the function of the nonlinearities is interpreted as a simplified sparse pursuit procedure. The MLCSC model has multiple layers of synthesis dictionaries where the first dictionary is nonsparse while the following dictionaries are sparse. Tariyal et al. [42] proposed a greedy layerwise deep dictionary learning method which performs synthesis dictionary learning layerbylayer. A parametric approach is proposed in [29] to learn a deep dictionary for image classification tasks. The proposed dictionary learning method contains a forward pass which performs sparse coding with the given synthesis dictionaries and a backward pass which updates the dictionaries by gradient descent.The contribution of this paper is threefold:

We propose a Deep Analysis dictionary Model (DeepAM) which is composed of multiple layers of analysis dictionaries with associated softthresholding operators and a layer of synthesis dictionary.

We propose to characterize each analysis dictionary as a combination of two subdictionaries: an Information Preserving Analysis Dictionary (IPAD) and a Clustering Analysis Dictionary (CAD). The IPAD together with the softthresholding operator preserves the key information of the input data, and the thresholds are set essentially to denoise the data. The CAD with the associated softthresholding operator generates a discriminative representation, and the thresholds are set facilitate such discrimination.

We propose learning algorithms for DeepAM. To achieve the two different goals of IPAD and CAD, different learning criteria are introduced. We validate our proposed DeepAM on the single image superresolution task. Simulation results show that the proposed deep dictionary model achieves comparable performance with a DNN which has the same structure but is optimized using backpropagation.
The rest of the paper is organized as follows. Section II gives an overview of the proposed deep analysis dictionary model. Section III analyzes the proposed model and explains the rationale behind splitting each analysis dictionary into an information preserving and a clustering subdictionary. Section IV introduces the learning algorithm for the deep analysis dictionary model. Section V presents simulation results on single image superresolution task and Section VI concludes the paper.
Ii Overview
We begin this section by briefly introducing the single image superresolution problem and some notations. We then outline the structure of our deep dictionary model.
Iia Image SuperResolution
The task of single image superresolution (SISR) is to estimate a (highresolution) HR image
from an observed (lowresolution) LR image .We use patchbased single image superresolution as the sample application to validate our proposed architecture. Instead of estimating the HR image as a whole, the patchbased approaches [47, 49, 44, 45, 21, 22, 20] divide the LR image into overlapping patches and infer a HR patch for each LR patch. The HR image can then be reconstructed using the estimated HR patches by patch overlapping. Learningbased approaches [47, 49, 44, 21, 22, 20, 9, 10, 24] learn the inference model from an external training dataset which contains LR and HR image pairs. The patchbased methods use LRHR patch pairs extracted from the training dataset. The size of the LR patches and the HR patches is assumed to be and , respectively. The variable represents the upscaling factor. To gain illumination invariance property, the mean value of each patch is normally removed. For simplicity, we denote and
. By vectorizing the image patches and grouping the training vectors into a matrix, we obtain the input LR training data matrix
and the corresponding groundtruth HR training data matrix .IiB Deep Analysis Dictionary Model
To address the SISR problem, we propose a Deep Analysis dictionary Model (DeepAM) which consists of multiple layers of analysis dictionary interleaved with softthresholding operations and a single synthesis dictionary. In an layer DeepAM, there are dictionaries and layers that correspond to the nonlinear operations. The first dictionaries are analysis dictionaries and are denoted as with . The row atoms of the analysis dictionary are of unit norm. The nonlinear operator used in DeepAM is the elementwise softthreshold^{1}^{1}1Softthresholding is defined as where denotes the threshold vector at layer . The dictionary in the last layer is a synthesis dictionary and is designed to optimize the regression task at hand. Fig. 1 shows an example of a 3layer deep analysis dictionary model for the image superresolution task.
The layer DeepAM can therefore be expressed mathematically as:
(1) 
where and is the input signal and the estimated output signal, respectively.
Let us denote with the output of the th layer. This means that the input signal is multiplied with the analysis dictionary and then passed through the elementwise softthresholding operator . The thresholded output signal will be a sparse signal and is expected to be better at predicting the groundtruth signal than . After layers, is transformed to the HR signal domain via the synthesis dictionary. Note that the input LR signal lives in a lower dimensional space when compared to the target HR signal. It is therefore essential for the inference model to be able to nonlinearly transform the input data to a higher dimensional space. This is to be achieved by combined use of the analysis dictionaries and the associated softthresholding operators.
The proposed DeepAM framework is closely related to Deep Neural Networks (DNNs) with Rectified Linear Unite (ReLU) nonlinearity
^{2}^{2}2ReLU is defined as .. As ReLU can be considered as the onesided version of softthresholding, there exists an equivalence between a layer of analysis dictionary with softthresholding and a layer of Neural Networks with ReLU:(2) 
where , and . The superscripts HC and VC stand for horizontal concatenate and vertical concatenate, respectively.
From Eqn. (2), we realize that a layer of analysis dictionary and softthresholding operation with atoms can be represented with a layer of Neural Networks with ReLU and neurons. For data which is symmetrically distributed around the origin, DeepAM with softthresholding can be more efficient than DNNs with ReLU. This observation will be validated numerically in Section V.
The learning problem for DeepAM can be formulated as learning the parameters that minimize the mean squared error between the groundtruth data and the estimations:
(3) 
where denotes all the parameters of the layer DeepAM.
Optimizing Eqn. (3) directly can be very difficult as it involves nonconvex matrix factorization and the learning of the parameters of the nonlinear operators. Standard tools for optimizing DNNs can be utilized, for example backpropagation algorithm [40]. However this would lead to effective but difficult to interpret results.
Our objective instead is to build a deep dictionary model with a higher interpretability through understanding the purpose of different components in the model. The analysis dictionary and threshold pairs play a key role in DeepAM as they determine the way effective features are generated. The synthesis dictionary instead can be learned using least squares once all the analysis dictionaries and thresholds have been determined. We propose a layerwise learning strategy to learn the pair of analysis dictionary and softthresholding operators. In this way, we can obtain a system where the purpose of every component is easier to interpret.
Iii Analyzing the Deep Analysis Dictionary Model
To justify our approach, we begin by considering a 1layer DeepAM system:
(4) 
where is one of the elements of , with and .
We assume, for the sake of argument, that the degradation model is linear. That is, there exists a degradation matrix such that . Denote as the projection of the LR signal onto the HR signal space with the pseudoinverse matrix .
As shown in Fig. 2, the signal lies in a lowdimensional subspace of the groundtruth HR data space . A linear operation will not be able to recover the components that are orthogonal to (i.e. the dashed line in Fig. 2). It is therefore imperative to design the analysis dictionary and the nonlinear softthresholding operation in a way that facilitates the recovery of the information of in .
When we multiply with , the analysis dictionary atoms project the input LR data onto specific directions. After softthresholding, the resulting signal is sparse and the end result is a partitioning of as shown in Fig. 3. The input data within each piece is then linearly transformed for prediction, and all the linear transforms are jointly described by the synthesis dictionary .
We note that if we assume that all thresholds are large, there is a convex polyhedron in which all input data will be set to zero by the analysis and the thresholding operations, that is, , (see the central black region in Fig. 3). Therefore, the corresponding estimation will be zero, and the information of the data within the convex polyhedron will then be completely lost. This may lead to a large mean squared error for prediction.
This suggests that not all thresholds should be too large. The problem can be solved if there is a set of analysis dictionary atoms with small thresholds. If we assume that the signal subspace has dimension , then in order not to lose essential information there should be at least pairs of analysis dictionary atoms and softthresholds for information preservation. These atoms are associated with small softthresholds and are able to fully represent the input data space. Therefore the pairs of analysis atoms and softthresholds together with the corresponding synthesis atoms provide a baseline estimation for the input data. The remaining analysis dictionary atoms can instead be associated with large thresholds. The outcome of these analysis and thresholding operations is a clustering of the input data. That is, the data within the same cluster has the same sparsity pattern and shares the same linear model for prediction. The corresponding synthesis atoms then help recover the signal components within the orthogonal subspace .
Based on the above discussion, we propose to divide an analysis dictionary into two subdictionaries , and similarly divide each threshold vector into two parts . The Information Preserving Analysis Dictionary (IPAD) with its thresholds aims at passing key information of the input signal to the next layer. The Clustering Analysis Dictionary (CAD) with its threshold is to facilitate the separation of key feature in the signal. The CAD and thresholding operators provide a highly nonlinear prediction. Fig. 4 shows an analysis dictionary and the softthresholding operation with the partition of the IPAD part and the CAD part.
In a multilayer DeepAM, we adopt the same information preserving and clustering strategy. As depicted in Fig. 5, the analysis dictionary at each layer is composed of an IPAD part and a CAD part. The IPADs and thresholds create a channel which transmits the information of the input signal to the CAD in each intermediate layer and to the final estimation. The feature representation generated by the last layer of IPAD and its thresholds should be able to well represent signal components of the HR signal which are within the input data subspace. The CADs and thresholds are the main source of nonlinearity in DeepAM. The feature representation generated by the last layer of CAD should be able to well represent the signal components of which are orthogonal to the input data subspace. Therefore, the function of CAD and its thresholds can be interpreted as identifying the data with large energy in the subspace orthogonal to the input data subspace. A deep layer of CAD takes the feature representation generated by the IPAD and CAD of the previous layer as input and can generate a nonlinear feature representation that cannot be attained by a single layer softthresholding with the same number of atoms. Therefore a DeepAM with deeper layers is expected to outperform a shallower one.
Iv Learning a Deep Analysis Dictionary Model
In this section, we introduce the proposed learning algorithm for DeepAM. In view of the distinctive goals of the two pairs of subdictionary and thresholds, different learning criteria have been proposed for learning the IPAD and its thresholds and the CAD and its thresholds.
Iva Basic Analysis Dictionary Learning Algorithm
The IPAD and the CAD have different functions but also share some common learning objectives. There are four basic objectives for learning an analysis dictionary: (1) its atoms span a subspace of the input data space; (2) it is able to sparsify the input data; (3) there are no pairs of row atoms that are linearly dependent; (4) the row atoms are of unit norm.
Our proposed learning algorithm is an extension of the GeOmetric Analysis Operator Learning (GOAL) algorithm [17, 23] and we denote it as GOAL+. The four learning objectives can be attained by using corresponding constraints. The IPAD and the CAD are learned using modified versions of GOAL+ algorithm.
For simplicity, let us denote the analysis dictionary to be learned as , the th atom of as and the training data as . We assume that the data span a dimensional subspace . Let us denote with an orthogonal basis of , and with the orthogonal basis of the orthogonal complement .
The first learning objective is that the learned analysis dictionary should span only the subspace . There are two main reasons for this requirement. First, the analysis dictionary which spans can fully preserve the information of the input data. Second, if an atom belongs to , it is an unnecessary atom. This is because the coefficients will be zero since is in the nullspace of . We apply a logarithm determinant (logdet) term to impose the information preservation constraint:
(5) 
Together with the unit norm constraint, the feasible set of the analysis dictionary is therefore defined as with being the unit sphere in and being the product of unit spheres . In other words the feasible set restricts the learned atoms in to be of unit norm and excludes the contributions from . Eqn. (5) is a generalization of the logdet constraint term applied in GOAL [17, 23]. This is because in our case, defines a basis of the input data space whose size could be much smaller than the dimension of the input signal in particular when considering dictionaries in deeper layers. In GOAL [17, 23], defines a much larger subspace and is with or .
We achieve the constraint by performing orthogonal projection onto the tangent space of the manifold at location . For a row atom , the operation of the orthogonal projection onto the tangent space can be represented by the projection matrix [23]:
(6) 
where
is the identity matrix, and
.Sparsifying ability is essential for both IPAD and CAD. The sparsifying constraint is imposed by using a logsquare function which is a good approximation of the norm:
(7) 
where is a tunable parameter which controls the sparsifying ability of the learned dictionary.
Linearly dependent row atoms (e.g. ) are undesirable in the learned dictionary. A logarithm barrier term is used to penalize linearly dependent row atoms:
(8) 
The combination of the information preservation constraint in Eqn. (5), sparsifying constraint in Eqn. (7), and linearly dependent penalty term in Eqn. (8) leads to the objective function of GOAL+:
(9) 
where with and being the regularization parameters.
The objective function defined in Eqn. (9) is optimized using a geometric conjugate gradient descent method [1, 17]. The analysis dictionary learning algorithm GOAL+ is summarized in Algorithm 1. At iteration , the gradient of the objective function is computed and orthogonal projected on the tangent space of the manifold at location . The orthogonal projection of onto the tangent space can be expressed as . Let us denote , the search direction can be set as . In practice, the search direction is a combination of and the previous search direction . The updated analysis dictionary is then obtained by gradient descent with backtracking line search along the search direction . The halting condition is when the analysis dictionary converges or when a predefined maximum number of iterations is reached. In summary, our optimization approach is similar to that in GOAL [17] with the exception of the orthogonal projection step as described in Eqn. (6) which represents the constraint introduced by the feasible set . For a more detailed treatment of the geometric conjugate gradient descent we refer to [1, 17]. Now that the overall objectives of GOAL+ have been introduced, we can focus on how to tailor the optimization in Eqn. (9) to achieve the objectives of IPAD and CAD respectively.
IvB Learning IPAD and Threshold Pairs
The function of the IPAD and threshold pair is to pass key information of the input data to deeper layers. The learned IPADs create a channel that enables the information flow from the input signal to the estimated output signal.
IvB1 IPAD Learning
The training data for learning is the th layer input training data (the th layer training data is obtained as for ). Let us denote the rank of the input training data at the first layer as where outputs the rank of a matrix. The IPAD is assumed to have atoms to ensure that the learned IPAD can well represent the input data subspace.
By setting the training data as , the th layer analysis dictionary can be learned using GOAL+. The orthonormal basis is set to be an arbitrary basis of that corresponds to the signal subspace of . The orthogonal basis is set to span the orthogonal complement of the subspace spanned by .
IvB2 Learning the Thresholds for IPAD
Given a learned IPAD , the analysis coefficients contain sufficient information of . When is a redundant representation or when the input data is noisy, applying a proper thresholding operation to can further enhance the robustness of the representation. We propose to apply softthresholding with small thresholds to the IPAD analysis coefficients as and interpret the softthresholding operation as a form of denoising.
There are related works in the literature about thresholding for redundant representations [13, 35, 28]. Elad [13] shows that simple shrinkage is effective for redundant representation and interprets the simple shrinkage as the first iteration for solving Basis Pursuit Denoising (BPDN) problems. Raphan and Simoncelli [35] proposed a denoising scheme for redundant representations based on Stein’s Unbiased Risk Estimator. Lin and Lee [28] proposed a Bayesian framework for finding the optimal norm regularization.
Let us consider a weighed norm regularized minimization problem:
(10) 
where is the th coefficient of the sparse vector , and is the corresponding regularization parameter.
Selecting the softthreshold is equivalent to finding suitable regularization parameters in Eqn. (10) as the softthresholding operation can be interpreted as the first iteration of the Iterative SoftThresholding Algorithm (ISTA) [8] for solving Eqn. (10):
(11) 
where the initial sparse code .
Lin and Lee [28] proposed a method to choose the optimal regularization parameters based on a Bayesian analysis. They assume that the data
is with additive i.i.d. zero mean Gaussian noise with variance
:(12) 
and assume a Laplacian distribution prior for the sparse code with parameters :
(13) 
Empirically, we have found that the prior distribution can be well characterized by an i.i.d. zeromean Laplacian distribution. Based on the analysis in [28], the optimal regularization parameters for Eqn. (10) can be set as inversely proportional to the variance of the Laplacian distribution:
(14) 
where is the variance of the Laplacian distribution for the th sparse code .
From Eqn. (14), the softthreshold associated with IPAD is:
(15) 
where is a scaling parameter, and the variance of the th coefficient can be estimated using the obtained IPAD and its input data.
There is only a free parameter to be determined. It can be obtained by solving a 1dimensional search problem. The optimization problem for is therefore formulated as:
(16) 
where , with , and is a discrete set of values.
The obtained pair should be able to preserve the important information within the input signal and give no worse performance when compared to a linear model without any nonlinearity.
IvC Learning CAD and Threshold Pairs
The function of the clustering analysis dictionary and threshold pair is to sparsify its input data and identify the data with large residual energy. The CAD and threshold pairs at shallow layers provide lowlevel feature representations for the CADs at deeper layers.
IvC1 CAD Learning
Different from IPAD, learning CAD requires supervision from both the input training data and the groundtruth HR training data .
Let us denote with the middle resolution data and with the residual data where is the synthesis dictionary of layer which minimizes the mean squared reconstruction error and can be obtained by solving:
(17) 
It has a closedform solution given by:
(18) 
The learning objective for CAD is that its atoms should be able to project onto directions where the data with large residual error has responses with large magnitude. To achieve that, we propose to first learn an analysis dictionary which acts on the data and is able to jointly sparsify the data and the residual data . That is, the atoms in are able to identify the data in with large residual energy. The th layer CAD is then reparameterized as:
(19) 
As a result, the learned CAD will have the same identification ability as since .
We propose the following constraint for learning the analysis dictionary . Each analysis atom is enforced to be able to jointly sparsify and :
(20) 
where , and is a tunable parameter.
The objective function for learning the analysis dictionary is then formulated as:
(21) 
where with , and being the regularization parameters. Here, , and are those defined in Eqn. (7), Eqn. (5) and Eqn. (8), respectively.
The input training data is set to . Let us denote the rank of as . The orthogonal basis is set to be an arbitrary basis of the signal subspace of . The orthonormal basis is set to be a basis spanning the orthogonal complement to the subspace spanned by .
IvC2 Learning the Thresholds for CAD
The thresholds for CAD are crucial to the performance of DeepAM as the CAD and threshold pair is the main source of nonlinearity in DeepAM. The atoms of the learned CAD project the input data onto directions where the data with large residual prediction error will have responses with large magnitude. After softthresholding, the coefficients should be as sparse as possible to achieve a strong discriminative power.
We propose to set the CAD thresholds in the form of:
(22) 
where is a scaling parameter, and is the variance of the Laplacian distribution for the th atom.
As discussed in the previous section, the analysis coefficients can be well modelled by Laplacian distributions. By setting the CAD thresholds proportional to the variance of the analysis coefficients, the proportion of data that has been set to zero for each pair of atom and threshold will be the same. When the synthesis dictionary is applied for reconstruction, the synthesis atoms corresponding to the CAD atoms will be activated with a similar frequency.
With this simplification, the CAD thresholds can be learned in an efficient way. The scaling parameter can be obtained using the same strategy as in Eqn. (16) by solving a 1dimensional search problem. The optimization problem for is formulated as:
(23) 
where is the estimation residual obtained after using IPAD, , with , and is a discrete set of values.
In the simulation results in Section V, we will show that the learned CAD thresholds lead to an effective system.
IvD Synthesis Dictionary Learning
The deep analysis dictionary learning can be considered as a layerwise representation learning process in which the input data is consistently nonlinearly transformed to a high dimensional feature representation with good descriptive and discriminative properties. The synthesis dictionary models the linear transformation from to the desired HR counterpart . Similar to Eqn. (18), the synthesis dictionary is learned using least squares:
(24) 
IvE DeepAM Learning Algorithm
The overall learning algorithm for DeepAM is summarized in Algorithm 2. We adopt a layerwise learning strategy for DeepAM. At each layer, two subdictionaries IPAD and CAD are independently learned and then combined to form the analysis dictionary, and the thresholds for IPAD and CAD are obtained with two different strategies. In this way, each pair of analysis dictionary and softthresholding operations fulfil two different functionalities. Finally, the synthesis dictionary is learned using least squares.
V Simulation Results
In this section, we report the implementation details and numerical results of our proposed DeepAM method and compare our proposed method with Deep Neural Networks learned using backpropagation and with other single image superresolution algorithms.
Va Implementation Details
We use the standard 91 training images [47] as training dataset and use Set5 [47] and Set14 [49]
as the testing dataset. The Peak SignaltoNoise Ratio (PSNR)
^{3}^{3}3PSNR=, where is the mean squared error between the groundtruth HR image and the estimated HR imageis used as the evaluation metric. The color images have been converted from RGB color space to YCbCr color space and image superresolution is only applied to the luminance component. The lowresolution (LR) images are obtained by downsampling the groundtruth highresolution (HR) images by a factor
using Matlab function imresize. The size of the lowresolution patches is set to for the purpose of better visualization and easier interpretation. The size of the highresolution patches is then . Around LRHR patch pairs have been collected for training. During testing, full patch overlapping is applied to reconstruct the HR images.Table I shows the parameters setting of GOAL+ algorithm for learning the th layer Information Preserving Analysis Dictionary (IPAD) and Clustering Analysis Dictionary (CAD). Both the IPAD and the CAD are initialized with i.i.d. Gaussian random entries. We apply batch training for GOAL+ algorithm. The training data has been divided into batches with size . During training, the GOAL+ algorithm is sequentially applied to batches until the learned dictionary converges. For each batch, iterations of conjugate gradient descent are performed to update the dictionary. The discrete set used for searching the scaling parameter of the thresholds is set to be .
Parameters  

IPAD  —  
CAD  100 
VB Visualization of the Learned DeepAM
In this section, we will show and analyze the learned DeepAM with the implementation details as described in the previous section.
Figure 6 shows an example of a learned 1layer DeepAM. It contains an analysis dictionary , thresholds and a synthesis dictionary . Each atom is displayed in a 2D patch in which black and white corresponds to the smallest and the largest value, respectively. The number of the information preserving atoms is set to 40 which is larger than the rank of the input data. The thresholds depicted in Fig. 6(b) show a clear bimodal behaviour. The first 40 thresholds are close to zero, while the remaining 60 thresholds are relatively large. After thresholding, almost all coefficients corresponding to IPAD are nonzero, and the percentage of nonzero coefficients of different CAD atoms are similar and are around 8%. This indicates that modelling the distribution of the analysis coefficients as a Laplacian distribution is a good approximation. The atoms within the blue box are the clustering atoms. The atoms in IPAD shown in Fig. 6(a) are similar to the LR versions of their corresponding synthesis atoms in Fig. 6(c). The CAD atoms look like directional filters and are more localized. There is little lowfrequency information. The corresponding synthesis atoms are correlated to the CAD atoms, however, they are not the HR counterpart. The inner product between the HR projection of a CAD atom and its corresponding synthesis atom is close to zero. This shows that, in line with our objective, the synthesis atoms which correspond to the CAD part are nearly orthogonal to the LR data subspace.
Backpropagation can be used to further update the parameters in our learned DeepAM. The backpropagation update is implemented using Pytorch with Adam optimizer
[25], batch size , initial learning rate , learning rate decay step , and decay rate . The parameter setting has been tuned to achieve the best performance.Figure 7
shows the 1layer DeepAM after updating using backpropagation. With backpropagation, the performance of DeepAM has a rapid improvement with the first 5 epochs and converges within 20 epochs. After backpropagation update, the average PSNR evaluated on
Set5 has improved by approximately 0.3 dB. We can find that the different characteristics of the IPAD part and the CAD part are preserved on the updated DeepAM. There are subtle differences on the updated dictionaries. In general, the IPAD atoms have no visible changes, while the CAD atoms have become more localized. The thresholds continue to have a bimodal behaviour. There is only a slight change on the percentage of nonzero coefficients of different atoms.Figure 8 shows the dictionaries of a learned 2layer DeepAM including two analysis dictionaries , and a synthesis dictionary . The first analysis dictionary is similar to that in Figure 6(a), while its CAD part mainly contains directional filters due to a smaller number of clustering atoms. The second analysis dictionary is shown in Figure 8(b) and is a sparse dictionary where the sparse atoms can be considered as indicating a weighted combination of the first layer analysis dictionary atoms if the softthresholding operation is neglected. The effective dictionary shown in Figure 8(c) can partially show the effective atoms applied to the input LR data whose IPAD part is similar to that in and CAD part contains more localized atoms when compared to those in . Similar observations can be found in a deeper analysis dictionary in DeepAM. The synthesis dictionary has similar characteristics as the one in the 1layer DeepAM.
Figure 9 shows the dictionaries of the 2layer DeepAM after updating with backpropagation. The backpropagation slightly updates the dictionaries and converges within 20 epochs. The average PSNR evaluated on Set5 improves by 0.2 dB after the first 5 epochs and achieved a 0.3 dB improvement after convergence. As in the 1layer DeepAM case, after backpropagation, there is still a clear difference between the IPAD atoms and the CAD atoms. The IPAD atoms did not change significantly, while the CAD atoms in and the effective dictionary have become more localized.
VC Comparison with Deep Neural Networks
Model Size  

Method  DNNR  DNNS  DeepAM  DNNR  DNNS  DeepAM  DNNR  DNNS  DeepAM 
Set5  35.83  36.26  35.90  36.14  36.50  36.12  36.19  36.54  36.22 
Set 14  31.80  32.06  31.83  32.00  32.23  31.96  32.01  32.25  32.02 
In this section, we compare our proposed DeepAM method with Deep Neural Networks (DNNs). The number of IPAD atoms in each layer is set to be 35 which is the rank of the input LR data since a DeepAM with more CAD atoms provides better performance when the input data is noiseless. For comparison, DNNs are learned with the same training data using gradient descent with backpropagation. Let us denote DNNR and DNNS as the DNN with ReLU as nonlinearity and softthresholding as nonlinearity, respectively. The architecture of DNNS is the same as our DeepAM. The implementation of DNNs is based on Pytorch with Adam optimizer [25], batch size , initial learning rate , learning rate decay step , and decay rate . The total number of epoches for training is . The parameter setting has been tuned to achieve the best performance. The parameters of the DNNs are initialized using the default method in Pytorch.
Table II reports the average PSNR (dB) of DNNR, DNNS and the proposed DeepAM method evaluated on Set5 [47] and Set14 [49]. There are three different model sizes which correspond to DNNs with 1 hidden layer, 2 hidden layers and 3 hidden layers where the number of neurons in each layer is fixed to be 256.
With a deeper model, the performance of our proposed DeepAM method improves even when the size of the final feature representation is the same. This indicates that the depth of DeepAM is important for the final performance. Figure 10 further shows the percentage of nonzero coefficients for each atom in 3 different layers of the 3layer DeepAM in Table II. We can find that the percentage of nonzero coefficients has a bimodal behaviour in all three layers which is the same to that shown in Figure 6(b). After thresholding, the percentage of nonzero coefficients corresponding to CAD atoms are almost the same in each layer. The percentage of nonzero coefficients for CAD atoms in layer 1, 2 and 3 is around 9%, 14% and 22%, respectively. This means the feature representation becomes less sparse with the increase of layers. A denser signal representation is helpful for modelling more complex signals which requires the use of more synthesis atoms for a good reconstruction quality. This could be the reason for an improved performance with the increase of depth.
Our proposed DeepAM method achieves a similar average PSNR when compared to the DNNR method over different model sizes. The DNNS method achieves the highest average PSNR which is around 0.3 dB and 0.2 dB higher than that of our proposed DeepAM method when evaluated on Set5 and Set 14, respectively. The slightly lower performance of DeepAM when compared to DNNS is not surprising since DeepAM does not utilize joint optimization as DNNs and the thresholds are set according to simple principles. On the other hand, the simulation result validates the effectiveness of our proposed method. The better performance of DNNS also suggests that DNNs with softthresholding as nonlinearity can be more effective for image enhancement applications.
As shown in the previous section, backpropagation can be used to further improve the learned DeepAM. Figure 11 shows average PSNR of the 3layer DeepAM in Table II which is updated using backpropagation and the performance of the baseline DNNS. The performance of DeepAM improves significantly and outperforms DNNS within 15 epochs. The converged performance of DeepAM is around 0.1 dB higher than that of DNNS. The result shows that the parameters of DeepAM are not far from the final parameters. It also suggests that the DeepAM learning algorithm has the potential to be a good initialization method for DNNs.
VD Comparison with Single Image SuperResolution Methods
Images  Bicubic  SC [49]  ANR [44]  A+ [45]  SRCNN [9]  DeepAM  

baboon  24.85  25.47  25.55  25.66  25.64  25.65  25.70 
barbara  27.87  28.50  28.43  28.49  28.53  28.49  28.43 
bridge  26.64  27.63  27.62  27.87  27.74  27.82  27.93 
coastguard  29.16  30.23  30.34  30.34  30.43  30.44  30.53 
comic  25.75  27.34  27.47  27.98  28.17  27.77  28.12 
face  34.69  35.45  35.52  35.63  35.57  35.62  35.62 
flowers  30.13  32.04  32.06  32.80  32.95  32.45  32.86 
foreman  35.55  38.41  38.31  39.45  37.43  38.89  39.34 
lenna  34.52  36.06  36.17  36.45  36.46  36.46  36.51 
man  29.11  30.31  30.33  30.74  30.78  30.57  30.78 
monarch  32.68  35.50  35.46  36.77  37.11  36.06  37.00 
pepper  35.02  36.64  36.51  37.08  36.89  36.87  37.11 
ppt3  26.58  29.00  28.67  29.79  30.31  29.13  29.70 
zebra  30.41  33.05  32.91  33.45  33.14  33.34  33.71 
Average  30.21  31.83  31.81  32.32  32.22  32.11  32.38 
In this section, we will compare our proposed DeepAM method with some existing single image superresolution methods including Bicubic interpolation, SCbased method
[49], Anchored Neighbor Regression (ANR) method [44], Adjusted Anchored Neighborhood Regression (A+) method [45], and SuperResolution Convolutional Neural Network (SRCNN) method [9].The SCbased method [49] is a synthesis dictionary based method with a coupled LR and HR dictionary. The LR dictionary is learned using KSVD [2] and has 1024 atoms, and the HR dictionary is learned using least squares. It assumes that a LR patch and its corresponding HR patch share the same sparse code which is retrieved using OMP [30]
. The input LR feature is the concatenation of the intensity, the firstorder derivatives, and the secondorder derivatives of the LR data and is further compressed using Principal Component Analysis (PCA). The ANR method
[44] and the A+ method [45] use the same feature representation as [49]. They apply a learned LR synthesis dictionary for LR patch clustering and have a regression model for each dictionary atom. The superresolution algorithm finds the nearest neighbor atom for each input LR signal and apply the corresponding regression model for HR signal prediction. The dictionary has 1024 atoms and thus there are 1024 regression models. The A+ method [45] represented the stateoftheart before the emergence of methods based on deep convolutional neural networks. The aforementioned methods [49, 44, 45] are all patchbased. The SuperResolution Convolutional Neural Network (SRCNN) method [9] is the first to use Convolutional Neural Network for single image superresolution. SRCNN has 3 layers and is with filters with spatial size , filters with spatial size and filters with spatial size for layer 1, 2 and 3, respectively. It takes the Bicubic upscaled image as input and is able to upscale the input LR image without dividing the input image into patches.In Table III, and DeepAM represents the 3layer DeepAM (model size is ) with and without backpropagation, respectively. The input data is the intensity of the LR image patches. Instead of predicting HR for each input LR patch, the output of DeepAM is the central HR patch since the input LR patch does not contain sufficient information to predict the boundary pixels. DeepAM achieves a comparable performance to the existing methods. Its average PSNR is around 0.3 dB higher than that of the SCbased method and the ANR, while it is around 0.2 dB lower than that of A+ and is around 0.1 dB lower than that of SRCNN. achieves the highest average PSNR. Figure 12 shows an example of the input LR image and the reconstructed HR images using DeepAM and .
Vi Conclusions
In this paper, we proposed a Deep Analysis Dictionary Model (DeepAM) which consists of multiple layers of analysis dictionary and softthresholding operators and a layer of synthesis dictionary. Each analysis dictionary has been designed to contain two subdictionaries: an Information Preserving Analysis Dictionary (IPAD) and a Clustering Analysis Dictionary (CAD). The IPAD and threshold pairs are to pass key information from input to deeper layers. The function of the CAD and threshold pairs is to facilitate discrimination of key features. We proposed an extension of GOAL [17] to perform dictionary learning for both the IPAD and the CAD. The thresholds have been efficiently set according to simple principles, while leading to effective models. Simulation results show that our proposed DeepAM achieves comparable performance with DNNs and other existing single image superresolution methods.
References
 [1] (2009) Optimization algorithms on matrix manifolds. Princeton University Press. Cited by: §IVA.
 [2] (2006) KSVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing 54 (11), pp. 4311. Cited by: §I, §VD.
 [3] (2009) A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2 (1), pp. 183–202. Cited by: §I.
 [4] (2009) Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis 27 (3), pp. 265–274. Cited by: §I.
 [5] (2013) Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1872–1886. Cited by: §I.
 [6] (2001) Atomic decomposition by basis pursuit. SIAM review 43 (1), pp. 129–159. Cited by: §I.
 [7] (2009) Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory 55 (5), pp. 2230–2249. Cited by: §I.
 [8] (2004) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 57 (11), pp. 1413–1457. Cited by: §I, §IVB2.
 [9] (2014) Learning a deep convolutional network for image superresolution. In European Conference on Computer Vision, pp. 184–199. Cited by: §IIA, §VD, §VD, TABLE III.
 [10] (2016) Image superresolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §IIA.
 [11] (2015) Analysis simco algorithms for sparse analysis model based dictionary learning. IEEE Transactions on Signal Processing 64 (2), pp. 417–431. Cited by: §I.
 [12] (2007) Analysis versus synthesis in signal priors. Inverse problems 23 (3), pp. 947. Cited by: §I, §I, §I.
 [13] (2006) Why simple shrinkage is still relevant for redundant representations?. IEEE transactions on information theory 52 (12), pp. 5559–5569. Cited by: §IVB2.
 [14] (2010) Sparse and redundant representations: from theory to applications in signal and image processing. Springer. Cited by: §I.
 [15] (1999) Method of optimal directions for frame design. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, Vol. 5, pp. 2443–2446. Cited by: §I.
 [16] (2016) Deep neural networks with random Gaussian weights: a universal classification strategy?. IEEE Trans. Signal Processing 64 (13), pp. 3444–3457. Cited by: §I.
 [17] (2013) Analysis operator learning and its application to image reconstruction. IEEE Transactions on Image Processing 22 (6), pp. 2138–2150. Cited by: §I, §IVA, §IVA, §IVA, §VI.
 [18] (201803) A deep dictionary model for image superresolution. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18), Cited by: Learning Deep Analysis Dictionaries—Part I: Unstructured Dictionaries.
 [19] (201905) A deep dictionary model to preserve and disentangle key features in a signal. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19), Cited by: Learning Deep Analysis Dictionaries—Part I: Unstructured Dictionaries.

[20]
(201707)
SRHRF+: selfexample enhanced single image superresolution using hierarchical random forests
. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on New Trends in Image Restoration and Enhancement
, Cited by: §IIA.  [21] (2015) Fast image interpolation via random forests. IEEE Transactions on Image Processing 24 (10), pp. 3232–3245. Cited by: §IIA.

[22]
(2017)
Learning hierarchical decision trees for single image superresolution
. IEEE Transactions on Circuits and Systems for Video Technology. External Links: Document Cited by: §IIA.  [23] (2015) A bimodal cosparse analysis model for image processing. International Journal of Computer Vision 114 (23), pp. 233–247. Cited by: §I, §IVA, §IVA, §IVA.
 [24] (2016) Accurate image superresolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654. Cited by: §IIA.
 [25] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VB, §VC.
 [26] (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
 [27] (2018) Geometric understanding of deep learning. arXiv preprint arXiv:1805.10451. Cited by: §I.
 [28] (2006) Bayesian l 1norm sparse learning. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, Vol. 5, pp. V–V. Cited by: §IVB2, §IVB2, §IVB2.
 [29] (2018) Deep dictionary learning: a parametric network approach. arXiv preprint arXiv:1803.04022. Cited by: §I.
 [30] (1993) Matching pursuits with timefrequency dictionaries. IEEE Transactions on Signal Processing 41 (12), pp. 3397–3415. Cited by: §I, §VD.
 [31] (2012) Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §I.
 [32] (2014) On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932. Cited by: §I.

[33]
(2017)
Convolutional neural networks analyzed via convolutional sparse coding.
The Journal of Machine Learning Research
18 (1), pp. 2887–2938. Cited by: §I.  [34] (2018) Learning filter bank sparsifying transforms. IEEE Transactions on Signal Processing 67 (2), pp. 504–519. Cited by: §I.
 [35] (2008) Optimal denoising in redundant representations. IEEE Transactions on Image Processing 17 (8), pp. 1342–1352. Cited by: §IVB2.
 [36] (2012) Learning sparsifying transforms. IEEE Transactions on Signal Processing 61 (5), pp. 1072–1086. Cited by: §I.
 [37] (2014) Dictionary learning for analysissynthesis thresholding. IEEE Transactions on Signal Processing 62 (22), pp. 5962–5972. Cited by: §I.
 [38] (2013) Analysis ksvd: a dictionarylearning algorithm for the analysis sparse model. IEEE Transactions on Signal Processing 61 (3), pp. 661–677. Cited by: §I.
 [39] (2010) Double sparsity: learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing 58 (3), pp. 1553–1564. Cited by: §I.
 [40] (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §I, §IIB.
 [41] (2018) Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Transactions on Signal Processing 66 (15), pp. 4090–4104. Cited by: §I.
 [42] (2016) Deep dictionary learning. IEEE Access 4, pp. 10096–10109. Cited by: §I.
 [43] (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288. Cited by: §I.
 [44] (2013) Anchored neighborhood regression for fast examplebased superresolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1920–1927. Cited by: §IIA, §VD, §VD, TABLE III.
 [45] (2014) A+: adjusted anchored neighborhood regression for fast superresolution. In Asian Conference on Computer Vision, pp. 111–126. Cited by: §IIA, §VD, §VD, TABLE III.
 [46] (2013) Constrained overcomplete analysis operator learning for cosparse signal modelling. IEEE Transactions on Signal Processing 61 (9), pp. 2341–2355. Cited by: §I.
 [47] (2010) Image superresolution via sparse representation. IEEE transactions on Image Processing 19 (11), pp. 2861–2873. Cited by: §IIA, Fig. 11, §VA, §VC, TABLE II.
 [48] (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §I.
 [49] (2010) On single image scaleup using sparserepresentations. In International conference on curves and surfaces, pp. 711–730. Cited by: §IIA, §VA, §VC, §VD, §VD, TABLE II, TABLE III.
Comments
There are no comments yet.