I Introduction
Convolutional dictionary learning has attracted increasing interests in signal and image processing communities as it leads to a more elegant framework for highdimensional signal analysis. An advantage of convolutional dictionaries [4, 41, 16, 32, 7, 9, 6, 31, 42, 29, 28, 38, 1] is that they can take the highdimensional signal as input for sparse representation and processing, whereas traditional approaches [3, 14, 43, 44, 33, 34, 15] have to divide the highdimensional signal into overlapping lowdimensional patches and perform sparse representation on each patch independently.
A convolutional dictionary models the convolution between a set of filters and a signal. It is a structured dictionary and can be represented as a concatenation of Toeplitz matrices where each Toeplitz matrix is constructed using the taps of a filter and the usual assumption is that the filters are with compact support. So a convolutional dictionary is effective for processing highdimensional signals while also restraining the number of free parameters.
To achieve efficient convolutional dictionary learning, the convolutional dictionary is usually modelled as a concatenation of circulant matrices [4, 41, 16, 32, 7, 9]
by assuming a periodic boundary condition on the signals. As all circulant matrices share the same set of eigenvectors which is the Discrete Fourier Transform (DFT) matrix, a circular convolution can be therefore represented as a multiplication in Fourier domain and can be efficiently implemented using Fast Fourier Transform (FFT). However, using a circulant matrix to approximate a general Toeplitz matrix may lead to boundary artifacts
[16, 22, 8] especially when the boundary region is large.A multilayer convolutional dictionary model is able to represent multiple levels of abstraction of the input signal. Due to the associativity property of convolution, multiplying two convolutional dictionaries results in a convolutional dictionary whose corresponding filters are the convolution of two set of filters and have support size that is larger than the original filters. In the MultiLayer Convolutional Sparse Coding (MLCSC) model in [28, 38, 30], there are multiple layers of convolutional synthesis dictionaries. By increasing the number of layers, the global dictionary which is the multiplication of the convolutional dictionaries is able to represent more complex structures. The MLCSC model [28] provides theoretical insights on the conditions for the success of the layered sparse pursuit and uses it as a way to interpret the forward pass of Deep Neural Networks (DNNs) as a layered sparse pursuit.
Convolutional Neural Networks (CNNs) [25, 26, 37]
have been widely used for processing images and achieve stateoftheart performances in many applications. A CNN consists of a cascade of convolution operations and elementwise nonlinear operations. The Rectified Linear Unit (ReLU) activation function is one of the most popular nonlinear operators used in DNNs. With the multilayer convolution structure, a deeper layer in a CNN receives information from a corresponding wider region of the input signal. The ReLU operator provides a nonlinear transformation and leads to a sparse feature representation. The parameters of a CNN are usually optimized using the backpropagation algorithm
[35]In this paper, we propose a Deep Convolutional Analysis Dictionary Model (DeepCAM) which consists of multiple layers of convolutional analysis dictionaries and softthresholding operations and a layer of convolutional synthesis dictionary. The motivation is to use DeepCAM as a tool to interpret the workings of CNNs from the sparse representation perspective. Single image superresolution [43, 44, 39, 40, 20, 21, 19, 18, 11, 12, 13, 36] is used as a sample application to validate the proposed model design. The input lowresolution image is used as the input to DeepCAM and is not partitioned into patches. At each layer of DeepCAM, the input signal is multiplied with a convolutional analysis dictionary and then passed through softthresholding operators. At the last layer, the convolutional synthesis dictionary is used to predict the highresolution image.
The contribution of this paper is twofold:

We propose a Deep Convolutional Analysis Dictionary Model (DeepCAM) for single image superresolution, where at each layer, the convolutional analysis dictionary and the softthresholding operation are designed to achieve simultaneously information preservation and discriminative representation.

We propose a convolutional analysis dictionary learning method by explicitly modelling the convolutional dictionary with a Toeplitz structure. By exploiting the properties of Toeplitz matrices, the convolutional analysis dictionary can be efficiently learned from a set of training samples. Simulation results on single image superresolution are used to validate our proposed DeepCAM and convolutional dictionary learning method.
The rest of the paper is organized as follows. In Section II we show how we build a convolutional analysis dictionary from an unstructured dictionary. In Section III, we propose an efficient convolutional analysis dictionary learning algorithm by exploiting the properties of Toeplitz matrices. Section IV presents the proposed Deep Convolutional Analysis Dictionary Model (DeepCAM) and the complete learning algorithm. Section V presents simulation results on single image superresolution task and Section VI draws conclusions.
Ii Convolutional Analysis Dictionary
An analysis dictionary contains row atoms and is usually assumed to be overcomplete with . Given a signal of interests , the analysis dictionary should be able to sparsify while preserving its essential information. That is, the analysis coefficients are sparse but still contain sufficient information for further processing.
The focus of this paper is on learning convolutional analysis dictionaries which model the convolution between a signal and a set of filters. The filters’ taps depend on the rows of the analysis dictionary. In what follows, we first show how we build our convolutional analysis dictionary from a unstructured analysis dictionary. We then study strategies to learn a convolutional analysis dictionary from a set of training samples.
The convolution can be represented as a multiplication with a convolutional analysis dictionary. Let us assume that the input signal is 1dimensional for simplicity. The convolution between an atom and an input signal with can be expressed as:
(1) 
where denotes the convolution operator, and is a Toeplitz matrix with columns which is constructed using as follows:
(2) 
where is the th coefficient of , and with is an indicator matrix with 1s on the th upper diagonal and 0s on other locations.
Given an unstructured analysis dictionary and an input signal , the convolution between and each row of can be expressed as a matrix multiplication where is converted to a convolutional analysis dictionary which can be represented as a concatenation of Toeplitz matrices along the column direction:
(3) 
Instead of assuming a circulant structure, we model the convolutional operation with a Toeplitz matrix. The represented convolution operation will be performed only within the input signal. That is, convolutional operation is performed without padding at the boundaries.
Fig. 1 shows an example of how we build a convolutional analysis dictionary from the unstructured analysis dictionary with and . Note that the analysis dictionary in Fig. 1(a) is not overcomplete, while the convolutional analysis dictionary has more rows than columns.
The above description applies to 1dimensional input signals, but it can be extended to multidimensional signals, like for example images. A convolutional analysis dictionary will then be in the form of a concatenation of doubly block Toeplitz matrices (i.e. a matrix with block Toeplitz structure where each block is a Toeplitz matrix). Similar to Eqn. (2), the doubly block Toeplitz structure can be represented with corresponding indicator matrices. Fig. 2(a) shows an example of 2dimensional convolution between a image region and a filter and Fig. 2(b) shows the corresponding convolutional analysis dictionary.
Iii Learning Convolutional Analysis Dictionaries
In this section, we propose an efficient convolutional analysis dictionary learning algorithm by exploiting the properties of the Toeplitz structure within the dictionary.
For simplicity, let us assume that the input data is made of 1dimensional vectors. Therefore, the convolutional analysis dictionary is a concatenation of Toeplitz matrices as we discussed in Section
II. The proposed learning algorithm can be easily extended to the multidimensional case where the convolutional dictionary is a concatenation of doubly block Toeplitz matrices.Let us assume that convolution is performed between an analysis dictionary and an input signal with . Therefore, the convolutional analysis dictionary will be of size with . Given that is built using , it has the same number of free parameters as despite being a much bigger matrix. This means that if we were to optimize directly we would end up with a computationally inefficient approach.
We mitigate this issue by first observing that when we assume that the convolutional filters are with compact support , the convolutional analysis dictionary has many zero entries. It is therefore inefficient to evaluate by directly multiplying with . As illustrated in Fig. 3, with the commutativity property of convolution (i.e. ), the matrix multiplication between a Toeplitz matrix and an input vector can be efficiently implemented as:
(4) 
where is the right dual matrix of and is defined as:
(5) 
where is the th coefficient of , and is an indicator matrix with 1s on the
th skewdiagonal and 0s on other locations. Note that the right dual matrix
is a matrix without zero entries and has columns.Symbol  

Size 
Secondly, whenever possible, we will pose the optimization problem using whilst imposing the constraints associated with the structured matrix as the actual analysis dictionary learning problem.
We want the convolutional analysis dictionary to satisfy four properties: (i) its row atoms span the input data space; (ii) it is able to sparsify the input data; (iii) the row atoms are of unit norm; (iv) there are no pairs of row atoms in that are linearly dependent.
Different from the unstructured analysis dictionary learning case, we propose to use two sets of input training data with different sizes. Let us denote the superpatch training data as and denote the small patch training data as . Please remember that is the filters’ size and is the dimension of input signals and we assume . Both the superpatch and small patch training datasets are extracted from an external training dataset. The superpatch training data will be used to impose property (i) which is a global property of the convolutional dictionary. The small patch training data will be used to impose property (ii).
The first learning objective is that the convolutional analysis dictionary should be able to span the input data space in order to preserve the information within the input superpatch training data . The superpatch training dataset defines the subspace covered by the input data. Let us denote with the orthogonal basis covering the signal subspace of the input superpatch data where we assume that this subspace has dimension , we also denote with the orthogonal basis of the orthogonal complement to the signal subspace of . These two bases will be used to impose that the row space of the learned convolutional analysis dictionary spans the input data space while being orthogonal to the nullspace of . The information preservation constraint can be interpreted as a rank constraint on the convolutional analysis dictionary which is usually achieved by imposing a logarithm determinant constraint:
(6) 
The size of the convolutional analysis dictionary can be huge, especially when the input data is multidimensional. Therefore it would be computationally inefficient to evaluate Eqn. (6) and its first order derivative directly. By exploiting the properties of the convolutional analysis dictionary, we propose an efficient reformulation of Eqn. (6) which is based on the analysis dictionary .
With the definition of the right dual matrix, the multiplication between and the th orthogonal basis element of can be expressed as:
(7) 
where denotes the vectorization operation , the vectorization operation for 2dimensional signal can be expressed as:
(8) 
where is the th canonical basis vector of , that is, (with 1 on the th location), and denotes the Kroneckers product.
The information preservation constraint in Eqn. (6) can therefore be reformulated and expressed in terms of the analysis dictionary as:
(9) 
where with .
The gradient of can be expressed as:
(10) 
where and .
With the information preservation constraint in Eqn. (9), the learned is constrained to span the signal subspace defined by . However, we still need to exclude the nullspace components of the training data from . Specifically, the Toeplitz matrix should not be within the subspace spanned by to avoid a zero response when multiplying with .
Therefore we define the feasible set of the convolutional analysis dictionary as with being the unit sphere in , being the product of unit spheres and being the orthogonal complement of . The unit sphere constraint ensures that the unit norm condition is satisfied. The feasible set is defined in , while we wish to have a feasible set for which is defined in with and can be more efficiently implemented.
The operation of orthogonal projection onto the complementary subspace of can be represented by the projection matrix given by:
(11) 
where
is the identity matrix and
is the pseudoinverse of .The orthogonal projection operation is achieved by multiplying the convolutional analysis dictionary with the projection matrix. The projection is applied on the rows of . With the definition of the right dual matrix, the orthogonal projection operation can be expressed in terms of the analysis dictionary atom as:
(12) 
where denotes the th row of .
We note that, after the projection, the Toeplitz structure within may not be preserved and needs to be imposed again. The Toeplitz matrix closest to is obtained by averaging over the diagonal elements [5]. The orthogonal projection operation and the averaging operation can be jointly represented and applied to the atoms of the analysis dictionary. Let us define a vector whose inner product with equals the average value of the th diagonal elements of , it can be expressed as:
(13) 
where denotes the th row of .
The matrix therefore represents simultaneously the orthogonal projection operation and the averaging operation. Let us denote the feasible set of the analysis dictionary as where represents the product of unit spheres and represents the orthogonal complementary subspace of the nullspace of . The operation of the orthogonal projection onto the tangent space can then be represented by the projection matrix :
(14) 
where is the identity matrix, and .
The sparsifying property of the convolutional analysis dictionary over the superpatch training data can be achieved by imposing the sparsifying property of the analysis dictionary over the small patch training data. The rationale is that the row atoms of the convolutional analysis dictionary only operate on local regions of the input signal as illustrated in Eqn. (4) and Fig. 3. Similar to [15, 23, 17], the sparsifying constraint is imposed by using a logsquare function which promotes analysis dictionary atoms that sparsify the small patch training data:
(15) 
where is a tunable parameter which controls the sparsifying ability of the learned dictionary.
The linearly dependent penalty and the unit norm constraint can also be imposed directly on the analysis dictionary . Linearly dependent row atoms (e.g. ) are penalized by using a logarithm barrier term :
(16) 
We observe that, by exploiting the Toeplitz structure, we have been able to impose the desired proprieties of a convolutional analysis dictionary by imposing constraints on the lowerdimensional analysis dictionary . This will reduce computational costs and memory requirements.
Combining the information preservation constraint in Eqn. (9), feasible set constraint in Eqn. (14), sparsifying constraint in Eqn. (15), and linearly dependent penalty term in Eqn. (16), the objective function for the convolutional analysis dictionary problem can be expressed as:
(17) 
where with and being the regularization parameters.
Iv Deep Convolutional Analysis Dictionary Model
In this section, we introduce our Deep Convolutional Analysis Dictionary Model (DeepCAM). DeepCAM is a convolutional extension of the Deep Analysis dictionary Model (DeepAM) [17]. Different from DeepAM which is patchbased, DeepCAM performs convolution operation and elementwise softthresholding at image level on all layers without dividing the input image into patches.
When it comes to Single Image SuperResolution (SISR), convolutional neural networks are designed using two main strategies: the earlyupsampling approaches [11, 12] and the lateupsampling approaches [13, 36]. The earlyupsampling approaches [11, 12]
first upsample the lowresolution (LR) image to the same resolution of the desired highresolution (HR) one through bicubic interpolation and then perform convolution on the upsampled image. The drawback is that this leads to a large number of model parameters and a high computational complexity during testing as the feature maps are of the same size as the HR image. The lateupsampling approaches
[13, 36] perform convolution on the input LR image and applies a deconvolution layer [13] or a subpixel convolution layer [36] at the last layer to predict the HR image. The lateupsampling approaches have smaller number of parameters and lower computational cost than the earlyupsampling one.SISR is used as a sample application to validate our proposed design. We utilize a similar strategy as the lateupsampling approach. The LR image is used as input to DeepCAM without bicubic interpolation. At each layer, the convolution and softthresholding operations are applied to the corresponding input signal. For SISR with upsampling factor , the synthesis dictionary consists of atoms. The convolution between the synthesis dictionary and its input signal yields output channels which correspond to subsampled version of the HR image. The final predicted HR image can then be obtained by reshaping and combing the output channels.
The parameters of a layer DeepCAM include layers of analysis dictionary and softthresholds pair and a single synthesis layer modelled with dictionary . The atoms of the dictionaries represent filters. Let us denote with the number of filters at the th layer and with the spatial support size of the convolutional filters, since there were filters at the previous layer, there are free parameters at layer with . Therefore the complete set of free parameters is given by the analysis dictionaries , the softthresholds and the synthesis dictionary where .
Fig. 4 shows an example of a 2layer DeepCAM for image superresolution. The input LR image denoted with passes through multiple layers of convolution with the analysis dictionary and softthresholding. There are 4 synthesised HR subimages which are obtained by convolving the last layer analysis feature maps with the synthesis dictionary and will be rearranged to generate the final predicted HR image according to the sampling pattern.
Let us denote with the input signal at the th layer, and denote with the th atom of . The convolution and softthresholding operations corresponding to the th atom and threshold pair can be expressed as^{1}^{1}1This is a 2D convolution performed along the image coordinate , .:
(18) 
where represents the operation which reshapes a vector of length
to a tensor with size
, is the convolution result, denotes the elementwise softthresholding operation with threshold , and is the sparse representation after thresholding.Fig. 5 illustrates the convolution and the softthresholding operation described in Eqn. (18). The convolution linearly transforms the input signal to a 2D representation . An elementwise softthresholding operation is then applied to every element on and generates a sparse 2D representation .
By stacking the sparse 2D representation , the th layer output signal can be represented as . For simplicity, let us denote the th layer convolution and softthresholding operation as:
(19) 
When the convolution of and is represented by a convolutional analysis dictionary with , the convolution and softthresholding operations can be expressed as follows:
(20) 
where is a all ones vector of size , and is the Kronecker product.
The complete model of a layer DeepCAM can then be expressed as:
(21) 
where denotes the estimated HR images.
V Learning A Deep Convolutional Analysis Dictionary Model
In this section, we will introduce the proposed algorithm for learning both the convolutional analysis dictionary and the softthresholds in DeepCAM.
We adopt a joint Information Preserving and Clustering strategy as proposed in DeepAM [17]. At each layer, the analysis dictionary is divided into two subdictionaries: an Information Preserving Analysis Dictionary (IPAD) and a Clustering Analysis Dictionary (CAD) . The IPAD and softthreshold pair will generate feature maps that can preserve the information from the input image. The CAD and softthreshold pair will generate feature maps with strong discriminative power that can facilitate the prediction of the HR image. To achieve this goal, there should be a sufficient number of IPAD and CAD atoms and guidelines on how to determine the size of each dictionary will also be provided in this section.
Va Learning IPAD and Threshold Pair
The Information Preserving Analysis Dictionary (IPAD) will be learned using the proposed convolutional analysis dictionary learning method of Section III. The thresholds will be set according to the method used in DeepAM [17].
A multilayer convolutional analysis dictionary naturally possesses a multiscale property. The product of two convolutional dictionaries leads to a convolutional dictionary whose equivalent filters are given by the convolution of the filters in the two dictionaries due to the associativity property of convolution (i.e. ).
Let us denote with the th layer convolutional analysis dictionary constructed using convolutional filters with patch size . The effective convolutional analysis dictionary has filters with spatial patch size:
(22) 
An example of a twolayer convolutional analysis dictionary is shown in Fig. 6. The effective dictionary has an effective patch size that increases with the number of layers and can be large even when each convolutional analysis dictionary uses filters with small patch size.
When the support size of a convolutional analysis dictionary is small, its row atoms can only receive local information from the whole input signal. With an increased effective patch size, the row atoms of the convolutional analysis dictionary at a deeper layer will receive information from a larger segment of the input signal.
For each HR pixel at the synthesised HR images , there is a corresponding superpatch region on each layer which contributes all the information for predicting that pixel. Let us denote with the superpatch size at the th layer. It can be expressed in terms of the patch size of the convolutional filters from the final layer to layer :
(23) 
Fig. 7 shows the superpatches at different layers for a layer DeepCAM. Note that the superpatch size at a shallower layer is larger than that in a deeper layer.
In the proposed convolutional analysis dictionary learning method ConvGOAL+, there are two sets of training data: the superpatch training data and the small patch training data . The superpatch data is used to impose the rank constraint. The small patch data has the same support as the filters and is used to impose the sparsifying and linear independence constraints.
The patch size of the superpatch training data for convolutional analysis dictionary learning should be no smaller than . Otherwise, we can not ensure that the learned convolutional analysis dictionary will be able to utilize all information within the superpatch for predicting the corresponding HR pixel values.
At the th layer, let us define the support size of the superpatch training data as . The superpatch training data, the small patch training dataset and the groundtruth training dataset are denoted as , and , respectively.
Let us denote as the number of atoms in . With , we will have
. From the degree of freedom perspective, there should be at least
rows in to ensure that information from will be preserved. This leads to:(24) 
Eqn. (24) indicates that there should be more atoms for information preservation in a deeper layer of a DeepCAM. For example, when , , , and in a layer DeepCAM, there should be at least 2 atoms in the st layer, and 4 atoms in the nd layer for information preservation.
Given atoms, the superpatch training data and the small patch training data , the IPAD is learned using ConvGOAL+ algorithm. The convolutional analysis dictionary will then be able to preserve essential information from the input LR image.
The softthresholds should be set properly. As in [17], the inner product between an analysis atom and the small patch training samples
can be well modelled by a Laplacian distribution with variance
. Therefore, as in [17], the softthresholds associated with IPAD is set to be inversely proportional to the variances:(25) 
where is a scaling parameter, and the variance of the th coefficient can be estimated using the obtained IPAD and the small patch training data .
The free parameter is determined by solving a 1dimensional search problem. The optimization problem for is therefore formulated as:
(26) 
where , is an all ones vector of size , is the Kronecker product, with , and is a discrete set of values.
VB Learning CAD and Threshold Pair
The objective of a Clustering Analysis Dictionary (CAD) is to perform a linear transformation to its input signal such that the responses are highly correlated with the most significant components of the residual error. Softthresholding, which is used as the nonlinearity, sets to zero the data with relatively small responses. The components with large residual error will then be identified.
The number of atoms in is essential to the performance of DeepCAM. Similar to the discussions in Section VA, with atoms in , the size of the convolutional analysis dictionary will be . For each superpatch region on the LR image, the number of coefficients for discriminative feature representation should not decrease over layers. That is, we would like to have more atoms in than in . Therefore, the number of CAD atoms should meet the condition:
(27) 
Different from the unstructured deep dictionary model, it is not straightforward to set the dictionary sizes. Eqn. (24) and Eqn. (27) provide a guideline on how to set the number of atoms in order to generate representations that are both information preserving and discriminative.
Let us denote with the corresponding HR patch training data of . A synthesis dictionary can be learned to map to by solving:
(28) 
It has a closedform solution:
(29) 
Given , we define the middle resolution (MR) and the residual data as and , respectively. The MR data is a linear transformation of the input small patch training data. The residual data contains the information about the residual energy.
We propose to learn an analysis dictionary in the groundtruth data domain. If is able to simultaneously sparsify the middle resolution data and the residual data , the atoms within the learned will then be able to identify the data in with large residual energy and the th layer CAD is then reparameterized as:
(30) 
Therefore an additional constraint as proposed in [17] is applied to impose the simultaneous sparsifying property. Each analysis atom is enforced to be able to jointly sparsify and :
(31) 
where , is a tunable parameter, and and are the th column of and , respectively.
The objective function for learning the analysis dictionary can then be formulated as:
(32) 
where with , and being the regularization parameters. The functions , and are those defined in Eqn. (15), Eqn. (9) and Eqn. (16), respectively. To have zero mean responses for each learned CAD atom, the feasible set of the analysis dictionary is set to with .
The objective function in Eqn. (32) is optimized using ConvGOAL+ algorithm. With the learned analysis dictionary , the th layer CAD is then obtained as in Eqn. (30).
As proposed in DeepAM [17], it is both effective and efficient to set CAD softthresholds being proportional to the variance of the analysis coefficients. The CAD softthresholds are therefore defined as follows:
(33) 
where is a scaling parameter, and is the variance of the Laplacian distribution for the th atom.
The free parameter can be learned using a similar approach to the one used to solve Eqn. (26). As the analysis coefficients can be well modelled by Laplacian distributions, the proportion of data that is set to zero for each pair of atom and threshold will be the same. The optimization problem for is formulated as:
(34) 
where is the estimation residual using IPAD, , is a all ones vector of size , is the Kronecker product, with , and is a discrete set of values.
VC Synthesis Dictionary Learning
At the last layer, the synthesis dictionary will transform the th layer deep convolutional representation to the groundtruth training data . The synthesis dictionary can be learned using least squares:
(35) 
Convolving the learned synthesis dictionary with the th layer feature maps, results in estimated HR images which can be reshaped and combined to form the final estimated HR image.
The overall learning algorithm for DeepCAM is summarized in Algorithm 2.
Vi Simulation Results
In this section, we report the implementation details and numerical results of our proposed DeepCAM method and compare it with other existing single image superresolution algorithms.
Via Implementation Details
Most of the implementation settings are the same as in [17]. The standard 91 training images [43] are used as the training dataset and the Set5 [43] and the Set14 [44] are used as the testing datasets. The color images have been converted from the RGB color space to the YCbCr color space. Image superresolution is only performed on the luminance channel.
Table II shows the parameters setting of ConvGOAL+ algorithm for learning the th layer IPAD and CAD. Both the IPAD and the CAD are initialized with i.i.d. Gaussian random entries. The spatial size of the superpatches used for training is set to the minimum value as indicated by Eqn. (23) for the purpose of minimizing the training computational complexity. The number of IPAD atoms is set to the minimum integer satisfying Eqn. (24). The number of CAD atoms is then set to with a predefined total number of channel . We apply batch training for ConvGOAL+ algorithm. The training data has been equally divided into batches. During training, the ConvGOAL+ algorithm is sequentially applied to each batch until the learned dictionary converges or all batches have been used for training. For each batch, iterations of conjugate gradient descent is performed to update the dictionary. The discrete set used for searching the scaling parameter of the thresholds is set to be .
Parameters  

IPAD  —  
CAD  100 
ViB Analysis of the Learned DeepCAM
In this section, we analyze the learned DeepCAM in terms of the number of layers, learned softthresholds and extracted feature maps.
Layer Number  1  2  3 

Filter Number  [64]  [16,64]  [9,25,64] 
baby  38.03  38.21  38.30 
bird  39.74  40.19  40.60 
butterfly  31.70  31.99  32.17 
head  35.54  35.58  35.61 
woman  34.76  34.83  34.84 
Average  35.95  36.16  36.30 
Table III shows the PSNR (dB) of the learned DeepCAM with different number of layers evaluated on Set5 [44]. For DeepCAM with different number of layers, the spatial filter size is set to for all layers and the maximum number of filters at the last layer is set to 64. The effective filter size for the DeepCAM with 1, 2, and 3 layers is therefore , , and , respectively. We can see that DeepCAM with more layers achieves higher average PSNR. From 1 layer to 2 layers, there is an improvement of about 0.2 dB. In particular, the PSNR of the testing image “bird” has been improved by around 0.5 dB. With 3 layers, further improvements can be observed on all testing images. The improved performance of DeepCAM with more layers can be due to two reasons. First, a deeper model has more nonlinear layers and has therefore a stronger expressive power. Second, different from the unstructured deep dictionary model, a deeper convolutional dictionary model has an increased effective filter size which helps include more information for prediction and therefore improves prediction performance.
Fig. 8 shows the softthresholds of a 3layer DeepCAM with 9, 25 and 64 filters at layer 1, 2 and 3, respectively. We can observe that the softthresholds have a bimodal behaviour. That is, the thresholds corresponding to IPAD are relatively small, while the thresholds corresponding to CAD are relatively large. Another observation is that the amplitude of the softthresholds decreases over layers. This will lead to denser representations at a deeper layer which can represent more complex signals.
Due to different learning objectives, the resultant feature maps of IPAD and CAD contains different information. Fig. 9 shows the feature maps of the first layer in a 3layer DeepCAM. The first 2 feature maps correspond to IPAD. We can find that these two feature maps, especially, the feature map in Fig. 9 represents detailed structural information of the input LR image. The feature maps in Fig.9 and 9 are due to CAD and have zero responses on most regions due to relatively large softthresholds. These maps contain different directional edges corresponding to regions that require nonlinear estimations. A combination of these features from both IPAD and CAD forms an informative and discriminative feature representation for predicting the HR image.
Method  SC [44]  ANR [39]  A+ [40]  SRCNN [10]  DeepAM  DeepCAM 
Parameters  65,536  1,064,896  1,064,896  8,128  156,672  34,740 
Images  Bicubic  SC [44]  ANR [39]  A+ [40]  SRCNN [10]  DeepAM [17]  DeepCAM  

baby  36.93  38.11  38.31  38.39  38.16  38.48  38.31  38.31  38.18  38.20 
bird  36.76  39.87  40.00  41.11  40.58  41.01  41.10  41.06  40.68  40.78 
butterfly  27.58  30.96  30.76  32.37  32.58  31.44  32.23  32.56  32.39  32.50 
head  34.71  35.47  35.54  35.64  35.51  35.64  35.57  35.59  35.50  35.55 
woman  32.31  34.59  34.68  35.44  35.07  35.20  34.84  35.16  34.93  35.25 
Average  33.66  35.80  35.86  36.59  36.38  36.35  36.41  36.54  36.34  36.45 
Images  Bicubic  SC [44]  ANR [39]  A+ [40]  SRCNN [10]  DeepAM [17]  DeepCAM  

baboon  24.85  25.47  25.55  25.66  25.64  25.65  25.54  25.63  25.52  25.55 
barbara  27.87  28.50  28.43  28.49  28.53  28.49  27.88  28.14  28.45  28.40 
bridge  26.64  27.63  27.62  27.87  27.74  27.82  27.82  27.87  27.81  27.84 
c.guard  29.16  30.23  30.34  30.34  30.43  30.44  30.28  30.37  30.40  30.40 
comic  25.75  27.34  27.47  27.98  28.17  27.77  27.29  27.82  27.72  27.98 
face  34.69  35.45  35.52  35.63  35.57  35.62  35.54  35.55  35.48  35.51 
flowers  30.13  32.04  32.06  32.80  32.95  32.45  32.65  32.82  32.57  32.78 
foreman  35.55  38.41  38.31  39.45  37.43  38.89  39.31  39.24  39.12  39.23 
lenna  34.52  36.06  36.17  36.45  36.46  36.46  36.35  36.39  36.16  36.25 
man  29.11  30.31  30.33  30.74  30.78  30.57  30.68  30.71  30.58  30.70 
monarch  32.68  35.50  35.46  36.77  37.11  36.06  36.62  36.84  36.56  36.85 
pepper  35.02  36.64  36.51  37.08  36.89  36.87  36.99  36.99  36.86  36.97 
ppt3  26.58  29.00  28.67  29.79  30.31  29.13  28.02  29.18  29.61  29.67 
zebra  30.41  33.05  32.91  33.45  33.14  33.34  32.45  33.04  33.33  33.28 
Average  30.21  31.83  31.81  32.32  32.22  32.11  31.96  32.18  32.16  32.24 
ViC Comparison with Single Image SuperResolution Methods
In this section, we compare our proposed DeepCAM method with the DeepAM [17] and some existing single image superresolution methods including bicubic interpolation, sparse coding (SC)based method [44], Anchored Neighbor Regression (ANR) [39], Adjusted Anchored Neighborhood Regression (A+) [40], and SuperResolution Convolutional Neural Network (SRCNN) [10].
SCbased method [44], ANR [39] and A+ [40] are patchbased. SCbased method [44] is based on synthesis sparse representation and has a LR synthesis dictionary with 1024 atoms and a corresponding HR synthesis dictionary. The input feature is the compressed st and
nd order derivatives of the image patch and is obtained using Principal Component Analysis (PCA). ANR and A+
[39, 40]are based on clustering and assign each cluster a linear regression model. They use the same feature as in
[44] while require a huge number of free parameters. The SRCNN method [10] is based on a convolutional neural network with 2 convolutional layers of 64 and 32 filters. The spatial filter size is , and , respectively.DeepCAM used for comparison is a 3layer DeepCAM. For the convolutional analysis dictionary, the spatial filter size is at all layers and the filter number is 9, 25, 100 for layer 1, 2 and 3, respectively. The convolutional synthesis dictionary is with spatial filter size . Therefore, the effective filter size of DeepCAM is .
Table IV shows the number of free parameters in different single image superresolution methods. The SCbased method [44] requires a relatively small number of parameters which mainly comes from two synthesis dictionaries. The ANR method [39] and the A+ method [40] have around 1 million free parameters because there are 1024 regressors with size . The SRCNN method [10] has the least number of parameters. DeepAM has around 160,000 parameters. This is because the dictionaries are not structured and there are 3 layers of analysis dictionaries. The proposed DeepCAM has only approximately 35,000 free parameters since each structured convolutional dictionary shares a small number of free parameters though it has a huge size.
We denote with the deep neural network with the same structure as DeepCAM and trained using backpropagation algorithm [35] with learning rate decay step and total epochs for training. The implementation of DNNs is based on Pytorch with Adam optimizer [24], batch size , initial learning rate , and decay rate . The training data has been arranged into and patch pairs.
Table V shows the evaluation results of different methods on Set5. DeepCAM outperforms SCbased method and ANR by around 0.5 dB, and has similar performance as SRCNN and DeepAM, while has around 0.2 dB lower PSNR than A+. The parameters setting of has been tuned to achieve the best performance on Set5. and have been used for comparison and achieves around 0.1 dB and 0.2 dB higher PSNR than DeepCAM.
Comments
There are no comments yet.