1 Introduction
Signal models based on sparsity, convolutional properties, or tensor or manifold structures, etc., have garnered increasing interest in recent years. Such models have been used in many applications including inverse problems, where they are often used to construct regularizers. In particular, the learning of signal models from training data, or even corrupted measurements has shown promise in various settings.
Among sparsitybased models, the synthesis dictionary model [1] is perhaps the most wellknown. Various methods have been proposed to learn synthesis dictionaries from signals or image patches [2, 3, 4, 5, 6, 7, 8] or in a convolutional framework [9, 10]. However, the sparse coding problem (i.e., representing a signal as a sparse combination of appropriate dictionary atoms or filters) in the synthesis model (or during learning) typically lacks a closedform solution and can be NPhard in general. While numerous algorithms exist for general synthesis sparse coding [11, 12, 13, 14, 15], they may be computationally expensive in largescale settings.
On the other hand, several recent works [16, 17, 18, 19, 20, 21] have focused on the learning of sparsifying transform models, where the signal is assumed approximately sparse in a transform domain. A main advantage of the transform model is that the sparse approximation or sparse coding problem has a simple and efficient closedform solution by thresholding. Adaptive sparsifying transforms have demonstrated promising performance in applications such as image and video denosing and medical image reconstruction [22, 23, 24]. They can be learned relatively cheaply and the learning algorithms often come with provable convergence properties [18, 20, 25]. When the sparsifying transform is learned from or applied to regular patches of an image, the process involves convolutional/filtering operations. Recent works have thus interpreted transform learning for images as learning convolutional filters or filter banks [26, 27, 28]. Dictionary or transform learning methods often employ various structures, constraints, or regularizers for the model [29, 18, 19, 17, 30] that may help avoid ambiguities, or make the model more efficient to use, or provide added robustness in applications, etc.
In this work, we focus on the sparsifying transform model and investigate a framework for learning multilayer transforms. In each layer, the transform domain residuals
generated in the previous layer are further sparsified. Transforms are learned to jointly sparsify the residual maps (or effectively the residual volume). We optimize the proposed multilayer transform learning problem in a greedy fashion by starting with the first or base layer and then estimating subsequent layers. A unitary constraint is used for the filters in each layer for simplicity, which leads to efficient alternating updates of the filters and coefficient maps. The residual maps are also undersampled in each layer (i.e., a subset is zeroed out) during learning, which helps avoid dimensionality issues and helps prevent overfitting to noise. We also present a simple decoder to estimate the image from the multilayer coefficients.
Our experiments illustrate the behavior of the proposed learning algorithm and the structure of the learned models and also demonstrate their use in image denoising, where multilayer learned models outperform their single layer counterparts as well as conventional methods such as learned KSVD dictionarybased denoising [31]. Finally, we also present a multipass extension of the learning scheme involving stacking multiple encoders and decoders that provides further improvements in denoising. In the following sections, we discuss the proposed model and its learning along with the experiments on images.
2 Model and Algorithm
Here, we describe the transform model and its multilayer or deep extension, and then present an optimization problem for its learning, together with a greedy algorithm for minimizing it.
2.1 MultiLayer Model and Learning Formulation
The transform model suggests that for a given signal and sparsifying transform operator , we have , where has several zeros. Given the signal and operator, the best approximation is found cheaply by often thresholding with some threshold , and setting the smallest elements (in magnitude) to zero [16].
When the transform
is applied to all overlapping patches (including patches overlapping image boundaries that wrap around on the other side of the image) of an image, each row or atom of the transform applies to all image patches via inner products to generate coefficient maps. Clearly, this corresponds to a circular convolution (the corresponding filter is obtained by flipping and zeropadding a reshaped row of
) to generate each coefficient map, followed by thresholding. When overlapping patches are used with a patch stride greater than
pixel, applying the transform involves convolutions followed by downsampling and thresholding [26]. Methods for learning such sparsifying transforms or filter banks from images typically enforce additional properties to avoid trivial solutions such as the allzero operator or operators with repeated rows, etc. Some of the regularizers employed in transform learning include [16], or using a unitary constraint [32], or penalties enforcing incoherence [17, 26].We propose a multilayer extension of the transform model that involves layers of sparsification. Fig. 1 illustrates this model for layers. The filtering residuals (i.e., the difference between the pre and postthresholded coefficient map) or residual maps generated in the first layer for various filters are stacked together to form a residual volume. In the second and subsequent layers, the transform model jointly sparsifies the residual maps. For the th layer, with filters, we assume here that equals the residual volume depth^{1}^{1}1If the values were less than the residual volume depths and 3D filtering were performed in the second and higher layers, then the dimensions of the residual maps would keep increasing from one layer to another, potentially rendering the learning of filters infeasible., so that the convolution is done only along the spatial dimensions to produce 2D coefficient maps. In other words, the 2D filters/components are convolved with their corresponding 2D residual maps and the results are summed together to produce a 2D map. In each layer in Fig. 1, the residual maps generated from different filters are stacked and further jointly sparsified in the next layer. However, in the final layer, only the sparse coefficient maps are computed without the residual maps. We refer to this model as a Deep Residual Transform (DeepResT) model. Intuitively, the residual maps in each layer may contain fine features/details, which could be further sparsified or encoded.
We now present an optimization framework to learn DeepResT models from images. The training images are denoted by the set , with each
a vectorized image. To simplify the learning, we assume that the filters in each layer form a unitary set
[32], i.e., the matrix formed with the vectorized filters as its rows is unitary. We formulate a patchbased learning problem (which may be equivalently written using convolutions) as follows:Here, denotes a set of unitary matrices, one for each layer, with
denoting the identity matrix of appropriate dimensions. The operator
for each forms a matrix by extracting appropriately sized patches of its input and stacking them as vectorized matrix columns. The matrices denote the sparse coefficient maps for the layers. Each row of denotes the (rowvectorized) coefficient map^{2}^{2}2This could be multiple coefficient maps (for a filter) corresponding to multiple training images that are vectorized and stacked along the row. for a particular atom or filter of . The “norm” counts the total number of nonzero elements in a matrix or vector and the nonnegative parameters control the sparsity in each layer during training. While we use sparsity penalties, they could alternatively be replaced with or other sparsity penalties or constraints. The residual maps are recursively defined by the matrices , where denotes the initial training images.Problem (P1) is to learn the transforms for the layers by minimizing the norm of the residual in the th layer (output) and enforcing the coefficient maps in the layers to be sparse via
penalties. The learning in (P1) is quite different from the growing field of deep learning
[33], where the learning is typically supervised and objectives are taskdriven (e.g., classification accuracy) rather than modelbased (such as dictionary or transform learning costs). This enables utilizing (P1) to learn deep models from even corrupted data, without requiring large ground truth datasets for training. One could control the degrees of freedom while learning DeepResT models to achieve the best tradeoffs in applications.
Decoder.
The DeepResT model learned using (P1) acts as an encoder for images. In order to estimate an image(s) from the coefficient and residual maps, the residual and sparse coefficient estimates need to be backpropagated through the layers as follows. First, we obtain the following estimate from the
th layer model and coefficients:(1) 
Then, the preceding residuals are computed one by one with decreasing as follows:
(2) 
These updates follow quite easily under the unitary assumption for the transform matrices and the constraints in (P1). Each residual volume is obtained from its patchversion by averaging together the patches at their respective locations in the volume (or image in the case of ) [31, 19]. The estimated residual volume(s) are then reshaped in appropriate matrix form yielding . In general, if the ’s were nonunitary, the operator in (1) and (2) could be replaced with , the pseudoinverse.
Although there are sparsity parameters in (P1), we have observed in practice (see Section 3) that they can be set quite similarly and yet achieve good performance in applications.
Downsampling Residual Volumes. Another issue to consider for (P1) is the size of the transform in each layer. Assuming , the residual volume depth after the first layer is . Thus, the transform filters in the second layer (with their third dimension being ) will have size upon vectorization. This implies that the transform matrix size would be monotonically increasing over the layers. In order to avoid the issue of increasing dimensionality and to achieve robustness to data noise and corruptions during learning, we propose a residual volume “downsampling” strategy for each layer. The residual maps with the smallest energies are set to zero in each layer and not used to train the subsequent layer. During the decoding process, these are simply stacked back as allzero maps.
2.2 DeepResT Learning Algorithm and Properties
We propose a simple and fast greedy algorithm for (P1), where we learn the transform and coefficients one layer at a time. At first, all the sparse codes are initialized to zero matrices. Then, assuming that in each layer, patches are extracted such that each pixel in the residual volume (or original training images in the first layer) belongs to the same number of patches (e.g., with fully overlapping patches with patch wrap around), we have that
(3) 
Thus, when the initial sparse codes are all zero, . Thus, minimizing sequentially with respect to for yields the following subproblems:
where for each , is fixed based on the transforms and sparse coefficients estimated for the previous layers.
Problem (P2) is optimized by alternating between updating and , with each subproblem solved efficiently [32]. With fixed, the optimal coefficients are given as , where operator thresholds its inputs entrywise by setting elements with magnitude less than to zero and leaving other elements unchanged. For fixed , the optimal operator is obtained as , where
denotes the full singular value decomposition (SVD) of
.When Problem (P1) is optimized by the above greedy algorithm with alternating optimization in each layer, the cost in (P1) decreases over the layers as well as within the alternating algorithm iterations in each layer. When we downsample the residual volumes in each layer by keeping only a given number of residual maps or filter residuals, the operation could be included in (P1) by redefining to include downsampling followed by patch extraction. But (3) does not hold in this case. We still employ greedy sequential optimization based on (P2) to learn the model, which we found worked well in practice.
3 Experiments
Here, we present experiments illustrating the learned models for images and the behavior of the proposed learning algorithm for denoising. We refer to the proposed scheme as DeepResT. The images used in our experiments are shown in Fig. 2.
3.1 MultiLayer Transforms for Images
We learned a DeepResT model with layers for the image Puffins. The patch size in the first layer was , and in the second and third layers it was and , respectively (i.e., and residual maps respectively were zeroed out for the second and third layers with the transform in these layers applied along the residual volume depth). We explored 1D filters in the higher layers in this work and leave the study of 3D filters to future work. The threshold parameters were set as and , and the greedy learning algorithm was executed with iterations in each layer. The initial transform was set to the 2D DCT in the first layer and the identity matrix in subsequent layers.
Fig. 3 shows the transforms learned in each layer. The atoms are displayed as square patches. While the atoms in the first layer show directional and edge like features that sparsify the image, the 1D atoms in the second and third layers indicate how the different residual maps (arising from distinct filters) input to those layers were combined for better sparsification. The latter atoms clearly look quite different from the former imagelevel sparsifying features. The benefit of the learned DeepResT over a single layer model is demonstrated next.
3.2 Application to Image Denoising
Image  KSVD  DeepResT  

Barbara  10  34.41  34.14  34.50  34.50 
20  30.82  30.36  30.91  30.91  
30  28.57  28.25  28.78  28.78  
100  21.87  22.56  22.71  22.67  
Boat  10  33.62  33.19  33.67  33.72 
20  30.37  29.99  30.49  30.54  
30  28.43  28.16  28.66  28.70  
100  22.81  23.18  23.27  23.26  
Man  10  32.73  32.34  32.84  32.91 
20  29.40  29.07  29.66  29.73  
30  27.61  27.38  27.94  28.01  
100  22.75  23.17  23.28  23.25  
Couple  10  33.51  33.15  33.59  33.64 
20  30.03  29.72  30.22  30.27  
30  27.87  27.69  28.19  28.23  
100  22.57  22.88  22.99  22.97  
Puffins  10  34.76  34.36  34.81  34.85 
20  31.16  30.69  31.21  31.24  
30  29.18  28.71  29.23  29.26  
100  23.60  23.90  23.97  23.92 
We evaluate the usefulness of the adaptive DeepResT algorithm for denoising the images in Fig. 2
. Simulated i.i.d. zero mean Gaussian noise with standard deviation
, and was added to the images. DeepResT learning is simulated with , and layers. The filter sizes in the layers were set as , , , , and , with appropriate numbers of lowenergy residual maps zeroed while learning each layer. At the large , slightly smaller transform atom sizes were used for layers to to avoid noise overfitting, with the atom length along the third dimension being , , , and , respectively. The sparsity thresholds in the first and subsequent layers were set as and , respectively, and the greedy training was run for iterations in each layer with the DeepResT model being learned from the noisy image and then used to denoise the same image.Barbara  Boat  Man  Couple  Puffins  

Single Pass  22.67  23.26  23.25  22.97  23.92 
Two Passes  23.00  23.60  23.39  23.26  24.29 
Table 1 shows the peak signaltonoiseratio (PSNR) values in decibels (dB) for adaptive transform denoising with various numbers of layers along with the PSNRs for denoising with the wellknown overcomplete KSVD learned dictionarybased denoising algorithm [31]. The DeepResT method with and layers perform quite similarly (with the former outperforming by 0.02 dB on average), and both outperform the KSVD method and the singlelayer transform learningbased denoising scheme. In particular, the case outperforms KSVD by 0.26 dB on average, with a peak improvement of 0.8 dB. Note that the KSVD method sparse codes patches according to an error bound () criterion, which plays a key role in its success. Incorporating such a bound into our scheme (see [19] for its use in singlelayer adaptive transform denoising) may improve its performance further over the current sparsity penalized strategy. Fig. 4 shows the zoomins of the denoised image Man with and layers. The image with layers shows sharper edges than the one with the single layer scheme. Finally, when DeepResT models with layers and with or length atoms in the first layer and smaller atoms for subsequent layers were learned to adaptively denoise images, the PSNR values were only dB worse on average than for the layer models in Table 1.
Multipass or Stacked DeepResT Scheme. We also studied a multipass version of the DeepResT scheme, where the image denoised by the learned DeepResT model is further denoised by additional passes of DeepResT learning. This corresponds to stacking several DeepResT encoder + decoder modules to perform denoising. Table 2 shows the PSNR values with one and two passes of learned DeepResT () denoising at . For the two pass scheme, the value that determines the thresholds in each pass was set as (a smaller estimate in the first pass could enable further denoising improvement in the next pass) and in the first and second pass, respectively. The two pass scheme achieves a peak improvement of about 0.4 dB over the single pass scheme. Fig. 5 shows zoomins of the denoised image Barbara with KSVD denoising [31] and the twopass DeepResT scheme showing much better reconstruction of image features and textures for the latter approach.
Recent works [23, 24] have shown that combining transform learning with block matching strategies can outperform popular stateoftheart image and video denoising methods such as BM3D and VBM3D or VBM4D. The proposed DeepResT learning could also be potentially combined with block matching strategies. We leave its investigation to future work.
4 Conclusions
This paper investigated the learning of a multilayer extension of the transform model, where the transform domain residuals generated in each layer were further sparsified in the subsequent layer. Filters in later layers were learned to jointly sparsify the coefficient residual maps of the preceding layers. We presented a greedy algorithm for the learning problem with a unitary constraint for the filters in each layer that enabled efficient filter updates. We also proposed downsampling the residual volumes in each layer to address dimensionality issues and to prevent overfitting to noisy data. Numerical experiments showed the promise of the learning algorithm in extracting rich image features and its utility for denoising by learning directly on noisy images. A simple decoder was used for the multilayer model under the unitary assumption. The denoising quality typically improved with more layers and with multiple passes of denoising. Future work will further explore the proposed model in detail with applications in inverse problems such as in imaging.
References
 [1] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM Review, vol. 51, no. 1, pp. 34–81, 2009.
 [2] M. Aharon, M. Elad, and A. Bruckstein, “KSVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on signal processing, vol. 54, no. 11, pp. 4311–4322, 2006.
 [3] M. Aharon and M. Elad, “Sparse and redundant modeling of image content using an imagesignaturedictionary,” SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 228–247, 2008.
 [4] J. Mairal, G. Sapiro, and M. Elad, “Learning multiscale sparse representations for image and video restoration,” SIAM Multiscale Modeling and Simulation, vol. 7, no. 1, pp. 214–241, 2008.
 [5] M. Yaghoobi, T. Blumensath, and M. Davies, “Dictionary learning for sparse approximations with the majorization method,” IEEE Transaction on Signal Processing, vol. 57, no. 6, pp. 2178–2191, 2009.
 [6] R. Rubinstein, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1553–1564, 2010.
 [7] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, 2010.
 [8] S. Ravishankar, R. R. Nadakuditi, and J. A. Fessler, “Efficient sum of outer products dictionary learning (SOUPDIL) and its application to inverse problems,” IEEE Transactions on Computational Imaging, vol. 3, no. 4, pp. 694–709, Dec 2017.
 [9] B. Wohlberg, “Efficient algorithms for convolutional sparse representations,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 301–315, Jan 2016.
 [10] C. GarciaCardona and B. Wohlberg, “Convolutional dictionary learning: A comparative review and new algorithms,” IEEE Transactions on Computational Imaging, vol. 4, no. 3, pp. 366–381, Sept 2018.
 [11] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit : recursive function approximation with applications to wavelet decomposition,” in Asilomar Conf. on Signals, Systems and Comput., 1993, pp. 40–44 vol.1.
 [12] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.
 [13] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, pp. 407–499, 2004.
 [14] D. Needell and J.A. Tropp, “CoSaMP: iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, 2009.
 [15] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruction,” IEEE Trans. Information Theory, vol. 55, no. 5, pp. 2230–2249, 2009.
 [16] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,” IEEE Trans. Signal Process., vol. 61, no. 5, pp. 1072–1086, 2013.
 [17] S. Ravishankar and Y. Bresler, “Learning overcomplete sparsifying transforms for signal processing,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 3088–3092.
 [18] S. Ravishankar and Y. Bresler, “ sparsifying transform learning with efficient optimal updates and convergence guarantees,” IEEE Trans. Signal Process., vol. 63, no. 9, pp. 2389–2404, May 2015.
 [19] S. Ravishankar and Y. Bresler, “Learning doubly sparse transforms for images,” IEEE Trans. Image Process., vol. 22, no. 12, pp. 4598–4612, 2013.

[20]
B. Wen, S. Ravishankar, and Y. Bresler,
“Structured overcomplete sparsifying transform learning with
convergence guarantees and applications,”
International Journal of Computer Vision
, vol. 114, no. 23, pp. 137–167, 2015.  [21] JianFeng Cai, H. Ji, Z. Shen, and GuiBo Ye, “Datadriven tight frame construction and image denoising,” Applied and Computational Harmonic Analysis, vol. 37, no. 1, pp. 89–105, 2014.
 [22] S. Ravishankar and Y. Bresler, “Datadriven learning of a union of sparsifying transforms model for blind compressed sensing,” IEEE Transactions on Computational Imaging, vol. 2, no. 3, pp. 294–309, 2016.
 [23] B. Wen, Y. Li, and Y. Bresler, “When sparsity meets lowrankness: Transform learning with nonlocal lowrank constraint for image restoration,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2297–2301.
 [24] B. Wen, S. Ravishankar, and Y. Bresler, “VIDOSAT: Highdimensional sparsifying transform learning for online video denoising,” IEEE Transactions on Image Processing, 2018, to appear.
 [25] S. Ravishankar, A. Ma, and D. Needell, “Analysis of fast structured dictionary learning,” 2018, Preprint: https://arxiv.org/abs/1805.12529.
 [26] L. Pfister and Y. Bresler, “Learning filter bank sparsifying transforms,” 2018, Preprint: https://arxiv.org/abs/1803.01980.

[27]
S. Ye, S. Ravishankar, Y. Long, and J. A. Fessler,
“Adaptive sparse modeling and shiftedpoisson likelihood based
approach for lowdose CT image reconstruction,”
in
IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)
, 2017, pp. 1–6.  [28] S. Ravishankar, A. Lahiri, C. Blocker, and J. A. Fessler, “Deep dictionarytransform learning for image reconstruction,” in IEEE International Symposium on Biomedical Imaging (ISBI 2018), 2018, pp. 1208–1212.
 [29] D. Barchiesi and M. D. Plumbley, “Learning incoherent dictionaries for sparse approximation using iterative projections and rotations,” IEEE Transactions on Signal Processing, vol. 61, no. 8, pp. 2055–2065, 2013.
 [30] S. Ravishankar, B. E. Moore, R. R. Nadakuditi, and J. A. Fessler, “Efficient learning of dictionaries with lowrank atoms,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016, pp. 222–226.
 [31] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, 2006.
 [32] S. Ravishankar and Y. Bresler, “Closedform solutions within sparsifying transform learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 5378–5382.
 [33] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
Comments
There are no comments yet.