Learning Multi-Layer Transform Models

10/19/2018 ∙ by Saiprasad Ravishankar, et al. ∙ 0

Learned data models based on sparsity are widely used in signal processing and imaging applications. A variety of methods for learning synthesis dictionaries, sparsifying transforms, etc., have been proposed in recent years, often imposing useful structures or properties on the models. In this work, we focus on sparsifying transform learning, which enjoys a number of advantages. We consider multi-layer or nested extensions of the transform model, and propose efficient learning algorithms. Numerical experiments with image data illustrate the behavior of the multi-layer transform learning algorithm and its usefulness for image denoising. Multi-layer models provide better denoising quality than single layer schemes.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Signal models based on sparsity, convolutional properties, or tensor or manifold structures, etc., have garnered increasing interest in recent years. Such models have been used in many applications including inverse problems, where they are often used to construct regularizers. In particular, the learning of signal models from training data, or even corrupted measurements has shown promise in various settings.

Among sparsity-based models, the synthesis dictionary model [1] is perhaps the most well-known. Various methods have been proposed to learn synthesis dictionaries from signals or image patches [2, 3, 4, 5, 6, 7, 8] or in a convolutional framework [9, 10]. However, the sparse coding problem (i.e., representing a signal as a sparse combination of appropriate dictionary atoms or filters) in the synthesis model (or during learning) typically lacks a closed-form solution and can be NP-hard in general. While numerous algorithms exist for general synthesis sparse coding [11, 12, 13, 14, 15], they may be computationally expensive in large-scale settings.

On the other hand, several recent works [16, 17, 18, 19, 20, 21] have focused on the learning of sparsifying transform models, where the signal is assumed approximately sparse in a transform domain. A main advantage of the transform model is that the sparse approximation or sparse coding problem has a simple and efficient closed-form solution by thresholding. Adaptive sparsifying transforms have demonstrated promising performance in applications such as image and video denosing and medical image reconstruction [22, 23, 24]. They can be learned relatively cheaply and the learning algorithms often come with provable convergence properties [18, 20, 25]. When the sparsifying transform is learned from or applied to regular patches of an image, the process involves convolutional/filtering operations. Recent works have thus interpreted transform learning for images as learning convolutional filters or filter banks [26, 27, 28]. Dictionary or transform learning methods often employ various structures, constraints, or regularizers for the model [29, 18, 19, 17, 30] that may help avoid ambiguities, or make the model more efficient to use, or provide added robustness in applications, etc.

In this work, we focus on the sparsifying transform model and investigate a framework for learning multi-layer transforms. In each layer, the transform domain residuals

generated in the previous layer are further sparsified. Transforms are learned to jointly sparsify the residual maps (or effectively the residual volume). We optimize the proposed multi-layer transform learning problem in a greedy fashion by starting with the first or base layer and then estimating subsequent layers. A unitary constraint is used for the filters in each layer for simplicity, which leads to efficient alternating updates of the filters and coefficient maps. The residual maps are also undersampled in each layer (i.e., a subset is zeroed out) during learning, which helps avoid dimensionality issues and helps prevent overfitting to noise. We also present a simple decoder to estimate the image from the multi-layer coefficients.

Our experiments illustrate the behavior of the proposed learning algorithm and the structure of the learned models and also demonstrate their use in image denoising, where multi-layer learned models outperform their single layer counterparts as well as conventional methods such as learned K-SVD dictionary-based denoising [31]. Finally, we also present a multi-pass extension of the learning scheme involving stacking multiple encoders and decoders that provides further improvements in denoising. In the following sections, we discuss the proposed model and its learning along with the experiments on images.

2 Model and Algorithm

Here, we describe the transform model and its multi-layer or deep extension, and then present an optimization problem for its learning, together with a greedy algorithm for minimizing it.

2.1 Multi-Layer Model and Learning Formulation

The transform model suggests that for a given signal and sparsifying transform operator , we have , where has several zeros. Given the signal and operator, the best approximation is found cheaply by often thresholding with some threshold , and setting the smallest elements (in magnitude) to zero [16].

Figure 1: The Deep Residual Transform (DeepResT) model with layers. The th layer has filters and the filters in the second and higher layers sparsify the stacked residual maps (i.e., residual volumes). Note that residuals are not computed at the end of the th (last) layer. The hard-thresholding function shown here can also be replaced with other non-linearities.

When the transform

is applied to all overlapping patches (including patches overlapping image boundaries that wrap around on the other side of the image) of an image, each row or atom of the transform applies to all image patches via inner products to generate coefficient maps. Clearly, this corresponds to a circular convolution (the corresponding filter is obtained by flipping and zero-padding a reshaped row of

) to generate each coefficient map, followed by thresholding. When overlapping patches are used with a patch stride greater than

pixel, applying the transform involves convolutions followed by downsampling and thresholding [26]. Methods for learning such sparsifying transforms or filter banks from images typically enforce additional properties to avoid trivial solutions such as the all-zero operator or operators with repeated rows, etc. Some of the regularizers employed in transform learning include [16], or using a unitary constraint [32], or penalties enforcing incoherence [17, 26].

We propose a multi-layer extension of the transform model that involves layers of sparsification. Fig. 1 illustrates this model for layers. The filtering residuals (i.e., the difference between the pre- and post-thresholded coefficient map) or residual maps generated in the first layer for various filters are stacked together to form a residual volume. In the second and subsequent layers, the transform model jointly sparsifies the residual maps. For the th layer, with filters, we assume here that equals the residual volume depth111If the values were less than the residual volume depths and 3D filtering were performed in the second and higher layers, then the dimensions of the residual maps would keep increasing from one layer to another, potentially rendering the learning of filters infeasible., so that the convolution is done only along the spatial dimensions to produce 2D coefficient maps. In other words, the 2D filters/components are convolved with their corresponding 2D residual maps and the results are summed together to produce a 2D map. In each layer in Fig. 1, the residual maps generated from different filters are stacked and further jointly sparsified in the next layer. However, in the final layer, only the sparse coefficient maps are computed without the residual maps. We refer to this model as a Deep Residual Transform (DeepResT) model. Intuitively, the residual maps in each layer may contain fine features/details, which could be further sparsified or encoded.

We now present an optimization framework to learn DeepResT models from images. The training images are denoted by the set , with each

a vectorized image. To simplify the learning, we assume that the filters in each layer form a unitary set

[32], i.e., the matrix formed with the vectorized filters as its rows is unitary. We formulate a patch-based learning problem (which may be equivalently written using convolutions) as follows:

Here, denotes a set of unitary matrices, one for each layer, with

denoting the identity matrix of appropriate dimensions. The operator

for each forms a matrix by extracting appropriately sized patches of its input and stacking them as vectorized matrix columns. The matrices denote the sparse coefficient maps for the layers. Each row of denotes the (row-vectorized) coefficient map222This could be multiple coefficient maps (for a filter) corresponding to multiple training images that are vectorized and stacked along the row. for a particular atom or filter of . The “norm” counts the total number of non-zero elements in a matrix or vector and the non-negative parameters control the sparsity in each layer during training. While we use sparsity penalties, they could alternatively be replaced with or other sparsity penalties or constraints. The residual maps are recursively defined by the matrices , where denotes the initial training images.

Problem (P1) is to learn the transforms for the layers by minimizing the norm of the residual in the th layer (output) and enforcing the coefficient maps in the layers to be sparse via

penalties. The learning in (P1) is quite different from the growing field of deep learning


, where the learning is typically supervised and objectives are task-driven (e.g., classification accuracy) rather than model-based (such as dictionary or transform learning costs). This enables utilizing (P1) to learn deep models from even corrupted data, without requiring large ground truth datasets for training. One could control the degrees of freedom while learning DeepResT models to achieve the best trade-offs in applications.


The DeepResT model learned using (P1) acts as an encoder for images. In order to estimate an image(s) from the coefficient and residual maps, the residual and sparse coefficient estimates need to be backpropagated through the layers as follows. First, we obtain the following estimate from the

th layer model and coefficients:


Then, the preceding residuals are computed one by one with decreasing as follows:


These updates follow quite easily under the unitary assumption for the transform matrices and the constraints in (P1). Each residual volume is obtained from its patch-version by averaging together the patches at their respective locations in the volume (or image in the case of ) [31, 19]. The estimated residual volume(s) are then reshaped in appropriate matrix form yielding . In general, if the ’s were non-unitary, the operator in (1) and (2) could be replaced with , the pseudo-inverse.

Although there are sparsity parameters in (P1), we have observed in practice (see Section 3) that they can be set quite similarly and yet achieve good performance in applications.

Downsampling Residual Volumes. Another issue to consider for (P1) is the size of the transform in each layer. Assuming , the residual volume depth after the first layer is . Thus, the transform filters in the second layer (with their third dimension being ) will have size upon vectorization. This implies that the transform matrix size would be monotonically increasing over the layers. In order to avoid the issue of increasing dimensionality and to achieve robustness to data noise and corruptions during learning, we propose a residual volume “downsampling” strategy for each layer. The residual maps with the smallest energies are set to zero in each layer and not used to train the subsequent layer. During the decoding process, these are simply stacked back as all-zero maps.

2.2 DeepResT Learning Algorithm and Properties

We propose a simple and fast greedy algorithm for (P1), where we learn the transform and coefficients one layer at a time. At first, all the sparse codes are initialized to zero matrices. Then, assuming that in each layer, patches are extracted such that each pixel in the residual volume (or original training images in the first layer) belongs to the same number of patches (e.g., with fully overlapping patches with patch wrap around), we have that


Thus, when the initial sparse codes are all zero, . Thus, minimizing sequentially with respect to for yields the following subproblems:

where for each , is fixed based on the transforms and sparse coefficients estimated for the previous layers.

Problem (P2) is optimized by alternating between updating and , with each subproblem solved efficiently [32]. With fixed, the optimal coefficients are given as , where operator thresholds its inputs entry-wise by setting elements with magnitude less than to zero and leaving other elements unchanged. For fixed , the optimal operator is obtained as , where

denotes the full singular value decomposition (SVD) of


When Problem (P1) is optimized by the above greedy algorithm with alternating optimization in each layer, the cost in (P1) decreases over the layers as well as within the alternating algorithm iterations in each layer. When we downsample the residual volumes in each layer by keeping only a given number of residual maps or filter residuals, the operation could be included in (P1) by redefining to include downsampling followed by patch extraction. But (3) does not hold in this case. We still employ greedy sequential optimization based on (P2) to learn the model, which we found worked well in practice.

3 Experiments

Figure 2: Images used in experiments. Clockwise from top-left: Barbara (), Boat (), Man (), Puffins (), and Couple ().

Here, we present experiments illustrating the learned models for images and the behavior of the proposed learning algorithm for denoising. We refer to the proposed scheme as DeepResT. The images used in our experiments are shown in Fig. 2.

3.1 Multi-Layer Transforms for Images

Figure 3: Left to right: Transforms learned in the first, second, and third layers for the image Puffins. The atoms in each layer are shown as square 2D patches for concise display.

We learned a DeepResT model with layers for the image Puffins. The patch size in the first layer was , and in the second and third layers it was and , respectively (i.e., and residual maps respectively were zeroed out for the second and third layers with the transform in these layers applied along the residual volume depth). We explored 1D filters in the higher layers in this work and leave the study of 3D filters to future work. The threshold parameters were set as and , and the greedy learning algorithm was executed with iterations in each layer. The initial transform was set to the 2D DCT in the first layer and the identity matrix in subsequent layers.

Fig. 3 shows the transforms learned in each layer. The atoms are displayed as square patches. While the atoms in the first layer show directional and edge like features that sparsify the image, the 1D atoms in the second and third layers indicate how the different residual maps (arising from distinct filters) input to those layers were combined for better sparsification. The latter atoms clearly look quite different from the former image-level sparsifying features. The benefit of the learned DeepResT over a single layer model is demonstrated next.

3.2 Application to Image Denoising

Image K-SVD DeepResT
Barbara 10 34.41 34.14 34.50 34.50
20 30.82 30.36 30.91 30.91
30 28.57 28.25 28.78 28.78
100 21.87 22.56 22.71 22.67
Boat 10 33.62 33.19 33.67 33.72
20 30.37 29.99 30.49 30.54
30 28.43 28.16 28.66 28.70
100 22.81 23.18 23.27 23.26
Man 10 32.73 32.34 32.84 32.91
20 29.40 29.07 29.66 29.73
30 27.61 27.38 27.94 28.01
100 22.75 23.17 23.28 23.25
Couple 10 33.51 33.15 33.59 33.64
20 30.03 29.72 30.22 30.27
30 27.87 27.69 28.19 28.23
100 22.57 22.88 22.99 22.97
Puffins 10 34.76 34.36 34.81 34.85
20 31.16 30.69 31.21 31.24
30 29.18 28.71 29.23 29.26
100 23.60 23.90 23.97 23.92
Table 1: PSNR values (in dB) for denoising for the adaptive transform algorithm using , , and layers. The denoising PSNRs obtained using the overcomplete K-SVD denoising scheme [31] are also listed. The best PSNRs are marked in bold.

We evaluate the usefulness of the adaptive DeepResT algorithm for denoising the images in Fig. 2

. Simulated i.i.d. zero mean Gaussian noise with standard deviation

, and was added to the images. DeepResT learning is simulated with , and layers. The filter sizes in the layers were set as , , , , and , with appropriate numbers of low-energy residual maps zeroed while learning each layer. At the large , slightly smaller transform atom sizes were used for layers to to avoid noise overfitting, with the atom length along the third dimension being , , , and , respectively. The sparsity thresholds in the first and subsequent layers were set as and , respectively, and the greedy training was run for iterations in each layer with the DeepResT model being learned from the noisy image and then used to denoise the same image.

Barbara Boat Man Couple Puffins
Single Pass 22.67 23.26 23.25 22.97 23.92
Two Passes 23.00 23.60 23.39 23.26 24.29
Table 2: Denoised PSNR values (in dB) for the adaptive DeepResT algorithm using a single pass and two passes with layers and .
Figure 4: A zoomed-in region of the denoised image Man obtained using the single layer () adaptive transform scheme (left) and the DeepResT scheme with (right) when .

Table 1 shows the peak signal-to-noise-ratio (PSNR) values in decibels (dB) for adaptive transform denoising with various numbers of layers along with the PSNRs for denoising with the well-known overcomplete K-SVD learned dictionary-based denoising algorithm [31]. The DeepResT method with and layers perform quite similarly (with the former outperforming by 0.02 dB on average), and both outperform the K-SVD method and the single-layer transform learning-based denoising scheme. In particular, the case outperforms K-SVD by 0.26 dB on average, with a peak improvement of 0.8 dB. Note that the K-SVD method sparse codes patches according to an error bound () criterion, which plays a key role in its success. Incorporating such a bound into our scheme (see [19] for its use in single-layer adaptive transform denoising) may improve its performance further over the current sparsity penalized strategy. Fig. 4 shows the zoom-ins of the denoised image Man with and layers. The image with layers shows sharper edges than the one with the single layer scheme. Finally, when DeepResT models with layers and with or length- atoms in the first layer and smaller atoms for subsequent layers were learned to adaptively denoise images, the PSNR values were only dB worse on average than for the layer models in Table 1.

Multi-pass or Stacked DeepResT Scheme. We also studied a multi-pass version of the DeepResT scheme, where the image denoised by the learned DeepResT model is further denoised by additional passes of DeepResT learning. This corresponds to stacking several DeepResT encoder + decoder modules to perform denoising. Table 2 shows the PSNR values with one and two passes of learned DeepResT () denoising at . For the two pass scheme, the value that determines the thresholds in each pass was set as (a smaller estimate in the first pass could enable further denoising improvement in the next pass) and in the first and second pass, respectively. The two pass scheme achieves a peak improvement of about 0.4 dB over the single pass scheme. Fig. 5 shows zoom-ins of the denoised image Barbara with K-SVD denoising [31] and the two-pass DeepResT scheme showing much better reconstruction of image features and textures for the latter approach.

Figure 5: Zoomed-in region of the denoised image Barbara, obtained using K-SVD (left) and the DeepResT method with and two passes (right) for .

Recent works [23, 24] have shown that combining transform learning with block matching strategies can outperform popular state-of-the-art image and video denoising methods such as BM3D and VBM3D or VBM4D. The proposed DeepResT learning could also be potentially combined with block matching strategies. We leave its investigation to future work.

4 Conclusions

This paper investigated the learning of a multi-layer extension of the transform model, where the transform domain residuals generated in each layer were further sparsified in the subsequent layer. Filters in later layers were learned to jointly sparsify the coefficient residual maps of the preceding layers. We presented a greedy algorithm for the learning problem with a unitary constraint for the filters in each layer that enabled efficient filter updates. We also proposed downsampling the residual volumes in each layer to address dimensionality issues and to prevent overfitting to noisy data. Numerical experiments showed the promise of the learning algorithm in extracting rich image features and its utility for denoising by learning directly on noisy images. A simple decoder was used for the multi-layer model under the unitary assumption. The denoising quality typically improved with more layers and with multiple passes of denoising. Future work will further explore the proposed model in detail with applications in inverse problems such as in imaging.


  • [1] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM Review, vol. 51, no. 1, pp. 34–81, 2009.
  • [2] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on signal processing, vol. 54, no. 11, pp. 4311–4322, 2006.
  • [3] M. Aharon and M. Elad, “Sparse and redundant modeling of image content using an image-signature-dictionary,” SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 228–247, 2008.
  • [4] J. Mairal, G. Sapiro, and M. Elad, “Learning multiscale sparse representations for image and video restoration,” SIAM Multiscale Modeling and Simulation, vol. 7, no. 1, pp. 214–241, 2008.
  • [5] M. Yaghoobi, T. Blumensath, and M. Davies, “Dictionary learning for sparse approximations with the majorization method,” IEEE Transaction on Signal Processing, vol. 57, no. 6, pp. 2178–2191, 2009.
  • [6] R. Rubinstein, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1553–1564, 2010.
  • [7] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, 2010.
  • [8] S. Ravishankar, R. R. Nadakuditi, and J. A. Fessler, “Efficient sum of outer products dictionary learning (SOUP-DIL) and its application to inverse problems,” IEEE Transactions on Computational Imaging, vol. 3, no. 4, pp. 694–709, Dec 2017.
  • [9] B. Wohlberg, “Efficient algorithms for convolutional sparse representations,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 301–315, Jan 2016.
  • [10] C. Garcia-Cardona and B. Wohlberg, “Convolutional dictionary learning: A comparative review and new algorithms,” IEEE Transactions on Computational Imaging, vol. 4, no. 3, pp. 366–381, Sept 2018.
  • [11] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit : recursive function approximation with applications to wavelet decomposition,” in Asilomar Conf. on Signals, Systems and Comput., 1993, pp. 40–44 vol.1.
  • [12] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.
  • [13] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, pp. 407–499, 2004.
  • [14] D. Needell and J.A. Tropp, “CoSaMP: iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, 2009.
  • [15] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruction,” IEEE Trans. Information Theory, vol. 55, no. 5, pp. 2230–2249, 2009.
  • [16] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,” IEEE Trans. Signal Process., vol. 61, no. 5, pp. 1072–1086, 2013.
  • [17] S. Ravishankar and Y. Bresler, “Learning overcomplete sparsifying transforms for signal processing,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 3088–3092.
  • [18] S. Ravishankar and Y. Bresler, sparsifying transform learning with efficient optimal updates and convergence guarantees,” IEEE Trans. Signal Process., vol. 63, no. 9, pp. 2389–2404, May 2015.
  • [19] S. Ravishankar and Y. Bresler, “Learning doubly sparse transforms for images,” IEEE Trans. Image Process., vol. 22, no. 12, pp. 4598–4612, 2013.
  • [20] B. Wen, S. Ravishankar, and Y. Bresler, “Structured overcomplete sparsifying transform learning with convergence guarantees and applications,”

    International Journal of Computer Vision

    , vol. 114, no. 2-3, pp. 137–167, 2015.
  • [21] Jian-Feng Cai, H. Ji, Z. Shen, and Gui-Bo Ye, “Data-driven tight frame construction and image denoising,” Applied and Computational Harmonic Analysis, vol. 37, no. 1, pp. 89–105, 2014.
  • [22] S. Ravishankar and Y. Bresler, “Data-driven learning of a union of sparsifying transforms model for blind compressed sensing,” IEEE Transactions on Computational Imaging, vol. 2, no. 3, pp. 294–309, 2016.
  • [23] B. Wen, Y. Li, and Y. Bresler, “When sparsity meets low-rankness: Transform learning with non-local low-rank constraint for image restoration,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2297–2301.
  • [24] B. Wen, S. Ravishankar, and Y. Bresler, “VIDOSAT: High-dimensional sparsifying transform learning for online video denoising,” IEEE Transactions on Image Processing, 2018, to appear.
  • [25] S. Ravishankar, A. Ma, and D. Needell, “Analysis of fast structured dictionary learning,” 2018, Preprint: https://arxiv.org/abs/1805.12529.
  • [26] L. Pfister and Y. Bresler, “Learning filter bank sparsifying transforms,” 2018, Preprint: https://arxiv.org/abs/1803.01980.
  • [27] S. Ye, S. Ravishankar, Y. Long, and J. A. Fessler, “Adaptive sparse modeling and shifted-poisson likelihood based approach for low-dose CT image reconstruction,” in

    IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)

    , 2017, pp. 1–6.
  • [28] S. Ravishankar, A. Lahiri, C. Blocker, and J. A. Fessler, “Deep dictionary-transform learning for image reconstruction,” in IEEE International Symposium on Biomedical Imaging (ISBI 2018), 2018, pp. 1208–1212.
  • [29] D. Barchiesi and M. D. Plumbley, “Learning incoherent dictionaries for sparse approximation using iterative projections and rotations,” IEEE Transactions on Signal Processing, vol. 61, no. 8, pp. 2055–2065, 2013.
  • [30] S. Ravishankar, B. E. Moore, R. R. Nadakuditi, and J. A. Fessler, “Efficient learning of dictionaries with low-rank atoms,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016, pp. 222–226.
  • [31] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, 2006.
  • [32] S. Ravishankar and Y. Bresler, “Closed-form solutions within sparsifying transform learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 5378–5382.
  • [33] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.