Deep Coupled-Representation Learning for Sparse Linear Inverse Problems with Side Information

07/04/2019 ∙ by Evaggelia Tsiligianni, et al. ∙ 0

In linear inverse problems, the goal is to recover a target signal from undersampled, incomplete or noisy linear measurements. Typically, the recovery relies on complex numerical optimization methods; recent approaches perform an unfolding of a numerical algorithm into a neural network form, resulting in a substantial reduction of the computational complexity. In this paper, we consider the recovery of a target signal with the aid of a correlated signal, the so-called side information (SI), and propose a deep unfolding model that incorporates SI. The proposed model is used to learn coupled representations of correlated signals from different modalities, enabling the recovery of multimodal data at a low computational cost. As such, our work introduces the first deep unfolding method with SI, which actually comes from a different modality. We apply our model to reconstruct near-infrared images from undersampled measurements given RGB images as SI. Experimental results demonstrate the superior performance of the proposed framework against single-modal deep learning methods that do not use SI, multimodal deep learning designs, and optimization algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Linear inverse problems arise in various signal processing domains such as computational imaging, remote sensing, seismology and astronomy, to name a few. These problems can be expressed by a linear equation of the form:


where is the unknown signal, , , is a linear operator, and denotes the observations contaminated with noise . Sparsity is commonly used for the regularization of ill-posed inverse problems, leading to the so-called sparse approximation problem [1]. Compressed sensing (CS) [2] deals with the sparse recovery of linearly subsampled signals and falls in this category.

In several applications, besides the observations of the target signal, additional information from correlated signals is often available [3, 4, 5, 6, 7, 8, 9, 10]. In multimodal applications, combining information from multiple signals calls for methods that allow coupled signal representations, capturing the similarities between correlated data. To this end, coupled dictionary learning is a popular approach [8, 9, 10]; however, dictionary learning methods employ overcomplete dictionaries, resulting in computationally expensive sparse approximation problems.

Deep learning has gained a lot of momentum in solving inverse problems, often surpassing the performance of analytical approaches [11, 12, 13]. Nevertheless, neural networks have a complex structure and appear as “black boxes”; thus, understanding what the model has learned is an active research topic. Among the efforts trying to bridge the gap between analytical methods and deep learning is the work presented in [14], which introduced the idea of unfolding a numerical algorithm for sparse approximation into a neural network form. Several unfolding approaches [15, 16, 17] followed that of [14]. Although the primary motivation for deploying deep learning in inverse problems concerns the reduction of the computational complexity, unfolding offers another significant benefit: the model architecture allows a better insight in the inference procedure and enables the theoretical study of the network using results from sparse modelling [18, 19, 20, 15].

In this paper, we propose a deep unfolding model for the recovery of a signal with the aid of a correlated signal, the side information (SI). To the best of our knowledge, this is the first work in deep unfolding that incorporates SI. Our contribution is as follows: (i) Inspired by [14]

, we design a deep neural network that unfolds a proximal algorithm for sparse approximation with SI; we coin our model Learned Side Information Thresholding Algorithm (LeSITA). (ii) We use LeSITA in an autoencoder fashion to learn coupled representations of correlated signals from different modalities. (iii) We design a LeSITA-based reconstruction operator that utilizes learned SI provided by the autoencoder to enhance signal recovery.

We test our method in an example application, namely, multimodal reconstruction from CS measurements. Other inverse problems of the form (1

) such as image super-resolution 

[21, 8] or image denoising [22] can benefit from the proposed approach. We compare our method with existing single-modal deep learning methods that do not use SI, multimodal deep learning designs, and optimization algorithms, showing its superior performance.

The paper is organized as follows. Section II provides the necessary background and reviews related work. The proposed framework is presented in Section III, followed by experimental results in Section IV. Conclusions are drawn in Section V.

Ii Background and Related Work

Figure 1: Graphical representation of the proximal operators of (a) ISTA and (b) SITA (for non-negative SI , ). are positive parameters.

A common approach for solving problems of the form (1) with sparsity constraints is convex optimization [23]. Let us assume that the unknown has a sparse representation with respect to a dictionary , , that is, . Then, (1) takes the form


and a solution can be obtained via the formulation of the minimization problem:


where denotes the -norm (), which promotes sparse solutions and is a regularization parameter.

Numerical methods [1] proposed to solve (3) include pivoting algorithms, interior-point methods, gradient based methods and message passing algorithms (AMP) [24]. Among gradient based methods, proximal methods are tailored to optimize an objective of the form


where is a convex differentiable function with a Lipschitz-continuous gradient, and is convex and possibly nonsmooth [25], [26]. Their main step involves the proximal operator, defined for a function according to


with and an upper bound on the Lipschitz constant of . A popular proximal algorithm is the Iterative Soft Thresholding Algorithm (ISTA) [27, 28]. Let us set , in (3). At the -th iteration ISTA computes:


where denotes the proximal operator [Figure 1(a)] expressed by the component-wise shrinkage function:


with .

In order to account for the high computational cost of numerical algorithms, Gregor and LeCun [14] unfolded ISTA into a neural network referred to as LISTA. Specifically, by setting , , (6) results in


Considering a correspondence of every iteration with a neural network layer, a number of iterations of (8

) can be implemented by a recurrent or feed forward neural network;

, and are learnable parameters, and the proximal operator (7

) acts as a nonlinear activation function. A fixed depth network allows the computation of sparse codes in a fixed amount of time. Similar unfolding methods were proposed in 

[15, 16, 17].

Iii Proposed Framework

In this paper, we consider that, besides the observations of the target signal, we also have access to SI, that is, a signal correlated to the unknown . We assume that and have similar sparse representations , , under dictionaries , , , , respectively. Specifically, we assume that and are similar by means of the norm, that is, is small. The condition holds for representations with partially common support and a number of similar nonzero coefficients; we refer to them as coupled sparse representations. Then, can be obtained from the - minimization problem


(9) has been theoretically studied in [29] and has been employed for the recovery of sequential signals in [3, 4, 5].

We can easily obtain coupled sparse representations of sequential signals that change slowly using the same sparsifying dictionary [3, 4, 5]. However, this is not the case in most multimodal applications, where, typically, finding coupled sparse representations involves dictionary learning and complex optimization methods [8, 9, 10]. In this work, we propose an efficient approach based on a novel multimodal deep unfolding model. The model is employed for learning coupled representations of the target signal and the SI (Section III-B), and for reconstruction with SI (Section III-C). Our approach is inspired by a proximal algorithm for the solution of (9).

Iii-a Sparse Approximation with SI via Deep Unfolding

Problem (9) is of the form (4) with , , , and . The proximal operator for is defined by


where , and is an upper bound on the Lipschitz constant of . All terms in (10) are separable, thus, we can easily show that (see Appendix):

  1. For , :

  2. For , :


Figure 1(b) depicts the graphical representation of the proximal operator given by (11). With , a proximal method for (9) takes the form


We coin (13) Side-Information-driven iterative soft Thresholding Algorithm (SITA).

We unfold SITA to a neural network form, by settting , . Then (13) results in


(14) has a similar expression to LISTA (8); however, the two algorithms involve different proximal operators (Figure 1). A fixed number of iterations of (14) can be implemented by a recurrent or feed forward neural network, with the proximal operator given by (11), (12) employed as a nonlinear activation function, which integrates the SI; , and are learnable parameters. The network architecture is depicted in Figure 2.

We can train the neural network using pairs of sparse codes corresponding to pairs of correlated signals

, and a loss function of the form:



is the output estimation. The learning results in a fast sparse approximation operator that directly maps the input observation vector

to a sparse code with the aid of the SI . We coin this operator Learned Side Information Thresholding Algorithm (LeSITA).

Figure 2: LeSITA: Unfolding SITA (13) to a neural network form (14).

Being based on an optimization method, LeSITA can be theoretically analyzed (see [7, 18, 19, 20, 15]). We leave this analysis for future work.

Iii-B LeSITA Autoencoder for Coupled Representations

Instead of training using sparse codes, we can use LeSITA in an autoencoder fashion to learn coupled representations of , . By setting

equal to the identity matrix, (

9) reduces to a sparse representation problem with SI. Then, (14) can compute a representation of according to . The proposed autoencoder is depicted in Figure 3. The main branch accepts as input the target signal (). The core component is a LeSITA encoder, followed by a linear decoder performing reconstruction, i.e., ; is a trainable dictionary ( is not tied to any other weight). A second branch referred to as SINET acts as an SI encoder, performing a (possibly) nonlinear transformation of the SI. We employ LISTA (8) to incorporate sparse priors in the transformation, obtaining , ; is given by (7), and , and are learnable parameters. The number of layers of LISTA and LeSITA may differ.

We use pairs of correlated signals to train our autoencoder, and an objective function of the form:


where is the reconstruction loss, is a constraint on the latent representations, and , are appropriate weights. We use the norm as reconstruction loss, i.e., , where is the -th sample of the target signal and is the respective output estimation. We set to promote coupled latent representations capturing the correlation between and .

Figure 3: Use of LeSITA for signal representation or reconstruction with SI. The main branch comprises a LeSITA encoder and a linear decoder; the input is either the signal (Section III-B) or the observations (Section III-C). The SI branch (SINET) performs transformation of the SI. The transformed SI is used to guide LeSITA to produce a representation of the target signal that improves reconstruction.
Table I: Sparse approximation results (NMSE in dB).
Single-modal methods Multi-modal methods
LISTA [14] LAMP [17] DL [12] Multimodal DL LeSITA () LeSITA ()
CS ratio
country (0070)
field (0058)
forest (0058)
indoor (0056)
mountain (0055)
oldbuilding (0103)
street (0057)
urban (0102)
water (0083)
Table II: Reconstruction results on NIR images (PSNR in dB).

Iii-C LeSITA for Reconstruction with SI

We propose a reconstruction operator that effectively utilizes SI for signal recovery, following the architecture of Figure 3. In the main branch, a LeSITA encoder computes a latent representation of the observation vector obtained from (1), according to (14). A linear decoder performs reconstruction of the unknown signal, i.e., ; is a learnable dictionary. The role of the SINET branch is to enhance the encoding process by providing LeSITA with prior knowledge. In this task, the SINET is realized by a LISTA encoder, the weights of which are initialized with the SINET weights of the trained autoencoder (Sec. III-B). In this way, the LeSITA autoencoder is used to provide coupled sparse representations. The proposed model is trained using the loss function, , with the -th sample of the target signal and the respective model estimation.

Iv Experimental results

A first set of experiments concerns the performance of the proposed LeSITA model (14) in sparse approximation using synthetic data. We generate K pairs of sparse signals of length with

nonzero coefficients drawn from a standard normal distribution. The sparsity level is kept fixed but the signals have varying support. The SI is generated such that

and share the same support in a number of positions , that is, , , with , denoting the -th coefficient of the respective signals. For , we obtain , where is drawn from a normal distribution; therefore, for , the coefficients of and are of the same sign; the rest are drawn from a standard normal distribution. We vary the values of , i.e., , to obtain different levels of similarity between and . A random Gaussian matrix is used as a sparsifying dictionary and is set equal to the identity matrix. We use of the generated samples for validation and for testing.

We design a LeSITA (14) and a LISTA (8) model to learn sparse codes of the target signal. Different instantiations of both models are realized with different number of layers, i.e., . Average results are presented in Table I in terms of normalized mean square error (NMSE) in dB. When the involved signals are similar, i.e., , LeSITA outperforms LISTA substantially. The SI has a negative effect in reconstruction when the support differs in more than positions. The results also show that deeper models deliver better accuracy. Moreover, Table I includes results for SITA (13) after iterations, for . We also run (13) with the following stopping criteria: maximum number of iterations , minimum error equal to the error delivered by LeSITA () for . The respective average NMSE is  dB corresponding to iterations (on average). The comparison shows the computational efficiency of LeSITA against SITA.

A second set of experiments involves real data from the EPFL dataset.111 The dataset contains spatially aligned pairs of near-infrared (NIR) and RGB images grouped in nine categories, e.g., “urban” and “forest”. Our goal is to reconstruct linearly subsampled NIR images (acquired as , , ) with the aid of RGB images. We convert the available images to grayscale and extract pairs of image patches (), creating a dataset of K samples. One image from each category is reserved for testing.222 In Table II, an image is identified by a code following the category name.

We design a LeSITA-based reconstruction operator with each LeSITA and LISTA encoders comprising layers, initialized with weights learned from a LeSITA autoencoder. The autoencoder model was initialized with a random Gaussian dictionary and trained using (16) with . Besides , we also experiment with . For every testing image, we extract the central part and divide it into

patches with an overlapping stride equal to

. We apply CS with different ratios () to NIR image patches.

We compare our reconstruction operator with (i) a LISTA-based [14] reconstruction operator with layers, (ii) a LAMP-based [17] reconstruction operator with layers, (iii) a deep learning (DL) model proposed in [12], and (iv) a multimodal DL model inspired from [30, 31]; note that [14][17] and [12]

do not use SI. The multimodal model consists of two encoding and a single decoding branches. The target and SI encodings are concatenated to obtain a shared latent representation which is received by the decoder to estimate the target signal. Each encoding branch comprises three ReLU layers of dimension

. The decoding branch comprises one ReLU and one linear layer. In all experiments, the projection matrix is jointly learned with the reconstruction operator.333The model in [12] learns sparse ternary projections. Results presented in Table II in terms of peak signal-to-noise ratio (PSNR) show that LeSITA trained with manages to capture the correlation between the target and the SI signals and outperforms all the other models.

V Conclusions and Future Work

We proposed a fast reconstruction operator for the recovery of an undersampled signal with the aid of SI. Our framework utilizes a novel deep learning model that produces coupled representations of correlated data, enabling the efficient use of the SI in the reconstruction of the target signal. Following design principles that rely on existing convex optimization methods allows the theoretical study of the proposed representation and reconstruction models, using sparse modelling and convex optimization theory. We will explore this research direction in our future work. The proximal operator for (9) has been defined in (10) as follows:

Let us set


Considering that the minimization of is separable, for the -th component of the vectors involved in (17), we obtain


Hereafter, we abuse the notation by omitting the index and denoting as , , the -th component of the corresponding vectors.

Let . Then we consider the following five cases:

  1. If then


    The partial derivative with respect to is


    is minimized at , that is, . For , we obtain . Therefore,

  2. If , then


    For , we obtain , thus,

  3. If , then


    For , we obtain or , thus,

  4. If , then


    where denotes the subgradient. Thus,


    and the proximal operator is given by

  5. If , then




    and the proximal operator is given by


Therefore, for , (21), (25), (29), (32), and (35) result in:

Similarly, we calculate the proximal operator for .


  • [1] J. A. Tropp and S. J. Wright, “Computational methods for sparse solution of linear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp. 948–958, 2010.
  • [2] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
  • [3] Y. Zhang, “On theory of compressive sensing via l1 minimization: simple derivations and extensions,” Rice University, Tech. Rep., 2008.
  • [4] L. Weizman, Y. C. Eldar, and D. Ben Bashat, “Compressed sensing for longitudinal MRI: An adaptive-weighted approach,” Medical Physics, vol. 42, no. 9, pp. 5195–5208, 2015.
  • [5] J. F. C. Mota, N. Deligiannis, A. C. Sankaranarayanan, V. Cevher, and M. R. D. Rodrigues, “Dynamic sparse state estimation using minimization: Adaptive-rate measurement bounds, algorithms and applications,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 3332–3336.
  • [6] N. Vaswani and W. Lu, “Modified-CS: Modifying compressive sensing for problems with partially known support,” IEEE Transactions on Signal Processing, vol. 58, no. 9, pp. 4595–4607, 2010.
  • [7] A. Ma, Y. Zhou, C. Rush, D. Baron, and D. Needell, “An Approximate Message Passing Framework for Side Information,” IEEE Transactions on Signal Processing, vol. 67, no. 7, pp. 1875–1888, 2019.
  • [8] P. Song, J. F. Mota, N. Deligiannis, and M. R. Rodrigues, “Coupled dictionary learning for multimodal image super-resolution,” in 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016, pp. 162–166.
  • [9] N. Deligiannis, J. F. Mota, B. Cornelis, M. R. Rodrigues, and I. Daubechies, “Multi-modal dictionary learning for image separation with application in art investigation,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 751–764, 2017.
  • [10] P. Song, X. Deng, J. F. Mota, N. Deligiannis, P. L. Dragotti, and M. R. Rodrigues, “Multimodal image super-resolution via joint sparse representations induced by coupled dictionaries,” IEEE Transactions on Computational Imaging, 2019.
  • [11] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using deep neural networks for inverse problems in imaging: Beyond analytical methods,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 20–36, 2018.
  • [12] D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Deep learning sparse ternary projections for compressed sensing of images,” in 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2017, pp. 1125–1129.
  • [13] A. Mousavi and R. G. Baraniuk, “Learning to invert: Signal recovery via deep convolutional networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2272–2276.
  • [14] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in

    Proceedings of the 27th International Conference on Machine Learning

    , ser. ICML’10.   USA: Omnipress, 2010, pp. 399–406.
  • [15] B. Xin, Y. Wang, W. Gao, D. Wipf, and B. Wang, “Maximal sparsity with deep networks?” in Advances in Neural Information Processing Systems, 2016, pp. 4340–4348.
  • [16] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,” arXiv preprint arXiv:1409.2574, 2014.
  • [17] M. Borgerding, P. Schniter, and S. Rangan, “AMP-inspired deep networks for sparse linear inverse problems,” IEEE Transactions on Signal Processing, vol. 65, no. 16, pp. 4293–4308, 2017.
  • [18] X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds,” In Advances in Neural Information Processing Systems, pp. 9061–9071, 2018.
  • [19] R. Giryes, Y. C. Eldar, A. M. Bronstein, and G. Sapiro, “Tradeoffs between convergence speed and reconstruction accuracy in inverse problems,” IEEE Transactions on Signal Processing, vol. 66, no. 7, pp. 1676–1690, 2018.
  • [20]

    V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,”

    The Journal of Machine Learning Research, vol. 18, no. 1, pp. 2887–2938, 2017.
  • [21] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust single image super-resolution via deep networks with sparse prior,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–3207, 2016.
  • [22] C. Metzler, A. Maleki, and R. Baraniuk, “From denoising to compressed sensing,” IEEE Transactions on Information Theorys, vol. 62, no. 9, pp. 5117–44, 2016.
  • [23] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM review, vol. 43, no. 1, pp. 129–159, 2001.
  • [24] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in 2011 IEEE International Symposium on Information Theory Proceedings.   IEEE, 2011, pp. 2168–2172.
  • [25] P. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” in Fixed-point algorithms for inverse problems in science and engineering.   Springer, 2011, pp. 185–212.
  • [26] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsity-inducing penalties,” Foundations and Trends in Machine Learning, vol. 4, no. 1, pp. 1–106, Jan. 2012.
  • [27] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, vol. 57, no. 11, pp. 1413–1457, 2004.
  • [28] P. L. Combettes and V. Wajs, “Signal recovery by proximal forward-backward splitting,” SIAM Journal on Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal, vol. 4, pp. 1164–1200, 2005.
  • [29] J. F. C. Mota, N. Deligiannis, and M. R. D. Rodrigues, “Compressed Sensing with Prior Information: Strategies, Geometry, and Bounds,” IEEE Transactions on Information Theory, vol. 63, pp. 4472–4496, 2017.
  • [30] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696.
  • [31] W. Ouyang, X. Chu, and X. Wang, “Multi-source deep learning for human pose estimation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2014, pp. 2329–2336.