I Introduction
Linear inverse problems arise in various signal processing domains such as computational imaging, remote sensing, seismology and astronomy, to name a few. These problems can be expressed by a linear equation of the form:
(1) 
where is the unknown signal, , , is a linear operator, and denotes the observations contaminated with noise . Sparsity is commonly used for the regularization of illposed inverse problems, leading to the socalled sparse approximation problem [1]. Compressed sensing (CS) [2] deals with the sparse recovery of linearly subsampled signals and falls in this category.
In several applications, besides the observations of the target signal, additional information from correlated signals is often available [3, 4, 5, 6, 7, 8, 9, 10]. In multimodal applications, combining information from multiple signals calls for methods that allow coupled signal representations, capturing the similarities between correlated data. To this end, coupled dictionary learning is a popular approach [8, 9, 10]; however, dictionary learning methods employ overcomplete dictionaries, resulting in computationally expensive sparse approximation problems.
Deep learning has gained a lot of momentum in solving inverse problems, often surpassing the performance of analytical approaches [11, 12, 13]. Nevertheless, neural networks have a complex structure and appear as “black boxes”; thus, understanding what the model has learned is an active research topic. Among the efforts trying to bridge the gap between analytical methods and deep learning is the work presented in [14], which introduced the idea of unfolding a numerical algorithm for sparse approximation into a neural network form. Several unfolding approaches [15, 16, 17] followed that of [14]. Although the primary motivation for deploying deep learning in inverse problems concerns the reduction of the computational complexity, unfolding offers another significant benefit: the model architecture allows a better insight in the inference procedure and enables the theoretical study of the network using results from sparse modelling [18, 19, 20, 15].
In this paper, we propose a deep unfolding model for the recovery of a signal with the aid of a correlated signal, the side information (SI). To the best of our knowledge, this is the first work in deep unfolding that incorporates SI. Our contribution is as follows: (i) Inspired by [14]
, we design a deep neural network that unfolds a proximal algorithm for sparse approximation with SI; we coin our model Learned Side Information Thresholding Algorithm (LeSITA). (ii) We use LeSITA in an autoencoder fashion to learn coupled representations of correlated signals from different modalities. (iii) We design a LeSITAbased reconstruction operator that utilizes learned SI provided by the autoencoder to enhance signal recovery.
We test our method in an example application, namely, multimodal reconstruction from CS measurements. Other inverse problems of the form (1
) such as image superresolution
[21, 8] or image denoising [22] can benefit from the proposed approach. We compare our method with existing singlemodal deep learning methods that do not use SI, multimodal deep learning designs, and optimization algorithms, showing its superior performance.Ii Background and Related Work
A common approach for solving problems of the form (1) with sparsity constraints is convex optimization [23]. Let us assume that the unknown has a sparse representation with respect to a dictionary , , that is, . Then, (1) takes the form
(2) 
and a solution can be obtained via the formulation of the minimization problem:
(3) 
where denotes the norm (), which promotes sparse solutions and is a regularization parameter.
Numerical methods [1] proposed to solve (3) include pivoting algorithms, interiorpoint methods, gradient based methods and message passing algorithms (AMP) [24]. Among gradient based methods, proximal methods are tailored to optimize an objective of the form
(4) 
where is a convex differentiable function with a Lipschitzcontinuous gradient, and is convex and possibly nonsmooth [25], [26]. Their main step involves the proximal operator, defined for a function according to
(5) 
with and an upper bound on the Lipschitz constant of . A popular proximal algorithm is the Iterative Soft Thresholding Algorithm (ISTA) [27, 28]. Let us set , in (3). At the th iteration ISTA computes:
(6) 
where denotes the proximal operator [Figure 1(a)] expressed by the componentwise shrinkage function:
(7) 
with .
In order to account for the high computational cost of numerical algorithms, Gregor and LeCun [14] unfolded ISTA into a neural network referred to as LISTA. Specifically, by setting , , (6) results in
(8) 
Considering a correspondence of every iteration with a neural network layer, a number of iterations of (8
) can be implemented by a recurrent or feed forward neural network;
, and are learnable parameters, and the proximal operator (7) acts as a nonlinear activation function. A fixed depth network allows the computation of sparse codes in a fixed amount of time. Similar unfolding methods were proposed in
[15, 16, 17].Iii Proposed Framework
In this paper, we consider that, besides the observations of the target signal, we also have access to SI, that is, a signal correlated to the unknown . We assume that and have similar sparse representations , , under dictionaries , , , , respectively. Specifically, we assume that and are similar by means of the norm, that is, is small. The condition holds for representations with partially common support and a number of similar nonzero coefficients; we refer to them as coupled sparse representations. Then, can be obtained from the  minimization problem
(9) 
(9) has been theoretically studied in [29] and has been employed for the recovery of sequential signals in [3, 4, 5].
We can easily obtain coupled sparse representations of sequential signals that change slowly using the same sparsifying dictionary [3, 4, 5]. However, this is not the case in most multimodal applications, where, typically, finding coupled sparse representations involves dictionary learning and complex optimization methods [8, 9, 10]. In this work, we propose an efficient approach based on a novel multimodal deep unfolding model. The model is employed for learning coupled representations of the target signal and the SI (Section IIIB), and for reconstruction with SI (Section IIIC). Our approach is inspired by a proximal algorithm for the solution of (9).
Iiia Sparse Approximation with SI via Deep Unfolding
Problem (9) is of the form (4) with , , , and . The proximal operator for is defined by
(10) 
where , and is an upper bound on the Lipschitz constant of . All terms in (10) are separable, thus, we can easily show that (see Appendix):

For , :
(11) 
For , :
(12)
Figure 1(b) depicts the graphical representation of the proximal operator given by (11). With , a proximal method for (9) takes the form
(13) 
We coin (13) SideInformationdriven iterative soft Thresholding Algorithm (SITA).
We unfold SITA to a neural network form, by settting , . Then (13) results in
(14) 
(14) has a similar expression to LISTA (8); however, the two algorithms involve different proximal operators (Figure 1). A fixed number of iterations of (14) can be implemented by a recurrent or feed forward neural network, with the proximal operator given by (11), (12) employed as a nonlinear activation function, which integrates the SI; , and are learnable parameters. The network architecture is depicted in Figure 2.
We can train the neural network using pairs of sparse codes corresponding to pairs of correlated signals
, and a loss function of the form:
(15) 
where
is the output estimation. The learning results in a fast sparse approximation operator that directly maps the input observation vector
to a sparse code with the aid of the SI . We coin this operator Learned Side Information Thresholding Algorithm (LeSITA).IiiB LeSITA Autoencoder for Coupled Representations
Instead of training using sparse codes, we can use LeSITA in an autoencoder fashion to learn coupled representations of , . By setting
equal to the identity matrix, (
9) reduces to a sparse representation problem with SI. Then, (14) can compute a representation of according to . The proposed autoencoder is depicted in Figure 3. The main branch accepts as input the target signal (). The core component is a LeSITA encoder, followed by a linear decoder performing reconstruction, i.e., ; is a trainable dictionary ( is not tied to any other weight). A second branch referred to as SINET acts as an SI encoder, performing a (possibly) nonlinear transformation of the SI. We employ LISTA (8) to incorporate sparse priors in the transformation, obtaining , ; is given by (7), and , and are learnable parameters. The number of layers of LISTA and LeSITA may differ.We use pairs of correlated signals to train our autoencoder, and an objective function of the form:
(16) 
where is the reconstruction loss, is a constraint on the latent representations, and , are appropriate weights. We use the norm as reconstruction loss, i.e., , where is the th sample of the target signal and is the respective output estimation. We set to promote coupled latent representations capturing the correlation between and .
[14]  LeSITA  SITA  

similarity  –  
Singlemodal methods  Multimodal methods  
LISTA [14]  LAMP [17]  DL [12]  Multimodal DL  LeSITA ()  LeSITA ()  
CS ratio  
country (0070)  
field (0058)  
forest (0058)  
indoor (0056)  
mountain (0055)  
oldbuilding (0103)  
street (0057)  
urban (0102)  
water (0083)  
Average 
IiiC LeSITA for Reconstruction with SI
We propose a reconstruction operator that effectively utilizes SI for signal recovery, following the architecture of Figure 3. In the main branch, a LeSITA encoder computes a latent representation of the observation vector obtained from (1), according to (14). A linear decoder performs reconstruction of the unknown signal, i.e., ; is a learnable dictionary. The role of the SINET branch is to enhance the encoding process by providing LeSITA with prior knowledge. In this task, the SINET is realized by a LISTA encoder, the weights of which are initialized with the SINET weights of the trained autoencoder (Sec. IIIB). In this way, the LeSITA autoencoder is used to provide coupled sparse representations. The proposed model is trained using the loss function, , with the th sample of the target signal and the respective model estimation.
Iv Experimental results
A first set of experiments concerns the performance of the proposed LeSITA model (14) in sparse approximation using synthetic data. We generate K pairs of sparse signals of length with
nonzero coefficients drawn from a standard normal distribution. The sparsity level is kept fixed but the signals have varying support. The SI is generated such that
and share the same support in a number of positions , that is, , , with , denoting the th coefficient of the respective signals. For , we obtain , where is drawn from a normal distribution; therefore, for , the coefficients of and are of the same sign; the rest are drawn from a standard normal distribution. We vary the values of , i.e., , to obtain different levels of similarity between and . A random Gaussian matrix is used as a sparsifying dictionary and is set equal to the identity matrix. We use of the generated samples for validation and for testing.We design a LeSITA (14) and a LISTA (8) model to learn sparse codes of the target signal. Different instantiations of both models are realized with different number of layers, i.e., . Average results are presented in Table I in terms of normalized mean square error (NMSE) in dB. When the involved signals are similar, i.e., , LeSITA outperforms LISTA substantially. The SI has a negative effect in reconstruction when the support differs in more than positions. The results also show that deeper models deliver better accuracy. Moreover, Table I includes results for SITA (13) after iterations, for . We also run (13) with the following stopping criteria: maximum number of iterations , minimum error equal to the error delivered by LeSITA () for . The respective average NMSE is dB corresponding to iterations (on average). The comparison shows the computational efficiency of LeSITA against SITA.
A second set of experiments involves real data from the EPFL dataset.^{1}^{1}1https://ivrl.epfl.ch/supplementary_material/cvpr11/ The dataset contains spatially aligned pairs of nearinfrared (NIR) and RGB images grouped in nine categories, e.g., “urban” and “forest”. Our goal is to reconstruct linearly subsampled NIR images (acquired as , , ) with the aid of RGB images. We convert the available images to grayscale and extract pairs of image patches (), creating a dataset of K samples. One image from each category is reserved for testing.^{2}^{2}2 In Table II, an image is identified by a code following the category name.
We design a LeSITAbased reconstruction operator with each LeSITA and LISTA encoders comprising layers, initialized with weights learned from a LeSITA autoencoder. The autoencoder model was initialized with a random Gaussian dictionary and trained using (16) with . Besides , we also experiment with . For every testing image, we extract the central part and divide it into
patches with an overlapping stride equal to
. We apply CS with different ratios () to NIR image patches.We compare our reconstruction operator with (i) a LISTAbased [14] reconstruction operator with layers, (ii) a LAMPbased [17] reconstruction operator with layers, (iii) a deep learning (DL) model proposed in [12], and (iv) a multimodal DL model inspired from [30, 31]; note that [14], [17] and [12]
do not use SI. The multimodal model consists of two encoding and a single decoding branches. The target and SI encodings are concatenated to obtain a shared latent representation which is received by the decoder to estimate the target signal. Each encoding branch comprises three ReLU layers of dimension
. The decoding branch comprises one ReLU and one linear layer. In all experiments, the projection matrix is jointly learned with the reconstruction operator.^{3}^{3}3The model in [12] learns sparse ternary projections. Results presented in Table II in terms of peak signaltonoise ratio (PSNR) show that LeSITA trained with manages to capture the correlation between the target and the SI signals and outperforms all the other models.V Conclusions and Future Work
We proposed a fast reconstruction operator for the recovery of an undersampled signal with the aid of SI. Our framework utilizes a novel deep learning model that produces coupled representations of correlated data, enabling the efficient use of the SI in the reconstruction of the target signal. Following design principles that rely on existing convex optimization methods allows the theoretical study of the proposed representation and reconstruction models, using sparse modelling and convex optimization theory. We will explore this research direction in our future work. The proximal operator for (9) has been defined in (10) as follows:
Let us set
(17) 
Considering that the minimization of is separable, for the th component of the vectors involved in (17), we obtain
(18) 
Hereafter, we abuse the notation by omitting the index and denoting as , , the th component of the corresponding vectors.
Let . Then we consider the following five cases:

If then
(19) The partial derivative with respect to is
(20) is minimized at , that is, . For , we obtain . Therefore,
(21) 
If , then
(22) (23) (24) For , we obtain , thus,
(25) 
If , then
(26) (27) (28) For , we obtain or , thus,
(29) 
If , then
(30) where denotes the subgradient. Thus,
(31) and the proximal operator is given by
(32) 
If , then
(33) Thus,
(34) and the proximal operator is given by
(35)
Therefore, for , (21), (25), (29), (32), and (35) result in:
Similarly, we calculate the proximal operator for .
References
 [1] J. A. Tropp and S. J. Wright, “Computational methods for sparse solution of linear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp. 948–958, 2010.
 [2] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 [3] Y. Zhang, “On theory of compressive sensing via l1 minimization: simple derivations and extensions,” Rice University, Tech. Rep., 2008.
 [4] L. Weizman, Y. C. Eldar, and D. Ben Bashat, “Compressed sensing for longitudinal MRI: An adaptiveweighted approach,” Medical Physics, vol. 42, no. 9, pp. 5195–5208, 2015.
 [5] J. F. C. Mota, N. Deligiannis, A. C. Sankaranarayanan, V. Cevher, and M. R. D. Rodrigues, “Dynamic sparse state estimation using minimization: Adaptiverate measurement bounds, algorithms and applications,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 3332–3336.
 [6] N. Vaswani and W. Lu, “ModifiedCS: Modifying compressive sensing for problems with partially known support,” IEEE Transactions on Signal Processing, vol. 58, no. 9, pp. 4595–4607, 2010.
 [7] A. Ma, Y. Zhou, C. Rush, D. Baron, and D. Needell, “An Approximate Message Passing Framework for Side Information,” IEEE Transactions on Signal Processing, vol. 67, no. 7, pp. 1875–1888, 2019.
 [8] P. Song, J. F. Mota, N. Deligiannis, and M. R. Rodrigues, “Coupled dictionary learning for multimodal image superresolution,” in 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016, pp. 162–166.
 [9] N. Deligiannis, J. F. Mota, B. Cornelis, M. R. Rodrigues, and I. Daubechies, “Multimodal dictionary learning for image separation with application in art investigation,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 751–764, 2017.
 [10] P. Song, X. Deng, J. F. Mota, N. Deligiannis, P. L. Dragotti, and M. R. Rodrigues, “Multimodal image superresolution via joint sparse representations induced by coupled dictionaries,” IEEE Transactions on Computational Imaging, 2019.
 [11] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using deep neural networks for inverse problems in imaging: Beyond analytical methods,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 20–36, 2018.
 [12] D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Deep learning sparse ternary projections for compressed sensing of images,” in 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2017, pp. 1125–1129.
 [13] A. Mousavi and R. G. Baraniuk, “Learning to invert: Signal recovery via deep convolutional networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2272–2276.

[14]
K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in
Proceedings of the 27th International Conference on Machine Learning
, ser. ICML’10. USA: Omnipress, 2010, pp. 399–406.  [15] B. Xin, Y. Wang, W. Gao, D. Wipf, and B. Wang, “Maximal sparsity with deep networks?” in Advances in Neural Information Processing Systems, 2016, pp. 4340–4348.
 [16] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Modelbased inspiration of novel deep architectures,” arXiv preprint arXiv:1409.2574, 2014.
 [17] M. Borgerding, P. Schniter, and S. Rangan, “AMPinspired deep networks for sparse linear inverse problems,” IEEE Transactions on Signal Processing, vol. 65, no. 16, pp. 4293–4308, 2017.
 [18] X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds,” In Advances in Neural Information Processing Systems, pp. 9061–9071, 2018.
 [19] R. Giryes, Y. C. Eldar, A. M. Bronstein, and G. Sapiro, “Tradeoffs between convergence speed and reconstruction accuracy in inverse problems,” IEEE Transactions on Signal Processing, vol. 66, no. 7, pp. 1676–1690, 2018.

[20]
V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,”
The Journal of Machine Learning Research, vol. 18, no. 1, pp. 2887–2938, 2017.  [21] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust single image superresolution via deep networks with sparse prior,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–3207, 2016.
 [22] C. Metzler, A. Maleki, and R. Baraniuk, “From denoising to compressed sensing,” IEEE Transactions on Information Theorys, vol. 62, no. 9, pp. 5117–44, 2016.
 [23] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM review, vol. 43, no. 1, pp. 129–159, 2001.
 [24] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in 2011 IEEE International Symposium on Information Theory Proceedings. IEEE, 2011, pp. 2168–2172.
 [25] P. Combettes and J.C. Pesquet, “Proximal splitting methods in signal processing,” in Fixedpoint algorithms for inverse problems in science and engineering. Springer, 2011, pp. 185–212.
 [26] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsityinducing penalties,” Foundations and Trends in Machine Learning, vol. 4, no. 1, pp. 1–106, Jan. 2012.
 [27] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, vol. 57, no. 11, pp. 1413–1457, 2004.
 [28] P. L. Combettes and V. Wajs, “Signal recovery by proximal forwardbackward splitting,” SIAM Journal on Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal, vol. 4, pp. 1164–1200, 2005.
 [29] J. F. C. Mota, N. Deligiannis, and M. R. D. Rodrigues, “Compressed Sensing with Prior Information: Strategies, Geometry, and Bounds,” IEEE Transactions on Information Theory, vol. 63, pp. 4472–4496, 2017.
 [30] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML11), 2011, pp. 689–696.

[31]
W. Ouyang, X. Chu, and X. Wang, “Multisource deep learning for human pose
estimation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2014, pp. 2329–2336.
Comments
There are no comments yet.