Image super-resolution (SR) refers to the recovery of a high-resolution (HR) image from its low-resolution (LR) version. The problem is severely ill-posed and a common approach for its solution considers the use of sparse priors [26, 17, 25]. For example, the method presented in  is based on the assumption that the LR and HR images have joint sparse representations with respect to some dictionaries. Nevertheless, sparsity based methods result in complex optimization problems, which is a significant drawback in large-scale settings.
Accounting for the high computational cost of numerical optimization algorithms, deep neural networks have been successfully applied to image SR achieving state-of-the-art performance [8, 9, 11, 13, 15]
. Deep learning methods rely on large datasets to learn a non-linear transformation between the LR and HR image spaces. The methods are efficient at inference by shifting the computational load to the training phase. However, most of the existing deep models do not integrate domain knowledge about the problem and cannot provide theoretical justifications for their effectiveness. A different approach was followed in the recent work of, which relies on a deep unfolding architecture referred to as LISTA 
. LISTA introduced the idea of translating an iterative numerical algorithm for sparse approximation into a feed-forward neural network. By integrating LISTA into their network architecture, the authors of managed to incorporate sparse priors into the deep learning solution.
In many image processing and machine vision applications a reference HR image from a second modality is often available [10, 14]. The recovery of an HR image from its LR variant with the aid of another HR image from a different image modality is referred to as multimodal image SR [14, 20]. Several studies have investigated sparse representation models as well as deep learning methods for multimodal image SR [20, 14, 12, 10, 7, 6].
In this paper, we propose a deep network architecture that incorporates sparse priors and effectively utilizes information from another image modality to perform multimodal image SR. Inspired by , the proposed deep learning model relies on a deep unfolding method for sparse approximation with side information. Our contributions are threefold:
We address multimodal image SR as a problem of sparse approximation with side information and formulate an appropriate - minimization problem for its solution.
We design a core neural network model for patch-based image SR, employing a recently proposed deep sparse approximation model , referred to as LeSITA, to integrate information from another HR image modality in the solution.
We propose two novel deep neural network architectures for multimodal image SR that employ the LeSITA-based SR model, achieving superior performance against state-of-the-art methods.
The proposed models are used to super-resolve near-infrared (NIR) LR images given RGB HR images. The performance of the proposed models is demonstrated by experimental results.
Ii Background and Related Work
Ii-a Single Image SR with Sparse Priors
Let be an LR image obtained from an HR image . According to , the transformation of an HR image to an LR image can be modeled as a blurring and downscaling process expressed by , where and denote blurring and downscaling operators, respectively. Under this assumption, an
-dimensional (vectorized) patchfrom the bicubic-upscaled LR image exhibits a common sparse representation with the corresponding patch from the HR image w.r.t. different over-complete dictionaries , , that is, and , where . , can be jointly learned with coupled dictionary learning techniques . Therefore, the problem of computing the HR patch given the LR patch can be formulated as
In order to account for the high computational cost of numerical algorithms, the seminal work in  translated a proximal algorithm, namely ISTA , into a neural network form referred to as LISTA. Each layer of LISTA implements an iteration of ISTA according to
where , , and are learnable parameters, and is the soft thresholding operator  expressed by the component-wise shrinkage function ,
, which acts as a nonlinear activation function.and are initialized as: , . The network is depicted in Fig. 1.
Ii-B Sparse Coding with Side Information via Deep Learning
The deep unfolding idea presented in , also explored in [24, 3, 2], introduced a new methodology for the design of neural networks, enabling the network structure to incorporate domain knowledge about the addressed problem. Following similar principles, we recently proposed a deep learning architecture to solve the sparse approximation problem with the aid of side information . Our approach relies on a proximal algorithm for - minimization. Specifically, suppose that we want to find a sparse approximation of a signal w.r.t. an over-complete dictionary , given a side information signal that is correlated to . Then, by solving the - minimization problem
where is the Lipschitz constant of and . is the proximal operator expressed by:
for , :
for , :
By setting , , (4) takes the form:
A feed forward neural network performing operations according to (7) can learn sparse codes with the aid of side information. , , and are parameters learned from data. Compared to LISTA , the network in —which we call Learned Side-Information-driven iterative soft Thresholding Algorithm (LeSITA)—incorporates a new activation function integrating the side information signal into the sparse representation learning process.
Iii The Proposed Method
Iii-a Multimodal Image SR with Sparse Priors
The problem of multimodal image super-resolution concerns the reconstruction of an HR image from an LR image , given an HR reference or guidance image from another modality. In this work, we utilize the reference image as side information and leverage our previous work  to build a deep network performing multimodal image SR.
We follow the sparsity-based model presented in  and assume that a (vectorized) patch from the bicubic-upscaled LR image and an HR patch from the high resolution image share the same sparse representation under over-complete dictionaries and , respectively. If the images of the target and the guidance modalities are highly correlated, we can also assume that the reference patch has a sparse representation under a dictionary , which is similar to , for example, by means of the norm. Then, the multimodal image super-resolution problem can be formulated as an - minimization problem of the form (3).
By employing LeSITA to solve (3), we can design an end-to-end multimodal deep learning architecture to perform super-resolution of the input LR image with the aid of a reference HR image . The network architecture incorporates sparse priors and exploits the correlation between the two available modalities. The proposed framework is presented next.
Iii-B DMSC: Deep Multimodal Sparse Coding Network
LeSITA can learn sparse codes of a target image modality using side information from another correlated image modality. The side information needs to be a sparse signal similar to the target sparse code. To obtain sparse codes of the guidance modality, our architecture also includes a LISTA subnetwork. The proposed core model consists of the following three components: (i) a LeSITA encoder that computes a sparse representation of an LR image patch of the target modality using side information, (ii) a LISTA subnetwork that produces a sparse representation of the available HR patch from the guidance image, and (iii) a linear decoder that reconstructs the HR image patch of the main modality using the sparse representation obtained from LeSITA.
LeSITA computes a sparse representation of the LR patch according to (7). LISTA accepts as input the reference patch and performs a nonlinear transformation according to to produce a side information signal for LeSITA. Given the sparse representation produced by LeSITA, the HR patch can be recovered by a linear decoder according to . is a learnable dictionary. By training the network end-to-end, an LR/HR transformation that relies on the joint representations provided by LeSITA can be learned.
The proposed core model successfully performs multimodal image SR at a patch level. Nevertheless, our goal is to design a network that accepts as input the entire images and and outputs the HR image . To this end, we add three more layers to the network as follows. A convolutional layer consisting of filters of size is added before the LeSITA encoder to extract -dimensional feature vectors from the LR image corresponding to patches of size . A similar layer is added before the LISTA branch to ensure that the side information patches stay aligned with the LR patches. Finally, a convolutional layer is added after the decoder to aggregate the reconstructed HR patches and form the super-resolved image . We present this model in Fig. 3 and refer to it as Deep Multimodal Sparse Coding network (DMSC).
Iii-C : Deep Multimodal Image SR Network
The DMSC network presented in Section III-B learns joint representations of three different image modalities, that is, the input LR image , the guidance modality and the HR image . Learning representations that encode only the common information among the different modalities is critical for the performance of the model. Nevertheless, some information from the guidance modality may be misleading when learning a mapping between the LR and HR versions of the target modality. In other words, the encoding performed by the LISTA branch may result in transferring unrelated information to the LeSITA encoder. As a result, the latent representation of the target modality may not capture the underlying mapping from the LR space to the HR space.
As the performance of the model relies on the learned LR/HR transformation of the target modality, we present an architecture equipped with an uncoupling component that focuses on learning the LR/HR transformation without using side information. The proposed framework consists of two different subnetworks: (i) a DMSC subnetwork performing fusion of the information of the different image modalities, and (ii) a subnetwork for the enhancement of the LR/HR transformation. The second is realized by a LISTA encoder followed by a linear decoder and includes convolutional layers (, ) to operate on the entire image. The proposed deep multimodal framework, referred to as , is depicted in Fig. 4. The network is trained using objective (8).
In this section, we first report the implementation details of the proposed DMSC and models. Then, we present experimental results on multimodal image SR.
The convolutional layers denoted by (patch extractors) are realized with filters of size to extract -dimensional feature vectors from patches of size of the LR image and the HR side information. Each feature vector is then processed by the corresponding LeSITA or LISTA encoder to produce a sparse representation of the LR input and the side information. is realized in a similar way. Each of the convolutional layers and contains one filter to build-up the super-resolved NIR image from the computed patches. The sizes of the linear filters and are set to , while and are set to . The linear decoder layer is realized by a linear filter that recovers
HR patches. We note that all convolutional layers use padding such that the networks preserve the spatial size of the input. We use learnable scalars for the parametersand . The parameters and are initialized to .
We provide experimental results for the DMSC and models, and compare their performance with existing single-modal and multimodal SR methods. In our experiments, we employ the EPFL RGB-NIR dataset111https://ivrl.epfl.ch/supplementary_material/cvpr11/. The dataset includes spatially aligned RGB and near-infrared (NIR) image pairs, capturing the same scene of
different landscapes. Taking into account the high cost per pixel in NIR cameras, we want to super-resolve LR NIR images using the corresponding HR RGB image of the same scene as side information. A preprocessing step involves upscaling the NIR LR image to the desired resolution using bicubic interpolation, which results in a blurry image given as input to the model. We convert the RGB images to YCbCr and use only the luminance channel as side information. The training dataset containssamples of NIR/RGB pairs; due to memory and computational limitations, we use image patches of size to train our models. We reserve image pairs for testing222A test image is identified by a letter “u”, “o” referring to the folders urban and oldbuilding in the dataset, and a code “00xx”..
We train the proposed models for three SR scales, , and , by minimizing the objective (8) utilizing ADAM optimizer. We compare the proposed models with (i) a coupled dictionary learning (CDLSR) method , (ii) the deep joint image filtering (DJF) method , (iii) an approximate convolutional sparse coding network (ACSC) , and (iv) a cascaded sparse coding network (CSCN) 
. CDLSR and DJF perform multimodal image SR using sparse coding and convolutional neural networks, respectively. ACSC and CSCN are unimodal neural networks and do not use information from another image modality. Results in terms of Peak Signal-to-Noise Ratio (PSNR) for different scales are presented in TablesI, II and III. As can be seen, the DMSC model achieves state-of-the-art performance at most scaling factors. For instance, the average gains over the best competing method for and upscaling factors are 0.78 dB and 0.07 dB, respectively. However, for scale the average PSNR is 0.19 dB less than the best previous method. The always outperforms existing techniques in terms of average PSNR and exhibits the best values for most of the testing images at all scales. A visual example presented in Fig. 5 corroborates our numerical results.
We developed two novel deep multimodal models, namely DMSC and , for the super-resolution of an LR image of a target modality with the aid of an HR image from another modality. The proposed design relies on the unfolding of an iterative algorithm for sparse approximation with side information. The architecture incorporates sparse priors and effectively integrates the available side information. We applied the proposed models to super-resolve NIR images using RGB images as side information. We compared our models with existing single-modal and multimodal designs, showing their superior performance.
Optimization with Sparsity-Inducing Penalties.
Foundations and Trends in Machine Learning4 (1), pp. 1–106. Cited by: §II-B.
-  (2018) Deep unfolding of a proximal interior point method for image restoration. CoRR abs/1812.04276. External Links: Cited by: §II-B.
-  (2017) AMP-inspired deep networks for sparse linear inverse problems. IEEE Transactions on Signal Processing 65 (16), pp. 4293–4308. Cited by: §II-B.
-  (2004-11) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57, pp. . External Links: Cited by: §II-A.
-  (2010) Learning fast approximations of sparse coding. In IEEE International Conference on Machine Learning (ICML), Cited by: §I, Fig. 1, §II-A, §II-B, §II-B.
Learning dynamic guidance for depth image enhancement.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. . External Links: Cited by: §I.
-  (2018) Robust guided image filtering using nonconvex potentials. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1), pp. 192–207. External Links: Cited by: §I.
-  (2018) Image super-resolution via dual-state recurrent networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2018) Deep back-projection networks for super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2013) Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (6), pp. 1397–1409. External Links: Cited by: §I.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. . External Links: Cited by: §I.
-  (2007) Joint bilateral upsampling. ACM Transactions on Graphics 26 (3). External Links: Cited by: §I.
-  (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2016) Deep joint image filtering. In European Conference on Computer Vision (ECCV), Vol. 9908, pp. . External Links: Cited by: §I, Fig. 5, §IV.
-  (2017) Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Vol. . External Links: Cited by: §I.
-  (2016) Robust single image super-resolution via deep networks with sparse prior. IEEE Transactions on Image Processing 25 (7), pp. 3194–3207. External Links: Cited by: §I, §I, §II-A, Fig. 5, §IV.
-  (2010) Super-resolution with sparse mixing estimators. IEEE Transactions on Image Processing 19 (11), pp. 2889–2900. External Links: Cited by: §I.
-  (2014) Compressed sensing with side information: geometrical interpretation and performance bounds. In GlobalSIP, Vol. , pp. 512–516. External Links: Cited by: §II-B.
-  (2017) Compressed Sensing with Prior Information: Strategies, Geometry, and Bounds. IEEE Transaction on Information Theory 63, pp. 4472–4496. Cited by: §II-B.
-  (2019) Multimodal image super-resolution via joint sparse representations induced by coupled dictionaries. IEEE Transactions on Computational Imaging. Cited by: §I, Table I, Fig. 5, §IV.
-  (2018) Learned convolutional sparse coding. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 2191–2195. External Links: Cited by: Fig. 5, §IV.
-  (2010) Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE 98 (6), pp. 948–958. Cited by: §II-A.
-  (2019) Learning fast sparse representations with the aid of side information. Technical report Note: [Online Available: https://bit.ly/2WIJB5z] External Links: Cited by: item 2, Fig. 2, §II-B, §II-B, §II-B, §III-A.
-  (2016) Maximal sparsity with deep networks?. In Advances in Neural Information Processing Systems (NIPS), pp. 4340–4348. Cited by: §II-B.
-  (2012) Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing 21 (8), pp. 3467–3478. External Links: Cited by: §I.
-  (2010-12) Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19, pp. 2861–2873. External Links: Cited by: §I, §II-A, §II-A, §III-A.