I Introduction
Image superresolution (SR) refers to the recovery of a highresolution (HR) image from its lowresolution (LR) version. The problem is severely illposed and a common approach for its solution considers the use of sparse priors [26, 17, 25]. For example, the method presented in [26] is based on the assumption that the LR and HR images have joint sparse representations with respect to some dictionaries. Nevertheless, sparsity based methods result in complex optimization problems, which is a significant drawback in largescale settings.
Accounting for the high computational cost of numerical optimization algorithms, deep neural networks have been successfully applied to image SR achieving stateoftheart performance [8, 9, 11, 13, 15]
. Deep learning methods rely on large datasets to learn a nonlinear transformation between the LR and HR image spaces. The methods are efficient at inference by shifting the computational load to the training phase. However, most of the existing deep models do not integrate domain knowledge about the problem and cannot provide theoretical justifications for their effectiveness. A different approach was followed in the recent work of
[16], which relies on a deep unfolding architecture referred to as LISTA [5]. LISTA introduced the idea of translating an iterative numerical algorithm for sparse approximation into a feedforward neural network. By integrating LISTA into their network architecture, the authors of
[16] managed to incorporate sparse priors into the deep learning solution.In many image processing and machine vision applications a reference HR image from a second modality is often available [10, 14]. The recovery of an HR image from its LR variant with the aid of another HR image from a different image modality is referred to as multimodal image SR [14, 20]. Several studies have investigated sparse representation models as well as deep learning methods for multimodal image SR [20, 14, 12, 10, 7, 6].
In this paper, we propose a deep network architecture that incorporates sparse priors and effectively utilizes information from another image modality to perform multimodal image SR. Inspired by [16], the proposed deep learning model relies on a deep unfolding method for sparse approximation with side information. Our contributions are threefold:

We address multimodal image SR as a problem of sparse approximation with side information and formulate an appropriate  minimization problem for its solution.

We design a core neural network model for patchbased image SR, employing a recently proposed deep sparse approximation model [23], referred to as LeSITA, to integrate information from another HR image modality in the solution.

We propose two novel deep neural network architectures for multimodal image SR that employ the LeSITAbased SR model, achieving superior performance against stateoftheart methods.
The proposed models are used to superresolve nearinfrared (NIR) LR images given RGB HR images. The performance of the proposed models is demonstrated by experimental results.
Ii Background and Related Work
Iia Single Image SR with Sparse Priors
Let be an LR image obtained from an HR image . According to [26], the transformation of an HR image to an LR image can be modeled as a blurring and downscaling process expressed by , where and denote blurring and downscaling operators, respectively. Under this assumption, an
dimensional (vectorized) patch
from the bicubicupscaled LR image exhibits a common sparse representation with the corresponding patch from the HR image w.r.t. different overcomplete dictionaries , , that is, and , where . , can be jointly learned with coupled dictionary learning techniques [26]. Therefore, the problem of computing the HR patch given the LR patch can be formulated as(1) 
where is a regularization parameter, and is the norm, which promotes sparsity. Several numerical optimization methods have been proposed for the solution of (1) [22].
In order to account for the high computational cost of numerical algorithms, the seminal work in [5] translated a proximal algorithm, namely ISTA [4], into a neural network form referred to as LISTA. Each layer of LISTA implements an iteration of ISTA according to
(2) 
where , , and are learnable parameters, and is the soft thresholding operator [4] expressed by the componentwise shrinkage function ,
, which acts as a nonlinear activation function.
and are initialized as: , . The network is depicted in Fig. 1.IiB Sparse Coding with Side Information via Deep Learning
The deep unfolding idea presented in [5], also explored in [24, 3, 2], introduced a new methodology for the design of neural networks, enabling the network structure to incorporate domain knowledge about the addressed problem. Following similar principles, we recently proposed a deep learning architecture to solve the sparse approximation problem with the aid of side information [23]. Our approach relies on a proximal algorithm for  minimization. Specifically, suppose that we want to find a sparse approximation of a signal w.r.t. an overcomplete dictionary , given a side information signal that is correlated to . Then, by solving the  minimization problem
(3) 
we obtain a solution that is of higher accuracy compared to (1), as long as certain conditions concerning the similarity between and hold [19, 18].
A proximal algorithm that solves (3) performs the following iterations [23, 1]:
(4) 
where is the Lipschitz constant of and . is the proximal operator expressed by:

for , :
(5) 
for , :
(6)
By setting , , (4) takes the form:
(7) 
A feed forward neural network performing operations according to (7) can learn sparse codes with the aid of side information. , , and are parameters learned from data. Compared to LISTA [5], the network in [23]—which we call Learned SideInformationdriven iterative soft Thresholding Algorithm (LeSITA)—incorporates a new activation function integrating the side information signal into the sparse representation learning process.
Iii The Proposed Method
Iiia Multimodal Image SR with Sparse Priors
The problem of multimodal image superresolution concerns the reconstruction of an HR image from an LR image , given an HR reference or guidance image from another modality. In this work, we utilize the reference image as side information and leverage our previous work [23] to build a deep network performing multimodal image SR.
We follow the sparsitybased model presented in [26] and assume that a (vectorized) patch from the bicubicupscaled LR image and an HR patch from the high resolution image share the same sparse representation under overcomplete dictionaries and , respectively. If the images of the target and the guidance modalities are highly correlated, we can also assume that the reference patch has a sparse representation under a dictionary , which is similar to , for example, by means of the norm. Then, the multimodal image superresolution problem can be formulated as an  minimization problem of the form (3).
By employing LeSITA to solve (3), we can design an endtoend multimodal deep learning architecture to perform superresolution of the input LR image with the aid of a reference HR image . The network architecture incorporates sparse priors and exploits the correlation between the two available modalities. The proposed framework is presented next.
IiiB DMSC: Deep Multimodal Sparse Coding Network
LeSITA can learn sparse codes of a target image modality using side information from another correlated image modality. The side information needs to be a sparse signal similar to the target sparse code. To obtain sparse codes of the guidance modality, our architecture also includes a LISTA subnetwork. The proposed core model consists of the following three components: (i) a LeSITA encoder that computes a sparse representation of an LR image patch of the target modality using side information, (ii) a LISTA subnetwork that produces a sparse representation of the available HR patch from the guidance image, and (iii) a linear decoder that reconstructs the HR image patch of the main modality using the sparse representation obtained from LeSITA.
LeSITA computes a sparse representation of the LR patch according to (7). LISTA accepts as input the reference patch and performs a nonlinear transformation according to to produce a side information signal for LeSITA. Given the sparse representation produced by LeSITA, the HR patch can be recovered by a linear decoder according to . is a learnable dictionary. By training the network endtoend, an LR/HR transformation that relies on the joint representations provided by LeSITA can be learned.
The proposed core model successfully performs multimodal image SR at a patch level. Nevertheless, our goal is to design a network that accepts as input the entire images and and outputs the HR image . To this end, we add three more layers to the network as follows. A convolutional layer consisting of filters of size is added before the LeSITA encoder to extract dimensional feature vectors from the LR image corresponding to patches of size . A similar layer is added before the LISTA branch to ensure that the side information patches stay aligned with the LR patches. Finally, a convolutional layer is added after the decoder to aggregate the reconstructed HR patches and form the superresolved image . We present this model in Fig. 3 and refer to it as Deep Multimodal Sparse Coding network (DMSC).
The proposed network can be trained endtoend using the mean square error (MSE) loss function:
(8) 
where, denotes the set of all network parameters, is the corresponding groundtruth HR image of the target modality, and
is the superresolved estimation computed by the network.
IiiC : Deep Multimodal Image SR Network
The DMSC network presented in Section IIIB learns joint representations of three different image modalities, that is, the input LR image , the guidance modality and the HR image . Learning representations that encode only the common information among the different modalities is critical for the performance of the model. Nevertheless, some information from the guidance modality may be misleading when learning a mapping between the LR and HR versions of the target modality. In other words, the encoding performed by the LISTA branch may result in transferring unrelated information to the LeSITA encoder. As a result, the latent representation of the target modality may not capture the underlying mapping from the LR space to the HR space.
As the performance of the model relies on the learned LR/HR transformation of the target modality, we present an architecture equipped with an uncoupling component that focuses on learning the LR/HR transformation without using side information. The proposed framework consists of two different subnetworks: (i) a DMSC subnetwork performing fusion of the information of the different image modalities, and (ii) a subnetwork for the enhancement of the LR/HR transformation. The second is realized by a LISTA encoder followed by a linear decoder and includes convolutional layers (, ) to operate on the entire image. The proposed deep multimodal framework, referred to as , is depicted in Fig. 4. The network is trained using objective (8).
2 
CSCN  ACSC  DJF  DMSC  

u0006 
39.47  39.78  41.52  41.79  43.21 
u0017  36.76  36.64  38.65  39.39  40.41 
o0018  33.98  34.26  34.78  36.02  37.90 
u0026  32.94  33.11  33.15  33.98  34.96 
o0030  33.34  33.32  35.67  36.32  37.73 
u0050  33.31  33.39  32.60  33.10  33.78 
Average  34.97  35.09  36.07  36.85  37.99 

4 
CSCN  ACSC  DJF  CDLSR  DMSC  

u0006 
32.60  32.61  36.04  36.79  37.24  37.82 
u0017  31.68  31.66  34.18  35.27  35.04  35.75 
o0018  27.28  27.42  30.72  33.01  32.30  32.91 
u0026  27.91  27.92  29.21  30.35  30.12  30.40 
o0030  27.72  27.66  31.27  32.71  32.30  32.66 
u0050  28.20  27.80  28.58  29.37  29.39  29.64 
Average  29.24  29.18  31.67  32.92  32.73  33.19 

6 
CSCN  ACSC  DJF  CDLSR  DMSC  

u0006 
29.94  29.97  34.92  34.15  35.43  35.74 
u0017  29.53  29.48  32.80  32.98  33.09  33.55 
o0018  24.57  24.70  29.92  31.03  30.44  31.34 
u0026  25.79  25.97  28.38  28.88  28.77  29.01 
o0030  25.86  25.91  30.00  30.52  30.37  30.61 
u0050  26.71  26.43  27.64  28.37  28.27  28.45 
Average  27.07  27.08  30.62  30.99  31.06  31.45 
Iv Experiments
In this section, we first report the implementation details of the proposed DMSC and models. Then, we present experimental results on multimodal image SR.
The convolutional layers denoted by (patch extractors) are realized with filters of size to extract dimensional feature vectors from patches of size of the LR image and the HR side information. Each feature vector is then processed by the corresponding LeSITA or LISTA encoder to produce a sparse representation of the LR input and the side information. is realized in a similar way. Each of the convolutional layers and contains one filter to buildup the superresolved NIR image from the computed patches. The sizes of the linear filters and are set to , while and are set to . The linear decoder layer is realized by a linear filter that recovers
HR patches. We note that all convolutional layers use padding such that the networks preserve the spatial size of the input. We use learnable scalars for the parameters
andof the proximal operators. We initialize the convolutional and linear filters using random weights drawn from a Gaussian distribution with standard deviation
. The parameters and are initialized to .We provide experimental results for the DMSC and models, and compare their performance with existing singlemodal and multimodal SR methods. In our experiments, we employ the EPFL RGBNIR dataset^{1}^{1}1https://ivrl.epfl.ch/supplementary_material/cvpr11/. The dataset includes spatially aligned RGB and nearinfrared (NIR) image pairs, capturing the same scene of
different landscapes. Taking into account the high cost per pixel in NIR cameras, we want to superresolve LR NIR images using the corresponding HR RGB image of the same scene as side information. A preprocessing step involves upscaling the NIR LR image to the desired resolution using bicubic interpolation, which results in a blurry image given as input to the model. We convert the RGB images to YCbCr and use only the luminance channel as side information. The training dataset contains
samples of NIR/RGB pairs; due to memory and computational limitations, we use image patches of size to train our models. We reserve image pairs for testing^{2}^{2}2A test image is identified by a letter “u”, “o” referring to the folders urban and oldbuilding in the dataset, and a code “00xx”..We train the proposed models for three SR scales, , and , by minimizing the objective (8) utilizing ADAM optimizer. We compare the proposed models with (i) a coupled dictionary learning (CDLSR) method [20], (ii) the deep joint image filtering (DJF) method [14], (iii) an approximate convolutional sparse coding network (ACSC) [21], and (iv) a cascaded sparse coding network (CSCN) [16]
. CDLSR and DJF perform multimodal image SR using sparse coding and convolutional neural networks, respectively. ACSC and CSCN are unimodal neural networks and do not use information from another image modality. Results in terms of Peak SignaltoNoise Ratio (PSNR) for different scales are presented in Tables
I, II and III. As can be seen, the DMSC model achieves stateoftheart performance at most scaling factors. For instance, the average gains over the best competing method for and upscaling factors are 0.78 dB and 0.07 dB, respectively. However, for scale the average PSNR is 0.19 dB less than the best previous method. The always outperforms existing techniques in terms of average PSNR and exhibits the best values for most of the testing images at all scales. A visual example presented in Fig. 5 corroborates our numerical results.V Conclusions
We developed two novel deep multimodal models, namely DMSC and , for the superresolution of an LR image of a target modality with the aid of an HR image from another modality. The proposed design relies on the unfolding of an iterative algorithm for sparse approximation with side information. The architecture incorporates sparse priors and effectively integrates the available side information. We applied the proposed models to superresolve NIR images using RGB images as side information. We compared our models with existing singlemodal and multimodal designs, showing their superior performance.
References

[1]
(201201)
Optimization with SparsityInducing Penalties.
Foundations and Trends in Machine Learning
4 (1), pp. 1–106. Cited by: §IIB.  [2] (2018) Deep unfolding of a proximal interior point method for image restoration. CoRR abs/1812.04276. External Links: Link, 1812.04276 Cited by: §IIB.
 [3] (2017) AMPinspired deep networks for sparse linear inverse problems. IEEE Transactions on Signal Processing 65 (16), pp. 4293–4308. Cited by: §IIB.
 [4] (200411) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57, pp. . External Links: Document Cited by: §IIA.
 [5] (2010) Learning fast approximations of sparse coding. In IEEE International Conference on Machine Learning (ICML), Cited by: §I, Fig. 1, §IIA, §IIB, §IIB.

[6]
(2017)
Learning dynamic guidance for depth image enhancement.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. . External Links: ISSN 10636919 Cited by: §I.  [7] (2018) Robust guided image filtering using nonconvex potentials. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1), pp. 192–207. External Links: Document, ISSN 01628828 Cited by: §I.
 [8] (2018) Image superresolution via dualstate recurrent networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
 [9] (2018) Deep backprojection networks for superresolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
 [10] (2013) Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (6), pp. 1397–1409. External Links: Document, ISSN 01628828 Cited by: §I.
 [11] (2016) Accurate image superresolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. . External Links: Document, ISSN 10636919 Cited by: §I.
 [12] (2007) Joint bilateral upsampling. ACM Transactions on Graphics 26 (3). External Links: ISSN 07300301, Link, Document Cited by: §I.
 [13] (2017) Deep laplacian pyramid networks for fast and accurate superresolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
 [14] (2016) Deep joint image filtering. In European Conference on Computer Vision (ECCV), Vol. 9908, pp. . External Links: Document Cited by: §I, Fig. 5, §IV.
 [15] (2017) Enhanced deep residual networks for single image superresolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Vol. . External Links: Document, ISSN 21607516 Cited by: §I.
 [16] (2016) Robust single image superresolution via deep networks with sparse prior. IEEE Transactions on Image Processing 25 (7), pp. 3194–3207. External Links: Document, ISSN 10577149 Cited by: §I, §I, §IIA, Fig. 5, §IV.
 [17] (2010) Superresolution with sparse mixing estimators. IEEE Transactions on Image Processing 19 (11), pp. 2889–2900. External Links: Document, ISSN 10577149 Cited by: §I.
 [18] (2014) Compressed sensing with side information: geometrical interpretation and performance bounds. In GlobalSIP, Vol. , pp. 512–516. External Links: Document, ISSN Cited by: §IIB.
 [19] (2017) Compressed Sensing with Prior Information: Strategies, Geometry, and Bounds. IEEE Transaction on Information Theory 63, pp. 4472–4496. Cited by: §IIB.
 [20] (2019) Multimodal image superresolution via joint sparse representations induced by coupled dictionaries. IEEE Transactions on Computational Imaging. Cited by: §I, Table I, Fig. 5, §IV.
 [21] (2018) Learned convolutional sparse coding. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 2191–2195. External Links: Document, ISSN 2379190X Cited by: Fig. 5, §IV.
 [22] (2010) Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE 98 (6), pp. 948–958. Cited by: §IIA.
 [23] (2019) Learning fast sparse representations with the aid of side information. Technical report Note: [Online Available: https://bit.ly/2WIJB5z] External Links: Link Cited by: item 2, Fig. 2, §IIB, §IIB, §IIB, §IIIA.
 [24] (2016) Maximal sparsity with deep networks?. In Advances in Neural Information Processing Systems (NIPS), pp. 4340–4348. Cited by: §IIB.
 [25] (2012) Coupled dictionary training for image superresolution. IEEE Transactions on Image Processing 21 (8), pp. 3467–3478. External Links: Document, ISSN 10577149 Cited by: §I.
 [26] (201012) Image superresolution via sparse representation. IEEE Transactions on Image Processing 19, pp. 2861–2873. External Links: Document Cited by: §I, §IIA, §IIA, §IIIA.
Comments
There are no comments yet.