Image super-resolution (SR) is a well-known inverse problem in imaging, referring to the reconstruction of a high-resolution (HR) image from a low-resolution (LR) observation [44, 34]. The problem is ill-posed as there is no unique mapping from the LR to the HR image. Practical applications such as medical imaging and remote sensing often involve different image modalities capturing the same scene, therefore, another approach in imaging is the joint use of multiple image modalities. The problem of multimodal or guided image super-resolution refers to the reconstruction of an HR image from an LR observation using a guidance image from another modality, also referred to as side information.
Several image processing methods use prior knowledge about the image such as sparse structure [62, 61, 52, 53, 64, 22, 15] or statistical image priors [26, 49]. Deep learning has been widely used in inverse problems, often outperforming analytical methods [34, 43]. For instance, in single image SR, Convolutional Neural Networks (CNNs) have led to impressive results reported in [10, 11, 24, 25, 36, 3]. Residual learning  enabled the training of very deep neural networks (DNNs), with the models proposed in [30, 68, 51, 50, 69] achieving state-of-the-art performance.
Capturing the correlation among different image modalities has been addressed with sparsity-based analytical models and coupled dictionary learning [56, 71, 33, 23, 4, 1, 6, 7, 47]. The main drawback of these approaches is the high-computational cost of iterative algorithms for sparse approximation, which has been addressed by multimodal deep learning methods [42, 28, 59]. A common approach in multimodal neural network design is the fusion of the input modalities at a shared latent layer, obtained as the concatenation of the latent representations of each modality . The CNN model proposed in  for multimodal depth upsampling follows this principle. Nevertheless, current multimodal DNNs are black-box models in the sense that we lack a principled approach to design such models for leveraging the signal structure and properties of the correlation across modalities.
A recent line of research in deep learning for inverse problems considers deep unfolding [13, 21, 60, 2, 65, 48], that is, the unfolding of an iterative algorithm into the form of a DNN. Inspired by numerical algorithms for sparse coding, deep unfolding designs have been utilized in several imaging problems to incorporate sparse priors into the solution. Results for denoising , compressive imaging , and image SR  have shown that incorporating domain knowledge into the network architecture can improve the performance substantially. Still, these methods focus on single-modal data, thereby lacking a principled way to incorporate knowledge from different imaging modalities. To the best of our knowledge, the only deep unfolding designs for guided image SR have been presented in  and , that build upon existing unfolding architectures for learned sparse coding  and learned convolutional sparse coding , respectively.
In this paper, we address the problem of guided image SR with a novel multimodal deep unfolding architecture, which is inspired by a proximal algorithm for convolutional sparse coding with side information. The proximal algorithm is translated into a neural network form coined Learned Multimodal Convolutional Sparse Coding (LMCSC); the network incorporates sparse priors and enables efficient integration of the guidance modality into the solution. While in existing multimodal deep learning methods [42, 28, 59], it is difficult to understand what the model has learned, our deep neural network is interpretable, in the sense that the model performs steps similar to an iterative algorithm.
The proposed approach builds upon our previous research on multimodal deep unfolding [55, 38], and preliminary results of LMCSC can be found in , where the recovery of HR near infrared (NIR) images based on LR NIR observations with the aid of RGB images was addressed. In this paper, we integrate LMCSC into different neural network architectures, and present experiments on various multimodal datasets, showing the superior performance of the proposed approach against several single- and multimodal SR methods. Our contribution is as follows:
We formulate the problem of convolutional sparse coding with side information, and propose a proximal algorithm for its solution.
Inspired by the proposed proximal algorithm, we design a deep unfolding neural network for fast computation of convolutional sparse codes with the aid of side information.
The deep unfolding operator is used as a core component in a novel multimodal framework for guided image SR that fuses information from two image modalities. Furthermore, we exploit residual learning and introduce skip connections in the proposed framework, obtaining an alternative design that can be trained more efficiently.
We test our models on several benchmark multimodal datasets, including NIR/RGB, multi-spectral/RGB, depth/RGB, and compare them against various state-of-the-art single-modal and multimodal models. The numerical results show a PSNR gain of up to dB over the coupled ISTA method in .
The rest of the paper is organized as follows: Section II reviews related work and Section III provides the necessary background. Section IV presents the proposed core deep unfolding architecture for convolutional sparse coding with side information, and our designs for multimodal image SR are presented in Section V, followed by experimental results in Section VI. Section VII concludes the paper.
Throughout the paper, all vectors are denoted by boldface lower case letters while lower case letters are used for scalars. We utilize boldface upper case letters to show matrices and boldface upper case letters in math calligraphy to indicate tensors. Moreover, in this paper, the terms upscaling factor and scale are used interchangeably.
Ii Related Work
Ii-1 Single Image Super-Resolution
A first category of single image SR methods includes interpolation-based methods[49, 45, 70, 31]. These methods are simple and fast, however, aliasing and blurring effects make them inefficient in obtaining HR images of fine quality. A second category includes reconstruction methods [63, 12, 35], which use several image priors to regularize the ill-posed reconstruction problem and result in images with fine texture details. Nevertheless, modelling the complex context of natural images with image priors is not an easy task. A third popular category consists of learning-based methods [62, 61, 52, 53, 64, 22, 10, 11, 24, 25, 36, 3, 30, 68, 51, 50, 69]
, which use machine learning techniques to learn the complex mapping between LR and HR images from data.
Among learning based methods, deep learning models have drawn considerable attention as they achieve excellent restoration quality. SRCNN  was the first deep learning method for image SR. The model has a simple structure and can directly learn an end-to-end mapping between the LR/HR images. An accelerated version of  was presented in . Increasing the depth of CNN architectures ensues several training difficulties which have been mitigated with residual learning. Examples of very deep residual networks for SR include a -layer CNN proposed in [24, 25], a
-layer convolutional autoencoder proposed in, and a -layer CNN proposed in . Residual learning has also been employed to learn inter-layer dependencies in [50, 69]. An improved residual design obtained by removing unnecessary residual modules was proposed in 
. Recurrent neural networks (RNNs) have been also used for image SR in[29, 17]. The network in  implements a feedback mechanism that carries high-level information back to previous layers, refining low-level encoded information. Following , the authors of  presented an alternative structure with two states (RNN layers) that operate at different spatial resolutions, providing information flow from LR to HR encodings.
Deep unfolding has been applied to single image SR in  where the authors designed a neural network that computes latent representations of the LR/HR image using LISTA , a neural network that performs steps similar to the Iterative Soft Thresholding Algorithm (ISTA)  (see also Section III).
Ii-2 Multimodal Image Super-Resolution
A common approach in multimodal image restoration is the joint or guided filtering approach, that is, the design of a filter that leverages the guidance image as a prior and transfers structural details from the guidance to the target image. Several joint filtering techniques have been proposed in [27, 18, 16]. Nevertheless, when the local structures in the guidance and target images are not consistent, these techniques may transfer incorrect content to the target image. The approach presented in  concerns the design of an explicit mapping that captures the structural discrepancy between images from different modalities.
Model-based techniques and joint filtering methods are limited in characterizing the complex dependency between different modalities. Learning based methods aim to learn this dependency from data. In a depth upsampling method presented in , a weighted analysis representation is used to model the complex relationship between depth and RGB images; the model parameters are learned with a task driven training strategy. Another learning based approach relies on sparse modelling and involves coupled-dictionary learning [56, 71, 33, 23, 4, 1, 6, 7]. Most of these works assume that there is a mapping between the sparse representation of one modality to the sparse representation of another modality. The authors of  consider both similarities and disparities between different modalities under the sparse representation invariance assumption.
Purely data-driven solutions for multimodal image SR are provided by multimodal deep learning approaches. Examples include the model presented in  implementing a CNN based joint image filter, and the work presented in , which is a deep learning reformulation of the widely used guided image filter proposed in .
where a coupled ISTA network is presented. The network accepts as input an LR image from the target modality and an HR image from the guidance modality. Two LISTA branches are employed to compute latent representations of the input images. The estimation of the target HR image is obtained as a linear combination of these representations. A similar approach is proposed in that employs three convolutional LISTA networks to split the common information shared between modalities, from the unique information belonging to each modality. The output is then computed as a combination of these common and unique feature maps after applying the corresponding dictionaries on them.
Image super-resolution can be addressed as a linear inverse problem formulated as follows :
where is a vectorized form of the unknown HR image, and denotes the LR observations contaminated with noise . The linear operator , , describes the observation mechanism, which can be expressed as the product of a downsampling operator and a blurring filter . Problem (1) appears in many imaging applications including image restoration and inpainting [44, 34].
Iii-a Image Super-Resolution via Sparse Approximation
Even when the linear observation operator is given, problem (1) is ill-posed and requires additional regularization for its solution. Sparsity has been widely used as a regularizer leading to the well-known sparse approximation problem . Instead of directly solving for , in this paper, we rely on a sparse modelling approach presented in . According to , an -dimensional (vectorized) patch from a bicubic-upscaled LR image and the corresponding patch from the respective HR image can be expressed by joint sparse representations. By jointly learning two dictionaries , , , for the low- and the high-resolution image patches, respectively, we can enforce the similarity of sparse representations of patch pairs such that and , . Then, computing the HR patch is equivalent to finding the sparse representation of the LR patch , by solving
where is a regularization parameter, and is the -norm, which promotes sparsity. Several methods have been proposed for solving (2) including pivoting algorithms, interior-point methods and gradient based methods .
Sparse modelling techniques that involve computations applied to independent image patches do not take into account the consistency of pixels in overlapping patches . Convolutional Sparse Coding (CSC)  is an alternative approach, which can be directly applied to the entire image. Denoting with the image of interest, the convolutional sparse codes are obtained by solving the following problem:
where is the Frobenius norm, , , are the atoms of a convolutional dictionary , and , , are the sparse feature maps with respect to . The -norm computes the sum of absolute values of the elements in (as if is unrolled as a vector). Efficient solutions of (3) are presented in [58, 20].
According to recent studies , the accuracy of sparse approximation problems can be improved if a signal correlated with the target signal is available; we refer to as side information (SI). Assume that and have similar sparse representations , , under dictionaries , , , , respectively. Then the sparse representation can be obtained as the solution of the - minimization problem .
Iii-B Deep Unfolding
Analytical approaches for sparse approximation are usually equipped with theoretical guarantees; however, their major drawback is their high computational complexity. In some applications, the deployed dictionaries also need to be learned, increasing the computational burden . The authors of  address this problem by a neural network design that performs operations similar to the Iterative Soft Thresholding Algorithm (ISTA)  proposed for the solution of (2). The learning process results in a trained version of ISTA, coined LISTA. The -th layer of LISTA computes:
is the soft thresholding operator; , and are parameters, which are fixed in ISTA, while LISTA learns them from data. As a result, LISTA achieves high accuracy in only a few iterations. The technique known as deep unfolding was also explored in [21, 60, 2, 65]; a convolutional LISTA design for CSC was presented in .
The aforementioned deep unfolding studies deal with single-modal data. A deep unfolding design that incorporates side information coming from another modality was first presented in our previous work . The model proposed in  relies on a proximal method for the solution of (4), which iterates over
with , appropriate parameters. The proximal operator incorporates the side information and is expressed as follows:
For , :
For , :
By writing the proximal algorithm in the form
and translating (10) into a deep neural network, we obtain a fast multimodal operator referred to as Learned Side-Information-driven iterative soft Thresholding Algorithm (LeSITA). LeSITA has a similar expression to LISTA (5), however, (10
) employs the new activation functionthat integrates side information into the learning process.
Iv Design Multimodal Convolutional Networks with Deep Unfolding
In what follows, we consider that, besides the observations of the target signal, another image modality , correlated with is available. We assume that the two image modalities can be represented by convolutional sparse codes that are similar by means of the -norm. Specifically, let be a sparse representation of the observed image with respect to a convolutional dictionary ; , , denote the atoms of . By employing a convolutional dictionary with atoms , , the guidance image can be expressed as with the convolutional sparse codes , , obtained as the solution of (3). Then, we can compute the unknown sparse codes of the target modality by solving a problem formulated in a way similar to (4), that is,
There is a correspondence between convolutional and linear sparse codes. If we replace the convolutional dictionary with a matrix with Toeplitz structure, and take into account the linear properties of convolution, then (11) reduces to (4). Specifically, we define as a sparse dictionary obtained by concatenating the Toeplitz matrices that unroll ’s; and take the form of vectorized sparse feature maps of the target and the side information images, respectively. Then, by replacing the convolutional operations in (11) with multiplications, we obtain (4), and the proximal algorithm (7) can be employed to compute convolutional sparse codes.
Nevertheless, transforming (11) to (4) and using (7) for its solution is not computationally efficient. Since CSC deals with the entire image, the dimensionality of (4) becomes too high and the proximal method becomes impractical. We use the correspondence between linear and convolutional representations, and formulate an iterative algorithm that performs convolutions as follows: In the proximal algorithm (7), the matrices and , which take the form of concatenated Toeplitz matrices in the convolutional case, are replaced by the convolutional dictionaries and , respectively. Then, by replacing multiplications with convolutional operations, we can compute the convolutional codes of the target image, given the convolutional codes of the guidance image, by iterating over:
where , are tensors of size .
Equation (12) can be translated into a deep convolutional neural network (CNN). Each stage of the network computes the sparse feature maps according to
with , , learnable convolutional layers and a learnable parameter; is the number of channels of the employed images. The proposed network architecture, depicted in Fig. 1, is referred to as Learned Multimodal Convolutional Sparse Coding (LMCSC). The network can be trained in a supervised manner to map an input image to sparse feature maps. During training, the parameters , , and are learned; therefore, the deep LMCSC can achieve high accuracy with only a fraction of computations of the proximal method.
LMCSC uses the convolutional sparse codes of the guidance modality to compute the convolutional sparse codes of the target modality . An efficient multimodal convolutional operator should integrate a fast operator for the encoding of the guidance modality. In the models presented next, we obtain using the ACSC operator presented in . ACSC has the form of convolutional LISTA, with the -th layer computing:
V Deep Multimodal Image Super-Resolution
The proposed LMCSC architecture can be employed to perform multimodal image super-resolution based on a sparsity-driven convolutional model. The proposed model follows similar principles with . Specifically, the sparse linear modelling of LR/HR image patches presented in  is replaced by a sparse convolutional modelling of the entire LR/HR images, followed by similarity assumptions between the convolutional representations of the LR and HR images. To efficiently integrate information from a second image modality, we also assume that the target and the guidance image modalities are similar by means of the -norm in the representation domain.
Our first model proposed for multimodal image super-resolution relies on the following assumption: The LR observation and the HR image share the same convolutional sparse features maps , under different convolutional dictionaries and , that is, , , where , are the atoms of the respective convolutional dictionaries. Given , , finding a mapping from to is equivalent to computing the sparse features maps of the observed LR image . The similarity assumption between the target and the guidance image modalities in the representation domain implies that the convolutional sparse codes of the guidance HR image are similar to by means of the -norm. Therefore, can be obtained as the solution of the - minimization problem (11). Based on these assumptions, we build our first model for multimodal image SR using LMCSC as a core component of a deep architecture.
The proposed model, coined LMCSC-Net, consists of three subnetworks: (i
) an LMCSC encoder that produces convolutional latent representations of the imput LR image with the aid of side information, (ii) a side information encoder that produces latent representations of the guidance HR image, and (iii) a convolutional decoder that computes the target HR image. The goal of the LMCSC encoder is to learn a convolutional sparse feature map of the LR image , also shared by the HR image , using a convolutional sparse feature map as side information, akin to the model presented in Section IV. The LMCSC branch is followed by a convolutional decoder realized by a learnable convolutional dictionary . The decoder receives the latent representations provided by LMCSC and estimates according to .
The entire network, depicted in Fig. 3
, is trained end-to-end using the mean square error (MSE) loss function:
where denotes the set of all network parameters, is the ground-truth image of the target modality, and is the estimation computed by the network.
Different from our previous multimodal image SR design  which relies on LeSITA , LMCSC-Net has a novel convolutional structure inspired by a different proximal algorithm. The core LMCSC component computes latent representations of the target modality using side information from the guidance modality, performing fusion of information at every layer. Therefore, our approach is different from coupled ISTA  which employs one branch of LISTA  for each modality and fuses the latent representations only in the last layer.
The model presented in Section V-A learns similar sparse representations of three different image modalities, that is, the input LR image , the guidance modality and the HR image . Learning representations that mainly encode the common information among the different modalities is critical for the performance of the model. Nevertheless, some information from the guidance modality may be misleading when learning a mapping between and . In other words, the encoding performed by the ACSC branch may result in transferring unrelated information to the LMCSC encoder. As a result, the latent representation of the target modality may not capture the underlying mapping between the LR and HR images in the representation domain. Furthermore, assuming identical latent representations for both LR and HR images limits the performance of the network especially when the degradation level of the LR observations is high.
In order to address the aforementioned problems, we relax the assumption concerning the similarity between the LR and HR images in the representation domain. Specifically, we assume that the convolutional sparse codes of the HR image
can be obtained as a non-linear transformation of the respective codesof the LR image , that is, where is a non-linear function parameterized by .
Under this assumption, we build the proposed multimodal SR framework by employing the following components: (i) An LMCSC subnetwork is used to fuse the information from the LR observations and the guidance HR image, providing a first estimation of the target HR image with the aid of side information. (ii) An ACSC subnetwork following the LMCSC subnetwork is used to enhance the transformation between the LR and HR images of the target modality without using side information. The architecture of the proposed model, referred to as LMCSC-Net, is depicted in Fig. 4. The additional ACSC and dictionary layers in LMCSC-Net implement . The network is trained using the objective (15).
V-C LMCSC-Net with Skip Connections
Considering the significant improvement achieved by residual learning in the training efficiency and the prediction accuracy [19, 30, 68, 51, 50, 69], we enhance the proposed LMCSC-Net with a skip connection, introducing a new model coined LMCSC-ResNet. For the design of LMCSC-ResNet we rely on the assumption that the HR image contains all the low-frequency information from the LR image plus some high-frequency details that can be captured by a non-linear mapping between LR and HR images. By using an identity mapping of the input, the capacity of the network can be assigned to learning the high frequency details, since the low-frequency information is provided by . LMCSC-ResNet learns the non-linear mapping , where . In LMCSC-ResNet, is obtained from the LMCSC-Net. The model architecture is presented in Fig. 5. This model is also trained end-to-end using the objective (15).
|Scale||CSCN ||ACSC ||EDSR ||SRFBN ||DJF ||CoISTA ||LMCSC-Net||LMCSC-Net||LMCSC-ResNet|
We apply the proposed models to different upsampling tasks, that is, super-resolution of near-infrared (NIR) images, depth upsampling and super-resolution of multi-spectral data, using RGB images as side information. We compare our models against state-of-the-art single-modal and multimodal methods showing the superior performance of the proposed approach. Before demonstrating our experimental results, we present the employed datasets and report implementation details.
Vi-A1 EPFL RGB-NIR dataset
NIR images are acquired at a low resolution due to the high cost per pixel of a NIR sensor compared to an RGB sensor. We employ the EPFL RGB-NIR dataset111https://ivrl.epfl.ch/supplementary_material/cvpr11/ and apply our models to super-resolve an LR NIR image with the aid of an HR RGB image. The dataset contains spatially aligned NIR/RGB image pairs. Our training set contains approximately cropped image pairs extracted from images. Each training image is of size pixels; the size is chosen with respect to memory requirements and computational complexity. We also create a testing set containing image pairs; testing is performed on an entire image.
The NIR images consist of one channel. An LR version of a NIR image is generated by blurring and downscaling the ground truth HR version. We convert the RGB images to YCbCr and only utilize the luminance channel as the side information. Following , we apply bicubic interpolation as a preprocessing step to upscale the LR input such that the input and output images are of the same size.
Vi-A2 NYU v2 RGB-D dataset
Depth cameras like Microsoft Kinect and time-of-flight (ToF) cameras only provide low-resolution depth images. Therefore, depth upsampling is a necessary task for many vision applications. We apply our models for depth upsampling with the aid of RGB images, using the NYU v2 RGB-D dataset . The dataset contains RGB images with their depth maps. Similar to , we use the first images for training, and the remaining for testing.
Vi-A3 Columbia multi-spectral database
The third dataset that we use to evaluate the proposed models is the Columbia multi-spectral database,222http://www.cs.columbia.edu/CAVE/databases/multispectral which contains spectral reflectance data and RGB images. For testing, we reserve images from the nm band and randomly select images from different bands; the rest are used for training.
Vi-B Implementation Details
All networks are designed with three unfolding steps for the target (LMCSC) and the side information (ACSC) encoders. The number of unfolding steps is chosen after taking into account the trade-off between the computational complexity and the reconstruction accuracy; for instance, by increasing the unfolding steps to five, the improvement in the average PSNR is less than dB while the execution time is almost higher. In the LMCSC-ResNet, the ACSC branch employed for the nonlinear mapping of the target signal is designed with one unfolding step.
We empirically set the size of the network parameters , , and to ; the size of and are set to . The size of the convolutional dictionaries for reconstruction is . Note that a convolutional layer of size consists of convolutional filters with kernel size and channels. We use untied weights at every unfolding step, i.e., the -th layer of LMCSC and ACSC subnetworks is realized by the independent variables , and ,. The parameters and of the proximal operators are both initialized to . We train the network using the Adam optimizer.
We notice that the complexity of our networks is dominated by the LeSITA activation layers in the LMCSC block, and an implementation based on (8) and (9) is not efficient. In order to address this issue, we rewrite the proximal operator in (8), (9) as follows:
where% faster implementation and we use this version in all of the experiments.
Vi-C Performance Comparison
|GF ||JBF ||SDF ||DJF||Gu et al. ||LMCSC-Net||LMCSC-Net||LMCSC-ResNet|
|Gu et al. |
|Gu et al. |
|Gu et al. |
||Chart toy||Egyptian||Feathers||Glass tiles||Jelly beans||Oil Paintings||Paints||Average|