I Introduction
Image superresolution (SR) is a wellknown inverse problem in imaging, referring to the reconstruction of a highresolution (HR) image from a lowresolution (LR) observation [44, 34]. The problem is illposed as there is no unique mapping from the LR to the HR image. Practical applications such as medical imaging and remote sensing often involve different image modalities capturing the same scene, therefore, another approach in imaging is the joint use of multiple image modalities. The problem of multimodal or guided image superresolution refers to the reconstruction of an HR image from an LR observation using a guidance image from another modality, also referred to as side information.
Several image processing methods use prior knowledge about the image such as sparse structure [62, 61, 52, 53, 64, 22, 15] or statistical image priors [26, 49]. Deep learning has been widely used in inverse problems, often outperforming analytical methods [34, 43]. For instance, in single image SR, Convolutional Neural Networks (CNNs) have led to impressive results reported in [10, 11, 24, 25, 36, 3]. Residual learning [19] enabled the training of very deep neural networks (DNNs), with the models proposed in [30, 68, 51, 50, 69] achieving stateoftheart performance.
Capturing the correlation among different image modalities has been addressed with sparsitybased analytical models and coupled dictionary learning [56, 71, 33, 23, 4, 1, 6, 7, 47]. The main drawback of these approaches is the highcomputational cost of iterative algorithms for sparse approximation, which has been addressed by multimodal deep learning methods [42, 28, 59]. A common approach in multimodal neural network design is the fusion of the input modalities at a shared latent layer, obtained as the concatenation of the latent representations of each modality [42]. The CNN model proposed in [28] for multimodal depth upsampling follows this principle. Nevertheless, current multimodal DNNs are blackbox models in the sense that we lack a principled approach to design such models for leveraging the signal structure and properties of the correlation across modalities.
A recent line of research in deep learning for inverse problems considers deep unfolding [13, 21, 60, 2, 65, 48], that is, the unfolding of an iterative algorithm into the form of a DNN. Inspired by numerical algorithms for sparse coding, deep unfolding designs have been utilized in several imaging problems to incorporate sparse priors into the solution. Results for denoising [48], compressive imaging [67], and image SR [32] have shown that incorporating domain knowledge into the network architecture can improve the performance substantially. Still, these methods focus on singlemodal data, thereby lacking a principled way to incorporate knowledge from different imaging modalities. To the best of our knowledge, the only deep unfolding designs for guided image SR have been presented in [9] and [8], that build upon existing unfolding architectures for learned sparse coding [13] and learned convolutional sparse coding [48], respectively.
In this paper, we address the problem of guided image SR with a novel multimodal deep unfolding architecture, which is inspired by a proximal algorithm for convolutional sparse coding with side information. The proximal algorithm is translated into a neural network form coined Learned Multimodal Convolutional Sparse Coding (LMCSC); the network incorporates sparse priors and enables efficient integration of the guidance modality into the solution. While in existing multimodal deep learning methods [42, 28, 59], it is difficult to understand what the model has learned, our deep neural network is interpretable, in the sense that the model performs steps similar to an iterative algorithm.
The proposed approach builds upon our previous research on multimodal deep unfolding [55, 38], and preliminary results of LMCSC can be found in [37], where the recovery of HR near infrared (NIR) images based on LR NIR observations with the aid of RGB images was addressed. In this paper, we integrate LMCSC into different neural network architectures, and present experiments on various multimodal datasets, showing the superior performance of the proposed approach against several single and multimodal SR methods. Our contribution is as follows:
 (i)

We formulate the problem of convolutional sparse coding with side information, and propose a proximal algorithm for its solution.
 (ii)

Inspired by the proposed proximal algorithm, we design a deep unfolding neural network for fast computation of convolutional sparse codes with the aid of side information.
 (iii)

The deep unfolding operator is used as a core component in a novel multimodal framework for guided image SR that fuses information from two image modalities. Furthermore, we exploit residual learning and introduce skip connections in the proposed framework, obtaining an alternative design that can be trained more efficiently.
 (iv)

We test our models on several benchmark multimodal datasets, including NIR/RGB, multispectral/RGB, depth/RGB, and compare them against various stateoftheart singlemodal and multimodal models. The numerical results show a PSNR gain of up to dB over the coupled ISTA method in [9].
The rest of the paper is organized as follows: Section II reviews related work and Section III provides the necessary background. Section IV presents the proposed core deep unfolding architecture for convolutional sparse coding with side information, and our designs for multimodal image SR are presented in Section V, followed by experimental results in Section VI. Section VII concludes the paper.
Throughout the paper, all vectors are denoted by boldface lower case letters while lower case letters are used for scalars. We utilize boldface upper case letters to show matrices and boldface upper case letters in math calligraphy to indicate tensors. Moreover, in this paper, the terms upscaling factor and scale are used interchangeably.
Ii Related Work
Ii1 Single Image SuperResolution
A first category of single image SR methods includes interpolationbased methods
[49, 45, 70, 31]. These methods are simple and fast, however, aliasing and blurring effects make them inefficient in obtaining HR images of fine quality. A second category includes reconstruction methods [63, 12, 35], which use several image priors to regularize the illposed reconstruction problem and result in images with fine texture details. Nevertheless, modelling the complex context of natural images with image priors is not an easy task. A third popular category consists of learningbased methods [62, 61, 52, 53, 64, 22, 10, 11, 24, 25, 36, 3, 30, 68, 51, 50, 69], which use machine learning techniques to learn the complex mapping between LR and HR images from data.
Among learning based methods, deep learning models have drawn considerable attention as they achieve excellent restoration quality. SRCNN [10] was the first deep learning method for image SR. The model has a simple structure and can directly learn an endtoend mapping between the LR/HR images. An accelerated version of [10] was presented in [11]. Increasing the depth of CNN architectures ensues several training difficulties which have been mitigated with residual learning. Examples of very deep residual networks for SR include a layer CNN proposed in [24, 25], a
layer convolutional autoencoder proposed in
[36], and a layer CNN proposed in [51]. Residual learning has also been employed to learn interlayer dependencies in [50, 69]. An improved residual design obtained by removing unnecessary residual modules was proposed in [30]. Recurrent neural networks (RNNs) have been also used for image SR in
[29, 17]. The network in [29] implements a feedback mechanism that carries highlevel information back to previous layers, refining lowlevel encoded information. Following [29], the authors of [17] presented an alternative structure with two states (RNN layers) that operate at different spatial resolutions, providing information flow from LR to HR encodings.Deep unfolding has been applied to single image SR in [32] where the authors designed a neural network that computes latent representations of the LR/HR image using LISTA [13], a neural network that performs steps similar to the Iterative Soft Thresholding Algorithm (ISTA) [5] (see also Section III).
Ii2 Multimodal Image SuperResolution
A common approach in multimodal image restoration is the joint or guided filtering approach, that is, the design of a filter that leverages the guidance image as a prior and transfers structural details from the guidance to the target image. Several joint filtering techniques have been proposed in [27, 18, 16]. Nevertheless, when the local structures in the guidance and target images are not consistent, these techniques may transfer incorrect content to the target image. The approach presented in [46] concerns the design of an explicit mapping that captures the structural discrepancy between images from different modalities.
Modelbased techniques and joint filtering methods are limited in characterizing the complex dependency between different modalities. Learning based methods aim to learn this dependency from data. In a depth upsampling method presented in [14], a weighted analysis representation is used to model the complex relationship between depth and RGB images; the model parameters are learned with a task driven training strategy. Another learning based approach relies on sparse modelling and involves coupleddictionary learning [56, 71, 33, 23, 4, 1, 6, 7]. Most of these works assume that there is a mapping between the sparse representation of one modality to the sparse representation of another modality. The authors of [47] consider both similarities and disparities between different modalities under the sparse representation invariance assumption.
Purely datadriven solutions for multimodal image SR are provided by multimodal deep learning approaches. Examples include the model presented in [28] implementing a CNN based joint image filter, and the work presented in [59], which is a deep learning reformulation of the widely used guided image filter proposed in [18].
The deep unfolding design LISTA [13] has also been deployed for multimodal image SR in [9]
where a coupled ISTA network is presented. The network accepts as input an LR image from the target modality and an HR image from the guidance modality. Two LISTA branches are employed to compute latent representations of the input images. The estimation of the target HR image is obtained as a linear combination of these representations. A similar approach is proposed in
[8] that employs three convolutional LISTA networks to split the common information shared between modalities, from the unique information belonging to each modality. The output is then computed as a combination of these common and unique feature maps after applying the corresponding dictionaries on them.Iii Background
Image superresolution can be addressed as a linear inverse problem formulated as follows [62]:
(1) 
where is a vectorized form of the unknown HR image, and denotes the LR observations contaminated with noise . The linear operator , , describes the observation mechanism, which can be expressed as the product of a downsampling operator and a blurring filter [62]. Problem (1) appears in many imaging applications including image restoration and inpainting [44, 34].
Iiia Image SuperResolution via Sparse Approximation
Even when the linear observation operator is given, problem (1) is illposed and requires additional regularization for its solution. Sparsity has been widely used as a regularizer leading to the wellknown sparse approximation problem [54]. Instead of directly solving for , in this paper, we rely on a sparse modelling approach presented in [62]. According to [62], an dimensional (vectorized) patch from a bicubicupscaled LR image and the corresponding patch from the respective HR image can be expressed by joint sparse representations. By jointly learning two dictionaries , , , for the low and the highresolution image patches, respectively, we can enforce the similarity of sparse representations of patch pairs such that and , . Then, computing the HR patch is equivalent to finding the sparse representation of the LR patch , by solving
(2) 
where is a regularization parameter, and is the norm, which promotes sparsity. Several methods have been proposed for solving (2) including pivoting algorithms, interiorpoint methods and gradient based methods [54].
Sparse modelling techniques that involve computations applied to independent image patches do not take into account the consistency of pixels in overlapping patches [15]. Convolutional Sparse Coding (CSC) [66] is an alternative approach, which can be directly applied to the entire image. Denoting with the image of interest, the convolutional sparse codes are obtained by solving the following problem:
(3) 
where is the Frobenius norm, , , are the atoms of a convolutional dictionary , and , , are the sparse feature maps with respect to . The norm computes the sum of absolute values of the elements in (as if is unrolled as a vector). Efficient solutions of (3) are presented in [58, 20].
According to recent studies [39], the accuracy of sparse approximation problems can be improved if a signal correlated with the target signal is available; we refer to as side information (SI). Assume that and have similar sparse representations , , under dictionaries , , , , respectively. Then the sparse representation can be obtained as the solution of the  minimization problem [39].
(4) 
Problem (4) has been theoretically studied in [39]. Numerical methods for its solution are presented in [57, 40].
IiiB Deep Unfolding
Analytical approaches for sparse approximation are usually equipped with theoretical guarantees; however, their major drawback is their high computational complexity. In some applications, the deployed dictionaries also need to be learned, increasing the computational burden [62]. The authors of [13] address this problem by a neural network design that performs operations similar to the Iterative Soft Thresholding Algorithm (ISTA) [5] proposed for the solution of (2). The learning process results in a trained version of ISTA, coined LISTA. The th layer of LISTA computes:
(5) 
where
(6) 
is the soft thresholding operator; , and are parameters, which are fixed in ISTA, while LISTA learns them from data. As a result, LISTA achieves high accuracy in only a few iterations. The technique known as deep unfolding was also explored in [21, 60, 2, 65]; a convolutional LISTA design for CSC was presented in [48].
The aforementioned deep unfolding studies deal with singlemodal data. A deep unfolding design that incorporates side information coming from another modality was first presented in our previous work [55]. The model proposed in [55] relies on a proximal method for the solution of (4), which iterates over
(7) 
with , appropriate parameters. The proximal operator incorporates the side information and is expressed as follows:

For , :
(8) 
For , :
(9)
By writing the proximal algorithm in the form
(10) 
and translating (10) into a deep neural network, we obtain a fast multimodal operator referred to as Learned SideInformationdriven iterative soft Thresholding Algorithm (LeSITA). LeSITA has a similar expression to LISTA (5), however, (10
) employs the new activation function
that integrates side information into the learning process.Iv Design Multimodal Convolutional Networks with Deep Unfolding
In what follows, we consider that, besides the observations of the target signal, another image modality , correlated with is available. We assume that the two image modalities can be represented by convolutional sparse codes that are similar by means of the norm. Specifically, let be a sparse representation of the observed image with respect to a convolutional dictionary ; , , denote the atoms of . By employing a convolutional dictionary with atoms , , the guidance image can be expressed as with the convolutional sparse codes , , obtained as the solution of (3). Then, we can compute the unknown sparse codes of the target modality by solving a problem formulated in a way similar to (4), that is,
(11) 
There is a correspondence between convolutional and linear sparse codes. If we replace the convolutional dictionary with a matrix with Toeplitz structure, and take into account the linear properties of convolution, then (11) reduces to (4). Specifically, we define as a sparse dictionary obtained by concatenating the Toeplitz matrices that unroll ’s; and take the form of vectorized sparse feature maps of the target and the side information images, respectively. Then, by replacing the convolutional operations in (11) with multiplications, we obtain (4), and the proximal algorithm (7) can be employed to compute convolutional sparse codes.
Nevertheless, transforming (11) to (4) and using (7) for its solution is not computationally efficient. Since CSC deals with the entire image, the dimensionality of (4) becomes too high and the proximal method becomes impractical. We use the correspondence between linear and convolutional representations, and formulate an iterative algorithm that performs convolutions as follows: In the proximal algorithm (7), the matrices and , which take the form of concatenated Toeplitz matrices in the convolutional case, are replaced by the convolutional dictionaries and , respectively. Then, by replacing multiplications with convolutional operations, we can compute the convolutional codes of the target image, given the convolutional codes of the guidance image, by iterating over:
(12) 
where , are tensors of size .
Equation (12) can be translated into a deep convolutional neural network (CNN). Each stage of the network computes the sparse feature maps according to
(13) 
with , , learnable convolutional layers and a learnable parameter; is the number of channels of the employed images. The proposed network architecture, depicted in Fig. 1, is referred to as Learned Multimodal Convolutional Sparse Coding (LMCSC). The network can be trained in a supervised manner to map an input image to sparse feature maps. During training, the parameters , , and are learned; therefore, the deep LMCSC can achieve high accuracy with only a fraction of computations of the proximal method.
LMCSC uses the convolutional sparse codes of the guidance modality to compute the convolutional sparse codes of the target modality . An efficient multimodal convolutional operator should integrate a fast operator for the encoding of the guidance modality. In the models presented next, we obtain using the ACSC operator presented in [48]. ACSC has the form of convolutional LISTA, with the th layer computing:
(14) 
where is the proximal operator given by (6). The parameters of the convolutional layers , and , are learned from data. The architecture of ACSC is depicted in Fig. 2.
V Deep Multimodal Image SuperResolution
The proposed LMCSC architecture can be employed to perform multimodal image superresolution based on a sparsitydriven convolutional model. The proposed model follows similar principles with [62]. Specifically, the sparse linear modelling of LR/HR image patches presented in [62] is replaced by a sparse convolutional modelling of the entire LR/HR images, followed by similarity assumptions between the convolutional representations of the LR and HR images. To efficiently integrate information from a second image modality, we also assume that the target and the guidance image modalities are similar by means of the norm in the representation domain.
Va LMCSCNet
Our first model proposed for multimodal image superresolution relies on the following assumption: The LR observation and the HR image share the same convolutional sparse features maps , under different convolutional dictionaries and , that is, , , where , are the atoms of the respective convolutional dictionaries. Given , , finding a mapping from to is equivalent to computing the sparse features maps of the observed LR image . The similarity assumption between the target and the guidance image modalities in the representation domain implies that the convolutional sparse codes of the guidance HR image are similar to by means of the norm. Therefore, can be obtained as the solution of the  minimization problem (11). Based on these assumptions, we build our first model for multimodal image SR using LMCSC as a core component of a deep architecture.
The proposed model, coined LMCSCNet, consists of three subnetworks: (i
) an LMCSC encoder that produces convolutional latent representations of the imput LR image with the aid of side information, (
ii) a side information encoder that produces latent representations of the guidance HR image, and (iii) a convolutional decoder that computes the target HR image. The goal of the LMCSC encoder is to learn a convolutional sparse feature map of the LR image , also shared by the HR image , using a convolutional sparse feature map as side information, akin to the model presented in Section IV. The LMCSC branch is followed by a convolutional decoder realized by a learnable convolutional dictionary . The decoder receives the latent representations provided by LMCSC and estimates according to .The entire network, depicted in Fig. 3
, is trained endtoend using the mean square error (MSE) loss function:
(15) 
where denotes the set of all network parameters, is the groundtruth image of the target modality, and is the estimation computed by the network.
Different from our previous multimodal image SR design [38] which relies on LeSITA [55], LMCSCNet has a novel convolutional structure inspired by a different proximal algorithm. The core LMCSC component computes latent representations of the target modality using side information from the guidance modality, performing fusion of information at every layer. Therefore, our approach is different from coupled ISTA [9] which employs one branch of LISTA [13] for each modality and fuses the latent representations only in the last layer.
VB LmcscNet
The model presented in Section VA learns similar sparse representations of three different image modalities, that is, the input LR image , the guidance modality and the HR image . Learning representations that mainly encode the common information among the different modalities is critical for the performance of the model. Nevertheless, some information from the guidance modality may be misleading when learning a mapping between and . In other words, the encoding performed by the ACSC branch may result in transferring unrelated information to the LMCSC encoder. As a result, the latent representation of the target modality may not capture the underlying mapping between the LR and HR images in the representation domain. Furthermore, assuming identical latent representations for both LR and HR images limits the performance of the network especially when the degradation level of the LR observations is high.
In order to address the aforementioned problems, we relax the assumption concerning the similarity between the LR and HR images in the representation domain. Specifically, we assume that the convolutional sparse codes of the HR image
can be obtained as a nonlinear transformation of the respective codes
of the LR image , that is, where is a nonlinear function parameterized by .Under this assumption, we build the proposed multimodal SR framework by employing the following components: (i) An LMCSC subnetwork is used to fuse the information from the LR observations and the guidance HR image, providing a first estimation of the target HR image with the aid of side information. (ii) An ACSC subnetwork following the LMCSC subnetwork is used to enhance the transformation between the LR and HR images of the target modality without using side information. The architecture of the proposed model, referred to as LMCSCNet, is depicted in Fig. 4. The additional ACSC and dictionary layers in LMCSCNet implement . The network is trained using the objective (15).
VC LMCSCNet with Skip Connections
Considering the significant improvement achieved by residual learning in the training efficiency and the prediction accuracy [19, 30, 68, 51, 50, 69], we enhance the proposed LMCSCNet with a skip connection, introducing a new model coined LMCSCResNet. For the design of LMCSCResNet we rely on the assumption that the HR image contains all the lowfrequency information from the LR image plus some highfrequency details that can be captured by a nonlinear mapping between LR and HR images. By using an identity mapping of the input, the capacity of the network can be assigned to learning the high frequency details, since the lowfrequency information is provided by . LMCSCResNet learns the nonlinear mapping , where . In LMCSCResNet, is obtained from the LMCSCNet. The model architecture is presented in Fig. 5. This model is also trained endtoend using the objective (15).
Vi Experiments
Scale  CSCN [32]  ACSC [48]  EDSR [30]  SRFBN [29]  DJF [28]  CoISTA [9]  LMCSCNet  LMCSCNet  LMCSCResNet 

33.96  34.23  
32.07  31.94 
NIR/RGB  u0004  u0006  u0017  o0018  u0020  u0026  o0030  u0050  Average  


PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM 
Bicubic  
SDF [16]  
CSCN [32]  
ACSC [48]  
EDSR [30]  
SRFBN [29]  37.06  0.9966  
DJF [28]  
CoISTA [9]  
DMSC [38]  
LMCSCNet 
0.9982  0.9980  0.9963  38.29  
LMCSCNet  40.96  0.9988  
LMCSCResNet  37.99  0.9982  43.84  40.05  0.9989  41.05  36.15  0.9965  39.04  0.9971  
PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  
Bicubic  
SDF [16]  
JBF [27]  
JFSM [46]  
GF [18]  
CSCN [32]  
ACSC [48]  
EDSR [30]  
SRFBN [29]  
DJF [28]  
CoISTA [9]  
DMSC [38]  
LMCSCNet  
LMCSCNet  33.89  31.21  0.9785  33.66  
LMCSCResNet  0.9895  38.77  0.9915  36.54  0.9836  34.78  0.9920  37.34  0.9923  0.9784  30.10  0.9773  34.49  0.9853  

PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM 
Bicubic  
SDF [16]  
JBF [27]  
JFSM [46]  
GF [18]  
CSCN [32]  
ACSC [48]  
EDSR [30]  
SRFBN [29]  
DJF [28]  
CoISTA [9]  
DMSC [38]  
LMCSCNet  
LMCSCNet  
LMCSCResNet  32.19  0.9846  36.52  0.9860  34.65  0.9763  33.24  0.9877  35.75  0.9889  29.58  0.9630  31.59  0.9638  28.94  0.9624  32.81  0.9766 
We apply the proposed models to different upsampling tasks, that is, superresolution of nearinfrared (NIR) images, depth upsampling and superresolution of multispectral data, using RGB images as side information. We compare our models against stateoftheart singlemodal and multimodal methods showing the superior performance of the proposed approach. Before demonstrating our experimental results, we present the employed datasets and report implementation details.
Via Datasets
ViA1 EPFL RGBNIR dataset
NIR images are acquired at a low resolution due to the high cost per pixel of a NIR sensor compared to an RGB sensor. We employ the EPFL RGBNIR dataset^{1}^{1}1https://ivrl.epfl.ch/supplementary_material/cvpr11/ and apply our models to superresolve an LR NIR image with the aid of an HR RGB image. The dataset contains spatially aligned NIR/RGB image pairs. Our training set contains approximately cropped image pairs extracted from images. Each training image is of size pixels; the size is chosen with respect to memory requirements and computational complexity. We also create a testing set containing image pairs; testing is performed on an entire image.
The NIR images consist of one channel. An LR version of a NIR image is generated by blurring and downscaling the ground truth HR version. We convert the RGB images to YCbCr and only utilize the luminance channel as the side information. Following [10], we apply bicubic interpolation as a preprocessing step to upscale the LR input such that the input and output images are of the same size.
ViA2 NYU v2 RGBD dataset
Depth cameras like Microsoft Kinect and timeofflight (ToF) cameras only provide lowresolution depth images. Therefore, depth upsampling is a necessary task for many vision applications. We apply our models for depth upsampling with the aid of RGB images, using the NYU v2 RGBD dataset [41]. The dataset contains RGB images with their depth maps. Similar to [28], we use the first images for training, and the remaining for testing.
ViA3 Columbia multispectral database
The third dataset that we use to evaluate the proposed models is the Columbia multispectral database,^{2}^{2}2http://www.cs.columbia.edu/CAVE/databases/multispectral which contains spectral reflectance data and RGB images. For testing, we reserve images from the nm band and randomly select images from different bands; the rest are used for training.
ViB Implementation Details
All networks are designed with three unfolding steps for the target (LMCSC) and the side information (ACSC) encoders. The number of unfolding steps is chosen after taking into account the tradeoff between the computational complexity and the reconstruction accuracy; for instance, by increasing the unfolding steps to five, the improvement in the average PSNR is less than dB while the execution time is almost higher. In the LMCSCResNet, the ACSC branch employed for the nonlinear mapping of the target signal is designed with one unfolding step.
We empirically set the size of the network parameters , , and to ; the size of and are set to . The size of the convolutional dictionaries for reconstruction is . Note that a convolutional layer of size consists of convolutional filters with kernel size and channels. We use untied weights at every unfolding step, i.e., the th layer of LMCSC and ACSC subnetworks is realized by the independent variables , and ,
, respectively. The weights of all layers are initialized randomly using the Gaussian distribution with standard deviation equal to
. The parameters and of the proximal operators are both initialized to . We train the network using the Adam optimizer.We notice that the complexity of our networks is dominated by the LeSITA activation layers in the LMCSC block, and an implementation based on (8) and (9) is not efficient. In order to address this issue, we rewrite the proximal operator in (8), (9) as follows:
(16) 
where
is the Rectified Linear Unit (ReLU) function. This form of the proximal operator results in a
% faster implementation and we use this version in all of the experiments.ViC Performance Comparison
RGBD 
GF [18]  JBF [27]  SDF [16]  DJF[28]  Gu et al. [14]  LMCSCNet  LMCSCNet  LMCSCResNet 

4.04  2.31  3.04  1.97  1.56  1.49  1.38  1.45 

7.34  4.12  5.67  3.39  2.99  2.67  2.58  2.61 

12.23  6.98  9.97  5.63  5.24  5.01  4.93  4.88 

Depth/RGB 
NYU1  NYU2  NYU3  NYU4  NYU5  NYU6  NYU7  NYU8  NYU9  NYU10  Average  


Bicubic  
DJF [28]  
Gu et al. [14]  
LMCSCNet  
LMCSCNet  1.35  1.02  0.88  1.04  1.45  1.16  1.28  1.22  1.08  1.39  1.19  
LMCSCResNet  

Bicubic  
DJF [28]  
Gu et al. [14]  
LMCSCNet  
LMCSCNet  2.86  2.14  2.29  
LMCSCResNet  2.09  1.93  1.42  2.21  2.22  1.75  2.66  2.18  

Bicubic  
DJF [28]  
Gu et al. [14]  
LMCSCNet  4.59  
LMCSCNet  3.80  3.84  5.82  
LMCSCResNet  3.89  3.16  4.73  3.54  4.18  3.37  4.15  


Chart toy  Egyptian  Feathers  Glass tiles  Jelly beans  Oil Paintings  Paints  Average  


PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM 
Bicubic  
SDF [16]  
JBF [27]  
JFSM [46]  
GF [18]  
EDSR [30]  
SRFBN [29]  
DJF [28]  
CoISTA [9]  
LMCSCNet  38.47  0.9962  38.32  0.9953  
LMCSCNet  33.73  0.9936  36.89  
LMCSCResNet  43.45  0.9958 