Multimodal Deep Unfolding for Guided Image Super-Resolution

by   Iman Marivani, et al.

The reconstruction of a high resolution image given a low resolution observation is an ill-posed inverse problem in imaging. Deep learning methods rely on training data to learn an end-to-end mapping from a low-resolution input to a high-resolution output. Unlike existing deep multimodal models that do not incorporate domain knowledge about the problem, we propose a multimodal deep learning design that incorporates sparse priors and allows the effective integration of information from another image modality into the network architecture. Our solution relies on a novel deep unfolding operator, performing steps similar to an iterative algorithm for convolutional sparse coding with side information; therefore, the proposed neural network is interpretable by design. The deep unfolding architecture is used as a core component of a multimodal framework for guided image super-resolution. An alternative multimodal design is investigated by employing residual learning to improve the training efficiency. The presented multimodal approach is applied to super-resolution of near-infrared and multi-spectral images as well as depth upsampling using RGB images as side information. Experimental results show that our model outperforms state-of-the-art methods.


page 1

page 9

page 12


Interpretable Deep Multimodal Image Super-Resolution

Multimodal image super-resolution (SR) is the reconstruction of a high r...

Multimodal Image Super-resolution via Deep Unfolding with Side Information

Deep learning methods have been successfully applied to various computer...

Unsupervised MRI Super-Resolution using Deep External Learning and Guided Residual Dense Network with Multimodal Image Priors

Deep learning techniques have led to state-of-the-art single image super...

Image Formation Model Guided Deep Image Super-Resolution

We present a simple and effective image super-resolution algorithm that ...

Fast Generation of High Fidelity RGB-D Images by Deep-Learning with Adaptive Convolution

Using the raw data from consumer-level RGB-D cameras as input, we propos...

Weakly Aligned Joint Cross-Modality Super Resolution

Non-visual imaging sensors are widely used in the industry for different...

Advanced Super-Resolution using Lossless Pooling Convolutional Networks

In this paper, we present a novel deep learning-based approach for still...

I Introduction

Image super-resolution (SR) is a well-known inverse problem in imaging, referring to the reconstruction of a high-resolution (HR) image from a low-resolution (LR) observation [44, 34]. The problem is ill-posed as there is no unique mapping from the LR to the HR image. Practical applications such as medical imaging and remote sensing often involve different image modalities capturing the same scene, therefore, another approach in imaging is the joint use of multiple image modalities. The problem of multimodal or guided image super-resolution refers to the reconstruction of an HR image from an LR observation using a guidance image from another modality, also referred to as side information.

Several image processing methods use prior knowledge about the image such as sparse structure [62, 61, 52, 53, 64, 22, 15] or statistical image priors [26, 49]. Deep learning has been widely used in inverse problems, often outperforming analytical methods [34, 43]. For instance, in single image SR, Convolutional Neural Networks (CNNs) have led to impressive results reported in [10, 11, 24, 25, 36, 3]. Residual learning [19] enabled the training of very deep neural networks (DNNs), with the models proposed in [30, 68, 51, 50, 69] achieving state-of-the-art performance.

Capturing the correlation among different image modalities has been addressed with sparsity-based analytical models and coupled dictionary learning [56, 71, 33, 23, 4, 1, 6, 7, 47]. The main drawback of these approaches is the high-computational cost of iterative algorithms for sparse approximation, which has been addressed by multimodal deep learning methods [42, 28, 59]. A common approach in multimodal neural network design is the fusion of the input modalities at a shared latent layer, obtained as the concatenation of the latent representations of each modality [42]. The CNN model proposed in [28] for multimodal depth upsampling follows this principle. Nevertheless, current multimodal DNNs are black-box models in the sense that we lack a principled approach to design such models for leveraging the signal structure and properties of the correlation across modalities.

A recent line of research in deep learning for inverse problems considers deep unfolding [13, 21, 60, 2, 65, 48], that is, the unfolding of an iterative algorithm into the form of a DNN. Inspired by numerical algorithms for sparse coding, deep unfolding designs have been utilized in several imaging problems to incorporate sparse priors into the solution. Results for denoising [48], compressive imaging [67], and image SR [32] have shown that incorporating domain knowledge into the network architecture can improve the performance substantially. Still, these methods focus on single-modal data, thereby lacking a principled way to incorporate knowledge from different imaging modalities. To the best of our knowledge, the only deep unfolding designs for guided image SR have been presented in [9] and [8], that build upon existing unfolding architectures for learned sparse coding [13] and learned convolutional sparse coding [48], respectively.

In this paper, we address the problem of guided image SR with a novel multimodal deep unfolding architecture, which is inspired by a proximal algorithm for convolutional sparse coding with side information. The proximal algorithm is translated into a neural network form coined Learned Multimodal Convolutional Sparse Coding (LMCSC); the network incorporates sparse priors and enables efficient integration of the guidance modality into the solution. While in existing multimodal deep learning methods [42, 28, 59], it is difficult to understand what the model has learned, our deep neural network is interpretable, in the sense that the model performs steps similar to an iterative algorithm.

The proposed approach builds upon our previous research on multimodal deep unfolding [55, 38], and preliminary results of LMCSC can be found in [37], where the recovery of HR near infrared (NIR) images based on LR NIR observations with the aid of RGB images was addressed. In this paper, we integrate LMCSC into different neural network architectures, and present experiments on various multimodal datasets, showing the superior performance of the proposed approach against several single- and multimodal SR methods. Our contribution is as follows:


We formulate the problem of convolutional sparse coding with side information, and propose a proximal algorithm for its solution.


Inspired by the proposed proximal algorithm, we design a deep unfolding neural network for fast computation of convolutional sparse codes with the aid of side information.


The deep unfolding operator is used as a core component in a novel multimodal framework for guided image SR that fuses information from two image modalities. Furthermore, we exploit residual learning and introduce skip connections in the proposed framework, obtaining an alternative design that can be trained more efficiently.


We test our models on several benchmark multimodal datasets, including NIR/RGB, multi-spectral/RGB, depth/RGB, and compare them against various state-of-the-art single-modal and multimodal models. The numerical results show a PSNR gain of up to dB over the coupled ISTA method in [9].

The rest of the paper is organized as follows: Section II reviews related work and Section III provides the necessary background. Section IV presents the proposed core deep unfolding architecture for convolutional sparse coding with side information, and our designs for multimodal image SR are presented in Section V, followed by experimental results in Section VI. Section VII concludes the paper.

Throughout the paper, all vectors are denoted by boldface lower case letters while lower case letters are used for scalars. We utilize boldface upper case letters to show matrices and boldface upper case letters in math calligraphy to indicate tensors. Moreover, in this paper, the terms upscaling factor and scale are used interchangeably.

Ii Related Work

Ii-1 Single Image Super-Resolution

A first category of single image SR methods includes interpolation-based methods 

[49, 45, 70, 31]. These methods are simple and fast, however, aliasing and blurring effects make them inefficient in obtaining HR images of fine quality. A second category includes reconstruction methods [63, 12, 35], which use several image priors to regularize the ill-posed reconstruction problem and result in images with fine texture details. Nevertheless, modelling the complex context of natural images with image priors is not an easy task. A third popular category consists of learning-based methods [62, 61, 52, 53, 64, 22, 10, 11, 24, 25, 36, 3, 30, 68, 51, 50, 69]

, which use machine learning techniques to learn the complex mapping between LR and HR images from data.

Among learning based methods, deep learning models have drawn considerable attention as they achieve excellent restoration quality. SRCNN [10] was the first deep learning method for image SR. The model has a simple structure and can directly learn an end-to-end mapping between the LR/HR images. An accelerated version of [10] was presented in [11]. Increasing the depth of CNN architectures ensues several training difficulties which have been mitigated with residual learning. Examples of very deep residual networks for SR include a -layer CNN proposed in [24, 25], a

-layer convolutional autoencoder proposed in 

[36], and a -layer CNN proposed in [51]. Residual learning has also been employed to learn inter-layer dependencies in [50, 69]. An improved residual design obtained by removing unnecessary residual modules was proposed in [30]

. Recurrent neural networks (RNNs) have been also used for image SR in 

[29, 17]. The network in [29] implements a feedback mechanism that carries high-level information back to previous layers, refining low-level encoded information. Following [29], the authors of [17] presented an alternative structure with two states (RNN layers) that operate at different spatial resolutions, providing information flow from LR to HR encodings.

Deep unfolding has been applied to single image SR in [32] where the authors designed a neural network that computes latent representations of the LR/HR image using LISTA [13], a neural network that performs steps similar to the Iterative Soft Thresholding Algorithm (ISTA) [5] (see also Section III).

Ii-2 Multimodal Image Super-Resolution

A common approach in multimodal image restoration is the joint or guided filtering approach, that is, the design of a filter that leverages the guidance image as a prior and transfers structural details from the guidance to the target image. Several joint filtering techniques have been proposed in [27, 18, 16]. Nevertheless, when the local structures in the guidance and target images are not consistent, these techniques may transfer incorrect content to the target image. The approach presented in [46] concerns the design of an explicit mapping that captures the structural discrepancy between images from different modalities.

Model-based techniques and joint filtering methods are limited in characterizing the complex dependency between different modalities. Learning based methods aim to learn this dependency from data. In a depth upsampling method presented in [14], a weighted analysis representation is used to model the complex relationship between depth and RGB images; the model parameters are learned with a task driven training strategy. Another learning based approach relies on sparse modelling and involves coupled-dictionary learning [56, 71, 33, 23, 4, 1, 6, 7]. Most of these works assume that there is a mapping between the sparse representation of one modality to the sparse representation of another modality. The authors of [47] consider both similarities and disparities between different modalities under the sparse representation invariance assumption.

Purely data-driven solutions for multimodal image SR are provided by multimodal deep learning approaches. Examples include the model presented in [28] implementing a CNN based joint image filter, and the work presented in [59], which is a deep learning reformulation of the widely used guided image filter proposed in [18].

The deep unfolding design LISTA [13] has also been deployed for multimodal image SR in [9]

where a coupled ISTA network is presented. The network accepts as input an LR image from the target modality and an HR image from the guidance modality. Two LISTA branches are employed to compute latent representations of the input images. The estimation of the target HR image is obtained as a linear combination of these representations. A similar approach is proposed in 

[8] that employs three convolutional LISTA networks to split the common information shared between modalities, from the unique information belonging to each modality. The output is then computed as a combination of these common and unique feature maps after applying the corresponding dictionaries on them.

Iii Background

Image super-resolution can be addressed as a linear inverse problem formulated as follows [62]:


where is a vectorized form of the unknown HR image, and denotes the LR observations contaminated with noise . The linear operator , , describes the observation mechanism, which can be expressed as the product of a downsampling operator and a blurring filter  [62]. Problem (1) appears in many imaging applications including image restoration and inpainting [44, 34].

Iii-a Image Super-Resolution via Sparse Approximation

Even when the linear observation operator is given, problem (1) is ill-posed and requires additional regularization for its solution. Sparsity has been widely used as a regularizer leading to the well-known sparse approximation problem [54]. Instead of directly solving for , in this paper, we rely on a sparse modelling approach presented in [62]. According to [62], an -dimensional (vectorized) patch from a bicubic-upscaled LR image and the corresponding patch from the respective HR image can be expressed by joint sparse representations. By jointly learning two dictionaries , , , for the low- and the high-resolution image patches, respectively, we can enforce the similarity of sparse representations of patch pairs such that and , . Then, computing the HR patch is equivalent to finding the sparse representation of the LR patch , by solving


where is a regularization parameter, and is the -norm, which promotes sparsity. Several methods have been proposed for solving (2) including pivoting algorithms, interior-point methods and gradient based methods [54].

Sparse modelling techniques that involve computations applied to independent image patches do not take into account the consistency of pixels in overlapping patches [15]. Convolutional Sparse Coding (CSC) [66] is an alternative approach, which can be directly applied to the entire image. Denoting with the image of interest, the convolutional sparse codes are obtained by solving the following problem:


where is the Frobenius norm, , , are the atoms of a convolutional dictionary , and , , are the sparse feature maps with respect to . The -norm computes the sum of absolute values of the elements in (as if is unrolled as a vector). Efficient solutions of (3) are presented in [58, 20].

According to recent studies [39], the accuracy of sparse approximation problems can be improved if a signal correlated with the target signal is available; we refer to as side information (SI). Assume that and have similar sparse representations , , under dictionaries , , , , respectively. Then the sparse representation can be obtained as the solution of the - minimization problem [39].


Problem (4) has been theoretically studied in [39]. Numerical methods for its solution are presented in [57, 40].

Iii-B Deep Unfolding

Analytical approaches for sparse approximation are usually equipped with theoretical guarantees; however, their major drawback is their high computational complexity. In some applications, the deployed dictionaries also need to be learned, increasing the computational burden [62]. The authors of [13] address this problem by a neural network design that performs operations similar to the Iterative Soft Thresholding Algorithm (ISTA) [5] proposed for the solution of (2). The learning process results in a trained version of ISTA, coined LISTA. The -th layer of LISTA computes:




is the soft thresholding operator; , and are parameters, which are fixed in ISTA, while LISTA learns them from data. As a result, LISTA achieves high accuracy in only a few iterations. The technique known as deep unfolding was also explored in [21, 60, 2, 65]; a convolutional LISTA design for CSC was presented in [48].

The aforementioned deep unfolding studies deal with single-modal data. A deep unfolding design that incorporates side information coming from another modality was first presented in our previous work [55]. The model proposed in [55] relies on a proximal method for the solution of (4), which iterates over


with , appropriate parameters. The proximal operator incorporates the side information and is expressed as follows:

  1. For , :

  2. For , :


By writing the proximal algorithm in the form


and translating (10) into a deep neural network, we obtain a fast multimodal operator referred to as Learned Side-Information-driven iterative soft Thresholding Algorithm (LeSITA). LeSITA has a similar expression to LISTA (5), however, (10

) employs the new activation function 

that integrates side information into the learning process.

Iv Design Multimodal Convolutional Networks with Deep Unfolding

Fig. 1: The proposed LMCSC model with unfolded recurrent stages. The model computes sparse feature maps of an image given sparse feature maps of the side information. The nonlinear activation function follows the proximal operator given by (8), (9).

In what follows, we consider that, besides the observations of the target signal, another image modality , correlated with is available. We assume that the two image modalities can be represented by convolutional sparse codes that are similar by means of the -norm. Specifically, let be a sparse representation of the observed image with respect to a convolutional dictionary ; , , denote the atoms of . By employing a convolutional dictionary with atoms , , the guidance image can be expressed as with the convolutional sparse codes , , obtained as the solution of (3). Then, we can compute the unknown sparse codes of the target modality by solving a problem formulated in a way similar to (4), that is,


There is a correspondence between convolutional and linear sparse codes. If we replace the convolutional dictionary with a matrix with Toeplitz structure, and take into account the linear properties of convolution, then (11) reduces to (4). Specifically, we define as a sparse dictionary obtained by concatenating the Toeplitz matrices that unroll ’s; and take the form of vectorized sparse feature maps of the target and the side information images, respectively. Then, by replacing the convolutional operations in (11) with multiplications, we obtain (4), and the proximal algorithm (7) can be employed to compute convolutional sparse codes.

Nevertheless, transforming (11) to (4) and using (7) for its solution is not computationally efficient. Since CSC deals with the entire image, the dimensionality of (4) becomes too high and the proximal method becomes impractical. We use the correspondence between linear and convolutional representations, and formulate an iterative algorithm that performs convolutions as follows: In the proximal algorithm (7), the matrices and , which take the form of concatenated Toeplitz matrices in the convolutional case, are replaced by the convolutional dictionaries and , respectively. Then, by replacing multiplications with convolutional operations, we can compute the convolutional codes of the target image, given the convolutional codes of the guidance image, by iterating over:


where , are tensors of size .

Equation (12) can be translated into a deep convolutional neural network (CNN). Each stage of the network computes the sparse feature maps according to


with , , learnable convolutional layers and a learnable parameter; is the number of channels of the employed images. The proposed network architecture, depicted in Fig. 1, is referred to as Learned Multimodal Convolutional Sparse Coding (LMCSC). The network can be trained in a supervised manner to map an input image to sparse feature maps. During training, the parameters , , and are learned; therefore, the deep LMCSC can achieve high accuracy with only a fraction of computations of the proximal method.

Fig. 2: The Approximate Convolutional Sparse Coding (ACSC) model [48] is used to encode the side information. The nonlinear activation function follows the proximal operator in (6).

LMCSC uses the convolutional sparse codes of the guidance modality to compute the convolutional sparse codes of the target modality . An efficient multimodal convolutional operator should integrate a fast operator for the encoding of the guidance modality. In the models presented next, we obtain using the ACSC operator presented in [48]. ACSC has the form of convolutional LISTA, with the -th layer computing:


where is the proximal operator given by (6). The parameters of the convolutional layers , and , are learned from data. The architecture of ACSC is depicted in Fig. 2.

V Deep Multimodal Image Super-Resolution

The proposed LMCSC architecture can be employed to perform multimodal image super-resolution based on a sparsity-driven convolutional model. The proposed model follows similar principles with [62]. Specifically, the sparse linear modelling of LR/HR image patches presented in [62] is replaced by a sparse convolutional modelling of the entire LR/HR images, followed by similarity assumptions between the convolutional representations of the LR and HR images. To efficiently integrate information from a second image modality, we also assume that the target and the guidance image modalities are similar by means of the -norm in the representation domain.


Our first model proposed for multimodal image super-resolution relies on the following assumption: The LR observation and the HR image share the same convolutional sparse features maps , under different convolutional dictionaries and , that is, , , where , are the atoms of the respective convolutional dictionaries. Given , , finding a mapping from to is equivalent to computing the sparse features maps of the observed LR image . The similarity assumption between the target and the guidance image modalities in the representation domain implies that the convolutional sparse codes of the guidance HR image are similar to by means of the -norm. Therefore, can be obtained as the solution of the - minimization problem (11). Based on these assumptions, we build our first model for multimodal image SR using LMCSC as a core component of a deep architecture.

The proposed model, coined LMCSC-Net, consists of three subnetworks: (i

) an LMCSC encoder that produces convolutional latent representations of the imput LR image with the aid of side information, (

ii) a side information encoder that produces latent representations of the guidance HR image, and (iii) a convolutional decoder that computes the target HR image. The goal of the LMCSC encoder is to learn a convolutional sparse feature map of the LR image , also shared by the HR image , using a convolutional sparse feature map as side information, akin to the model presented in Section IV. The LMCSC branch is followed by a convolutional decoder realized by a learnable convolutional dictionary . The decoder receives the latent representations provided by LMCSC and estimates according to .

The entire network, depicted in Fig. 3

, is trained end-to-end using the mean square error (MSE) loss function:


where denotes the set of all network parameters, is the ground-truth image of the target modality, and is the estimation computed by the network.

Fig. 3: The proposed LMCSC-Net, a deep multimodal SR network consisting of an LMCSC encoder, an ACSC side information encoder, and a convolutional decoder.

Different from our previous multimodal image SR design [38] which relies on LeSITA [55], LMCSC-Net has a novel convolutional structure inspired by a different proximal algorithm. The core LMCSC component computes latent representations of the target modality using side information from the guidance modality, performing fusion of information at every layer. Therefore, our approach is different from coupled ISTA [9] which employs one branch of LISTA [13] for each modality and fuses the latent representations only in the last layer.

V-B Lmcsc-Net

Fig. 4: The proposed LMCSC-Net, a deep multimodal SR network consisting of an LMCSC component with an additional ACSC branch performing enhancement of the LR/HR mapping.

The model presented in Section V-A learns similar sparse representations of three different image modalities, that is, the input LR image , the guidance modality and the HR image . Learning representations that mainly encode the common information among the different modalities is critical for the performance of the model. Nevertheless, some information from the guidance modality may be misleading when learning a mapping between and . In other words, the encoding performed by the ACSC branch may result in transferring unrelated information to the LMCSC encoder. As a result, the latent representation of the target modality may not capture the underlying mapping between the LR and HR images in the representation domain. Furthermore, assuming identical latent representations for both LR and HR images limits the performance of the network especially when the degradation level of the LR observations is high.

In order to address the aforementioned problems, we relax the assumption concerning the similarity between the LR and HR images in the representation domain. Specifically, we assume that the convolutional sparse codes of the HR image

can be obtained as a non-linear transformation of the respective codes

of the LR image , that is, where is a non-linear function parameterized by .

Under this assumption, we build the proposed multimodal SR framework by employing the following components: (i) An LMCSC subnetwork is used to fuse the information from the LR observations and the guidance HR image, providing a first estimation of the target HR image with the aid of side information. (ii) An ACSC subnetwork following the LMCSC subnetwork is used to enhance the transformation between the LR and HR images of the target modality without using side information. The architecture of the proposed model, referred to as LMCSC-Net, is depicted in Fig. 4. The additional ACSC and dictionary layers in LMCSC-Net implement . The network is trained using the objective (15).

V-C LMCSC-Net with Skip Connections

Considering the significant improvement achieved by residual learning in the training efficiency and the prediction accuracy [19, 30, 68, 51, 50, 69], we enhance the proposed LMCSC-Net with a skip connection, introducing a new model coined LMCSC-ResNet. For the design of LMCSC-ResNet we rely on the assumption that the HR image contains all the low-frequency information from the LR image plus some high-frequency details that can be captured by a non-linear mapping between LR and HR images. By using an identity mapping of the input, the capacity of the network can be assigned to learning the high frequency details, since the low-frequency information is provided by . LMCSC-ResNet learns the non-linear mapping , where . In LMCSC-ResNet, is obtained from the LMCSC-Net. The model architecture is presented in Fig. 5. This model is also trained end-to-end using the objective (15).

Fig. 5: The proposed LMCSC-ResNet, a deep multimodal LMCSC-based network with a skip connection.

Vi Experiments

Scale CSCN [32] ACSC [48] EDSR [30] SRFBN [29] DJF [28] CoISTA [9] LMCSC-Net LMCSC-Net LMCSC-ResNet
33.96 34.23
32.07 31.94
TABLE I: Super-resolution of NIR images with the aid of RGB images. Performance comparison [in terms of average PSNR (dB)] over all test images for , and upscaling.
NIR/RGB u-0004 u-0006 u-0017 o-0018 u-0020 u-0026 o-0030 u-0050 Average

SDF [16]
CSCN [32]
ACSC [48]
EDSR [30]
SRFBN [29] 37.06 0.9966
DJF [28]
CoISTA [9]
DMSC [38]

0.9982 0.9980 0.9963 38.29
LMCSC-Net 40.96 0.9988
LMCSC-ResNet 37.99 0.9982 43.84 40.05 0.9989 41.05 36.15 0.9965 39.04 0.9971
SDF [16]
JBF [27]
JFSM [46]
GF [18]
CSCN [32]
ACSC [48]
EDSR [30]
SRFBN [29]
DJF [28]
CoISTA [9]
DMSC [38]
LMCSC-Net 33.89 31.21 0.9785 33.66
LMCSC-ResNet 0.9895 38.77 0.9915 36.54 0.9836 34.78 0.9920 37.34 0.9923 0.9784 30.10 0.9773 34.49 0.9853

SDF [16]
JBF [27]
JFSM [46]
GF [18]
CSCN [32]
ACSC [48]
EDSR [30]
SRFBN [29]
DJF [28]
CoISTA [9]
DMSC [38]
LMCSC-ResNet 32.19 0.9846 36.52 0.9860 34.65 0.9763 33.24 0.9877 35.75 0.9889 29.58 0.9630 31.59 0.9638 28.94 0.9624 32.81 0.9766
TABLE II: Super-resolution of NIR images with the aid of RGB images. Performance comparison [in terms of PSNR (dB) and SSIM] for selected test images for , and upscaling.

We apply the proposed models to different upsampling tasks, that is, super-resolution of near-infrared (NIR) images, depth upsampling and super-resolution of multi-spectral data, using RGB images as side information. We compare our models against state-of-the-art single-modal and multimodal methods showing the superior performance of the proposed approach. Before demonstrating our experimental results, we present the employed datasets and report implementation details.

Vi-a Datasets

Vi-A1 EPFL RGB-NIR dataset

NIR images are acquired at a low resolution due to the high cost per pixel of a NIR sensor compared to an RGB sensor. We employ the EPFL RGB-NIR dataset111 and apply our models to super-resolve an LR NIR image with the aid of an HR RGB image. The dataset contains spatially aligned NIR/RGB image pairs. Our training set contains approximately cropped image pairs extracted from images. Each training image is of size pixels; the size is chosen with respect to memory requirements and computational complexity. We also create a testing set containing image pairs; testing is performed on an entire image.

The NIR images consist of one channel. An LR version of a NIR image is generated by blurring and downscaling the ground truth HR version. We convert the RGB images to YCbCr and only utilize the luminance channel as the side information. Following [10], we apply bicubic interpolation as a preprocessing step to upscale the LR input such that the input and output images are of the same size.

Vi-A2 NYU v2 RGB-D dataset

Depth cameras like Microsoft Kinect and time-of-flight (ToF) cameras only provide low-resolution depth images. Therefore, depth upsampling is a necessary task for many vision applications. We apply our models for depth upsampling with the aid of RGB images, using the NYU v2 RGB-D dataset [41]. The dataset contains RGB images with their depth maps. Similar to [28], we use the first images for training, and the remaining for testing.

Vi-A3 Columbia multi-spectral database

The third dataset that we use to evaluate the proposed models is the Columbia multi-spectral database,222 which contains spectral reflectance data and RGB images. For testing, we reserve images from the  nm band and randomly select images from different bands; the rest are used for training.

Vi-B Implementation Details

All networks are designed with three unfolding steps for the target (LMCSC) and the side information (ACSC) encoders. The number of unfolding steps is chosen after taking into account the trade-off between the computational complexity and the reconstruction accuracy; for instance, by increasing the unfolding steps to five, the improvement in the average PSNR is less than  dB while the execution time is almost higher. In the LMCSC-ResNet, the ACSC branch employed for the nonlinear mapping of the target signal is designed with one unfolding step.

We empirically set the size of the network parameters , , and to ; the size of and are set to . The size of the convolutional dictionaries for reconstruction is . Note that a convolutional layer of size consists of convolutional filters with kernel size and channels. We use untied weights at every unfolding step, i.e., the -th layer of LMCSC and ACSC subnetworks is realized by the independent variables , and ,

, respectively. The weights of all layers are initialized randomly using the Gaussian distribution with standard deviation equal to

. The parameters and of the proximal operators are both initialized to . We train the network using the Adam optimizer.

We notice that the complexity of our networks is dominated by the LeSITA activation layers in the LMCSC block, and an implementation based on (8) and (9) is not efficient. In order to address this issue, we rewrite the proximal operator in (8), (9) as follows:



is the Rectified Linear Unit (ReLU) function. This form of the proximal operator results in a

% faster implementation and we use this version in all of the experiments.

Vi-C Performance Comparison

GF [18] JBF [27] SDF [16] DJF[28] Gu et al. [14] LMCSC-Net LMCSC-Net LMCSC-ResNet

4.04 2.31 3.04 1.97 1.56 1.49 1.38 1.45

7.34 4.12 5.67 3.39 2.99 2.67 2.58 2.61

12.23 6.98 9.97 5.63 5.24 5.01 4.93 4.88

TABLE III: Depth upsampling with the aid of RGB images. Performance comparison [in terms of average RMSE] over test images from the NYU v2 RGB-D dataset for , , and upsampling.

NYU-1 NYU-2 NYU-3 NYU-4 NYU-5 NYU-6 NYU-7 NYU-8 NYU-9 NYU-10 Average

DJF [28]
Gu et al. [14]
LMCSC-Net 1.35 1.02 0.88 1.04 1.45 1.16 1.28 1.22 1.08 1.39 1.19

DJF [28]
Gu et al. [14]
LMCSC-Net 2.86 2.14 2.29
LMCSC-ResNet 2.09 1.93 1.42 2.21 2.22 1.75 2.66 2.18

DJF [28]
Gu et al. [14]
LMCSC-Net 4.59
LMCSC-Net 3.80 3.84 5.82
LMCSC-ResNet 3.89 3.16 4.73 3.54 4.18 3.37 4.15

TABLE IV: Depth upsampling with the aid of RGB images. Performance comparison [in terms of RMSE] over selected test images from the NYU v2 dataset for , and upsampling.

Chart toy Egyptian Feathers Glass tiles Jelly beans Oil Paintings Paints Average

SDF [16]
JBF [27]
JFSM [46]
GF [18]
EDSR [30]
SRFBN [29]
DJF [28]
CoISTA [9]
LMCSC-Net 38.47 0.9962 38.32 0.9953
LMCSC-Net 33.73 0.9936 36.89
LMCSC-ResNet 43.45 0.9958