Multimodal Image Super-resolution via Deep Unfolding with Side Information

10/18/2019 ∙ by Iman Marivani, et al. ∙ 0

Deep learning methods have been successfully applied to various computer vision tasks. However, existing neural network architectures do not per se incorporate domain knowledge about the addressed problem, thus, understanding what the model has learned is an open research topic. In this paper, we rely on the unfolding of an iterative algorithm for sparse approximation with side information, and design a deep learning architecture for multimodal image super-resolution that incorporates sparse priors and effectively utilizes information from another image modality. We develop two deep models performing reconstruction of a high-resolution image of a target image modality from its low-resolution variant with the aid of a high-resolution image from a second modality. We apply the proposed models to super-resolve near-infrared images using as side information high-resolution RGB images. Experimental results demonstrate the superior performance of the proposed models against state-of-the-art methods including unimodal and multimodal approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image super-resolution (SR) refers to the recovery of a high-resolution (HR) image from its low-resolution (LR) version. The problem is severely ill-posed and a common approach for its solution considers the use of sparse priors [26, 17, 25]. For example, the method presented in [26] is based on the assumption that the LR and HR images have joint sparse representations with respect to some dictionaries. Nevertheless, sparsity based methods result in complex optimization problems, which is a significant drawback in large-scale settings.

Accounting for the high computational cost of numerical optimization algorithms, deep neural networks have been successfully applied to image SR achieving state-of-the-art performance [8, 9, 11, 13, 15]

. Deep learning methods rely on large datasets to learn a non-linear transformation between the LR and HR image spaces. The methods are efficient at inference by shifting the computational load to the training phase. However, most of the existing deep models do not integrate domain knowledge about the problem and cannot provide theoretical justifications for their effectiveness. A different approach was followed in the recent work of 

[16], which relies on a deep unfolding architecture referred to as LISTA [5]

. LISTA introduced the idea of translating an iterative numerical algorithm for sparse approximation into a feed-forward neural network. By integrating LISTA into their network architecture, the authors of 

[16] managed to incorporate sparse priors into the deep learning solution.

In many image processing and machine vision applications a reference HR image from a second modality is often available [10, 14]. The recovery of an HR image from its LR variant with the aid of another HR image from a different image modality is referred to as multimodal image SR [14, 20]. Several studies have investigated sparse representation models as well as deep learning methods for multimodal image SR [20, 14, 12, 10, 7, 6].

In this paper, we propose a deep network architecture that incorporates sparse priors and effectively utilizes information from another image modality to perform multimodal image SR. Inspired by [16], the proposed deep learning model relies on a deep unfolding method for sparse approximation with side information. Our contributions are threefold:

  1. We address multimodal image SR as a problem of sparse approximation with side information and formulate an appropriate - minimization problem for its solution.

  2. We design a core neural network model for patch-based image SR, employing a recently proposed deep sparse approximation model [23], referred to as LeSITA, to integrate information from another HR image modality in the solution.

  3. We propose two novel deep neural network architectures for multimodal image SR that employ the LeSITA-based SR model, achieving superior performance against state-of-the-art methods.

The proposed models are used to super-resolve near-infrared (NIR) LR images given RGB HR images. The performance of the proposed models is demonstrated by experimental results.

The rest of the paper is organized as follows: In Section II we present the necessary background and related work. Section III explains the details of the proposed model architectures for multimodal image SR. Section IV presents the experimental results, and Section V concludes the paper.

Ii Background and Related Work

Ii-a Single Image SR with Sparse Priors

Let be an LR image obtained from an HR image . According to [26], the transformation of an HR image to an LR image can be modeled as a blurring and downscaling process expressed by , where and denote blurring and downscaling operators, respectively. Under this assumption, an

-dimensional (vectorized) patch

from the bicubic-upscaled LR image exhibits a common sparse representation with the corresponding patch from the HR image w.r.t. different over-complete dictionaries , , that is, and , where . , can be jointly learned with coupled dictionary learning techniques [26]. Therefore, the problem of computing the HR patch given the LR patch can be formulated as

(1)

where is a regularization parameter, and is the -norm, which promotes sparsity. Several numerical optimization methods have been proposed for the solution of (1[22].

In order to account for the high computational cost of numerical algorithms, the seminal work in [5] translated a proximal algorithm, namely ISTA [4], into a neural network form referred to as LISTA. Each layer of LISTA implements an iteration of ISTA according to

(2)

where , , and are learnable parameters, and is the soft thresholding operator [4] expressed by the component-wise shrinkage function ,

, which acts as a nonlinear activation function.

and are initialized as: , . The network is depicted in Fig. 1.

By employing the model presented in [26] and leveraging the fast computations of sparse codes performed by LISTA, the authors of [16] proposed a deep neural network model for single-modal image super resolution that incorporates sparse priors.


Fig. 1: Deep learning for sparse coding with LISTA [5].

Ii-B Sparse Coding with Side Information via Deep Learning

The deep unfolding idea presented in [5], also explored in [24, 3, 2], introduced a new methodology for the design of neural networks, enabling the network structure to incorporate domain knowledge about the addressed problem. Following similar principles, we recently proposed a deep learning architecture to solve the sparse approximation problem with the aid of side information [23]. Our approach relies on a proximal algorithm for - minimization. Specifically, suppose that we want to find a sparse approximation of a signal w.r.t. an over-complete dictionary , given a side information signal that is correlated to . Then, by solving the - minimization problem

(3)

we obtain a solution that is of higher accuracy compared to (1), as long as certain conditions concerning the similarity between and hold [19, 18].

A proximal algorithm that solves (3) performs the following iterations [23, 1]:

(4)

where is the Lipschitz constant of and . is the proximal operator expressed by:

  1. for , :

    (5)
  2. for , :

    (6)

By setting , , (4) takes the form:

(7)

A feed forward neural network performing operations according to (7) can learn sparse codes with the aid of side information. , , and are parameters learned from data. Compared to LISTA [5], the network in [23]—which we call Learned Side-Information-driven iterative soft Thresholding Algorithm (LeSITA)—incorporates a new activation function  integrating the side information signal into the sparse representation learning process.


Fig. 2: Deep learning for sparse coding with side information using LeSITA [23].

Iii The Proposed Method

Iii-a Multimodal Image SR with Sparse Priors

The problem of multimodal image super-resolution concerns the reconstruction of an HR image from an LR image , given an HR reference or guidance image from another modality. In this work, we utilize the reference image as side information and leverage our previous work [23] to build a deep network performing multimodal image SR.

We follow the sparsity-based model presented in [26] and assume that a (vectorized) patch from the bicubic-upscaled LR image and an HR patch from the high resolution image share the same sparse representation under over-complete dictionaries and , respectively. If the images of the target and the guidance modalities are highly correlated, we can also assume that the reference patch has a sparse representation under a dictionary , which is similar to , for example, by means of the norm. Then, the multimodal image super-resolution problem can be formulated as an - minimization problem of the form (3).

By employing LeSITA to solve (3), we can design an end-to-end multimodal deep learning architecture to perform super-resolution of the input LR image with the aid of a reference HR image . The network architecture incorporates sparse priors and exploits the correlation between the two available modalities. The proposed framework is presented next.


Fig. 3: The proposed DMSC model for multimodal image SR consists of (i) a LeSITA encoder computing a latent representation of the LR/HR images of the main modality, using side information from the guidance modality provided by (ii) a LISTA encoder. A linear decoder recovers the HR patch from the latent representation. The convolutional layers , perform patch extraction and patch aggregation operations, respectively.

Iii-B DMSC: Deep Multimodal Sparse Coding Network

LeSITA can learn sparse codes of a target image modality using side information from another correlated image modality. The side information needs to be a sparse signal similar to the target sparse code. To obtain sparse codes of the guidance modality, our architecture also includes a LISTA subnetwork. The proposed core model consists of the following three components: (i) a LeSITA encoder that computes a sparse representation of an LR image patch of the target modality using side information, (ii) a LISTA subnetwork that produces a sparse representation of the available HR patch from the guidance image, and (iii) a linear decoder that reconstructs the HR image patch of the main modality using the sparse representation obtained from LeSITA.

LeSITA computes a sparse representation of the LR patch according to (7). LISTA accepts as input the reference patch and performs a nonlinear transformation according to to produce a side information signal for LeSITA. Given the sparse representation produced by LeSITA, the HR patch can be recovered by a linear decoder according to . is a learnable dictionary. By training the network end-to-end, an LR/HR transformation that relies on the joint representations provided by LeSITA can be learned.

The proposed core model successfully performs multimodal image SR at a patch level. Nevertheless, our goal is to design a network that accepts as input the entire images and and outputs the HR image . To this end, we add three more layers to the network as follows. A convolutional layer consisting of filters of size is added before the LeSITA encoder to extract -dimensional feature vectors from the LR image corresponding to patches of size . A similar layer is added before the LISTA branch to ensure that the side information patches stay aligned with the LR patches. Finally, a convolutional layer is added after the decoder to aggregate the reconstructed HR patches and form the super-resolved image . We present this model in Fig. 3 and refer to it as Deep Multimodal Sparse Coding network (DMSC).

The proposed network can be trained end-to-end using the mean square error (MSE) loss function:

(8)

where, denotes the set of all network parameters, is the corresponding ground-truth HR image of the target modality, and

is the super-resolved estimation computed by the network.


Fig. 4: The proposed model for multimodal image SR consists of the base DMSC model and a LISTA subnetwork with a linear decoder. The convolutional layer performs patch extraction (similar to ), while implements a patch aggregator layer.

Iii-C : Deep Multimodal Image SR Network

The DMSC network presented in Section III-B learns joint representations of three different image modalities, that is, the input LR image , the guidance modality and the HR image . Learning representations that encode only the common information among the different modalities is critical for the performance of the model. Nevertheless, some information from the guidance modality may be misleading when learning a mapping between the LR and HR versions of the target modality. In other words, the encoding performed by the LISTA branch may result in transferring unrelated information to the LeSITA encoder. As a result, the latent representation of the target modality may not capture the underlying mapping from the LR space to the HR space.

As the performance of the model relies on the learned LR/HR transformation of the target modality, we present an architecture equipped with an uncoupling component that focuses on learning the LR/HR transformation without using side information. The proposed framework consists of two different subnetworks: (i) a DMSC subnetwork performing fusion of the information of the different image modalities, and (ii) a subnetwork for the enhancement of the LR/HR transformation. The second is realized by a LISTA encoder followed by a linear decoder and includes convolutional layers (, ) to operate on the entire image. The proposed deep multimodal framework, referred to as , is depicted in Fig. 4. The network is trained using objective (8).


2
CSCN ACSC DJF DMSC

u-0006
39.47 39.78 41.52 41.79 43.21
u-0017 36.76 36.64 38.65 39.39 40.41
o-0018 33.98 34.26 34.78 36.02 37.90
u-0026 32.94 33.11 33.15 33.98 34.96
o-0030 33.34 33.32 35.67 36.32 37.73
u-0050 33.31 33.39 32.60 33.10 33.78
Average 34.97 35.09 36.07 36.85 37.99

Table I: Performance comparison [in terms of PSNR (dB)] for 2 SR upscaling (The results for scale 2 are not presented in CDLSR [20]).

4
CSCN ACSC DJF CDLSR DMSC

u-0006
32.60 32.61 36.04 36.79 37.24 37.82
u-0017 31.68 31.66 34.18 35.27 35.04 35.75
o-0018 27.28 27.42 30.72 33.01 32.30 32.91
u-0026 27.91 27.92 29.21 30.35 30.12 30.40
o-0030 27.72 27.66 31.27 32.71 32.30 32.66
u-0050 28.20 27.80 28.58 29.37 29.39 29.64
Average 29.24 29.18 31.67 32.92 32.73 33.19

Table II: Performance comparison [in terms of PSNR (dB)] for 4 SR upscaling.

6
CSCN ACSC DJF CDLSR DMSC

u-0006
29.94 29.97 34.92 34.15 35.43 35.74
u-0017 29.53 29.48 32.80 32.98 33.09 33.55
o-0018 24.57 24.70 29.92 31.03 30.44 31.34
u-0026 25.79 25.97 28.38 28.88 28.77 29.01
o-0030 25.86 25.91 30.00 30.52 30.37 30.61
u-0050 26.71 26.43 27.64 28.37 28.27 28.45
Average 27.07 27.08 30.62 30.99 31.06 31.45
Table III: Performance comparison [in terms of PSNR (dB)] for 6 SR upscaling.

Iv Experiments

In this section, we first report the implementation details of the proposed DMSC and models. Then, we present experimental results on multimodal image SR.

The convolutional layers denoted by (patch extractors) are realized with filters of size to extract -dimensional feature vectors from patches of size of the LR image and the HR side information. Each feature vector is then processed by the corresponding LeSITA or LISTA encoder to produce a sparse representation of the LR input and the side information. is realized in a similar way. Each of the convolutional layers and contains one filter to build-up the super-resolved NIR image from the computed patches. The sizes of the linear filters and are set to , while and are set to . The linear decoder layer is realized by a linear filter that recovers

HR patches. We note that all convolutional layers use padding such that the networks preserve the spatial size of the input. We use learnable scalars for the parameters

and

of the proximal operators. We initialize the convolutional and linear filters using random weights drawn from a Gaussian distribution with standard deviation

. The parameters and are initialized to .

We provide experimental results for the DMSC and models, and compare their performance with existing single-modal and multimodal SR methods. In our experiments, we employ the EPFL RGB-NIR dataset111https://ivrl.epfl.ch/supplementary_material/cvpr11/. The dataset includes spatially aligned RGB and near-infrared (NIR) image pairs, capturing the same scene of

different landscapes. Taking into account the high cost per pixel in NIR cameras, we want to super-resolve LR NIR images using the corresponding HR RGB image of the same scene as side information. A preprocessing step involves upscaling the NIR LR image to the desired resolution using bicubic interpolation, which results in a blurry image given as input to the model. We convert the RGB images to YCbCr and use only the luminance channel as side information. The training dataset contains

samples of NIR/RGB pairs; due to memory and computational limitations, we use image patches of size to train our models. We reserve image pairs for testing222A test image is identified by a letter “u”, “o” referring to the folders urban and oldbuilding in the dataset, and a code “00xx”..

We train the proposed models for three SR scales, , and , by minimizing the objective (8) utilizing ADAM optimizer. We compare the proposed models with (i) a coupled dictionary learning (CDLSR) method [20], (ii) the deep joint image filtering (DJF) method [14], (iii) an approximate convolutional sparse coding network (ACSC) [21], and (iv) a cascaded sparse coding network (CSCN) [16]

. CDLSR and DJF perform multimodal image SR using sparse coding and convolutional neural networks, respectively. ACSC and CSCN are unimodal neural networks and do not use information from another image modality. Results in terms of Peak Signal-to-Noise Ratio (PSNR) for different scales are presented in Tables 

III and III. As can be seen, the DMSC model achieves state-of-the-art performance at most scaling factors. For instance, the average gains over the best competing method for and upscaling factors are 0.78 dB and 0.07 dB, respectively. However, for scale the average PSNR is 0.19 dB less than the best previous method. The always outperforms existing techniques in terms of average PSNR and exhibits the best values for most of the testing images at all scales. A visual example presented in Fig. 5 corroborates our numerical results.

(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5: 6 upscaling for (a) the test image “u0017” (ground-truth) with (b) bicubic, (c) CSCN [16], (d) ACSC [21], (e) DJF [14] and (f) . Results for CDLSR [20] are not presented as the code is not available.

V Conclusions

We developed two novel deep multimodal models, namely DMSC and , for the super-resolution of an LR image of a target modality with the aid of an HR image from another modality. The proposed design relies on the unfolding of an iterative algorithm for sparse approximation with side information. The architecture incorporates sparse priors and effectively integrates the available side information. We applied the proposed models to super-resolve NIR images using RGB images as side information. We compared our models with existing single-modal and multimodal designs, showing their superior performance.

References

  • [1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski (2012-01) Optimization with Sparsity-Inducing Penalties.

    Foundations and Trends in Machine Learning

    4 (1), pp. 1–106.
    Cited by: §II-B.
  • [2] C. Bertocchi, E. Chouzenoux, M.-C. Corbineau, J.-C.Pesquet, and M. Prato (2018) Deep unfolding of a proximal interior point method for image restoration. CoRR abs/1812.04276. External Links: Link, 1812.04276 Cited by: §II-B.
  • [3] M. Borgerding, P. Schniter, and S. Rangan (2017) AMP-inspired deep networks for sparse linear inverse problems. IEEE Transactions on Signal Processing 65 (16), pp. 4293–4308. Cited by: §II-B.
  • [4] I. Daubechies, M. Defrise, and C. D. Mol (2004-11) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57, pp. . External Links: Document Cited by: §II-A.
  • [5] K. Gregor and Y. LeCun (2010) Learning fast approximations of sparse coding. In IEEE International Conference on Machine Learning (ICML), Cited by: §I, Fig. 1, §II-A, §II-B, §II-B.
  • [6] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang (2017) Learning dynamic guidance for depth image enhancement. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. . External Links: ISSN 1063-6919 Cited by: §I.
  • [7] B. Ham, M. Cho, and J. Ponce (2018) Robust guided image filtering using nonconvex potentials. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1), pp. 192–207. External Links: Document, ISSN 0162-8828 Cited by: §I.
  • [8] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. S. Huang (2018) Image super-resolution via dual-state recurrent networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [9] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [10] K. He, J. Sun, and X. Tang (2013) Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (6), pp. 1397–1409. External Links: Document, ISSN 0162-8828 Cited by: §I.
  • [11] J. Kim, J. K. Lee, and K. M. Lee (2016) Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. . External Links: Document, ISSN 1063-6919 Cited by: §I.
  • [12] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele (2007) Joint bilateral upsampling. ACM Transactions on Graphics 26 (3). External Links: ISSN 0730-0301, Link, Document Cited by: §I.
  • [13] W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [14] Y. Li, J. B. Huang, N. Ahuja, and M. H. Yang (2016) Deep joint image filtering. In European Conference on Computer Vision (ECCV), Vol. 9908, pp. . External Links: Document Cited by: §I, Fig. 5, §IV.
  • [15] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Vol. . External Links: Document, ISSN 2160-7516 Cited by: §I.
  • [16] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang (2016) Robust single image super-resolution via deep networks with sparse prior. IEEE Transactions on Image Processing 25 (7), pp. 3194–3207. External Links: Document, ISSN 1057-7149 Cited by: §I, §I, §II-A, Fig. 5, §IV.
  • [17] S. Mallat and G. Yu (2010) Super-resolution with sparse mixing estimators. IEEE Transactions on Image Processing 19 (11), pp. 2889–2900. External Links: Document, ISSN 1057-7149 Cited by: §I.
  • [18] J. F. C. Mota, N. Deligiannis, and M. R. D. Rodrigues (2014) Compressed sensing with side information: geometrical interpretation and performance bounds. In GlobalSIP, Vol. , pp. 512–516. External Links: Document, ISSN Cited by: §II-B.
  • [19] J. F. C. Mota, N. Deligiannis, and M. R. D. Rodrigues (2017) Compressed Sensing with Prior Information: Strategies, Geometry, and Bounds. IEEE Transaction on Information Theory 63, pp. 4472–4496. Cited by: §II-B.
  • [20] P. Song, X. Deng, J. F. C. Mota, N. Deligiannis, P. L. Dragotti, and M. R. D. Rodrigues (2019) Multimodal image super-resolution via joint sparse representations induced by coupled dictionaries. IEEE Transactions on Computational Imaging. Cited by: §I, Table I, Fig. 5, §IV.
  • [21] H. Sreter and R. Giryes (2018) Learned convolutional sparse coding. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 2191–2195. External Links: Document, ISSN 2379-190X Cited by: Fig. 5, §IV.
  • [22] J.A. Tropp and S.J. Wright (2010) Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE 98 (6), pp. 948–958. Cited by: §II-A.
  • [23] E. Tsiligianni and N. Deligiannis (2019) Learning fast sparse representations with the aid of side information. Technical report Note: [Online Available: https://bit.ly/2WIJB5z] External Links: Link Cited by: item 2, Fig. 2, §II-B, §II-B, §II-B, §III-A.
  • [24] B. Xin, Y. Wang, W. Gao, D. Wipf, and B. Wang (2016) Maximal sparsity with deep networks?. In Advances in Neural Information Processing Systems (NIPS), pp. 4340–4348. Cited by: §II-B.
  • [25] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. S. Huang (2012) Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing 21 (8), pp. 3467–3478. External Links: Document, ISSN 1057-7149 Cited by: §I.
  • [26] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010-12) Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19, pp. 2861–2873. External Links: Document Cited by: §I, §II-A, §II-A, §III-A.