Generative Collaborative Networks for Single Image Super-Resolution

02/27/2019 ∙ by Mohamed El Amine Seddik, et al. ∙ CEA 0

A common issue of deep neural networks-based methods for the problem of Single Image Super-Resolution (SISR), is the recovery of finer texture details when super-resolving at large upscaling factors. This issue is particularly related to the choice of the objective loss function. In particular, recent works proposed the use of a VGG loss which consists in minimizing the error between the generated high resolution images and ground-truth in the feature space of a Convolutional Neural Network (VGG19), pre-trained on the very "large" ImageNet dataset. When considering the problem of super-resolving images with a distribution "far" from the ImageNet images distribution (e.g., satellite images), their proposed fixed VGG loss is no longer relevant. In this paper, we present a general framework named Generative Collaborative Networks (GCN), where the idea consists in optimizing the generator (the mapping of interest) in the feature space of a features extractor network. The two networks (generator and extractor) are collaborative in the sense that the latter "helps" the former, by constructing discriminative and relevant features (not necessarily fixed and possibly learned mutually with the generator). We evaluate the GCN framework in the context of SISR, and we show that it results in a method that is adapted to super-resolution domains that are "far" from the ImageNet domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 12

page 17

page 18

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The super-resolution problem (

) consists in estimating a high resolution (HR) image from its corresponding low resolution (LR) counterpart.

finds a wide range of applications and has attracted much attention within the community of computer vision

(nasrollahi2014super, ; yang2007spatial, ; zou2012very, ). Generally, the considered optimization objective of supervised methods to solve is the minimization of the mean squared error (MSE) between the recovered HR image and ground-truth. This class of methods are known to be suboptimal to reconstruct texture details at large upscaling factors. In fact, since MSE consists in a pixel-wise images differences, its ability to recover high texture details is limited ledig2016photo ; gupta2011modified ; wang2004image ; wang2003multiscale . Furthermore, the minimization of MSE maximizes the Peak Signal-to-Noise-Ratio (PSNR) metric, which is commonly used for the evaluation of methods (yang2014single, ).

In order to correctly recover finer texture details when super-resolving at large upscaling factors, a recent (state-of-the-art) work (ledig2016photo, ) defined a perceptual loss which is a combination of an adversarial loss and a VGG loss. The former encourages solutions perceptually hard to distinguish from the HR ground-truth images, while the latter consists in using high-level feature maps of the VGG network (simonyan2014very, ) pre-trained on ImageNet deng2009imagenet . When considering the problem of super-resolving images from a target-domain different than ImageNet (e.g.,

satellite images), the features produced by the pre-trained VGG network on the source domain (ImageNet) are suboptimal and no longer relevant for the target domain. In fact, transfer-learning methods are known to be efficient only when the source and target domains are close enough

tamaazousti2017mucale ; tamaazousti2018universal ; karbalayghareh2018optimal . In this work, we present a general framework which we call Generative Collaborative Networks (GCN), where the main idea consists in optimizing the generator (i.e., the mapping of interest) in the feature space of a network which we shall refer to as a features extractor network. The two networks are said to be collaborative in the sense that the features extractor network “helps” the generator by constructing (here, learning) relevant features. In particular, we applied our framework to the problem of single image super-resolution, and we demonstrated that it results in a method that is more adapted (compared to SRGAN (ledig2016photo, )) when super-resolving images from a domain that is “far” from the ImageNet domain.

The rest of the paper is organized as follows. In Section 2 we present the state of the art on the problem of single image super-resolution. We describe our Generative Collaborative Networks framework in Section 3. Section 4 presents our proposed method for the super resolution task and related experimental results. Section 5 provides some discussions and concludes the article.

2 Related work

The problem of super-resolution has been tackled with a large range of approaches. In the following, we will consider the problem of single image super-resolution () and thus the approaches that recover HR images from multiple images (borman1998super, ; farsiu2004fast, ) are out of the scope of this paper. First approaches to solve were filtering-based methods (e.g., linear, bicubic or Lanczos (duchon1979lanczos, ) filtering). Even if these methods are generally very fast, they usually yield overly smooth textures solutions wang2004image . Most promising and powerful approaches are learning-based methods which consist in establishing a mapping between LR images and their HR counterparts (supposed to be known). Initial work was proposed by Freeman et al. (freeman2002example, ). This method has been improved in (dong2011image, ; zeyde2010single, )

by using compressed sensing approaches. Patch-based methods combined with machine learning algorithms were also proposed: in

(timofte2013anchored, ; timofte2014a+, )

upsampling a LR image by finding similar LR training patches in a low dimensional space (using neighborhood embedding approaches) and a combination of the HR patches counterparts are used to reconstruct HR patches. A more general mapping of example pairs (using kernel ridge regression) was formulated by Kim and Kwon

(kim2010single, ). Similar approaches used Gaussian process regression (he2011single, ), trees (salvador2015naive, )

or Random Forests

(schulter2015fast, ) to solve the regression problem introduced in (kim2010single, ). An ensemble method-based approach was adopted in (dai2015jointly, ) by learning multiple patch regressors and selecting the most relevant ones during the test phase.

Convolutional neural networks (CNN)-based approaches outperformed other approaches, by showing excellent performance. Authors in (wang2015deep, ) used an encoded sparse representation as a prior in a feed-forward CNN, based on the learned iterative shrinkage and thresholding algorithm of (gregor2010learning, )

. An end-to-end trained three layer deep fully convolutional network, based on bicubic interpolation to upscale the input images, was used in

(dong2014learning, ; dong2016image, ) and achieved good performances. Further works suggested that enabling the network to directly learn the upscaling filters, can remarkably increase performance in terms of both time complexity and accuracy (dong2016accelerating, ; shi2016real, ). In order to recover visually more convincing HR images, Johnson et al. (johnson2016perceptual, ) and Bruna et al. (bruna2015super, ) used a closer loss function to perceptual similarity. More recently, authors in (ledig2016photo, ) defined a perceptual loss which is a combination of an adversarial loss and a VGG loss. The latter consists in minimizing the error between the recovered HR image and ground-truth in the high-level feature space of the pre-trained VGG network (simonyan2014very, ) on ImageNet deng2009imagenet . This method notably outperformed CNN-based methods for the problem .

3 Generative Collaborative Networks

3.1 Proposed Framework

Consider a problem of learning a mapping function , parameterized by , that transforms images from a domain to a domain , given a training set of pairs . Denote by and

the probability distributions respectively over

and . In addition, we introduce a given features extractor function denoted , parameterized by , that maps an image to a certain euclidean feature space of dimensionality . The mappings and are typically feed-forward Convolutional Neural Networks. The Generative Collaborative Networks (GCN) framework consists in learning the mapping function by minimizing a given loss function111-loss is considered in the following. in the space of features , between the generated images (through ) and ground-truth. Formally,

(1)

where is a certain regularization term (detailed below) on the weights and and are summation coefficients. The two networks and are collaborative in the sense that, the latter learns specific features of the domain and “helps” the former, as it is learned in the space . An important question arises about how to learn the mapping . In following, we describe different classes of methods depending on the learning strategy of . In fact, the features extractor function can take different forms and be learned by different strategies. In particular, we distinguish two learning strategies (illustrated in Figure 2), which we shall call disjoint-learning and joint-learning. The four following cases belong to the disjoint-learning strategy:

Figure 2: Overview of the GCN framework with examples of the two learning strategies. The GCN framework consists in optimizing a generator in the feature space of an extractor as illustrated in (a). The extractor can be trained beforehand and used to optimize the generator, which we refer to as disjoint-learning strategy (b). The extractor can also be optimized jointly with the generator, i.e., using a joint-learning strategy (c).
  • When is the identity operator (). In that case, the objective in Eq.(1) becomes a simple pixel-wise MSE loss function. We refer to this class of methods by /mse.

  • When corresponds to a random feature map neural network, that is to say, the weights are set randomly according to a given distribution . We refer to this class of methods by /ran.

  • When is a part of a model that solves a reconstruction problem (jointly with an auxiliary mapping function ), by minimizing the pixel-wise -loss function between the reconstructed images (through ) and ground-truth:

    (2)

    Notably, this strategy allows for the learning of reconstruction features which are different from classification-based features. We refer to this class of methods by /rec.

  • When is trained to solve a multi-label classification problem ledig2016photo , that is to say, when labels are available for the domain . More precisely, it exists a dataset of images labelled among classes and is learned to minimize the following objective:

    (3)

    where . We refer to this class of methods by /cla.

The features extractor function can also be trained jointly with the desired mapping function . Indeed, as in the GANs paradigm, one can use a discriminator to distinguish the generated images (through ) and ground-truth, and thus learn more relevant and specific features for the problem of interest . In particular, the joint-learning strategy contains two cases:

  • When is a part of a discriminator.

    that classifies the generated images (through

    ) and ground-truth. is optimized in an alternating manner along with to solve the adversarial min-max problem sonderby2016amortised :

    (4)

    The adversarial loss (second term of Eq. (4)) can thus be seen as a regularization of the parameters by affecting this quantity to in Eq. (1). This regularization “pushes” the solution of the problem in Eq. (1) to the manifold of the images in the domain . We refer to this class of methods by /adv. When , we refer to it by /dis.

  • When is a part of a discriminator and an auto-encoder. Namely, by optimizing its weights to solve simultaneously, an adversarial problem as in Eq. (4); through , and a reconstruction problem as in Eq. (2); through a mapping . We refer to this class of methods by /adv,rec or /dis,rec depending on the value of in Eq. (1).

3.2 Existing Loss Functions

The natural way to learn a mapping from a manifold to another is to use /mse methods. It is well known (gupta2011modified, ; ledig2016photo, ; wang2003multiscale, ; wang2004image, ) that this class of methods lead to overly-smooth and poor perceptual quality solutions. In order to handle the mentioned perceptual quality limitation, a variety of methods have been proposed in the literature. First methods used generative adversarial networks (GANs) for generating high perceptual quality images (denton2015deep, ; mathieu2015deep, ), style transfer (li2016combining, ) and inpainting (yeh2016semantic, ), namely the class of methods /adv with . Authors in (yu2016ultra, ) proposed to use /mse with an adversarial loss ( and ) to train a network that super-resolves face images with large upscaling factors. Authors in (bruna2015super, ; johnson2016perceptual, ) and in (dosovitskiy2016generating, ) used /cla by considering respectively VGG19 and AlexNet networks as fixed features extractors (learned disjointly from the mapping of interest), which result in a more perceptually convincing results for both super-resolution and artistic style-transfer (gatys2015texture, ; gatys2016image, ). More recently, authors in (ledig2016photo, ) used /cla,adv by considering VGG19 as a fixed features extractor combined with an adversarial loss (). To the best of our knowledge, as summarized in table 1, the use of the other learning strategies of ; namely (1.c), (2.a) and (2.b), have not been explored in the literature. We particularly apply these strategies in the context of Single Image Super-Resolution, which results in methods that are more suitable (comparing to the SRGAN method (ledig2016photo, )) to super-resolution domains that differ from the ImageNet domain. The proposed methods as well as the corresponding experiments are presented in the following section.

Standard methods
Existence (gupta2011modified, ) (dosovitskiy2016generating, )
Adversarial methods
Existence (yu2016ultra, ) ledig2016photo
Table 1: Existent loss functions of the proposed GCN framework.

4 Application of GCN to Single Image Super-Resolution

4.1 Proposed Methods

In this section, we consider the problem of Single Image Super-Resolution (). In particular, we suppose we are given pairs of low-resolution images and their high-resolution counterparts. Recalling our GCN framework (presented in Section 3) the proposed methods for the problem are: /rec, /dis, /dis,rec, /adv and /adv,rec. We show in the following that the most convincing results are given by /adv,rec. In particular, we show on a dataset of satellite images (different from the ImageNet domain) that our method /adv,rec outperforms the SRGAN method (ledig2016photo, ) by a large margin on the considered domain. Note that, as our goal is to show the irrelevance of the VGG loss for some visual domains (different from ImageNet), we do not consider the well-known SR benchmarks (e.g., Set5, Set14, B100, Urban100) for the evaluation, as these benchmarks are relatively close to the ImageNet domain. The evaluation of the different methods is based on perceptual metrics (zhang2018unreasonable, ) which we recall in the following section.

4.2 Evaluation Metrics

The evaluation of super-resolution methods (more generally image regression-based methods) requires comparing visual patterns which remains an open problem in computer vision. In fact, classical metrics such as L2/PSNR, SSIM and FSIM often disagree with human judgments (e.g., blurring causes large perceptual change but small L2 change). Thus, the definition of a perceptual metric which agrees with humans perception is an important aspect for the evaluation of methods. Zhang et al. (zhang2018unreasonable, )

recently evaluated deep features across different architectures (Squeeze

iandola2016squeezenet , AlexNetkrizhevsky2012imagenet and VGGsimonyan2014very ) and tasks (supervised, self-supervised and unsupervised networks) and compared the resulting metrics with traditional ones. They found that deep features outperform all classical metrics (e.g., L2/PSNR, SSIM and FSIM) by large margins on their introduced dataset. As a consequence, deep networks seem to provide an embedding of images which agrees surprisingly well with humans judgments.

Zhang et al. (zhang2018unreasonable, ) compute the distance between two images with a network222The considered networks are Squeezeiandola2016squeezenet , AlexNetkrizhevsky2012imagenet and VGGsimonyan2014very and their ”perceptual calibrated” versions which we refer to respectively as Squeeze-l, AlexNet-l and VGG-l. See (zhang2018unreasonable, ) and the provided github project within for further details. in the following way:

(5)

where are the extracted features from layer and unit-normalized in the channel dimension.

is a re-scaling vector of the activations channel-wise at layer

. and are respectively the height and width of the feature map.

Thus, we compute the perceptual error (PE) of a method (a mapping ) on a given test-set of low-resolution images and their high-resolution counterparts as the mean distances between the generated images (through ) and ground-truth as follows:

(6)

Note that we use the implementation of (zhang2018unreasonable, ) to compute the perceptual distances using six variants which are based on the networks Squeezeiandola2016squeezenet , AlexNetkrizhevsky2012imagenet and VGGsimonyan2014very and their “perceptual calibrated” versions. The best method is considered to be the one which minimizes the maximum amount of PEs across different networks .

4.3 Experiments

The overall goal of this section is to validate our statement about the relevance of the VGG loss when super-resolving images from a different domain than the ImageNet domain. To highlight this aspect, we first present the considered datasets, architectures and training details. Then we select the more appropriate method (across the GCN framework methods) for the problem based on perceptual metrics (zhang2018unreasonable, ). Finally, we compare our proposed method to some baselines and the state-of-the-art SRGAN method (ledig2016photo, ), on three different datasets (detailed in the following section). We show in particular that our method outperforms SRGAN on the satellite images domain.

4.3.1 Datasets

The idea of replacing the MSE pixel-wise content loss on the image by a loss function that is closer to perceptual similarity is not new. Indeed, ledig2016photo defined a VGG loss on the feature map obtained by a specific layer of the pre-trained VGG19 network and shows that it fixes the inherent problem of overly smooth results which comes with the pixel-wise loss. Nevertheless, VGG19 being trained on ImageNet, their method would not perform particularly well on different images, the distribution of which is far away from that of ImageNet. Therefore, we propose a similar method where the difference is that our features extractor is not pre-trained, but trained jointly with the generator. This removes the aforementioned limitation since the features extractor is trained on the same dataset as the generator and thus extract relevant features.

Figure 3: Examples of images from the considered datasets.

To show that, we trained our different networks (i.e., with different features extractors) on three distinct datasets (examples of images of these datasets are shown in Figure 3):

  • A subset of ImageNet deng2009imagenet , for which we sampled images. Since VGG19 was trained on ImageNet for many (more than 300K) iterations, we expect to have similar or worse results than the state-of-the-art method SRGAN from ledig2016photo on this database.

  • The Describable Textures Dataset (DTD) cimpoi14describing , containing images of textural patterns. These data are relatively close to ImageNet and we show that our method gives convincing results relatively close to SRGAN.

  • A dataset containing satellite images333Can be found in http://www.terracolor.net/sample_imagery.html, which we generated by randomly cropping images on a satellite image which result in images. We particularly show that our method significantly outperforms SRGAN on this dataset. We refer to this dataset by Sat.

All experiments are performed with a scale factor of between low- and high-resolutions images and the formers are obtained during the training by down-scaling the original images by a factor .

4.3.2 Architectures

Our overall goal is to prove that the proposed GCN framework, is adapted to train a generative mapping model and that it surpasses the MSE loss in keeping perceptual similarity in the generated image (whereas the MSE loss tends to smooth things out and lose high frequency details). As opposed to ledig2016photo ’s work, our framework does not require to have a pre-trained network, like VGG, to extract helpful features for training. In this paper, we focus on the Super Resolution problem. Therefore, we chose our mapping function , or generator, to be that of Ledig et al. ledig2016photo : a feed-forward CNN parametrized by , composed of residual blocks. These blocks are made of two convolutional layers with kernels and

features maps, each followed by batch normalization and PReLU as activation. The image’s size is then increased of a factor

by two trained upsamplings. The architecture of all the used discriminators follows the guidelines of Radford et al. radford2015unsupervised as it is composed of convolutional layers, followed by a batch normalization and a LeakyReLU () activation. This block is repeated eight times and each time the number of kernels increases by a factor (ranging from to

), a strided convolution is used to reduce the image resolution by

. Two dense layers and a sigmoid activation then return the discrimination probability. In the case of an auto-encoder (every Reconstruction problem), we follow the same architecture for the encoder and a symmetric one for the decoder. Figure 4 depicts an overview of the architectures for both the generator and the discriminator.

Figure 4: Overview of the used architectures for the generator and the discriminator. We have considered the same architectures as that of Ledig et al. ledig2016photo .

4.3.3 Training details and parameters

All networks were trained444

A Keras implementation is provided in

https://github.com/melaseddik/GCN on a NVIDIA Geoforce GTX 1070 GPU using the datasets described in Section 4.3.1, which do not contain the testing images shown as results. We scaled the range of both the LR input images and the HR images to , which explains the activation for the last layer of the generator. All variants of our networks, which differ in their features extractor, were trained from scratch (for the generator and the features extractor) with mini batches of 10 images. We used the Adam optimizer with a learning rate of and a decay of . The generator and the feature extractor are updated alternatively. As we realized training was stable and quite fast, we trained with only update iterations to pinpoint the best method among the different GCNs. Finally, the regularization parameters in our global loss are set by default as and As a reminder, our goal here is, given a generator architecture (or mapping function ), to find the best strategy to train it, following our GCNs paradigms. The best method is then further compared to baselines.

4.3.4 Features Extractor Selection

As we said above, we investigated the ability of different features extractor to construct relevant perceptual feature maps for training and improving the rendering quality of the generator. In order to select the best learning strategy given a certain dataset, we train the generator on each dataset (presented in Section 4.3.1) using the different learning strategies: /rec, /dis, /dis,rec, /adv and /adv,rec. Note that, the features extractor for all the considered methods correspond to the first layer of the discriminators (or encoder-decoders). In fact, as the problem consists in recovering low-level perceptual cues, we limited our study to the first layer.

Table 2 summarizes the results of the proposed methods in terms of low-level metrics (L2 and SSIM) and perceptual metrics (zhang2018unreasonable, ) which are given by Eq. (6). We notice from this table that the method performs relatively well on the datasets ImageNet and Sat in terms of perceptual metrics. While gives better results on the DTD dataset. The main difference between these two methods is that the former considers an adversarial loss on the objective function while the latter does not consider the adversarial term. This explains the reason why does not perform well on DTD. In fact, texture images belong to a complex manifold and their distribution is relatively hard to fit by a generative model.

Figure 5 shows qualitative results of the different proposed methods on the different presented datasets. Generally, the methods which were trained with an additional adversarial loss (/adv and /adv,rec) output images of higher quality (on the datasets ImageNet and Sat) as GANs were introduced to do just so: generate images that follow the distribution of the dataset. Among these two adversarial methods, it seems to us (as suggested by the quantitative results of table 2) that /adv,rec (column (c) of Figure 5) is able to detect and render more details, due to its ability to generate more relevant features as the features extractor is learned to solve a multi-task problem; namely a discrimination and a reconstruction problem, in particular, this method allows for the learning of both classification and reconstruction-based features. We will thus further investigate the /adv,rec method for the comparison to the baseline and the state-of-the-art method SRGAN (ledig2016photo, ), on the satellite images domain.

Low-level Perceptual metrics
Methods L2 SSIM Squ Squ-l Alex Alex-l VGG VGG-l

ImageNet

0.018 0.147 1.606 0.279 1.470 0.398 2.088 0.358
0.020 0.162 1.723 0.301 1.595 0.425 2.243 0.388
0.017 0.147 1.587 0.279 1.420 0.382 2.052 0.353
0.028 0.202 1.820 0.222 1.554 0.322 2.598 0.432
0.016 0.141 1.533 0.263 1.362 0.368 1.994 0.340

DTD

0.027 0.184 1.873 0.327 1.739 0.440 2.401 0.421
0.027 0.183 1.851 0.320 1.726 0.438 2.398 0.420
0.023 0.167 1.703 0.292 1.576 0.404 2.260 0.392
0.036 0.227 2.077 0.281 1.812 0.375 2.770 0.473
0.046 0.236 2.089 0.277 1.793 0.344 2.796 0.481

Sat

0.011 0.129 1.484 0.210 1.508 0.356 2.121 0.355
0.060 0.168 1.705 0.245 1.762 0.423 2.260 0.395
0.011 0.138 1.493 0.215 1.435 0.351 2.108 0.372
0.030 0.214 1.719 0.181 1.627 0.306 2.711 0.419
0.018 0.183 1.359 0.140 1.310 0.220 2.115 0.344
Table 2: Results of the proposed methods in terms of traditional metrics (L2 and SSIM) and the perceptual error (PE) given by Eq. (6) on different datasets. As we can notice, the method outperforms the other methods in the datasets ImageNet and Sat, while gives the best results on DTD.

addcode=

Figure 5: Rows refer to the different considered Datasets. Columns refer to methods and ground-truth images: LR and HR refer to the low- and high-resolution pairs. The different used methods are: (a) /rec, (b) /dis,rec, (c) /adv and (d) /adv,rec. Best view in PDF.

,rotate=90,center

4.3.5 against baseline methods on the satellite images domain

Our main objective is to show that the VGG loss function (namely, the SRGAN method (ledig2016photo, )) is no longer relevant when super-resolving images from a domain different than the ImageNet domain. In particular, by considering the satellite images domain, we show in this section that the selected method from the previous section () outperforms some baselines, which are /mse (pixel-wise MSE loss) and /adv,mse (pixel wise MSE loss combined with an adversarial loss), and the state-of-the-art super-resolution method, SRGAN (ledig2016photo, ). Note that all the methods use the same architectures (depicted in figure 4) for the generator and discriminator and are trained on the same domain (here, on satellite images). Our purpose being to show the relevance of the proposed method on a domain “far” from the ImageNet domain, we do not consider standard SR benchmarks, which are raltively “close” to the ImageNet domain.

Table 3 presents quantitative results, in terms of classical metrics (L2 and SSIM) and perceptual metrics given by Eq. (6), of the different methods on the Sat dataset. As we can notice, our method outperforms the other methods in terms of perceptual metrics. Knowing that the perceptual metrics agree with human judgments (zhang2018unreasonable, ), these results validate the effectiveness of the method. Note also that even if SRGAN (ledig2016photo, ) is optimized to minimize a VGG loss, it does not give the lowest perceptual errors in terms of the perceptual metrics VGG and VGG-l, this is due to the fact that the VGG features are not relevant for the satellite images domain. In addition, gives the lowest perceptual errors in terms of the perceptual metrics Alex and Alex-l which agrees with a human perception. In fact, AlexNet network may more closely match the architecture of the human visual cortex (yamins2016using, ).

Low-level Perceptual metrics
Methods L2 SSIM Squ Squ-l Alex Alex-l VGG VGG-l

Sat

0.011 0.134 1.873 0.245 1.855 0.411 2.536 0.419
0.082 0.197 1.458 0.205 1.466 0.352 2.125 0.347
SRGAN (ledig2016photo, ) 0.228 0.188 1.510 0.220 1.361 0.282 2.230 0.412
0.018 0.183 1.359 0.140 1.310 0.220 2.115 0.344
Table 3: Comparison of our method with baselines and the SRGAN method (ledig2016photo, ) on the satellite images domain, in terms of classical metrics (L2 and SSIM) and perceptual metrics (zhang2018unreasonable, ).
(a)
Figure 15: Results of different methods on a patch of an image from the Sat dataset.
(b) HR (REF)
(c) /mse
(d) /adv,mse
(e) SRGAN (ledig2016photo, )
(f) /rec
(g) /dis,rec
(h) /adv
(i) /adv,rec

Figure (a)a shows some qualitative results of different methods on a patch of an image from the Sat dataset. As we can notice, the method gives the perceptually closest result to the ground-truth image, which agrees with the quantitative results of table 3.

4.3.6 Further results

In this section, we provide further qualitative and quantitative comparisons to the considered baselines of the previous section. In particular, we consider all the presented datasets for the comparisons. Qualitative results are provided in figure 16. SRGAN performs better on ImageNet, which is not that surprising considering our features extractor was trained much less than VGG19 used in ledig2016photo and the VGG features being more relevant for images from the ImageNet domain. Nonetheless, we do have sharper images than the MSE based methods, although we show some artifact (especially on the boat) which we attribute to the competition between the content and adversarial losses. On DTD though, we can see the benefit of our method over a pre-trained VGG loss. Indeed, SRGAN is blurrier on both the house (first row) and the cliff (third row), in spite of having less artifacts than our method. On the “cracks” example (second row), SRGAN even totally obliterates the details in the center. Finally, results on the dataset Sat, which is the most different dataset compared to ImageNet, are the most compelling. Our method generates super resolved images that are really close to the real high resolution images, while we can clearly see imperfections on SRGAN’s results because of VGG19 which was not trained to detect perceptual features on satellite images.

Low-level Perceptual metrics
Methods L2 SSIM Squ Squ-l Alex Alex-l VGG VGG-l

ImageNet

0.017 0.146 1.568 0.280 1.435 0.391 2.064 0.349
0.020 0.156 1.634 0.241 1.397 0.329 2.223 0.384
SRGAN 0.028 0.170 1.303 0.177 1.084 0.225 2.045 0.342
0.016 0.141 1.533 0.263 1.362 0.368 1.994 0.340

DTD

0.029 0.185 1.972 0.342 1.856 0.470 2.479 0.434
0.025 0.188 1.880 0.268 1.586 0.349 2.512 0.430
SRGAN 0.031 0.191 1.557 0.209 1.298 0.241 2.308 0.393
0.023 0.167 1.703 0.292 1.576 0.404 2.260 0.392
Table 4: Comparison of our methods and with baselines and the SRGAN method (ledig2016photo, ) on the datasets ImageNet (a subset of 200,000 randomely selected images) and DTD, in terms of classical metrics (L2 and SSIM) and perceptual metrics (zhang2018unreasonable, ).

Quantitative results are summarized in Table 4. As shown in (ledig2016photo, ; zhang2018unreasonable, ), the standard quantitative measures such as L2 and SSIM fail to highlight image quality according to the human visual system. In fact, while the results of /mse are overly smooth perceptually, it has the lowest L2 and SSIM errors on Sat. However, perceptual metrics agree with what we assess qualitatively: SRGAN performs best on ImageNet but not on Sat, the distribution of which is the farthest from ImageNet. Actually, SRGAN ranks third of all four methods on Sat, just before /adv,mse, while still performing best on DTD which still is pretty close to ImageNet. This shows that the VGG features become less and less relevant as the dataset’s distribution part from ImageNet. On the other hand, our training framework allows to construct relevant features on any (never seen) dataset. Thus our method /adv,rec performs best on Sat. Our method performing better than /adv,mse also shows that our framework helps finding detail preserving features. Figure 16 provides the results of the different baselines and our method on some examples of the considered datasets. We notice from these images that our method recovers finer details on the different datasets while it outperforms the considered baselines on satellite images. Table 5 summarizes the results of the different methods on the considered datasets through the paper. From these results, we make the following conclusions:

  • When the considered domain is far enough from the ImageNet domain, the VGG loss introduced by (ledig2016photo, ) is no longer relevant.

  • The VGG network can not be fine-tuned when considering a domain for which there is no available labels for the images (e.g., satellite images). Thus, the SRGAN method cannot be exploited efficiently in this case.

  • Our framework results in a method () that outperforms some baselines and the SRGAN method on the satellite images domain.

  • Even on a domain close to the ImageNet domain (e.g., texture images), one can find within our framework methods which give almost similar results to the SRGAN method, while the later is based on VGG features and thus need to train the VGG network on the whole ImageNet dataset.

Low-level Perceptual metrics
Methods L2 SSIM Squ Squ-l Alex Alex-l VGG VGG-l

ImageNet

0.017 0.146 1.568 0.280 1.435 0.391 2.064 0.349
0.020 0.156 1.634 0.241 1.397 0.329 2.223 0.384
SRGAN 0.028 0.170 1.303 0.177 1.084 0.225 2.045 0.342
0.018 0.147 1.606 0.279 1.470 0.398 2.088 0.358
0.020 0.162 1.723 0.301 1.595 0.425 2.243 0.388
0.017 0.147 1.587 0.279 1.420 0.382 2.052 0.353
0.028 0.202 1.820 0.222 1.554 0.322 2.598 0.432
0.016 0.141 1.533 0.263 1.362 0.368 1.994 0.340

DTD

0.029 0.185 1.972 0.342 1.856 0.470 2.479 0.434
0.025 0.188 1.880 0.268 1.586 0.349 2.512 0.430
SRGAN 0.031 0.191 1.557 0.209 1.298 0.241 2.308 0.393
0.027 0.184 1.873 0.327 1.739 0.440 2.401 0.421
0.027 0.183 1.851 0.320 1.726 0.438 2.398 0.420
0.023 0.167 1.703 0.292 1.576 0.404 2.260 0.392
0.036 0.227 2.077 0.281 1.812 0.375 2.770 0.473
0.046 0.236 2.089 0.277 1.793 0.344 2.796 0.481

Sat

0.011 0.134 1.873 0.245 1.855 0.411 2.536 0.419
0.082 0.197 1.458 0.205 1.466 0.352 2.125 0.347
SRGAN 0.228 0.188 1.510 0.220 1.361 0.282 2.230 0.412
0.011 0.129 1.484 0.210 1.508 0.356 2.121 0.355
0.060 0.168 1.705 0.245 1.762 0.423 2.260 0.395
0.011 0.138 1.493 0.215 1.435 0.351 2.108 0.372
0.030 0.214 1.719 0.181 1.627 0.306 2.711 0.419
0.018 0.183 1.359 0.140 1.310 0.220 2.115 0.344
Table 5: Comparison of the proposed methods in terms of traditional metrics (L2 and SSIM) and the perceptual error (PE) given by Eq. (6) on all the considered datasets. In terms of perceptual metrics, the proposed methods rank in the second position after SRGAN (ledig2016photo, ) on the datasets ImageNet and DTD, while they outperform all the baselines on the satellite images domain which is far from the ImageNet domain.

addcode=

Figure 16: Rows refer to the different Datasets. Columns refer to methods and ground-truth images: LR and HR refer to the low- and high-resolution pairs. P-mse+ refers to the method /mse with an adversarial loss (), SRGAN for the method in (ledig2016photo, ) and our method /adv,rec. Best view in PDF.

,rotate=90,center

5 Conclusion and Perspectives

In this paper, we propose a general framework named Generative Collaborative Networks (GCN) which generalizes the existing methods for the problem of learning a mapping between two domains. The GCN framework highlights that there is a learning strategy of mappings that is not explored in the literature. In particular, the optimization of these mappings in the feature space of a features extractor network, which is mutually learned at the same time as the considered mapping (joint-learning strategy). The GCN framework was evaluated in the context of super-resolution on three datasets (ImageNet deng2009imagenet , DTD cimpoi14describing and satellite images). We have shown that the proposed joint-learning strategy leads to a method that outperforms the state of the art ledig2016photo which uses a pre-trained features extractor network (VGG19 on ImageNet). Specifically, this holds when the domain of interest is “far” from the ImageNet domain (e.g., satellite images or images from the medical domain555This domain is particularly relevant for the proposed framework as it seems very far from the ImageNet domain. Unfortunately, we have not found a big amount of publicly available data (to the best of our knowledge) for medical images which prevented us from considering this domain through the paper.). However, note that even for domains close to the ImageNet domain, the proposed method gives convincing (almost similar to ledig2016photo ) results without using the whole ImageNet dataset to learn the features extractor network (as performed in ledig2016photo ).

In this work, we systematically designed the proposed methods by using the first layer of the features extractor networks, while it could be interesting to evaluate in more detail the impact of this choice regarding the learning strategy. Moreover, the impact of the selected layer may also depend on the considered dataset. More generally, the GCN framework offers a large vision on the wide variety of existing loss functions used in the literature of learning mappings-based problems (e.g., super-resolution, image completion, artistic style transfer, etc.). In fact, we show that these loss functions can be simply reformulated, in the proposed framework, as a certain combination of a particular type of features extractor networks (/rec, /dis, /dis,rec, /adv and /adv,rec) and a particular learning strategies (joint-learning or disjoint-learning). Therefore it will be interesting to explore this promising framework in other learning mappings-based problems.

References

References

  • (1) C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative adversarial network, arXiv preprint.
  • (2) K. Nasrollahi, T. B. Moeslund, Super-resolution: a comprehensive survey, Machine vision and applications 25 (6) (2014) 1423–1468.
  • (3)

    Q. Yang, R. Yang, J. Davis, D. Nistér, Spatial-depth super resolution for range images, in: Computer Vision and Pattern Recognition, CVPR., IEEE, 2007, pp. 1–8.

  • (4)

    W. W. Zou, P. C. Yuen, Very low resolution face recognition problem, IEEE Transactions on Image Processing 21 (1) (2012) 327–340.

  • (5) P. Gupta, P. Srivastava, S. Bhardwaj, V. Bhateja, A modified psnr metric based on hvs for quality assessment of color images, in: Communication and Industrial Application (ICCIA), 2011 International Conference on, IEEE, 2011, pp. 1–4.
  • (6) Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE transactions on image processing 13 (4) (2004) 600–612.
  • (7) Z. Wang, E. P. Simoncelli, A. C. Bovik, Multiscale structural similarity for image quality assessment, in: Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, Vol. 2, Ieee, 2003, pp. 1398–1402.
  • (8) C.-Y. Yang, C. Ma, M.-H. Yang, Single-image super-resolution: A benchmark, in: European Conference on Computer Vision, Springer, 2014, pp. 372–386.
  • (9) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
  • (10) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Computer Vision and Pattern Recognition, CVPR 2009., IEEE, 2009, pp. 248–255.
  • (11) Y. Tamaazousti, H. Le Borgne, C. Hudelot, Mucale-net: Multi categorical-level networks to generate more discriminating features.
  • (12) Y. Tamaazousti, H. Le Borgne, C. Hudelot, M. E. A. Seddik, M. Tamaazousti, Learning more universal representations for transfer-learning, arXiv:1712.09708.
  • (13) A. Karbalayghareh, X. Qian, E. R. Dougherty, Optimal bayesian transfer learning, IEEE Transactions on Signal Processing.
  • (14) S. Borman, R. L. Stevenson, Super-resolution from image sequences-a review, in: Circuits and Systems, 1998. Proceedings. 1998 Midwest Symposium on, IEEE, 1998, pp. 374–378.
  • (15) S. Farsiu, M. D. Robinson, M. Elad, P. Milanfar, Fast and robust multiframe super resolution, IEEE transactions on image processing 13 (10) (2004) 1327–1344.
  • (16) C. E. Duchon, Lanczos filtering in one and two dimensions, Journal of applied meteorology 18 (8) (1979) 1016–1022.
  • (17) W. T. Freeman, T. R. Jones, E. C. Pasztor, Example-based super-resolution, IEEE Computer graphics and Applications 22 (2) (2002) 56–65.
  • (18) W. Dong, L. Zhang, G. Shi, X. Wu, Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization, IEEE Transactions on Image Processing 20 (7) (2011) 1838–1857.
  • (19) R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-representations, in: International conference on curves and surfaces, Springer, 2010, pp. 711–730.
  • (20) R. Timofte, V. De, L. Van Gool, Anchored neighborhood regression for fast example-based super-resolution, in: Computer Vision (ICCV)., IEEE, 2013, pp. 1920–1927.
  • (21) R. Timofte, V. De Smet, L. Van Gool, A+: Adjusted anchored neighborhood regression for fast super-resolution, in: Asian Conference on Computer Vision, Springer, 2014, pp. 111–126.
  • (22) K. I. Kim, Y. Kwon, Single-image super-resolution using sparse regression and natural image prior, IEEE transactions on pattern analysis and machine intelligence 32 (6) (2010) 1127–1133.
  • (23) H. He, W.-C. Siu, Single image super-resolution using gaussian process regression, in: Computer Vision and Pattern Recognition (CVPR)., IEEE, 2011, pp. 449–456.
  • (24)

    J. Salvador, E. Perez-Pellitero, Naive bayes super-resolution forest, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 325–333.

  • (25) S. Schulter, C. Leistner, H. Bischof, Fast and accurate image upscaling with super-resolution forests, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3791–3799.
  • (26) D. Dai, R. Timofte, L. Van Gool, Jointly optimized regressors for image super-resolution, in: Computer Graphics Forum, Vol. 34, Wiley Online Library, 2015, pp. 95–104.
  • (27) Z. Wang, D. Liu, J. Yang, W. Han, T. Huang, Deep networks for image super-resolution with sparse prior, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 370–378.
  • (28) K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, in: Proceedings of the 27th International Conference on International Conference on Machine Learning, Omnipress, 2010, pp. 399–406.
  • (29) C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: European Conference on Computer Vision, Springer, 2014, pp. 184–199.
  • (30) C. Dong, C. C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE transactions on pattern analysis and machine intelligence 38 (2) (2016) 295–307.
  • (31) C. Dong, C. C. Loy, X. Tang, Accelerating the super-resolution convolutional neural network, in: European Conference on Computer Vision, Springer, 2016, pp. 391–407.
  • (32) W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
  • (33) J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in: European Conference on Computer Vision, Springer, 2016, pp. 694–711.
  • (34) J. Bruna, P. Sprechmann, Y. LeCun, Super-resolution with deep convolutional sufficient statistics, arXiv preprint arXiv:1511.05666.
  • (35) C. K. Sønderby, J. Caballero, L. Theis, W. Shi, F. Huszár, Amortised map inference for image super-resolution, arXiv preprint arXiv:1610.04490.
  • (36) E. L. Denton, S. Chintala, R. Fergus, et al., Deep generative image models using a laplacian pyramid of adversarial networks, in: Advances in neural information processing systems, 2015, pp. 1486–1494.
  • (37) M. Mathieu, C. Couprie, Y. LeCun, Deep multi-scale video prediction beyond mean square error, arXiv preprint arXiv:1511.05440.
  • (38) C. Li, M. Wand, Combining markov random fields and convolutional neural networks for image synthesis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2479–2486.
  • (39) R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, M. N. Do, Semantic image inpainting with perceptual and contextual losses, arXiv preprint arXiv:1607.07539.
  • (40) X. Yu, F. Porikli, Ultra-resolving face images by discriminative generative networks, in: European Conference on Computer Vision, Springer, 2016, pp. 318–333.
  • (41) A. Dosovitskiy, T. Brox, Generating images with perceptual similarity metrics based on deep networks, in: Advances in Neural Information Processing Systems, 2016, pp. 658–666.
  • (42) L. Gatys, A. S. Ecker, M. Bethge, Texture synthesis using convolutional neural networks, in: Advances in Neural Information Processing Systems, 2015, pp. 262–270.
  • (43) L. A. Gatys, A. S. Ecker, M. Bethge, Image style transfer using convolutional neural networks, in: Computer Vision and Pattern Recognition (CVPR)., IEEE, 2016, pp. 2414–2423.
  • (44) R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric, arXiv preprint.
  • (45) F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer, Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size, arXiv preprint arXiv:1602.07360.
  • (46) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
  • (47) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , A. Vedaldi, Describing textures in the wild, in: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  • (48) A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434.
  • (49) D. L. Yamins, J. J. DiCarlo, Using goal-driven deep learning models to understand sensory cortex, Nature neuroscience 19 (3) (2016) 356.