DeepSEE: Deep Disentangled Semantic Explorative Extreme Super-Resolution

by   Marcel Christoph Bühler, et al.
ETH Zurich

Super-resolution (SR) is by definition ill-posed. There are infinitely many plausible high-resolution variants for a given low-resolution natural image. This is why example-based SR methods study upscaling factors up to 4x (or up to 8x for face hallucination). Most of the current literature aims at a single deterministic solution of either high reconstruction fidelity or photo-realistic perceptual quality. In this work, we propose a novel framework, DeepSEE, for Deep disentangled Semantic Explorative Extreme super-resolution. To the best of our knowledge, DeepSEE is the first method to leverage semantic maps for explorative super-resolution. In particular, it provides control of the semantic regions, their disentangled appearance and it allows a broad range of image manipulations. We validate DeepSEE for up to 32x magnification and exploration of the space of super-resolution.



There are no comments yet.


page 33

page 34

page 35

page 38

page 39

page 40

page 41

page 42


SRFlow: Learning the Super-Resolution Space with Normalizing Flow

Super-resolution is an ill-posed problem, since it allows for multiple p...

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

The primary aim of single-image super-resolution is to construct a high-...

VSpSR: Explorable Super-Resolution via Variational Sparse Representation

Super-resolution (SR) is an ill-posed problem, which means that infinite...

Explorable Super Resolution

Single image super resolution (SR) has seen major performance leaps in r...

Normalizing Flow as a Flexible Fidelity Objective for Photo-Realistic Super-resolution

Super-resolution is an ill-posed problem, where a ground-truth high-reso...

One-to-many Approach for Improving Super-Resolution

Super-resolution (SR) is a one-to-many task with multiple possible solut...

D2C-SR: A Divergence to Convergence Approach for Image Super-Resolution

In this paper, we present D2C-SR, a novel framework for the task of imag...

Code Repositories


CVPR 2021 Challenge on Super-Resolution Space

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Upscaling and Manipulations with Disentangled Style Injection. We show multiple outputs for the low-resolution image in the first column. In column two, we apply the full style matrix from an image of the same person. For columns three to five, we extract the style from a guiding image and substitute the rows belonging to the regions that should be adapted. It is even possible to extract a style from geometric patterns, such as the grid in column five. In summary, DeepSEE can control the appearance of specific semantic regions
Figure 2: Multiple Potential Solutions for a Single Input. We upscale a low-resolution image to different high-resolution variants. Which one would be the correct solution?

In super-resolution (SR), we learn a mapping from a low-resolution (LR) image to a higher-resolution (HR) image :


This mapping can be a standard interpolation method, such as bilinear, bicubic or nearest-neighbour. As such interpolations do not restore high-frequency content or small details, their output is visually clearly different from an original high-resolution image. Consequently, most modern super-resolution methods rely on neural networks to learn a parametrized mapping

. Such methods usually target upscaling factors of up to in generic domains [timofte2017ntire, blau20182018], and up to for restricted domains, like faces [yu2018superverylow, kim2019progressive-face-sr, lee2018attribute, chen2018fsrnet, yu2018facemultitask, liu2015faceattributesceleba].

Generic methods lack clear guidance on what characteristics the output images should contain. For example, it is not possible to infer the skin texture from a very low-resolution face image. Solutions for domain-specific methods can include image attributes [yu2018superverylow, lee2018attribute, liu2015faceattributesceleba, li2019deep], reference images [li2018learninggfrnet, dogan2019exemplargwainet] and/or facial landmarks as a guidance [chen2018fsrnet, kim2019progressive-face-sr]. Such approaches demonstrate valid results for upscaling factors. However, they learn a deterministic mapping and produce a single high-resolution image output for a given input.

Producing a single high-resolution output from a low-resolution image is not ideal, as super-resolution is an ill-posed problem and there are multiple valid solutions for a given input image. Fig. 2 shows a low-resolution input and multiple possible outputs. It is very hard to tell, nonetheless, which image is closest to an unknown ground-truth.

Our proposed method, DeepSEE, is capable of generating an infinite number of potentially valid high-resolution candidates for a low-resolution image. The candidates may vary in both appearance and shape, but stay consistent with the low-resolution input. Our method learns a one-to-many mapping from a low-frequency input to a disentangled manifold of potential solutions with the same low-frequencies, but diverse high-frequencies. For inference, a user can tweak the shape or appearance of individual semantic regions until achieving the desired result.

1.1 Contributions

Our main contributions are as follows:

  • We introduce DeepSEE, a novel framework for Deep disentangled Semantic Explorative Extreme super-resolution.

  • We tackle the ill-posed super-resolution problem in an explorative approach based on semantic maps. DeepSEE is able to sample and manipulate the solution space to produce an infinite number of high-resolution outputs for a single low-resolution input image.

  • DeepSEE gives control over both shape and appearance. A user can tweak the disentangled semantic regions individually.

  • We go beyond other approaches by super-resolving to the extreme, with upscaling factors up to .

2 Related Work

2.1 Classical Super-resolution

Early super-resolution methods were mostly based on edge [fattal2007imageedge, sun2008imagegradientprofile] and image statistics [aly2005image, zhang2010non]

. Those pre-deep learning techniques relied on paired samples for low and high-resolution images and employed traditional machine learning algorithms, such as support-vector regression 

[ni2007image], graphical models [wang2005patch], Gaussian process regression [he2011single], sparse coding [yang2010image]

or piece-wise linear regression 

[timofte2014a+]. With the advent of deep learning, a multitude of top performing methods were proposed [dong2015image, kim2016deeply, timofte2017ntire, timofte2018ntire], defining the current main stream research in example-based single image super-resolution. Typically, super-resolution aims to restore the missing high-frequencies and details to achieve high (i) fidelity or (ii) perceptual quality. Fidelity quantifies the distortion to a high-resolution ground truth image and perceptual metrics measure visual quality. It is important to note that high perceptual quality does not necessarily require a close pixel-wise match to the ground truth (high fidelity).

While most of the research still targets fidelity, the perceptual super-resolution research is starting to bloom. Their results are less blurry and more photo-realistic [blau20182018].

2.1.1 Generative Adversarial Networks.

Generative Adversarial Networks [goodfellow2014generative] (GAN) have become increasingly popular in image generation [karras2019style, karras2019analyzing, spade, choi2018stargan, romero2019smit]

, thanks to their capability to sample and produce highly realistic images from a target distribution. The underlying technique is to alternately train two convolutional networks, a generator, and a discriminator, with contrary objectives in order to play a MiniMax game. While the discriminator aims to correctly classify whether the images are real or fake, the generator learns to produce photo-realistic images fooling the discriminator.

2.1.2 Perceptual Super-resolution.

Super-resolution methods with focus on high fidelity tend to generate blurry images [blau20182018]. In contrast, perceptual super-resolution targets photo-realism, but still keeping consistency with the low-frequencies in the input. Please note that high perceptual quality is not necessarily equal to a pixel-wise match to the ground truth (high fidelity). Training perceptual models includes perceptual losses [johnson2016perceptual]

, such as those based on the activations of the ImageNet pre-trained VGG-19 network 

[simonyan2014veryvgg], or GANs [ledig2017photosrgan, blau20182018, wang2018esrgan, bulat2018superfan]. A seminal GAN-based work for perceptual super-resolution was SRGAN [ledig2017photosrgan]. SRGAN employed a residual network [he2016deep] for the generator and relied on a combination of losses for reconstruction/fidelity, texture/content based on VGG-19 [simonyan2014veryvgg] activations, and a GAN discriminator [goodfellow2014generative] for realism. Follow-up works, such as ESRGAN [wang2018esrgan]

, further improved upon SRGAN by tweaking architecture and loss functions. Recently, 

[gu2019aim, gu2019div8k] introduced DIV8K, a dataset of ultra high definition images for extreme super-resolution.

2.1.3 Guided Super-resolution.

Super-resolving an image without any additional guidance is a hard problem. This is why many methods condition on previously known or predicted information. Concretely, it is possible to enforce characteristics and guide the image generation via semantic maps [wang2018recoveringsftgan], attributes [lee2018attribute, liu2015faceattributesceleba, yu2018superverylow, liu2015faceattributesceleba], facial landmarks [chen2018fsrnet, kim2019progressive-face-sr, yu2018facemultitask], or another image [li2018learninggfrnet, dogan2019exemplargwainet].

2.1.4 Deterministic Output.

One important shortcoming of existing approaches is the deterministic nature of the output, i.e. a low-resolution image is mapped to a single output. Such solutions are not ideal, as for a low-resolution image that lacks high-frequency information and details, many potential outputs can resemble the input. In this work, we go in the direction of perceptual super-resolution based on a conditional Generative Adversarial Network that produces highly realistic outputs, and allows for controlling the output in terms of semantic content and appearance.

2.2 Explorative Super-resolution

In super-resolution, many possible solutions for a low-resolution input exist. The possibility to tweak a model’s output is an important step towards making super-resolution even more useful and controllable. In a concurrent work, Bahat et al[bahat2019explorable]

suggest an editing tool with which a user can manually manipulate the super-resolution output. Their manipulations include adjusting the variance or periodicity (

e.g. for textures), brightness reduction, or face editing. In our work, do not stop by implementing specific manipulations, but we go further and allow to freely walk a latent style space and explore even more possible solutions. For example, our model can yield face with different skin texture, add/remove lipstick, manipulate eyes, eyebrows, glasses, noses, mouth, etc. (Figures 1, 2, 6 and more in the supplementary material). Remarkably, we can either impose different appearance styles per semantic regions from either a latent space or from a reference image. To the best of our knowledge, Bahat et al[bahat2019explorable] and our work are the first works targeting explorative super-resolution and, moreover, DeepSEE is the first method that achieves explorative super-resolution using semantically-guided style imposition.

2.3 Domain-specific Super-resolution

Applying super-resolution in a specific domain constrains the problem and improves the quality of the solutions. The focus on a particular domain also makes it easier to leverage prior information. Typical applications include super-resolution of faces [yu2018superverylow, kim2019progressive-face-sr, lee2018attribute, chen2018fsrnet, yu2018facemultitask, liu2015faceattributesceleba], outdoor scenes [wang2018recoveringsftgan] or depth maps [wang2019deepsurvey, riegler2016atgv, hui2016depth, song2016deep]. In this section we focus on super-resolution for faces, namely face hallucination.

As faces are very constrained models, there are several ways of supervision that can successfully guide the generation process. Among them, we can find facial keypoints, facial attributes and person identity as three common features. First, influential works [chen2018fsrnet, yu2018facemultitask, kim2019progressive-face-sr] used facial landmarks as heatmaps to align the output with respect to the input, and thus constraint the relative distance of the facial structure in both input and output. Second, other methods [li2019deep, lee2018attribute, yu2018superverylow] employed binary attributes to enforce the presence or absence of facial components, which preserves the structural consistency in the upscaled image. Lastly, Li et al[li2018learninggfrnet] and Dogan et al[dogan2019exemplargwainet]

leveraged the identity information using the deep features of a facial verification network in order to preserve the perceptual information. Despite the important roles that facial keypoints, attributes and identity preservation play within the facial upscaling, they are a high-level supervision that does not allow fine-grained manipulation of the output, which in most cases, is a desired property.

In contrast to previous works, we use a discrete semantic prior for each region of the face. This allows disentangled manipulations of the semantic layout and style during inference.

2.4 Super-resolution Evaluation

Many standard metrics, such as Peak Signal-to-Noise Ratio (PSNR) or Structural Similarity index 

[wang2004imagessim] (SSIM) evaluate fidelity. However, fidelity does not correlate well with the human visual response of the output [ledig2017photosrgan, blau20182018, wang2018esrgan, wang2018recoveringsftgan], i.e. a high PSNR or SSIM does not guarantee that the output is perceptually good looking [blau2018perception]. Alternative metrics, such as the recently proposed LPIPS [zhang2018perceptuallpips] and FID [heusel2017ttur] evaluate perceptual quality. In this work, we emphasize our validation on high perceptual quality as in [ledig2017photosrgan, wang2018esrgan, sajjadi2017enhancenet, wang2018recoveringsftgan], exploration of the solution space [bahat2019explorable] and extreme super-resolution [gu2019aim].

3 DeepSEE

Figure 3: Overview of DeepSEE Components and Data Flow. In addition to the low-resolution image (LR), our generator uses a semantic guidance, as well as a latent representation of style. We train a semantic segmentation network to extract the semantic regions from the low-resolution image. Moreover, a style encoder predicts a region dependent style. During inference, a user can manipulate both semantics and style. For a detailed description, please refer to Section 3 and Fig. 4

3.1 Problem Formulation

The low-resolution input () image acts as a starting point that carries the low-frequency information. The generator () learns to upscale this image and hallucinates the high-frequencies yielding the high-resolution image . As a guidance, leverages both a high resolution semantic map (, where is the number of the semantic regions) and independent styles per region (, where is the style dimensionality). The upscaled image should thus retain the low-frequency information from the low-resolution image. In addition, it should be consistent in terms of the semantic regions and have specific, yet independent styles per region. We formally define our problem as


Remarkably, thanks to the flexible semantic layout, a user is able to control the appearance and shape of each semantic region through the generation process. This allows to tweak an output until the desired solution has been found.

Following the GAN framework [goodfellow2014generative], our method consists of a generator and a discriminator network. In addition, we employ a segmentation network and an encoder for style. Concretely, the segmentation network predicts the semantic mask from a low-resolution image and the encoder produces a disentangled style. Fig. 3 illustrates our model at a high level and Fig. 4 provides a more detailed view. In the following, we describe each component in more detail.

3.2 Style Encoder

3.2.1 In- and Outputs.

The style encoder extracts style vectors of size from an input image and combines them to a style matrix . The style encoder is designed such that it can extract the style from either a low or a high-resolution image . The encoder maps both low and high-resolution inputs to the same latent style space . It disentangles the regional styles via the semantic layout . The resulting style matrix serves as guidance for the generator.

3.2.2 Architecture.

The style encoder consists of a convolutional neural network

for the low-resolution and a similar convolutional neural network for the high resolution input. Their output is mapped to the same latent style space via a shared layer . Fig. 4 illustrates the flow from the inputs to the style matrix. The architecture for the high-resolution input consists of four convolution layers. The input is downsampled twice in the intermediate two layers and upsampled again after a bottleneck. Similarly, the low-resolution encoder consists of four convolution layers. It upsamples the feature map once before the shared layer. The resulting feature map is then passed through the shared convolution layer and mapped to the range . During inference, a user can sample from this latent space to produce diverse outputs.

Inspired by Zhu et al[zhu2019sean], as a final step, we collapse the output of the shared style encoder for each semantic mask using regional average pooling. This is an important step to disentangle style codes across semantic regions. We describe the regional average pooling in detail in the supplementary material.

3.3 Generator

Our generator learns a mapping , where the model conditions on both a semantic layout and a style . This allows to control the appearance, as well as the size and shape of each region in the semantic layout.

The semantic layout consists of one binary mask for each semantic region

. For style, we assume a uniform distribution

, where each row in represents a style vector of size for one semantic region.

At a high level, the generator is a series of residual blocks with upsampling layers in between. Starting from the low-resolution image, it repeatedly doubles the resolution using nearest neighbor interpolation and processes the result in residual blocks. In the residual blocks, we inject semantic and style information through multiple normalization layers. For the semantic layout, we use spatially adaptive normalization (SPADE) [spade]. SPADE learns a spatial modulation of the feature maps from a semantic map. For the style, we utilize semantic region adaptive normalization in a similar fashion as [zhu2019sean]. Semantic region adaptive normalization is an extension to SPADE, which includes style. Like SPADE, it computes spatial modulation parameters, but also takes into consideration a style matrix computed from a reference image. In our case, we extract the style from an input image through our style encoder as described in Section 3.2. For more details, please check the supplementary material.

Figure 4: DeepSEE Architecture. Our Generator upscales a low-resolution image (LR) in a series of residual blocks. A predicted semantic mask guides the geometric layout and a style matrix controls the appearance of semantic regions. The noise added to the style matrix increases the robustness of the model. We describe the style encoding, generator and semantic segmentation in Section 3.2, 3.3, and 3.5, respectively. For more details, please refer to the supplementary material

3.4 Discriminator

We use an ensemble of two similar discriminator networks. One operates on the full image, and the another one on the half-scale of the generated image. Each network takes the concatenation of an image with its corresponding semantic layout and predicts the realism of overlapping image patches. The discriminator architecture follows [spade]. Please refer to the supplementary material for a more detailed description.

3.5 Segmentation Network

Our training scheme assumes high-resolution segmentation maps, which in most cases are not available during inference. Therefore, we predict a segmentation map from the low-resolution input image . Particularly, we train a segmentation network to learn the mapping , where is a high-resolution semantic map.

The semantic segmentation network is trained independently from the super-resolution network. For more information, please have a look at the supplementary material.

3.6 DeepSEE Model Variants

We suggest two slightly different variants of our proposed method. The guided model learns to super-resolve an image with the help of a high-resolution (HR) reference image. The other (independent) model does not requires any additional guidance and infers a reference style from the low-resolution image.

The guided model is able to apply characteristics from a reference image. When fed a guiding image from the same person, it extracts the original characteristics (if visible). Alternatively, when feeding an image from a different person, it integrates those aspects (as long as it is consistent with the low-resolution input). Figure 1 shows an example, where we first generate an image with the style from the same person and then alter particular regions with styles from other images. The second (independent) model applies to the case where no reference image is available.

The independent and the guided differ in the way how the style matrix is computed. For the independent model, we extract the style from the low-resolution input image: . In contrast, the guided model uses a high-resolution reference image to compute the style . It is worth to mention that for training, paired supervision is not necessary as we only require one high resolution picture of a person.

3.7 Training

We train the generator, encoder and discriminator end-to-end in an adversarial setting, similar to [spade, wang2018pix2pixhd]. As a difference, we inject noise at multiple stages of the generator. We list hyper-parameters and more training details in the supplementary material. In the following, we describe the loss function and explain the noise injection.

3.7.1 Loss Function.

Our loss function is identical to [spade]. Concretely, our discriminator computes an adversarial loss with feature matching . In addition, we employ a perceptual loss from a VGG-19 network [simonyan2014veryvgg]. Our full loss function is defined in Equation 3:


We set the loss weights to .

3.7.2 Injection of Noise.

After encoding the style to a style matrix , we add uniformly distributed noise. We define the noisy style matrix as , where . We empirically choose based on the model variant.

3.7.3 Learning Rate.

DeepSEE needs to handle both low-resolution and high-resolution style inputs. The high-resolution style encoder can extract rich style information. The low-resolution style encoder, however, does not receive such high-frequency information. Therefore, it needs to predict them, which we consider a considerably harder task compared to the high-resolution encoder. Therefore, we set a lower learning rate for the low-resolution style encoder. We train both encoders alternately, by feeding a low-resolution or a high-resolution image in of the iterations. Fig. 4 illustrates the data flow for the style architecture in context of the full DeepSEE model. Please refer to the supplementary material for a detailed description of the learning rate and training schedule.

4 Experimental Framework

4.1 Datasets

We train and evaluate our method on face images from CelebAMask-HQ [karras2017progressive, CelebAMask-HQ] and CelebA [liu2015faceattributesceleba].111Our codes and models will be available at: We use the official training split for developing and training and test on the provided test split. All low-resolution images (serving as inputs) are computed via bicubic downsampling.

4.2 Semantic Segmentation

We train a segmentation network [chen2017deeplab, chen2017rethinkingdeeplabv3] on images from CelebAMask-HQ [CelebAMask-HQ, karras2017progressive]. The network learns to predict a high-resolution segmentation map with semantic regions from a low-resolution image. As a model, we choose DeepLab V3+ [chen2017deeplab, chen2018encoderdeeplab, chen2017rethinkingdeeplabv3] with DRN [Yu2016multiscale, Yu2017drn] as the backbone.

4.3 Baseline

We establish a baseline via bicubic interpolation. We downsample an image to a low-resolution and then upsample back to the original resolution.

4.4 Evaluation Metrics

We compute the traditional super-resolution metrics peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [wang2004imagessim] and the perceptual metrics Fréchet Inception Distance (FID) [heusel2017ttur] and Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018perceptuallpips]. Our method focuses on generating results of high perceptual quality, measured by LPIPS and FID. PSNR and SSIM are frequently used, however, they are known not to correlate very well with perceptual quality [zhang2018perceptuallpips]. However, we still list SSIM scores for completion and report PSNR in the supplementary material.

5 Discussion

We validate our method on two different setups. First, we compare to state-of-the-art methods in face hallucination and provide both quantitative and qualitative results. Second, we show results for extreme and explorative super-resolution by applying numerous manipulations for upscaling factors of up to .

5.1 Comparison to Face Hallucination Methods

To the best of our knowledge, our method is the first face hallucination model based on discrete semantic masks. We compare to models (i) that use reference images [li2018learninggfrnet, dogan2019exemplargwainet] and (ii) models guided by facial landmarks [chen2018fsrnet, kim2019progressive-face-sr].

For (i), we use the CelebAMask-HQ dataset [karras2017progressive, CelebAMask-HQ] to compare to GFRNet [li2018learninggfrnet] and GWAInet [dogan2019exemplargwainet], both of which leverage an image from the same person to guide the upscaling. We list quantitative results in Table 1. Our method achieves the best scores for all metrics. For perceptual metrics, LPIPS [zhang2018perceptuallpips] and FID [heusel2017ttur], DeepSEE outperforms the other methods by a considerable margin. As we depict in Fig. 5, our proposed method also produces more convincing results, in particular for difficult regions, such as eyes, hair and teeth, our semantic model. We provide more examples in the supplementary material.

For models based on facial landmarks (ii), Table 2 compares to FSRNet [chen2018fsrnet] and Kim et al[kim2019progressive-face-sr], where we compute the metrics on the CelebA test set.222The models from [chen2018fsrnet, kim2019progressive-face-sr] were trained to generate images of size , so we can evaluate in their setting on CelebA. In contrast, the other two related methods [li2018learninggfrnet, dogan2019exemplargwainet] generate images that are bigger than CelebA [liu2015faceattributesceleba] (, whereas CelebA images have size ), hence we evaluate on CelebAMask-HQ [karras2017progressive, CelebAMask-HQ]. DeepSEE outperforms FSRNet in all metrics, as well as visually. Please check the supplementary material for a direct visual comparison.

GFRNet [li2018learninggfrnet]
GWAInet [dogan2019exemplargwainet]

ours (independent)
ours (guided) 0.6887 0.1519 22.02
Table 1: We list quantitative results on CelebAMask-HQ [CelebAMask-HQ, kim2019progressive-face-sr] for methods that use a guiding image from the same person. The images are upscaled . Both our DeepSEE variants outperform the other methods on the perceptual metrics (LPIPS [zhang2018perceptuallpips] and FID [heusel2017ttur]). For qualitative results, please refer to the supplementary material
Bicubic 159.60
FSRNet [chen2018fsrnet] (MSE)
FSRNet [chen2018fsrnet] (GAN)
Kim et al[kim2019progressive-face-sr] 0.6634 11.408
ours (independent) 0.1063 13.841
ours (guided) 11.253
Table 2: We compute the scores on the full CelebA test set after center cropping and resizing to . The upscaling factor is , starting at . Our model beats the state-of-the-art in perceptual quality. For a description of our two model variants, please refer to Section 3.6

It is important to note that all related face hallucination models output a single solution, i.e. for a given input there is always a single output. In fact, however, there exist multiple valid solutions for a low-resolution image. In contrast to prior work, our method can generate an infinite number of solutions, and gives the user fine-grained control over the output. Figures 1 and 2 show several solutions for a low-resolution image. All variants are highly consistent with the low frequencies of the input image. The high-frequencies, however, are not defined by the input and our method generates multiple variants. DeepSEE can not only extract the overall appearance from a guiding image of the same person, but it can also inject aspects from other people, and even leverage completely different style images, for instance, geometric patterns. We describe Fig. 1 in more detail in Section 5.2. In addition, our method allows to manipulate semantics, i.e. changing the shape, size or content of regions. Figures 1, 2 and 6 illustrate some examples for manipulating eyeglasses, eyebrows, noses, lips, hair, skin, etc. We provide further visualizations in the supplementary material.

Figure 5: Comparison to Related Work on 8 Upscaling. We compare to our default solutions for the independent and guided model settings. The randomly sampled guiding images are on the top right of each image. For ours, the bottom right corner shows the predicted semantic mask. Our results are less blurry than GFRNet [li2018learninggfrnet]. Comparing to GWAInet [dogan2019exemplargwainet], we observe differences in visual quality for difficult regions, like hair, eyes or teeth. With the additional semantic input, our method can produce more realistic textures. Please zoom in for better viewing

5.2 Manipulations

Figure 6: Manipulating Semantics. We continuously manipulate the semantic mask and change regional shapes. The first column shows the default solution. For the other columns, we highlight the manipulated regions and show the resulting image. We upscale with factor and show the input image in the bottom right. For more examples, please refer to the supplementary material

Our proposed approach is an explorative super-resolution model, which allows a user to tune two main knobs in order to manipulate the model output. Figure 3 shows these knobs in green boxes.

The first way to change the output image is to adapt the disentangled style matrix, for instance by adding random noise, to interpolate between style codes, or to mix multiple styles, as illustrated by Figure 1. Going from one style code to another gradually changes the image output. For example, manipulating the style code for lips can make them either slowly disappear, or on the contrary, become more prominent. We provide additional examples in the supplementary material.

The second tuning knob is the semantic layout. The user can change the size and shape of semantic regions, which causes the generator to adapt the output representation accordingly. Figure 6 shows an example where we close the mouth and make the chin more pointy by manipulating the regions for lips and facial skin. Furthermore, we change the shape of eyebrows, reduce the nose and update the stroke of the eyebrows. It is also possible to create hair on a bold head or add/remove eyeglasses. Alternatively, a user can change or swap semantic labels. We replace eyes with eyebrows and make the nose disappear. We showcase these examples in the supplementary material.

5.3 Extreme Super-resolution

Figure 7: Extreme Super-resolution. We show two examples for the upscaling factor . The smaller image on each right is the ground truth image from CelebAMask-HQ [karras2017progressive, CelebAMask-HQ]. To illustrate the upscale magnitude, we highlight the low-resolution input image in the bottom left corner in yellow. Please refer to the supplementary material to view these images in high-resolution

While previous face hallucination models applied upscaling factors of up  [chen2018fsrnet, dogan2019exemplargwainet, li2018learninggfrnet, kim2019progressive-face-sr], DeepSEE is capable of going far beyond, with upscaling factors of up to . In particular, DeepSEE upscales low-resolution inputs to pixels. This is possible thanks to the conditioning on dense semantic maps, as well as the injection of region-dependent style. Such constraints serve as strict guidance to the generator. As a result, the semantic regions in the super-resolved faces are highly consistent with the input masks and the generated images are of high perceptual quality, even for extreme upscaling factors.

Figure 7 shows results for magnification. For the example on the left, some high-frequency components slightly differ (e.g. the exact trimming of the beard, or the wrinkles on the forehead), yet our model captures the main essence and identity from the low-resolution image. The output is highly consistent with the ground truth image (bottom right).

Given such extreme upscaling factors, it is not surprising that we do not always observe such consistent results out of the box. In the second example (Fig. 7 on the right), the upscaled image shows a young woman with a smooth skin texture. In reality, however, the ground-truth image is a middle-aged woman with wrinkles. Wrinkles are a typical example of a high-frequency component that is not clear in a low-resolution image. Most images in CelebAMask-HQ [karras2017progressive, CelebAMask-HQ] show young people with smooth skin. Given the low-resolution version of a middle-aged woman, the style encoder incorrectly inherits the bias of the dataset and predicts the style code of a young woman. This case highlights the benefits of an explorative super-resolution model. With DeepSEE, a user is now able to manipulate the style for the skin and generate a solution that matches the ground-truth. Please check the supplementary material for more illustrations and higher-resolution versions of Fig. 7.

6 Ablation Study

We investigate the influence of DeepSEE

’s main components in an ablation study by training models with the disabled injection of semantics, style, or both. The models are trained for 7 epochs, which corresponds to 3 days on a single TITAN Xp GPU, with upscaling factor

, batch size 4, and we use the CelebA [liu2015faceattributesceleba] dataset. More details regarding hyper-parameters can be found in the supplementary material.

6.1 Architectural Differences

For the first model (Prior-only

), we disable both semantics and style. Therefore, the generator blocks consist of convolutions, batch normalization and ReLU activations. The model’s only conditioning is the low-resolution input. For the

hr-guided-only model, we inject the style matrix computed from another high-resolution image of the same person. Lastly, we train a semantic-only model that does not inject any style but applies a semantic layout via spatially adaptive normalization [spade].

6.2 Ablation Discussion

We compare performance scores in Table 3. The performance on all metrics improves when adding either semantics, style or both. Comparing models with either semantics or style (hr-guided-only vs. semantics-only), the perceptual metrics show better scores when including semantics. In particular, FID [heusel2017ttur] is considerably better. Combining both semantic and style yields even better results for both the distortion measures (PSNR and SSIM [wang2004imagessim]) and the perceptual metrics (LPIPS [zhang2018perceptuallpips] and FID [heusel2017ttur]). The performance between our two suggested model variants (the independent model and guided model) is very similar for distortion metrics. In terms of perceptual quality, the guided image clearly beats the independent in FID. This is not surprising, as it can extract the style from a guiding image of the same person, which makes it easier to produce an image that is close to the ground truth. On the other hand, the independent model is more flexible towards random manipulations of the style matrix.

Name Semantics Style HR-Guidance PSNR SSIM LPIPS FID
Prior-only - - - 20.883 0.6168 0.123 25.11
Guided-style-only - 21.564 0.6507 0.111 16.66
Semantics-only - - 21.517 0.6543 0.110 12.57
Independent - 21.852 0.6631 0.106 13.84
Guided 11.25

Table 3: We conduct an ablation study to compare models with disabled style injection or semantic conditioning. We find that semantics have a strong influence on both fidelity (PSNR and SSIM) and perceptual quality (LPIPS and FID). However, the best scores require both semantics and style. Finally, using a high-resolution guiding image (guided model) from the same person provides an additional point of control to the user compared to an independent model

7 Conclusion

The super-resolution problem is ill-posed because a lot of information is missing and needs to be hallucinated. In this paper, we tackle super-resolution in an explorative approach, DeepSEE, based on semantic regions and disentangled style codes. DeepSEE allows for fine-grained control of the output, disentangled into region-dependent appearance and shape. Our model goes far beyond common upscaling factors ( to ) and allows to magnify up to . Our validation for faces demonstrate results of high perceptual quality.

Acknowledgments. This work was partly supported by the ETH Zürich Fund (OK), and by Huawei, Amazon AWS and Nvidia grants.