CVPR 2021 Challenge on Super-Resolution Space
Super-resolution (SR) is by definition ill-posed. There are infinitely many plausible high-resolution variants for a given low-resolution natural image. This is why example-based SR methods study upscaling factors up to 4x (or up to 8x for face hallucination). Most of the current literature aims at a single deterministic solution of either high reconstruction fidelity or photo-realistic perceptual quality. In this work, we propose a novel framework, DeepSEE, for Deep disentangled Semantic Explorative Extreme super-resolution. To the best of our knowledge, DeepSEE is the first method to leverage semantic maps for explorative super-resolution. In particular, it provides control of the semantic regions, their disentangled appearance and it allows a broad range of image manipulations. We validate DeepSEE for up to 32x magnification and exploration of the space of super-resolution.READ FULL TEXT VIEW PDF
CVPR 2021 Challenge on Super-Resolution Space
In super-resolution (SR), we learn a mapping from a low-resolution (LR) image to a higher-resolution (HR) image :
This mapping can be a standard interpolation method, such as bilinear, bicubic or nearest-neighbour. As such interpolations do not restore high-frequency content or small details, their output is visually clearly different from an original high-resolution image. Consequently, most modern super-resolution methods rely on neural networks to learn a parametrized mapping. Such methods usually target upscaling factors of up to in generic domains [timofte2017ntire, blau20182018], and up to for restricted domains, like faces [yu2018superverylow, kim2019progressive-face-sr, lee2018attribute, chen2018fsrnet, yu2018facemultitask, liu2015faceattributesceleba].
Generic methods lack clear guidance on what characteristics the output images should contain. For example, it is not possible to infer the skin texture from a very low-resolution face image. Solutions for domain-specific methods can include image attributes [yu2018superverylow, lee2018attribute, liu2015faceattributesceleba, li2019deep], reference images [li2018learninggfrnet, dogan2019exemplargwainet] and/or facial landmarks as a guidance [chen2018fsrnet, kim2019progressive-face-sr]. Such approaches demonstrate valid results for upscaling factors. However, they learn a deterministic mapping and produce a single high-resolution image output for a given input.
Producing a single high-resolution output from a low-resolution image is not ideal, as super-resolution is an ill-posed problem and there are multiple valid solutions for a given input image. Fig. 2 shows a low-resolution input and multiple possible outputs. It is very hard to tell, nonetheless, which image is closest to an unknown ground-truth.
Our proposed method, DeepSEE, is capable of generating an infinite number of potentially valid high-resolution candidates for a low-resolution image. The candidates may vary in both appearance and shape, but stay consistent with the low-resolution input. Our method learns a one-to-many mapping from a low-frequency input to a disentangled manifold of potential solutions with the same low-frequencies, but diverse high-frequencies. For inference, a user can tweak the shape or appearance of individual semantic regions until achieving the desired result.
Our main contributions are as follows:
We introduce DeepSEE, a novel framework for Deep disentangled Semantic Explorative Extreme super-resolution.
We tackle the ill-posed super-resolution problem in an explorative approach based on semantic maps. DeepSEE is able to sample and manipulate the solution space to produce an infinite number of high-resolution outputs for a single low-resolution input image.
DeepSEE gives control over both shape and appearance. A user can tweak the disentangled semantic regions individually.
We go beyond other approaches by super-resolving to the extreme, with upscaling factors up to .
Early super-resolution methods were mostly based on edge [fattal2007imageedge, sun2008imagegradientprofile] and image statistics [aly2005image, zhang2010non]ni2007image], graphical models [wang2005patch], Gaussian process regression [he2011single], sparse coding [yang2010image]
or piece-wise linear regression[timofte2014a+]. With the advent of deep learning, a multitude of top performing methods were proposed [dong2015image, kim2016deeply, timofte2017ntire, timofte2018ntire], defining the current main stream research in example-based single image super-resolution. Typically, super-resolution aims to restore the missing high-frequencies and details to achieve high (i) fidelity or (ii) perceptual quality. Fidelity quantifies the distortion to a high-resolution ground truth image and perceptual metrics measure visual quality. It is important to note that high perceptual quality does not necessarily require a close pixel-wise match to the ground truth (high fidelity).
While most of the research still targets fidelity, the perceptual super-resolution research is starting to bloom. Their results are less blurry and more photo-realistic [blau20182018].
Generative Adversarial Networks [goodfellow2014generative] (GAN) have become increasingly popular in image generation [karras2019style, karras2019analyzing, spade, choi2018stargan, romero2019smit]
, thanks to their capability to sample and produce highly realistic images from a target distribution. The underlying technique is to alternately train two convolutional networks, a generator, and a discriminator, with contrary objectives in order to play a MiniMax game. While the discriminator aims to correctly classify whether the images are real or fake, the generator learns to produce photo-realistic images fooling the discriminator.
Super-resolution methods with focus on high fidelity tend to generate blurry images [blau20182018]. In contrast, perceptual super-resolution targets photo-realism, but still keeping consistency with the low-frequencies in the input. Please note that high perceptual quality is not necessarily equal to a pixel-wise match to the ground truth (high fidelity). Training perceptual models includes perceptual losses [johnson2016perceptual]
, such as those based on the activations of the ImageNet pre-trained VGG-19 network[simonyan2014veryvgg], or GANs [ledig2017photosrgan, blau20182018, wang2018esrgan, bulat2018superfan]. A seminal GAN-based work for perceptual super-resolution was SRGAN [ledig2017photosrgan]. SRGAN employed a residual network [he2016deep] for the generator and relied on a combination of losses for reconstruction/fidelity, texture/content based on VGG-19 [simonyan2014veryvgg] activations, and a GAN discriminator [goodfellow2014generative] for realism. Follow-up works, such as ESRGAN [wang2018esrgan]
, further improved upon SRGAN by tweaking architecture and loss functions. Recently,[gu2019aim, gu2019div8k] introduced DIV8K, a dataset of ultra high definition images for extreme super-resolution.
Super-resolving an image without any additional guidance is a hard problem. This is why many methods condition on previously known or predicted information. Concretely, it is possible to enforce characteristics and guide the image generation via semantic maps [wang2018recoveringsftgan], attributes [lee2018attribute, liu2015faceattributesceleba, yu2018superverylow, liu2015faceattributesceleba], facial landmarks [chen2018fsrnet, kim2019progressive-face-sr, yu2018facemultitask], or another image [li2018learninggfrnet, dogan2019exemplargwainet].
One important shortcoming of existing approaches is the deterministic nature of the output, i.e. a low-resolution image is mapped to a single output. Such solutions are not ideal, as for a low-resolution image that lacks high-frequency information and details, many potential outputs can resemble the input. In this work, we go in the direction of perceptual super-resolution based on a conditional Generative Adversarial Network that produces highly realistic outputs, and allows for controlling the output in terms of semantic content and appearance.
In super-resolution, many possible solutions for a low-resolution input exist. The possibility to tweak a model’s output is an important step towards making super-resolution even more useful and controllable. In a concurrent work, Bahat et al. [bahat2019explorable]
suggest an editing tool with which a user can manually manipulate the super-resolution output. Their manipulations include adjusting the variance or periodicity (e.g. for textures), brightness reduction, or face editing. In our work, do not stop by implementing specific manipulations, but we go further and allow to freely walk a latent style space and explore even more possible solutions. For example, our model can yield face with different skin texture, add/remove lipstick, manipulate eyes, eyebrows, glasses, noses, mouth, etc. (Figures 1, 2, 6 and more in the supplementary material). Remarkably, we can either impose different appearance styles per semantic regions from either a latent space or from a reference image. To the best of our knowledge, Bahat et al. [bahat2019explorable] and our work are the first works targeting explorative super-resolution and, moreover, DeepSEE is the first method that achieves explorative super-resolution using semantically-guided style imposition.
Applying super-resolution in a specific domain constrains the problem and improves the quality of the solutions. The focus on a particular domain also makes it easier to leverage prior information. Typical applications include super-resolution of faces [yu2018superverylow, kim2019progressive-face-sr, lee2018attribute, chen2018fsrnet, yu2018facemultitask, liu2015faceattributesceleba], outdoor scenes [wang2018recoveringsftgan] or depth maps [wang2019deepsurvey, riegler2016atgv, hui2016depth, song2016deep]. In this section we focus on super-resolution for faces, namely face hallucination.
As faces are very constrained models, there are several ways of supervision that can successfully guide the generation process. Among them, we can find facial keypoints, facial attributes and person identity as three common features. First, influential works [chen2018fsrnet, yu2018facemultitask, kim2019progressive-face-sr] used facial landmarks as heatmaps to align the output with respect to the input, and thus constraint the relative distance of the facial structure in both input and output. Second, other methods [li2019deep, lee2018attribute, yu2018superverylow] employed binary attributes to enforce the presence or absence of facial components, which preserves the structural consistency in the upscaled image. Lastly, Li et al. [li2018learninggfrnet] and Dogan et al. [dogan2019exemplargwainet]
leveraged the identity information using the deep features of a facial verification network in order to preserve the perceptual information. Despite the important roles that facial keypoints, attributes and identity preservation play within the facial upscaling, they are a high-level supervision that does not allow fine-grained manipulation of the output, which in most cases, is a desired property.
In contrast to previous works, we use a discrete semantic prior for each region of the face. This allows disentangled manipulations of the semantic layout and style during inference.
Many standard metrics, such as Peak Signal-to-Noise Ratio (PSNR) or Structural Similarity index[wang2004imagessim] (SSIM) evaluate fidelity. However, fidelity does not correlate well with the human visual response of the output [ledig2017photosrgan, blau20182018, wang2018esrgan, wang2018recoveringsftgan], i.e. a high PSNR or SSIM does not guarantee that the output is perceptually good looking [blau2018perception]. Alternative metrics, such as the recently proposed LPIPS [zhang2018perceptuallpips] and FID [heusel2017ttur] evaluate perceptual quality. In this work, we emphasize our validation on high perceptual quality as in [ledig2017photosrgan, wang2018esrgan, sajjadi2017enhancenet, wang2018recoveringsftgan], exploration of the solution space [bahat2019explorable] and extreme super-resolution [gu2019aim].
The low-resolution input () image acts as a starting point that carries the low-frequency information. The generator () learns to upscale this image and hallucinates the high-frequencies yielding the high-resolution image . As a guidance, leverages both a high resolution semantic map (, where is the number of the semantic regions) and independent styles per region (, where is the style dimensionality). The upscaled image should thus retain the low-frequency information from the low-resolution image. In addition, it should be consistent in terms of the semantic regions and have specific, yet independent styles per region. We formally define our problem as
Remarkably, thanks to the flexible semantic layout, a user is able to control the appearance and shape of each semantic region through the generation process. This allows to tweak an output until the desired solution has been found.
Following the GAN framework [goodfellow2014generative], our method consists of a generator and a discriminator network. In addition, we employ a segmentation network and an encoder for style. Concretely, the segmentation network predicts the semantic mask from a low-resolution image and the encoder produces a disentangled style. Fig. 3 illustrates our model at a high level and Fig. 4 provides a more detailed view. In the following, we describe each component in more detail.
The style encoder extracts style vectors of size from an input image and combines them to a style matrix . The style encoder is designed such that it can extract the style from either a low or a high-resolution image . The encoder maps both low and high-resolution inputs to the same latent style space . It disentangles the regional styles via the semantic layout . The resulting style matrix serves as guidance for the generator.
The style encoder consists of a convolutional neural networkfor the low-resolution and a similar convolutional neural network for the high resolution input. Their output is mapped to the same latent style space via a shared layer . Fig. 4 illustrates the flow from the inputs to the style matrix. The architecture for the high-resolution input consists of four convolution layers. The input is downsampled twice in the intermediate two layers and upsampled again after a bottleneck. Similarly, the low-resolution encoder consists of four convolution layers. It upsamples the feature map once before the shared layer. The resulting feature map is then passed through the shared convolution layer and mapped to the range . During inference, a user can sample from this latent space to produce diverse outputs.
Inspired by Zhu et al. [zhu2019sean], as a final step, we collapse the output of the shared style encoder for each semantic mask using regional average pooling. This is an important step to disentangle style codes across semantic regions. We describe the regional average pooling in detail in the supplementary material.
Our generator learns a mapping , where the model conditions on both a semantic layout and a style . This allows to control the appearance, as well as the size and shape of each region in the semantic layout.
The semantic layout consists of one binary mask for each semantic region
. For style, we assume a uniform distribution, where each row in represents a style vector of size for one semantic region.
At a high level, the generator is a series of residual blocks with upsampling layers in between. Starting from the low-resolution image, it repeatedly doubles the resolution using nearest neighbor interpolation and processes the result in residual blocks. In the residual blocks, we inject semantic and style information through multiple normalization layers. For the semantic layout, we use spatially adaptive normalization (SPADE) [spade]. SPADE learns a spatial modulation of the feature maps from a semantic map. For the style, we utilize semantic region adaptive normalization in a similar fashion as [zhu2019sean]. Semantic region adaptive normalization is an extension to SPADE, which includes style. Like SPADE, it computes spatial modulation parameters, but also takes into consideration a style matrix computed from a reference image. In our case, we extract the style from an input image through our style encoder as described in Section 3.2. For more details, please check the supplementary material.
We use an ensemble of two similar discriminator networks. One operates on the full image, and the another one on the half-scale of the generated image. Each network takes the concatenation of an image with its corresponding semantic layout and predicts the realism of overlapping image patches. The discriminator architecture follows [spade]. Please refer to the supplementary material for a more detailed description.
Our training scheme assumes high-resolution segmentation maps, which in most cases are not available during inference. Therefore, we predict a segmentation map from the low-resolution input image . Particularly, we train a segmentation network to learn the mapping , where is a high-resolution semantic map.
The semantic segmentation network is trained independently from the super-resolution network. For more information, please have a look at the supplementary material.
We suggest two slightly different variants of our proposed method. The guided model learns to super-resolve an image with the help of a high-resolution (HR) reference image. The other (independent) model does not requires any additional guidance and infers a reference style from the low-resolution image.
The guided model is able to apply characteristics from a reference image. When fed a guiding image from the same person, it extracts the original characteristics (if visible). Alternatively, when feeding an image from a different person, it integrates those aspects (as long as it is consistent with the low-resolution input). Figure 1 shows an example, where we first generate an image with the style from the same person and then alter particular regions with styles from other images. The second (independent) model applies to the case where no reference image is available.
The independent and the guided differ in the way how the style matrix is computed. For the independent model, we extract the style from the low-resolution input image: . In contrast, the guided model uses a high-resolution reference image to compute the style . It is worth to mention that for training, paired supervision is not necessary as we only require one high resolution picture of a person.
We train the generator, encoder and discriminator end-to-end in an adversarial setting, similar to [spade, wang2018pix2pixhd]. As a difference, we inject noise at multiple stages of the generator. We list hyper-parameters and more training details in the supplementary material. In the following, we describe the loss function and explain the noise injection.
Our loss function is identical to [spade]. Concretely, our discriminator computes an adversarial loss with feature matching . In addition, we employ a perceptual loss from a VGG-19 network [simonyan2014veryvgg]. Our full loss function is defined in Equation 3:
We set the loss weights to .
After encoding the style to a style matrix , we add uniformly distributed noise. We define the noisy style matrix as , where . We empirically choose based on the model variant.
DeepSEE needs to handle both low-resolution and high-resolution style inputs. The high-resolution style encoder can extract rich style information. The low-resolution style encoder, however, does not receive such high-frequency information. Therefore, it needs to predict them, which we consider a considerably harder task compared to the high-resolution encoder. Therefore, we set a lower learning rate for the low-resolution style encoder. We train both encoders alternately, by feeding a low-resolution or a high-resolution image in of the iterations. Fig. 4 illustrates the data flow for the style architecture in context of the full DeepSEE model. Please refer to the supplementary material for a detailed description of the learning rate and training schedule.
We train and evaluate our method on face images from CelebAMask-HQ [karras2017progressive, CelebAMask-HQ] and CelebA [liu2015faceattributesceleba].111Our codes and models will be available at: https://mcbuehler.github.io/DeepSEE/ We use the official training split for developing and training and test on the provided test split. All low-resolution images (serving as inputs) are computed via bicubic downsampling.
We train a segmentation network [chen2017deeplab, chen2017rethinkingdeeplabv3] on images from CelebAMask-HQ [CelebAMask-HQ, karras2017progressive]. The network learns to predict a high-resolution segmentation map with semantic regions from a low-resolution image. As a model, we choose DeepLab V3+ [chen2017deeplab, chen2018encoderdeeplab, chen2017rethinkingdeeplabv3] with DRN [Yu2016multiscale, Yu2017drn] as the backbone.
We establish a baseline via bicubic interpolation. We downsample an image to a low-resolution and then upsample back to the original resolution.
We compute the traditional super-resolution metrics peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [wang2004imagessim] and the perceptual metrics Fréchet Inception Distance (FID) [heusel2017ttur] and Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018perceptuallpips]. Our method focuses on generating results of high perceptual quality, measured by LPIPS and FID. PSNR and SSIM are frequently used, however, they are known not to correlate very well with perceptual quality [zhang2018perceptuallpips]. However, we still list SSIM scores for completion and report PSNR in the supplementary material.
We validate our method on two different setups. First, we compare to state-of-the-art methods in face hallucination and provide both quantitative and qualitative results. Second, we show results for extreme and explorative super-resolution by applying numerous manipulations for upscaling factors of up to .
To the best of our knowledge, our method is the first face hallucination model based on discrete semantic masks. We compare to models (i) that use reference images [li2018learninggfrnet, dogan2019exemplargwainet] and (ii) models guided by facial landmarks [chen2018fsrnet, kim2019progressive-face-sr].
For (i), we use the CelebAMask-HQ dataset [karras2017progressive, CelebAMask-HQ] to compare to GFRNet [li2018learninggfrnet] and GWAInet [dogan2019exemplargwainet], both of which leverage an image from the same person to guide the upscaling. We list quantitative results in Table 1. Our method achieves the best scores for all metrics. For perceptual metrics, LPIPS [zhang2018perceptuallpips] and FID [heusel2017ttur], DeepSEE outperforms the other methods by a considerable margin. As we depict in Fig. 5, our proposed method also produces more convincing results, in particular for difficult regions, such as eyes, hair and teeth, our semantic model. We provide more examples in the supplementary material.
For models based on facial landmarks (ii), Table 2 compares to FSRNet [chen2018fsrnet] and Kim et al. [kim2019progressive-face-sr], where we compute the metrics on the CelebA test set.222The models from [chen2018fsrnet, kim2019progressive-face-sr] were trained to generate images of size , so we can evaluate in their setting on CelebA. In contrast, the other two related methods [li2018learninggfrnet, dogan2019exemplargwainet] generate images that are bigger than CelebA [liu2015faceattributesceleba] (, whereas CelebA images have size ), hence we evaluate on CelebAMask-HQ [karras2017progressive, CelebAMask-HQ]. DeepSEE outperforms FSRNet in all metrics, as well as visually. Please check the supplementary material for a direct visual comparison.
|FSRNet [chen2018fsrnet] (MSE)|
|FSRNet [chen2018fsrnet] (GAN)|
|Kim et al. [kim2019progressive-face-sr]||0.6634||11.408|
It is important to note that all related face hallucination models output a single solution, i.e. for a given input there is always a single output. In fact, however, there exist multiple valid solutions for a low-resolution image. In contrast to prior work, our method can generate an infinite number of solutions, and gives the user fine-grained control over the output. Figures 1 and 2 show several solutions for a low-resolution image. All variants are highly consistent with the low frequencies of the input image. The high-frequencies, however, are not defined by the input and our method generates multiple variants. DeepSEE can not only extract the overall appearance from a guiding image of the same person, but it can also inject aspects from other people, and even leverage completely different style images, for instance, geometric patterns. We describe Fig. 1 in more detail in Section 5.2. In addition, our method allows to manipulate semantics, i.e. changing the shape, size or content of regions. Figures 1, 2 and 6 illustrate some examples for manipulating eyeglasses, eyebrows, noses, lips, hair, skin, etc. We provide further visualizations in the supplementary material.
Our proposed approach is an explorative super-resolution model, which allows a user to tune two main knobs in order to manipulate the model output. Figure 3 shows these knobs in green boxes.
The first way to change the output image is to adapt the disentangled style matrix, for instance by adding random noise, to interpolate between style codes, or to mix multiple styles, as illustrated by Figure 1. Going from one style code to another gradually changes the image output. For example, manipulating the style code for lips can make them either slowly disappear, or on the contrary, become more prominent. We provide additional examples in the supplementary material.
The second tuning knob is the semantic layout. The user can change the size and shape of semantic regions, which causes the generator to adapt the output representation accordingly. Figure 6 shows an example where we close the mouth and make the chin more pointy by manipulating the regions for lips and facial skin. Furthermore, we change the shape of eyebrows, reduce the nose and update the stroke of the eyebrows. It is also possible to create hair on a bold head or add/remove eyeglasses. Alternatively, a user can change or swap semantic labels. We replace eyes with eyebrows and make the nose disappear. We showcase these examples in the supplementary material.
While previous face hallucination models applied upscaling factors of up [chen2018fsrnet, dogan2019exemplargwainet, li2018learninggfrnet, kim2019progressive-face-sr], DeepSEE is capable of going far beyond, with upscaling factors of up to . In particular, DeepSEE upscales low-resolution inputs to pixels. This is possible thanks to the conditioning on dense semantic maps, as well as the injection of region-dependent style. Such constraints serve as strict guidance to the generator. As a result, the semantic regions in the super-resolved faces are highly consistent with the input masks and the generated images are of high perceptual quality, even for extreme upscaling factors.
Figure 7 shows results for magnification. For the example on the left, some high-frequency components slightly differ (e.g. the exact trimming of the beard, or the wrinkles on the forehead), yet our model captures the main essence and identity from the low-resolution image. The output is highly consistent with the ground truth image (bottom right).
Given such extreme upscaling factors, it is not surprising that we do not always observe such consistent results out of the box. In the second example (Fig. 7 on the right), the upscaled image shows a young woman with a smooth skin texture. In reality, however, the ground-truth image is a middle-aged woman with wrinkles. Wrinkles are a typical example of a high-frequency component that is not clear in a low-resolution image. Most images in CelebAMask-HQ [karras2017progressive, CelebAMask-HQ] show young people with smooth skin. Given the low-resolution version of a middle-aged woman, the style encoder incorrectly inherits the bias of the dataset and predicts the style code of a young woman. This case highlights the benefits of an explorative super-resolution model. With DeepSEE, a user is now able to manipulate the style for the skin and generate a solution that matches the ground-truth. Please check the supplementary material for more illustrations and higher-resolution versions of Fig. 7.
We investigate the influence of DeepSEE
’s main components in an ablation study by training models with the disabled injection of semantics, style, or both. The models are trained for 7 epochs, which corresponds to 3 days on a single TITAN Xp GPU, with upscaling factor, batch size 4, and we use the CelebA [liu2015faceattributesceleba] dataset. More details regarding hyper-parameters can be found in the supplementary material.
For the first model (Prior-only
), we disable both semantics and style. Therefore, the generator blocks consist of convolutions, batch normalization and ReLU activations. The model’s only conditioning is the low-resolution input. For thehr-guided-only model, we inject the style matrix computed from another high-resolution image of the same person. Lastly, we train a semantic-only model that does not inject any style but applies a semantic layout via spatially adaptive normalization [spade].
We compare performance scores in Table 3. The performance on all metrics improves when adding either semantics, style or both. Comparing models with either semantics or style (hr-guided-only vs. semantics-only), the perceptual metrics show better scores when including semantics. In particular, FID [heusel2017ttur] is considerably better. Combining both semantic and style yields even better results for both the distortion measures (PSNR and SSIM [wang2004imagessim]) and the perceptual metrics (LPIPS [zhang2018perceptuallpips] and FID [heusel2017ttur]). The performance between our two suggested model variants (the independent model and guided model) is very similar for distortion metrics. In terms of perceptual quality, the guided image clearly beats the independent in FID. This is not surprising, as it can extract the style from a guiding image of the same person, which makes it easier to produce an image that is close to the ground truth. On the other hand, the independent model is more flexible towards random manipulations of the style matrix.
The super-resolution problem is ill-posed because a lot of information is missing and needs to be hallucinated. In this paper, we tackle super-resolution in an explorative approach, DeepSEE, based on semantic regions and disentangled style codes.
DeepSEE allows for fine-grained control of the output, disentangled into region-dependent appearance and shape. Our model goes far beyond common upscaling factors ( to ) and allows to magnify up to . Our validation for faces demonstrate results of high perceptual quality.
Acknowledgments. This work was partly supported by the ETH Zürich Fund (OK), and by Huawei, Amazon AWS and Nvidia grants.