The Single Image Super-Resolution (SISR), a technique for restoring a visually pleasing high-resolution (HR) image from its low-resolution (LR) version, is still a challenging task within computer vision research community(Caballero et al., 2016; Dong et al., 2015; Kappeler et al., 2016; Kim et al., 2015; Ledig et al., 2016; Liu et al., 2017; Sajjadi et al., 2016; Shi et al., 2016; Tao et al., 2017). Since multiple solutions exist for the mapping from LR to HR space, SISR is highly ill-posed and a variety of algorithms, especially the current leading learning-based methods are proposed to address this problem.
Understanding what the SISR problem represents is crucial in order to develop a method that is capable of solving it. Having a low resolution image at inference time means that there is no ground truth answer on how the high resolution counterpart image is generated. That being said, in order to recover a higher resolution image, assumptions need to be made that do not violate the visible artifacts taken from the low resolution image. The fine details added to the higher resolution image are subjective since they only need to follow certain already visible artifacts from the low resolution image. The task in SISR is to find a model that learns how to make these assumptions and generate high resolution images as plausible as possible according to the specific task that is being undertaken like, face SISR. To this day, all current solutions for the SISR problem attempt to reconstruct a single high resolution image based on a given low resolution input image. In other words, the process of generating a high resolution image is deterministic and given the same low resolution image as input multiple times will yield the same high resolution image.
In this paper, we argue that a method for solving the SISR problem should yield multiple high resolution candidates for the same low resolution image and we propose an approach to solve this problem. Specifically, the proposed SR-NAM method is an unsupervised method of mapping high resolutions images to a given low resolution image. The advantage of this method over others is that it is fast and it requires to optimize only a single representation. This representation attempts to match pre-trained fixed knowledge of both the high resolution image space as well as the degradation method. To the best of our knowledge, all previous works on SISR, degraded the high resolution images artificially using down-sampling methods such as bilinear and bicubic algorithms in order to create a dataset of high resolution images and the associated low resolution image. Usually, these methods do not perform well when they are used with real-world low resolution images as shown in (Shocher et al., 2017; Bulat et al., 2018). In contrast to these approaches, following the work from (Bulat et al., 2018) we propose to use a degradation model to generate a low resolution image from a high resolution image that visually appears to be taken from a low quality camera.
2. Related Work
2.1. Image Super-Resolution
The problem of SISR has been widely studied. Early approaches either rely on natural image statistics (Kim and Kwon, 2010; Zhang et al., 2010) or predefined models (M. Irani and S. Peleg (1991); R. Fattal (2007); 18). Later, mapping functions between LR images and HR images are investigated, such as sparse coding based SR methods (Zeyde et al., 2012; Yang et al., 2010).
Recently, deep convolution neural networks (CNN) have been shown to be powerful and capable of improving the quality of SR results(Zhang et al., 2019; Ahn et al., 2018; Park et al., 2018; Xintao Wang and Loy, 2018; Tong et al., 2017). It needs to be highlighted, that all the aforementioned image super-resolution methods can be applied to all types of images and hence do not incorporate face-specific information, as proposed in our work.
Face Super-Resolution. There are many works in the literature focusing specifically on applying SISR techniques to face images. The recent works of (Yu and Porikli, 2016; Yu et al., 2018; Bulat et al., 2018; Bulat and Tzimiropoulos, 2017b) use a GAN-based approach. Other works like (Cao et al., 2017a)
used Reinforcement Learning to learn to progressively attend on specific parts of a face image in order to restore them in a sequence procedure. Some other methods(Chen et al., 2017) work with introducing facial prior knowledge that could be leveraged for better super-resolving face images. The method of (Zhu et al., 2016) performs super-resolution and dense landmark localization in an alternating manner which is shown to improve the quality of the super-resolved faces.
2.2. Unsupervised domain alignment
Due to the rise of generative adversarial networks (GANs), unsupervised translation across different domains began to generate strong results. All of the state-of-the-art unsupervised translation methods employ GAN technique. The most popular extension to the traditional GAN approach is the use of the cycle-consistency which enforces the generated samples mapped between the two domains to be the same. This approach is widely used by DiscoGAN (Kim et al., 2017), CycleGAN (Zhu et al., 2017) and DualGAN (Yi et al., 2017). Recently, StarGAN (Choi et al., 2017) extended the approach to more than two domains. Our work is built upon Non-Adversarial Mapping (NAM) method (Hoshen and Wolf, 2018) and the details are described on Section 3.4.
3. Super-Resolution using NAM
As mentioned in Section 1, we propose two main models: the degradation model and the Super-Resolution NAM model. The degradation model is designed to take as input a HR image and produce an output of it as a realistically taken LR image. The SR-NAM model then uses the pre-trained generator and degradation model to learn to infer with no supervision the predicted HR image.
This section describes the HR and LR datasets used during training and testing. In order to train the degradation model, a dataset with real-world LR images is needed. Searching the literature, the only available dataset that fulfills the requirements is the one described in (Bulat et al., 2018). Thus, we decided to contact the authors in order to get access to the exact subset of data that they used on their research.
HR dataset. Following the (Bulat et al., 2018), the High Resolution (HR) image dataset is composed of 182,866 face images of size 6464. The authors of the dataset aimed to create a dataset that is as balanced as possible in terms of facial poses. The dataset is a combination of subsets of 4 popular face datasets. Specifically, from Celeb-A (Liu et al., 2014), they randomly selected 60,000 faces. Additionaly, they used the whole AFLW (Köstinger et al., 2011) dataset. Finally, a subset of the LS3D-W (Bulat and Tzimiropoulos, 2017a) and VGGFace2 (Cao et al., 2017b) datasets is been used. The dataset includes many face images of various poses, illuminations, expressions and occlusions.
LR dataset. The authors in (Bulat et al., 2018) created a real-world Low Resolution (LR) image dataset from the Widerface (Yang et al., 2016) face dataset. This dataset is very large in scale and diverse in terms of faces and it contains real-world taken pictures with various forms of noise and degradation. The dataset is composed of 50,000 images of size 1616 where 3,000 randomly selected and kept for testing.
3.2. Degradation Model
The degradation model is inspired by (Bulat et al., 2018). The overall architecture is composed by a generator and a discriminator networks. Both models are based on ResNet architecture (He et al., 2015). The overall architecture is shown on Figure 1.
Degradation Generator. A HR image coming from the HR dataset is used as input to the degradation generator. The architecture is similar to the one used in (Bulat et al., 2018). The network is following an encoder-decoder schema and it composed by 12 residual blocks equally distributed in 6 groups. The resolution is dropped 4 times using a pooling layer. Specifically, from the input size of 6464 it is degraded to 44 px. Next, it is increased twice to 1616 using a pixel shuffle layer.
In addition to the HR image, a noise vector is concatenated that was projected and then reshaped using a fully connected layer in order to have the same size as one image channel. The intuition behind this, is that the degrading a HR image to LR image is an one-to-many problem where an HR image can have multiple corresponding LR images.
The discriminator is similar to the ResNet architecture and it consists of 6 residual blocks without any batch normalization in between, followed by a fully connected layer. To drop the resolution of the 16
16 image, max-pooling is been used for the last two blocks.
Degradation Loss. The degradation generator and discriminator networks were trained with a total loss which is a combination of a GAN loss and a pixel loss defined as:
where and are the corresponding weights.
Following (Bulat et al., 2018), we used the GAN loss defined as:
where is the LR data distribution and is the generator distribution defined by . For the GAN loss, according to the author an “unpaired” training setting is used where the real-world images from the LR dataset are enforcing the output of the generator (whose input is images from the HR dataset) to be contaminated with real-world noisy artifacts. According to (Arjovsky et al., 2017), using Wasserstein distance as GAN loss, greatly improves the stability of the GAN model. In (Arjovsky et al., 2017), in order to enforce the Lipschitz constraint the authors used weight clipping. We decided to enforce this constraint using the more recent and improved approach of gradient penalty as described in (Gulrajani et al., 2017).
Finally, the loss is used to enforce the output of the generator to have similar content (i.e. face identity, pose and expression) with the original HR image and is defined as:
where and are the corresponding weights. The loss is defined as:
where is an up-scaling function. We also decided to use the perceptual loss (Johnson et al., 2016) which was found to give perceptually pleasing results. This is defined as:
be the features extracted from a deep-network at the end of the i’th block (we use VGG(Liu and Deng, 2015))
3.3. HR Generative Model
This section describes the HR generator that is used to generate HR images given latent representation. Pre-training a good, generalized face generator is crucial for the success of the SR-NAM model. For this reason, we decided to experiment with Progressive GAN architecture as described in (Karras et al., 2017). The authors of that paper showed that their model is very effective on generating good quality HR images given enough face images. The overall architecture can be seen on Figure 2.
Progressive GAN. Following (Karras et al., 2017), the idea behind the progressive generator architecture is to start with a low-resolution image, and then progressively increase the resolution by adding layers to the network. This is visualized on Figure 2. This incremental procedure helps the training to first, find the large-scale structure of the image distribution ,and to then shift attention to progressively finer scale details, instead of trying to learn everything simultaneously. The discriminator and generator networks are mirrored and grow in a synchronous way. All the previous layers up to the new resolution remain trainable throughout the training process. The new layers that are added to the network are faded in smoothly to avoid sudden changes to the already well-trained lower resolution layers.
This section describes the Super-Resolution using Non-Adversarial Mapping approach in order to retrieve multiple HR images from a single LR image, which is the main focus of the paper. Let be the low resolution space and be a high resolution space, consisting of sets of images and respectively. The objective is to find every image in the high resolution space, that is analogous to an image in the low resolution domain. Each must appears to come from the high resolution space but preserve the unique content of the original image.
Non-Adversarial Mapping. NAM (Hoshen and Wolf, 2018) is a method for unsupervised mapping across image domains. For using this approach, you must have a pre-trained unconditional model of the domain, which in our case if the high resolution space. In addition, you must have a set of domain training images which in our case corresponds to from the low resolution space.
Given a pre-trained high resolution domain generative model , a pre-trained degradation model and a set of
training images, NAM estimates the latent codefor every training image so that the generated image from this latent code maps to low resolution image . Figure 3 shows exactly this process. The entire optimization problem is:
The advantages of NAM include that it does not use adversarial training to learn the mapping between high and low resolution images. In addition, the mapping can be applied to many situations and multiple solutions can be recovered for a single low resolution input image. NAM is also capable of using a pre-trained high resolution model as well as a pre-trained low resolution model that they only need to be estimated once.
In contrast to (Hoshen and Wolf, 2018), we decided not to include the perceptual loss as part of the optimization objective. This is because, although perceptual loss successfully yields perceptually pleasing results (i.e. following perceptually the content of the low resolution image such as similar pose, expression, face geometry etc.) it is of little use when the goal is to recover as close as possible a low resolution image to the higher resolution counterpart since it can produce images that are correct perceptually but totally different visually. Thus, we decided to only minimize the loss between and .
Inference. Since all the networks inside the SR-NAM model are already pre-trained and fixed, only the latent codes need to be trained each time. To infer an analogy of a new image, we need to recover the latent code which would yield the optimal reconstruction. The generated high resolution image is the proposed solution to the low resolution image .
Multiple Solutions. In order to produce multiple HR images from a LR image, it is sufficient to differently initialize the latent code . This is because the problem space is non-convex, thus starting from a different point in the space can yield different final analogies.
In this section, we demonstrate the effectiveness of the SR-NAM approach by reporting some qualitative results on both the HR as well as the LR dataset. Further, we show the performance of both the degradation network and the progressive GAN procedure.
4.1. Implementation Details
In this section, we give a detailed description of the procedure used to generate the experiments presented in this project.
Degradation Model. The intuition behind the degradation model is to create a model that can generate a realistically taken LR face image based on a HR image. This model is used both to create the ground truth LR images of the HR image dataset as well as the degradation model for the SR-NAM approach, which is responsible to convert the generated HR candidate back to the LR image in order to be compared with the ground truth LR image. We trained the model for 500,000 iterations and we used the Adam(Kingma and Ba, 2015) optimizer with default settings. The discriminator was set to be trained for 5 more iterations on each iteration step before training the generator. The value for gradient penalty was set to 10. The latent size for the input noise was set to be 100. In addition, the batch size was chosen to be 64. Finally, a pre-trained version of the VGG network with 19 layers has been used for the perceptual loss.
HR Generative Model. SR-NAM takes as input a pre-trained generative model of the HR image domain. As mentioned in Section 3.3, we decided to use the ProGAN model. The depth of resolution was set to 5 which translates to 64
64 images. Each resolution was trained for 10, 20, 20, 20 and 50 epochs respectively. The batch sizes were also set to 64, 64, 64, 32 and 16 for each resolution. Due to the complexity of the face generation problem, we decided to use a latent size of 512. In addition, we used Adam optimizer with default settings for the optimization procedure. Finally, the training of the model took approximately two weeks to complete in a single NVIDIA GeForce GTX 1080 Ti GPU card.
SR-NAM Model. Since SR-NAM takes as input both the HR generative models and the degradation model as pre-trained and fixed networks, the only optimization that is needed to be done is on the latent codes for each training/testing example. Again, we chose to use the Adam optimizer with default settings. Since the results are sensitive to how much generalized is the HR generative model, the number of iterations that each example needs in order to find and recover a corresponding HR image from the learned HR space varies. Empirically, we found a good number of iterations to be between 250 and 500 iterations.
4.2. Degradation Model Results
Figure 4 shows the results of the trained degradation model. It is clear that the model is able to produce a 1616 low resolution image given a 64
64 high resolution image. It is worth noting that the network can model a variety of image degradation styles, in different levels, such as blurriness, distortion, colouring, illumination, face geometry etc. Thus, it learns the different types of noise that is more probably to be produced in a real-world setting and as like the image was taken using a low quality camera.
4.3. Progressive Generator Results
We show examples of a variety of face images generated at 6464 by ProGAN. Figure 5 shows generated results of faces at 44, 88, 1616, 3232 and finally at 6464 using a fixed random noise given as input on each resolution. The progressive generator successfully learns to produce clear 6464 face images.
4.4. SR-NAM Results
In this section, we evaluate the performance of SR-NAM. The details of our experiments are as follows:
Performance metrics. The scope of this paper is to create an approach capable of generating multiple HR face images that correspond to a LR input face image. To date, based on my knowledge, there is no quantitative metric that can be used to measure if an image follows both perceptually and visually a given ground truth image. To overcome this issue, we propose the following new metric that can be used to measure the performance of each different generated HR image. This metric will use a facial landmark localization algorithm such as (Bulat and Tzimiropoulos, 2017b), to find all the facial landmarks from the LR image and compare them with the landmarks from the original HR image. The metric is defined as:
where is the heatmap corresponding to the n–th landmark at pixel produced by the facial landmark localization algorithm with input the generated HR image and is the heatmap obtained by running the algorithm on the original HR image . Due to time constrains, we did not perform quantitative evaluation on the proposed approach using this metric but it is worth noting that this can be a possible new metric to the problem.
For evaluating the quality of the generated HR images, in literature there are mainly two standard metrics that have been used, namely the PSNT and SSIM (Bovik et al., 2004) metrics. Up to date, those metrics are heavily criticized by the research community (Ledig et al., 2016; Bulat and Tzimiropoulos, 2017c) as they fail to perceive the real image quality and are considered poor measures. Since, the scope of the project is not to produce better quality HR images but to find a way of producing multiple HR corresponding images from a LR input image, there was no reason for computing these metrics.
Qualitative Results. Figure 6 shows qualitative results for several images from the HR dataset. The first set of columns shows the original HR image and the corresponding degraded LR image. The rest set of columns show the generated HR image and the corresponding degraded LR image on three different reconstructions using different random initialization on the each time. The proposed approach is unsupervised i.e. on inference time, the needs to be learned using the objective of comparing the input LR image with the degraded HR image until they match. Thus, it is worth noting that the model successfully learns to match the input LR image with the generated LR images as it is visualized on Figure 6. In addition, it is successfully generates new plausible reconstruction of the input LR image each time.
We also show results on Figure 7 using the LR dataset that described in Section 3.1. It is clear that it successfully reconstructs a more clear HR image compared to the LR image used as input. The faces have still a lot of noise and not all the times follow exactly the geometry of the faces and all the other artifacts but still closely resembles the LR input image.
4.5. Failure Cases and Discussion
The success of this approach lies on the very well trained and generalized HR Generative model. Without a generator that can function from a space of learned faces vast enough to simulate practically all the possible face generations, this method would not work in practice. By no-means we claim that the current training that we performed is enough to solve this problem but as a proof of concept it is clear that given the appropriately trained and generalized models this method can yield different reconstructions each time. On Figure 8, we demonstrate some of the failure cases where the face either does not resemble the LR face or is not successful on reconstructing a face at all. Furthermore, Figure 8 depicts some cases where reconstructing a HR face can be challenging due to distortion and illumination.
5. Conclusion and Future Work
In this paper, we presented a method for face super-resolution which does not assume that there is only a single HR image from a LR input image but rather maps the LR image to multiple candidate HR images. In addition, the presented method does not assume as input an artificially generated LR image but aims to produce results applied to real-world LR images. We discussed the advantages and disadvantages of the presented method including that the power of the method lies mainly on the training of the HR generator and that this model needs to be well generalized in all possible faces in order to perform well given new unseen examples. Finally, we demonstrated some qualitative results of the SR-NAM method on both real-world low resolution images and degraded high resolution images.
- Fast, accurate, and, lightweight super-resolution with cascading residual network. CoRR abs/1803.08664. External Links: Cited by: §2.1.
Wasserstein generative adversarial networks.
Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 214–223. External Links: Cited by: §3.2.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Cited by: §4.4.
- How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks). CoRR abs/1703.07332. External Links: Cited by: §3.1.
- How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, Cited by: §2.1, §4.4.
- Super-fan: integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans. CoRR abs/1712.02765. External Links: Cited by: §4.4.
- To learn image super-resolution, use a GAN to learn how to do image degradation first. CoRR abs/1807.11458. External Links: Cited by: Figure 1, §1, §2.1, §3.1, §3.1, §3.1, §3.2, §3.2, §3.2.
- Real-time video super-resolution with spatio-temporal networks and motion compensation. CoRR abs/1611.05250. External Links: Cited by: §1.
- Attention-aware face hallucination via deep reinforcement learning. CoRR abs/1708.03132. External Links: Cited by: §2.1.
- VGGFace2: A dataset for recognising faces across pose and age. CoRR abs/1710.08092. External Links: Cited by: §3.1.
- FSRNet: end-to-end learning face super-resolution with facial priors. CoRR abs/1711.10703. External Links: Cited by: §2.1.
StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. CoRR abs/1711.09020. External Links: Cited by: §2.2.
- Image super-resolution using deep convolutional networks. CoRR abs/1501.00092. External Links: Cited by: §1.
- Image upsampling via imposed edge statistics. ACM Trans. Graph. 26 (3). External Links: Cited by: §2.1.
- Improved training of wasserstein gans. CoRR abs/1704.00028. External Links: Cited by: §3.2.
- Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §3.2.
- NAM: non-adversarial unsupervised domain mapping. CoRR abs/1806.00804. External Links: Cited by: §2.2, §3.4, §3.4.
Image super-resolution using gradient profile prior.
2008 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–8. External Links: Cited by: §2.1.
- Improving resolution by image registration. CVGIP: Graph. Models Image Process. 53 (3), pp. 231–239. External Links: Cited by: §2.1.
- Perceptual losses for real-time style transfer and super-resolution. CoRR abs/1603.08155. External Links: Cited by: §3.2.
- Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2, pp. 109–122. Cited by: §1.
- Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196. External Links: Cited by: Figure 2, §3.3, §3.3, §3.3.
- Accurate image super-resolution using very deep convolutional networks. CoRR abs/1511.04587. External Links: Cited by: §1.
- Single-image super-resolution using sparse regression and natural image prior. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6), pp. 1127–1133. External Links: Cited by: §2.1.
- Learning to discover cross-domain relations with generative adversarial networks. CoRR abs/1703.05192. External Links: Cited by: §2.2.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §4.1.
- Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Vol. , pp. 2144–2151. External Links: Cited by: §3.1.
- Photo-realistic single image super-resolution using a generative adversarial network. CoRR abs/1609.04802. External Links: Cited by: §1, §4.4.
- Robust video super-resolution with learned temporal dynamics. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2526–2534. External Links: Cited by: §1.
- Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Vol. , pp. 730–734. External Links: Cited by: §3.2.
- Deep learning face attributes in the wild. CoRR abs/1411.7766. External Links: Cited by: §3.1.
- SRFeat: single image super-resolution with feature discrimination. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
- EnhanceNet: single image super-resolution through automated texture synthesis. CoRR abs/1612.07919. External Links: Cited by: §1.
- Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. CoRR abs/1609.05158. External Links: Cited by: §1.
- ”Zero-shot” super-resolution using deep internal learning. CoRR abs/1712.06087. External Links: Cited by: §1.
- Detail-revealing deep video super-resolution. CoRR abs/1704.02738. External Links: Cited by: §1.
- Image super-resolution using dense skip connections. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 4809–4817. External Links: Cited by: §2.1.
- Recovering realistic texture in image super-resolution by deep spatial feature transform. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
- Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19 (11), pp. 2861–2873. External Links: Cited by: §2.1.
WIDER face: a face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
- DualGAN: unsupervised dual learning for image-to-image translation. CoRR abs/1704.02510. External Links: Cited by: §2.2.
- Super-resolving very low-resolution face images with supplementary attributes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
- Ultra-resolving face images by discriminative generative networks. In ECCV, Cited by: §2.1.
- On single image scale-up using sparse-representations. In Proceedings of the 7th International Conference on Curves and Surfaces, Berlin, Heidelberg, pp. 711–730. External Links: Cited by: §2.1.
- Non-local kernel regression for image and video restoration. In ECCV, Cited by: §2.1.
- Deep plug-and-play super-resolution for arbitrary blur kernels. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
- Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §2.2.
- Deep cascaded bi-network for face hallucination. CoRR abs/1607.05046. External Links: Cited by: §2.1.