Disentangle Perceptual Learning through Online Contrastive Learning

Pursuing realistic results according to human visual perception is the central concern in the image transformation tasks. Perceptual learning approaches like perceptual loss are empirically powerful for such tasks but they usually rely on the pre-trained classification network to provide features, which are not necessarily optimal in terms of visual perception of image transformation. In this paper, we argue that, among the features representation from the pre-trained classification network, only limited dimensions are related to human visual perception, while others are irrelevant, although both will affect the final image transformation results. Under such an assumption, we try to disentangle the perception-relevant dimensions from the representation through our proposed online contrastive learning. The resulted network includes the pre-training part and a feature selection layer, followed by the contrastive learning module, which utilizes the transformed results, target images, and task-oriented distorted images as the positive, negative, and anchor samples, respectively. The contrastive learning aims at activating the perception-relevant dimensions and suppressing the irrelevant ones by using the triplet loss, so that the original representation can be disentangled for better perceptual quality. Experiments on various image transformation tasks demonstrate the superiority of our framework, in terms of human visual perception, to the existing approaches using pre-trained networks and empirically designed losses.

READ FULL TEXT VIEW PDF

page 2

page 6

page 7

page 8

page 9

page 11

page 12

03/30/2019

A HVS-inspired Attention Map to Improve CNN-based Perceptual Losses for Image Restoration

Deep Convolutional Neural Network (CNN) features have been demonstrated ...
12/13/2019

Learning to Observe: Approximating Human Perceptual Thresholds for Detection of Suprathreshold Image Transformations

Many tasks in computer vision are often calibrated and evaluated relativ...
03/15/2022

Unpaired Deep Image Dehazing Using Contrastive Disentanglement Learning

We present an effective unpaired learning based image dehazing network f...
12/09/2021

SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

Pre-training has become a standard paradigm in many computer vision task...
06/11/2021

Robust Representation Learning via Perceptual Similarity Metrics

A fundamental challenge in artificial intelligence is learning useful re...
11/25/2020

Contrastive Representation Learning for Whole Brain Cytoarchitectonic Mapping in Histological Human Brain Sections

Cytoarchitectonic maps provide microstructural reference parcellations o...
07/25/2022

Exploring CLIP for Assessing the Look and Feel of Images

Measuring the perception of visual content is a long-standing problem in...

1 Introduction

Image transformation aims at transforming images from one condition/scenario into another, e.g., low-resolution into high-resolution, low-lighting into normal-lighting, etc. Recent deep learning-based methods 

Dong et al. (2015)Lim et al. (2017)Zhang et al. (2018b) have achieved significant improvements in transforming the contents of images, but the visual quality of the transformed images are often not perfect, especially in terms of human perception.

More recent works Ledig et al. (2017)Wang et al. (2018)Jolicoeur-Martineau (2018) introduce perceptual learning to address this issue. They utilize a pre-trained classification network to extract high-dimension features as the representations of both the generated images and the target images

, and then measure the distance between these two representations as the loss function, formulated as:

(1)

Compared with the distance metrics like MAE or MSE used in earlier works, the perceptual distance is measured in the feature space instead of the pixel space, which is considered to be more compact and more relevant to human perception. Furthermore, Mechrez et al. Mechrez et al. (2018b) propose the contextual loss, which measures the distance in a feature contextual space, defined as:

(2)

where is the similarity between features and , and is usually calculated using the normalized cosine distance. Compared to , is calculated in the feature contextual space, which is supposed to be more robust when the training images and are not aligned. The impact of these methods is analyzed in more detail in Mechrez et al. Mechrez et al. (2018a) and Yang et al. Yang et al. (2019).

In general, the extracted features from pre-trained networks can be regarded as probability distributions over the input images

and , denoted as and . Then we can demonstrate that minimizing the distance between and using or is similar to minimizing the Kullback-Leibler (KL) divergence between two distributions and , as shown in Eq.(3):

(3)

Since and are extracted features from the pre-trained network (usually a classification network trained on large-scale classification datasets), we can consider and as the mappings from the pixel space into the semantic manifold of nature images learned by the pre-trained network . Therefore, the generated images using these perceptual learning approaches can be more realistic.

It should be noted, however, these two loss functions often account for relatively small roles in the final loss function of the entire network training, even though they might be important for human perception. In practice, they are often combined with the traditional pixel-wise losses or adversarial losses, and only worked as the auxiliary to avoid artifacts. We argue that these perceptual losses do not work well alone because of the irrelevant features contained in the pre-trained representations and the poor generalization ability of the feature projections when used on new datasets. For the first problem, we conduct several experiments in combining the contextual loss with L1 loss with different weights, and observe that directly using the contextual loss will lead to unexpected artifacts in the generated images, e.g., color offset, ripple artifacts, blurring, etc. These results are shown in the left part of Figure 1. Increasing the weight of the L1 loss will reduce the artifacts but cause blur in the generated images. Furthermore, in the middle part of figure 1, we visualize the dimension-reduced representation of three images from two scenes and two seasons mapped using the pre-trained networks. We notice that these two images from the same scene but different seasons have a smaller distance than that of the two images from different scenes but the same season, which means the pre-trained features are not suitable to the transformation tasks like season transfer. To achieve season transfer, we need to push the distance between the images of different seasons away and pull the distance between the same-season images closer, as shown in the right part of figure 1.

Figure 1: The left part: the image quality tradeoff between the contextual loss and L1 loss. The right part: changes in feature distance between different seasons before and after learning.

In other words, we believe that features pre-trained for classification tasks contain information that is irrelevant to other image transformation tasks. The perceptual losses designed upon these features, though capture additional perceptual information than the simple pixel-wise losses, also introduce many irrelevant information that misleads the training of the image transformation networks. Such features, therefore, should not be directly applied in the image transformation network training process. To overcome this drawback, in this paper we propose to fine-tune the pre-trained network using the task-oriented instance triplet samples, so that the traditional pre-trained representations can be disentangled as a set of task-relevant dimensions and a set of task-irrelevant dimensions. To achieve this goal, we introduce a novel online contrastive learning scheme to activate the task-relevant dimensions and suppress the task-irrelevant dimensions.

2 Related Work

2.1 Image Transformation

Pursuing realistic image transformation has been addressed by several recent works. Both adversarial based methods, e.g., SRGAN Ledig et al. (2017) and perceptual based methods, e.g., Perceptual Loss Johnson et al. (2016)

are proposed to push the generated images’ realism improvements by minimizing the distance between high-dimension features, which represent either semantic or classic information from the pre-training or parallel-training deep networks. Compared with the pixel-based loss function, e.g., MAE and MSE Losses, usually leading to over-smooth results, these works tend to generate finer texture details. However, they suffer ambiguous convergence during adversarial training or unpleasant artifacts, mainly caused by the unreasonable deep features represent both realistic relevant dimensions and non-relevant dimensions. To better reduce the artifacts, some works, e.g., ESRGAN 

Wang et al. (2018) have resorted to select stronger features from the pre-training network to represent the images, and enhance the network architecture on generation and discrimination to reduce the convergence difficulties. Different from the representation enhancement, the idea of optimizing the distance measuring has also been discussed in recent works, e.g., Contextual Loss Mechrez et al. (2018b)

computes the similarities on the contextual features, which could be seen as maximizing the similarities with images’ real distributions via non-parametric estimation, instead of minimizing the Euclidean distance as discussed previously. Even these methods are robustness on unaligned data due to the optimized distance measuring and tend to generate pleasant results, there still existed many failed cases of the results. Moreover, these methods lack of convincing reasons for the feature selection or effectiveness analysis of the realistic, which further increase the difficulties of improvement in themselves and can only be applied as a black-box.

2.2 Representation Learning

Extracting efficient representation from images is crucial to measuring the distance between transformed images and target images. Although Perceptual Loss Johnson et al. (2016) and Contextual Loss Mechrez et al. (2018b)

have declared that the features extracted from the pre-trained VGG-19 or AlexNet can best represent the human perception in images, there still have many works proposed to refine representations from pre-training. Using networks pre-trained in unsupervised common tasks and then fine-tune representations in specific sub-tasks have shown its superiority in both vision and language tasks, e.g., MoCo 

He et al. (2019), SimCLR Chen et al. (2020), GPT Radford et al. (2018), and BERT Devlin et al. (2018). These methods use contrastive learning to pre-train on common tasks so that they can capture the relations between similar objects and dissimilar relations without labels. For example, triplet loss Hoffer and Ailon (2015)Schroff et al. (2015) maximizes the distance from anchor samples to negative samples and minimizes the distance from anchor samples to positive samples. Under such a process, the optimized networks can learn useful representations from data. However, these vision-related pre-training methods are usually applied to the recognition Chen et al. (2020) related tasks rather than generation related tasks. Regarding to the image generation related tasks, most works use the representation of images as the latent code between the encoder and decoder. Then discriminator is used to regularize the generated results with specific latent code. Chen et al. Chen et al. (2016)

introduce InfoGAN that uses unsupervised learning to learn disentangled representations, which decomposes the input into an incompressible noise and a latent code so that the representation is related to the latent code only. Even these methods can generate diversified results using the encoded representation, the quality of generated results cannot be ensured. Jolicoeur et al. 

Jolicoeur-Martineau (2018) and Wang et al. Wang et al. (2018) apply the probabilities of generated images to stable the training process and enhance the quality of images. The probability similarities are measured in the feature differences of the discriminator, which could be seen as the special case of measuring representation differences.

3 Disentangled Perceptual Learning

  input: source images , target images , generator network , pre-trained network , feature selection layer , accumulate interval , random crop function , task-oriented distortion function .
  for sampled mini-batch ,  do
     for all do
      do freeze parameters of
     
     
     
     
     # accumulate gradient of loss
     
     
     
      do unfreeze parameters of
     
     end for
     
  end for
  return generator network , and throw away and
Algorithm 1 our proposed Disentangled Perceptual Learning.

In this paper, we adapt the perceptual loss as the main loss function in training networks, which uses the deep feature-based loss instead of pixel loss, adversarial loss, or any other handcrafted loss functions. Deep feature-based loss function tends to generate images with more realistic details and provides a more stable training process empirically, but it is usually incorporated as an auxiliary loss function in previous works due to its inexplicable and uncontrollable problems. To overcome these issues, in this section we describe the details of our proposed Disentangled Perceptual Learning (DPL) with online contrastive learning as a new general framework of perceptual learning. We separate the DPL into three different components: Online Contrastive Learning, Feature Selection as Fine-tune, and Task-Oriented Disentanglement. We summarize the overall method in Algorithm 1.

3.1 Online Contrastive Learning

The superiority of perceptual learning mainly comes from the applied pre-trained classification network. A classification network pre-trained on a large-scale image classification dataset can map the input image into a high-level feature space, where images with similar contents will be projected into similar embeddings. Past works Johnson et al. (2016)Wang et al. (2018) declare the distance calculated at the high-level embedding space is more similar to human perception since the pre-trained can omit the information that is helpless to human recognition. The introduced perceptual loss works as minimizing pixel loss of some compacter images without trivial information. Here we formulate the perceptual loss in terms of widely used MSE loss in the feature manifold as

(4)

where is the generator with parameters that learns to transform the input images into target image , and the weights of are fixed during the training phase. However, the generated images that using the original pre-trained networks usually have various artifacts. Here we assume the representation extracted from is not powerful enough to represent images due to the distribution divergence on pre-trained classification dataset and transformation dataset. One straightforward modification is to fine-tune the pre-trained for image transformation tasks, but it tends to be unpractical due to the lacking labeled dataset.

Here we introduce the online contrastive learning to perform simultaneous learning on both the pre-trained and generator . It aims to learn the distinctiveness where the similar objects are similar in feature space and different objects are different in feature space. By learning in the self-supervised manner, it does not need the categorical label during the training. Inspired from the unsupervised learning methods Chen et al. (2020) in the recognition tasks, the triplet is constructed using the random crop function on instance images, which produces different cropped results at each call. The final loss function can be formulated as:

(5)

where is the parameters of pre-trained networks and we set as the margin between the positive pairs and negative pairs. During the training, the optimization of and optimization of are conducted simultaneously and two different optimizers are used. However, it is difficult to find a meaningful triplet from randomly sampled images, thus we apply the gradient accumulation when updating the parameters of . To be specific, we update the parameters of after iterations that has forwarded. A more general understanding of this process can be to learn a pre-trained network to distinguish the generated images from the ground truth. Thus the pre-trained networks can be thought as a discriminator used in the RealisticGAN Jolicoeur-Martineau (2018) except the weight of discriminator is transferred and the training difficulties at the initial stage are greatly reduced.

3.2 Feature Selection as Fine-Tune

In the previous section, we introduce the online contrastive learning that updates the parameters of the pre-trained networks during training. However, it should be noted that such operation will become overfitting easily and the representation will be classification related only. Hence the contrasting learning, as well as the generator optimizing will become difficult to convergence due to the adversarial learning degenerated.

To overcome this, we introduce a non-linear feature selection layer after the pre-trained network and freeze the parameters of at the fine-tuning process. The feature selection layer consists of two convolution layers with kernel size, and one activation layer is inserted between them. Such architecture is also used in the other representation networks, e.g., SimCLR Chen et al. (2020). The difference is that in our work, the parameters of the pre-trained network are frozen, and only the parameters of the feature selection layer are trainable. With the introduced feature selection layer, we can combine the features from different channels and learn to activate the features per-channel using the feature selection layer. Thus the features related to the images difference are activated and the representation is further disentangled with less disturbance from irrelevant features. Besides, a channel reduction is also used to further compress the dimension of output features.

3.3 Task-Oriented Disentanglement

Different from disentangling irrelevant dimensions from extracted representation, we further extend it into decomposing perceptual relevant factors, e.g., color, sharp, and other perceptual factors that affect images, called task-oriented disentanglement. To implement this, we construct instance triplet samples in the online contrastive learning, but the negative samples is generated from using the task-specific distortion. For example, the perceptual factor: colorful accuracy is hard to be measured explicitly, but a human can easily distinguish which image is more accurate in color between two distorted images when given a reference image, even if the given reference image is blurred or affected by other factors. That is to say, human disentangle perceptual factors using contrastive samples. Inspired by this intuition, we introduce Task-Oriented Disentanglement to separate each perceptual factors from networks implicitly.

More specifically, we disentangle the perceptual relevant factors by constructing anchor samples from the target images using specific distortion randomly. where the used distortion is related to the separated factor only, then online contrastive learning is performed with the representation network which is composed of online contrastive learning and feature selection as fine-tune and is optimized via minimizing

(6)

The whole convergence process is also illustrated in Figure  1. It can be concluded as firstly maximizing the distance between and to optimize where the representations are varied due to the difference in factors, then minimizing the distance between and to optimize , which are similar to enlarge the distance between the and in the specific dimensions that related to the perceptual factors.

4 Experimental Results

In this section, we conduct three different experiments to validate the performance of our introduced methods, and analyze the relations between the perceptual quality. Furthermore, we explore the relationship between perceptual quality and different settings. Theses three experiments are performed on the Season image transfer Huang et al. (2018), RAW low-light image illumination Chen et al. (2019)

, and RAW image super-resolution 

Zhang et al. (2019) respectively. Before training, we apply the data augmentations including random flipping, random rotation, and random cropping on the above datasets. During training, we use Adam optimizer to update the networks with parameters (

). The learning rate 1e-4 is used during the whole training and the mini-batch size is 1. Moreover, all experiments are conducted using PyTorch 1.4 and CUDA 10.0 on Ubuntu 18.04 with 8x2080TI (11GB version). The source code of our implementation is available at supplementary material.

4.1 Season Image Transfer

Figure 2: Transformation results comparisons in Winter Summer and Summer Winter

Season image transfer is one of the most representative tasks in the unpaired image transformations. During learning, the transformation network is trained on a set of winter images and a set of summer images without corresponding relations, and GAN based networks are usually used to learn the cycle relation, i.e., winter summer winter. In order to enhance the quality of generated images, perceptual learning is applied to maximize preserve the perceptual similarity between the original images and transformed images .

Here we utilize the MUNIT proposed by Huang et al. Huang et al. (2018) as the baseline of our methods. More specificly, for online contrastive learning, we construct instance triplet samples between the source images and generated images as , where is the random crop operations and the triplet is used as . With such settings, the online contrastive learning aims to maximize the distance in a feature space that only related to the season and minimize the distance in a feature space that unrelated to seasons. In other words, the representation network is optimized to focus on more scenes that most affect the human recognized perception in winter and summer. Since there existed recognizable divergence between the winter images and summer images, we only update the perceptual network after every 100 iterations of the generator, and we do not use the feature selection and task-oriented augmentation. The generated results are shown in Figure 2. In the left part of Figure 2, we show the results of Winter Summer, and it is clear to see that applying the perceptual can generate brighter results but have no distinct effects in scene changing. However, after applying the perceptual with our introduced online contrastive learning, the generated results look like containing more green plants with fewer white snow in the scenes. The results differences also exist in the right part of Figure 2 which shows the Summer

Winter results. Therefore, we can conclude the proposed online contrastive learning is beneficial for preserving the perceptual similarity between images even when they belong to different seasons.

4.2 RAW Low-light Image Enhancement

Figure 3: Transformation results comparisons in low-light images enhancement task

In the area of low-light enhancement, RAW images have got lots of attention in recent works Chen et al. (2018)Chen et al. (2019). Compared with the previously used RGB images, RAW images can provide more information and have less information loss since they are generated from CMOS sensors directly. However, these images have notable differences in visualization than normal images that humans acceptable, e.g., color space offset, optical noise, optical distortion, and so on. Therefore, enhancing the perceptual quality of generated normal-light images have crucial practical significance.

Here we adapt the SMD proposed by Chen et al. Chen et al. (2019) as the baseline of our methods, as well as the EnlightenGAN Jiang et al. (2019) for comparison, which is the state-of-the-art methods in low-light image enhancement. SMD utilizes the extracted features from different layers of pre-trained VGG19 network, and it calculates the distance between the illuminated images and target images in the extracted feature spaces. The final loss of training is the combination of VGG19 loss and L1 loss. However, as the Figure 3 shows that the generated results using such loss function tend to have uneven illumination appearance. In order to boost the visual quality of generated results, we construct task-oriented instance triplet samples to fine-tune the pre-trained VGG19 networks with online contrastive learning. In Table 1

we show each component of our methods incrementally aims to provide a detailed analysis. The evaluation metrics include pixel-based PSNR, perception related MS-SSIM 

Wang et al. (2003) and LPIPS Zhang et al. (2018a). For convenience, we use FS to denote whether the feature selection as fine-tune is used. It is clear to see that our methods can not only achieve state-of-the-art performance in perception, but also can generate a result with the best visual quality and more realistic details, even compared to the handcrafted texture loss.

Methods
Backbone Loss FS Task PSNR MS-SSIM LPIPS
Traditional - - - 17.096 0.8039 0.4185
EnlightenGAN Jiang et al. (2019) Adversarial - - 20.556 0.9168 0.2525
SMD Chen et al. (2019) VGG - - 23.541 0.9147 0.1946
SMD Chen et al. (2019) VGG + Texture Loss - - 22.147 0.8791 0.2218
Ours VGG - 23.710 0.9210 0.1912
Ours VGG Blur 24.138 0.9081 0.1874
Table 1: Quantitative comparisons in low-light enhancement datasets.

4.3 RAW Image Super-Resolution

Recent works Zhang et al. (2019)Cai et al. (2019) have proven the remarkable differences existed between the real-world super-resolution problem and the simulated super-resolution, especially in the degeneration way of low-resolution images. However, even training on their proposed super-resolution datasets can improve the effects especially applied the perceptual learning, two problems are also raised which cannot be ignored. The first problem is the misalignment between the low-resolution images and corresponding high-resolution images during the collection. The second problem is the color space divergence between RAW images and RGB images. Both affect the final results seriously and make the pixel-based loss function and contextually related loss function working defectively. In the left part of Figure 1 we show the current trade-off solution that adjusts weights between L1 loss and contextual Loss to get the balance between color and sharpness.

To address the weight adjustment problem that existed for long, we apply the online contrastive learning to the representation network that contextual use. It should note that even the contextual loss calculates the distance in the feature contextual spaces instead of euclidean spaces, our proposed online contrastive learning is also robustness enough since the triplet loss is relative to the difference between and instead of the difference between samples. In table 2 we show the quantitative performance comparisons between the baseline methods and our methods. Here we apply the random color jitter to the high-resolution images as the negative samples, which makes the results with contextual only to be more colorful. The qualitative visual results is also shown in Figure 4. It is easy to conclude that our method can get the best trade-off between the sharpness and color, even no elaborate weights are used. Noted that since the images used for validation have obviously misalignment so that the pixel-based metrics like PSNR have limited reference values.

Methods
Backbone Loss FS Task PSNR MS-SSIM LPIPS
RGB + Bi-cubic - - - 18.556 0.6767 0.5718
RGB + ESRGAN Wang et al. (2018) Adversarial - - 18.498 0.6756 0.4389
EDSR Lim et al. (2017) L1 - - 15.705 0.5612 0.7146
EDSR Lim et al. (2017) Contextual - - 13.517 0.5740 0.5068
EDSR Lim et al. (2017) Contextual + Color - - 15.396 0.4950 0.4687
Ours Contextual Color 14.455 0.5870 0.3913
Table 2: Quantitative performance compassion between different super-resolution methods and ours.
Figure 4: Transformation results comparisons in RAW image super-resolution.

5 Conclusion

In this paper, we argue that even though the perceptual learning approaches using perceptual losses on pre-trained features can capture perceptual information better than the approaches used pixel-wise losses, they also bring irrelevant information into the image transformation tasks. We then introduce an online contrastive learning scheme to fine-tune the pre-training so that the learned representation can better represent the relationships between the results and the target images. Specifically, we propose a feature selection layer while freezing the pre-training to preserve the natural image statistics from pre-training and reduce the irrelevant features. Furthermore, we construct task-oriented triplet samples during the fine-tuning, which drive the feature selection layer to be more sensitive to the task related statistics. Finally, the proposed disentangled representation can achieve more realistic results in many image transformation tasks. Our further work will focus on disentangling the representation of human perception with finer controlling during image transformation.

Appendix A Additional Network Details

Instance Triplet.

Construct instance triplets takes an essential role in online contrastive learning. It utilizes the self-similarity properties to enhance the learning that distinguishes the positive samples from the anchor samples. To help readers better understand it, We illustrate its process in Figure 5 with the random crop function in two times.

Figure 5: Visualization of constructing instance triplet during online contrastive learning.

Task-Oriented Instance Triplet.

Different from the online contrastive learning that depends on the self-similarity, the task-oriented disentanglement focus more on the specific perceptual factors instead of differences between the target images and generated images. To achieve finer disentanglement during the contrastive learning, we apply the different distortion algorithms in the target images as the anchor samples. To help readers better understand it, we illustrate the distorted results in Figure 6 with two different distortion algorithms.

Figure 6: Visualization of constructing task-oriented instance triplet during online contrastive learning.

Appendix B Compared Handcrafted Losses Details

In order to demonstrate our method with the handcrafted loss functions that can disentangle specific perceptual factors, we have conducted several comparison in the experiments. In this section we will describe the details of the used handcrafted loss functions: color loss and texture loss proposed by Ignatov et al. Ignatov et al. (2017).

Color Loss.

We use the color loss function to measure the difference of images in the brightness, contrast and color instead of pixel itself. Denote the transformed image and the target image as and , they are first processed by the Gaussian Blur as for removing high-level details, e.g., edges and textures. With such operation, the left parts and only contain the information related to brightness, contrast, and color. Hence the distance in the color and other related perceptual factors could be calculated via the Euclidean distance as:

Texture loss.

Similar to the motivation of color loss, texture loss are calculated in the grayscale version of two images, which aims to eliminate the effect of color. The formula for texture loss is:

Appendix C Additional Result Images

On the basis of the Season Image Transfer, we also conduct experiments in edges2shoes dataset. In edges2shoes dataset, we still utilize the MUNIT proposed by Huang et al.Huang et al. (2018) as the baseline. And the method of constructing instance triplet samples is the same as the task of Season Image Transfer. The result is shown in Figure 7. The left is the input, and the right is the ground truth output. Each following column shows 3 random outputs from a method. And as can be seen from Figure 7, both the and methods produce some violation artifacts, but the result of our introduced online contrastive learning are more in line with human perception, and have more details.

Figure 7: Transformation results comparisons in edges2shoes dataset
Figure 8: Transformation results comparisons in RAW image super-resolution

Broader Impact

This paper introduces an online contrastive learning scheme to disentangle the classification-oriented pre-trained image representations for better perceptual learning in the image transformation tasks. This framework enables the transformed images to be more realistic with fewer artifacts. Artists, photographers, creative workers, as well as every end-user, can all benefit from it. It is possible this technique can be used to make more realistic “fake” images. However, we believe ultimately this technique will help people to understand the mechanism of image transformation even deeper. Besides, no consequences of failure of the system exist, and no biases in the data are leveraged.

References

  • [1] J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang (2019) Toward real-world single image super-resolution: a new benchmark and a new model. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 3086–3095. Cited by: §4.3.
  • [2] C. Chen, Q. Chen, M. N. Do, and V. Koltun (2019) Seeing motion in the dark. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3185–3194. Cited by: §4.2, §4.2, Table 1, §4.
  • [3] C. Chen, Q. Chen, J. Xu, and V. Koltun (2018) Learning to see in the dark. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3291–3300. Cited by: §4.2.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2.2, §3.1, §3.2.
  • [5] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.2.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
  • [7] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. TPAMI. Cited by: §1.
  • [8] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §2.2.
  • [9] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §2.2.
  • [10] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018)

    Multimodal unsupervised image-to-image translation

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: Appendix C, §4.1, §4.
  • [11] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool (2017) Dslr-quality photos on mobile devices with deep convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3277–3285. Cited by: Appendix B.
  • [12] Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou, and Z. Wang (2019) Enlightengan: deep light enhancement without paired supervision. arXiv preprint arXiv:1906.06972. Cited by: §4.2, Table 1.
  • [13] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: Disentangle Perceptual Learning through Online Contrastive Learning, §2.1, §2.2, §3.1.
  • [14] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734. Cited by: §1, §2.2, §3.1.
  • [15] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1, §2.1.
  • [16] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §1, Table 2.
  • [17] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor (2018) Maintaining natural image statistics with the contextual loss. In Asian Conference on Computer Vision, pp. 427–443. Cited by: §1.
  • [18] R. Mechrez, I. Talmi, and L. Zelnik-Manor (2018) The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 768–783. Cited by: §1, §2.1, §2.2.
  • [19] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §2.2.
  • [20] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §2.2.
  • [21] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1, §2.1, §2.2, §3.1, Table 2.
  • [22] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §4.2.
  • [23] W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, and Q. Liao (2019) Deep learning for single image super-resolution: a brief review. IEEE Transactions on Multimedia 21 (12), pp. 3106–3121. Cited by: §1.
  • [24] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §4.2.
  • [25] X. Zhang, Q. Chen, R. Ng, and V. Koltun (2019) Zoom to learn, learn to zoom. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3762–3770. Cited by: §4.3, §4.
  • [26] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, Cited by: §1.