Image-to-image translation (i2i) methods seek to learn a mapping between two related domains, thereby “translating” an image from a domain to the other while preserving some information from the original. These methods have been applied to various applications in computer vision, such as colorization9, 37], medical imaging and photorealistic image synthesis .
In recent years, i2i methods have achieved promising results in terms of visual quality and diversity. CycleGAN  learns a one-to-one mapping between domains by introducing the cycle consistency constraint that enforces the model to preserve the content information while changing the style of the image. This idea inspired many works such as [16, 22] to disentangle the feature space into 1) a domain-specific space for the style; and 2) a shared space for the content—this allowed more diverse multimodal generation. Recent methods such as StarGAN [7, 8] unified the process in a single framework to achieve compelling results both in terms of visual quality and diversity. In the problem of reference-guided image synthesis, such methods are capable of extracting the style of a target image and merging it with the content of a source image to generate an image that shares information from both images. An inherent assumption of these methods is that both style and content images share the same spatial resolution.
In this paper, we consider the scenario where the target image is of much lower resolution than the source image. In this case, the target image only contains low frequency information such as the general shape, pose, and color of the subject. We therefore aim to learn a mapping that can leverage the high frequency information present in the HR source image, and combine it with low frequency information preserved in the LR target. As illustrated in fig. 1, such a mapping can learn to generate an HR image that shares distinctive features from both HR and LR inputs in an unsupervised manner.
This scenario also bears resemblance to the image super-resolution problem, which aims at generating an HR image that corresponds to a LR input. Both scenarios are highly ill-posed since there exists a large number of HR images which all downscale to exactly the same LR image—number which grows exponentially with each downscaling factor . We also demonstrate that our framework can be applied to super-resolve a very low resolution image given a high resolution exemplar as guidance.
The main contribution of this paper is to define a novel framework that deals with resolution mismatch between target and source for image-to-image translation. We demonstrate that the approach can effectively be used when dealing with very low resolution targets where details are completely blurred to the point of being visually unrecognizable. When evaluating our approach on the CelebA-HQ and AFHQ datasets, we show that our framework results in more realistic samples than state-of-the-art image-to-image translation methods such as Stargan-v2 . We validated our findings by reporting FID and LPIPS scores on both dataset, and also provide more evidence by reporting density/coverage metrics . These extensive experiments demonstrate that our method can generate results that are photo-realistic, and that convincingly fuses information from both the HR source and LR target images.
2 Related work
Generative adversarial networks (GANs)  have demonstrated promising results in various applications in computer vision, including image generation [5, 19, 20], super-resolution [9, 37] and image-to-image translation [16, 17, 22, 27, 45].
Recent work has striven to improve sample quality and diversity, notably through theoretical breakthroughs in terms of defining loss functions which provide more stable training[1, 12, 38, 43] and encourage diversity in the generated images [29, 39]. Architectural innovations also played a crucial role in these advancements [5, 8, 18, 25]. For instance,  makes use of an attention layer that allows it to focus on long-range dependencies present in the image. Spectral normalization  stabilizes the network, which also translates into having high quality samples. Since our work sits at the intersection of reference-guided image synthesis and super-resolution, the remainder of this section focuses on these two branches exclusively.
Reference-guided image synthesis
Also dubbed “conditional” , reference-guided i2i translation methods seek to learn a mapping from a source to a target domain while being conditioned on a specific image instance belonging to the target domain. In this case, some methods attempt to preserve the “content” of the source image (identity, pose) and apply the “style” (hair/skin color) of the target. Inspired by the mapping network of StyleGAN , recent methods [7, 8, 16] make use of a style encoder to extract the style of a target image and feed it to the generator, typically via adaptive instance normalization (AdaIN) . All these methods have the built-in assumption that the resolution of the source and target images are the same. In this work, we explore the scenario where the target (style) image has much lower resolution than the source.
Our approach is also related to super-resolution methods, which are aiming to learn a mapping from LR to HR images [9, 10]. Other approaches leverage knowledge about the class of images for super-resolution. For example, when dealing with faces, explicit facial priors can be leveraged . Of particular interest, PULSE  leverages a pretrained StyleGAN 
and, through an iterative optimization procedure, searches the space of latent style vectors for a face which downscales correctly to a given LR image. Our method bears resemblance to PULSE, but allows for a guided and therefore more controlled generation procedure with a single forward pass in the network.
More closely related are the so-called reference-guided super-resolution methods, which, in addition to the LR input image, also accept additional HR images for guidance. Here, the reference images need to contain similar content (e.g. textures) as the LR image. Representative recent methods propose to transfer information using cross-scale warping layers  or with texture transfer . In contrast, our method frames super-resolution in an i2i context, relying on a specific instance (e.g. a specific person for faces) to guide the super-resolution process.
Given a LR target image , we define the associated subspace of LR images as:
We consider an close to zero, so that each subspace contains only the LR images that are highly similar to the LR target according to a given norm (here ). We are also defining a subspace in the HR image manifold that includes all HR images that are included in when downscaled as LR images:
where is an image in the HR space, is the one in the HR space corresponding to , and is a downscaling procedure. Therefore, for each subspace on the HR image manifold, there exists a corresponding subspace in the LR space related by . As illustrated in fig. 2, given two HR-LR image pairs and , our goal is to learn a function that translates the HR images from one subspace to another subspace . These two subspaces are identified by their LR counterpart (i.e., and respectively), using LR images as additional information to specify the target HR space:
Following conventional GAN terminology , parameterized function is a generator (simplified as hereinafter). The goal of is to translate a HR image from a HR subspace to another HR subspace while preserving some of the original image information (i.e., high-frequency content). For training , we are using a discriminator
that plays two roles: 1) to classify whether the generated images are fake or real, pushingto sample from the natural image manifold; and 2) to judge whether the translated high resolution image is part of the right LR subspace when downscaled, guiding into the correct subspace.
3.2 Training Objectives
For achieving these two discrimination roles, the GAN is trained according to the following minimax optimization:
where is an adversarial loss used to ensure that we are generating plausible natural images, is the cycle consistency loss that is getting the translated images are kept on the correct subspace while carrying the right information, and
is a hyperparameter to achieve the right balance between these two losses.
At each training iteration, four forward passes are done with the generator, the first two for translating from a subspace to the other one in both ways, and , while the other two forward passes are for the cycle consistency constraint, and . The discriminator is used to make sure that the generated samples are from the designated subspace.
Following , we provide the discriminator with the absolute difference between the downscaled version of the generated image and the LR target :
where is the color resolution. As in , we round the downscaled image to its nearest color resolution (, since pixel values are in ) to avoid unstable optimization caused by exceedingly large weights to measure small pixel value differences. A straight throughestimator  is employed to pass the gradient through the rounding operation in eq. 5. The discriminator therefore takes as inputs:
Here, is an all-zeros image difference, since the downscaled version of is exactly . However, for fake samples, the absolute difference depends on how close is the generator to the designated subspace, in our example. Both networks and are trained via the resulting adversarial loss:
Cycle consistency loss
To make sure that the generator preserves the high frequency information available in the source HR image, we employ the cycle consistency constraint [17, 26, 45] in both directions, each time by changing the LR target to specify the designated subspace:
This cycle consistency loss encourages the generator to identify for the shared/invariant information between each two subspaces and preserve it during translation.
: (a) Encoding pono residual blocks, used to extract and compress the features coming from the source image and pass it to the decoding part of the generator through moments shortcuts and as compressed features ; and (b) Decoding pono SPAdaIn residual blocks takes as input extracted features from the encoding part, and also upsample the LR target image and pass it to the SPAdaIn layer
Most of the recent image-to-image translation models (e.g. [7, 8, 16]) rely on Adaptive Instance Normalization (AdaIN) [15, 19] to transfer the style from a reference image to a source image. In our work, however, the hypothesis of content and style is not suitable since the LR image contains information on both style (e.g., colors) and content (e.g., pose). Thus, our generator adapts HR source image to the content and style of the LR image through the use of spatially adaptive instance normalization (SPAdaIN) .
Generator (fig. 3) is U-shaped, with skip connections (moments shortcuts ) between the encoding and decoding part. The encoder takes the input HR image, passes it through a series of downsampling residual blocks (ResBlks) . Each ResBlks is equipped with instance normalization (IN) to remove the style of the input, followed by 2D convolution layers and a positional normalization (Pono) . The mean
and varianceare subtracted and passed as a skip connection to the corresponding block in the decoder. Pono and moments shortcuts plays a crucial role in transferring the needed structural information from the HR to the decoding part of the network. These blocks, dubbed Pono ResBlks, are illustrated in detail in fig. (a)a.
For the decoder blocks (fig. (b)b), we use SPAdaIN  conditioned on the LR image, where the LR image is first upsampled to the corresponding resolution of the Pono SPAdaIN ResBlk using bilinear upsampling. It is then followed by 2D convolution layers and a dynamic moment shortcut layer, where, instead of reinjecting and as is, we use a convolutional layer that takes and as inputs to generate the and used as moment shortcuts. Using the dynamic version of moment shortcuts allows the network to adapt and align the shape of the incoming structural information to its LR counterpart.
We compare our method with Stargan-v2 , the state-of-the-art for image-to-image generation on CelebA-HQ and AFHQ.
We evaluate our method on the CelebA-HQ  and AFHQ  datasets. However, for CelebA-HQ we do not separate the two domains into female and male, since both domains are close to each other. Also, we are not using any extra information (e.g. facial attributes of CelebA-HQ). As for AFHQ, we train our network on each domain separately, since the amount of information shared between these is much lower. Average pooling is used as downscaling operator to generate the LR images, as in .
Baseline results are evaluated according to the metrics of image-to-image translation used in [16, 22, 27]. Specifically, diversity and visual quality of samples produced by different methods are evaluated both with the Fréchet inception distance (FID)  and the learned perceptual image patch similarity (LPIPS) . Since the FID score entwine diversity and fidelity in a single metric [32, 33], we also experiment with the density and coverage metrics proposed in .
4.1 Training Setup
For our experiments, we fixed the LR image resolution to and experimented with and for the HR image resolution—we ablate the effect of LR image resolution in sec. 4.4. We train our networks with Adam  and TTUR , with a learning rate of for the generator and for the discriminator. We also used regularization  with , with a batch size of 8. Spectral normalization  was used in all the layers of both and . In eq. 1, we use to push the downscaled version of the generated image to be as close as possible to the LR target. We set when trained on , and to for .
4.2 Qualitative Evaluation
Fig. 6 compares images obtained with our framework with those obtained with Stargan-v2  using reference-guided synthesis on CelebA-HQ. Since our method focuses on generating images that downscale to the given LR image, the generator learns to merge the high frequency information presents in the source image with the low frequency information of the LR target, while preserving the identity of the person and other distinctive features. Differently from traditional i2i methods that only change the style of the source image while preserving its content, our method adapts the source image to the pose of the LR target. More qualitative samples obtained with our technique are shown in fig. 5, where the first row of HR images are used as source images and the first column is the LR target. We also display the real HR target to show that our model is capable of generating diverse images that are different from the target.
Fig. 8 displays generated samples on AFHQ. Visually, we notice that our model is capable of merging most of the high frequency information coming from HR source image with low frequency information present in the LR target. The degree of this transfer depends on how much information is shared between the domain.
4.3 Quantitative Evaluation
|HR image res.|
In table 1, we report FID and LPIPS scores on the results obtained on CelebA-HQ, using two different resolutions, and . Results with the resolution show a significantly lower FID of our method compared to [16, 22, 27], while being similar to Stargan-v2 . We notice a better FID score with the lower resolution, being then significantly better than Stargan-v2. This is due to the fact that the task is harder with higher scale factor since we need to hallucinate more detailed textural information missing from the LR target.
For a deeper insight on the differences between our method and Stargan-v2, we used the density and coverage metrics of . The density measures the overlap between the real data and generated samples, while the coverage measures their diversity, by measuring the ratio of real samples that are covered by the fake samples . Following , we used the feature space embedding of both the real and fakes images with a pretrained VGG16 
on ImageNet. The density metric is then obtained from the-nearest neighbours (with as in ) on the 4096 features obtained from the VGG network’s second fully connected layer. Diversity results reported in table 2 show higher density values for our method, meaning that their samples are closer to the real data distribution than Stargan-v2. Higher coverage measures are also obtained for our method, meaning a better coverage of the data distribution modes. This is noticeable for the resolution, since the coverage is close to maximum value (the domain is ), being almost 10% higher than Stargan-v2 while this gap doubles when the HR value is increased to . This indicates that our model produces more realistic results by exploiting HR information from the source image, while being more diverse by staying faithful to the LR target image.
We also report quantitative results on AFHQ  in table 3, where we train our model on each domain separately given the higher differences of the domains distributions. Indeed, we found that our method excels on domains where images are structurally similar and share information, such as the “cat” domain. However, the “dog” and “wild” domains show a wider variety of races and species, meaning less information shared between images from the same domain. This reduces the amount of shared information that can be transferred from the HR source, forcing to hallucinate more details out of LR targets. This is confirmed by the lower LPIPS results obtained by our approach compared to Stargan-v2, while keeping similar FID scores.
Table 4 illustrates the impact of LR resolution on CelebA-HQ. As the LR target resolution increases, the model exploits the information in the LR target more and more over the information provided by the HR source image. This is confirmed by sustained decrease the LPIPS score when the resolution increases from to , while maintaining similar FID score. The effect of color in the LR target images is also ablated in the supp. material.
4.5 Comparison to Super-Resolution
We now compare our method to PULSE , which super-resolves a face image by optimizing on the latent space of a pretrained StyleGAN model. Fig. 7 presents results that illustrate similarities and differences between PULSE and our method. Both methods generate images that are realistic and faithful to the given LR image. From a super-resolution standpoint, our method would be considered reference-guided—but as opposed to image synthesis which is guided by the LR image, the super-resolution reference is another HR face image. We find this provides a certain amount of control over the diversity of the results, which is not possible with PULSE. Our generated samples are therefore much closer to the HR reference image.
|HR image res.|
|LR target res.|
This paper proposes a novel framework for reference-guided image synthesis targeted towards the scenario where the reference (target) has very low resolution. Our method attempts to realistically fuse information from both the high resolution source (such as identity and HR facial features) and the low resolution target (such as overall color distribution and pose). Our experiments show that our method allows for the generation of a wide variety of realistic images given LR targets. We validate our method on two well-known datasets, CelebA-HQ and AFHQ, compare it to the leading i2i methods [8, 16, 22], and demonstrate advantages in terms of diversity and visual quality.
Limitations and Future Work
As with recent work on exemplar-based super-resolution , our method works best when the LR and HR images come from the same domain (human faces, for example). In the case where the target LR image comes from a different domain than the source (e.g. tiger vs lion), the generated image attempts to match the LR target at the expense of “forgetting” more information from the source image. In addition, we also note that results sometimes lack diversity for a given LR target (rows in fig. 5, for example)—this is a consequence of having to perfectly match the LR image. A potential solution to mitigate both of the above problems would be to soften that constraint, for instance by increasing the distance in eq. 1, or by modifying the discriminator inputs (eq. 6) to tolerate larger differences. Finally, the framework has so far only been tested on faces (humans and animals). Extending it to handle more generic scenes, where the LR target would capture higher level information such as layout, makes for an exciting future research direction.
Wasserstein generative adversarial networks.
International Conference on Machine Learning (ICML), Cited by: §2.
-  (2002) Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 24 (9), pp. 1167–1183. Cited by: §1.
Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432. External Links: Cited by: §3.2.
-  (2020) Creating High Resolution Images with a Latent Adversarial Generator. arXiv e-prints, pp. arXiv:2003.02365. External Links: Cited by: §3.2, §4.
-  (2018) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), Cited by: §2, §2.
Fsrnet: end-to-end learning face super-resolution with facial priors.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.3.
-  (2020) StarGAN v2: diverse image synthesis for multiple domains. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Image-to-Image Translation with Low Resolution Conditioning, §1, §1, §2, §2, §3.3, §3.3, Figure 6, §4, §4, §4.2, §4.3, §4.3, Table 1, Table 3, §5.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
-  (2014) Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 27. Cited by: §2, §3.1.
-  (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (NeurIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Cited by: §2.
-  (2016) Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
-  (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4, §4.1.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2, §3.3.
-  (2018) Multimodal unsupervised image-to-image translation. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2, §3.3, §4, §4.3, Table 1, §5.
-  (2017) Image-to-image translation with conditional adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.2.
-  (2018) Progressive growing of GANs for improved quality, stability, and variation. In IEEE/CVF International Conference on Learning Representations (CVPR), Cited by: §2, §4.
-  (2019) A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2, §2, §3.3.
-  (2020) Analyzing and improving the image quality of stylegan. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
-  (2018) Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §4, §4.3, Table 1, §5.
-  (2019) Positional normalization. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 32. Cited by: §3.3.
-  (2018-06) Conditional image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017) Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems (NeurIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Cited by: §2.
-  (2017) Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems (NeurIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Cited by: §3.2.
-  (2019) Mode seeking generative adversarial networks for diverse image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4, §4.3, Table 1.
-  (2020) PULSE: self-supervised photo upsampling via latent space exploration of generative models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Figure 7, §4.5.
-  (2018) Which training methods for gans do actually converge?. In International Conference on Machine Learning (ICML), Cited by: §2, §4.1.
-  (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), External Links: Cited by: §4.1.
-  (2020) Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning (ICML), Cited by: §1, §4, §4.3, Table 2.
Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems (NeurIPS), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. External Links: Cited by: §4.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §4.3.
-  (2020) Instance-aware image colorization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2020) Neural pose transfer by spatially adaptive instance normalization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 4, §3.3, §3.3.
-  (2018) ESRGAN: enhanced super-resolution generative adversarial networks. In The European Conference on Computer Vision Workshops (ECCVW), Cited by: §1, §2.
-  (2017) Least squares generative adversarial networks. In IEEE/CVF International Conference on Computer Vision (ICCV), External Links: Cited by: §2.
-  (2019) Diversity-sensitive conditional generative adversarial networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
The unreasonable effectiveness of deep features as a perceptual metric. In IEEE/CVF International Conference on Learning Representations (CVPR), Cited by: §4.
-  (2019) Image super-resolution by neural texture transfer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.
-  (2016) Energy-based generative adversarial network. CoRR abs/1609.03126. External Links: Cited by: §2.
-  (2018) CrossNet: an end-to-end reference-based super resolution network using cross-scale warping. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.2.