The main challenges of image-to-image (I2I) translation are to make the translated image realistic and retain as much information from the source domain as possible. To address this issue, we propose a novel architecture, termed as IEGAN, which removes the encoder of each network and introduces an encoder that is independent of other networks. Compared with previous models, it embodies three advantages of our model: Firstly, it is more directly and comprehensively to grasp image information since the encoder no longer receives loss from generator and discriminator. Secondly, the independent encoder allows each network to focus more on its own goal which makes the translated image more realistic. Thirdly, the reduction in the number of encoders performs more unified image representation. However, when the independent encoder applies two down-sampling blocks, it's hard to extract semantic information. To tackle this problem, we propose deep and shallow information space containing characteristic and semantic information, which can guide the model to translate high-quality images under the task with significant shape or texture change. We compare IEGAN with other previous models, and conduct researches on semantic information consistency and component ablation at the same time. These experiments show the superiority and effectiveness of our architecture. Our code is published on: https://github.com/Elvinky/IEGAN.READ FULL TEXT VIEW PDF
Unsupervised image-to-image translation is a central task in computer vi...
Image-to-image translation has recently achieved remarkable results. But...
In this paper, we aim at solving the multi-domain image-to-image transla...
Despite significant advances in image-to-image (I2I) translation with
Generative Adversarial Networks (GANs) are now widely used for
In this paper, we address the task of layout-to-image translation, which...
Bloom taxonomy is a common paradigm for categorizing educational learnin...
Image-to-image (I2I) translation is an essential topic in the field of computer vision. The goal of I2I translation is to learn mutual mapping functions between two different domains with a bijective relationshipZhu et al. (2017a); Yi et al. (2017)
. In recent years, there are mainly two approaches of I2I translation. The first approach is based on supervised learningMirza and Osindero (2014); Isola et al. (2017); Li et al. (2017); Wang et al. (2018)
, it learns mapping functions from paired image sets. However, in many real-world applications, the workload of collecting paired datasets is extremely heavy, so another approach based on unsupervised learning is proposedLiu et al. (2017); Huang et al. (2018); Zhu et al. (2017b). In this approach, due to the lack of mapping relations of paired samples, it’s necessary to use additional rules including weight-couplingLiu et al. (2017); Liu and Tuzel (2016), cycle consistency Zhu et al. (2017a); Kim et al. (2017); Yi et al. (2017) and identity function Taigman et al. (2017) to restrict the training of mapping functions.
, each generator and dicriminator encodes before translating and classifying respectively. However, the encoder of generator has a bottleneck of encoding capability in I2I translationChen et al. (2020); Radford et al. (2016)
. Compared with the encoder of discriminator directly training through the discrimination loss, gradient received by encoder of generator is back-propagated from the discriminator. This training is indirect for encoder of generator, which causes that the hidden vector learned by an encoder cannot strongly response to the input image. NICE-GANChen et al. (2020) proposes a solution by removing the encoder of generator, then generator and discriminator share the same encoder of discriminator.
Reviewing the goal of each component, the goal of an encoder is to learn the hidden vector that can fully represent the features of input image Bengio et al. (2013), the decoder can use this hidden vector to restore the input image as much as possible. The ultimate goal of discriminator is to distinguish between the translated image from the source domain and the real image from the target domain Goodfellow et al. (2014). However, discriminators and encoders have different goals, which makes the hidden vector learned by the encoder of discriminator well complete the task of classification but not suitable for task of generation. This is because information that is not conducive to classification won’t be learned into the hidden vector.
In view of above mentioned reasons, we propose a novel architecture which refines the goal of each network, as illustrated in Figure 1. Specifically, we remove the encoders of generators and discriminators, and introduce an independent encoder, which means that the encoder is no longer affected by other networks. In other words, the generator and discriminator don’t need to encode before achieving the goal, and the training of encoder is also independent of the training of other networks. Such kind of independence exhibits three advantages: I. Independent encoder can grasp image information more directly and comprehensively. Because this architecture guides the encoder to focus on learning input image representation and ignore the goals of other networks. II. The translated image is of higher quality and retains more information from the source domain. III. With reduction in the number of encoders in Figure 1 decreasing, the number of image representations required for the model is also reduced, which brings the more unified representation.
The performance of previous methods depends on the amount of changes of shape and texture between domains. When independent encoder applies two down-sampling blocks Kim et al. (2017); Lee et al. (2018), the style transfer can be successfully performed. But it’s strenuous to complete tasks with significant shape changes (e.g. The cat is translated into the dog). The hidden vector learned by independent encoder only contains characteristic information (e.g. color and texture) of the input image Wang et al. (2020b). To mitigate this problem, we propose deep and shallow information space (DSI) composed of different layers of hidden vectors. In an unsupervised environment, the model obtains the DSI of the input image through an independent encoder, and then asks the decoder to use DSI to restore exactly the same input image. At the same time, DSI merges and superimposes the hidden vectors of different layers and transmits them to the generator and discriminator. In this way, generator and discriminator can use characteristic and semantic information to complete task with significant shape or texture change.
We perform experiments on several popular benchmarks on multiple datasets. Our method outperforms various state-of-the-art counterparts. We further evaluate the independent encoder through semantic information consistency which proves the ability of each model to retain source domain information. In the meantime, ablation studies are conducted to verify the effectiveness of each component.
GAN Goodfellow et al. (2014) has done a large number of practical use cases, such as image generation Zhang et al. (2017), artwork generation Tan et al. (2017), music generation Engel et al. (2019), and video generation Tulyakov et al. (2018). In addition, it can also improve image quality Tong et al. (2017), image coloring Zhang et al. (2016), face generation Karras et al. (2019), video encoding Wang et al. (2020a), and other more interesting tasks. GAN has several approaches to improve the authenticity of translated images. The first approach is to improve training stability (e.g. DCGAN Radford et al. (2016)
used stride convolution and transposed convolution to improve training stability). The second one is large-scale training (e.g. BIGGANBrock et al. (2019) synthesized realistic images by increasing batchsize and truncation techniques). The third one is architectural modifications(e.g. SAGAN Zhang et al. (2019)
added self-attention mechanism to the network). The GAN models mentioned above are all based on probability models, but there are some GAN models based on energy models, such as EBGANZhao et al. (2017) and BEGAN Berthelot et al. (2017). These models are classified by the reconstruction of input image through the discriminator composed of encoder-decoder.
I2I translation based on unpaired datasets has been widely studied in the field of computer vision since CycleGAN Zhu et al. (2017a) was proposed. U-GAT-IT Kim et al. (2020) added AdaILN and CAM Zhou et al. (2016) to the generator and the discriminator. DeepI2I Wang et al. (2020b) used a pre-trained discriminator as the encoder of the generator. And NICE-GAN Chen et al. (2020) removed the encoder of the generator and used the encoder of the discriminator. These are optimizations for the generator and the discriminator. MUNIT Huang et al. (2018) and DRIT Lee et al. (2018) decoupled the latent space of images into information of content and style. UNIT Liu et al. (2017) and ComboGAN Anoosheh et al. (2018) used domain to share latent space. These are researches on image representation. UPD Yi et al. (2020) and GANILLA Hicsonmez et al. (2020) focused on image translation under specific tasks in generating artistic portrait line drawings and illustrations. StarGAN Choi et al. (2018) achived multi-domains image translation by adding mask vector to domain labels.
The researches and developments of unsupervised representation learning includes probabilistic models Salakhutdinov and Hinton (2009)2006), and deep networks. The goal of representation learning is to find a better way to express data Bengio et al. (2013). A good data representation will greatly improve the efficiency of the model. Representation learning can be implemented in three ways: supervised learning, self-supervised learning and unsupervised learning. Unsupervised learning uses an encoder-decoder network to perform dimensionality reduction and compression on the input data, thereby discarding redundant information, and selecting the most critical concentrated information. In the deep network, the encoder-decoder network constitutes a powerful method of representation learning.
and are samples respectively in the source domain and target domain. In I2I translation, the ultimate goal is to train the mapping function of and the inverse mapping function of Zhu et al. (2017a). In order to train them, some additional constraints are neccessary. For example, in supervised learning, after a paired dataset given, mapping functions are restricted by conditions of and Isola et al. (2017).
But in unsupervised learning, there are only unpaired datasets and . Without restriction of the pairing relation, the function can be mapped to any distributions. To tackle this issue, previous methods put forward additional conditions, such as cycle-consistency (resp.) Zhu et al. (2017a); Kim et al. (2017); Yi et al. (2017) and identity-mapping-enforcing (resp.) Taigman et al. (2017); Zhu et al. (2017a).
Even if the above conditions guide model to learn mapping functions between two domains, the translated images based on the mapping functions trained by the above conditions are blurred. In other words, the translated image is a mixture of multiple distributions Goodfellow et al. (2014). So GAN introduces two discriminators and , where (resp. ) calculates the conditional probability that the sample (resp. ) matching to the distribution (resp. ).
In the field of I2I translation, the common GAN models consists of four core networks Chen et al. (2020); Kim et al. (2020); Yi et al. (2020): two generators (consists of and ) and (consists of and ), two discriminators (consists of and ) and (consists of and ). And an encoder is embedded in each core network. We have observed two phenomenas: I. Because the input of each network is an image and there are only two image domains, we don’t need so many encoders, only two encoders and II. The generator needs to encode the image first and then translate it. Also the discriminator encodes first and then classifies the image. However, the generator and discriminator can actually be used without encoding.
As mentioned above, if the encoder is embedded in other networks, the encoding capability of encoder will be limited by the network which it belongs to. This is because training of encoder depends on the back-propagation gradient received by the network. And the gradient depends on the loss function known as the goal of the network.Goodfellow et al. (2014) The purpose of encoder is to learn the DSI of the input image, and the goal of discriminator is to map the input image into vector to determine whether the domains are aligned Chen et al. (2020). The vector refers to the output of the generator or the target domain, and the goal of the generator is to translate an image in another domain Zhu et al. (2017a), which means the goal of the encoder is different from the goals of generator and discriminator. If the encoder is in the discriminator, the target of encoder becomes classification, which causes the DSI to lose information irrelevant to classification. If the encoder is in the generator, the loss of encoder is generated by another network. This is an indirect way which causes the DSI not respond strongly to the input image.
So we propose to remove the encoder of each network and establish an encoder-decoder ( and ) as an independent encoder to be trained, Figure 2 illustrates our architecture. The training of the encoder now no longer receives the loss generated by other networks but instead receives the feature loss (Eq. 1) between the output image of the decoder and the input image . In this way, the encoder ignores the goal of the generator or discriminator, and focuses on learning the DSI of the input image, thereby ensuring the encoding capacility of encoder. At the same time, the encoder transmits DSI to the generator and discriminator . When the encoder applies feature loss, DSI also contains more information about the input image. Such an encoder can improve the generating ability of the model, which makes it easier to translate high-quality images.
The idea of an independent encoder was inspired by previous work Berthelot et al. (2017); Ronneberger et al. (2015); Hinton et al. (2006); He et al. (2016). In these previous works, encoder-decoder network is a powerful method in unsupervised representation learning, the encoder progressively downsamples the input to obtain image information. The decoder uses the learned image information to progressively upsample and restore the input image. When the number of down-sampling blocks is small, the encoder can only obtain characteristic information. When the number of down-sampling blocks increases, even if encoder obtains semantic information, it is difficult for the decoder to restore a image similar to the input image Wang et al. (2020b).
By adding the skip connections between encoder and decoder to construct an U-net Ronneberger et al. (2015) network, the above problems can be solved ingeniously. In this paper, we apply U-net to build an independent encoder, which means that the independent encoder includes an encoder and a decoder. In order to improve the encoding capability of the encoder, we add linear attention transformer Kitaev et al. (2020) in the encoder.
Most of I2I models only transmit the characteristic information of the input image to generator and discriminator. There are also networks that do other processing, such as decoupling the latent space into content information and style information Huang et al. (2018). As the resolution increases, image information is transmited to the generator through the adapter network layer by layer Wang et al. (2020b). We propose an approach that is different from previous models. We extract the information of the input image to DSI, which contains Then the DSI merges and superimposes the hidden vectors of different layers into during the up-sampling process, model transmits it to other networks.
Generators and are composed of U-GAT-IT-light Kim et al. (2020) generators without encoders. The generator contains six layers of residual blocks with AdaILN and two sub-pixel up-sampling layers. The discriminators and are composed of an attention mechanism CAM Zhou et al. (2016) and a down-sampling network with a multi-scale mechanism Durugkar et al. (2017).
The training process is consisted of four types of losses: feature loss, adversarial loss, identity reconstruction loss, and cycle-consistency loss. We explain them in detail as follows:
Feature loss. We use average absolute loss to ensure that there is a stable gradient in any input situation to provide a higher quality and more accurate encoding capability. Except for this loss, the encoders remain unchanged in the following three losses.
Adversarial loss. The adversarial loss guides the discriminator to distinguish image between source domain and target domain, and make the distribution probability of the output of the generator continuously approach of target domain.
Cycle-consistency loss. In order to prevent mode collapse in I2I translation, we apply cycle-consistency loss to have input image translated into target domain, and then translated image back to source domain, which the translated image should be consistent with the input image.
Identity loss. To address the steganography issue Chu et al. (2017), we apply identity loss to ensure that and no longer have a similar behavior, which is decryption and encryption for hidden vector, so the output distribution of generator can be closer to the distribution of target domain.
Full objective. In summary, final objective of our model is optimized by jointly training independent encoders, generators, and discriminators.
where, , , , . Here, , , and .
All metrics are obtained through 100K iterations of training on the NVIDIA Tesla P100 GPU. We use the Adam optimizer with
weight decay and learning rate, and apply ReLU and leaky-ReLU with a slope of
as the activation functions of the generator and discriminator respectively. We resize the image toand randomly crop to for data augmentation.
We compare the performance of metrics of our method with six different state-of-the-art I2I translation methods including NICE-GAN Chen et al. (2020), U-GAT-IT Kim et al. (2020), UNIT Liu et al. (2017), MUNIT Huang et al. (2018), DRIT Lee et al. (2018) and CycleGAN Zhu et al. (2017a), all of which achieves translation between different domains. All models are implemented by the public code on github.
We consider four unpaired datasets: cat2dog, man2woman, apple2orange, summer2winter. We crop and resize all input images to 256256 during training, and the output images are also 256256. The number of splits for the training set and test set of all datasets is based on the following template(train-/test-): cat2dog (771-1264/100-100); man2woman (1200-1200/115-115); apple2orange (995-1019/266-248); summer2winter (1231-962/309-238). cat2dog dataset is studied in DRIT, apple2orange and summer2winter dataset are studied in CycleGAN. We created man2woman dataset to randomly filter images after classification by gender on FFHQ Karras et al. (2019).
(KID) as evaluation metrics. FID calculates the distance between real and translated images in hidden vector given by the features of a convolutional neural network. KID computes the squared Maximum Mean Discrepancy to get the visual similarity of real and translated images. KID has an unbiased estimator, which is more consistent with human evaluation. Low values of FID and KID scores mean the excellent perfomance of GAN.
Visual analysis. From the perspective of visual analysis, Figure 3 shows examples of image translation of different models on different datasets. In general, the images translated by IEGAN are more difficult to distinguish from the images in the target domain, which shows that our model has excellent translation capability at a qualitative level. Both shape and texture are important bases for human perception of images. In terms of shape, the shapes of animals, people, fruits, and landscapes translated by IEGAN are closer to realistic images. For example, facial features of the translated cat in the first row of Figure 3 are more realistic than other models. In terms of texture, images translated from models under different architectures will have artifacts to a certain extent. In contrast, our model can better reduce the appearances of artifact.
Metric anaysis. From metric analysis, Table 1 shows the FID and KID scores of the above models on the four datasets. In brief, except for summer2winter dataset, the metrics of IEGAN achieve the lowest scores on all datasets and have a significant reduction of FID and KID, especially KID. This shows that our model has an excellent translation capability at a quantitative level. For example, on the popular cat2dog dataset, the best KID score of we obtained was , which is lower than NICE-GAN, and the FID score also dropped from to . Compared to some models that only obtain good scores on the mapping function in one direction, we also reduced the KID from to on , and the FID from to . We can also notice from the Table 1 that CycleGAN can cope with the task of texture change well. UNIT and MUNIT can achieve the goal of shape change. The images translated by DRIT is real but sometimes have nothing to do with input images. The images translated by NICE-GAN and U-GAT-IT have fewer artifacts.
|DSI||LAT||FID||KID 100||FID||KID 100|
In the process of I2I translation, not every part of image needs to be translated. We refer to the parts that need to be translated and don’t need to be translated as subject and background respectively. Even though most of time we only translate the subject, the background is still a part of the image information. The loss of background information from the source domain to the target domain is also a kind of semantic information inconsistency. The semantic information consistency is reflected in the shape, texture, and color of background information. For instance, in the first row of Figure 4, a cat is on a table with patterns. After translation, it should be a dog on a table with patterns. In the same example, in the fourth row of Figure 4, there is a green label on the apple. After translation, there should be a green label on the orange. Compared with other models, it is obvious that our model has semantic information consistency.
Under tasks of I2I translation, FID and KID are high-quality metrics for evaluating the quality and diversity of translated images, but they cannot fully explain the completeness of translation work. In other words, it is important to translate realistic images, but if semantic information consistency is lost in the translation process, the translation relationship between output and input is weakened Kim et al. (2020). For IEGAN, because the goal of an independent encoder is to focus on learning the DSI of input image, it can preserve as much background information of input image as possible to the generator on the premise of translating high-quality images in order to achieve semantic information consistency.
In the Table 2, we next compare the individual impact of each proposed component of the independent encoder in GAN model on the cat2dog dataset, and compare the FID and KID scores. We analyze three key components including IE, DSI and LAT. Table 2 shows that the application of IE reduces FID and KID by to , which means that an independent encoder can obtain better image representation while also allowing each network to focus on its own goal. This strategy of using independent encoder can improve translation capability of model. In further analysis of DSI, compared with IE, the application of DSI doesn’t significantly optimize the quality of the translated image but still lower every metric, which proves the effectiveness of DSI and also means that DSI can make better use of hierarchical information to provide more information to other networks, which makes the quality of translated images better. Finally, we notice that using LAT alone won’t produce positive effects, but combined with DSI, it can significantly improve the translation quality of the model. Overall, by merging all components, IEGAN is significantly better than all other variants.
In this paper, we propose a novel unsupervised I2I translation architecture, called IEGAN. This architecture allows the encoder to be independent of generator and discriminator. It can more directly and comprehensively grasp the image information. In addition, we introduce a deep and shallow information space based on deep hierarchical I2I translation. The proposed representation allows other networks to obtain the characteristic and semantic information of the input image, which makes the translated image more realistic. Through experiments, we have proved that our model is more superior and effective than previous models.
Our model is mainly used to translate images. The translated images can be applied to business models such as movies, advertisements, games and even virtual reality. Also, this technology can be used to solve the automatic translation of faces or objects. Traditionally, it’s a labor-intensive technology, which means that the emergence of this technology has lowered the application threshold for deepfakes. Non-professionals can also use this technology to fake information, which may harm the rights of individuals. Seriously, it may endanger the safety of enterprises and the country.
BEGAN: boundary equilibrium generative adversarial networks. CoRR abs/1703.10717. Cited by: §2, §3.2.
Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §2.
ALICE: towards understanding adversarial learning for joint distribution matching. In NeurIPS, pp. 5495–5503. Cited by: §1.
Deep boltzmann machines. In AISTATS, Vol. 5, pp. 448–455. Cited by: §2.
Image super-resolution using dense skip connections. In ICCV, pp. 4809–4817. Cited by: §2.
Colorful image colorization. In ECCV, Vol. 9907, pp. 649–666. Cited by: §2.
Learning deep features for discriminative localization. In CVPR, pp. 2921–2929. Cited by: §2, §3.2.
For all authors…
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? The scope of this paper is unsupervised I2I translation, and the main contribution is a novel translation architecture, which are shown in the abstract and introduction.
Did you describe the limitations of your work? We pointed out in section4.2 of the paper that our model doesn’t perform well in the clarity of the translated images on the summer2winter dataset.
Did you discuss any potential negative societal impacts of your work? We discussed the impact of our work on the future, see sectionBroader impact for details.
Have you read the ethics review guidelines and ensured that your paper conforms to them? We carefully read the ethical review guidelines and ensured that our papers meets these guidelines.
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results? The assumption of the work of image translation is that there needs to be a bijective relationship between domains, which can be seen in the introduction. For example, the translation from cat to chair and human face to orange is meaningless.
Did you include complete proofs of all theoretical results? We compared other models and conducted other experiments to prove our theoretical results.
If you ran experiments…
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The code, data, and instructions to reproduce our experiment are published on the github via URL in abstract.
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? The error of the experiment was mainly caused by the order after the training set was shuffled, but after 100K iterations training, the error will not deviate much. So we did not give the error bars.
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? In section4.1, we give the experimental environment and resource consumption.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
If your work uses existing assets, did you cite the creators? We have cited the creators for all data and codes used.
Did you mention the license of the assets? We checked the licenses and cited the datasets. Due to space limitation, those kinds of studies do not state the license information.
Did you include any new assets either in the supplemental material or as a URL? We publish the code on the github via URL in abstract.
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? All the data we use are open dataset.
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? Face information appeared in the FFHQ dataset, the privacy of which is discussed in the cited article.
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable?
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?