Unpaired cross-domain image-to-image translation is achieving exceptionally convincing results in a variety of domains . High-fidelity image translation requires not only realism, but also strict preservation of the factors that are common to both domains. Consider Figure 1. We wish to translate an image of a face across two domains that mostly differ in texture. It is inappropriate for the translator to additionally change the face’s expression. Unfortunately, this failure mode is surprisingly common in standard unsupervised image-to-image models.
Besides the challenge of fidelity, image-to-image translation is made difficult by the fact that one domain may contain factors of variation that the other does not. For instance, consider the task of translating real photographs of a face (with arbitrary lighting and backgrounds) to uniformly-lit faces with black backgrounds. Following the approach of CycleGAN , we may create a translator for each direction, and train these jointly with a cycle-consistency loss (ensuring that forward and backward translation yields the original input), and an adversarial loss (ensuring that the mappings reach the target domains). But how can we expect this to work? The first translator needs to remove the lighting and background (to map to the second domain), and the second translator needs to add back the same lighting and background (to map to the first domain, and reconstruct the input).
Perhaps unsurprisingly, gradient descent tends to find a way around this issue, resulting in models that hide information (rather than remove it), allowing the information’s recovery. Unfortunately, this leads to models that do inaccurate translations (as shown in our experiments), since the “hiding” affects the accuracy of the translation. Several works have proposed to add a “residual” (or “style”) path to the cycle, which gives the model a sanctioned means to encode and reconstruct the auxiliary information [28, 9, 1]. We view this modified cycle as performing a disentanglement and subsequent re-entanglement: the first translator disentangles the input into (1) an image in the second domain, and (2) a residual; the second translator entangles these to reconstruct the input. In practice, however, we find that with or without the sanctioned residual path, standard optimization tends to “hide” information during translation, rather than fully disentangle as desired.
An unconstrained “residual” path can actually be detrimental to the final results. Note that while data-driven priors are assumed available for the translation endpoints, the residual—information in one domain but not the other—is generally unknown. While the implementor may hope that the model will use the “residual” path for disentanglement, it may instead exploit this path to encode the entire input, greatly facilitating the reconstruction task of the “enganglement” step (, the cycle-consistency objective). Prior works have proposed a variety of methods to mitigate this problem, but usually at the cost of severely reducing the representational capacity of the residual (, limiting it to 8 dimensions), and making strong assumptions about its distribution (, assuming it is standard normal) [1, 9, 15, 17]. After applying these heavy constraints, some prior works report that the residual path is ignored by the model, unless its usage is facilitated by careful design choices (, rather than simply concatenating the residual as an input, enforce its usage as layer-wise normalization coefficients, applied throughout the second translator) [1, 17].
Our main insight is that the disentanglement-entanglement cycle is ineffective when the disentangler and entangler are allowed to cooperate. By “cooperate” we mean that they train on each other’s outputs. CycleGAN and its many variants [27, 9, 15, 6, 17, 1] all have a cooperative training setup: in each cycle, the first translator receives a real input, and the second translator receives a fake input (, an attempted translation/disentanglement) which it back-translates, and both networks get penalized according to the reconstruction error. This essentially asks the second network to compensate for the first network’s errors. This is counter-productive, because if the second network succeeds, then the first network need not improve. Given sufficient optimization time, these cooperative setups find extremely effective “cheats”, in which subtle signals are encoded into low-fidelity forward translations and subsequently decoded to achieve near-perfect back-translation, thus defeating the reconstruction error .
Our main contribution is in preventing the networks from compensating for each other’s errors, via a simple optimization technique: simply train each network only when its input data is real. With this technique, neither network learns about the other’s behavior, which renders cooperation impossible. Instead, the back-translator simply preserves any errors made during forward-translation, and the reconstruction penalty is put entirely on the forward translator. This forces the networks to learn more faithful mappings to their target domains. The technique also constrains the residual path to encoding only “auxiliary” information (regardless of architecture), since the model is simply incapable of exploiting it for other purposes. In experiments with real images, we show that our optimization method delivers an obvious qualitative improvement over the current state-of-the-art, both in terms of semantics-preservation and residual-factor disentanglement. In synthetic data (where the residual is known), we demonstrate that our “uncooperative” optimization leads to quantitatively accurate disentanglement, whereas “cooperative” optimization does not.
2 Related Work
Image-to-image translation has recently attracted great attention, partly thanks to the success of generative adversarial networks (GANs) [7, 20, 26, 13]. The goal in image-to-image translation is to translate an image in one domain to a corresponding image in the second domain. Pix2Pix  trains models for this task using paired data from the two domains (, input-output pairs, exemplifying good translations). CycleGAN  removes the need for paired data by forming a translation “cycle”—forward translation followed by backward translation—which creates a natural reconstruction objective between the input and the back-translation. This is an important step, because in many domains, paired examples do not exist (, a face in the exact same pose/expression in two different physical environments). CycleGAN often preserves the structural content of the images, but this may simply be a consequence of the convolutional architecture . CycleGAN is only capable of learning one-to-one mappings, but several works (not all unsupervised) have proposed variants that are capable of one-to-many mappings, such as Augmented CycleGAN , DRIT , MUNIT , BicycleGAN , and cross-domain disentanglers . These methods are able to generate diverse image with similar “content” (, structural pattern) but different “style” (, textural rendering) through disentanglement. These methods use strong assumptions or regularizations to avoid undesirable local optima, including shared latent spaces [9, 15, 6], loss on KL divergence from simple Gaussians [9, 15, 28], or low-dimensional representations [15, 1, 28, 6]. The effectiveness of these methods is therefore highly dependent on parameter selection.
Image factor disentanglement is necessary if we wish to control the latent factors in the generated images. Hadad  assumes the availability of attribute labels in a particular domain, where the goal is to disentangle images into a target domain plus a residual (, “everything else”). Many disentanglement works also make strong assumptions on domain knowledge of the latent space, which includes having data pre-grouped according to individual factors [22, 14], or having exact knowledge of the structure and function of individual factors (, for faces: identity, pose, shape, texture [24, 23]). In this work, we do not have attribute labels, we do not make assumptions on the latent space, and we perform disentanglement using only the unpaired image data. Similar to our method, InfoGAN  and MINE  are completely unsupervised, but the approach in these works is quite different: these methods maximize the mutual information between the inferred latent variables and the observations, while we use discriminators and reconstruction to achieve disentanglement.
There are three key ingredients to our method: (1) adversarial priors, which encourages the translated images to be indistinguishable from ones in their target domain, (2) cycle-consistency, which encourages the translations to be invertible, and (3) “uncooperative” optimization, which ensures the networks do not “cheat” toward an undesirable local minimum.
Let and be two image domains, such that the images have more information than the images . That is, contains variation in some latent factor that is either constant or absent in . This implies that the mapping is many-to-one, and the mapping is one-to-many. As a mnemonic, note that is variable in some aspect where is constant.
Let be the residual information that is in but not in . Accessing this extra information allows us to form bijective (one-to-one) mappings, and . Note that is not necessarily an image domain. In our implementation, each is a collection of deep featuremaps at multiple scales, which allows its actual form to be determined entirely by the data.
Our goal is to learn functions that can map between these domains. We call the first mapping a disentanglement, denoted D, since it performs an intricate splitting operation: . We call the second mapping an entanglement, denoted E, since it performs a merging operation: . Figure 2 shows a diagram of the domains and the mappings between them. Figure 3 relates the notation to the data and architecture. Note that D and E are inverses of one another.
Our input is a set of samples from , and samples from . The two datasets are unpaired, and true correspondences might not exist.
3.2 Adversarial priors
Our model has two main networks, D and E. We would like to have , and . To achieve this, we introduce adversarial networks and , which learn and impose priors on the distributions of our networks’ outputs.
The adversarial networks attempt to discriminate between real and fake (, generated) samples of the domains and . In our notation, we distinguish “fake” samples with a prime symbol. We train our main networks against the adversarial labels with the least-squares loss :
In a separate (but concurrent) optimization, we also train the parameters of the adversaries, with the losses , and .
Note that we have no priors on the samples generated by D, because there is no dataset of “true” samples. Prior works manufactured a prior by assuming that9, 15, 6, 17]. Here, we avoid this limiting assumption. We are able to do this because of our unique optimization procedure, detailed in Sec 3.4. However, we do obtain some constraints on by enforcing cycle consistency, described next.
3.3 Cycle consistency
On each training step, the model runs two “cycles”. Each cycle generates a reconstruction loss, which constrains the model to perform consistent forward-backward translation. Figure 3 illustrates the cycles.
In the first cycle, the disentangler D receives a random from the dataset, and generates two outputs: . These outputs are passed to the entangler E, which generates . If the disentanglement and re-entanglement are successful, this output should correspond to the original . Therefore, we form the reconstruction objective , where denotes the L1 norm. In summary, this cycle performs .
The second cycle is symmetric to the first. The entangler E receives a random from the dataset, and an generated from a random . Note that it is necessary to use generated samples here, since is completely determined by the network. We omit the prime on this since it is treated as an input rather than an output. From the input , the entangler generates . We then pass to the disentangler, which generates two new outputs . If the entanglement and disentanglement are successful, these outputs should correspond to the original inputs. We therefore form the reconstruction objectives and . In summary, this cycle performs .
Collecting the reconstruction objectives, we have
Observe that there is no “fidelity” objective on the translated tensor of each cycle (, in Cycle 1, and in Cycle 2); these tensors only have an adversarial loss. In other words, there is nothing in the design to force to correspond to , or to correspond to , other than the back-translation error. As we will show in experiments, this back-translation requirement is not sufficient, because the networks are able to cooperate on the back-translation: when E is the back-translator, it can compensate for errors made by D, and vice versa.
In practice, many of these “errors” are never corrected. Instead, they are adapted and refined, to minimize the adversarial loss while facilitating reconstruction. We call these “cheats”: undesirable outputs that yield near-zero loss. At convergence, cheats often take the form of a within-domain transformation: this causes the adversary to not impose a loss (since the output is still in the correct domain), yet allows the second network to (jointly) learn how to undo the transformation. These cheats are especially visible in experiments with faces, likely because humans are so sensitive to faces . Figures 1 and 5 show clear examples of this cheating behavior: while the two domains only differ in texture/lighting, the networks learn to additionally (and unpredictably) alter the facial expression.
We observe that this undesirable solution to the reconstruction error requires both D and E to be complicit in the scheme. For example, if D transforms its input while translating it, but E is unaware of the cheat, E will not undo the transformation while back-translating, yielding a loss. This leads us to our optimization procedure, which essentially prevents D and E from cooperating in this way.
3.4 Uncooperative optimization
The total loss we wish to minimize is
As long as the forward translations land in the target domains, is minimized; as long as the backward translations reconstructs the input, is minimized.
As explained, there is a local minimum to this loss, in which forward translation includes an undesirable transformation, and back-translation includes an inverse transformation. This ruins the fidelity of the translation.
To reach this bad minimum, each network needs to learn two functions: (1) translating “real” inputs into outputs in a target domain, and (2) decoding “fake” inputs by undoing generated translations. Referring to Figure 3, the disentangler D learns the first task in Cycle 1, and learns the second task in Cycle 2; the entangler E learns these functions in the opposite order. In other words, the networks learn to perform different tasks depending on their input: given real inputs, they translate; given fake inputs, they decode.
To prevent this from happening, we prevent the networks from learning how to decode fake inputs. We do this by freezing the networks when they receive fake inputs. When a network is “frozen”, it is treated as a fixed but differentiable function, so that gradients flow through it, but it does not learn. Referring again to Figure 3, this means training D only in the first cycle, and training E only in the second cycle (where they respectively receive real inputs).
With this optimization technique, the networks are incapable of learning how to compensate for each other’s errors. This means that an erroneous forward translation will always be taken “at face value” by the backward translator, and produce an appropriate loss. This is because the backward translator’s only experience (in terms of gradient steps) comes from real data.
This method is a type of alternating optimization, in the sense that we keep one set of parameters fixed while optimizing the other set, and alternate. In practice, we alternate on every step. Specifically, we do a forward pass through Cycle 1, freeze E while we update D, and then do a forward pass through Cycle 2, and freeze D while we update E.
Including the independent update required for the adversarial networks, this setup requires three optimizers in total.
3.5 Implementation details
: two stride-2 convolutions, four residual blocks, and two transposed convolutions.
We implement the disentangler as two separate networks: one for the stream and one for the stream; the stream ends before the transposed convolutions. We found this worked significantly better than using a single network to produce both and .
The entangler uses the same architecture, except it receives skip connections from the stream. There are three such connections: the first uses the featuremap produced after the stride-2 convolutions; the second uses the featuremap after two residual blocks, and the third uses the featuremap after the next (and final) two residual blocks. These featuremaps are simply concatenated with the corresponding featuremaps in E. The intent with multiple skip connections is to allow the network the capacity to transfer residuals at multiple levels of scale and abstraction. Our model has fewer residual blocks than CycleGAN, but the added stream makes the total parameter count similar.
We set the reconstruction coefficients on and to be ten times the GAN loss, so . We use a smaller coefficient on the the reconstruction, since it is a much larger tensor: . We update the discriminator using generated images drawn randomly from a history buffer of size 50. We use the Adam solver , with , a batch size of 4, and a learning rate of 0.0002. After the reconstruction errors stop descending, we linearly decay the learning rate to zero. In total, training can take up to 300,000 steps, which is approximately 3 days on a single Nvidia GTX 1080 TI. This is slower convergence than a traditional CycleGAN (which takes 100,000 iterations on our data), likely because the objective is harder to optimize when “cheating” is disallowed.
Simplified settings for synthetic data
For the experiments with synthetic data, we use a model with fewer parameters. We implement each generator as a fully-connected network with one hidden layer of 32 units and ReLU activation. We implement each adversarial discriminator as a fully-connected network with one hidden layer of 32 units, and leaky ReLU activation. Our experiments suggest that the discriminators have more than sufficient capacity to correctly learn the distributions ofand and keep equilibrium with the generators. We use the same training setup as in the real-image experiments, except we set the batch size to 128, and training to convergence takes approximately 60,000 iterations, which is 1 hour on a single GPU.
In this section, we demonstrate that our method outperforms prior work on (1) accuracy of disentanglement, (2) fidelity of translation, (3) coverage of modes (in multi-modal translations). Ground-truth disentanglements do not exist in real image data, so we use a simple synthetic scenario to quantitatively evaluate accuracy, then present real-world qualitative results for fidelity and coverage.
4.1 Disentanglement accuracy
One of our claims is that uncooperative optimization is critical for accurate disentanglement. This is based on the idea that a uncooperative models are less able to find “cheats” which bypass the need for accuracy.
In other words, we need to show that “uncooperative” optimization leads to correctly disentangling from within , in a setting where “cooperative” optimization fails. It is surprisingly easy to find such a scenario. We present one here, in which the ground-truth factors are 1D, and entanglement/disentanglement is simply concatenation/splitting. We find that cooperative optimization is incapable of learning this simple operation (under the given data availability assumptions), whereas uncooperative optimization succeeds.
In this experiment, we use two identical models (see the “Simplified settings for synthetic data” in Sec. 3.5), and change only the optimization method: one uses the proposed “uncooperative” optimization, and the other uses the baseline “cooperative” optimization.
Since ground-truth latent factors are generally unknown in real data, it is necessary to design synthetic data for this experiment. We define the latent factors and to be Gaussian distributions. We generate synthetic entanglements by concatenating a sample with a sample . Specifically, we draw the elements of from a 1D Gaussian with , and draw the elements of from a 1D Gaussian with . We find that results are not sensitive to dimensionality (except in convergence time), and so present only the simplest version here, setting the dimensionality of both and to 1, making the dimensionality of equal to 2. Note that the domain is never encountered at training time, except in its entangled form inside . The task is to recover , using only disentanglement/entanglement cycles, and unpaired samples of and .
We measure the relationship between the actual domain (used to generate samples) and the learned domain (disentangled from samples) using the Pearson correlation coefficient , which equals 1 if the two variables have a totally linear relationship, and is closer to 0 otherwise. This (unlike a distance metric) allows solutions where the learned is a scaled version of the true , which is appropriate since the scaling may be absorbed in the model weights.
Our results are summarized in Figure 4. The two models converge in approximately the same number of iterations. At the end of training, the cooperative version achieves a correlation coefficient of 0.695, while the uncooperative version achieves 0.998. Results vary slightly across iterations (and across initializations), but correlation does not noticeably improve for the cooperative version, even if training is extended to 200k iterations.
Overall, this shows that uncooperative optimization leads the model to disentangle the true latent factors, while cooperative optimization does not.
4.2 High-fidelity translation
One of our claims is that the uncooperative training leads to high-fidelity translations. By this, we mean that the translation retains as much information as possible from the input, without altering it. To evaluate this, we compare our compare our model’s forward translations against those of CycleGAN and MUNIT.
CycleGAN is a popular baseline in unsupervised (but unimodal) image-to-image translation; our architecture is based on it. MUNIT is a state-of-the-art unsupervised multimodal image-to-image translation method.
We note that MUNIT was originally applied to translating between widely different domains, , translating dogs to lions. While this type of translation is impressive, it is also difficult to evaluate, and it is not clear that close pixel-wise correspondence/fidelity is even desirable in that task.
In this paper, we primarily focus on translating a human face across two appearance domains: photos of the face captured by a head-mounted camera, and renders of the face produced by a parametric face model (already adapted to the input face). This has an application in social virtual/augmented reality (VR/AR), where we would like users to interact with each other “face-to-face” (inside the virtual environment) as naturally as possible.
We collected the face data ourselves. The real photos (representing the domain) were captured by a camera attached to the actor’s headset, with the lens pointed toward the bottom half of the actor’s face; lighting variation was achieved with a set of lights surrounding the actor; background variation was achieved by placing large computer monitors behind the actor and displaying random images. Rendered images of the same face (representing the domain) were produced by fitting a deep parametric face model to the actor , and generating random expressions from a viewpoint similar to the headset view. There are 7074 real photos, and 1000 rendered images. The task is to translate a photo of a face to (or from) a rendered-like image of the same face, while maintaining the face’s expression.
In the face image experiments—which are necessarily qualitative—we rely on the fact that humans are extremely adept at reading faces , and attempt to demonstrate that our model achieves obviously better disentanglements than prior methods. The results on aerial and facade data (introduced in prior work) is harder to interpret at a glance, but close inspection can reveal differences in sharpness and spatial consistency with the input. We note that even when ground truth translations exist, it does not make sense to evaluate against them, since these are many-to-one/one-to-many mappings, and totally unsupervised models (as considered here) cannot be expected to generate labels that match the ground truth (, as assumed in the “FCN score” used in Pix2Pix  and CycleGAN ).
Figure 5 compares our method against MUNIT and CycleGAN on the face dataset. The results show that while CycleGAN and MUNIT perform the appearance translation, they make small but very noticeable shifts in the facial expression, , turning a closed mouth into a smile, or changing a grimace to a pout. This is due to the drawbacks of cooperative training, described earlier. Our method does not have this problem, and translates the faces across domains without altering expression. Figure 6 shows the same experiment but for the facades labels task, with similar results: while our method retains, for instance, exact spatial positions of the features in either domain, the baseline methods tend to make small shifts in position and scale.
4.3 Multi-modal outputs
Our model is designed to produce multi-modal outputs, through a “mix-and-match” method, where we use from one input and from another input, and entangle these to form a novel sample of . We compare against MUNIT, which is the current state-of-the-art method for this task.
More specifically, generating multiple outputs from a single input involves the following steps: (1) given as input, generate ; (2) given an unrelated as input, generate ; (3) entangle , to produce the composite . In the face context, since the domain contains expression but not lighting, this setup means extracting expression from one image, and extracting everything else (which is mostly lighting and backgrounds) from another image, and combining these factors into a new image. The experimental setup is similar for MUNIT: a “content code” is generated from , and a “style code” is generated from , and these are encoded into the final output. We do this for multiple , to show the effect of transferring a variety of residual factors onto the same face.
We use the same face data as in the high-fidelity task, and add the aerial photos Google maps task , which we find has more evident multi-modality than the facades data.
Figure 8 shows the results of this experiment on the faces dataset, for MUNIT and our model. The table shows expressions from across rows, and residuals from (, lighting/background conditions) across columns. An overview of the results can be obtained by scanning the rows to inspect that expression is transferred from the leftmost row unchanged, and scanning across columns to inspect that lighting and backgrounds are transferred from the topmost row unchanged. MUNIT appears to have only learned to transfer the global intensity from the source. Our model appears to be transferring backgrounds, and even casting distinct shadows onto the face. However, some shadows appear reduced in intensity (, third column), suggesting that expression-lighting disentanglement is not perfect here.
We also show results of this experiment on the aerial photos Google maps dataset, where we treat the Google map as (assuming it has less information), and the aerial photos as . Results are summarized in Figure 7, in the same format as the face relighting results. In this domain, it appears MUNIT transfers very little from the residual, while our model incorporates textures and objects (, note the white object transferred from the first residual). Both methods appear to retain the spatial layout of the input map.
In this work, we address the compensation issue in translation cycle-consistency, which typically diminishes the utility of the reconstruction loss. In compensation, the back-translator (undesirably) adapts to the weaknesses and shortcuts of the forward-translator. Hypothetically, there is another way to (partially) defeat the loss, which may be called exploitation. In exploitation, the forward-translator (undesirably) adapts to the weaknesses and shortcuts of the back-translator. The enduring exploitation issue may explain the subtle imperfections in our outputs.
Another limitation of our approach is that we do not address many-to-many mappings. Our approach is only multi-modal in one direction.
In summary, we introduced the problem of high-fidelity image-to-image translation, motivated it for augmented reality applications, and presented an unsupervised method for solving it. We identified a fundamental cause of low-fidelity translations: cooperation between the forward translator and the backward translator, which allows the forward-translation to “hide” information, and the back-translator to “recover” from noticeable errors. This is a critical problem in real applications. We presented an “uncooperative” optimization scheme that prevents the problem. Our results demonstrate that uncooperative optimization leads to high-fidelity image translations, making image-to-image translation not only fun, but useful for augmented reality.
-  A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. C. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML, 2018.
-  I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, and A. Courville. MINE: Mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
-  C. Chu, A. Zhmoginov, and M. Sandler. CycleGAN: A master of steganography. arXiv preprint arXiv:1712.02950, 2017.
-  P. Ekman. The face of man: Expressions of universal emotions in a New Guinea village. Garland Publishing, Incorporated, 1980.
-  A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio. Image-to-image translation for cross-domain disentanglement. arXiv preprint arXiv:1805.09730, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  N. Hadad, L. Wolf, and M. Shahar. A two-step disentanglement method. In , 2018.
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. arXiv preprint arXiv:1804.04732, 2018.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.arXiv preprint, 2017.
J. Johnson, A. Alahi, and L. Fei-Fei.
Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  J. Kossaifi, L. Tran, Y. Panagakis, and M. Pantic. Gagan: Geometry-aware generative adversarial networks. arXiv preprint arXiv:1712.00684, 2017.
-  T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
-  H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1808.00948, 2018.
-  K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CV, pages 991–999, 2015.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
-  S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 37(4):68, 2018.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2813–2821. IEEE, 2017.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
S. Reed, K. Sohn, Y. Zhang, and H. Lee.
Learning to disentangle factors of variation with manifold
International Conference on Machine Learning, pages 1431–1439, 2014.
-  Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. arXiv preprint arXiv:1806.06503, 2018.
-  Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5444–5453. IEEE, 2017.
-  R. Tyleček and R. Šára. Spatial pattern templates for recognition of objects with regular structure. In German Conference on Pattern Recognition, pages 364–374. Springer, 2013.
-  A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. CoRR, abs/1606.05328, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In CVPR, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in neural information processing systems, pages 465–476, 2017.
A How “cheating” happens in practice
It is relatively easy to see how the “uncooperative” optimization prevents the networks from developing a “cheating” scheme, since the networks only train when their inputs are real. It is less easy to see how a “cheating” scheme can develop at all, considering the losses that already constrain the model. In this section, we will first summarize a tempting (but flawed) argument suggesting that “cheating is penalized by the losses”, and then demonstrate how the intuition is generally proven wrong in practice.
To see how cheating may intuitively seem impossible, consider the following, with reference to Figure 3 in the main text. Suppose is used as a “shortcut” to cheat Cycle 1, in the sense that D copies into , and then E copies into , meeting the cycle-consistency constraint of . Meanwhile, to meet the adversarial constraint, D may write any target-domain image into . But this leads to errors in Cycle 2: if E simply copies its input into , and/or D does not produce an output which strictly corresponds to its input , then is essentially ignored, and we will have and a loss. Therefore, it seems that cheating should be eliminated at convergence.
In practice, however, the networks achieve a far more subtle type of cheat, which eventually yields zero loss. At training time, the visual manifestation of the cheat is that the translations do not correspond to the inputs, and yet they are back-translated perfectly. Figure References (left) shows some examples of this behavior. Our experiments suggest that the networks generate outputs that facilitate reconstruction of the corresponding inputs, and the networks treat these generated tensors differently from real tensors. In particular, when we generate , then tends to hide inside, to facilitate its reconstruction by E. Similarly, when we generate , then tends to hide inside, to facilitate its reconstruction by D. Figure References (middle and right) illustrates how to empirically reveal this behavior, and shows sample non-corresponding outputs from a converged “cooperative” model. For a brief reading of the figure, observe that and appear visually identical, but D decodes (the real) into a closed mouth, and decodes (the fake) into a wide open mouth.
Parallel work  has also observed this phenomenon, under the label of steganography. That work showed that the secret/cheating signal is often hidden in high frequencies, where presumably the discriminators are less effective. With sufficient training, a discriminator should learn to block this strategy (since such high-frequency content is not present in real examples), which would force the signal to shift to lower (and more semantically-relevant) frequencies, as observed here.