1 Introduction
Unpaired crossdomain imagetoimage translation is achieving exceptionally convincing results in a variety of domains [27]. Highfidelity image translation requires not only realism, but also strict preservation of the factors that are common to both domains. Consider Figure 1. We wish to translate an image of a face across two domains that mostly differ in texture. It is inappropriate for the translator to additionally change the face’s expression. Unfortunately, this failure mode is surprisingly common in standard unsupervised imagetoimage models.
Besides the challenge of fidelity, imagetoimage translation is made difficult by the fact that one domain may contain factors of variation that the other does not. For instance, consider the task of translating real photographs of a face (with arbitrary lighting and backgrounds) to uniformlylit faces with black backgrounds. Following the approach of CycleGAN [27], we may create a translator for each direction, and train these jointly with a cycleconsistency loss (ensuring that forward and backward translation yields the original input), and an adversarial loss (ensuring that the mappings reach the target domains). But how can we expect this to work? The first translator needs to remove the lighting and background (to map to the second domain), and the second translator needs to add back the same lighting and background (to map to the first domain, and reconstruct the input).
Perhaps unsurprisingly, gradient descent tends to find a way around this issue, resulting in models that hide information (rather than remove it), allowing the information’s recovery. Unfortunately, this leads to models that do inaccurate translations (as shown in our experiments), since the “hiding” affects the accuracy of the translation. Several works have proposed to add a “residual” (or “style”) path to the cycle, which gives the model a sanctioned means to encode and reconstruct the auxiliary information [28, 9, 1]. We view this modified cycle as performing a disentanglement and subsequent reentanglement: the first translator disentangles the input into (1) an image in the second domain, and (2) a residual; the second translator entangles these to reconstruct the input. In practice, however, we find that with or without the sanctioned residual path, standard optimization tends to “hide” information during translation, rather than fully disentangle as desired.
An unconstrained “residual” path can actually be detrimental to the final results. Note that while datadriven priors are assumed available for the translation endpoints, the residual—information in one domain but not the other—is generally unknown. While the implementor may hope that the model will use the “residual” path for disentanglement, it may instead exploit this path to encode the entire input, greatly facilitating the reconstruction task of the “enganglement” step (, the cycleconsistency objective). Prior works have proposed a variety of methods to mitigate this problem, but usually at the cost of severely reducing the representational capacity of the residual (, limiting it to 8 dimensions), and making strong assumptions about its distribution (, assuming it is standard normal) [1, 9, 15, 17]. After applying these heavy constraints, some prior works report that the residual path is ignored by the model, unless its usage is facilitated by careful design choices (, rather than simply concatenating the residual as an input, enforce its usage as layerwise normalization coefficients, applied throughout the second translator) [1, 17].
Our main insight is that the disentanglemententanglement cycle is ineffective when the disentangler and entangler are allowed to cooperate. By “cooperate” we mean that they train on each other’s outputs. CycleGAN and its many variants [27, 9, 15, 6, 17, 1] all have a cooperative training setup: in each cycle, the first translator receives a real input, and the second translator receives a fake input (, an attempted translation/disentanglement) which it backtranslates, and both networks get penalized according to the reconstruction error. This essentially asks the second network to compensate for the first network’s errors. This is counterproductive, because if the second network succeeds, then the first network need not improve. Given sufficient optimization time, these cooperative setups find extremely effective “cheats”, in which subtle signals are encoded into lowfidelity forward translations and subsequently decoded to achieve nearperfect backtranslation, thus defeating the reconstruction error [4].
Our main contribution is in preventing the networks from compensating for each other’s errors, via a simple optimization technique: simply train each network only when its input data is real. With this technique, neither network learns about the other’s behavior, which renders cooperation impossible. Instead, the backtranslator simply preserves any errors made during forwardtranslation, and the reconstruction penalty is put entirely on the forward translator. This forces the networks to learn more faithful mappings to their target domains. The technique also constrains the residual path to encoding only “auxiliary” information (regardless of architecture), since the model is simply incapable of exploiting it for other purposes. In experiments with real images, we show that our optimization method delivers an obvious qualitative improvement over the current stateoftheart, both in terms of semanticspreservation and residualfactor disentanglement. In synthetic data (where the residual is known), we demonstrate that our “uncooperative” optimization leads to quantitatively accurate disentanglement, whereas “cooperative” optimization does not.
2 Related Work
Imagetoimage translation has recently attracted great attention, partly thanks to the success of generative adversarial networks (GANs) [7, 20, 26, 13]. The goal in imagetoimage translation is to translate an image in one domain to a corresponding image in the second domain. Pix2Pix [10] trains models for this task using paired data from the two domains (, inputoutput pairs, exemplifying good translations). CycleGAN [27] removes the need for paired data by forming a translation “cycle”—forward translation followed by backward translation—which creates a natural reconstruction objective between the input and the backtranslation. This is an important step, because in many domains, paired examples do not exist (, a face in the exact same pose/expression in two different physical environments). CycleGAN often preserves the structural content of the images, but this may simply be a consequence of the convolutional architecture [16]. CycleGAN is only capable of learning onetoone mappings, but several works (not all unsupervised) have proposed variants that are capable of onetomany mappings, such as Augmented CycleGAN [1], DRIT [15], MUNIT [9], BicycleGAN [28], and crossdomain disentanglers [6]. These methods are able to generate diverse image with similar “content” (, structural pattern) but different “style” (, textural rendering) through disentanglement. These methods use strong assumptions or regularizations to avoid undesirable local optima, including shared latent spaces [9, 15, 6], loss on KL divergence from simple Gaussians [9, 15, 28], or lowdimensional representations [15, 1, 28, 6]. The effectiveness of these methods is therefore highly dependent on parameter selection.
Image factor disentanglement is necessary if we wish to control the latent factors in the generated images. Hadad [8] assumes the availability of attribute labels in a particular domain, where the goal is to disentangle images into a target domain plus a residual (, “everything else”). Many disentanglement works also make strong assumptions on domain knowledge of the latent space, which includes having data pregrouped according to individual factors [22, 14], or having exact knowledge of the structure and function of individual factors (, for faces: identity, pose, shape, texture [24, 23]). In this work, we do not have attribute labels, we do not make assumptions on the latent space, and we perform disentanglement using only the unpaired image data. Similar to our method, InfoGAN [3] and MINE [2] are completely unsupervised, but the approach in these works is quite different: these methods maximize the mutual information between the inferred latent variables and the observations, while we use discriminators and reconstruction to achieve disentanglement.
3 Method
There are three key ingredients to our method: (1) adversarial priors, which encourages the translated images to be indistinguishable from ones in their target domain, (2) cycleconsistency, which encourages the translations to be invertible, and (3) “uncooperative” optimization, which ensures the networks do not “cheat” toward an undesirable local minimum.
3.1 Preliminaries
Let and be two image domains, such that the images have more information than the images . That is, contains variation in some latent factor that is either constant or absent in . This implies that the mapping is manytoone, and the mapping is onetomany. As a mnemonic, note that is variable in some aspect where is constant.
Let be the residual information that is in but not in . Accessing this extra information allows us to form bijective (onetoone) mappings, and . Note that is not necessarily an image domain. In our implementation, each is a collection of deep featuremaps at multiple scales, which allows its actual form to be determined entirely by the data.
Our goal is to learn functions that can map between these domains. We call the first mapping a disentanglement, denoted D, since it performs an intricate splitting operation: . We call the second mapping an entanglement, denoted E, since it performs a merging operation: . Figure 2 shows a diagram of the domains and the mappings between them. Figure 3 relates the notation to the data and architecture. Note that D and E are inverses of one another.
Our input is a set of samples from , and samples from . The two datasets are unpaired, and true correspondences might not exist.
3.2 Adversarial priors
Our model has two main networks, D and E. We would like to have , and . To achieve this, we introduce adversarial networks and , which learn and impose priors on the distributions of our networks’ outputs.
The adversarial networks attempt to discriminate between real and fake (, generated) samples of the domains and . In our notation, we distinguish “fake” samples with a prime symbol. We train our main networks against the adversarial labels with the leastsquares loss [19]:
(1) 
In a separate (but concurrent) optimization, we also train the parameters of the adversaries, with the losses , and .
Note that we have no priors on the samples generated by D, because there is no dataset of “true” samples. Prior works manufactured a prior by assuming that
is a lowdimensional Gaussian distribution (, 8 dimensions, with zero mean and unit variance)
[9, 15, 6, 17]. Here, we avoid this limiting assumption. We are able to do this because of our unique optimization procedure, detailed in Sec 3.4. However, we do obtain some constraints on by enforcing cycle consistency, described next.3.3 Cycle consistency
On each training step, the model runs two “cycles”. Each cycle generates a reconstruction loss, which constrains the model to perform consistent forwardbackward translation. Figure 3 illustrates the cycles.
In the first cycle, the disentangler D receives a random from the dataset, and generates two outputs: . These outputs are passed to the entangler E, which generates . If the disentanglement and reentanglement are successful, this output should correspond to the original . Therefore, we form the reconstruction objective , where denotes the L1 norm. In summary, this cycle performs .
The second cycle is symmetric to the first. The entangler E receives a random from the dataset, and an generated from a random . Note that it is necessary to use generated samples here, since is completely determined by the network. We omit the prime on this since it is treated as an input rather than an output. From the input , the entangler generates . We then pass to the disentangler, which generates two new outputs . If the entanglement and disentanglement are successful, these outputs should correspond to the original inputs. We therefore form the reconstruction objectives and . In summary, this cycle performs .
Collecting the reconstruction objectives, we have
(2) 
Observe that there is no “fidelity” objective on the translated tensor of each cycle (, in Cycle 1, and in Cycle 2); these tensors only have an adversarial loss. In other words, there is nothing in the design to force to correspond to , or to correspond to , other than the backtranslation error. As we will show in experiments, this backtranslation requirement is not sufficient, because the networks are able to cooperate on the backtranslation: when E is the backtranslator, it can compensate for errors made by D, and vice versa.
In practice, many of these “errors” are never corrected. Instead, they are adapted and refined, to minimize the adversarial loss while facilitating reconstruction. We call these “cheats”: undesirable outputs that yield nearzero loss. At convergence, cheats often take the form of a withindomain transformation: this causes the adversary to not impose a loss (since the output is still in the correct domain), yet allows the second network to (jointly) learn how to undo the transformation. These cheats are especially visible in experiments with faces, likely because humans are so sensitive to faces [5]. Figures 1 and 5 show clear examples of this cheating behavior: while the two domains only differ in texture/lighting, the networks learn to additionally (and unpredictably) alter the facial expression.
We observe that this undesirable solution to the reconstruction error requires both D and E to be complicit in the scheme. For example, if D transforms its input while translating it, but E is unaware of the cheat, E will not undo the transformation while backtranslating, yielding a loss. This leads us to our optimization procedure, which essentially prevents D and E from cooperating in this way.
3.4 Uncooperative optimization
The total loss we wish to minimize is
(3) 
As long as the forward translations land in the target domains, is minimized; as long as the backward translations reconstructs the input, is minimized.
As explained, there is a local minimum to this loss, in which forward translation includes an undesirable transformation, and backtranslation includes an inverse transformation. This ruins the fidelity of the translation.
To reach this bad minimum, each network needs to learn two functions: (1) translating “real” inputs into outputs in a target domain, and (2) decoding “fake” inputs by undoing generated translations. Referring to Figure 3, the disentangler D learns the first task in Cycle 1, and learns the second task in Cycle 2; the entangler E learns these functions in the opposite order. In other words, the networks learn to perform different tasks depending on their input: given real inputs, they translate; given fake inputs, they decode.
To prevent this from happening, we prevent the networks from learning how to decode fake inputs. We do this by freezing the networks when they receive fake inputs. When a network is “frozen”, it is treated as a fixed but differentiable function, so that gradients flow through it, but it does not learn. Referring again to Figure 3, this means training D only in the first cycle, and training E only in the second cycle (where they respectively receive real inputs).
With this optimization technique, the networks are incapable of learning how to compensate for each other’s errors. This means that an erroneous forward translation will always be taken “at face value” by the backward translator, and produce an appropriate loss. This is because the backward translator’s only experience (in terms of gradient steps) comes from real data.
This method is a type of alternating optimization, in the sense that we keep one set of parameters fixed while optimizing the other set, and alternate. In practice, we alternate on every step. Specifically, we do a forward pass through Cycle 1, freeze E while we update D, and then do a forward pass through Cycle 2, and freeze D while we update E.
Including the independent update required for the adversarial networks, this setup requires three optimizers in total.
3.5 Implementation details
Network architecture
Our implementation is based on CycleGAN [27]. The translators’ architecture originally comes from Johnson [11]
: two stride2 convolutions, four residual blocks, and two transposed convolutions.
We implement the disentangler as two separate networks: one for the stream and one for the stream; the stream ends before the transposed convolutions. We found this worked significantly better than using a single network to produce both and .
The entangler uses the same architecture, except it receives skip connections from the stream. There are three such connections: the first uses the featuremap produced after the stride2 convolutions; the second uses the featuremap after two residual blocks, and the third uses the featuremap after the next (and final) two residual blocks. These featuremaps are simply concatenated with the corresponding featuremaps in E. The intent with multiple skip connections is to allow the network the capacity to transfer residuals at multiple levels of scale and abstraction. Our model has fewer residual blocks than CycleGAN, but the added stream makes the total parameter count similar.
Training
We set the reconstruction coefficients on and to be ten times the GAN loss, so . We use a smaller coefficient on the the reconstruction, since it is a much larger tensor: . We update the discriminator using generated images drawn randomly from a history buffer of size 50. We use the Adam solver [12], with , a batch size of 4, and a learning rate of 0.0002. After the reconstruction errors stop descending, we linearly decay the learning rate to zero. In total, training can take up to 300,000 steps, which is approximately 3 days on a single Nvidia GTX 1080 TI. This is slower convergence than a traditional CycleGAN (which takes 100,000 iterations on our data), likely because the objective is harder to optimize when “cheating” is disallowed.
Simplified settings for synthetic data
For the experiments with synthetic data, we use a model with fewer parameters. We implement each generator as a fullyconnected network with one hidden layer of 32 units and ReLU activation. We implement each adversarial discriminator as a fullyconnected network with one hidden layer of 32 units, and leaky ReLU activation. Our experiments suggest that the discriminators have more than sufficient capacity to correctly learn the distributions of
and and keep equilibrium with the generators. We use the same training setup as in the realimage experiments, except we set the batch size to 128, and training to convergence takes approximately 60,000 iterations, which is 1 hour on a single GPU.4 Experiments
In this section, we demonstrate that our method outperforms prior work on (1) accuracy of disentanglement, (2) fidelity of translation, (3) coverage of modes (in multimodal translations). Groundtruth disentanglements do not exist in real image data, so we use a simple synthetic scenario to quantitatively evaluate accuracy, then present realworld qualitative results for fidelity and coverage.
4.1 Disentanglement accuracy
One of our claims is that uncooperative optimization is critical for accurate disentanglement. This is based on the idea that a uncooperative models are less able to find “cheats” which bypass the need for accuracy.
In other words, we need to show that “uncooperative” optimization leads to correctly disentangling from within , in a setting where “cooperative” optimization fails. It is surprisingly easy to find such a scenario. We present one here, in which the groundtruth factors are 1D, and entanglement/disentanglement is simply concatenation/splitting. We find that cooperative optimization is incapable of learning this simple operation (under the given data availability assumptions), whereas uncooperative optimization succeeds.
Models
In this experiment, we use two identical models (see the “Simplified settings for synthetic data” in Sec. 3.5), and change only the optimization method: one uses the proposed “uncooperative” optimization, and the other uses the baseline “cooperative” optimization.
Data
Since groundtruth latent factors are generally unknown in real data, it is necessary to design synthetic data for this experiment. We define the latent factors and to be Gaussian distributions. We generate synthetic entanglements by concatenating a sample with a sample . Specifically, we draw the elements of from a 1D Gaussian with , and draw the elements of from a 1D Gaussian with . We find that results are not sensitive to dimensionality (except in convergence time), and so present only the simplest version here, setting the dimensionality of both and to 1, making the dimensionality of equal to 2. Note that the domain is never encountered at training time, except in its entangled form inside . The task is to recover , using only disentanglement/entanglement cycles, and unpaired samples of and .
Metrics
We measure the relationship between the actual domain (used to generate samples) and the learned domain (disentangled from samples) using the Pearson correlation coefficient , which equals 1 if the two variables have a totally linear relationship, and is closer to 0 otherwise. This (unlike a distance metric) allows solutions where the learned is a scaled version of the true , which is appropriate since the scaling may be absorbed in the model weights.
Results
Our results are summarized in Figure 4. The two models converge in approximately the same number of iterations. At the end of training, the cooperative version achieves a correlation coefficient of 0.695, while the uncooperative version achieves 0.998. Results vary slightly across iterations (and across initializations), but correlation does not noticeably improve for the cooperative version, even if training is extended to 200k iterations.
Overall, this shows that uncooperative optimization leads the model to disentangle the true latent factors, while cooperative optimization does not.
4.2 Highfidelity translation
One of our claims is that the uncooperative training leads to highfidelity translations. By this, we mean that the translation retains as much information as possible from the input, without altering it. To evaluate this, we compare our compare our model’s forward translations against those of CycleGAN and MUNIT.
Baselines
CycleGAN is a popular baseline in unsupervised (but unimodal) imagetoimage translation; our architecture is based on it. MUNIT is a stateoftheart unsupervised multimodal imagetoimage translation method.
Data
We note that MUNIT was originally applied to translating between widely different domains, , translating dogs to lions. While this type of translation is impressive, it is also difficult to evaluate, and it is not clear that close pixelwise correspondence/fidelity is even desirable in that task.
In this paper, we primarily focus on translating a human face across two appearance domains: photos of the face captured by a headmounted camera, and renders of the face produced by a parametric face model (already adapted to the input face). This has an application in social virtual/augmented reality (VR/AR), where we would like users to interact with each other “facetoface” (inside the virtual environment) as naturally as possible.
We collected the face data ourselves. The real photos (representing the domain) were captured by a camera attached to the actor’s headset, with the lens pointed toward the bottom half of the actor’s face; lighting variation was achieved with a set of lights surrounding the actor; background variation was achieved by placing large computer monitors behind the actor and displaying random images. Rendered images of the same face (representing the domain) were produced by fitting a deep parametric face model to the actor [18], and generating random expressions from a viewpoint similar to the headset view. There are 7074 real photos, and 1000 rendered images. The task is to translate a photo of a face to (or from) a renderedlike image of the same face, while maintaining the face’s expression.
Metrics
In the face image experiments—which are necessarily qualitative—we rely on the fact that humans are extremely adept at reading faces [5], and attempt to demonstrate that our model achieves obviously better disentanglements than prior methods. The results on aerial and facade data (introduced in prior work) is harder to interpret at a glance, but close inspection can reveal differences in sharpness and spatial consistency with the input. We note that even when ground truth translations exist, it does not make sense to evaluate against them, since these are manytoone/onetomany mappings, and totally unsupervised models (as considered here) cannot be expected to generate labels that match the ground truth (, as assumed in the “FCN score” used in Pix2Pix [10] and CycleGAN [27]).
Results
Figure 5 compares our method against MUNIT and CycleGAN on the face dataset. The results show that while CycleGAN and MUNIT perform the appearance translation, they make small but very noticeable shifts in the facial expression, , turning a closed mouth into a smile, or changing a grimace to a pout. This is due to the drawbacks of cooperative training, described earlier. Our method does not have this problem, and translates the faces across domains without altering expression. Figure 6 shows the same experiment but for the facades labels task, with similar results: while our method retains, for instance, exact spatial positions of the features in either domain, the baseline methods tend to make small shifts in position and scale.
4.3 Multimodal outputs
Our model is designed to produce multimodal outputs, through a “mixandmatch” method, where we use from one input and from another input, and entangle these to form a novel sample of . We compare against MUNIT, which is the current stateoftheart method for this task.
More specifically, generating multiple outputs from a single input involves the following steps: (1) given as input, generate ; (2) given an unrelated as input, generate ; (3) entangle , to produce the composite . In the face context, since the domain contains expression but not lighting, this setup means extracting expression from one image, and extracting everything else (which is mostly lighting and backgrounds) from another image, and combining these factors into a new image. The experimental setup is similar for MUNIT: a “content code” is generated from , and a “style code” is generated from , and these are encoded into the final output. We do this for multiple , to show the effect of transferring a variety of residual factors onto the same face.
Data
We use the same face data as in the highfidelity task, and add the aerial photos Google maps task [27], which we find has more evident multimodality than the facades data.
Results
Figure 8 shows the results of this experiment on the faces dataset, for MUNIT and our model. The table shows expressions from across rows, and residuals from (, lighting/background conditions) across columns. An overview of the results can be obtained by scanning the rows to inspect that expression is transferred from the leftmost row unchanged, and scanning across columns to inspect that lighting and backgrounds are transferred from the topmost row unchanged. MUNIT appears to have only learned to transfer the global intensity from the source. Our model appears to be transferring backgrounds, and even casting distinct shadows onto the face. However, some shadows appear reduced in intensity (, third column), suggesting that expressionlighting disentanglement is not perfect here.
We also show results of this experiment on the aerial photos Google maps dataset, where we treat the Google map as (assuming it has less information), and the aerial photos as . Results are summarized in Figure 7, in the same format as the face relighting results. In this domain, it appears MUNIT transfers very little from the residual, while our model incorporates textures and objects (, note the white object transferred from the first residual). Both methods appear to retain the spatial layout of the input map.
5 Discussion
In this work, we address the compensation issue in translation cycleconsistency, which typically diminishes the utility of the reconstruction loss. In compensation, the backtranslator (undesirably) adapts to the weaknesses and shortcuts of the forwardtranslator. Hypothetically, there is another way to (partially) defeat the loss, which may be called exploitation. In exploitation, the forwardtranslator (undesirably) adapts to the weaknesses and shortcuts of the backtranslator. The enduring exploitation issue may explain the subtle imperfections in our outputs.
Another limitation of our approach is that we do not address manytomany mappings. Our approach is only multimodal in one direction.
In summary, we introduced the problem of highfidelity imagetoimage translation, motivated it for augmented reality applications, and presented an unsupervised method for solving it. We identified a fundamental cause of lowfidelity translations: cooperation between the forward translator and the backward translator, which allows the forwardtranslation to “hide” information, and the backtranslator to “recover” from noticeable errors. This is a critical problem in real applications. We presented an “uncooperative” optimization scheme that prevents the problem. Our results demonstrate that uncooperative optimization leads to highfidelity image translations, making imagetoimage translation not only fun, but useful for augmented reality.
References
 [1] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. C. Courville. Augmented cyclegan: Learning manytomany mappings from unpaired data. In ICML, 2018.
 [2] I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, and A. Courville. MINE: Mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
 [3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 [4] C. Chu, A. Zhmoginov, and M. Sandler. CycleGAN: A master of steganography. arXiv preprint arXiv:1712.02950, 2017.
 [5] P. Ekman. The face of man: Expressions of universal emotions in a New Guinea village. Garland Publishing, Incorporated, 1980.
 [6] A. GonzalezGarcia, J. van de Weijer, and Y. Bengio. Imagetoimage translation for crossdomain disentanglement. arXiv preprint arXiv:1805.09730, 2018.
 [7] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[8]
N. Hadad, L. Wolf, and M. Shahar.
A twostep disentanglement method.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2018.  [9] X. Huang, M.Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised imagetoimage translation. arXiv preprint arXiv:1804.04732, 2018.

[10]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
arXiv preprint, 2017. 
[11]
J. Johnson, A. Alahi, and L. FeiFei.
Perceptual losses for realtime style transfer and superresolution.
In European Conference on Computer Vision, pages 694–711. Springer, 2016.  [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] J. Kossaifi, L. Tran, Y. Panagakis, and M. Pantic. Gagan: Geometryaware generative adversarial networks. arXiv preprint arXiv:1712.00684, 2017.
 [14] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
 [15] H.Y. Lee, H.Y. Tseng, J.B. Huang, M. Singh, and M.H. Yang. Diverse imagetoimage translation via disentangled representations. arXiv preprint arXiv:1808.00948, 2018.
 [16] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CV, pages 991–999, 2015.
 [17] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
 [18] S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 37(4):68, 2018.
 [19] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2813–2821. IEEE, 2017.
 [20] M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
 [21] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.

[22]
S. Reed, K. Sohn, Y. Zhang, and H. Lee.
Learning to disentangle factors of variation with manifold
interaction.
In
International Conference on Machine Learning
, pages 1431–1439, 2014.  [23] Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. arXiv preprint arXiv:1806.06503, 2018.
 [24] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5444–5453. IEEE, 2017.
 [25] R. Tyleček and R. Šára. Spatial pattern templates for recognition of objects with regular structure. In German Conference on Pattern Recognition, pages 364–374. Springer, 2013.
 [26] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. CoRR, abs/1606.05328, 2016.
 [27] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In CVPR, 2017.
 [28] J.Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal imagetoimage translation. In Advances in neural information processing systems, pages 465–476, 2017.
A How “cheating” happens in practice
It is relatively easy to see how the “uncooperative” optimization prevents the networks from developing a “cheating” scheme, since the networks only train when their inputs are real. It is less easy to see how a “cheating” scheme can develop at all, considering the losses that already constrain the model. In this section, we will first summarize a tempting (but flawed) argument suggesting that “cheating is penalized by the losses”, and then demonstrate how the intuition is generally proven wrong in practice.
To see how cheating may intuitively seem impossible, consider the following, with reference to Figure 3 in the main text. Suppose is used as a “shortcut” to cheat Cycle 1, in the sense that D copies into , and then E copies into , meeting the cycleconsistency constraint of . Meanwhile, to meet the adversarial constraint, D may write any targetdomain image into . But this leads to errors in Cycle 2: if E simply copies its input into , and/or D does not produce an output which strictly corresponds to its input , then is essentially ignored, and we will have and a loss. Therefore, it seems that cheating should be eliminated at convergence.
In practice, however, the networks achieve a far more subtle type of cheat, which eventually yields zero loss. At training time, the visual manifestation of the cheat is that the translations do not correspond to the inputs, and yet they are backtranslated perfectly. Figure References (left) shows some examples of this behavior. Our experiments suggest that the networks generate outputs that facilitate reconstruction of the corresponding inputs, and the networks treat these generated tensors differently from real tensors. In particular, when we generate , then tends to hide inside, to facilitate its reconstruction by E. Similarly, when we generate , then tends to hide inside, to facilitate its reconstruction by D. Figure References (middle and right) illustrates how to empirically reveal this behavior, and shows sample noncorresponding outputs from a converged “cooperative” model. For a brief reading of the figure, observe that and appear visually identical, but D decodes (the real) into a closed mouth, and decodes (the fake) into a wide open mouth.
Parallel work [4] has also observed this phenomenon, under the label of steganography. That work showed that the secret/cheating signal is often hidden in high frequencies, where presumably the discriminators are less effective. With sufficient training, a discriminator should learn to block this strategy (since such highfrequency content is not present in real examples), which would force the signal to shift to lower (and more semanticallyrelevant) frequencies, as observed here.
Comments
There are no comments yet.