Image Disentanglement and Uncooperative Re-Entanglement for High-Fidelity Image-to-Image Translation

01/11/2019 ∙ by Adam W. Harley, et al. ∙ Oculus VR, LLC Carnegie Mellon University 0

Cross-domain image-to-image translation should satisfy two requirements: (1) preserve the information that is common to both domains, and (2) generate convincing images covering variations that appear in the target domain. This is challenging, especially when there are no example translations available as supervision. Adversarial cycle consistency was recently proposed as a solution, with beautiful and creative results, yielding much follow-up work. However, augmented reality applications cannot readily use such techniques to provide users with compelling translations of real scenes, because the translations do not have high-fidelity constraints. In other words, current models are liable to change details that should be preserved: while re-texturing a face, they may alter the face's expression in an unpredictable way. In this paper, we introduce the problem of high-fidelity image-to-image translation, and present a method for solving it. Our main insight is that low-fidelity translations typically escape a cycle-consistency penalty, because the back-translator learns to compensate for the forward-translator's errors. We therefore introduce an optimization technique that prevents the networks from cooperating: simply train each network only when its input data is real. Prior works, in comparison, train each network with a mix of real and generated data. Experimental results show that our method accurately disentangles the factors that separate the domains, and converges to semantics-preserving translations that prior methods miss.



There are no comments yet.


page 1

page 6

page 8

page 9

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unpaired cross-domain image-to-image translation is achieving exceptionally convincing results in a variety of domains [27]. High-fidelity image translation requires not only realism, but also strict preservation of the factors that are common to both domains. Consider Figure 1. We wish to translate an image of a face across two domains that mostly differ in texture. It is inappropriate for the translator to additionally change the face’s expression. Unfortunately, this failure mode is surprisingly common in standard unsupervised image-to-image models.

Besides the challenge of fidelity, image-to-image translation is made difficult by the fact that one domain may contain factors of variation that the other does not. For instance, consider the task of translating real photographs of a face (with arbitrary lighting and backgrounds) to uniformly-lit faces with black backgrounds. Following the approach of CycleGAN [27], we may create a translator for each direction, and train these jointly with a cycle-consistency loss (ensuring that forward and backward translation yields the original input), and an adversarial loss (ensuring that the mappings reach the target domains). But how can we expect this to work? The first translator needs to remove the lighting and background (to map to the second domain), and the second translator needs to add back the same lighting and background (to map to the first domain, and reconstruct the input).

Perhaps unsurprisingly, gradient descent tends to find a way around this issue, resulting in models that hide information (rather than remove it), allowing the information’s recovery. Unfortunately, this leads to models that do inaccurate translations (as shown in our experiments), since the “hiding” affects the accuracy of the translation. Several works have proposed to add a “residual” (or “style”) path to the cycle, which gives the model a sanctioned means to encode and reconstruct the auxiliary information [28, 9, 1]. We view this modified cycle as performing a disentanglement and subsequent re-entanglement: the first translator disentangles the input into (1) an image in the second domain, and (2) a residual; the second translator entangles these to reconstruct the input. In practice, however, we find that with or without the sanctioned residual path, standard optimization tends to “hide” information during translation, rather than fully disentangle as desired.

An unconstrained “residual” path can actually be detrimental to the final results. Note that while data-driven priors are assumed available for the translation endpoints, the residual—information in one domain but not the other—is generally unknown. While the implementor may hope that the model will use the “residual” path for disentanglement, it may instead exploit this path to encode the entire input, greatly facilitating the reconstruction task of the “enganglement” step (, the cycle-consistency objective). Prior works have proposed a variety of methods to mitigate this problem, but usually at the cost of severely reducing the representational capacity of the residual (, limiting it to 8 dimensions), and making strong assumptions about its distribution (, assuming it is standard normal) [1, 9, 15, 17]. After applying these heavy constraints, some prior works report that the residual path is ignored by the model, unless its usage is facilitated by careful design choices (, rather than simply concatenating the residual as an input, enforce its usage as layer-wise normalization coefficients, applied throughout the second translator) [1, 17].

Our main insight is that the disentanglement-entanglement cycle is ineffective when the disentangler and entangler are allowed to cooperate. By “cooperate” we mean that they train on each other’s outputs. CycleGAN and its many variants [27, 9, 15, 6, 17, 1] all have a cooperative training setup: in each cycle, the first translator receives a real input, and the second translator receives a fake input (, an attempted translation/disentanglement) which it back-translates, and both networks get penalized according to the reconstruction error. This essentially asks the second network to compensate for the first network’s errors. This is counter-productive, because if the second network succeeds, then the first network need not improve. Given sufficient optimization time, these cooperative setups find extremely effective “cheats”, in which subtle signals are encoded into low-fidelity forward translations and subsequently decoded to achieve near-perfect back-translation, thus defeating the reconstruction error [4].

Our main contribution is in preventing the networks from compensating for each other’s errors, via a simple optimization technique: simply train each network only when its input data is real. With this technique, neither network learns about the other’s behavior, which renders cooperation impossible. Instead, the back-translator simply preserves any errors made during forward-translation, and the reconstruction penalty is put entirely on the forward translator. This forces the networks to learn more faithful mappings to their target domains. The technique also constrains the residual path to encoding only “auxiliary” information (regardless of architecture), since the model is simply incapable of exploiting it for other purposes. In experiments with real images, we show that our optimization method delivers an obvious qualitative improvement over the current state-of-the-art, both in terms of semantics-preservation and residual-factor disentanglement. In synthetic data (where the residual is known), we demonstrate that our “uncooperative” optimization leads to quantitatively accurate disentanglement, whereas “cooperative” optimization does not.

2 Related Work

Image-to-image translation has recently attracted great attention, partly thanks to the success of generative adversarial networks (GANs) [7, 20, 26, 13]. The goal in image-to-image translation is to translate an image in one domain to a corresponding image in the second domain. Pix2Pix [10] trains models for this task using paired data from the two domains (, input-output pairs, exemplifying good translations). CycleGAN [27] removes the need for paired data by forming a translation “cycle”—forward translation followed by backward translation—which creates a natural reconstruction objective between the input and the back-translation. This is an important step, because in many domains, paired examples do not exist (, a face in the exact same pose/expression in two different physical environments). CycleGAN often preserves the structural content of the images, but this may simply be a consequence of the convolutional architecture [16]. CycleGAN is only capable of learning one-to-one mappings, but several works (not all unsupervised) have proposed variants that are capable of one-to-many mappings, such as Augmented CycleGAN [1], DRIT [15], MUNIT [9], BicycleGAN [28], and cross-domain disentanglers [6]. These methods are able to generate diverse image with similar “content” (, structural pattern) but different “style” (, textural rendering) through disentanglement. These methods use strong assumptions or regularizations to avoid undesirable local optima, including shared latent spaces [9, 15, 6], loss on KL divergence from simple Gaussians [9, 15, 28], or low-dimensional representations [15, 1, 28, 6]. The effectiveness of these methods is therefore highly dependent on parameter selection.

Image factor disentanglement is necessary if we wish to control the latent factors in the generated images. Hadad  [8] assumes the availability of attribute labels in a particular domain, where the goal is to disentangle images into a target domain plus a residual (, “everything else”). Many disentanglement works also make strong assumptions on domain knowledge of the latent space, which includes having data pre-grouped according to individual factors [22, 14], or having exact knowledge of the structure and function of individual factors (, for faces: identity, pose, shape, texture [24, 23]). In this work, we do not have attribute labels, we do not make assumptions on the latent space, and we perform disentanglement using only the unpaired image data. Similar to our method, InfoGAN [3] and MINE [2] are completely unsupervised, but the approach in these works is quite different: these methods maximize the mutual information between the inferred latent variables and the observations, while we use discriminators and reconstruction to achieve disentanglement.

Figure 2: Our domains, and mapping functions. The disentangler D maps a to a particular pair of and . The entangler E does the reverse mapping, from tuples to a unique .

3 Method

There are three key ingredients to our method: (1) adversarial priors, which encourages the translated images to be indistinguishable from ones in their target domain, (2) cycle-consistency, which encourages the translations to be invertible, and (3) “uncooperative” optimization, which ensures the networks do not “cheat” toward an undesirable local minimum.

3.1 Preliminaries

Let and be two image domains, such that the images have more information than the images . That is, contains variation in some latent factor that is either constant or absent in . This implies that the mapping is many-to-one, and the mapping is one-to-many. As a mnemonic, note that is variable in some aspect where is constant.

Let be the residual information that is in but not in . Accessing this extra information allows us to form bijective (one-to-one) mappings, and . Note that is not necessarily an image domain. In our implementation, each is a collection of deep featuremaps at multiple scales, which allows its actual form to be determined entirely by the data.

Our goal is to learn functions that can map between these domains. We call the first mapping a disentanglement, denoted D, since it performs an intricate splitting operation: . We call the second mapping an entanglement, denoted E, since it performs a merging operation: . Figure 2 shows a diagram of the domains and the mappings between them. Figure 3 relates the notation to the data and architecture. Note that D and E are inverses of one another.

Our input is a set of samples from , and samples from . The two datasets are unpaired, and true correspondences might not exist.

3.2 Adversarial priors

Our model has two main networks, D and E. We would like to have , and . To achieve this, we introduce adversarial networks and , which learn and impose priors on the distributions of our networks’ outputs.

The adversarial networks attempt to discriminate between real and fake (, generated) samples of the domains and . In our notation, we distinguish “fake” samples with a prime symbol. We train our main networks against the adversarial labels with the least-squares loss [19]:


In a separate (but concurrent) optimization, we also train the parameters of the adversaries, with the losses , and .

Note that we have no priors on the samples generated by D, because there is no dataset of “true” samples. Prior works manufactured a prior by assuming that

is a low-dimensional Gaussian distribution (, 8 dimensions, with zero mean and unit variance)

[9, 15, 6, 17]. Here, we avoid this limiting assumption. We are able to do this because of our unique optimization procedure, detailed in Sec 3.4. However, we do obtain some constraints on by enforcing cycle consistency, described next.

Figure 3: Disentanglement-Entanglement architecture. The networks are D (disentangler), E (entangler), (adversary on ), and (adversary on

). Primes indicate tensors that are generated by the model during the forward pass. Each generated tensor is subject to either reconstruction loss or an adversarial loss.

is a multi-scale output of D; it is concatenated to the featuremaps of E at the corresponding scales.

3.3 Cycle consistency

On each training step, the model runs two “cycles”. Each cycle generates a reconstruction loss, which constrains the model to perform consistent forward-backward translation. Figure 3 illustrates the cycles.

In the first cycle, the disentangler D receives a random from the dataset, and generates two outputs: . These outputs are passed to the entangler E, which generates . If the disentanglement and re-entanglement are successful, this output should correspond to the original . Therefore, we form the reconstruction objective , where denotes the L1 norm. In summary, this cycle performs .

The second cycle is symmetric to the first. The entangler E receives a random from the dataset, and an generated from a random . Note that it is necessary to use generated samples here, since is completely determined by the network. We omit the prime on this since it is treated as an input rather than an output. From the input , the entangler generates . We then pass to the disentangler, which generates two new outputs . If the entanglement and disentanglement are successful, these outputs should correspond to the original inputs. We therefore form the reconstruction objectives and . In summary, this cycle performs .

Collecting the reconstruction objectives, we have


Observe that there is no “fidelity” objective on the translated tensor of each cycle (, in Cycle 1, and in Cycle 2); these tensors only have an adversarial loss. In other words, there is nothing in the design to force to correspond to , or to correspond to , other than the back-translation error. As we will show in experiments, this back-translation requirement is not sufficient, because the networks are able to cooperate on the back-translation: when E is the back-translator, it can compensate for errors made by D, and vice versa.

In practice, many of these “errors” are never corrected. Instead, they are adapted and refined, to minimize the adversarial loss while facilitating reconstruction. We call these “cheats”: undesirable outputs that yield near-zero loss. At convergence, cheats often take the form of a within-domain transformation: this causes the adversary to not impose a loss (since the output is still in the correct domain), yet allows the second network to (jointly) learn how to undo the transformation. These cheats are especially visible in experiments with faces, likely because humans are so sensitive to faces [5]. Figures 1 and 5 show clear examples of this cheating behavior: while the two domains only differ in texture/lighting, the networks learn to additionally (and unpredictably) alter the facial expression.

We observe that this undesirable solution to the reconstruction error requires both D and E to be complicit in the scheme. For example, if D transforms its input while translating it, but E is unaware of the cheat, E will not undo the transformation while back-translating, yielding a loss. This leads us to our optimization procedure, which essentially prevents D and E from cooperating in this way.

3.4 Uncooperative optimization

The total loss we wish to minimize is


As long as the forward translations land in the target domains, is minimized; as long as the backward translations reconstructs the input, is minimized.

As explained, there is a local minimum to this loss, in which forward translation includes an undesirable transformation, and back-translation includes an inverse transformation. This ruins the fidelity of the translation.

To reach this bad minimum, each network needs to learn two functions: (1) translating “real” inputs into outputs in a target domain, and (2) decoding “fake” inputs by undoing generated translations. Referring to Figure 3, the disentangler D learns the first task in Cycle 1, and learns the second task in Cycle 2; the entangler E learns these functions in the opposite order. In other words, the networks learn to perform different tasks depending on their input: given real inputs, they translate; given fake inputs, they decode.

To prevent this from happening, we prevent the networks from learning how to decode fake inputs. We do this by freezing the networks when they receive fake inputs. When a network is “frozen”, it is treated as a fixed but differentiable function, so that gradients flow through it, but it does not learn. Referring again to Figure 3, this means training D only in the first cycle, and training E only in the second cycle (where they respectively receive real inputs).

With this optimization technique, the networks are incapable of learning how to compensate for each other’s errors. This means that an erroneous forward translation will always be taken “at face value” by the backward translator, and produce an appropriate loss. This is because the backward translator’s only experience (in terms of gradient steps) comes from real data.

This method is a type of alternating optimization, in the sense that we keep one set of parameters fixed while optimizing the other set, and alternate. In practice, we alternate on every step. Specifically, we do a forward pass through Cycle 1, freeze E while we update D, and then do a forward pass through Cycle 2, and freeze D while we update E.

Including the independent update required for the adversarial networks, this setup requires three optimizers in total.

3.5 Implementation details

Network architecture

Our implementation is based on CycleGAN [27]. The translators’ architecture originally comes from Johnson [11]

: two stride-2 convolutions, four residual blocks, and two transposed convolutions.

We implement the disentangler as two separate networks: one for the stream and one for the stream; the stream ends before the transposed convolutions. We found this worked significantly better than using a single network to produce both and .

The entangler uses the same architecture, except it receives skip connections from the stream. There are three such connections: the first uses the featuremap produced after the stride-2 convolutions; the second uses the featuremap after two residual blocks, and the third uses the featuremap after the next (and final) two residual blocks. These featuremaps are simply concatenated with the corresponding featuremaps in E. The intent with multiple skip connections is to allow the network the capacity to transfer residuals at multiple levels of scale and abstraction. Our model has fewer residual blocks than CycleGAN, but the added stream makes the total parameter count similar.

For discriminators, we use the PatchGANs [10] which were also used in CycleGAN. We apply spectral normalization to the weights of the discriminators [21], which we found to stabilize the adversarial training.


We set the reconstruction coefficients on and to be ten times the GAN loss, so . We use a smaller coefficient on the the reconstruction, since it is a much larger tensor: . We update the discriminator using generated images drawn randomly from a history buffer of size 50. We use the Adam solver [12], with , a batch size of 4, and a learning rate of 0.0002. After the reconstruction errors stop descending, we linearly decay the learning rate to zero. In total, training can take up to 300,000 steps, which is approximately 3 days on a single Nvidia GTX 1080 TI. This is slower convergence than a traditional CycleGAN (which takes  100,000 iterations on our data), likely because the objective is harder to optimize when “cheating” is disallowed.

Simplified settings for synthetic data

For the experiments with synthetic data, we use a model with fewer parameters. We implement each generator as a fully-connected network with one hidden layer of 32 units and ReLU activation. We implement each adversarial discriminator as a fully-connected network with one hidden layer of 32 units, and leaky ReLU activation. Our experiments suggest that the discriminators have more than sufficient capacity to correctly learn the distributions of

and and keep equilibrium with the generators. We use the same training setup as in the real-image experiments, except we set the batch size to 128, and training to convergence takes approximately 60,000 iterations, which is 1 hour on a single GPU.

Figure 4: Uncooperative cooperative optimization results on the objective function (left) and correlation with the ground-truth latent factor (right), over training steps. At convergence, uncooperative optimization achieves near-perfect disentanglement of the latent factor, whereas cooperative optimization does not.
Figure 5: Domain translation results on the face dataset, compared with MUNIT and CycleGAN. During translation in either direction, MUNIT and CycleGAN sometimes alter the expression of the subject, whereas our approach keeps the expression intact.

4 Experiments

In this section, we demonstrate that our method outperforms prior work on (1) accuracy of disentanglement, (2) fidelity of translation, (3) coverage of modes (in multi-modal translations). Ground-truth disentanglements do not exist in real image data, so we use a simple synthetic scenario to quantitatively evaluate accuracy, then present real-world qualitative results for fidelity and coverage.

4.1 Disentanglement accuracy

One of our claims is that uncooperative optimization is critical for accurate disentanglement. This is based on the idea that a uncooperative models are less able to find “cheats” which bypass the need for accuracy.

In other words, we need to show that “uncooperative” optimization leads to correctly disentangling from within , in a setting where “cooperative” optimization fails. It is surprisingly easy to find such a scenario. We present one here, in which the ground-truth factors are 1D, and entanglement/disentanglement is simply concatenation/splitting. We find that cooperative optimization is incapable of learning this simple operation (under the given data availability assumptions), whereas uncooperative optimization succeeds.


In this experiment, we use two identical models (see the “Simplified settings for synthetic data” in Sec. 3.5), and change only the optimization method: one uses the proposed “uncooperative” optimization, and the other uses the baseline “cooperative” optimization.


Since ground-truth latent factors are generally unknown in real data, it is necessary to design synthetic data for this experiment. We define the latent factors and to be Gaussian distributions. We generate synthetic entanglements by concatenating a sample with a sample . Specifically, we draw the elements of from a 1D Gaussian with , and draw the elements of from a 1D Gaussian with . We find that results are not sensitive to dimensionality (except in convergence time), and so present only the simplest version here, setting the dimensionality of both and to 1, making the dimensionality of equal to 2. Note that the domain is never encountered at training time, except in its entangled form inside . The task is to recover , using only disentanglement/entanglement cycles, and unpaired samples of and .


We measure the relationship between the actual domain (used to generate samples) and the learned domain (disentangled from samples) using the Pearson correlation coefficient , which equals 1 if the two variables have a totally linear relationship, and is closer to 0 otherwise. This (unlike a distance metric) allows solutions where the learned is a scaled version of the true , which is appropriate since the scaling may be absorbed in the model weights.


Our results are summarized in Figure 4. The two models converge in approximately the same number of iterations. At the end of training, the cooperative version achieves a correlation coefficient of 0.695, while the uncooperative version achieves 0.998. Results vary slightly across iterations (and across initializations), but correlation does not noticeably improve for the cooperative version, even if training is extended to 200k iterations.

Overall, this shows that uncooperative optimization leads the model to disentangle the true latent factors, while cooperative optimization does not.

4.2 High-fidelity translation

One of our claims is that the uncooperative training leads to high-fidelity translations. By this, we mean that the translation retains as much information as possible from the input, without altering it. To evaluate this, we compare our compare our model’s forward translations against those of CycleGAN and MUNIT.


CycleGAN is a popular baseline in unsupervised (but unimodal) image-to-image translation; our architecture is based on it. MUNIT is a state-of-the-art unsupervised multimodal image-to-image translation method.


We note that MUNIT was originally applied to translating between widely different domains, , translating dogs to lions. While this type of translation is impressive, it is also difficult to evaluate, and it is not clear that close pixel-wise correspondence/fidelity is even desirable in that task.

In this paper, we primarily focus on translating a human face across two appearance domains: photos of the face captured by a head-mounted camera, and renders of the face produced by a parametric face model (already adapted to the input face). This has an application in social virtual/augmented reality (VR/AR), where we would like users to interact with each other “face-to-face” (inside the virtual environment) as naturally as possible.

We collected the face data ourselves. The real photos (representing the domain) were captured by a camera attached to the actor’s headset, with the lens pointed toward the bottom half of the actor’s face; lighting variation was achieved with a set of lights surrounding the actor; background variation was achieved by placing large computer monitors behind the actor and displaying random images. Rendered images of the same face (representing the domain) were produced by fitting a deep parametric face model to the actor [18], and generating random expressions from a viewpoint similar to the headset view. There are 7074 real photos, and 1000 rendered images. The task is to translate a photo of a face to (or from) a rendered-like image of the same face, while maintaining the face’s expression.

For completeness, we also show results on translating architectural facades labels [25], which is a task used in prior work [27]. We have also experimented with the aerial photos Google maps task [27], but did not find noticeable differences between the methods on that task.


In the face image experiments—which are necessarily qualitative—we rely on the fact that humans are extremely adept at reading faces [5], and attempt to demonstrate that our model achieves obviously better disentanglements than prior methods. The results on aerial and facade data (introduced in prior work) is harder to interpret at a glance, but close inspection can reveal differences in sharpness and spatial consistency with the input. We note that even when ground truth translations exist, it does not make sense to evaluate against them, since these are many-to-one/one-to-many mappings, and totally unsupervised models (as considered here) cannot be expected to generate labels that match the ground truth (, as assumed in the “FCN score” used in Pix2Pix [10] and CycleGAN [27]).


Figure 5 compares our method against MUNIT and CycleGAN on the face dataset. The results show that while CycleGAN and MUNIT perform the appearance translation, they make small but very noticeable shifts in the facial expression, , turning a closed mouth into a smile, or changing a grimace to a pout. This is due to the drawbacks of cooperative training, described earlier. Our method does not have this problem, and translates the faces across domains without altering expression. Figure 6 shows the same experiment but for the facades labels task, with similar results: while our method retains, for instance, exact spatial positions of the features in either domain, the baseline methods tend to make small shifts in position and scale.

4.3 Multi-modal outputs

Our model is designed to produce multi-modal outputs, through a “mix-and-match” method, where we use from one input and from another input, and entangle these to form a novel sample of . We compare against MUNIT, which is the current state-of-the-art method for this task.

More specifically, generating multiple outputs from a single input involves the following steps: (1) given as input, generate ; (2) given an unrelated as input, generate ; (3) entangle , to produce the composite . In the face context, since the domain contains expression but not lighting, this setup means extracting expression from one image, and extracting everything else (which is mostly lighting and backgrounds) from another image, and combining these factors into a new image. The experimental setup is similar for MUNIT: a “content code” is generated from , and a “style code” is generated from , and these are encoded into the final output. We do this for multiple , to show the effect of transferring a variety of residual factors onto the same face.


We use the same face data as in the high-fidelity task, and add the aerial photos Google maps task [27], which we find has more evident multi-modality than the facades data.


Figure 8 shows the results of this experiment on the faces dataset, for MUNIT and our model. The table shows expressions from across rows, and residuals from (, lighting/background conditions) across columns. An overview of the results can be obtained by scanning the rows to inspect that expression is transferred from the leftmost row unchanged, and scanning across columns to inspect that lighting and backgrounds are transferred from the topmost row unchanged. MUNIT appears to have only learned to transfer the global intensity from the source. Our model appears to be transferring backgrounds, and even casting distinct shadows onto the face. However, some shadows appear reduced in intensity (, third column), suggesting that expression-lighting disentanglement is not perfect here.

We also show results of this experiment on the aerial photos Google maps dataset, where we treat the Google map as (assuming it has less information), and the aerial photos as . Results are summarized in Figure 7, in the same format as the face relighting results. In this domain, it appears MUNIT transfers very little from the residual, while our model incorporates textures and objects (, note the white object transferred from the first residual). Both methods appear to retain the spatial layout of the input map.

Figure 6: Domain translation results on facade/label images. While MUNIT and CycleGAN introduce artifacts (which make back-translation easier during training), our model performs high-fidelity translation.
Figure 7: Aerial image composition of our method (left) vs MUNIT (right). Our method successfully transfers textures from the residual, while MUNIT does not; both retain the spatial structure of the map in this case.
Figure 8: Face relighting results of MUNIT (top) vs our model (bottom). In each table, the leftmost column shows the input from which expression is drawn; the top row shows the input from which everything else is drawn.

5 Discussion

In this work, we address the compensation issue in translation cycle-consistency, which typically diminishes the utility of the reconstruction loss. In compensation, the back-translator (undesirably) adapts to the weaknesses and shortcuts of the forward-translator. Hypothetically, there is another way to (partially) defeat the loss, which may be called exploitation. In exploitation, the forward-translator (undesirably) adapts to the weaknesses and shortcuts of the back-translator. The enduring exploitation issue may explain the subtle imperfections in our outputs.

Another limitation of our approach is that we do not address many-to-many mappings. Our approach is only multi-modal in one direction.

In summary, we introduced the problem of high-fidelity image-to-image translation, motivated it for augmented reality applications, and presented an unsupervised method for solving it. We identified a fundamental cause of low-fidelity translations: cooperation between the forward translator and the backward translator, which allows the forward-translation to “hide” information, and the back-translator to “recover” from noticeable errors. This is a critical problem in real applications. We presented an “uncooperative” optimization scheme that prevents the problem. Our results demonstrate that uncooperative optimization leads to high-fidelity image translations, making image-to-image translation not only fun, but useful for augmented reality.


A How “cheating” happens in practice

It is relatively easy to see how the “uncooperative” optimization prevents the networks from developing a “cheating” scheme, since the networks only train when their inputs are real. It is less easy to see how a “cheating” scheme can develop at all, considering the losses that already constrain the model. In this section, we will first summarize a tempting (but flawed) argument suggesting that “cheating is penalized by the losses”, and then demonstrate how the intuition is generally proven wrong in practice.

To see how cheating may intuitively seem impossible, consider the following, with reference to Figure 3 in the main text. Suppose is used as a “shortcut” to cheat Cycle 1, in the sense that D copies into , and then E copies into , meeting the cycle-consistency constraint of . Meanwhile, to meet the adversarial constraint, D may write any target-domain image into . But this leads to errors in Cycle 2: if E simply copies its input into , and/or D does not produce an output which strictly corresponds to its input , then is essentially ignored, and we will have and a loss. Therefore, it seems that cheating should be eliminated at convergence.

In practice, however, the networks achieve a far more subtle type of cheat, which eventually yields zero loss. At training time, the visual manifestation of the cheat is that the translations do not correspond to the inputs, and yet they are back-translated perfectly. Figure References (left) shows some examples of this behavior. Our experiments suggest that the networks generate outputs that facilitate reconstruction of the corresponding inputs, and the networks treat these generated tensors differently from real tensors. In particular, when we generate , then tends to hide inside, to facilitate its reconstruction by E. Similarly, when we generate , then tends to hide inside, to facilitate its reconstruction by D. Figure References (middle and right) illustrates how to empirically reveal this behavior, and shows sample non-corresponding outputs from a converged “cooperative” model. For a brief reading of the figure, observe that and appear visually identical, but D decodes (the real) into a closed mouth, and decodes (the fake) into a wide open mouth.

Parallel work [4] has also observed this phenomenon, under the label of steganography. That work showed that the secret/cheating signal is often hidden in high frequencies, where presumably the discriminators are less effective. With sufficient training, a discriminator should learn to block this strategy (since such high-frequency content is not present in real examples), which would force the signal to shift to lower (and more semantically-relevant) frequencies, as observed here.