DeepAI
Log In Sign Up

Disentangled Unsupervised Image Translation via Restricted Information Flow

11/26/2021
by   Ben Usman, et al.
Boston University
0

Unsupervised image-to-image translation methods aim to map images from one domain into plausible examples from another domain while preserving structures shared across two domains. In the many-to-many setting, an additional guidance example from the target domain is used to determine domain-specific attributes of the generated image. In the absence of attribute annotations, methods have to infer which factors are specific to each domain from data during training. Many state-of-art methods hard-code the desired shared-vs-specific split into their architecture, severely restricting the scope of the problem. In this paper, we propose a new method that does not rely on such inductive architectural biases, and infers which attributes are domain-specific from data by constraining information flow through the network using translation honesty losses and a penalty on the capacity of domain-specific embedding. We show that the proposed method achieves consistently high manipulation accuracy across two synthetic and one natural dataset spanning a wide variety of domain-specific and shared attributes.

READ FULL TEXT VIEW PDF

page 28

page 29

page 31

page 32

page 33

page 35

page 36

page 37

03/29/2021

Evaluation of Correctness in Unsupervised Many-to-Many Image Translation

Given an input image from a source domain and a "guidance" image from a ...
11/29/2018

Unsupervised Image-to-Image Translation Using Domain-Specific Variational Information Bound

Unsupervised image-to-image translation is a class of computer vision pr...
03/25/2021

Scaling-up Disentanglement for Image Translation

Image translation methods typically aim to manipulate a set of labeled a...
06/01/2019

ZstGAN: An Adversarial Approach for Unsupervised Zero-Shot Image-to-Image Translation

Image-to-image translation models have shown remarkable ability on trans...
04/28/2020

Neural Hair Rendering

In this paper, we propose a generic neural-based hair rendering pipeline...
09/05/2018

A Unified Feature Disentangler for Multi-Domain Image Translation and Manipulation

We present a novel and unified deep learning framework which is capable ...
08/02/2018

Diverse Image-to-Image Translation via Disentangled Representations

Image-to-image translation aims to learn the mapping between two visual ...

1 Introduction

The goal * equal contribution of unsupervised image-to-image translation is to learn a mapping between two sets of images (or two domains) without any pair supervision. For example, face domains in Figure 1a are defined by gender and share variability in poses and backgrounds, so a correct cross-domain mapping must change the face gender, but preserve the pose and the background of the original. When one domain has some unique attributes absent in the other domain, like males with and without beards in CelebA [CelebAMask-HQ], the question of whether the “correct” one-to-one cross-domain mapping should add a beard to a specific female face does not have a well-defined answer. However, if we alter the problem definition by providing a “guide” image specifying male-specific factors, the resulting unsupervised many-to-many translation problem has a well-defined correct solution: the learned mapping must preserve all attributes of the source image that are shared across two domains and use values of attributes unique to the target domain from the guide input (e.g. preserve pose and background of the female source and add a beard from the male guide in the example above).

Figure 2: Flaws in existing methods. On Shapes-3D-A [kim2018disentangling, bashkirova2021evaluation], all prior methods fail to either preserve shared attributes of the source (shape, object color), or apply target-specific attributes of the guide (size, orientation), or both.

Most state-of-the-art methods for unsupervised many-to-many translation implicitly assume that the domain-specific variations can be modeled as “global style” (textures and colors) by hard-coding this assumption into their architectures via adaptive instance normalization (AdaIN) [huang2017arbitrary] originally proposed for style transfer. However, this choice severely restricts the kinds of problems that can be efficiently solved. More specifically, AdaIN-based methods [huang2018multimodal, choi2020stargan] inject domain-specific information from the guide image via a global feature re-normalization that forces colors, textures, and other global statistics to be always treated as domain-specific factors regardless of their actual distribution across the two domains. As a result, AdaIN-based methods change colors and textures of the input to match the guide image during translation even if colors/textures are varied across both domains and should not change. For example, background textures in the female-to-male setting (Figure 1a) vary in both domains, and therefore should be preserved, the same holds for hair color in the children-to-adults setting (Figure 1c). Even on a toy perfectly-balanced problem (Figure 2) AdaIN-based methods (e.g. MUNIT [huang2018multimodal]) change object color of the input to match the object color of the guide, even though object color is varied in both domains, and thus should be preserved.

Autoencoder-based methods [almahairi2018augmented, DRIT_plus, benaim2019didd], on the other hand, preserve shared information better, but often fail to apply correct domain-specific factors. For example, DIDD [benaim2019didd] preserved the object color of the source in Fig. 2, but failed to extract and apply the correct orientation and size from the guide. Overall, both our experiments and recent advances in evaluation of many-to-many image translation [bashkirova2021evaluation], show that all existing methods generally either fail to preserve global domain-specific attributes or fail to apply domain-specific factors well.

In this paper, we propose Restricted Information Flow for Translation (RIFT) - a novel approach that does not rely on an inductive bias provided by AdaIN and achieves high attribute manipulation accuracy across different kinds of attributes regardless of whether they are shared or domain-specific. As illustrated in Figure 3, during “brunet male-to-female translation” our method preserves shared factors (background and pose) of the input male face, and encodes male-specific attributes (mustache) in a domain-specific embedding to enable accurate reconstruction of the source image. The core observation at the heart of our method is that only values of shared attributes (background and pose) of the source can be encoded naturally in a generated image from the target domain, whereas source-specific attributes (mustache) can be encoded in the generated image only by “hiding” them in the form of structured adversarial noise [bashkirova2019adversarial]. With this in mind, we propose using the translation honesty loss [bashkirova2019adversarial] to penalize the model for “hiding” [chu2017cyclegan] a mustache inside the generated female image, and the embedding capacity loss to penalize the model for encoding shared factors into the domain-specific embedding. As a result, information about the mustache is forced out of the generated female image into the domain-specific embedding, while information about the pose and background is forced out of domain-specific embeddings into the translation result - resulting in proper disentanglement of domain-specific and domain-invariant factors.

We measure how well RIFT models different kinds of attributes as either shared or domain-specific across three splits of Shapes-3D [kim2018disentangling], SynAction [sun2020twostreamvan] and Celeb-A [CelebAMask-HQ] following an evaluation protocol similar to the one proposed by bashkirova2021evaluation. Our experiments confirm that the proposed method achieves high attribute manipulation accuracy without relying on an inductive bias towards treating certain attribute kinds as domain-specific hard-coded into its architecture.

Figure 3: Overview of the proposed method – RIFT. Source male image and male-specific factors (green), female guide input and female-specific attributes (blue), shared attributes (red).

2 Related work

Image-to-image translation.

In contrast to task-specific image translation methods [cao2017unsupervised, guadarrama2017pixcolor, lugmayr2019unsupervised, qu2018unsupervised, gatys2015neural, ulyanov2016texture], early unsupervised image-to-image translation methods, such as CycleGAN [zhu2017unpaired], and UNIT [liu2017unsupervised], infer semantically meaningful cross-domain mappings from arbitrary pairs of semantically related domains without pair supervision. These methods assume one-to-one correspondence between examples in source and target domains, which makes the problem ill-posed if at least one of two domains has some unique domain-specific factors, as we discussed in Section 1.

Many-to-many translation.

To account for such domain-specific factors, and to enable control over them in the translation results, many-to-many image translation methods [huang2018multimodal, almahairi2018augmented, choi2020stargan, liu2019few, DRIT_plus] have been proposed. These methods separate domain-invariant “content” from domain-specific “style” using separate encoders. Following [bashkirova2021evaluation], we avoid terms “content” and “style” to distinguish general many-to-many translation from its subtask - style transfer [gatys2015neural].

Adaptive instance normalization.

Many state-of-art many-to-many translation methods, such as MUNIT [huang2018multimodal], FUNIT [liu2019few] and StarGANv2 [choi2020stargan], use AdaIN [huang2017arbitrary], originally proposed for style transfer [gatys2015neural]. More specifically, these methods modulate activations of the decoder with the domain-specific embedding of the guide. This architectural choice was shown to limit the range of applications of these methods to cases when domain-specific information lies within textures and colors [bashkirova2021evaluation].

Autoencoders.

In contrast, methods like Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus] and Domain Intersection and Domain Difference (DIDD) [benaim2019didd] rely on embedding losses and therefore are more general. For example, DIDD forces domain-specific embeddings of opposite domain to be zero, while DRIT++ uses adversarial training to make the source and target content embeddings indistinguishable.

Cycle losses.

Most methods [huang2018multimodal, almahairi2018augmented] also use cycle-consistency losses on domain-specific embeddings to ensure that information extracted from the guidance image is not ignored during translation, and cycle loss on images to improve semantic consistency [chu2017cyclegan]. However, cycle-consistency losses on images have been shown [chu2017cyclegan, bashkirova2019adversarial] to force one-to-one unsupervised translation models to “cheat” by hiding domain-specific attributes into translations.

Overall, prior methods ensure that the guide input modulates the translation result in some non-trivial way, but, to our knowledge, no prior work explicitly address adversarial embedding of domain-specific information into the translated result, or ensures that domain-invariant factors are preserved during translation, and this work fills this gap.

3 Restricted Information Flow for Translation

In this section we first formally introduce the many-to-many image translation problem, and describe how our method solves it. Our model reconstructs input images from translation results and domain-specific embeddings as illustrated in Fig. 3, forcing domain-invariant information out of domain-specific embedding using capacity losses, and forcing domain-specific information out from the generated translation using honesty losses.

Setup.

Following [huang2018multimodal], we assume that we have access to two unpaired image datasets and that share some semantic structure, differ visually (e.g. male and female faces with poses, backgrounds and skin color varied in both). In addition to that, each domain has some attributes that vary only within that domain, e.g. only males have variation in the amount of facial hair and only females have variation in the hair color (like in Figure 1). Our goal is to find a pair of guided cross-domain mappings and such for any source inputs and guide inputs from respective domains, resulting guided cross-domain translations and look like plausible examples of respective output domains, share domain-invariant factors with their “source” arguments ( and respectively) and domain-specific attributes with their “guidance” arguments ( and respectively). This general setup covers the absolute majority of real-world image-to-image tasks. For example, the correct guided female-to-male mapping applied to female source image and a guide male image should generate a new male image with pose, background, skin color, and other shared factors from the female input image , and facial hair from the guidance input , because poses, backgrounds and skin color vary in both, while facial hair is male-specific.

Figure 4: Losses used to train RIFT. For illustration purposes, we use 3D-Shapes-A split described in Section 4 and illustrated in Figure 5. When the model is trained, green arrows carry only B-specific information (floor and wall color), blue arrows carry only A-specific information (orientation and size), and red arrows carry information shared across two domains (object color and shape).

Method.

While it might be possible to approximate functions and directly, following prior work, we split each one into two learnable parts: encoders that extract domain-specific information from corresponding guide images, and generators and that combine that domain-specific information with a corresponding source image, as illusrated in Figure 4. Final many-to-many mappings are just compositions of encoders and generators:

Our goal is to ensure that encoders extract all domain-specific information from their inputs (and nothing more), and that generators use that information, along with domain-invariant factors from their source inputs to form plausible images from corresponding domains.

Noisy cycle consistency loss.

First, to ensure that each attribute of input images is not ignored completely, (i.e. that it is treated as either domain-specific, or domain-invariant, or both), we use a guided analog of the cycle consistency loss. This loss ensures that any image translated into a different domain, and translated back with its original domain-specific embedding is reconstructed perfectly. Additionally, as the first step towards restricting the amount of information passed through each branch, we add zero-mean Gaussian noise () of amplitude or and appropriate shape to translations and domain-specific embeddings respectively, before reconstructing images back:

Translation honesty.

Unfortunately, any form of cycle loss encourages the model to “hide” domain-specific information inside the translated image in the form of structured adversarial noise [chu2017cyclegan]. To actively penalize the model for “hiding” the domain-specific information, such as mustache, inside a generated female image (instead of putting it into a male-specific embedding ), we use the guess loss [bashkirova2019adversarial]. This loss detects and prevents this so-called “self-adversarial attack” in the generator by training an additional discriminator to “guess” which of its two inputs is a cycle-reconstruction and which is the original image. For example, if the male-to-female generator is consistently adversarially embedding mustaches into all generated female images, then the cycle-reconstructed female will also have traces of an embedded mustache, and will be otherwise be identical to the input . In this case, the guess discriminator, trained specifically to detect differences between input images and their cycle-reconstructions, will detect this hidden signal and penalize the model:

Domain-specific channel capacity.

Unfortunately, neither of two losses described above can prevent the model from learning to embed the entire guide image into the domain-specific embeddings and reconstructing it from that embedding in , ignoring its first argument completely, i.e. just always producing the guide input exactly. In order to prevent this from happening we penalize norms of domain-specific embeddings, effectively constraining the capacity of the resulting channel:

Intuitively, the mutual information between the input guide image and the predicted translation corresponds to the maximal amount of information that an observer could learn about translations by observing guides if they had infinite amount of examples to learn from. Formally, using the derivation for the capacity of the additive white Gaussian noise channel (Sec. 7.1) we can show that:

meaning that minimizing loss effectively limits the amount of information from the guide image that can access to generate , i.e. the effective capacity of the domain-specific embedding. Note that disabling either the noise () or the capacity loss () results in effectively infinite capacity, so we need both. Intuitively, this bound describes the expected number of “reliably distinguishable” embeddings that we can pack into a ball of radius given that each embedding will be perturbed by Gaussian noise with amplitude .

Realism losses.

Remaining losses are analogous to CycleGAN [liu2017unsupervised] losses that ensure that output images lie within respective domains:

Discriminator losses

We also train discriminator networks and guess discriminators by minimizing corresponding adversarial LS-GAN [mao2017least] losses.

4 Experiments

We would like to measure how well each model can generalize across a diverse set of shared and domain-specific attributes. In this section we discuss datasets we used and generated to achieve this goal, as well as list baselines and metrics we used to compare our method to prior work.

Figure 5: Shapes-3D-ABC splits with respective shared and domain-specific attributes.

Data.

Following the protocol proposed by bashkirova2021evaluation, we re-purposed existing disentanglement datasets to evaluate the ability of our method to model different attributes as shared and domain-specific. We used 3D-Shapes [kim2018disentangling], SynAction [sun2020twostreamvan] and CelebA [CelebAMask-HQ]. Unfortunately, among the three, only 3D-Shapes [kim2018disentangling] is balanced enough and contains enough labeled attributes to make it possible to generate and evaluate all methods across several attribute splits of comparable sizes. For example, if we attempted to build a split of SynAction with domain-specific pose attribute, the domain with fixed pose would only contain 90 unique images, which is not sufficient to train an unsupervised translation network.

3D-Shapes-ABC.

The original 3D-Shapes [kim2018disentangling] dataset contains 40k synthetic images labeled with six attributes: floor, wall and object colors, object shape and object size, and orientation (viewpoint). There are ten possible values for each color attribute, four possible values for the shape (cyliner, capsule, box, sphere), fifteen values for orientation, and eight values for size. We used three subsets of 3D-Shapes with different attribute splits visualized in Figure 5. Three resulting domain pairs contained 4.8k/4k, 12k/3.2k, and 12k/6k images respectively.

SynAction.

We used the same [bashkirova2021evaluation] split of SynAction [sun2020twostreamvan]

- with background varied in one domain (nine possible values), identity/clothing varied in the other (ten possible values), and pose varied in both (real-valued vector). The resulting dataset contains 5k images in one domain and 4.6k images in the other. We note that the attribute split of this dataset

matches the inductive bias of AdaIN methods, since the layout (pose) is shared and textures (background, clothing) are domain-specific in both domains.

CelebA.

We used the male-vs-female split proposed by [bashkirova2021evaluation] with 25k/25k images, and evaluated disentanglement of six most visually prominent attributes: pose, skin and background color (shared attributes, real-valued vectors), male-specific presence of facial hair (binary), female-specific hair color (three possible values), and domain-defining gender.

Baselines

We compare the proposed method against several state-of-art AdaIN methods, namely MUNIT [huang2018multimodal], StarGANv2 [choi2020stargan], MUNITX [bashkirova2021evaluation], and autoencoder-based methods, namely Domain Intersection and Domain Difference (DIDD) [benaim2019didd], Augmented CycleGAN [almahairi2018augmented] and DRIT++ [DRIT_plus]. In what follows we also provide a random baseline (RAND) that corresponds to selecting and returning a random image from the target domain.

Metrics

In order to evaluate the performance of our method, we measured how well the domain-specific attributes were manipulated and domain-invariant attributes were preserved. Following bashkirova2021evaluation

we trained an attribute classifier

, and for each attribute , we measured the its manipulation accuracy

- the probability of correctly modifying an attribute across input-guide pairs for which the value of the attribute

must change:

where the “correct” attribute value equals for shared attributes, and otherwise. For real-valued multi-variate attributes (pose keypoints, background RGB, skin RGB, etc.) we measured the probability of generating an image with predicted attribute vector closer to the correct attribute vector then to the incorrect vector:

where and for shared attributes, and vice-versa otherwise. The manipulation accuracy in the opposite direction

was estimated analogously. For

Shapes-3D we additionally aggregated results across three splits by averaging manipulation accuracies across splits in which the given attribute was shared/common (C) or domain-specific (S). If we introduce the set of all splits and predicates and , and the manipulation accuracy at a given split , aggregated manipulation accuracy can be defined as follows:

(1)
(2)

For three splits of 3D-Shapes we also report the relative discrepancy between domain-specific and domain-invariant manipulation accuracies:

(3)

Evaluation.

To compute metrics above, we generated two guided translations per source image per domain per baseline. We re-ran each method multiple times to account for poor initialization. We used PoseNet [papandreou2018personlab] to get ground truth poses for SynAction, and ruiz2018headpose and median background and skin color for CelebA, see suppl. Fig. 10.

Architecture.

We used standard CycleGAN components: pix2pix [isola2017image] generators and patch discriminators with LS-GAN loss [mao2017least]. We archived best results when represented domain-specific embedding vectors as single-channel images, and made generators and encoders for the same domain (e.g. and ) share all but last layers.

5 Results

In this section, we first compare our method to prior work both qualitatively and quantitatively. Then we show what happens if we remove key losses discussed Section 3. And finally, we discuss implicit assumptions made by our method, and key challenges that future methods might encounter in further improving manipulation accuracy across three datasets we used in this paper.

Figure 6: Qualitative results on CelebA. Methods should preserve pose and background of the source, and apply hair color of the female guide (top) and the facial hair of the male guide (bottom).
Method 3D-Shapes[kim2018disentangling]-ABC SynAction [sun2020twostreamvan] CelebA [CelebAMask-HQ]
FC WC OC SZ SH ORI AVG PS IDT BG AVG HC FH GD ORI BG SC AVG AVG
C S C S C S C S C S C S AC RD C S S AC S S S C C C AC AC
StarGANv2 0 99 0 99 0 78 5 56 4 99 0 96 45 97 96 52 99 82 76 15 97 87 11 22 51 5920
MUNIT 5 94 0 99 0 97 59 31 96 58 99 61 58 56 75 28 7 37 45 7 90 89 43 44 53 4911
MUNITX 1 50 2 55 8 28 12 16 95 21 99 7 33 74 93 26 37 52 64 17 75 83 50 43 55 4712
DRIT++ 7 12 9 19 10 10 27 14 7 15 42 51 18 20 52 6 13 24 23 9 96 89 67 44 55 3220
AugCycleGAN 10 8 10 9 11 7 17 13 30 13 7 7 12 20 90 8 12 37 16 30 98 12 42 40 40 2915
DIDD 38 81 29 22 72 18 41 20 87 43 48 34 44 35 89 12 99 67 22 50 91 78 89 56 64 5812
RIFT (ours) 99 45 99 39 92 10 50 23 62 84 98 87 66 33 89 47 99 78 22 35 99 65 83 57 60 689
RAND 10 10 10 10 10 10 12 19 24 19 6 6 12 9 50 11 11 24 12 31 99 50 50 50 49 2720
Table 1: Manipulation accuracy for six attributes aggregated across three splits of Shapes-3D: floor color (FC), wall color (WC), object color (OC), size (SZ), shape (SH), room orientation (ORI); three attributes in SynAction: pose (PS), identity/clothing (IDT), background (BG); and six attributes in CelebA: hair color (HC), facial hair (FH), gender (GD), face orientation (ORI), background (BG) and skin color (SC). We report per-attribute and average manipulation accuracies for shared/common (C) or domain-specific (S) attributes, as well as overall average aggregated manipulation accuracy (AC) and relative discrepancy (RD) on 3D-Shapes described in Section 4. Table 2 with non-aggregated performance across three splits of 3D-Shapes can be found in supplementary.

Qualitative results.

Figures 9 and 9 show that, in most cases, the proposed method successfully preserves domain-invariant content and applies domain-specific attributes from respective domains on 3D-Shapes and SynAction. Figure 6 shows that, on CelebA, our method preserves poses and backgrounds, and applies hair color better then other baselines. On 3D-Shapes-A, our method also preserves object color and applies correct size and orientation better than all alternatives (Figure 2). A more detailed side-by-side qualitative comparison of generated images across all baselines and all datasets can be found in supp. Figures 26-36.

Quantitative results.

Table 1 shows that our method archives the highest average aggregate manipulation accuracy across three splits of 3D-Shapes, and second lowest relative discrepancy (RD) between accuracies of modeling same attributes as shared and specific across all attributes. On SynAction, that matches the inductive bias of AdaIN methods, our method performs on-par with AdaIN-based methods and outperforms all non-AdaIN methods. On CelebA, our method is much better at preserving background and skin colors then all AdaIN-based methods, and, in terms of the overall manipulation accuracy is second only to DIDD [benaim2019didd]. Despite high manipulation scores, DIDD generated blurry faces and also struggled with applying correct hair color, as can be seen in Figure 6

and supplementary. To sum up, our method achieves best or second-best performance in each of three dataset, and best performance overall (last column), with among the lowest discrepancy between per-attribute manipulation accuracies (RD), and lowest variance across datasets (

).

Figure 7: Ablations. Effects of disabling capacity and honesty losses on guided translations (top) and guided cycle-reconstructions (bottom) on Shapes-3D-A. Inputs images from domains A and B, A2B and B2A guided translations.

Ablations.

During B2A translation on Shapes-3D-A the model trained with all losses uses object color/shape from the source image and floor/wall color from the guide (Fig. 9). If we remove the penalty on the capacity of domain-specific embeddings (), the model ignores the source input (Fig. 7a-top). The model encodes all attributes into domain-specific embeddings, and cycle-reconstructs inputs and perfectly from these embeddings (Fig. 7a-bottom), completely ignoring the source input: . Removing honesty losses (), on the other hand, results in a model that ignores the guide input altogether (Fig. 7b-top). The model “hides” domain-specific information inside generated translations instead of the domain-specific embeddings, and makes domain-specific embeddings equal zero, resulting in zero capacity loss , and zero cycle reconstruction loss . For example (Fig. 7b-bottom), the size and orientation of is hidden inside in the form of imperceptible adversarial noise and is used to reconstruct perfectly. If mapping actually used size and orientation of to generate , it would have also applied that same size and orientation when generating , but it did not - so we conclude that both and ignore domain-specific embeddings and embed information inside generated translations instead. More illustrations in suppl. Figure 11.

Challenges.

We identified three major causes of remaining errors that existing methods fail to handle at the moment, and future researchers will need to address to make further progress in this task possible. First, some attributes “affect” very different number of pixels in training images, and as a consequence contribute very differently to reconstruction losses, making the job of balancing different loss components much harder. For example, the floor color in 3D-Shapes “affects” roughly half of all image pixels, whereas size affects only one tenth of all pixels - resulting in drastically different effective weights across all losses, especially if both are either domain-specific or shared at the same time. This explains highest performance of our method on 3D-Shapes-A (in comparison to 3D-Shapes-B,C, see Table 

2) which has “similarly-sized” domain-specific attributes in both domains. The second challenge is that all datasets contain some attribute combinations that are almost distinguishable: for example, front-facing boxes and cylinders are hardly distinguishable in 3D-Shapes, and clothing of people is much less articulated when they are facing backwards because of shading in SynAction (see suppl. Figure 12). This explains why our model fails on these cases: since it can not reliably reconstruct these attributes from such intermediate translations, it has no incentive to apply correct correct attributes values to them in the first place. Finally, unevenly distributed shared attributes in real world in-the-wild datasets (such as CelebA) pose even more serious challenge rendering the whole many-to-many problem setup not well defined. For example, if both male and female domains had hair color variation, but males were mostly brunet with only 3% of blondes, and 50% of females were blondes - should the model preserve blonde hair when translating females to males and sacrifice the “realism” of the generated male domain, or should it treat hair-color as a domain-specific attribute despite variations present in both?

Ethical considerations.

While more precise attribute manipulation models requiring less supervision might be used for malicious deepfakes [nguyen2019deep, citron2018disinformation], they can also be used to remove biases present in existing datasets [grover2019bias] to promote fairness in down-stream tasks [augenstein2019generative]. We acknowledge that the CelebA dataset contains many biases (e.g. being predominantly white) and that binary gender labels are problematic.

Figure 8: Guided translations generated by our method on 3D-Shapes-A. Our model successfully preserves shared attributes (object color and shape) of the source image and applies domain-specific attributes of the guide domain (rotation and size on the left, floor and wall color on the right) in most cases. It sometimes confuses boxes with cylinders, as discussed in challenges paragraph.
Figure 9: Guided translations generated by our method on SynAction. Our model correctly preserve shared attributes (pose) of the source image and applies domain-specific attributes of the guide domain (background texture on the left, clothing/identity colors on the right). It sometimes applies wrong clothing, especially in extreme poses, as discussed in challenges paragraph.
Figure 8: Guided translations generated by our method on 3D-Shapes-A. Our model successfully preserves shared attributes (object color and shape) of the source image and applies domain-specific attributes of the guide domain (rotation and size on the left, floor and wall color on the right) in most cases. It sometimes confuses boxes with cylinders, as discussed in challenges paragraph.

6 Conclusion

In this paper we propose RIFT - a new unsupervised many-to-many image-to-image translation method that does not rely on an inductive bias hard-coded into its architecture to determine which attributes are shared and which are domain-specific, and achieves consistently high attribute manipulation accuracy across a wide range of datasets with different kinds of domain-specific and shared attributes, and low discrepancy between manipulation accuracies across different attributes and datasets. Moreover, on datasets that match the inductive bias of AdaIN-based methods, the proposed method performs on-par with AdaIN-based methods. Finally, in this paper we identified three core challenges that need to be resolved to enable further development of unsupervised many-to-many image-to-image translation.

References

7 Supplementary

7.1 Derivation of the capacity

Let and be arbitrary datasets, and be domain-specific embedding and generator functions, and be the translation from source to domain , guided by the target example . The following theorem bounds the amount of information about that can access to generate .

Theorem 1.

The effective capacity of the guided embedding, i.e. the capacity of the channel, i.e. the mutual information is bounded by:

Proof.

First, let us define a Markov chain

using the data processing inequality twice we can show that

intuitively meaning that the overall pipeline always looses at least as much information as each of its steps. Then expanding the mutual information in terms of the differential entropy gives us

Since the the second raw moment (aka power) of is bounded by , the entropy will be maximized if is a -dimensional spherical multivariate normal with variance , where therefore

Figure 10: Predictions of the attribute regression network on network predictions on CelebA for hair color (black, brown or blonde), facial hair (binary), background color (RGB), skin color (RGB), and face orientation (yaw; pitch; roll).
Figure 11: Additional ablation visualizations on Shapes-3D-A. Without capacity losses (top) the model always embeds the entire guidance image into the domain-specific embedding and ignores the source input. Without honesty losses (bottom) it ignores the guide input and embeds domain-specific information into the translated image to reconstruct it back to minimize the cycle reconstruction losses.
Figure 12: Examples with indistinguishable attributes after translation cause instability in cycle reconstruction. For example, identities of actors with extreme poses are very dark and therefore hard to infer from translation results in SynAction; front-facing boxes and cylinders are almost indistinguishable too.
Split A Split B Split C
A A C B C B A C B C A B C B A A B C
FC WC OC SZ SH ORI FC WC OC SZ SH ORI FC WC OC SZ SH ORI
MUNIT 99 99 0 50 96 64 88 0 95 59 15 58 5 99 99 12 99 99
MUNITX 10 9 8 18 95 8 89 2 11 12 8 6 1 99 45 13 33 99
DRIT 13 16 10 14 7 95 10 9 7 27 8 6 7 21 12 13 22 42
AugCycleGAN 10 9 11 13 30 7 5 10 5 17 0 7 10 9 9 13 26 7
StarGANv2 99 99 0 56 4 99 99 0 66 5 99 92 0 99 89 56 99 0
DIDD 99 27 72 12 87 8 62 29 10 41 59 59 38 17 25 28 27 48
RIFT (ours) 81 68 92 35 62 93 9 99 10 50 69 81 99 10 9 11 98 98
RAND 10 9 10 12 24 6 10 10 9 12 25 6 10 10 10 25 12 6
Table 2: Per-split attribute manipulation accuracy on 3D-Shapes-ABC with indicated attribute role (A, B - domain-specific; C - common/shared) across six attributes floor color (FC), wall color (WC), object color (OC), size (SZ), shape (SH), room orientation (ORI).
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGTinputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 13: Qualitative comparison to existing methods on Shapes-3D-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGTinputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 14: Qualitative comparison to existing methods on Shapes-3D-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGTinputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 15: Qualitative comparison to existing methods on Shapes-3D-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGTinputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 16: Qualitative comparison to existing methods on Shapes-3D-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGTinputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 17: Qualitative comparison to existing methods on Shapes-3D-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGTinputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 18: Qualitative comparison to existing methods on Shapes-3D-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 19: Qualitative comparison to existing methods on SynAction across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 20: Qualitative comparison to existing methods on SynAction across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 21: Qualitative comparison to existing methods on SynAction across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 22: Qualitative comparison to existing methods on SynAction across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 23: Qualitative comparison to existing methods on SynAction across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 24: Qualitative comparison to existing methods on SynAction across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFTGT
Figure 25: Qualitative comparison to existing methods on SynAction across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT. The rightmost column shows ground truth predictions for that domain pair.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 26: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 27: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 28: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 29: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 30: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 31: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 32: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 33: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 34: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 35: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.
inputguidanceAugCGDRIT++DIDDMUNITMUNITXStarGANv2RIFT
Figure 36: Qualitative comparison to existing methods on Celeb-A across seven methods: Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus], DIDD [benaim2019didd], MUNIT [huang2018multimodal], MUNITX [bashkirova2021evaluation], StarGANv2 [choi2020stargan] and the proposed RIFT.