The goal ††* equal contribution of unsupervised image-to-image translation is to learn a mapping between two sets of images (or two domains) without any pair supervision. For example, face domains in Figure 1a are defined by gender and share variability in poses and backgrounds, so a correct cross-domain mapping must change the face gender, but preserve the pose and the background of the original. When one domain has some unique attributes absent in the other domain, like males with and without beards in CelebA [CelebAMask-HQ], the question of whether the “correct” one-to-one cross-domain mapping should add a beard to a specific female face does not have a well-defined answer. However, if we alter the problem definition by providing a “guide” image specifying male-specific factors, the resulting unsupervised many-to-many translation problem has a well-defined correct solution: the learned mapping must preserve all attributes of the source image that are shared across two domains and use values of attributes unique to the target domain from the guide input (e.g. preserve pose and background of the female source and add a beard from the male guide in the example above).
Most state-of-the-art methods for unsupervised many-to-many translation implicitly assume that the domain-specific variations can be modeled as “global style” (textures and colors) by hard-coding this assumption into their architectures via adaptive instance normalization (AdaIN) [huang2017arbitrary] originally proposed for style transfer. However, this choice severely restricts the kinds of problems that can be efficiently solved. More specifically, AdaIN-based methods [huang2018multimodal, choi2020stargan] inject domain-specific information from the guide image via a global feature re-normalization that forces colors, textures, and other global statistics to be always treated as domain-specific factors regardless of their actual distribution across the two domains. As a result, AdaIN-based methods change colors and textures of the input to match the guide image during translation even if colors/textures are varied across both domains and should not change. For example, background textures in the female-to-male setting (Figure 1a) vary in both domains, and therefore should be preserved, the same holds for hair color in the children-to-adults setting (Figure 1c). Even on a toy perfectly-balanced problem (Figure 2) AdaIN-based methods (e.g. MUNIT [huang2018multimodal]) change object color of the input to match the object color of the guide, even though object color is varied in both domains, and thus should be preserved.
Autoencoder-based methods [almahairi2018augmented, DRIT_plus, benaim2019didd], on the other hand, preserve shared information better, but often fail to apply correct domain-specific factors. For example, DIDD [benaim2019didd] preserved the object color of the source in Fig. 2, but failed to extract and apply the correct orientation and size from the guide. Overall, both our experiments and recent advances in evaluation of many-to-many image translation [bashkirova2021evaluation], show that all existing methods generally either fail to preserve global domain-specific attributes or fail to apply domain-specific factors well.
In this paper, we propose Restricted Information Flow for Translation (RIFT) - a novel approach that does not rely on an inductive bias provided by AdaIN and achieves high attribute manipulation accuracy across different kinds of attributes regardless of whether they are shared or domain-specific. As illustrated in Figure 3, during “brunet male-to-female translation” our method preserves shared factors (background and pose) of the input male face, and encodes male-specific attributes (mustache) in a domain-specific embedding to enable accurate reconstruction of the source image. The core observation at the heart of our method is that only values of shared attributes (background and pose) of the source can be encoded naturally in a generated image from the target domain, whereas source-specific attributes (mustache) can be encoded in the generated image only by “hiding” them in the form of structured adversarial noise [bashkirova2019adversarial]. With this in mind, we propose using the translation honesty loss [bashkirova2019adversarial] to penalize the model for “hiding” [chu2017cyclegan] a mustache inside the generated female image, and the embedding capacity loss to penalize the model for encoding shared factors into the domain-specific embedding. As a result, information about the mustache is forced out of the generated female image into the domain-specific embedding, while information about the pose and background is forced out of domain-specific embeddings into the translation result - resulting in proper disentanglement of domain-specific and domain-invariant factors.
We measure how well RIFT models different kinds of attributes as either shared or domain-specific across three splits of Shapes-3D [kim2018disentangling], SynAction [sun2020twostreamvan] and Celeb-A [CelebAMask-HQ] following an evaluation protocol similar to the one proposed by bashkirova2021evaluation. Our experiments confirm that the proposed method achieves high attribute manipulation accuracy without relying on an inductive bias towards treating certain attribute kinds as domain-specific hard-coded into its architecture.
2 Related work
In contrast to task-specific image translation methods [cao2017unsupervised, guadarrama2017pixcolor, lugmayr2019unsupervised, qu2018unsupervised, gatys2015neural, ulyanov2016texture], early unsupervised image-to-image translation methods, such as CycleGAN [zhu2017unpaired], and UNIT [liu2017unsupervised], infer semantically meaningful cross-domain mappings from arbitrary pairs of semantically related domains without pair supervision. These methods assume one-to-one correspondence between examples in source and target domains, which makes the problem ill-posed if at least one of two domains has some unique domain-specific factors, as we discussed in Section 1.
To account for such domain-specific factors, and to enable control over them in the translation results, many-to-many image translation methods [huang2018multimodal, almahairi2018augmented, choi2020stargan, liu2019few, DRIT_plus] have been proposed. These methods separate domain-invariant “content” from domain-specific “style” using separate encoders. Following [bashkirova2021evaluation], we avoid terms “content” and “style” to distinguish general many-to-many translation from its subtask - style transfer [gatys2015neural].
Adaptive instance normalization.
Many state-of-art many-to-many translation methods, such as MUNIT [huang2018multimodal], FUNIT [liu2019few] and StarGANv2 [choi2020stargan], use AdaIN [huang2017arbitrary], originally proposed for style transfer [gatys2015neural]. More specifically, these methods modulate activations of the decoder with the domain-specific embedding of the guide. This architectural choice was shown to limit the range of applications of these methods to cases when domain-specific information lies within textures and colors [bashkirova2021evaluation].
In contrast, methods like Augmented CycleGAN [almahairi2018augmented], DRIT++ [DRIT_plus] and Domain Intersection and Domain Difference (DIDD) [benaim2019didd] rely on embedding losses and therefore are more general. For example, DIDD forces domain-specific embeddings of opposite domain to be zero, while DRIT++ uses adversarial training to make the source and target content embeddings indistinguishable.
Most methods [huang2018multimodal, almahairi2018augmented] also use cycle-consistency losses on domain-specific embeddings to ensure that information extracted from the guidance image is not ignored during translation, and cycle loss on images to improve semantic consistency [chu2017cyclegan]. However, cycle-consistency losses on images have been shown [chu2017cyclegan, bashkirova2019adversarial] to force one-to-one unsupervised translation models to “cheat” by hiding domain-specific attributes into translations.
Overall, prior methods ensure that the guide input modulates the translation result in some non-trivial way, but, to our knowledge, no prior work explicitly address adversarial embedding of domain-specific information into the translated result, or ensures that domain-invariant factors are preserved during translation, and this work fills this gap.
3 Restricted Information Flow for Translation
In this section we first formally introduce the many-to-many image translation problem, and describe how our method solves it. Our model reconstructs input images from translation results and domain-specific embeddings as illustrated in Fig. 3, forcing domain-invariant information out of domain-specific embedding using capacity losses, and forcing domain-specific information out from the generated translation using honesty losses.
Following [huang2018multimodal], we assume that we have access to two unpaired image datasets and that share some semantic structure, differ visually (e.g. male and female faces with poses, backgrounds and skin color varied in both). In addition to that, each domain has some attributes that vary only within that domain, e.g. only males have variation in the amount of facial hair and only females have variation in the hair color (like in Figure 1). Our goal is to find a pair of guided cross-domain mappings and such for any source inputs and guide inputs from respective domains, resulting guided cross-domain translations and look like plausible examples of respective output domains, share domain-invariant factors with their “source” arguments ( and respectively) and domain-specific attributes with their “guidance” arguments ( and respectively). This general setup covers the absolute majority of real-world image-to-image tasks. For example, the correct guided female-to-male mapping applied to female source image and a guide male image should generate a new male image with pose, background, skin color, and other shared factors from the female input image , and facial hair from the guidance input , because poses, backgrounds and skin color vary in both, while facial hair is male-specific.
While it might be possible to approximate functions and directly, following prior work, we split each one into two learnable parts: encoders that extract domain-specific information from corresponding guide images, and generators and that combine that domain-specific information with a corresponding source image, as illusrated in Figure 4. Final many-to-many mappings are just compositions of encoders and generators:
Our goal is to ensure that encoders extract all domain-specific information from their inputs (and nothing more), and that generators use that information, along with domain-invariant factors from their source inputs to form plausible images from corresponding domains.
Noisy cycle consistency loss.
First, to ensure that each attribute of input images is not ignored completely, (i.e. that it is treated as either domain-specific, or domain-invariant, or both), we use a guided analog of the cycle consistency loss. This loss ensures that any image translated into a different domain, and translated back with its original domain-specific embedding is reconstructed perfectly. Additionally, as the first step towards restricting the amount of information passed through each branch, we add zero-mean Gaussian noise () of amplitude or and appropriate shape to translations and domain-specific embeddings respectively, before reconstructing images back:
Unfortunately, any form of cycle loss encourages the model to “hide” domain-specific information inside the translated image in the form of structured adversarial noise [chu2017cyclegan]. To actively penalize the model for “hiding” the domain-specific information, such as mustache, inside a generated female image (instead of putting it into a male-specific embedding ), we use the guess loss [bashkirova2019adversarial]. This loss detects and prevents this so-called “self-adversarial attack” in the generator by training an additional discriminator to “guess” which of its two inputs is a cycle-reconstruction and which is the original image. For example, if the male-to-female generator is consistently adversarially embedding mustaches into all generated female images, then the cycle-reconstructed female will also have traces of an embedded mustache, and will be otherwise be identical to the input . In this case, the guess discriminator, trained specifically to detect differences between input images and their cycle-reconstructions, will detect this hidden signal and penalize the model:
Domain-specific channel capacity.
Unfortunately, neither of two losses described above can prevent the model from learning to embed the entire guide image into the domain-specific embeddings and reconstructing it from that embedding in , ignoring its first argument completely, i.e. just always producing the guide input exactly. In order to prevent this from happening we penalize norms of domain-specific embeddings, effectively constraining the capacity of the resulting channel:
Intuitively, the mutual information between the input guide image and the predicted translation corresponds to the maximal amount of information that an observer could learn about translations by observing guides if they had infinite amount of examples to learn from. Formally, using the derivation for the capacity of the additive white Gaussian noise channel (Sec. 7.1) we can show that:
meaning that minimizing loss effectively limits the amount of information from the guide image that can access to generate , i.e. the effective capacity of the domain-specific embedding. Note that disabling either the noise () or the capacity loss () results in effectively infinite capacity, so we need both. Intuitively, this bound describes the expected number of “reliably distinguishable” embeddings that we can pack into a ball of radius given that each embedding will be perturbed by Gaussian noise with amplitude .
Remaining losses are analogous to CycleGAN [liu2017unsupervised] losses that ensure that output images lie within respective domains:
We also train discriminator networks and guess discriminators by minimizing corresponding adversarial LS-GAN [mao2017least] losses.
We would like to measure how well each model can generalize across a diverse set of shared and domain-specific attributes. In this section we discuss datasets we used and generated to achieve this goal, as well as list baselines and metrics we used to compare our method to prior work.
Following the protocol proposed by bashkirova2021evaluation, we re-purposed existing disentanglement datasets to evaluate the ability of our method to model different attributes as shared and domain-specific. We used 3D-Shapes [kim2018disentangling], SynAction [sun2020twostreamvan] and CelebA [CelebAMask-HQ]. Unfortunately, among the three, only 3D-Shapes [kim2018disentangling] is balanced enough and contains enough labeled attributes to make it possible to generate and evaluate all methods across several attribute splits of comparable sizes. For example, if we attempted to build a split of SynAction with domain-specific pose attribute, the domain with fixed pose would only contain 90 unique images, which is not sufficient to train an unsupervised translation network.
The original 3D-Shapes [kim2018disentangling] dataset contains 40k synthetic images labeled with six attributes: floor, wall and object colors, object shape and object size, and orientation (viewpoint). There are ten possible values for each color attribute, four possible values for the shape (cyliner, capsule, box, sphere), fifteen values for orientation, and eight values for size. We used three subsets of 3D-Shapes with different attribute splits visualized in Figure 5. Three resulting domain pairs contained 4.8k/4k, 12k/3.2k, and 12k/6k images respectively.
We used the same [bashkirova2021evaluation] split of SynAction [sun2020twostreamvan]
- with background varied in one domain (nine possible values), identity/clothing varied in the other (ten possible values), and pose varied in both (real-valued vector). The resulting dataset contains 5k images in one domain and 4.6k images in the other. We note that the attribute split of this datasetmatches the inductive bias of AdaIN methods, since the layout (pose) is shared and textures (background, clothing) are domain-specific in both domains.
We used the male-vs-female split proposed by [bashkirova2021evaluation] with 25k/25k images, and evaluated disentanglement of six most visually prominent attributes: pose, skin and background color (shared attributes, real-valued vectors), male-specific presence of facial hair (binary), female-specific hair color (three possible values), and domain-defining gender.
We compare the proposed method against several state-of-art AdaIN methods, namely MUNIT [huang2018multimodal], StarGANv2 [choi2020stargan], MUNITX [bashkirova2021evaluation], and autoencoder-based methods, namely Domain Intersection and Domain Difference (DIDD) [benaim2019didd], Augmented CycleGAN [almahairi2018augmented] and DRIT++ [DRIT_plus]. In what follows we also provide a random baseline (RAND) that corresponds to selecting and returning a random image from the target domain.
In order to evaluate the performance of our method, we measured how well the domain-specific attributes were manipulated and domain-invariant attributes were preserved. Following bashkirova2021evaluation
we trained an attribute classifier, and for each attribute , we measured the its manipulation accuracy
- the probability of correctly modifying an attribute across input-guide pairs for which the value of the attributemust change:
where the “correct” attribute value equals for shared attributes, and otherwise. For real-valued multi-variate attributes (pose keypoints, background RGB, skin RGB, etc.) we measured the probability of generating an image with predicted attribute vector closer to the correct attribute vector then to the incorrect vector:
where and for shared attributes, and vice-versa otherwise. The manipulation accuracy in the opposite direction
was estimated analogously. ForShapes-3D we additionally aggregated results across three splits by averaging manipulation accuracies across splits in which the given attribute was shared/common (C) or domain-specific (S). If we introduce the set of all splits and predicates and , and the manipulation accuracy at a given split , aggregated manipulation accuracy can be defined as follows:
For three splits of 3D-Shapes we also report the relative discrepancy between domain-specific and domain-invariant manipulation accuracies:
To compute metrics above, we generated two guided translations per source image per domain per baseline. We re-ran each method multiple times to account for poor initialization. We used PoseNet [papandreou2018personlab] to get ground truth poses for SynAction, and ruiz2018headpose and median background and skin color for CelebA, see suppl. Fig. 10.
We used standard CycleGAN components: pix2pix [isola2017image] generators and patch discriminators with LS-GAN loss [mao2017least]. We archived best results when represented domain-specific embedding vectors as single-channel images, and made generators and encoders for the same domain (e.g. and ) share all but last layers.
In this section, we first compare our method to prior work both qualitatively and quantitatively. Then we show what happens if we remove key losses discussed Section 3. And finally, we discuss implicit assumptions made by our method, and key challenges that future methods might encounter in further improving manipulation accuracy across three datasets we used in this paper.
|Method||3D-Shapes[kim2018disentangling]-ABC||SynAction [sun2020twostreamvan]||CelebA [CelebAMask-HQ]|
Figures 9 and 9 show that, in most cases, the proposed method successfully preserves domain-invariant content and applies domain-specific attributes from respective domains on 3D-Shapes and SynAction. Figure 6 shows that, on CelebA, our method preserves poses and backgrounds, and applies hair color better then other baselines. On 3D-Shapes-A, our method also preserves object color and applies correct size and orientation better than all alternatives (Figure 2). A more detailed side-by-side qualitative comparison of generated images across all baselines and all datasets can be found in supp. Figures 26-36.
Table 1 shows that our method archives the highest average aggregate manipulation accuracy across three splits of 3D-Shapes, and second lowest relative discrepancy (RD) between accuracies of modeling same attributes as shared and specific across all attributes. On SynAction, that matches the inductive bias of AdaIN methods, our method performs on-par with AdaIN-based methods and outperforms all non-AdaIN methods. On CelebA, our method is much better at preserving background and skin colors then all AdaIN-based methods, and, in terms of the overall manipulation accuracy is second only to DIDD [benaim2019didd]. Despite high manipulation scores, DIDD generated blurry faces and also struggled with applying correct hair color, as can be seen in Figure 6
and supplementary. To sum up, our method achieves best or second-best performance in each of three dataset, and best performance overall (last column), with among the lowest discrepancy between per-attribute manipulation accuracies (RD), and lowest variance across datasets ().
During B2A translation on Shapes-3D-A the model trained with all losses uses object color/shape from the source image and floor/wall color from the guide (Fig. 9). If we remove the penalty on the capacity of domain-specific embeddings (), the model ignores the source input (Fig. 7a-top). The model encodes all attributes into domain-specific embeddings, and cycle-reconstructs inputs and perfectly from these embeddings (Fig. 7a-bottom), completely ignoring the source input: . Removing honesty losses (), on the other hand, results in a model that ignores the guide input altogether (Fig. 7b-top). The model “hides” domain-specific information inside generated translations instead of the domain-specific embeddings, and makes domain-specific embeddings equal zero, resulting in zero capacity loss , and zero cycle reconstruction loss . For example (Fig. 7b-bottom), the size and orientation of is hidden inside in the form of imperceptible adversarial noise and is used to reconstruct perfectly. If mapping actually used size and orientation of to generate , it would have also applied that same size and orientation when generating , but it did not - so we conclude that both and ignore domain-specific embeddings and embed information inside generated translations instead. More illustrations in suppl. Figure 11.
We identified three major causes of remaining errors that existing methods fail to handle at the moment, and future researchers will need to address to make further progress in this task possible. First, some attributes “affect” very different number of pixels in training images, and as a consequence contribute very differently to reconstruction losses, making the job of balancing different loss components much harder. For example, the floor color in 3D-Shapes “affects” roughly half of all image pixels, whereas size affects only one tenth of all pixels - resulting in drastically different effective weights across all losses, especially if both are either domain-specific or shared at the same time. This explains highest performance of our method on 3D-Shapes-A (in comparison to 3D-Shapes-B,C, see Table2) which has “similarly-sized” domain-specific attributes in both domains. The second challenge is that all datasets contain some attribute combinations that are almost distinguishable: for example, front-facing boxes and cylinders are hardly distinguishable in 3D-Shapes, and clothing of people is much less articulated when they are facing backwards because of shading in SynAction (see suppl. Figure 12). This explains why our model fails on these cases: since it can not reliably reconstruct these attributes from such intermediate translations, it has no incentive to apply correct correct attributes values to them in the first place. Finally, unevenly distributed shared attributes in real world in-the-wild datasets (such as CelebA) pose even more serious challenge rendering the whole many-to-many problem setup not well defined. For example, if both male and female domains had hair color variation, but males were mostly brunet with only 3% of blondes, and 50% of females were blondes - should the model preserve blonde hair when translating females to males and sacrifice the “realism” of the generated male domain, or should it treat hair-color as a domain-specific attribute despite variations present in both?
While more precise attribute manipulation models requiring less supervision might be used for malicious deepfakes [nguyen2019deep, citron2018disinformation], they can also be used to remove biases present in existing datasets [grover2019bias] to promote fairness in down-stream tasks [augenstein2019generative]. We acknowledge that the CelebA dataset contains many biases (e.g. being predominantly white) and that binary gender labels are problematic.
In this paper we propose RIFT - a new unsupervised many-to-many image-to-image translation method that does not rely on an inductive bias hard-coded into its architecture to determine which attributes are shared and which are domain-specific, and achieves consistently high attribute manipulation accuracy across a wide range of datasets with different kinds of domain-specific and shared attributes, and low discrepancy between manipulation accuracies across different attributes and datasets. Moreover, on datasets that match the inductive bias of AdaIN-based methods, the proposed method performs on-par with AdaIN-based methods. Finally, in this paper we identified three core challenges that need to be resolved to enable further development of unsupervised many-to-many image-to-image translation.
7.1 Derivation of the capacity
Let and be arbitrary datasets, and be domain-specific embedding and generator functions, and be the translation from source to domain , guided by the target example . The following theorem bounds the amount of information about that can access to generate .
The effective capacity of the guided embedding, i.e. the capacity of the channel, i.e. the mutual information is bounded by:
First, let us define a Markov chain
using the data processing inequality twice we can show that
intuitively meaning that the overall pipeline always looses at least as much information as each of its steps. Then expanding the mutual information in terms of the differential entropy gives us
Since the the second raw moment (aka power) of is bounded by , the entropy will be maximized if is a -dimensional spherical multivariate normal with variance , where therefore
|Split A||Split B||Split C|