Machine learning methods for image-to-image translation are widely studied and have applications in several fields. In medical imaging, the CycleGAN has found an important application for translating one modality to another, for instance in MR to CT translation (han2017mr; sjolund2015generating; Wolterink). Classically, these methods are trained in a supervised setting making their applications limited due to the a lack of good paired data. Similar issues appear in e.g. transferring the style of one artist to another (gatys) or adding snow to sunny California streets (nvidia_im2im). Unpaired image-to-image translation models such as CycleGAN (Zhu2017) promise to solve this issue by only enforcing a relationship on a distribution level, thus removing the need for paired data. However, given their widespread use, it is paramount to gain more understanding of their dynamics, to prevent unexpected things from happening, e.g., (Cohen2018). As a step in that direction, we explore the solution space of the CycleGAN in the subsequent sections of this paper.
The general task of unpaired domain translation can be informally described as follows: given two probability spaces and which represent our domains, we seek to learn a mapping such that a sample is mapped to a sample where
The mapping is typically approximated by a neural network parametrized by . Without paired data, directly solving this is impossible but on a distribution level it is easily seen if solves eq. 1 then the distribution of as is sampled from is equal to that of . Mathematically, if and are probability spaces with probability measures and respectively, this can be written as
Or in words, the probability measure equals the push-forward measure . By Jensen’s equality we can relate this to the fixed f-divergence :
While adversarial adversarial optimization techniques such as GANs can in principle solve problem eq. 3, they remain under-constrained thus not giving a reasonable solution to the original problem eq. 1.
The idea behind the cycle consistency condition from (Zhu2017) is to enforce additional constraints by introducing another function , which is also approximated by a neural network and tries to solve the inverse task: for each find that would be the best translation of to . Similar to the reasoning above, this condition would imply that
The goal is to enforce that for all and, similarly, that for all , i.e. to minimize the following cycle consistency loss
where typically the norm is chosen, but in principle any norm can be chosen. Zhu et al. (Zhu2017) also suggested that an adversarial loss could in principle have been used here as well, but they did not note any performance improvement.
Combining these losses, we arrive at the CycleGAN loss defined as
where the factor determines the weight of the cycle consistency term. We illustrate the CycleGAN model in fig. 1.
Precautions with generative models have been addressed before, for example, unpaired image to image translation can hallucinate features in medical images (Cohen2018). Furthermore, it was already noted in (Zhu2017) that the CycleGAN might admit multpiple solutions and that the issue of tint shift in image-to-image translation arises due to the fact that for a fixed input image multiple images with different tints might be equally plausible. Adding identity loss term was suggested in (Zhu2017) to alleviate the tint shift issue, i.e., the extended CycleGAN loss is defined as
where the factor determines the weight of the identity loss term. In general, to properly define the identity loss one needs to represent both and as being the supported on the same manifold, which is limiting if the distributions are substantially different.
The goal of this work is to study the kernel, or null space, of the CycleGAN loss, which is the set of solutions which have zero ‘pure’ CycleGAN loss, and to give a perturbation bounds for approximate solutions for the case of extended CycleGAN loss. We do the theoretical analysis in section 2. We show that under certain assumptions on the probability spaces the kernel has symmetries which allow for multiple possible solutions in Proposition 2.1. Furthermore, we show in Proposition 2.2 and the following remarks that the kernel admits a natural structure of a principle homogeneous space with the automorphism group of acting on the set of solutions freely and transitively. Next, we expand our analysis to the case of approximate solutions for the extended CycleGAN loss by proving perturbation bounds in Proposition 2.3 and Corollary 2.1. We discuss the existence problem of automorphism in Proposition 2.4 and Proposition 2.6. We proceed in section 3 by showing that unexpected symmetries can be learned by a CycleGAN. In particular, when translating the same domain to itself CycleGAN can learn a nontrivial automorphism of the domain. In appendix A, we briefly explain the measure-theoretic language we use heavily in the paper for those readers who are more used to working with distributions, and also remind the reader of some basic notions from differential geometry which we use as well.
2.1 CycleGAN kernel as a principle homogeneous space
The notions of isomorphism of probability spaces and of probability space automorphisms are central to this paper. Intuitively speaking, an isomorphism of probability spaces and is a bijection between and such that the probability of an event equals the probability of event . An isomorphism of a probability space to itself is called a probability space automorphism. For example, if our probability space consists of samples from
-dimensional spherical Gaussian distribution, then any rotation inis a probability space automorphism. For a precise definition we refer the reader to appendix A.
Firstly, we prove that if at least one of the probability spaces admits a nontrivial probability automorphism, then any exact solution in the kernel of CycleGAN can be altered giving a different solution.
Proposition 2.1 (Invariance of the kernel).
Let be probability spaces and be a probability space automorphism. Let and be measurable maps satisfying
Then are probability space isomorphisms and
If, furthermore, ,222Inequality should be understood in the ‘modulo null sets’ sense here, i.e., we assert that there are positive probability sets on which the maps do differ. then
Since is a probability space automorphism, its inverse is an automorphism as well. In particular, it is measure-preserving since
Therefore both and are isomorphisms. By definition of ,
Since and is measure-preserving, eq. 9 implies that . Similarly, since is measure-preserving as well. This shows that
Using eq. 10 and the fact that almost everywhere, we conclude that
Combining these observations together, we deduce that
since we assume that essentially differs from the identity mapping. If -a.e., then -a.e. as well, which implies that for -almost every , which is a contradiction. In a similar way one can show that essentially differs from . ∎
We provide the following converse to Proposition 2.1.
Proposition 2.2 (Kernel as a principle homogeneous space).
Let be probability spaces. Let and be measurable maps satisfying
Then there exists a unique probability space automorphism such that
For the proof it suffices to take . Combined with Proposition 2.1, this allows us to say that the group of probability space automorphisms of acts freely and transitively on the set of isomorphisms when the latter set is nonempty. This amounts to saying that the space of solutions of CycleGAN is a principle homogeneous space. It can be helpful to view this result from the abstract category theory point of view, that is, if is a category and is any fixed object, then for any object the automorphism group acts on the set of homomorphisms on the right by composition, i.e. we define
This action leaves the space of isomorphisms invariant, and this restricted action is transitive if is nonempty, and, furthermore, free, i.e. for all and all .
To proceed with our analysis for case of approximate solutions for extended CycleGAN loss, we first formulate a useful ‘push-forward property’ for general -divergences between distributions on 333While very natural to conjecture and easy to prove, we were unable to find references to it in existing ML literature, so we dubbed this property a ‘push-forward property’ and provide a proof.. The proof is provided in appendix A.
Lemma 2.1 (Push-forward property for -divergences).
Let be distributions on and be a diffeomorphism. Then for any -divergence we have
We are now ready to prove the perturbation bounds for approximate solutions.
Proposition 2.3 (Perturbation bound).
Let be probability spaces with probability densities and let be a diffeomorphic probability space automorphism. Assume that is -Lipshitz, where is some positive constant. Let and be measurable maps. Then the following perturbation bound holds for extended CycleGAN loss:
The proof is an adaptation of the proof of Proposition 2.1. By definition of ,
Firstly, since is measure-preserving, . Using Lemma 2.1 and the fact that is measure-preserving again, we see that
where the equality uses the fact that is measure-preserving. As in before, since almost everywhere.
Finally, since is a probability space automorphism and is -Lipshitz, we conclude that
Combining all these estimates together, we deduce that
and the proof is complete. ∎
Corollary 2.1 (Asymptotic perturbation bound).
In the setting of Proposition 2.3, let and for be a sequence of measurable maps such that the ‘pure’ CycleGAN loss converges to zero, i.e.,
Then the following asymptotic perturbation bound holds for the ‘extended’ CycleGAN loss:
Corollary 2.1 has a direct practical implication. When using a CycleGAN model for translating substantially different distributions (such as different medical imaging modalities) one would be forced to pick a small value for in order for the model to produce reasonable results. Furthermore, since the distributions are substantially different, we can expect that for many nontrivial automorphism . Therefore, the asymptotic perturbation bound automatically implies that the approximate solution space admits a lot of symmetry, potentially leading to undesirable results.
2.2 Existence of automorphisms
By Proposition 2.1 we see that if either space admits a nontrivial probability automorphism, then the CycleGAN problem has multiple solutions. However, for this to be a problem in practice there must actually exist such probability automorphisms, which we shall now show is the case. First of all, we state the following proposition, which says that we can transfer automorphism from an isomorphic copy of to itself.
Let be an isomorphism of probability spaces and be an automorphism of . Then is an automorphism of and the diagram
commutes. Furthermore, if , are submanifolds and , are diffeomorphisms, then is a diffeomorphism as well.
The first claim follows from invertibility of and . The second claim follows from the definition of a diffeomorphism between submanifolds, see appendix A. ∎
An important notion in probability theory is that of a Lebesgue probability space. Many probability spaces which emerge in practice such aswith the Lebesgue measure or
with a Gaussian probability distribution, both defined on the respective-algebras of Lebesgue measurable sets, are instances of Lebesgue probability spaces.
A probability space is called a Lebesgue probability space if it is isomorphic as a measure space to a disjoint union , where is the Lebesgue measure on the -algebra of Lebesgue measurable subsets of the interval , and at most countably many atoms of total mass .
Informally speaking, this definition says that Lebesgue probability spaces consist of a continuous part and at most countably many Dirac deltas (=atoms). First of all, we provide an abstract result about existence of nontrivial probability space automorphisms in Lebesgue probability spaces which are either ‘not purely atomic’ or have at least two atoms with equal mass. ‘Not purely atomic’ means that the sum of the probabilities of all atoms is strictly less than .
Let be a Lebesgue probability space such that at least one of the assumptions
not purely atomic;
there exist at least two atoms in with equal mass
holds. Then admits nontrivial automorphisms.
If the space is not purely atomic, we have for some , where is the continuous part and is the atomic part of the probability measure . Interval admits at least one nontrivial automorphism, namely the transformation (leaving the atoms fixed), hence so does by Lemma 2.2. In fact, there are infinitely many other automorphisms, which can be obtained by exchanging nonoverlapping subintervals of the same length. If there exist two atoms in with equal mass, then a transformation which transposes with and keeps the rest of fixed is a nontrivial automorphism. ∎
Probability spaces of images which appear in real life typically have a continuous component which would correspond to continuous variations in object sizes, lighting conditions, etc. Therefore, they admit some probability space automorphisms. However, such abstract automorphisms can be highly discontinuous, which would make it questionable if neural networks can learn them. We would like to show that there are also automorphisms which are smooth, at least locally. For this, we first state the following technical claim. The proof is provided in appendix A.
Let be a Borel probability measure on and be a continuous injective function. Then is an isomorphism of probability spaces, where denotes the push-forward of measure to .
Finally, we show the existence of smooth automorphisms under the assumption that our data manifold can be generated by embedding with standard Gaussian measure into as a submanifold. We write for the standard Gaussian probability measure on the space .
Let be an -dimensional standard Gaussian distribution. Let be a manifold embedding. Denote by the probability space . Then the following assertions hold:
is an isomorphism of probability spaces when viewed as a map ;
every rotation is a probability space automorphism and a diffeomorphism of . induces a probability space automorphism of which is, additionally, a diffeomorphism when restricted to .
The connection with generative models is clear if we take to be an invertible generative model such as RealNVP dinh2017 or Glow glow2018. The assumption of manifold embedding in the proposition can be seen as too limiting in general, and we explain how to ‘bypass’ it in Lemma A.2 for the interested readers. In conclusion, if we assume that the distributions we are working with could be represented by an invertible generative model, then there exists a rich space of automorphisms. Given the success of e.g. Glow, this assumption seems to be valid for natural images.
3 Numerical results
Since we have established that the existence of automorphisms can negatively impact the results of CycleGAN, we now demonstrate how this can happen by considering a toy case with a known solution and demonstrating that CycleGAN can and does learn a nontrivial automorphism. The toy experiment which we perform is translation of MNIST dataset to itself. That is, at training time we pick two minibatches and from MNIST at random and use these as samples from and
respectively. The generator neural network in this case is a convolutional autoencoder with residual blocks, fully connected layer in the bottleneck andno
skip connections from encoder to decoder. We also train a simple CNN for MNIST classification in order to classify CycleGAN outputs. The networks were trained using SGD. The ‘natural’ transformation in this case is, of course, the identity mapping and we expect the classification of the inputs and outputs to stay the same. But we shall see that this is not the case.
we provide the confusion matrices for the A2B and B2A generators respectively. We use these matrices to understand if e.g. the class of transformed image for A2B translation equals the source class, or if is a random variable independent of the source class, or if we can spot some deterministic permutation of classes.
We have observed that in practice the identity mapping is not learned. Instead, the network leans towards producing a certain permutation of digits, rather than identity or a random assignment of classes independent of the source label. One explanation would be as follows. Suppose that we can perfectly disentangle class and style in latent digit representation aae. Then any permutation in , acting on the class part of the latent code, determines a probability space automorphism on the space of digits, which can be learned by a neural network. Further investigation of confusion matrices reveals that the networks introduce short cycles, e.g., mapping to and vice versa.
We provide additional experiments on BRATS2015 dataset in appendix B, where we show that in the absense of identity loss the pure CycleGAN loss demonstrates noticeable symmetry, while the PSNR is clearly not invariant. Increasing the weight of the identity loss term reduces the symmetry, but does not necessarily result in a similar PSNR improvement.
4 Discussion and future work
We have shown theoretically that under mild assumptions, the kernel of the CycleGAN admits nontrivial symmetries and has a natural structure of a principle homogeneous space. To show empirically that such symmetries can be learned, we have trained a CycleGAN on the task of translating a domain to itself. In particular, we show that on the MNIST2MNIST task, in contrast to the expected identity, the CycleGAN learns to permute the digits. We have therefore effectively shown, that it is not the CycleGAN loss which prevents this from occurring more often, but hypothesize that the network architecture also has major influence. We advocate against the usage of CycleGAN when translating between substantially different distributions in critical tasks such as medical imaging, given the theoretical results in Corollary 2.1 which suggest ambiguity of solutions, even in the presence of the identity loss term.
We would like to point out that some work has been done recently extending the CycleGAN. For example, in Na2019 the authors argue that many image-to-image translation tasks are ‘multimodal’ in a sense that there are multiple equally plausible outputs for a single input image, therefore, one should explicitly model this uncertainty in the model. To address this issue, the authors design a network which has two ‘style’ encoders , two discriminators for each domain, two conditional encoders for each direction and two generators for each direction
. The style encoders serve to extract the ‘style’ of the image, which is present in both domains, e.g., in case of the ‘female-to-male’ task on CelebA dataset the style would correspond to coarsely represented facial features. The loss term forces the mutual information between the style vector of the translated image and the input style to the conditional encoder to be maximized. This allows the network to roughly preserve the style in the translation. While we leave full analysis of this approach for the future work, we expect that such loss would reduce ambiguity in the solution space to those isomorphisms which differ by automorhpishs from the set
leaving the style fixed, since replacing with and with does not change the loss value for such . Therefore, the reduction in uncertainty of our solution depends on capacity of the encoder , and, ideally, should be quantified. In particular, one might still need to enforce additional problem-specific features in the encoder to guarantee that important image style content is preserved.
Appendix A Background
Firstly, we very briefly explain the probability theory language we use in this article, and we refer the reader to (otae; bogachev) for more details. Formally, a measurable space is a pair of a set and a -algebra of subsets of . Given a topological space with topology , there exists the smallest -algebra , which contains all open sets in . This -algebra is called Borel -algebra of and its elements are called Borel sets. A probability space is a triple of a set , a sigma algebra of subsets of and a probability measure defined on the sigma-algebra . Given a probability space , a measurable set is called an atom if and for all measurable such that we have . Given measurable spaces and , we say that a mapping is measurable if for any we have . If and are probability spaces and is a measurable map, we say that is measure-preserving if for all we have . An approximation argument easily shows that a measurable transformation is measure-preserving if and only if for all nonnegative measurable functions on we have
Given a probability space , a measurable space and a measurable map , we define the push-forward measure on by setting for all .
Let and be probability spaces and be a measure-preserving map. A measurable map is called an essential inverse of if for -almost every and for -almost every . One can show that essential inverse is measure preserving and uniquely defined up to equality almost everywhere. We say that is an isomorphism if it admits an essential inverse. An isomorphism is called an automorphism.
Lemma A.1 (Push-forward property for -divergences).
Let be distributions on and be a diffeomorphism. Then for any -divergence we have
First of all, change of variables formula for the integral implies that
Applying change of variables formula with , we get
where the equality in uses a general property of Jacobians of smooth invertible maps that . Hence , which completes the proof. ∎
We remind the reader that a Polish space is a separable completely metrizable topological space. A Borel probability space is a Polish space endowed with a probability measure on its Borel -algebra, and we will also say that is a Borel probability measure. The basic examples of Borel probability spaces would be e.g. the spaces with its Borel -algebra , endowed with Lebesgue measure . A Borel -algebra of the space endowed with Lebesgue measure can be extended by adding all -measurable sets, leading to the -algebra of Lebesgue-measurable sets.
For the proof of Proposition 2.5 we need the following theorem, see kechris, Theorem 15.1.
Theorem A.1 (Lusin-Souslin theorem).
Let be Polish spaces and be continuous. If is Borel and is injective, then is Borel.
Proof of Proposition 2.5.
Denote the image by . Then is a Borel subset, since is a countable union of a compact sets and is continuous. Furthermore, from Lusin-Souslin theorem (theorem A.1) it follows that for every Borel subset its image is Borel as well. Pick a point which is not an atom of . We want to define an almost everywhere inverse of . Define a function by
Using the remark above it is easy to see that is Borel measurable and that for every Borel . It follows from the definition that and that
Since , is an almost everywhere inverse to . We conclude that is a probability space isomorphism. ∎
Secondly, we remind the reader of a couple of notions from differential geometry which we use in the text, and we refer the reader to e.g. (warner) for more details. Given a subset of a manifold and a subset of a manifold , a function is said to be smooth if for all there is a neighborhood of and a smooth function such that extends , i.e., the restrictions agree . is said to be a diffeomorphism between and if it is bijective, smooth and its inverse is smooth. Let and be smooth manifolds. A differentiable mapping is said to be an immersion if the tangent map is injective for all . If, in addition, is a homeomorphism onto , where carries the subspace topology induced from , we say that is an embedding. If and the inclusion map is an embedding, we say that is a submanifold of . Thus, the domain of an embedding is diffeomorphic to its image, and the image of an embedding is a submanifold.
We close this section with a small lemma, explaining how one can weaken the embedding assumption for generative models in Proposition 2.6.
Let be an injective manifold immersion. Let be an open ball of radius in and be its closure. Then is a manifold embedding.
Since is compact and is continuous, image of every closed subset is compact and hence closed. This shows that is continuous and thus is a homeomorphism. Restricting to the open ball , we conclude that is a homemorphism and thus a manifold embedding. ∎
As a consequence, for our example with spherical Gaussian latent vector one can take sufficiently large ball of radius in the latent space, truncating the latent distribution to ‘sufficiently likely’ values. This ball remains invariant under rotations, thus leading to a differentiable automorphism on the submanifold of ‘sufficiently likely’ images.
Appendix B BRATS2015 experiments
We present some additional results on the BRATS2015 dataset. For this experiment Unet-based generators with residual connections were used. The number of downsampling layers was 4 for both generators, and skip connections were preserved. We trained all models for 20 epochs with Adam optimizer and learning rate. We trained 4 models with . No data augmentation was used so as to avoid creating any additional symmetries. All images were normalized by dividing by the -percentile, as is common in medical imaging when working with MR data.
We hypothesize that flipping images horizontally is a distribution symmetry. We measure the final test loss for both the network output (Loss) and its flipped version (Loss (f)), as well as the PSNR for both translation directions without (PSNR T1-Fl, PSNR Fl-T1) and with horizontal flips (PSNR T1-Fl (f), PSNR Fl-T1 (f)). We summarize these results in table 1.
We observe that in the absense of identity loss the pure CycleGAN loss demonstrates noticeable symmetry, while the PSNR is clearly not invariant. Increasing the weight of the identity loss term reduces the symmetry, but does not always result in a similar PSNR improvement. We present some samples from the model with in fig. 3(a), fig. 3(b).
|Loss||Loss (f)||PSNR T1-Fl||PSNR T1-Fl (f)||PSNR Fl-T1||PSNR Fl-T1 (f)|