In the task of unsupervised image-to-image translation, we seek a mapping between two distributions of images where we do not have a known pairwise correspondence between points. Implicit in the task is the assumption that the variation in a system with two distributions can be broken down into: 1. variation separating the source and target domains and 2. variation intrinsic to just the source or just the target domain. The goal in image-to-image translation is to remove variation along the first axis while preserving all of the other variation along the axes in the second category.
Since their advent, the field has been dominated by generative adversarial networks (GANs) based on the cycle-consistency assumption Zhu et al. (2017); Almahairi et al. (2018); Choi et al. (2018); Lee et al. (2018); Yi et al. (2017). With the cycle-consistency assumption, two networks are trained simultaneously (going in both directions between the two domains) and they are enforced to be inverses of each other. With a cycle-consistent GAN, the removal of the variation separating the source and target domains (from 1. above) is accomplished by the discriminator distinguishing between the target domain images and the generated images. To avoid changing other sources of variation (from 2. above), the cycle-consistency loss is employed. While invertibility may be a desirable property in some mappings, it prohibits other mappings from being learned (e.g. many-to-one relationships cannot be inverted losslessly). Moreover, empirical evidence shows that training under cycle-consistency produces unwanted restrictions, like generating a set of images with highly correlated pairwise distances to the original set of images, as measured in pixel space Benaim and Wolf (2017); Amodio and Krishnaswamy (2019). These limitations shed light on the seemingly arbitrary difference between domains where cycle-consistent models succeed impressively versus those where they utterly fail Amodio and Krishnaswamy (2019).
The problems of cycle-consistent training are especially exacerbated when mapping between many domains, for example using one model to map between domains represented by all attributes in the canonical CelebA dataset Liu et al. (2015). In the context of many domains, one generator network performs all of the mappings, distinguished by conditional labels. With one generator network having to be its own inverse in pixel space between many different domains, with just conditional labels to distinguish them, it comes as no surprise that existing cycle-consistency models struggle.
We offer an alternative approach: rather than a completely unrestricted architecture coupled with a cycle-consistency constraint, we directly restrict the generator architecture.
Our approach, which is called semantic attribute transfer, is built on the previously proposed notion of the consciousness prior Bengio (2017). Based off of ideas from cognitive neuroscience, the consciousness prior hypothesizes that high-level abstract concepts are stored sparsely in deep representations Baars (1993); Van Gulick and Zalta (2009); Dehaene et al. (2017). Rather than performing transformations directly in pixel space, this prior supposes transformations between high-level traits can be done in a latent layer with relatively simple functions performed on sparse coordinates chosen with an attention mechanism.
Semantic attribute transfer bridges the gap between the consciousness prior, previously detailed in the context of natural language processing, and image-to-image translation. In the consciousness prior work, consciousness is defined as a process involving the restriction of the attention of a network to changing one feature at a time. While they view this as consciousnessover time, we modify this as consciousness in abstract representation feature space.
We restrict the translation to be a consciousness process in the following way: we eschew the cycle-consistency loss term, but instead of using an entirely unconstrained generator architecture, we only allow it to perform a single-feature transformation. We train the generator to autoencode,i.e. when given an image as input, it produces the same image as output. Jointly, we allow an attention mechanism to select a singleneuron in the middle layer of the generator, perform a simple parametric change to its value, and finish the feedforward to produce a generated image (Figure 1).
This construction places a prior on the representation of high-level concepts in the latent layer of the generator, corresponding to each domain being stored as independent factors at this level. As proposed in the previous work on the consciousness prior, this allows each transformation to constitute a single “conscious thought”. With this, an analogy can be made to how a human would focus consciousness on one main difference between the two distributions, and ignore all other minor differences.
Such a strictly constrained transformation function ensures only the dominant axis of variation between domains is changed, as it can only change the amount of information that can be stored in a single neuron (while also being able to autoencode from the representation). Furthermore, this obviates the need for heuristic losses frequently used in image-to-image translation like the cycle-consistency loss or the loss that enforces a network not to change input that is already in the target domainZhu et al. (2017).
The main experimental findings are as follows:
The strength of our representation can be improved by adding more domains to the mapping problem due to increasingly disentangled representations. This makes our model more scalable to increasing numbers of domains than previous work.
Our results indicate that requiring cycle-consistency in training is an overly restrictive heuristic: without it our model learns inverse functions when domains are inherently one-to-one and not when otherwise.
2 Related Work
The consciousness prior hypothesizes a biologically motivated representation constraint on the manipulation of high-level, abstract characteristics in the data Bengio (2017). In previous work, it was explained in terms of consciousness over time in a recurrent natural language processing model. This framed the memory of a model at time in terms of its memory at time , using a consciousness process :
By restricting to be sparse and to be simple, we can simulate the model focusing on only one feature at each step.
The StarGAN extended image-to-image translation work by considering many domains Choi et al. (2018). To avoid the combinatorial growth of having to train distinct models for every pair of domains, they use one generator network and one discriminator network that are conditioned upon the target domain. Then, using the original domain label, a cycle-consistency loss is imposed upon both directions. Otherwise, their architectures are the completely unconstrained ones commonly used in other image-to-image translation models.
Here we consider the multi-domain image-to-image translation problem. Let be a finite sample from domain . We seek mappings that take points outside of domain (in the complement) and map them into domain . This terminology emphasizes that we want a separate mapping for each domain that maps into domain from anywhere else in the space. We note that the domains may not form a partition as they may overlap. We employ the generative adversarial network formulation that trains these mappings by pitting a generator against a discriminator in the standard alternating optimization process:
Each mapping , which maps into domain , shares the same network weights. The target domain to map into is selected via a conditioning label . We choose this rather than using separate networks for each generator and each discriminator to facilitate scaling to large numbers of domains.
In our image-to-image domain translation context, we modify the previous definition of a consciousness process :
Instead of operating on a time axis, it is now a function that operates on a low-level representation of an image in domain , and parameterized by a sparse attention mechanism and a simple parameterized edit , produces a low-level representation of an image in domain .
To create a generator that uses such a consciousness process, we create a novel architecture for the generator function
. The conditioning label is provided to the generator by concatenating a one-hot vector to the input channel-wise. We decompose our generator into a downsampling encoderand an upsampling decoder . We run our generator in two distinct modes (optimized jointly), and in the first it is simply the composition of the encoder and the decoder. In this part of the total loss we train it as an autoencoder with the usual reconstruction loss, which will result in an approximate identity function.
Simultaneously, we run the generator in another form designed to perform the transformation. To do so, we place a transform in between the encoder and decoder:
Crucially, by enforcing to be the identity and restricting properly, we ensure that the generating function conforms to the notion of the consciousness prior, by only allowing the “consciousness” of the generator to focus on a single element, selected via a sparse attention mechanism, and change it with a simple edit.
The transform consists of two aspects: an attention mechanism which selects a feature dimension to edit, and a simple quadratic edit. We chose the simplest transformation that had sufficient expressiveness to produce domain adaptation results (an affine transformation was insufficiently expressive). Let be a feature vector representation in the internal layer of the generator. A domain-specific attention vector selects one of the dimensions to edit, and domain-specific parameters are used in the simple quadratic edit:
where denotes element-wise multiplication. To ensure that the attention mechanism is sparse, we add an entropy penalty to the loss:
When adapting the consciousness prior concept from natural language to images, we are faced with the additional complication of points whose feature depth is distributed across spatial locations. Thus, here we observe that while the transform performs the domain translation in a sparse way with respect to the feature depth, it does not have spatial awareness. To address this, we introduce a skip connection network to act as a mask and select the spatial coordinates for the transform to apply to. This skip connection network has a one-channel sigmoid output mask and is used in the final generator model in the following way:
The total objectives for the generator and the discriminator, respectively, are thus:
with coefficients controlling how much each term weighs relatively in the total.
There are many challenges to building models in the multi-domain image-to-image context. Most significantly, many different mappings must be learned at once. In a typical image-to-image task, only two domains are considered, and there is one generator entirely dedicated to going in each direction. That means each generator’s weights only need to be directed towards one set of pixel combinations (e.g. learning weights that can only produce red coloring, thus never erroneously producing yellow coloring, because the network just tunnels all output towards the red part of the space). In this context, though, many different sets of pixel combinations all need to be generated at different times. Beyond just the challenges in generation, the adversarial discriminator also faces a more difficult task in having to distinguish between many different domains, as opposed to just one.
Our architecture, based on the U-Net framework, consists of three stride-two downsampling layers with Leaky ReLU activations and three upsampling layers with regular ReLU activations, and skip connections between downsampling layers and upsampling layers of the same dimensionalityRonneberger et al. (2015). We use the Adam optimizer to train with a learning rate of and momentum parameters Kingma and Ba (2014). Resolution
is used for inputting the images. The loss function is the standard sigmoid-based GAN loss. The discriminator’s architecture is identical to the downsampling half of the generator except with a flattening, conditional projection, and classification layer added on top of the last layer.
We use several baseline models as comparisons. The most relevant method we compare to for multi-domain image-to-image translation is the StarGAN Choi et al. (2018)
. The StarGAN relies on a cycle-consistency loss, but is architecturally equivalent to ours, allowing us to test this form of loss against our approach which follows the consciousness prior. We compare to it with varying choices for their hyperparameters, the coefficient of the cycle-consistency and the coefficient of the identity loss, with StarGANbeing 10 and 5, StarGAN being 10 and 0, and StarGAN being 1 and 0, respectively. We then additionally compare to two other multi-modal cycle-consistency models, namely the MUNIT and NoiseAugmented models Huang et al. (2018); Almahairi et al. (2018). We note that these models are designed for producing multiple outputs within one target domain, but do not natively handle multiple domains like we have in our context. We thus use the same approach as in the StarGAN of giving the model the target label to learn a conditional output distribution.
In our first experiment, we perform image-to-image translation using primary color of birds in the CUB dataset Welinder et al. (2010). We use each primary color with more than 100 image (eight total domains).
Samples from our model with changing domains are shown in Figure 3. We emphasize that each image is produced by changing just one neuron in the middle layer away from the identity function. Through this, we see that the concept of primary color has been stored sparsely and simply, conforming to the hypothesis of the consciousness prior.
To quantify the performance, we utilize the traditional measure of generative distributional accuracy, the FID score. In image-to-image translation, it is insufficient to measure image quality, as a real image is provided to the generator as input. Instead, a distributional score is necessary to measure how well the input images match the images in the target domain. For this reason, FID score, which measures distributional distance to the target distribution, is the most appropriate metric. In Table 1, we report the average FID score across all domains from our model and the baselines. We found our model performed significantly better than all baselines. In Figure 5, we plot the FID score per domain. Our model has a consistently low FID in each domain, with only its worst domain FID being higher than any other models’ best domain FID.
In the previous experiment, there were only eight domains and they were characterized by color changes only. From here we move on to a harder test of more domains that are characterized by more complicated transformations in the CelebA dataset. Because the attributes in the dataset are not necessarily mutually exclusive (e.g. an image can both be “black-hair” and “glasses”), we expand the number of attributes into domains, where each attribute has a domain corresponding to “has the attribute” and a domain corresponding to “does not have the attribute”.
In Figure 3, we see samples from our model with changing target attributes, again only different from the identity function by a single simple change to one neuron in the middle layer. For example, by being able to produce two images, one without glasses and with glasses, that only differ in their latent representation by one neuron, this provides strong empirical evidence that the abstract concept of “glasses” has indeed been stored sparsely and simply. As before, in Figure 5 we show the FID score per domain for each model, and in Table 2 we report the mean FID across all domains per model. In this case as in the previous experiment, we see our model significantly outperform the cycle-consistent models across all domains.
We next perform a series of experiments on this dataset to more thoroughly examine our model to better understand how it outperforms the best alternative StarGAN model.
Increasing number of domains effect
We first evaluate the effect of increasing the number of domains that must be mapped between, by designing an experiment that isolates this effect, which we call the “crowding effect”. As an example of the crowding effect, a model might be able to generate high quality “glasses” images if that is the only output domain, because it can simply learn to ignore any part of the image that is far from the eyes. However, if a model must both generate different hair colors given one condition label and glasses given another condition label, the performance on generating glasses may suffer.
To quantify this effect, we train a model on just , and measure its FID. Then, we train a new model on both and , and compare the FID on
from the second model to its FID in the first model. We keep doing this, adding more domains in powers of two. This gives a series of FID scores for each domain corresponding to increasing numbers of total domains in the model. We build a simple linear regression on these scores and look at the fitted slope of the regression line: an estimate of the change in FID for a given domain from adding another domain to the total mapping model.
In other words, these scores represent the change in FID for a given from making a -domain mapping problem a -domain mapping problem. We report the average of these scores for each model in Table 4.2. While the cycle-consistency model has an FID that worsens as the number of domains increases, our model actually improves its FID as the number of domains increases.
Latent space disentanglement
We hypothesize one reason for our model’s better generative performance is that the representation it learns and operates on better disentangles the attribute it is transforming. To investigate this, we look at the domains corresponding to “having” and “not having” each attribute, and calculate the maximal mean discrepancy (MMD) as a distributional distance between them in the deepest latent layer of each model. The distances are reported in Table 4
, with the mean and standard deviation across all domains. We see that our model does indeed produce significantly more separation between points with an attribute and points without that attribute, providing evidence that our model has more effectively disentangled this variation from other underlying variation in the data.
Pairwise distance correlation
Previous work has shown that cycle-consistent mappings are constrained to learning transformations that preserve pairwise distances between points in ambient pixel space Benaim and Wolf (2017); Amodio and Krishnaswamy (2019). Here we perform an experiment to investigate whether this extended to the multi-domain setting. As this restriction may not always be desirable, we evaluate whether it holds in our model without a cycle-consistency loss. We take a fixed set of images, and then transform them to each target domain in turn. For each pair of domains, we calculate the correlation between the pairwise distances of points in the first domain with their representations in the other domain. Table 4.2 shows the StarGAN produces high correlations in every domain (the minimum between any pair was 0.968). In our model, some domains had high correlations, while others did not (the minimum was 0.484).
Cycle-consistency in our model
While we do not enforce cycle-consistency in our model, we can evaluate whether our fully-trained model has the property. We map to and from each domain and report the average mean-squared error (MSE) between the original image and composite output after the cycle. To compare, we note that the cycle-consistent baseline had an average MSE of , but the standard deviation was just : this value is almost identical for every pair of domains. This makes sense, as the model is explicitly trying to minimize this term. In our model, the average MSE was , and the standard deviation was drastically higher: . For some domains, the reconstruction error was almost zero, while for other domains the reconstruction error was much higher than the baseline’s highest. Not all desirable transformations are one-to-one and invertible: our model produces cycle-consistent mappings in some cases, and not in others.
Figure 6 shows an example of a mapping that is cycle-consistent in domains where it is natural (black hair to brown hair and back to black hair) and an example of a mapping that is not cycle-consistent but still reasonable. An image with facial hair is mapped to the “no facial hair” domain and then re-mapped to the “facial hair” domain. The resulting image doesn’t have the same facial hair as the original (it has a beard instead of the original soul patch), but is a perfectly reasonable mapping. For the StarGAN, it handles the first transformation but for the second one it is prevented from changing the input at all by the restriction of cycling back to exactly the image it started with.
Lastly, we perform an ablation study to measure the effect of the different loss terms in our model: the reconstruction loss which forces the generator to produce the identity without the semantic attribute transform and the entropy loss which forces the attention mechanism in the transform to be sparse. In Table 6, we report the mean across all domains. Each loss term contributes to the ultimate performance, as the full model achieves the best score.
Here we presented a novel framework that conforms with the notions from the recent theoretical advances in cognitive neuroscience, motivating future work in biologically motivated architectures.
6 Broader Impact
Our proposed method has impact on the research community, but we believe it is safe from concerns about negative ethical or societal consequences. As an architectural advance in image-to-image translation, our model does not implicate any negative consequences moreso than any other image-to-image translation model like the CycleGAN or the StarGAN. Moreover, by using pre-determined attributes in canonical datasets, we do not isolate any attribute that is not used ubiquitously in machine learning research. Many useful applications of this kind of work exist, like mapping between healthy and diseased medical imaging results.
One potential positive impact of our work is the furthering of progress towards biologically plausible and biologically motivated architectures. We believe the previous work of the consciousness prior, along with ours here, can offer the opportunity for the deep learning community to become more involved with the cognitive neuroscience community. Such a grounding in a biological science would stand to benefit the broader community in our opinion.
-  (2018) Augmented cyclegan: learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151. Cited by: §1, §4.
-  (2019) Travelgan: image-to-image translation by transformation vector learning. In , pp. 8983–8992. Cited by: §1, §4.2.
-  (1993) A cognitive theory of consciousness. Cambridge University Press. Cited by: §1.
-  (2017) One-sided unsupervised domain mapping. In Advances in neural information processing systems, pp. 752–762. Cited by: §1, §4.2.
-  (2017) The consciousness prior. arXiv preprint arXiv:1709.08568. Cited by: §1, §2.
-  (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797. Cited by: §1, §2, §4.
-  (2017) What is consciousness, and could machines have it?. Science 358 (6362), pp. 486–492. Cited by: §1.
-  (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pp. 35–51. Cited by: §1.
-  (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: item 1, §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.
-  (2009) The stanford encyclopedia of philosophy. Retrieved from SEP-Consciousness: http://plato. edu/archives/win2009/entries …. Cited by: §1.
-  (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: item 1, §4.1.
-  (2017) Dualgan: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pp. 2849–2857. Cited by: §1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §1.