Unsupervised Cross-Domain Image Generation

11/07/2016 ∙ by Yaniv Taigman, et al. ∙ Facebook 0

We study the problem of transferring a sample in one domain to an analog sample in another domain. Given two related domains, S and T, we would like to learn a generative function G that maps an input sample from S to the domain T, such that the output of a given function f, which accepts inputs in either domains, would remain unchanged. Other than the function f, the training data is unsupervised and consist of a set of samples from each domain. The Domain Transfer Network (DTN) we present employs a compound loss function that includes a multiclass GAN loss, an f-constancy component, and a regularizing component that encourages G to map samples from T to themselves. We apply our method to visual domains including digits and face images and demonstrate its ability to generate convincing novel images of previously unseen entities, while preserving their identity.



There are no comments yet.


page 5

page 7

page 8

page 9

page 12

page 13

page 14

Code Repositories


Tensorflow implementation of Unsupervised Cross-Domain Image Generation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans excel in tasks that require making analogies between distinct domains, transferring elements from one domain to another, and using these capabilities in order to blend concepts that originated from multiple source domains. Our experience tells us that these remarkable capabilities are developed with very little, if any, supervision that is given in the form of explicit analogies.

Recent achievements replicate some of these capabilities to some degree: Generative Adversarial Networks (GANs) are able to convincingly generate novel samples that match that of a given training set; style transfer methods are able to alter the visual style of images; domain adaptation methods are able to generalize learned functions to new domains even without labeled samples in the target domain and transfer learning is now commonly used to import existing knowledge and to make learning much more efficient.

These capabilities, however, do not address the general analogy synthesis problem that we tackle in this work. Namely, given separated but otherwise unlabeled samples from domains and and a multivariate function , learn a mapping such that .

In order to solve this problem, we make use of deep neural networks of a specific structure in which the function

is a composition of the input function and a learned function . A compound loss that integrates multiple terms is used. One term is a Generative Adversarial Network (GAN) term that encourages the creation of samples that are indistinguishable from the training samples of the target domain, regardless of or . The second loss term enforces that for every in the source domain training set, is small. The third loss term is a regularizer that encourages to be the identity mapping for all .

The type of problems we focus on in our experiments are visual, although our methods are not limited to visual or even to perceptual tasks. Typically,

would be a neural network representation that is taken as the activations of a network that was trained, e.g., by using the cross entropy loss, in order to classify or to capture identity.

As a main application challenge, we tackle the problem of emoji generation for a given facial image. Despite a growing interest in emoji and the hurdle of creating such personal emoji manually, no system has been proposed, to our knowledge, that can solve this problem. Our method is able to produce face emoji that are visually appealing and capture much more of the facial characteristics than the emoji created by well-trained human annotators who use the conventional tools.

2 Related work

As far as we know, the domain transfer problem we formulate is novel despite being ecological (i.e., appearing naturally in the real-world), widely applicable, and related to cognitive reasoning (Fauconnier & Turner, 2003). In the discussion below, we survey recent GAN work, compare our work to the recent image synthesis work and make links to unsupervised domain adaptation.

GAN (Goodfellow et al., 2014) methods train a generator network

that synthesizes samples from a target distribution given noise vectors.

is trained jointly with a discriminator network , which distinguishes between samples generated by and a training set from the target distribution. The goal of is to create samples that are classified by as real samples.

While originally proposed for generating random samples, GANs can be used as a general tool to measure equivalence between distributions. Specifically, the optimization of corresponds to taking the most discriminative achievable, which in turn implies that the indistinguishability is true for every . Formally, Ganin et al. (2016) linked the GAN loss to the H-divergence between two distributions of Ben-david et al. (2006).

The generative architecture that we employ is based on the successful architecture of Radford et al. (2015). There has recently been a growing concern about the uneven distribution of the samples generated by – that they tend to cluster around a set of modes in the target domain (Salimans et al., 2016). In general, we do not observe such an effect in our results, due to the requirement to generate samples that satisfy specific -constancy criteria.

A few contributions (“Conditional GANs”) have employed GANs in order to generate samples from a specific class (Mirza & Osindero, 2014), or even based on a textual description (Reed et al., 2016). When performing such conditioning, one can distinguish between samples that were correctly generated but fail to match the conditional constraint and samples that were not correctly generated. This is modeled as a ternary discriminative function  (Reed et al., 2016; Brock et al., 2016).

The recent work by Dosovitskiy & Brox (2016), has shown promising results for learning to map embeddings to their pre-images, given input-target pairs. Like us, they employ a GAN as well as additional losses in the feature- and the pixel-space. Their method is able to invert the mid-level activations of AlexNet and reconstruct the input image. In contrast, we solve the problem of unsupervised domain transfer and apply the loss terms in different domains: pixel loss in the target domain, and feature loss in the source domain.

Another class of very promising generative techniques that has recently gained traction is neural style transfer. In these methods, new images are synthesized by minimizing the content loss with respect to one input sample and the style loss with respect to one or more input samples. The content loss is typically the encoding of the image by a network training for an image categorization task, similar to our work. The style loss compares the statistics of the activations in various layers of the neural network. We do not employ style losses in our method. While initially style transfer was obtained by a slow optimization process (Gatys et al., 2016), recently, the emphasis was put on feed-forward methods (Ulyanov et al., 2016; Johnson et al., 2016).

There are many links between style transfer and our work: both are unsupervised and generate a sample under constancy given an input sample. However, our work is much more general in its scope and does not rely on a predefined family of perceptual losses. Our method can be used in order to perform style transfer, but not the other way around. Another key difference is that the current style transfer methods are aimed at replicating the style of one or several images, while our work considers a distribution in the target space. In many applications, there is an abundance of unlabeled data in the target domain , which can be modeled accurately in an unsupervised manner.

Given the impressive results of recent style transfer work, in particular for face images, one might get the false impression that emoji are just a different style of drawing faces. By way of analogy, this claim is similar to stating that a Siamese cat is a Labrador in a different style. Emoji differ from facial photographs in both content and style. Style transfer can create visually appealing face images; However, the properties of the target domain are compromised.

In the computer vision literature, work has been done to automatically generate sketches from images, see 

Kyprianidis et al. (2013) for a survey. These systems are able to emphasize image edges and facial features in a convincing way. However, unlike our method, they require matching pairs of samples, and were not shown to work across two distant domains as in our method. Due to the lack of supervised training data, we did not try to apply such methods to our problems. However, one can assume that if such methods were appropriate for emoji synthesis, automatic face emoji services would be available.

Unsupervised domain adaptation addresses the following problem: given a labeled training set in , for some target space , and an unlabeled set of samples from domain , learn a function  (Chen et al., 2012; Ganin et al., 2016). One can solve the sample transfer problem (our problem) using domain adaptation and vice versa. In both cases, the solution is indirect. In order to solve domain adaptation using domain transfer, one would learn a function from to and use it as the input method of the domain transfer algorithm in order to obtain a map from to 111The function trained this way would be more accurate on than on . This asymmetry is shared with all experiments done in this work.. The training samples could then be transferred to and used to learn a classifier there.

In the other direction, given the function , one can invert in the domain by generating training samples for and learn from them a function from to . Domain adaptation can then be used in order to map to , thus achieving domain transfer. Based on the work by Zhmoginov & Sandler (2016), we expect that , even in the target domain of emoji, will be hard to learn, making this solution hypothetical at this point.

3 A baseline problem formulation

Given a set of unlabeled samples in a source domain sampled i.i.d according to some distribution , a set of samples in the target domain sampled i.i.d from distribution , a function from the domain , some metric , and a weight , we wish to learn a function that minimizes the combined risk , which is comprised of


where is a binary classification function from ,

the probability of the class

it assigns for a sample , and


The first term is the adversarial risk, which requires that for every discriminative function , the samples from the target domain would be indistinguishable from the samples generated by for samples in the source domain. An adversarial risk is not the only option. An alternative term that does not employ GANs would directly compare the distribution to the distribution of where , e.g., by using KL-divergence.

The second term is the -constancy term, which requires that is invariant under . In practice, we have experimented with multiple forms of including Mean Squared Error (MSE) and cosine distance, as well as other variants including metric learning losses (hinge) and triplet losses. The performance is mostly unchanged, and we report results using the simplest MSE solution.

Similarly to other GAN formulations, one can minimize the loss associated with the risk over , while maximizing it over , where and are deep neural networks, and the expectations in are replaced by summations over the corresponding training sets. However, this baseline solution, as we will show experimentally, does not produce desirable results.

Figure 1: The Domain Transfer Network. Losses are drawn with dashed lines, input/output with solid lines. After training, the forward model G is used for the sample transfer.

4 The domain transfer network

We suggest to employ a more elaborate architecture that contains two high level modifications. First, we employ as the baseline representation to the function . Second, we consider, during training, the generated samples for .

The first change is stated as , for some learned function . By applying this, we focus the learning effort of on the aspects that are most relevant to . In addition, in most applications, is not as accurate on as it on . The composed function, which is trained on samples from both and , adds layers on top of , which adapt it.

The second change alters the form of , making it multiclass instead of binary. It also introduces a new term that requires

to be the identity matrix on samples from

. Taken together and written in terms of training loss, we now have two losses and , for some weights , where


and where is a ternary classification function from the domain to , and is the probability it assigns to class for an input sample , and is a distance function in . During optimization, is minimized over and is minimized over . See Fig. 1 for an illustration of our method.

The last loss, is an anisotropic total variation loss (Rudin et al., 1992; Mahendran & Vedaldi, 2015), which is added in order to slightly smooth the resulting image. The loss is defined on the generated image as


where we employ .

In our work, MSE is used for both and . We also experimented with replacing , which, in visual domains, compares images, with a second GAN. No noticeable improvement was observed. Throughout the experiments, the adaptive learning rate method Adam by Kingma & Ba (2016) is used as the optimization algorithm.

Figure 2:

Domain transfer in two visual domains. Input in odd columns; output in even columns. (a) Transfer from SVHN to MNIST. (b) Transfer from face photos (Facescrub dataset) to emoji.

5 Experiments

The Domain Transfer Network (DTN) is evaluated in two application domains: digits and face images. In the first domain, we transfer images from the Street View House Number (SVHN) dataset of Netzer et al. (2011) to the domain of the MNIST dataset by LeCun & Cortes (2010). In the face domain, we transfer a set of random and unlabeled face images to a space of emoji images. In both cases, the source and target domains differ considerably.

5.1 Digits: from SVHN to MNIST

For working with digits, we employ the extra training split of SVHN, which contains 531,131 images for two purposes: learning the function and as an unsupervised training set s for the domain transfer method. The evaluation is done on the test split of SVHN, comprised of 26,032 images. The architecture of consists of four convolutional layers with

filters respectively, each followed by max pooling and ReLU non-linearity. The error on the test split is 4.95%. Even tough this accuracy is far from the best reported results, it seems to be sufficient for the purpose of domain transfer. Within the DTN,

maps a RGB image to the activations of the last convolutional layer of size (post a max pooling and before the ReLU). In order to apply on MNIST images, we replicate the grayscale image three times.

The set t contains the test set of the MNIST dataset. For supporting quantitative evaluation, we have trained a classifier on the train set of the MNIST dataset, consisting of the same architecture as

. The accuracy of this classifier on the test set approaches perfect performance at 99.4% accuracy, and is, therefore, trustworthy as an evaluation metric. In comparison, the network

, achieves 76.08% accuracy on t.

Network , inspired by Radford et al. (2015), maps SVHN-trained ’s 128D representations to grayscale images.

employs four blocks of deconvolution, batch-normalization, and ReLU, with a hyperbolic tangent terminal. The architecture of

consists of four batch-normalized convolutional layers and employs ReLU. In the digit experiments, the results were obtained with the tradeoff hyperparamemters . We did not observe a need to add a smoothness term and the weight of was set to .

Despite not being very accurate on both domains (and also considerably worse than the SVHN state of the art), we were able to achieve visually appealing domain transfer, as shown in Fig. 2(a). In order to evaluate the contribution of each of the method’s components, we have employed the MNIST network on the set of samples , using the true SVHN labels of the test set.

We first compare to the baseline method of Sec. 3, where the generative function, which works directly with samples in , is composed out of a few additional layers at the bottom of . The results, shown in Tab. 2, demonstrate that DTN has a clear advantage over the baseline method. In addition, the contribution of each one of the terms in the loss function is shown in the table. The regularization term seems less crucial than the constancy term. However, at least one of them is required in order to obtain good performance. The GAN constraints are also important. Finally, the inclusion of within the generator function has a dramatic influence on the results.

Method Accuracy
Baseline method (Sec. 3) 13.71%
DTN 90.66%
DTN w/0 88.40%
DTN w/0 74.55%
DTN does not contain 36.90%
DTN w/0 and 34.70%
DTN w/0 & 5.28%
Original SHVN image 40.06%
Table 2: Domain adaptation from SVHN to MNIST
Method Accuracy
SA Fernando et al. (2013) 59.32%
DANN Ganin et al. (2016) 73.85%
DTN on SVHN transferring
the train split 84.44%
DTN on SVHN transferring
the test split 79.72%
Table 1: Accuracy of the MNIST classifier on the sampled transferred by our DTN method from SHVN to MNIST.

As explained in Sec. 2, domain transfer can be used in order to perform unsupervised domain adaptation. For this purposes, we transformed the set to the MNIST domain (as above), and using the true labels of employed a simple nearest neighbor classifier there. The choice of classifier was to emphasize the simplicity of the approach; However, the constraints of the unsupervised domain transfer problem would be respected for any classifier trained on . The results of this experiment are reported in Tab. 2, which shows a clear advantage over the state of the art method of Ganin et al. (2016). This is true both when transferring the samples of the set and when transferring the test set of SVHN, which is much smaller and was not seen during the training of the DTN.

5.1.1 Unseen digits

Another set of experiments was performed in order to study the ability of the domain transfer network to overcome the omission of a class of samples. This type of ablation can occur in the source or the target domain, or during the training of and can help us understand the importance of each of these inputs. The results are shown visually in Fig. 3, and qualitatively in Tab. 3, based on the accuracy of the MNIST classifier only on the transferred samples from the test set of SVHN that belong to class ‘3’.

It is evident that not including the class in the source domain is much less detrimental than eliminating it from the target domain. This is the desirable behavior: never seeing any ‘3’-like shapes in , the generator should not generate such samples. Results are better when not observing ‘3’ in both than when not seeing it only in since in the latter case, learns to map source samples of ‘3’ to target images of other classes.

(a) (b) (c) (d) (e) (f)
Figure 3: A random subset of the digit ’3’ from SVHN, transferred to MNIST. (a) The input images. (b) Results of our DTN. In all plots, the cases keep their respective locations, and are sorted by the probability of ‘3’ as inferred by the MNIST classifier on the results of our DTN. (c) The obtained results, in which the digit 3 was not shown as part of the set unlabeled samples from SVNH. (d) The obtained results, in which the digit 3 was not shown as part of the set of unlabeled samples in MNIST. (e) The digit 3 was not shown in both and . (f) The digit 3 was not shown in , , and during the training of .
Method Accuracy of ‘3’
DTN 94.67%
‘3’ was not shown in 93.33%
‘3’ was not shown in 40.13%
‘3’ was not shown in both or 60.02%
‘3’ was not shown in , , and during the training of 4.52 %
Table 3: Comparison of recognition accuracy of the digit 3 as generated in MNIST

5.2 Faces: from Photos to Emoji

For face images, we use a set of one million random images without identity information. The set consists of assorted facial avatars (emoji) created by an online service (bitmoji.com

). The emoji images were processed by a fully automatic process that localizes, based on as set of heuristics, the center of the irides and the tip of the nose. Based on these coordinates, the emoji were centered and scaled into

RGB images.

As the function , we employ the representation layer of the DeepFace network Taigman et al. (2014). This representation is 256-dimensional and was trained on a labeled set of four million images that does not intersect the set . Network D takes

RGB images (either natural or scaled-up emoji) and consists of 6 blocks, each containing a convolution with stride 2, batch normalization, and a leaky ReLU with a parameter of 0.2. Network

maps ’s 256D representations to RGB images through a network with 5 blocks, each consisting of an upscaling convolution, batch-normalization and ReLU. Adding convolution to each block resulted in lower training errors, and made 9-layers deep. We set , ,

as the tradeoff hyperparameters within

via validation. As expected, higher values of resulted in better -constancy, however introduced artifacts such as general noise or distortions.

In order to upscale the output to print quality, we used the method of Dong et al. (2015), which was shown to work well on art. We did not retrain this network for our application, and used the published one. Results without this upscale are shown, for comparison, in Appendix B.

Comparison With Human Annotators

For evaluation purposes only, a team of professional annotators manually created an emoji, using the web service of bitmoji.com, for random images from the CelebA dataset (Yang et al., 2015). Fig. 4 shows side by side samples of the original image, the human generated emoji and the emoji generated by the learned generator function . As can be seen, the automatically generated emoji tend to be more informative, albeit less restrictive than the ones created manually.

Figure 4: Shown, side by side are sample images from the CelebA dataset, the emoji images created manually using a web interface (for validation only), and the result of the unsupervised DTN. See Tab. 4 for retrieval performance.

In order to evaluate the identifiability of the resulting emoji, we have collected a second example for each identity in the set of CelebA images and a set of 100,000 random face images, which were not included in . We then employed the VGG face CNN descriptor of Parkhi et al. (2015) in order to perform retrieval as follows. For each image in our manually annotated set, we create a gallery , where is the other image of the person in . We then perform retrieval using the VGG face descriptor using either the manually created emoji or as probe.

The VGG network is used in order to avoid a bias that might be caused by using both for training the DTN and for evaluation. The results are reported in Tab. 4. As can be seen, the emoji generated by are much more discriminative than the emoji created manually and obtain a median rank of 16 in cross-domain identification out of distractors.

Measure Manual Emoji by DTN
Median rank 16311 16
Mean rank 27,992.34 535.47
Rank-1 accuracy 0% 22.88%
Rank-5 accuracy 0% 34.75%
Table 4: Comparison of retrieval accuracy out of a set of 100,001 face images for either manually created emoji or the one created by the DTS network.
Multiple Images Per Person

We evaluate the visual quality that is obtained per person and not just per image, by testing DTN on the Facescrub dataset (Ng & Winkler, 2014). For each person , we considered the set of their images , and selected the emoji that was most similar to their source image:


This simple heuristic seems to work well in practice; The general problem of mapping a set to a single output in is left for future work. Fig. 2(b) contains several examples from the Facescrub dataset. For the complete set of identities, see Appendix A.

Network Visualization

The obtained mapping can serve as a visualization tool for studying the properties of the face representation. This is studied in Appendix C by computing the emoji generated for the standard basis of . The resulting images present a large amount of variability, indicating that does not present a significant mode effect.

Figure 5: Style transfer as a specific case of Domain Transfer. (a) The input content photo. (b) An emoji taken as the input style image. (c) The result of applying the style transfer method of Gatys et al. (2016). (d) The result of the emoji DTN. (e) Source image for style transfer. (f) The result, on the same input image, of a DTN trained to perform style transfer.

5.3 Style Transfer as a specific Domain Transfer task

Fig. 5(a-c) demonstrates that neural style transfer Gatys et al. (2016) cannot solve the photo to emoji transfer task in a convincing way. The output image is perhaps visually appealing; However, it does not belong to the space of emoji. Our result are given in Fig. 5(d) for comparison. Note that DTN is able to fix the missing hair in the image.

Domain transfer is more general than style transfer in the sense that we can perform style transfer using a DTN. In order to show this, we have transformed, using the method of Johnson et al. (2016), the training images of CelebA based on the style of a single image (shown in Fig. 5(e)). The original photos were used as the set , and the transformed images were used as . Applying DTN, using face representation , we obtained styled face images such as the one shown in the figure 5(f).

6 Discussion and Limitations

Asymmetry is central to our work. Not only does our solution handle the two domains and differently, the function is unlikely to be equally effective in both domains since in most practical cases, would be trained on samples from one domain. While an explicit domain adaptation step can be added in order to make more effective on the second domain, we found it to be unnecessary. Adaptation of occurs implicitly due to the application of downstream.

Using the same function , we can replace the roles of the two domains, and . For example, we can synthesize an SVHN image that resembles a given MNIST image, or synthesize a face that matches an emoji. As expected, this yields less appealing results due to the asymmetric nature of and the lower information content in these new source domains, see Appendix D.

Domain transfer, as an unsupervised method, could prove useful across a wide variety of computational tasks. Here, we demonstrate the ability to use domain transfer in order to perform unsupervised domain adaptation. While this is currently only shown in a single experiment, the simplicity of performing domain adaptation and the fact that state of the art results were obtained effortlessly with a simple nearest neighbor classifier suggest it to be a promising direction for future research.


Appendix A Facescrub dataset generations

In Fig. 6 we show the full set of identities of the Facescrub dataset, and their corresponding generated emoji.

Figure 6: All 80 identities of the Facescrub dataset. The even columns show the results obtained for the images in the odd column to the left. Best viewed in color and zoom.

Appendix B The effect of super-resolution

As mentioned in Sec. 5, in order to upscale the output to print quality, the method of  Dong et al. (2015) is used. Fig. 7 shows the effect of applying this postprocessing step.

Appendix C The basis elements of the face representation

Fig. 8 depicts the face emoji generated by for the standard basis of the face representation (Taigman et al., 2014), viewed as the vector space .

Appendix D Domain transfer in the reverse direction

For completion, we present, in Fig. 9 results obtained by performing domain transfer using DTNs in the reverse direction of the one reported in Sec. 5.

Figure 7: The images in Fig. 4 above with (right version) and without (left version) applying super-resolution. Best viewed on screen.
Figure 8: The emoji visualization of the standard basis vectors in the space of the face representation, i.e., ,…,, where is the standard basis vector in .
Figure 9: Domain transfer in the other direction (see limitations in Sec. 6). Input (output) in odd (even) columns. (a) Transfer from MNIST to SVHN. (b) Transfer from emoji to face photos.