NAM: Non-Adversarial Unsupervised Domain Mapping

06/03/2018 ∙ by Yedid Hoshen, et al. ∙ 0

Several methods were recently proposed for the task of translating images between domains without prior knowledge in the form of correspondences. The existing methods apply adversarial learning to ensure that the distribution of the mapped source domain is indistinguishable from the target domain, which suffers from known stability issues. In addition, most methods rely heavily on "cycle" relationships between the domains, which enforce a one-to-one mapping. In this work, we introduce an alternative method: Non-Adversarial Mapping (NAM), which separates the task of target domain generative modeling from the cross-domain mapping task. NAM relies on a pre-trained generative model of the target domain, and aligns each source image with an image synthesized from the target domain, while jointly optimizing the domain mapping function. It has several key advantages: higher quality and resolution image translations, simpler and more stable training and reusable target models. Extensive experiments are presented validating the advantages of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human ability to think in spontaneous analogies motivates the field of unsupervised domain alignment, in which image to image translation is achieved without correspondences between samples in the training set. Unsupervised domain alignment methods typically operate by finding a function for mapping images between the domains so that after mapping, the distribution of mapped source images is identical to that of the target images.

Successful recent approaches, e.g. DTN [28], CycleGANs [37] and DiscoGAN [14], utilize Generative Adversarial Networks (GANs) [9] to model the distributions of the two domains, and . GANs are very effective tools for generative modeling of images, however they suffer from instability in training, making their use challenging. The instability typically requires careful choice of hyper-parameters and often multiple initializations due to mode collapse. Current methods also make additional assumptions that can be restrictive, e.g., DTN assumes that a pre-trained high-quality domain specific feature extractor exists which is effective for both domains. This assumption is good for the domain of faces (which is the main application of DTN) but may not be valid for all cases. CycleGAN and DiscoGAN make the assumption that a transformation can be found for every -domain image to a unique -domain image , and another transformation exists between the domain and the original -domain image ,

. This is problematic if the actual mapping is many-to-one or one-to-many, as in super-resolution or coloring.

We propose a novel approach motivated by cross-domain matching. We separate the problem of modeling the distribution of the target domain from the source to target mapping problem. We assume that the target image domain distribution is parametrized using a generative model. This model can be trained using any state-of-the-art unconditional generation method such as GAN [25], GLO [2], VAE [15] or an existing graphical or simulation engine. Given the generative model, we solve an unsupervised matching problem between the input domain images and the domain. For each source input image , we synthesize an domain image , and jointly learn the mapping function , which maps images from the domain to the domain. The synthetic images and mapping function are trained using a reconstruction loss on the input domain images.

Our method is radically different from previous approaches and it presents the following advantages:

  1. A generative model needs to be trained only once per target dataset, and can be used to map to this dataset from all source datasets without adversarial generative training.

  2. Our method is one-way and does not assume a one-to-one relationship between the two domains, e.g., it does not use cycle-constraints.

  3. Our work directly connects between the vast literature of unconditional image generation and the task of cross-domain translation. Any progress in unconditional generation architectures can be simply plugged in with minimal changes. Specifically, we can utilize recent very high-resolution generators to obtain high quality results.

2 Previous Work

Unsupervised domain alignment:

Mapping across similar domains without supervision has been successfully achieved by classical methods such as Congealing [22]. Unsupervised translation across very different domains has only very recently began to generate strong result, due to the advent of generative adversarial networks (GANs), and all state-of-the-art unsupervised translation methods we are aware of employ GAN technology . As this constraint is insufficient for generating good translations, current methods are differentiated by additional constraints that they impose.

The most popular constraint is cycle-consistency: enforcing that a sample that is mapped from to and back to , reconstructs the original sample. This is the approach taken by DiscoGAN [14], CycleGAN [37] and DualGAN [30]. Recently, StarGAN [5] created multiple cycles for mapping in any direction between multiple (two or more) domains. The generator receives as input the source image as well as the specification of the target domain.

For the case of linear mappings, orthogonality has a similar effect to circularity. Very recently, it was used outside computer vision by several methods 

[33, 34, 6, 12] for solving the task of mapping words between two languages without using parallel corpora.

Another type of constraint is provided by employing a shared latent space. Given samples from two domains and , CoGAN [21]

, learns a mapping from a random input vector

to matching samples, one in each domain. The domains and are assumed to be similar and their generators (and GAN discriminators) share many of the layers’ weights, similar to [27]. Specifically, the earlier generator layers are shared while the top layer are domain specific. CoGAN can be modified to perform domain translation in the following way: given a sample , a latent vector is fitted to minimize the distance between the image generated by the first generator and the input image . Then, the analogous image in is given by . This method was shown in [37] to be less effective than cycle-consistency based methods.

UNIT [20]

employs an encoder-decoder pair per each domain. The latent spaces of the two are assumed to be shared, and similarly to CoGAN, the layers that are distant from the image (the top layers of the encoder and the bottom layers of the encoder) are shared between the two domains. Cycle-consistency is added as well, and structure is added to the latent space using variational autoencoder 

[16] loss terms.

As mentioned above our method does not use adversarial or cycle-consistency constraints.

Mapping using Domain Specific Features

Using domain specific features has been found by DTN [28]

to be important for some tasks. It assumed that a feature extractor can be found, for which the source and target would give the same activation values. Specifically it uses face specific features to map faces to emojis. While for some of the tasks, our work does use a “perceptual loss” that employs a pretrained imagenet-trained network, this is a generic feature extraction method that is not domain specific. We claim therefore that our method still qualifies as unsupervised. For most of the tasks presented, the VGG loss alone, would not be sufficient to recover good mappings between the two domains, as shown in ANGAN 

[11].

Unconditional Generative Modeling:

Many methods were proposed for generative models of image distributions. Currently the most popular approaches rely on GANs and VAEs [15]. GAN-based methods are plagued by instability during training. Many methods were proposed to address this issue for unconditional generation, e.g., [1, 10, 23]. The modifications are typically not employed in cross-domain mapping works. Our method trains a generative model (typically a GAN), in the domain separately from any domain considerations, and can directly benefit from the latest advancements in the unconditional image generation literature. GLO [3] is an alternative to GAN, which iteratively fits per-image latent vectors (starting from random “noise”) and learns a mapping between the noise vectors and the training images. GLO is trained using a reconstruction loss, minimizing the difference between the training images and those generated from the noise vectors. Differently from our approach is tackles unconditional generation rather than domain mapping.

3 Unsupervised Image Mapping without GANs

In this section, we present our method - NAM - for unsupervised domain mapping. The task we aim to solve, is finding analogous images across domains. Let and be two image domains, each with some unique characteristics. For each domain we are given a set of example images. The objective is to find for every image y in the domain, an analogous image x which appears to come from the domain but preserves the unique content of the original y image.

3.1 Non-Adversarial Exact Matching

To motivate our approach, we first consider the simpler case, where we have two image domains and , consisting of sets of images and respectively. We assume that the two sets are approximately related by a transformation , and that a matching paired image exists for every image y in domain such that . The task of matching becomes a combination of two tasks: i) inferring the transformation between the two domains ii) finding matching pairs across the two domains. Formally this becomes:

(1)

Where is the matching matrix containing if and are matching and otherwise. The optimization is over both the transformation as well as binary match matrix .

Since the optimization of this problem is hard, a relaxation method - ANGAN - was recently proposed [11]. The binary constraint on the matrix was replaced by the requirement that and . As optimization progresses, a barrier constraint on , pushes the values of to or .

ANGAN was shown to be successful in cases where exact matches exist and is initialized with a reasonably good solution obtained by CycleGAN.

3.2 Non-Adversarial Inexact Matching

In Sec. 3.1, we described the scenario in which exact matches exist between the images in domains and . In most situations, exact matches do not exist between the two domains. In such situations it is not sufficient to merely find an image in the domain training set such that for a target domain image , we have as we cannot hope that such a match will exist. Instead, we need to synthesize an image that comes from the domain distribution, and satisfies . This can be achieved by removing the stochasticity requirement in Eq. 1. Effectively, this models the images in the domain as:

(2)

This solution is unsatisfactory on several counts: (i) the simplex model for the domain cannot hope to achieve high quality image synthesis for general images (ii) The complexity scales quadratically with the number of training images making both training and evaluation very slow.

Figure 1: Given a generator for domain and training samples in domain , NAM jointly learns the transformation and the latent vectors that give rise to samples that resemble the training images in

3.3 Non-Adversarial Mapping (NAM)

In this section we generalize the ideas presented in the previous sections into an effective method for mapping between domains without supervision or the use of adversarial methods.

In Sec. 3.2, we showed that to find analogies between domains and , the method requires two components: (i) a model for the distribution of the domain, and (ii) a mapping function between domains and .

Instead of the linear simplex model of Sec. 3.2, we propose to model the domain distribution by a neural generative model , where is a latent vector. The requirements on the generative model are such that for every image in the domain distribution we can find such and that is compact, that is, for no , will lie outside the domain. The task of learning such generative models, is the research focus of several communities. In this work we do not aim to contribute to the methodology of unsupervised generative modeling, but rather use the state-of-the-art modeling techniques obtained by previous approaches, for our generator . Methods which can be used to obtain generative model include: GLO [2], VAE [15], GAN [9] or a hand designed simulator (see for example [29]). In our method, the task of single domain generative modeling is entirely decoupled from the task of cross-domain mapping, which is the one we set to solve.

Armed with a much better model for the domain distribution, we can now make progress on finding synthetic analogies between and . Our task is to find for every domain image , a synthetic domain image so that when mapped to the domain . The task is therefore twofold: (i) for each , we need to find the latent vector which will synthesize the analogous domain image, and (ii) the mapping function needs to be learned.

The model can therefore be formulated as an optimization problem, where the objective is to minimize the reconstruction cost of the training images of the domain. The optimization is over the latent codes, a unique latent code vector for every input domain image , as well as the mapping function . It is formally written as below:

(3)

The model is fully differentiable, as both the generative model and the mapping function

are parameterized by neural networks. The above objective is jointly optimized for

and , but not for which is kept fixed. The method is illustrated in Fig. 1.

3.4 Perceptual Loss

Although the optimization described in Sec. 3.3 can achieve good solutions, we found that introducing a perceptual loss, can significantly help further improve the quality of analogies. Let be the features extracted from a deep-network at the end of the i’th block (we use VGG [26]). The perceptual loss is given by:

(4)

The final optimization problem becomes:

(5)

The VGG perceptual loss was found by several recent papers [4, 35]

to give perceptually pleasing results. There have been informal claims in the past that methods using perceptual loss functions should count as supervised. We claim that the perceptual loss does not make our method supervised, as the VGG network does not come from our domains and does not require any new labeling effort. Our view is that taking advantage of modern feature extractors will benefit the field of unsupervised learning in general and unsupervised analogies in particular.

3.5 Inference and Multiple Solutions

Once training has completed, we are now in possession of the mapping function which is now fixed (the pre-trained was never modified as a part of training).

To infer the analogy of a new domain image , we need to recover the latent code which would yield the optimal reconstruction. The mapping function is now fixed, and is not modified after training. Inference is therefore performed via the following optimization:

(6)

The synthetic domain image is our proposed solution to domain image .

This inference procedure is a non-convex optimization problem. Different initializations, yield different final analogies. Let us denote initialization where is the ID of the solution. At the end of the optimization procedure for each initialization, the synthetic images yield multiple proposed analogies for the task. We find are very diverse when in fact many analogies are available. For example, when the domain is Shoes and the domain is Edges, there are many shoes that can result in the same edge image.

3.6 Implementation Details

In this section we give a detailed description of the procedure used to generate the experiments presented in this paper.

domain generative model : Our method takes as input a pre-trained generative model for the domain. In our MNIST, SVHN and cars, Edges2 (Shoes,Handbags) experiments, we used DCGAN [25] with (32,32,32,100,100) latent dimensions. The low resolution face image generator was trained on celebA and used the training method of [23]. The high resolution face generator is provided by [13] and the Dog generator by [32]

. The hyperparameters of all trained generators were set to their default value. In our experiments GAN unconditional generators provided more compelling results than competing SOTA methods such as GLO and VAE.

Mapping function : The mapping function was designed so that it is powerful enough but not too large as to overfit. Additionally, it needs to preserve image locality, in the case of spatially aligned domains. We elected to use a network with an architecture based on [4]. We found that as we only rely on the networks to find correspondences rather than generate high-fidelity visual outputs, small networks were the preferred choice. We used a similar architecture to [4], with a single layer per scale, and linearly decaying number of filters per layer starting with 4, and decreasing by with every layer. for SVHN and MNIST and for the other experiments.

Optimization: We optimized using SGD with ADAM [17]. For all datasets we used a learning rate of for the latent codes and for the mapping function (due to the uneven update rates of each and ). On all datasets training was performed on randomly selected examples (a subset) from the domain. Larger training sets were not more helpful as each is updated less frequently.

Generating results: The domain translation of domain image is given by , where is the latent code found in optimization. The mapping , typically resulted in weaker results due to the relatively shallow architecture selected for . A strong can be trained by calculating a set of and (obtained using NAM), and training a fully-supervised network , e.g. as described by [4]. A similar procedure was carried out in [11].

4 Experiments

To evaluate the merits of our method, we carried out an extensive set of qualitative and quantitative experiments.

SVHNMNIST MNISTSVHN
Figure 2: Converting digits between SVHN and MNIST (both directions). (a) CycleGAN results (b) NAM results (c) the input images.

SVHN-MNIST Translation: We evaluated our method on the SVHN-MNIST translation task. Although SVHN [24] and MNIST [18] are simple datasets, the mapping task is not trivial. The MNIST dataset consists of simple handwritten single digits written on black background. In contrast, SVHN images are taken from house numbers and typically contain not only the digit of interest but also parts of the adjacent digits, which are nuisance information. We translate in both directions SVHNMNIST and MNISTSVHN. The results are presented in Fig. 2. We can observe that in the easier direction of SVHNMNIST, in which there is information loss, NAM resulted in more accurate translations than CycleGAN. In the reverse direction of MNISTSVHN, which is harder due to information gain, CycleGAN did much worse, whereas NAM was often successful. Note that perceptual loss was not used in the MNISTSVHN translation task.

SVHNMNIST MNISTSVHN
CycleGAN 26.8 17.7
NAM 33.3 31.9
Table 1: Translation quality measured by translated digit classification accuracy (%)

We performed a quantitative evaluation of the quality of SVHN

MNIST translation. This was achieved by mapping an image from the one dataset to appear like the other dataset, and classifying it using a pre-trained classifier trained on the clean target data (the classifier followed a NIN architecture 

[19], and achieved test accuracies of around 99.5% on MNIST and 95.0% on SVHN). The results are presented in Tab. 1. We can see that the superior translations of NAM are manifested in higher classification accuracies.

Edges2Shoes: The task of mapping edges to shoes is commonly used to qualitatively evaluate unsupervised domain mapping methods. The two domains are a set of Shoe images first collected by [31], and their edge maps. The transformation between an edge map and the original photo-realistic shoe image is non-trivial, as much information needs to be hallucinated.

(a) (b)
(c) (d)
Figure 3: (a) Comparison of NAM and DiscoGAN for EdgesShoes. Each triplet shows NAM (center row) vs. DiscoGAN (top row) for a given input (bottom row). (b) A similar visualization for EdgesHandbags. (c,d) NAM mapping from a single source edge image (shown first) for different random initializations.

Examples of NAM and DiscoGAN results can be seen in Fig. 3(a). The higher quality of the analogies generated by NAM is apparent. This stems from using a pre-learned generative model rather than learning jointly with mapping, which is hard and results in worse performance. We also see the translations result in more faithful analogies. Another advantage of our method is the ability to map one input into many proposed solutions. Two examples are shown in Fig. 3(c) and (d). It is apparent that the solutions all give correct analogies, however they give different possibilities for the correct analogy. This captures the one-to-many property of the edge to shoes transformation.

As mentioned in the method description, NAM requires high-quality generators, and performs better for better pre-trained generators. In Fig. 4 we show NAM results for generators trained with: VAE [15] with high (VAE-h) and low (VAE-l) regularization, GLO [2], DCGAN [25] and Spectral-Normalization GAN [23]. We can see from the results that NAM works is all cases. however results are much better for the best generators (DCGAN, Spectral-Norm GAN).

Target VAE-h VAE-l GLO DCGAN SNGAN Target VAE-h VAE-l GLO DCGAN SNGAN
Figure 4: Comparison of NAM results for different generators

Edges2Handbags: The Edges2Handbags [36] dataset is constructed similarly to Edges2Shoes. Sample results on this dataset can be seen in Fig. 3(b). The conclusions are similar to Edges2Shoes: NAM generates analogies that are both more appealing and more precise than DiscoGAN.

Shoes2Handbags: One of the major capabilities displayed by DiscoGAN is being able to relate domains that are very different. The example shown in [14], of mapping images of handbags to images of shoes that are semantically related, illustrates the ability of making distant analogies.

Figure 5: Example results for mapping from bags (original images - top) to shoes. NAM mapped images (center) are clearly better than DiscoGAN mapped images (bottom).

In this experiment we show that NAM is able to make analogies between handbags and shoes, resulting in higher quality solutions than those obtained by DiscoGAN. In order to achieve this, we replace the reconstruction VGG loss by a Gram matrix VGG loss, as used in Style Transfer [8]. DiscoGAN also uses a Gram matrix loss (with feature extracted from its discriminator). For this task, we also add a skip connection from , as the domains are already similar under a style loss.

Example images can be seen in Fig. 5. The superior quality of the NAM generated mapped images is apparent. The better quality is a result of using an interpretable and well understood non-adversarial loss which is quite straight forward to optimize. Another advantage comes from being able to ”plug-in” a high-quality generative model.

DiscoGAN NAM
13.81 1.47
Table 2: Car2Car root median residual deviation from linear alignment (lower is better).

Car2Car: The Car2Car dataset is a standard numerical baseline for cross-domain image mapping. Each domain consists of a set of different cars, presented in angles varying from -75 tp 75 degrees. The objective is to align the two domains such that a simple relationship exists between orientation of car image and mapped image (typically, either the orientation of and should be equal or reversed). A few cars mapped by NAM and DiscoGAN can be seen in Fig. 6. Our method results in a much cleaner mapping. We also quantitatively evaluate the mapping, by training a simple regressor on the car orientation in the domain, and comparing the ground-truth orientation of with the predicted orientation of the mapped image . We evaluate using the root median residuals (as the regressor sometimes flips orientations of -75 to 75 resulting in anomalies). For car2car, we used a skip connection from to the output. Results are seen in Tab. 2. Our method significantly outperforms DiscoGAN. Interestingly, on this task, on this task, it was not necessary to use a perceptual loss, a simple Euclidean pixel loss was sufficient for a very high-quality solution on this task. As a negative result, on the car2head task i.e. mapping between car images and images of heads of different people at different azimuth angles; NAM did not generate a simple relation between the orientations of the cars and heads but a more complex relationship. Our interpretation from looking at results is that black cars were inversely correlated with the head orientation, whereas white cars were positively correlated.

Figure 6: Example results for mapping across two sets of car models at different orientations. Although DiscoGAN (bottom) does indeed preserve orientation of the original images (top) to some extent, NAM (center) preserves both orientation and general car properties very accurately - despite the target domain containing few sports cars.

Avatar2Face: One of the first applications of cross-domain translation was face to avatar generation by DTN [28]. This was achieved by using state-of-the-art face features, and ensured the features are preserved in the original face and the output avatar (-constancy). Famously however, DTN does not generate good results on avatar2face generation, which involves adding rather than taking away information. Due to the many-to-one nature of our approach, NAM is better suited for this task. In fig. 7 we present example images of our avatar2face conversions. This was generated by a small generative model with a DCGAN [25] architecture, trained using Spectral Normalization GAN [23] using celebA face images. The avatar dataset was obtained from the authors of [29].

Figure 7: Example results for mapping Avatars (top) to Faces (bottom) using NAM.

Plugging in State-of-the-Art Generative models: One of the advantages of our method is the independence between mapping and generative modeling. The practical consequence is that any generative model, even very large models that take weeks to train, can be effortlessly plugged into our framework. We can then map any suitable source domain to it, very quickly and efficiently.

Amazing recent progress has been recently carried out on generative modeling. One of the most striking examples of it is Progressive Growing of GANs (PGGAN) [13], which has yielded generative models of faces with unprecedented resolutions of 1024X1024. The generative model training took 4 days of 8 GPUs, and the architecture selection is highly non-trivial. Including the training of such generative models in unsupervised domain mapping networks is therefore very hard.

For NAM, however, we simply set as the trained generative model from the authors’ code release. A spatial transformer layer, with parameters optimized by SGD per-image, reduced the model outputs to the Avatar scale (which we chose to be ). We present visual results in Fig. 9. Our method is able to find very compelling analogous high-resolution faces. Scaling up to such high resolution would be highly-nontrivial with state-of-the-art domain translation methods. We mention that DTN [28], the state-of-the-art approach for unsupervised face-to-emoji mapping, has not been successful at this task, even though it uses domain specific facial features.

(Emoji) (Mapped faces)
Figure 8: One-to-many high-resolution mapping from Avatars to Faces using the pre-trained generator from [13]

To show the generality of our approach, we also mapped Avatars to Dog images. The generator was trained using StackGAN-v2 [32]. We plugged in the trained generators from the publicly released code into NAM. Although emoji to dogs is significantly more distant than emoji to human face (all the Avatars used, were human faces), NAM was still able to find compelling analogies.

(Emoji1) (Mapped1) (Emoji2) (Mapped2) (Emoji3) (Mapped3)
Figure 9: High-resolution mapping from Avatars to Dogs, using the pre-trained generator from [32].

5 Discussion

Human knowledge acquisition typically combines existing knowledge with new knowledge obtained from novel domains. This process is called blending [7]. Our work (as most of the existing literature) focuses on the mapping process i.e. being able to relate the information from both domains, but does not deal with the actual blending of knowledge. We believe that blending, i.e., borrowing from both domains to create a unified view that is richer than both sources would be an extremely potent direction for future research.

An attractive property of our model, is the separation between the acquisition of the existing knowledge and the fitting of a new domain. The preexisting knowledge is modeled as the generative model of domain , given by ; The fitting process includes the optimization of a learned mapper from domain to domain , as well as identifying exemplar analogies and .

A peculiar feature of our architecture, is that function maps from the target ( domain) to the source ( domain) and not the other way around. Mapping in the other direction would fail, since it can lead to a form of mode-collapse, in which all samples are mapped to the same generated for a fixed . While additional loss terms and other techniques can be added in order to avoid this, mode collapse is a challenge in generative systems and it is better to avoid the possibility of it altogether. Mapping as we do avoids this issue.

6 Conclusions

Unsupervised mapping between domains is an exciting technology with many applications. While existing work is currently dominated by adversarial training, and relies on cycle constraints, we present results that support other forms of training.

Since our method is very different from the existing methods in the literature, we have been able to achieve success on tasks that do not fit well into other models. Particularly, we have been able to map low resolution face avatar images into very high resolution images. On lower resolution benchmarks, we have been able to achieve more visually appealing and quantitatively accurate analogies.

Our method relies on having a high quality pre-trained unsupervised generative model for the domain. We have shown that we can take advantage of very high resolution generative models, e.g., [13, 32]. As the field of unconditional generative modeling progresses, so will the quality and scope of NAM.

References