Unsupervised Multi-Domain Image Translation with Domain-Specific Encoders/Decoders

12/06/2017 ∙ by Le Hui, et al. ∙ Nanjing University 0

Unsupervised Image-to-Image Translation achieves spectacularly advanced developments nowadays. However, recent approaches mainly focus on one model with two domains, which may face heavy burdens with large cost of O(n^2) training time and model parameters, under such a requirement that n domains are freely transferred to each other in a general setting. To address this problem, we propose a novel and unified framework named Domain-Bank, which consists of a global shared auto-encoder and n domain-specific encoders/decoders, assuming that a universal shared-latent sapce can be projected. Thus, we yield O(n) complexity in model parameters along with a huge reduction of the time budgets. Besides the high efficiency, we show the comparable (or even better) image translation results over state-of-the-arts on various challenging unsupervised image translation tasks, including face image translation, fashion-clothes translation and painting style translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on digit benchmark datasets. Further, thanks to the explicit representation of the domain-specific decoders as well as the universal shared-latent space, it also enables us to conduct incremental learning to add a new domain encoder/decoder. Linear combination of different domains' representations is also obtained by fusing the corresponding decoders.



There are no comments yet.


page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image-to-image translation problem is a general formulation which involves a wide range of various computer vision problems. Just as a sentence may be translated in either English or French, an image may be rendered in another image. Many problem in image processing can be defined as “translating” an input image in one domain into a corresponding output image in another domain. Typically, denoising, super-resolution and colorization all pertain to image-to-image translation where input is a degraded image (noisy, low-resolution, or gray scale) and the output is a high-quality color image.

Recently, a series of attractive works ignite a renewed interest in the image-to-image translation problem by adopting Convolution Neural Networks (CNNs). Gatys et al.

[3] first study how to use CNN to reproduce famous painting styles on natural images. Since the seminal work by Goodfellow et al. [4], GAN has been proposed for a wide variety of problems. Unlike past works, by utilizing GANs, [15, 16, 29, 30] are proposed to translate an image from a source domain to a target domain

in the absence of paired examples. These algorithms often produce more impressive results near to the corresponding target domain, since a joint distribution that can be learnt from two different domains by using images from the marginal distributions in individual domains.

Notwithstanding their demonstrated success, currently existing approaches basically focus on the one model with two domains setting. Specially, learnt through one fresh training, translation is limited to transfers one pair of different domains. After a careful examination of existing image-to-image translation networks, we argue that different marginal distributions can be projected into a common space in their learnt network structures. To the best of our knowledge, a translation among domains has not yet been proposed with complexity in these previous works.

As a result, the network is only able to capture a two specific domains translate one at a time. For a new domain, the whole network has to be retrained end-to-end, which leads to an unavoidable burden under the situation where transformations are required, given domains. In practice, this make these methods unable to scale to a large number of domains, especially when the domain require to be incrementally augmented. Additionally, how to further reduce the training time, network model size and enable more flexibilities to control translation among domains, remain to be considered yet to be addressed.

To overcome these problems, we explore a multi-domain image translation in which we reconsider the joint distributions of multiple domains. From the perspective of a probabilistic modeling, the coupling theory [14] states there exists an infinite set of joint distributions that can arrive the given marginal distributions in general. This highly ill-posed problem forces us to make additional assumption on the structure of the joint distribution. By further considering the interaction of domains, we make a global shared-latent space assumption that assumes every sampled image from one of domains can be mapped to an universal shared-latent space. Based on the universal assumption, we propose a compact, and easily extended Domain-Bank framework that learns every domain pairs’ joint distribution simultaneously.

In details, the proposed Domain-Bank framework is composed of multiple domain-specific component banks and each component represents one specific domain. Specially, a component bank consist of encoder and decoder for a specific domain. For an input image, the corresponding component bank maps an image to a shared-latent space, and then decodes it to the target image.

In several challenging unsupervised multi-domain image translation tasks like face image translation, fashion-clothes translation and painting style translation, we comprehensively demonstrate the superior efficiency and at least comparable results to the state-of-the-art methods. Furthermore, as more domains’ samples are engaged in Domain-Bank framework, the performance gain of domain adaptation tasks on digital recognition becomes consistently obvious. More importantly, it not only allows us to simultaneously learn the translation among various domains, but also enables a very efficient incremental learning for a new image domain. This is achieved by learning a new component bank of domain while holding the other auto-encoder/decoder fixed.

Compared with existing models under the unsupervised image-to-image translation settings, our proposed Domain-Bank is unique in the following aspects:

  • Our model is designed with a compact and clean structure, which also obtains a considerably huge reduction (from a complexity of to ) of training time and model parameters in case of () domains with transformations.

  • The universal shared auto-encoder subnetwork is trained efficiently and effectively with multi-domain training samples/pairs, thus leading to a better generation which is confirmed in both quantitative and qualitative experimental results.

  • The shared auto-encoder, along with the domain-specific encoders/decoders, can provide more functional utilizations like domain linear combination or incrementally learning a new domain.

The remainder of the paper is organized as follows: Related work is summarized in Section 2. We devote Section 3 to the main technical design of the proposed Domain-Bank. Section 4 gives the experimental results in both quantitative and qualitative aspects. New characteristics of the proposed framework can be found in Section 5. Finally, we conclude our paper in Section 6.

2 Related Work

Image-to-Image translation problem has already been promoted by deep neural network and obtains some impressive results especially in the domain/style transfer fields. Neural generative models has recently received an increasing amount of attention. Several algorithms, including generative adversarial networks [4]

, variational autoencoders (VAEs)

[10, 11], stochastic back-propagation [23] and diffusion processes[25], have demonstrated that a deep neural network can learn a domain distribution from examples. Thus, the learned networks can be used to generate novel images. We are interested in image-to-image translation problem. After analyzing the image translation problem from a probabilistic modeling attitude, the key challenge is to learn a joint distribution of images in different domains from its marginal distributions of individual domains.

Image-to-Image Translation. Many image processing and computer vision tasks can be posed as an image-to-image translation problem, mapping an image in one domain to a corresponding image in another domain, e.g., image segmentation, stylization, super-resolution and abstraction. Particularly, the image-to-image can be traced back at least to Hertzmann et al’s Image Analogies [6]. Hence, image segmentation can be considered as a problem of mapping a natural image to a corresponding segmented image. More recent approaches use a dataset of input-output samples to learn a parametric function using CNNs. Similar ideas have also been applied to various tasks including generating photo from sketches or attributes and semantic layouts etc.

Unsupervised Image-to-Image Translation. In unsupervised image-to-image setting, we only have two independent sets of images where one possesses images in one domain and so do the other. Note that there exists no paired samples guiding how an image could be translated to a corresponding image in another domain. Several other approaches also adopt the unpaired setting, where the goal is to relate two data domains, domain and domain . More Recently, [26]

proposed the domain transformation network (DTN) and achieved promising results on translating small resolution face and digit images. Liu et al. proposed CoGAN

[16], which use a weight-sharing strategy to learn a common representation across two domains. Following, Liu et al. first made a shared-latent assumption, and then they proposed an unsupervised image-to-image translation framework [15], which uses a VAEs and GANs to learn a mapping from input to output images. Our approach builds on the this framework. However, unlike these prior works, we learn the translation among multiple domains (more than two domains) without paired training samples and also enable a very efficient incremental learning for a new domain based on our proposed framework at the same time.

Generative Adversarial Networks (GANs). GANs have achieved great success in a wide variety of computer vision applications, enhancing both supervised tasks and unsupervised ones. The key of GANs is the introduction of the adversarial loss, that forces the generated images to be indistinguishable from real images substantially. Learning in GAN is via staging a zero-sum game between two players, where the discriminator tries to distinguish reliable real samples from fake ones and the generator attempts to fool it. Soon after, various GANs have been proposed to the image generation on class labels [19], attributes [21, 28] and images [13, 15, 30, 16, 29, 22, 7]. A list of training tricks of GANs is given in [24].

Variational Auto Encoders (VAEs). A VAE consists of two networks that encode a data sample to a latent representation and decode the latent representation back to data space. The key of VAEs is to optimize a variational bound. By enhancing the variational approximation, superior image generation results were obtained [9, 18]. Larsen et al. [11] proposed a VAE-GAN architecture to improve image generation quality of VAEs. VAEs also were applied to translate face image attribute in [28]. More recently, Liu et al. [15] extend the framework of VAE-GAN to unsupervised image-to-image translation problems.

3 Domain-Bank Networks

Our goal is to learn translations for domains. It may offer a new understanding for the image domain translation problems, and then help design a more elegant architecture to address multi-domain translation problems.

We construct a multi-domain image translation network based on variational autoencoders (VAEs) [10, 11, 23] and generative adversarial networks (GANs) [4, 16, 30], which encodes an input image to the shared-latent space and can also reconstruct/transfer it.

Networks {} {} {} {} {}
Functions VAE for GAN for VAE-GAN [11] Image Translator Cycle-consistency [30]
Table 1: Interpretation of the functions of network architecture.

3.1 Network Architecture

Figure 1 shows our multi-domain translation architecture, which is based on the universal shared-latent space assumption. Suppose we are considering arbitrary two domains of domains, namely , which contain training samples where and where , respectively. We denote the corresponding marginal data distribution as and . We aim to learn a joint distribution of images in domain and by utilizing images from the marginal distributions in two individual domains. It can be easily extended to the case of domains. That is, we can learn a joint distribution of domains.

Every one (supposed to be ) has two functional paths, through which any given image sampled from can be projected into the shared-latent code and it can be recovered back as well. That is, we suppose there exists functions and () such that, given a pair of corresponding images () (where , , and ) from the joint distribution. We define , on the contrary, and . In our structures, we map domain to domain through the function , which can be represented by the function . Equally, we define two reconstruction functions for domain to domain : 1 and 2 . The function 1 can be equivalently written as , and the 2 is written as . More notably, for an input image in domain , the function 1 directly translate it to an image in domain . However, in function 2, the input images are first translated from domain to domain , and then the generated images in domain are converted back to the domain . In addition, a necessary condition for translating domain to domain to exist is the cycle-consistency constraint [8, 15, 30]: . In other words, we can reconstruct the input image from translating back to the translated input image. Therefore, the shared-latent space assumption indicates the cycle-consistency assumption.

Domain-Specific Encoder and Decoder. Following the architecture used in [15], the image encoder consists of 3 convolutional layers and 3 basic residual blocks [5], symmetrically, the image decoder also consists of 3 basic residual blocks and 3 transposed convolutional layers. In our mulit-domain image-to-image translation, different domains have domain-specific encoders and domain-specific decoders . For instance, Monet’s painting need to use Monet’s specific encoder whilst Van Gogh’s has to use Van Gogh’s specific encoder. Similarly, this is also necessary for the domain-specific decoders. Different domains use domain-specific encoders to extract representations of the input images, and then domain-specific decoders responsible for decoding representations for reconstructing images in different domains. In other words, the encoder and the decoder can be seen as a domain-specific component, and we only need to train different components for different domains. In practice, when a new domain arrives, it also enables us to conduct incremental learning to train the encoder and decoder.

Universal Shared Auto-Encoder. Based on the shared-latent assumption, we enforce a weight-sharing constraint to relate the VAEs. Specially, we further assume a share intermediate representation that the process of generating corresponding images satisfy the formula


Therefore, we assume where is a common low-level generation function that maps to , respectively. However, are high-level generation function that maps to . From another view , can be considered as the high-level representation of different domains, and can be regarded as a special implementation of through . Similarly, also admit us to represent by In the implementation, we share the weights of the last few layers of that are responsible for extracting high-level representations of input images in the domains. Equally, the first few layers of are shared, which responsible for decoding high-level representations for reconstructing the input images.

Domain-Specific Discriminator. Since we have different domains, our framework has adversarial networks: = {}. In , for real images sampled from the domain , should output true, while for images generated by , it should output false. In our framework, can generate two types of images: 1 images from the reconstruction streams and 2 images from the translation streams . The reconstruction streams can be trained with supervisions that we only apply adversarial training to images from the translation streams, . Thus, we require train domain-specific discriminators for different domains.

3.2 Loss Function

To better understand the losses applied in Domain-Bank, we first give a decomposed perspective of possible combinations of the key components in Figure 1. Basically, our framework is based on variational autoencoders (VAEs) and generative adversarial networks (GANs) including domain image encoders , domain image generators and domain adversarial discriminators where . The Table 1 further explains the various roles inside our framework and their corresponding functions.

In the image-to-image translation problem of our Domain-Bank, we have three kinds of fundamental information streams, namely the image reconstruction streams, the image translation streams, and the cycle-reconstruction streams. In Table 1, the encoder-decoder pair {} constitutes VAE for the image reconstruction streams. For an input domain , the image is translated to another domain by the translation stream {}. Since the shared-latent space assumption indicates the cycle-consistency constraint, we require a cycle-reconstruction stream {} to reconstruct input images. Consequently, we are jointly considering address the problem of VAE and GAN to solve the image translation problem.

VAE loss. In our framework, we use variational autoencoder (VAE) to generate images in which VAE is supervised by the

divergence. In the VAE, the encoder outputs a mean vector

where the input image . The distribution of the latent code is written as where

is an identity matrix. We assume the distribution of

as a random vector of and sample from it. Thus, the reconstructed image is . In addition, let

be a random vector with a multi-variate Gaussian distribution:

. In the VAE, the function is implemented via . The aim of VAE is to minimize a variational upper bound that the VAE object is written as


where and display the weights of corresponding objective and the divergence term penalizes the deviation of the latent code distribution from the prior distribution. More notable, the L1 loss used in VAE is to ensure similarity between the generated real image and the original rendering image.

Adversarial loss. The adversarial losses are applied to translation functions. Note that we have defined and its discriminator , where . We express the objective as:


where the hyper-parameter controls the impact of the GAN objective functions. In the adversarial part, attempts to generate images that look like images from domain , while tries to distinguish between translated samples and real samples . Finally, aims to minimize this objective against an adversary that tries to maximize it.

Cycle-consistency Loss. We use a VAE-like function to model the cycle-consistency constraint, which is written as


Full objective. As a result, our ultimate objective is written as:


where and . We aim to solve:


Figure 2: The results of Face attributes translation. It contains the results of three sets of hair translation. In each group, the far left is the input image, and the results from left to right are Ours, UNIT and CycleGAN. We obtain comparable results with others. In particularly, our method better preserves the original quality of human face and facial identity. It can be observed that only the color of hair region changes, and the rest remains even in details. In addition, our results are generated by passing end-to-end training only once, while the others require training three models for translations between different pairs of domains respectively.

Figure 3: The results show the painting style translation of famous painters: Cezanne, Monet, and Ukiyo-e. Our generated images are more clear and with higher contrast ratio than UNIT’s, and can be near the target domain in the aspect of style. All the results we have obtained are visibly superior than UNIT’s. On the image quality, our generated images also have high contrast when compare to CycleGAN. Note that all our results are obtained by training only once avoiding heavy training burden while others not.

3.3 Training Strategy

We employ an alternative training strategy motivated by GAN’s [4] solving a mini-max problem where the optimization aims to find a saddle point. The zero-sum game in our framework consists of two plays: the domain-specific discriminators as the first team, and the domain-specific encoders/decoders for the second. During training with a specific pair of images from domain and , we first train domain-specific ’s discriminator with all other components fixed. Afterwards the ’s encoder/decoder and ’s encoder/decoder are involved not only to minimize the VAEs losses and the cycle-consistency losses but also to defeat the first player.

4 Experiments

We first give qualitative results on various tasks along with rich complexity comparisons. Further, we present the quantitative performance gain on the digital domain adaptation tasks.

4.1 Qualitative Analysis

Face attributes translation. The CelebA dataset [17] is exploited for attribute-based face images translation. There are many different attributes of face images including hair, smiling and eyeglass. Particularly, we select a domain of hair with different colors including blond, brown, black, etc. Specifically, the hair with blond color constitutes the 1st domain, the brown hair constitutes the 2nd domain, while the black hair constitutes the 3rd domain. In Figure 2, we visualize the results where we display the transitions between hair with different colors. We find that the translated hair images are impressive. It is not difficult to see that we obtain comparable results to other algorithms in hair translation.

When training for multi-domain translation, the shared-latent space of our framework accurately captures the invisible inner-similarity whereas encoders/decoders represent the shallow-difference. Thanks to inner-similarity of face images which is captured by shared-latent space, it can be found that besides the change of hair color, our method maintains the original quality of human face and facial identity. More importantly, our results are obtained by training the network only once. While for three different kinds of hair, the UNIT and CycleGAN need to be trained three times.

Painting style translation. We further utilize the landscape photographs downloaded from Flickr and WikiArt, which is also used in [30]. The size of the dataset for each artist/style is 526, 1073, 400, and 563 for Cezanne, Monet, Van Gogh, and Ukiyo-e. In this experiment, we choose three of them, while Van Gogh is for incremental learning. Figure 3 shows our results in comparison with the other methods. Compared with UNIT, our results are superior than it. Particularly, Our generated images are more clear and with higher contrast ratio than UNIT’s, whilst comparable to CycleGAN’s.

Fashion-clothes translation. We shows several example results achieved on a recently released dataset Fashion-MNIST [27], which contains roughly 5 domains of clothes and 3 domains of shoes. Precisely, clothes are composed of T-Shirt, Pullover, Coat, Shirt and Dress, whilst shoes include Sandals, Sneaker and Ankle boots. We treat different categories of clothes as different domains. Figure 5 shows several results of translation between different clothes. In the texture and details of generated images, our method keeps the original texture of the clothes and have clearer clothing details. In general, our method obtains superior translation results to others.

Figure 4:

The visualization of digital image translation. The curve on the right represents the classification accuracy of MNIST (In the case of test, we classify the MNIST dataset using the features extracted by the discriminator in the SVHN dataset.), and the six images numbered from

A to F

are the visualization of different iterations. Visibly, our results outperform in terms of image quality and details for describing digits. Due to the details for digits, our results obtain higher accuracy for classification by unsupervised learning without a hitch. Worth mentioning, our approach speed up the train process.

Figure 5: The translation between different categories of clothes. Here are ten groups of results, and each group has four columns. The first column is the input image. The second and third columns are generated by UNIT and CycleGAN. The generated images of our method are marked with orange boxes. In the texture and details of the generated images, our method better preserves the texture information and is clearer the other two methods in details. In addition, the results we have obtained through end-to-end training only once.

Summary of Complexity Comparisons. To demonstrate the advantages of complexity of our proposed framework, we compare the training time and model parameters with those baselines in Table 2. It can be clearly seen that our Domain-Bank has less parameters and training time. This is because others are able to capture only one pair specific domains translation, which leads to an unavoidable burden under the situation where transformations are needed, given domains. However, in our framework, we merely require end-to-end training once where we can efficiently accomplish the translation between arbitrary two domains.

Experiment Type UNIT [15] CycleGAN [30] ours
Face (3) Time 6 day 13 day 3 day
Param 54.06M 68.28M 25.58M
Painting (4) Time 12 day 27 day 4 day
Param 19.62 136.56M 5.94M
Clothes (5) Time 21 day 46 day 7 day
Param 32.75M 227.65M 7.28M
Table 2: The cost of parameters and times in the process of training. Obviously, advantages of our method in retrenching calculating space and time is revealed in this form.

4.2 Quantitative Performance

In order to better understand the performance gained by sharing more information through more than two domains, we adopt our framework to the domain adaptation task, which adapts a classifier trained using labeled samples in one domain (source domain) to classify samples in a new domain where labeled samples in the new domain (target domain) are unavailable during training. In our case, we append additional auxiliary domains by applying our framework with minimal efforts to check whether it can boost the system’s performance.

More specifically, we utilize three datasets for digits: the Street View House Number (SVHN) dataset [20], the MNIST dataset [12] and USPS dataset [2], and perform multi-task learning where our framework is supposed to 1) translate images between any two of three domains and 2) classify the samples in the source domain using the features extracted by the discriminator in it. In the practice, we adopt a small network because the digit images have a small resolution.

Method CoGAN [16] UNIT [15] ours
SVHN MNIST - 0.9053% 0.9146%
MNIST USPS 0.9565% 0.9597% 0.9645%
USPS MNIST 0.9315% 0.9358% 0.9412%
Table 3: Unsupervised domain adaption performance. The reported numbers are classification accuracies.

In the experiment, we find that the cycle-consistency constraint is not necessary for this problem, and that is why we remove the cycle-consistency stream from the framework. In addition, we also tie the weights of the high-level layer of in order to adapt a classifier trained in the source domain to the target domain.

As a result, Figure 4 shows the visualization of digit and Table 3 reports the achieved performance with comparison to the competing algorithms. We achieve better performance for SVHN MNIST task than the UNIT approach, which is the state-of-the-art right now. We also obtain the superior results than UNIT on MNIST USPS and USPS MNIST tasks.

5 Capabilities of Our Framework

5.1 Incremental Training

Retraining a new model is inconvenient and risks being not able to recover the performance of previous trained model, when a new style needs to be added. Our framework proposed in this paper has same capability described in [1] supporting incremental training.

Figure 6: Results of translation between incrementally trained Painting style of Van Gogh and other well-trained domain are shown in this figure. The images on far left column are inputs of each row, and the second column the translated images. The third column are reconstructed images for the corresponding input images, whilst the last column are images from target domain.

When we need add a new style, only the images sampled from the specific incremental domain participate in the training process. The incremental domain’s encoder/decoder and discriminator layers are not fixed shown in Figure 7. Considering the incremental domain and an existing domain , a’s encoder/decoder will be fixed and only samples where

assist incremental training. The loss function for incremental training is defined as follow (

is the number of domain):


Furthermore, we show few samples in Figure 6 to demonstrate the efficiency.

Figure 7: The blue parts are encoder/decoder and discriminator front layers of incremental domain, which are trainable in incremental training process, whereas the gray parts are for other domains well trained which are fixed.

Figure 8: Results of fusion of two styles with variant ratio are provided in this figure and on the far left column are input images, while the far right are target domain images. Every row shows progressive translation between two styles.

5.2 Domain Fusion

In this section, we demonstrate an experiment for style fusion: linear fusion of two different styles. Style Linear Fusion. Translations between different styles are encoded into different pairs of , especially, for the input port for source domain and for the output port for target domain. We linearly combine for different target domains in Domain-Bank layers and the fused can be a new output port:

Figure 8 shows progressive results of two styles with variant ratio .

6 Conclusion

In this paper, we have proposed a novel multi-domain image translation framework, namely Domain-Bank. We show it learnt to translate from multiple domains to multiple domains in one training process. Particularly, our Domain-Bank explicitly reduces the training time from to , given domains. The universal shared auto-encoder subnetwork leads to a better generation which is confirmed in both quantitative and qualitative experimental results. More notably, our framework has less parameters and training time when comparing to others. In addition, we also provide more functional augmentations like domain linear combination and incremental learning.