Integrated unpaired appearance-preserving shape translation across domains

by   Kaili Wang, et al.

We address the problem of un-supervised geometric image-to-image translation. Rather than transferring the style of an image as a whole, our goal is to translate the geometry of an object as depicted in different domains while preserving its appearance. Towards this goal, we propose a fully un-paired model that performs shape translation within a single model and without the need of additional post-processing stages. Extensive experiments on the VITON, CMU-Multi-PIE and our own FashionStyle datasets show the effectiveness of the proposed method at achieving the task at hand. In addition, we show that despite their low-dimensionality, the features learned by our model have potential for the item retrieval task


page 1

page 6

page 7

page 8


Multi-domain Unsupervised Image-to-Image Translation with Appearance Adaptive Convolution

Over the past few years, image-to-image (I2I) translation methods have b...

TransGaGa: Geometry-Aware Unsupervised Image-to-Image Translation

Unsupervised image-to-image translation aims at learning a mapping betwe...

Cross-Domain Cascaded Deep Feature Translation

In recent years we have witnessed tremendous progress in unpaired image-...

TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images

An unsupervised image-to-image translation (UI2I) task deals with learni...

LSC-GAN: Latent Style Code Modeling for Continuous Image-to-image Translation

Image-to-image (I2I) translation is usually carried out among discrete d...

Do As I Do: Transferring Human Motion and Appearance between Monocular Videos with Spatial and Temporal Constraints

Creating plausible virtual actors from images of real actors remains one...

A Geometry-Sensitive Approach for Photographic Style Classification

Photographs are characterized by different compositional attributes like...

1 Introduction

Thanks to the development of Generative Adversarial Networks (GAN) [10, 31, 30, 3], image-to-image translation (I2I) has achieved great successes in recent years. With image translation, we refer to the process of generating a novel image, which is similar to the original input image yet different in some aspects. Typically, the input and output images belong to different domains, where the images in the same domain share a common characteristic, e.g. going from photographs to paintings [19], from greyscale to color images [5], or from virtual (synthetic) to real images [41]. Apart from direct applications [22], I2I has proven valuable, among others, as a tool for data augmentation [7]

or to learn a representation for cross-domain image retrieval 


Traditionally, each image domain is characterized by a different appearance or style, and I2I is therefore sometimes referred to as style transfer [19]. While the translation process may drastically change the appearance or style of the input image to accommodate for the differences between the two domains, the image semantics are to be preserved, i.e. both input and output should represent the same objects and scene. Moreover, in most works, also the image geometry, i.e. the shape of the objects and the global image composition, is preserved. We refer to this as the image content.

Most methods for I2I build on top of GANs and are data-driven. They learn a translation model from example images of the two domains. While some methods require paired examples [18, 38, 40], some recent methods can do I2I without [31, 42, 2]. To constrain the complexity of the problem and ensure good results, the training data is often restricted to a specific setting, e.g. close-ups of faces [15, 40], people [28, 29], traffic scenes [25], etc. We refer to these as different domains.


Figure 1: Translating a clothing item from a ”catalog” image domain to a domain of individuals wearing the indicated item (try-on task, top), and vice versa (take-off task, bottom). Notice how for both tasks the appearance details of the clothing items are preserved while their shape is effectively translated.

In contrast, in this work, as in [38], we focus on the particularly challenging setting where input and output do not belong to the same domain. Instead, we work with one object-centric domain and one that is more contextualized. For instance, we want to go from a single piece of clothing to a person wearing that same item; or from a frontal face crop to a wider shot with arbitrary viewpoint of that same person (see Fig. 1 & 7). Such image-to-image translation setting is significantly more challenging, as the image geometry changes. Low-level transformations won’t suffice to perform the translation. At the same time, the semantics should be preserved, which means the appearance of the clothing item and the identity of the face should not be altered. In analogy to the term style transfer, this could be referred to as shape transfer. While a couple of recent works have looked into this setting [29, 38, 40], we are, to the best of our knowledge , the first proposing a solution that does not require paired data, across different domains, for model training. This is important, as collecting paired data is cumbersome or, in some cases, even impossible. Either way, it limits the amount of data that can be used for training, while access to large amounts of data is crucial for good results.

The main contributions of this paper are five-fold: i) We propose a general method for appearance-preserving shape translation. Our method requires a weak shape-guidance feature, e.g. in the form of a segmentation mask or keypoints. This is needed to localize the object and to specify the target shape. ii) The proposed method has the ability to perform shape translation across different image domains in both directions with a single model. Moreover, no refinement post-processing is required. iii) Our method achieves satisfactory translation results using unpaired data. iv) We utilize the context and structure information to guide the successful geometry transfer with adversarial training. v) Through an extensive analysis we show the potential of the features learned by our model on the cross-domain item retrieval task.

This paper is organized as follows: Sec. 2 positions our work in the literature. In Sec. 3 we present the details of the proposed method. This is followed by an extensive evaluation in Sec. 4. Finally, we draw conclusions in Sec. 5.

2 Related Work

Isola et al[18]

first formulate the image-to-image translation problem with a conditional GAN model which learns a mapping from the source image distribution to the output image distribution using a U-Net neural network in an adversarial way. Zhu 

et al[42] propose the cycle-consistency to solve the I2I problem with unpaired data, which enables a lot of applications since it is usually expensive or even impossible to collect paired data for many tasks. Liu et al[26] assume that there exists a shared latent space for the two related domains and propose a weights-sharing based framework to enforce this constraint. These methods are used to learn a one-to-one mapping function, i.e. the input image is mapped to a deterministic output image. [1, 27, 17, 23] are unpaired multimodal methods which either sample multiple styles from a Gaussian space or capture the styles from exemplar images to generate diverse outputs.

All the above methods focus on appearance transfer where the content depicted on the input and output images has an aligned geometric structure. [28, 29, 4, 32, 15, 40] aim at the case when the geometry itself is to be transferred. However, these methods focus on the within-domain tasks (e.g. person-to-person and face-to-face), which depicts reduced variability when compared to its cross-domain counterpart (e.g. person-to-clothing). Yoo et al[38] propose one of the first methods addressing cross-domain pixel-level translation. Their method semantically transfers a natural image depicting a person (source domain) to a clothing-item image corresponding to clothing worn by that person on the upper body (target domain), and vice versa. Recently, [13, 34] propose two-stage warping-based methods aimed at virtual try-on of clothing items. These methods focus on learning a thin-plate spline (TPS) operation to transfer the pixel information directly. These methods need paired data to learn to transfer the shape in a first stage and then refine it in a second stage. In contrast, we propose a more general method that only requires a weak (and general) shape guidance in order to perform translation across different domains. In addition, different from previous work which divides the translation process into multiple stages, our method is able to handle the full appearance-preserving translation, in both directions, within a single model.

3 Methodology

Figure 2: Proposed architecture for unpaired appearance-preserving shape translation across domains.

In this section, we describe our model using the clothing try-on / take-off as an example. It should be noted though that our method can also be applied to other types of data, such as the face try-on / take-off illustrated in Sec. 4. Our goal is to transfer the shape information while keeping the appearance information, all trained without access to paired data. For this, we propose the asymmetric two-stream model shown in Fig. 2. The asymmetry reflects the fact that one of the two domains (domain B) is object-focused (e.g. catalog images) while the other one (domain A) shows the objects in context (e.g. pictures of clothed persons). In the try-on stream (blue arrows), we transfer from the object-focused to the contextualized setting. This requires synthesizing a new image, where the shape of the object is first altered after which it is merged seamlessly with the provided context (in our setting, a segmented image of a person wearing a different piece of clothing). During this process, the color, texture and anything else specific to the object instance is to be preserved. In the take-off stream (red arrows), our goal is to synthesize the product clothing image in a standard frontal view starting from a natural person image with varying pose.

Here, we use and to refer to images from domain A and domain B respectively. refers to images transferred from domain A to domain B, and vice versa for .

3.1 Assumptions

In previous works [13, 38], the try-on and take-off tasks are solved in a supervised way, respectively. Here, we solve both tasks in one model using unpaired data based on the cycle-consistency, shared-latent space, and context information constraints.

Cycle-consistency. For the unpaired setting, translating an image can be formulated as a distribution matching problem, i.e. learning a mapping that makes the translated result look like images sampled from domain B’s distribution. Such a mapping function can be learned with adversarial learning. However, the learning procedure is highly under-constrained and ill-posed, since there could be multiple mappings from domain A to B. To reduce the space of possible mapping functions, we utilize the cycle-consistency [42], i.e. and vice versa, this constrains and stabilizes the adversarial training.

Shared-latent space. Similar to [17, 23, 27], we decompose the latent space into a content space and a style space. Different from previous works, we have two assumptions: 1) content space constraint, i.e. the content space can be shared by two domains; 2) style space constraint, i.e. images from the two domains do share the same style space. We use and to denote the content space of domains A and B, respectively, and we assume and are both embedded in a larger latent space . Symbols and denote the style space of domain A and B, respectively. Note that we assume and are the same space, which is a stronger constraint.

To achieve the content space constraint, we use two encoders and to encode images from domain A and B, respectively. Then, we use a latent content code reconstruction loss to enforce the latent content reconstruction, similar to [17, 23]. To achieve the style space constraint, we utilize both the weight-sharing technique [26] and the latent style code reconstruction loss.

Context information. Although the above cycle-consistency and the shared-latent space constraints enable the unpaired I2I and work well for appearance transfer tasks [26, 23, 27], they are not enough for geometry transfer when the output is multi-modal (i.e. there are multiple possible outputs). To address this, [38] proposed triplet adversarial learning with paired data. However, for the unpaired setting, the adversarial learning on its own is too weak. Here, we propose to use the context information to constrain the output to be deterministic. In particular, for the try-on stream, we propose a Fit-in module which combines the feature maps with the context information. As to the take-off stream, we assume the output is unimodal and directly use the adversarial learning to learn the deterministic mapping.

3.2 Try-on stream

The product clothing image first passes through the domain B content encoder resulting in the content code in the shared content space . In parallel, is also encoded into a style code in the shared style space by the shared style encoder . To combine the content and style information in the decoder, we use adaptive instance normalization (AdaIN) [16] layers for all residual and up-sampling blocks. The AdaIN parameters

are dynamically generated by a multilayer perceptron (MLP) from the style code

to ensure the generated person image has the same style (in our case: the same object-specific characteristics) as .


where is the activation of the previous convolution layer. and

are the mean and standard deviation computed per channel. Parameters

and are the output of the MLP of the shared style encoding module.

During decoding, the content code is concatenated with the shape guidance feature along the channel dimension and then fed into the decoder where content and style are fused by AdaIN and then fed into the ”Fit-in” module after up-sampling. As to the ”Fit-in” module, we first obtain the bounding box of the mask in the context image and then resize the up-sampled feature maps to the size and location of this bounding box followed by a concatenation with the context image. Such design is to combine the context information which helps the deterministic shape transform. The final try-on image is generated after the last convolution block.

In addition, inspired by [18], we introduce an attention mechanism to the discriminator, i.e. we concatenate the mask with the generated image before feeding it into the discriminator. This simple but effective attention operation encourages the discriminator to focus on the generated clothing instead of the context part. This can improve the results, especially when the objects to be translated have a very variable scale/location within the images.

3.3 Take-off stream

For the take-off stream, the person image () first passes through a convolution block and then gets multiplied with the clothing mask in order to exclude the background and skin information. Similar to the try-on stream, the masked feature maps are then encoded into a content code in the shared content space .

For the decoding part, the only difference with the try-on stream is that there is no ”Fit-in” module: the final take-off product clothing image is generated by decoding , through the decoder with AdaIN residual blocks, up-sampling blocks and convolution blocks.

3.4 Learning

In this section, we only describe translation for simplicity and clarity. The is learned in a similar fashion. We denote the content latent code as , style latent code as , within domain reconstruction output as , cross domain translation output as .

Our loss function contains terms for bidirectional reconstruction loss, cycle-consistency loss and adversarial loss similar to  

[17, 23]. Besides, we also use a composed perceptual loss to preserve the appearance information across domains, and a symmetry loss capturing some extra domain knowledge similar to [15, 40].

Bidirectional reconstruction loss (, ). This loss consists of the pixel level image self-reconstruction loss and the feature level latent reconstruction loss , where the latter contains both content and style code reconstructions. The bidirectional reconstruction loss encourages the network to learn pairs of encoder and decoder that are inverses of one another and also stabilizes the training.


Cycle-consistency loss (). The cycle-consistency loss can be treated as a reconstruction loss, i.e. an image from domain A is first mapped to domain B, and then reconstructed back to domain A (and vice versa). We use this loss to achieve the cycle-consistency constraint.


Adversarial loss (). To make the translated image look domain realistic, we use an adversarial loss to match the domain distribution. For the translation, the domain B discriminator tries to distinguish the generated fake images with the real domain B images, while the generator will try to generate domain B realistic images.


Perceptual loss(). To preserve the appearance information, we apply a composed perceptual loss.


where is the Region of Interest (RoI) of . For clothing items, it is the segmented clothing region. For the face experiments, it is the facial region (without context information). is a network trained on external data, whose representation can capture image similarity. Similar to [8, 20], we use the first convolution layer of all five blocks in VGG16 [33] to extract the feature maps to calculate the Gram matrix which contains non-localized style information. The and are corresponding loss weights.

Symmetry loss.() To utilize the inherent prior knowledge of clothing and human faces, we apply a symmetry loss to the take-off stream similar to [15, 40].


where and denote the height and width of the image, are the coordinates of each pixel, and refers to a pixel in the transferred image .

Total loss. Our model, including encoders, decoders and discriminators is optimized jointly. The full objective is as follows,


where , , , and are loss weights for different loss terms.

3.5 Network architecture

The overall model can be divided into several sub-networks111For more details we refer the reader to the supplementary material.. For content encoder and , we use a convolution block and several down-sampling layers followed by several residual blocks. Each residual block has two convolution blocks. The decoders and are symmetric with the encoding part except for the Fit-In module.

The Fit-In module receives the feature map from the last up-sampling layer, resizes it and inserts it on an empty tensor by considering the size and location of the bounding box. The resized feature map is then concatenated with the context information and sent to the next convolution block.

The shared style encoding module contains a style encoder and a multilayer perceptron (MLP) similar to [17]. The style encoder consists of several down-sampling layers, followed by a global average pooling layer and a fully-connected layer. The MLP consists of two fully-connected layers.

4 Evaluation

We evaluate our method on both clothing try-on / take-off and face try-on / take-off tasks. We perform an ablation study on our own FashionStyle dataset. Then, the full model results on VITON and MultiPIE datasets are reported. Finally, we assess the potential of the learned style/appearance representation for clothing item retrieval across domains.

Datasets We use three datasets: VITON [13], Fashion-Style and CMU MultiPIE [11]. VITON and FashionStyle are fashion related datasets, see Figs. 1, 3, 4

for some example images. VITON has around 16,000 images for each domain. However, we find that there are plenty of image duplicates with different file names. After cleaning the dataset, there are 7,240 images in each domain left. The FashionStyle dataset, provided by an industrial partner, has 5,230 training images and 1,320 testing images of clothed people (domain A), and 2,837 training images and 434 testing images of individual clothing items (domain B). For domain A, FashionStyle has multiple views of the same person wearing the same clothing item. We present results on this dataset for one category, namely pullover/sweater. CMU MultiPIE is a large dataset with 750,000+ images for face recognition under pose, illumination and expression changes. Here we focus on images with illumination and pose variations, and divide the dataset in two domains: profile images (domain A) and frontal views (domain B). The first subset has 7,254 profile images and 920 frontal view images while the second subset consists of 145,554 profile images and 18,272 frontal view images.

Metrics We use paired images from different domains depicting the same clothing item to quantitatively evaluate the performance our method. For the case of the try-on task we measure the similarity between the original image (from domain A) and a generated version where its corresponding clothing item has been translated in a masked out version of the image. To create this masked image we first run a clothing-item segmentation algorithm [24] that we use to remove the clothing-item originally worn by the person. For the case of the take-off task, given an image from domain A, we measure the similarity of its corresponding clothing item (from domain B) with the generated item. On both cases similarity between images is computed using the SSIM [35] and LPIPS [39] metrics. We report the mean similarity across the whole testing set.

For the retrieval task performance is reported in terms of Recall rate. Note that different from standard datasets where there are multiple matching items that could be retrieved in the test dataset, there is only one matching item in the entire database.

Implementation details The perceptual feature extractors in Eq. 6 are LPIPS [39] and Light-CNN [37] networks for clothing translation and face translation, respectively. In all our experiments, we use the Adam [21] optimizer with and . The initial learning rate is set to 0.00002. Models are trained with a minibatch of size for FashionStyle and VITON, and for the face experiment. We use the segmentation method proposed by [24, 9] to get the clothing mask and its bounding box. For faces, we detect the face landmarks using [6, 36]

and then connect each point to get the face mask. The shared content code is a tensor whose dimension is determined by the data. The shared style code is a vector, we use

dimensions in our experiments.

Figure 3: Ablation study on the FashionStyle dataset. The top part are try-on results and the bottom part are take-off results. The first two columns show the input clothing product image and the reference ground truth image. The other columns show the generated results of different model settings. Please zoom in for more details. More results are provided in the supplementary materials.
Figure 4: Try-on and take-off results on the VITON dataset. For try-on (top) each column shows a person (from the top row) virtually trying on different clothing items. For take-off (bottom) each example consists of three images: input image, generated take-off image and the ground-truth (GT) image. Zoom in for more details.

4.1 Ablation Study: Clothing try-on / take-off

We conduct a study in order to analyze the importance of four main components of our model. More precisely, the shared style encoder (Shared S. E.), discriminator attention (Dis. Attention), mask guidance, and perceptual loss, on the FashionStyle dataset. Towards this goal we test different variants of our architecture (Sec. 3) where one of these four components has been removed. In addition, we run an experiment using a supervised model (paired data). The model architecture is a residual block based on U-net similar to PG [28], yet extended to get closer to our model. It is extended by applying our mask multiplication operation after the first convolution block for the supervised take-off experiment. Likewise, we add our Fit-in module for the supervised try-on experiment. Please refer to the supplementary material for an ablation study of our extensions on the supervised model. We present quantitative results on the translation performance of the try-on / take-off tasks in Table 1 for the FashionStyle dataset with related qualitative results presented in Fig. 3(left).

Method Try-on ROI Take off
W/O P. Loss 79.00 / 17.80 58.96 / 36.62
W/O shared S.E. 78.81 / 17.65 60.33 / 34.94
W/O Dis. Attention 78.48 / 18.03 60.97 / 34.71
W/O mask Guide 77.63 / 18.08 61.48 / 34.53
Full model 79.20 / 17.40 61.19 / 34.37
Supervised model 81.87 / 15.12 61.54 / 32.56
Table 1: Mean SSIM and LPIPS-VGG similarity of each setting from our ablation study. Higher SSIM values and lower LPIPS indicate higher similarity. Both metrics are in the range .

Discussion. A quick inspection of Table 1 reveals that indeed the full model generates images with the highest similarity to the ground-truth for both metrics on the try-on task for all the unsupervised models. Similarity on this task seems to be affected most when the discriminator attention (Dis. Attention) and mask guidance are dropped. It important to note that these two factors are directly related to the weak shape-guidance feature introduced by our method. This confirms the relevance of this feature when translating shape from images in this direction (try-on).

For the case of the take-off task, results are again mostly dominated by the full model. However, different from the try-on task, the take-off task is mostly affected by the removal of the perceptual loss, i.e. LPIPS. Although this trend is different from the try-on task, it is not surprising given that for the take-off task, the expected shape of the translated image is more constant when compared with that of the try-on task which is directly affected by the persons pose. Moreover, the output of the take-off task is mostly dominated by uniformly-coloured regions, which is a setting in which perceptual similarity metrics, such as LPIPS, excel at.

A close inspection of Fig. 3(left) further confirms the trends observed in our quantitative evaluation. Note how the full model produces the most visually-pleasing translation result; striking a good balance between shape and level of details on the translated items.

It is remarkable that quantitatively speaking (Table 1), the performance of our method is comparable to that of the supervised model. Moreover, while the supervised model is very good at translating logos, our method is still able to have an edge when translating patterns (e.g. squares from the row and stripes from the row of Fig. 3), without requiring paired data.

4.2 Clothing try-on / take-off on VITON

We complement the results presented previously with a qualitative experiment (see Fig. 4) for the try-on and take-off tasks on the VITON dataset using the full model.

We see that our method is able to effectively translate the shape of the clothing items across the domains. It is notable that on the try-on task, it is able to preserve the texture information of the items even in the presence of occlusions caused by arms. This is handled by the proposed Fit-in module (Sec. 3) which learns how to combine foreground and contextual information.

Figure 5: Clothing retrieval ablation study. Note that relevant factors for the retrieval are somewhat the opposite of those from the translation task.

4.3 Clothing retrieval

We present the in-shop clothing retrieval results using the extracted style features. We apply the shared style encoder as feature extractor to extract the style codes and then use L2 distance to measure the similarity for retrieval.

Protocol. The shared style encoder is trained and tested on the FashionStyle training and testing sets, respectively. During retrieval, there are 1,302 query images and 434 database images. As shown in Table 2

, we provide three baselines: Autoencoder

GAN (AE+GAN), ResNet-50[14] and ResNet-152[14]

. For AE+GAN, the latent code of the AE is 128-dimension. We train the model using both domain A and domain B images. ResNet-50 and ResNet-152 are trained from imageNet.

Discussion. Our method outperforms all the baselines except LPIPS-Alex. It is noted that LPIPS-Alex extracts the feature maps of different layers as clothing features, resulting in a very high dimensional feature vector (640K dimensions). This costs a lot, both in compute time as well as in storage costs, which both scale linearly with the dimensionality. Our extracted style code on the other hand has a very low dimension (e.g. 8), which can significantly reduce (over 80K times) the computation. Furthermore, combining our method with LPIPS-Alex in a simple coarse-to-fine way, i.e. first using our method to quickly obtain the coarse top- results and then using LPIPS-Alex to rerank these results, can achieve the best performance while reducing the aforementioned costs significantly. The value can be selected as the point where the performance of our method and LPIPS gets close. e.g. / for Ours (SD = 8/SD = 128), or adapted based on user requirements. A similar gain in performance can be achieved for the case of the VITON dataset (Table 3).

In addition, we also provide a clothing retrieval ablation study on FashionStyle, as shown in Fig. 5. It is interesting to observe that the performance of the retrieval process is affected by different factors than that of the image translation process (Sec. 4.1). We hypothesize that the translation task directly exploits shape related components in order to achieve detailed image generation. On the contrary, the retrieval task considers representative features regardless of whether they grant accurate shape translation.

Complexity analysis.

We also provide the computation complexity analysis for the retrieval. We use Euclidean distance to measure the difference between the features extracted from two different images. For each query image, computation complexity is

which scales linearly with the feature dimension and the number of database images . Thus, the computation complexity of our method is 80k times (SD=8) or 5k times (SD=128) smaller than LPIPS-Alex according to the dimension in Tab. 2. As to LPIPS-Alex+Ours, the computation complexity is which maintains the performance and significantly reduces the computation compared to , for our naive implementation. While more efficient retrieval algorithms exist, the dependence on the feature dimensionality remains.

Method Dim top-1 top-5 top-20 top-50
AE+GAN 128 9.4 21.9 39.6 57.3
ResNet-50 [14] 2048 11.9 25.0 40.9 56.4
ResNet-152 [14] 2048 14.4 29.1 47.6 62.8
LPIPS-Alex [39] 640K 25.2 42.0 59.5 72.0
Ours (SD = 8) 8 17.1 37.6 58.1 74.3
Ours (SD = 32) 32 18.7 39.6 62.5 76.1
Ours (SD = 128) 128 19.4 41.1 64.1 77.6
LPIPS-Alex + Ours(SD = 8, k = 20) - 24.4 41.4 58.1 74.3
LPIPS-Alex + Ours(SD = 128, k = 5) - 24.4 41.1 64.1 77.6
Table 2: Retrieval recall rate in the FashionStyle dataset.
Method Dim top-1 top-5 top-20 top-50
Ours 128 20.2 39.9 64.9 79.3
LPIPS-Alex [39] 640K 42.3 61.3 77.2 88.7
LPIPS-Alex + Ours(k = 50) - 41.6 57.7 72.8 79.3
Table 3: Retrieval recall rate in the VITON dataset
Method Try-on ROI Take off
Face experiment 69.48 / 15.65 43.82 / 39.97
Table 4: Mean SSIM and LPIPS-VGG distance of face experiment.
Figure 6: Top-5 retrieval results on the FashionStyle dataset sorted in decreasing order from left to right. Correct items are indicated by the green mark.

4.4 Face shape translation

We conduct two experiments related to face translation. In the first experiment, given the input face and the target context (body), we generate a new image where the input face is fitted on the target context (try-on task). In the second experiment, we perform a face take-off task where given a face image with a side viewpoint, we generate an image where the face from the input is rotated towards the front and zoomed-in. We conduct these experiments in the CMU MultiPIE dataset. Qualitative results are presented in Figure 7. For reference we present translation similarity measurements in Table 4.

Discussion As can be noted in Fig. 7, images from the different domains, i.e. frontal and side view faces, exhibit many differences regarding to scale and the presence of other parts of the body. Yet, the proposed method is able to achieve both translation tasks with a decent level of success. Fig. 7 shows that, for both tasks, besides face orientation special features such as facial hair, lip-color, accessories, and skin color are to some level properly translated. It is remarkable that this has been achieved without the use of specific regions around discriminative facial parts, e.g. eyes, nose, mouth, ears, as is done by existing work [15, 40]. Quantitative results (Table 4) suggest that the proposed method has a comparable performance on both faces and clothing related datasets.

Figure 7: Qualitative results for face translation. On the top, given the input face and the target body, we generate a new image where the input face is fitted on the target body (try-on task), and vice versa (take-off task) at the bottom.

5 Conclusion

We present a method to effectively translate the shape of an object across different domains while preserving the appearance. The extensive empirical evidence suggests that the proposed method has comparable translation performance on both faces and clothing related data. Moreover, our ablation study suggests that the proposed weak shape guidance assists the translation of shape features, thus, improving the image generation process. Finally, we have shown that the features learned by the model have the potential to be employed for retrieval tasks, in spite of their low dimensionality.


  • [1] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML, 2018.
  • [2] Anonymous. Unsupervised one-to-many image translation. In Submitted to International Conference on Learning Representations, 2019. under review.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • [4] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag. Synthesizing images of humans in unseen poses. In CVPR, 2018.
  • [5] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu.

    Unsupervised diverse colorization via generative adversarial networks.

    In ECML/PKDD, volume 10534, pages 151–166, 2017.
  • [6] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh.

    Realtime multi-person 2d pose estimation using part affinity fields.

    In CVPR, 2017.
  • [7] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan. Synthetic data augmentation using gan for improved liver lesion classification. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 289–293, 2018.
  • [8] L. A. Gatys, A. S. Ecker, and M. Bethge.

    Image style transfer using convolutional neural networks.

    In CVPR, 2016.
  • [9] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , July 2017.
  • [10] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [11] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. In 2008 8th IEEE International Conference on Automatic Face Gesture Recognition, pages 1–8, 2008.
  • [12] L. Guo, J. Liu, Y. Wang, Z. Luo, W. Wen, and H. Lu. Sketch-based image retrieval using generative adversarial networks. In Proceedings of the 25th ACM International Conference on Multimedia, 2017.
  • [13] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis. Viton: An image-based virtual try-on network. In CVPR, 2018.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  • [15] R. Huang, S. Zhang, T. Li, R. He, et al. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In ICCV, 2017.
  • [16] X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
  • [17] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
  • [18] P. Isola, J. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In CVPR, 2017.
  • [19] Y. Jing, Y. Yang, Z. Feng, J. Ye, and M. Song. Neural style transfer: A review. CoRR, abs/1705.04058, 2017.
  • [20] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In ECCV, 2016.
  • [21] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, pages 13–23, San Diego, 2015.
  • [22] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial cvpr. In CoRR, 2015.
  • [23] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In ECCV, 2018.
  • [24] X. Liang, K. Gong, X. Shen, and L. Lin. Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [25] G. Liu, J. Wang, C. Zhang, S. Liao, and Y. Liu. Realistic view synthesis of a structured traffic environment via adversarial training. In 2017 Chinese Automation Congress (CAC), pages 6600–6605, 2017.
  • [26] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
  • [27] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool. Exemplar guided unsupervised image-to-image translation with semantic consistency. arXiv preprint arXiv:1805.11145, 2018.
  • [28] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation. In NIPS, 2017.
  • [29] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz. Disentangled person image generation. In CVPR, 2018.
  • [30] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
  • [31] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [32] A. Raj, P. Sangkloy, H. Chang, J. Lu, D. Ceylan, and J. Hays. Swapnet: Garment transfer in single view images. In ECCV, 2018.
  • [33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [34] B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and M. Yang. Toward characteristic-preserving image-based virtual try-on network. In ECCV, 2018.
  • [35] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. Trans. Img. Proc., 13(4):600–612, Apr. 2004.
  • [36] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [37] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
  • [38] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. Kweon. Pixel-level domain transfer. In ECCV, 2016.
  • [39] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In CVPR, 2018.
  • [40] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen, J. Xing, S. Yan, and J. Feng. Towards pose invariant face recognition in the wild. In CVPR, 2018.
  • [41] C. Zheng, T. Cham, and J. Cai. T ^2 2 net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pages 798–814, 2018.
  • [42] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.