Improving Shape Deformation in Unsupervised Image-to-Image Translation

08/13/2018 ∙ by Aaron Gokaslan, et al. ∙ 14

Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a multi-scale perceptual loss that is better able to represent error in the underlying shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex mappings with significant dataset variation between humans, dolls, and anime faces, and between cats and dogs.



There are no comments yet.


page 2

page 11

page 12

page 13

page 15

page 21

page 22

page 23

Code Repositories


Source code and information for the ECCV 2018 paper: Gokaslan et al., 'Improving Shape Deformation in Unsupervised Image-to-Image Translation'

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised image-to-image translation is the process of learning an arbitrary mapping between image domains without labels or pairings. This can be accomplished via deep learning with generative adversarial networks (GANs), through the use of a discriminator network to provide instance-specific generator training, and the use of a cyclic loss to overcome the lack of supervised pairing. Prior works such as DiscoGAN 

[20] and CycleGAN [46] are able to transfer sophisticated local texture appearance between image domains, such as translating between paintings and photographs. However, these methods often have difficulty with objects that have both related appearance and shape changes; for instance, when translating between cats and dogs.

Coping with shape deformation in image translation tasks requires the ability to use spatial information from across the image. For instance, we cannot expect to transform a cat into a dog by simply changing the animals’ local texture. From our experiments, networks with fully connected discriminators, such as DiscoGAN, are able to represent larger shape changes given sufficient network capacity, but train much slower [17] and have trouble resolving smaller details. Patch-based discriminators, as used in CycleGAN, work well at resolving high frequency information and train relatively quickly [17], but have a limited ‘receptive field’ for each patch that only allows the network to consider spatially local content. These networks reduce the amount of information received by the generator. Further, the functions used to maintain the cyclic loss prior in both networks retains high frequency information in the cyclic reconstruction, which is often detrimental to shape change tasks.

We propose an image-to-image translation system, designated GANimorph, to address shortcomings present in current techniques. To allow for patch-based discriminators to use more image context, we use dilated convolutions in our discriminator architecture [42]. This allows us to treat discrimination as a semantic segmentation problem: the discriminator outputs per-pixel real-vs.-fake decisions, each informed by global context. This per-pixel discriminator output facilitates more fine-grained information flow from the discriminator to the generator. We also use a multi-scale structure similarity perceptual reconstruction loss to help represent error over image areas rather than just over pixels. We demonstrate that our approach is more successful on a challenging shape deformation toy dataset than previous approaches. We also demonstrate example translations involving both appearance and shape variation by mapping human faces to dolls and anime characters, and mapping cats to dogs (Figure 1).

The source code to our GANimorph system and all datasets are online:

Figure 1: Our approach translates texture appearance and complex head and body shape changes between the cat and dog domains (left: input; right: translation).

2 Related Work

Image-to-image Translation.

Image analogies provides one of the earliest examples of image-to-image translation [14]. The approach relies on non-parametric texture synthesis and can handle transformations such as seasonal scene shifts [22]

, color and texture transformation, and painterly style transfer. Despite the ability of the model to learn texture transfer, the model cannot affect the shape of objects. Recent research has extended the model to perform visual attribute transfer using neural networks

[25, 13]. However, despite these improvements, deep image analogies are unable to achieve shape deformation.

Neural Style Transfer.

These techniques show transfer of more complex artistic styles than image analogies [10]. They combine the style of one image with the content of another by matching the Gram matrix statistics of early-layer feature maps from neural networks trained on general supervised image recognition tasks. Further, Duomiln et al. [8]

extended Gatys et al.’s technique to allow for interpolation between pre-trained styles, and Huang et al. 

[15] allowed real-time transfer. Despite this promise, these techniques have difficulty adapting to shape deformation, and empirical results have shown that these networks only capture low-level texture information [2]. Reference images can affect brush strokes, color palette, and local geometry, but larger changes such as anime-style combined appearance and shape transformations do not propagate.

Generative Adversarial Networks.

Generative adversarial networks (GANs) have produced promising results in image editing [24], image translation [17], and image synthesis [11]

. These networks learn an adversarial loss function to distinguish between real and generated samples. Isola et al. 


demonstrated with Pix2Pix that GANs are capable of learning texture mappings between complex domains. However, this technique requires a large number of explicitly-paired samples. Some such datasets are naturally available, e.g., registered map and satellite photos, or image colorization tasks. We show in our supplemental material that our approach is also able to solve these limited-shape-change problems.

For specific domains such as faces, prior work has achieved domain transfer without explicit pairing. For instance, Taigman et al. [37]

tackled the problem of generating a personal emoji avatar from a photograph of a human face. Their technique requires a pre-trained facial attribute classifier, plus domain-specific and task-specific supervised labels for the photorealistic domain. Wolf et al. 

[40] improved the generation by learning an underlying data generating avatar parameterization to create new avatars. Such a technique requires an existing and easily-parameterized model, and therefore cannot cope with more complex art styles, avatars, or avatar scenes, which are difficult to parameterize.

Unsupervised Image Translation GANs.

Pix2Pix-like architectures have been extended to work with unsupervised pairs [20, 46]. Given image domains X and Y, these approaches work by learning a cyclic mapping from XYX and YXY. This creates a bijective mapping that prevents mode collapse in the unsupervised case. We build upon the DiscoGAN [20] and CycleGAN [46] architectures, which themselves extend Coupled GANs for style transfer [28]. We seek to overcome their shape change limitations through more efficient learning and expanded discriminator context via dilated convolutions, and by using a cyclic loss function that considers multi-scale frequency information (Table 1).

Input Patch based Dense Dilated
Table 1: Translating a human to a doll, and a cat to a dog. Dilated convolutions in the discriminator outperform both patch-based and dense convolution methods for image translations that require larger shape changes and small detail preservation.

Other works tackle complementary problems. Yi et al. [41] focus on improving high frequency features over CycleGAN in image translation tasks, such as texture transfer and segmentation. Shuang et al. [30] examine adapting CycleGAN to wider variety in the domains—so-called instance-level translation. Liu et al. [27]

use two autoencoders to create a cyclic loss through a shared latent space with additional constraints. Several layers are shared between the two generators and an identity loss ensures that both domains resolve to the same latent vector. This produces some shape transformation in faces; however, the network does not improve the discriminator architecture to provide greater context awareness.

One qualitatively different approach is to introduce object-level segmentation maps into the training set. Liang et al.’s ContrastGAN [24] has demonstrated shape change by learning segmentation maps and combining multiple conditional cyclic generative adversarial networks. However, this additional input is often unavailable and time consuming to declare.

3 Our Approach

Crucial to the success of translation under shape deformation is the ability to maintain consistency over global shapes as well as local texture. Our algorithm adopts the cyclic image translation framework [20, 46] and achieves the required consistency by incorporating a new dilated discriminator, a generator with residual blocks and skip connections, and a multi-scale perceptual cyclic loss.

3.1 Dilated Discriminator

Initial approaches used a global discriminator with a fully connected layer [20]. Such a discriminator collapses an image to a single scalar value for determining image veracity. Later approaches [46, 24] used a patch-based DCGAN [35] discriminator, initially developed for style transfer and texture synthesis [23]. In this type of discriminator, each image patch is evaluated to determine a fake or real score. The patch-based approach allows for fast generator convergence by operating on each local patch independently. This approach has proven effective for texture transfer, segmentation, and similar tasks. However, this patch-based view limits the networks’ awareness of global spatial information, which limits the generator’s ability to perform coherent global shape change.

Reframing Discrimination as Semantic Segmentation.

To solve this issue, we reframe the discrimination problem from determining real/fake images or subimages into the more general problem of finding real or fake regions of the image, i.e., a semantic segmentation task. Since the discriminator outputs a higher-resolution segmentation map, the information flow between the generator and discriminator increases. This allows for faster convergence than using a fully connected discriminator, such as in DiscoGAN.

Current state-of-the-art networks for segmentation use dilated convolutions, and have been shown to require far fewer parameters than conventional convolutional networks to achieve similar levels of accuracy [42]. Dilated convolutions provide advantages over both global and patch-based discriminator architectures. For the same parameter budget, they allow the prediction to incorporate data from a larger surrounding region. This increases the information flow between the generator and discriminator: by knowing that regions of the image contribute to making the image unrealistic, the generator can focus on that region of the image. An alternative way to think about dilated convolutions is that they allow the discriminator to implicitly learn context. While multi-scale discriminators have been shown to improve results and stability for high resolution image synthesis tasks [38], we will show that incorporating information from farther away in the image is useful in translation tasks as the discriminator can determine where a region should fit into an image based on surrounding data. For example, this increased spatial context helps localize the face of a dog relative to its body, which is difficult to learn from small patches or patches learned in isolation from their neighbors. Figure 2 (right) illustrates our discriminator architecture.

3.2 Generator

Our generator architecture builds on those of DiscoGAN and CycleGAN. DiscoGAN uses a standard encoder-decoder architecture (Figure 2, top left). However, its narrow bottleneck layer can lead to output images that do not preserve all the important visual details from the input image. Furthermore, due to the low capacity of the network, the approach remains limited to low resolution images of size 6464. The CycleGAN architecture seeks to increase capacity over DiscoGAN by using a residual block to learn the image translation function [12]. Residual blocks have been shown to work in extremely deep networks, and they are able to represent low frequency information [43, 2].

However, using residual blocks at a single scale limits the information that can pass through the bottleneck and thus the functions that the network can learn. Our generator includes residual blocks at multiple layers of both the decoder and encoder, allowing the network to learn multi-scale transformations that work on both higher and lower spatial resolution features (Figure 2, bottom left).

Figure 2: (Left) Generators from different unsupervised image translation models. The skip connections and residual blocks are combined via concatenation as opposed to addition. (Right) Our discriminator network architecture is a fully-convolutional segmentation network. Each colored block represents a convolution layer; block labels indicate filter size. In addition to global context from the dilations, the skip connection bypassing the dilated convolution blocks preserves the network’s view of local context.

3.3 Objective Function

Perceptual Cyclic Loss.

As per prior unsupervised image-to-image translation work [20, 24, 27, 46, 41], we use a cyclic loss to learn a bijective mapping between two image domains. However, not all image translation functions can be perfectly bijective, e.g., when one domain has smaller appearance variation, like human face photos vs. anime drawings. When all information in the input image cannot be preserved in the translation, the cyclic loss term should aim to preserve the most important information. Since the network should focus on image attributes of importance to human viewers, we should choose a perceptual loss that emphasizes shape and appearance similarity between the generated and target images.

Defining an explicit shape loss is difficult, as any explicit term requires known image correspondences between domains. These do not exist for our examples and our unsupervised setting. Further, including a more-complex perceptual neural network into the loss calculation imparts a significant computational and memory overhead. While using pretrained image classification networks as a perceptual loss can speed up style transfer [19], these do not work on shape changes as the pretrained networks tend only to capture low-level texture information [2].

Instead, we use multi-scale structure similarity loss (MS-SSIM) [39]

. This loss better preserves features visible to humans instead of noisy high frequency information. MS-SSIM can also better cope with shape change since it can recognize geometric differences through area statistics. However, MS-SSIM alone can ignore smaller details, and does not capture color similarity well. Recent work has shown that mixing MS-SSIM with L1 or L2 losses is effective for super resolution and segmentation tasks 

[44]. Thus, we also add a lightly-weighted L1 loss term, which helps increase the clarity of generated images.

Feature Matching Loss.

To increase the stability of the model, our objective function uses a feature matching loss [36]:


Where represents the raw activation potentials of the layer of the discriminator , and is the number of discriminator layers. This term encourages fake and real samples to produce similar activations in the discriminator, and so encourages the generator to create images that look more similar to the target domain. We have found this loss term to prevent generator mode collapse, to which GANs are often susceptible [20, 36, 38].

Scheduled Loss Normalization (SLN).

In a multi-part loss function, linear weights are often used to normalize the terms with respect to one another, with previous works often optimizing a single set of weights. However, finding appropriately-balanced weights can prove difficult without ground truth. Further, often a single set of weights is inappropriate because the magnitude of the loss terms changes over the course of training. Instead, we create a procedure to periodically renormalize each loss term and so control their relative values. This lets the user intuitively provide weights that sum to 1 to balance the loss terms in the model, without having knowledge of how their magnitudes will change over training.

Let be a loss function, and let be a sequence of batches of training inputs, each images large, such that is the training loss at iteration . We compute an exponentially-weighted moving average of the loss:


where is the decay rate. We can renormalize the loss function by dividing it by this moving average. If we do this on every training iteration, however, the loss stays at its normalized average and no training progress is made. Instead, we schedule the loss normalization:

Here, is the scheduling parameter such that we apply normalization every training iterations. For all experiments, we use , , and .

One other normalization difference between CycleGAN/DiscoGAN and our approach is the use of instance normalization [15]

and batch normalization 

[16], respectively. We found that batch normalization caused excessive over-fitting to the training data, and so we used instance normalization.

Final Objective.

Our final objective comprises three loss normalized terms: a standard GAN loss, a feature matching loss, and two cyclic reconstruction losses. Given image domains and , let map from to and map from to . and denote discriminators for and , respectively.

For GAN loss, we combine normal GAN loss terms from Goodfellow et al. [11]:


For feature matching loss, we use Equation 1 for each domain:


For the two cyclic reconstruction losses, we consider structural similarity [39] and an loss. Let and be the cyclically-reconstructed input images. Then:


where we compute MS-SSIM without discorrelation.

Our total objective function with scheduled loss normalization (SLN) is:


with , , and all coefficients . We set , , and , and and . Empirically, these helped to reduce mode collapse and worked across all datasets.

3.4 Training

The network architecture both consumes and output 128128 images. All models trained within 3.2 days on a single NVIDIA Titan X GPU with a batch size of 16. The number of generator updates per step varied between 1 and 2 for each dataset depending on the dataset difficulty. Each update of the generator used separate data than in the update of the discriminator.

We train for 50–400 epochs depending on the domain, with 1,000 batches per epoch. Overall, this resulted in 400,000 generator updates over the course of training for difficult datasets (e.g., cat to dog) and 200,000 generator updates for easier datasets (e.g., human to doll). We empirically define a dataset as hard or easy if it is difficult to generate images in the domain.

Data Augmentation.

To help mitigate dataset overfitting, the following image augmentations were applied to each dataset: rescale to 1.1 input size, random horizontal flipping of the image, random rotation of up to 30 degrees in either direction, random rescaling, and random cropping of the image.

4 Experiments

Figure 3: Toy Dataset (128128). Left: instance; a regular polygon with deformed dot matrix overlay. Right: instance; a deformed polygon and dot lattice. The dot lattice provides information from across the image to the true deformation.

4.1 Toy Problem: Learning 2D Dot and Polygon Deformations

We created a challenging toy problem to evaluate the ability of our network design to learn shape- and texture-consistent deformation. We define two domains: the regular polygon domain and its deformed equivalent (Figure 3). Each example contains a centered regular polygon with sides, plus a deformed matrix of dots overlaid. The dot matrix is computed by taking a unit dot grid and transforming it via , a Gaussian random normal 22 matrix, and a displacement vector , a Gaussian normal vector in . The corresponding domain equivalent in is , with instead the polygon transformed by and the dot matrix remaining regular. This construction forms a bijection from to , and so the translation problem is well-posed.

Learning a mapping from to requires the network to use the large-scale cues present in the dot matrix to successfully deform the polygon, as local patches with a fixed image location cannot overcome the added displacement . Table 2 shows that DiscoGAN is unable to learn to map between either domain, and produces an output that is close to the mean of the dataset (off-white). CycleGAN is able to learn only local deformation, which produces hue shifts towards the blue of the polygon when mapping from regular to deformed spaces, and which in most cases produces an undeformed dot matrix when mapping from deformed to regular spaces. In contrast, our approach is significantly more successful at learning the deformation as the dilated discriminator is able to incorporate information from across the image.

Regular to Deformed

Input CycleGAN DiscoGAN Ours
Input CycleGAN DiscoGAN Ours

Deformed to Regular

Table 2:

Toy Dataset. When trying to estimate complex deformation, DiscoGAN collapses to the mean value of dataset (all white). CycleGAN is able to approximate the deformation of the polygon but not the dot lattice (right-hand side). Our approach is able to learn both under strong deformation.

Quantitative Comparison.

As our output is a highly-deformed image, we estimate the learned transform parameters by sampling. We compute a Hausdorff distance between 500 point samples on the ground truth polygon and on the image of the generated polygon after translation: for finite sets of points and , . We hand annotate 220 generated polygon boundaries for our network, sampled uniformly at random along the boundary. Samples exist in a unit square with bottom left corner at (0, 0).

First, DiscoGAN fails to generate polygons at all, despite being able to reconstruct the original image. Second, for ‘regular to deformed’, CycleGAN fails to produce a polygon, whereas our approach produces average Hausdorff distance of . Third, for ‘deformed to regular’, CycleGAN produces a polygon with distance of , whereas our approach has distance of . In the true dataset, note that regular polygons are centered, but CycleGAN only constructs polygons at the position of the original distorted polygon. Our network constructs a regular polygon at the center of the image as desired.

4.2 Real-world Datasets

We evaluate our GANimorph system by learning mappings between several image datasets (Figure 4). For human faces, we use the aligned version of the CelebFaces Attribute dataset [29], which contains 202,599 images.

Figure 4: Face dataset examples, left to right: CelebA, Danbooru, Flickr Cat, Columbia Dog, Flickr Dolls, and Pets in the Wild.

Anime Faces.

Previous works have noted that anime images are challenging to use with existing style transfer methods, since translating between a photoreal face and an anime-style face involves both shape and appearance variation. To test on anime faces, we create a large 966,777 image anime dataset crowdsourced from Danbooru [1]. The Danbooru dataset has a wide variety of styles from super-deformed chibi-style faces, to realistically-proportioned faces, to rough sketches. Since traditional face detectors yield poor results on drawn datasets, we ran the Animeface filter [32] on both datasets.

When translating humans to anime, we see an improvement in our approach for head pose and accessories such as glasses (Table 3, 3 row, right), plus a larger degree of shape deformation such as reduced face vertical height. The final line of each group represents a particularly challenging example.

Photoreal to Anime

Input CycleGAN DiscoGAN Ours
Input CycleGAN DiscoGAN Ours

Anime to Photoreal

Human to Doll Face

Input CycleGAN DiscoGAN Ours
Input CycleGAN DiscoGAN Ours

Doll Face to Human

Table 3: GANimorph can translate shape and style changes while retaining many input attributes such as hair color, pose, glasses, headgear, and background. CycleGAN and DiscoGAN are less successful both with just shape (human to doll) and with shape and style changes (human to anime).

Doll Faces.

To demonstrate that our algorithm can handle shape deformations with similar photographic appearance, we translate between the two domains of doll and human face photographs. Similar to Morsita et al. [31], we extracted 13,336 images from the Flickr100m dataset [33] using specific doll manufacturers as keywords. Then, we extract local binary patterns [34] using OpenCV [4], and use the Animeface filter for facial alignment [32]. Stylizing human faces as dolls provides an informative test case: both domains have similar photorealistic appearance, so the translation task focuses on shape more than texture.

Table 3, bottom, shows that our architecture handles local deformation and global shape change better than CycleGAN and DiscoGAN, while preserving local texture similarity to the target domain. The second to last row on the right hand side shows that, with other networks, either the shape is malformed (DiscoGAN), or the shape shows artifacts from the original image or unnatural skin texture (CycleGAN). Our method matches skintones from the CelebA dataset, while capturing the overall facial structure and hair color of the doll. For a more difficult doll to human example in the bottom right-hand corner, while our transformation is not realistic, our method still creates more shape change than existing networks.

Pet Faces.

To test whether our network could translate between animal domains, we constructed a dataset of 47,906 cat faces from Flickr100m[33] dataset using OpenCV’s [4] Haar cascade cat face detector. This detector produces false positives, and occasionally detects a human face as a cat face. We also use the 8,223-image Columbia Dog dataset [26], which comes with curated bounding boxes around the dog faces, which reduced the number of noisy results. Translation results are shown in Table 4.

Input CycleGAN DiscoGAN Ours
Input CycleGAN DiscoGAN Ours
Input CycleGAN DiscoGAN Ours
Input CycleGAN DiscoGAN Ours
Table 4: Pet Faces: GANimorph can map poses across large variations in appearance, and does not incorrectly replace local texture without replacing the surrounding context.

Pets in the Wild.

To demonstrate our network on unaligned data, we evaluate on the Kaggle cat and dog dataset from Microsoft Research [9]. This dataset contains 12,500 images of each species. The intended purpose of the dataset is to classify cat images from dog images, and so it contains many animal breeds at varying scales, lighting conditions, poses, backgrounds, and occlusion factors.

When translating between cats and dogs (Table 5), the network is able to change both the local features such as the addition and removal of fur and whiskers, plus the larger shape deformation required to fool the discriminator, such as growing a snout. Most errors in this domain come from the generator failing to identify an animal from the background, such as forgetting the rear or tail of the animal. Sometimes the generator may fail to identify the animal at all.

We also translate between humans and cats. Table 6 demonstrates how our architecture handles large scale translation with these two variable data distributions. Our failure cases are approximately the same as that of the cats to dogs translation, with some promising results. Overall, we translate a surprising degree of shape deformation even when we might not expect this to be possible.

Supplemental Datasets.

We also tested our approach on existing datasets used in the CycleGAN paper (maps to satellite imagery, horses to zebras, and apples to oranges). These mappings focus on appearance transfer and require less shape deformation; please see our supplemental material to verify that our approach can handle this setting as well.


Input CycleGAN DiscoGAN Ours
Input CycleGAN DiscoGAN Ours


Table 5: Pets in the Wild: Between dogs and cats, our approach is able to generate shape transforms across pose and appearance variation.
Input Output
Input Output
Input Output
Input Output
Table 6: Human and Pet Faces: As a challenge, we try to map cats to humans and humans to cats. Pose is reliably translated; semantic appearance such as hair color is sometimes translated; but some inputs still fail (bottom left).
Class (%) Cat→Dog Dog→Cat
Networks Cat Dog Person Other Cat Dog Person Other
Initial Domain 100.00 0.00 0.00 0.00 0.00 98.49 1.51 0.00
CycleGAN 99.99 0.01 0.00 0.00 2.67 97.27 0.06 0.00
DiscoGAN 24.37 75.38 0.25 0.00 96.95 0.00 2.71 0.34
Ours w/ L1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Ours w/o feature match loss 5.03 93.64 0.81 0.53 85.62 14.15 0.00 0.23
Ours w/ fully conn. discrim. 6.11 93.60 0.29 0.00 91.41 8.45 0.03 0.10
Ours w/ patch discrim. 46.02 42.90 0.05 11.03 91.77 8.22 0.00 0.01
Ours (dilated discrim.) 1.00 98.57 0.41 0.02 100.00 0.00 0.00 0.00
Table 7: Percentage of pixels classified in translated images via CycleGAN, DiscoGAN, and our algorithm (with design choices). Target classes are in blue.
Input DiscoGAN CycleGAN Ours
Table 8: Example segmentation masks from DeepLabV3 for Table 7 for Cat→Dog. Red denotes the cat class, and blue denotes the intended dog class.
Input No FM Loss L1 Loss Patch Discrim FC Discrim Ours
Table 9: In qualitative comparisons, GANimorph outperforms all of its ablated versions. For instance, our approach better resolves fine details (e.g., second row, cat eyes) while also better translating the overall shape (e.g., last row, cat nose and ears).

4.3 Quantitative Study

To quantify GANimorph’s translation ability, we consider classification-based metrics to detect class change, e.g., whether a cat was successfully translated into a dog. Since there is no per pixel ground truth in this task for any real-world datasets, we cannot use Fully Convolution Score. Using Inception Score [36] is uninformative since simply outputting the original image would score highly.

Further, similar to adversarial examples, CycleGAN is able to convince many classification networks that the image is translated even though to a human the image appears untranslated: all CycleGAN results from Table 4 convince both ResNet50 [12] and the traditional segmentation network of Zheng et al. [45], even though the image is unsuccessfully translated.

However, semantic segmentation networks that model multi-scale properties can distinguish CycleGAN’s ‘adversarial examples’ from true translations, such as DeepLabV3 [5] (trained on PascalVOC 2012 and using dilated convolutions itself). As such, we run each test image through the DeepLabV3 network to generate a segmentation mask. Then, we compute the percent of non-background-labeled pixels per class, and average across the test set (Table 7). Our approach is able to more fully translate the image in the eyes of the classification network, with images also appearing translated to a human (Table 8).

4.4 Ablation Study

We use these quantiative settings for an ablation study (Table 7). First, we removed MS-SSIM to leave only L1 (, Eq. 3.3), which causes our network to mode collapse. Next, we removed feature match loss, but this decreases both our segmentation consistency and the stability of the network. Then, we replaced our dilated discriminator with a patch discriminator. However, the patch discriminator cannot use global context, and so the network confuses facial layouts. Finally, we replace our dilated discriminator with a fully connected discriminator. We see that our generator architecture and loss function allow our network to outperform DiscoGAN even with the same type of discriminator (fully connected).

Qualitative ablation study results are shown in Table 9. The patch based discriminator translates texture well, but fails to create globally-coherent images. Decreasing the information flow by using a fully-connected discriminator or removing feature match leads to better results. Maximizing the information flow ultimately leads to the best results (last column). Using L1 instead of a perceptual cyclic loss term leads to mode collapse.

5 Discussion

There exists a trade off in the relative weighting of the cyclic loss. A higher cyclic loss term weight will prevent significant shape change and weaken the generator’s ability to adapt to the discriminator. Setting it too low will cause the collapse of the network and prevent any meaningful mapping from existing between domains. For instance, the network can easily hallucinate objects in the other domain if the reconstruction loss is too low. Likewise, setting it too high will prevent the network from deforming the shape properly. As such, an architecture which could modify the weight of this term at test time would prove valuable for user control over how much deformation to allow.

One counter-intuitive result we discovered is that in domains with little variety, the mappings can lose semantic meaning (see supplemental material). One example of a failed mapping was from celebA to bitmoji faces [37]. Many attributes were lost, including pose, and the mapping fell back to pseudo-steganographic encoding of the faces [7]

. For example, background information would be encoded in color gradients of hair styles, and minor variations in the width of the eyes were used similarly. As such, the cyclic loss limits the ability of the network to abstract relevant details. Approaches such as relying on mapping the variance within each dataset, similar to Benaim et al. 

[3], may prove an effective means of ensuring the variance in either domain is maintained. We found that this term over-constrained the amount of shape change in the target domain; however, this may be worth further investigation.

Finally, trying to learn each domain simultaneously may also prove an effective way to increase the accuracy of image translation. Doing so allows the discriminator(s) and generator to learn how to better determine and transform regions of interest for either network. Better results might be obtained by mapping between multiple domains using parameter-efficient networks (e.g., StarGAN [6]).

6 Conclusion

We have demonstrated that reframing the discriminator’s role as a semantic segmenter allows greater shape change with less image artifacts. Further, that training with a perceptual cyclic loss and that adding explicit multi-scale features both help the network to translate more complex shape deformation. Finally, that training techniques such as feature matching loss and scheduled loss normalization can increase the performance of translation networks. In summary, our architecture and training changes allow the network to go beyond simple texture transfer and improve shape deformation. This lets our GANimorph system perform challenging translations such as from human to anime and feline faces, and from cats to dogs. The source code to our GANimorph system and all datasets are online:


Kwang In Kim thanks RCUK EP/M023281/1, and Aaron Gokaslan and James Tompkin thank NVIDIA Corporation.


  • [1] Anonymous, Branwen, G., Gokaslan, A.: Danbooru2017: A large-scale crowdsourced and tagged anime illustration dataset (April 2017),
  • [2]

    Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying interpretability of deep visual representations. In: Computer Vision and Pattern Recognition (2017)

  • [3] Benaim, S., Wolf, L.: One-sided unsupervised domain mapping. In: Advances in Neural Information Processing Systems (2017)
  • [4] Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)
  • [5] Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
  • [6] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In: Computer Vision and Pattern Recognition (2018)
  • [7] Chu, C., Zhmoginov, A., Sandler, M.: CycleGAN: a master of steganography. arXiv preprint arXiv:1712.02950 (2017)
  • [8] Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. International Conference on Learning Representations (2017)
  • [9] Elson, J., Douceur, J., Howell, J., Saul, J.: Asirra: A captcha that exploits interest-aligned manual image categorization. In: Proceedings of the 14th ACM Conference on Computer and Communications Security. CCS ’07 (2007)
  • [10]

    Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Computer Vision and Pattern Recognition (2016)

  • [11] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
  • [12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (2016)
  • [13] He, M., Liao, J., Yuan, L., Sander, P.V.: Neural color transfer between images. arXiv preprint arXiv:1710.00756 (2017)
  • [14] Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques. ACM (2001)
  • [15] Huang, X., Belongie, S.J.: Arbitrary style transfer in real-time with adaptive instance normalization. In: International Conference on Computer Vision (2017)
  • [16]

    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (2015)

  • [17]

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Computer Vision and Pattern Recognition (2017)

  • [18] Jin, Y., Zhang, J., Li, M., Tian, Y., Zhu, H., Fang, Z.: Towards the automatic anime characters creation with generative adversarial networks. arXiv preprint arXiv:1708.05509 (2017)
  • [19] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (2016)
  • [20] Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. In: International Conference on Machine Learning (2017)
  • [21] Kingma, D.P., Ba, J.: ADAM: A Method for Stochastic Optimization. In: International Conference on Learning Representations (2014)
  • [22] Laffont, P.Y., Ren, Z., Tao, X., Qian, C., Hays, J.: Transient attributes for high-level understanding and editing of outdoor scenes. ACM Trans. Graph. (TOG) (2014)
  • [23] Li, C., Wand, M.: Precomputed real-time texture synthesis with markovian generative adversarial networks. In: European Conference on Computer Vision (2016)
  • [24] Liang, X., Zhang, H., Xing, E.P.: Generative semantic manipulation with contrasting gan. arXiv preprint arXiv:1708.00315 (2017)
  • [25] Liao, J., Yao, Y., Yuan, L., Hua, G., Kang, S.B.: Visual attribute transfer through deep image analogy. ACM Trans. Graph. (2017)
  • [26] Liu, J., Kanazawa, A., Jacobs, D., Belhumeur, P.: Dog breed classification using part localization. In: European Conference on Computer Vision (2012)
  • [27] Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems (2017)
  • [28] Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: Advances in Neural Information Processing Systems (2016)
  • [29] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: International Conference on Computer Vision (2015)
  • [30] Ma, S., Fu, J., Chen, C.W., Mei, T.: DA-GAN: Instance-level image translation by deep attention generative adversarial networks. In: Conference on Computer Vision and Pattern Recognition (2018)
  • [31] Morishita, M., Ueno, M., Isahara, H.: Classification of doll image dataset based on human experts and computational methods: A comparative analysis. In: International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA) (2016)
  • [32] Nagadomi: Local binary pattern cascade—anime face. (2017)
  • [33] Ni, K., Pearce, R., Boakye, K., Van Essen, B., Borth, D., Chen, B., Wang, E.: Large-scale deep learning on the YFCC100M dataset. arXiv:1502.03409 (2015)
  • [34] Ojala, T., Pietikainen, M., Harwood, D.: Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: International Conference on Pattern Recognition (1994)
  • [35] Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ArXiv e-prints (Nov 2015)
  • [36] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems (2016)
  • [37] Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200 (2016)
  • [38] Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In: Computer Vision and Pattern Recognition (2018)
  • [39] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. Trans. Image Processing (2004)
  • [40] Wolf, L., Taigman, Y., Polyak, A.: Unsupervised creation of parameterized avatars. In: International Conference on Computer Vision (2017)
  • [41] Yi, Z., Zhang, H.R., Tan, P., Gong, M.: DualGAN: Unsupervised dual learning for image-to-image translation. In: International Conference on Computer Vision (2017)
  • [42] Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: International Conference on Learning Representations (2015)
  • [43] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision (2014)
  • [44] Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging (2017)
  • [45]

    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: International Conference on Computer Vision (2015)

  • [46] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: International Conference on Computer Vision (2017)

Appendix 0.A Appendix

0.a.1 Optimization and Loss Parameters

For convenience, we list all optimization and loss parameters in Table 10.

Optimization term Value
Learning rate 2e-4
Minibatch size 16
Residual Blocks 3
Residual Merge Op. Concat
Optimizer ADAM [21]
Momentum 0.95
(ADAM) 0.999
() 0.99
Hyperparameter Value
Table 10: Left: Optimization parameter values. Right: Loss hyperparameter values.
 Dataset Iterations    Discrim. every
Anime (Danbooru) 200,000 2
Anime (Getchu) 200,000 1
Doll 100,000 2
Cat/dog faces 150,000 2
Cat/dog bodies 300,000 2
Toy dataset 150,000 1
CycleGAN datasets 200,000 1
Table 11: Number of iterations per dataset, with how often the discriminator was updated in interations.

0.a.2 Network

We use 64 filters for the first layer of the generator and 128 filters for the first layer of the generator. Then, for subsequent layers, we double the number of filters for every downsampling stride of two (main paper, Figure 2). The stride is two for all downsampling layers. We do not increase the number of filters for dilated convolutions. Likewise, we decrease the number of filter for each transposed convolution by a factor of two. We also linearly decay the learning rate from 150k steps onwards, to approach 0 at 300k steps. Table

11 lists the number of update steps computed per dataset.

0.a.3 Existing Dataset Comparison

We compute comparisons with our method on existing unsupervised image-to-image translation datasets. We trained CycleGAN [46] and our architecture on the same datasets for the same number of iterations.

Satellite to Google Maps. In Table 12, we compare results for translating satellite imagery to Google Maps and vice versa. The Google Maps dataset carries less information overall than the satellite dataset, so this task requires the network to ‘encode detail in plain sight’ when translating in the Google Maps to satellite direction [7]. Generally, our network produces comparable results, with some differences in ambiguous cases. This provides evidence that our network is also able to solve tasks with very little shape deformation.

Apples to Oranges. Table 13 shows the results. With less shape change between elements in the domain, our approach produces comparable results to CycleGAN.

Horse to Zebra. Table 14 shows the results, and vice versa. In general, the results between the two techniques are comparable. One improvement that our method is able to make is to better maintain global orientation of stripes, e.g., in column 1, rows 2 and 3, we see that our method places horizontal stripes on the rear of the zebra, which rotate over the body of the animal to vertical stripes on the neck. Overall, this is still a hard problem, and many examples show artifacts for both techniques.

Getchu to CelebA. First, we collected dataset from [18], which consists of professionally-drawn visual novel characters from 1995–2017. Table 15 shows the results, including failure cases. When translating from anime to CelebA, CycleGAN often has trouble to substantially change the shape of the character’s face, often simply attempting to replace anime shading with skin tones (see row 1 column 1). Our network is better able to map both pose and facial structure, while also blending skin tone more appropriately. One limitation is that the Getchu dataset has different attribute variance, e.g., more pink and purple hair, and less diversity in skin colors. This restricts the ability of the methods to transfer attributes successfully.

Limitation: Human to Bitmoji. In Table 16, we show a significant failure cases of our network in the Human to Bitmoji task. Our network mode collapses and often fails to properly match the pose of the target distribution. This is interesting because our reconstructions for these samples are almost identical to the input image, implying that our network is able to encode the entire database into this single sample using almost imperceptible differences in the image [7]. The failure is likely caused by the simplicity of the bitmoji domain, e.g., uniform skin color. The anime and other domains contain enough for information to prevent the network from learning such a steganographic encoding.

Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Table 12: Satellite to Google Maps. Generally, our network produces comparable results, with some differences in ambiguous cases. This provides evidence that our network is also able to solve tasks with very little shape deformation.
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Table 13: On Apples to Oranges, results between the two techniques are approximately comparable, though the shapes of the two fruits are already similar.
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Table 14: Our method is able to transfer local zebra texture onto horses comparably to CycleGAN. However, our method is better able to maintain global stripe orientation, e.g., horizontal stripes on the rump, rotating to vertical stripes on the neck. When removing texture to turn a zebra into a horse, our method performs comparably.
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours

Input CycleGAN Ours
Input CycleGAN Ours
Input CycleGAN Ours
Table 15: Anime to Human (and vice versa) on the Getchu dataset. Our method is more successful at making shape changes, e.g., shrinking the head when translating to anime, or growing it when translating to human. Different attribute variances between the two datasets, e.g., hair or skin color, sometimes prevents attribute transfer.
Input CycleGAN Ours Reconstr.
Input CycleGAN Ours Reconstr.
Input CycleGAN Ours Reconstr.
Input CycleGAN Ours Reconstr.
Table 16: Bitmoji to Human: Our network mode collapses down to a single image (right-hand side), even though our reconstructions for these samples are almost identical to the input image. This implies that our network is able to encode the entire database into this single sample using almost imperceptible differences in the image [7]. The failure is likely caused by the simplicity of the bitmoji domain, e.g., uniform skin color.