SPA-GAN: Spatial Attention GAN for Image-to-Image Translation

08/19/2019
by   Hajar Emami, et al.
Wayne State University
25

Image-to-image translation is to learn a mapping between images from a source domain and images from a target domain. In this paper, we introduce the attention mechanism directly to the generative adversarial network (GAN) architecture and propose a novel spatial attention GAN model (SPA-GAN) for image-to-image translation tasks. SPA-GAN computes the attention in its discriminator and use it to help the generator focus more on the most discriminative regions between the source and target domains, leading to more realistic output images. We also find it helpful to introduce an additional feature map loss in SPA-GAN training to preserve domain specific features during translation. Compared with existing attention-guided GAN models, SPA-GAN is a lightweight model that does not need additional attention networks or supervision. Qualitative and quantitative comparison against state-of-the-art methods on benchmark datasets demonstrates the superior performance of SPA-GAN.

READ FULL TEXT VIEW PDF

page 2

page 6

page 7

page 9

page 11

page 12

page 13

page 14

11/27/2019

AttentionGAN: Unpaired Image-to-Image Translation using Attention-Guided Generative Adversarial Networks

State-of-the-art methods in the unpaired image-to-image translation are ...
12/12/2019

Unified Generative Adversarial Networks for Controllable Image-to-Image Translation

Controllable image-to-image translation, i.e., transferring an image fro...
05/21/2019

Toward Learning a Unified Many-to-Many Mapping for Diverse Image Translation

Image-to-image translation, which translates input images to a different...
12/03/2021

Semantic Map Injected GAN Training for Image-to-Image Translation

Image-to-image translation is the recent trend to transform images from ...
10/06/2021

SDA-GAN: Unsupervised Image Translation Using Spectral Domain Attention-Guided Generative Adversarial Network

This work introduced a novel GAN architecture for unsupervised image tra...
03/26/2021

Multiple GAN Inversion for Exemplar-based Image-to-Image Translation

Existing state-of-the-art techniques in exemplar-based image-to-image tr...
02/29/2020

Reusing Discriminators for Encoding Towards Unsupervised Image-to-Image Translation

Unsupervised image-to-image translation is a central task in computer vi...

I Introduction

Image-to-image translation is to learn a mapping between images from a source domain and images from a target domain and has many applications including image colorization, generating semantic labels from images

[10]

, image super resolution

[14, 3] and domain adaptation [22]

. Many image-to-image translation approaches require supervised learning settings in which pairs of corresponding source and target images are available. However, acquiring paired training data is expensive or sometimes impossible for diverse applications. Therefore, there are motivations towards approaches in unsupervised settings in which, source and target image sets are completely independent with no paired examples between the two domains. To this end, the need for paired training samples is removed by introducing the cycle consistency loss in unsupervised approaches

[40, 33] that force two mappings to be consistent with each other. In general, an image-to-image translation method needs to detect areas of interest in the input image and learn how to translate the detected areas into the target domain. In an unsupervised setting with no paired images between the two domains, one must pay attention to the areas of the image that are subject to transfer. The task of locating areas of interest is more important in applications of image-to-image translation where the translation should be applied only to a particular type of object rather than the whole image. For example, for transferring an input “orange” image to the target domain “apple” (see the example in Fig. 1), one needs to first locate the oranges in the input image and then transfer them to apples.

In [40, 33]

, a generative network is employed to detect areas of interest and translate between the two domains. More recently, the attention mechanism is introduced in image-to-image translation to decompose the generative network into two separate networks: the attention network to predict regions of interest and the transformation network to transform the image from one domain to another. Specifically, additional attention networks are added to the CycleGAN framework to keep the background of the input image unchanged while translating the foreground

[19, 5]. For example, Chen et al. [5] used the segmentation annotations of input images as extra supervision to train an attention network. Then, the attention maps are applied to the output of the transformation network so that the background of input image is used as the output background, leading to the improvement of the overall image translation quality.

Fig. 1:

A comparison of CycleGAN (a) and SPA-GAN (b) architectures. In SPA-GAN, in addition to classifying the input images, the discriminator also generates spatial attention maps, which are fed to the generator and help it focus on the most discriminative object parts. In addition, the feature map loss is shown in the dashed blocks (b-1) and (b-2), which is the difference between the feature maps of the (attended) real and generated images computed in the first layer of the decoder. The feature map loss is used to preserve domain specific features in image translation.

In this paper, we introduce the attention mechanism directly to the generative adversarial network (GAN) architecture and propose a novel spatial attention GAN model (SPA-GAN) for image-to-image translation. SPA-GAN computes the attention in its discriminator and use it to help the generator focus more on the most discriminative regions between the source and target domains. Specifically, the attention from the discriminator is defined as the spatial maps [34] showing the areas that the discriminator focuses on for classifying an input image as real or fake. The extracted spatial attention maps are fed back to the generator so that higher weights are given to the discriminative areas when computing the generator loss. In unsupervised setting, we also find it helpful to introduce an additional feature map loss to preserve domain specific features during translation. That is, in SPA-GAN’s generative network, we constrain the feature maps obtained in the first layer of the decoder [17] to be matched with the identified regions of interest from both real and generated images so that the generated images are more realistic. The major contribution of our work is summarized as follows:

  • Different from [19, 5] where attention is employed to separate foreground and background, we use attention in SPA-GAN as a mechanism of transferring knowledge from the discriminator back to the generator. The discriminator helps the generator explicitly attend to the discriminative regions between two domains, leading to more realistic output images. Based on the proposed attention mechanism, we used a modified cycle consistency loss during SPA-GAN training and also introduced a generator feature map loss to preserve domain specific features.

  • Earlier approaches on attention-guided image-to-image translation [19, 5] require loading generators, discriminators and additional attention networks into the GPU memory all at once, which may cause computational and memory limitations. In comparison, SPA-GAN is a lightweight model that does not need additional attention networks or supervision during training.

  • SPA-GAN demonstrates the effectiveness of directly incorporating the attention mechanism into GAN models. Through extensive experiments, we show that, both qualitatively and quantitatively, SPA-GAN significantly outperforms other state-of-the-art image-to-image translation methods.

The remainder of this paper is organized as follows: Section II contains a brief review of the literature surrounding image-to-image translation and attention learning. In Section III, we introduce our SPA-GAN model in detail. In Section IV, we present our image-to-image translation results on the benchmark datasets. Finally, we conclude in Section V.

Ii Related Work

Ii-a Image-to-Image Translation

Recently, GAN-based methods have been widely used in image-to-image translation and produced appealing results. In pix2pix [10], conditional GAN (cGAN) was used to learn a mapping from an input image to an output image; cGAN learns a conditional generative model using paired images from source and target domains. CycleGAN was proposed by Zhu et al. [40] for image-to-image translation tasks in the absence of paired examples. It learns a mapping from a source domain to a target domain (and vice versa) by introducing two cycle consistency losses. Similarly, DiscoGAN [11] and DualGAN [33]

use an unsupervised learning approach for image-to-image translation based on unpaired data, but with different loss functions. Liu et al.

[16] proposed unsupervised image-to-image translation network (UNIT) based on Coupled GANs [17] and a shared-latent space assumption, which assumes a pair of corresponding images from different domains can be mapped to the same latent representation in a shared-latent space. Some existing image-to-image translation methods assume that the latent space of images can be decomposed into a content space and a style space, which enable the generation of multi-modal outputs. Huang et al. [9] proposed multimodal unsupervised image-to-image translation framework (MUNIT) that assumes two latent representations for style and content. To translate an image to another domain, its content code is combined with different style representations sampled from the target domain. Similarly, Lee et al. [15] introduced diverse image-to-image translation (DRIT) based on disentangled representation on unpaired data that decomposes the latent space into two space: a domain-invariant content space capturing shared information and a domain-specific attribute space to produce diverse outputs given the same content. Zhou et al. [39] proposed BranchGAN to transfer an image of one domain to the corresponding domain by exploiting the shared distribution of two domains with the same encoder. Recently, HarmonicGAN proposed by Zhang et al. [37] for unpaired image-to-image translation, introduces spatial smoothing to enforce consistent mappings during translation. InstaGAN [21] utilizes the object segmentation masks as extra supervision to perform multi-instance domain-to-domain image translation. It preserves the background by introducing the context preserving loss. This method depends on semantic segmentation labels (i.e., pixel-wise annotation), and has limitation for new applications where pixel-level annotation is not available.

Ii-B Attention Learning in Deep Networks

Inspired from human attention mechanism [24]

, attention-based models have gained popularity in a variety of computer vision and machine learning tasks including neural machine translation

[1], image classification [20, 27], image segmentation [4], image and video captioning [29, 32] and visual question answering [31]. Attention improves the performance of all these tasks by encouraging the model to focus on the most relevant parts of the input. Zhou et al. [38] produce attention maps for each class by removing top average-pooling layer and improving object localization accuracy. Zagoruyko et al. [34]

improve the performance of a student convolutional neural network (CNN) by transferring the attention from a teacher CNN. Their scheme determines the attention map of a CNN based on the assumption that the absolute value of a hidden neuron activation is relative to the importance of that neuron in the task of classifying a given input. Minh et al.

[20]

propose a visual attention model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at a high resolution. Kuen et al.

[13] propose a recurrent attentional convolutional-deconvolution network for saliency detection. This supervised model uses an iterative approach to attend to selected image sub-regions for saliency refinement in a progressive way. Wang et. al. [27] propose a residual attention network for image classification with a trunk-and-mask attention mechanism.

Recent studies show that incorporation of attention learning in GAN models leads to more realistic images in both image generation and image-to-image translation tasks. For example, Zhang et al. [35] propose self-attention GAN that uses a self-attention mechanism for image generation. In [30], the LR-GAN model learns to generate image background and foregrounds separately and recursively, and the idea was later adapted to image-to-image translations. Specifically, Chen et al. [5] and Mejjati et al. [19] add an attention network to each generator to locate the object of interest in image-to-image translation tasks. Since, the background is excluded from the translation, the quality of translated images in the background regions are improved in these approaches. However, improving the quality of translated objects and foreground is not the focus of these two approaches.

Iii Spa-Gan

The goal of image-to-image translation is to learn a mapping from a source domain : to a target domain : , where and are the number of samples in domains and , respectively. In an unpaired setting, two inverse mappings are learned simultaneously through the cycle consistency loss [40, 33].

Incorporating the attention mechanism into image-to-image translations can help the generative network to attend to the regions of interest and produce more realistic images. The proposed SPA-GAN model achieves this by explicitly transferring the knowledge from the discriminator to the generator to force it focus on the discriminative areas of the source and the target domains. Fig. 1 shows the main components of SPA-GAN and compares it to the CycleGAN model with no feedback attention. Both frameworks learn two inverse mappings through one generator and one discriminator in each domain. However, in SPA-GAN the discriminator generates the attention maps in addition to classifying its input as real or fake. These attention maps are looped back to the input of the generator. While CycleGAN is trained using the adversarial and cycle consistency losses, SPA-GAN integrates the adversarial, modified cycle consistency and feature map losses to generate more realistic outputs.

Iii-a Spatial Attention Map from Discriminator

In GAN, the discriminator classifies the input to either fake or real. In SPA-GAN, we deploy the discriminator network to highlight the most discriminative regions between real and fake images in addition to the classification. These discriminative regions illustrate the areas where the discriminator focuses in order to correctly classify the input, and therefore are considered as the spatial attention maps.

Formally, given an input image , the spatial attention map , whose size is the same as the input image , is obtained by feeding to the discriminator. Following [34], we define as the sum of the absolute values of activation maps in each spatial location in a layer across the channel dimension:

(1)

where is -th feature plane of a discriminator layer for the specific input and is the number of channels. directly indicates the importance of the hidden units at each spatial location in classifying the input image as a fake or real.

The attention maps of different layers in a classifier network focus on different features. For instance, when classifying apples or faces, the middle layer attention maps have higher activations on regions such as the top of an apple or eyes and lips of the face, while the attention maps of the later layers typically focus on full objects. Thus, in SPA-GAN we select the mid-level attention maps from the second to last layer in , usually correlated to discriminative object parts [34], and feed them back to the generator.

The detailed architecture of SPA-GAN is shown in panel (b) of Fig. 1. First, an input image is fed to the discriminator , to get the spatial attention map , the most discriminative regions in . Then, the spatial attention map is normalized to the range of and upsampeled to match the input image size. Next, we apply the spatial attention map to the input image using an element-wise product and feed it to the generator to help it focus on the most discriminative parts when generating :

(2)

where is the attended input sample.

Iii-B Feature Map Loss

Unsupervised image synthesizing requires two pairs of generator and discriminator as the mapping is done in both directions(see panel (b) of Fig. 1). We make use of this architecture and also introduce an additional feature map loss term that encourages the generators to obtain domain specific features. Ideally, in the generator pair, both real and generated objects should share the same high-level abstraction in the first layer of the decoder, which is responsible for decoding high-level semantics [17]. Thus, we penalize the differences between the feature maps, obtained in the first layer of the decoders for the real and generated images, respectively.

Specifically, the generator feature map loss between the attended sample in the source domain and the attended generated sample in the inverse mapping is computed as follows (see dashed box (b-1) and (b-2) in Fig. 1):

(3)

where is the -th feature map and C is the number of feature maps in the given layer of the generator . The feature map loss is added to the overall loss function of the generator to preserve domain specific features. The feature map loss associated with the inverse mapping can be defined similarly, and the total feature map loss is given as:

(4)

As shown in our experimental results in Section IV, the feature map loss helps generate more realistic objects by explicitly forcing the generators to maintain domain specific features.

Iii-C Loss Function

The adversarial loss of GAN for the mapping and its discriminator is expressed as:

(5)

and the inverse mapping has a similar adversarial loss:

(6)

where the mapping functions and aim to minimize the loss against the adversary discriminators and that try to maximize the loss.

A network with enough capacity might map a set of input images to any random permutation of images in the target domain, and thus the adversarial losses alone cannot guarantee a desired output from the input image with the learned mapping. To overcome this, Cycle consistency loss is proposed in CycleGAN [40] to measure the discrepancy between the input image and the image generated by the inverse mapping that translates the input image back to the original domain space. Similar to CycleGAN, we take advantage of cycle consistency loss to achieve one-to-one correspondence mapping. Since we apply the attention map extracted from the discriminator to the generator’s input, we modify the cycle consistency loss as:

(7)

where and are the attended input samples. The modified cycle consistency loss helps the generators to focus on the most discriminative regions in image-to-image translations. In [19, 5], the attended regions are the same for both mappings, and cycle consistency loss enforces the attended regions to conserve content (e.g., pose) of the object, which prevents the network from geometric and shape changes. Different from [19, 5], our framework allows different attention maps in the forward and inverse mappings.

Finally, by combining the adversarial loss, modified cycle consistency loss and the generator feature map loss, the full objective function of SPA-GAN is expressed as:

(8)

where and control the importance of different terms, and we aim to solve the following min-max problem:

(9)

Iv Experiments

In this section, we first perform ablation study of our model and analyze the effect of each component in SPA-GAN. Then, we compare SPA-GAN with current state-of-the-art methods on benchmark datasets qualitatively, quantitatively.

Iv-a Datasets and Experimental Setups

We evaluate SPA-GAN on the Horse Zebra, Apple Orange datasets provided in [40] and the Lion

Tiger dataset downloaded from ImageNet

[7], which consists images for tigers and images for lions. These are challenging image-to-image translation datasets including objects at different scales. The goal is to translate one particular type of object (e.g., orange) into another type of object (e.g., apple). We also evaluate SPA-GAN on image-to-image translation tasks that require to translate the whole image, e.g., WinterSummer in [40] and gender conversion on the Facescrub [23] dataset.

For all experiments, we use the Adam solver [12] and a batch size of 1. The networks were trained with an initial learning rate of 0.0002. We adopt the same architecture used in [40] for our generative networks and discriminators. We use a least-squares loss [18] which has been shown to lead to more stable training and help to generate higher quality and sharper images. We empirically set =10 and =1 in Eq. 8. Different from [19, 5] that add additional attention networks to the CycleGAN framework, SPA-GAN does not include any additional attention network or supervision, and its training time is similar to CycleGAN.

Iv-B Evaluation Metrics

The following state-of-the-art image-to-image translation methods are used in our empirical evaluation and comparison.

CycleGAN. CycleGAN adopts GAN with cycle consistency loss for unpaired image-to-image translation task [40].
DualGAN. An unsupervised dual learning framework for image to image translation on unlabeled images from two domains that uses Wasserstein GAN loss rather than the sigmoid cross-entropy loss [33].
UNIT. An unsupervised image-to-image translation framework based on the shared-latent space assumption and cycle loss [16].
MUNIT. A multimodal unsupervised image-to-image translation framework that assumes two latent representations for style and content. To translate an image to another domain, its content code is combined with different style representations sampled from the target domain [9].
DRIT. A diverse image-to-image translation approach based on disentangled representation on unpaired data that decomposes the latent space into two space: a domain-invariant content space capturing shared information and a domain-specific attribute space to produce diverse outputs given the same content. The number of output style is set to 1 in our experiments for both MUNIT and DRIT [15].
AGGAN [19] and Attention-GAN. [5] Similar unsupervised image-to-image translation methods with added attention networks. Since the code of Attention-GAN [5] is not released, and it was outperformed by AGGAN [19], we did the comparison with AGGAN only.

Two metrics, Kernel Inception Distance (KID) and classification accuracy, are used for quantitative comparison between SPA-GAN and the state-of-the-arts. KID [2] is defined as the squared Maximum Mean Discrepancy (MMD) between Inception representations of real and generated images. It has been recently used for performance evaluation of image-to-image translation and image generation models [19, 2]

. KID is an improved measure that has an unbiased estimator with no assumption about the form of activations distribution, which makes it a more reliable metric compared to the Fréchet Inception Distance (FID)

[8]

, even for a small number of test samples. Smaller KID value indicates higher visual similarities between the generated images and the real images. Classification accuracy on the generated images is also widely used as a quantitative evaluation metric in image generation literature

[10, 28, 36]. In our experiment, we fine-tuned the inception network [26] pretrained on ImageNet [7] for each translation and report the top-1 classification performance on the images generated by each method. We have also conducted a human perceptual study on different translation tasks to further evaluate our model.

Fig. 2: Comparison between the attention maps generated by the attention network in AGGAN [19] and the generated attention maps computed in the discriminator of our SPA-GAN model (third row) on different datasets. SPA-GAN attention maps have higher activation values in the most discriminative regions between source and target domains. AGGAN generates the disconnected attention map for zebra while SPA-GAN attends on the discriminative regions of zebra (the first column). In Column 4, AGGAN attends on the whole oranges while SPA-GAN generates the attention map with higher values around the boundaries and the top of the oranges.
Fig. 3: Translation results generated by different approaches on AppleOrange dataset.

. KID accuracy CycleGAN 11.02 0.60 71.80 SPA-GAN-wo- 4.81 0.23 85.71 SPA-GAN- 5.66 0.48 84.59 SPA-GAN 3.77 0.32 87.21

TABLE I: Kernel Inception Distance 100 std. 100 (lower is better) computed using only the target domain and classification accuracy (higher is better) for ablations of our proposed approach on Apple Orange dataset.

Iv-C Ablation Study

We first performed model ablation on the Apple Orange dataset to evaluate the impact of each component of SPA-GAN. In Table I, we report both KID and classification accuracy for different configurations of our model. First, we removed the attention transfer from the discriminator to the generator (as a consequence, we also used the regular cycle consistency loss). The generator feature map loss is also removed because it is calculated only on the objects detected by the spatial attention map. In this case, our model is reduced to the CycleGAN architecture (CycleGAN). The KID and classification accuracy we obtained is consistent with the reported ones in the literature.

Next, we feed the spatial attention from the discriminator to the generator in CycleGAN but without the generator feature map loss (SPA-GAN-wo-). Our results show that this leads to a higher KID and lower classification accuracy when compared with the full version of SPA-GAN. Clearly, by enforcing the similarity between the discriminative regions of the attended real image and the attended generated image, the feature map loss computed in the abstract level can help us achieve a more realistic output.

As pointed out in [34], attention can also be computed as the maximum of the absolute values in the activation maps. Thus, we also compared maximum-based attention (SPA-GAN-) with sum-based attention adopted in SPA-GAN. Higher KID and lower classification accuracy of SPA-GAN- reported in Table I is consistent with the results in [34]. In the following experiments, we use the SPA-GAN with sum-based attention in our evaluation and compare it with exiting methods.

Fig. 4: Image-to-image translation results generated by different approaches on ZebraHorse and TigerLion datasets.

Iv-D Qualitative Results

In Fig. 2, we show a few examples and compare our generated attention maps with the attention maps generated by the attention network in AGGAN [19] on different datasets. Each row from top to bottom are the input images, attention maps computed in the discriminator of our SPA-GAN model and the attention maps generated by the attention network in AGGAN [19], respectively. In the orange

apple translation, the SPA-GAN attention map computed in the discriminator focuses on both the shape and texture of the generated and real apple images in order to correctly classify the input image. In this example, the SPA-GAN spatial attention map has higher values around the boundaries and on the top part of the oranges while AGGAN attends on the whole oranges. The attention maps in SPA-GAN have higher activation levels for the most discriminative regions between the two domains. Transferring this knowledge to the generator improves the generator performance by focusing on the discriminative areas and makes it more robust to shape changes between the two domains.

. Method apple orange orange apple zebra horse horse zebra lion tiger tiger lion DualGAN [33] 14.68 1.10 8.66 0.94 9.82 0.83 11.00 0.68 11.5 0.43 10.04 0.76 UNIT [16] 15.11 1.41 7.26 1.02 7.76 0.80 6.35 0.70 8.14 0.25 8.17 0.94 MUNIT [9] 13.45 1.67 6.79 0.78 6.32 0.90 4.76 0.63 2.67 0.63 8.10 0.87 DRIT [15] 9.65 1.61 6.50 1.16 5.67 0.66 4.30 0.57 2.39 0.67 7.04 0.73 CycleGAN [40] 11.02 0.60 5.94 0.65 4.87 0.52 3.94 0.41 2.56 0.13 5.32 0.47 AGGAN [19] 10.36 0.86 4.54 0.50 4.46 0.40 4.12 0.80 2.23 0.21 5.83 0.51 SPA-GAN 3.77 0.32 2.38 0.33 2.19 0.12 2.01 0.13 1.17 0.19 3.09 0.19

TABLE II: Kernel Inception Distance 100 std. 100 (lower is better) computed using only the target domain for various image-to-image translation methods on Horse Zebra, Apple Orange and Tiger Lion datasets.

. Method apple orange orange apple zebra horse horse zebra lion tiger tiger lion DualGAN [33] 13.04 0.72 12.42 0.88 12.86 0.50 10.38 0.31 10.18 0.15 10.44 0.04 UNIT [16] 11.68 0.43 11.76 0.51 13.63 0.34 11.22 0.24 11.00 0.09 10.23 0.03 MUNIT [9] 9.70 1.22 10.61 1.16 11.51 1.27 8.31 0.46 10.87 0.91 10.61 0.47 DRIT [15] 6.37 0.75 8.34 1.22 9.65 0.91 8.23 0.08 9.56 0.18 10.11 0.59 CycleGAN [40] 8.48 0.53 9.82 0.51 11.44 0.38 10.25 0.25 10.15 0.08 10.97 0.04 AGGAN [19] 6.44 0.69 5.32 0.48 8.87 0.26 6.93 0.27 8.56 0.16 9.17 0.07 SPA-GAN 5.81 0.51 7.95 0.42 8.72 0.24 7.89 0.29 8.47 0.07 8.63 0.05

TABLE III: Kernel Inception Distance 100 std. 100 (lower is better) computed using both the target and the source domains for various image-to-image translation methods on Horse Zebra, Apple Orange and Tiger Lion datasets.

. Method apple orange orange apple zebra horse horse zebra lion tiger tiger lion Real 97.58 97.36 85.71 97.85 99.63 100 DualGAN [33] 78.57 64.91 41.42 83.33 66.53 39.05 UNIT [16] 80.07 94.75 70.00 82.50 82.95 67.27 MUNIT [9] 67.80 85.70 55.27 82.50 79.60 52.75 DRIT [15] 75.50 76.80 72.50 80.31 84.90 60.38 CycleGAN [40] 71.80 72.93 75.00 83.33 73.48 48.10 AGGAN [19] 21.80 34.21 64.28 82.85 87.63 50.54 SPA-GAN 87.21 95.49 84.17 87.50 92.42 87.12

TABLE IV: Top-1 classification performance (higher is better) on images generated by various image-to-image translation methods on Horse Zebra, Apple Orange and Tiger Lion datasets.

Figs. 3 and 4 demonstrate some exemplar image-to-image translation results on benchmark datasets. The first column is the real input image and the generated images using SPA-GAN and other approaches are shown in the next columns. In all rows of Fig. 3, DRIT, CycleGAN and AGGAN only changed the color of the objects and don’t succeed in translating shape differences between apple and orange domains. As a comparison, SPA-GAN is more robust to shape changes and succeed on localizing parts of object and translating them to the target domain. It is clear that our approach generates more realistic images by changing both shape and texture of the input object in appleorange dataset which validates the effectiveness of incorporating the attention to the generative network instead of applying the attention on the output of the transformation network [19, 5].

As shown in Fig. 4, DualGAN, UNIT, MUNIT and DRIT altered the background of the input image. For example, the generated images by these methods in row 3 and 4 have zebra patterns in the background. CycleGAN and AGGAN generate visually better results and preserve the input background. However, they miss certain parts of the object in the translation. For example, CycleGAN doesn’t succeed to translate the head of zebra in row 1 and 4 while AGGAN misses the body or the head of the animal for row 1, 3 and 4. The generated objects by CycleGAN and AGGAN are mixed with parts from the target as well as the source domain. It can be seen in row 3 and 5 that CycleGAN generates images with horizontal zebra patterns instead of vertical ones by SPA-GAN. In the tiger lion translation, all other methods kept some tiger patterns after translation. Clearly, SPA-GAN results are more realistic compared to other methods. SPA-GAN is more successful in generating tiger pattern in lion tiger translation (row 8 and 9) compared to other methods. Please see the supplementary material for more visual examples.

Iv-E Quantitative Comparison

Mejjati et al. [19] reported the mean KID value computed between generated samples using both source and target domains. We argue that calculating using both target and source domains is not a good practice especially for the datasets with no meaningful background such as AppleOrange. Therefore, we report mean KID values computed only on the target domain (Table II) and on both source and target domains (Table III) to better evaluate the performance of our proposed approach and state-of-the-arts.

Our approach achieved the lowest target only KID scores in all translation tasks, showing its effectiveness in generating more realistic images. It is interesting to see that SPA-GAN does not always achieve the smallest values when KIDs are computed with both the source and target. For example, in the column 2 of Table III (Orange Apple), AGGAN has the smallest KID value of 5.32, which is averaged between the generated apples and real apples (target), and the generated apples and real oranges (source). As a comparison, SPA-GAN has the smallest KID value in column 2 of Table II, computed only based on the real apples. This clearly shows that the apples generated by AGGAN still maintain a higher level of feature similarity to real oranges when compared to SPA-GAN. That is, SPA-GAN results are more realistic. Results from Tables II and  III clearly demonstrate the effectiveness of SPA-GAN.


Fig. 5: Translation results on WinterSummer dataset.

Fig. 6: Translation results on gender conversion.

We also report the top-1 classification performance on the real images as well as the generated images by each method in Table IV. If the generated image is real enough, the classifier will predict it as a target sample from the target domain. The images produced by SPA-GAN network clearly outperforms all competing methods in terms of classification accuracy.

Iv-F Other Image-to-Image Translation Applications

Finally, we evaluated SPA-GAN on image-to-image translation datasets that require to translate the whole image. In Fig. 5, we show the results on the WinterSummer dataset [40]. The second column in Fig. 5 shows the attention maps in the WinterSummer task requires holistic translation for the input image with no specific type of object. Clearly, the discriminator focuses on the areas such as ground and trees that have different colors during the winter and summer seasons. In Fig. 6, we show the gender conversion results on the Facescrub [23] dataset. The second column show the attention maps with higher activation level around different areas of the face such as eyes, nose and lips that the discriminator attends to classify the input image. The spatial attention maps obtained from the discriminator in these two datasets clearly demonstrate the effectiveness of SPA-GAN in a variety of image-to-image translation tasks. Please see the supplementary material for more visual examples on Facescrub and GTA [25] Cityscapes [6] datasets.

V Conclusion

In this paper, we proposed SPA-GAN for image-to-image translation in unsupervised settings. In SPA-GAN, we compute the spatial attention maps in the discriminator and transfer the knowledge to the generator so that it can explicitly attend on discriminative regions between two domains and thus improve the quality of generated images. SPA-GAN is a lightweight model and achieved superior performance, both qualitative and quantitative, over current state-of-the-arts.

References

  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §II-B.
  • [2] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401. Cited by: §IV-B.
  • [3] L. Chen, L. Wu, Z. Hu, and M. Wang (2019) Quality-aware unpaired image-to-image translation. IEEE Transactions on Multimedia. Cited by: §I.
  • [4] L. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille (2016) Attention to scale: scale-aware semantic image segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3640–3649. Cited by: §II-B.
  • [5] X. Chen, C. Xu, X. Yang, and D. Tao (2018) Attention-gan for object transfiguration in wild images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 164–180. Cited by: 1st item, 2nd item, §I, §II-B, §III-C, §IV-A, §IV-B, §IV-D.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §IV-F.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. Cited by: §IV-A, §IV-B.
  • [8] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §IV-B.
  • [9] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §II-A, §IV-B, TABLE II, TABLE III, TABLE IV.
  • [10] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §I, §II-A, §IV-B.
  • [11] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1857–1865. Cited by: §II-A.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A.
  • [13] J. Kuen, Z. Wang, and G. Wang (2016) Recurrent attentional networks for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3677. Cited by: §II-B.
  • [14] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §I.
  • [15] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §II-A, §IV-B, TABLE II, TABLE III, TABLE IV.
  • [16] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pp. 700–708. Cited by: §II-A, §IV-B, TABLE II, TABLE III, TABLE IV.
  • [17] M. Liu and O. Tuzel (2016) Coupled generative adversarial networks. In Advances in neural information processing systems, pp. 469–477. Cited by: §I, §II-A, §III-B.
  • [18] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §IV-A.
  • [19] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim (2018) Unsupervised attention-guided image-to-image translation. In Advances in Neural Information Processing Systems, pp. 3697–3707. Cited by: 1st item, 2nd item, §I, §II-B, §III-C, Fig. 2, §IV-A, §IV-B, §IV-B, §IV-D, §IV-D, §IV-E, TABLE II, TABLE III, TABLE IV.
  • [20] V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §II-B.
  • [21] S. Mo, M. Cho, and J. Shin (2018) InstaGAN: instance-aware image-to-image translation. arXiv preprint arXiv:1812.10889. Cited by: §II-A.
  • [22] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim (2018) Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4500–4509. Cited by: §I.
  • [23] H. Ng and S. Winkler (2014) A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing (ICIP), pp. 343–347. Cited by: §IV-A, §IV-F.
  • [24] R. A. Rensink (2000) The dynamic representation of scenes. Visual cognition 7 (1-3), pp. 17–42. Cited by: §II-B.
  • [25] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In European conference on computer vision, pp. 102–118. Cited by: §IV-F.
  • [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §IV-B.
  • [27] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §II-B.
  • [28] X. Wang and A. Gupta (2016) Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pp. 318–335. Cited by: §IV-B.
  • [29] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §II-B.
  • [30] J. Yang, A. Kannan, D. Batra, and D. Parikh (2017) Lr-gan: layered recursive generative adversarial networks for image generation. arXiv preprint arXiv:1703.01560. Cited by: §II-B.
  • [31] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29. Cited by: §II-B.
  • [32] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pp. 4507–4515. Cited by: §II-B.
  • [33] Z. Yi, H. Zhang, P. Tan, and M. Gong (2017) Dualgan: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2849–2857. Cited by: §I, §I, §II-A, §III, §IV-B, TABLE II, TABLE III, TABLE IV.
  • [34] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §I, §II-B, §III-A, §III-A, §IV-C.
  • [35] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §II-B.
  • [36] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §IV-B.
  • [37] R. Zhang, T. Pfister, and J. Li (2019) Harmonic unpaired image-to-image translation. arXiv preprint arXiv:1902.09727. Cited by: §II-A.
  • [38] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §II-B.
  • [39] Y. Zhou, R. Jiang, X. Wu, J. He, S. Weng, and Q. Peng (2019) BranchGAN: unsupervised mutual image-to-image transfer with a single encoder and dual decoders. IEEE Transactions on Multimedia. Cited by: §II-A.
  • [40] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232. Cited by: §I, §I, §II-A, §III-C, §III, §IV-A, §IV-A, §IV-B, §IV-F, TABLE II, TABLE III, TABLE IV.