Unpaired Image Translation via Adaptive Convolution-based Normalization

11/29/2019 ∙ by Wonwoong Cho, et al. ∙ Korea University 4

Disentangling content and style information of an image has played an important role in recent success in image translation. In this setting, how to inject given style into an input image containing its own content is an important issue, but existing methods followed relatively simple approaches, leaving room for improvement especially when incorporating significant style changes. In response, we propose an advanced normalization technique based on adaptive convolution (AdaCoN), in order to properly impose style information into the content of an input image. In detail, after locally standardizing the content representation in a channel-wise manner, AdaCoN performs adaptive convolution where the convolution filter weights are dynamically estimated using the encoded style representation. The flexibility of AdaCoN can handle complicated image translation tasks involving significant style changes. Our qualitative and quantitative experiments demonstrate the superiority of our proposed method against various existing approaches that inject the style into the content.



There are no comments yet.


page 7

page 8

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, unpaired image-to-image translation 

Zhu_2017; kim2017learning; StarGAN2018

has been actively studied as one of the major research areas. It aims to learn inter-domain mappings without paired images, such that deep neural networks can translate a given image from one domain to another (e.g., real photo

artwork). However, these methods bear a fundamental limitation of generating a uni-modal output given a single image even if multiple diverse outputs may exist. In response, several approaches lin2018conditional; Huang_2018_ECCV; Lee_2018_ECCV; Xiao_2018_ECCV; ma2018exemplar; Chang:2018:PAS have been proposed to achieve the multi-modality that indicates the capability of generating multiple outputs given a single input image by taking an additional input, such as an exemplar image conveying detailed style information to transfer.

Although exemplar-based image translation achieves multi-modality of outputs owing to its flexibility in reflecting the exemplar image that gives fine details of intended style, there still remains the issue of how to properly impose the style feature extracted from an exemplar image into a content image. Previous approaches 

Lee_2018_ECCV; huang2017arbitrary; cho2019image commonly follows two steps of first standardizing features and then applying a particular transformation, where the first step can be regarded as removing the existing style information of an input image and the second step plays a role of imposing the exemplar style to the style-neutralized input feature.

As one of the state-of-the-art methods, adaptive instance normalization (AdaIN) huang2017arbitrary has been successfully utilized to combine content and style in a slew of studies Huang_2018_ECCV; ma2018exemplar; karras2018style

. AdaIN incorporates different features by matching each channel’s first-order statistics , e.g., the mean and the variance, in the content to those in the style. To this end, AdaIN first standardizes each channel of content feature and adaptively performs channel-wise scaling and shifting using the parameters regressed by the style feature. Another recently proposed method called group-wise deep whitening-and-coloring transformation (GDWCT) 

cho2019image has shown superior capability of imposing drastically different styles by matching higher-order statistics such as covariance, which we call the coloring transformation, in addition to the first-order ones.

In the above methods, we claim that the second step of imposing the target statistics can be viewed as a simpler variant or a special case of a convolution operation, as illustrated in Fig. 1. That is, (a) the channel-wise affine transformation used in AdaIN can be viewed as the channel-wise convolution. (b) On the other hand, the coloring transformation of GDWCT, which matches the target covariance, can be considered as the convolution operation that generates each output channel as a linear combination of the entire input channels.111The additional illustration can be found in Appendix. However, these methods tend to fail in handling a dramatic shape change because the methods have limited capability in translating significant transfiguration.

From this unified perspective of a convolution operation, these existing methods relied only on its simpler forms with only using convolution filters, and thus, the potentials of leveraging general convolution operations with larger-than- filters when injecting target style has not yet been fully explored.

Inspired by this, we propose an adaptive convolution-based normalization (AdaCoN) as an advanced method to inject the target style to a given image. AdaCoN is basically composed of two steps of standardization and adaptive convolution. First, the standardization is locally performed on each sub-region of an input activation map where the convolution filter is applied, similar to previous work jarrett2009best; krizhevsky2012imagenet. Second, AdaCoN performs adaptive convolution where the (larger-than-) convolution filter weights are dynamically estimated using the encoded style representation.

By taking into account spatial patterns due to a convolution operation, we hypothesize that AdaCoN is capable of flexibly performing a spatially-adaptive image translation, which can potentially handle complicated image translation tasks involving significant style changes. In this sense, AdaCoN has something in common with the recent success in patch-based style transfer chen2016fast; gu2018arbitrary that dynamically applies different styles to each patch of an input image.

In order to verify the superiority of AdaCoN, we conduct both quantitative and qualitative experiments that compare different normalization methods while maintaining the same model architectures.

Figure 1: Comparisons of different normalization methods for image translation. Each existing method can be viewed as the special case or the variant of a convolution operation.

2 Related work

Unpaired image translation.

Unpaired image translation aims to transform an input image from one domain to another without paired images. Numerous approaches Zhu_2017; kim2017learning; liu2017unsupervised have been proposed for this task. Recently, multimodal image translation methods, capable of yielding multiple different images given a particular image, have also been studied Huang_2018_ECCV; Lee_2018_ECCV; cho2019image. These studies take similar approaches to address the uni-modality problem of previous methods by incorporating an exemplar image as a guidance for image translation. In addition, they assume that a latent image space can be disentangled into the content space that contains an underlying structure of images and the style space that maintains a domain-specific feature. However, they propose different methods for integrating the disentangled content feature from the input image and the style feature from the exemplar image. To be specific, inspired by AdaIN huang2017arbitrary, MUNIT Huang_2018_ECCV adopts the idea of matching the statistics between the content and the style features. Extending this idea, GDWCT cho2019image leverages higher-order statistics compared to the previous method, enhancing the quality of generated images. Meanwhile, DRIT Lee_2018_ECCV simply concatenates the content and the style features to perform image translation. However, these methods have a limited capability to handle the drastic changes between the domains. A recently proposed method called instaGAN mo2018instagan tackles this problem by taking the segmentation mask as additional input, which serves as strong hint for translation.

Adaptive convolution.

Unlike standard convolution layers where the filter weights are trainable constant values, an adaptive convolution layer uses varying filter weights dynamically determined by input data. Based on this idea, dynamic filter networks jia2016dynamic proposed to take an auxiliary input image to determine convolution filter weights in an video prediction task. Furthermore, Kang et al. kang2017incorporating showed that convolution filter weights from the side information such as camera perspective or noise level can be utilized to improve the performance of classification task. Recent studies proposed to apply adaptive convolution to a variety of tasks such as semantic segmentation harley2017segmentation; su2019pixel and motion prediction xue2016visual. In this paper, we propose AdaCoN, which adaptively obtains convolution weights associated with convolution-based normalization for an image translation task.

Figure 2: Overview of our networks.

3 Proposed Methods

In this section, we briefly describe our backbone networks for an image translation task. Afterwards, we concretely describe our proposed method in detail.

3.1 Translation backbone

Networks overview.

Let and denote randomly sampled images from two different domains of and , respectively. Given two images, our networks translate from domain to domain as well as from domain to domain . To this end, we adopt the disentangling strategy Huang_2018_ECCV; Lee_2018_ECCV; cho2019image that decomposes an image into a domain-invariant content feature (e.g., an identity of a person) and a domain-specific style feature (e.g., the hair length in the female domain). This can be formulated as


where { ,} are content encoders and {, } are style encoders. By combining the content and the style features of the different domains {, } and forwarding it to decoders {}, we obtain the translated results {, }, i.e.,


where AdaCoN indicates our adaptive convolution-based normalization that incorporates given content and style features. As shown in Fig. 2, for example, given and in the woman and the man domains, respectively, let us assume that our networks translate the woman to the man . (a) We first extract the content feature from a woman image and the style feature from a man image by forwarding each image into the content encoder and the style encoder . (b) We next inject the style to the content feature through AdaCoN and (c) forward the combined features into the decoder . After obtaining a fake man image , (d) we exploit the fake image as an input of a discriminator that encourages the generated image distribution to be close to the real image distribution. (e) Lastly, we repeat the processes of (a)-(c) in order to obtain a reconstructed woman image , enabling our networks to maintain an original identity. In this manner, our networks are trained to translate the images between two different domains.

Loss functions.

Our networks are composed of several losses, and each term plays a crucial role in appropriately training our networks. In order to avoid redundancy, we focus on a translation of from this point on. First, we leverage the pixel-level reconstruction losses, such as the cycle-consistency loss and the identity loss Zhu_2017 in order to guarantee the high-quality of generated images. The image reconstruction losses can be represented as


We also use latent-level reconstruction losses that encourage the networks to impose style information while maintaining the original content during the forwarding phase. First, the style reconstruction loss is computed between the style features of

, which makes our networks properly reflect the style because is constrained to be equivalent to . Second, the content reconstruction loss is computed between , and this encourages the networks to maintain the original content after performing a translation. These two losses can be formulated as


Lastly, the adversarial loss goodfellow2014generative is used to minimize the distance of the two distributions of the real images in a target domain and the generated images. For this purpose, we exploit LSGAN mao2017least as our adversarial loss, i.e.,


Note that our translation backbone is trained to translate in both directions of and . Finally, our full loss is formulated as


where each term without the domain notation is bidirectionally applied within two different domains, and we empirically set and .

Figure 3: Overview of the style branch. The first step of the style branch is (a) the local standardization step that makes each local patch of the input activation map have a zero mean and a unit variance, e.g., neutralizing the original style. The second step is (b) the style injection into the standardized local patch by applying dynamically determined convolution filters. Detailed descriptions are found in Section 3.2.2.

3.2 Adaptive convolution-based normalization (AdaCoN)

The goal of AdaCoN is to produce an output feature that can reflect the style of while maintaining the identity of . The combined feature is used as input to a decoder to generate a translated image. Note that we omit the domain notation in this section for brevity.

3.2.1 Basic components

As illustrated in Fig. 2 (h), AdaCoN is composed of a style branch (h1) to reflect the style and a content branch (h2) that aims to maintain the content identity. Given the content and the style , the style branch learns to inject the style into the content. On the other hand, the content branch learns to keep the essential information of the given , so that the output of AdaCoN can maintain its original identity. Lastly, In the joining step (h3), the outputs of the branches are concatenated and forwarded into a subsequent convolution layer. Note that an additional analysis of this structure is provided in Appendix.

3.2.2 Style branch

Standardization function.

normalizes the content feature before applying adaptive convolution. Specifically, we compute the statistics of from the channel-wise local patch of the size , where and are a kernel height and a kernel width, respectively. We use because locally computed statistics can be more effective in normalizing a given feature than globally computed ones. Our standardization is formulated as


where denotes an unfolding operation that amasses every patch of

and unites it into one tensor. Fig. 

3(a) concretely describes the procedure. (a1) given , (a2) extracts each sliding local block in

from the zero-padded

and the extracted blocks are united into one tensor in

. (a3) In order to perform the standardization, we compute the mean and the standard deviation along the dimensions of

. (a4) We then normalize the content feature by exploiting its local channel-wise statistics. That is, performs a local normalization by using statistics specified in local patch. Note that and dimensions of imply a spatial coordinate of the local patch where it is extracted from, such that the number of patches is equivalent to . (a5) Finally, we obtain the patch-wisely normalized feature in .

Adaptive convolution layer.

takes and as inputs and generates a stylized feature as output. Specifically, as illustrated in Fig. 3 (b1), first takes the style feature as an input and (b2) encodes to the convolution weights , where is the number of output channels. Lastly, after unfolding it to the dimensions of , we apply this weights as the form of the convolution operation and obtain the stylized feature (b3-b5). Finally, the adaptive convolution is formulated as


where represents a function that learns to properly encode a given style as the convolution weight of . and indicates the horizontal and the vertical coordinates, respectively, and and are the height and the width of , respectively. Lastly, we add the mean of the style to the stylized feature that can be viewed as a bias in the convolution operation.

4 Experiments

This section describes the dataset and the baseline models we used for the experiments in Section 4.1. Subsequently, we discuss the comparison results with the baselines in Section 4.2. Lastly, we analyze our proposed method in detail in Section 4.3.

4.1 Experimental settings


We conduct evaluations with diverse datasets. First, we use CelebA dataset liu2015faceattributes. This is a widely-used facial dataset involving multiple attributes. In order to construct a dataset with a large domain gap, we combine several attributes and newly form the dataset, such as (Male, Non-Bangs, Non-SmilingFemale, Bangs, Smiling). Second, we use BAM dataset Wilber_2017_ICCV

, composed of numerous artworks labeled with its artistic style, such as watercolor and vector-graphic. We use Watercolor

Pen, Vector Pen, and Oil Pen, in order to demonstrate AdaCoN can perform image translation with a substantial domain difference. Finally, Edges Handbag Zhu_2017 and Summer Winter isola2017image datasets are used to confirm the wide applicability of AdaCoN in diverse image translation tasks. We commonly set the size of the image as in all the experiments.

Baseline methods

We compare our proposed method with the AdaIN huang2017arbitrary exploited in MUNIT, and GDWCT cho2019image

. The main difference among them lies in a specific method of combining the content feature with the style feature. As for the settings of ours, we explore various settings by adjusting the hyperparameters, such as the kernel size

and the style dimension of AdaCoN. We empirically set the kernel size of 3 and the style dimension of 128. The specific results of those hyperparmeters are reported in the Section 4.3.

Training details

For training the models, we exploit the Adam optimizer (kingma2014adam) with and . We empirically adopt the initialization method (he2015delving) for initializing our models. We also set one for the batch size and 0.0001 for the learning rate. We regularly decay the learning rate by half in every 50,000 iteration and the decaying is started from 200,000 iterations. Every model exploited in the experiments are trained for 500,000 iterations on a NVIDIA TITAN Xp GPU for 90 hours.

Evaluation metric

In order to evaluate the methods, we measure the the classification accuracy as well as content distance using a pretrained Inception-v3 model szegedy2016rethinking. To be specific, the content distance is measured by computing L2 distance of the features from intermediate layer of Inception-v3 between the input images and the translated ones. A lower content distance indicates that the gap between the them is relatively small. On the other hand, the evaluation on style injection is measured by the classification accuracy. This is because a well-trained image translation model can transform the domain of input image, so that a higher classification accuracy shows that the translation model successfully generates the prominent characteristics of the target domain. For training the classification model, we exploit the pretrained Inception-v3 and fine-tuned on CelebA dataset liu2015faceattributes

. To evaluate the performance on multi-attribute translation task, we train the classifiers with multi-label dataset.

4.2 Baseline comparison

This section reports the comparison results of AdaCoN with other baseline methods. Quantitative results using the classification accuracy and the content distance are described in Section 4.2.1 and the qualitative results on CelebA dataset liu2015faceattributes is reported in Section 4.2.2.

AdaIN 0.173/89.5 0.179/88.9 0.162/53.8 0.166/58.5 0.196/67.5 0.195/51.6 0.192/33.8 0.191/82.9
GDWCT 0.174/90.6 0.190/90.4 0.173/52.3 0.175/64.4 0.202/64.9 0.200/47.9 0.197/30.5 0.199/85.6
AdaCoN 0.186/91.6 0.184/90.0 0.202/62.3 0.202/66.5 0.193/67.7 0.197/57.5 0.199/36.5 0.201/86.7

Table 1: Content loss and overall classification results(%). We bidirectionally calculate the metric with CelebA dataset liu2015faceattributes. Each value in the cell indicates content loss and overall classification accuracy respectively. Abbreviations: (Male), (Female), (Young, Non-Smiling), (Old, Smiling), (Non-Bald, Young, Eyeglasses), (Bald, Old, Non-Eyeglasses), (Male, Non-Bangs, Non-Smiling),(Female, Bangs, Smiling)
Figure 4: Comparisons with baselines; (a):, (b):, (c):, (d):

4.2.1 Quantitative comparison

The classification accuracy increase when a translated output is correctly classified over every target attribute. As shown in Table. 1, our model displays the higher classification accuracy than other baselines. Moreover, the gap between AdaCoN and other baselines tends to be larger in multi-attribute translation task than the single attribute translation one. We believe this is because the multi-attribute translation tasks demand more considerable style injection than the single-attribute translation. For example, in case of , in order to translate an image to the target attributes, the translation networks must change the regions of the manly characteristics, the hair, and the mouth. On the other hand, the case of requires to change only the regions of the manly characteristics, of which the amount of changes the task demands is relatively small. As for the content distance, even though AdaCoN obtains the highest score in the content distance in most translation cases, the small amount of differences ensures that AdaCoN can maintain content-identity. Considering our objective is strong reflection of the style, it is tolerable to lose the small amount of content information.

4.2.2 Qualitative comparison

Fig. 4 shows the comparison results of AdaCoN with baselines on various attribute translation cases. The results demonstrate that AdaCoN can significantly reflect the style compared to baselines. For example, in case of (c) in the left macro column, whose the target attributes are (Bald, Non-Eyeglasses, Old), AdaCoN considerably applies the style of the exemplar, such that the result of AdaCoN represents the bald and old man without the eyeglasses. However, both AdaIN and GDWCT keep the hair even though the style of the exemplar includes the bald attribute. On the other hand, (a) in the right macro column, of which the target attribute is Male shows the difference of the amount of the style reflection between baselines. Specifically, in order to transfer the style of man, every baseline removes the make-up. Furthermore, AdaIN makes the beard while keeping the hair length long. GDWCT incompletely removes hair region while AdaCoN clearly removes the hair region. Since the long hair is the dominant characteristic of woman, the output of AdaCoN changed to short hair verifies the superior performance of AdaCoN in style reflection.

4.3 Additional analysis

Figure 5: Kernel size comparison and justification of standardization function. We perform experiments in order to explore the effects of the kernel size of AdaCoN and justify our standardization function. We exploit (Oil Pen) and (Watercolor Pen) of BAM dataset Wilber_2017_ICCV in (a) and (b), respectively.
Effects of kernel size.

As shown in Fig. 5(a), the kernel size is relevant to the spatial-awareness. In the first row, the hair color on the chest of the woman of the content image is different from the other hair color of hers. Because the small receptive field is disadvantageous to recognizing the wide hair region, K3 fails in generating the hair on the chest naturally. On the other hand, K11 shows the better results in generating the hair region because it has the larger receptive field. Furthermore, we observe that the larger kernel size engenders the larger amount of style reflection. For instance, the results of K11 more strongly reflect the style, so that it distorts the eye and mouth of the content in the first row and represents more conspicuous texture in the second row, compared to the results of K3.

Effects of standardization function.

Fig. 5(b) shows the effects of the standardization function of AdaCoN. represents the results from a model trained without the standardization function . As shown in the results in both rows, plays essential role in injecting a style because the model trained without the standardization function fails in performing a translation. We believe this is attributed to the conflicts of the style features between the content (input) and the style (exemplar) images. Specifically, the input image has both the content and the style features, so that if its style feature is not removed by , the style feature extracted from the exemplar can give rise to the degradation of the style reflection performance. As a consequence, the results demonstrate that our proposed standardization function based on local normalization is essential in AdaCoN.

Figure 6: Effects of style dimension and results from diverse dataset. (a) performs the translation of Male Female). (b) is conducted with (b1): Pen Watercolor, (b3): Winter Summer, (b4): Oil Pen, (b5): Summer Winter, (b2, b6): Male, Smile, Straight-Hair, Big-Nose Female, Non-smile, Wavy-Hair, Small-Nose, individually.
Effects of style dimension () and results on diverse dataset.

We compare the effects of that indicates the number of channels of . As discussed in Appendix, determines the extent of the style reflection to the output of AdaCoN . As illustrated in Fig. 6(a), the results verify that the amount of the style reflection is directly affected by . For instance, (a1) shows the hair region of is clearly removed while relatively keeps hair region. We further observe that a beard, the other dominant characteristic of man, is rather transferred in . This shows that the low dimension of tends to translate the domain with the minimum change. That is, this result demonstrates that the size of has a positive correlation with the amount of the style reflection, such that it can be usefully exploited when attempting to control the extent of the style reflection. Meanwhile, in order to verify AdaCoN can be exploited widely as well as robustly along the diverse dataset, we conduct the experiment in Fig. 6(b). The results consistently show that AdaCoN can translate a given image with a rich style.

5 Conclusion

In this paper, we proposed the novel normalization method that can dramatically inject the style of the given exemplar in a image translation. AdaCoN locally performs the standardization of the content representation in order to properly reflect the given style, and the adaptive convolution layer, whose weights are dynamically extracted from the style encoding is applied to the standardized feature. We verify the superior performance of AdaCoN in drastic style injection through the experiments. We believe AdaCoN can be usefully exploited in diverse challenging image translation tasks that have a large gap between a source and a target domain, such as the multi-attribute translation. Finally, AdaCoN can be potentially used by incorporating an additional information with our novel normalization technique in various tasks such as object detection and semantic segmentation.


6 Appendix

6.1 Analysis on existing methods

In order to intensively comprehend the existing methods, this section reviews their principal operations and performs the comparative analysis of them.

6.1.1 Review on baselines

The previous methods are typically composed of two steps, of which the first step is to normalize the content feature, and the second step is to reflect the style feature to the normalized content. We formulate this procedure as , where and represent the standardization and the style injection function, respectively. In this point of view, AdaIN can be illustrated as


where and are the height and the width of an input feature. Each channel is normalized and combined independently. and respectively denote the standard deviation and the mean computed along the and dimensions. In Eq. (10), the function normalizes an input content feature with the channel-wise mean and variance. On the other hand, the function transfers the mean and the variance of the style to those of the normalized content . Meanwhile, GDWCT can be represented as


where the matrices can be obtained by the eigendecomposition of the channel covariance matrix of the content and the style features, respectively. Each of

indicates a square matrix composed of the eigenvectors, and

are diagonal matrices whose each diagonal entry indicates an eigenvalue of a corresponding eigenvector in

. In Eq. (11), the function plays a similar role to Eq. (10

), but forces the more strict rule, so it normalizes not only the mean and the variance but also the covariance of an input feature by making its covariance matrix the identity matrix. As for the style injection function

, it matches the first and the second-order statistics of normalized content feature to those of the style feature.

6.1.2 Comparative analysis on baselines

The differences of the existing methods are clear when we regard those methods as a special case of the convolution operation. in Eq. (10) can be represented as the depth-wise convolution with the bias since adaptive parameters of identically scale and shift along channels. Meanwhile, in Eq. (11) can be viewed as the convolution layer, of which the weights are and the bias is . This is because the vector-matrix multiplication of a row vector of by the matrix generates a new row vector in . This is identical to the convolution operation, whose the output channel is one. From the aforementioned view, we can intensively explore these style injection functions. can be expected to transfer the lowest amount of style as it injects the style along channel, such that it engenders a relatively high consistency with the content compared to other methods. On the other hand, can be thought as a stronger combining method than because it generates the channel dimension of the content feature as a linear combination of the content feature channels. Even though GDWCT accomplishes more drastic changes of the style compared to AdaIN since it carries out mixing channel information of the content, we claim that even more dramatic changes can be achieved if the spatial information is simultaneously considered. Hence, we propose adaptive convolution-based normalization, whose weights are extracted from the style. We believe this can increase a transferring capacity of a given style.

6.2 Discussion on branch-separation

Fully exploiting the adaptive convolution-based normalization at the intermediate layers may engender considerable distortions of the content information because the spatial information as well as the channel information of the output features of AdaCoN is entirely different from those of the input features. Considering one of the task objectives is maintaining an input identity, we posit that a combination of the adaptive convolution-based normalization with the general convolution layer is reasonable choice for performing the translation. Moreover, through separating branches, we can control the amount of the style injection by changing that indicates the number of style dimensions. That is, the small gives rise to the low injection of the style.

6.3 Additional results

Fig. 78 show the additional results of AdaCoN on the various image translation tasks.

Figure 7: Extra results of our model on CelebA dataset.
Figure 8: Extra results of our model on CelebA dataset.