Beyond Textures: Learning from Multi-domain Artistic Images for Arbitrary Style Transfer

05/25/2018 ∙ by Zheng Xu, et al. ∙ University of Maryland 0

We propose a fast feed-forward network for arbitrary style transfer, which can generate stylized image for previously unseen content and style image pairs. Besides the traditional content and style representation based on deep features and statistics for textures, we use adversarial networks to regularize the generation of stylized images. Our adversarial network learns the intrinsic property of image styles from large-scale multi-domain artistic images. The adversarial training is challenging because both the input and output of our generator are diverse multi-domain images. We use a conditional generator that stylized content by shifting the statistics of deep features, and a conditional discriminator based on the coarse category of styles. Moreover, we propose a mask module to spatially decide the stylization level and stabilize adversarial training by avoiding mode collapse. As a side effect, our trained discriminator can be applied to rank and select representative stylized images. We qualitatively and quantitatively evaluate the proposed method, and compare with recent style transfer methods.



There are no comments yet.


page 6

page 7

page 9

page 10

page 15

page 16

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image style transfer is a task that aims to render the content of one image with the style of another, which is important and interesting for both practical and scientific reasons. The style transfer techniques can be widely used in image processing applications such as mobile camera filters and artistic image generation. Furthermore, the study of style transfer often reveals the intrinsic property of images. Style transfer is challenging as it is difficult to explicitly separate and represent the content and style of an image.

In the seminal work of gatys2016image

, the authors represent content with deep features extracted by a pre-trained neural network, and represent style with second order statistics (i.e. the Gram matrix) of the deep features. They propose an optimization framework with the objective that the generated image has similar deep features to the given content image, and similar second order statistics to the given style image. The generated results are visually impressive, but the optimization framework is far too slow for real-time applications. Later works

johnson2016perceptual; ulyanov2017improved train a feed-forward network to replace the optimization framework for fast stylization, with a loss similar to gatys2016image. However, they need to train a network for each style image and cannot generalize to unseen images. More recent approaches huang2017arbitrary; li2017universal tackle arbitrary style transfer for unseen content and style images, which still represent style with second order statistics of deep features. The second order statistics of style representation is originally designed for textures gatys2015texture, and style transfer is considered as texture transfer in previous methods.

Another line of research considers style transfer as conditional image generation, and apply adversarial networks to train an image to image translation network isola2016image; taigman2016unsupervised; zhu2017unpaired; huang2018multimodal. The trained image translation networks can transfer image from one domain to another domain, for example, from a natural image to sketch. However, they cannot be applied to arbitrary style transfer as the input images are from mutliple domains.

In this paper, we combine the best of both worlds by adversarially training a single feed-forward network for arbitrary style transfer. We introduce several techniques to tackle the challenging problem of adversarial training from multi-domain data. In adversarial training, the generator (stylization network) and the discriminator are alternatively updated. Both our generator and discriminator are conditional networks. The generator is trained to fool the discriminator, as well as satisfy the content and style representation similarity to inputs. Our generator is built upon a state-of-the-art network for abitrary style transfer huang2017arbitrary

, which is conditioned on both content image and style image, and uses adaptive instance normalization (AdaIN) to combine the two inputs. AdaIN shifts the mean and variance of the deep features of content image to match those of the style image. Our discriminator is conditioned on the coarse domain categories, which is trained to distinguish the generated images with real images from the the same style category.

Comparing with previous arbitrary style transfer methods, our approach uses the discriminator to learn a data-driven representation for styles. The combined loss for our generator considers both instance-level information from style loss and category-level information from adversarial training. Comparing with previous adversarial training methods, our approach handles multi-domain inputs by using a conditional generator designed for arbitrary style transfer and a conditional discriminator. Moreover, we propose a mask module to automatically control the level of stylization by predicting a mask to blend the stylized features and the content features. Finally, we use the trained discriminator to rank and find the representative generated images in each style category. We release our code and model at

2 Related work

Style transfer. We briefly review the neural style transfer methods, and recommend jing2017neural for a more comprehensive review. gatys2016image proposed the first neural style transfer method based on an optimization framework, which uses deep features to represent content and Gram matrix to represent style. The optimization framework was replaced by a feed forward network to achieve real-time performance in johnson2016perceptual; ulyanov2016texture; wang2017multimodal. ulyanov2017improved showed that instance normalization is particularly effective for training a fast style transfer network. Other works focused on controlling spatial, color, and stroke for stylization gatys2017controlling; frigo2016split; jing2018stroke, and exploring other style representation such as mean and variance li2017demystifying, histogram wilmot2017stable, patch-based MRF li2016combining, and patch-based GAN li2016precomputed. Comparing with gatys2016image, these fast style transfer methods sometimes compromise on the visual quality, and need to train one network for each style.

Various methods have been proposed to train a single feed forward network for multiple styles. dumoulin2016learned proposed conditional instance normalization, which learned the affine parameter for each style image. chen2017stylebank learned the “style bank”, which contains several layers of filters for each style. zhang2017multi proposed comatch layers for multi-style transfer. These methods only work with limited number of styles, and cannot apply to an unseen style image.

More recent approaches are designed for arbitrary style transfer, where both the content and the style inputs can be unseen images. ghiasi2017exploring extended conditional instance normalization (IN) by training a separate network to predict the affine parameter of IN. shen2018style

learned a meta network to predict filters in the transformation networks.

huang2017arbitrary proposed adaptive instance normalization (AdaIN) that adjusts the mean and variance of content image to match those of the style image. li2017universal; li2018closed used feature whitening and coloring transforms (WCT) to match the statistics of the content image to the style image. sheng2018avatar proposed feature decoration that generalizes AdaIN and WCT. Note that the optimization framework gatys2016image and path-based non-parametric methods (e.g., style swamp chen2016fast, deep image analogyliao2017visual, and deep feature reshuffle gu2018arbitrary) can also be applied to arbitrary style transfer, but these methods can be much slower. zhang2017separating proposed to separate style and content and then combine them with bilinear layer, which requires a set of content and style images as input and has limited applications. Our approach is the first to explore adversarial training for arbitrary style transfer.

Generative adversarial networks (GANs). GANs have been widely studied for image generation and manipulation tasks since goodfellow2014generative. elgammal2017can applied GANs to generate artistic images. isola2016image

used conditional adversarial networks to learn the loss for image to image translation, which is extended by several concurrent methods

zhu2017unpaired; kim2017learning; yi2017dualgan; liu2017unsupervised that explored cycle-consistent loss when training data are unpaired. Later works improved the diversity of generated images by considering multimodality of data zhu2017toward; almahairi2018augmented; huang2018multimodal. Similar techniques have been applied to specific image to image translation tasks such as image dehazing yang2018towards, face to cartoon taigman2016unsupervised; royer2017xgan and font style transfer azadi2017multi. These methods successfully train a transformation network from one image domain to another. However, they cannot handle multi-domain input and output images, and it is known to be difficult to generate images with large variance chen2016infogan; odena2016conditional; miyato2018cgans. Our approach adopt conditional generator and discriminator to tackle the multi-domain input and output for arbitrary style transfer.

Figure 1: Proposed network: (left) encoder-decoder as generator; (right) pre-trained VGG as encoder. The decoder architecture is symmetric comparing to encoder. We use the conventional texture loss based on pre-trained encoder features, and adversarially train mask module, decoder and discriminator.

3 Proposed method

We use an encoder-decoder architecture as our transformation network, and use the convolutional layers of the pre-trained VGG net simonyan2014very; xu2018effectiveness as our encoder to extract the deep features. We add skip connections and concatenate the features from different levels of convolutional layers as the output feature of the encoder. We adopt adaptive instance normalization (AdaIN) huang2017arbitrary to adjust the first and second order statistics of the deep features. Furthermore, we generate spatial masks to automatically adjust the stylization level. Our transformation network is a conditional generator inspired by the state-of-the-art network for arbitrary style transfer. Our network is trained with perceptual loss for content representation, Gram loss for style representation as in gatys2016image; johnson2016perceptual; ulyanov2016texture, as well as the adversarial loss to capture the common style information beyond textures from a style category. We show the proposed network in figure 1, and provide details in the following sections.

3.1 Network architecture

Our encoder uses the convolutional layers of the VGG net simonyan2014very

pre-trained on Imagenet large-scale image classification task


. VGG net contains five blocks of convolutional layers, and we adopt the first three blocks and the first convolutional layer of the forth block. Each block contains convolutional layers with ReLU activation

krizhevsky2012imagenet, and the width (number of channels) and size (height and width) of the convolutional layers are shown in figure 1

. There is a maxpooling layer of stride two between blocks, and the width of convolutional layer is doubled after the downsampling by maxpooling. We concatenate the features from the first convolutional layer of each block as the output of the encoder. These skip connections help to transfer style captured by both high-level and low-level features, as well as make the training easier by smoothing the loss surface of neural networks


Our decoder is designed to be almost symmetric to the VGG encoder, which has four blocks and between blocks are transposed convolutional layer for upsampling. We add LeakyReLU he2015delving

and batch normalization

ioffe2015batch to each convolutional layer for effective adversarial training radford2015unsupervised. The decoder is trained from scratch.

Adaptive instance normalization (AdaIN) has been shown to be effective for image style transfer huang2017arbitrary. AdaIN shifts the mean and variance of deep features of content to match style with no learnable parameters. Let represent the features of a convolutional layer from a minibatch of content and style images, where is the batch size, is the width of the layer (number of channels), and are height and width, respectively. denotes the element at height , width of the th channel from the th sample, and adaIN layer can be written as,


where , , is a very small constant, and represent the mean and variance for the th channel of the th sample of feature .

The mask module in our network contains a few convolutional layers operated on the concatenation of content feature and style feature . The output is a spatial soft mask that has the same size as feature and each value is between and . The generated mask is used to control the stylization level by linearly combine the adaIN feature and the original content feature as the input of the decoder,


where the element-wise operations are used for combining these features.

Our discriminator is a patch-based network inspired by isola2016image. To handle the multi-domain images for arbitrary style transfer, our discriminator is conditioned on the style category labels. Inspired by AC-GAN odena2016conditional, our discriminator predicts the style category and distinguish the real image and fake image at the same time. We also adopt the projection discriminator miyato2018cgans to make sure the style category conditioning will not be ignored.

3.2 Adversarial training

We alternatively update the generator (mask module and decoder) and discriminator during training, and apply prediction optimizer yadav2017stabilizing to stabilize the training.

Generator update. Our generator takes a content image and a style image as input, and outputs the stylized image. The generator is updated by minimizing the loss combined of adversarial loss , style classification loss , content loss and style loss ,



are hyperparameters for the weights of different losses. Let us denote the feature map of the

th layer in our encoder as , the input content and style images as , the generator network as , and the discriminator network as . When the discriminator is fixed, the output stylized images

aim to fool the discriminator, and also be classified to same style category

as the input style image,


and are learned loss that capture the category-level style of images from the training data. We also use the traditional content and style loss based on deep features and Gram matrix,


We use the deep feature from the forth block of pre-trained VGG net for content representation, and use the Gram matrix from all the blocks for style representation. We find norm is more stable than when combining with the adversarial loss.

Figure 2: Benefits of adversarial training and mask module. We show the encoder-decoder network with adversarial training only, mask module only, and the combination of adversarial training and mask module. Mask module only does not improve the visual quality of generated images, which have artifacts and undesired textures. GAN only can generate collapsed images with corrupted eyes and noses.

Discriminator update. Our discrimintor is conditioned on style category to handle the multi-domain generated images, inspired by chen2016infogan; odena2016conditional; miyato2018cgans; xu2018training. When the generator is fixed, the discriminator is adversarially trained to distinguish the generated images and the real style images,


where , and .

Discriminator for ranking. The adversarilly trained discriminator characterizes the real style images, and hence can be used to rank the generated images. We rank the stylized images based on the likelihood score .

3.3 Ablation study

The encoder-decoder architecture and adaIN module have been shown to be effective in previous work huang2017arbitrary. We use visual examples to show the importance of mask module and adversarial training in the proposed method in figure 2. We present results from adversarially trained network without mask module, network with mask module but trained without adversarial loss, and the proposed method. When trained without adversarial loss, the network produces visually similar results with or without mask module as the network is over-parameterized.

Figure 3: Qualitative evaluation for style transfer. We shown examples of transferring photos to seven different styles. AdaIN and WCT will generate artifacts and undesired textures. Gatys’ results are more visually appealing, but the optimization is slow, and it is hard to choose the parameter to control stylization level. Our method efficiently generate clean and stylized images.
vectorart 3D graphics comic graphite oil paint pen ink water color all
AdaIN huang2017arbitrary 0.2849 0.2029 0.2314 0.1277 0.3018 0.2151 0.2118 0.2199
WCT li2017universal 0.1134 0.1957 0.2066 0.4754 0.3350 0.2868 0.4409 0.3001
Ours 0.6017 0.6014 0.5620 0.3969 0.3632 0.4981 0.3473 0.4800
Table 1: Quantitative evaluation for style transfer. Our method is preferred by human annotators and outperforms baselines.

Our adversarial training significantly improves the visual quality of the generated images in general. The block effects and many other artifacts are removed through adversarial training, which makes the generated images look more “natural”. Moreover, the data-driven discriminator learns to distinguish foreground and background well; adversarial training cleans the background and adds more details to the foreground. Our mask module controls the stylization level at different spatial location of the image, which significantly improves the stylization of salient components like eyes, nose and mouth of a face. The salient regions are repeatedly captured by the deep features from high-level layers, which can make them difficult to handle when adjusting the statistics of the features. By controlling the stylization level, the mask module prevents over-stylization of salient region, and also helps adversarial training by relieving the mode collapse of salient regions.

4 Experiments

We qualitatively and quantitatively evaluate the proposed method with experiments. We extensively use the Behance dataset wilber2017bam for training and testing. Behance wilber2017bam

is a large-scale dataset of artistic images, which contains coarse category labels for content and style. We use the seven media labels in Behance as style category: vector art, 3D graphics, comic, graphite , oil paint, pen ink, and water color. We create four subsets from the Behance images for face, bird, car, and building. Our face dataset is created by running a face detector on a subset of images with people as content label and contains roughly 15,000 images for each style. The other three are created by selecting the top 5000 ranked images of each media for the content, respectively. We add describable textures Dataset (DTD) 

cimpoi14describing as another style category to improve the robustness of our method. We add natural images as both content images and an extra style for each subset. Specifically, we use labeled faces in the wild (LFW) LFWTech, the first 16,000 images of CelebA dataset liu2015faceattributes, Caltech-UCSD birds dataset WelinderEtal2010, cars dataset KrauseStarkDengFei-Fei_3DRR2013, and Oxford building dataset Philbin07. In total, we have nine style categories in our data. We split both content and style images into training/testing set, and use unseen testing images for our evaluation. The total number of training/testing images are 122,247 / 11,325 for face, 35,000 / 3,505 for bird, 36,940 / 3,700 for car, and 34,374 / 3,444 for building.

We train the network on face images, and then fine-tune it on bird, car, and building. We use Adam optimizer with prediction method yadav2017stabilizing with learning rate and parameter

. We train the network with batch size 56 for 150 epochs and linearly decrease the learning rate after 60 epochs. It takes about 8 hours to complete on a workstation with 4 GPUs. We set all weights in our combined loss (

3) as except for

for the style loss. The weights are chosen so that different components of the loss have similar numerical scales. The training code and pre-trained model in Pytorch are released in

We compare with arbitrary style transfer methods, the optimization framework of neural style transfer (Gatys) gatys2016image, and two state-of-the-art methods, adaptive instance normalization (AdaIN) huang2017arbitrary and feature transformation (WCT) li2017universal. Note that our approach, AdaIN and WCT apply feed-forward network for style transfer, which are much faster than Gatys method.

Figure 4: Qualitative evaluation for general objects. This task is more difficult for our GAN-based method because the training data is more noisy, especially for bird images with large diversity. Our method can generate clean background, detailed foreground, and better stylized strokes.
vectorart 3D graphics comic graphite oil paint pen ink water color all
AdaIN huang2017arbitrary 0.2119 0.2703 0.3089 0.3260 0.2778 0.3944 0.3654 0.3203
WCT li2017universal 0.4503 0.4865 0.3740 0.1547 0.4383 0.2310 0.1731 0.3145
Ours 0.3377 0.2432 0.3171 0.5193 0.2840 0.3746 0.4615 0.3652
Table 2: Quantitative evaluation for style transfer of building. Different methods are competitive for different styles. The overall performance of our method is better.

Figure 5: Qualitative evaluation for style ranking.

4.1 Evaluation of style transfer

We qualitatively compare our approach with previous arbitrary style transfer methods, and present some results in figure 3. We show seven pairs of content and style images from our face dataset, and the style images are from testing set of vector art, 3D graphics, comic, graphite , oil paint, pen ink, and water color, respectively. For Gatys method gatys2016image, we tune the weight parameter, and select the best visual results from either Adam or BFGS as optimizer. For AdaIN huang2017arbitrary and WCT li2017universal, we use their released best models. The content and style images are from the separate testing set that have not been seen for our approach and the baseline methods.

Gatys method gatys2016image is sensitive to parameter and optimizer setting. We may get results that are not stylized enough even after parameter tuning due to the difficulty of optimization. AdaIN huang2017arbitrary often over-stylizes the content image, creates undesirable artifacts, and sometimes changes the semantic of the content image. WCT li2017universal suffers from severe block effect and artifacts. The previous methods all create texture-like artifacts because of the texture-based style representation. For example, the stylized images of baselines in the first column of figure 3 have stride artifacts. Our approach generate more visually appealing results with clean background, vivid foreground, and more consistent with the style of the input.

We conduct user study on Amazon Mechanical Turk, and present quantitative results in table 1. We compare with the two recent fast style transfer methods in this study. We randomly select 10 content images and 10 style images from each Behance style category to generate 700 testing pairs. For each pair, we show the stylized images by our approach, AdaIN huang2017arbitrary, and WCT li2017universal, and ask 10 users to select the best results. We remove the unreliable results that are labeled too soon, and show preference (click) ratio for different style categories. WCT li2017universal performs well on graphite and water color, where the style images themselves are visually not “clean”. Our approach achieves the best results in the other five categories and is overall the most favorable.

Figure 6: Ranking stylized images by our discriminator.

4.2 Evaluation of style transfer for general objects

We evaluate the performance of the proposed approach on general objects beyond face. Specifically, we test for bird, car, and building. In figure 4, we show the stylized images generated by our network trained on face (Ours), as well as fine-tuned for each object (Outs-FT). Our network trained on face generalizes well, and generates images look comparable, if not better than, the baseline methods. Fine-tuning on bird does not help the performance. The adversarial training may be too difficult for bird because the given training style images are noisy and diverse. Fine-tuning on car and building brings more details to the foreground object of our generated images. The training images of car and building are also noisy and diverse, but these objects are more structured than bird. We show more results on our performance on general object tasks in the supplementary material.

We conduct the user study for building images and report results in table 2. Our approach achieves good results for graphite and water color because of the clean background in our generated images. For the other categories, our results are comparable with baselines. Our overall performance is still the best.

4.3 Evaluation for style ranking

Figure 7: Qualitative evaluation for style transfer on texture-centric cases in previous papers. Our method generates stylized images with clean background, which are visually competitive to the previous methods that targeted only on texture transfer.

We apply the trained discriminator to rank the generated images for a style category. Figure 5 show the top five generated images by stylizing with all the testing images in comic style. The stylized images are generated by our network, and ranked by our discriminator, a style classifier, and random selection, respectively. The style classifier use the same network architecture as our discriminator and training data as our method. The hyper parameters are tuned to achieve the best style classification accuracy on the separate validation dataset, which makes the style classifier a strong baseline. Our generator network produced good results, and even random selected images look acceptable. The top selected results of our discriminator are more diverse, and more consistent to the comic style because of the adversarial training.

Figure 6 shows more ranked images by our discriminator at top, in the middle, and at the bottom for two content images stylized by images from two categories. The top ranked results are more visually appealing, and more consistent with the style category.

Finally, we conduct user study to compare the ranking performance of our discriminator and the baseline classifier. We generated images by stylizing ten content images with all the testing images for each of the seven Behance styles, and rank the 70 sets of results. We comparing the rank of each generated image by discriminator and classifier, and select five images that are ranked higher by our discriminator, and five images that are ranked higher by the baseline classifier. We show the ten images to ten users and ask them to select five images for each set. The preference ratio of our discriminator is comparing to of classifier. We beat a strong baseline in a highly subjective and challenging evaluation.

5 Supplemental experiments

In this section, we present supplemental experiments to show some side effect of the proposed method. We first demonstrate our method can be applied to previous style transfer test cases which focus on transferring textures of the style image. We then show that the proposed method can be applied to destylization and generate images look more realistic than baselines.

5.1 Examples for general style transfer

In figure 7, we evaluate on test cases from previous style transfer papers. The style images have rich texture information, and the content images vary from face to building. Our network is trained on our face dataset described in section 4. Our network generalizes well and produces comparable results, if not better than, comparing with baselines. Particularly, our approach often generates clean background without undesired artifacts.

Figure 8: Qualitative evaluation for destylization.

5.2 Destylization

We show that if we also use artistic images as content images during training, the exact same architecture can be used to destylize images (figure 8). Destylization is a difficult task because we only use one network to destylize diverse artistic images. The training also becomes much more difficult as the number of pairs increase square to the samples. Though there is still room to improve, our adversarial training and network architecture look promising in limited training time. The last row in 8 also suggests our network can transfer style of photorealistic images, which is difficult for the baselines.

6 Conclusion and discussion

We propose a feed-forward network that uses adversarial training to enhance the performance of arbitrary style transfer. We use both conditional generator and conditional discriminator to tackle multi-domain input and output. Our generator is inspired by the recent progress in arbitrary style transfer, and our discriminator is inspired by the recent progress in generative adversarial networks. Our approach combines the best of both worlds. We propose a mask module that helps in both adversarial training and style transfer. Moreover, we show that our trained discriminator can be used to select representative stylized image, which has been a long-standing problem.

Previous style transfer and GAN-based image translation methods only target on one domain, such as transferring the style of oil paint, or transforming from natural images to sketches. We systematically study the style transfer problem on a large-scale dataset of diverse artistic images. We can train one network to generate images in different styles, such as comic, graphite, oil paint, water color and vector art. Our approach generates more visually appealing results than previous style transfer methods, but there is still room to improve. For example, transferring image into 3D graphics with the arbitrary style transfer network is still challenging.