Universal Style Transfer via Feature Transforms

05/23/2017 ∙ by Yijun Li, et al. ∙ adobe University of California, Merced 0

Universal style transfer aims to transfer arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect a direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures via simple feature coloring.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 7

page 8

Code Repositories

WCT-TF

TensorFlow/Keras implementation of "Universal Style Transfer via Feature Transforms" from https://arxiv.org/abs/1705.08086


view repo

PytorchWCT

This is the Pytorch implementation of Universal Style Transfer via Feature Transforms.


view repo

universal_style_transfer_via_feature_transfer

https://arxiv.org/abs/1705.08086


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Style transfer is an important image editing task which enables the creation of new artistic works. Given a pair of examples, i.e., the content and style image, it aims to synthesize an image that preserves some notion of the content but carries characteristics of the style. The key challenge is how to extract effective representations of the style and then match it in the content image. The seminal work by Gatys et al. GatysTexture-NIPS2015 ; GatysTransfer-CVPR2016 show that the correlation between features, i.e., Gram matrix or covariance matrix (shown to be as effective as Gram matrix in Me-2017-diversified

), extracted by a trained deep neural network has remarkable ability of capturing visual styles. Since then, significant efforts have been made to synthesize stylized images by minimizing Gram/covariance matrices based loss functions, through either iterative optimization 

GatysTransfer-CVPR2016 or trained feed-forward networks Texturenet-ICML2016 ; Perceptual-ECCV2016 ; Me-2017-diversified ; MSRA-2017-stylebank ; GoogleMultiTexture-2016 . Despite the recent rapid progress, these existing works often trade off between generalization, quality and efficiency, which means that optimization-based methods can handle arbitrary styles with pleasing visual quality but at the expense of high computational costs, while feed-forward approaches can be executed efficiently but are limited to a fixed number of styles or compromised visual quality.

By far, the problem of universal style transfer remains a daunting task as it is challenging to develop neural networks that achieve generalization, quality and efficiency at the same time. The main issue is how to properly and effectively apply the extracted style characteristics (feature correlations) to content images in a style-agnostic manner.

In this work, we propose a simple yet effective method for universal style transfer, which enjoys the style-agnostic generalization ability with marginally compromised visual quality and execution efficiency. The transfer task is formulated as image reconstruction processes, with the content features being transformed at intermediate layers with regard to the statistics of the style features, in the midst of feed-forward passes. In each intermediate layer, our main goal is to transform the extracted content features such that they exhibit the same statistical characteristics as the style features of the same layer and we found that the classic signal whitening and coloring transforms (WCTs) on those features are able to achieve this goal in an almost effortless manner.

In this work, we first employ the VGG-19 network VGG-2014 as the feature extractor (encoder), and train a symmetric decoder to invert the VGG-19 features to the original image, which is essentially the image reconstruction task (Figure 1(a)). Once trained, both the encoder and the decoder are fixed through all the experiments. To perform style transfer, we apply WCT to one layer of content features such that its covariance matrix matches that of style features, as shown in Figure 1(b). The transformed features are then fed forward into the downstream decoder layers to obtain the stylized image. In addition to this single-level stylization, we further develop a multi-level stylization pipeline, as depicted in Figure 1(c), where we apply WCT sequentially to multiple feature layers. The multi-level algorithm generates stylized images with greater visual quality, which are comparable or even better with much less computational costs. We also introduce a control parameter that defines the degree of style transfer so that the users can choose the balance between stylization and content preservation. The entire procedure of our algorithm only requires learning the image reconstruction decoder with no style images involved. So when given a new style, we simply need to extract its feature covariance matrices and apply them to the content features via WCT. Note that this learning-free scheme is fundamentally different from existing feed-forward networks that require learning with pre-defined styles and fine-tuning for new styles. Therefore, our approach is able to achieve style transfer universally.

The main contributions of this work are summarized as follows:

  • We propose to use feature transforms, i.e., whitening and coloring, to directly match content feature statistics to those of a style image in the deep feature space.

  • We couple the feature transforms with a pre-trained general encoder-decoder network, such that the transferring process can be implemented by simple feed-forward operations.

  • We demonstrate the effectiveness of our method for universal style transfer with high-quality visual results, and also show its application to universal texture synthesis.

(a) Reconstruction (b) Single-level stylization (c) Multi-level stylization
Figure 1: Universal style transfer pipeline. (a) We first pre-train five decoder networks DecoderX (X=1,2,…,5) through image reconstruction to invert different levels of VGG features. (b) With both VGG and DecoderX fixed, and given the content image and style image , our method performs the style transfer through whitening and coloring transforms. (c) We extend single-level to multi-level stylization in order to match the statistics of the style at all levels. The result obtained by matching higher level statistics of the style is treated as the new content to continue to match lower-level information of the style.

2 Related Work

Existing style transfer methods are mostly example-based Hertz-2001-analogy ; shih2013data ; shih2014style ; Frigo-2016-CVPR ; MSRA-2017-visual . The image analogy method Hertz-2001-analogy aims to determine the relationship between a pair of images and then apply it to stylize other images. As it is based on finding dense correspondence, analogy-based approaches shih2013data ; shih2014style ; Frigo-2016-CVPR ; MSRA-2017-visual often require that a pair of image depicts the same type of scene. Therefore these methods do not scale to the setting of arbitrary style images well.

Recently, Gatys et al. GatysTexture-NIPS2015 ; GatysTransfer-CVPR2016

proposed an algorithm for arbitrary stylization based on matching the correlations (Gram matrix) between deep features extracted by a trained network classifier within an iterative optimization framework. Numerous methods have since been developed to address different aspects including speed 

Texturenet-ICML2016 ; MGAN-ECCV2016 ; Perceptual-ECCV2016 , quality InstanceBN-arxiv-2016 ; MrfTransfer-CVPR2016 ; HistoGramLoss-2017 ; Wang-2016-highres , user control Gatys2016-control , diversity Texturenet-2017-V2 ; Me-2017-diversified , semantics understanding Frigo-2016-CVPR ; Doodle-2016-semantic and photorealism Luan-2017-photorealism . It is worth mentioning that one of the major drawbacks of GatysTexture-NIPS2015 ; GatysTransfer-CVPR2016 is the inefficiency due to the optimization process. The improvement of efficiency in Texturenet-ICML2016 ; MGAN-ECCV2016 ; Perceptual-ECCV2016

is realized by formulating the stylization as learning a feed-forward image transformation network. However, these methods are limited by the requirement of training one network per style due to the lack of generalization in network design.

Most recently, a number of methods have been proposed to empower a single network to transfer multiple styles, including a model that conditioned on binary selection units Me-2017-diversified , a network that learns a set of new filters for every new style MSRA-2017-stylebank , and a novel conditional normalization layer that learns normalization parameters for each style GoogleMultiTexture-2016 . To achieve arbitrary style transfer, Chen et al. Chen-2016-swap first propose to swap the content feature with the closest style feature locally. Meanwhile, inspired by GoogleMultiTexture-2016 , two following work Wang-2017-zeroshortArbitrary ; Ghiasi-2017-BMVC turn to learn a general mapping from the style image to style parameters. One closest related work Huang-2017-arbitrary

directly adjusts the content feature to match the mean and variance of the style feature. However, the generalization ability of the learned models on unseen styles is still limited.

Different from the existing methods, our approach performs style transfer efficiently in a feed-forward manner while achieving generalization and visual quality on arbitrary styles. Our approach is closely related to Huang-2017-arbitrary , where content feature in a particular (higher) layer is adaptively instance normalized by the mean and variance of style feature. This step can be viewed as a sub-optimal approximation of the WCT operation, thereby leading to less effective results on both training and unseen styles. Moreover, our encoder-decoder network is trained solely based on image reconstruction, while Huang-2017-arbitrary requires learning such a module particularly for stylization task. We evaluate the proposed algorithm with existing approaches extensively on both style transfer and texture synthesis tasks and present in-depth analysis.

3 Proposed Algorithm

We formulate style transfer as an image reconstruction process coupled with feature transformation, i.e., whitening and coloring. The reconstruction part is responsible for inverting features back to the RGB space and the feature transformation matches the statistics of a content image to a style image.

3.1 Reconstruction decoder

We construct an auto-encoder network for general image reconstruction. We employ the VGG-19 VGG-2014 as the encoder, fix it and train a decoder network simply for inverting VGG features to the original image, as shown in Figure 1(a). The decoder is designed as being symmetrical to that of VGG-19 network (up to Relu_X_1 layer), with the nearest neighbor upsampling layer used for enlarging feature maps. To evaluate with features extracted at different layers, we select feature maps at five layers of the VGG-19, i.e., Relu_X_1 (X=1,2,3,4,5), and train five decoders accordingly. The pixel reconstruction loss Doso-NIPS2016-Generation and feature loss Perceptual-ECCV2016 ; Doso-NIPS2016-Generation are employed for reconstructing an input image,

(1)

where , are the input image and reconstruction output, and is the VGG encoder that extracts the Relu_X_1 features. In addition, is the weight to balance the two losses. After training, the decoder is fixed (i.e., will not be fine-tuned) and used as a feature inverter.

3.2 Whitening and coloring transforms

Given a pair of content image and style image

, we first extract their vectorized VGG feature maps

and at a certain layer (e.g., Relu_4_1), where , (, ) are the height and width of the content (style) feature, and is the number of channels. The decoder will reconstruct the original image if is directly fed into it. We next propose to use a whitening and coloring transform to adjust with respect to the statistics of . The goal of WCT is to directly transform the to match the covariance matrix of . It consists of two steps, i.e., whitening and coloring transform.

Whitening transform.

Before whitening, we first center by subtracting its mean vector . Then we transform linearly as in (2) so that we obtain such that the feature maps are uncorrelated (),

(2)

where

is a diagonal matrix with the eigenvalues of the covariance matrix

, and

is the corresponding orthogonal matrix of eigenvectors, satisfying

.

To validate what is encoded in the whitened feature , we invert it to the RGB space with our previous decoder trained for reconstruction only. Figure 2 shows two visualization examples, which indicate that the whitened features still maintain global structures of the image contents, but greatly help remove other information related to styles. We note especially that, for the Starry_night example on right, the detailed stroke patterns across the original image are gone. In other words, the whitening step helps peel off the style from an input image while preserving the global content structure. The outcome of this operation is ready to be transformed with the target style.

  
Figure 2: Inverting whitened features. We invert the whitened VGG Relu_4_1 feature as an example. Left: original images, Right: inverted results (pixel intensities are rescaled for better visualization). The whitened features still maintain global content structures.
  
(a) Style (b) Content (c) HM (d) WCT (e) Style (f) Content (g) HM (h) WCT
Figure 3: Comparisons between different feature transform strategies. Results are obtained by our multi-level stylization framework in order to match all levels of information of the style.
Coloring transform.

We first center by subtracting its mean vector , and then carry out the coloring transform WCT-2016 , which is essentially the inverse of the whitening step to transform linearly as in (3) such that we obtain which has the desired correlations between its feature maps (),

(3)

where is a diagonal matrix with the eigenvalues of the covariance matrix , and is the corresponding orthogonal matrix of eigenvectors. Finally we re-center the with the mean vector of the style, i.e., .

To demonstrate the effectiveness of WCT, we compare it with a commonly used feature adjustment technique, i.e., histogram matching (HM), in Figure 3. The channel-wise histogram matching Gonzalez-DIP method determines a mapping function such that the mapped has the same cumulative histogram as . In Figure 3, it is clear that the HM method helps transfer the global color of the style image well but fails to capture salient visual patterns, e.g., patterns are broken into pieces and local structures are misrepresented. In contrast, our WCT captures patterns that reflect the style image better. This can be explained by that the HM method does not consider the correlations between features channels, which are exactly what the covariance matrix is designed for.

After the WCT, we may blend with the content feature as in (4) before feeding it to the decoder in order to provide user controls on the strength of stylization effects:

(4)

where serves as the style weight for users to control the transfer effect.

  
(a) Style (b) Relu_1_1 (c) Relu_2_1 (d) Relu_3_1 (e) Relu_4_1 (f) Relu_5_1
Figure 4: Single-level stylization using different VGG features. The content image is from Figure 2.
  
(a) (b) (c) (d) Fine-to-coarse
Figure 5: (a)-(c) Intermediate results of our coarse-to-fine multi-level stylization framework in Figure 1(c). The style and content images are from Figure 4. is the final output of our multi-level pipeline. (d) Reversed fine-to-coarse multi-level pipeline.

3.3 Multi-level coarse-to-fine stylization

Based on the single-level stylization framework shown in Figure 1(b), we use different layers of VGG features Relu_X_1 (X=1,2,…,5) and show the corresponding stylized results in Figure 4. It clearly shows that the higher layer features capture more complicated local structures, while lower layer features carry more low-level information (e.g., colors). This can be explained by the increasing size of receptive field and feature complexity in the network hierarchy. Therefore, it is advantageous to use features at all five layers to fully capture the characteristics of a style from low to high levels.

Figure 1(c) shows our multi-level stylization pipeline. We start by applying the WCT on Relu_5_1 features to obtain a coarse stylized result and regard it as the new content image to further adjust features in lower layers. An example of intermediate results are shown in Figure 5. We show the intermediate results , , with obvious differences, which indicates that the higher layer features first capture salient patterns of the style and lower layer features further improve details. If we reverse feature processing order (i.e., fine-to-coarse layers) by starting with Relu_1_1, low-level information cannot be preserved after manipulating higher level features, as shown in Figure 5(d).

4 Experimental Results

4.1 Decoder training

For the multi-level stylization approach, we separately train five reconstruction decoders for features at the VGG-19 Relu_X_1 (X=1,2,…,5) layer. It is trained on the Microsoft COCO dataset COCO-lin2014-microsoft and the weight to balance the two losses in (1) is set as 1.

  
  
  
  
  
  
(a) Style (b) Chen-2016-swap (c) Huang-2017-arbitrary (d) Texturenet-ICML2016 (e) GatysTransfer-CVPR2016 (f) Ours
Figure 6: Results from different style transfer methods. The content images are from Figure 2-3. We evaluate various styles including paintings, abstract styles, and styles with obvious texton elements. We adjust the style weight of each method to obtain the best stylized effect. For our results, we set the style weight .
Chen et al. Chen-2016-swap Huang et al. Huang-2017-arbitrary TNet Texturenet-ICML2016 DeepArt GatysTransfer-CVPR2016 Ours
Arbitrary
Efficient
Learning-free
Table 1: Differences between our approach and other methods.

4.2 Style transfer

To demonstrate the effectiveness of the proposed algorithm, we list the differences with existing methods in Table 1 and present stylized results in Figure 6. We adjust the style weight of other methods to obtain the best stylized effect. The optimization-based work of GatysTransfer-CVPR2016 handles arbitrary styles but is likely to encounter unexpected local minima issues (e.g., 5th and 6th row of Figure 6(e)). Although the method Texturenet-ICML2016 greatly improves the stylization speed, it trades off quality and generality for efficiency, which generates repetitive patterns that overlay with the image contents (Figure 6(d)).

Closest to our work on generalization are the recent methods  Chen-2016-swap ; Huang-2017-arbitrary , but the quality of the stylized results are less appealing. The work of Chen-2016-swap replaces the content feature with the most similar style feature based on patch similarity and hence has limited capability, i.e., the content is strictly preserved while style is not well reflected with only low-level information (e.g., colors) transferred, as shown in Figure 6(b). In Huang-2017-arbitrary , the content feature is simply adjusted to have the same mean and variance with the style feature, which is not effective in capturing high-level representations of the style. Even learned with a set of training styles, it does not generalize well on unseen styles. Results in Figure 6(c) indicate that the method in Huang-2017-arbitrary is not effective at capturing and synthesizing salient style patterns, especially for complicated styles where there are rich local structures and non-smooth regions.

Figure 6(f) shows the stylized results of our approach. Without learning any style, our method is able to capture visually salient patterns in style images (e.g., the brick wall on the 6th row). Moreover, key components in the content images (e.g., bridge, eye, mouth) are also well stylized in our results, while other methods only transfer patterns to relatively smooth regions (e.g., sky, face). The models and code are available at https://github.com/Yijunmaverick/UniversalStyleTransfer.

Chen et al. Chen-2016-swap Huang et al. Huang-2017-arbitrary TNet Texturenet-ICML2016 Gatys et al. GatysTransfer-CVPR2016 Ours
log() 7.4 7.0 6.8 6.7 6.3
Preference/% 15.7 24.9 12.7 16.4 30.3
Time/sec 2.1 0.20 0.18 21.2 0.83
Table 2: Quantitative comparisons between different stylization methods in terms of the covariance matrix difference (), user preference and run-time, tested on images of size and a 12GB TITAN X.
     
Style
Figure 7: Controlling the stylization on the scale and weight.
(a) Content (b) Different masks and styles (c) Our result
Figure 8: Spatial control in transferring, which enables users to edit the content with different styles.

In addition, we quantitatively evaluate different methods by computing the covariance matrix difference () on all five levels of VGG features between stylized results and the given style image. We randomly select 10 content images from COCO-lin2014-microsoft and 40 style images from Wikipainting-BMVC2014 , compute the averaged difference over all styles, and show the results in Table 2 (1st row). Quantitative results show that we generate stylized results with lower , i.e., closer to the statistics of the style.

User study.

Evaluating artistic style transfer has been an open question in the community. Since the qualitative assessment is highly subjective, we conduct a user study to evaluate 5 methods shown in Figure 6. We use 5 content images and 30 style images, and generate 150 results based on each content/style pair for each method. We randomly select 15 style images for each subject to evaluate. We display stylized images by 5 compared methods side-by-side on a webpage in random order. Each subject is asked to vote his/her ONE favorite result for each style. We finally collect the feedback from 80 subjects of totally 1,200 votes and show the percentage of the votes each method received in Table 2

(2nd row). The study shows that our method receives the most votes for better stylized results. It can be an interesting direction to develop evaluation metrics based on human visual perception for general image synthesis problems.

Efficiency.

In Table 2 (3rd row), we also compare our approach with other methods in terms of efficiency. The method by Gatys et al. GatysTransfer-CVPR2016 is slow due to loops of optimization and usually requires at least 500 iterations to generate good results. The methods Texturenet-ICML2016 and Huang-2017-arbitrary are efficient as the scheme is based on one feed-forward pass with a trained network. The approach Chen-2016-swap is feed-forward based but relatively slower as the feature swapping operation needs to be carried out for thousands of patches. Our approach is also efficient but a little bit slower than Texturenet-ICML2016 ; Huang-2017-arbitrary because we have a eigenvalue decomposition step in WCT. But note that the computational cost on this step will not increase along with the image size because the the dimension of covariance matrix only depends on filter numbers (or channels), which is at most 512 (Relu_5_1). Currently the decomposition step is implemented based on CPU. Our future work includes more efficient GPU implementations of the proposed algorithm.

User Controls.

Given a content/style pair, our approach is not only as simple as a one-click transferring, but also flexible enough to accommodate different requirements from users by providing different controls on the stylization, including the scale, weight and spatial control. The style input on different scales will lead to different extracted statistics due to the fixed receptive field of the network. Therefore the scale control is easily achieved by adjusting the style image size. In the middle of Figure 7, we show two examples where the brick can be transferred in either small or large scale. The weight control refers to controlling the balance between stylization and content preservation. As shown on right of Figure 7, our method enjoys this flexibility in simple feed-forward passes by simply adjusting the style weight in (4). However in GatysTransfer-CVPR2016 and Texturenet-ICML2016 , to obtain visual results of different weight settings, a new round of time-consuming optimization or model training is needed. Moreover, our blending directly works on deep feature space before inversion/reconstruction, which is fundamentally different from GatysTransfer-CVPR2016 ; Texturenet-ICML2016 where the blending is formulated as the weighted sum of the content and style losses that may not always lead to a good balance point.

The spatial control is also highly desired when users want to edit an image with different styles transferred on different parts of the image. Figure 8 shows an example of spatially controlling the stylization. A set of masks (Figure 8(b)) is additionally required as input to indicate the spatial correspondence between content regions and styles. By replacing the content feature in (3) with where is a simple mask-out operation, we are able to stylize the specified region only.

           
           
           
Figure 9: Texture synthesis. In each panel, Left: original textures, Right: our synthesized results. Texture images are mostly from the Describable Textures Dataset (DTD) DTDtexture-CVPR2016 .

4.3 Texture synthesis

By setting the content image as a random noise image (e.g., Gaussian noise), our stylization framework can be easily applied to texture synthesis. An alternative is to directly initialize the in (3

) to be white noise. Both approaches achieve similar results. Figure 

9 shows a few examples of the synthesized textures. We empirically find that it is better to run the multi-level pipeline for a few times (e.g., 3) to get more visually pleasing results.

Our method is also able to synthesize the interpolated result of two textures. Given two texture examples

and , we first perform the WCT on the input noise and get transformed features and respectively. Then we blend these two features and feed the combined feature into the decoder to generate mixed effects. Note that our interpolation directly works on deep feature space. By contrast, the method in GatysTransfer-CVPR2016 generates the interpolation by matching the weighted sum of Gram matrices of two textures at the loss end. Figure 10 shows that the result by GatysTransfer-CVPR2016 is simply overlaid by two textures while our method generates new textural effects, e.g., bricks in the stripe shape.

     
Texture Texture
Figure 10: Interpolation between two texture examples. Left: original textures, Middle: our interpolation results, Right: interpolated results of GatysTransfer-CVPR2016 . controls the weight of interpolation.
     
     
Texture TNet Texturenet-ICML2016 Ours
Figure 11: Comparisons of diverse synthesized results between TNet Texturenet-ICML2016 and our model.

One important aspect in texture synthesis is diversity. By sampling different noise images, our method can generate diverse synthesized results for each texture. While Texturenet-ICML2016 can generate different results driven by the input noise, the learned networks are very likely to be trapped in local optima. In other words, the noise is marginalized out and thus fails to drive the network to generate large visual variations. In contrast, our approach explains each input noise better because the network is unlikely to absorb the variations in input noise since it is never trained for learning textures. We compare the diverse outputs of our model with Texturenet-ICML2016 in Figure 11. Note that the common diagonal layout is shared across different results of Texturenet-ICML2016 , which causes unsatisfying visual experiences. The comparison shows that our method achieves diversity in a more natural and flexible manner.

5 Concluding Remarks

In this work, we propose a universal style transfer algorithm that does not require learning for each individual style. By unfolding the image generation process via training an auto-encoder for image reconstruction, we integrate the whitening and coloring transforms in the feed-forward passes to match the statistical distributions and correlations between the intermediate features of content and style. We also present a multi-level stylization pipeline, which takes all level of information of a style into account, for improved results. In addition, the proposed approach is shown to be equally effective for texture synthesis. Experimental results demonstrate that the proposed algorithm achieves favorable performance against the state-of-the-art methods in generalizing to arbitrary styles.

Acknowledgments

This work is supported in part by the NSF CAREER Grant #1149783, gifts from Adobe and NVIDIA.

References