Style transfer is an important image editing task which enables the creation of new artistic works. Given a pair of examples, i.e., the content and style image, it aims to synthesize an image that preserves some notion of the content but carries characteristics of the style. The key challenge is how to extract effective representations of the style and then match it in the content image. The seminal work by Gatys et al. GatysTexture-NIPS2015 ; GatysTransfer-CVPR2016 show that the correlation between features, i.e., Gram matrix or covariance matrix (shown to be as effective as Gram matrix in Me-2017-diversified
), extracted by a trained deep neural network has remarkable ability of capturing visual styles. Since then, significant efforts have been made to synthesize stylized images by minimizing Gram/covariance matrices based loss functions, through either iterative optimizationGatysTransfer-CVPR2016 or trained feed-forward networks Texturenet-ICML2016 ; Perceptual-ECCV2016 ; Me-2017-diversified ; MSRA-2017-stylebank ; GoogleMultiTexture-2016 . Despite the recent rapid progress, these existing works often trade off between generalization, quality and efficiency, which means that optimization-based methods can handle arbitrary styles with pleasing visual quality but at the expense of high computational costs, while feed-forward approaches can be executed efficiently but are limited to a fixed number of styles or compromised visual quality.
By far, the problem of universal style transfer remains a daunting task as it is challenging to develop neural networks that achieve generalization, quality and efficiency at the same time. The main issue is how to properly and effectively apply the extracted style characteristics (feature correlations) to content images in a style-agnostic manner.
In this work, we propose a simple yet effective method for universal style transfer, which enjoys the style-agnostic generalization ability with marginally compromised visual quality and execution efficiency. The transfer task is formulated as image reconstruction processes, with the content features being transformed at intermediate layers with regard to the statistics of the style features, in the midst of feed-forward passes. In each intermediate layer, our main goal is to transform the extracted content features such that they exhibit the same statistical characteristics as the style features of the same layer and we found that the classic signal whitening and coloring transforms (WCTs) on those features are able to achieve this goal in an almost effortless manner.
In this work, we first employ the VGG-19 network VGG-2014 as the feature extractor (encoder), and train a symmetric decoder to invert the VGG-19 features to the original image, which is essentially the image reconstruction task (Figure 1(a)). Once trained, both the encoder and the decoder are fixed through all the experiments. To perform style transfer, we apply WCT to one layer of content features such that its covariance matrix matches that of style features, as shown in Figure 1(b). The transformed features are then fed forward into the downstream decoder layers to obtain the stylized image. In addition to this single-level stylization, we further develop a multi-level stylization pipeline, as depicted in Figure 1(c), where we apply WCT sequentially to multiple feature layers. The multi-level algorithm generates stylized images with greater visual quality, which are comparable or even better with much less computational costs. We also introduce a control parameter that defines the degree of style transfer so that the users can choose the balance between stylization and content preservation. The entire procedure of our algorithm only requires learning the image reconstruction decoder with no style images involved. So when given a new style, we simply need to extract its feature covariance matrices and apply them to the content features via WCT. Note that this learning-free scheme is fundamentally different from existing feed-forward networks that require learning with pre-defined styles and fine-tuning for new styles. Therefore, our approach is able to achieve style transfer universally.
The main contributions of this work are summarized as follows:
We propose to use feature transforms, i.e., whitening and coloring, to directly match content feature statistics to those of a style image in the deep feature space.
We couple the feature transforms with a pre-trained general encoder-decoder network, such that the transferring process can be implemented by simple feed-forward operations.
We demonstrate the effectiveness of our method for universal style transfer with high-quality visual results, and also show its application to universal texture synthesis.
|(a) Reconstruction||(b) Single-level stylization||(c) Multi-level stylization|
2 Related Work
Existing style transfer methods are mostly example-based Hertz-2001-analogy ; shih2013data ; shih2014style ; Frigo-2016-CVPR ; MSRA-2017-visual . The image analogy method Hertz-2001-analogy aims to determine the relationship between a pair of images and then apply it to stylize other images. As it is based on finding dense correspondence, analogy-based approaches shih2013data ; shih2014style ; Frigo-2016-CVPR ; MSRA-2017-visual often require that a pair of image depicts the same type of scene. Therefore these methods do not scale to the setting of arbitrary style images well.
proposed an algorithm for arbitrary stylization based on matching the correlations (Gram matrix) between deep features extracted by a trained network classifier within an iterative optimization framework. Numerous methods have since been developed to address different aspects including speedTexturenet-ICML2016 ; MGAN-ECCV2016 ; Perceptual-ECCV2016 , quality InstanceBN-arxiv-2016 ; MrfTransfer-CVPR2016 ; HistoGramLoss-2017 ; Wang-2016-highres , user control Gatys2016-control , diversity Texturenet-2017-V2 ; Me-2017-diversified , semantics understanding Frigo-2016-CVPR ; Doodle-2016-semantic and photorealism Luan-2017-photorealism . It is worth mentioning that one of the major drawbacks of GatysTexture-NIPS2015 ; GatysTransfer-CVPR2016 is the inefficiency due to the optimization process. The improvement of efficiency in Texturenet-ICML2016 ; MGAN-ECCV2016 ; Perceptual-ECCV2016
is realized by formulating the stylization as learning a feed-forward image transformation network. However, these methods are limited by the requirement of training one network per style due to the lack of generalization in network design.
Most recently, a number of methods have been proposed to empower a single network to transfer multiple styles, including a model that conditioned on binary selection units Me-2017-diversified , a network that learns a set of new filters for every new style MSRA-2017-stylebank , and a novel conditional normalization layer that learns normalization parameters for each style GoogleMultiTexture-2016 . To achieve arbitrary style transfer, Chen et al. Chen-2016-swap first propose to swap the content feature with the closest style feature locally. Meanwhile, inspired by GoogleMultiTexture-2016 , two following work Wang-2017-zeroshortArbitrary ; Ghiasi-2017-BMVC turn to learn a general mapping from the style image to style parameters. One closest related work Huang-2017-arbitrary
directly adjusts the content feature to match the mean and variance of the style feature. However, the generalization ability of the learned models on unseen styles is still limited.
Different from the existing methods, our approach performs style transfer efficiently in a feed-forward manner while achieving generalization and visual quality on arbitrary styles. Our approach is closely related to Huang-2017-arbitrary , where content feature in a particular (higher) layer is adaptively instance normalized by the mean and variance of style feature. This step can be viewed as a sub-optimal approximation of the WCT operation, thereby leading to less effective results on both training and unseen styles. Moreover, our encoder-decoder network is trained solely based on image reconstruction, while Huang-2017-arbitrary requires learning such a module particularly for stylization task. We evaluate the proposed algorithm with existing approaches extensively on both style transfer and texture synthesis tasks and present in-depth analysis.
3 Proposed Algorithm
We formulate style transfer as an image reconstruction process coupled with feature transformation, i.e., whitening and coloring. The reconstruction part is responsible for inverting features back to the RGB space and the feature transformation matches the statistics of a content image to a style image.
3.1 Reconstruction decoder
We construct an auto-encoder network for general image reconstruction. We employ the VGG-19 VGG-2014 as the encoder, fix it and train a decoder network simply for inverting VGG features to the original image, as shown in Figure 1(a). The decoder is designed as being symmetrical to that of VGG-19 network (up to Relu_X_1 layer), with the nearest neighbor upsampling layer used for enlarging feature maps. To evaluate with features extracted at different layers, we select feature maps at five layers of the VGG-19, i.e., Relu_X_1 (X=1,2,3,4,5), and train five decoders accordingly. The pixel reconstruction loss Doso-NIPS2016-Generation and feature loss Perceptual-ECCV2016 ; Doso-NIPS2016-Generation are employed for reconstructing an input image,
where , are the input image and reconstruction output, and is the VGG encoder that extracts the Relu_X_1 features. In addition, is the weight to balance the two losses. After training, the decoder is fixed (i.e., will not be fine-tuned) and used as a feature inverter.
3.2 Whitening and coloring transforms
Given a pair of content image and style image
, we first extract their vectorized VGG feature mapsand at a certain layer (e.g., Relu_4_1), where , (, ) are the height and width of the content (style) feature, and is the number of channels. The decoder will reconstruct the original image if is directly fed into it. We next propose to use a whitening and coloring transform to adjust with respect to the statistics of . The goal of WCT is to directly transform the to match the covariance matrix of . It consists of two steps, i.e., whitening and coloring transform.
Before whitening, we first center by subtracting its mean vector . Then we transform linearly as in (2) so that we obtain such that the feature maps are uncorrelated (),
is a diagonal matrix with the eigenvalues of the covariance matrix, and .
To validate what is encoded in the whitened feature , we invert it to the RGB space with our previous decoder trained for reconstruction only. Figure 2 shows two visualization examples, which indicate that the whitened features still maintain global structures of the image contents, but greatly help remove other information related to styles. We note especially that, for the Starry_night example on right, the detailed stroke patterns across the original image are gone. In other words, the whitening step helps peel off the style from an input image while preserving the global content structure. The outcome of this operation is ready to be transformed with the target style.
|(a) Style||(b) Content||(c) HM||(d) WCT||(e) Style||(f) Content||(g) HM||(h) WCT|
We first center by subtracting its mean vector , and then carry out the coloring transform WCT-2016 , which is essentially the inverse of the whitening step to transform linearly as in (3) such that we obtain which has the desired correlations between its feature maps (),
where is a diagonal matrix with the eigenvalues of the covariance matrix , and is the corresponding orthogonal matrix of eigenvectors. Finally we re-center the with the mean vector of the style, i.e., .
To demonstrate the effectiveness of WCT, we compare it with a commonly used feature adjustment technique, i.e., histogram matching (HM), in Figure 3. The channel-wise histogram matching Gonzalez-DIP method determines a mapping function such that the mapped has the same cumulative histogram as . In Figure 3, it is clear that the HM method helps transfer the global color of the style image well but fails to capture salient visual patterns, e.g., patterns are broken into pieces and local structures are misrepresented. In contrast, our WCT captures patterns that reflect the style image better. This can be explained by that the HM method does not consider the correlations between features channels, which are exactly what the covariance matrix is designed for.
After the WCT, we may blend with the content feature as in (4) before feeding it to the decoder in order to provide user controls on the strength of stylization effects:
where serves as the style weight for users to control the transfer effect.
|(a) Style||(b) Relu_1_1||(c) Relu_2_1||(d) Relu_3_1||(e) Relu_4_1||(f) Relu_5_1|
3.3 Multi-level coarse-to-fine stylization
Based on the single-level stylization framework shown in Figure 1(b), we use different layers of VGG features Relu_X_1 (X=1,2,…,5) and show the corresponding stylized results in Figure 4. It clearly shows that the higher layer features capture more complicated local structures, while lower layer features carry more low-level information (e.g., colors). This can be explained by the increasing size of receptive field and feature complexity in the network hierarchy. Therefore, it is advantageous to use features at all five layers to fully capture the characteristics of a style from low to high levels.
Figure 1(c) shows our multi-level stylization pipeline. We start by applying the WCT on Relu_5_1 features to obtain a coarse stylized result and regard it as the new content image to further adjust features in lower layers. An example of intermediate results are shown in Figure 5. We show the intermediate results , , with obvious differences, which indicates that the higher layer features first capture salient patterns of the style and lower layer features further improve details. If we reverse feature processing order (i.e., fine-to-coarse layers) by starting with Relu_1_1, low-level information cannot be preserved after manipulating higher level features, as shown in Figure 5(d).
4 Experimental Results
4.1 Decoder training
For the multi-level stylization approach, we separately train five reconstruction decoders for features at the VGG-19 Relu_X_1 (X=1,2,…,5) layer. It is trained on the Microsoft COCO dataset COCO-lin2014-microsoft and the weight to balance the two losses in (1) is set as 1.
|(a) Style||(b) Chen-2016-swap||(c) Huang-2017-arbitrary||(d) Texturenet-ICML2016||(e) GatysTransfer-CVPR2016||(f) Ours|
4.2 Style transfer
To demonstrate the effectiveness of the proposed algorithm, we list the differences with existing methods in Table 1 and present stylized results in Figure 6. We adjust the style weight of other methods to obtain the best stylized effect. The optimization-based work of GatysTransfer-CVPR2016 handles arbitrary styles but is likely to encounter unexpected local minima issues (e.g., 5th and 6th row of Figure 6(e)). Although the method Texturenet-ICML2016 greatly improves the stylization speed, it trades off quality and generality for efficiency, which generates repetitive patterns that overlay with the image contents (Figure 6(d)).
Closest to our work on generalization are the recent methods Chen-2016-swap ; Huang-2017-arbitrary , but the quality of the stylized results are less appealing. The work of Chen-2016-swap replaces the content feature with the most similar style feature based on patch similarity and hence has limited capability, i.e., the content is strictly preserved while style is not well reflected with only low-level information (e.g., colors) transferred, as shown in Figure 6(b). In Huang-2017-arbitrary , the content feature is simply adjusted to have the same mean and variance with the style feature, which is not effective in capturing high-level representations of the style. Even learned with a set of training styles, it does not generalize well on unseen styles. Results in Figure 6(c) indicate that the method in Huang-2017-arbitrary is not effective at capturing and synthesizing salient style patterns, especially for complicated styles where there are rich local structures and non-smooth regions.
Figure 6(f) shows the stylized results of our approach. Without learning any style, our method is able to capture visually salient patterns in style images (e.g., the brick wall on the 6th row). Moreover, key components in the content images (e.g., bridge, eye, mouth) are also well stylized in our results, while other methods only transfer patterns to relatively smooth regions (e.g., sky, face). The models and code are available at https://github.com/Yijunmaverick/UniversalStyleTransfer.
|Chen et al. Chen-2016-swap||Huang et al. Huang-2017-arbitrary||TNet Texturenet-ICML2016||Gatys et al. GatysTransfer-CVPR2016||Ours|
|(a) Content||(b) Different masks and styles||(c) Our result|
In addition, we quantitatively evaluate different methods by computing the covariance matrix difference () on all five levels of VGG features between stylized results and the given style image. We randomly select 10 content images from COCO-lin2014-microsoft and 40 style images from Wikipainting-BMVC2014 , compute the averaged difference over all styles, and show the results in Table 2 (1st row). Quantitative results show that we generate stylized results with lower , i.e., closer to the statistics of the style.
Evaluating artistic style transfer has been an open question in the community. Since the qualitative assessment is highly subjective, we conduct a user study to evaluate 5 methods shown in Figure 6. We use 5 content images and 30 style images, and generate 150 results based on each content/style pair for each method. We randomly select 15 style images for each subject to evaluate. We display stylized images by 5 compared methods side-by-side on a webpage in random order. Each subject is asked to vote his/her ONE favorite result for each style. We finally collect the feedback from 80 subjects of totally 1,200 votes and show the percentage of the votes each method received in Table 2
(2nd row). The study shows that our method receives the most votes for better stylized results. It can be an interesting direction to develop evaluation metrics based on human visual perception for general image synthesis problems.
In Table 2 (3rd row), we also compare our approach with other methods in terms of efficiency. The method by Gatys et al. GatysTransfer-CVPR2016 is slow due to loops of optimization and usually requires at least 500 iterations to generate good results. The methods Texturenet-ICML2016 and Huang-2017-arbitrary are efficient as the scheme is based on one feed-forward pass with a trained network. The approach Chen-2016-swap is feed-forward based but relatively slower as the feature swapping operation needs to be carried out for thousands of patches. Our approach is also efficient but a little bit slower than Texturenet-ICML2016 ; Huang-2017-arbitrary because we have a eigenvalue decomposition step in WCT. But note that the computational cost on this step will not increase along with the image size because the the dimension of covariance matrix only depends on filter numbers (or channels), which is at most 512 (Relu_5_1). Currently the decomposition step is implemented based on CPU. Our future work includes more efficient GPU implementations of the proposed algorithm.
Given a content/style pair, our approach is not only as simple as a one-click transferring, but also flexible enough to accommodate different requirements from users by providing different controls on the stylization, including the scale, weight and spatial control. The style input on different scales will lead to different extracted statistics due to the fixed receptive field of the network. Therefore the scale control is easily achieved by adjusting the style image size. In the middle of Figure 7, we show two examples where the brick can be transferred in either small or large scale. The weight control refers to controlling the balance between stylization and content preservation. As shown on right of Figure 7, our method enjoys this flexibility in simple feed-forward passes by simply adjusting the style weight in (4). However in GatysTransfer-CVPR2016 and Texturenet-ICML2016 , to obtain visual results of different weight settings, a new round of time-consuming optimization or model training is needed. Moreover, our blending directly works on deep feature space before inversion/reconstruction, which is fundamentally different from GatysTransfer-CVPR2016 ; Texturenet-ICML2016 where the blending is formulated as the weighted sum of the content and style losses that may not always lead to a good balance point.
The spatial control is also highly desired when users want to edit an image with different styles transferred on different parts of the image. Figure 8 shows an example of spatially controlling the stylization. A set of masks (Figure 8(b)) is additionally required as input to indicate the spatial correspondence between content regions and styles. By replacing the content feature in (3) with where is a simple mask-out operation, we are able to stylize the specified region only.
4.3 Texture synthesis
By setting the content image as a random noise image (e.g., Gaussian noise), our stylization framework can be easily applied to texture synthesis. An alternative is to directly initialize the in (3
) to be white noise. Both approaches achieve similar results. Figure9 shows a few examples of the synthesized textures. We empirically find that it is better to run the multi-level pipeline for a few times (e.g., 3) to get more visually pleasing results.
Our method is also able to synthesize the interpolated result of two textures. Given two texture examplesand , we first perform the WCT on the input noise and get transformed features and respectively. Then we blend these two features and feed the combined feature into the decoder to generate mixed effects. Note that our interpolation directly works on deep feature space. By contrast, the method in GatysTransfer-CVPR2016 generates the interpolation by matching the weighted sum of Gram matrices of two textures at the loss end. Figure 10 shows that the result by GatysTransfer-CVPR2016 is simply overlaid by two textures while our method generates new textural effects, e.g., bricks in the stripe shape.
One important aspect in texture synthesis is diversity. By sampling different noise images, our method can generate diverse synthesized results for each texture. While Texturenet-ICML2016 can generate different results driven by the input noise, the learned networks are very likely to be trapped in local optima. In other words, the noise is marginalized out and thus fails to drive the network to generate large visual variations. In contrast, our approach explains each input noise better because the network is unlikely to absorb the variations in input noise since it is never trained for learning textures. We compare the diverse outputs of our model with Texturenet-ICML2016 in Figure 11. Note that the common diagonal layout is shared across different results of Texturenet-ICML2016 , which causes unsatisfying visual experiences. The comparison shows that our method achieves diversity in a more natural and flexible manner.
5 Concluding Remarks
In this work, we propose a universal style transfer algorithm that does not require learning for each individual style. By unfolding the image generation process via training an auto-encoder for image reconstruction, we integrate the whitening and coloring transforms in the feed-forward passes to match the statistical distributions and correlations between the intermediate features of content and style. We also present a multi-level stylization pipeline, which takes all level of information of a style into account, for improved results. In addition, the proposed approach is shown to be equally effective for texture synthesis. Experimental results demonstrate that the proposed algorithm achieves favorable performance against the state-of-the-art methods in generalizing to arbitrary styles.
This work is supported in part by the NSF CAREER Grant #1149783, gifts from Adobe and NVIDIA.
-  A. J. Champandard. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768, 2016.
-  D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In CVPR, 2017.
-  T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.
-  M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
-  A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In NIPS, 2016.
-  V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.
-  O. Frigo, N. Sabater, J. Delon, and P. Hellier. Split and match: Example-based adaptive patch sampling for unsupervised style transfer. In CVPR, 2016.
L. A. Gatys, A. S. Ecker, and M. Bethge.
Texture synthesis using convolutional neural networks.In NIPS, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. In CVPR, 2017.
-  G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC, 2017.
-  R. C. Gonzalez and R. E. Woods. Digital image processing (3rd edition). Prentice Hall, 2008.
-  A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In SIGGRAPH, 2001.
Whitening and coloring transforms for multivariate gaussian random variables.Project Rhea, 2016.
-  X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
J. Johnson, A. Alahi, and L. Fei-Fei.
Perceptual losses for real-time style transfer and super-resolution.In ECCV, 2016.
-  S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, and H. Winnemoeller. Recognizing image style. In BMVC, 2014.
-  C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In CVPR, 2016.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, 2016.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In CVPR, 2017.
-  J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang. Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
-  F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. In CVPR, 2017.
-  Y. Shih, S. Paris, C. Barnes, W. T. Freeman, and F. Durand. Style transfer for headshot portraits. In SIGGRAPH, 2014.
-  Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. In SIGGRAPH, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In CVPR, 2017.
-  H. Wang, X. Liang, H. Zhang, D.-Y. Yeung, and E. P. Xing. Zm-net: Real-time zero-shot image manipulation network. arXiv preprint arXiv:1703.07255, 2017.
-  X. Wang, G. Oxholm, D. Zhang, and Y.-F. Wang. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. In CVPR, 2017.
-  P. Wilmot, E. Risser, and C. Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893, 2017.