StyleBank: An Explicit Representation for Neural Image Style Transfer

03/27/2017 ∙ by Dongdong Chen, et al. ∙ Microsoft USTC 0

We propose StyleBank, which is composed of multiple convolution filter banks and each filter bank explicitly represents one style, for neural image style transfer. To transfer an image to a specific style, the corresponding filter bank is operated on top of the intermediate feature embedding produced by a single auto-encoder. The StyleBank and the auto-encoder are jointly learnt, where the learning is conducted in such a way that the auto-encoder does not encode any style information thanks to the flexibility introduced by the explicit filter bank representation. It also enables us to conduct incremental learning to add a new image style by learning a new filter bank while holding the auto-encoder fixed. The explicit style representation along with the flexible network design enables us to fuse styles at not only the image level, but also the region level. Our method is the first style transfer network that links back to traditional texton mapping methods, and hence provides new understanding on neural style transfer. Our method is easy to train, runs in real-time, and produces results that qualitatively better or at least comparable to existing methods.



There are no comments yet.


page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Style transfer is to migrate a style from an image to another, and is closely related to texture synthesis. The core problem behind these two tasks is to model the statistics of a reference image (texture, or style image), which enables further sampling from it under certain constraints. For texture synthesis, the constraints are that the boundaries between two neighboring samples must have a smooth transition, while for style transfer, the constraints are that the samples should match the local structure of the content image. So in this sense, style transfer can be regarded as a generalization of texture synthesis.

Recent work on style transfer adopting Convolutional Neural Networks (CNN) ignited a renewed interest in this problem. On the machine learning side, it has been shown that a pre-trained image classifier can be used as a feature extractor to drive texture synthesis 

[11] and style transfer [12]. These CNN algorithms either apply an iterative optimization mechanism [12], or directly learn a feed-forward generator network [19, 37] to seek an image close to both the content image and the style image -- all measured in the CNN (i.e., pre-trained VGG-16 [36]) feature domain. These algorithms often produce more impressive results compared to the texture-synthesis ones, since the rich feature representation that a deep network can produce from an image would allow more flexible manipulation of an image.

Notwithstanding their demonstrated success, the principles of CNN style transfer are vaguely understood. After a careful examination of existing style transfer networks, we argue that the content and style are still coupled in their learnt network structures and hyper-parameters. To the best of our knowledge, an explicit representation for either style or content has not yet been proposed in these previous neural style transfer methods.

As a result, the network is only able to capture a specific style one at a time. For a new style, the whole network has to be retrained end-to-end. In practice, this makes these methods unable to scale to large number of styles, especially when the style set needs to be incrementally augmented. In addition, how to further reduce run time, network model size and enable more flexibilities to control transfer (e.g., region-specific transfer), remain to be challenges yet to be addressed.

To explore an explicit representation for style, we reconsider neural style transfer by linking back to traditional texton (known as the basic element of texture) mapping methods, where mapping a texton to the target location is equivalent to a convolution between a texton and a Delta function (indicating sampling positions) in the image space.

Inspired by this, we propose StyleBank, which is composed of multiple convolution filter banks and each filter bank represents one style. To transfer an image to a specific style, the corresponding filter bank is convolved with the intermediate feature embedding produced by a single auto-encoder, which decomposes the original image into multiple feature response maps. This way, for the first time, we provide a clear understanding of the mechanism underneath neural style transfer.

The StyleBank and the auto-encoder are jointly learnt in our proposed feed-forward network. It not only allows us to simultaneously learn a bundle of various styles, but also enables a very efficient incremental learning for a new image style. This is achieved by learning a new filter bank while holding the auto-encoder fixed.

We believe this is a very useful functionality to recently emerged style transfer mobile applications (e.g., Prisma) since we do not need to train and prepare a complete network for every style. More importantly, it can even allow users to efficiently create their own style models and conveniently share to others. Since the part of our image encoding is shared for variant styles, it may provide a faster and more convenient switch for users between different style models.

Because of the explicit representation, we can more conveniently control style transfer and create new interesting style fusion effects. More specifically, we can either linearly fuse different styles altogether, or produce region-specific style fusion effects. In other words, we may produce an artistic work with hybrid elements from van Gogh’s and Picaso’s paintings.

Compared with existing neural style transfer networks  [19, 37], our proposed neural style transfer network is unique in the following aspects:

  • In our method, we provide an explicit representation for styles. This enables our network to completely decouple styles from the content after learning.

  • Due to the explicit style representation, our method enables region-based style transfer. This is infeasible in existing neural style transfer networks, although classical texture transfer methods were able to achieve it.

  • Our method not only allows to simultaneously train multiple styles sharing a single auto-encoder, but also incrementally learn a new style without changing the auto-encoder.

The remainder of the paper is organized as follows. We summarize related work in Section 2. We devote Section 3 to the main technical design of the proposed Style-Bank Network. Section 4 discusses about new characteristics of the proposed Style-Bank Network when compared with previous work. We present experimental results and comparisons in Section 5. And finally we conclude in Section 6.

2 Related Work

Style transfer is very related to texture synthesis, which attempts to grow textures using non-parametric sampling of pixels [8, 39] or patches [7, 25] in a given source texture. The task of style transfer can be regarded as a problem of texture transfer [7, 10, 9], which synthesizes a texture from a source image constrained by the content of a target image. Hertzman et al. [16] further introduce the concept of image analogies, which transfer the texture from an already stylised image onto a target image. However, these methods only use low-level image features of the target image to inform the texture transfer.

Ideally, a style transfer algorithms should be able to extract and represent the semantic image content from the target image and then render the content in the style of the source image. To generally separate content from style in natural images is still an extremely difficult problem before, but the problem is better mitigated by the recent development of Deep Convolutional Neural Networks (CNN) [21].

DeepDream [1] may be the first attempt to generate artistic work using CNN. Inspired by this work, Gatys et al. [12] successfully applies CNN (pre-trained VGG-16 networks) to neural style transfer and produces more impressive stylization results compared to classic texture transfer methods. This idea is further extended to portrait painting style transfer [35] and patch-based style transfer by combining Markov Random Field (MRF) and CNN [22]. Unfortunately, these methods based on an iterative optimization mechanism are computationally expensive in run-time, which imposes a big limitation in real applications.

To make the run-time more efficient, more and more works begin to directly learn a feed-forward generator network for a specific style. This way, stylized results can be obtained just with a forward pass, which is hundreds of times faster than iterative optimization [12]. For example, Ulyanov et al. [37] propose a texture network for both texture synthesis and style transfer. Johnson et al. [19]

define a perceptual loss function to help learn a transfer network that aims to produce results approaching 

[12]. Chuan et al. [23] introduce a Markovian Generative Adversarial Networks, aiming to speed up their previous work [22].

However, in all of these methods, the learnt feed-forward networks can only represent one specific style. For a new style, the whole network has to be retrained, which may limit the scalability of adding more styles on demand. In contrast, our network allows a single network to simultaneously learn numerous styles. Moreover, our work enables incremental training for new styles.

At the core of our network, the proposed StyleBank represents each style by a convolution filter bank. It is very analogous to the concept of "texton" [30, 41, 24] and filter bank in [42, 29], but StyleBank is defined in feature embedding space produced by auto-encoder [17] rather than image space. As we known, embedding space can provide compact and descriptive representation for original data [4, 32, 40]. Therefore, our StyleBank would provide a better representation for style data compared to predefined dictionaries (such as wavelet [31] or pyramid [15] ).

Figure 1: Our network architecture consists of three modules: image encoder , StyleBank layer and image decoder

3 StyleBank Networks

3.1 StyleBank

At its core, the task of neural style transfer requires a more explicit representation, like texton [30, 24] (known as the basic element of texture) used in classical texture synthesis. It may provide a new understanding for the style transfer task, and then help design a more elegant architecture to resolve the coupling issue in existing transfer networks [19, 37], which have to retrain hyper-parameters of the whole network for each newly added style end-to-end.

We build a feed-forward network based on a simple image auto-encoder (shown in Figure 1), which would first transform the input image (i.e., the content image) into the feature space through the encoder subnetwork. Inspired by the texton concept, we introduce StyleBank as style representation by analogy, which is learnt from input styles.

Indeed, our StyleBank contains multiple convolution filter banks. Every filter bank represents one kind of style, and all channels in a filter bank can be regarded as bases of style elements (e.g., texture pattern, coarsening or softening strokes). By convolving with the intermediate feature maps of content image, produced by auto-encoder, StyleBank would be mapped to the content image to produce different stylization results. Actually, this manner is analogy to texton mapping in image space, which can also be interpreted as the convolution between texton and Delta function (indicating sampling positions).

3.2 Network Architecture

Figure 1 shows our network architecture, which consists of three modules: image encoder , StyleBank layer and image decoder , which constitute two learning branches: auto-encoder (i.e., ) and stylizing (i.e., ). Both branches share the same encoder and decoder modules.

Our network requires the content image to be the input. Then the image is transformed into multi-layer feature maps through the encoder : . For the auto-encoder branch, we train the auto-encoder to produce an image that is as close as possible to the input image, i.e., . In parallel, for the stylizing branch, we add an intermediate StyleBank layer between and . In this layer, StyleBank , for styles would be respectively convolved with features to obtain transferred features . Finally, the stylization result for style is achieved by the decoder : .

In this manner, contents could be encoded to the auto-encoder and as much as possible, while styles would be encoded into StyleBank. As a result, content and style are decoupled from our network as much as possible.

Encoder and Decoder.

Following the architecture used in [19], the image encoder

consists of one stride-1 convolution layer and two stride-2 convolution layers, symmetrically, the image decoder

consists of two stride-fractionally strided convolution layers and one stride-1 convolution layer. All convolutional layers are followed by instance normalization [38]

and a ReLU nolinearity except the last output layer. Instance normalization has been demonstrated to perform better than spatial batch normalization 


in handling boundary artifacts brought by padding. Other than the first and last layers which use

kernels, all convolutional layers use kernels. Benefited from the explicit representation, our network can remove all the residual blocks [14] used in the network presented in Johnson et al. [19] to further reduce the model size and computation cost without performance degradation.

StyleBank Layer.

Our architecture allows multiple styles (by default, 50 styles, but there is really no limit on it) to be simultaneously trained in the single network at the beginning. In the StyleBank layer , we learn convolution filter banks (referred as StyleBank). During training, we need to specify the -th style, and use the corresponding filter bank for forward and backward propagation of gradients. At this time, transferred features is achieved by


where , , , and are numbers of feature channels for and respectively, is the feature map size, and is the kernel size. To allow efficient training of new styles in our network, we may reuse the encoder and the decoder in our new training. We fix the trained and , and only retrain the layer with new filter banks starting from random initialization.

Loss Functions.

Our network consists of two branches: auto-encoder (i.e., ) and stylizing (i.e., ), which are alternatively trained. Thus, we need to define two loss functions respectively for the two branches.

In the auto-encoder branch, we use MSE (Mean Square Error) between input image and output image to measure an identity loss :


At the stylizing branch, we use perceptual loss proposed in [19], which consists of a content loss , a style loss and a variation regularization loss :


where , , are the input content image, style image and stylization result (for the -th style) respectively. is a variation regularizer used in [2, 19]. and use the same definition in  [12]:


where and are respectively feature map and Gram matrix computed from layer of VGG-16 network [36]

(pre-trained on the ImageNet dataset  

[34]). are VGG-16 layers used to respectively compute the content loss and the style loss.

Training Strategy.

We employ a -step alternative training strategy motivated by [13] in order to balance the two branches (auto-encoder and stylizing). During training, for every iterations, we first train iterations on the branch with , then train one iteration for auto-encoder branch. We show the training process in Algorithm 1.

for  every iterations do
     // Training at branch  :
     for  to  do
           Sample images and style indices
      , as one mini-batch.
           Update and :
     end for
     // Training at branch  :
      Update only:
end for
Algorithm 1 Two branches training strategy. Here is the tradeoff between two branches. denote gradients of filter banks in . denote gradients of in stylizing and auto-encoder branches respectively.

3.3 Understanding StyleBank and Auto-encoder

For our new representation of styles, there are several questions one might ask:

1) How does StyleBank represent styles?

Figure 2: Reconstruction of the style elements learnt from two kinds of representative patches in an exemplar stylization image.

After training the network, each styles is encoded in one convolution filter bank. Each channel of filter bank can be considered as dictionaries or bases in the literature of representation learning method [3]. Different weighted combinations of these filter channels can constitute various style elements, which would be the basic elements extracted from the style image for style synthesis. We may link them to ‘‘textons" in texture synthesis by analogy.

For better understanding, we try to reconstruct style elements from a learnt filter bank in an exemplar stylization image shown in Figure 2. We extract two kinds of representative patches from the stylization result (in Figure 2(b))-- stroke patch (indicated by red box) and texture patch (indicated by green box) as an object to study. Then we apply two operations below to visualize what style elements are learnt in these two kinds of patches.

First, we mask out other regions but only remain these corresponding positions of the two patches in feature maps (as shown in Figure 2(c)(d)), that would be convolved with the filter bank (corresponding to a specific style). We further plot feature responses in Figure 2(e) for the two patches along the dimension of feature channels. As we can observe, their responses are actually sparsely distributed and some peak responses occur at individual channels. Then, we only consider non-zero feature channels for convolution and their convolved channels of filter bank (marked by green and red colors in Figure 2(f)) indeed contribute to a certain style element. Transferred features are then passed to the decoder. Recovery style elements are shown in Figure 2(g), which are very close in appearance to the original style patches (Figure 2(i)) and stylization patches (Figure 2(j)).

Figure 3: Learnt style elements of different StyleBank kernel sizes. (b) and (c) are stylization results of and kernels respectively. (d), (e) and (f) respectively show learnt style elements, original style patches and stylization patches.

To further explore the effect of kernel size in the StyleBank, we set a comparison experiment to train our network with two different kernel size of and . Then we use similar method to visualize the learnt filter banks, as shown in Fig. 3. Here the green and red box indicate representative patches from (3,3) and (7,7) kernels respectively. After comparison, it is easy to observe that bigger style elements can be learnt with larger kernel size. For example, in the bottom row , bigger sea spray appears in the stylization result with (7,7) kernels. That suggests our network supports the control on the style element size by tuning parameters to better characterize the example style.

2) What is the content image encoded in?

Figure 4: k-means clustering result of feature maps(left) and corresponding stylization result(right).
Figure 5:

Sparsity analysis. Top-left: means and standard deviations of per-channel average response; top-right: distributions of sorted means of per-channel average response for different model sizes (

); bottom: corresponding stylization results.

In our method, the auto-encoder is learnt to decompose the content image into multi-layer feature maps, which are independent of any styles. When further analyzing these feature maps, we have two observations.

First, these features can be spatially grouped into meaningful clusters in some sense (e.g.

, colors, edges, textures). To verify this point, we extract each feature vector at every position of feature maps. Then, an unsupervised clustering (

e.g., K-means algorithms) is applied to all feature vectors (based on L2 normalized distance). Finally, we can obtain the clustering results shown in left of Figure 4, which suggests a certain segmentation to the content image.

Comparing the right stylization result with left clustering results, we can easily find that different segmented regions are indeed rendered with different kinds of colors or textures. For regions with the same cluster label, the filled color or textures are almost the same. As a result, our auto-encoder may enable region-specific style transfer.

Second, these features would distribute sparsely in channels. To exploit this point, we randomly sample content images, and for each image, we compute the average of all non-zero responses at every of feature channels (in the final layer of encoder). And then we plot the means and standard deviations of those per-channel averages among images in the top-left of Figure 5. As we can see, valuable responses consistently exist at certain channels. One possible reason is that these channels correspond to specific style elements for region-specific transfer, which is in consistency with our observation in Figure 2(e).

The above sparsity property will drive us to consider smaller model size of the network. We attempt to reduce all channel numbers in our auto-encoder and StyleBank layer by a factor of or . Then the maximum channel number become 64, 32 respectively from the original 128. We also compute and sort the means of per-channel averages, as plotted in the top-right of Fig. 5. We can observe that the final layer of our encoder still maintains the sparsity even for smaller models although sparsity is decreased in smaller models (). On the bottom of Figure 5, we show corresponding stylization results of respectively. By comparison, we can notice that obviously produces worse results than since the latter may encourage better region decomposition for transfer. Nevertheless, there may still be a potential to design a more compact model for content and style representation. We leave that to our future exploration.

3) How are content and style decoupled from each other?

Figure 6: Illustration of the effects of two branches. The middle and right ones are reconstructed input image (left) with and without auto-encoder branch during training.

To further know how well content is decoupled from style, we need to examine if the image is completely encoded in the auto-encoder. We compare two experiments with and without the auto-encoder branch in our training. When we only consider the stylizing branch, the decoded image (shown in the middle of  Figure 6) produced by solely auto-encoder without fails to reconstruct the original input image (shown in the left of  Figure 6), and instead seems to carry some style information. When we enable the auto-encoder branch in training, we obtain the final image (shown in the right of Figure 6) reconstructed from the auto-encoder, which has very close appearance to the input image. Consequently, the content is explicitly encoded into the auto-encoder, and independent of any styles. This is very convenient to carry multiple styles learning in a single network and reduce the interferences among different styles.

4) How does the content image control style transfer?

Figure 7: Stylization result of a toy image, which consists of four parts of different color or different texture.

To know how the content controls style transfer, we consider a toy case shown in Figure 7. On the top, we show the input toy image consisting of five regions with variant colors or textures. On the bottom, we show the output stylization result. Below are some interesting observations:

  • For input regions with different colors but without textures, only a purely color transfer is applied (see Figure 7 (b)(f)).

  • For input regions with the same color but different textures, the transfer consists of two parts: the same color transfer and different texture transfer influenced by appearance of input textures. (see Figure 7 (c)(d)).

  • For input regions with different colors but the same textures, the results have the same transferred textures but different target colors (see Figure 7 (d)(e)).

4 Capabilities of Our Network

Because of an explicit representation, our proposed feed-forward network provides additional capabilities, when compared with previous feedforward networks for style transfer. They may bring new user experiences or generate new stylization effects compared to existing methods.

4.1 Incremental Training

Previous style transfer networks (e.g.[19, 37, 22]) have to be retrained for a new style, which is very inconvenient. In contrast, an iterative optimization mechanism [12] provides an online-learning for any new style, which would take several minutes for one style on GPU (e.g., Titan X). Our method has virtues of both feed-forward networks [19, 37, 22] and iterative optimization method [12]. We enable an incremental training for new styles, which has comparable learning time to the online-learning method  [12], while preserving efficiency of feed-forward networks [19, 37, 22].

In our configuration, we first jointly train the auto-encoder and multiple filter banks (50 styles used at the beginning) with the strategy described in Algorithm 1. After that, it allows to incrementally augment and train the StyleBank layer for new styles by fixing the auto-encoder. The process converges very fast since only the augmented part of the StyleBank would be updated in iterations instead of the whole network. In our experiments, when training with Titan X and given training image size of 512, it only takes around 8 minutes with about iterations to train a new style, which can speed up the training time by times compared with previous feed-forward methods.

Figure 8 shows several stylization results of new styles by incremental training. It obtains very comparable stylization results to those from fresh training, which retrains the whole network with the new styles.

Figure 8: Comparison between incremental training (Left) and fresh training (Right). The target styles are shown on the top-left.

4.2 Style Fusion

We provide two different types of style fusion: linear fusion of multiple styles, and region-specific style fusion.

Figure 9: Results by linear combination of two style filter banks.

Linear Fusion of Styles.

Since different styles are encoded into different filter banks , we can linearly fuse multiple styles by simply linearly fusing filter banks in the StyleBank layer. Next, the fused filter bank is used to convolve with content features :


where is the number of styles, is the filter bank of style . is then fed to the decoder. Figure 9 shows such linear fusion results of two styles with variant fusion weight .

Region-specific Style Fusion.

Our method naturally allows a region-specific style transfer, in which different image regions can be rendered by various styles. Suppose that the image is decomposed into disjoint regions by automatic clustering (e.g., K-means mentioned in Section 3.3 or advanced segmentation algorithms [5, 33]) in our feature space, and denotes every region mask. The feature maps can be described as . Then region-specific style fusion can be formulated as Equation (6):


where is the -th filter bank.

Figure 10 shows such a region-specific style fusion result which exactly borrows styles from two famous paintings of Picasso and Van Goph. Superior to existing feed-forward networks, our method naturally obtains image decomposition for transferring specific styles, and passes the network only once. On the contrary, previous approaches have to pass the network several times and finally montage different styles via additional segmentation masks.

Figure 10: Region-specific style fusion with two paintings of Picasso and Van Gophm, where the regions are automatically segmented with K-means method.

5 Experiments

Training Details

Our network is trained on 1000 content images randomly sampled from Microsoft COCO dataset [27] and style images (from existing papers and the Internet). Each content image is randomly cropped to , and each style image is scaled to on the long side. We train the network with a batch size of 4 ( in Algorithm 1) for iterations. And the Adam optimization method [20] is adopted with the initial learning rate of and decayed by at every iterations. In all of our experiments, we compute content loss at layer and style loss at layer , , , and of the pre-trained VGG-16 network. We use (in Algorithm 1) in our two branches training.

5.1 Comparisons

In this section, we compare our method with other CNN-based style transfer approaches [12, 19, 37, 6]. For fair comparison, we directly borrow results from their papers. It is difficult to compare results with different abstract stylization, which is indeed controlled by the ratio in Equation (3) and different work may use their own ratios to present results. For comparable perception quality, we choose different in each comparison. More results are available in our supplementary material111

Compared with the Iterative Optimization Method.

Figure 11: Comparison with optimization-based method [12].

We use (in Equation (3)) to produce comparable perceptual stylization in Fig. 11. Our method, like all other feed-forward methods, creates less abstract stylization results than optimization method [12]. It is still difficult to judge which one is more appealing in practices. However, our method, like other feed-forward methods, could be hundreds of times faster than optimization-based methods.

Compared with Feed-forward Networks.

Figure 12: Comparison with the feed-forward network in [37].
Figure 13: Comparison with the feed-forward network in [19].

In Figure 12 and Figure 13, we respectively compare our results with two feed-forward network methods [37, 19]. We use (in Equation (3)) in both comparisons. Ulyanov et al. [37] design a shallow network specified for the texture synthesis task. When it is applied to style transfer task, the stylization results are more like texture transfer, sometimes randomly pasting textures to the content image. Johnson et al. [19] use a much deeper network and often obtain better results. Compared with both methods, our results obviously present more region-based style transfer, for instance, the portrait in Figure 12, and river/grass/forest in Fig. 13. Moreover, different from their one-network-per-style training, all of our styles are jointly trained in a single model.

Compared with other Synchronal Learning.

Figure 14: Comparison with the synchronal learning [6],

Dumoulin et al., in their very recent work [6], introduces the ‘‘conditional instance normalization" mechanism derived from [38] to jointly train multiple styles in one model, where parameters of different styles are defined by different instance normalization factors (scaling and shifting) after each convolution layer. However, their network does not explicitly decouple the content and styles as ours. Compared with theirs, our method seems to allow more abilities of region-specific transfer. As shown in Fig. 14, our stylization results better correspond to the natural regions of content images. In this comparison, we use (in Equation (3)).

6 Discussion and Conclusion

In this paper, we have proposed a novel explicit representation for style and content, which can be well decoupled by our network. The decoupling allows faster training (for multiple styles, and new styles), and enables new interesting style fusion effects, like linear and region-specific style transfer. More importantly, we present a new interpretation to neutral style transfer which may inspire other understandings for image reconstruction, and restoration.

There are still some interesting issues for further investigation. For example, the auto-encoder may integrate semantic segmentation [28, 26] as additional supervision in the region decomposition, which would help create more impressive region-specific transfer. Besides, our learnt representation does not fully utilize all channels, which may imply a more compact representation.


This work is partially supported by National Natural Science Foundation of China(NSFC, NO.61371192)