Arbitrary Style Transfer with Style-Attentional Networks

12/06/2018 ∙ by Dae Young Park, et al. ∙ 4

Arbitrary style transfer is the problem of synthesizing content image with style of the image that have never seen before. Recent arbitrary style transfer algorithms have trade-off between the content structure and the style patterns, or maintaining the global and local style patterns at the same time is difficult due to the patch-based mechanism. In this paper, we introduce a novel style-attentional network (SANet), which efficiently and flexibly decorates the local style patterns according to the semantic spatial distribution of the content image. A new identity loss function and a multi-level features embedding also make our SANet and decoder preserve the content structure as much as possible while enriching the style patterns. Experimental results demonstrate that our algorithm synthesizes higher-quality stylized images in real-time than the state-of-the-art-algorithms.



There are no comments yet.


page 2

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artistic style transfer is a technique used to create art by taking a style image and recomposing a content image by synthesizing global and local style patterns from the style image evenly over the content image, while maintaining the content image’s original structure. Recently, The seminal work of Gatys et al. [5]

showed that the correlation between features extracted from a pre-trained deep neural network can capture the style patterns well. The method by Gatys et al.

[5] is flexible enough to combine content and style of arbitrary images while it is prohibitively slow due to iterative optimization process.

Significant efforts have been made to reduce the computational cost. Several approaches [1, 8, 12, 22, 3, 14, 19, 26, 29] have been developed based on feed-forward networks. The feed-forward methods can synthesize stylized images efficiently, but are limited to a fixed number of styles or an insufficient visual quality.

For arbitrary style transfer, a few methods [13, 7, 20] holistically adjust the content features to match the second-order statistics of the style features. The AdaIN [7]

simply adjusts the mean and variance of the content image to match those of the style image. Although the AdaIN effectively combines the structure of the content image and the style pattern by transferring feature statistics, its output suffers in quality due to the over-simplified nature of this method. The WCT

[13] transforms the content features into the style feature space through a whitening and coloring process with the covariance instead of the variance. By embedding these stylized features within a pre-trained encoder-decoder module, the style-free decoder synthesizes the stylized image. However, in the case that the feature has a large dimension, the WCT will accordingly require computationally-expensive operations. The AvatarNet [20] proposed a patch-based style decorator module that maps the content features with the characteristics of the style patterns, while maintaining the content structure. The AvatarNet considers not only the holistic style distribution, but also the local style patterns. However, despite valuable efforts, these methods still could not reflect the detailed texture of the style image, distort content structures or fail balancing the local and global style patterns.

In this work, we propose a novel arbitrary style transfer algorithm, which synthesize high-quality stylized images in real-time while preserving the content structure. This is achieved by a new style-attentional network (SANet) and a novel identity loss function. For arbitrary style transfer, our feed-forward network, composed of SANets and decoders, learns the semantic correlations between the content features and the style features by spatially rearranging the style features according to the content features.

Our SANet is closely related to the style feature decorator of the AvatarNet [20]

. However, the biggest difference between both approaches is that the SANet can flexibly decorate the style features by learning through the conventional style reconstruction loss and identity loss. In addition, our identity loss function helps the SANet maintain as much of their original content structure as possible through the keeping diversity of global and local style patterns. The main contributions of our work are:

We propose a new style-attentional network to flexibly match the semantically nearest style features onto the content features.

We present a learning approach of a feed-forward networks composed of style-attentional networks and decoders, which can be optimized using a conventional style reconstruction loss and a new identity loss.

Our experiments show that our method highly efficient (about 100150 fps) at synthesizing high-quality stylized images by balancing the global and local style patterns while preserving content structure.

2 Related Work

Arbitrary Style Transfer. The ultimate goal of arbitrary style transfer is to simultaneously achieve and preserve generalization, quality and efficiency. Despite recent advances, existing works [5, 4, 1, 8, 12, 22, 3, 6, 10, 11, 23, 24, 28, 18] present a trade-off among generalization, quality and efficiency. Recently, several methods [13, 20, 2, 7] have been proposed to achieve arbitrary style transfer. The AdaIN algorithm simply adjusts the mean and variance of the content image to match those of the style image transferring global feature statistics. The WCT performs a pair of feature transforms, whitening and coloring for feature embedding within a pre-trained encoder-decoder module. The AvatarNet proposed the patch-based feature decorator, which transfers the content features to semantically nearest style features simultaneously minimizing the difference between their holistic feature distributions. In many cases, we observe that the results of the WCT and the AvatarNet fail to sufficiently represent the detailed texture or maintain the content structure. We carefully guess that WCT and AvatarNet fail to synthesize the detailed texture style due to pre-trained general encoder-decoder networks, which is learned from general images such as MS COCO datasets with large differences in style characteristics. As a result, these methods consider mapping the style feature onto content feature in the feature space, but there is no way to control the global statistics or content structure of the style. Although the AvatarNet can obtain the local style patterns through the patch-based style decorator, the scale of style patterns in the style images depends on the patch size. Therefore, the global and local style patterns can not both be taken into consideration. On the other hand, the AdaIN transforms texture and color distribution well, but does not represent local style patterns well. There exists another trade-off between content and style due to a combinational scale-adapted content and style loss. In this paper, we try to solve these problems by using the SANets and the identity loss. In this way, the proposed style transfer network can represent global and local style patterns, and maintain the content structure without losing the richness of the style.

Figure 2: Overview of proposed method. (a) Fixed VGG encoder encoding content and style images. Two Style-Attentional Networks map features from Relu_4_1 and Relu_5_1 features respectively. Decoder transform combined Style-Attentional Network output features to (Equ. 4). Compute (Equ. 7) and (Equ. 8) using fixed VGG encoder (b) (Equ. 9) calucates the difference between and or between and . (Equ. 9) is computed from same image pair (content or style) and , () means the output image synthesized from same image pair (content or style).

Self-Attention Mechanism. Our style-attentional module is related to the recent self-attention methods [25, 30] for image generation and machine translation. These models calculate the response at a position in a sequence or an image by attending to all positions and taking their weighted average in an embedding space. The proposed style-attentional network learns the mapping between the content features and the style features by slightly modifying the self-attention mechanism.

3 Method

In this paper, a novel style transfer network is proposed. The style transfer network is composed of a encoder-decoder module and a style-attentional module, as shown in Fig. 2. Proposed feed-forward network effectively generates high-quality stylized images, which appropriately reflect global and local style patterns. Our new identity loss function helps to maintain the detailed structure of the content while reflecting the style sufficiently.

3.1 Network Aritecture

Our style transfer network takes a content image and an arbitrary style image as inputs, and synthesizes a stylized image with the semantic structures from the former and characteristics from the latter. In this work, the pretrained VGG-19 network [21] is employed as encoder and a symmetric decoder and two SANets are jointly trained for arbitrary style transfer. Our decoder follows the setting of [7].

To combine global style patterns and local style patterns adequately, we integrate two SANets by taking VGG feature maps encoded from different layers (Relu_4_1 and Relu_5_1) as inputs and combining both ouput feature maps. From a pair of content image and style image , we first extract their VGG feature maps and at a certain layer (e.g., Relu_4_1) of the encoder.

After encoding the content and style images, we put both feature maps to a SANet module that map correspondences between the content feature maps and the style feature maps , producing the output feature maps :


After applying 1x1 convolution to the and applying the element-wise sum of two matrices, we obtain :


where ”” denotes element wise sum.

We combine two ouput feature maps from two SANets as


where and are the output feature maps obtained from two SANets, denotes 33 convolution to combine two feature maps and is added to after upsampling.

Then the stylized output image is synthesized by feeding into the decoder,


3.2 Style-Attentional Network for Style Feature Embedding

Fig. 3 shows style feature embedding using SANet module. Content feature maps and style feature maps from encoder are normalized and then transformed into two feature space to calculate the attention between and as:

Figure 3: Style Attentional Network(SANet).

where , , and denotes a channel wise normalized version of the . The response is normalized by a factor . Here is the index of an output position and is the index that enumerates all possible positions. In the above formulation, , , are the learned weight matrices, which are implemented as 11 convolutions like [30]

Our SANet has a network structure similar to the existing non-local block structure [27], but the number of input data is different (input of the SANet are and ). The SANet module can appropriately embed a local style pattern in each position of content feature maps by mapping the relationship (such as affinity) between content feature maps and style feature maps through learning.

3.3 Full Objective

As shown in Fig. 2, we use the encoder (pre-trained VGG-19 [21]) to compute the loss function for training the SANet and the decoder:


where is composer of content loss , style loss and identity loss and , are the weights of different losses.

Similar to [7], the content loss is the Euclidean distance between the channel wise normalized target features, and and the channel wise normalized features of the output image VGG features, , :


The style loss is defined as:


where each denotes a layer in the encoder used to compute the style loss. We use Relu_1_1, Relu_2_1, Relu_3_1, Relu_4_1, Relu_5_1 layers with equal weights. We have applied both the Gram matrix loss [5] and the AdaIN style loss [7], but the result is that the AdaIN style loss is more satisfactory.

When , , are fixed as the identity matrices, each position in content feature maps can be transformed into the semantically nearest feature in the style feature maps. In this case, it cannot parse sufficient style features. In the SANet, although , , are learnable matrices, our style transfer model can be trained by considering only global statistics by the style loss .

In order to consider both the global statistics and the semantically local mapping between the content features and the style features, we define a new identity loss function as:


where (or ) means the output image synthesized from two same content(or style) images, each denotes a layer in the encoder and, and is an identity loss weight. The weighting parameters are simply set as , , and in our experiments.

The content loss and the style loss play a role of controlling trade-off between the structure of the content image and the style patterns. Unlike both loss, the identity loss is calculated from same input images having no gap of style characteristics. Therefore, the identity loss concentrates on keeping the structure of the content image rather than changing style statistics. As a result, the identity loss makes it possible to maintain the structure of the content image and style characteristics of the reference image at the same time.

Figure 4: Result close-ups. Regions marked by bounding boxes are zoomed in for a better visualization.
Figure 5: User preference result of five style transfer algorithms.
Method Time (256px) Time (512px)
gatys et al. 15.863 50.804
WCT 0.668 0.943
AvatarNet 0.248 0.356
AdaIN 0.010 0.011
ours(Relu_4_1) 0.006 0.006
ours(multi-level) 0.008 0.008
Table 1: Execution time comparison.(in seconds)

4 Experimental Results

Fig. 2

shows an overview of our style transfer network based on the proposed SANets. Code and pretrained models (in Pytorch

[16]) will be made available to the public.

4.1 Experimental Settings

We train the network using MS-COCO [15] as content images and WikiArt [17] as style images. Both dataset contain roughly 80,000 training images. We use the adam optimizer [9] with a learning rate of 0.0001 and a batch size of 5 content-style image pairs. During training, we first rescale the smaller dimension of both images to 512 while preserving the aspect ratio, then randomly crop a region of size 256256 pixels. In testing time, our network can handle any input size because it is fully convolutional.

4.2 Comparison with Prior Works

To evaluate the our method, we compared it with three types of arbitrary style transform methods: the iterative optimization [5], the feature transformation based methods [13, 7] and the patch-based method [20].

Qualitative examples. In Fig. 11 we show example style transfer results synthesized by the state-of-the-art methods and more results in supplementary materials. Note that all the test style images are never observed during the training of our model. The optimization-based method [5] allows arbitrary style transfer but is likely to encounter a bad local minimum (e.g., row 2, 4) and it is computationally expensive due to iterative optimization process (see Table 1).

The AdaIN [7] simply adjusts the mean and variance of the content features to synthesize the stylized image. However, its results are less appealing and often retain some of the color distribution of the content due to the trade-off between content and style (e.g., row 1, 2, 8 ). Also, both AdaIN [7] and WCT [13] sometimes result in distorted local style patterns because of holistically adjusting the content features to match the second-order statistics of the style features, as shown in Fig. 11. Although the AvatarNet [20] decorates the style patterns according to the semantic spatial distribution of the content image and applies the multi-scale style transfer, it frequently cannot represent the local and global style patterns at the same time due to the dependency of the patch size. Also it cannot keep the content structure in most cases (see the 4th column in Fig. 11). In contrast, our method can parse diverse style patterns such as global color distribution, texture and local style patterns while maintain the structure of content in most examples, as shown in Fig. 11.

Unlike other algorithms, our learnable SANets flexibly can parse sufficient style features without maximally aligning between the content and style features and regardless of large domain gap (see the row 1 ,6 in Fig. 11). The proposed SANet semantically distinguishes the content structure and transfers similar style patterns onto the regions with same semantic meaning. Our method transfers different styles for each semantic content. In Fig. 11 (row 3), our stylized image show that the sky and buildings are stylized by different style patterns, respectively while the other results show that the style boundaries between the sky and building are ambiguous.

We also provide result close-ups in Fig. 4. Our results show well the multi-scale style patterns (e.g., color distribution, bush strokes and white and red patterns of rough textures in style image). AvatarNet and WCT distort the brush strokes, have blurry hair textures and the face appearances are not maintained. AdaIN cannot even maintain the color distribution.

User study. We use 14 content images and 70 style images to synthesize 980 images in total. We randomly select 30 content and style combinations to each subject. We show stylized images by 5 compared methods side-by-side in a random order. And we ask the subject to vote his/her one favorit result for each style. We collect the 2400 votes from 80 users and show the percentage of votes for each method in Fig. 5. The study shows that our method is favored for better stylized results.

Efficiency. Tab. 1 shows the run time performance of the proposed method and other methods on two image scales: 256, 512. We measured the run time performance, including the time for style encoding. Our algorithm runs at 165 FPS and 125 FPS for the single-scale (only Relu_4_1) and multi-scale models (Relu_4_1 and Relu_5_1), respectively, and the performance for 256 x 256 and 512 x 512 is nearly same. Therefore our method is possible to process realtime style transfer. Our model is 40 times faster than the matrix computation based methods (WCT [13] and AvatarNet [20]).

4.3 Ablation Studies

Loss analysis. In this section, we show the influence of content-style loss and identity loss. Fig. 6 (a) shows the results obtained by fixing , and at 0, 0 and 5, respectively, and increasing from 1 to 50. Fig. 6 (b) shows the results obtained by fixing and at 0 and 5, respectively, and increasing and from 1 to 100 and from 50 to 5000, respectively. Without the identity loss, if we increase the weight of the content loss, the content structure is preserved, but the characteristics of style patterns disappear, due to trade-off between the content loss and the style loss. On the other hand, increasing the weights of identity loss without content loss preserves the content structure as much as possible while maintaining style patterns. However, distortion of the content structure can not be avoided. We applied a combination of content-style loss and identity loss to maintain the content structure while enriching style patterns.

Multi-level features embedding. Fig. 7 shows two stylized outputs obtained from Relu_4_1 and Relu_5_1, respectively. When the Relu_4_1 is only used for style transfer, the global statistics of the style features and the content structure are maintained well. However, the local style patterns do not appear well. In other hand, the Relu_5_1 helps add the local style patterns such as circle patterns because the receptive field is more wide relatively. However, the content structures are distorted and the textures such as brush strokes are disappear. In our work, to enrich the style patterns, we integrate two SANets by taking VGG feature maps encoded from different layers (Relu_4_1 and Relu_5_1) as inputs and combining both ouput feature maps

Figure 6: Content-Style loss vs. Identity loss. (a) shows the results obtained by fixing , and at 0, 0 and 5, respectively, and increasing from 1 to 50. (b) shows the results obtained by fixing and at 0 and 5, respectively, and increasing and from 1 to 100 and from 50 to 5000, respectively.
Figure 7: Multi-level features embedding. By embedding multi-level features, we can enrich the local and global patterns for the stylized images.

4.4 Runtime Controls

In this section, we present the flexibility of our method through several applications.

Figure 8:

Content-style trade-off during runtime. Our algorithm allows at test time content-style trade-off by interpolating between feature maps,

and .
Figure 9: Style interpolation with four different styles.

Content-style trade-off. The degree of stylization can be controlled during training by adjusting the style weight in Equ. 6 or during test time by interpolating between feature maps that are fed to the decoder. For runtime control, we adjust the stylized features , . is obtained by taking two content images as input for our model. The network tries to reconstruct the content image when , and to synthesize the most stylized image when (result shown in Fig. 8).

Style interpolation. To interpolate between several style images, a convex combination of the feature maps from different styles is feeded into the decoder (result shown in Fig. 9).

Spatial control. Fig. 10 shows an example of spatially controlling the stylization. A set of masks (Fig. 10 column 3) is additionally required as input to map the spatial correspondence between content regions and styles. We can assign the different styles in each spatial region by replacing with where is a simple mask-out operation.

Figure 10: Example of spatial control. Left: content image. Middle: style images and masks. Right: stylized image from two different style images.

5 Conclusions

In this work we propose a new arbitrary style transform algorithm that is consist of the style-attentional networks and decoders. Our algorithm is effective and efficient. Unlike the patch-based style decorator in [20], our SANet can flexibly decorate the style features by learning using the conventional style reconstruction loss and the identity loss. Futhermore, proposed identity loss helps the SANet keep their content structure, enriching the local and global style patterns. Experimental results demonstrate that the proposed method synthesizes favorably against the state-of-the-art algorithms on arbitrary style transfer.

Figure 11: Example results for comparisons.