Style transfer is a powerful technique for art creation and image editing that enables recomposing the images in the style of other images. Recently, inspired by the power of Convolutional Neural Network (CNN) in visual perception tasks, Gatyset al. [4, 5]
opens up a new field named Neural Style Transfer, which first introduces neural representations to separate and recombine content and style of arbitrary images. They propose to extract the content features and style correlations (Gram matrix) along the processing hierarchy of a pretrained network classifier. Based on this work, several algorithms have been proposed to accelerate the development in terms of the generalization and efficiency issues, through optimization-based methods and feed-forward networks [10, 12, 21]. The success of style transfer makes it possible to deploy service upon mobile applications (e.g., Prisma, Artify), allowing users to create an artwork out of a picture they took with their phones.
In spite of significant progress already achieved, these methods suffer from the restricted binding between the model and specific styles. Recently, Arbitrary-Style-Per-Model Fast Style Transfer Methods (ASPM)  are proposed to conquer this dilemma. One possible solution is to coordinate high-level statistical distribution between content features and style features. Although visual quality and efficiency can be greatly improved, they unexpectedly introduce unexpected or distorted patterns to the stylized result because of treating diverse image regions in an indiscriminate way, such as AdaIN  and WCT  in Figure 1. Another solution is to swap the content feature patch with the closest style feature patch at the intermediate layer of a trained autoencoder. However, this method may generate insufficient stylized results when huge difference exists between content and style images, such as the Style-Swap  method shown in Figure 1. Compared with Style-Swap, Avatar-Net  further dispels the domain gap between content and style features, leading to better stylized results, but it still maintains inconsistent spatial distribution of visual attention with the content image and thus manifests distortion in terms of semantic perception.
Stroke textons , which are referred to fundamental micro-structures in natural images, reflect perceptual style patterns. Methods such as [8, 23] are dedicated to learn stroke control in transfer process. Jing et al.  first proposes to achieve continuous stroke size control by incorporating multiple stroke sizes into one single StrokePyramid model. The StrokePyramid result shown in Figure 1
is produced by mixing two different stroke sizes. However, due to the lack of local awareness of content image, they perform the stroke interpolation in a holistic way regardless of region diversity, leading to lacking of level of details. In addition, these methods are inflexible to handle arbitrary styles in one feedforward pass.
To address the aforementioned problems, we propose an attention-aware multi-stroke (AAMS) model for arbitrary styles transfer. Our model encourages attention consistency (which refers to spatial consistency of visual attention distribution) for corresponding regions between content image and stylized image, and it achieves both scalable multi-stroke fusion control and automatic spatial stroke size control in one shot. Specifically, we introduce self-attention mechanism as complementary to the autoencoder framework. The self-attention module calculates the response at a position as a weighted sum of the features at all positions, which helps to capture long-range dependencies across image regions. By performing a reconstruction training process for the self-attention assembled autoencoder, the attention map could grasp salient characteristics within any content images. As shown in Figure 1, the attention map of the content image highlights the salient parts while enabling consistency of the attention degree for long-range features. Based on the correlation between receptive field and stroke size, a multi-scale style swap module is proposed to blend distinct stroke patterns via swapping the content features with multi-scale style features in high-level representations. We inject the attention map into a multi-stroke fusion module to synthesize distinct stroke patterns harmoniously, which achieves automatic spatial stroke size control. Comprehensive experiments have been conducted to demonstrate the effectiveness of our method, and the model is capable of generating comparable stylized images with multiple stroke patterns against the state-of-the-art methods. The main contributions of this work are:
We introduce self-attention mechanism to an autoencoder network, allowing capturing critical characteristics and long-range region relations of the input image.
We propose multi-scale style swap to break the limitation of fixed receptive field in high-level feature space and produce multiple feature maps reflecting different stroke patterns.
By combining with attention map, we present a flexible fusion strategy to integrate multiple stroke patterns into different spatial regions of the output image harmoniously, which enables the attention consistency between the content image and stylized image.
2 Related Work
Neural Style Transfer. Arbitrary-Style-Per-Model methods (ASPM)  have recently been proposed to transfer arbitrary styles through only one single model. The backend idea is to formulate the style transfer as an image reconstruction process, with feature statistics fusion between content and style features at intermediate layers. Chen et al.  first proposes to swap the content feature patch with the best matching style feature patch using a Style-Swap operation. Huang et al. 
introduces adaptive instance normalization (AdaIN) to adjust the mean and variance of the content feature to match those of the style feature. Liet al.  performs the style transfer by integrating the whitening and coloring transforms (WCT) to match the statistical distributions and correlations between the features of content and style. Avatar-Net  elevates the feature transfer ability by matching the normalized counterparts features and applying a patch-based style decorator. However, the above methods either locally exchange the closest feature patches or transfer feature statistics globally, which tends to display a uniform stroke pattern without attention aware guaranty. In comparison, our methods perform style transfer to render visually plausible result with multi-stroke patterns integrated within the same stylized image. Another related work is , which proposes a StrokePyramid module to incorporate multiple stroke sizes into one single model, it empowers distinct stroke sizes in different spatial regions within the same transformed image. Our approach can automatically manipulate multiple stroke fusion through the guidance of attention feature in one shot within ASPM framework, while  achieves spatial stroke size control by feeding masked content image and need to be retrained for each new style.
One of the most promising trends in research is the emergence of incorporating attention mechanism into deep learning framework[11, 16]. Rather than compressing an entire image or a sequence into a static representation, attention allows the model to focus on the most relevant part of images or features as needed. Such mechanism has been proved to be very effective in many vision tasks including image classification [24, 30], image captioning [26, 28] and visual question answering [25, 27]. In particular, self-attention [15, 22] has been proposed to calculate the response at a position in a sequence by attending to all positions within the same sequence. Shaw et al.  proposes to incorporate relative position information for sequences in the self-attention mechanism of the Transformer model, which improves the translation quality on machine translation tasks. Zhang et al. 
demonstrates that the self-attention model can capture the multi-level dependencies across image regions and draw fine details in the context of GAN framework. Compared with, we adapt the self-attention to introduce a residual feature map to catch salient characteristics within content images.
3 Proposed Approach
For the task of style transfer, the goal is to generate a stylized image , given the content image and style image . To satisfy arbitrary style transfer in one feed-forward pass while integrating images with multiple stroke patterns, we develop three modules( multi-scale style swap, multi-stroke fusion and self-attention module) in the bottleneck layer for feature manipulation. The three modules cooperate with each other and form a coupling framework. The network architecture of our proposed approach is depicted in Figure 2.
Assume that and denote the feature maps extracted from the encoder correspond to and respectively. At the core of our network, a self-attention module is proposed to learn the saliency properties by feeding with , which will generate self-attention feature map for any content after reconstruction training process. During the intermediate process of transferring into the domain of via a WCT transformation in the testing phase, a multi-scale style swap module is first designed to synthesize features of multiple stroke sizes , taking both the content feature and style features as input. The module conducts a style-swap procedure for the content feature and K style features simultaneously. To perform a flexible integration of the features, a multi-stroke fusion module is presented to handle controllable blending. The attention map filtered from is incorporated to guide fusion among the content feature and the swapped stroke features , where is the user provided clustering number. After the two steps, the synthetic feature is fed into the trained decoder to generate the stylized image in one feed-forward pass. We further introduce skip connections to enhance the stylization effects by adapting multiple level of synthetic features with style features. In this section, we introduce the three components in details.
3.1 Self-Attention Autoencoder
We employ a self-attention mechanism  to catch the relationships between separated spatial regions. In detail, let and represent the encoder and decoder respectively. The encoder produces by mapping the input image x into the intermediate activation space, with H, W, C denoting height, width and channel number respectively.
Let denote the flatten operation along the channel dimension. We calculate the self-attention feature of in the location as:
where denotes the convolution operation and is the weight coefficient. indicates the dependencies between two regions, not just for neighboring positions. It is computed using a softmax function:
where is obtained by a compatibility function that compares elements by a scaled dot product:
The self-attention feature map decoded from the self-attention module is , which will be reshaped as the same dimensions of . In the above formulation, are learned parameter matrices within our self-attention autoencoder framework. We implement them as 1 1 convolution .
Given the self-attention feature map , we propose to obtain a self-attention residual by multiplying the hidden feature map with . The output is then given by appending the residual to the feature map .
where denotes the element-wise multiplication operator. We then feed into the decoder and reconstruct the input image . In this manner, the self-attention residual can exhibit the salient regions in synthesizing the image while charactering the correlations between distant regions.
where is the activations of the layer of the pretrained VGG-19 network when processing image x, is the weight to balance the two losses. Both losses are calculated using normalized Euclidean distance. In addition, we introduce a sparse loss on self-attention feature map to encourage the self-attention autoencoder to pay more attention to small regions instead of the whole image:
With total variational regularization loss  added to encourage the spatial smoothness in the generated images, we obtain our total loss function as:
where , and are the balancing factors.
3.2 Multi-scale Style Swap
Style swap is the process of substituting content feature with the closest style feature patch-by-patch . Given a specific patch size, the style swap procedure can be implemented efficiently as two convolutions operations and a channel-wise argmax operation. The filters of the convolutional layers are derived from the extracted style patches. Based on the analysis of how receptive field influences stroke size in , we note that larger patch size leads to larger receptive field and style stroke accordingly. However, the increase of the scale for patch size is strictly limited by the network structure and easily saturated when the patch size is larger than the fixed receptive field of network.
To resolve the above issue in an efficient manner, we propose to fix the patch size while changing the scale of activation feature map of styles, such that we introduce multi-scale stroke patterns after swapping with the same content feature. Specifically, we first adopt whitening transform  on content feature and style feature to peel off their style information while preserving global structure, results are denoted as and . We then obtain a series of multi-scale style features by casting the whitened style feature into multiple scales:
where denotes the scaling operation, is the scale coefficient controlling different stroke sizes. Finally, the multiple swapped features are produced by performing style-swap procedure between and multiple simultaneously.
where denotes the parallelizable style swap process .
3.3 Multi-stroke Fusion
Equipped with self-attention mechanism, the residual in eq. (4) is able to capture critical characteristics and long-range region correlations of the feature map . In detail, during the reconstruction process, the residual learns to fine tune intrinsically crucial parts of content feature by adding variation to it, therefore the non-trivial (zero) parts of the residual deserve special attention. We utilize an attention filter by first performing an absolute operation to highlight these non-trivial parts in , following by a Gaussian kernel convolution layer to enhance the regional homogeneity of the features. The variance of the Gaussian kernel can further be used to control the salient region’s proportion in content image. We obtain our attention map after normalizing into the range . Figure 3 shows the visualization image of the intermediate results. We notice that the attention map can enlarge the attending influence of saliency regions while maintaining the correlations among distant regions.
In addition to the K swapped features obtained from the multi-scale style swap, we introduce the whitened content feature as another entry to character the most significant regions, which is known as the fine-grained stroke . Therefore, we have in total of K+1 features for multi-stroke fusion.
To integrate arbitrary strokes in a scalable framework, we propose a flexible fusion strategy by first dividing the attention map
into multiple clusters according to the stroke number provided. We apply k-means method to cluster our attention map, the goal is to iteratively findK+1 intensity centers, minimizing the Euclidean distance between center and elements among each clusters.
where K+1 clusters are generated, and denotes the mean intensity value of all the attention points in cluster .
The multi-stroke fusion can then be formulated as integrating the content feature with multiple swapped features under the guidance of attention map.
where is the weight map assigned to the stroke size, and k = 0 denotes the fine-grained stroke. The weight map is computed by a softmax function:
We define to measure the absolute distance with regard to the center , thus can be used to indicate how much extent each stroke size contributes to synthesis the feature. The smoothing factor is used to control the smoothing degree for fusion.
Before feeding into the decoder to generate the stylized result, we derive by performing coloring transform with the syncretic feature to match feature statistics to the style feature following .
4.1 Implementation Details
We train our self-attention autoencoder for reconstruction using MS-COCO dataset , which contains roughly 80,000 training examples. We preserve the aspect ratio of image and rescale the smaller dimension to 512 pixels, and then randomly crop to 256 256 pixels. Our encoder contains the first few layers from VGG-19 model 
pre-trained on the ImageNet dataset. The decoder is symmetric to the encoder structure. The Relu_X_1 (X = 1, 2, 3, 4) in the encoder are used to compute the perceptual loss and are set as 1, 10, 6, 10 to balance each loss. We use three 1 1 convolutions for in our self-attention module, and set for and for . We train our network using the Adam optimizer and a batch size of 8 for 10k iterations. Although our method can handle arbitrary number of stroke fusions, we select three stroke scenario as a default setting in the following experiments, including the fine-grained stroke size and two coarse stroke sizes whose scale coefficients are 0.5 and 1 respectively. We set the smoothing factor as 50 to reveal discriminative patterns for different stroke.
To obtain abundant stylized information, it is advantageous to match features across different levels in the VGG-19 encoder to fully capture the characteristics of the style as suggested in [13, 19]. We hence adopt a similar strategy for our reconstruction process. We use skip connections to perform style enhancement using adaptive instance normalization 
, feeding with style features extracted from Relu_X_1 (X = 1, 2, 3) outputs and the reconstruction features in corresponding deconvolution layers.
4.2 Qualitative Evaluation
Comparison With Prior Methods. We evaluate four state-of-the-art methods for arbitrary style transfer: Style-Swap , AdaIN , WCT  and Avatar-Net . The stylized results for various content/style pairs are shown in Figure 4. To make a fair comparison, the results of compared methods are obtained by running their codes with default configurations. Style-Swap simply relies on the patch similarity between content and style feature, so as to strictly preserve the content feature, which is validated in its results that only low-level style patterns (e.g., colors) are transferred. AdaIN presents an efficient solution by directly adjusts the content feature to match the mean and variance of the style feature. But it usually brings similar texture patterns for all the stylized images (e.g., the crack pattern in all stylized results) due to the style-dependent training. WCT holistically adjusts the feature covariance of content feature to match that of style feature, which inevitably brings unseen information and unconstrained patterns, for example the missing circular patterns in the row and the unexpected textures in the row. Avatar-Net shrinks the domain gap between content and style features and enhances propagation for feature alignment. Although concrete style patterns are reflected, it still cannot handle attention-aware feature adaption and manifest distortion from semantically perception, for example the eyes on the first row and the farmer on the last row.
By contrast, our method can produce visually plausible stylized results against the previous methods. The attention map enables the seamlessly synthesis among multiple stroke sizes, while demonstrating superior spatial consistency of visual attention between content image and stylized image. In the last column of Figure 4, the salient regions in content images such as eyes, candles, house and farmer still maintain the focus in the stylized image. This validates the effectiveness of the attention map in the column, where salient regions are mainly attributes to the fine-grained stroke size. In addition, the attention map exhibits similarity measurement between distant regions, which allows detailed features in distant portions of the image are consistent with each other.
. The attention consistency is dramatically improved with our method for all the evaluation metrics.
Ablation Studies. We discuss in Section 3 about the contributions of the proposed modules. The self-attention module is responsible for capturing attention characteristics of content image, such that the multi-scale style swap module and the multi-stroke module can work together to integrate multiple stroke patterns. Several evaluations are performed to verify the effectiveness of the coupling framework.
We first train an autoencoder by removing the self-attention module. Since no guidance from the attention map, we apply an average fusion strategy in the multi-stroke fusion module during the testing phase. We name this method as AAMS(-SA). Figure 5 shows the comparison with our method AAMS. Our stylized result with self-attention module demonstrates dramatical improvement on visual effect compared with AAMS(-SA), mainly in two aspects: 1) Our attention-aware method emphasizes the salient regions by painting with fine-grained stroke. 2) The multi-stroke fusion allows abundant integration among stroke patterns and presents discriminative style patterns locally without sacrificing the holistic perception.
Apart from the cooperation between self-attention and multi-stroke fusion module, the multi-scale style swap also form a compact affiliation with the fusion module. Without the multi-scale style swap module, the style transfer will be restricted to single stroke pattern. We present the final stylized results in Figure 6 with different stroke patterns. The stroke size is controlled by changing the scaling coefficient in eq. (8). From left to right, the larger the stroke size, the coarser stylized pattern emerges. When multiple stroke patterns are generated together, the multi-stroke fusion module can integrate them into one stylized image.
4.3 Quantitative Evaluation
User Studies. As artistic style transfer is a highly subjective task, we resort to user studies to better evaluate the performance of the proposed method. Since Style-Swap  only transfers low-level information, resulting in insufficient stylization effects. We compare the proposed method to the other three competing methods, i.e., AdaIN , WCT  and Avatar-Net . We use 10 content images and 15 style images collected from the aforementioned method and synthesize 150 images for each method, from which we show 20 randomly chosen content and style combinations to each subject.
We conduct two user studies on the results in terms of the stylization effects and faithfulness to content characteristics. In the first study, we ask the participants to select which stylized image better transfers both color and texture patterns of style image to content image. In the second study, we ask the participants to select which stylized image better preserves the content characters with less distorted artifacts. In each question, we show the participants the content/style pair and the stylized results of each method in a random order. We finally collect totally 600 votes from 30 subjects and demonstrate the preference results in Figure 8. The studies show that our method receives most votes for both stylization and attention consistency. The better preservation for visual attention enables perceptual promotion for stylization.
Consistency Evaluation. To further evaluate the effectiveness on preserving the attention consistency of our method, we propose to adopt saliency metrics  to measure the saliency similarity between content and stylized image pairs. These metrics are widely used to evaluate a saliency model’s ability to predict ground truth human fixations. We generate the saliency maps of stylized images and their corresponding content images using SALGAN , a state-of-the-art saliency prediction method. We then employ five metrics to evaluate the degree of visual attention consistency for 400 content/stylized image pairs. Comparison results between the proposed method and AAMS(-SA) are shown in Table 1. With our AAMS framework, the attention consistency is dramatically improved for all the evaluation metrics. The results further verify the ability in maintaining saliency for style transfer.
Speed Analysis. We show the run time performance of our method and prior feed-forward methods in Table 2. Results are obtained with a 12G Tesla M40 GPU and averaged over 400 transfers. Among the patch based methods (Style-Swap, Avatar-Net), our method achieves comparable speed even with multi-scale feature processing. Given three strokes, it takes averagely 0.80 and 0.94 seconds to transfer images with size 256 256 and 512 512 respectively. Since the running time increases with stroke numbers in a controllable trend, the tradeoff between efficiency and details diversity should be considered.
|Method||Execution Time (sec)|
4.4 Runtime Control
Given the learnt attention map from the self-attention autoencoder, our method can not only accommodate different requirements from users by providing different controls on the multi-stroke fusion effect, but also achieves automatic spatial stroke size control.
Multi-stroke Fusion Control. One of the advantage towards the prior methods is that our method can effectively integrate multiple stroke patterns with different control strategies, including the number of stroke sizes and level of detail control. As discussed in the previous section, the different scale of high-level style feature will lead to different statistics after a style swap process with the same content feature. Given arbitrary number of stroke sizes, our method allows flexible fusion among these strokes. The left part of Figure 7 demonstrates three fusion results with different number of stroke sizes. Note that by integrating with more stroke sizes, the stylized result presents more varying patterns with different stroke boundaries, the reason for this is that our clustering algorithm produces different attention distributions for different number of clusters, which results in varying weight maps. The right part of Figure 7 shows our ability in regulating level of detail control for four-stroke fusion. According to eq. (12), the smoothing factor determines the weight distribution for all stroke sizes. As shown in Figure 7, the larger it is, the more discriminative for each stroke within corresponding regions.
Automatic Spatial Stroke Size Control. Spatial stroke size control refers to properly utilizing strokes of different sizes to render regions with different levels of detail in an image. The gist is that for a visually plausible stylized artwork, we usually hope to use stroke of small size for salient regions (e.g., objects, humans), while large size for less salient ones (e.g., sky, grassland). To this end, previous method  generally employs several hand-crafted masks and then stylize corresponding regions separately. Our method, however, empowers spatial control in an automatic way due to the self-attention mechanism. As shown in Figure 9, three stroke sizes are adopted to render regions of different saliency, i.e., the fine stroke patterns on the farmer, the middle stroke patterns on the mountain and coarse stroke patterns on the grass. In particular, these regions are partitioned and integrated automatically under the guidance of content attention map during our multi-stroke fusion procedure, and can be further adjusted by changing the Gaussian variance in attention filter module. By further combining with multiple styles, our method provides automatic solution for spatial control (which refers to transferring each region with different styles in the content image) instead of mask controlling methods [6, 13].
In this paper, we propose an attention-aware multi-stroke style transfer model for arbitrary styles, which can preserve attention consistency and achieve multi-stroke fusion control and automatic spatial stroke size control in the output result. To accomplish this, we first introduce the self-attention mechanism into the bottleneck layer of an autoencoder framework, then perform a multi-scale style swap to produce multiple swapped features with different stroke sizes. By combining with attention map obtained from content feature, the fusion module can integrate distinct stroke patterns into different regions harmoniously. Experimental results demonstrate the effectiveness of our method in generating favorable results in terms of stylization effects and visual consistency with content image.
-  Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand. What do different evaluation metrics tell us about saliency models? TPAMI, 2018.
-  T. Q. Chen and M. W. Schmidt. Fast patch-based style transfer of arbitrary style. In CVPR, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. In CVPR, 2017.
-  X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
-  Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song. Stroke controllable fast style transfer with adaptive receptive fields. In ECCV, 2018.
-  Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song. Neural style transfer: A review. arXiv:1705.04058, 2017.
J. Johnson, A. Alahi, and L. Fei-Fei.
Perceptual losses for real-time style transfer and super-resolution.In ECCV, 2016.
H. Larochelle and G. E. Hinton.
Learning to combine foveal glimpses with a third-order boltzmann machine.In NIPS, 2010.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, 2016.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In NIPS, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  Z. Lin, M. Feng, C. N. D. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive sentence embedding. In ICLR, 2017.
-  V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, 2014.
-  J. Pan, C. Canton, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. a. Giro-i Nieto. Salgan: Visual saliency prediction with generative adversarial networks. arXiv:1701.01081, 2017.
-  P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. arXiv:1803.02155, 2018.
-  L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In CVPR, 2018.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
-  X. Wang, G. Oxholm, D. Zhang, and Y.-F. Wang. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. In CVPR, 2017.
-  T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In CVPR, 2015.
-  H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
-  Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, 2016.
-  H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv:1805.08318, 2018.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
Learning deep features for discriminative localization.In CVPR, 2016.
-  S.-C. Zhu, C.-E. Guo, Y. Wang, and Z. Xu. What are textons? In ICCV, 2005.
6.1 Implementation Details
We assemble self-attention module into the bottleneck layer of an encoder-decoder framework to form our self-attention autoencoder. Here we present more details of the network architecture.
6.1.1 Encoder-decoder Architecture
Table 3 and 4 illustrate the detailed configurations of the encoder and decoder, respectively. The encoder is made of the first few layers of the VGG-19  network. We take input image with size 512 512
3 as an example and list the feature size for each layer. The max pooling operation is replace by the average pooling operation. The decoder is symmetric to the encoder, with all pooling layers replaced by nearest up-sampling. All convolutional layers use reflection padding to avoid border artifacts
. There are some notations; N: the number of output channels, K: kernel size, S: stride size.
As suggested in [13, 19], it is advantageous to match features across different levels in the VGG-19 encoder to fully capture the charateristics of the style. We use skip connections to perform style enhancement using adaptive instance normalization . The three connections are conv1_1 inv_con1_2, conv2_1 inv_con2_2, conv3_1 inv_con3_2, feeding with both output features.
|Layer||Layer Information||Feature Size|
Conv(N64, K3x3, S1), ReLU
|(512, 512, 3) (512, 512, 64)|
|conv1_2||Conv(N64, K3x3, S1), ReLU||(512, 512, 64) (512, 512, 64)|
|pool_1||AveragePooling(K2x2, S2)||(512, 512, 64) (256, 256, 64)|
|conv2_1||Conv(N128, K3x3, S1), ReLU||(256, 256, 64) (256, 256, 128)|
|conv2_2||Conv(N128, K3x3, S1), ReLU||(256, 256, 128) (256, 256, 128)|
|pool_2||AveragePooling(K2x2, S2)||(256, 256, 128) (128, 128, 128)|
|conv3_1||Conv(N256, K3x3, S1), ReLU||(128, 128, 128) (128, 128, 256)|
|conv3_2||Conv(N256, K3x3, S1), ReLU||(128, 128, 256) (128, 128, 256)|
|conv3_3||Conv(N256, K3x3, S1), ReLU||(128, 128, 256) (128, 128, 256)|
|conv3_4||Conv(N256, K3x3, S1), ReLU||(128, 128, 256) (128, 128, 256)|
|pool_3||AveragePooling(K2x2, S2)||(128, 128, 256) (64, 64, 256)|
|conv4_1||Conv(N512, K3x3, S1), ReLU||(64, 64, 256) (64, 64, 512)|
|Layer||Layer Information||Feature Size|
|inv_conv4_1||Conv(N256, K3x3, S1), ReLU||(64, 64, 512) (64, 64, 256)|
|upsample_1||Nearest Upsampling(x2)||(64, 64, 256) (128, 128, 256)|
|inv_conv3_4||Conv(N256, K3x3, S1), ReLU||(128, 128, 256) (128, 128, 256)|
|inv_conv3_3||Conv(N256, K3x3, S1), ReLU||(128, 128, 256) (128, 128, 256)|
|inv_conv3_2||Conv(N256, K3x3, S1), ReLU||(128, 128, 256) (128, 128, 256)|
|inv_conv3_1||Conv(N128, K3x3, S1), ReLU||(128, 128, 256) (128, 128, 128)|
|upsample_2||Nearest Upsampling(x2)||(128, 128, 128) (256, 256, 128)|
|inv_conv2_2||Conv(N128, K3x3, S1), ReLU||(256, 256, 128) (256, 256, 128)|
|inv_conv2_1||Conv(N64, K3x3, S1), ReLU||(256, 256, 128) (256, 256, 64)|
|upsample_3||Nearest Upsampling(x2)||(256, 256, 64) (512, 512, 64)|
|inv_conv1_2||Conv(N64, K3x3, S1), ReLU||(512, 512, 64) (512, 512, 64)|
|inv_conv1_1||Conv(N3, K3x3, S1), ReLU||(512, 512, 64) (512, 512, 3)|
6.1.2 Self-Attention Module
The architecture of the self-attention module is shown in Figure 10. Different from the way used in , where the output of self-attention feature map is added back to the input feature map to learn non-local evidence. We proposed to obtain a self-attention residual by multiplying the feature map with self-attention feature map , and find it is effective to capture saliency characteristics.
6.2 Experiments and Results
6.2.1 Extra ablation study
To demonstrate the capability of skip connections for style enhancement. We present the stylized results without the connections in Figure 11. By matching features across multiple levels, the results could capture more low-level characteristics (e.g., colors) of style images, thus exhibit higher fidelity to styles in terms of color saturation.
6.2.2 More results of our method
6.3 Multi-stroke Fusion Control
Here we explain more details of the advantage of our attention-aware multi-stroke method.
6.3.1 Stroke control vs weight control
The weight control refers to controlling the balance between stylization and content preservation. This strategy has been adopted in previous style transfer methods [7, 13, 19]. As visualized in Figure 12, the weight control strategy directly interpolate on deep feature space as weighted sum of content and stylized features, demonstrating minor variations among range [0, 1]. Our multi-scale style swap enables continuous and discriminative stylized patterns by changing the scale coefficient in eq. (8) of the paper, and further generate integrated results via different combinations efficiently.
6.3.2 Fusion control strategy
As mentioned in Section 4.4, our method can effectively integrate multiple stroke patterns with different control strategies. We present the fusion procedure in Figure 13. Given K+1 stroke feature maps and the corresponding attention map , we first generate K+1 clustering centers with attention values according to eq. (10), and then assign sequentially for stroke sizes, with higher attention values for finer stroke patterns. The integrated feature map is the weighted sum of the K+1 stroke feature maps based on eq. (11-12) in the paper.
The level of detail for multi-stroke fusion can be controlled by the smoothing factor . As visualized in Figure 14, we present the influence of different smoothing values for four-stroke fusion scenario. The larger it is, the more contributions for single stroke on corresponding attention area, leading to a more discriminative effect among these stroke patterns.