Style Mixer: Semantic-aware Multi-Style Transfer Network

10/29/2019 ∙ by Zixuan Huang, et al. ∙ 15

Recent neural style transfer frameworks have obtained astonishing visual quality and flexibility in Single-style Transfer (SST), but little attention has been paid to Multi-style Transfer (MST) which refers to simultaneously transferring multiple styles to the same image. Compared to SST, MST has the potential to create more diverse and visually pleasing stylization results. In this paper, we propose the first MST framework to automatically incorporate multiple styles into one result based on regional semantics. We first improve the existing SST backbone network by introducing a novel multi-level feature fusion module and a patch attention module to achieve better semantic correspondences and preserve richer style details. For MST, we designed a conceptually simple yet effective region-based style fusion module to insert into the backbone. It assigns corresponding styles to content regions based on semantic matching, and then seamlessly combines multiple styles together. Comprehensive evaluations demonstrate that our framework outperforms existing works of SST and MST.



There are no comments yet.


page 1

page 3

page 6

page 7

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The target of style transfer is to confer the style of a reference image to another image while preserving the content of the latter one. The seminal work of Gatys et al. [7, 6]

demonstrated that the correlation between the deep features is superior in capturing visual style. It opened up the era of neural style transfer. Later significant effort has been devoted to improving the speed, flexibility, and visual quality of neural style transfer. The most recent works

[12, 20, 17, 3, 25]

support efficient arbitrary transfer style with a single convolutional neural network model, which serve as the state-of-the-art baselines.

However, most studies in neural style transfer focus on SST, i.e., the image is transferred by a single style reference. To generate more diverse and visually pleasing results, two straightforward attempts are proposed to extend existing techniques to MST, allowing the user to transfer the contents into an aggregation of multiple styles. One is linear blending [6, 9, 20, 25, 23, 3]

, which interpolates features of different styles linearly by given weights. However, as is shown in Fig. 

LABEL:fig:teaser1, this method tends to generate muddled results since the colors and textures of different styles are simply mixed, and also dull results since the combination is spatially invariant. Another method is to spatially combine multiple styles by asking users to provide a mask and manually assign the styles to different regions [23, 20], which results in the desired effect but involves tedious work.

In this paper, we propose a semantic-aware MST network: Style Mixer. It can automatically incorporate multiple styles into one result according to the regional semantics. Our Style Mixer consists of a backbone SST network and a multi-style fusion module. The backbone network can achieve semantic-level SST by learning the semantic correlations between the content and style features. It is inspired from two arbitrary style transfer networks: Avatar-Net [25] and SANet [23]. In order to build correspondences, Avatar-Net uses a fixed patch-swap module while SANet uses a learnable attention module. We incorporate the merits of both methods (leveraging patch information while allowing learnable parameters) by proposing a novel patch attention (PA) module for more accurate correspondences. PA improves traditional attention module by enabling the controllability of the size of the receptive field, which will benefit the works in other fields as well. Besides, we further improve the richness of style features by introducing multi-level feature fusion (MFF). Compared to the state-of-the-art style transfer networks, our backbone network is better in both capturing semantic correspondences and preserving style richness.

In the inference stage, we design an efficient region-based multi-style fusion module to embed in the middle of the backbone network. The module first segments the content feature map into regions based on semantic information, and then assigns the most suitable style to each region according to the correspondence confidences generated by the PA module. After decoding this hybrid future map, our network will create a seamless and coherent MST result. Comprehensive evaluations show that our approach can produce more vivid and diverse results than existing SST and MST methods.

In summary, the contributions in this paper are three folds:

(1) We propose the first MST framework to automatically and spatially incorporate different styles into one result based on the semantic information.

(2) We design a patch attention module for semantic correspondence, which broadens the form of attention module and enables the controllability of the size of the receptive field.

(3) We propose a conceptually simple yet effective region-based multi-style fusion module for MST to assign multiple styles to their semantically related regions and then seamlessly fuse them.

Figure 1: An overview of our proposed network. Multi-level features and will first be fused in MFF module by channel-wise attention. Then the fused style feature will be reassembled into guided by the semantic correspondence between and . In SST case, will be merged with and decoded. While in MST case, as is shown by the dashed line, multiple and will be fed to our multi-style fusion module and integrated based on regional correspondence confidence.

2 Related Work

Neural style transfer. Starting from the seminal work of Gatys et al[7, 6], Convolution Neural Network (CNN) demonstrates its remarkable ability to transfer style by matching statistical information between features of content and style images. The framework of Gatys et al. [6] is based on iterative updates of the image by optimizing content and style loss, which is applicable to arbitrary image but computationally expensive. Numerous study have since been developed to improve style transfer in different aspects such as visual quality  [15, 30, 24], perceptual control  [5], stroke control  [13, 34]. A great number of researchers try to accelerate the transfer  [14, 26, 27, 16, 2, 19, 4] by approximating the iterative optimization with a feed-forward network. Although speed is improved dramatically, the flexibility is compromised since each network is restricted to a single style or a finite set of styles. The dilemma between speed, flexibility, and quality [35]

impedes the further development of style transfer. Recently some fast Arbitrary-Style-Per-Model methods are proposed to resolve the dilemma. The idea is to train a style-agnostic autoencoder and convert the content feature into a given style domain while preserving content structures.

[12, 20, 17] transfers the global style by coordinating the statistical distribution between them; while [3]

swap the content feature patch with the nearest style feature patch in terms of cosine similarity, which achieves local semantic-aware style transfer results. Avatar-Net 

[25] further extends the AdaIN [12] to multi-scale style adaptation and loosen the restrictions of Style Swap [3] by performing projection before matching.

Despite the success in SST, little attention has been paid to the field of MST, which is likely to create more vibrant and distinctive artistic effects. Some works extend their SST framework to MST as a simple add-in by linearly blending the feature from different styles [6, 9, 20, 25, 23, 3] or manually specifying the masks [23, 20]. They either generate undesired results or require tedious user efforts. The challenge of MST is how to automatically combine the feature of different styles harmoniously without damaging the characteristics of each style. We effectively resolve this challenge by regional semantic matching and produce state-of-the-art MST results.

Attention Module. Recently, attention mechanisms have become a key ingredient for models that need to incorporate global dependency  [1, 8, 32, 33]. It allows the model to look globally but attend selectively at the data. Particularly, self-attention [10, 22] calculates the correlation between every two positions in a sequence. Such mechanism has been proved to be exceptionally effective in machine translation [28, 1], image classification [31, 37], visual question answering [32] and image generation [36]. Recently, [23] introduces style-attention to capture the correspondence between content image and style image and outperforms prior works in terms of visual quality. Compared to [23], we further improve the capability of semantic matching to catalyze the performance of our multi-style fusion module.

3 Proposed Method

The architecture of Style Mixer is shown in Fig. 1. The backbone style transfer model comprises an encoder and a decoder, with the multi-level feature fusion (MFF) module and patch attention (PA) module in the middle. In the case of MST, a multi-style fusion module is further embedded to distribute style features from different style references.

3.1 Framework Pipeline

A pretrained VGG-19 network is employed as a feed-forward encoder to extract features of the input pairs. To incorporate multi-level feature produced by the encoder, an MFF module is placed after the encoder and takes features from 3 different layers as input.

Being able to classify the objects correctly despite the huge low-level variations, VGG-19 proves its efficiency and robustness in extracting semantic information. Therefore, by calculating the patch attention between the high-level feature of content and style images, we can obtain a meaningful semantic attention map and reassemble the style features accordingly. At last, we merge the reassembled style feature

with and decode them into an artistic image.

Since the problems of multi-level feature fusion and semantic correspondence functions are common in both SST and MST, these two modules can be trained with SST and then applied to MST. In MST, Style Mixer will process multiple styles in a parallel manner, and incorporate them with our region-based style fusion strategy. The correspondence confidence produced by PA module will guide the distribution of different styles based on semantic matching. In this way, every style will be assigned to the most semantically related region with local consistency.

3.2 Multi-level Feature Fusion Module

Figure 2: The process of multi-level feature fusion can be summarized as: concatenate, attend, and squeeze.

Features from different layers of VGG carry information of different scales and abstractness levels. To incorporate multi-level information, Avatar-net [25] introduces multi-level AdaIN [12] to conduct style adaptation progressively. However, holistic statistic alignment sometimes creates unpleasant artifacts. After that, SANet [23] integrates two separate style-attention modules to extract style features of layer and to improve style richness but also introduces an expensive computational cost. To obtain faithful stylization with affordable computation cost (which is especially critical when adopting PA), we design an MFF module to coalesce the features from 3 different layers adaptively.

Figure 3: The second image, result of single-level feature, is rendered in large stroke and lacks style patterns. After adding the feature from , the result (the 3rd column) is richer in spiral patterns, for instance, the nose and upper-right corner. If we further integrate the feature from a lower level, which has a smaller receptive field, the high-frequency area, i.e., the hair and eyes of the women, become finer. At the same time, the cheek remains coarse. Features with different scales are combined pleasingly.

The whole process of our MFF module is as depicted in Fig. 2. Features from , , will first be recalibrated by a 11 convolution. After that, all features will be resized to the same size and concatenated together. In order to eliminate redundant and undesired feature, we conduct channel-wise attention [11] to reweight the concatenated feature maps according to channel-wise importance. At last, we apply one more 33 convolution layer to smoothen the fused feature and obtain . The comparison between different choices of input layers is shown in Fig. 3

3.3 Patch Attention Module

Figure 4: Unfolding operation.

Style Swap [3]

is a pioneer work that introduces local patterns matching to style transfer. However, due to the fixed cosine similarity metric and the overlap between patches, it produces undesired overly smooth results with mismatches. SANet

[23] proposed a novel style-attention mechanism to replace the fixed cosine similarity with a flexible learnable similarity kernel. Following the tradition of self-attention [28] and non-local block [29], it conducts point-wise attention between content and style features. Due to the limited size of the receptive field and local variation of the input image, point-wise attention performs unstably despite the learnable similarity kernel. To solve this problem, we extend the attention module to a more generic form, patch attention (PA), which enables the controllability of the size of the receptive field and better grasps the structural information. The mechanism of our PA module is illustrated in Fig. 5. Together with the abundant semantic information in the high-level feature of VGG-19, our PA module achieves robust semantic matching. Also, it is worth noting that Style-attentional module in SANet is a special case of PA.

The PA module takes content feature , style feature and from MFF module as its inputs. It should be noted that in SANet [23], attention is carried out between the content feature and the style feature, which will be reassembled. On the contrary, we calculated patch attention on the original feature of VGG-19, which is from layer relu_4_1, to best preserve the semantic information, and use the resulted pair-wise correspondence to guide the rearrangements of fused style feature .

Figure 5: Patch attention module. In order to best preserve the semantic information, we calculate the correspondence score between original feature from VGG-19 to guide the reassembling of fused multi-level feature .

PA starts with channel-wise normalization to put and into a common domain. This can be regarded as style normalization [12, 18] and encourages matching to rely only on structural and semantic similarities. Then we perform a convolution to the normalized feature to enable the learning of a suitable similarity kernel by itself. To improve the matching accuracy, we take neighboring information into consideration by unfolding patches at each position. The unfold operation is demonstrated in Fig. 4. In Eq. 1, represents channel-wise normalized feature, and

indicates a vectorized patch feature at

-th position, which consists of the information of the -th position and its neighborhood.


Next, the correspondence score and semantic attention map are calculated with patch attention mechanism as Eq. 2. After performing softmax operation on each row of , we obtain the attention map needed for the reallocation of :


Driven by contextual loss and identity loss, similar features will obtain a larger correspondence score, resulting in the larger attention value in the . Thanks to the rich semantic information provided by encoder, the correspondence score can be interpreted as semantic affinity. Thus, in the reallocation process, as is depicted in Eq. 3, style feature that is more semantically related will be emphasized. refers to the reassembled style feature from PA module.


To measure the confidence that have same semantic implication as , we further conduct element-wise multiplication between correspondence score and semantic attention map to derive a correspondence confidence . In essence, is the weighted average correspondence score of , representing the semantic correspondence between a given style feature and . plays a critical role in the distribution of styles in MST. We define it as:


where indicates the correspondence confidence of at location i.

Figure 6: Investigation of patch size. PA completely fails to differentiate bird with people or flower. While PA obtains a good overall matching accuracy, it sometimes mismatches objects due to noisy neighboring information, i.e., some flowers in the second example are mistakenly identified as background and disappear.

The size of the receptive field is an intrinsic characteristic of a chosen layer and always fixed. PA enables the adjustability of the receptive field and further releases the potential of attention mechanism. From Fig. 6, we may see how different patch size affect matching and stylization results. In all 3 cases, (traditional point-wise attention) failed to capture semantic correspondence correctly. In the first pair, the bird was wrongly rendered in the style of the portrait. While in the other two pairs, styles of bird and flower respectively dominate the whole image, disregarding the semantic meaning of different objects. On the contrary, both and

PA demonstrate an excellent capability of semantic matching. However, larger patch size tends to compromise the detail. For instance, in the third image of the second row, some flowers in the background disappear. It is probably because the neighboring information dominates the matching so that the flowers wrongly match with the background of the styles. In addition, with consideration of computation cost in mind, we choose

PA in our model.

3.4 Region-based Multi-style Fusion Module

Figure 7: Comparison between region-based strategy and discrete strategy. The discrete strategy introduces noises in certain regions, eliminates the characteristics of style features and therefore fails to generate high-quality stylized images.

In MST, the most challenging problem lies in how to harmoniously incorporate different styles without hurting the characteristics of each style. This has two underlying implications.

Firstly, styles should not be mixed; otherwise, they will obfuscate each other and compromise style integrity. What is worse, mixing distinctive styles may produce disturbing and nondescript patterns. Thus, the assignment of different styles should be mutually exclusive. Secondly, a metric needs to be defined to decide the distribution of multiple styles. Semantic correspondence is a natural idea since, with semantic consideration, the overall effect will look more reasonable and intuitive. Correspondence confidence is precisely the objective measure of semantic correspondence among different styles.

Given the two consideration above, a straightforward idea is to assign the style with the highest confidence to each position. However, local variation and noise sometimes intervene in the calculation of correspondence, inducing false match, and producing unpleasing discrete patterns. In Fig. 7, we can see that the discrete strategy produces many scattered pattern and deteriorates local consistency.

To resolve the problem, we utilize clustering to segment our content feature map (

) and calculate regional correspondence confidence. The regional voting strategy increases the robustness of matching by fixing individual mismatch. As we mentioned before, high-level feature comprises abundant semantic information, clustering in high-dimensional feature space is efficient in distinguish objects with different semantic implication. Specifically, we apply K-means to cluster all feature vectors as well as their spatial location in Euclidean Distance to ensure spatial affinity of the result.

The pipeline of MST is depicted as the dashed line in Fig. 1. MFF module and PA module will process multiple style references in a parallel way and pass all the reassembled style features and correspondence confidence to multi-style fusion module.

To allocate semantically nearest style for each region, we calculate the regional sum of correspondence confidence and choose the style with the highest value for each region. The assignment policy is conceptually simple but proves its robustness by comprehensive evaluation. Formally, let to be a specific region, we calculate the sum of correspondence confidence in R for every style, and style with the highest sum will be the assignment result for region R. Formally, the strategy is defined as:


where indicates the correspondence confidence of style at position .

Compared to the straightforward discrete strategy, our proposed region-based strategy improves the visual quality and matching robustness. In Fig. 7, the results of discrete strategy are suffered from mismatch and local inconsistency, such as the blemishes on the horse and grassland in upper-right pair. By conducting regional voting, those flaws are fixed automatically. Both horse and grass are faithfully transformed according to the reference image.

Figure 8: By changing different styles of architecture, the corresponding region in content image changes simultaneously.

With Fig. 8, you will have a better idea about how the styles are distributed.

Figure 9: Visual comparison with existing works on SST.

4 Experiments and Results

4.1 Implementation Details

We train our network using MSCOCO and WikiArt datasets as content images and style images, respectively, both of which contain roughly 80000 images. We use an Adam optimizer to train the backbone model with a batch-size of 6 content-style pairs and a learning rate initially set to . During the training process, we firstly resize the smaller dimension to 512 pixels while preserving the aspect ratio, and then randomly crop regions of pixels for end-to-end training.

Our loss function is defined as below to drive the training process:


Similar to [12], our perceptual loss

is defined as Euclidean distance between channel-wise normalized VGG-19 features extracted from content image and synthesized image. Feature layer

, and are used to compute perceptual loss. For style loss , we apply style loss same as AdaIN [12] to drive the global style transfer.

We also apply contextual loss proposed by [21] to facilitate the semantic matching between style feature and content feature. The cosine distances are calculated between each pair of feature vectors in the feature maps of style and synthesized image. After being normalized as , the affinity between any two feature points in layer is represented as:

where is the bandwidth, typically set to 0.1. The contextual loss is defined to maximize such affinity between the synthesized image and the semantically nearest style feature:


where is the number of feature vectors at layer L and l is set to to in our case.

Figure 10: Visual Comparison with existing works on MST.

In order to guide the network to gain the powerful ability of semantic matching and image reconstruction, an advanced identity loss proposed by [23] is employed, as is shown in Fig. 1. Two symmetric pairs of content and style images are fed to the network with the hope that the network should be able to reconstruct the original images, and the results are identified as and separately. Formally, the identity loss is defined as below:


In addition, we change the behavior of merging module during identity loss calculation to:


where k is a learnable scale factor and we name the module as Amplifier. The advantage of Amplifier is further discussed in sec. 5.1.

The weight parameters , , , , are set to 3, 3, 3, 1, 50 respectively according to our experiments.

4.2 Qualitative Comparison

To evaluate the effectiveness of our backbone model and region-based style fusion strategy, we conduct a comparison with existing methods. All the inputs are chosen outside the training set. For a fair comparison, we generate results by running the released codes of the aforementioned works with the default configuration, except for SANet (We use the official demo page). The visual comparisons of SST and MST methods are shown in Fig. 9 and Fig. 10 respectively. Additionally, extra examples of our work can be found in Fig. 17.

Single-style transfer. Single style performance comparison results are available in Fig. 9. The optimization-based method [6] is unstable since it is likely to stick in the local minimum for some pairs, which can be seen in column 3, 4 of Gatys et al. in Fig. 9. The two faces suffer heavily from the loss of details and deviation of style. Both AdaIN [12] and WCT [20] holistically adjust the content features to match the global statistics of the style features, which leads to blurring effect and textual distortion in some local regions (e.g., the last column of AdaIN and WCT, the pattern of trees grow indiscriminately to the sky). Although Avatar Net [25] shrinks the domain gap between content and style features and utilizes patch-wise semantics, it tends to produce fuzzy effects due to overlapping patches and repeated patterns because of global statistical alignment (e.g., column 1, 2, 7 of Avatar in Fig. 9). LST [17] originates from [20] and generates some good results, but it is vulnerable to wash-out artifacts (e.g., column 3, 4 of LST in Fig. 9) and halation around the edges (e.g., column 1, 6). Besides, this method fails to display desired stylized effect for some images (e.g., column 2, 5 of LST). SANet [23] applies style-attention mechanism to flexibly conduct style transfer. However, false matching and distortions still occur for this method, such as the pink pattern on trees in the first column.

Our method achieves the most balanced performance among all the above models. Our method greatly improves the content preservation by incorporating content features from relu3, 4, 5, which can be seen in column 1, 5, 6, 7 of Fig. 9. At the same time, it presents rich style patterns that are both appealing and meaningful (e.g., column 2, 4 of Ours in Fig. 9). Besides, learnable patch attention module takes contextual information into consideration and flexibly reassembles style patterns, which makes a breakthrough in semantic feature transfer (e.g., column 1, 3, 6).

Multi-style Transfer. To illustrate the effectiveness of our region-based strategy for MST, we compare it with the traditional linear blending strategy implemented by AdaIN [12], AvatarNet [25] as well as our backbone model.

All the results are shown in Fig. 10. Generally speaking, linear blending mixes different styles; therefore, the characteristics of the individual style are not preserved. It tends to produce muddled results with fade-out effects. By applying linear blending strategy, our model and AdaIN [12] fail to retain characteristics of individual style as the structural and color information is fused indiscriminately (column 2, 3, 5, 7 in Fig. 10). Although AvatarNet [25] preserves the style patterns for certain images, it seriously suffers from fade-out effects (column 2, 5, 6, 7 of Avatar in Fig. 10). On the other hand, Style Mixer eliminates the interference between different styles with a spatially exclusive transfer strategy. In the last column of Fig. 10, three linear blending based methods produce results with colors that do not exist in style references, while our proposed Style Mixer faithfully transfer the field, mountain, and sky in style references to the result.

4.3 Quantitative Comparison

Figure 11: User preference towards different SST algorithms in terms of different metrics.
Figure 12: User preference towards different MST strategies.

In order to validate our work, we further conduct two user studies to evaluate the SST performance of our backbone model and MST performance of Style Mixer. Both studies are conducted among 40 participants uniformly ranging from university students to normal officers. For each question, we display the results of all methods in random order and ask the participants to choose the one that best conforms to the given metrics. All the questions are presented in random order, and the participants are given unlimited time to finish the questions. Unlike the settings in regular user studies, we do not choose the test images randomly. Instead, we handpicked semantically related content and style image pairs to evaluate the performance on semantic matching. Each user studies involves 36 pairs of images in total, and each user will be presented with six randomly chosen ones.

Figure 13: Exemplar images in MST survey. The results of our region-based fusing strategy demonstrate the best style faithfulness with local consistency.

Single-style Transfer. Firstly, we access the ability of our backbone model on SST. 5 state-of-the-art models [6, 12, 20, 25, 23] are chosen for comparison. We follow [34] to evaluate content preservation and style faithfulness. Besides, we introduce the semantic matching ability as a new metric, indicating whether the styles are transferred according to semantic matching, i.e., tree-to-tree, face-to-face. We manually make explicit instructions with exemplar images to define the criteria for each metric. For a fair comparison, we run the released code with the default setting for the aforementioned models. As we can see in Fig. 11, our model obtains the most impressive performance in visual perspective, especially in content preservation. Even in terms of style faithfulness, our model is competitive with iteration-based method [6]. Also, the semantic matching score of our proposed method is the highest among the six models, and this should be credited to the PA module. The extraordinary visual quality and semantic matching of our backbone model serve as the cornerstone of our MST framework.

Multi-style transfer. In order to evaluate the user preference towards different MST strategies, we eliminate the effect of the backbone model by using the same one (our model) for all strategies. Our region-based strategy is compared with linear blending as well as the discrete strategy in the user study.

The result illustrated in Fig. 12 shows that our region-based strategy is superior to the other two methods. Linear blending is the least favorable probably because of the muddled results and insipid color, as is shown in 13. The discrete strategy produces more vivid results with some flaws due to unstable local matching (i.e., the green color on the horse in the first image and mottled sky in the second image of Fig. 13). While our proposed method fixes those false matching by regional voting mechanism and thus obtains more decent results.

Method SST Time MST Time
Gatys et al. [6] 51.04 -
AdaIN [12] 0.014 0.032 (Linear blending)
WCT [20] 0.933 -
Avatar-Net [25] 0.330 0.526 (Linear blending)
SANet [23] 0.034 -
Our 0.045 0.371 (Region-based)
Table 1: Execution time comparison (in seconds).
Figure 14: MST with more style references. Our Style Mixer is able to potentially handle arbitrary number of style references.
Figure 15: Comparison between add operation and Amplifier as the merging module in calculation of identity loss.

4.4 Efficiency

A run time evaluation has also been conducted, and the results are displayed in Tab. 1. All the inputs are rescaled to 512 px 512 px. In SST, due to the adoption of PA, our model is slightly slower than SANet [23], but is still very competitive compared to WCT [20] and Avatar Net [25]. In terms of MST, our region-based feature fusion strategy can run at near real-time speed, faster than WCT [20] but slower than AdaIN [12] due to the expensive cost of clustering.

Figure 16: Investigation on different number of clusters.

4.5 Results with More References

Fig. 14 shows examples of MST with three references. Our region-based strategy is able to assign different styles to appropriate regions according to semantic correspondence and potentially handle an arbitrary number of references.

5 Discussion

5.1 The Motivation of Amplifier

[23] introduces identity loss to improve the content preservation and matching ability of style-attention module. When calculating identity loss, SANet merges content feature with swapped style feature by , which is same as normal inference process. However, has already contained the necessary information to complete the reconstruction. The chances are that although the network is capable of rebuilding the image, the weights of the attention module is wrongly trained to be 0, which means it makes no effect at all. To solve this vulnerability, we apply Eq. 9 to replace the original add operation. Without the supply of content image, the PA module is confronted with a bigger challenge and forced to learn more accurate correspondence, which is corroborated by experiments. For example, in Fig. 15, with add operation as merging module, the wings of the bird are wrongly match with the background of flower reference. On the other hand, when the amplifier is being utilized, the wings of the birds are transferred to green color in accordance with that of bird reference.

Figure 17: More results of Style Mixer. The upper rows are results of SST which showcase the competency of semantic matching of our backbone model, i.e., the eyes of the lady are transferred accordingly in the 5th column. The two rows below provide more examples of results to demonstrate the superiority of Style Mixer in terms of MST.

5.2 Choice of the Number of Clusters

To investigate how the number of clusters (K) affects the MST results, we carry out experiments with various content-style pairs, two of which are shown in Fig. 16. The experimental results illustrate that the quality of the synthesized result is not sensitive to the size of K when K lies in a restricted range. Typically, K with a size between 5 to 7 inclines to produce appealing results. When K is relatively large, content image is segmented into smaller regions possessing similar characteristics, which are very likely to be assigned with the same style. However, if we further increase the K, unpleasant patterns will occur since small segments are easily influenced by local features and noises, thus producing false matching. It should also be noted that when K is set to a small number, the results are sensitive to the initial seeds of K-means and are not consistent with the semantic information of the content image.

5.3 Limitation

Semantic mismatch.

The phenomenon can be attributed to the limitation of the encoder. Since VGG-19 is pretrained on ImageNet, which may not be able to handle the objects that are beyond the predefined categories. Also, there is a distinct domain gap between photos and paintings. As a consequence, some style patterns may be too abstract for VGG19 to extract accurate semantic information. For example, the cloud in the 4th column of Fig. 

8 is wrongly transformed into the pattern of the ground rather than the cloud in that style reference. We believe the development of a more suitable encoder for style images will help to alleviate the problem.

Halos near the boundary. The segmentation we applied on features is coarser than segmentation of original image due to the shrinking of size. And this deviation will be amplified by the upsampling process and lead to halos. A progressive fusion strategy may be a good direction to resolve this problem.

6 Conclusion

In this work, we propose an advanced style transfer network and efficient region-based multi-style transfer strategy. The proposed patch attention module dramatically elevates the ability of semantic style transfer and is applicable to any current attention-based model. Also, we come up with the first region-based strategy for MST, which is proved to be efficient and is capable of improving the consistency of multi-style transfer. Comprehensive experiments demonstrate that our proposed method is favorable compared to other existing methods.


We thank the anonymous reviewers for helping us to improve this paper. And we acknowledge to the authors of our image and style examples. This work was partly supported by CityU start-up grant 7200607 and Hong Kong ECS grant 21209119.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
  • [2] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua (2017) Stylebank: an explicit representation for neural image style transfer. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1897–1906. Cited by: §2.
  • [3] T. Q. Chen and M. Schmidt (2016) Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337. Cited by: §1, §1, §2, §2, §3.3.
  • [4] V. Dumoulin, J. Shlens, and M. Kudlur (2017) A learned representation for artistic style. Proc. of ICLR 2. Cited by: §2.
  • [5] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman (2017) Controlling perceptual factors in neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3985–3993. Cited by: §2.
  • [6] L. A. Gatys, A. S. Ecker, and M. Bethge (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423. Cited by: §1, §1, §2, §2, §4.2, §4.3, Table 1.
  • [7] L. Gatys, A. S. Ecker, and M. Bethge (2015) Texture synthesis using convolutional neural networks. In Advances in neural information processing systems, pp. 262–270. Cited by: §1, §2.
  • [8] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra (2015)

    Draw: a recurrent neural network for image generation

    arXiv preprint arXiv:1502.04623. Cited by: §2.
  • [9] S. Gu, C. Chen, J. Liao, and L. Yuan (2018) Arbitrary style transfer with deep feature reshuffle. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8222–8231. Cited by: §1, §2.
  • [10] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [11] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.2.
  • [12] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §1, §2, §3.2, §3.3, §4.1, §4.2, §4.2, §4.2, §4.3, §4.4, Table 1.
  • [13] Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song (2018) Stroke controllable fast style transfer with adaptive receptive fields. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 238–254. Cited by: §2.
  • [14] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European conference on computer vision, pp. 694–711. Cited by: §2.
  • [15] C. Li and M. Wand (2016) Combining markov random fields and convolutional neural networks for image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2479–2486. Cited by: §2.
  • [16] C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pp. 702–716. Cited by: §2.
  • [17] X. Li, S. Liu, J. Kautz, and M. Yang (2019)

    Learning linear transformations for fast arbitrary style transfer

    In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §4.2.
  • [18] Y. Li, N. Wang, J. Liu, and X. Hou (2017) Demystifying neural style transfer. arXiv preprint arXiv:1701.01036. Cited by: §3.3.
  • [19] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2017) Diversified texture synthesis with feed-forward networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3920–3928. Cited by: §2.
  • [20] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2017) Universal style transfer via feature transforms. In Advances in neural information processing systems, pp. 386–396. Cited by: §1, §1, §2, §2, §4.2, §4.3, §4.4, Table 1.
  • [21] R. Mechrez, I. Talmi, and L. Zelnik-Manor (2018) The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 768–783. Cited by: §4.1.
  • [22] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016)

    A decomposable attention model for natural language inference

    arXiv preprint arXiv:1606.01933. Cited by: §2.
  • [23] D. Y. Park and K. H. Lee (2018) Arbitrary style transfer with style-attentional networks. arXiv preprint arXiv:1812.02342. Cited by: §1, §1, §2, §2, §3.2, §3.3, §3.3, §4.1, §4.2, §4.3, §4.4, Table 1, §5.1.
  • [24] E. Risser, P. Wilmot, and C. Barnes (2017) Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893. Cited by: §2.
  • [25] L. Sheng, Z. Lin, J. Shao, and X. Wang (2018) Avatar-net: multi-scale zero-shot style transfer by feature decoration. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, pp. 1–9. Cited by: §1, §1, §1, §2, §2, §3.2, §4.2, §4.2, §4.2, §4.3, §4.4, Table 1.
  • [26] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky (2016) Texture networks: feed-forward synthesis of textures and stylized images.. In ICML, Vol. 1, pp. 4. Cited by: §2.
  • [27] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6924–6932. Cited by: §2.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.3.
  • [29] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §3.3.
  • [30] X. Wang, G. Oxholm, D. Zhang, and Y. Wang (2017) Multimodal transfer: a hierarchical deep convolutional neural network for fast artistic style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5239–5247. Cited by: §2.
  • [31] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 842–850. Cited by: §2.
  • [32] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    International conference on machine learning

    pp. 2048–2057. Cited by: §2.
  • [33] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29. Cited by: §2.
  • [34] Y. Yao, J. Ren, X. Xie, W. Liu, Y. Liu, and J. Wang (2019) Attention-aware multi-stroke style transfer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.3.
  • [35] C. Zhang, Y. Zhu, and S. Zhu (2018) MetaStyle: three-way trade-off among speed, flexibility, and quality in neural style transfer. arXiv preprint arXiv:1812.05233. Cited by: §2.
  • [36] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §2.
  • [37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §2.