Multimodal Style Transfer via Graph Cuts

04/09/2019 ∙ by Yulun Zhang, et al. ∙ 24

An assumption widely used in recent neural style transfer methods is that image styles can be described by global statics of deep features like Gram or covariance matrices. Alternative approaches have represented styles by decomposing them into local pixel or neural patches. Despite the recent progress, most existing methods treat the semantic patterns of style image uniformly, resulting unpleasing results on complex styles. In this paper, we introduce a more flexible and general universal style transfer technique: multimodal style transfer (MST). MST explicitly considers the matching of semantic patterns in content and style images. Specifically, the style image features are clustered into sub-style components, which are matched with local content features under a graph cut formulation. A reconstruction network is trained to transfer each sub-style and render the final stylized result. Extensive experiments demonstrate the superior effectiveness, robustness and flexibility of MST.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 10

page 11

page 12

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image style transfer is the process of rendering a content image with characteristics of a style image. Usually, it would take a long time for a diligent artist to create a stylized image with particular style. Recently, it draws a lot of interests [8, 14, 4, 36, 11, 21, 10, 33, 32, 3, 37, 20] since Gatys et al. [8] discovered that the correlations between convolutional features of deep networks can represent image styles, which would have been hard for traditional patch-based methods to deal with. These neural style transfer methods either use an iterative optimization scheme [8] or feed-forward networks [14, 4, 36, 11, 21, 33, 20] to synthesize the stylizations. Most of them are applicable for arbitrary style transfer with a pre-determined model. These universal style transfer methods [11, 21, 33, 20] inherently assume that the style can be represented by the global statistics of deep features such as gram matrix [8] and its approximates [11, 21]. Although these neural style transfer methods can preserve the content well and match the overall style of the reference style images, they will also distort the local style patterns, resulting unpleasing visual artifacts.

Let’s start with some examples in Fig. LABEL:fig:fig_first. In the first row, where the style image consists of complex textures and strokes, these methods cannot tell them apart and neglect to match style patterns to content structures adaptively. This would introduce some less desired strokes in smooth content areas, e.g., the sky. In the second row, the style image has clear spatial patterns (e.g., large uniform background and blue/red hand). AdaIN, WCT, and LST failed to maintain the content structures and suffered from wash-out artifacts. This is mainly because the unified style background occupies a large proportion in the style image, resulting its domination in the global statistics of style features. These observations indicate that it may not be sufficient to represent style features as at may not be sufficient to represent style features as a unimodal distribution such as a Gram or covariance matrix. An ideal style representation should respect to the spatially-distributed style patterns.

Inherited from traditional patch-based methods, these neural patch-based algorithms could generate visually pleasing results when content and style images have similar structures. However, the greedy example matching usually employed by these methods will introduce less desired style patterns to the outputs. This is illustrated by the bottom two examples in Fig. LABEL:fig:fig_first, where some salient patterns in the style images, e.g., the eyes and lips, are improperly copied to the buildings and landscape. Moreover, the last row of Fig. LABEL:fig:fig_first also illustrates shape distortion problem of these methods; e.g., the appearance for the girl has changed. This phenomenon apparently limits the choice of style images for these methods.

To address these issues, we propose the multimodal style transfer (MST), a more flexible and general style transfer method that seeks for a sweet spot between parametric (gram matrix based) and non-parametric (patch based) approaches. Specifically, instead of representing the style with a unimodal distribution, we propose a multimodal style representation with graph based style matching mechanism to adaptively match the style patterns to a content image.

Our main contributions are summarized as follows:

  • We analyze the feature distributions of different style images (see Fig. 1) and propose a multimodal style representation that better models the style feature distribution. This multimodal representation consists of a mixture of clusters, each of which represents a particular style pattern. It also allows users to mix-and-match different styles to render diverse stylized results.

  • We formulate style-content matching as an energy minimization problem with a graph and solve it via graph cuts. Style clusters are adapted to content features with respect to the content spatial configuration.

  • We demonstrate the strength of MST by extensive comparison with several state-of-the-art style transfer methods. The robustness and flexibility of MST is shown with different sub-style numbers and multi-style mixtures. The general idea of MST can be extended to improve other existing stylization methods.

Figure 1: t-SNE [25] visualization for style features. The original high-dimension style features are extracted at layer Conv_4_1 in VGG-19 [34] and then are reduced to 3 dimensions via t-SNE. We can see the feature distributions tend to fit as multimodal distributions rather than single-modal ones.

2 Related Works

Style Transfer. Originating from non-realistic rendering [18], image style transfer is closely related to texture synthesis [5, 7, 6]. Gatys et al. [8]

were the first to formulate style transfer as the matching of multi-level deep features extracted from a pre-trained deep neural network. Lots of improvements have been proposed based on the works of Gatys

et al. [8]. Johnson et al. [14] trained feed-forward style-specific network and produced one stylization with one model. Sanakoyeu et al. [31] further proposed a style-aware content loss for high-resolution style transfer. Jing et al. [12] proposed a StrokePyramid module to enable controllable stroke with adaptive receptive fields. However, these methods are either time consuming or have to re-train new models for new styles.

The first arbitrary style transfer was proposed by Chen and Schmidt [4], who matched each content patch to the most similar style patch and swapped them. Luan et al. [24] proposed deep photo style transfer by adding a regularization term to the optimization function. Based on markov random field (MRF), Li and Wand [19] proposed CNNMRF to enforce local patterns in deep feature space. Ruder et al. [30] improved video stylization with temporal coherence. Although their visual stylizations for arbitrary style are appealing, the results are not stable [30].

Recently, Huang et al. [11]

proposed real-time style transfer by matching the mean-variance statistics between content and style features. Li 

et al. [21] further introduced whitening and coloring (WCT) by matching the covariance matrices. Li et al. boosted style transfer with linear style transfer (LST) [20]. Gu et al. [10] proposed deep feature reshuffle (DFR), which connects both local and global style losses used in parametric and non-parametric methods. Sheng et al. [33] proposed AvatarNet to enable multi-scale transfer for arbitrary style. Shen et al. [32]

built meta networks by taking style images as inputs and generating corresponding image transformation networks directly. Mechrez

et al. [26] proposed contextual loss for image transformation. However, these methods fail to treat the style patterns distinctively and neglect to adaptively match style patterns with content semantic information. For more neural style transfer works, readers can refer to the survey [13].

Graph Cuts based Matching

. Many problems that arose in early vision can be naturally expressed in terms of energy minimization. For example, a large number of computer vision problems attempt to assign labels to pixels based on noisy measurements. Graph cuts is a powerful method to solve such discrete optimization problems. Greig

et al. [9]

were firstly successful solving graph cuts by using powerful min-cut/max-flow algorithms from combinatorial optimization. Roy and Cox 

[29] were the first to use these techniques for multi-camera stereo computation. Later, a growing number of researches in computer vision use graph-based energy minimization for a wide range of applications, which includes stereo [16], texture synthesis [17], image segmentation [35], object recognition [1], and others. In this paper, we formulate the matching between content and style features as an energy minimization problem. We approximate its global minimum via efficient graph cuts algorithms. To the best of our knowledge, we are the first to formulate style matching as energy minimization problem and solve it via graph cuts.

3 Proposed Method

We first investigate the style representation and propose a more efficient and reasonable multimodal style representation. Then we further show how to match each content feature with each sub-styles. Finally, we transform features in each sub-modal feature space.

3.1 Multimodal Style Representation

In previous CNN-based image style transfer works, there are two main ways to represent style. One is to use the features from the whole image and assume that they are in the same distribution (e.g., AdaIN [11] and WCT [21]). The other one treats the style patterns as individual style patches (e.g., Deep Feature Reshuffle [10]). Equal treatments to different style patterns lack flexibility in the real cases, where there has several distributions among style features. Let’s see the t-SNE [25] visualization for style features in Fig. 1, where the style features are clustered to multiple groups. Therefore, if a cluster dominates the feature space, e.g., second example of Fig. LABEL:fig:fig_first, the Gram matrix based methods [21, 20, 11] will fail to capture the the overall style patterns. On the other hand, patch-based methods which treat each sub-patch distinctly would suffer from copying multiple same style patterns to the results directly. For example, in Fig. LABEL:fig:fig_first, the eyes in the style images are copied multiple times, causing unpleasing stylization results.

Based on the observations and analyses above, we argue that neither a global statics of deep features nor local neural patches could be a suitable way to represent the complex real-world cases. As a result, we propose multimodal style representation, a more efficient and flexible way to represent different style patterns.

Figure 2: t-SNE [25] visualization for style features with cluster labels. For each style-visualization pair, we set and label style features with corresponding cluster labels.

For a given style image , we can extract its deep features via a pre-trained encoder , like VGG-19 [34]. and

are the height and width of the style feature. To achieve multimodal representation in high-dimension feature space, we target to segment the style patterns into multiple subsets. Technically, we simply apply K-means to cluster all the style feature points into

clusters without considering spatial style information

(1)

where is the -th cluster with features and we assign this cluster a label

. In the clustered space, features in the same cluster have similar visual properties and are likely drawn from the same distribution (resembling Gaussian Mixture Model 

[28]). This process helps us obtain a multimodal representation of style.

We visualize multimodal style representation in Fig. 2. For each style image, we extract its VGG feature (at layer Conv_4_1 in VGG-19) and cluster it into clusters. Then, we conduct t-SNE [25] visualization with the cluster labels. As shown in Fig. 2, clustering results match our assumption of multimodal style representation well. Nearby feature points tend to be in the same cluster. These observation not only shows the multimodal style distribution, but also demonstrates that clustering is a proper way to model such a multimodal distribution.

3.2 Graph Based Style Matching

Like style feature extraction, we extract deep content features from a content image . and are the height and width of the content feature. Distance measurement is the first step before matching. To reach a good distance metric, we should consider the scale difference between the content and style features. Computation complexity should also be taken into consideration, since all the content features will be used to match. Based on above analysis, we calculate the cosine distance between content feature and style cluster center as follows

(2)

where is transpose operation and

is magnitude of the feature vector.

Then we target to find a labeling that assigns each content feature with a style cluster center label . We formulate the disagreement between and content features as follows

(3)

where we name as data energy. Minimizing encourages to be consistent with content features.

Figure 3: Graph based style matching. Example of the graph containing content features and style cluster centers. We match content features with style cluster in pixel level.

However, the spatial content information here is not considered, failing to preserve discontinuity and producing some unpleasing structures in the stylized results. Instead, we hope pixels in the same content local region have same labels. Namely, we want to be piecewise smooth and discontinuity preserving. So, we further introduce another smooth term as follows

(4)

where is the position set of direct interacting pairs of content features. denotes the distinct penalty for each position pair of features . This has been investigated to be important in various computer vision applications [2]. Also, various forms of energy functions have been investigated before. Here, we take the discontinuity preserving function given by the Potts model

(5)

where is 1 if its argument is true, and otherwise 0. is a smooth constant. This model encourages the labeling to pursue several regions, where content features in the same region have same style cluster labels.

By taking Eqs. (3) and (4) into consideration, we naturally formulate the style matching problem as a minimization of the following energy function:

(6)

The whole energy measures not only the disagreement between and content features, but also the extent to which is not piecewise smooth. However, the global minimization of such an energy function is NP-hard even in the simplest discontinuity-preserving case [2].

Figure 4: Visualization of style matching. Here, we cluster style features into subsets for better understanding.

To solve the energy minimization problem in Eq. (6), we propose to build a graph by regarding content features as -vertices and style cluster centers as -vertices (shown in Fig. 3). Then the energy minimization is equal to min-cut/max-flow problem, which can be efficiently solved via graph cuts [2]. After finding a local minimum, the whole content features can be re-organized as follows

(7)

where demotes the sub-set whose content features are matched with the same style label .

We show visualization details about graph based style matching in Fig. 4. We extract style and content features from Conv_4_1 layer in VGG-19. Due to several downsampling modules in VGG-19, the spatial resolution of the features is much smaller than that of the inputs. We label the spacial style feature pixels with their corresponding cluster labels and obtain the style cluster maps. According to the style cluster maps in Fig. 4, we find that style feature clustering grasps semantic information from style images.

After style matching in pixel level, we get the content-style matching map, which also reflects the semantic information, matching the content structures adaptively. Such an adaptive matching alleviates the wash-out artifacts, when the style is very simple or has large area of unified background. Then, we are able to conduct feature transform in each content-style pair group.

3.3 Multimodal Style Transfer

For each content-style pair group , we first center them by subtracting their mean vectors and respectively. We conduct feature whitening and coloring as used in WCT [21].

(8)

where is a whitening matrix and is a coloring matrix. and

are diagonal matrix of eigenvalues and the orthogonal matrix of eigenvectors of the covariance matrix

. For style covariance matrix , the corresponding matrices are and . The reasons why we choose WCT to transfer features is its robustness and efficiency [21, 20]. More details about whitening and coloring are introduced in [21].

After feature transformation, we may also want to blend transferred features with content features as did in previous works (e.g., AdaIN [11] and WCT [21]). Most previous works have to blend the whole transferred features with a unified content-style trade-off, which treats different content parts equally and is not flexible to the real-world cases. Instead, our multimodal style representation and matching make it possible to adaptively blend features. Namely, for each content-style pair group, we blend them via

(9)

where is a content-style trade-off for specific labeled content features. After blending all the features, we obtain the whole transferred features

(10)

is then fed into the decoder to reconstruct the final output .

3.4 Implementation Details

Now, we specify the implementation details about our proposed MST111The MST source code will be available after the paper is published.. Similar to some previous works (e.g., AdaIN, WCT, DFR), we incorporate the pre-trained VGG-19 (up to Conv_4_1) [34] as the encoder . We obtain decoder by mirroring the encoder, whose pooling layers are replaced by nearest up-scaling layers.

To train the decoder, we use the pre-trained VGG-19 [34] to compute perceptual loss which combines content loss and style loss . We simply set the weighting constant as . Inspired by the loss designations in [14, 22, 11], we formulate content loss as

(11)

where extracts features at layer Conv__1 in VGG-19. We then formulate the style loss as

(12)

where extracts features at layer Conv__1 in VGG-19. We use and

to compute the mean and standard deviation of the content and style features.

Figure 5: Distance measurement investigation.

Figure 6: Discontinuity preservation investigation.

We train our network by using images from MS-COCO [23] and WikiArt [27] as content and style data respectively. Each dataset contains about 80,000 images. In each training batch, we randomly crop one pair of content and style images with the size of

as input. We implement our model with TensorFlow and apply Adam optimizer 

[15] with learning rate of .

4 Discussions

To better position MST among the whole body of style transfer works, we further discuss and clarify the relationship between MST and some representative works.

Differences to CNNMRF. CNNMRF [19] extracts a pool of neural patches from style images, with which patch matching is used to match content. MST clusters style features into multiple sub-sets and matches style cluster centers with content feature points via graph cuts. CNNMRF uses smoothness prior for reconstruction, while MST uses it for style matching only. CNNMRF minimizes energy function to synthesize the results. MST generates stylization results with a decoder.

Differences to MT-Net. Both color and luminance are treated as a mixture of modalities in MT-Net [36]. MST obtains multimodal representation from style features via clustering. It should also be noted that MT-Net has to train new models for new style images. While, MST is designed for arbitrary style transfer with a single model.

Differences to WCT. In WCT [21], the decoder is trained by using only content data and loss. MST introduces additional style images for training. WCT uses multiple layers of VGG features and conducts multi-level coarse-to-fine stylization, which costs much more time and sometimes distorts structures. While, MST only transfers single-level content and style features. Consequently, even we set in MST, we achieve more efficient stylizations.

5 Experiments

We provide more results in supplementary material222http://yulunzhang.com/papers/MST_supp_arXiv.pdf.

5.1 Ablation Study

Distance Measurement. We first investigate the choice of the distance measurement, as it is critical for graph building. Here, we mainly investigate Euclidean distance and cosine distance (shown in Eq. (2)). As shown in Fig. 5, MST with Euclidean distance is affected by the huge background and may fail to transfer desired style patterns, leading to wash-out artifacts. This is mainly because there is no normalization for the deep features. As a result, the weight of style cluster center is proportional to its spatial proportion, weakening its semantic meaning. Instead, MST with cosine distance performs much better.

Discontinuity Preservation. In Fig. 6, we show the effectiveness of the smooth term in Eq. (6). Specifically, we set as 0, 0.1, and 1 respectively. In real-world style transfer, people would like to smooth the facial area, as they do in the real photos. Here, we select one portrait to investigate how affects smoothness. When we set , the energy function in Eq. (6) is minimized by only considering the data term. This would introduce some unpleasing artifacts in the facial area near edges and demonstrate the necessity of smooth term. However, large smooth term (e.g., ) would over-smooth the stylization results, decreasing the style diversity. A proper value of would not only keep better smoothness, but also preserve style diversity. We empirically set through the whole experiments.

Figure 7: Feature transformation investigation.

Feature Transform. Here we set in MST and compare with AdaIN [11] to show the effectiveness of WCT for feature transfer. As shown in Fig. 7, AdaIN produces some stroke artifacts in the smooth area. These artifacts may make the cloud unnatural. This is mainly because AdaIN uses the mean and variance of the whole content/style features. Instead, by using whitening and coloring in a more optimized way, our MST-1 achieves more natural stylization and cleaner smoothed area. As a result, we introduce whitening and coloring for feature transform.

Figure 8: Visual comparison. MST () and all compared methods use default parameters.

5.2 Comparisons with Prior Arts

After investigating the effects of each component in our method, we turn to validate the effectiveness of our proposed MST. We compare with 7 state-of-the-art methods: method by Gatys et al. [8], CNNMRF [19], AdaIN [11], WCT [21], Deep feature reshuffle (DFR) [10], AvatarNet [33], and LST [20]. We obtain results using their official codes and default parameters, except for Gatys et al.333We use code from https://github.com/jcjohnson/neural-style with its default parameters (e.g., iterations = 10, learning rate = 1)..

Qualitative Comparisons. We show extensive comparisons in Fig. 8. Gatys et al. [8] transfer style with iterative optimization, being likely to falling in local minimum (e.g., 1st and 3rd columns). AdaIN [11] often produces less desired artifacts in the smooth area and some halation around the edges (e.g., 1st, 5th, and 6th columns). CNNMRF [19] may suffer from distortion effects and not preserve content structure well. Due to the usage of higher-level deep feature (e.g., Conv_5_1), WCT [21] would generate distorted results, failing to preserve the main content structures (e.g., 1st and 2nd columns). DRF [10] reconstructs the results by using the style patches, which could also distort the content structure (e.g., 1st and 3rd columns). In some cases (e.g., 5th, 6th, and 7th columns), some tiny style patterns (e.g., the eyes in the flowers and tree) would be copied to the results, leading unpleasing stylizations. AvatarNet [33] would introduce some less desired style patterns in the smooth area (e.g., 1st column) and also copy some style patterns in the results (e.g., 6th and 7th columns). LST [20] could generate very good result in some cases (e.g., 6th column). However, it may suffer from wash-out artifacts (e.g., 3rd and 4th columns) and halation around the edges (e.g., 5th column). These compared methods mainly treat the style patterns as a whole, lacking distinctive ability to style patterns.

Instead, we treat style features as multimodal presentations in high-dimension space. We match each content feature to its most related style cluster and adaptively transfer features according to the content semantic information. These advantages help explain why MST generates clearer results (e.g., 1st, 3rd, 5th, and 7th columns), performs more semantic matching with style patterns (e.g., 2nd column), and alleviates wash-out artifacts (e.g., 4th column). Such superior results demonstrate the effectiveness of our MST.

Method Gatys AdaIN WCT DFR AvatarNet MST
Perc./% 21.41 11.31 12.67 11.55 9.61 33.45
Table 1: Percentage of the votes that each method received.

User Study. To further evaluate the 6 methods shown in Fig 8, we conduct a user study like [21]. We use 15 content images and 30 style images. For each method, we use the released codes and default parameters to generate 450 results. 20 content-style pairs are randomly selected for each user. For each style-content pair, we display the stylized results of 6 methods on a web-page in random order. Each user is asked to vote the one that he/she like the most. Finally, we collect 2,000 votes from 100 users and calculate the percentage of votes that each method received. The results are shown in Tab. 1, where our MST (=3) obtains of the total votes. It’s much higher than that of Gatys et al. [8], whose stylization results are usually thought to be high-quality. This user study result is consistent with the visual comparisons (in Fig. 8) and further demonstrate the superior performance of our MST.

Method Gatys AdaIN WCT DFR AvatarNet
Time (s) 116.46 0.09 0.92 54.32 0.33
Method MST-1 MST-2 MST-3 MST-4 MST-5
Time (s) 0.20 1.10 1.40 1.97 2.27
Table 2: Running time (s) comparisons.

Efficiency. We further compare the running time of our methods with previous ones [8, 11, 21, 10, 33]. Tab. 2 gives the average time of each method on 100 image pairs with size of . All the methods are tested on a PC with an Intel i7-6850K 3.6 GHz CPU and a Titan Xp GPU. Our MST with different performs relatively faster than methods by Gatys et al. [8] and DFR [10]. Even using SVD in CPU, MST- is faster than AvatarNet [33] and WCT [21]. It should be noted that WCT conducts multi-level stylization, which costs much more time than that of MST-. MST- () becomes much slower with larger . This is mainly because our cluster operation is executed in CPU and consumes much more time. On the other hand, although MST with larger would consume more time, its stylized results would be very robust. So, in general, we don’t have to choose very large , of which we give more details about the effects later.

Figure 9: Style cluster number investigation. Same content image with complex and simple style images.

Figure 10: Multi-style transfer. MST treats patterns from different style images distinctively and transfers them adaptively.

5.3 Style Cluster Number

We investigate how style cluster number affects the stylization in Fig. 9. When , our MST performs style transfer by taking the whole style features equally, resulting in either very complex (1st row) or simple (2nd row) stylizations. These results are not consistent with the content structures and lack flexibility, leading to unpleasing feelings to users. Such cases are neglected by previous style transfer methods. Instead, we can produce multiple results with different . When we enlarge with multimodal style representation, stylization results would either throw unnecessary style patterns (1st row) or introduce more matched style patterns (2nd row). The stylizations become more matched with the content structures. This is mainly because multimodal style representation allows distinctive and adaptive treatment for the style patterns. More important, MST reconstructs several stylization results with different , providing multiple selections for the users.

5.4 Adaptive Multi-Style Transfer

Most previous style transfer methods enable style interpolation, which blends the content image with a set of weighted stylizations. However, we don’t fix the weights for each style image, but adaptively interpolate the style patterns to the content. As shown in Fig. 

10, the content image is stylized by two style images simultaneously. We use AdaIN [11] and WCT [21] for reference (because it’s not strictly fair comparisons) by setting equal weight for each style image. In Fig. 10, AdaIN and WCT suffer from wash-out artifacts. While, our MST preserves the content structures well. MST transfers more portrait hair style to the cat body and more cloud style to the cat eyes and green leaves. Our adaptive multi-style transfer is also similar to spatial control in previous methods [11, 21]. But, they need additional manually designed mask as input, consuming more user efforts. Instead, MST automatically allows good matching between content and style features.

Figure 11: Generalization of MST to AdaIN [11].

5.5 Generalization of MST

We further investigate the generalization of our proposed MST to improve some existing style transfer methods. Here, we take the popular AdaIN [11] as an example. We apply style clustering and graph based style matching to AdaIN, which is then denoted as “AdaIN + MST-”. As shown in Fig. 11, AdaIn may distort some content structures (e.g., mouth) by switching the global mean and standard deviation between style and content features. When we cluster the style feature into sub-sets and match them with content features via graph cuts, such a phenomenon can be obviously alleviated (see 3rd and 4th columns in Fig. 11). According to these observations and analyses, we can learn that our MST can be generalized and will benefit to some other existing style transfer methods.

6 Conclusion

We first propose multimodal style representation to model the complex style distribution. We then formulate the style matching problem as an energy minimization one and solve it using our proposed graph based style matching. As a result, we propose multimodal style transfer to transform features in a multimodal way. We not only treat the style patterns distinctively, but also consider the semantic content structure and its matching with style patterns. We also investigate that MST can be generalized to some existing style transfer methods and improve their stylization results. We conduct extensive experiments to validate the effectiveness, robustness, and flexibility of our method.

References

  • [1] Y. Boykov and D. P. Huttenlocher. A new bayesian framework for object recognition. In CVPR, 1999.
  • [2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. TPAMI, 2001.
  • [3] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stereoscopic neural style transfer. In CVPR, 2018.
  • [4] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. In NIPSW, 2016.
  • [5] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In ICCV, 1999.
  • [6] M. Elad and P. Milanfar. Style transfer via texture synthesis. TIP, 2017.
  • [7] L. Gatys, A. S. Ecker, and M. Bethge.

    Texture synthesis using convolutional neural networks.

    In NIPS, 2015.
  • [8] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
  • [9] D. M. Greig, B. T. Porteous, and A. H. Seheult.

    Exact maximum a posteriori estimation for binary images.

    Journal of the Royal Statistical Society. Series B (Methodological), pages 271–279, 1989.
  • [10] S. Gu, C. Chen, J. Liao, and L. Yuan. Arbitrary style transfer with deep feature reshuffle. In CVPR, 2018.
  • [11] X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
  • [12] Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song. Stroke controllable fast style transfer with adaptive receptive fields. In ECCV, 2018.
  • [13] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song. Neural style transfer: A review. arXiv preprint arXiv:1705.04058, 2017.
  • [14] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In ECCV, 2016.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [16] V. Kolmogorov and R. Zabih. Multi-camera scene reconstruction via graph cuts. In ECCV, 2002.
  • [17] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis using graph cuts. TOG, 2003.
  • [18] J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg. State of the” art”: A taxonomy of artistic stylization techniques for images and video. TVCG, 2013.
  • [19] C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In CVPR, 2016.
  • [20] X. Li, S. Liu, J. Kautz, and M.-H. Yang.

    Learning linear transformations for fast arbitrary style transfer.

    In CVPR, 2019.
  • [21] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In NIPS, 2017.
  • [22] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
  • [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [24] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. In CVPR, 2017.
  • [25] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 2008.
  • [26] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In ECCV, 2018.
  • [27] K. Nichol. Painter by numbers, wikiart. https://www.kaggle.com/c/painter-by-numbers, 2016.
  • [28] D. Reynolds. Gaussian mixture models. Encyclopedia of biometrics, 2015.
  • [29] S. Roy and I. J. Cox. A maximum-flow formulation of the n-camera stereo correspondence problem. In ICCV, 1998.
  • [30] M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transfer for videos. In

    German Conference on Pattern Recognition

    , 2016.
  • [31] A. Sanakoyeu, D. Kotovenko, S. Lang, and B. Ommer. A style-aware content loss for real-time hd style transfer. In ECCV, 2018.
  • [32] F. Shen, S. Yan, and G. Zeng. Neural style transfer via meta networks. In CVPR, 2018.
  • [33] L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In CVPR, 2018.
  • [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [35] O. Veksler. Image segmentation by nested cuts. In CVPR, 2000.
  • [36] X. Wang, G. Oxholm, D. Zhang, and Y.-F. Wang. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. In CVPR, 2017.
  • [37] Y. Zhang, Y. Zhang, and W. Cai. Separating style and content for generalized style transfer. In CVPR, 2018.

Appendix A Appendix

Due to the limited space, we show partial more results about visual comparisons with prior arts are shown in Figs. 12131415, and 16.

We investigate how style cluster number affect the stylization in Fig. 17.

We show adaptive multi-style transfer in Figs. 18.

More experimental details about multimodal style representation, graph based style matching, and results are available at http://yulunzhang.com/papers/MST_supp_arXiv.pdf.

Figure 12: Visual comparison with method by Gatys et al. [8], CNNMRF [19], AdaIN [11], WCT [21], Deep feature reshuffle (DFR) [10], AvatarNet [33], and LST [20]. We set in MST. In addition to the wash-out artifacts in 1st comparison, they can hardly distinguish semantic content structures in the 2nd comparison. However, our MST handles different style images better.

Figure 13: Visual comparison with method by Gatys et al. [8], CNNMRF [19], AdaIN [11], WCT [21], Deep feature reshuffle (DFR) [10], AvatarNet [33], and LST [20]. We set in MST. CNNMRF, DFR, and AvatarNet would copy some style patterns to the results (e.g., 2nd comparison), leading to less desired stylizations. Although LST keeps clean background in the second comparison, it also generates some halation around the building.

Figure 14: Visual comparison with method by Gatys et al. [8], CNNMRF [19], AdaIN [11], WCT [21], Deep feature reshuffle (DFR) [10], AvatarNet [33], and LST [20]. We set in MST. CNNMRF, DFR, and AvatarNet would copy some style patterns to the results (e.g., tiger eyes in the 1st comparison), leading to unpleasing stylizations. In the 2nd comparison, MST generates results, being more faithful to the content structures.

Figure 15: Visual comparison with method by Gatys et al. [8], CNNMRF [19], AdaIN [11], WCT [21], Deep feature reshuffle (DFR) [10], AvatarNet [33], and LST [20]. We set in MST. MST reflects more semantic matching between content and style structures.

Figure 16: Visual comparison with method by Gatys et al. [8], CNNMRF [19], AdaIN [11], WCT [21], Deep feature reshuffle (DFR) [10], AvatarNet [33], and LST [20]. We set in MST. WCT suffers from distortion heavily.

Figure 17: Style number cluster number investigation. MST- means style features are clustered into clusters. As we can see, for simple style, would suffer from wash-out artifacts to some degree. As we enlarge , our MST can match each content feature pixel with better style cluster and alleviate the wash-out artifacts.

Figure 18: Multi-style transfer. We set in MST. Our MST treats patterns from different style images distinctively and transfer them adaptively according to the specific content structures.