Pair-wise Exchangeable Feature Extraction for Arbitrary Style Transfer

11/26/2018 ∙ by Zhijie Wu, et al. ∙ 10

Style transfer has been an important topic in both computer vision and graphics. Gatys et al. first prove that deep features extracted by the pre-trained VGG network represent both content and style features of an image and hence, style transfer can be achieved through optimization in feature space. Huang et al. then show that real-time arbitrary style transfer can be done by simply aligning the mean and variance of each feature channel. In this paper, however, we argue that only aligning the global statistics of deep features cannot always guarantee a good style transfer. Instead, we propose to jointly analyze the input image pair and extract common/exchangeable style features between the two. Besides, a new fusion mode is developed for combining content and style information in feature space. Qualitative and quantitative experiments demonstrate the advantages of our approach.

READ FULL TEXT VIEW PDF

Authors

page 6

page 11

page 12

page 13

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Stylization of different content images based on the same style image. Existing state-of-the-art (AdaIN) ignores differences in content images and aligns them to the same set of features extracted from the style image. In contrast, our approach jointly analyzes each content-style image pair and extract exchangeable features. This allows it to better respect semantic information (e.g., blue sky in all our results vs. white sky in the 1st column of AdaIN) and adapt to texture patterns (AdaIN’s results in the 2nd and 3rd column contain residue textures from the content images).

A style transfer method takes a pair of images as input and synthesize an output image that preserves the content of the first image while mimicking the style of the second image. The study on this topic has drawn much attention in recent years due to its scientific and artistic values. Recently, the seminal work [6] found that multi-level feature statistics extracted from a pre-trained CNN model can be used to separate content and style information, making it possible to combine content and style of arbitrary images. This method, however, depends on a slow iterative optimization, which limits its range of application.

Since then, many attempts have been made to accelerate the above approach through replacing the optimization process with a feed-forward neural networks 

[5, 14, 19, 34, 31]. While these methods can effectively speed up the stylization process, they are generally constrained to a predefined set of styles and cannot adapt to an arbitrary style specified by a single exemplar image.

Notable effort[22, 28, 4] has been devoted to solving this flexibility v.s. speed dilemma. A successful direction is to apply statistical transformation, which aligns feature statistics of the input content image to that of the style image [11, 29, 21]. Such approaches implicitly assume that feature statistics (i.e., channel-wise mean and variance) contains all and only style information, which can be exchanged between any pair of content and style images. When this assumption does not hold for a given pair of images, the corresponding style transfer result can be poor.

Instead of aligning the input image to features independently computed from either a batch of samples (batch normalization) or a single style sample (instance normalization), we jointly consider both content and style images and extract

exchangeable style features, which are customized for this pair of images only. As a result, the stylization of different content images are guided by different exchangeable features even under the same style image. Our experiments demonstrate that performing style transfer through pairwise exchangeable feature extraction yields more structured results and better visual details than existing approaches; see e.g., Figures 1 and 5.

To compute exchangeable style features from feature statistics of two input images, a novel Feature Exchanging Block is designed, which is inspired by the works on private-shared component analysis [2, 3]. In addition, we propose a new Content-Style Fusion mode to fuse together the content information and exchangeable style information, before a decoder is used to synthesize the output image. To summarize, the contributions of our work include:

  • The importance of computing pairwise exchangeable features for style transfer between two images is clearly demonstrated.

  • A novel Feature Exchanging Block is designed for learning common information in-between features extracted from a pair of input images.

  • A simple yet effective mode is developed to fuse content and style information together through channel compression and expansion.

  • The overall end-to-end style transfer framework can perform arbitrary style transfer in real-time and synthesize highly detailed results with favored styles.

2 Related Work

Figure 2: Illustration of our method. Given a pair of content and style images, we jointly analyze them to compute content codes and style codes for both images. The style codes are obtained based on the common information found in-between the two images, which facilitate the style swapping. The fusion of different combinations of content and style codes yields 4 output images, which are used to compute Reconstruction Loss and Perceptual Loss. In addition, a Feature Exchange Loss is computed based on the joint-analyzed features.

2.1 Style Transfer

Intuitively, style transfer aims at changing the style of an image while preserving its content. Earlier works of non-parametric methods are usually build upon low-level image features [10, 9]. Recently, impressive neural style transfer is realized by Gatys et al. [6]

. In this pioneer work, they found that deep feature map extracted by the neural network pre-trained on large dataset (e.g., ImageNet) is a good representation of the content information for an image, whereas the correlation between different filter responses at a given layer of the network encodes the style info. By matching the two representations between deep features of content and style images, the output image can be iteratively updated until a satisfied stylization is reached.

The iterative optimization process used in the above approach is slow and thus limits its practical application. Since then, numerous methods [14, 18, 31] have been proposed to accelerate it by training feed-forward neural networks with the same loss as in [6]. Some other studies have been developed to improve the quality [17], photorealism [32], and user controllability [7, 32]. Very recently, Sanakoyeu et al. [27] propose to define the style based on a collection of related artistic images, achieving a better stylization from the aspect of art history experts. Nonetheless, most of the above methods are constrained by the limited styles. Dumoulin et al. [5] try to solve this problem and succeed to train a feed-forward network being capable of encoding 32 styles. Li et al. [20] then extend the style types up to 1000. But still, the set of transferable styles is fixed and these models cannot adapt to arbitrary new styles.

To achieve both efficiency and flexibility, Huang et al. [11] propose to explicitly match the mean and variance of each feature channel of the content image to those of the style image. This simple yet effective approach enables transferring an arbitrary style specified by a single exemplar image. Li et al. [21] further apply whitening and coloring transform between the extracted deep features.

In this paper, we argue that aligning global statistics only cannot guarantee good style transfer results, especially when there are significant differences between the content and the style images. Inspired by the domain adaptation works [2, 3], we jointly analyze both content and style images to compute the exchangeable feature component. Through manipulating the feature channels of the content image based on this exchangeable feature component extracted, the final stylization is significantly improved as evidenced in the section of results.

2.2 Image-to-Image Translation

Image-to-image translation refers to the task of mapping an image from a source domain to a target domain. Isola et al. [13] firstly propose a supervised framework based on conditional GANs, where paired training data are required. A few unsupervised methods are proposed later on to learn the translation between two image collections with only unpaired data [36, 23, 33]. Nevertheless, these methods suffer from the lack of mapping diversity. To tackle this issue, some works [16, 12, 8, 24] are proposed recently, all of which adopt the disentanglement strategy. More specifically, the disentangled shared/common part is considered as the content representation, while the private/domain-specific part represents the style component.

Since we are not pursuing multi-modal mapping, here we still follow the assumptions made by Gatys et al. [6]. The key difference is that, for a better stylization, we analyze the style features of the two input images jointly. A common style feature is disentangled, which is then used to guide the extraction of exchangeable style representations from the raw style features of content and style images.

3 Developed Framework

As shown in Figure 2, we present a new framework that enables fast style transfer via learning to extract exchangeable style features between the two input images, which are then intertwined with content codes for decoding the final synthesized results. A distinct feature of our approach is that it is trained over pairs of input images. Hence, a dataset with content images and style images provides us training samples. Such pairwise training approach allows our framework to better leverage inter-dependency between the two input images and improve the final results. In this architecture, inspired by the work on private-shared component analysis [2, 3], we develop a novel block, named Feature Exchange block, to learn common features for the styles from both input content and corresponding style images. A common feature and two private features will be used to represent the styles of two input images. A simple yet efficient mode to fuse content and style is then studied. Figure 3 illustrates the architecture of our framework.

Figure 3: Architecture overview. The input image pair ( and ) goes through the pre-trained VGG encoder to extract feature maps ( and ). The maps are compressed ( and

) before feeding into the newly proposed Feature Exchange Block to extract a common feature vector (

). is concatenated with or to learn content purification weights ( or ) and exchangeable style features ( or ). The former is used for suppressing style-related information in the original feature map ( or ), whereas the latter are used to fused with purified feature map. Finally, decoders are learned for synthesizing the stylized images ( or ).
Figure 4: Architecture of the proposed Feature Exchange Block, where and denote element-wise addition and multiplication. (a) Each block has three input features, one common feature () and two unique features for content () and style () images, respectively. The block allows common feature to interact with unique features in Residual Message Passing Unit and outputs refined results (, , and ). (b) Within each Residual Message Passing Unit, the two input features update each other through four fully connected layers.

3.1 Exchangeable Feature for Style Transfer

The overall goal of the presented framework is to learn two exchangeable style features for content image () and style image (), which can be fused with content features to decode either a reconstructed or a stylized image. As illustrated in Figure 3, our framework consists of a shared encoder, some Feature Exchange blocks and two decoders. Similar to prior work [11, 29], we use the first few layers of the pre-trained VGG-19 model (up to relu4_1) to initialize the encoder module, which is fixed during training. The VGG-based encoder is used to map the images into a latent space. We denote as the feature map outputted by encoder for content images and for style images. Both and have 512 channels at each pixel location. These two feature maps encode the basic content information for the corresponding input images.

Next, we compute a covariance matrix for each of and by treating the channel as the element of a random vector. The covariance matrices store the raw style features for the two images and contain richer information than just mean and variance at each channel. Then to reduce the number of parameters, each of the two covariance matrices is fed into multiple convolution layers, followed by a fully-connected layer, resulting a vector. The two vectors are denoted as for content image and for style image. Inspired by private-shared component analysis, and are further processed to output three feature vectors: two unique feature vectors, and for the two images and a common feature vector . More precisely, and are initialized by feeding and through two fully-connected layers, respectively, whereas is initialized by feeding the concatenation of and into a fully-connected layer. These three initial feature vectors are then refined using several Feature Exchanging blocks that are chained together; see Section 3.2.

The refined common feature, denoted as , is employed to guide each style feature ( or ) to learn content purification weights and exchangeable style features for the respective images. To be specific and take the style image () as example, the refined common feature is concatenated with to form a vector. It is used to compute three vectors, each through a dedicated fully connected layer. The first one is a weight vector () and is used for suppressing style-related information in the original feature map . This goal is achieved by multiplying with in a channel-wise attention manner for content purification. The resulting purified feature map is denoted as (or for the one computed for content image). The next two vectors, a column vector and a row vector , encode exchangeable style features, which can be fused with purified feature map (or ); see Section 3.3.

Finally, a decoder is learned to invert the feature maps to the image space. The resulting is the stylization image that transfers style in to , whereas is the reconstruction for the style image . Similar operation is performed for computing stylization and reconstruction . Note that in our framework, and share one decoder while and share another for more structured synthesized results.

3.2 Feature Exchange Block

The architecture of a single Feature Exchange block is illustrated in Figure. 4. Generally speaking, the main idea of the Feature Exchange Block is to use the residual features to convey message so that each input feature is updated in an iterative manner, like message passing operation does. This property allows us to chain any number of Feature Exchange blocks in a model, without breaking its initial behavior. One can see that, each block includes two Residual Message Passing Unit, which is used to learn a residual feature via an attention gate.

The proposed residual message passing unit takes two features as input, as depicted in Figure 4

(b) . The unit aims to learn two residual vectors to update the two original input features. It is able to efficiently consider the two input features at the same time and determine how much information to output. In particular, this component is built with four learnable weights. The original inputs are respectively weighted by the first two weights, followed by a non-linear operation (Relu). The two processed features are then added up, which is further fed into two different learnable weighting layers for the final attentional gating. Therefore, two residual features are the eventual outputs based on two inputs. Note that, all the learnable layers in this unit have the size of

in our experiments.

To gradually refine the common feature, each Feature Exchange block takes three inputs. The middle feature vector () encodes the common information and another two ( and ) represent the unique information of corresponding images. As shown in Figure 4(a), is simultaneously fed into two residual message passing units and is updated using the outputs of both. It is hence encouraged to encode information shared by the two images. The residual messages that are unique to individual images are passed to vectors and .

Employing residual connections facilitates gradient propagation during training, and makes a direct modification on the original feature. For the four learnable weights in each unit, it is expected that all these weights will learn to accommodate the importance of the intermediate features. It is also worth noting that, the Feature Exchange block is easy to extend to learn a common feature for more images or for other tasks.

3.3 Content-style Fusion

In this section, we present a simple yet effective mode, which fuses content and style features in a channel compression then expansion manner. Without losing generality, here we discuss the fusion between the content information from (represented as purified feature map ) and the exchangeable style information from (represented as a column vector and a row vector ). Our first step, referred as a style-aware content pooling, is designed for removing the information from that does not match with the target style through channel compression. That is:

(1)

where , , and . is the number of pixels. This operation effectively compresses all 512 channels at a given pixel location in into a single scalar.

Then different channels are restored based on the style information extracted from , i.e.:

(2)

where row vector and is the final stylized feature map.

Compared to existing methods [11, 21, 29], the proposed mode employs the target style feature vectors to discard unrelated information and merge useful ones. Figure 8 shows that the proposed fusion mode is more successful to remove rich color information from the content image while still preserving the structure. Two alternative fusion modes (concatenation and AdaIN) failed, proving the rationality of our new fusion mode. It is also noteworthy that although the expanded channels are all linearly dependent, our learned decoder is capable of inferring high-resolution stylized image from the fusion results.

3.4 Loss Function for Training

As illustrated in Figure 2, three different types of losses are computed for each input image pair. The first one is perceptual loss [14], which is used to evaluate the stylized results. Following previous work [11, 29], we employ a VGG model [30] pre-trained on ImageNet to compute the perceptual content loss:

(3)

and style loss

(4)

where denotes the VGG-based encoder and represents a Gram matrix for features extracted at the layer i in the encoder module. The set L contains conv1_1, conv2_1, conv3_1, conv4_1 layers.

The second one is a reconstruction loss function, which helps to improve the fidelity of our model. It employs

regularization to compute the difference between the reconstructed content image and the original one , as well as the ones for style image ( and ). That is:

(5)

Finally a feature exchange loss term is defined to facilitate common feature extraction at Feature Exchange Block. According to the work on private-shared component analysis [2], the disentangled common feature should be different from the two unique features and meanwhile combining them should reconstruct the original style features. In other words, take the content feature for example, we want and to be as much orthogonal as possible and able to reconstruct as well. The reconstruction is performed by feeding the sum of and into a fully connected layer, which is trained to output . Hence, the overall feature exchange loss is computed as:

(6)

where is the output of the fully-connected reconstruction layer. Note that, this layer is only used in training stage and the loss is computed only over the output of the last feature exchange block.

To summarize, the full objective function of our proposed network is:

(7)

where the four weights parameters are respectively set as 1, 2, 5, and 7 through out the experiments.

3.5 Implementation Details

We implement our model with Tensorflow 

[1]. Place365 database [35] and WiKiArt dataset [25] are used for content and style images respectively, following [27]. During training, we resize the smaller dimension of each image to 512 pixels with the original image ratio. Then we train our model with randomly sampled patches of size . Note that in the testing stage, both the content and style images can be of any size.

In general, our framework consists of one encoder, three Feature Exchange blocks, two decoders. For the decoder of each branch, we use three residual blocks to process the content codes first. After the fusion of style and content features, two extra residual blocks will be used, followed by several upsampling operations. Nearest-neighbor upscaling plus convolution strategy is used to reduce artifacts in the upsampling stage [26].

We choose Adam optimizer [15] with a batch size of 4 and a learning rate of 0.0001, and set the decay rates by default for 350000 iterations.

4 Experimental Results

Figure 5: Comparison with results from different methods. Note that the proposed model generates images with better visual quality while the results of other baselines have various artifacts; see text for detailed discussions.
Figure 6: Details produced by different arbitrary style transfer models. Top row shows the zoomed-in views for the areas highlighted in the bottom row. The comparison suggests that our results contain more visual details and better preserve semantic information (e.g. vegetation is mostly mapped to greenish color) than those of existing approaches.

Comparison with Existing Methods

We compare our approach with three types of state-of-the-art techniques: 1) the general but slow optimization-based approach [6]; 2) three feed-forward neural methods for arbitrary style transfer (AdaIn [11], WCT [21] and Avatar-Net [29]); and 3) the recent image-to-image translation algorithm (DRIT [16]) that uses disentanglement representation and can be adapted to style transfer task. We set the maximum number of iteration to 500 for [6]. For AdaIn [11], WCT [21], Avatar-Net [29] and DRIT [16], publicly available code released by the authors are used with default configurations.

Figure 7: Variations of our model with different loss terms: (a) only; (b) and ; (c) full model including all loss terms. Compared to (a), adding improves the fidelity, but there still exists regions that are not well stylized (b). After adding , our full model generates the best stylized result with the most similar color distribution to the target style image (c).
Figure 8: Ablation study for different fusion modes. It is demonstrated that developed fusion mode can significantly remove various colors from the content image while other two fail. Note that the concatenation mode fusing content-style in an expansion-concatenation manner is adapted from Lee et al. [16]. And the AdaIN mode comes from Huang et al. [12] where of style images are regarded as the target means and variances.

Results of qualitative comparisons are shown in Figure 5. As we can see, our method achieves favorable performance against the state-of-the-art approaches. The optimization-based method [6] can transfer arbitrary styles but also is very easy to get stuck into local minimum, causing distortion in the results (see the rows 3 & 5). Additionally, it takes several minutes to generate the final results, which is inconvenient for parameter tuning. AdaIN [11] significantly speeds up this process, however, it does not respect semantic information and sometimes generates results with color distribution different from the style image (see the row 4). WCT [21]

tries to use covariance matrix to improve the performance but heavily depends on hyperparameters. As shown in the rows 1 & 4, it sometimes produces messy and less-structured images. Avatar-Net improves AdaIN and WCT with a feature decorating module, but it distorts the semantic structures a lot and artifacts of blurring and color bumps are also introduced. As an image-to-image translation technique, DRIT 

[16] can generate results with high fidelity, however they are often insufficiently stylized (see the rows 2, 4, & 5). In contrast, our method learns exchangeable style features for individual image pairs, which allows us to generate more semantic structured images with better visual details (see the row 1) as well as richer color distribution (see the row 5).

Figure 9: Balance between content and style. At deployment stage, the degree of stylization can be controlled using parameter .
Figure 10: Application for spatial control. Left: content image. Middle: style images with masks to indicate target regions. Right: synthesized result.

Figure 6 provides close-up views for a better comparison on the generated details. Compared to the other baselines, our proposed model produces results with better structures and stylization (such as the stroke-like textures and similar color distribution to the style image). AdaIN fails to transfer the temple into the target style while the result of WCT is less structured or even a bit messy, losing detail textures. DRIT is poor in color distribution and fails to transfer the texture details as well.

Table 1

further compares different methods quantitatively in terms of perceptual loss. This evaluation metrics contain both content and style terms and have been used in previous approaches

[11]. It is worth noting that our approach does not minimize perceptual loss directly since it is only one of the three types of losses we use. Nevertheless, our model achieves the lowest perceptual loss among all feed-forward models, with style loss being the lowest and content loss slightly higher than some of the baselines. This indicates our approach favors fully stylized results over results with high content fidelity.

Loss Content Style Overall Perception
Gatys et al. [6] 14.0196 68.3269 82.3465
AdaIN [11] 15.2805 289.7572 305.0377
WCT [21] 15.4437 199.1699 214.6136
Avatar-Net [29] 17.4324 94.6269 112.0593
DRIT [16] 11.3788 370.2852 381.664
Ours 15.6505 89.4890 105.1395
Table 1: Quantitative comparison on perceptual (content and style) loss over 100 test images.
Image Size
Gatys et al. [6] 16.51 43.25 162.49
AdaIN [11] 0.014 0.037 0.134
WCT [21] 0.360 0.463 0.954
Avatar-Net [29] 0.756 0.834 1.11
DRIT [16] 0.021 0.044 0.164
Ours 0.031 0.066 0.21
Table 2: Running time (in seconds) comparison. All models are tested on a Nvidia Titan Xp GPU and averaged over 100 images.

Table 2 lists the running time of our approach and various state-of-the-art baselines [11, 6, 21, 29, 16] under three image scales. Existing feed-forward network approaches [11, 21, 29] are known to be faster than the optimization-based method [6]. Among them, WCT [21] requires several passes and extra SVD operation, whereas Avatar-Net [29] uses CPU-based operation. This makes them more than a magnitude slower than other neural methods. Our approach is slower than, but still comparable to the fastest AdaIN algorithm.

Ablation Study

Here we evaluate the impacts of common feature learning and the proposed style-content fusion mode. Common feature disentanglement during joint analysis plays a key role in our approach. Its importance can be evaluated by disabling the feature exchange loss, which prevents the network to learn exchangeable features. As shown in Figures 7(a-b), without this loss term, the color distribution and texture patterns in the result image no longer mimic the target style image. In comparison, our proposed model yields a much more favorable result; see Figure 7(c).

The proposed fusion mode is evaluated by replacing it with two alternatives while fixing the other parts. One choices are fusing like AdaIN [11, 12] and concatenation like Lee et al. [16]. The comparison shown in Figure 8 demonstrates that only our fusion mode can effectively remove the rich colors from the content image, leading to better stylization result with respect to the input style.

Applications

We demonstrate the flexibility of our model using three applications. All these tasks are completed with the same trained model without any further fine-tuning.

Figure 11: Video stylization comparison, where each frame is processed independently in both approaches. The input style image contains strong red and yellow curves. Consequently, our stylization results enhance subtle edges in the content video and map them to red and yellow colors. Our results are clean and coherence among different frames, whereas the ones obtained by WCT are more noisy.

Being able to adjust the degree of stylization is a useful feature. In our model, this can be achieved by blending between stylized feature map and reconstructed feature map before feeding the result to the decoder. That is, we have:

(8)

By definition, the network outputs the reconstructed image when , the fully stylized image when , and a smooth transition between the two when is gradually changed from 0 to 1; see Figure 9.

In Figure 10, we present our model’s ability for applying different styles to different image regions. Masks are used to specify the correspondences between different content image regions and the desired styles. Pairwise exchangeable feature extraction only consider the masked regions when applying a given style, helping to achieve optimal stylization effect for individual regions.

Our method can also be applied to video stylization based on per-frame style transfer; see Figure 11. Comparing to WCT [21], the color distributions in our stylization results are closer to the provided style image and the semantic structures of the content frames are better preserved. Moreover, the adjacent frames are more coherent thanks to our sample-level common feature analysis.

5 Conclusions and Future Work

In this paper, we have presented a novel framework to address transferring an arbitrary style over a content image. By analyzing the common style feature from both inputs as a guider, exchangeable style features are extracted. Better stylization can be achieved for the content image by fusing its purified content feature with the exchangeable style feature from the style image. In addition, we study a novel yet efficient mode to fuse content and style in a channel compression-expansion manner. Experiments show that our method significantly improves the stylization performance over the prior state-of-the-art methods.

Many directions can be explored in the future. Currently the covariance matrices are computed from VGG feature map at a fix layer. Whether involving covariance matrices from other layers can help enhance the performance worth to be investigated. The presented Feature Exchange Block is proven to be powerful for learning the inter-dependency between samples. How to apply it to other tasks, such as image-to-image translation or domain adaptation could be investigated later. Finally, the presented channel compression-then-expansion fusion mode may have discarded too much information, since the resulting channels are linearly dependent. Designing a more advanced strategy could further improve quality.

References

6 Appendix

6.1 Ablation Study

Figure 12: Ablation study for different fusion modes. As we can see, the proposed fusion mode leads to better stylization than other two alternative modes (see the color distribution and texture patterns).
Figure 13: Results under different numbers of Feature Exchange Blocks. Increasing the number of blocks helps to remove artifacts and adjust color distributes. In addition, a model that iterates over a single block 3 times (rightmost) cannot achieve the same effect.

In this section, additional results of ablation study on the proposed content-style fusion mode and the Feature Exchange blocks are presented.

Content-style Fusion Mode.

The proposed fusion mode combines content and style features in a compression-then-expansion mode. Compared to two existing modes, AdaIN and Concatenation, our proposed mode can discard unrelated information in content features based on the target styles. As visualized in Figure 12, the proposed mode is more successful in adapting its color distribution to the style image than the other two modes.

Number of Feature Exchange Blocks.

To further evaluate the impact of Feature Exchange Blocks on common feature learning, we train a series of models where the block number varies from 0 to 3. In addition, another model that iterates over one shared block three times is compared. As we can see in Figure 13, more blocks can reduce unexpected artifacts and boost the performance, while the model that iterates over a single block cannot achieve the same effect.

6.2 Comparison with Existing Methods

Figure 14 presents additional comparison results over several state-of-the-art methods. As we can see, our proposed framework can generate more structured and better stylized results. Moreover, our model is more successful in removing unrelated information in content features and better correspondences between the style and content images can be see in our results.

Figure 14: More comparison results with several state-of-the-art methods. From the results, our framework generates high-quality stylizations and meanwhile faithfully preserves the semantic structures.

6.3 More Stylization Results

Figure 15: Stylization matrix of transferring different content images to different styles. The first row consists of style images and the content images are listed in the leftmost column.
Figure 16: Stylization matrix of transferring different content images to different styles. The first row consists of style images and the content images are listed in the leftmost column.

Stylization matrix.

Figure 15 and Figure 16 form two matrices of style transfer results. Our model is good at preserving the input semantic information and adapting the content image to the target texture patterns and color distribution.

Full style-swap of our framework.

Figure 17 lists the full results of our framework. As described in paper, we can get four different types of generated images, among which the stylization for the content image (i.e. ) is the goal of our method. Note that the reconstruction of input images is mainly for stabilizing the training. Although unrelated information of content features are discarded during the fusion, we can see that our model is still able to reasonably reconstruct the input images.

Figure 17: Full results of our style-swap. As described in our paper, we can get 4 different results by combining different content and style codes. Though some fine-details are lost, the reconstruction results ( and ) reasonably reproduce the input images. On the other hand, the results of transferring style from content image to style image () contain more artifacts. We attribute this to the fact that natural images do not have distinct styles that can be easily transferred.
Figure 18: Our method failed to transfer the illuminating effect in the style image and lost structures.
Figure 19: Failure cases: unexpected colors are introduced in the stylization.
Figure 20: Failure cases: unexpected vertical stripe patterns are introduced in the stylization.
Figure 21: Failure cases: unexpected patterns (e.g., vertical stripes in 2nd column) are introduced and some objects (e.g., the tree on right side in 1st column) are hard to recognize.

6.4 High-resolution stylization

Figure 22: Comparison of high-resolution image stylization. Note that our model is trained on patches, but it is able to process images in arbitrary sizes, e.g., the resolution of the content image in this figure is .

In this section, we demonstrate the ability of our proposed model to transfer styles for high-resolution images. Figure 22 shows a comparison between Avatar-Net and our framework with content image at resolution . One can see that our synthesized image exhibits a lot of details such as the color transition within the mountains and semantic structures between various objects are preserved very well. In contrast, the result of Avatar-Net is more noisy and less structured.

6.5 Video Stylization

A supplementary video consisting of various contents and styles is attached. At the beginning of the video, we compare results generated by our model with those produced by the baseline method WCT. We can see that our framework can generate much more stable stylized video. The remaining part shows several impressive stylization results produced by our method. Please refer to the YouTube link: https://www.youtube.com/watch?v=Vo-S1RiQBUg.

6.6 Failure cases

Limitations of our method are discussed in the paper. From Figure 18 to Figure 21 a number of failure cases are shown, where in some cases unexpected colors or patterns are introduced in the stylization, and in some other cases the semantic structures are not well preserved.