Style Transfer by Rigid Alignment in Neural Net Feature Space

09/27/2019 ∙ by Suryabhan Singh Hada, et al. ∙ 16

Arbitrary style transfer is an important problem in computer vision that aims to transfer style patterns from an arbitrary style image to a given content image. However, current methods either rely on slow iterative optimization or fast pre-determined feature transformation, but at the cost of compromised visual quality of the styled image; especially, distorted content structure. In this work, we present an effective and efficient approach for arbitrary style transfer that seamlessly transfers style patterns as well as keep content structure intact in the styled image. We achieve this by aligning style features to content features using rigid alignment; thus modifying style features, unlike the existing methods that do the opposite. We demonstrate the effectiveness of the proposed approach by generating high-quality stylized images and compare the results with the current state-of-the-art techniques for arbitrary style transfer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 7

page 8

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a pair of style and a target image, style transfer is a process of transferring the texture of the style image to the target image keeping the structure of the target image intact. Recent work from Gatys et al. (2016a)

(Neural style transfer (NST)) shows the power of the Convolution Neural Networks (CNN) in style transfer. The use of multi-level features extracted from different layers of a pre-trained CNN has significantly improved stylization quality.

In just a few years, significant effort has been made to improve NST, either by iterative optimization-based approaches (Li and Wand, 2016a; Li et al., 2017c; Risser et al., 2017) or feed-forward network approximation (Johnson et al., 2016; Ulyanov et al., 2016b, a; Li and Wand, 2016b; Dumoulin et al., 2017; Chen et al., 2017; Li et al., 2017b; Shen et al., 2018; Zhang and Dana, 2017; Wang et al., 2017). Optimization-based methods (Gatys et al., 2016a; Li and Wand, 2016a; Li et al., 2017c; Risser et al., 2017), achieve visually great results, but at the cost of efficiency, as every style transfer requires multiple optimization steps. On the other hand, feed-forward network based style transfer methods (Johnson et al., 2016; Ulyanov et al., 2016b, a; Li and Wand, 2016b; Dumoulin et al., 2017; Chen et al., 2017; Li et al., 2017b; Shen et al., 2018; Zhang and Dana, 2017; Wang et al., 2017) provide efficiency and quality, but at the cost of generalization. These networks are limited to a fixed number of styles.

Arbitrary style transfer can achieve generalization, quality, and efficiency at the same time. The goal is to find a transformation that can take style and content features as input, and produce a styled feature that does not compromise reconstructed stylized image quality.

However, current work in this regard (Huang and Belongie, 2017; Li et al., 2017a; Chen and Schmidt, 2016; Sheng et al., 2018) has failed in quality of the generated results. Huang and Belongie (2017) and Chen and Schmidt (2016) use external style signals to supervise the content modification on a feed-forward network. The network is trained by using perpetual loss (Johnson et al., 2016), which is known to be unstable and produce unsatisfactory style transfer results (Gupta et al., 2017; Risser et al., 2017).

On the contrary, Li et al. (2017a), Chen and Schmidt (2016) and Sheng et al. (2018) manipulate the content features under the guidance of the style features in a shared high-level feature space. By decoding the manipulated features back into the image space with a style-agnostic image decoder, the reconstructed images will be stylized with seamless integration of the style patterns. However, these techniques over-distort the content or fail to balance the low level and global style patterns.

In this work, we address the aforementioned issues by modifying style features instead of content features during style transfer. We achieve this by first matching the channel-wise statistics of content features to those of style features, and then align style features to content features by rigid alignment. The channel-wise statistics matching transfers local texture and rigid transformation adjusts global style patterns with respect to content features. By doing so, we solve the problem of content over-distortion since alignment does not manipulate content features. Similar to Li et al. (2017a) and Sheng et al. (2018), our method does not require any training and can be applied to any style image in real time. We also provide comprehensive evaluations to compare with the prior arbitrary style transfer methods (Gatys et al., 2016a; Huang and Belongie, 2017; Li et al., 2017a; Sheng et al., 2018), to show that our method achieves state-of-the-art performance.

Our contributions in this paper are threefold: 1) We achieve style transfer by using rigid alignment which is different from traditional style transfer methods that depend on feature statistics matching. Rigid alignment is well studied in computer vision for many years and has been very successful in image registration and many problems of that type. We show that by rearranging the content and style features in a specific manner (each channel () as a point in space, where is height and is the width of the feature), they can be considered as a point cloud of points. 2) We provide a closed-form solution to the style transfer problem. 3) The proposed approach achieves impressive style transfer results in real-time without introducing content distortion.

content              sty le              Avatar[27]            WCT[20]           AdaIN[14]           Ours
Figure 1: Content distortion during style transfer. Regions marked by bounding boxes are zoomed in for a better visualization.

2 Related work

Due to the wide variety of applications, the problem of style transfer has been studied for a long time in computer vision. Before seminal work by Gatys et al. (2016a), the problem of style transfer has been focused as non-photorealistic rendering (NPR) (Kyprianidis et al., 2012), and closely related to texture synthesis (Efros and Freeman, 2010; Efros and Leung, 1999). Early approaches rely on finding low-level image correspondence and do not capture high-level semantic information well. As mentioned above, the use of CNN features in style transfer has improved the results significantly. We can divide the current Neural style transfer literature into four parts.

  • Slow optimization-based methods: Gatys et al. (2016a) introduced the first NST method for style transfer. The authors created artistic style transfer by matching multi-level feature statistics of content and style images extracted from a pre-trained image classification CNN (VGG (Simonyan and Zisserman, 2015)) using Gram matrix. Soon after this, other variations were introduced to achieve better style transfer (Li and Wand, 2016a; Li et al., 2017c; Risser et al., 2017), user controls like spatial control and color preserving (Gatys et al., 2016b; Risser et al., 2017) or include semantic information (Frigo et al., 2016; Champandard, 2016). However, these methods require an iterative optimization over the image which makes it impossible to apply in real time.

  • Single style feed-forward networks: Recently, Johnson et al. (2016), Ulyanov et al. (2016b), Ulyanov et al. (2016a) and Li and Wand (2016b)

    address the real-time issue by approximating the iterative back-propagating procedure to feed-forward neural networks, trained either by the perceptual loss

    (Johnson et al., 2016; Ulyanov et al., 2016b) or Markovian generative adversarial loss (Li and Wand, 2016b). Although these approaches achieve style transfer in real time, they require training a new model for every style. This makes them very difficult to use for multiple styles, as every single style requires hours of training.

  • Single network for multiple styles: Dumoulin et al. (2017), Chen et al. (2017), Li et al. (2017b) and Shen et al. (2018) have tried to tackle the problem of multiple styles by training a small number of parameters for every new style while keeping rest of the network the same. Conditional instance normalization (Dumoulin et al., 2017) achieved it by training channel-wise statistics corresponding to each style. Stylebank (Chen et al., 2017) learned convolution filters for each style, Li et al. (2017b) transferred styles by binary selection units and Shen et al. (2018) trained a meta-network that generates a layer network for each content and style image pair. On the other hand, Zhang and Dana (2017) trained a weight matrix to combine style and content features. The major drawback is the model size that grows proportionally to the number of style images. Additionally, there is interference among different styles (Jing et al., 2017), which affects stylization quality.

  • Single network for arbitrary styles: Some recent works (Huang and Belongie, 2017; Li et al., 2017a; Chen and Schmidt, 2016; Sheng et al., 2018) have been focused on creating a single model for arbitrary style i.e., one model for any style. Chen and Schmidt (2016) swaps the content feature patches with the closest style feature patch, but fails if the domain gap between content and style is large. Sheng et al. (2018) addresses this problem by first normalizing the features, and then apply the patch swapping. Although this improves the stylization quality, it still produces content distortion and misses global style patterns as shown in figure: 1. WCT (Li et al., 2017a) transfers multi-level style patterns by recursively applying whitening and coloring transformation (WCT) to a set of trained auto-encoders with different levels. However, similar to Sheng et al. (2018), WCT also produces content distortion; moreover, this introduces some unwanted patterns in the styled image (Jing et al., 2017). Adaptive Instance normalization (AdaIN) (Huang and Belongie, 2017)

    matches the channel-wise statistics (mean and variance) of content features to the style features, but this matching occurs only at one layer which authors try to compensate by training a network on perpetual loss

    (Johnson et al., 2016). Although this does not introduce content distortion, it fails to capture style patterns.

content

style


   relu_1                     relu_2                     relu_3                     relu_4                  relu 1 to 4

Figure 2: Top: Network pipeline of the proposed style transfer method that is the same as Li et al. (2017a)

. The result obtained by matching higher level statistics of the style is treated as the new content to continue to match lower-level information of the style. MM represents moment matching, and RA represents rigid alignment.

Bottom: Comparison between single-level and multi-level stylization with the proposed approach. The first four images show styled images created by applying moment matching and rigid alignment to individual VGG features. The last image shows stylization result by applying multi-level stylization as shown in the above network pipeline.
           content style relu 1 to 4 relu_4
Figure 3: Comparison between the style transfer results, by applying rigid alignment only at the deepest layer (relu_4) instead of every layer. The third image shows the style transfer result by applying alignment at every layer ({relu_1, relu_2, relu_3, relu_4}). On the other hand, the last column shows the style transfer result by just applying alignment only at the deepest layer (relu_4). Both produce nearly identical results.

frames

Style image

Ours

WCT

Avatar

Figure 4: Video stylization using the proposed approach. Similar to WCT (Li et al., 2017a) and Avatar-Net (Sheng et al., 2018), the proposed method keeps style patterns coherent in each frame. However, unlike the other two, the proposed method does not suffer from content distortion. In the case of WCT(Li et al., 2017a), the distortion is much worse than Avatar net, especially the face of the animal. Animations are provided at author’s webpage.

The common part of the existing arbitrary style transfer methods, that they all try to modify the content features during the style transfer process. This eventually creates content distortion. Different from existing methods, our approach manipulates the style features during style transfer. We achieve this in two steps. First, we apply channel-wise moment matching (mean and variance) between content and style features, just as AdaIN (Huang and Belongie, 2017). Second, we use rigid alignment (Procrustes analysis(see Borg and Groenen, 1997, chap. 21)) to align style features to content features. This alignment modifies the style features to adapt content structure, thus avoiding any content distortions while keeping its style information intact. In the next section, we describe our complete approach.

3 Style transfer using features

Let is a feature extracted from a layer of a pre-trained CNN when the content image passes through the network. Here, is the height, is the width, and is the number of channels of the feature . Similarly, for style image represents the corresponding features.

For any arbitrary style transfer method, we pass and to a transformation function which outputs styled feature as described below:

(1)

Reconstruction of to image space gives the styled image. The difficult part is finding the transformation function that is style-agnostic like Sheng et al. (2018),Chen and Schmidt (2016) and Li et al. (2017a), but unlike these, it captures local and global style information without distorting the content and does not need iterative optimization.

4 Proposed approach

Although AdaIN (Huang and Belongie, 2017) is not style agnostic, it involves a transformation which is entirely style agnostic: channel-wise moment matching. This involves matching channel-wise mean and variance of content features to those of style features as follows:

(2)

Here, and is channel-wise mean and variance respectively. Although this channel-wise alignment produces unsatisfactory styled results, it is able to transfer local patterns of style image without distorting content structure as shown in figure: 1. Moment matching does not provide a perfect alignment among channels of style and content features which leads to missing global style patterns and thus unsatisfactory styled results. Other approaches achieve this, either by doing WCT transformation (Li et al., 2017a) or patch replacement (Sheng et al., 2018; Chen and Schmidt, 2016), but this requires content features modification that leads to content distortion. We tackle this, by aligning style features to content features instead. In that way, style features get structure of content while maintaining their global patterns.

There exist many variations of alignment, or registration, for images and point clouds, the more general of which involve nonrigid alignment (e.g. Myronenko et al., 2007). In this work we use rigid alignment via a Procrustes transformation (Borg and Groenen, 1997) because of its simplicity and the existence of a closed-form solution that can be computed efficiently. The Procrustes transformation involves shifting, scaling and finally rotation of the points that to be moved (styled features) with respect to the target points (content features after moment matching). For this we consider both features as point clouds of size with each point is in space, i.e. . Now, we apply rigid transformation in following steps:

  • Step-I: Shifting. First, we need to shift both point clouds and to a common point in space. We center these point clouds to the origin as follows:

    (3)

    here, and are the mean of the and point clouds respectively.

  • Step-II: Scaling. Both points clouds need to have the same scale before alignment. For this, we make each point cloud to have unit Frobenius norm.

    (4)

    here, represents Frobenius norm.

  • Step-III: Rotation. Next step involves rotation of so that it can align perfectly with . For this, we multiply to a rotation matrix that can be created as follows:

    (5)

    Although this is an optimization problem, it can be solved as follows:

    (6)

    Since, term is independent of , so eq: (5) becomes:

    (7)

    Using singular value decomposition of

    and cyclic property of trace we have:

    (8)

    here,

    is an orthogonal matrix, as it is product of orthogonal matrices. Since,

    is a diagonal matrix, so in order to maximize , the diagonal values of need to equal to . Now, we have:

    (9)
  • Step-IV: Alignment. After obtaining rotation matrix , we scale and shift style point cloud with respect to the original content features in the following way:

    (10)

    is the final styled feature.

    This alignment makes style features to adapt content structure while keeping its local and global patterns intact.

4.1 Multi-level style transfer

As shown in the Gatys et al. (2016a), features from different layers provide different details during style transfer. Lower layer features (relu_1 and relu_2) provide color and texture information, while features from higher layer (relu_3 and relu_4) provide common patterns details (figure: 2). Similar to WCT (Li et al., 2017a), we also do this by cascading the image through different auto-encoders. However, unlike WCT (Li et al., 2017a) we do not need to do the alignment described in section 4 at every level. We only apply the alignment at the deepest layer (relu4_1).

Doing alignment at each layer or only at deepest layer (relu4_1) produce identical results as shown in figure: 3. This also shows the rigid alignment of style features to content is perfect.

Once the features are aligned we only need to take care of local textures at other layers. We do this by applying moment matching (eq: (2)) at lower layers. The complete pipeline is shown in figure: 2.

           content style
Figure 5: Third column: style transfer with features () transformed as cloud points and each in space. Fourth column: style transfer with cloud points and each in space.

4.2 Need to arrange features in space

As mentioned above, for alignment we consider the deep neural network features () as a point cloud which has points each of dimension . We can also choose another configuration where, each point is in space, and thus having points in a point cloud. In figure:5 we show a comparison of style transfer with two configurations. As shown in the figure:5 having later configuration results in complete distortion of content structure in the final styled image. The reason for that is deep neural network features (convolution layers) preserve some spatial structure, which is required for style transfer and successful image reconstruction. So, we need to transform the features in a specific manner that after alignment we do not lose that spatial structure. That is why for alignment, we transform such that the point cloud has points each of dimension .

5 Experiments

5.1 Decoder training

We use a pre-trained auto-encoder network from Li et al. (2017a). This auto-encoder network has been trained for general image reconstruction. The encoder part of the network is the pre-trained VGG-19 (Simonyan and Zisserman, 2015) that has been fixed and the decoder network is trained to invert the VGG features to image space. As mentioned in Li et al. (2017a)

the decoder is designed as being symmetrical to that of the VGG-19 network, with the nearest neighbor up-sampling layer used as the inverse of max pool layers.

Authors in Li et al. (2017a) trained five decoders for reconstructing images from features extracted at different layers of the VGG-19 network. These layers are relu5_1, relu4_1, relu3_1, relu2_1 and relu1_1

. The loss function for training involves pixel reconstruction loss and feature loss

(Dosovitskiy and Brox, 2016).

(11)

where, are the weights of the decoder. , are the original reconstructed image respectively, and is VGG-19 encoder that extracts features from layer . In addition, is the weight to balance the two losses. The decoders have been trained on Microsoft COCO dataset (Lin et al., 2014). However, unlike Li et al. (2017a) we use only four decoders in our experiments for multi-level style transfer. These decoders correspond to relu4_1, relu3_1, relu2_1 and relu1_1 layers of the VGG-19 network.

5.2 Comparison with prior style transfer methods

       content style Gatys[11] AdaIN [14] WCT [20] Avatar [27] Ours
Figure 6: Figure shows comparison of our style transfer approach with existing work.

To show the effectiveness of the proposed method, we compare our results with two types of arbitrary style transfer approaches. The first type is iterative optimization-based (Gatys et al., 2016a) and the second type is fast arbitrary style transfer method (Li et al., 2017a; Shen et al., 2018; Huang and Belongie, 2017). We present these stylization results in figure: 6.

Although optimization based approach (Gatys et al., 2016a) perform arbitrary style transfer, it requires slow optimization for this. Moreover, it suffers from getting stuck at local minima. This results in visually unsatisfied style transfer results as shown in the third and fourth rows. AdaIN (Huang and Belongie, 2017) addresses the issue of local minima along with efficiency, but fails to capture the style patterns. For instance, in the third row, the styled image contains colors from the content such as red color on the lips. Contrary to this, WCT (Li et al., 2017a) and Avatar-Net (Shen et al., 2018) perform very well in capturing the style patterns, by matching second order statistics and the latter one by normalized patch swapping. However, both methods fail to maintain the content structure in the stylized results. For instance, in the first row, WCT (Li et al., 2017a) completely destroys the content structure: mountains and clouds are indistinguishable. Similarly, in the second and fifth row, content image details are too distorted. Although Avatar-Net (Shen et al., 2018) performs better than WCT (Li et al., 2017a) as in the first and fifth rows, it fails too in maintaining content information as shown in the second and sixth rows. In the second row, the styled image does not even have any content information.

On the other hand, the proposed method not only captures style patterns similar to WCT (Li et al., 2017a) and Avatar-Net (Shen et al., 2018), but also maintains the content structure perfectly as shown in the first, second and fifth row where other two failed.

We also provide a close-up in figure: 1. As shown in the figure, WCT (Li et al., 2017a) and Avatar-Net (Shen et al., 2018) distort the content image structure. The nose in the styled image is too much distorted, making these methods difficult to use with human faces. Contrary to this, AdaIN (Huang and Belongie, 2017) and the proposed method keep content information intact, as shown in the last two columns of the second row. However, AdaIN (Huang and Belongie, 2017) does not capture style patterns very well. The proposed method, on the other hand, captures style patterns very well without any content distortion in the styled image.

In addition to image-based stylization, the proposed method can also do video stylization. We achieve this by just doing per-frame style transfer as shown in figure: 4. The styled video is coherent over adjacent frames since the style features adjust themselves instead of content, so the style transfer is spatially invariant and robust to small content variations. In contrast, Avatar-Net (Sheng et al., 2018) and WCT (Li et al., 2017a) contain severe content distortions, with the distortion is much worse in WCT (Li et al., 2017a).

5.3 Efficiency

We compare the execution time for style transfer of the proposed method with state-of-the-art arbitrary style transfer methods in the table 1

. We implement all methods in Tensorflow 

(Abadi et al., 2016) for a fair comparison. Gatys et al. (2016a) approach is very slow due to iterative optimization steps that involve multiple forward and backward pass through a pre-trained network. On the contrary, other methods have very good execution time, as these methods are feed-forward network based. Among all, AdaIN (Huang and Belongie, 2017) performs best, since it requires only moment-matching between content and style features. WCT (Li et al., 2017a) is relatively slower as it requires SVD operation at each layer during multi-layer style transfer. Avatar-Net (Sheng et al., 2018) has better execution time compared to WCT (Li et al., 2017a) and ours. This is because of the GPU based style-swap layer and hour-glass multi-layer network.

On the other hand, our method is comparatively slower than AdaIN (Huang and Belongie, 2017), and Avatar-Net (Sheng et al., 2018) as our method involves SVD operation at relu_4. Additionally, it requires to pass through multiple auto-encoders for multi-level style transfer similar to WCT (Li et al., 2017a). However, unlike WCT (Li et al., 2017a) proposed method needs only one SVD operation as shown in figure: 3 and thus have better execution time compared to WCT (Li et al., 2017a).

Method Execution time (in sec)
()
Gatys (Gatys et al., 2016a) 58 4.40 8.28
AdaIN (Huang and Belongie, 2017) 0.13 4.62 8.18
WCT (Li et al., 2017a) 1.12 4.79 7.83
Avatar-Net (Sheng et al., 2018) 0.34 4.75 7.77
Ours 0.46 4.70 7.87
Table 1: Numeric comparison between the proposed method and state of the art methods. Second column: average execution time (in seconds). Last two columns: average content and style loss for the styled images in figure: 6. Lower values are better.

5.4 Numeric comparison

In table 1 we show numerical comparison between different style methods. We provide average content loss() and style loss() from Gatys et al. (2016a), for the images in figure: 6.

(12)
(13)

here, is content feature, is style feature, is styled feature and provides Gram matrix. As shown in the table 1, our method not only performs well in terms of content loss, but is also on par with WCT (Li et al., 2017a) and Avatar-Net (Sheng et al., 2018) in terms of style loss. This proves our intuition that by aligning style features to content features, we not only preserve content structure but also effectively transfers style patterns.

5.5 User control

content style                                                                                                style
image image 1                                                                                                image 2
Figure 7: Interpolation between styles. Value of is increasing from to with an increment of from left to right.
style                                                                                                             content
Figure 8: Trade-off between content and style during style transfer. Value of is increasing from to with an increment of from left to right.
Figure 9: Spatial control in style transfer. The second and third column in first row are binary mask and corresponding styles are in the second row.

Like other arbitrary style transfer methods, our approach is also flexible to accommodate different user controls such as the trade-off between style and content, style interpolation, and spatial control during style transfer.

Since our method applies transformation in the feature-space independent of the network, we can achieve trade-off between style and content as follows:

Here, is the transformed feature from eq: (10), is content feature and is the trade off parameter. Figure: 8 shows one such example of content-style trade-off.

Figure: 7 shows an instance of linear interpolation between two styles created by proposed approach. This is done by adjusting the weight parameter () between transformation outputs () as follows:

Spatial control is needed to apply different styles to different parts of the content image. A set of masks are additionally required to control the regions of correspondence between style and content. By replacing the content feature in section 4 with , where is a simple mask-out operation, we can stylize the specified region only, as shown in figure: 9.

6 Conclusion

In this work, we propose an effective and efficient arbitrary style transfer approach that does not require learning for every individual style. By applying rigid alignment to style features with respect to content features, we solve the problem of content distortion without sacrificing style patterns in the styled image. Our method can seamlessly adapt the existing multi-layer stylization pipeline and capture style information from those layers too. Our method can also seamlessly perform video stylization, merely by per-frame style transfer. Experimental results demonstrate that the proposed algorithm achieves favorable performance against the state-of-the-art methods in arbitrary style transfer. As further direction, one may replace multiple autoencoders for multi-level style transfer, by training an hourglass architecture similar to Avatar-Net for better efficiency.

Appendix A More styled Results

       content style Gatys[11] AdaIN [14] WCT [20] Avatar [27] Ours
       content style Gatys[11] AdaIN [14] WCT [20] Avatar [27] Ours
       content style Gatys[11] AdaIN [14] WCT [20] Avatar [27] Ours

References

  • Abadi et al. (2016) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: A system for large-scale machine learning.

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pages 265–283, Savannah, GA, Oct. 6–8 2016.
  • Borg and Groenen (1997) I. Borg and P. Groenen. Modern Multidimensional Scaling: Theory and Application. Springer Series in Statistics. Springer-Verlag, Berlin, 1997.
  • Champandard (2016) A. J. Champandard. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv:1603.01768 [cs.CV], Mar. 16 2016.
  • Chen et al. (2017) D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In

    Proc. of the 2017 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’17)

    , Honolulu, HI, July 21–26 2017.
  • Chen and Schmidt (2016) T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv:1612.04337, Dec. 16 2016.
  • Dosovitskiy and Brox (2016) A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 29, pages 658–666. MIT Press, Cambridge, MA, 2016.
  • Dumoulin et al. (2017) V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In Proc. of the 5th Int. Conf. Learning Representations (ICLR 2017), Toulon, France, Apr. 24–26 2017.
  • Efros and Freeman (2010) A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In L. Pocock, editor, Proc. of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 2010), pages 341–346, Los Angeles, CA, Aug. 12–17 2010.
  • Efros and Leung (1999) A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In J. K. Tsotsos, A. Blake, Y. Ohta, and S. W. Zucker, editors, Proc. 7th Int. Conf. Computer Vision (ICCV’99), pages 1033–1038, Kerkyra, Corfu, Greece, Sept. 20–27 1999.
  • Frigo et al. (2016) O. Frigo, N. Sabater, J. Delon, and P. Hellier. Split and match: Example-based adaptive patch sampling for unsupervised style transfer. In Proc. of the 2016 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 26 – July 1 2016.
  • Gatys et al. (2016a) L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proc. of the 2016 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’16), pages 2414–2423, Las Vegas, NV, June 26 – July 1 2016a.
  • Gatys et al. (2016b) L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. arXiv:1611.07865, Nov. 16 2016b.
  • Gupta et al. (2017) A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei. Characterizing and improving stability in neural style transfer. In Proc. 17th Int. Conf. Computer Vision (ICCV’17), pages 2380–7504, Dec. 11–18 2017.
  • Huang and Belongie (2017) X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. 17th Int. Conf. Computer Vision (ICCV’17), Dec. 11–18 2017.
  • Jing et al. (2017) Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song. Neural style transfer: A Review. arXiv:1705.04058, May 17 2017.
  • Johnson et al. (2016) J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Proc. 14th European Conf. Computer Vision (ECCV’16), pages 694–711, Amsterdam, The Netherlands, Oct. 11–14 2016.
  • Kyprianidis et al. (2012) J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg. State of the “Art”: A Taxonomy of Artistic Stylization Techniques for Images and Video. IEEE transactions on visualization and computer graphics, 19(5):866–885, July 2012.
  • Li and Wand (2016a) C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In Proc. of the 2016 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 26 – July 1 2016a.
  • Li and Wand (2016b) C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Proc. 14th European Conf. Computer Vision (ECCV’16), Amsterdam, The Netherlands, Oct. 11–14 2016b.
  • Li et al. (2017a) Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 30, pages 386–396. MIT Press, Cambridge, MA, 2017a.
  • Li et al. (2017b) Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In Proc. of the 2017 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, July 21–26 2017b.
  • Li et al. (2017c) Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. In

    Proc. of the 26th Int. Joint Conf. Artificial Intelligence (IJCAI’15)

    , pages 2230–2236, Melbourne, Australia, Aug. 19–25 2017c.
  • Lin et al. (2014) T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In Proc. 13th European Conf. Computer Vision (ECCV’14), pages 740–755, Zürich, Switzerland, Sept. 6–12 2014.
  • Myronenko et al. (2007) A. Myronenko, X. Song, and M. Á. Carreira-Perpiñán. Non-rigid point set registration: Coherent point drift. In B. Schölkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems (NIPS), volume 19, pages 1009–1016. MIT Press, Cambridge, MA, 2007.
  • Risser et al. (2017) E. Risser, P. Wilmot, and C. Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv:1701.08893, Jan. 17 2017.
  • Shen et al. (2018) F. Shen, S. Yan, and G. Zeng. Neural style transfer via meta networks. In Proc. of the 2018 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, June 18–22 2018.
  • Sheng et al. (2018) L. Sheng, Z. Lina, J. Shao, and X. Wang. Avatar-Net: Multi-scale zero-shot style transfer by feature decoration. In Proc. of the 2018 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, June 18–22 2018.
  • Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. of the 3rd Int. Conf. Learning Representations (ICLR 2015), San Diego, CA, May 7–9 2015.
  • Ulyanov et al. (2016a) D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In M.-F. Balcan and K. Q. Weinberger, editors, Proc. of the 33rd Int. Conf. Machine Learning (ICML 2016), pages 1349–1357, New York, NY, June 19–24 2016a.
  • Ulyanov et al. (2016b) D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, July 16 2016b.
  • Wang et al. (2017) H. Wang, X. Liang, H. Zhang, D.-Y. Yeung, and E. P. Xing. ZM-net: Real-time zero-shot image manipulation network. arXiv:1703.07255, Mar. 17 2017.
  • Zhang and Dana (2017) H. Zhang and K. Dana. Multi-style generative network for real-time transfer. arXiv:1703.06953, Mar. 20 2017.