Multi-style Generative Network for Real-time Transfer

03/20/2017 ∙ by Hang Zhang, et al. ∙ 0

Despite the rapid progress in style transfer, existing approaches using feed-forward generative network for multi-style or arbitrary-style transfer are usually compromised of image quality and model flexibility. We find it is fundamentally difficult to achieve comprehensive style modeling using 1-dimensional style embedding. Motivated by this, we introduce CoMatch Layer that learns to match the second order feature statistics with the target styles. With the CoMatch Layer, we build a Multi-style Generative Network (MSG-Net), which achieves real-time performance. We also employ an specific strategy of upsampled convolution which avoids checkerboard artifacts caused by fractionally-strided convolution. Our method has achieved superior image quality comparing to state-of-the-art approaches. The proposed MSG-Net as a general approach for real-time style transfer is compatible with most existing techniques including content-style interpolation, color-preserving, spatial control and brush stroke size control. MSG-Net is the first to achieve real-time brush-size control in a purely feed-forward manner for style transfer. Our implementations and pre-trained models for Torch, PyTorch and MXNet frameworks will be publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

page 8

Code Repositories

PyTorch-Multi-Style-Transfer

Neural Style and MSG-Net


view repo

MSG-Net

Multi-style Generative Network for Real-time Transfer


view repo

MXNet-Gluon-Style-Transfer

Neural Style and MSG-Net


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Style transfer can be approached as reconstructing or synthesizing texture based on the target image semantic content [28]. Many pioneering works have achieved success in classic texture synthesis starting with methods that resample pixels [11, 10, 43, 27] or match multi-scale feature statistics[7, 20, 35]. These methods employ traditional image pyramids obtained by handcrafted multi-scale linear filter banks [38, 1]

and perform texture synthesis by matching the feature statistics to the target style. In recent years, the concepts of texture synthesis and style transfer have been revisited within the context of deep learning. Gatys

et al. [13] shows that using feature correlations (i.e. Gram Matrix) of convolutional neural nets (CNN) successfully captures the image styles. This framework has brought a surge of interest in texture synthesis and style transfer using iterative optimization  [13, 15, 28] or training feed-forward networks [41, 25, 29, 42]. Recent work extends style flexibility using feed-forward networks and achieves multistyle or arbitrary style transfer [9, 21, 3, 5]. These approaches typically encode image styles into 1-dimensional space, i.e.

 tuning the featuremap mean and variance (bias and scale) for different styles. However, the comprehensive appearance of image style is fundamentally difficult to represent in 1D embedding space. Figure 

15 shows style transfer results using the optimization-based approach [15] and we can see Gram matrix representation produces more appealing image quality comparing to mean and variance of CNN featuremap.

[height=0.217]figure/abstract/1.jpg

(a)

[height=0.217]figure/abstract/5.jpg

(b)

[height=0.217]figure/abstract/16.jpg

(c)

[height=0.21]figure/abstract/9.jpg

(d)

[height=0.21]figure/abstract/12.jpg

(e)

[height=0.21]figure/abstract/13.jpg

(f)

[height=0.19]figure/abstract/19.jpg

(g)

[height=0.19]figure/abstract/18.jpg

(h)

[height=0.19]figure/abstract/15.jpg

(i)
Figure 10: Examples of transferred images and the corresponding styles using the proposed MSG-Net.
Figure 11:

An overview of MSG-Net, Multi-style Generative Network. The transformation network explicitly matches the features statistics of the style targets captured by a Siamese network using the proposed CoMatch Layer (introduced in Section 

3

). A pre-trained loss network provides the supervision of MSG-Net learning by minimizing the content and style differences with the targets as discussed in Section 

4.2.

In addition to the image quality, concerns about the flexibility of current feed-forward generative models have been raised in Jing et al. [24], and they point out that no generative methods can adjust the brush stroke size in real-time. Feeding the generative network with high-resolution content image usually results in unsatisfying images as shown in Figure 24. The generative network as a fully convolutional network (FCN) can accept arbitrary input image sizes. Resizing the style image changes the relative brush size and the multistyle generative network matching the image style at run-time should naturally enable brush-size control by changing the input style image size. What limits the current generative model from being aware of the brush size? The 1D style embedding (featuremap mean and variance) fundamentally limits the potential of exploring finer behavior for style representations. Therefore, a 2D method is desired for finer representation of image styles.

[width=0.32height=0.2]figure/seu/seu.jpg

(a) Input
(b) Mean & Var
(c) Gram Matrix
Figure 15: Comparing 1D and 2D style representation using an optimization-based approach [15]. (a) Input image and style. (b) Style transfer result minimizing difference of CNN featuremap mean and variance. (c) Style transfer result minimizing the difference in Gram matrix representation.

As the first contribution of the paper, we introduce an CoMatch Layer which embeds style with a 2D representation and learns to match the second-order feature statistics (Gram Matrix) of the style targets inherently during the training. The CoMatch Layer is differentiable and end-to-end learnable with existing generative network architectures without additional supervision. The proposed CoMatch Layer enables multi-style generation from a single feed-forward network.

The second contribution of this paper is building Multi-style Generative Network (MSG-Net) with the proposed CoMatch Layer and a novel Upsample Convolution. The MSG-Net as a feed-forward network runs in real-time after training. Generative networks typically have a decoder part recovering the image details from downsampled representations. Learning fractionally-strided convolution [31] typically brings checkerboard artifacts. For improving the image quality, we employ a strategy we call upsampled convolution, which successfully avoids the checkerboard artifacts by applying an integer stride convolution and outputs an upsampled featuremap (details in Section 4.1). In addition, we extend the Bottleneck architecture [18] to an Upsampling Residual Block, which reduces computational complexity without losing style versatility by preserving larger number of channels. Passing identity all the way through the generative network enables the network to extend deeper and converge faster. The experimental results show that MSG-Net has achieved superior image fidelity and test speed compared to previous work. We also study the scalability of the model by extending 100-style MSG-Net to 1K styles using a larger model size and longer training time, and we observe no obvious quality differences. In addition, MSG-Net as a general multi-style strategy is compatible to most existing techniques and progress in style transfer, such as content style trade-off and interpolation [9], spatial control, color preserving and brush-size control [14, 16].

To our knowledge, MSG-Net is the first to achieve real-time brush-size control in a purely feed-forward manner for multistyle transfer.

1.1 Related Work

Relation to Pyramid Matching.

Early methods for texture synthesis were developed using multi-scale image pyramids [20, 7, 43, 35]

. The discovery in these earlier methods was that realistic texture images could be synthesized from manipulating a white noise image so that its feature statistics were matched with the target at each pyramid level. Our approach is inspired by classic methods, which match feature statistics within the feed-forward network, but it leverages the advantages of deep learning networks while placing the computational costs into the training process (feed-forward vs. optimization-based).

Relation to Fusion Layers.

Our proposed CoMatch Layer is a kind of fusion layer that takes two inputs (content and style representations). Current work in fusion layers with CNNs include feature map concatenation and element-wise sum  [23, 45, 12]. However, these approaches are not directly applicable, since there is no separation of style from content. For style transfer, the generated images should not carry semantic information of the style target nor styles of the content image.

Relation to Generative Adversarial Training.

The Generative Adversarial Network (GAN) [17], which jointly trains an adversarial generator and discriminator simultaneously, has catalyzed a surge of interest in the study of image generation [45, 23, 2, 36, 22, 40, 44]. Recent work on image-to-image GAN [23]

adopts a conditional GAN to provide a general solution for some image-to-image generation problems. For those problems, it was previously hard to define a loss function. However, the style transfer problem cannot be tackled using the conditional GAN framework, due to missing ground-truth image pairs. Instead, we follow the work 

[41, 25] to adopt a discriminator/loss network that minimizes the perceptual difference of synthesized images with content and style targets and provides the supervision of the generative network learning. The initial idea of employing Gram Matrix to trigger style synthesis is inspired by a recent work [2]

that suggests using an encoder instead of random vector in GAN framework.

Recent Work in Multiple or Arbitrary Style Transfer.

Recent/concurrent work explores multiple or arbitrary style transfer [9, 5, 21, 9]. A style swap layer is proposed in [5], but gets lower quality and slower speed (compared to existing feed-forward approaches). An adaptive instance normalization is introduced in  [21] to match the mean and variance of the feature maps with the style target. Instead, our CoMatch Layer matches the second order statistics of Gram Matrices for the feature maps. We also explore the scalability of our approach in the Experiment Section 5.

2 Content and Style Representation

CNNs pre-trained on a very large dataset such as ImageNet can be regarded as descriptive representations of image statistics containing both semantic content and style information. Gatys

et al. [15] provides explicit representations that independently model the image content and style from CNNs, which we briefly describe in this section for completeness.

The semantic content of the image can be represented as the activations of the descriptive network at -th scale with a given the input image , where the , and are the number of feature map channels, feature map height and width. The texture or style of the image can be represented as the distribution of the features using Gram Matrix given by

(1)

The Gram Matrix is orderless and describes the feature distributions. For zero-centered data, the Gram Matrix is the same as the covariance matrix scaled by the number of elements . It can be calculated efficiently by first reshaping the feature map , where is a reshaping operation. Then the Gram Matrix can be written as .

Figure 16: Left: fractionally-strided convolution. Right: Upsampled convolution, which reduces the checkerboard artifacts by applying an integer stride convolution and outputting an upsampled featuremaps.

3 CoMatch Layer

In this section, we introduce CoMatch Layer, which explicitly matches second order feature statistics based on the given styles. For a given content target and a style target , the content and style representations at the -th scale using the descriptive network can be written as and , respectively. A direct solution is desirable which preserves the semantic content of input image and matches the target style feature statistics:

(2)

where is a trade-off parameter that balancing the contribution of the content and style targets.

Figure 17: We extend the original down-sampling residual architecture (left) to an up-sampling version (right). We use a 1

1 fractionally-strided convolution as a shortcut and adopt reflectance padding.

[width=0.32height=0.2]figure/dense/rutgers.jpg

(a)
(b)
(c)

[width=0.32]figure/content/venice-boat.jpg

(a) input
(b) MSG-Net (ours)
(c) baseline
Figure 24: Comparing Brush-size control. a) High-resolution input image and dense styles. b) Style transfer results using MSG-Net with brush-size control. c) Standard generative network [25] without brush-size control. See also Figure 55

The minimization of the above problem is solvable by using an iterative approach, but it is infeasible to achieve it in real-time or make the model differentiable. However, we can still approximate the solution and put the computational burden to the training stage. We introduce an approximation which tunes the feature map based on the target style:

(3)

where is a learnable weight matrix and is a reshaping operation to match the dimension, so that . For intuition on the functionality of , suppose , then the first term in Equation 2 (content term) is minimized. Now let , where is obtained by the Cholesky Decomposition of , then the second term of Equation 2 (style term) is minimized. We let be learned directly from the loss function to dynamically balance the trade-off. The CoMatch Layer is differentiable and can be inserted in the existing generative network and directly learned from the loss function without any additional supervision.

4 Multi-style Generative Network

4.1 Network Architecture

Prior feed-forward based single-style transfer work learns a generator network that takes only the content image as the input and outputs the transferred image, i.e. the generator network can be expressed as , which implicitly learns the feature statistics of the style image from the loss function. We introduce a Multi-style Generative Network which takes both content and style target as inputs. i.e. . The proposed network explicitly matches the feature statistics of the style targets at runtime.

As part of the Generator Network, we adopt a Siamese network sharing weights with the encoder part of transformation network, which captures the feature statistics of the style image at different scales, and outputs the Gram Matrices where is the total number of scales. Then a transformation network takes the content image and matches the feature statistics of the style image at multiple scales with CoMatch Layers.

Upsampled Convolution.

Standard CNN for image-to-image tasks typically adopts an encoder-decoder framework, because it is efficient to put heavy operations (style switching) in smaller featuremaps and also important to keep a larger receptive field for preserving semantic coherence. The decoder part learns a fractionally-strided convolution to recover the detail information from downsampled featuremaps. However, the fractionally strided convolution [31] typically introduces checkerboard artifacts [33]. Prior work suggests using upsampling followed by convolution to replace the standard fractionally-strided convolution [33]. However, this strategy will decrease the receptive field and it is inefficient to apply convolution on an upsampled area. For this, we use upsampled convolution, which has an integer stride, and outputs upsampled featuremaps. For an upsampling factor of 2, the upsampled convolution will produce a 22 outputs for each convolutional window as visualized in Figure 16. Comparing to fractionally-strided convolution, this method has the same computation complexity and 4 times parameters. This strategy successfully avoid upsampling artifacts in the network decoder.

Upsample Residual Block.

Deep residual learning has achieved great success in visual recognition [18, 19]. Residual block architecture plays an important role by reducing the computational complexity without losing diversity by preserving the large number of feature map channels. We extend the original architecture with an upsampling version as shown in Figure 17 (right), which has a fractionally-strided convolution [31] as the shortcut and adopts reflectance padding to avoid artifacts of the generative process. This upsampling residual architecture allows us to pass identity all the way through the network, so that the network converges faster and extends deeper.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
Figure 50: Content and style trade-off and interpolation.

Brush Stroke Size Control.

Feeding the generative model with high-resolution image usually results in unsatisfying style transfer outputs, as shown in Figure 24 (c). Controlling brush stroke size can be achieved using optimization-based approach [16]. Resizing the style image changes the brush-size, and feed-forward generative model matches the feature statistics at runtime should naturally achieve brush stoke size control. However, prior work is mainly limited by the 1D style embedding, because this finer style behavior cannot be captured using merely featuremap mean and variance. With MSG-Net, the CoMatch Layer matching the second order statistics elegantly solves the brush-size control. During training, we train the network with different style image sizes to learn from different brush stroke sizes. After training, the brush stroke size can be an option to the user by changing style input image size. Note that the MSG-Net can accept different input sizes for style and content images. Example results are shown in Figure 55.

[width=0.48height=0.28]figure/brush/greatwall.jpg

(a)
(b)
(c)
(d)
Figure 55: Brush-size control using MSG-Net. Top left: High-resolution input image and dense style. Others: Style transfer results using MSG-Net with brush-size control.

[width=0.16]figure/compare3/lena.jpg

(a)
(b)
(c)
(d)
(e)
(f)

[width=0.16height=0.12]figure/compare3/saltlake.jpg

(g)
(h)
(i)
(j)
(k)
(l)

[width=0.16height=0.12]figure/compare3/sanfrancisco.jpg

(m)
(n)
(o)
(p)
(q)
(r)

[width=0.16height=0.12]figure/compare3/shenyang3.jpg

(s)
(t)
(u)
(v)
(w)
(x)

[width=0.16height=0.12]figure/compare3/nicholas.jpg

(a) input
(b) Dumoulin et al. [9]
(c) MSG-Net (ours)
(d) Gatys et al. [15]
(e) Huang et al. [21]
(f) Chen & Schmidt [5]
Figure 86: The tradeoff between style-flexibility and output-image quality is challenging for generative models. Our approach enables multi-style transfer and has minimal difference in quality compared to the optimization-based Gatys approach [15].

Other Details.

We only use in-network down-sample (convolutional) and up-sample (upsampled convolution) in the transformation network. We use reflectance padding to avoid artifacts at the border. Instance normalization [42]

and ReLU are used after weight layers (convolution, fractionally-strided convolution and the CoMatch Layer), which improves the generated image quality and is robust to the image contrast changes.

4.2 Network Learning

Style transfer is an open problem, since there is no gold-standard ground-truth to follow. We follow previous work to minimize a weighted combination of the style and content differences of the generator network outputs and the targets for a given pre-trained loss network  [41, 25]. Let the generator network be denoted by parameterized by weights . Learning proceeds by sampling content images and style images and then adjusting the parameters of the generator in order to minimize the loss:

(4)

where and are the balancing weights for content and style losses. We consider image content at scale and image style at scales . is the total variation regularization as used prior work for encouraging the smoothness of the generated images [25, 32, 46].

5 Experimental Results

5.1 Style Transfer

Baselines.

We use the implementation of the work of Gatys et al. [15] as a gold standard baseline for style transfer approach (technical details will be included in the supplementary material). We also compare our approach with state-of-the-art multistyle or arbitrary style transfer methods, including patch-based approach [5] and 1D style embedding [9, 21]. The implementations from original authors are used in this experiments.

Method Details.

We adapt 16-layer VGG network [39] pre-trained on ImageNet as the loss network in Equation 4, because the network features learned from a diverse set of images are likely to be generic and informative. We consider the style representation at 4 different scales using the layers ReLU1_2, ReLU2_2, ReLU3_3 and ReLU4_3, and use the content representation at the layer ReLU2_2. The Microsoft COCO dataset [30] is used as the content image image set , which has around 80,000 natural images. We collect 100 style images, choosing from previous work in style transfer. Additionally 900 real paintings are selected from the open-source artistic dataset wikiart.org [8] as additional style images for training MSG-Net-1K. We follow the work [41, 25] and adopt Adam [26] to train the network with a learning rate of . We use the loss function as described in Equation 4 with the balancing weights for content, style and total regularization. We resize the content images to and learn the network with a batch size of 4 for 80,000 iterations. We iteratively update the style image every iteration with size from {256, 512, 768} for runtime brush-size control. After training, the MSG-Net as a fully convolutional network [31] can accept arbitrary input image size. For comparing the style transfer approaches, we use the same content image size, by resizing the image to 512 along the long side. Our implementations are based on Torch [6], PyTorch [34] and MXNet [4]. It takes roughly 8 hours for training MSG-Net-100 model on a Titan Xp GPU.

max width=0.48 Model-size Speed (256) Speed (512) Gatys et al. [15] N/A 0.07 0.02 Johnson et al. [25] 6.7MB 91.7 26.3 Dumoulin et al. [9] 6.8MB 88.3 24.7 Chen et al. [5] 574MB 5.84 0.31 Huang et al. [21] 28.1MB 37.0 10.2 MSG-Net-100 (ours) 9.6MB 92.7 29.2 MSG-Net-1K (ours) 40.3MB 47.2 14.3

Table 1: Comparing model size on disk and inference/test speed fps (frames/sec) of images with the size of 256256 and 512512 on a NVIDIA Titan Xp GPU average over 50 samples. MSG-Net-100 and MSG-Net-1K have 2.3M and 8.9M parameters respectively.

Model Size and Speed Analysis

For mobile applications or cloud services, the model size and test speed are crucial. We compare the model size and inference/test speed of style transfer approaches in Table 1. Our proposed MSG-Net-100 has a comparable model size and speed with single style network [25, 41]. The MSG-Net is faster than Arbitrary Style Transfer work [21], because of using a learned compact encoder instead of pre-trained VGG network.

Qualitative Comparison

Our proposed MSG-Net achieves superior performance comparing to state-of-the-art generative network approaches as shown in Figure 86. One may argue that the arbitrary style work has better scalability/capacity [21, 5]. The style flexibility and image quality are always hard trade-off for generative model, and we particularly focus on the image quality in this work. More examples of the transfered images using MSG-Net are shown in Figure 154.

[width=0.5]figure/color/night.jpg

(a)
Figure 88: Color control using MSG-Net, (left) content and style images, (right) color-preserved transfer result.

Model Scalability.

Prior work using 1D style embedding has achieved success in the scalability of style transfer towards the goal of arbitrary style transfer [21]. To test the scalability of MSG-Net, we augment the style set to 1K images, by adding 900 extra images from the wikiart.org [8]. We also build a larger model MSG-Net-1K with larger model capacity by increasing the width/channels of the model at mid stage (6464) by a factor of 2, resulting in 8.9M parameters. We also increase the training iterations by 4 times (320K) and follow the same training procedure as MSG-Net-100. We observe no quality degradation when increasing the number of styles (examples shown in the supplementary material).

(a)
(d)
Figure 93: Spatial control using MSG-Net. Left: input image, middle: foreground and background styles, right: style transfer result. (Input image and segmentation mask from Shen et al. [37].)

5.2 Runtime Manipulation

MSG-Net as a general approach for real-time style transfer is compatible with existing recent progress for both feed-forward and optimization methods, including but not limited to: content-style trade-off and interpolation (Figure 50), color-preserving transfer (Figure 88), spatial manipulation (Figure 93) and brush stroke size control (Figure 24&55). For style interpolation, we use an affine interpolation of our style embedding following the prior work [9, 21]. For color pre-serving, we match the color of style image with the content image as Gatys et. al. [14]. Brush-size control has been discussed in the Section 4.1. We use the segmentation mask provided by Shen et al. [37] for spatial control. The source code and technical detail of runtime manipulation will be included in our PyTorch implementation.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
(ae)
(af)
(ag)
(ah)
(ai)
(aj)
(ak)
(al)
(am)
(an)
(ao)
(ap)
(aq)
(ar)
(as)
(at)
(au)
(av)
(aw)
(ax)
(ay)
(az)
(ba)
(bb)
(bc)
(bd)
(be)
(bf)
(bg)
(bh)
Figure 154: Diverse images that are generated using a single MSG-Net-100 (2.3M parameters). First row shows the input content images and the other rows are generated images with different style targets (first column).

6 Conclusion and Discussion

To improve the quality and flexibility of generative models in style transfer, we introduce a novel CoMatch Layer that learns to match the second order statistics as image style representation. Multi-style Generative Network has achieved superior image quality comparing to state-of-the-art approaches. In addition, the proposed MSG-Net is compatible with most existing techniques and recent progress of stye transfer including style interpolation, color-preserving and spatial control. Moreover, MSG-Net first enables real-time brush-size control in a fully feed-forward manor. The compact MSG-Net-100 model has only 2.3M parameters and runs at more than 90 fps (frame/sec) on NVIDIA Titan Xp for the input image of size 256256 and at 15 fps on a laptop GPU (GTX 750M-2GB).

Acknowledgment

This work was supported by National Science Foundation award IIS-1421134. A GPU used for this research was donated by the NVIDIA Corporation.

References