Exploring the Neural Algorithm of Artistic Style

02/23/2016 ∙ by Yaroslav Nikulin, et al. ∙ 0

We explore the method of style transfer presented in the article "A Neural Algorithm of Artistic Style" by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge (arXiv:1508.06576). We first demonstrate the power of the suggested style space on a few examples. We then vary different hyper-parameters and program properties that were not discussed in the original paper, among which are the recognition network used, starting point of the gradient descent and different ways to partition style and content layers. We also give a brief comparison of some of the existing algorithm implementations and deep learning frameworks used. To study the style space further we attempt to generate synthetic images by maximizing a single entry in one of the Gram matrices G_l and some interesting results are observed. Next, we try to mimic the sparsity and intensity distribution of Gram matrices obtained from a real painting and generate more complex textures. Finally, we propose two new style representations built on top of network's features and discuss how one could be used to achieve local and potentially content-aware style transfer.



There are no comments yet.


page 6

page 7

page 8

page 9

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The key idea behind style transfer is to perform a gradient descent from random noise minimizing the deviation from content (i.e. feature responses in the upper convolutional layers of a recognition network) of the target image and the deviation from the style representation of the style image. The latter is defined as a set of Gram correlation matrices with the entries , where is the network layer, and are two filters of the layer and

is the array (indexed by spatial coordinates) of neuron responses of filter

in layer . In other words, value says how often features and in the layer appear together. Please refer to [Gatys15] for details.

On figure 1 we demonstrate the spectacular performance of the algorithm by running iterations of the L-BFGS [lbfgs] optimization algorithm on a photo of a cat.

In section 8 we consider cases where the algorithm doesn’t work and suggest possible solutions in section 9.

2 Frameworks and Implementations

We have tried running implementations on Caffe

[Caffe] (using CUDA and OpenCL [CaffeOpenCL] backend) by Frank Liu [fliu]

and on Torch

[Torch] by Kai Sheng Tai [tai] and by Justin Johnson [jc].

We have observed the OpenCL Caffe backend to be very promising but yet unstable compared to CUDA. Otherwise, both Torch implementations turned out to be significantly faster.

We have thus built our work on top of the Torch implementation by Kai Sheng Tai. Interestingly, the Torch cunn [cunn] backend performed slightly better than cuDNN [cudnn] by NVIDIA.

3 Networks

In [Gatys15] the VGG-19 [VGG] recognition network is used to produce results. We compare the impact of using other networks (AlexNet [AlexNet], GoogLeNet [GoogLeNet], VGG-16 and VGG-19) in figure 2, with AlexNet performing similarly to GoogLeNet and VGG-16 similarly to VGG-19.

VGG networks perform much better at style transfer due to their architecture. For example, AlexNet and GoogLeNet strongly compress the input at the first convolutional layer using large kernels and stride (

with stride and with stride 2 respectively) and thus a lot of fine detail is lost. VGG networks use kernels with stride 1 at all convolutional layers and thus capture much more information.

We have therefore used the VGG-19 network for all our experiments.

4 Initialization

In [Gatys15]

gradient descent is always performed from white noise. We try and compare two other initialization points: content and style. We demonstrate the impact of the initialization strategy in figure

3, starting the gradient descent from the content photo, the style artwork or from white noise.

The results highlight well the highly non-convex nature of our objective and that the starting point has a tremendous influence on the basin of attraction that the gradient descent will converge to. We naturally observe that starting from the content lets us preserve the most of it, while starting from the style image is prone to “leaking” the content of the artwork into the final result. This also serves to reinforce the observation made in [Gatys15] that style and content are not strictly separable.

We find that for most practical applications starting from the content image produces the best result. This corresponds well to the way most artworks are produced – starting from the artist observing the content and drawing a rough sketch (i.e. content reconstruction comes first), and only then applying paint (i.e. style) on top.

Note that noise initialization is still very useful for testing, benchmarking and hyper-parameter tuning. For example, starting from content and observing no change one might wonder whether the content weight is too high or the learning rate is too small. Such questions do not occur when starting from noise.

5 Partial Style Transfer

In [Gatys15] the impact of shrinking the style layer pool (from using the first 5 convolutional layers down to using only the first convolutional layer ) is demonstrated. The authors observe how the style feature scale and complexity goes down as they consider using lower and lower convolutional layers to represent style. In general, one doesn’t benefit from using only the lower layer style features (apart from the computation facilitation) and is better off simply reducing the style weight in the optimization objective.

In our work we consider keeping the upper convolutional layers and removing the bottom ones, while enforcing them as part of the content pool.

For example, instead of using as the content layer and as the style layers, we could try to set as content and as style layers (see figure 4). Notice how the partial style transfer managed to reshape the content and make it more rectangular (according to the style image) while preserving the original colors (which are mostly captured by the features in the bottom layers).

We thus conclude that relaxing the bottom layer style constraints and enforcing them instead as the content constrains allows us to retain the colors and low-level features of the content photo and only transfer mid- to high-level properties of the style, combining them into a visually appealing result.

6 Generating Styles

In order to better understand the style space constructed in [Gatys15] we draw inspiration from the Deep Visualization technique [DD]. We disregard all the Gram matrices except for one with a single non-zero entry and descend from white noise without content constraints. The motivation is to understand what parts of style a single element can describe and verify if the Gram matrices present a basis in the style space qualitatively similar to the basis of VGG features in the space of natural images. By varying the non-zero element and its magnitude we can generate some curious textures with complexity increasing from the first to the fourth layer (see figure 5).

Next we attempt to generate some more sophisticated styles. For this purpose we consider Gram matrices of a single painting and visualize its histogram. We observe that sparsity and amplitude of the Gram matrix elements increase from the first to the fifth layer (of course, this doesn’t necessary need to generalize, yet it seems to be in accordance with the CNN paradigm where more complex and more discriminative features are constructed from simpler ones). Although it is impossible to judge about the distribution of Gram matrices describing styles with such a limited sample, we can try to mimic the sparsity and amplitude of Gram matrices representing the ”Starry Night” by van Gogh [gogh]

. We generate a random matrix using absolute values of Gaussian distribution and apply a random sparse zero-one mask to it. We vary only two parameters: the variance of the Gaussian distribution and the sparsity of the mask. Interestingly enough, with such a simple construction, we were able to generate some intriguing textures (see figures

6 and 7

) suggesting that style density estimation and generation might be a promising research direction.

7 Spatial Style Transfer

In [Gatys15] each style layer with features introduces a Gram matrix with feature correlation entries . This makes the style completely invariant to the spatial configuration of the style image (which is by design).

As an experiment, we suggest a new style representation designed to capture less of the artistic details and more of the spatial configuration of an image with roughly the same computational and storage complexity.

We construct matrices of size (where and are the spatial dimensions of the layer ) with entries

where stands for full 2D convolution. We thus impose a soft constraint on the feature distributions across the spatial coordinates (note that in principle we could store all pairwise convolutions of and , but such an experiment would be computationally infeasible).

In order for this objective to work we need to also rescale both content and style images to the same dimensions and modify the error derivative in [Gatys15] accordingly:

where stands for valid correlation.

We demonstrate this new approach on classic style transfer and on style reconstruction from noise (without content constraints) in figure 8.

Notice how differently the proposed algorithm scales: it is easy to optimize on top layers (high , small and ) but is expensive on bottom layers (low , huge and ). This is contrary to the algorithm in [Gatys15], where the computation cost mostly depends on and thus grows from bottom to top layers.

8 Illumination and Season Transfer

Not all artistic styles are transferred equally well by the algorithm in [Gatys15]. If the style is highly repetitive and homogeneous over the whole style image (see abstract art, textures and patterns), it can be transferred with remarkable quality. However, once it becomes more subtle and varied within the image (see Renaissance, Baroque, Realism etc), style transfer falters. It fails to capture properties like the dramatic use of lighting, exaggerated face features etc. These properties get averaged-out, as the style representation is computed over the whole image.

The same problem (i.e. global style representation) stands in the way of season and illumination transfer, as these properties change different elements of the scene differently.

We present some relatively successful examples of season and illumination transfer on images that are well aligned and are fairly repetitive in figures 9, 10, 11, 12.

9 Towards Content-Aware Style Transfer

The examples presented in figures 9, 10, 11, 12 are quite bad, but they aren’t too bad. Indeed, one can easily spot a lot of regions where the texture was luckily transferred correctly. This serves to indicate that in principle the style representation developed in [Gatys15] is capable of capturing photorealistic properties.

What remains is developing a content-aware style transfer, i.e. transferring style according to the content matches between the image to be repainted and the style image. Below we discuss some possible leads towards implementing such a task.

A direct approach could be to replace the global style values as defined in [Gatys15]

with localized values of

where we capture only feature correlations in a small () region around a point and we make the contributions decay as they get more distant from the point of interest:

We can then replace the global style loss

as defined in [Gatys15] with a global style-content covariation loss

where is the weighted content response of neurons of filter in layer reachable from

and the norm is that of a 3-dimensional tensor indexed with

and .

Of course such an objective dramatically increases the computational cost of our problems and while could be efficiently parallelized, would probably still remain unfeasible.

In our implementation we make some very rough assumptions to test it: we consider (content of images is aligned; we thus perform a locality-sensitive style transfer), (pixel-wise style) and (no distance decay). This leads to a simplified expression of

and, consequently,

We were only able to test this approach on small images (see figure 13 and 14). Note that due to very rigid constraints the style picture basically gets painted over the content image.

The next step within this approach would be to expand the style window and see whether a locality-sensitive style transfer is feasible. If it yields good results, the content-aware transfer could be investigated.

We believe that if implemented well, such an algorithm could tackle more exquisite artistic styles, season transfer, illumination transfer, super-resolution and possibly many other applications.

Figure 1: Transferring different styles [s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11] to a photo of a cat [cat] (top-left).
Figure 2: Style transfer using GoogLeNet (left) and VGG-19 (right).
Figure 3: Impact of the initialization point on the final result. Top row: the source photo (left), the style image (right, [opt]). Bottom row: results of 500 iterations starting from style (left), content (center) and noise (right) images.
Figure 4: Using as content and as style layers to repaint a photo of a cat. Style image (top-left, [cubism]), full (top-right), low-weight (bottom-left) and partial style transfer (bottom-right). Notice that simply using a low style weight does not allow to reproduce the same result.
Figure 5: Styles generated by targeting one Gram matrix having a single non-zero entry. Top to bottom: layer from 1 to 4 of the target Gram matrix . Position of the non-zero element is either fixed (left) or generated randomly at each gradient descent iteration (right)
Figure 6: Styles generated by targeting one random sparse Gram matrix . Top to bottom: layer from 1 to 4 of the target Gram matrix . The matrix is either fixed (left) or generated at each gradient descent iteration (right).
Figure 7: Styles generated targeting multiple random (but fixed for the whole optimization procedure) sparse Gram matrices. Target matrices are at convolutional layers {1, 2, 4}, {1, 2, 3} (top row), {1, 4}, {1, 2, 5} (bottom row).
Figure 8: Comparing the conventional and the “spatial” style transfer. Top row: content (left) and style (right); middle row: conventional (left) and “spatial” (right) style transfer; bottom row: style reconstruction from noise with no content constraint using conventional (left) and “spatial” (right) representations. Notice how the spatial style transfer gently imprints the scene structure into the photo. Reconstruction from noise demonstrates how it arrives at a rough but not exact configuration of the style scene.
Figure 9: An attempt of season transfer. Top row: content (left) and style (right). Bottom row: reconstruction from content (left), style (center) and noise (right). Notice how the problem of global style is especially well observed when descending from white noise, because random textures are generated everywhere and partially stay on the sky / ground, which is undesirable.
Figure 10: Another attempt of season transfer. Content (left), style (center) and the transfer result (right).
Figure 11: Yet another attempt of season transfer. Content (left), style (center) and the transfer result (right).
Figure 12: An attempt of illumination transfer. Content (left), style (center) and the transfer result (right).
Figure 13: Summer to winter transition using a very rough approximation of a locality-sensitive style transfer. Due to very rigid constraints the style gets basically painted over the content.
Figure 14: Winter to summer transition using a very rough approximation of a locality-sensitive style transfer. As earlier, due to very rigid constraints the style gets basically painted over the content.