A Neural Algorithm of Artistic Style

08/26/2015 ∙ by Leon A. Gatys, et al. ∙ 0

In fine art, especially painting, humans have mastered the skill to create unique visual experiences through composing a complex interplay between the content and style of an image. Thus far the algorithmic basis of this process is unknown and there exists no artificial system with similar capabilities. However, in other key areas of visual perception such as object and face recognition near-human performance was recently demonstrated by a class of biologically inspired vision models called Deep Neural Networks. Here we introduce an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images. Moreover, in light of the striking similarities between performance-optimised artificial neural networks and biological vision, our work offers a path forward to an algorithmic understanding of how humans create and perceive artistic imagery.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 7

Code Repositories

neuralart

An implementation of the paper 'A Neural Algorithm of Artistic Style'.


view repo

style-transfer

An implementation of "A Neural Algorithm of Artistic Style" by L. Gatys, A. Ecker, and M. Bethge. http://arxiv.org/abs/1508.06576.


view repo

neural-art-tf

"A neural algorithm of Artistic style" in tensorflow


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Methods

The results presented in the main text were generated on the basis of the VGG-Network [22], a Convolutional Neural Network that rivals human performance on a common visual object recognition benchmark task [23] and was introduced and extensively described in [22]

. We used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG-Network. We do not use any of the fully connected layers.The model is publicly available and can be explored in the caffe-framework

[24]. For image synthesis we found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly more appealing results, which is why the images shown were generated with average pooling.

Generally each layer in the network defines a non-linear filter bank whose complexity increases with the position of the layer in the network. Hence a given input image is encoded in each layer of the CNN by the filter responses to that image. A layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in a layer can be stored in a matrix where is the activation of the filter at position in layer . To visualise the image information that is encoded at different layers of the hierarchy (Fig 1

, content reconstructions) we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image. So let

and be the original image and the image that is generated and and their respective feature representation in layer . We then define the squared-error loss between the two feature representations

(1)

The derivative of this loss with respect to the activations in layer equals

(2)

from which the gradient with respect to the image

can be computed using standard error back-propagation. Thus we can change the initially random image

until it generates the same response in a certain layer of the CNN as the original image . The five content reconstructions in Fig 1 are from layers ‘conv1_1’ (a), ‘conv2_1’ (b), ‘conv3_1’ (c), ‘conv4_1’ (d) and ‘conv5_1’ (e) of the original VGG-Network.

On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image. These feature correlations are given by the Gram matrix , where is the inner product between the vectorised feature map and in layer :

(3)

To generate a texture that matches the style of a given image (Fig 1, style reconstructions), we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated. So let and be the original image and the image that is generated and and their respective style representations in layer . The contribution of that layer to the total loss is then

(4)

and the total loss is

(5)

where are weighting factors of the contribution of each layer to the total loss (see below for specific values of in our results). The derivative of with respect to the activations in layer l can be computed analytically:

(6)

The gradients of with respect to the activations in lower layers of the network can be readily computed using standard error back-propagation. The five style reconstructions in Fig 1 were generated by matching the style representations on layer ‘conv1_1’ (a), ‘conv1_1’ and ‘conv2_1’ (b), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (c), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (d), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (e).

To generate the images that mix the content of a photograph with the style of a painting (Fig 2) we jointly minimise the distance of a white noise image from the content representation of the photograph in one layer of the network and the style representation of the painting in a number of layers of the CNN. So let be the photograph and be the artwork. The loss function we minimise is

(7)

where and are the weighting factors for content and style reconstruction respectively. For the images shown in Fig 2 we matched the content representation on layer ‘conv4_2’ and the style representations on layers ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ ( in those layers, in all other layers) . The ratio was either (Fig 2 B,C,D) or (Fig 2 E,F). Fig 3

shows results for different relative weightings of the content and style reconstruction loss (along the columns) and for matching the style representations only on layer ‘conv1_1’ (A), ‘conv1_1’ and ‘conv2_1’ (B), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (C), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (D), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (E). The factor

was always equal to one divided by the number of active layers with a non-zero loss-weight .

Acknowledgments

This work was funded by the German National Academic Foundation (L.A.G.), the Bernstein Center for Computational Neuroscience (FKZ 01GQ1002) and the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307)(M.B., A.S.E, L.A.G.)

References and Notes