A Neural Algorithm of Artistic Style

by   Leon A. Gatys, et al.

In fine art, especially painting, humans have mastered the skill to create unique visual experiences through composing a complex interplay between the content and style of an image. Thus far the algorithmic basis of this process is unknown and there exists no artificial system with similar capabilities. However, in other key areas of visual perception such as object and face recognition near-human performance was recently demonstrated by a class of biologically inspired vision models called Deep Neural Networks. Here we introduce an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images. Moreover, in light of the striking similarities between performance-optimised artificial neural networks and biological vision, our work offers a path forward to an algorithmic understanding of how humans create and perceive artistic imagery.



There are no comments yet.


page 3

page 5

page 7


CIFAR10 to Compare Visual Recognition Performance between Deep Neural Networks and Humans

Visual object recognition plays an essential role in human daily life. T...

Neural Style Representations and the Large-Scale Classification of Artistic Style

The artistic style of a painting is a subtle aesthetic judgment used by ...

Learning to See: You Are What You See

The authors present a visual instrument developed as part of the creatio...

PsyPhy: A Psychophysics Driven Evaluation Framework for Visual Recognition

By providing substantial amounts of data and standardized evaluation pro...

Perceptual Deep Neural Networks: Adversarial Robustness through Input Recreation

Adversarial examples have shown that albeit highly accurate, models lear...

Inverting face embeddings with convolutional neural networks

Deep neural networks have dramatically advanced the state of the art for...

Passive attention in artificial neural networks predicts human visual selectivity

Developments in machine learning interpretability techniques over the pa...

Code Repositories


An implementation of the paper 'A Neural Algorithm of Artistic Style'.

view repo


An implementation of "A Neural Algorithm of Artistic Style" by L. Gatys, A. Ecker, and M. Bethge. http://arxiv.org/abs/1508.06576.

view repo


"A neural algorithm of Artistic style" in tensorflow

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The results presented in the main text were generated on the basis of the VGG-Network [22], a Convolutional Neural Network that rivals human performance on a common visual object recognition benchmark task [23] and was introduced and extensively described in [22]

. We used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG-Network. We do not use any of the fully connected layers.The model is publicly available and can be explored in the caffe-framework

[24]. For image synthesis we found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly more appealing results, which is why the images shown were generated with average pooling.

Generally each layer in the network defines a non-linear filter bank whose complexity increases with the position of the layer in the network. Hence a given input image is encoded in each layer of the CNN by the filter responses to that image. A layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in a layer can be stored in a matrix where is the activation of the filter at position in layer . To visualise the image information that is encoded at different layers of the hierarchy (Fig 1

, content reconstructions) we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image. So let

and be the original image and the image that is generated and and their respective feature representation in layer . We then define the squared-error loss between the two feature representations


The derivative of this loss with respect to the activations in layer equals


from which the gradient with respect to the image

can be computed using standard error back-propagation. Thus we can change the initially random image

until it generates the same response in a certain layer of the CNN as the original image . The five content reconstructions in Fig 1 are from layers ‘conv1_1’ (a), ‘conv2_1’ (b), ‘conv3_1’ (c), ‘conv4_1’ (d) and ‘conv5_1’ (e) of the original VGG-Network.

On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image. These feature correlations are given by the Gram matrix , where is the inner product between the vectorised feature map and in layer :


To generate a texture that matches the style of a given image (Fig 1, style reconstructions), we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated. So let and be the original image and the image that is generated and and their respective style representations in layer . The contribution of that layer to the total loss is then


and the total loss is


where are weighting factors of the contribution of each layer to the total loss (see below for specific values of in our results). The derivative of with respect to the activations in layer l can be computed analytically:


The gradients of with respect to the activations in lower layers of the network can be readily computed using standard error back-propagation. The five style reconstructions in Fig 1 were generated by matching the style representations on layer ‘conv1_1’ (a), ‘conv1_1’ and ‘conv2_1’ (b), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (c), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (d), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (e).

To generate the images that mix the content of a photograph with the style of a painting (Fig 2) we jointly minimise the distance of a white noise image from the content representation of the photograph in one layer of the network and the style representation of the painting in a number of layers of the CNN. So let be the photograph and be the artwork. The loss function we minimise is


where and are the weighting factors for content and style reconstruction respectively. For the images shown in Fig 2 we matched the content representation on layer ‘conv4_2’ and the style representations on layers ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ ( in those layers, in all other layers) . The ratio was either (Fig 2 B,C,D) or (Fig 2 E,F). Fig 3

shows results for different relative weightings of the content and style reconstruction loss (along the columns) and for matching the style representations only on layer ‘conv1_1’ (A), ‘conv1_1’ and ‘conv2_1’ (B), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (C), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (D), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (E). The factor

was always equal to one divided by the number of active layers with a non-zero loss-weight .


This work was funded by the German National Academic Foundation (L.A.G.), the Bernstein Center for Computational Neuroscience (FKZ 01GQ1002) and the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307)(M.B., A.S.E, L.A.G.)

References and Notes