An implementation of the paper 'A Neural Algorithm of Artistic Style'.
In fine art, especially painting, humans have mastered the skill to create unique visual experiences through composing a complex interplay between the content and style of an image. Thus far the algorithmic basis of this process is unknown and there exists no artificial system with similar capabilities. However, in other key areas of visual perception such as object and face recognition near-human performance was recently demonstrated by a class of biologically inspired vision models called Deep Neural Networks. Here we introduce an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images. Moreover, in light of the striking similarities between performance-optimised artificial neural networks and biological vision, our work offers a path forward to an algorithmic understanding of how humans create and perceive artistic imagery.READ FULL TEXT VIEW PDF
Visual object recognition plays an essential role in human daily life. T...
The artistic style of a painting is a subtle aesthetic judgment used by ...
The authors present a visual instrument developed as part of the creatio...
Traditionally, the vision community has devised algorithms to estimate t...
Deep neural networks have dramatically advanced the state of the art for...
This paper addresses the interpretability of deep learning-enabled image...
We present a novel real-time, collaborative, and interactive AI painting...
An implementation of the paper 'A Neural Algorithm of Artistic Style'.
An implementation of "A Neural Algorithm of Artistic Style" by L. Gatys, A. Ecker, and M. Bethge. http://arxiv.org/abs/1508.06576.
"A neural algorithm of Artistic style" in tensorflow
The results presented in the main text were generated on the basis of the VGG-Network , a Convolutional Neural Network that rivals human performance on a common visual object recognition benchmark task  and was introduced and extensively described in 
. We used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG-Network. We do not use any of the fully connected layers.The model is publicly available and can be explored in the caffe-framework. For image synthesis we found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly more appealing results, which is why the images shown were generated with average pooling.
Generally each layer in the network defines a non-linear filter bank whose complexity increases with the position of the layer in the network. Hence a given input image is encoded in each layer of the CNN by the filter responses to that image. A layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in a layer can be stored in a matrix where is the activation of the filter at position in layer . To visualise the image information that is encoded at different layers of the hierarchy (Fig 1
, content reconstructions) we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image. So letand be the original image and the image that is generated and and their respective feature representation in layer . We then define the squared-error loss between the two feature representations
The derivative of this loss with respect to the activations in layer equals
from which the gradient with respect to the image
can be computed using standard error back-propagation. Thus we can change the initially random imageuntil it generates the same response in a certain layer of the CNN as the original image . The five content reconstructions in Fig 1 are from layers ‘conv1_1’ (a), ‘conv2_1’ (b), ‘conv3_1’ (c), ‘conv4_1’ (d) and ‘conv5_1’ (e) of the original VGG-Network.
On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image. These feature correlations are given by the Gram matrix , where is the inner product between the vectorised feature map and in layer :
To generate a texture that matches the style of a given image (Fig 1, style reconstructions), we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated. So let and be the original image and the image that is generated and and their respective style representations in layer . The contribution of that layer to the total loss is then
and the total loss is
where are weighting factors of the contribution of each layer to the total loss (see below for specific values of in our results). The derivative of with respect to the activations in layer l can be computed analytically:
The gradients of with respect to the activations in lower layers of the network can be readily computed using standard error back-propagation. The five style reconstructions in Fig 1 were generated by matching the style representations on layer ‘conv1_1’ (a), ‘conv1_1’ and ‘conv2_1’ (b), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (c), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (d), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (e).
To generate the images that mix the content of a photograph with the style of a painting (Fig 2) we jointly minimise the distance of a white noise image from the content representation of the photograph in one layer of the network and the style representation of the painting in a number of layers of the CNN. So let be the photograph and be the artwork. The loss function we minimise is
where and are the weighting factors for content and style reconstruction respectively. For the images shown in Fig 2 we matched the content representation on layer ‘conv4_2’ and the style representations on layers ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ ( in those layers, in all other layers) . The ratio was either (Fig 2 B,C,D) or (Fig 2 E,F). Fig 3
shows results for different relative weightings of the content and style reconstruction loss (along the columns) and for matching the style representations only on layer ‘conv1_1’ (A), ‘conv1_1’ and ‘conv2_1’ (B), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (C), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (D), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (E). The factorwas always equal to one divided by the number of active layers with a non-zero loss-weight .
This work was funded by the German National Academic Foundation (L.A.G.), the Bernstein Center for Computational Neuroscience (FKZ 01GQ1002) and the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307)(M.B., A.S.E, L.A.G.)
Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 1701–1708 (IEEE, 2014). URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6909616.