The key idea behind style transfer is to perform a gradient descent from random noise minimizing the deviation from content (i.e. feature responses in the upper convolutional layers of a recognition network) of the target image and the deviation from the style representation of the style image. The latter is defined as a set of Gram correlation matrices with the entries , where is the network layer, and are two filters of the layer and
is the array (indexed by spatial coordinates) of neuron responses of filterin layer . In other words, value says how often features and in the layer appear together. Please refer to [Gatys15] for details.
On figure 1 we demonstrate the spectacular performance of the algorithm by running iterations of the L-BFGS [lbfgs] optimization algorithm on a photo of a cat.
2 Frameworks and Implementations
We have tried running implementations on Caffe[Caffe] (using CUDA and OpenCL [CaffeOpenCL] backend) by Frank Liu [fliu]
and on Torch[Torch] by Kai Sheng Tai [tai] and by Justin Johnson [jc].
We have observed the OpenCL Caffe backend to be very promising but yet unstable compared to CUDA. Otherwise, both Torch implementations turned out to be significantly faster.
We have thus built our work on top of the Torch implementation by Kai Sheng Tai. Interestingly, the Torch cunn [cunn] backend performed slightly better than cuDNN [cudnn] by NVIDIA.
In [Gatys15] the VGG-19 [VGG] recognition network is used to produce results. We compare the impact of using other networks (AlexNet [AlexNet], GoogLeNet [GoogLeNet], VGG-16 and VGG-19) in figure 2, with AlexNet performing similarly to GoogLeNet and VGG-16 similarly to VGG-19.
VGG networks perform much better at style transfer due to their architecture. For example, AlexNet and GoogLeNet strongly compress the input at the first convolutional layer using large kernels and stride (with stride and with stride 2 respectively) and thus a lot of fine detail is lost. VGG networks use kernels with stride 1 at all convolutional layers and thus capture much more information.
We have therefore used the VGG-19 network for all our experiments.
gradient descent is always performed from white noise. We try and compare two other initialization points: content and style. We demonstrate the impact of the initialization strategy in figure3, starting the gradient descent from the content photo, the style artwork or from white noise.
The results highlight well the highly non-convex nature of our objective and that the starting point has a tremendous influence on the basin of attraction that the gradient descent will converge to. We naturally observe that starting from the content lets us preserve the most of it, while starting from the style image is prone to “leaking” the content of the artwork into the final result. This also serves to reinforce the observation made in [Gatys15] that style and content are not strictly separable.
We find that for most practical applications starting from the content image produces the best result. This corresponds well to the way most artworks are produced – starting from the artist observing the content and drawing a rough sketch (i.e. content reconstruction comes first), and only then applying paint (i.e. style) on top.
Note that noise initialization is still very useful for testing, benchmarking and hyper-parameter tuning. For example, starting from content and observing no change one might wonder whether the content weight is too high or the learning rate is too small. Such questions do not occur when starting from noise.
5 Partial Style Transfer
In [Gatys15] the impact of shrinking the style layer pool (from using the first 5 convolutional layers down to using only the first convolutional layer ) is demonstrated. The authors observe how the style feature scale and complexity goes down as they consider using lower and lower convolutional layers to represent style. In general, one doesn’t benefit from using only the lower layer style features (apart from the computation facilitation) and is better off simply reducing the style weight in the optimization objective.
In our work we consider keeping the upper convolutional layers and removing the bottom ones, while enforcing them as part of the content pool.
For example, instead of using as the content layer and as the style layers, we could try to set as content and as style layers (see figure 4). Notice how the partial style transfer managed to reshape the content and make it more rectangular (according to the style image) while preserving the original colors (which are mostly captured by the features in the bottom layers).
We thus conclude that relaxing the bottom layer style constraints and enforcing them instead as the content constrains allows us to retain the colors and low-level features of the content photo and only transfer mid- to high-level properties of the style, combining them into a visually appealing result.
6 Generating Styles
In order to better understand the style space constructed in [Gatys15] we draw inspiration from the Deep Visualization technique [DD]. We disregard all the Gram matrices except for one with a single non-zero entry and descend from white noise without content constraints. The motivation is to understand what parts of style a single element can describe and verify if the Gram matrices present a basis in the style space qualitatively similar to the basis of VGG features in the space of natural images. By varying the non-zero element and its magnitude we can generate some curious textures with complexity increasing from the first to the fourth layer (see figure 5).
Next we attempt to generate some more sophisticated styles. For this purpose we consider Gram matrices of a single painting and visualize its histogram. We observe that sparsity and amplitude of the Gram matrix elements increase from the first to the fifth layer (of course, this doesn’t necessary need to generalize, yet it seems to be in accordance with the CNN paradigm where more complex and more discriminative features are constructed from simpler ones). Although it is impossible to judge about the distribution of Gram matrices describing styles with such a limited sample, we can try to mimic the sparsity and amplitude of Gram matrices representing the ”Starry Night” by van Gogh [gogh]
. We generate a random matrix using absolute values of Gaussian distribution and apply a random sparse zero-one mask to it. We vary only two parameters: the variance of the Gaussian distribution and the sparsity of the mask. Interestingly enough, with such a simple construction, we were able to generate some intriguing textures (see figures6 and 7
) suggesting that style density estimation and generation might be a promising research direction.
7 Spatial Style Transfer
In [Gatys15] each style layer with features introduces a Gram matrix with feature correlation entries . This makes the style completely invariant to the spatial configuration of the style image (which is by design).
As an experiment, we suggest a new style representation designed to capture less of the artistic details and more of the spatial configuration of an image with roughly the same computational and storage complexity.
We construct matrices of size (where and are the spatial dimensions of the layer ) with entries
where stands for full 2D convolution. We thus impose a soft constraint on the feature distributions across the spatial coordinates (note that in principle we could store all pairwise convolutions of and , but such an experiment would be computationally infeasible).
In order for this objective to work we need to also rescale both content and style images to the same dimensions and modify the error derivative in [Gatys15] accordingly:
where stands for valid correlation.
We demonstrate this new approach on classic style transfer and on style reconstruction from noise (without content constraints) in figure 8.
Notice how differently the proposed algorithm scales: it is easy to optimize on top layers (high , small and ) but is expensive on bottom layers (low , huge and ). This is contrary to the algorithm in [Gatys15], where the computation cost mostly depends on and thus grows from bottom to top layers.
8 Illumination and Season Transfer
Not all artistic styles are transferred equally well by the algorithm in [Gatys15]. If the style is highly repetitive and homogeneous over the whole style image (see abstract art, textures and patterns), it can be transferred with remarkable quality. However, once it becomes more subtle and varied within the image (see Renaissance, Baroque, Realism etc), style transfer falters. It fails to capture properties like the dramatic use of lighting, exaggerated face features etc. These properties get averaged-out, as the style representation is computed over the whole image.
The same problem (i.e. global style representation) stands in the way of season and illumination transfer, as these properties change different elements of the scene differently.
9 Towards Content-Aware Style Transfer
The examples presented in figures 9, 10, 11, 12 are quite bad, but they aren’t too bad. Indeed, one can easily spot a lot of regions where the texture was luckily transferred correctly. This serves to indicate that in principle the style representation developed in [Gatys15] is capable of capturing photorealistic properties.
What remains is developing a content-aware style transfer, i.e. transferring style according to the content matches between the image to be repainted and the style image. Below we discuss some possible leads towards implementing such a task.
A direct approach could be to replace the global style values as defined in [Gatys15]
with localized values of
where we capture only feature correlations in a small () region around a point and we make the contributions decay as they get more distant from the point of interest:
We can then replace the global style loss
as defined in [Gatys15] with a global style-content covariation loss
where is the weighted content response of neurons of filter in layer reachable from
and the norm is that of a 3-dimensional tensor indexed withand .
Of course such an objective dramatically increases the computational cost of our problems and while could be efficiently parallelized, would probably still remain unfeasible.
In our implementation we make some very rough assumptions to test it: we consider (content of images is aligned; we thus perform a locality-sensitive style transfer), (pixel-wise style) and (no distance decay). This leads to a simplified expression of
The next step within this approach would be to expand the style window and see whether a locality-sensitive style transfer is feasible. If it yields good results, the content-aware transfer could be investigated.
We believe that if implemented well, such an algorithm could tackle more exquisite artistic styles, season transfer, illumination transfer, super-resolution and possibly many other applications.