implementation of neural style transfer in Tensorflow
The recent work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNN) in creating artistic fantastic imagery by separating and recombing the image content and style. This process of using CNN to migrate the semantic content of one image to different styles is referred to as Neural Style Transfer. Since then, Neural Style Transfer has become a trending topic both in academic literature and industrial applications. It is receiving increasing attention from computer vision researchers and several methods are proposed to either improve or extend the original neural algorithm proposed by Gatys et al. However, there is no comprehensive survey presenting and summarizing recent Neural Style Transfer literature. This review aims to provide an overview of the current progress towards Neural Style Transfer, as well as discussing its various applications and open problems for future research.READ FULL TEXT VIEW PDF
Style-transfer is a process of migrating a style from a given image to t...
The difficulty of textual style transfer lies in the lack of parallel
We propose a new flexible deep convolutional neural network (convnet) to...
Convolutional Neural Networks have been highly successful in performing ...
Neural style transfer, first proposed by Gatys et al. (2015), can be use...
Recently, methods have been proposed that perform texture synthesis and ...
Neural Style Transfer is a striking, recently-developed technique that u...
implementation of neural style transfer in Tensorflow
Painting is a popular form of art. For thousands of years, people have been attracted by the art of painting with the advent of many fantastic artworks, , Van Gogh’s “The Starry Night”. In the past, re-drawing an image in a particular style requires a well-trained artist and lots of time.
Since the mid-1990s, the art theories behind the fantastic artworks have been attracting the attention of not only the artists but many computer science researchers. There are plenty of studies exploring how to automatically turn images into synthetic artworks. Among these studies, the advances in Non-photorealistic Rendering (NPR) [1, 2, 3] are inspiring, and nowadays, it is a firmly established field in the community of computer graphics. However, most of these NPR stylisation algorithms are designed for particular artistic styles [3, 4] and cannot be easily extended to other styles. While in the community of computer vision, style transfer is usually studied as a generalised problem of texture synthesis, which is to extract and transfer the texture from the source to target. However, only low-level features are considered during this process and the results are usually not that impressive.
Recently, inspired by the power of Convolutional Neural Network (CNN), Gatys  first studied how to use CNN to reproduce famous painting styles on natural images. They proposed to model the content of a photo as the feature responses from a pre-trained CNN, and further model the style of an artwork as the summary feature statistics. Their experimental results demonstrated that the content and style
in a photo were separable, which indicates the probability of changing a photo’sstyle while preserving desired semantic content. Based on this finding, Gatys  first proposed to exploit CNN feature activations to separate and recombine the content of a given photo and the style of famous artworks. The key idea behind their algorithm is to iteratively optimise an image with the objective of matching desired CNN feature distribution, which involves both the photo’s content information and artwork’s style information. Their proposed algorithm successfully produces fantastic stylised images with the appearance of a given artwork. Figure 1 shows an example of transferring the style of a Chinese painting “Dwelling in the Fuchun Mountains” onto a photo of The Great Wall. Since the algorithm of Gatys does not have any explicit restrictions on the type of style images, it breaks the constraints of previous approaches. The work of Gatys opened up a new field called Neural Style Transfer (NST), which is the process of using Convolutional Neural Network to render a content image in different styles.
The seminal work of Gatys has attracted wide attention from both academia and industry. In academia, lots of follow-up studies were conducted to either improve or extend this algorithm and before long, these techniques were applied to many successful industrial applications (, Prisma , Ostagram , Deep Forger ). However, there is no comprehensive survey summarising and discussing recent advances as well as challenges within this new field of Neural Style Transfer.
In this paper, we aim to provide an overview of current advances (up to March 2018) in Neural Style Transfer (NST). Our contributions are threefold. First, we investigate, classify and summarise recent advances in the field of NST. Second, we present several evaluation methods and experimentally compare different NST algorithms. Third, we summarise current challenges in this field and propose the corresponding possible solutions.
The organisation of this paper is as follows. We start our discussion with a brief review of pre-neural artistic style transfer methods in Section 2. Then Section 3 explores the derivations and foundations of NST. Based on the discussions in Section 3, we categorise and explain existing NST algorithms in Section 4. Some improvement strategies for these methods and their extensions will be given in Section 5. Section 6 presents several methodologies for evaluating NST algorithms and aims to build a standardised benchmark for follow-up studies. Then we demonstrate the commercial applications of NST in Section 7, including both current successful usages and its potential applications. In Section 8, we summarise current challenges in the field of NST, as well as propose the corresponding possible solutions. Finally, Section 9 concludes the paper, and Section 10 delineates several promising directions for future research.
Artistic style transfer is a long-standing research topic. Due to its wide variety of applications, it has been an important research area for more than two decades. Before the appearance of Neural Style Transfer (NST), the related researches in computer graphics have expanded into an area called Non-photorealistic Rendering (NPR). In the field of computer vision, style transfer is often considered as a generalised problem of texture synthesis. In this section, we briefly review some of these pre-neural artistic style transfer algorithms. For a more comprehensive overview, we recommend [3, 9, 10].
Stroke-based rendering (SBR) refers to a process of placing virtual strokes (, brush strokes, tiles, stipples) upon a digital canvas to render a photograph with a particular style . The process of SBR is generally starting from a source photo, incrementally compositing strokes to match the photo, and finally producing a non-photorealistic imagery, which looks like the photo but with an artistic style. During this process, an objective function is designed to guide the greedy or iterative placement of strokes. Despite the effectiveness of a large body of SBR algorithms, they are usually designed for a particular style (, oil paintings, watercolours, sketches), which is not that flexible.
Image analogy aims to learn a mapping between a pair of source images and target stylised images in a supervised manner . The training set of image analogy comprises pairs of unstylised source images and the corresponding stylised images with a particular style. Image analogy algorithm then learns the analogous transformation from the example training pairs, and creates analogous stylised result when given a test input photograph. Image analogy can also be extended in various ways, , to learn stroke placements for portrait painting rendering . In general, image analogy is effective for a variety of artistic styles. However, pairs of training data are usually unavailable in practice.
Creating an artistic image is actually a process that aims for image simplification and abstraction. Therefore, it is natural to consider adopting and combining some related image processing filters to render a given photo. For example, in , Winnemöller for the first time exploit bilateral  and difference of Gaussians filters  to automatically produce cartoon-like effect. In general, image filtering based rendering algorithms are straightforward to implement and efficient in practice. At an expense, they are very limited in style diversity.
Textures are repeated visual patterns in an image. Texture synthesis is a process which grows similar textures in the source texture image. It has long been an active research topic in computer vision [17, 18]. Given the distribution of a texture instance , the process of texture synthesis can be considered to draw a sample from a certain distribution:
Texture synthesis is very related to style transfer, since one can consider style as a kind of texture. In that sense, style transfer is actually a process of texture transfer, which constrains the given semantic content of the image while synthesizing textures:
This view has already been proposed in , which is one of the earliest works on texture synthesis. After that, there are lots of works [19, 20] following this route, which are generally built upon patch matching and quilting techniques. Even recently, Frigo  propose an effective style transfer algorithm fully based on traditional texture synthesis method. Their idea is to first divide the input image adaptively into suitable patches, search the optimal mapping from the candidate regions in the style image, and then apply bilinear blending and colour transfer to obtain the final stylised result. Another recent work in  proposes a series of steps to stylise images, which include similarity matching of content and style patches, patch aggregation, content fusion based on segmentation algorithm, . Built upon traditional texture synthesis techniques, style transfer can be performed in an unsupervised manner. However, these texture synthesis based algorithms only exploit low-level image features, which limits their performance.
For a better understanding of the NST development, we start by introducing its derivations. To automatically transfer an artistic style, the first and most important issue is how to model and extract style from an image. Since style is also a form of texture, a straightforward way is to relate Visual Style Modelling back to previously well-studied Visual Texture Modelling methods. After obtaining the style representation, the next issue is how to reconstruct an image with desired style information while preserving its semantic content. There comes the Image Reconstruction techniques.
Visual texture modelling  is previously studied as the heart of texture synthesis. Throughout the history, there are two distinct approaches to model visual textures, which are Parametric Texture Modelling with Summary Statistics and Non-parametric Texture Modelling with MRFs.
One path towards texture modelling is to capture image statistics from a sample texture and exploit summary statistical property to model the texture. The idea is first proposed by Julesz , which models textures as pixel-based -th order statistics. Later work in  exploits filter responses to analyze textures, instead of direct pixel-based measurements. After that, Portilla and Simoncelli  further introduce a texture model based on multi-scale orientated filter responses and use gradient descent to improve synthesised results. A more recent parametric texture modelling approach proposed by Gatys  is the first to measure summary statistics in the domain of Convolutional Neural Network (CNN). They design a Gram-based representation to model textures, which is the correlations between filter responses in different layers of a pre-trained classification network (VGG network) . More specifically, the Gram-based representation encodes the second order statistics of the set of CNN filter responses. Next, we will explain this representation in detail for the usage of the following sections.
Assume that the feature map of a sample texture image at layer of VGG network is , where is the number of channels, and and represent the height and width of the feature map . Then the Gram-based representation can be obtained by computing Gram matrix over the feature map (a reshaped version of ):
This Gram-based texture representation from CNN is effective at modelling wide varieties of both natural and non-natural textures. However, Gram-based representation is designed to capture global statistics and tosses spatial arrangements, which leads to unsatisfying results for modelling regular textures with long-range symmetric structures. To address this problem, Berger and Memisevic  propose to horizontally and vertically translate feature map by pixels to correlate the feature at position with those at positions and . In this way, the representation incorporates spatial arrangement information and is therefore more effective at modelling textures with symmetric properties.
Another notable texture modelling methodology is to use non-parametric resampling. A variety of non-parametric methods are based on Markov Random Fields (MRFs) model, which assumes that in a texture image, each pixel is entirely characterised by its spatial neighbourhood. Under this assumption, Efros and Leung  propose to synthesise each pixel one by one by searching similar neighbourhoods in the source texture image and assigning the corresponding pixel. Their work is one of the earliest non-parametric algorithms with MRFs. Following their work, Wei and Levoy  further speed up the neighbourhood matching process by always using a fixed neighbourhood.
In general, an essential step for many vision tasks is to extract an abstract representation from the input image. Image reconstruction is actually a reverse process, which is to reconstruct the whole input image from the extracted image representation. It is previously studied to analyse particular image representation and discover what information is contained in the abstract representation. Here our major focus is on CNN representation based image reconstruction algorithms, which can be categorised into Slow Image Reconstruction based on Online Image Optimisation and Fast Image Reconstruction based on Offline Model Optimisation.
The first algorithm to reverse CNN representations is proposed by Mahendran and Vedaldi [30, 31]. Given a CNN representation to be reversed, their algorithm iteratively optimises an image (generally starting from random noise) until it has similar desired CNN representation. The iterative optimisation process is based on gradient descent in image space. Therefore, the process is time-consuming especially when the desired reconstructed image is large.
To address the efficiency issue of [30, 31], Dosovitskiy and Brox  propose to train a feed-forward network in advance and put the computational burden at training stage. At testing stage, the reverse process can be simply done with a network forward pass. Their algorithm significantly speeds up the image reconstruction process. In their later work , they further combine Generative Adversarial Network (GAN)  to improve the results.
Neural Style Transfer (NST) is a subset of the large body of aforementioned artistic style transfer, as shown in Figure 2. It actually denotes the group Style Transfer via Neural Network. One can also say that NST is a combination of Style Transfer via Texture Synthesis and Convolutional Neural Network. In this section, we provide a categorisation of NST algorithms. Current NST methods fit into one of two categories, Slow Neural Method Based On Online Image Optimisation and Fast Neural Method Based On Offline Model Optimisation. The first category transfers the style by iteratively optimising an image, , algorithms belong to this category are built upon Slow Image Reconstruction techniques. The second category optimises a generative model offline and produces the stylised image with a single forward pass, which actually exploits the idea of Fast Image Reconstruction techniques.
DeepDream  is the first attempt to produce artistic images by reversing CNN representations with Slow Image Reconstruction techniques. By further combining Visual Texture Modelling techniques to model style, Slow Neural Methods Based On Online Image Optimisation are subsequently proposed, which build the early foundations for the field of NST. Their basic idea is to first model and extract style and content information from the corresponding style and content images, recombine them as the target representation, and then iteratively reconstruct a stylised result that matches the target representation. In general, different Slow Neural Methods share the same Slow Image Reconstruction technique, but differ in the way they model the visual style, which is built on the aforementioned two categories of Slow Visual Texture Modelling techniques.
The first subset of Slow Neural Methods is based on Parametric Texture Modelling with Summary Statistics. The style is characterised as a set of spatial summary statistics.
We start by introducing the first NST algorithm proposed by Gatys [5, 4]. By reconstructing representations from intermediate layers in VGG network, Gatys observe that deep convolutional neural network is capable of extracting semantic image content from an arbitrary photograph and some appearance information from the well-known artwork. According to this observation, they build the content component of the newly stylised image by penalising the difference of high-level representations derived from content and stylised images, and further build the style component by matching Gram-based summary statistics of style and stylised images, which is derived from their proposed texture modelling technique  (Section 3.1). The details of their algorithm are as follows.
Given a content image and a style image , the algorithm in  tries to seek a stylised image that minimises the following objective:
where compares the content representation of a given content image to that of the (yet unknown) stylised image, and compares the Gram-based style representation derived from a style image to that of the (yet unknown) stylised image. and are used to balance the content component and style component in the stylised result.
The content loss is defined by the squared Euclidean distance between the feature representations of the content image in layer and that of the (yet unknown) stylised image :
where denotes the set of VGG layers for computing the content loss. For the style loss ,  exploits Gram-based visual texture modelling technique to model the style, which has already been explained in Section 3.1. Therefore, the style loss is defined by the squared Euclidean distance between the Gram-based style representations of and :
where is the aforementioned Gram matrix to encode the second order statistics of the set of filter responses. represents the set of VGG layers for calculating the style loss. The choice of and empirically follows the principle that the usage of lower layer tends to retain low-level features (, colours), while the usage of higher layer generally preserves more high-level semantic content information. Therefore, is usually computed with lower layers and is computed with higher layers. Given the pre-trained VGG-19 
as the loss network, Gatys et al.’s choice in is and . Also, VGG loss network is not the only option. Similar performance can be achieved by selecting other pre-trained classification networks, , ResNet .
) can be minimised by using gradient descent in image space with backpropagation. In addition, a total variation denoising term is usually added to encourage the smoothness in the stylised result in practice.
Gram-based style representation is not the only choice to statistically encode style information. There are also some other effective statistical style representations, which are derived from Gram-based representation. Li 
derive some different style representations by considering style transfer in the domain of transfer learning, or more specifically,domain adaption . Given that training and testing data are drawn from different distributions, the goal of domain adaption is to adapt a model trained on labelled training data from a source domain to predict labels of unlabelled testing data from a target domain. One way for domain adaption is to match a sample in the source domain to that in the target domain by minimising their distribution discrepancy, in which Maximum Mean Discrepancy (MMD)
is a popular choice to measure the discrepancy between two distributions. Li prove that matching Gram-based style representations between a pair of style and stylised images is intrinsically minimising MMD with a quadratic polynomial kernel. Therefore, it is expected that other kernel functions for MMD can be equally applied in NST, , the linear kernel, polynomial kernel and Gaussian kernel. Another related representation is BN statistic representation, which is to use mean and variance of the feature maps in VGG layers to model style:
where is the -th feature map channel at layer of VGG network, and is the number of channels.
However, the Gram-based algorithm has the limitation of instabilities during optimisations. Also, it requires manually tuning the parameters, which is very tedious. Risser  find that feature activations with quite different means and variances can still have the same Gram matrix, which is the main reason of instabilities. Inspired by this observation, Risser introduce an extra histogram loss, which guides the optimisation to match the entire histogram of feature activations. They also present a preliminary solution to automatic parameter tuning, which is to explicitly prevent gradients with extreme values through extreme gradient normalisation.
All these aforementioned neural methods only compare content and stylised images in the CNN feature space to make the stylised image semantically similar to the content image. But since CNN features inevitably lose some low-level information contained in the image, there are usually some unappealing distorted structures and irregular artefacts in the stylised results. To preserve the coherence of fine structures during stylization, Li  propose to incorporate additional constraints upon low-level features in pixel space. They introduce an additional Laplacian loss, which is defined as the squared Euclidean distance between the Laplacian filter responses of a content image and stylised result. Laplacian filter computes the second order derivatives of the pixels in an image and is widely used for edge detection.
Non-parametric Slow Neural Method is built on the basis of Non-parametric Texture Modelling with MRFs. This category considers NST at a local level, , operating on patches to match the style.
Li and Wand 
are the first to propose an MRF-based NST algorithm. They find that the parametric NST method with summary statistics only captures the per-pixel feature correlations and does not constrain the spatial layout, which leads to a less visual plausibility result for photorealistic styles. Their solution is to model the style in a non-parametric way and introduce a new style loss function which includes a patch-based MRF prior:
where is the set of all local patches from the feature map . denotes the local patch and is the most similar style patch with the -th local patch in the stylised image . The best matching is obtained by calculating normalised cross-correlation over all the style patches in the style image . is the total number of local patches. Since their algorithm matches a style in the patch-level, the fine structure and arrangement can be preserved much better. Given a photograph as the content, their algorithm achieves remarkable results, especially for photorealistic styles.
Although the Slow Neural Method Based On Online Image Optimisation is able to yield impressive stylised images, there are still some limitations. The most concerned limitation is the efficiency issue. The second category Fast Neural Method addresses the speed and computational cost issue by exploiting Fast Image Reconstruction based on Offline Model Optimisation to reconstruct the stylised result, , a feed-forward network is optimised over a large set of images for one or more style images :
Depending on the number of artistic styles a single can produce, Fast Neural Methods are further divided into Per-Style-Per-Model Fast Neural Method (PSPM), Multiple-Style-Per-Model Fast Neural Method (MSPM), and Arbitrary-Style-Per-Model Fast Neural Method (ASPM).
The first two Fast Neural Methods are proposed by Johnson  and Ulyanov  respectively. These two methods share a similar idea, which is to pre-train a feed-forward style-specific network and produce a stylised result with a single forward pass at testing stage. They only differ in the network architecture, for which Johnson ’s design roughly follows the network proposed by Radford 
but with residual blocks as well as fractionally strided convolutions, and Ulyanov use a multi-scale architecture as the generator network. The objective function is similar to the algorithm of Gatys, which indicates that they are also Parametric Methods with Summary Statistics.
Shortly after [42, 43], Ulyanov  further find that simply applying normalisation to every single image rather than a batch of images (precisely batch normalization (BN)) leads to a significant improvement in stylisation quality. This single image normalisation is called Instance Normalisation (IN), which is actually equivalent to batch normalisation when the batch size is set to . The style transfer network with IN is shown to converge faster than BN and also achieves visually better results. One interpretation is that IN is actually a form of style normalisation and can directly normalise the style of each content image to the desired style . Therefore, the objective is easier to learn as the rest of the network only needs to take care of the content loss.
Another work by Li and Wand  is inspired by the MRF-based NST  algorithm in Section 4.1.2. They address the efficiency issue by training a Markovian feed-forward network using adversarial training. Similar to , their algorithm is a Patch-based Non-parametric Method with MRFs. Their method is shown to outperform the algorithms of Johnson and Ulyanov in the preservation of coherent textures in complex images, thanks to their patch-based design.
Although the above PSPM approaches can produce stylised images two orders of magnitude faster than previous slow NST methods, separate generative networks have to be trained for each particular style image, which is quite time-consuming and inflexible. But many paintings (, impressionist paintings) actually share similar paint strokes and only differ in their colour palettes. Intuitively, it is redundant to train a separate network for each of them. MSPM is therefore proposed, which improves the flexibility of PSPM by further incorporating multiple styles into one single model. There are generally two paths towards handling this problem: 1) tying only a small number of parameters in a network to each style ([48, 49]) and 2) still exploiting only a single network like PSPM but combining both style and content as inputs ([50, 51]).
An early work by Dumoulin  is built on the basis of the proposed IN layer in PSPM algorithm  (Section 4.2.1). They surprisingly find that using the same convolutional parameters but only scaling and shifting parameters in IN layers is sufficient to model different styles. Therefore, they propose an algorithm to train a conditional multi-style transfer network based on conditional instance normalisation (CIN), which is defined as:
where is the input feature activation and is the index of the desired style from a set of style images. As shown in Equation (10), the conditioning for each style is done by scaling and shifting parameters and after normalising feature activation , , each style can be achieved by tuning parameters of an affine transformation. The interpretation is similar to that for  in Section 4.2.1, , the normalisation of feature statistics with different affine parameters can normalise input content image to different styles. Furthermore, the algorithm of Dumoulin can also be extended to combine multiple styles in a single stylised result by combining affine parameters of different styles.
Another algorithm which follows the first path of MSPM is proposed by Chen . Their idea is to explicitly decouple style and content, , using separate network components to learn the corresponding content and style information. More specifically, they use mid-level convolutional filters (called “StyleBank” layer) to individually learn different styles. Each style is tied to a set of parameters in “StyleBank” layer. The rest components in the network are used to learn semantic content information, which is shared by different styles. Their algorithm also supports flexible incremental training, which is to fix the content components in the network and only train “StyleBank” layer for the newly coming style.
One disadvantage of the first category is that the model size generally becomes larger with the increase of the number of learned styles. The second path of MSPM addresses this limitation by fully exploring the capability of one single network and combining both content and style into the network for style identification. Different MSPM algorithms differ in the way to incorporate style into the network.
propose to first sample a set of noise maps from a uniform distribution and establish a one-to-one mappingbetween each style and noise map. For clarity, we divide the style transfer network into an encoder () and decoder () pair. For each style, the corresponding noise map is concatenated () with the encoded feature activations and then feeded into the decoder to get the stylised result: .
In , Zhang and Dana first forward each style image in the style set through the pre-trained VGG network and obtain multi-scale feature activations in different VGG layers. Then multi-scale are combined with multi-scale encoded features from different layers in the encoder through their proposed inspiration layers. The inspiration layers are designed to reshape to match the desired dimension, and also have a learnable weight matrix to tune feature maps to help minimise the objective function.
Arbitrary-Style-Per-Model Fast Neural Method (ASPM) aims at one-model-for-all, , one single trainable model to transfer arbitrary artistic styles. There are also two types of ASPM, one built upon Non-parametric Texture Modelling with MRFs and the other one built upon Parametric Texture Modelling with Summary Statistics.
The first ASPM algorithm is proposed by Chen and Schmidt . They first extract a set of activation patches from content and style feature activations computed in pre-trained VGG network. Then they match each content patch to the most similar style patch and swap them (called “Style Swap” in ). The stylised result can be produced by reconstructing the resulting activation map after “Style Swap”, with either Slow Image Reconstruction based on Online Image Optimisation or Fast Image Reconstruction based on Offline Model Optimisation.
Considering  in Section 4.2.2, the simplest approach for arbitrary style transfer is to train a separate parameter prediction network to predict and in Equation (10) with a number of training styles . Given a test style image , CIN layers in the style transfer network take affine parameters and from , and normalise the input content image to the desired style with a forward pass.
Another similar approach based on  is proposed by Huang and Belongie . Instead of training a parameter prediction network, Huang and Belongie propose to modify conditional instance normalisation (CIN) in Equation (10) to adaptive instance normalisation (AdaIN):
AdaIN transfers the channel-wise mean and variance feature statistics between content and style feature activations, which also shares a similar idea with . Different from , the encoder in the style transfer network of  is fixed and comprises the first few layers in pre-trained VGG network. Therefore, in  is actually the feature activation from a pre-trained VGG network. The decoder part needs to be trained with a large set of style and content images to decode resulting feature activations after AdaIN to the stylised result: .
A more recent work by Li  attempts to exploit a series of feature transformations to transfer arbitrary artistic style in a style learning free manner. Similar to , Li use the first few layers of pre-trained VGG as the encoder and train the corresponding decoder. But they replace the AdaIN layer  in between the encoder and decoder with a pair of whitening and colouring transformations (WCT): . Their algorithm is built on the observation that the whitening transformation can remove the style related information and preserve the structure of content. Therefore, receiving content activations from the encoder, whitening transformation can filter the original style out of the input content image and return a filtered representation with only content information. Then, by applying colouring transformation, the style patterns contained in are incorporated into the filtered content representation, and the stylised result can be obtained by decoding the transformed features. They also extend this single-level stylisation to multi-level stylisation to further improve visual quality.
Since the boom of Neural Style Transfer (NST), there are also some researches devoted to improving current NST algorithms by controlling perceptual factors (, stroke size control, spatial style control, and colour control). Also, all of aforementioned NST methods are designed for general still images. They may not be appropriate for other types of images (, doodles, head portraits, and video frames). Thus, a variety of follow-up studies aim to extend general NST algorithms to these particular types of images and even extend them beyond artistic image style (, audio style).
Gatys themselves  propose several slight modifications to improve their previous algorithm . They demonstrate a spatial style control strategy, which is to define a guidance map for the feature activations, where the desired region (getting the style) is assigned and otherwise, . While for the colour control, the origin NST algorithm produces stylised images with the colour distribution of the style image. However, sometimes people prefer a colour-preserving style transfer, , preserving the colour of the content image during style transfer. The corresponding solution is to first transform the style image’s colours to match the content image’s colours before style transfer, or alternatively perform style transfer only in the luminance channel.
For stroke size control, the problem is much more complex. We show sample results of stroke size control in Figure 3. The discussions of stroke size control strategy need to be split into several cases :
1) Slow Neural Style Transfer with non-high-resolution images: Since current style statistics (, Gram-based and BN-based statistics) are scale-sensitive , to achieve different stroke sizes, the solution is simply resizing a given style image to different scales.
2) Fast Style Transfer with non-high-resolution images: One possible solution is to resize the input image to different scales before the forward pass, which inevitably hurts stylisation quality. Another possible solution is to train multiple models with different scales of a style image, which is space and time consuming. Also, the possible solution fails to preserve stroke consistency among results with different stroke sizes, , the results vary in stroke orientations, stroke configurations, . However, users generally desire to only change the stroke size but not others. To address this problem, Jing  propose a stroke controllable PSPM algorithm. The core component of their algorithm is a StrokePyramid module, which learns different stroke sizes with adaptive receptive fields. Without trading off quality and speed, their algorithm is the first to exploit one single model to achieve flexible continuous stroke size control while preserving stroke consistency, and further achieve spatial stroke size control to produce new artistic effects. Although one can also use ASPM algorithm to control stroke size, ASPM trades off quality and speed. As a result, ASPM is not effective at producing fine strokes and details compared with .
3) Slow Neural Style Transfer with high-resolution images: For high-resolution images (, pixels in 
), a large stroke size cannot be achieved by simply resizing style image to a large scale. Since only the region in the content image with a receptive field size of VGG can be affected by a neuron in the loss network, there is almost no visual difference between a large and larger brush strokes in a small image region with receptive field size. Gatys tackle this problem by proposing a coarse-to-fine Slow Style Transfer procedure with several steps of downsampling, stylising, upsampling and final stylising.
4) Fast Style Transfer with high-resolution images: Similar to 3), stroke size in stylised result does not vary with style image scale for high-resolution images. The solution is also similar to Gatys ’s algorithm in , which is a coarse-to-fine stylisation procedure . The idea is to exploit a multimodel, which comprises multiple subnetworks. Each subnetwork receives the upsampled stylised result of the previous subnetwork as the input, and stylises it again with finer strokes.
Another limitation of current NST algorithms is that they do not consider the depth information contained in the image. To address this limitation, depth preserving NST algorithms [58, 59] are proposed. Their approach is to add a depth loss function based on 
to measure the depth difference between the content image and the (yet unknown) stylised image. The image depth is acquired by applying a single-image depth estimation algorithm (, Zoran et al.’s work in).
Given a pair of style and content images which are similar in content, the goal of semantic style transfer is to build a semantic correspondence between the style and content, which maps each style region to a corresponding semantically similar content region. Then the style in each style region is transferred to the semantically similar content region.
1) Slow Semantic Style Transfer. Since the patch matching scheme naturally meets the requirements of the region-based correspondence, Champandard  proposes to build a semantic style transfer algorithm based on the aforementioned patch-based algorithm  (Section 4.1.2). Actually, the result produced by  is close to the target of semantic style transfer but without incorporating an accurate segmentation mask, which sometimes leads to a wrong semantic match. Therefore, Champandard augments an additional semantic channel upon , which is a downsampled semantic segmentation map. The segmentation map can be either manually annotated or from a semantic segmentation algorithm. Despite the remarkable results produced by , MRF-based design is not the only choice. Instead of combining MRF prior, Chen and Hsu  provide an alternative way for semantic style transfer, which is to exploit masking out process to constrain the spatial correspondence and also a higher order style feature statistic to further improve the result. More recently, Mechrez  propose an alternative contextual loss to realise semantic style transfer in a segmentation-free manner.
2) Fast Semantic Style Transfer. As before, the efficiency issue is always a big issue. Both  and  are based on Slow NST algorithms and therefore leave much room for improvement. Lu  speed up the process by optimising the objective function in feature space, instead of in pixel space. More specifically, they propose to do feature reconstruction, instead of image reconstruction as previous algorithms do. This optimisation strategy reduces the computation burden, since the loss does not need to propagate through a deep network. The resulting reconstructed feature is decoded into the final result with a trained decoder. Since the speed of  does not reach real-time, there is still big room for further research.
Instance style transfer is built on instance segmentation and aims to stylise only a single user-specified object within an image. The challenge mainly lies in the transition between a stylised object and non-stylised background. Castillo  tackle this problem by adding an extra MRF-based loss to smooth and anti-alias boundary pixels.
An interesting extension can be found in , which is to exploit NST to transform rough sketches into fine artworks. The method is simply discarding content loss term and using doodles as segmentation map to do semantic style transfer.
Driven by the demand of AR/VR, Chen  propose a stereoscopic NST algorithm for stereoscopic images. They propose a disparity loss to penalise the bidirectional disparity. Their algorithm is shown to produce more consistent strokes for different views.
Current style transfer algorithms are usually not appropriate for head portrait images. As they do not impose spatial constraints, directly applying these existing algorithms to head portraits will deform facial structures, which is unacceptable for the human visual system. Selim  address this problem and extend  to head portrait painting transfer. They propose to use the notion of gain maps to constrain spatial configurations, which can preserve the facial structures while transferring the texture of the style image.
NST algorithms for video sequences are substantially proposed shortly after Gatys et al.’s first NST algorithm for still images . Different from still image style transfer, the design of video style transfer algorithm needs to consider the smooth transition between adjacent video frames. Like before, we divide related algorithms into Slow and Fast Video Style Transfer.
1) Slow Video Style Transfer based on Online Image Optimisation. The first video style transfer algorithm is proposed by Ruder [68, 69]. They introduce a temporal consistency loss based on optical flow to penalise the deviations along point trajectories. The optical flow is calculated by using novel optical flow estimation algorithms [70, 71]. As a result, their algorithm eliminates temporal artefacts and produces smooth stylised videos. However, they build their algorithm upon  and need several minutes to process a single frame.
2) Fast Video Style Transfer based on Offline Model Optimisation. Several follow-up studies are devoted to stylising a given video in real-time. Huang  propose to augment Ruder et al.’s temporal consistency loss  upon current PSPM algorithm. Given two consecutive frames, the temporal consistency loss is directly computed using two corresponding outputs of style transfer network to encourage pixel-wise consistency, and a corresponding two-frame synergic training strategy is introduced for the computation of temporal consistency loss. Another concurrent work which shares a similar idea with  but with an additional exploration of style instability problem can be found in . Different from [72, 73], Chen  propose a flow subnetwork to produce feature flow and incorporate optical flow information in feature space. Their algorithm is built on a pre-trained style transfer network (an encoder-decoder pair) and wraps feature activations from the pre-trained stylisation encoder using the obtained feature flow.
Given a style image containing multiple characters, the goal of Character Style Transfer is to apply the idea of NST to generate new fonts and text effects. In , Atarsaikhan directly apply the algorithm in  to font style transfer and achieve remarkable results. While Yang  propose to first characterise style elements and exploit extracted characteristics to guide the generation of text effects. A more recent work  designs a conditional GAN model for glyph shape prediction, and also an ornamentation network for colour and texture prediction. By training these two networks jointly, font style transfer can be realised in an end-to-end manner.
Colour style transfer aims to transfer the style of colour distributions. The general idea is to build upon current semantic style transfer but to eliminate distortions and preserve the original structure of the content image.
1) Slow Colour Style Transfer. The earliest colour style transfer approach is proposed by Luan . They propose to add a photorealism regularization upon  to penalise image distortions. But since Luan et al.’s algorithm is built on an online image optimisation based Slow Semantic Style Transfer method , their algorithm is computationally expensive.
2) Fast Colour Style Transfer. Li  address the efficiency issue of  by handling this problem with two steps, the stylisation step and smoothing step. The stylisation step is to apply the NST algorithm in  but replace upsampling layers with unpooling layers to produce the stylised result with fewer distortions. Then the smoothing step further eliminates structural artefacts. These two aforementioned algorithms [78, 79] are mainly designed for natural images. Another work in  proposes to exploit GAN to transfer the colour from human-designed anime images to sketches. Their algorithm demonstrates a promising application of Colour Style Transfer, which is the automatic image colourisation.
Image attributes are generally referred to image colours, textures, . Previously, image attribute transfer is accomplished through image analogy  in a supervised manner (Section 2). Derived from the idea of patch-based NST , Liao  propose a deep image analogy to study image analogy in the domain of CNN features. Their algorithm is based on patch matching technique and realises a weakly supervised image analogy, , their algorithm only needs a single pair of source and target images instead of a large training set.
Fashion style transfer receives fashion style image as the target and generates clothing images with desired fashion styles. The challenge of Fashion Style Transfer lies in the preservation of similar design with the basic input clothing while blending desired style patterns. This idea is first proposed by Jiang and Fu . They tackle this problem by proposing a pair of fashion style generator and discriminator.
In addition to transferring image styles, [83, 84] extend the domain of image style to audio style, and synthesise new sounds by transferring the desired style from a target audio. The study of audio style transfer also follows the route of image style transfer, , Slow Audio Style Transfer and then Fast Audio Style Transfer. Inspired by image optimisation based image style transfer, Verma and Smith  propose a Slow Audio Style Transfer algorithm based on online audio optimisation. They start from a noise signal and optimise it iteratively using backpropagation.  improves the efficiency by transferring an audio in a feed-forward manner and can produce the result in real-time.
The evaluations of NST algorithms remain an open and important problem in this field. In general, there are two major types of evaluation methodologies that can be employed in the field of NST, , qualitative evaluation and quantitative evaluation. Qualitative evaluation relies on the aesthetic judgements of observers. The evaluation results are related to lots of factors (, age and occupation of participants). While quantitative evaluation focuses on the precise evaluation metrics, which include time complexity, loss variation, .
In this section, we experimentally compare different NST algorithms both qualitatively and quantitatively. We hope our study can build a standardised benchmark for this area.
|Group I||Group II||Group III||Group IV||Group V||Group VI||Group VII||Group VIII|
|Content & Style:|
|Li and Wand :|
|Zhang and Dana :|
|Chen and Schmidt :|
|Huang and Belongie :|
Totally, there are ten style images and forty content images. For style images, we select artworks of diversified styles, as shown in Figure 4. For example, there are impressionism artwork, cubism artwork, abstract artwork, contemporary artwork, futurism artwork, surrealist artwork, and expressionism artwork. Regarding the mediums, some of these artworks are painted on canvas, while others are painted on cardboard or wool, cotton, polyester, . For content images, we also try to select a wide variety of photos, which include animal photography, still life photography, landscape photography and portrait photography. All the images are never seen during training.
To maximise the fairness of the comparisons, we also obey the following principles during our experiment:
1) In order to cover every detail in each algorithm, we try to use the provided implementation from their published literatures. For , since there is no official implementation provided by the authors, we use a popular open source code  which is also admitted by the authors. Except for [48, 29]
2) Since the visual effect is influenced by the content and style weight, it is difficult to compare results with different degrees of stylisation. Simply giving the same content and style weight is not an optimal solution due to the different ways to calculate losses in each algorithm (, different choices of content and style layers, different loss functions). Therefore, in our experiment, we try our best to balance the content and style weight among different algorithms.
We try to use the default parameters (, choice of layers, learning rate, ) suggested by the authors except for the aforementioned content and style weight. Although the results for some algorithms may be further improved by more careful hyperparameter tuning, we select the authors’ default parameters since we hold the point that thesensitivity for hyperparameters is also an important implicit criterion for comparison. For example, we cannot say an algorithm is effective if it needs heavy work to tune its parameters for each style.
There are also some other implementation details to be noted. For  and , we use the instance normalisation strategy proposed in , which is not covered in the published papers. Also, we do not consider the diversity loss term (proposed in [45, 50]) for all algorithms, , one pair of content and style images corresponds to one stylised result in our experiment. For Chen and Schmidt’s algorithm , we use the feed-forward reconstruction to reconstruct the stylised results.
|256 256||512 512||1024 1024|
|Li and Wand ||0.015||0.055||0.229||1|
|Zhang and Dana ||0.019 (0.039)||0.059 (0.133)||0.230 (0.533)|
|Chen and Schmidt ||0.123 (0.130)||1.495 (1.520)|
|Huang and Belongie ||0.026 (0.037)||0.095 (0.137)||0.382 (0.552)|
Note: The fifth column shows the number of styles that a single model can produce. Time both excludes (out of parenthesis) and includes (in parenthesis) the style encoding process is shown, since ,  and  support storing encoded style statistics in advance to further speed up the stylisation process for the same style but different content images. Time of  for producing images is not shown due to the memory limitation. The speed of [48, 53] are similar to  since they share similar architecture. We do not redundantly list them in this table.
NST is an art creation process. It is difficult to define the aesthetic criterion for an artwork. Therefore, for the same stylised result, different people may have different or even opposite views. Here, we choose to present stylised results of different algorithms and leave the judgement to readers. Example stylised results are shown in Figure 5. More results can be found in the supplementary material111http://yongchengjing.com/pdf/review_supp.pdf. In Figure 5, we build several blocks to separate results of different categories of NST algorithms.
Following the content & style images, the first block contains the results of Gatys et al.’s Slow NST algorithm based on online image optimisation . The style transfer process is computationally expensive, but in contrast, the results are appealing in visual quality. Therefore, the algorithm of Gatys is usually regarded as the gold-standard method in the community of NST.
The second block shows the results of Per-Style-Per-Model Fast NST algorithms (Section 4.2). Each model only fits one style. It can be noticed that the stylised results of Ulyanov  and Johnson  are somewhat similar. This is not surprising since they share a similar idea and only differ in their detailed network architectures. For the results of Li and Wand , the results are sightly less impressive. Since  is based on Generative Adversarial Network (GAN), to some extent, the training process is not that stable. But we believe that GAN-based style transfer is a very promising direction, and there are already some other GAN-based works [77, 80, 86] (Section 5) in the field of NST.
The third block demonstrates the results of Multiple-Style-Per-Model Fast NST algorithms. Multiple styles are incorporated into a single model. The idea of both Dumoulin et al.’s algorithm  and Chen et al.’s algorithm  is to tie a small number of parameters to each style. Also, both of them build their algorithm upon the architecture of . Therefore, it is not reasonable that their results are visually similar. Although the results of [48, 49] are appealing, their model size will become larger with the increase of the number of learned styles. In contrast, Zhang and Dana’s algorithm  and Li et al.’s algorithm  use a single network with the same trainable network weights for multiple styles. The model size issue is tackled, but there seems to be some interferences among different styles (Group II and VII), which slightly influences the stylisation quality.
The forth block presents the last category of Fast Style Transfer, namely Arbitrary-Style-Per-Model Fast NST algorithms. Their idea is one-model-for-all. Globally, the results of ASPM are slightly less impressive than other types of algorithms. This is acceptable in that a three-way trade-off between speed, flexibility and quality is common in research. Chen and Schmidt’s patch-based algorithm  seems to not combine enough style elements into the content image. Their algorithm is based on similar patch swap. When lots of content patches are swapped with style patches that do not contain enough style elements, the target style will not be reflected well. Ghiasi et al.’s algorithm  is data-driven and their stylisation quality is very dependent on the varieties of training styles. For the algorithm of Huang and Belongie , they propose to match global summary feature statistics and successfully improve the visual quality compared with . However, their algorithm seems not good at handling complex style patterns (Group III and VI), and their stylisation quality is still related to the varieties of training styles. The algorithm of Li  replaces the training process with a series of transformations. But  is not effective at producing sharp details and fine strokes.
Regarding the quantitative evaluation, we mainly focus on five evaluation metrics, which are: generating time for a single content image of different sizes; training time for a single model; average loss for content images to measure how well the loss function is minimised; loss variation during training to measure how fast the model converges; style scalability to measure how large the learned style set can be.
The issue of efficiency is the focus of Fast NST algorithms. In this subsection, we compare different algorithms quantitatively in terms of the stylisation speed. Table 1 demonstrates the average time to stylise one image with three resolutions using different algorithms. In our experiment, the style images have the same size as the content images. The fifth column in Table 1 represents the number of styles one model of each algorithm can produce. denotes that a single model can produce multiple styles, which corresponds to MSPM algorithms. means a single model works for any style, which corresponds to ASPM algorithms. The numbers reported in Table 1 are obtained by averaging the generating time of 100 images. Note that we do not include the speed of [48, 53] in Table 1 as their algorithm is to scale and shift parameters based on the algorithm of Johnson . The time required to stylise one image using [48, 29] is very close to  under the same setting. For Chen et al.’s algorithm in , since their algorithm is protected by patent and they do not make public the detailed architecture design, here we just attach the speed information provided by the authors for reference: On a Pascal Titan X GPU, : s; : s; : s. For Chen and Schmidt’s algorithm , the time for producing image is not reported due to the limit of video memory. Swapping patches for two images needs more than 24 GB video memory and thus, the stylisation process is not practical. We can observe that except for [52, 54], all the other Fast NST algorithms are capable of stylising even high-resolution content images in real-time. ASPM algorithms are generally slower than PSPM and MSPM, which demonstrates the aforementioned three-way trade-off again.
Another concern is the training time for one single model. The training time of different algorithms is hard to compare as sometimes the model trained with just a few iterations is capable of producing enough visually appealing results. So we just outline our training time of different algorithms (under the same setting) as a reference for follow-up studies. On a NVIDIA Quadro M6000, the training time for a single model is about hours for the algorithm of Johnson , hours for the algorithm of Ulyanov , hours for the algorithm of Li and Wand , hours for Zhang and Dana , and hours for Li . Chen and Schmidt’s algorithm  and Huang and Belongie’s algorithm  take much longer (, a couple of days), which is acceptable since a pre-trained model can work for any style. The training time of  depends on how large the training style set is. For MSPM algorithms, the training time can be further reduced through incremental learning over a pre-trained model. For example, the algorithm of Chen only needs minutes to incrementally learn a new style, as reported in .
One way to evaluate some Fast NST algorithms which share the same loss function is to compare their loss variation during training, , the training curve comparison. It helps researchers to justify the choice of architecture design by measuring how fast the model converges and how well the same loss function can be minimised. Here we compare training curves of two popular Fast NST algorithms [42, 43] in Figure 6, since most of the follow-up works are based on their architecture designs. We remove the total variation term and keep the same objective for both two algorithms. Other settings (, loss network, chosen layers) are also kept the same. For the style images, we randomly select four styles from our style set and represent them in different colours in Figure 6. It can be observed that the two algorithms are similar in terms of the convergence speed. Also, both algorithms minimise the content loss well during training, and they mainly differ in the speed of learning the style objective. The algorithm in  minimises the style loss better.
Another related criterion is to compare the final loss values of different algorithms over a set of test images. This metric demonstrates how well the same loss function can be minimised by using different algorithms. For a fair comparison, the loss function and other settings are also required to be kept the same. We show the results of one Slow NST algorithm  and two Fast NST algorithms [42, 43] in Figure 7. The result is consistent with the aforementioned trade-off between speed and quality. Although Fast NST algorithms are capable of stylising images in real-time, they are not good as Slow NST algorithm in terms of minimising the same loss function.
Scalability is a very important criterion for MSPM algorithms. However, it is very hard to measure since the maximum capabilities of a single model is highly related to the set of particular styles. If most styles have somewhat similar patterns, a single model can produce thousands of styles or even more, since these similar styles share somewhat similar distribution of style feature statistics. In contrast, if the style patterns vary a lot among different style images, the capability of a single model will be much smaller. But it is hard to measure how much these styles differ from each other in style patterns. Therefore, to provide the reader a reference, here we just summarise the authors’ attempt for style scalability: the number is for , for both  and , and for .
Due to the amazing stylised results, the research of NST has led to many successful industrial applications and begun to deliver commercial benefits. In this section, we summarise these applications and present some potential usages.
One reason why NST catches eyes in both academia and industry is its popularity in some social networking sites, , Facebook and Twitter. A recently emerged mobile application named Prisma  is one of the first industrial applications that provide the NST algorithm as a service. Due to its high stylisation quality, Prisma achieved great success and is becoming popular around the world. Before long, some other applications providing the same service appeared one after another and began to deliver commercial benefits, , a web application Ostagram  requires users to pay for a faster stylisation speed. Under the help of these industrial applications [8, 87, 88], people can create their own fantastic art paintings and share their artwork with others on Twitter and Facebook, which is a new form of social communication. There are also some related application papers:  introduces an iOS app Pictory which combines style transfer techniques with image filtering;  further presents the technical implementation details of Pictory;  demonstrates the design of anther GPU-based mobile app ProsumerFX.
The application of NST in social communication reinforces the connections between people and also has positive effects on both academia and industry. For academia, when people share their own masterpiece, their comments can help the researchers to further improve the algorithm. Moreover, the application of NST in social communication also drives the advances of other new techniques. For instance, inspired by the real-time requirements of NST for videos, Facebook AI Research (FAIR) first developed a new mobile-embedded deep learning systemCaffe2Go and then Caffe2
(now merged with PyTorch), which can run deep neural networks on mobile phones. For industry, the application brings commercial benefits and promotes the economic development.
Another use of NST is to make it act as user-assisted creation tools. Although there are no popular applications that applied the NST technique in creation tools, we believe that it will be a promising potential usage in the future.
As a creation tool for painters and designers, NST can make it more convenient for a painter to create an artwork of a particular style, especially when creating computer-made artworks. Moreover, with NST algorithms, it is trivial to produce stylised fashion elements for fashion designers and stylised CAD drawings for architects in a variety of styles, which will be costly when creating them by hand.
Some entertainment applications such as movies, animations and games are probably the most application forms of NST. For example, creating an animation usually requires to painted frames per second. The production costs will be largely reduced if NST can be applied to automatically stylise a live-action video into an animation style. Similarly, NST can significantly save time and costs when applied to the creation of some movies and computer games.
There are already some application papers aiming at introducing how to apply NST to the production of movies, , Joshi explore the use of NST in redrawing some scenes in a movie named Come Swim , which indicates the promising potential applications of NST in this field.
The advances in the field of NST is amazing and some algorithms have already found use in industrial applications. Although current algorithms achieve remarkable results, there are still several challenges and open issues. In this section, we summarise key challenges within this field of NST and discuss their corresponding possible solutions.
The most concerned challenge is probably the three-way trade-off between speed, flexibility and quality in NST. Although current ASPM algorithms successfully transfer arbitrary styles, they are not that satisfying in perceptual quality and speed. The quality of data-driven ASPM quite relies on the diversity of training styles. However, one can hardly cover every style due to the great diversity of artworks. Image transformation based ASPM transfer arbitrary styles in a learning-free manner, but it is behind others in speed.
One of the keys for this problem may be a better understanding of the optimisation procedure in NST. The choice of optimiser (, Adam and L-BFGS) in NST greatly influences the visual quality. We believe that a deep understanding towards optimisation procedure will help understand how to find the local minima that leads to a high quality. Also, a well-studied automatic layer chosen strategy would also help improve the quality.
Another important issue is the interpretability of NST algorithms. Like many other CNN-based vision tasks, NST is a black box, which makes it quite uncontrollable. Interpreting CNN feature statistics based NST can benefit the separation of different style attributes and address the problem of a finer control during stylisation. For example, current NST algorithms cannot guarantee the detailed orientations and continuities of curves in stylised results. However, brush stroke orientation is an important element in paintings, which can impress the viewer and convey the painter’s ideas. Regarding the solution to this problem, fortunately, there are already researches devoted to interpreting CNN  which would shed light on the interpretable NST.
Several studies have shown that deep classification network is easily fooled by adversarial examples [95, 96], which are generated by applying perturbations to input images (, Figure 8(c)). The emergence of adversarial examples reveals the difference between deep neural network and human vision system. The perturbed result by changing an originally correctly classified image is still recognisable to humans, but leads to a misclassified label for deep neural network. Previous studies on adversarial examples mainly focus on deep classification network. However, in Figure 8, we demonstrate that adversarial examples also exist in deep generative network. In Figure 8(d), one can hardly recognise the semantic content, which is originally contained in Figure 8(c). The corresponding countermeasure to this adversarial NST would benefit from previous research on deep classification network. A recent survey on adversarial examples can be found in .
We believe that the lack of a gold standard aesthetic criterion is a major cause that prevents NST from becoming a mainstream research direction like object detection and recognition. Li  propose to design a user study to address the aesthetic evaluation problem. It is not practical since the results vary a lot with different observers. We conduct an experiment for user studies and show our results in Figure 9. Given the same stylised result, different observers have quite different ratings. We believe that the problem of a standard aesthetic criterion for NST is a generalised problem of Photographic Image Aesthetic Assessment, and one could get inspirations from related researches in this area. Here, we recommend  for an overview of Photographic Image Aesthetic Assessment.
Over the past several years, NST has continued to become an inspiring research area, motivated by both scientific challenges and industrial demands. A considerable amount of researches have been conducted in the field of NST. Key advances in this field are summarised in Figure 2. NST is quite a fast-paced area, and we are looking forwarding to more exciting works devoted to advancing the development of this field.
During the period of preparing this review, we are also delighted to find that related researches on NST also bring new inspirations for other areas and accelerate the development of a wider vision community:
1) For the area of Image Reconstruction, derived from NST, Ulyanov  propose a novel deep image prior, which replaces the manually-designed total variation regulariser in  with a randomly initialised deep neural network. Given a task-dependent loss function , an image and a fixed uniform noise as inputs, their algorithm can be formulated as:
One can easily notice that Equation (12) is very similar to Equation (9). Actually, the process in  is equivalent with the training process of Fast NST when there is only one available image in the training set, but replacing with and with . In other words, in  is actually trained to overfit one single sample.
2) Inspired by NST, Upchurch 
Promising directions for future research on NST mainly focus on three aspects. The first one is to solve the existing aforementioned challenges in the field of NST. Descriptions of key challenges and the corresponding possible solutions have been discussed in Section 8. The second aspect is to derive more extensions from general NST, as presented in Section 5. These interesting extensions can bring benefit to both academia and industry, and may even expand into a brand-new field in the future. It is also promising to exploit NST techniques to benefit other vision communities, as introduced in Section 9.
We would like to thank Hang Zhang, Dongdong Chen and Tian Qi Chen for providing pre-trained models for our study, and thank Xun Huang and Yijun Li for helpful discussions. We would also like to thank the anonymous reviewers for their insightful comments and suggestions.
This work is supported in part by National Key Research and Development Program (2016YFB1200203), National Natural Science Foundation of China (61572428, U1509206), Fundamental Research Funds for the Central Universities (2017FZA5014), Key Research and Development Program of Zhejiang Province (2018C01004), and Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414–2423.
I. Prisma Labs, “Prisma: Turn memories into art using artificial intelligence,” 2016. [Online]. Available:http://prisma-ai.com
L.-Y. Wei and M. Levoy, “Fast texture synthesis using tree-structured vector quantization,” inProceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 2000, pp. 479–488.
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inEuropean Conference on Computer Vision, 2016, pp. 694–711.
International Conference on Machine Learning, 2016, pp. 1349–1357.
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2223–2232.