Neural Comic Style Transfer: Case Study

09/05/2018 ∙ by Maciej Pęśko, et al. ∙ Politechnika Warszawska 0

The work by Gatys et al. [1] recently showed a neural style algorithm that can produce an image in the style of another image. Some further works introduced various improvements regarding generalization, quality and efficiency, but each of them was mostly focused on styles such as paintings, abstract images or photo-realistic style. In this paper, we present a comparison of how state-of-the-art style transfer methods cope with transferring various comic styles on different images. We select different combinations of Adaptive Instance Normalization [11] and Universal Style Transfer [16] models and confront them to find their advantages and disadvantages in terms of qualitative and quantitative analysis. Finally, we present the results of a survey conducted on over 100 people that aims at validating the evaluation results in a real-life application of comic style transfer.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cartoons and comics became a popular mean of artistic expression worldwide. Unfortunately, only a limited number of talented people with painting or graphics skills can create them holding on to aesthetic standards. What is more, it also takes a significant amount of time to create a valuable comic graphics. Providing an automatic tool to transfer the style of images or videos to comics could revolutionize the way publishers and individuals create comixified content.

Gatys et al. [1]

showed that spectacular results in terms of transferring the style between the images can be obtained using convolutional neural networks and their intermediate representations. Further works presented improved versions of the original style transfer algorithm that improve the model in terms of execution speed, results quality and generalization to various styles. Although many publications focused on various applications of style transfer, to our best knowledge, this is the first attempt to evaluate and compare the results obtained by several methods in the context of transferring comic style.

More precisely, in this paper we compare several modern style transfer methods and evaluate their effectiveness in terms of how well they propagate various comic style characteristics between the images. In our work, we focus mostly on the most efficient methods, i.e. the methods whose execution time per image do not exceed 2 seconds, that enable arbitrary style transfer with infinite number of possible comic styles.

The remainder of this paper is organized in the following manner. In Section 2 we describe all related works that discuss the topic of style transfer. Section 3 presents an detailed overview of a selected set of the state-of-the-art approaches. Section 4 describes our experimental setup, image collection process and implementation sources. Section 5 presents the results of our evaluation. Section 6 shows results of conducted survey. In Section 7 we make some conclusions and plans for further research.

2 Related Work

Gatys et al. [1] in their paper demonstrated that deep convolutional neural networks are able to encode the style information of any given image. Moreover, they showed that the content and style of the image can be separated and treated individually. As a consequence, it is possible to transfer the characteristics of the style of a given image to another one, while preserving the content of the latter. They propose to exploit the correlations between the features of deep neural networks in terms of Gram matrices to capture image style. Furthermore, Y. Li et al. showed that covariance matrix can be as effective as Gram matrix in representing image style [17] and theoretically proved that matching the Gram matrices of the neural activations can be seen as minimizing a specific Maximum Mean Discrepancy [6] function, which gives more intuition on why Gram matrix can represent an artistic style.

Since the original work [1], numerous improvements have been made in the field of Style Transfer. Johnson et al. [3] and Ulyanov et al. [8] proposed a fast approach that increases the efficiency of style transfer by three orders of magnitude. Nevertheless, this improvement comes at a price of lower results quality. Multiple authors tried to address this shortcoming  [9, 18, 22, 23] or make it more generic to enable using different styles in one model. The proposed solution include using a conditional normalization layer that learns normalization parameters for each style [7], swapping a given content feature with the closest style feature in its local neighborhood [10]

, directly adjusting the content feature to match the mean and variance of the style feature

[11, 12, 13]

or building a meta network which takes in the style image and produces corresponding image transformation network directly

[14]. Other methods include an automatic segmentation of the objects and extraction of their soft semantic masks [15] or adjusting feature maps using whitening and coloring transforms [16]. There are also some works that try to make photo-realistic style transfer [19, 20] to carry style of one photo to another, leaving it as realistic as possible. In addition, many works have been created that focus on some other, various fields of Style Transfer. For instance Coherent Online Video Style Transfer [27] which describes end-to-end network that generates consistent stylized video sequences in near real time, StyleBank [26] which uses multiple convolution filter banks, where each filter in bank explicitly represents one style or Stereoscopic Neural Style Transfer [28] that concerns on 3D or AR/VR subject.

It is worth mentioning that existing approaches often suffer from a trade-off between generalization, quality and efficiency. More precisely, the optimization-based approaches handle arbitrary styles with a great visual quality, but the computational costs are relatively high. On the other side, the feed-forward methods are executed very fast with slightly worse but acceptable quality, but they are limited to a specific, fixed number of styles. Finally, the arbitrary methods are fast and enable multiple style transfer, but often their quality is worse than the previously proposed ones.

3 Methods Overview

In this section, we give an overview of the methods proposed for transferring the style of images. We start with the initial work of Gatys et al. [1] and continue with the follow-up works.

3.1 Style transfer

Gatys et al. [1] in their work, use 16 convolutional and 5 pooling layers of the 19-layer VGG network [24]

. They pass three images through this network: Content image, Style image and White Noise image. The content information is given by feature maps from one layer and content loss is the squared-error loss between the two feature representations:


Where is the activation of the i-th filter at position in layer . On the other side, the style information is given by the Gram matrix

of the vectorized feature maps

and in layer :


The style loss is a squared-error loss function computed between two Gram matrices obtained from specific layers from white noise image and style image passed through the network.

is a number of feature maps and is a feature map size.


Finally, the cost function is defined as weighted sum of two above losses. Namely, between the activations (content) and Gram matrices (style) and then is minimized using backpropagation.


Unfortunately, this method is very slow. Using this approach, style transfer of one 512x512 pixel image lasts almost a minute on recent GPU architectures, such as NVIDIA Quadro or Titan X [3, 6]. It is caused by the fact that for each pair of images, this method performs an optimization process using backpropagation to minimize . To address this shortcoming, several approaches have been proposed, including Adaptive Instance Normalization, described below.

3.2 Adaptive Instance Normalization

X. Huang and S. Belongie in their paper [11]

present the first fast and arbitrary neural style transfer algorithm that resolves problem with generalization, quality and efficiency trade-off. This method consists of two networks: a style transfer network and a loss network.

Figure 1: AdaIN architecture. C and S are Content and Style images and Lc and Ls are content and style losses.

The loss network is pretty similar to the network presented in [1]. It is used to compute a total loss which is minimized by a backpropagation algorithm to fit the parameters of a style transfer network. The style transfer network consists of a simple encoder-decoder architecture. The encoder is first few layers of a pre-trained VGG-19 network. The decoder mirrors the encoder with a few minor differences. All pooling layers are replaced by nearest up-sampling layers and there are no normalization layers. The most interesting part is AdaIN layer which is placed between the encoder and the decoder. AdaIN produces the target feature maps that are inputs to the decoder by aligning the mean and variance of the content feature maps to style feature maps. A randomly initialized decoder is trained using a loss network to map outputs of AdaIN to the image space in order to get a stylized image. AdaIN architecture can be seen in the Fig. 1. However, this method only aligns the most basic feature statistics which are mean and variance. It is only approximation of the Gram matrix approach presented in  [1]. Fortunately, subsequent works have introduced some changes in this matter.

3.3 Universal Style Transfer

Knowing that covariance matrix can be as effective as Gram matrix in representing image style [17], Y. Li et al. [16] proposed a novel approach called Universal Style Transfer (UST). It is closely related to AdaIN method but the intuition of UST is to match covariance matrices of feature maps instead of aligning only mean and variance which was proposed in [11]. Their Universal Style Transfer approach (UST) uses a very similar encoder-decoder architecture where encoder is composed of a first few layers of a pre-trained VGG-19 network and the decoder mostly mirrors the encoder. However, instead of using the AdaIN layer to carry style information, they use Whitening and Coloring transform (WCT). Moreover, in this method style transfer that is represented by WCT layer is not used during training. The network is only trained to reconstruct an input content image. The entire style transfer takes place in the WCT layer which is added to already pre-trained image reconstruction network.f Training process and a single encoder-decoder network is presented in Fig. 2.

Figure 2: Training process (left hand side) and single level encoder-WCT-decoder network (right hand side). C and S are content and style images, respectively.

The Universal Style Transfer can be described in the following way. Firstly, five different encoder-decoder networks are pre-trained for image reconstruction purpose. Each of them consists of a different number of VGG-19 layers in encoder and the same number in decoder Afterwards, WCT is placed as an intermediate element between the encoder and the decoder. Given a pair of style image and content image , encoder extracts their vectorized feature maps and , for content and style images, respectively. Next, the WCT module is used to adjust to match the statistics of or more precisely, directly transform the to match the covariance matrix of . Finally, the decoder reconstructs the image using transformed feature maps.

Another extension is to use a multi-level stylization in order to match the statistics of the style at all abstraction levels. It means that the result obtained from a network that matches higher level information is treated as the new content input image to network that matches lower level statistics. In the original paper [16], UST flow consists of 5 level stylization where stylization is a one single pass through the encoder-WCT-decoder network. Such architecture can be seen in Fig. 3. In our work, except for the 5-level architecture, we also use a shallower version with 4 levels in order to try this method with less style abstraction. What it means is to not use first step (Relu_5_1 - WCT - Decoder_5 in Fig. 3) of 5-level architecture that is responsible for transferring the highest level statistics of the style. We believe that using this modification we can get less distortions in output image. We also use the UST approach with AdaIN layer instead of WCT.

Whitening and Coloring Transform. Firstly, all vectorized feature maps are centered by subtracting its mean vector. Then

are linearly transformed using Whitening Transform so that we obtain

in such a way that the feature maps are uncorrelated and each of them has a variance equal to 1.0. Afterwards, all vectorized style feature maps are centered by subtracting its mean vector . Subsequently, Coloring Transform is performed. Essentially, it is the inverse of the Whitening step that is used to transform to . are transformed to a vector with a specified covariance matrix corresponding to vectorized style feature maps . Finally, the is re-centered with the mean vector and the output is a vector which is a transformed vector to match the covariance matrix of .

Figure 3: Universal Style Transfer architecture with the whole multi level pipeline. Each level of stylization consists of single encoder-WCT-decoder network with different decreasing number of VGG layers. C and S are content and style images, respectively.

3.4 Photorealistic Image Stylization

In our paper, we would like to evaluate different types of neural style transfer models in the field of comic style transfer. Therefore, we also evaluate a Photorealistic Image Stylization (PHOTO-R) method proposed by Y. Li et al. in their paper [19]. Their method is strongly connected with UST with some minor differences. However, those slight changes cause their model to give more photorealistic output images. The PHOTO-R algorithm consists of two steps that can be seen in Fig. 4. The first step (F1) is a modified version of UST with four levels of stylization. Less stylization layers means that less abstraction information is transfered so more details of the input image are preserved. Moreover, F1 differs from UST also by using unpooling layers instead of upsampling layers. This change is crucial to the results obtained by this model and causes the model to preserve much more spatial information. The second step (F2) is a smoothing process that removes visible artifacts and returns a photorealistic image. F2 smooths pixels in a local neighborhood and ensures that if those pixels have similar content then they are stylized in a similar way.

Figure 4: PHOTO-R architecture. and are style and content images.

4 Experimental Setup

For our comparison we choose five state-of-the-art methods, discussed in previous chapter. Each of them performs fast and arbitrary style transfer. The selected methods are:

  • Adaptive Instance Normalization (AdaIN) [11]

  • Universal Style Transfer (UST-WCT) [16]

  • Shorten Universal Style Transfer with four levels of stylization instead of five (UST-WCT4)

  • Universal Style Transfer with Adaptive Instance Normalization layer [11] instead of Whitening and Coloring Transform [16] (UST-AdaIN)

  • Photorealistic Image Stylization (PHOTO-R) [19]

We use pretrained models, provided by authors of the above-mentioned methods. For AdaIN111, UST-AdaIN and UST-WCT222, and PHOTO-R333

, we use authors’ implementations and for UST-WCT4 we use the Tensorflow implementation

444 with minor modification covering 4-level instead of 5-level architecture.

We set the parameter controlling style-content transfer as indicated in the original works: 1.0 for AdaIN and UST-AdaIN and 0.6 for UST-WCT and UST-WCT4. We prepare 20 different style and content images collected from the Internet. We select style images to represent different comic styles. Content pictures are chosen randomly from the results of image search on Google. All images are rescaled to size 600x450 pixels and evaluated with all five models using 12GB TITAN X GPU.

5 Results

5.1 Qualitative Results

In this section, we compare selected style transfer methods. Fig. 5 shows the example results obtained for different approaches. AdaIN method returns images that are the closest to cartoons from all examined approaches. Unfortunately, it often leaves the content colors or some mixed hues that do not fit the style (especially UST-AdaIN). UST-WCT and UST-WCT4 give results with more appropriate and stylistically coherent colors. However, the results of those models seem to be stylized too much which often leads to significant distortions in those pictures. Moreover, those approaches sometimes struggle with multiple blurs and stains. On the other hand, PHOTO-R returns very photo-realistic images, it seems that this model transfers almost only color information, which is definitely not what is expected for Comic Style Transfer. Furthermore, one common disadvantage of all methods is that they often carry style and color in an inadequate way, i.e. in Fig. 5 in two last rows there are a lot of red blurs in all result images, while red color should only correspond to the Mickey Mouse shorts and the Little Mermaid’s hair.

In addition, we use SSIM index [25] to compare obtained results with corresponding style and content images. Result values are presented in Table 1

. In terms of content comparison, the biggest value of SSIM index we obtain for PHOTOR-R method which is expected due to its photo-realistic nature. Morover, greater SSIM index values for AdaIN and UST-WCT4 methods compared to UST-WCT and UST-AdaIN are probably caused by less higher-level statistics carried by this approaches (Result images are less distorted in comparison with UST-WCT and UST-AdaIN). Unfortunately, SSIM index does not work well for style comparison. All aproaches gives very similar SSIM values, so we can not conclude wich one performs better. The largest value for PHOTO-R only confirms this thesis, as this method transfers almost only color information without higher-level statistics. This, in turn, is not what we expect for Style Transfer comparison.

Method SSIM style SSIM content
AdaIN 0.135419 0.282729
UST-AdaIN 0.157140 0.214915
UST-WCT 0.146336 0.191519
UST-WCT4 0.134137 0.267988
PHOTO-R 0.161805 0.651781
Table 1: Average SSIM index values between result images and content/style images for specific approaches.
a) b) c) d) e) f) g)
Figure 5: Results from different style transfer methods. Columns: a) style images, b) content images, c) AdaIN, d) PHOTO-R, e) UST-AdaIN, f) UST-WCT, g) UST-WCT4.

5.2 Quantitative Results

In addition to the qualitative assessment, we also tested the efficiency of each method. In this subsection, we compare the selected approaches in terms of the stylization speed. In Table. 2 we can see the average time of the stylization of one image with a resolution of pixels. The content and style images have the same size. We provide an average runtime for 400 executions on non-repeatable input images. Our tests are performed on a machine with 12GB TITAN X GPU. The third column in Fig. 2

represents the number of styles that every model of each algorithm can produce. All methods except PHOTO-R give very competitive results. Slight differences are connected with complexity of each model. For example AdaIN layer in AdaIN and UST-AdaIN method is computationally faster than WCT in UST-WCT and UST-WCT4 approaches, which is caused by eigenvalue decomposition step in WCT

[16]. Only PHOTO-R model is a little slower than others, but definitely fast enough for a real-time use.

Method Time(s) for 600x450 pixel images Styles/Model
AdaIN 0.201
UST-AdaIN 0.378
UST-WCT 0.831
UST-WCT4 0.510
PHOTO-R 1.697
Table 2: Average speed comparison of selected Fast Neural Style Transfer algorithms for images of size 600x450 pixels (on 12GB TITAN X GPU).

6 Survey

To evaluate the selected approaches we conducted a survey on over 100 participants, living in Poland and with the age range between 18 to 70 years with different professions, including students, retired and others. It took us two weeks to gather all the votes. We prepared different examples of comic style transfer using five chosen models. In each question we asked which image transfers comic style from a given image to presented photo in the best way. Survey answers were distributed randomly in order to prevent choosing the same answers in every question. The main goal was to obtain subjective opinions of as many various people as possible. The quality measures were ability to preserve content information, style transfer and how "comic" is final image.

Survey results are presented in Fig. 6. As we can see the best results were obtained by the AdaIN approach. It confirms our assumptions that this method gives results that are the closest to cartoon or comics in terms of stylistic similarity. The second and the third place was assigned to UST-WCT4 and UST-WCT, respectively. Better results for UST-WCT4 are probably caused by less higher-level statistics carried by this model with respect to the UST-WCT approach, so the result images are less distorted.

Figure 6: Survey results.

7 Conclusion

In this work, we compared five different Neural Style Transfer approaches and how they cope with Transferring Comic Style. Experimental results demonstrate pros and cons of the evaluated approaches. Each of them suffers from similar problems such as blur effects or inappropriate color transfer and we believe that there is still some place for improvement. As future work we intend to introduce Inter-layer Correlations (as described in M. Yeh et al. paper [18]) to all evaluated methods in order to check whether it can improve Comic Style Transfer quality.