Conditional Neural Style Transfer with Peer-Regularized Feature Transform

06/07/2019 ∙ by Jan Svoboda, et al. ∙ NNAISENSE 4

This paper introduces a neural style transfer model to conditionally generate a stylized image using only a set of examples describing the desired style. The proposed solution produces high-quality images even in the zero-shot setting and allows for greater freedom in changing the content geometry. This is thanks to the introduction of a novel Peer-Regularization Layer that recomposes style in latent space by means of a custom graph convolutional layer aiming at separating style and content. Contrary to the vast majority of existing solutions our model does not require any pre-trained network for computing perceptual losses and can be trained fully end-to-end with a new set of cyclic losses that operate directly in latent space.An extensive ablation study confirms the usefulness of the proposed losses and of the Peer-Regularization Layer, with qualitative results that are competitive with respect to the current state-of-the-art even in the challenging zero-shot setting. This opens the door to more abstract and artistic neural image generation scenarios and easier deployment of the model in. production



There are no comments yet.


page 7

page 8

page 12

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural style transfer (NST), introduced by the seminal work of Gatys Gatys2015NeuralStyle , is an area of research that focuses on models that transform the visual appearance of an input image (or video) to match that of a desired target image. For example, convert a given photo to appear as if Van Gogh himself had painted the same scene.

NST, due to its general formulation, has seen an exponential growth within the deep learning community and spans a wide spectrum of applications e.g. converting time-of-day 

Zhu2017CycleGAN ; Huang2018MUNIT , mapping among artwork and photos Anoosheh2017ComboGAN ; Zhu2017CycleGAN ; Huang2018MUNIT , transferring facial expressions Karras2018StyleBasedGen , transforming between animal species Zhu2017CycleGAN ; Huang2018MUNIT , etc.

Despite their popularity and good quality results, current NST approaches are not free from limitations. Firstly, the original formulation of Gatys et al. requires a new optimization process for each transfer performed, making it impractical for many real-world scenarios. In addition, the method relies heavily on pre-trained networks, usually borrowed from classification tasks, that are known to be sub-optimal and biased toward texture rather than structure Geirhos2018ImageNetTrained

. To overcome this first limitation, deep neural networks have been proposed to approximate the lengthy optimization procedure in a single feed forward step, thereby making the models amenable for real-time processing. Of notable mention is the work of Johnson

et al. Johnson2016PerceptLosses and Ulyanov et al. Ulyanov2016 , who also later introduced Instance Normalization Ulyanov17IN , a popular feature-normalization scheme for style transfer.

Secondly, when a neural network is used to overcome the computational burden of Gatys2015NeuralStyle , training of a model for every desired style image is required due to the limited-capacity of conventional models in encoding multiple styles into their weights. This greatly narrows the applicability of the method for use cases where the concept of style cannot be defined a-priori and needs to be inferred from examples. With respect to this second limitation, recent works attempted to separate style and content in feature space (latent space) to allow generalization to a style characterized by an additional input image, or set of images. The most widespread work in this family is AdaIN Huang2017AdaIN , a particular case of FiLM Perez2017FiLM

, introduced later the same year. The current state-of-the-art allows, among other things, to control the amount of stylization applied, interpolating between different styles, and using masks to convert different regions of image into different styles 

Huang2017AdaIN ; Sheng2018Avatar .

Research, beyond the study of new network architectures for NST, has also introduced better loss functions to train the models. The perceptual loss 

Gatys2015NeuralStyle ; Johnson2016PerceptLosses

with a pre-trained VGG19 classifier 

SimonyanZ14 is usually used as it, supposedly, captures high-level features of the image. However, this assumption has been recently challenged in Geirhos2018ImageNetTrained . Cycle-GAN Zhu2017CycleGAN proposed a new cycle consistent loss that requires no one-to-one correspondence between the input and the target images, thus lifting the heavy burden of data annotation.

Image style transfer is an extremely challenging problem and moreover style of an image is expressed by both local properties (e.g. typical shapes of objects, etc.) and global properties (e.g. textures, etc.). We advocate to model this hierarchy in latent space by local aggregation of pixel-wise features and by the use of metric learning to separate different styles. To the best of our knowledge, this has not been addressed by previous approaches explicitly.

In the presence of a well structured latent space where style and content are fully separated, transfer could be easily performed by exchanging the style information in latent space between the input and the conditioning style images, without the need to store the transformation in the decoder weights. Such an approach is independent with respect to feature normalization and further avoids the need for rather problematic pre-trained models.

This paper addressed the NST setting where style is externally defined by a set of input images to allow transfer from arbitrary domains and to tackle the challenging zero-shot style-transfer scenario by introducing a novel feature regularization layer capable of recomposing global and local style content from the input style image. This inductive bias is shown to allow the network to learn how to separate style and content rather than encoding the transformation in its weights. We successfully demonstrate this in a series of zero-shot style transfer experiments, whose generated result would not be possible if the style was not actually inferred from the respective input images.

This work addresses the aforementioned limitations of NST models by making the following contributions:

  • A state-of-the-art approach to NST using a custom graph convolutional layer that recomposes style in latent space;

  • End-to-end training without the need for any pre-trained model (e.g. VGG) to compute the perceptual loss;

  • Constructing a globally- and locally-combined latent space for style information and imposing structure on it by means of metric learning.

2 Background

The key component of any NST system is the modeling and extraction of the "style" from an image (though the term is partially subjective). As style is often related to texture, a natural way to model it is to use visual texture modeling methods PaulyGreiner2009TexSynth . Such methods can either exploit texture image statistics (e.g. Gram matrix) Gatys2015NeuralStyle or model textures using Markov Random Fields (MRFs) EfrosLeung1999 . The following paragraphs provide an overview of the style transfer literature introduced by Jing2017NSTReview .

Image-Optimization-Based Online Neural Methods.

The method from Gatys et al. Gatys2015NeuralStyle may be the most representative of this category. While experimenting with representations from intermediate layers of the VGG-19 network, the authors observed that a deep convolutional network is able to extract image content from an arbitrary photograph, as well as some appearance information from works of art. The content is represented by a low-level layer of VGG-19, whereas the style is expressed by a combination of activations from several higher layers, whose statistics are described using the network features’ Gram matrix. Li Li2017DemystifyingNST later pointed out that the Gram matrix representation can be generalized using a formulation based on Maximum Mean Discrepacy (MMD). Using MMD with a quadratic polynomial kernel gives results very similar to the Gram matrix-based approach, while being computationally more efficient. Other non-parametric approaches based on MRFs operating on patches were introduced by Li and Wand LiWand2016MRF .

Model-Optimization-Based Offline Neural Methods.

These techniques can generally be divided into several sub-groups Jing2017NSTReview . Per-Style-Per-Model methods need to train a separate model for each new target style Johnson2016PerceptLosses ; Ulyanov2016 ; LiWand2016Adversarial , rendering them rather impractical for dynamic use. A notable member of this family is the work by Ulyanov et al. Ulyanov17IN

introducing Instance Normalization (IN), better suited for style-transfer applications than Batch Normalization (BN).

Multiple-Styles-Per-Model methods attempt to assign a small number of parameters to each style. Dumoulin Dumoulin2016CIN proposed an extension of IN called Conditional Instance Normalization (CIN), StyleBank Chen2017StyleBank learns filtering kernels for different styles, and other works instead feed the style and content as two separate inputs Li2017 ; ZhangDana2017 similarly to our approach.

Arbitrary-Style-Per-Model methods either treat the style information in a non-parametric, i.e. as in StyleSwap Chen2016StyleSwap , or parametric manner using summary statistics, such as in Adaptive Instance Normalization (AdaIN) Huang2017AdaIN

. AdaIN, instead of learning global normalization parameters during training, uses first moment statistics of the style image features as normalization parameters. Later, Li

et al. Li2017WCT introduced a variant of AdaIN using Whitening and Coloring Transformations (WTC). Going towards zero-shot style transfer, ZM-Net WangLiang2017ZMNet

proposed a transformation network with dynamic instance normalization to tackle the zero-shot transfer problem. More recently, Avatar-Net 

Sheng2018Avatar proposed the use of a "style decorator" to re-create content features by semantically aligning input style features with those derived from the style image.

Other methods.

Cycle-GAN Zhu2017CycleGAN introduced a cycle-consistency loss on the reconstructed images that delivers very appealing results without the need for aligned input and target pairs. However, it still requires one model per style. The approach was extended in Combo-GAN Anoosheh2017ComboGAN , which lifted this limitation and allowed for a practical multi-style transfer; however, also this method requires a decoder-per-style.

Sanakoyeu et al. Sanakoyeu2018StyleAware observed that the applying the cycle consistency loss in image space might be over-restricting the stylization process. This strict consistency loss was later relaxed by MUNIT Huang2018MUNIT , a multi-model extension of UNIT Liu2017UNIT , that imposes it in latent space instead, providing more freedom to the image reconstruction process. Sanakoyeu et al. Sanakoyeu2018StyleAware show also how to use a set of images, rather than of a single one, to better express the style of an artwork.

3 Method

The core idea of our work is a region-based mechanism to exchange the style between input and target style images, similarly to StyleSwap Chen2016StyleSwap , while preserving the semantic content. To successfully achieve this, style and content information must be fully separated, disentangled. The inductive bias of any architecture is, however, not enough to achieve the desired level of separation. We advocate the use of metric learning to directly enforce separation among different styles, and experimentally shown this to greatly reduce the amount of style dependent information that gets encoded in the decoder.

3.1 Architecture and losses

The proposed architecture is shown in Figure 1. To prevent the generator from encoding the stylization in its weights auxiliary decoders DeFauw2019AuxDec are used during training to optimize the parameters of the encoder and decoder independently. The yellow module in Figure 1

is trained as an autoencoder (AE) 

Masci2011 ; ZhaoMGL15 ; MaoSY16a to reconstruct the input. The green module, instead, is trained as a GANGoodfellow2014GAN to generate the stylized version of the input using the encoder from the yellow module, with fixed parameters. The optimization of both modules is interleaved together with the discriminator. Additionally, following the analysis from Martineau et al. Martineau2018RAGAN , the Relativistic Average GAN (RaGAN) is used as our adversarial loss formulation, which was shown to be more stable and to produce more natural-looking images than traditionally used GAN losses.

Figure 1: The proposed architecture with two decoder modules. The yellow decoder is the auxiliary generator, while the main generator is depicted in green. Dashed lines indicate lack of gradient propagation.

Let us now describe the four main building blocks of our approach in detail111The full architecture details are found in the supplementary material.. We denote an input image, a target and fake image, respectively. Our model consists of an encoder generating the latent representations, an auxiliary decoder taking a single latent code as input, and a main decoder taking two latent codes as inputs. Generated latent codes are denoted . We further denote the content and style part of the latent code, respectively. Additionally, to impose a stronger prior on the feature similarity expected while performing reconstruction , a cycle loss is used on the encoder’s middle-layer features maps, denoted , which have double the spatial size of their subsequent latent representations.

The distance between two latent representations is defined as the smooth L1 norm Girshick2015FastRCNN in order to stabilize training and to avoid exploding gradients:


where , and are two different feature embeddings with channels and is the spatial dimension of each channel.


The encoder used to produce the latent representation of all input images is composed of several strided-convolutional layers for downsampling followed by multiple ResNet blocks. The latent code

is composed by two parts: the content part, , which holds information about the image content (e.g. objects, position, scale, etc.), and the style part, , which encodes the style that the content is presented in (e.g. level of detail, shapes, etc.). The style component of the latent code is further split equally into . Here, encodes local style information per pixel of each feature map, while undergoes further downsampling via a small sub-network to generate a single value per feature map.

Auxiliary generator.

The Auxiliary generator simply reconstructs an image from its latent representation and is used only during training to train the encoder module. It is composed of several ResNet blocks followed by fractionally-strided convolutional layers to reconstruct the original image.

The loss is composed of a metric learning loss, enforcing clustering of the style part of the latent representations, a latent cycle loss, and a classical reconstruction loss. It is defined as follows:


Main generator.

This network replicates the architecture of the auxiliary generator, and uses the output of the Peer Regularized Feature Transform module (see Section 3.2). During training of the main generator the encoder is kept fixed, and the generator is optimized using the following loss:



The discriminator is a convolutional network receiving two images concatenated over the channel dimension and producing an map of predictions. The first image is the one to discriminate, whereas the second one serves as conditioning for the style class. The output prediction is ideally if the two inputs come from the same style class and otherwise. The discriminator loss is defined as:


where is the distribution of the real data and is the distribution of the generated (fake) data.

3.2 Peer Regularized Feature Transform (PeRFeaT)

The PeRFeaT module draws inspiration from PeerNets Svoboda2019 and Graph Attention Layer (GAT) Velickovic2018 and performs style transfer in latent space taking advantage of the separation of content and style information (enforced by Equation 3). It receives and as an input and computes the -Nearest-Neighbors (k-NN) between and using the Euclidean distance to induce a graph of peers.

Attention coefficients over the graph nodes are computed and used to recompose the style portion of as convex combination of its nearest neighbors representations. The content portion of the latent code remains instead unchanged, resulting in: .

Figure 2: The Peer Regularization Layer takes as input latent representations of content and style images. The content part of the latent representation is used to induce a graph structure on the style latent space, which is then used to recombine the style part of the content image’s latent representation from the style image’s latent representation. This results in a new style latent code.

Given a pixel of feature map , its -NN graph in the space of -dimensional feature maps of all pixels of all peer feature maps is considered. The new value of the style part for the pixel is expressed as:


where denotes a fully connected layer mapping from -dimensional input to scalar output, and are attention scores measuring the importance of the th pixel of feature map to the output th pixel of feature map . The resulting style component of the input feature map is the weighted average, pixel-wise, of its peer pixel features defined over the style input image.

4 Experimental setup and Results

The proposed approach is compared against the state-of-the-art on extensive qualitative evaluations and, to support the choice of architecture and loss functions, ablation studies are performed to show the role of the various components and how they influence the final result.

4.1 Training

The dataset of Zhu2017CycleGAN , composed of a collection of photographs and four different painter collections is used for training the model. In particular, the datasets named monet2photo, cezzane2photo, vangogh2photo and ukiyoe2photo are combined into a single dataset named painter2photo, consisting of 6280 real photos and 2560 paintings in total.

Our network can be trained end-to-end alternating optimization steps for the auxiliary generator, the main generator, and the discriminator. The loss used for training is defined as:


where is the discriminator, the main generator, and is the auxiliary generator (see Section 3.1). ADAM Kingma2014Adam

is used as the optimizer with learning rate set to 0.0002 and batch size of 1, training is performed for total a of 200 epochs. In each epoch, all the real photos are visited, which results in

iterations per epoch. The weighting of the reconstruction identity loss and the margin for the metric learning during all of our experiments. The training images are cropped and resized to pixels. Note that during testing, our method can operate on images of arbitrary size.

4.2 Style transfer

A set of test images from Sanakoyeu Sanakoyeu2018StyleAware is stylized and compared against competing methods in Figure 3 (inputs of size px) to demonstrate arbitrary stylization of a content image given several different styles not previously seen at training time. It is important to note that our network was trained on only four different painters (Monet, Cezzane, Van Gogh and Ukiyoe). Figure 4(a) shows that it generalizes well to previously unseen styles, allowing zero-shot style transfer. Additional experiments222More results are shown in the supplementary material. showcasing the capabilities of our method are performed on the test sets of the four painter datasets compiled by Zhu2017CycleGAN . This evaluation is done on color images of size , and results are displayed in Figure 4(b).

Figure 3: Qualitative comparison with respect to other state-of-the-art methods. It should be noted that most of the compared methods had to train a new model for each style. While providing competitive results, our method performs zero-shot style transfer using a single model.
(a) Zero-shot style transfer.
(b) Styles seen during training.
Figure 4: Qualitative evaluation of our method for previously unseen styles (left) and for styles seen during training (right). It can be observed that the generated images are consistent with the provided target style (inferred from a single sample only), showing the good generalization capabilities of the approach.
Figure 5: Ablation study evaluating different architecture choices for our approach. It shows how a fixed content reacts to different styles (left) and vice-versa (right). Ours refers to the final approach with all losses, NoSep ignores the separation of content and style in the latent space during feature exchange in PeRFeaT layer, NoML does not use metric learning and NoAux makes no use of the auxiliary decoder during training.

Ablation study.

There are several key components in our solution which make arbitrary style transfer with a single model and end-to-end training possible. Namely, the auxiliary decoder used during training, which prevents degenerate solutions. Separation of the latent code into content and style, which allows to transfer the style features while preserving the content. Last but not least, metric learning on the latent space style class separation which allows to pull different styles apart. The effect of suppressing each of them during the training is examined, and results for the various models are compared, highlighting the importance of each component in Figure 5.

5 Conclusions

We proposed a novel model for neural style transfer which mitigates various limitations of current state-of-the-art methods and that can be used in the challenging zero-shot transfer setting. This is done with a Peer-Regularization Layer using graph convolutions to recompose the style component of the latent representation and with a metric learning loss enforcing separation of different styles combined with cycle consistency in feature space. An auxiliary decoder is also introduced to prevent degenerate solutions and to enrich variability of the generated samples. The result is a state-of-the-art method that can be trained end-to-end without the need of a pre-trained model to compute the perceptual loss, therefore lifting recent concerns regarding the reliability of such features for NST. More importantly the proposed method requires only a single encoder and a single decoder to perform transfer among arbitrary styles, contrary to many competing methods requiring a decoder (and possibly an encoder) for each input and target pair. This makes our method more applicable to real-world image generation scenarios where style needs to be defined by the user.


Appendix A Network architecture

This section describes our model in detail. We describe the generator and discriminator in separate sections below.

a.1 Generator

Detailed scheme of the architecture is depicted in Figure 6. Each of the convolutional layers (in yellow) is followed by Instance Normalization (IN) [33]

and ReLU nonlinearity 

[26]. The PeRFeaT module uses Peer Regularization Layer [31] with Euclidean distance metric, k-NN with nearest neighbours and dropout on the attention weights of .

The generated latent code are feature maps of size . First feature maps is the content latent representation, while the remaining is for the style. The style latent representation is further split into halves, having first feature maps left unchanged and the second feature maps are passed through the Global style transform block producing feature maps of size that hold the global part of the style latent representation.

The last convolutional block of the decoder is equipped with TanH nonlinearity and produces the reconstructed RGB image.

The auxiliary generator copies the architecture of the main generator, while omitting the Style transfer block (see Figure 6).

Figure 6: Detailed architecture of generator.

a.2 Discriminator

The discriminator architecture is shown in Figure 7. It takes two RGB images concatenated over the channel dimension as input and produces a map of predictions. Our implementation uses LS-GAN and therefore there is no Sigmoid activation at the output

To stabilize the discriminator training, we add random Gaussian noise to each input:



is a Gaussian distribution with mean

and standard deviation


Figure 7: Detailed architecture of discriminator.

Appendix B Style transfer results

This section provides more qualitative results of our style transfer approach that did not fit in the main text. Figure 8 are images generated with resolution and shows the generalization of our approach to different styles and ability of our approach to perform zero-shot style transfer. From the set of painters that are shown, only Cezzane and Van Gogh were seen during training. In addition, images in Figure 9 were generated with resolution and show results of transfer taking a random painting from the test set of painters that were seen during training (Cezzane, Monet, Van Gogh, Ukiyoe).

Figure 8: Qualitative evaluation of our method generalizing to different, even previously unseen styles.
Figure 9: Qualitative evaluation of randomly coupled content and style images from the test set containing painter styles seen during training.

Appendix C Latent space structure

Our latent representation is split into two parts, and , content and style respectively. Metric learning loss is used on the style part in order to enforce a separation of different modalities in the style latent space.


where , are style parts of latent representations of two different input images and , are style parts of latent representations of two different targets from the same target class. Parameter and it is the margin we are enforcing on the separation of the positive and negative scores.

Figure 10 shows 2D embedding of the style latent space generated using T-SNE [34], where different colors represent different style classes. One can observe that the photos can be separated from the painters very well. Separation of different painters into good clusters is rather difficult problem in very low dimensional space. We see that only partial separation is learned for the other styles (different painters).

c.1 Embedding ablation study

The T-SNE embeddings are shown for our ablation study experiments as well. This clearly demonstrates that without the metric learning loss, no clustering of the latent space is learned. Furthermore, without using the auxiliary decoder, the latent space is partially clustered thanks to the metric learning loss, however the decoder cannot make any use of it. This suggests that all the stylization is hard-coded in the decoder weights and the style information of the latent representation is not taken into account. Lastly, we can observe that for the experiment where we transfer the whole latent code, latent space is again clustered. The content is however preserved only partially. This is due to the fact that we compute the adjacency matrix and attention only on the content part, but then we reconstruct the whole latent code as a convex combination of neighbouring nodes, instead of doing so only for the style part, and preserving the original content part.

(a) Ours
(b) Without metric learning loss
(c) Without auxiliary decoder
(d) Transfering the whole latent code
Figure 10: T-SNE 2D embedding of the style latent space. The real photos are in red, different paints are represented by all the other colors. The latent space structure is shown for our final version and also for all 3 ablation study experiments we have performed.

c.2 Visualization in image space

Figure 11 visualizes the influence of the and parts of the latent representation after decoding back into the RGB image space. The PeRFeaT transformation, which performs the style transfer, is executed first. The resulting latent code is then modified before feeding it to the decoder. Replacing the with ’s gives us some rough representation of the style with only approximate shapes. On the other hand, if we replace with ’s and we keep , a rather flat representation of the input with very sharp edges is reconstructed. This confirms that focuses on the content, while holds most of the stylization information.

The fact that PeRFeaT transformation is done first means that the resulting style is mapped to the content of the content image. As a result, the decoded image slightly resembles the structure of the content image even if the is set to ’s.

Figure 11: Visualization of information contained in content and style parts of the latent representation. Even if is set to , there is still some vague resemblance of the structure of the Content image, because the PeRFeaT layer transforms a partially local style features based on the content features.