Neural style transfer (NST), introduced by the seminal work of Gatys Gatys2015NeuralStyle , is an area of research that focuses on models that transform the visual appearance of an input image (or video) to match that of a desired target image. For example, convert a given photo to appear as if Van Gogh himself had painted the same scene.
NST, due to its general formulation, has seen an exponential growth within the deep learning community and spans a wide spectrum of applications e.g. converting time-of-dayZhu2017CycleGAN ; Huang2018MUNIT , mapping among artwork and photos Anoosheh2017ComboGAN ; Zhu2017CycleGAN ; Huang2018MUNIT , transferring facial expressions Karras2018StyleBasedGen , transforming between animal species Zhu2017CycleGAN ; Huang2018MUNIT , etc.
Despite their popularity and good quality results, current NST approaches are not free from limitations. Firstly, the original formulation of Gatys et al. requires a new optimization process for each transfer performed, making it impractical for many real-world scenarios. In addition, the method relies heavily on pre-trained networks, usually borrowed from classification tasks, that are known to be sub-optimal and biased toward texture rather than structure Geirhos2018ImageNetTrained
. To overcome this first limitation, deep neural networks have been proposed to approximate the lengthy optimization procedure in a single feed forward step, thereby making the models amenable for real-time processing. Of notable mention is the work of Johnsonet al. Johnson2016PerceptLosses and Ulyanov et al. Ulyanov2016 , who also later introduced Instance Normalization Ulyanov17IN , a popular feature-normalization scheme for style transfer.
Secondly, when a neural network is used to overcome the computational burden of Gatys2015NeuralStyle , training of a model for every desired style image is required due to the limited-capacity of conventional models in encoding multiple styles into their weights. This greatly narrows the applicability of the method for use cases where the concept of style cannot be defined a-priori and needs to be inferred from examples. With respect to this second limitation, recent works attempted to separate style and content in feature space (latent space) to allow generalization to a style characterized by an additional input image, or set of images. The most widespread work in this family is AdaIN Huang2017AdaIN , a particular case of FiLM Perez2017FiLM
, introduced later the same year. The current state-of-the-art allows, among other things, to control the amount of stylization applied, interpolating between different styles, and using masks to convert different regions of image into different stylesHuang2017AdaIN ; Sheng2018Avatar .
Research, beyond the study of new network architectures for NST, has also introduced better loss functions to train the models. The perceptual lossGatys2015NeuralStyle ; Johnson2016PerceptLosses
with a pre-trained VGG19 classifierSimonyanZ14 is usually used as it, supposedly, captures high-level features of the image. However, this assumption has been recently challenged in Geirhos2018ImageNetTrained . Cycle-GAN Zhu2017CycleGAN proposed a new cycle consistent loss that requires no one-to-one correspondence between the input and the target images, thus lifting the heavy burden of data annotation.
Image style transfer is an extremely challenging problem and moreover style of an image is expressed by both local properties (e.g. typical shapes of objects, etc.) and global properties (e.g. textures, etc.). We advocate to model this hierarchy in latent space by local aggregation of pixel-wise features and by the use of metric learning to separate different styles. To the best of our knowledge, this has not been addressed by previous approaches explicitly.
In the presence of a well structured latent space where style and content are fully separated, transfer could be easily performed by exchanging the style information in latent space between the input and the conditioning style images, without the need to store the transformation in the decoder weights. Such an approach is independent with respect to feature normalization and further avoids the need for rather problematic pre-trained models.
This paper addressed the NST setting where style is externally defined by a set of input images to allow transfer from arbitrary domains and to tackle the challenging zero-shot style-transfer scenario by introducing a novel feature regularization layer capable of recomposing global and local style content from the input style image. This inductive bias is shown to allow the network to learn how to separate style and content rather than encoding the transformation in its weights. We successfully demonstrate this in a series of zero-shot style transfer experiments, whose generated result would not be possible if the style was not actually inferred from the respective input images.
This work addresses the aforementioned limitations of NST models by making the following contributions:
A state-of-the-art approach to NST using a custom graph convolutional layer that recomposes style in latent space;
End-to-end training without the need for any pre-trained model (e.g. VGG) to compute the perceptual loss;
Constructing a globally- and locally-combined latent space for style information and imposing structure on it by means of metric learning.
The key component of any NST system is the modeling and extraction of the "style" from an image (though the term is partially subjective). As style is often related to texture, a natural way to model it is to use visual texture modeling methods PaulyGreiner2009TexSynth . Such methods can either exploit texture image statistics (e.g. Gram matrix) Gatys2015NeuralStyle or model textures using Markov Random Fields (MRFs) EfrosLeung1999 . The following paragraphs provide an overview of the style transfer literature introduced by Jing2017NSTReview .
Image-Optimization-Based Online Neural Methods.
The method from Gatys et al. Gatys2015NeuralStyle may be the most representative of this category. While experimenting with representations from intermediate layers of the VGG-19 network, the authors observed that a deep convolutional network is able to extract image content from an arbitrary photograph, as well as some appearance information from works of art. The content is represented by a low-level layer of VGG-19, whereas the style is expressed by a combination of activations from several higher layers, whose statistics are described using the network features’ Gram matrix. Li Li2017DemystifyingNST later pointed out that the Gram matrix representation can be generalized using a formulation based on Maximum Mean Discrepacy (MMD). Using MMD with a quadratic polynomial kernel gives results very similar to the Gram matrix-based approach, while being computationally more efficient. Other non-parametric approaches based on MRFs operating on patches were introduced by Li and Wand LiWand2016MRF .
Model-Optimization-Based Offline Neural Methods.
These techniques can generally be divided into several sub-groups Jing2017NSTReview . Per-Style-Per-Model methods need to train a separate model for each new target style Johnson2016PerceptLosses ; Ulyanov2016 ; LiWand2016Adversarial , rendering them rather impractical for dynamic use. A notable member of this family is the work by Ulyanov et al. Ulyanov17IN
introducing Instance Normalization (IN), better suited for style-transfer applications than Batch Normalization (BN).
Multiple-Styles-Per-Model methods attempt to assign a small number of parameters to each style. Dumoulin Dumoulin2016CIN proposed an extension of IN called Conditional Instance Normalization (CIN), StyleBank Chen2017StyleBank learns filtering kernels for different styles, and other works instead feed the style and content as two separate inputs Li2017 ; ZhangDana2017 similarly to our approach.
Arbitrary-Style-Per-Model methods either treat the style information in a non-parametric, i.e. as in StyleSwap Chen2016StyleSwap , or parametric manner using summary statistics, such as in Adaptive Instance Normalization (AdaIN) Huang2017AdaIN
. AdaIN, instead of learning global normalization parameters during training, uses first moment statistics of the style image features as normalization parameters. Later, Liet al. Li2017WCT introduced a variant of AdaIN using Whitening and Coloring Transformations (WTC). Going towards zero-shot style transfer, ZM-Net WangLiang2017ZMNet
proposed a transformation network with dynamic instance normalization to tackle the zero-shot transfer problem. More recently, Avatar-NetSheng2018Avatar proposed the use of a "style decorator" to re-create content features by semantically aligning input style features with those derived from the style image.
Cycle-GAN Zhu2017CycleGAN introduced a cycle-consistency loss on the reconstructed images that delivers very appealing results without the need for aligned input and target pairs. However, it still requires one model per style. The approach was extended in Combo-GAN Anoosheh2017ComboGAN , which lifted this limitation and allowed for a practical multi-style transfer; however, also this method requires a decoder-per-style.
Sanakoyeu et al. Sanakoyeu2018StyleAware observed that the applying the cycle consistency loss in image space might be over-restricting the stylization process. This strict consistency loss was later relaxed by MUNIT Huang2018MUNIT , a multi-model extension of UNIT Liu2017UNIT , that imposes it in latent space instead, providing more freedom to the image reconstruction process. Sanakoyeu et al. Sanakoyeu2018StyleAware show also how to use a set of images, rather than of a single one, to better express the style of an artwork.
The core idea of our work is a region-based mechanism to exchange the style between input and target style images, similarly to StyleSwap Chen2016StyleSwap , while preserving the semantic content. To successfully achieve this, style and content information must be fully separated, disentangled. The inductive bias of any architecture is, however, not enough to achieve the desired level of separation. We advocate the use of metric learning to directly enforce separation among different styles, and experimentally shown this to greatly reduce the amount of style dependent information that gets encoded in the decoder.
3.1 Architecture and losses
The proposed architecture is shown in Figure 1. To prevent the generator from encoding the stylization in its weights auxiliary decoders DeFauw2019AuxDec are used during training to optimize the parameters of the encoder and decoder independently. The yellow module in Figure 1
is trained as an autoencoder (AE)Masci2011 ; ZhaoMGL15 ; MaoSY16a to reconstruct the input. The green module, instead, is trained as a GANGoodfellow2014GAN to generate the stylized version of the input using the encoder from the yellow module, with fixed parameters. The optimization of both modules is interleaved together with the discriminator. Additionally, following the analysis from Martineau et al. Martineau2018RAGAN , the Relativistic Average GAN (RaGAN) is used as our adversarial loss formulation, which was shown to be more stable and to produce more natural-looking images than traditionally used GAN losses.
Let us now describe the four main building blocks of our approach in detail111The full architecture details are found in the supplementary material.. We denote an input image, a target and fake image, respectively. Our model consists of an encoder generating the latent representations, an auxiliary decoder taking a single latent code as input, and a main decoder taking two latent codes as inputs. Generated latent codes are denoted . We further denote the content and style part of the latent code, respectively. Additionally, to impose a stronger prior on the feature similarity expected while performing reconstruction , a cycle loss is used on the encoder’s middle-layer features maps, denoted , which have double the spatial size of their subsequent latent representations.
The distance between two latent representations is defined as the smooth L1 norm Girshick2015FastRCNN in order to stabilize training and to avoid exploding gradients:
where , and are two different feature embeddings with channels and is the spatial dimension of each channel.
The encoder used to produce the latent representation of all input images is composed of several strided-convolutional layers for downsampling followed by multiple ResNet blocks. The latent codeis composed by two parts: the content part, , which holds information about the image content (e.g. objects, position, scale, etc.), and the style part, , which encodes the style that the content is presented in (e.g. level of detail, shapes, etc.). The style component of the latent code is further split equally into . Here, encodes local style information per pixel of each feature map, while undergoes further downsampling via a small sub-network to generate a single value per feature map.
The Auxiliary generator simply reconstructs an image from its latent representation and is used only during training to train the encoder module. It is composed of several ResNet blocks followed by fractionally-strided convolutional layers to reconstruct the original image.The loss is composed of a metric learning loss, enforcing clustering of the style part of the latent representations, a latent cycle loss, and a classical reconstruction loss. It is defined as follows:
This network replicates the architecture of the auxiliary generator, and uses the output of the Peer Regularized Feature Transform module (see Section 3.2). During training of the main generator the encoder is kept fixed, and the generator is optimized using the following loss:
The discriminator is a convolutional network receiving two images concatenated over the channel dimension and producing an map of predictions. The first image is the one to discriminate, whereas the second one serves as conditioning for the style class. The output prediction is ideally if the two inputs come from the same style class and otherwise. The discriminator loss is defined as:
where is the distribution of the real data and is the distribution of the generated (fake) data.
3.2 Peer Regularized Feature Transform (PeRFeaT)
The PeRFeaT module draws inspiration from PeerNets Svoboda2019 and Graph Attention Layer (GAT) Velickovic2018 and performs style transfer in latent space taking advantage of the separation of content and style information (enforced by Equation 3). It receives and as an input and computes the -Nearest-Neighbors (k-NN) between and using the Euclidean distance to induce a graph of peers.
Attention coefficients over the graph nodes are computed and used to recompose the style portion of as convex combination of its nearest neighbors representations. The content portion of the latent code remains instead unchanged, resulting in: .
Given a pixel of feature map , its -NN graph in the space of -dimensional feature maps of all pixels of all peer feature maps is considered. The new value of the style part for the pixel is expressed as:
where denotes a fully connected layer mapping from -dimensional input to scalar output, and are attention scores measuring the importance of the th pixel of feature map to the output th pixel of feature map . The resulting style component of the input feature map is the weighted average, pixel-wise, of its peer pixel features defined over the style input image.
4 Experimental setup and Results
The proposed approach is compared against the state-of-the-art on extensive qualitative evaluations and, to support the choice of architecture and loss functions, ablation studies are performed to show the role of the various components and how they influence the final result.
The dataset of Zhu2017CycleGAN , composed of a collection of photographs and four different painter collections is used for training the model. In particular, the datasets named monet2photo, cezzane2photo, vangogh2photo and ukiyoe2photo are combined into a single dataset named painter2photo, consisting of 6280 real photos and 2560 paintings in total.
Our network can be trained end-to-end alternating optimization steps for the auxiliary generator, the main generator, and the discriminator. The loss used for training is defined as:
is used as the optimizer with learning rate set to 0.0002 and batch size of 1, training is performed for total a of 200 epochs. In each epoch, all the real photos are visited, which results initerations per epoch. The weighting of the reconstruction identity loss and the margin for the metric learning during all of our experiments. The training images are cropped and resized to pixels. Note that during testing, our method can operate on images of arbitrary size.
4.2 Style transfer
A set of test images from Sanakoyeu Sanakoyeu2018StyleAware is stylized and compared against competing methods in Figure 3 (inputs of size px) to demonstrate arbitrary stylization of a content image given several different styles not previously seen at training time. It is important to note that our network was trained on only four different painters (Monet, Cezzane, Van Gogh and Ukiyoe). Figure 4(a) shows that it generalizes well to previously unseen styles, allowing zero-shot style transfer. Additional experiments222More results are shown in the supplementary material. showcasing the capabilities of our method are performed on the test sets of the four painter datasets compiled by Zhu2017CycleGAN . This evaluation is done on color images of size , and results are displayed in Figure 4(b).
There are several key components in our solution which make arbitrary style transfer with a single model and end-to-end training possible. Namely, the auxiliary decoder used during training, which prevents degenerate solutions. Separation of the latent code into content and style, which allows to transfer the style features while preserving the content. Last but not least, metric learning on the latent space style class separation which allows to pull different styles apart. The effect of suppressing each of them during the training is examined, and results for the various models are compared, highlighting the importance of each component in Figure 5.
We proposed a novel model for neural style transfer which mitigates various limitations of current state-of-the-art methods and that can be used in the challenging zero-shot transfer setting. This is done with a Peer-Regularization Layer using graph convolutions to recompose the style component of the latent representation and with a metric learning loss enforcing separation of different styles combined with cycle consistency in feature space. An auxiliary decoder is also introduced to prevent degenerate solutions and to enrich variability of the generated samples. The result is a state-of-the-art method that can be trained end-to-end without the need of a pre-trained model to compute the perceptual loss, therefore lifting recent concerns regarding the reliability of such features for NST. More importantly the proposed method requires only a single encoder and a single decoder to perform transfer among arbitrary styles, contrary to many competing methods requiring a decoder (and possibly an encoder) for each input and target pair. This makes our method more applicable to real-world image generation scenarios where style needs to be defined by the user.
-  Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc Van Gool. Combogan: Unrestrained scalability for image domain translation. CoRR, abs/1712.06909, 2017.
-  Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stylebank: An explicit representation for neural image style transfer. CoRR, abs/1703.09210, 2017.
-  Tian Qi Chen and Mark Schmidt. Fast patch-based style transfer of arbitrary style. CoRR, abs/1612.04337, 2016.
-  Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. CoRR, abs/1610.07629, 2016.
A. A. Efros and T. K. Leung.
Texture synthesis by non-parametric sampling.
Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1033–1038 vol.2, Sep. 1999.
-  Jeffrey De Fauw, Sander Dieleman, and Karen Simonyan. Hierarchical autoregressive image models with auxiliary decoders. CoRR, abs/1903.04933, 2019.
-  Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style. CoRR, abs/1508.06576, 2015.
-  Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, 2019.
-  Ross B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
-  Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
-  Xun Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. CoRR, abs/1703.06868, 2017.
-  Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. CoRR, abs/1804.04732, 2018.
-  Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, and Mingli Song. Neural style transfer: A review. CoRR, abs/1705.04058, 2017.
-  Justin Johnson, Alexandre Alahi, and Fei-Fei Li. Perceptual losses for real-time style transfer and super-resolution. CoRR, abs/1603.08155, 2016.
-  Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard GAN. CoRR, abs/1807.00734, 2018.
-  Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  Chuan Li and Michael Wand. Combining markov random fields and convolutional neural networks for image synthesis. CoRR, abs/1601.04589, 2016.
-  Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. CoRR, abs/1604.04382, 2016.
-  Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. CoRR, abs/1701.01036, 2017.
-  Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Diversified texture synthesis with feed-forward networks. CoRR, abs/1703.01664, 2017.
-  Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. CoRR, abs/1705.08086, 2017.
-  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. CoRR, abs/1703.00848, 2017.
-  Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using convolutional auto-encoders with symmetric skip connections. CoRR, 2016.
Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber.
Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction, pages 52–59. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
Vinod Nair and Geoffrey E. Hinton.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, USA, 2010. Omnipress.
-  Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. CoRR, abs/1709.07871, 2017.
-  Artsiom Sanakoyeu, Dmytro Kotovenko, Sabine Lang, and Björn Ommer. A style-aware content loss for real-time HD style transfer. CoRR, abs/1807.10201, 2018.
-  Lu Sheng, Ziyi Lin, Jing Shao, and Xiaogang Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In CVPR, 2018.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  Jan Svoboda, Jonathan Masci, Federico Monti, Michael M. Bronstein, and Leonidas J. Guibas. Peernets: Exploiting peer wisdom against adversarial attacks. CoRR, abs/1806.00088, 2018.
-  Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. CoRR, abs/1603.03417, 2016.
-  Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. CoRR, abs/1701.02096, 2017.
-  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
-  Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Alejandro Romero, Pietro Lió, and Yoshua Bengio. Graph attention networks. CoRR, abs/1710.10903, 2018.
-  Hao Wang, Xiaodan Liang, Hao Zhang, Dit-Yan Yeung, and Eric P. Xing. Zm-net: Real-time zero-shot image manipulation network. CoRR, abs/1703.07255, 2017.
-  Li-Yi Wie, Sylvain Lefebvre, Vivek Kwatra, and Greg Turk. State of the Art in Example-based Texture Synthesis. In M. Pauly and G. Greiner, editors, Eurographics 2009 - State of the Art Reports, 2009.
-  Hang Zhang and Kristin J. Dana. Multi-style generative network for real-time transfer. CoRR, abs/1703.06953, 2017.
-  Junbo Jake Zhao, Michaël Mathieu, Ross Goroshin, and Yann LeCun. Stacked what-where auto-encoders. CoRR, 2015.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.
Appendix A Network architecture
This section describes our model in detail. We describe the generator and discriminator in separate sections below.
Detailed scheme of the architecture is depicted in Figure 6. Each of the convolutional layers (in yellow) is followed by Instance Normalization (IN)  and ReLU nonlinearity
and ReLU nonlinearity. The PeRFeaT module uses Peer Regularization Layer  with Euclidean distance metric, k-NN with nearest neighbours and dropout on the attention weights of .
The generated latent code are feature maps of size . First feature maps is the content latent representation, while the remaining is for the style. The style latent representation is further split into halves, having first feature maps left unchanged and the second feature maps are passed through the Global style transform block producing feature maps of size that hold the global part of the style latent representation.
The last convolutional block of the decoder is equipped with TanH nonlinearity and produces the reconstructed RGB image.
The auxiliary generator copies the architecture of the main generator, while omitting the Style transfer block (see Figure 6).
The discriminator architecture is shown in Figure 7. It takes two RGB images concatenated over the channel dimension as input and produces a map of predictions. Our implementation uses LS-GAN and therefore there is no Sigmoid activation at the output
Appendix B Style transfer results
This section provides more qualitative results of our style transfer approach that did not fit in the main text. Figure 8 are images generated with resolution and shows the generalization of our approach to different styles and ability of our approach to perform zero-shot style transfer. From the set of painters that are shown, only Cezzane and Van Gogh were seen during training. In addition, images in Figure 9 were generated with resolution and show results of transfer taking a random painting from the test set of painters that were seen during training (Cezzane, Monet, Van Gogh, Ukiyoe).
Appendix C Latent space structure
Our latent representation is split into two parts, and , content and style respectively. Metric learning loss is used on the style part in order to enforce a separation of different modalities in the style latent space.
where , are style parts of latent representations of two different input images and , are style parts of latent representations of two different targets from the same target class. Parameter and it is the margin we are enforcing on the separation of the positive and negative scores.
Figure 10 shows 2D embedding of the style latent space generated using T-SNE , where different colors represent different style classes. One can observe that the photos can be separated from the painters very well. Separation of different painters into good clusters is rather difficult problem in very low dimensional space. We see that only partial separation is learned for the other styles (different painters).
c.1 Embedding ablation study
The T-SNE embeddings are shown for our ablation study experiments as well. This clearly demonstrates that without the metric learning loss, no clustering of the latent space is learned. Furthermore, without using the auxiliary decoder, the latent space is partially clustered thanks to the metric learning loss, however the decoder cannot make any use of it. This suggests that all the stylization is hard-coded in the decoder weights and the style information of the latent representation is not taken into account. Lastly, we can observe that for the experiment where we transfer the whole latent code, latent space is again clustered. The content is however preserved only partially. This is due to the fact that we compute the adjacency matrix and attention only on the content part, but then we reconstruct the whole latent code as a convex combination of neighbouring nodes, instead of doing so only for the style part, and preserving the original content part.
c.2 Visualization in image space
Figure 11 visualizes the influence of the and parts of the latent representation after decoding back into the RGB image space. The PeRFeaT transformation, which performs the style transfer, is executed first. The resulting latent code is then modified before feeding it to the decoder. Replacing the with ’s gives us some rough representation of the style with only approximate shapes. On the other hand, if we replace with ’s and we keep , a rather flat representation of the input with very sharp edges is reconstructed. This confirms that focuses on the content, while holds most of the stylization information.
The fact that PeRFeaT transformation is done first means that the resulting style is mapped to the content of the content image. As a result, the decoded image slightly resembles the structure of the content image even if the is set to ’s.