Style transfer is an image editing strategy transferring an image style to a content image. Given style and content, the goal is to extract the style characteristics of the style and merge them to the geometric features of the content. While this problem has a long history in computer vision and computer graphics (e.g.[Hertzmann_etal_image_analogies_SIGGRAPH2001, Aubry_etal_fast_local_laplacian_filters_TOG2014]), it has seen a remarkable development since the seminal works of Gatys et al. [Gatys_et_al_texture_synthesis_using_CNN_2015, Gatys_et_al_image_style_transfer_cnn_cvpr2016]
. These works demonstrate that the Gram matrices of the activation functions of a pre-trained VGG19 network[Simonyan_Zisserman_VGG_ICLR15] faithfully encode the perceptual style and textures of an input image. Style transfer is performed by optimizing a functional aiming at a compromise between fidelity to VGG19 features of the content image while reproducing the Gram matrix statistics of the style image. Other global statistics have been proven effective for style transfer and texture synthesis [Lu_Zhu_Wu_Deepframe_AAAI2016, Sendik_deep_correlations_texture_synthesis_SIGGRAPH2017, Luan_etal_deep_photo_style_transfer_cvpr2017, Vacher_etal_texture_interpolation_probing_visual_perception_NEURIPS2020, Risser_etal_stable_and_controllable_neural_texture_synthesis_and_style_transfer_Arxiv2017, Heitz_slices_Wassestein_loss_neural_texture_synthesis_CVPR2021, DeBortoli_et_al_maximum_entropy_methods_texture_synthesis_SIMODS2021, gonthier2022high] and it has been shown that a coarse-to-fine multiscale approach allows one to reproduce different levels of style detail for images of moderate to high-resolution (HR) [Gatys_etal_Controlling_perceptual_factors_in_neural_style_transfer_CVPR2017, snelgrove2017high, gonthier2022high]. The two major drawbacks of such optimization-based style transfer are the computation time and the limited resolution of images because of large GPU memory requirement.
Regarding computation time, several methods have been proposed to generate new stylized images by training feed-forward networks [ulyanov2016texturenets, johnson2016Perceptual, li2016precomputed] or by training VGG encoder-decoder networks [chen2016fast, Huang_arbitrary_style_transfer_real_time_ICCV2017, li2017universal, li2019learning, chiu2020iterative]. These models tend to provide images with relatively low style transfer loss and can therefore be considered as approximate solutions to [Gatys_et_al_image_style_transfer_cnn_cvpr2016]. Despite remarkable progress regarding computation time, these methods suffer from GPU memory limitations due to the large size of the models used for content and style characterization and are therefore limited in terms of resolution (generally limited to pixels (px)).
This resolution limitation was recently tackled [an2020real, Wang_2020_CVPR, Chen_Wang_Xie_Lu_Luo_towards_ultra_resolution_neural_style_transfer_thumbnail_instance_normalization_AAAI2022]. Nevertheless, although generating ultra-high resolution (UHR) images (larger than 4k images), the approximate results are not able to correctly represent the style resolution. Indeed, for some methods to satisfy the GPU’s memory limitations, the transfer is performed locally on small patches of the content image with a zoomed out style image ( px) [Chen_Wang_Xie_Lu_Luo_towards_ultra_resolution_neural_style_transfer_thumbnail_instance_normalization_AAAI2022]. In other methods, the multiscale nature of the networks is not fully exploited [Wang_2020_CVPR].
As illustrated in Figure 1, our high-resolution multiscale method manages to transfer the different levels of detail contained in the style image from the colour palette and compositional style to the fine brushstrokes and canvas texture. The resulting UHR images look like authentic painting as can be seen in the UHR example of Figure 2.
Comparative experiments illustrate that the results of competitive methods suffer from brushstroke styles that do not match those of the UHR style image, and that very fine textures are not well transferred and are subject to local artifacts. To straighten this visual comparison, we also introduce a qualitative and quantitative identity test that highlights how well a given texture is being emulated.
The main contributions of this work are summarized as follows:
We introduce a two-step algorithm to compute the style transfer loss gradient for UHR images that do not fit in GPU memory using localized neural feature calculation.
We show that this algorithm allows a multi-resolution UHR transfer for images up to px in size.
We experimentally show that the visual quality of this UHR style transfer is richer and more faithful than recent fast but approximate solutions.
This work provides a new reference method for high-quality style transfer with unequaled multi-resolution depth. It might serve as a reference to evaluate fast but approximate models.
2 Related work
Style transfer by optimization.
As recalled in the introduction, the seminal work of Gatys et al. formulated style transfer as an optimization minimizing the distances between Gram matrices of VGG features. Other global statistics have been proven effective for style transfer and texture synthesis such as deep correlations [Sendik_deep_correlations_texture_synthesis_SIGGRAPH2017, gonthier2022high], Bures metric [Vacher_etal_texture_interpolation_probing_visual_perception_NEURIPS2020], spatial mean of features [Lu_Zhu_Wu_Deepframe_AAAI2016, DeBortoli_et_al_maximum_entropy_methods_texture_synthesis_SIMODS2021], feature histograms [Risser_etal_stable_and_controllable_neural_texture_synthesis_and_style_transfer_Arxiv2017], or even the full feature distributions [Heitz_slices_Wassestein_loss_neural_texture_synthesis_CVPR2021]. Specific cost function corrections have also been proposed for photorealistic style transfer [Luan_etal_deep_photo_style_transfer_cvpr2017]. When dealing with HR images, a coarse-to-fine multiscale strategy has been proven efficient to capture the different levels of details present in style images [Gatys_etal_Controlling_perceptual_factors_in_neural_style_transfer_CVPR2017, snelgrove2017high, gonthier2022high].
Style transfer by training feed-forward networks.
Ulyanov et al. [ulyanov2016texturenets, Ulyanov_etal_improved_texture_networks_CVPR2017] and Johnson et al. [johnson2016Perceptual] showed that one could train a feed-forward network to approximately solve style transfer. Although these models produce a very fast style transfer, they require learning a new model for each type of style.
Universal style transfer (UST).
Style limitation has been addressed by training a VGG autoencoder that attempts to reverse VGG feature computations after normalizing them at the autoencoder bottleneck. Chenet al. [chen2016fast] introduce the encoder-decoder framework with a style swap layer replacing content features with the closest style features on overlapping patches. Huang et al. [Huang_arbitrary_style_transfer_real_time_ICCV2017]
propose to use an Adaptive Instance Normalization (AdaIN) that adjusts the mean and variance of the content image features to match those of the style image. Liet al. [li2017universal] match the covariance matrices of the content image features to those of the style image by applying whitening and coloring transforms. These operations are performed layer by layer and involve specific reconstruction decoders at each step. Sheng et al. [sheng2018avatar] use one encoder-decoder block combining the transformations of [li2017universal] and [chen2016fast]. Park et al. [park2019arbitrary] introduce an attention-based transformation module to integrate the local style patterns according to the spatial distribution of the content image. Li et al. [li2019learning] train a symmetric encoder-decoder image reconstruction module and a transformation learning module. Chiu et al. [chiu2020iterative] extend [li2017universal] by embedding a new transformation that iteratively updates features in the cascade of four autoencoder modules. Despite the numerous improvements of fast UST strategies, let us remark that: (a) they rely on matching VGG statistics as introduced by Gatys et al. [Gatys_et_al_image_style_transfer_cnn_cvpr2016] (b) they are limited in resolution due to GPU memory required for the large sized models.
UST for high-resolution images.
Some methods attempt to reduce the size of the network in order to perform high resolution style transfer. An et al. [an2020real] propose ArtNet which is a channel-wise pruned version of GoogLeNet [Szegedy_2015_CVPR]. Wang et al. [Wang_2020_CVPR] propose a collaborative distillation approach in order to compress the model by transferring the knowledge of a large network (VGG19) to a smaller one, hence reducing the number of convolutional filters involved in [li2017universal] and [Huang_arbitrary_style_transfer_real_time_ICCV2017]. Chen et al. [Chen_Wang_Xie_Lu_Luo_towards_ultra_resolution_neural_style_transfer_thumbnail_instance_normalization_AAAI2022] recently proposed an UHR style transfer framework where the content image is divided into patches and a patch-wise style transfer is performed from a zoomed out version of the style image of size px.
3 Global optimization for neural style transfer
Single scale style transfer.
Let us recall the algorithm of Gatys et al. [Gatys_et_al_image_style_transfer_cnn_cvpr2016]. It solely relies on optimizing some VGG19 second-order statistics for changing the image style while maintaining some VGG19 features to preserve the content image’s geometric features. Style is encoded through Gram matrices of several VGG19 layers, namely the set while the content is encoded with a single feature layer .
Given a content image and a style image
, one optimizes the loss function
where , with , and
where is the Frobenius norm, and, for an image and a layer index , denotes the Gram matrix of the VGG19 features at layer : if is the feature response of at layer that has spatial size and channels, one first reshapes as a matrix of size with the number of feature pixels, its associated Gram matrix is
is the column vector corresponding to the-th line of . is a fourth-degree polynomial and non convex with respect to (wrt) the VGG features . Gatys et al. [Gatys_et_al_texture_synthesis_using_CNN_2015] propose to use the L-BFGS algorithm [Nocedal_updating_Quasi-Newton_matrices_with_limited_storage_1980] to minimize this loss, after initializing with the content image . L-BFGS is an iterative quasi-Newton procedure that approximates the inverse of the Hessian using a fixed size history of the gradient vectors computed during the last iterations. The history size is typically 100 but will be decreased to 10 for HR images (for all scales except the first one) to limit memory requirement.
Gram loss correction.
It is known that optimizing for the Gram matrix alone may introduce some loss of contrast artefacts since Gram matrices encompass information regarding both the mean values and correlation of features [Sendik_deep_correlations_texture_synthesis_SIGGRAPH2017, Risser_etal_stable_and_controllable_neural_texture_synthesis_and_style_transfer_Arxiv2017, Heitz_slices_Wassestein_loss_neural_texture_synthesis_CVPR2021]. Instead of considering the full histogram of the features [Risser_etal_stable_and_controllable_neural_texture_synthesis_and_style_transfer_Arxiv2017, Heitz_slices_Wassestein_loss_neural_texture_synthesis_CVPR2021]
, we found that correcting for the mean and standard deviation (std) of each feature gives visually satisfying results. Given some (reshaped) features, define and by
In the whole paper, we replace the Gram loss of Eq. (3) by the following augmented style loss
for a better reproduction of the feature distribution. The values of all the weights , , , , , have been fixed for all images. Note that limiting our style loss to second-order statistics is capital for a straightforward implementation of our localized algorithm described in Section 4.
Multiscale style transfer.
Since the style transfer solely relies on VGG19, the transfer is spatially limited by the receptive field of the network [Gatys_etal_Controlling_perceptual_factors_in_neural_style_transfer_CVPR2017]. For images having a side larger than 500 px, visually richer results are obtained by adopting a multiscale approach [Gatys_etal_Controlling_perceptual_factors_in_neural_style_transfer_CVPR2017] corresponding to the standard coarse-to-fine approach for texture synthesis [Wei_Levoy_fast_texture_synthesis_2000]. For the sake of simplicity, suppose that the content image and the style image have the same resolution (otherwise one can downscale the resolution of to match the one of as a preprocessing [Gatys_et_al_image_style_transfer_cnn_cvpr2016]). When using , both and are first downscaled by a factor to obtain the low-resolution couple and style transfer is first applied at this coarse resolution starting with . Then, for each subsequent scale to , the result image is upscaled by a factor 2 to define the initialization image , and style transfer is applied with the content and style image downscaled by a factor . At the last scale the output image has the same resolution as the HR content image. Thanks to this coarse-to-fine approach, the style is transferred in a coarse-to-fine way. This is especially important when using an HR digital photograph of a painting for the style: Ideally, the first scale encompasses color and large strokes while subsequent scales refine the stroke details up to the painting texture, bristle brushes and canvas texture.
Unfortunately, applying this multiscale algorithm off-the-shelf with UHR images is not possible in practice for images of size larger than 4000 px, even with a high-end GPU. The main limitation comes from the fact that differentiating the loss wrt requires fitting into memory and all its intermediate VGG19 features. While this requires a moderate 2.61 GB for a px image, it requires GB for a while scaling up to is not feasible with a 40 GB GPU. In the next section we describe a practical solution to overcome this limitation.
4 Localized neural features and style transfer loss gradient
Our main contribution is to emulate the computation of even for images larger than px for which evaluation and automatic differentiation of the loss is not feasible due to large memory requirement.
First suppose one wants to compute the feature maps , , of an UHR image . The natural idea developed here is to compute the feature maps piece by piece, by partitioning the input image into small images of size , that we will call blocks. This approach will work up to boundary issues. Indeed, to compute exactly the feature maps of one needs the complete receptive field centered at the pixel of interest. Hence, each block of the partition must be extracted with a margin area, except on the sides that are actual borders for the image . In all our experiments we use a margin of width px in the image domain.
This localized way to compute features allows one to compute global feature statistics such as Gram matrices and means and stds vectors. Indeed, these statistics are all spatial averages that can be aggregated block by block by adding sequentially the contribution of each block. Hence, this easy to implement procedure allows one to compute the value of the loss (1). Note that it is not possible to automatically differentiate this loss, because the computation graph linking back to is lost.
|Global statistics||Feature loss Expression||Gradient wrt the feature|
|Gram matrix:||Gram loss:|
|Feature mean:||Mean loss:|
|Feature std:||Std loss:|
However, a close inspection of the different style losses wrt the neural features shows that they all have the same form: For each style layer , the gradient of the layer style loss wrt the layer feature at some pixel location only depends on the local value and on some difference between the global statistics (Gram matrix, spatial mean, std) of and the corresponding ones from the style layer . This fact is summarized in the formulas of Table 1. Exploiting this locality of the gradient, it is also possible to exactly compute the gradient vector
block by block using a two-pass procedure: The first pass is used to compute the global VGG19 statistics of each style layer and the second pass is used to locally backpropagate the gradient wrt the local neural features. The whole procedure is described by Algorithm1 and illustrated by Figure 3.
Note that the memory requirement for Algorithm 1 does not depend on the image size. Indeed, by spatially splitting all the computations involving VGG19 features, dealing with larger images only requires more computation time (since there are more blocks). However, the L-BFGS optimization will require more memory since it requires to store a gradients history, each gradient having the size of . Using a single 40 GB GPU, our algorithm allows for style transfer for images of size up to px.
Ultra-high resolution style transfer.
An example of UHR style transfer is displayed in Figure 2 with several highlighted details. Figure 1 illustrates intermediary steps of our high resolution multiscale algorithm. The result for the first scale (third column) corresponds to the ones of the original paper [Gatys_et_al_image_style_transfer_cnn_cvpr2016] (except for our slightly modified style loss) and suffers from poor image resolution and grid artefacts. Note how, while progressing to the last scale, the texture of the painting gets refined and stroke details gain a natural aspect. This process is remarkably stable; the successive global style transfers results remain consistent with the one of the first scale.
We compare our method with two fast alternatives for UHR style transfer, namely collaborative distillation (CD) [Wang_2020_CVPR] and URST [Chen_Wang_Xie_Lu_Luo_towards_ultra_resolution_neural_style_transfer_thumbnail_instance_normalization_AAAI2022] (based on [li2017universal]) using their official implementations111[Wang_2020_CVPR]: https://github.com/MingSun-Tse/Collaborative-Distillation; [Chen_Wang_Xie_Lu_Luo_towards_ultra_resolution_neural_style_transfer_thumbnail_instance_normalization_AAAI2022]: https://git.io/URST. As already discussed in Section 2, URST decreases the resolution of the style image to px, so the style transfer is not performed at the proper scale and fine details cannot be transferred. As in UST methods, CD does not take into account details at different scales but simply proposes to reduce the number of filters in the auto-encoder network through collaborative distillation to process larger images. Unsurprisingly, one observes in Figure 4 that our method is the only one capable of conveying the aspect of the painting strokes to the content image. CD suffers from halo and high-frequency artefacts, while URST presents visible patch boundaries and a detail frequency mismatch due to improper scaling. Observe also that the style transfer results are in general better when the geometric content of the style image and the content image are close, regardless of the method.
Identity test for style transfer quality assessment.
Style transfer is an ill-posed problem by nature. We introduce here an identity test to evaluate if a method is able to reproduce a painting when using the same image for both content and style. Two examples of this sanity check test are shown in Figure 5. We observe that our multiscale algorithm is slightly less sharp than the original style image, yet high-resolution details from the paint texture are faithfully conveyed. In comparison, the results of [Wang_2020_CVPR] suffer from color deviation and frequency artefacts while the results of [Chen_Wang_Xie_Lu_Luo_towards_ultra_resolution_neural_style_transfer_thumbnail_instance_normalization_AAAI2022] apply a style transfer that is too homogeneous and present color and scale issues as already discussed. Some previous work introduce a style distance [Wang_2020_CVPR] that corresponds to the Gram loss for some VGG19 layers, showing again that fast approximate methods try to reproduce the algorithm of Gatys et al. which we extend to UHR images. Since we explicitly minimize this quantity, it is not fair to only consider this criterion for a quantitative evaluation. For this reason, we also calculate PSNR, SSIM [ssim] and LPIPS [Zhang_2018_CVPR] metrics on a set of paint styles (see supplementary material) to quantitatively evaluate our results, in addition to the “Gram” metric, that is, the style loss of Equation (2) using the original Gram loss of Equation (3), computed on UHR results using our localized approach. The average scores reported in Table 2 confirm the good qualitative behaviour discussed earlier: Our method is by far the best for all the scores.
Our work presented an extension of the Gatys et al. style transfer algorithm to UHR images. Regarding visual quality, our algorithm clearly outperforms competing UHR methods by conveying a true painting feel thanks to faithful HR details such as strokes, paint cracks, and canvas texture.
It is here that we may confess: Our iterative method is obviously slow, even though its complexity scales linearly with image size. Yet, as we have demonstrated, fast methods do not reach a satisfying quality, and fast high-quality style transfer remains an open problem to this date.
Several extensions and applications of our work can be considered. For instance, we can perform HR texture synthesis by removing the content term [Gatys_et_al_texture_synthesis_using_CNN_2015] (see supplementary material). Our two-pass procedure can be extended to any function of the Gram matrix and feature spatial means, such as the Bures metric used in [Vacher_etal_texture_interpolation_probing_visual_perception_NEURIPS2020] for texture mixing. One could also consider extending the method to the slice Wasserstein style loss [Heitz_slices_Wassestein_loss_neural_texture_synthesis_CVPR2021] using the Run-Sort-ReRun strategy [Lezama_etal_run_sort_rerun_ICML2021]. However, the memory requirements to store five VGG feature maps (or their projections) increase linearly with the size of the input image, in contrast to the size-agnostic global statistics used in this paper.
This work opens the way for several future research directions, from allowing local control for UHR style transfer [Gatys_etal_Controlling_perceptual_factors_in_neural_style_transfer_CVPR2017] to training fast CNN-based models to reproduce our results.
Acknowledgements: B. Galerne and L. Raad acknowledge the support of the project MISTIC (ANR-19-CE40-005).
A full version with a supplementary material that presents additional style transfer results,
gives an overview of the dataset of pictures used for the identity test comparison,
and discuss the adaptation of the algorithm for UHR texture synthesis is available here: