Improving Style Transfer with Calibrated Metrics

10/21/2019 ∙ by Mao-Chuang Yeh, et al. ∙ 27

Style transfer methods produce a transferred image which is a rendering of a content image in the manner of a style image. We seek to understand how to improve style transfer. To do so requires quantitative evaluation procedures, but the current evaluation is qualitative, mostly involving user studies. We describe a novel quantitative evaluation procedure. Our procedure relies on two statistics: the Effectiveness (E) statistic measures the extent that a given style has been transferred to the target, and the Coherence (C) statistic measures the extent to which the original image's content is preserved. Our statistics are calibrated to human preference: targets with larger values of E (resp C) will reliably be preferred by human subjects in comparisons of style (resp. content). We use these statistics to investigate the relative performance of a number of Neural Style Transfer(NST) methods, revealing several intriguing properties. Admissible methods lie on a Pareto frontier (i.e. improving E reduces C or vice versa). Three methods are admissible: Universal style transfer produces very good C but weak E; modifying the optimization used for Gatys' loss produces a method with strong E and strong C; and a modified cross-layer method has slightly better E at strong cost in C. While the histogram loss improves the E statistics of Gatys' method, it does not make the method admissible. Surprisingly, style weights have relatively little effect in improving EC scores, and most variability in the transfer is explained by the style itself (meaning experimenters can be misguided by selecting styles).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 9

page 10

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we seek to identify factors that lead to strong style transfers. To do so, we construct a comprehensive quantitative evaluation procedure for style transfer methods. We evaluate style transfers on two criteria. Effectiveness

(E) measures whether transferred images have the desired style, using divergence between Convolutional Neural Network (CNN) feature layer distributions of the synthesized image and original image.

Coherence (C) measures whether the synthesized images respect the underlying decomposition of the content image into objects, using established procedures together with the Berkeley segmentation dataset BSDS500 [1]. Both our E and C measures are calibrated by user studies.

Our qualitative metric focus on the analysis of Parametric Neural Methods (under the taxonomy of NST techniques) [16]. Actually, Non-Parametric Methods can generate a largely different feature statistics from original style image due to the pattern fitting to the content image, which are intrinsically different from Parametric ones. Therefore, it does not make sense to require two types of methods to be evaluated by the same metric at this stage.

Contributions: We present E and C measures of style transferred images ( see Fig. 1). Our measures are highly effective at predicting user preferences. We use our measures to compare several style transfer methods quantitatively. Our study suggests that controlling cross-layer loss is helpful, particularly if one uses the cross-layer covariance matrix (rather than Gram matrix). Our study suggests that, despite the analysis of Risser et al.  [29], the main problem with Gatys’ method is optimization rather than symmetry; modifying the optimization leads to an extremely strong method. Gatys’ method is unstable with high style weights, and we construct explicit models of the symmetry groups for Gatys’ style loss and the cross-layer style loss (improving over Risser et al. , who could not construct the groups), which may explain this effect. Our study suggests that, even for the best methods we investigated, the effect of choice of style image is strong, meaning that it is dangerous for experimenters to select style images when reporting results.

2 Related work

Style transfer: bilinear models [26] , non-parametric methods  [8], image analogies [13] and adjusting filter statistics  [2, 25] are capable of image style transfer and yield texture synthesis. Gatys et al. demonstrated that producing neural network layers with particular summary statistics (i.e. Gram matrices) yielded effective texture synthesis [11]. Gatys et al. achieved style transfer by searching for an image that satisfies both style texture summary statistics and content constraints [9]. This work has been much elaborated  [17, 28, 5, 7, 27, 14, 19, 20, 6, 24, 22, 10, 18, 4, 15]. Novak and Nikulin noticed that cross-layer Gram matrices reliably produce improvement on style transfer ([23]). However, their work was an exploration of variants of style transfer rather than a thorough study to gain insights on style summary statistics; since then, the method has been ignored in the literature.

Style transfer evaluation: style transfer methods are currently evaluated mostly by visual inspection on a small set of different styles and content image pairs. To our knowledge, there are no quantitative protocols to evaluate the competence of style transfer apart from user studies  [20] (who also investigate edge coherence between content and stylized images).

Gram matrices symmetry

in a style transfer loss function occur when there is a transformation available that changes the style transferred image without changing the value of the loss function. Risser

et al. note instability in Gatys’ method; symptoms are: poor and good style transfers of the same style to the same content with about the same loss value [29]. They supply evidence that this behavior can be controlled by adding a histogram loss, which breaks the symmetry. They do not write out the symmetry group as too complicated ( [29], p 4-6). Gupta et al.  [12] link instability in Gaty’s method to the size of the trace of the Gram matrix.

2.1 Gatys Method and Notation

We review the original work of Gatys et al. [9] in detail to introduce notation. Gatys finds an image where early layers of convolutional features match the lower layers of the style image and higher layers match the higher layers of a content image. Write (resp. , ) for the style (resp. content, new) image, and for some parameters balancing style and content losses ( and respectively). Occasionally, we will write for the image resulting from style transfer using method applied to the arguments. We obtain by finding

Losses are computed on a network representation, with convolutional layers, where the ’th layer produces a feature map of size (resp. height, width, and channel number). We partition the layers into three groups (style, content and target). Then we reindex the spatial variables (height and width) and write for the response of the ’th channel at the ’th location in the ’th convolutional layer. The content loss is

(where ranges over content layers). The within-layer Gram matrix for the ’th layer is

Write for the weight applied to the ’th layer. Then

where ranges over style layers. Gatys et al. use Relu1_1, Relu2_1, Relu3_1, Relu4_1, and Relu5_1 as style layers, and layer Relu4_2 for the content loss, and search for using L-BFGS [21]. From now on, we write R51 for Relu5_1, etc.

2.2 Cross-layer style loss

We consider a style loss that takes into account between layer statistics. The cross-layer, additive (XL) loss is obtained as follows. Consider layer and , both style layers, with decreasing spatial resolution.

Write for an upsampling of to , and consider

as the cross-layer gram matrix, We can form a style loss

(where is a set of pairs of style layers). We can substitute this loss into the original style loss, and minimize as before. All results here used a pairwise descending strategy, where one constrains each layer and its successor (i.e. (R51, R41); (R41, R31); etc). Alternatives include an all distinct pairs strategy, where one constrains all pairs of distinct layers. Carefully controlling weights for each layer’s style loss is not necessary in cross-layer gram matrix scenario.

3 Base Statistics for Quantitative Evaluation

Ideally, a style transfer method should meet two basic tests: (1) the method produces images in the desired style – E statistics; (2) the resulting images respect the underlying decomposition of the content image into objects – C statistics.

Base E statistics

: In general, we want to measure similarity of two distributions, one derived from the style image, the other from the transferred image. At each layer, e.g. R41 feature map, we first project both style image and transferred image’s summary statistics to a low-dimensional representation. Then we assume these representations are parameters of Gaussian distributions and a standard KL divergence is applied to measure the distance. The same procedure is repeated for other layers (i.e. R11,R21,R31 and R51).

Specifically, the projection matrix at each layer is discovered as such: we first find a set of content images (we use 200 test images from BSDS500[1]) , and obtain their convolutional feature covariance matrices from a pretrained VGG model. Similar to the Gram matrix, A feature covariance matrix is computed by:

where , are the ’th and ’th element of channel-wise feature mean at level . Then, the average Covariance matrix is computed by element-wise average of all images of ’s Covaraiance matrices at layer . We decompose

via singular value decomposition and keep

eigenvectors corresponding the largest eigenvalues. These eigenvectors form our projection basis which is fixed.

Given an image , , it’s low-dimensional summary statistics representation at level becomes:

We treat and as the parameters and of -dimensional Gaussian distribution . denotes the negative KL divergence between the ’th layer of the transferred image and the ’th layer of the style image , the KL distance is expressed as follow:

We choose the dimension for low-dimensional representation of layers R11, R21, R31, R41, R51 respectively with 18, 100, 128, 280, 256. The reason of why we project the statistics onto low dimension representation is because of follows: 1. there will be numerical problem if we use the full rank of convariance matrix in KL divergence formula, e.g. there is a term of ratio between two eigenvalues where both eigenvalues could be close to zero. 2. we believe some channels in feature map can not effectively capture image style.

An alternative approach might be using style images to compute projection matrices, our rationale of using content images is that the basis is ”general”, it hasn’t been adapted to our style images, for example. The procedure works because, in summary statistics, layer feature vectors tend to have significant redundancies which are shared across all images.

Base C statistics

measures the extend to which style transfer methods preserve ”objectness” in the content image. We treat object boundaries as a vital cue for human perception because boundaries are contours which represent the exchange of pixel ownership between objects. In other words, if an object’s boundary is recognized in the transferrred version of image then this object’s intrinsic coherence is preserved by this style transfer method. Probability of boundary (Pb), a density distribution of contours on image plane, is a form of output by contour detection methods. In this paper we use an off-the-shelf method by Arbelaez 

[1]

. A common metric to measure contour detection over an image is the F-score, a harmonic mean of precision and recall between Pb and human-drawn contour map. The Maximum F-score is taken from the precision and recall curve and is used as final contour detection score of an image. Our recipe to generate base C statistics for transferred images starts with applying a standard contour detection method on it. The Pb map is then used to generate maximum F-score, which we treat as C measurement. We think this is fair because standard contour detection methods were not developed with transferred images in the scope. For source content images and human annotated ground truth contour maps we choose 200 test images from BSDS500

[1].

4 Calibrated Measures from Base Statistics

The proposed EC statistics offer a quantitative measurement to style transfer methods and are meant to provide insight in searching better style transfer methods. Yet one should calibrate such measurement with actual user preference over transferred images. We conduct two surveys (E-test for style and C-test for content, Fig. 2) to help calibrating EC statistics.

In both surveys, users are presented with a pair of transferred images which only differ by style transfer methods or the same method but optimization parameters (e.g. style weights, optimization iterations), while the content and the style images are the same. In the E-test, users are asked to choose the transferred image that better captures the style. The transferred images are randomly selected from transferred results of the same style-content pair. Similarly, in the content study, users are asked to choose the image that more resemble to the content image, but the provided image pairs are chosen to have relatively high E statistics (details below). This selection is manual to ensure only seemly plausible transferred images are used for C-test. Pilot studies provided evidence that human preferences could be accurately predicted using our EC statistics.

Figure 2: On the left, a typical screen from the C-test; a user must select which target has content most like the given content image. On the right, a typical screen from the E-test; a user must select which target has style most like the given style image. In the C-test, transferred images are selected to have reasonably good E statistics.

4.1 Calibration with User Studies

Calibration method:

From the produced EC statistics, we construct per-image measurements that directly predict human preferences. We first compare transferred images by comparing the scores derived from their EC statistics. The difference of scores between two transferred images ( referred as image 1 and 2) will be used to predict the probability that one is preferred by the user over the other. We obtain such predictions using binary logistic regression. The scores are then

calibrated if the predictions of preference are accurate. e.g. if image 1 has score and image 2 has , then the probability that image 1 will be preferred by a user is predicted by . We seek one such score for effectiveness (which should predict the results of the style study) and another for coherence (which should predict the results of the content user study).

Scores and logistic models:

For each image, we have a random variable

says if this image is referred by human from an transferred image pair, we also have a vector of features chosen from some combination of the base C statistic and the 5 base E statistics. Given a pair of images ( for image 1, etc.), we can fit the logistic regression model

which yields a per-image score . The choice of the admissible logistic model for user calibration is important: (a) the model should predict human preferences accurately; (b) the model should have positive weights for every base E statistics to avoid completely relying on base E statistics.

Calibrating E statistic: We investigated five E-models, where the ’th uses . Table 1 shows the cross-validated accuracy of the models and whether they are admissible or not. We use the admissible model with

, which has highest cross-validated accuracy; note from the standard error statistics that accuracy differences are significant (

).

Calibrating C statistic: We investigated six C-models, where the first only uses , the rest use and the ’th uses . Table 2 shows the cross-validated accuracy of the models and whether they are admissible or not. There is no significant difference in accuracy between the two admissible models; we choose the larger model .

Visualizing calibration results: We visualize predictions of user preference as a function of difference between scores from selected E-model and C-model in Fig. 3. In both plots scattered points are true user observations of style-content pairs. In the C-test each pair has 9 observations, in the E-test each pair has 16 or more observations.

E-Model Admissible Cross-validated accuracy
1 yes .856 (3e-3)
2 yes .867 (2e-3)
3 yes .873 (3e-3)
4 no .871 (3e-3)
5 no .873 (2e-3)
Table 1: Cross validated accuracy for our E-model predictions of human preference in the style experiment (parens give standard error of cross-validated accuracy). Model 4 and 5 are not admissible due to violating condition (b), see model description in Sec.4.1.
C-Model Admissible Cross-validated accuracy
C yes .692 (8e-3)
1 yes .694 (8e-3)
2 no .710 (7e-3)
3 no .756 (7e-3)
4 no .759 (7e-3)
5 no .767 (7e-3)
Table 2: Cross validated accuracy for our C-model predictions of human preference in the content experiment (parens give standard error of cross-validated accuracy). Model 2,3,4 and 5 are not admissible due to violating condition (b), see model description in Sec.4.1.
Figure 3: Both E and C statistics are calibrated to user preferences in a comparison. On the left, the predicted probability of preferring image 1 to original content as a function of score from the selected C-model.On the right, the predicted probability of preferring image 1 to original style as a function of score from the selected E-model.

4.2 User Study Details

We conduct with two rounds user studies. The first round had 300 image pairs for E-test and 150 image pairs for C-test, each of which was generated using Gatys method[9]. In the second round, to calibrate E regardless of transfer methods, we used a mixture of 939 image pairs generated from Universal (352), XL (294) and Gatys (294) methods (see methods explanation in Sec. 5.1).

First round

: For the E-test we randomly selected two transferred images from the same style and the same content but with different optimization parameters, then paired and displayed them in random order. For the C-test we follow the same process and only used pairs where the E statistic was in the top quartile. For each task, users are presented with a question, an original image (style image for E-test and content image for C-test) and a transferred pair. Users are asked to choose a preferred image based on the displayed question. Overall, 16 users finished E-test, and 9 finished C-test task. From the first round we obtained 4800 clicks for E-test and 1350 clicks for C-test.

Second round: Only E-test was conducted at second round with the same user interface as in the first round. Different style transfer methods are applied on the same set of style-content pairs. User are provided with two transferred image using the same style-content combination but generated with different style transfer methods. 24 users (a few also participated the first round) participate the second round and contributed 2232 clicks.

In total, from the two rounds of user study, we collected 7032 user clicks over style, and 1350 user clicks over content. Note that C-test is difficult because we selected C-test images with high E statistics. Also note that we do not evaluate on individual user preference nor on specific method, but on the correlation between general user preference and the proposed base E C statistics. Results in Tab. 1 and  2 show low standard error of mean accuracy, indicating high confidence of these experiments.

5 Comparing Style Transfer Methods with E and C

With calibrated, meaningful measures of effectiveness and coherence, we can evaluate style transfer algorithms. We consider which algorithm is “best” and the effect choice of style has on performance. For analyzing the effects of weights, choice of style,and optimization objectives etc. we use the following procedure. We regress E (resp. C) for many style transfers produced by the algorithm of interest, then extract information from the coefficient weights.

5.1 Details

We list style transfer methods compared in this paper:
Gatys ([9] and described above); we use the implementation by Gatys 111https://github.com/leongatys/PytorchNeuralStyleTransfer.
Gatys aggressive ([9] and described above); we use the same Gatys implementation, but with the aggressive weighting set.
Gatys, with histogram loss: as advocated by [29], we attach a histogram loss to Gatys method.
Gatys, with layerwise style weights: the style weight is varied by layer; we multiple style losses of layers by factors ,,,, respectively.
Gatys, with mean control: Gatys’ loss, with an added L2 loss requiring that means in each transfer layer match to means in each style layer.
Gatys, with covariance control: replacing Gatys’ gram matrix by covariant matrix.
Gatys, with mean and covariance control: replacing Gatys’ style loss with losses requiring that means and covariances in each layer match.
Cross-layer: We used a pairwise descending strategy with pre-trained VGG-16 model. We use R11, R21, R31, R41, and R51 for style loss, and R42 for the content loss for style transfer.
Cross-layer, aggressive: as for XL, but with the aggressive weighting set.
Cross-layer, multiplicative (XM): A natural alternative to combine style and content losses is to multiply them; we form

. This provides a dynamical weighting between content loss and style loss during optimization. Although this loss function may seem odd, it performs extremely well in practice.


Cross-layer, with control of covariance (XLC) Cross-layer loss, but replacing cross-layer gram matrices by cross-layer covariance matrices.
Cross-layer, with control of mean and covariance (XLCM) XLC, but with an added loss requiring that means in each layer match.
Gatys, augmented Lagrangian method (GAL): We use the Gatys’ loss, but rather than only using LBFGS to optimize, we decouple layers to produce a constrained optimization problem and use the augmented Lagrangian method to solve this (after the procedure in [3] for decomposing MRF problems). As XM, this works effectively as dynamical weighting and performs extremely well. Details in Appendix A.
Universal Style Transfer (Universal):(from [19]

, and its Pytorch implementation

222https://github.com/sunshineatnoon/PytorchWCT.
Style control: the style image is resized to content size and reported as transferred image.
Content control: the content image reported as transferred image.

We construct a wide range of styles and contents collection, using 50 style images and the 200 content images from the BSDS500 test set. Styles are chosen by padding out the styles used in figures for previous papers with comparable images till we had 50 styles. There is not yet enough information to select a canonical style set. We have built two dataset base on these style and content pairs. The

main set is used for most experiments, and was obtained by: take 20 evenly spaced weight values in the range 50-2000; then, for each weight value, choose 15 style/content pairs uniformly and at random. The aggressive weighting set is used to investigate the effect of extreme weights. This was built by taking 20 weight values sampled uniformly and at random between 2000-10000; then, for each weight value, choose 15 style/content pairs uniformly and at random. For each method, we then produced 300 style transfer images using each weight-style-content triplet. For Universal [19], since the maximum weight is one, we linearly map main set weights to the zero-one range. Our samples are sufficient to produce clear differences in standard error bars and evaluate different methods.

5.2 Results

Figure 4: E and C statistics for admissible methods. The plot shows mean (filled black circle) and 66% confidence ellipse, showing covariance of E and C values for each method. Notice: E and C are positively correlated, suggesting some dependence on either style (compare Fig. 7) or optimization difficulties; XLCM and GAL achieve better E, and universal achieves better C; controls are where expected (style control gets excellent E, weak C; content control weak E, excellent C).

We run style transfer methods on our dataset(a tuple of style, content, and weight), and then plot these samples with calibrated the E and C statistics for comparison. We show the mean and covariance ellipse for E and C for various methods in Fig.  4,  5 and 6.

Generally, methods with strong C may have weak E and vice versa, which can be considered as a typical trade-off (this is a Pareto frontier). In spite of this trad-off phenomenon, we still can find some style methods superior than others. An admissible method is a method which does not have both mean E and mean C weaker than any other methods, e.g. style control has excellent E and weak C; the content control has excellent C and weak E. Note that this criterion is weak, because it looks at mean E and mean C, and the covariance might argue for using a method with inadmissible means. Fig. 4 summarizes the admissible methods based on the comparison with methods shown in Fig. 4,  5 and 6

. Universal style transfer has excellent C, but very weak E (i.e. the style is not much transferred, so the original image is quite coherent). XLCM and GAL obtain only very slightly different E’s, but different C’s; although each is admissible, GAL should likely be preferred as it obtains a strong C with little erosion of E. The differences between methods quite obviously achieve statistical significance (n=300; ellipses show covariance rather than standard deviation).

Fig. 5 and 6 summarize the inadmissible methods (for the Gatys type and the cross-layer type respectively). Any of these methods can not beat methods of Fig. 4

in both mean E and mean C at same time. Note that XM is very close to being admissible. Notice, in particular, that inadmissible methods tend to have large variance in C; one might get a good C, but one might also get a bad one.

Figure 5: E and C statistics for inadmissible methods of the Gatys type. The plot shows mean (filled black circle) and 66% confidence ellipse. Notice: E and C are positively correlated, suggesting some dependence on either style (compare Fig. 7) or optimization difficulties; the likely instability in Gatys’ method is reflected by very high variance when an aggressive weight schedule is used.
Figure 6: E and C statistics for inadmissible methods of the cross-layer type. The plot shows mean (filled black circle) and 66% confidence ellipse. Notice: E and C are positively correlated, suggesting some dependence on either style (compare Fig 7) or optimization difficulties; the cross-layer method reacts to aggressive style weighting by producing increased E and lower C, as one would expect. XM performs best, and is very close to being admissible.

Style and Weight: Style weights have surprisingly small effect on the E statistic for admissible methods (Tab. 3). Aggressive style weights lead to unstable transfer results, see Gatys, aggressive in Fig. 5 and Cross-layer, agressive in Fig. 6. Choice of style is very important. Fig. 7 shows the result of regressing the E statistic against style identity; many styles are strongly advantageous or disadvantageous for many methods. There is no clearly dominant method here. It is obvious from the figure that any given method can be significantly advantaged by choosing the styles for transfer carefully. This is a trap for evaluators.

Admissible Method Style Weight Significance
Effect (P-value)
XLCM -0.40 (0.23) 0.05
GAL -0.34 (0.19) 0.09
Universal 1.54 (0.89)
Table 3: We show the effect of style weight on E for admissible methods by multiplying the regression coefficient by the mean style weight (brackets show regression coefficient standard deviation). This gives the range of differences in E caused by style weights. Note P-values are high for XLCM and GAL, so there is little evidence weights actually matter.
Figure 7: The E measure that a method produces depends very strongly on the style; some styles transfer well, others poorly, even for admissible methods. On the top, a heatmap showing the significance of the dependency of the E statistic on style, red boxes indicate (i.e. likely not an accident). Vertical coordinate gives the method, horizontal coordinate gives the style. While more detailed analysis would be required to reliably identify which styles have a strong effect of the method, it is clear that all methods are strongly affected by many styles. On the bottom, a heatmap showing the weight (positive=yellow means improves E; negative=red means weakens E) for each of our 50 styles for each method. All methods find some styles hard and others helpful.

6 Discussion

What causes the difference between Gatys’ method and cross-layer losses? A symmetry analysis [29] helps explain some aspects of our results. The Appendix C give a construction for all affine maps that fix the gram matrix for a layer and its parent. It is necessary to assume the map from layer to layer is linear. This is not as restrictive as it may seem; the analysis yields a local construction about any generic operating point of the network. In summary, we have: The cross-layer gram matrix loss has very different symmetries to Gatys’ (within-layer) method. In particular, the symmetry of Gatys’ method can rescale features while shifting the mean. For the cross-layer loss, the symmetry cannot rescale, and cannot shift the mean. This implies that, if one constructs numerous style transfers with the same style using Gatys’ method, the variance of the layer features should be much greater than that observed for the cross layer method. Furthermore, these symmetries impede optimization by making it hard to identify progress as massive changes in the input image may lead to no change in loss. Increasing style weights in Gatys method should result in poor style transfers, by exaggerating the effects of the symmetry, and we observe this effect, see Gatys, aggresive in Fig.5.

Our experimental evidence suggests the symmetries manifest themselves in practice. Gatys-like methods displays significantly larger variance in C than cross-layer methods, and aggressive weighting makes the situation worse. This suggests that the variance implied by the larger symmetry group is actually appearing. In particular, Gatys’ symmetry group allows rescaling of features and shifting of their mean, which will cause the feature distribution of the transferred image to move away from the feature distribution of the style, causing the lower E statistic. Histogram regularization does not appear to help significantly.

Symmetries appear to interact strongly with optimization difficulties. GAL uses a standard optimization trick (insert variables and constraints to decouple terms in an unconstrained problem in the hope of making better progress with each step) and benefits significantly. In particular, GAL is largely immune to change in style weight.This suggests that the main difficulty might lie with optimization procedures, rather than with losses.

7 Conclusion

Style transfer methods have proliferated in the absence of a quantitative evaluation method. Our evaluation procedure attempts to provide evidents for strong style transfer methods. We calibrate out measurement to predict human preferences in style (resp. content) experiments, allowing extensive comparison of methods. Small variants on method – for example, changes to optimization procedure – seem to have significant effect on performance. This is a situation where quantitative evaluation is essential. Furthermore, our results suggest that the choice of style strongly affects the performance of all admissible algorithms.

References

  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: §1, §3, §3.
  • [2] J. D. Bonet (1997) Multiresolution sampling procedure for analysis and synthesis of texture images. SIGGRAPH. Cited by: §2.
  • [3] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends® in Machine learning

    3 (1), pp. 1–122.
    Cited by: §5.1.
  • [4] A. J. Champandard (2016) Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768. Cited by: §2.
  • [5] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua (2017-07) Stylebank: an explicit representation for neural image style transfer.

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .
    Cited by: §2.
  • [6] T. Q. Chen and M. Schmidt (2016) Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337. Cited by: §2.
  • [7] V. Dumoulin, J. Shlens, and M. Kudlur (2017) A learned representation for artistic style. ICLR. External Links: Link Cited by: §2.
  • [8] A. A. Efros and W. T. Freeman (2001) Image quilting for texture synthesis and transfer. Proceedings of the 28th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’01, pp. 341–346. External Links: Document, ISBN 158113374X, ISSN 00134694, Link Cited by: §2.
  • [9] L. A. Gatys, A. S. Ecker, and M. Bethge (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423. Cited by: §2.1, §2, §4.2, §5.1.
  • [10] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman (2017-07) Controlling perceptual factors in neural style transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [11] L. Gatys, A. S. Ecker, and M. Bethge (2015) Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 262–270. External Links: Link Cited by: §2.
  • [12] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei (2017) Characterizing and improving stability in neural style transfer. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 4087–4096. Cited by: §2.
  • [13] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin (2001) Image analogies. Proceedings of the 28th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’01 (August), pp. 327–340. External Links: Document, 1705.01088, ISBN 158113374X, Link Cited by: §2.
  • [14] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868. Cited by: §2.
  • [15] Y. Jing, Y. Yang, Z. Feng, J. Ye, and M. Song (2017) Neural style transfer: a review. arXiv preprint arXiv:1705.04058. Cited by: §2.
  • [16] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song (2019) Neural style transfer: a review. IEEE transactions on visualization and computer graphics. Cited by: §1.
  • [17] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    .
    In European Conference on Computer Vision, Cited by: §2.
  • [18] Y. Li, N. Wang, J. Liu, and X. Hou (2017) Demystifying neural style transfer. arXiv preprint arXiv:1701.01036. Cited by: §2.
  • [19] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086. Cited by: §2, §5.1, §5.1.
  • [20] Y. Li, M. Liu, X. Li, M. Yang, and J. Kautz (2018) A closed-form solution to photorealistic image stylization. arXiv preprint arXiv:1802.06474. Cited by: §2, §2.
  • [21] D. C. Liu and J. Nocedal (1989) On the limited memory bfgs method for large scale optimization. Mathematical programming 45 (1), pp. 503–528. Cited by: §2.1.
  • [22] F. Luan, S. Paris, E. Shechtman, and K. Bala (2017) Deep Photo Style Transfer. External Links: Document, 1703.07511, Link Cited by: §2.
  • [23] R. Novak and Y. Nikulin (2016) Improving the neural algorithm of artistic style. arXiv preprint arXiv:1605.04603. Cited by: §2.
  • [24] Y. Shih, S. Paris, C. Barnes, W. T. Freeman, and F. Durand (2014) Style transfer for headshot portraits. ACM Transactions on Graphics 33 (4), pp. 1–14. External Links: Document, ISBN 0730-0301, ISSN 07300301, Link Cited by: §2.
  • [25] E. P. Simoncelli and J. Portilla (1998) Texture characterization via joint statistics of wavelet coefficient magnitudes. In ICIP, Cited by: §2.
  • [26] J. B. Tenenbaum and W. T. Freeman (2000) Separating Style and Content with Bilinear Models. Neural Computation 12 (6), pp. 1247–1283. External Links: Document, ISBN 0899-7667 (Print)$\$r0899-7667 (Linking), ISSN 0899-7667, Link Cited by: §2.
  • [27] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022. External Links: Link, 1607.08022 Cited by: §2.
  • [28] X. Wang, G. Oxholm, D. Zhang, and Y. Wang (2016) Multimodal transfer: a hierarchical deep convolutional neural network for fast artistic style transfer. arXiv preprint arXiv:1612.01895. Cited by: §2.
  • [29] P. Wilmot, E. Risser, and C. Barnes (2017) Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893. Cited by: §1, §2, §5.1, §6.

Appendix A Quick Overview

Notice that in Fig 5 all Gatys related methods except Gatys with mean and covariance control have quite low E compared to the E for cross-layer methods in Fig 6. But Gatys with mean and covariance control has different symmetries to Gatys (because one is controlling both mean and covariance, rather than just the Gram matrix; the symmetries are like those of the cross-layer method). This suggests it is likely that the symmetry is at least part of the reason why some methods outperform others.

There are two possible reasons. First, the symmetry results in poor solutions being easy to find. Second, the symmetry causes optimization problems. Both issues appear to be in play. Figures 5 and 6 together suggest that methods have considerable variance in performance, which is consistent with poor solutions being easy to find. But the good performance of GAL (see Fig. 4) suggests that optimization is an issue, too.

Symmetries can create problems for optimization methods, because symmetries must be associated with strong gradient curvature at least some points. GAL uses a standard optimization trick to simplify the optimization problem; the success of this trick suggests that optimization of Gatys’ loss is hard.

a.1 Gal

Gatys’ loss is a function of feature values at each layer. One usually assumes that the feature values taken at layer are a known function of the feature values at layer . Here the function is given by the appropriate convolutional layer, etc. However, we could “cut” the network between layers, then introduce a constraint requiring that variables on either side of the cut be equal. We solve this constrained problem using the augmented lagrangian method (see [4] for this strategy applied to MRFs).

Write for the response of the ’th channel at the ’th location in the ’th convolutional layer; drop subscripts as required, and write

for the function mapping layer to layer. GAL cuts the layers only at R41. We have not tried other cuts. It would be interesting to see what happened with more cuts, but the optimization problem gets big quickly. We introduce dummy variables

, and the constraint . Write for lagrange multipliers corresponding to the constraint, for the image, and for the

’th estimate of those lagrange multipliers, etc.

The augmented lagrangian is now

where is the style weight of each layer, is the style loss for layer , and is the content loss at R41, and

In the primal step, we first optimize the lagrangian with respect to , using fixed , using LBFGS. We then fix , and optimize with respect to (notice this involves solving a relatively straightforward linear system). The dual step then re-estimates the lagrange multipliers as usual:

Finally, we update by .

Figure 8 and Figure 9 display our 50 style images. Except the Universal style transfer, all other methods synthesize image from Gaussian noise with LBFGS optimizer. The content images and style images are resized to same width of 512 as the input for style transfers.

Figure 8: The first group of 50 styles.
Figure 9: The second group of 50 styles.

a.2 Cross-layer with control of mean and covariance (XLCM)

We observe that feature mean difference between and is directly related to the optimization performance of style transfer, e.g. when the content image have similar feature mean as style image the transfer image has better style quality. Therefore we introduce the L2 loss between each feature channel’s mean of and each feature channel’s mean of to enforce the transfer image has close feature mean to style image. Here is the loss for mean control.

On the other hand, the covariant control is to replace cross-layer gram matrix by corresponding cross-layer gram matrix with each feature subtracted by by its mean. Here is the new cross-layer loss with covariant control.

Here

is the tensor duplicated in p dimension with the mean of

over p.

(low C-score, high E-score)
Cross-layer,aggressive:24.06%,
XLCM:20.92%,
XLC:11.92%,
XL:11.30%,
GatysCM:9.21%
(middle C-score, high E-score)
XLC:14.56%,
Cross-layer,aggressive:13.60%,
XLCM:13.41%,
XL:13.22%,
GAL:10.15%
(high C-score, high E-score)
GAL:25.56%,
XM:15.04%,
XL:10.53%,
GatysL:8.52%,
GatysCM:6.77%
(low C-score, middle E-score)
GatysCM:15.29%,
GatysC:12.86%,
Cross-layer, aggressive:11.65%,
GatysL:11.65%,
XLCM:8.50%
(middle C-score, middle E-score)
XM:11.69%,
GatysM:11.49%,
GatysL:10.69%,
GatysH:10.08%,
GatysC:8.87%
(high C-score, middle E-score)
XM:15.45%,
GatysH:14.02%,
Gatys:13.41%,
GAL:13.01%,
GatysM:11.18%
(low C-score, low E-score)
Gatys aggressive:23.97%,
GatysC:12.57%,
XLC:10.02%,
GatysCM:8.84%,
GatysM:7.47%
(middle C-score, low E-score)
Universal:12.83%,
GatysH:10.73%,
Gatys aggressive:10.47%,
GatysM:10.21%,
Gatys:9.69%
(high C-score, low E-score)
Universal:45.28%,
Gatys:15.75%,
GatysH:7.87%,
GatysM:6.69%,
GatysL:4.53%
  • GatysH – Gatys, with histogram loss

  • GatysL – Gatys, with layerwise style weights

  • GatysM – Gatys, with mean control

  • GatysC – Gatys, with covariance control

  • GatysCM – Gatys, with mean and covariance control

  • XL – Cross-layer

  • XM – Cross-layer, multiplicative

  • XLC – Cross-layer, with control of covariance

  • XLCM – Cross-layer, with control of mean and covariance

  • GAL – Gatys, augmented Lagrangian method

  • Universal – Universal Style Transfer

Table 4:

Top 5 methods ranking for each quantile under regression scores coordinate generated by selected E-model and C-model. Each transferred image has five E-statistic and one C-statistic, they are used to regress user preference in E-test and C-test (Sec. 4.1 in original text). Selected E and C models regress scores (higher is better) for each transferred image. We divide the scatter into 3-by-3 quantiles, and show method distribution for each quantile.

Appendix B Quantization of transferred images under user study regression models

Recall in Section 4 of original text we regress base E and C statistic to user preference. We obtain one best E-model from E-test user preference, and one best C-model from that of C-test. These two models assign E and C scores for each transferred image (Sec. 4.1 of original text). Thus, we gather a scatter plot of all transferred images, and we quantize this scatter plot into a 3-by-3 grid, each cell has roughly same number of images. From this grid we generate a visualization of EC space (Fig.1 in original text).

This quantization shows similar trends with Figure 4-6 in the original text. Table 4 shows the Top 5 methods ranking for all quantiles. In quantile of high C-score, high E-score, GAL is the top method. XM dominates both (middle C, middle E) and (high C, middle E), and Universal dominates both (middle C, low E) and (high C, low E). Other high E quantiles are dominated by cross-layer related methods. The worst quantile(low C-score,Low E-score) has Gatys aggressive as the most popular.

This difference in symmetry groups is important. Risser argues that the symmetries of gram matrices in Gatys’ method could lead to unstable reconstructions; they control this effect using feature histograms. What causes the effect is that the symmetry rescales features while shifting the mean. For the cross-layer loss, the symmetry cannot rescale, and cannot shift the mean. In turn, the instability identified in that paper does not apply to the cross-layer gram matrix and our results could not be improved by adopting a histogram loss.

Write , (resp for the feature vector at the ’th location (of in total) in the first (resp second) layer. Write , etc.

Symmetries of the first layer: Now assume that the first layer has been normalized to zero mean and unit covariance. There is no loss of generality, because the whitening transform can be written into the expression for the group. Write for the operator that forms the within layer gram matrix. We have . Now consider an affine action on layer 1, mapping to ; then for this to be a symmetry, we must have . In turn, the symmetry group can be constructed by: choose which does not have unit length; factor to obtain (for example, by using a cholesky transformation); then any element of the group is a pair where is orthonormal. Note that factoring will fail for a unit vector, whence the restriction.

The second layer: We will assume that the map between layers of features is linear. This assumption is not true in practice, but major differences between symmetries observed under these conditions likely result in differences when the map is linear. We can analyze for two cases: first, all units in the map observe only one input feature vector (i.e. 1x1 convolutions; the point sample case); second, spatial homogeneity in the layers.

The point sample case: Assume that every unit in the map observes only one input feature from the previous layer (1x1 convolutions). We have , because the map between layers is linear. Now consider the effect on the second layer. We have . Choose some symmetry group element for the first layer, . The gram matrix for the second layer becomes , where Recalling that and , we have

so that if . This is relatively easy to achieve with .

Spatial homogeneity: Now assume the map between layers has convolutions with maximum support . Write for an index that runs over the whole feature map, and for a stacking operator that scans the convolutional support in fixed order and stacks the resulting features. For example, given a 3x3 convolution and indexing in 2D, we might have

In this case, there is some , so that . We ignore the effects of edges to simplify notation (though this argument may go through if edges are taken into account). Then there is some , so we can write

Now assume further that layer 1 has the following (quite restrictive) spatial homogeneity property: for pairs of feature vectors within the layer , with (i.e. within a convolution window of one another), we have . This assumption is consistent with image autocorrelation functions (which fall off fairly slowly), but is still strong. Write for an operator that stacks copies of its argument as appropriate, so

Then . If there is some affine action on layer 1, we have , where we have overloaded in the natural way. Now if and , .

The cross-layer gram matrix: Symmetries of the cross-layer gram matrix are very different. Write for the cross layer gram matrix.

Cross-layer, point sample case: Here (recalling )we have . Now choose some symmetry group element for the first layer, . The cross-layer gram matrix becomes

(recalling that and ). But this means that the symmetry requires ; in turn, we must have .

Cross-layer, homogeneous case: We have

Now choose some symmetry group element for the first layer, . The cross-layer gram matrix becomes

(recalling the spatial homogeneity assumption, that and ). But this means that the symmetry requires ; in turn, we must have .

Appendix C Construction of Affine Maps for Symmetry Groups

This difference in symmetry groups is important. Risser argues that the symmetries of gram matrices in Gatys’ method could lead to unstable reconstructions; they control this effect using feature histograms. What causes the effect is that the symmetry rescales features while shifting the mean. For the cross-layer loss, the symmetry cannot rescale, and cannot shift the mean. In turn, the instability identified in that paper does not apply to the cross-layer gram matrix and our results could not be improved by adopting a histogram loss.

Write , (resp for the feature vector at the ’th location (of in total) in the first (resp second) layer. Write , etc.

Symmetries of the first layer: Now assume that the first layer has been normalized to zero mean and unit covariance. There is no loss of generality, because the whitening transform can be written into the expression for the group. Write for the operator that forms the within layer gram matrix. We have . Now consider an affine action on layer 1, mapping to ; then for this to be a symmetry, we must have . In turn, the symmetry group can be constructed by: choose which does not have unit length; factor to obtain (for example, by using a cholesky transformation); then any element of the group is a pair where is orthonormal. Note that factoring will fail for a unit vector, whence the restriction.

The second layer: We will assume that the map between layers of features is linear. This assumption is not true in practice, but major differences between symmetries observed under these conditions likely result in differences when the map is linear. We can analyze for two cases: first, all units in the map observe only one input feature vector (i.e. 1x1 convolutions; the point sample case); second, spatial homogeneity in the layers.

The point sample case: Assume that every unit in the map observes only one input feature from the previous layer (1x1 convolutions). We have , because the map between layers is linear. Now consider the effect on the second layer. We have . Choose some symmetry group element for the first layer, . The gram matrix for the second layer becomes , where Recalling that and , we have

so that if . This is relatively easy to achieve with .

Spatial homogeneity: Now assume the map between layers has convolutions with maximum support . Write for an index that runs over the whole feature map, and for a stacking operator that scans the convolutional support in fixed order and stacks the resulting features. For example, given a 3x3 convolution and indexing in 2D, we might have

In this case, there is some , so that . We ignore the effects of edges to simplify notation (though this argument may go through if edges are taken into account). Then there is some , so we can write

Now assume further that layer 1 has the following (quite restrictive) spatial homogeneity property: for pairs of feature vectors within the layer , with (ie within a convolution window of one another), we have . This assumption is consistent with image autocorrelation functions (which fall off fairly slowly), but is still strong. Write for an operator that stacks copies of its argument as appropriate, so

Then . If there is some affine action on layer 1, we have , where we have overloaded in the natural way. Now if and , .

The cross-layer gram matrix: Symmetries of the cross-layer gram matrix are very different. Write for the cross layer gram matrix.

Cross-layer, point sample case: Here (recalling )we have . Now choose some symmetry group element for the first layer, . The cross-layer gram matrix becomes

(recalling that and ). But this means that the symmetry requires ; in turn, we must have .

Cross-layer, homogeneous case: We have

Now choose some symmetry group element for the first layer, . The cross-layer gram matrix becomes

(recalling the spatial homogeneity assumption, that and ). But this means that the symmetry requires ; in turn, we must have .