1 Introduction
In this paper, we seek to identify factors that lead to strong style transfers. To do so, we construct a comprehensive quantitative evaluation procedure for style transfer methods. We evaluate style transfers on two criteria. Effectiveness
(E) measures whether transferred images have the desired style, using divergence between Convolutional Neural Network (CNN) feature layer distributions of the synthesized image and original image.
Coherence (C) measures whether the synthesized images respect the underlying decomposition of the content image into objects, using established procedures together with the Berkeley segmentation dataset BSDS500 [1]. Both our E and C measures are calibrated by user studies.Our qualitative metric focus on the analysis of Parametric Neural Methods (under the taxonomy of NST techniques) [16]. Actually, NonParametric Methods can generate a largely different feature statistics from original style image due to the pattern fitting to the content image, which are intrinsically different from Parametric ones. Therefore, it does not make sense to require two types of methods to be evaluated by the same metric at this stage.
Contributions: We present E and C measures of style transferred images ( see Fig. 1). Our measures are highly effective at predicting user preferences. We use our measures to compare several style transfer methods quantitatively. Our study suggests that controlling crosslayer loss is helpful, particularly if one uses the crosslayer covariance matrix (rather than Gram matrix). Our study suggests that, despite the analysis of Risser et al. [29], the main problem with Gatys’ method is optimization rather than symmetry; modifying the optimization leads to an extremely strong method. Gatys’ method is unstable with high style weights, and we construct explicit models of the symmetry groups for Gatys’ style loss and the crosslayer style loss (improving over Risser et al. , who could not construct the groups), which may explain this effect. Our study suggests that, even for the best methods we investigated, the effect of choice of style image is strong, meaning that it is dangerous for experimenters to select style images when reporting results.
2 Related work
Style transfer: bilinear models [26] , nonparametric methods [8], image analogies [13] and adjusting filter statistics [2, 25] are capable of image style transfer and yield texture synthesis. Gatys et al. demonstrated that producing neural network layers with particular summary statistics (i.e. Gram matrices) yielded effective texture synthesis [11]. Gatys et al. achieved style transfer by searching for an image that satisfies both style texture summary statistics and content constraints [9]. This work has been much elaborated [17, 28, 5, 7, 27, 14, 19, 20, 6, 24, 22, 10, 18, 4, 15]. Novak and Nikulin noticed that crosslayer Gram matrices reliably produce improvement on style transfer ([23]). However, their work was an exploration of variants of style transfer rather than a thorough study to gain insights on style summary statistics; since then, the method has been ignored in the literature.
Style transfer evaluation: style transfer methods are currently evaluated mostly by visual inspection on a small set of different styles and content image pairs. To our knowledge, there are no quantitative protocols to evaluate the competence of style transfer apart from user studies [20] (who also investigate edge coherence between content and stylized images).
Gram matrices symmetry
in a style transfer loss function occur when there is a transformation available that changes the style transferred image without changing the value of the loss function. Risser
et al. note instability in Gatys’ method; symptoms are: poor and good style transfers of the same style to the same content with about the same loss value [29]. They supply evidence that this behavior can be controlled by adding a histogram loss, which breaks the symmetry. They do not write out the symmetry group as too complicated ( [29], p 46). Gupta et al. [12] link instability in Gaty’s method to the size of the trace of the Gram matrix.2.1 Gatys Method and Notation
We review the original work of Gatys et al. [9] in detail to introduce notation. Gatys finds an image where early layers of convolutional features match the lower layers of the style image and higher layers match the higher layers of a content image. Write (resp. , ) for the style (resp. content, new) image, and for some parameters balancing style and content losses ( and respectively). Occasionally, we will write for the image resulting from style transfer using method applied to the arguments. We obtain by finding
Losses are computed on a network representation, with convolutional layers, where the ’th layer produces a feature map of size (resp. height, width, and channel number). We partition the layers into three groups (style, content and target). Then we reindex the spatial variables (height and width) and write for the response of the ’th channel at the ’th location in the ’th convolutional layer. The content loss is
(where ranges over content layers). The withinlayer Gram matrix for the ’th layer is
Write for the weight applied to the ’th layer. Then
where ranges over style layers. Gatys et al. use Relu1_1, Relu2_1, Relu3_1, Relu4_1, and Relu5_1 as style layers, and layer Relu4_2 for the content loss, and search for using LBFGS [21]. From now on, we write R51 for Relu5_1, etc.
2.2 Crosslayer style loss
We consider a style loss that takes into account between layer statistics. The crosslayer, additive (XL) loss is obtained as follows. Consider layer and , both style layers, with decreasing spatial resolution.
Write for an upsampling of to , and consider
as the crosslayer gram matrix, We can form a style loss
(where is a set of pairs of style layers). We can substitute this loss into the original style loss, and minimize as before. All results here used a pairwise descending strategy, where one constrains each layer and its successor (i.e. (R51, R41); (R41, R31); etc). Alternatives include an all distinct pairs strategy, where one constrains all pairs of distinct layers. Carefully controlling weights for each layer’s style loss is not necessary in crosslayer gram matrix scenario.
3 Base Statistics for Quantitative Evaluation
Ideally, a style transfer method should meet two basic tests: (1) the method produces images in the desired style – E statistics; (2) the resulting images respect the underlying decomposition of the content image into objects – C statistics.
Base E statistics
: In general, we want to measure similarity of two distributions, one derived from the style image, the other from the transferred image. At each layer, e.g. R41 feature map, we first project both style image and transferred image’s summary statistics to a lowdimensional representation. Then we assume these representations are parameters of Gaussian distributions and a standard KL divergence is applied to measure the distance. The same procedure is repeated for other layers (i.e. R11,R21,R31 and R51).
Specifically, the projection matrix at each layer is discovered as such: we first find a set of content images (we use 200 test images from BSDS500[1]) , and obtain their convolutional feature covariance matrices from a pretrained VGG model. Similar to the Gram matrix, A feature covariance matrix is computed by:
where , are the ’th and ’th element of channelwise feature mean at level . Then, the average Covariance matrix is computed by elementwise average of all images of ’s Covaraiance matrices at layer . We decompose
via singular value decomposition and keep
eigenvectors corresponding the largest eigenvalues. These eigenvectors form our projection basis which is fixed.Given an image , , it’s lowdimensional summary statistics representation at level becomes:
We treat and as the parameters and of dimensional Gaussian distribution . denotes the negative KL divergence between the ’th layer of the transferred image and the ’th layer of the style image , the KL distance is expressed as follow:
We choose the dimension for lowdimensional representation of layers R11, R21, R31, R41, R51 respectively with 18, 100, 128, 280, 256. The reason of why we project the statistics onto low dimension representation is because of follows: 1. there will be numerical problem if we use the full rank of convariance matrix in KL divergence formula, e.g. there is a term of ratio between two eigenvalues where both eigenvalues could be close to zero. 2. we believe some channels in feature map can not effectively capture image style.
An alternative approach might be using style images to compute projection matrices, our rationale of using content images is that the basis is ”general”, it hasn’t been adapted to our style images, for example. The procedure works because, in summary statistics, layer feature vectors tend to have significant redundancies which are shared across all images.
Base C statistics
measures the extend to which style transfer methods preserve ”objectness” in the content image. We treat object boundaries as a vital cue for human perception because boundaries are contours which represent the exchange of pixel ownership between objects. In other words, if an object’s boundary is recognized in the transferrred version of image then this object’s intrinsic coherence is preserved by this style transfer method. Probability of boundary (Pb), a density distribution of contours on image plane, is a form of output by contour detection methods. In this paper we use an offtheshelf method by Arbelaez
[1]. A common metric to measure contour detection over an image is the Fscore, a harmonic mean of precision and recall between Pb and humandrawn contour map. The Maximum Fscore is taken from the precision and recall curve and is used as final contour detection score of an image. Our recipe to generate base C statistics for transferred images starts with applying a standard contour detection method on it. The Pb map is then used to generate maximum Fscore, which we treat as C measurement. We think this is fair because standard contour detection methods were not developed with transferred images in the scope. For source content images and human annotated ground truth contour maps we choose 200 test images from BSDS500
[1].4 Calibrated Measures from Base Statistics
The proposed EC statistics offer a quantitative measurement to style transfer methods and are meant to provide insight in searching better style transfer methods. Yet one should calibrate such measurement with actual user preference over transferred images. We conduct two surveys (Etest for style and Ctest for content, Fig. 2) to help calibrating EC statistics.
In both surveys, users are presented with a pair of transferred images which only differ by style transfer methods or the same method but optimization parameters (e.g. style weights, optimization iterations), while the content and the style images are the same. In the Etest, users are asked to choose the transferred image that better captures the style. The transferred images are randomly selected from transferred results of the same stylecontent pair. Similarly, in the content study, users are asked to choose the image that more resemble to the content image, but the provided image pairs are chosen to have relatively high E statistics (details below). This selection is manual to ensure only seemly plausible transferred images are used for Ctest. Pilot studies provided evidence that human preferences could be accurately predicted using our EC statistics.
4.1 Calibration with User Studies
Calibration method:
From the produced EC statistics, we construct perimage measurements that directly predict human preferences. We first compare transferred images by comparing the scores derived from their EC statistics. The difference of scores between two transferred images ( referred as image 1 and 2) will be used to predict the probability that one is preferred by the user over the other. We obtain such predictions using binary logistic regression. The scores are then
calibrated if the predictions of preference are accurate. e.g. if image 1 has score and image 2 has , then the probability that image 1 will be preferred by a user is predicted by . We seek one such score for effectiveness (which should predict the results of the style study) and another for coherence (which should predict the results of the content user study).Scores and logistic models:
For each image, we have a random variable
says if this image is referred by human from an transferred image pair, we also have a vector of features chosen from some combination of the base C statistic and the 5 base E statistics. Given a pair of images ( for image 1, etc.), we can fit the logistic regression modelwhich yields a perimage score . The choice of the admissible logistic model for user calibration is important: (a) the model should predict human preferences accurately; (b) the model should have positive weights for every base E statistics to avoid completely relying on base E statistics.
Calibrating E statistic: We investigated five Emodels, where the ’th uses . Table 1 shows the crossvalidated accuracy of the models and whether they are admissible or not. We use the admissible model with
, which has highest crossvalidated accuracy; note from the standard error statistics that accuracy differences are significant (
).Calibrating C statistic: We investigated six Cmodels, where the first only uses , the rest use and the ’th uses . Table 2 shows the crossvalidated accuracy of the models and whether they are admissible or not. There is no significant difference in accuracy between the two admissible models; we choose the larger model .
Visualizing calibration results: We visualize predictions of user preference as a function of difference between scores from selected Emodel and Cmodel in Fig. 3. In both plots scattered points are true user observations of stylecontent pairs. In the Ctest each pair has 9 observations, in the Etest each pair has 16 or more observations.
EModel  Admissible  Crossvalidated accuracy 

1  yes  .856 (3e3) 
2  yes  .867 (2e3) 
3  yes  .873 (3e3) 
4  no  .871 (3e3) 
5  no  .873 (2e3) 
CModel  Admissible  Crossvalidated accuracy 

C  yes  .692 (8e3) 
1  yes  .694 (8e3) 
2  no  .710 (7e3) 
3  no  .756 (7e3) 
4  no  .759 (7e3) 
5  no  .767 (7e3) 
4.2 User Study Details
We conduct with two rounds user studies. The first round had 300 image pairs for Etest and 150 image pairs for Ctest, each of which was generated using Gatys method[9]. In the second round, to calibrate E regardless of transfer methods, we used a mixture of 939 image pairs generated from Universal (352), XL (294) and Gatys (294) methods (see methods explanation in Sec. 5.1).
First round
: For the Etest we randomly selected two transferred images from the same style and the same content but with different optimization parameters, then paired and displayed them in random order. For the Ctest we follow the same process and only used pairs where the E statistic was in the top quartile. For each task, users are presented with a question, an original image (style image for Etest and content image for Ctest) and a transferred pair. Users are asked to choose a preferred image based on the displayed question. Overall, 16 users finished Etest, and 9 finished Ctest task. From the first round we obtained 4800 clicks for Etest and 1350 clicks for Ctest.
Second round: Only Etest was conducted at second round with the same user interface as in the first round. Different style transfer methods are applied on the same set of stylecontent pairs. User are provided with two transferred image using the same stylecontent combination but generated with different style transfer methods. 24 users (a few also participated the first round) participate the second round and contributed 2232 clicks.
In total, from the two rounds of user study, we collected 7032 user clicks over style, and 1350 user clicks over content. Note that Ctest is difficult because we selected Ctest images with high E statistics. Also note that we do not evaluate on individual user preference nor on specific method, but on the correlation between general user preference and the proposed base E C statistics. Results in Tab. 1 and 2 show low standard error of mean accuracy, indicating high confidence of these experiments.
5 Comparing Style Transfer Methods with E and C
With calibrated, meaningful measures of effectiveness and coherence, we can evaluate style transfer algorithms. We consider which algorithm is “best” and the effect choice of style has on performance. For analyzing the effects of weights, choice of style,and optimization objectives etc. we use the following procedure. We regress E (resp. C) for many style transfers produced by the algorithm of interest, then extract information from the coefficient weights.
5.1 Details
We list style transfer methods compared in this paper:
Gatys ([9] and described above); we use the implementation by Gatys ^{1}^{1}1https://github.com/leongatys/PytorchNeuralStyleTransfer.
Gatys aggressive ([9] and described above); we use the same Gatys implementation, but with the aggressive weighting set.
Gatys, with histogram loss: as advocated by [29], we attach a histogram loss to
Gatys method.
Gatys, with layerwise style weights: the style weight is varied by layer; we multiple style losses of layers by factors ,,,, respectively.
Gatys, with mean control: Gatys’ loss, with an added L2 loss requiring
that means in each transfer layer match to means in each style layer.
Gatys, with covariance control: replacing Gatys’ gram matrix by covariant matrix.
Gatys, with mean and covariance control: replacing Gatys’ style loss with losses requiring
that means and covariances in each layer match.
Crosslayer: We used a pairwise descending strategy with pretrained VGG16 model. We use R11, R21, R31,
R41, and R51 for style loss, and R42 for the content loss for style transfer.
Crosslayer, aggressive: as for XL, but with the aggressive weighting set.
Crosslayer, multiplicative (XM): A natural alternative to combine style and content losses is to multiply them; we form
. This provides a dynamical weighting between content loss and style loss during optimization. Although this loss function may seem odd, it performs extremely well in practice.
Crosslayer, with control of covariance (XLC) Crosslayer loss, but replacing crosslayer gram matrices by crosslayer covariance matrices.
Crosslayer, with control of mean and covariance (XLCM) XLC, but with an added loss requiring that means in each layer match.
Gatys, augmented Lagrangian method (GAL): We use the Gatys’ loss, but rather than only using LBFGS to optimize, we decouple layers to produce a constrained optimization problem and use the augmented Lagrangian method to solve this (after the procedure in [3] for decomposing MRF problems). As XM, this works effectively as dynamical weighting and performs extremely well. Details in Appendix A.
Universal Style Transfer (Universal):(from [19]
, and its Pytorch implementation
^{2}^{2}2https://github.com/sunshineatnoon/PytorchWCT.Style control: the style image is resized to content size and reported as transferred image.
Content control: the content image reported as transferred image.
We construct a wide range of styles and contents collection, using 50 style images and the 200 content images from the BSDS500 test set. Styles are chosen by padding out the styles used in figures for previous papers with comparable images till we had 50 styles. There is not yet enough information to select a canonical style set. We have built two dataset base on these style and content pairs. The
main set is used for most experiments, and was obtained by: take 20 evenly spaced weight values in the range 502000; then, for each weight value, choose 15 style/content pairs uniformly and at random. The aggressive weighting set is used to investigate the effect of extreme weights. This was built by taking 20 weight values sampled uniformly and at random between 200010000; then, for each weight value, choose 15 style/content pairs uniformly and at random. For each method, we then produced 300 style transfer images using each weightstylecontent triplet. For Universal [19], since the maximum weight is one, we linearly map main set weights to the zeroone range. Our samples are sufficient to produce clear differences in standard error bars and evaluate different methods.5.2 Results
We run style transfer methods on our dataset(a tuple of style, content, and weight), and then plot these samples with calibrated the E and C statistics for comparison. We show the mean and covariance ellipse for E and C for various methods in Fig. 4, 5 and 6.
Generally, methods with strong C may have weak E and vice versa, which can be considered as a typical tradeoff (this is a Pareto frontier). In spite of this tradoff phenomenon, we still can find some style methods superior than others. An admissible method is a method which does not have both mean E and mean C weaker than any other methods, e.g. style control has excellent E and weak C; the content control has excellent C and weak E. Note that this criterion is weak, because it looks at mean E and mean C, and the covariance might argue for using a method with inadmissible means. Fig. 4 summarizes the admissible methods based on the comparison with methods shown in Fig. 4, 5 and 6
. Universal style transfer has excellent C, but very weak E (i.e. the style is not much transferred, so the original image is quite coherent). XLCM and GAL obtain only very slightly different E’s, but different C’s; although each is admissible, GAL should likely be preferred as it obtains a strong C with little erosion of E. The differences between methods quite obviously achieve statistical significance (n=300; ellipses show covariance rather than standard deviation).
Fig. 5 and 6 summarize the inadmissible methods (for the Gatys type and the crosslayer type respectively). Any of these methods can not beat methods of Fig. 4
in both mean E and mean C at same time. Note that XM is very close to being admissible. Notice, in particular, that inadmissible methods tend to have large variance in C; one might get a good C, but one might also get a bad one.
Style and Weight: Style weights have surprisingly small effect on the E statistic for admissible methods (Tab. 3). Aggressive style weights lead to unstable transfer results, see Gatys, aggressive in Fig. 5 and Crosslayer, agressive in Fig. 6. Choice of style is very important. Fig. 7 shows the result of regressing the E statistic against style identity; many styles are strongly advantageous or disadvantageous for many methods. There is no clearly dominant method here. It is obvious from the figure that any given method can be significantly advantaged by choosing the styles for transfer carefully. This is a trap for evaluators.
Admissible Method  Style Weight  Significance 
Effect  (Pvalue)  
XLCM  0.40 (0.23)  0.05 
GAL  0.34 (0.19)  0.09 
Universal  1.54 (0.89) 
6 Discussion
What causes the difference between Gatys’ method and crosslayer losses? A symmetry analysis [29] helps explain some aspects of our results. The Appendix C give a construction for all affine maps that fix the gram matrix for a layer and its parent. It is necessary to assume the map from layer to layer is linear. This is not as restrictive as it may seem; the analysis yields a local construction about any generic operating point of the network. In summary, we have: The crosslayer gram matrix loss has very different symmetries to Gatys’ (withinlayer) method. In particular, the symmetry of Gatys’ method can rescale features while shifting the mean. For the crosslayer loss, the symmetry cannot rescale, and cannot shift the mean. This implies that, if one constructs numerous style transfers with the same style using Gatys’ method, the variance of the layer features should be much greater than that observed for the cross layer method. Furthermore, these symmetries impede optimization by making it hard to identify progress as massive changes in the input image may lead to no change in loss. Increasing style weights in Gatys method should result in poor style transfers, by exaggerating the effects of the symmetry, and we observe this effect, see Gatys, aggresive in Fig.5.
Our experimental evidence suggests the symmetries manifest themselves in practice. Gatyslike methods displays significantly larger variance in C than crosslayer methods, and aggressive weighting makes the situation worse. This suggests that the variance implied by the larger symmetry group is actually appearing. In particular, Gatys’ symmetry group allows rescaling of features and shifting of their mean, which will cause the feature distribution of the transferred image to move away from the feature distribution of the style, causing the lower E statistic. Histogram regularization does not appear to help significantly.
Symmetries appear to interact strongly with optimization difficulties. GAL uses a standard optimization trick (insert variables and constraints to decouple terms in an unconstrained problem in the hope of making better progress with each step) and benefits significantly. In particular, GAL is largely immune to change in style weight.This suggests that the main difficulty might lie with optimization procedures, rather than with losses.
7 Conclusion
Style transfer methods have proliferated in the absence of a quantitative evaluation method. Our evaluation procedure attempts to provide evidents for strong style transfer methods. We calibrate out measurement to predict human preferences in style (resp. content) experiments, allowing extensive comparison of methods. Small variants on method – for example, changes to optimization procedure – seem to have significant effect on performance. This is a situation where quantitative evaluation is essential. Furthermore, our results suggest that the choice of style strongly affects the performance of all admissible algorithms.
References
 [1] (2011) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: §1, §3, §3.
 [2] (1997) Multiresolution sampling procedure for analysis and synthesis of texture images. SIGGRAPH. Cited by: §2.

[3]
(2011)
Distributed optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine learning
3 (1), pp. 1–122. Cited by: §5.1.  [4] (2016) Semantic style transfer and turning twobit doodles into fine artworks. arXiv preprint arXiv:1603.01768. Cited by: §2.

[5]
(201707)
Stylebank: an explicit representation for neural image style transfer.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. Cited by: §2.  [6] (2016) Fast patchbased style transfer of arbitrary style. arXiv preprint arXiv:1612.04337. Cited by: §2.
 [7] (2017) A learned representation for artistic style. ICLR. External Links: Link Cited by: §2.
 [8] (2001) Image quilting for texture synthesis and transfer. Proceedings of the 28th annual conference on Computer graphics and interactive techniques  SIGGRAPH ’01, pp. 341–346. External Links: Document, ISBN 158113374X, ISSN 00134694, Link Cited by: §2.
 [9] (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423. Cited by: §2.1, §2, §4.2, §5.1.
 [10] (201707) Controlling perceptual factors in neural style transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [11] (2015) Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 262–270. External Links: Link Cited by: §2.
 [12] (2017) Characterizing and improving stability in neural style transfer. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 4087–4096. Cited by: §2.
 [13] (2001) Image analogies. Proceedings of the 28th annual conference on Computer graphics and interactive techniques  SIGGRAPH ’01 (August), pp. 327–340. External Links: Document, 1705.01088, ISBN 158113374X, Link Cited by: §2.
 [14] (2017) Arbitrary style transfer in realtime with adaptive instance normalization. arXiv preprint arXiv:1703.06868. Cited by: §2.
 [15] (2017) Neural style transfer: a review. arXiv preprint arXiv:1705.04058. Cited by: §2.
 [16] (2019) Neural style transfer: a review. IEEE transactions on visualization and computer graphics. Cited by: §1.

[17]
(2016)
Perceptual losses for realtime style transfer and superresolution
. In European Conference on Computer Vision, Cited by: §2.  [18] (2017) Demystifying neural style transfer. arXiv preprint arXiv:1701.01036. Cited by: §2.
 [19] Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086. Cited by: §2, §5.1, §5.1.
 [20] (2018) A closedform solution to photorealistic image stylization. arXiv preprint arXiv:1802.06474. Cited by: §2, §2.
 [21] (1989) On the limited memory bfgs method for large scale optimization. Mathematical programming 45 (1), pp. 503–528. Cited by: §2.1.
 [22] (2017) Deep Photo Style Transfer. External Links: Document, 1703.07511, Link Cited by: §2.
 [23] (2016) Improving the neural algorithm of artistic style. arXiv preprint arXiv:1605.04603. Cited by: §2.
 [24] (2014) Style transfer for headshot portraits. ACM Transactions on Graphics 33 (4), pp. 1–14. External Links: Document, ISBN 07300301, ISSN 07300301, Link Cited by: §2.
 [25] (1998) Texture characterization via joint statistics of wavelet coefficient magnitudes. In ICIP, Cited by: §2.
 [26] (2000) Separating Style and Content with Bilinear Models. Neural Computation 12 (6), pp. 1247–1283. External Links: Document, ISBN 08997667 (Print)$\$r08997667 (Linking), ISSN 08997667, Link Cited by: §2.
 [27] (2016) Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022. External Links: Link, 1607.08022 Cited by: §2.
 [28] (2016) Multimodal transfer: a hierarchical deep convolutional neural network for fast artistic style transfer. arXiv preprint arXiv:1612.01895. Cited by: §2.
 [29] (2017) Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893. Cited by: §1, §2, §5.1, §6.
Appendix A Quick Overview
Notice that in Fig 5 all Gatys related methods except Gatys with mean and covariance control have quite low E compared to the E for crosslayer methods in Fig 6. But Gatys with mean and covariance control has different symmetries to Gatys (because one is controlling both mean and covariance, rather than just the Gram matrix; the symmetries are like those of the crosslayer method). This suggests it is likely that the symmetry is at least part of the reason why some methods outperform others.
There are two possible reasons. First, the symmetry results in poor solutions being easy to find. Second, the symmetry causes optimization problems. Both issues appear to be in play. Figures 5 and 6 together suggest that methods have considerable variance in performance, which is consistent with poor solutions being easy to find. But the good performance of GAL (see Fig. 4) suggests that optimization is an issue, too.
Symmetries can create problems for optimization methods, because symmetries must be associated with strong gradient curvature at least some points. GAL uses a standard optimization trick to simplify the optimization problem; the success of this trick suggests that optimization of Gatys’ loss is hard.
a.1 Gal
Gatys’ loss is a function of feature values at each layer. One usually assumes that the feature values taken at layer are a known function of the feature values at layer . Here the function is given by the appropriate convolutional layer, etc. However, we could “cut” the network between layers, then introduce a constraint requiring that variables on either side of the cut be equal. We solve this constrained problem using the augmented lagrangian method (see [4] for this strategy applied to MRFs).
Write for the response of the ’th channel at the ’th location in the ’th convolutional layer; drop subscripts as required, and write
for the function mapping layer to layer. GAL cuts the layers only at R41. We have not tried other cuts. It would be interesting to see what happened with more cuts, but the optimization problem gets big quickly. We introduce dummy variables
, and the constraint . Write for lagrange multipliers corresponding to the constraint, for the image, and for the’th estimate of those lagrange multipliers, etc.
The augmented lagrangian is now
where is the style weight of each layer, is the style loss for layer , and is the content loss at R41, and
In the primal step, we first optimize the lagrangian with respect to , using fixed , using LBFGS. We then fix , and optimize with respect to (notice this involves solving a relatively straightforward linear system). The dual step then reestimates the lagrange multipliers as usual:
Finally, we update by .
a.2 Crosslayer with control of mean and covariance (XLCM)
We observe that feature mean difference between and is directly related to the optimization performance of style transfer, e.g. when the content image have similar feature mean as style image the transfer image has better style quality. Therefore we introduce the L2 loss between each feature channel’s mean of and each feature channel’s mean of to enforce the transfer image has close feature mean to style image. Here is the loss for mean control.
On the other hand, the covariant control is to replace crosslayer gram matrix by corresponding crosslayer gram matrix with each feature subtracted by by its mean. Here is the new crosslayer loss with covariant control.
(low Cscore, high Escore)
Crosslayer,aggressive:24.06%, XLCM:20.92%, XLC:11.92%, XL:11.30%, GatysCM:9.21% 
(middle Cscore, high Escore)
XLC:14.56%, Crosslayer,aggressive:13.60%, XLCM:13.41%, XL:13.22%, GAL:10.15% 
(high Cscore, high Escore)
GAL:25.56%, XM:15.04%, XL:10.53%, GatysL:8.52%, GatysCM:6.77% 

(low Cscore, middle Escore)
GatysCM:15.29%, GatysC:12.86%, Crosslayer, aggressive:11.65%, GatysL:11.65%, XLCM:8.50% 
(middle Cscore, middle Escore)
XM:11.69%, GatysM:11.49%, GatysL:10.69%, GatysH:10.08%, GatysC:8.87% 
(high Cscore, middle Escore)
XM:15.45%, GatysH:14.02%, Gatys:13.41%, GAL:13.01%, GatysM:11.18% 
(low Cscore, low Escore)
Gatys aggressive:23.97%, GatysC:12.57%, XLC:10.02%, GatysCM:8.84%, GatysM:7.47% 
(middle Cscore, low Escore)
Universal:12.83%, GatysH:10.73%, Gatys aggressive:10.47%, GatysM:10.21%, Gatys:9.69% 
(high Cscore, low Escore)
Universal:45.28%, Gatys:15.75%, GatysH:7.87%, GatysM:6.69%, GatysL:4.53% 

GatysH – Gatys, with histogram loss

GatysL – Gatys, with layerwise style weights

GatysM – Gatys, with mean control

GatysC – Gatys, with covariance control

GatysCM – Gatys, with mean and covariance control

XL – Crosslayer

XM – Crosslayer, multiplicative

XLC – Crosslayer, with control of covariance

XLCM – Crosslayer, with control of mean and covariance

GAL – Gatys, augmented Lagrangian method

Universal – Universal Style Transfer
Top 5 methods ranking for each quantile under regression scores coordinate generated by selected Emodel and Cmodel. Each transferred image has five Estatistic and one Cstatistic, they are used to regress user preference in Etest and Ctest (Sec. 4.1 in original text). Selected E and C models regress scores (higher is better) for each transferred image. We divide the scatter into 3by3 quantiles, and show method distribution for each quantile.
Appendix B Quantization of transferred images under user study regression models
Recall in Section 4 of original text we regress base E and C statistic to user preference. We obtain one best Emodel from Etest user preference, and one best Cmodel from that of Ctest. These two models assign E and C scores for each transferred image (Sec. 4.1 of original text). Thus, we gather a scatter plot of all transferred images, and we quantize this scatter plot into a 3by3 grid, each cell has roughly same number of images. From this grid we generate a visualization of EC space (Fig.1 in original text).
This quantization shows similar trends with Figure 46 in the original text. Table 4 shows the Top 5 methods ranking for all quantiles. In quantile of high Cscore, high Escore, GAL is the top method. XM dominates both (middle C, middle E) and (high C, middle E), and Universal dominates both (middle C, low E) and (high C, low E). Other high E quantiles are dominated by crosslayer related methods. The worst quantile(low Cscore,Low Escore) has Gatys aggressive as the most popular.
This difference in symmetry groups is important. Risser argues that the symmetries of gram matrices in Gatys’ method could lead to unstable reconstructions; they control this effect using feature histograms. What causes the effect is that the symmetry rescales features while shifting the mean. For the crosslayer loss, the symmetry cannot rescale, and cannot shift the mean. In turn, the instability identified in that paper does not apply to the crosslayer gram matrix and our results could not be improved by adopting a histogram loss.
Write , (resp for the feature vector at the ’th location (of in total) in the first (resp second) layer. Write , etc.
Symmetries of the first layer: Now assume that the first layer has been normalized to zero mean and unit covariance. There is no loss of generality, because the whitening transform can be written into the expression for the group. Write for the operator that forms the within layer gram matrix. We have . Now consider an affine action on layer 1, mapping to ; then for this to be a symmetry, we must have . In turn, the symmetry group can be constructed by: choose which does not have unit length; factor to obtain (for example, by using a cholesky transformation); then any element of the group is a pair where is orthonormal. Note that factoring will fail for a unit vector, whence the restriction.
The second layer: We will assume that the map between layers of features is linear. This assumption is not true in practice, but major differences between symmetries observed under these conditions likely result in differences when the map is linear. We can analyze for two cases: first, all units in the map observe only one input feature vector (i.e. 1x1 convolutions; the point sample case); second, spatial homogeneity in the layers.
The point sample case: Assume that every unit in the map observes only one input feature from the previous layer (1x1 convolutions). We have , because the map between layers is linear. Now consider the effect on the second layer. We have . Choose some symmetry group element for the first layer, . The gram matrix for the second layer becomes , where Recalling that and , we have
so that if . This is relatively easy to achieve with .
Spatial homogeneity: Now assume the map between layers has convolutions with maximum support . Write for an index that runs over the whole feature map, and for a stacking operator that scans the convolutional support in fixed order and stacks the resulting features. For example, given a 3x3 convolution and indexing in 2D, we might have
In this case, there is some , so that . We ignore the effects of edges to simplify notation (though this argument may go through if edges are taken into account). Then there is some , so we can write
Now assume further that layer 1 has the following (quite restrictive) spatial homogeneity property: for pairs of feature vectors within the layer , with (i.e. within a convolution window of one another), we have . This assumption is consistent with image autocorrelation functions (which fall off fairly slowly), but is still strong. Write for an operator that stacks copies of its argument as appropriate, so
Then . If there is some affine action on layer 1, we have , where we have overloaded in the natural way. Now if and , .
The crosslayer gram matrix: Symmetries of the crosslayer gram matrix are very different. Write for the cross layer gram matrix.
Crosslayer, point sample case: Here (recalling )we have . Now choose some symmetry group element for the first layer, . The crosslayer gram matrix becomes
(recalling that and ). But this means that the symmetry requires ; in turn, we must have .
Crosslayer, homogeneous case: We have
Now choose some symmetry group element for the first layer, . The crosslayer gram matrix becomes
(recalling the spatial homogeneity assumption, that and ). But this means that the symmetry requires ; in turn, we must have .
Appendix C Construction of Affine Maps for Symmetry Groups
This difference in symmetry groups is important. Risser argues that the symmetries of gram matrices in Gatys’ method could lead to unstable reconstructions; they control this effect using feature histograms. What causes the effect is that the symmetry rescales features while shifting the mean. For the crosslayer loss, the symmetry cannot rescale, and cannot shift the mean. In turn, the instability identified in that paper does not apply to the crosslayer gram matrix and our results could not be improved by adopting a histogram loss.
Write , (resp for the feature vector at the ’th location (of in total) in the first (resp second) layer. Write , etc.
Symmetries of the first layer: Now assume that the first layer has been normalized to zero mean and unit covariance. There is no loss of generality, because the whitening transform can be written into the expression for the group. Write for the operator that forms the within layer gram matrix. We have . Now consider an affine action on layer 1, mapping to ; then for this to be a symmetry, we must have . In turn, the symmetry group can be constructed by: choose which does not have unit length; factor to obtain (for example, by using a cholesky transformation); then any element of the group is a pair where is orthonormal. Note that factoring will fail for a unit vector, whence the restriction.
The second layer: We will assume that the map between layers of features is linear. This assumption is not true in practice, but major differences between symmetries observed under these conditions likely result in differences when the map is linear. We can analyze for two cases: first, all units in the map observe only one input feature vector (i.e. 1x1 convolutions; the point sample case); second, spatial homogeneity in the layers.
The point sample case: Assume that every unit in the map observes only one input feature from the previous layer (1x1 convolutions). We have , because the map between layers is linear. Now consider the effect on the second layer. We have . Choose some symmetry group element for the first layer, . The gram matrix for the second layer becomes , where Recalling that and , we have
so that if . This is relatively easy to achieve with .
Spatial homogeneity: Now assume the map between layers has convolutions with maximum support . Write for an index that runs over the whole feature map, and for a stacking operator that scans the convolutional support in fixed order and stacks the resulting features. For example, given a 3x3 convolution and indexing in 2D, we might have
In this case, there is some , so that . We ignore the effects of edges to simplify notation (though this argument may go through if edges are taken into account). Then there is some , so we can write
Now assume further that layer 1 has the following (quite restrictive) spatial homogeneity property: for pairs of feature vectors within the layer , with (ie within a convolution window of one another), we have . This assumption is consistent with image autocorrelation functions (which fall off fairly slowly), but is still strong. Write for an operator that stacks copies of its argument as appropriate, so
Then . If there is some affine action on layer 1, we have , where we have overloaded in the natural way. Now if and , .
The crosslayer gram matrix: Symmetries of the crosslayer gram matrix are very different. Write for the cross layer gram matrix.
Crosslayer, point sample case: Here (recalling )we have . Now choose some symmetry group element for the first layer, . The crosslayer gram matrix becomes
(recalling that and ). But this means that the symmetry requires ; in turn, we must have .
Crosslayer, homogeneous case: We have
Now choose some symmetry group element for the first layer, . The crosslayer gram matrix becomes
(recalling the spatial homogeneity assumption, that and ). But this means that the symmetry requires ; in turn, we must have .
Comments
There are no comments yet.