Improved Style Transfer by Respecting Inter-layer Correlations

01/05/2018 ∙ by Mao-Chuang Yeh, et al. ∙ University of Illinois at Urbana-Champaign 0

A popular series of style transfer methods apply a style to a content image by controlling mean and covariance of values in early layers of a feature stack. This is insufficient for transferring styles that have strong structure across spatial scales like, e.g., textures where dots lie on long curves. This paper demonstrates that controlling inter-layer correlations yields visible improvements in style transfer methods. We achieve this control by computing cross-layer, rather than within-layer, gram matrices. We find that (a) cross-layer gram matrices are sufficient to control within-layer statistics. Inter-layer correlations improves style transfer and texture synthesis. The paper shows numerous examples on "hard" real style transfer problems (e.g. long scale and hierarchical patterns); (b) a fast approximate style transfer method can control cross-layer gram matrices; (c) we demonstrate that multiplicative, rather than additive style and content loss, results in very good style transfer. Multiplicative loss produces a visible emphasis on boundaries, and means that one hyper-parameter can be eliminated.



There are no comments yet.


page 1

page 2

page 3

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Style transfer methods apply the “style” from one example image to the “content” of another; for instance, one might render a camera image (the content) as a watercolor painting (the style). Recent work has shown that highly effective style transfer can be achieved by searching for an image such that early layers of CNN representation match the early layers of the style image and later layers match the later layers of a content image [8]. Content matching is by comparing unit outputs at each location of feature map. But style matching is achieved by comparing summary statistics – in particular, the gram matrix – of the layers individually. Comparing gram matrices of individual layers ensures that small, medium and large patterns that are common in the style image appear with about the same frequency in the synthesized image, and that spatial co-occurences between these patterns are about the same in synthesized and style image.

Novak and Nikulin noticed that across-layer gram matrices reliably produce improvement on style transfer. ([18]). However, their work was an exploration of variants of style transfer rather than a thorough study to gain insights on style summary statistics. There are reasons cross-layer terms produce improvements. In some styles, very long scale patterns are formed out of small components. For instance, in Figure 3, small white spots are organized into long curves. Within-layer gram matrices are not well adapted to represent this phenomenon, as Figure 3 shows. Generally, such hard styles occur where effects at short spatial scales are organized into longer scale structures. Such hard styles are strongly associated with physical materials (for instance, relief painting in Figure 2). In this paper, we show that comparing cross-layer gram matrices – which encode co-occurrences between (say) small and medium scale patterns — produces improvements in style transfer for such styles. Furthermore, controlling cross-layer gram matrices also effectively controls pattern frequencies.

(a) Style
(b) Within-Layer
(c) Cross-Layer
Figure 3: Left: styles to transfer; center: results using within-layer loss; right results using cross-layer loss. There are visible advantages to using the cross-layer loss. Note how cross-layer preserves large black areas (top row); creates an improved appearance of relief for the acrylic strokes (second row); preserves the overall structure of the rods (third row); and ensures each string has a dot on each end (fourth row).

Our contributions:

  • We show that controlling cross-layer, rather than within-layer, gram matrices produces visible improvements in style transfer for many styles even though the cross-layer has less constraints than within-layer. This observation differs from the main claim of Novak and Nikulin, which suggests more layers(16layers) are needed for cross-layer gram matrix to improve within layer terms [18]. Furthermore, they found reliable small improvements from cross-layer gram-matrices; in contrast, we argue that the method produces large, principled improvements, particularly for styles where inter-scale relations are important (Figure 1, 2).

    (a) Style
    (b) Within-Layer
    (c) Cross-Layer
    Figure 4: Left: styles to transfer; center: results using within-layer loss; right: results using cross-layer loss. There are visible advantages to using the cross-layer loss. Note how cross-layer preserves the shape of the abstract color blocks (top row); avoids smearing large paint strokes (second row); preserves the overall structure of the curves as much as possible (third row); and produces color blocks with thin boundaries (fourth row).
  • We show that universal style transfer (UST) method can adapt to cross-layer gram matrices, consequently improving style transfer.

  • We demonstrate that loss multiplication method often produce better looking style transfers. We claim that multiplicative loss has stronger capability to encourage the stylized image to preserve prominent boundaries in content image geometry than additive loss does.

2 Related work

Bilinear models are capable of simple image style transfer [21] by factorizing style and content representations, but non-parametric methods like patch-based texture synthesis can deal with much more complex texture fields [6]. Image analogies use a rendering of one image in two styles to infer a mapping from a content image to a stylized image [10]. Researchers have been looking for versatile parametric methods to control style patterns at different scales to be transferred. Adjusting filter statistics is known to yield texture synthesis [1, 20]

. Gatys et al. demonstrated that producing neural network layers with particular summary statistics (i.e Gram matrices) yielded effective texture synthesis 

[7]. In a following paper, Gatys et al. achieved style transfer by searching for an image that satisfies both style texture summary statistics and content constraints [8]. This work has been much elaborated. The search can be replaced with a regression (at one scale [13]; at multiple scales [22]; with cached [3] or learned [5] style representations) or a decoding process that allows efficient adjusting of statistics [14]. Search can be sped up with local matching methods [4]. Methods that produce local maps (rather than pixels) result in photorealistic style transfer [19, 17]. Style transfer can be localized to masked regions [9]. The criterion of matching summary statistics is a Maximum Mean Discrepancy condition [15]. Style transfer can be used to enhance sketches [2].

Novak and Nikulin search a range of variant style transfer methods, including cross-layer gram matrices. However, their primary suggestions are adding more layers for more features, and shifting activations such that the number of zero entries in gram matrix is reduced. They don’t pursue on cross-layer gram matrices nor explain its results. They experiment on a long chain of cross-layer gram matrices but do not identify what the improvements are or extend the method to fast style transfer [18]. There is a comprehensive review in [12].

3 Within layer gram matrix for style transfer

Gatys et al. [8] finds an image where early layers of a CNN representation match the lower layers of the style image and higher layers match the higher layers of a content image. We review the original work of Gatys et al. in detail. Write (resp. , ) for the style (resp. content, new) image, and for some parameter balancing style and content losses ( and respectively). We obtain by optimizing

Losses are computed on a network representation, with convolutional layers, where the ’th layer produces a feature map of size (resp. height, width, and channel number). We partition the layers into three groups (style, content and irrelevant). Then we reindex the spatial variables (height and width) and write for the response of the ’th channel at the ’th location in the ’th convolutional layer. The content loss is

(where ranges over content layers). The style loss is depends on within-layer gram matrices. Write

and for the weight applied to the ’th layer. Then

where ranges over style layers. Gatys et al. use Relu1_1, Relu2_1, Relu3_1, Relu4_1, and Relu5_1 as style layers, and layer Relu4_2 for the content loss, and search for using L-BFGS [16]. Notation: From now on, we write R51 for Relu5_1, etc.

4 The cross layer gram matrix

Now consider layer and , both style layers, with decreasing spatial resolution. Write for an upsampling of to , and consider

as the cross-layer gram matrix, We can form a style loss

(where is a set of pairs of style layers). We can substitute this loss into the original style loss, and minimize as before. This construction has a variety of interesting properties which we will investigate later.

Style layer pairs: In principle, any set of pairs can be used. We have investigated a pairwise descending strategy, where one constrains each layer and its successor (i.e. (R51, R41); (R41, R31); etc) and an all distinct pairs strategy, where one constrains all pairs of distinct layers.

Pattern management across scales: Controlling within-layer gram matrices by proper weighting ensures that the statistics of patterns at a particular scale are “appropriate”. However, we speculate – and our experimental results seem to confirm – that one can get these statistics right without having desirable weighting relations across scales. Inter-layer gram matrices require that phenomena at one scale are correlated to those at the next scale appropriately. In other words, carefully controlling weights for each layer’s style loss is not necessary in cross-layer gram matrix scenario.

Number of constraints: Cross-layer gram matrices control considerably fewer parameters than within layer gram matrices. For a pairwise descending strategy, we have four cross-layer gram matrices, leading to control of parameters; compare within layer gram matrices, which control parameters. It may seem that there is less constraint on style. Experiment suggests our method produces visible improved results, meaning that many of the parameters controlled by within-layer gram matrices have no particular effect on the outcome.

Figure 5: Fast universal cross-layer transfer (FCT). We use similar procedure as in Li et al. [14], a pair of convlutional features (e.g. R11 and R21) are reshaped and concatenated before performing transformation. The transformed feature is then split up and only one layer is fed into the decoder. We use off-shelf decoders from [14].

4.1 Fast Universal Cross-layer Transfers

Li et al. use signal whitening and coloring to implement a fast version of style transfer using a VGG encoder [14]. Their procedure takes the R51 layer from the content image, then applies an affine transformation (by whitening, coloring, and matching means) to match the gram matrix of the corresponding layer computed for the style image. The resulting layer is decoded to an image through one of five pre-trained image reconstruction decoder networks. The R41 layer produced by this image is again affine transformed to match the gram matrix of the corresponding layer computed for the style image. This layer is then again decoded to an image. The process continues until the affine transformed R11 layer is decoded to an image, which is retained.

This procedure is easily extended to cross-layer gram matrices (Figure 5). We start by choosing sequence of sets of layer covariances to control. The simple scheme is individual, controlling (R51), (R41), (R31), (R21), (R11). An alternative scheme is pairwise descending, where one controls (R51, R41); (R41, R31); (R31, R21); and (R21, R11). Another scheme is descending, where one controls (R51, R41, R31, R21, R11); (R41, R31, R21, R11); (R31, R21, R11); etc. we then start the first set of relevant layers from the content image (i.e. (R51, R41) for pairwise; (R51, … R11) for descending). Construct a gram matrix from this set of layers, upsampling as required. Apply an affine transformation to match the gram matrix of the corresponding set of layers for the style image, then decode the resulting layers to an image. Pass this image through VGG, recover the next set of control layers from the result, apply an affine transformation to match the gram matrix of the corresponding set of layers for the style image, then decode the resulting layers to an image. Proceed until the R11 layer is decoded to an image, and use that image. Note that this approach controls both within layer and between layer statistics, as the relevant gram matrices have within layer gram matrices as diagonal blocks, and between layer gram matrices as off-diagonal blocks.

Figure 6: Multiplicative loss produces good style transfer results. Top row: style transfers using cross layer gram matrices and additive loss, with a good choice of . Bottom row: style transfers using cross layer gram matrices and multiplicative loss, where no choice of is required. Notice the emphasis of content outline in the multiplicative loss images.

4.2 Loss multiplication

Style transfer methods require a choice of parameter, , to balance the style and content losses. The value is typically chosen by eye, which is unsatisfying. A natural alternative to adding the losses is to multiply them; in this case, no parameter is needed, and we can form

Multiplicative loss tends to emphasize strong boundaries in the content image( Figure 6). We believe this is because style loss is always large, so that minimization will force down large differences (which are large difference in values between stylized image and content image) in the content layer. Our experimental results suggest that this approach is successful (section 5.3). The effect is quite prominent (Figure 6, Figure 11, Figure 12 ), multiplicative loss has significant advantage of reducing the number of parameters that need to be searched over to produce useful results. Figure 2 shows style transfer results using cross-layer gram matrices and multiplicative loss, we observe distinguishable improvement over Gatys’ method in preserving content boundaries.

(a) Styles
(b) Within-layers
(c) CG
(d) more CGs
(e) WCT
(f) FCT
Figure 7: Texture synthesis comparison: Except the first column as style, the rest of columns from left to right are respectively generated by within-layer gram matrix, CG (cross-layer gram matrices), more CG (all cross-layer gram matrices between R51,R4,R31,R21,R11 are considered), WCT, and FCT. We can see that either in Gatys vs ours or WCT vs FCT, the cross-layer gram matrix indeed shows the improvement on texture patterns.

We find one trick to improve transfer results using multiplicative loss by shifting the mean when creating the new image to optimized, we recommend this shift should be the channel mean of style image.

5 Results

5.1 Experimental details

We use VGG-19 for both style transfer and texture synthesis. We use R11, R21, R31, R41, and R51 for style(texture) loss, and R42 for the content loss for style transfer. In loss optimization, if it not specified, all stylized images start from Gaussian noise image and optimized with LBFGS.

(a) Our method shows better color grouping in the stylized image.
(b) Many black spots in original WCT, which is not observed in our method.
(c) Ours improved the color contrast, because cross-layer gram matrices preserve longer scale color pattern.
(d) Note our method does not have the blue color shift present in WCT.
(e) WCT has many artificial pattern which is not seen in original style image, and ours largely reduce it.
(f) Color blocks are better organized in ours.
Figure 8: In each row, first: the style image; second: transfer using FCT with descending sequences (i.e. (R51, R41, R31, R21, R11); (R41, R31, R21, R11); (R31, R21, R11); etc); third: transfer using FCT with pairwise descending sequences (i.e. (R51, R41); (R41, R31); (R31, R21); and (R21, R11)); fourth transfer using WCT [14]

5.2 Texture synthesis

Cross-layer gram matrix control applies to texture synthesis since style loss [8] was first introduced as ”texture” loss in [7]. We now omit the content loss, and seeks a minimum of style loss alone. We show texture synthesis results, which highlights the method’s ability to manage long spatial correlations. We controlled R51, R41, R31, R21, R11 for comparison with our style transfer results. Our synthesis starts from an image which has the mean color of the texture image. As Figure 7 shows, synthesized textures have better long scale coherence. For the universal texture synthesis, we followed Li et al. as starting from zero-mean Gaussian noise, run the multi level pipeline 3 times for better results.

5.3 Style transfer

Cross-layer vs. within-layer style loss: Figures 34 compare style transfers using within-layer gram matrices and cross-layer gram matrices with a pairwise descending strategy. Cross-layer gram matrices are particularly good at preserving relations between effects, as the detail in figure 4 shows.

Multiplicative loss: The multiplicative loss often produces visual pleasing style transfer results by showing better style pattern arrangement at same time keeping the outline of content relatively intact, so that the generated image preserves the perceptual meaning of the content while showing coherent style patterns(Figure 6); More examples are present in supplementary materials.

Figure 9: All pairs distinct cross-layer style transfer yields somewhat better results than descending pairs. Top row: cross-layer style transfer using descending pairs (i.e. (R51, R41); (R41, R31); (R31, R21); (R21, R11)). Bottom row: cross-layer style transfer using all pairs distinct (i.e all distinct pairs from R51…R11). There are fewer bubbles; color localization and value is improved; and line breaks are fewer.

Pairwise descending vs all pairs distinct: All pairs distinct cross-layer style transfer seems to produce improvements over pairwise descending (Figure 9). This is in some contrast to Novak and Nikulin’s findings ([18], p5), which suggest “tying distant … layers produces poor results”.

FCT vs WCT: Fast universal cross-layer transfer (FCT) works visually better than the original WCT method of Li et al.  [14], as Figure 8 shows. However, FCT has some of the same difficulties that WCT has. Both methods have difficulty reproducing crisp subshapes in styles.

Figure 10: This figure shows what happens when one controls only one (or one pair) of layers with the style loss. Left: controlling a single layer, with a within-layer gram matrix. Center: controlling two layers in sequence, but each with a within-layer gram matrix. Right: controlling a two layers in sequence, but using only a cross-layer gram matrix. Notice that, as one would expect, controlling cross-layer gram matrices results in more pronounced effects and a wider range of spatial scales of effect. Furthermore, in comparison to controlling a pair of within-layer gram matrices, one is controlling fewer parameters.

Individual style loss control: When one controls style loss using a single layer (or a single pair of layers). We can clearly see how they effect stylized images (Figure 10). We observe higher level style loss shows stronger control over long scale patterns from style images, this is in agreement with similar observations in [8]. We also found that cross-layer gram matrices have stronger ability in preserving prominent boundaries of content images while display equal or better control over long scale style patterns compared to the same level within-layer gram matrices.

(a) Style
(b) Style size 768
(c) Style size 512
(d) Style size 256
Figure 11: Each row of stylized images shows a transfer with the same style, but where the style image has been cropped to different sizes (style elements are large (=edge length 768), medium (=edge length 512) and small (=edge length 256), reading left to right). The first row shows cross layer loss, the second row within layer loss. Note that, when style elements are large, the cross-layer loss is better at preserving their structure (e.g., the large circles have fewer wiggles, etc.). Loss is multiplicative, notice the emphasis on outlines from multiplicative loss.

Scales: A crop of the style image will effectively result in transferring larger style elements. We expect that, when style elements are large compared to the content, cross-layer methods will have a strong advantage because they will be better able to preserve structural relations that make up style elements. Qualitative evidence supports this view (Figure 11 and Figure 12).

6 Conclusion

Cross-layer gram matrix creates summary statistics that captures the correlation between different layers; higher layers can guide lower layers the most likely location for feature activations through the spacial product of forming cross-layer gram matrix. Therefore, we expect cross-layer gram matrices performs better especially on long scale patterns. Our experiments prove this point. The cross-layer gram matrix has less constraint but better style control than within-layer gram matrix.

(a) Style size 768
(b) Style size 512
(c) Style size 256
Figure 12: Each row shows a style transfer with the same style, but where the style image has been cropped to different sizes (style elements are large (=edge length 768), medium (=edge length 512) and small (=edge length 256), reading left to right). The first row shows cross layer loss, the second row within layer loss. Note that, when style elements are large, the cross-layer loss is better at preserving their structure (e.g., the long scale color coherence is preserved, and the large paint strokes have more detail and more relief etc.) Loss is multiplicative, notice the emphasis on outlines from multiplicative loss.

Fast Universal Cross-layer Style Transfer successfully unifies the Universal style transfer with our inter-layer statistics, and indeed shows some intrinsic difference in both style transfer and texture synthesis.

Multiplicative style loss

not only simplifies the style weight searching by eliminating one hyperparameter, but also emphasizes the boundary of content object even when strong boundaries information is present in style summary statistics. It provides better style quality in terms of preserving content shape and keeping long style coherence. More examples are present in the supplementary materials.


We give our deep thanks to Mao-Chuang’s supervisor, Professor David Forsyth who provides many helpful suggestions in research and helps us writing the paper. We also thank Jason Rock for his help on setting up faster style transfer and other related recommendations. Besides, Anand Bhattad also gives us a lot of useful suggestions about style transfer. We really thank his kind help.