The Contextual Loss
Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image. Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations. However, for many tasks, aligned training pairs of images will not be available. We present an alternative loss function that does not require alignment, thus providing an effective and simple solution for a new space of problems. Our loss is based on both context and semantics -- it compares regions with similar semantic meaning, while considering the context of the entire image. Hence, for example, when transferring the style of one face to another, it will translate eyes-to-eyes and mouth-to-mouth.READ FULL TEXT VIEW PDF
Many classic problems can be framed as image transformation tasks, where a system receives some source image and generates a corresponding output image. Examples include image-to-image translation[1, 2]3, 4, 5], and style-transfer [6, 7, 8]. Samples of our results for some of these applications are presented in Figure 1.
One approach for solving image transformation tasks is to train a feed-forward convolutional neural network. The training is based on comparing the image generated by the network with a target image via a differentiable loss function. The commonly used loss functions for comparing images can be classified into two types: (i) Pixel-to-pixel loss functions that compare pixels at the same spatial coordinates, e.g.,[3, 9], [1, 2, 10], and the perceptual loss of  (often computed at a coarse level). (ii) Global loss functions, such as the Gram loss , which successfully captures style [6, 8] and texture [4, 11] by comparing statistics collected over the entire image. Orthogonal to these are adversarial loss functions (GAN) , that push the generated image to be of high likelihood given examples from the target domain. This is complementary and does not compare the generated and the target image directly.
Both types of image comparison loss functions have been shown to be highly effective for many tasks, however, there are some cases they do not address. Specifically, the pixel-to-pixel loss functions explicitly assume that the generated image and target image are spatially aligned. They are not designed for problems where the training data is, by definition, not aligned. This is the case, as illustrated in Figures1 & 2, in tasks such as semantic style transfer, single-image animation, puppet control, and unpaired domain translation. Non-aligned images can be compared by the Gram loss, however, due to its global nature it translates global characteristics to the entire image. It cannot be used to constrain the content of the generated image, which is required in these applications.
In this paper we propose the Contextual Loss – a loss function targeted at non-aligned data. Our key idea is to treat an image as a collection of features, and measure the similarity between images, based on the similarity between their features, ignoring the spatial positions of the features. We form matches between features by considering all the features in the generated image, thus incorporating global image context into our similarity measure. Similarity between images is then defined based on the similarity between the matched features. This approach allows the generated image to spatially deform with respect to the target, which is the key to our ability to solve all the applications in Figure 2 with a feed-forward architecture. In addition, the Contextual loss is not overly global (which is the main limitation of the Gram loss) since it compares features, and therefore regions, based on semantics. This is why in Figure 1 style-transfer endowed Obama with Hillary’s eyes and mouth, and domain translation changed people’s gender by shaping/thickening their eyebrows and adding/removing makeup.
A nice characteristic of the Contextual loss is its tendency to maintain the appearance of the target image. This enables generation of images that look real even without using GANs, whose goal is specifically to distinguish between ‘real’ and ‘fake’, and are sometimes difficult to fine tune in training.
We show the utility and benefits of the Contextual loss through the applications presented in Figure 2. In all four applications we show state-of-the-art or comparable results without using GANs. In style transfer, we offer an advancement by translating style in a semantic manner, without requiring segmentation. In the tasks of puppet-control and single-image-animation we show a significant improvement over previous attempts, based on pixel-to-pixel loss functions. Finally, we succeed in domain translation without paired data, outperforming CycleGAN , even though we use a single feed-forward network, while they train four networks (two generators, and two discriminators).
Our key contribution is a new loss function that could be effective for many image transformation tasks. We review here the most relevant approaches for solving image-to-image translation and style transfer, which are the applications domains we experiment with.
includes tasks whose goal is to transform images from an input domain to a target domain, for example, day-to-night, horse-to-zebra, label-to-image, BW-to-color, edges-to-photo, summer-to-winter, photo-to-painting and many more. Isola et al.  (pix2pix) obtained impressive results with a feed-forward network and adversarial training (GAN) . Their solution demanded pairs of aligned input-target images for training the network with a pixel-to-pixel loss function ( or ). Chen and Koltun  proposed a Cascaded Refinement Network (CRN) for solving label-to-image, where an image is generated from an input semantic label map. Their solution as well used pixel-to-pixel losses, (Perceptual  and ), and was later appended with GAN . These approaches require paired and aligned training images.
Domain transfer has recently been applied also for problems were paired training data is not available [2, 14, 15]. To overcome the lack of training pairs the simple feed-forward architectures were replaced with more complex ones. The key idea being that translating from one domain to the other, and then going back, should take us to our starting point. This was modeled by complex architectures, e.g., in CycleGAN  four different networks are required. The circular process sometimes suffers from the mode collapse problem, a prevalent phenomenon in GANs, where data from multiple modes of a domain map to a single mode of a different domain .
aims at transferring the style of a target image to an input image [16, 17, 18, 19]. Most relevant to our study are approaches based on CNNs. These differ mostly in the choice of architecture and loss function [6, 7, 8, 20, 21]. Gatys et al.  presented stunning results obtained by optimizing with a gradient based solver. They used the pixel-to-pixel Perceptual loss  to maintain similarity to the input image and proposed the Gram loss to capture the style of the target. Their approach allows for arbitrary style images, but this comes at a high computational cost. Methods with lower computational cost have also been proposed [8, 21, 22, 23]. The speedup was obtained by replacing the optimization with training a feed-forward network. The main drawback of these latter methods is that they need to be re-trained for each new target style.
Another line of works aim at semantic
style transfer, were the goal is to transfer style across regions of corresponding semantic meaning, e.g., sky-to-sky and trees-to-trees (in the methods listed above the target style is transfered globally to the entire image). One approach is to replace deep features of the input image with matching features of the target and then invert the features via efficient optimization or through a pre-trained decoder . Li et al.  integrate a Markov Random Field into the output synthesis process (CNNMRF). Since the matching in these approaches is between neural features semantic correspondence is obtained. A different approach to semantic style transfer is based on segmenting the image into regions according to semantic meaning [25, 26]. This leads to semantic transfer, but depends on the success of the segmentation process. In  a histogram loss was suggested in order to synthesize textures that match the target statistically. This improves the color fatefulness but does not contribute to the semantic matching. Finally, there are also approaches tailored to a specific domain and style, such as faces or time-of-day in city-scape images [28, 29].
Our goal is to design a loss function that can measure the similarity between images that are not necessarily aligned. Comparison of non-aligned images is also the core of template matching methods, that look for image-windows that are similar to a given template under occlusions and deformations. Recently, Talmi et al.  proposed a statistical approach for template matching with impressive results. Their measure of similarity, however, has no meaningful derivative, hence, we cannot adopt it as a loss function for training networks. We do, nonetheless, draw inspiration from their underlying observations.
We start by defining a measure of similarity between a pair of images. Our key idea is to represent each image as a set of high-dimensional points (features), and consider two images as similar if their corresponding sets of points are similar. As illustrated in Figure 3, we consider a pair of images as similar when for most features of one image there exist similar features in the other. Conversely, when the images are different from each other, many features of each image would have no similar feature in the other image. Based on this observation we formulate the contextual similarity measure between images.
|(a) Similar||(b) Not-similar|
Given an image and a target image we represent each as a collection of points (e.g., VGG19 features ): and . We assume (and sample points from the bigger set when ). To calculate the similarity between the images we find for each feature the feature that is most similar to it, and then sum the corresponding feature similarity values over all . Formally, the contextual similarity between images is defined as:
where , to be defined next, is the similarity between features and .
We incorporate global image context via our definition of the similarity between features. Specifically, we consider feature as contextually similar to feature if it is significantly closer to it than to all other features in . When this is not the case, i.e., is not closer to any particular , then its contextual similarity to all should be low. This approach is robust to the scale of the distances, e.g., if is far from all then will be low regardless of how far apart is. Figure 4 illustrates these ideas via examples.
We next formulate this mathematically. Let be the Cosine distance between and 111 where .. We consider features and as similar when . To capture this we start by normalizing the distances:
for a fixed . We shift from distances to similarities by exponentiation:
where is a band-width parameter. Finally, we define the contextual similarity between features to be a scale invariant version of the normalized similarities:
Since the Contextual Similarity sums over normalized values we get that . Comparing an image to itself yields , since the feature similarity values will be and otherwise. At the other extreme, when the sets of features are far from each other then , and thus
. We further observe that binarizing the values by settingif and otherwise, is equivalent to finding the Nearest Neighbor in for every feature in . In this case we get that is equivalent to counting how many features in are a Nearest Neighbor of a feature in , which is exactly the template matching measure proposed by .
For training a generator network we need to define a loss function, based on the contextual similarity of Eq.(1). Let and be two images to be compared. We extract the corresponding set of features from the images by passing them through a perceptual network , where in all of our experiments is VGG19 . Let , denote the feature maps extracted from layer of the perceptual network of the images and , respectively. The contextual loss is defined as:
In image transformation tasks we train a network to map a given source image into an output image . To demand similarity between the generated image and the target we use the loss . Often we demand also similarity to the source image by the loss . In Section 4 we describe in detail how we use such loss functions for various different applications and what values we select for .
Other loss functions: In the following we compare the Contextual loss to other popular loss functions. We provide here their definitions for completeness:
The loss .
The loss .
The first two are pixel-to-pixel loss functions that require alignment between the images and . The Gram loss is global and robust to pixel locations.
The Contextual loss compares sets of features, thus implicitly, it can be thought of as a way for comparing distributions. To support this observation we provide empirical statistical analysis, similar to that presented in [30, 32]. Our goal is to show that the expectation of is maximal when the points in and
are drawn from the same distribution, and drops sharply as the distance between the two distributions increases. This is done via a simplified mathematical model, in which each image is modeled as a set of points drawn from a 1D Gaussian distribution. We compute the similarity between images for varying distances between the underlying Gaussians. Figure5 presents the resulting approximated expected values. It can be seen that is likely to be maximized when the distributions are the same, and falls rapidly as the distributions move apart from each other. Finally, similar to [30, 32], one can show that this holds also for the multi-dimensional case.
In order to examine the robustness of the contextual loss to non-aligned data, we designed the following toy experiment. Given a single noisy image , and multiple clean images of the same scene (targets ), the goal is to reconstruct a clean image . The target images are not aligned with the noisy source image . In our toy experiment the source and target images were obtained by random crops of the same image, with random translations pixels. We added random noise to the crop selected as source . Reconstruction was performed by iterative optimization using gradient descent where we directly update the image values of . That is, we minimize the objective function , where is either or , and we iterate over the targets
. In this specific experiment the features we use for the contextual loss are vectorized RGB patches of size
with stride(and not VGG19).
The results, presented in Figure 6, show that optimizing with yields a drastically blurred image, because it cannot properly compare non-aligned images. The contextual loss, on the other hand, is designed to be robust to spatial deformations. Therefore, optimizing with leads to complete noise removal, without ruining the image details.
|(a) Noisy input||(b) Clean targets||(c) as loss||(d) as loss|
We refer to reader to , were additional theoretical and empirical analysis of the contextual loss is presented.
|Style transfer||Optim. |
|Single-image animation||CRN |
|Puppet control||CRN |
|Domain transfer||CRN ||CycleGAN|
We experiment on the tasks presented in Figure 2
. To asses the contribution of the proposed loss function we adopt for each task a state-of-the-art architecture and modify only the loss functions. In some tasks we also compare to other recent solutions. For all applications we used TensorFlow and Adam optimizer  with the default parameters (). Unless otherwise mentioned we set (of Eq. (3)).
The tasks and the corresponding setups are summarized in Table 1. We use shorthand notation to demand similarity between the generated image and the target and to demand similarity to the source image . The subscripted notation stands for either the proposed or one of the common loss functions defined in Section 3.2.
|Source||Target||Gatys et al. ||CNNMRF ||Ours|
In style-transfer the goal is to translate the style of a target image onto a source image . A landmark approach, introduced by Gatys et al. , is to minimize a combination of two loss functions, the perceptual loss to maintain the content of the source image , and the Gram loss to enforce style similarity to the target (with and ).
We claim that the Contextual loss is a good alternative for both. By construction it makes a good choice for the style term, as it does not require alignment. Moreover, it will allow transferring style features between regions according to their semantic similarity, rather than globally over the entire image, which is what one gets with the Gram loss. The Contextual loss is also a good choice for the content term since it demands similarity to the source, but allows some positional deformations. Such deformations are advantageous, since due to the style change the stylized and source images will not be perfectly aligned.
To support these claims we adopt the optimization-based framework of Gatys et al. 222We used the implementation in https://github.com/anishathalye/neural-style, that directly minimizes the loss through an iterative process, and replace their objective with:
where (to capture content) and (to capture style). We set as 0.1 and 0.2 for the content term and style term respectively. In our implementation we reduced memory consumption by random sampling of layer into features.
Figure 8 presents a few example results. It can be seen that the style is transfered across corresponding regions, e.g., eyes-to-eyes, hair-to-hair, etc. In Figure 7 we compare our style transfer results with two other methods: Gatys et al.  and CNNMRF . The only difference between our setup and theirs is the loss function, as all three use the same optimization framework. It can be seen that our approach transfers the style semantically across regions, whereas, in Gatys’ approach the style is spread all over the image, without semantics. CNNMRF, on the other hand, does aim for semantic transfer. It is based on nearest neighbor matching of features, which indeed succeeds in replacing semantically corresponding features, however, it suffers from severe artifacts.
In single-image animation the data consists of many animation images from a source domain (e.g, person ) and only a single image from a target domain (e.g., person ). The goal is to animate the target image according to the input source images. This implies that by the problem definition the generated images are not aligned with the target .
This problem setup is naturally handled by the Contextual loss. We use it both to maintain the animation (spatial layout) of the source and to maintain the appearance of the target :
and trained it for 10 epochs on 1000 input frames.
Results are shown in Figure 9. We are not aware of previous work the solves this task with a generator network. We note, however, that our setup is somewhat related to fast style transfer , since effectively the network is trained to generate images with content similar to the input (source) but with style similar to the target. Hence, as baseline for comparison, we trained the same CRN architecture and replaced only the objective with a combination of the Perceptual (with ) and Gram losses (with ), as proposed by . It can be seen that using our Contextual loss is much more successful, leading to significantly fewer artifacts.
Our task here is somewhat similar to single-image animation. We wish to animate a target “puppet” according to provided images of a “driver” person (the source). This time, however, available to use are training pairs of source-target (driver-puppet) images, that are semi-aligned. Specifically, we repeated an experiment published online, were Brannon Dorsey (the driver) tried to control Ray Kurzweil (the puppet)444B. Dorsey, https://twitter.com/brannondorsey/status/808461108881268736. For training he filmed a video (K frames) of himself imitating Kurzweil’s motions. Then, given a new video of Brannon, the goal is to generate a corresponding animation of the puppet Kurzweil.
The generated images should look like the target puppet, hence we use the Contextual loss to compare them. In addition, since in this particular case the training data available to us consists of pairs of images that are semi-aligned, they do share a very coarse level similarity in their spatial arrangement. Hence, to further refine the optimization we add a Perceptual loss, computed at a very coarse level, that does not require alignment. Our overall objective is:
where , , and to let the contextual loss dominate. As architecture we again selected CRN  and trained it for 20 epochs.
We compare our approach with three alternatives: (i) Using the exact same CRN architecture, but with the pixel-to-pixel loss function instead of . (ii) The Pix2pix architecture of  that uses and adversarial training (GAN), since this was the original experiment. (iii) We also compare to CycleGAN  that treats the data as unpaired and compares images with and uses adversarial training (GAN). Results are presented in Figure 10. It can be seen that the puppet animation generated with our approach is much sharper, with significantly fewer artifacts, and captures nicely the poses of the driver, even though we don’t use GAN.
Finally, we use the Contextual loss also in the unpaired scenario of domain transfer. We experimented with gender change, i.e., making male portraits more feminine and vice versa. Since the data is unpaired (i.e., we do not have the female versions of the male images) we sample random pairs of images from the two domains. As the Contextual loss is robust to misalignments this is not a problem. We use the exact same architecture and loss as in single-image-animation.
Our results, presented in Figure 11, are quite successful when compared with CycleGAN . This is a nice outcome since our approach provides a much simpler alternative – while the CycleGAN framework trains four networks (two generators and two discriminators), our approach uses a single feed-forward generator network (without GAN). This is possible because the Contextual loss does not require aligned data, and hence, can naturally train on non-aligned random pairs.
We proposed a novel loss function for image generation that naturally handles tasks with non-aligned training data. We have applied it for four different applications and showed state-of-the-art (or comparable) results on all.
In our follow-up work, 
, we suggest to use the Contextual loss for realistic restoration, specifically for the tasks of super-resolution and surface normal estimation. We draw a theoretical connection between the Contextual loss and KL-divergence, which is supported by empirical evidence. In future work we hope to seek other loss functions, that could overcome further drawbacks of the existing ones.
In the supplementary we present limitations of our approach, ablation studies, and explore variations of the proposed loss.
Acknowledgements: This research was supported by the Israel Science Foundation under Grant 1089/16 and by the Ollendorf foundation.
Image-to-image translation with conditional adversarial networks.In: CVPR. (2017)