Generating Images with Perceptual Similarity Metrics based on Deep Networks

02/08/2016 ∙ by Alexey Dosovitskiy, et al. ∙ 0

Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of computing distances in the image space, we compute distances between image features extracted by deep neural networks. This metric better reflects perceptually similarity of images and thus leads to better results. We show three applications: autoencoder training, a modification of a variational autoencoder, and inversion of deep convolutional networks. In all cases, the generated images look sharp and resemble natural images.



There are no comments yet.


page 5

page 6

page 7

page 8

page 10

page 11

page 12

page 13

Code Repositories


Code for paper "Synthesizing the preferred inputs for neurons in neural networks via deep generator networks"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently there has been a surge of interest in training neural networks to generate images. These are being used for a wide variety of applications: unsupervised and semi-supervised learning, generative models, analysis of learned representations, analysis by synthesis, learning of 3D representations, future prediction in videos. Nevertheless, there is little work on studying loss functions which are appropriate for the image generation task. Typically used squared Euclidean distance between images often yields blurry results, see Fig.

1b. This is especially the case when there is inherent uncertainty in the prediction. For example, suppose we aim to reconstruct an image from its feature representation. The precise location of all details may not be preserved in the features. A loss in image space leads to averaging all likely locations of details, and hence the reconstruction looks blurry.

However, exact locations of all fine details are not important for perceptual similarity of images. But the distribution of these details plays a key role. Our main insight is that invariance to irrelevant transformations and sensitivity to local image statistics can be achieved by measuring distances in a suitable feature space. In fact, convolutional networks provide a feature representation with desirable properties. They are invariant to small smooth deformations, but sensitive to perceptually important image properties, for example sharp edges and textures.

Using a distance in feature space alone, however, does not yet yield a good loss function; see Fig. 1

d. Since feature representations are typically contractive, many images, including non-natural ones, get mapped to the same feature vector. Hence, we must introduce a natural image prior. To this end, we build upon adversarial training as proposed by 

Goodfellow et al. (2014). We train a discriminator network to distinguish the output of the generator from real images. The objective of the generator is to trick the discriminator, i.e., to generate images that the discriminator cannot distinguish from real ones. This yields a natural image prior that selects from all potential generator outputs the most realistic one. A combination of similarity in an appropriate feature space with adversarial training allows to obtain the best results; see Fig. 1e.

Original     Img loss   Img + Adv  Img + Feat    Our
a)       b)      c)      d)       e)
Figure 1: Reconstructions from layer fc6 of AlexNet with different losses.

We show three example applications: image compression with an autoencoder, a generative model based on a variational autoencoder, and inversion of the AlexNet convolutional network. We demonstrate that an autoencoder with DeePSiM loss can compress images while preserving information about fine structures. On the generative modeling side, we show that a version of a variational autoencoder trained with the new loss produces images with realistic image statistics. Finally, reconstructions obtained with our method from high-level activations of AlexNet are dramatically better than with existing approaches. They demonstrate that even the predicted class probabilities contain rich texture, color, and position information.

2 Related work

There is a long history of neural network based models for image generation. A prominent class of probabilistic models of images are restricted Boltzmann machines 

(Hinton & Sejnowski, 1986; Smolensky, 1987; Hinton & Salakhutdinov, 2006) and their deep variants (Hinton et al., 2006; Salakhutdinov & Hinton, 2009; Lee et al., 2009). Autoencoders (Hinton & Salakhutdinov, 2006; Vincent et al., 2008)

have been widely used for unsupervised learning and generative modeling, too. Recently, stochastic neural networks 

(Bengio et al., 2014; Kingma et al., 2014; Gregor et al., 2015) have become popular, and deterministic networks are being used for image generation tasks (Dosovitskiy et al., 2015b). In all these models, loss is measured in the image space. By combining convolutions and un-pooling (upsampling) layers (Lee et al., 2009; Goodfellow et al., 2014; Dosovitskiy et al., 2015b) these models can be applied to large images.

There is a large body of work on assessing the perceptual similarity of images. Some prominent examples are the visible differences predictor  (Daly, 1993), the spatio-temporal model for moving picture quality assessment  (van den Branden Lambrecht & Verscheure, 1996), and the perceptual distortion metric of  Winkler (1998)

. The most popular perceptual image similarity metric is the structural similarity metric (SSIM) 

(Wang et al., 2004), which compares the local statistics of image patches. We are not aware of any work making use of similarity metrics for machine learning, except a recent pre-print of Ridgeway et al. (2015)

. They train autoencoders by directly maximizing the SSIM similarity of images. This resembles in spirit what we do, but technically is very different. While psychophysical experiments go out of scope of this paper, we believe that deep learned feature representations have better potential than shallow hand-designed SSIM.

Generative adversarial networks (GANs) have been proposed by Goodfellow et al. (2014). In theory, this training procedure can lead to a generator that perfectly models the data distribution. Practically, training GANs is difficult and often leads to oscillatory behavior, divergence, or modeling only part of the data distribution. Recently, several modifications have been proposed that make GAN training more stable. Denton et al. (2015) employ a multi-scale approach, gradually generating higher resolution images. Radford et al. (2015)

make use of a convolutional-deconvolutional architecture and batch normalization.

GANs can be trained conditionally by feeding the conditioning variable to both the discriminator and the generator (Mirza & Osindero, 2014)

. Usually this conditioning variable is a one-hot encoding of the object class in the input image. Such GANs learn to generate images of objects from a given class. Recently 

Mathieu et al. (2015) used GANs for predicting future frames in videos by conditioning on previous frames. Our approach looks similar to a conditional GAN. However, in a GAN there is no loss directly comparing the generated image to some ground truth. We found that the feature loss introduced in the present paper is essential to train on complicated tasks such as feature inversion.

Most related is concurrent work of Larsen et al. (2015). The general idea is the same — to measure the similarity not in the image space, but rather in a feature space. They also use adversarial training to improve the realism of the generated images. However, Larsen et al. (2015) only apply this approach to a variational autoencoder trained on images of faces, and measure the similarity between features extracted from the discriminator. Our approach is much more general, we apply it to various natural images, and we demonstrate three different applications.

3 Model

Suppose we are given a supervised learning task and a training set of input-target pairs , , . Inputs and outputs can be arbitrary vectors. In this work, we focus on targets that are images with an arbitrary number of channels.

The aim is to learn the parameters of a differentiable generator function that optimally approximates the input-target dependency according to a loss function . Typical choices are squared Euclidean (SE ) loss or loss . As we demonstrate in this paper, these losses are suboptimal for some image generation tasks.

We propose a new class of losses, which we call DeePSiM. These go beyond simple distances in image space and can capture complex and perceptually important properties of images. These losses are weighted sums of three terms: feature loss , adversarial loss , and pixel space loss :


They correspond to a network architecture, an overview of which is shown in Fig. 2. The architecture consists of three convolutional networks: the generator that implements the generator function, the discriminator that discriminates generated images from natural images, and the comparator that computes features from images.

Figure 2: Schematic of our model. Black solid lines denote the forward pass. Dashed lines with arrows on both ends are the losses. Thin dashed lines denote the flow of gradients.

Loss in feature space. Given a differentiable comparator , we define


may be fixed or may be trained; for example, it can be a part of the generator or the discriminator.

alone does not provide a good loss for training. It is known (Mahendran & Vedaldi, 2015) that optimizing just for similarity in the feature space typically leads to high-frequency artifacts. This is because for each natural image there are many non-natural images mapped to the same feature vector 111This is unless the feature representation is specifically designed to map natural and non-natural images far apart, such as the one extracted from the discriminator of a GAN.. Therefore, a natural image prior is necessary to constrain the generated images to the manifold of natural images.

Adversarial loss. Instead of manually designing a prior, as in Mahendran & Vedaldi (2015), we learn it with an approach similar to Generative Adversarial Networks (GANs) of Goodfellow et al. (2014). Namely, we introduce a discriminator which aims to discriminate the generated images from real ones, and which is trained concurrently with the generator

. The generator is trained to “trick” the discriminator network into classifying the generated images as real. Formally, the parameters

of the discriminator are trained by minimizing


and the generator is trained to minimize


Loss in image space.

Adversarial training is known to be unstable and sensitive to hyperparameters. We found that adding a loss in the image space


stabilizes training.

3.1 Architectures

Generators. We used several different generators in experiments. They are task-specific, so we describe these in corresponding sections below. All tested generators make use of up-convolutional (’deconvolutional’) layers, as in Dosovitskiy et al. (2015b). An up-convolutional layer consists of up-sampling and a subsequent convolution. In this paper we always up-sample by a factor of and a ’bed of nails’ upsampling.

In all networks we use leaky ReLU nonlinearities, that is,

. We used in our experiments. All generators have linear output layers.

Comparators. We experimented with four comparators:

1. AlexNet  (Krizhevsky et al., 2012) is a network with convolutional and fully connected layers trained on image classification.

2. The network of Wang & Gupta (2015) has the same architecture as AlexNet, but is trained using videos with triplet loss, which enforces frames of one video to be close in the feature space and frames from different videos to be far apart. We refer to this network as VideoNet.

3. AlexNet with random weights.

4. Exemplar-CNN  (Dosovitskiy et al., 2015a) is a network with convolutional layers and fully connected layer trained on a surrogate task of discriminating between different image patches.

The exact layers used for comparison are specified in the experiments sections.

Discriminator. The architecture of the discriminator was nearly the same in all experiments. The version used for the autoencoder experiments is shown in Table 1

. The discriminator must ensure the local statistics of images to be natural. Therefore after five convolutional layers with occasional stride we perform global average pooling. The result is processed by two fully connected layers, followed by a

-way softmax. We perform dropout after the global average pooling layer and the first fully connected layer.

There are two modifications to this basic architecture. First, when dealing with large ImageNet 

(Deng et al., 2009) images we increase the stride in the first layer from to . Second, when training networks to invert AlexNet, we additionally feed the features to the discriminator. We process them with two fully connected layers with and units, respectively. Then we concatenate the result with the output of global average pooling.

Type conv conv conv conv conv pool fc fc
Table 1: Discriminator architecture.

3.2 Training details

We modified the caffe (Jia et al., 2014) framework to train the networks. For optimization we used Adam (Kingma & Ba, 2015) with momentum , and initial learning rate . To prevent the discriminator from overfitting during adversarial training we temporarily stopped updating it if the ratio of and was below a certain threshold ( in most experiments). We used batch size in all experiments. We trained for - mini-batch iterations.

4 Experiments

We started with a simple proof-of-concept experiment showing how DeePSiM can be applied to training autoencoders. Then we used the proposed loss function within the variational autoencoder (VAE) framework. Finally, we applied the method to invert the representation learned by AlexNet and analyzed some properties of the method.

In quantitative comparisons we report normalized Euclidean error . The normalization coefficient is the average of Euclidean distances between all pairs of different samples from the test set. Therefore, the error of means that the algorithm performs the same as randomly drawing a sample from the test set.

4.1 Autoencoder

Here the target of the generator coincides with its input (that is,

), and the task of the generator is to encode the input to a compressed hidden representation and then decode back the image. The architecture is shown in Table 

2. All layers are convolutional or up-convolutional. The hidden representation is an -channel feature map times smaller than the input image. We trained on the STL-10 (Coates et al., 2011) unlabeled dataset which contains images pixels. To prevent overfitting we augmented the data by cropping random patches during training.

We experimented with four loss functions: SE and in the image space, as well as DeePSiM with AlexNet conv3 or Exemplar-CNN conv3 as comparator.

Qualitative results are shown in Fig. 3, quantitative results in Table 3. While underperforming in terms of Euclidean loss, our approach can preserve more texture details, resulting in naturally looking non-blurry reconstructions. Interestingly, AlexNet as comparator tends to corrupt fine details (petals of the flower, sails of the ship), perhaps because it has stride of in the first layer. Exemplar-CNN as comparator does not preserve the exact color because it is explicitly trained to be invariant to color changes. We believe that with carefully selected or specifically trained comparators yet better results can be obtained.

We stress that lower Euclidean error does not mean better reconstruction. For example, imagine a black-and-white striped ”zebra” pattern. A monotonous gray image will have twice smaller Euclidean error than the same pattern shifted by one stripe width.

Table 2: Autoencoder architecture. Top: encoder, bottom: decoder. All layers are convolutional or ’up-convolutional’.
Figure 3: Autoencoder qualitative results. Best viewed on screen.
SE loss loss Our-ExCNN Our-AlexNet
Table 3: Normalized Euclidean reconstruction error (in %) of autoencoders trained with different loss functions.
SE loss loss Our-ExCNN Our-AlexNet
Table 4: Classification accuracy (in %) on STL with autoencoder features learned with different loss functions.

Classification. Reconstruction-based models are commonly used for unsupervised feature learning. We checked if our loss functions lead to learning more meaningful representations than usual and SE losses. To this end, we trained linear SVMs on the -channel hidden representations extracted by autoencoders trained with different losses. We are just interested in relative performance and, thus, do not compare to the state of the art. We trained on folds of the STL-10 training set and tested on the test set.

The results are shown in Table 4. As expected, the features learned with DeePSiM perform significantly better, indicating that they contain more semantically meaningful information. This suggests that other losses than standard and SE may be useful for unsupervised learning. Note that the Exemplar-CNN comparator is trained in an unsupervised way.

Figure 4: Samples from VAE with the SE loss (topmost) and the proposed DeePSiM loss (top to bottom: AlexNet conv5, AlexNet fc6, VideoNet conv5).

4.2 Variational autoencoder

A standard VAE consists of an encoder and a decoder . The encoder maps an input sample to a distribution over latent variables . maps from this latent space to a distribution over images . The loss function is


where is a prior distribution of latent variables and

is the Kullback-Leibler divergence. The first term in Eq. 


is a reconstruction error. If we assume that the decoder predicts a Gaussian distribution at each pixel, then it reduces to squared Euclidean error in the image space. The second term pulls the distribution of latent variables towards the prior. Both

and are commonly assumed to be Gaussian, in which case the divergence can be computed analytically. Please refer to Kingma et al. (2014) for details.

We use the proposed loss instead of the first term in Eq. 6. This is similar to Larsen et al. (2015), but the comparator does not have to be a part of the discriminator. Technically, there is little difference from training an autoencoder. First, instead of predicting a single latent vector we predict two vectors and and sample , where

is standard Gaussian (zero mean, unit variance) and

is element-wise multiplication. Second, we add the KL divergence term to the loss:


We manually set the weighting of the KL term relative to the rest of the loss. Proper probabilistic derivation is non-straightforward, and we leave it for future research.

We trained on pixel crops of pixel ILSVRC-2012 images. The encoder architecture is the same as AlexNet up to layer fc6, and the decoder architecture is shown in Table 5. We initialized the encoder with AlexNet weights, however, this is not necessary, as shown in the appendix. We sampled from the model by sampling the latent variables from a standard Gaussian and generating images from that with the decoder.

Samples generated with the usual SE loss, as well as three different comparators (AlexNet conv5, AlexNet fc6, VideoNet conv5) are shown in Fig. 4. While Euclidean loss leads to very blurry samples, our method yields images with realistic statistics. Interestingly, the samples trained with the VideoNet comparator look qualitatively similar to the ones with AlexNet, showing that supervised training may not be necessary to yield a good comparator. More results are shown in the appendix.

Type fc fc fc reshape uconv conv
Type uconv conv uconv conv uconv uconv uconv
Table 5: Generator architecture for inverting layer fc6 of AlexNet.
Figure 5: Representative reconstructions from higher layers of AlexNet. General characteristics of images are preserved very well. In some cases (simple objects, landscapes) reconstructions are nearly perfect even from fc8. In the leftmost column the network generates dog images from fc7 and fc8.

4.3 Inverting AlexNet

Analysis of learned representations is an important but largely unsolved problem. One approach is to invert the representation. This may give insights into which information is preserved in the representation and what are its invariance properties. However, inverting a non-trivial feature representation , such as the one learned by a large convolutional network, is a difficult ill-posed problem.

Our proposed approach inverts the AlexNet convolutional network very successfully. Surprisingly rich information about the image is preserved in deep layers of the network and even in the predicted class probabilities. While being an interesting result in itself, this also shows how DeePSiM is an excellent loss function when dealing with very difficult image restoration tasks.

Suppose we are given a feature representation , which we aim to invert, and an image . There are two inverse mappings: such that , and such that . Recently two approaches to inversion have been proposed, which correspond to these two variants of the inverse.

Mahendran & Vedaldi (2015), as well as Simonyan et al. (2014) and  Yosinski et al. (2015), apply gradient-based optimization to find an image which minimizes the loss


where is a simple natural image prior, such as total variation (TV) regularizer. This method produces images which are roughly natural and have features similar to the input features, corresponding to . However, the prior is limited, so reconstructions from fully connected layers of AlexNet do not look much like natural images.

Dosovitskiy & Brox (2015) train up-convolutional networks on a large training set of natural images to perform the inversion task. They use SE distance in the image space as loss function, which leads to approximating . The networks learn to reconstruct the color and rough positions of objects well, but produce over-smoothed results because they average all potential reconstructions.

Our method can be seen as combining the best of both worlds. Loss in the feature space helps preserve perceptually important image features. Adversarial training keeps reconstructions realistic. Note that similar to Dosovitskiy & Brox (2015) and unlike Mahendran & Vedaldi (2015), our method does not require the feature representation being inverted to be differentiable.

Image  conv5    fc6      fc7      fc8
Figure 6: Comparison with Dosovitskiy & Brox (2015) and Mahendran & Vedaldi (2015). Our results look significantly better, even our failure cases (second image).

Technical details. The generator in this setup takes the features extracted by AlexNet and generates an image from them, that is, . In general we followed Dosovitskiy & Brox (2015) in designing the generators. The only modification is that we inserted more convolutional layers, giving the network more capacity. We reconstruct from outputs of layers conv5 –fc8. In each layer we also include processing steps following the layer, that is, pooling and non-linearities. So for example conv5 means pooled features (pool5), and fc6 means rectified values (relu6).

Architecture used for inverting fc6 is the same as the decoder of the VAE shown in Table 5. Architectures for other layers are similar, except that for reconstruction from conv5 fully connected layers are replaced by convolutional ones. The discriminator is the same as used for VAE. We trained on the ILSVRC-2012 training set and evaluated on the ILSVRC-2012 validation set.

Ablation study. We tested if all components of our loss are necessary. Results with some of these components removed are shown in Fig. 7. Clearly the full model performs best. In the following we will give some intuition why.

Training just with loss in the image space leads to averaging all potential reconstructions, resulting in over-smoothed images. One might imagine that adversarial training would allow to make images sharp. This indeed happens, but the resulting reconstructions do not correspond to actual objects originally contained in the image. The reason is that any “natural-looking” image which roughly fits the blurry prediction minimizes this loss. Without the adversarial loss predictions look very noisy. Without the image space loss the method works well, but one can notice artifact on the borders of images, and training was less stable in this case.

Sampling pre-images. Given a feature vector , it would be interesting to sample multiple images such that . A straightforward approach would inject noise into the generator along with the features, so that the network could randomize its outputs. This does not yield the desired result, since nothing in the loss function forces the generator to output multiple different reconstructions per feature vector. A major problem is that in the training data we only have one image per feature vector, i.e., a single sample per conditioning vector. We did not attack this problem in our paper, but we believe it is an important research direction.

  Image     Full       
Figure 7: Reconstructions from fc6 with some components of the loss removed.

Best results. Representative reconstructions from higher layers of AlexNet are shown in Fig. 5. Comparison with existing approaches is shown in Fig. 6. Reconstructions from conv5 are near-perfect, combining the natural colors and sharpness of details. Reconstructions from fully connected layers are still very good, preserving the main features of images, colors, and positions of large objects.

Normalized Euclidean error in image space and in feature space (that is, the distance between the features of the image and the reconstruction) are shown in Table 8. The method of Mahendran&Vedaldi performs well in feature space, but not in image space, the method of Dosovitskiy&Brox — vice versa. The presented approach is fairly good on both metrics.

conv5 fc6 fc7 fc8
Dosovitskiy & Brox
Our just image loss
Our AlexNet conv5
Our VideoNet conv5
Figure 8: Normalized inversion error (in %) when reconstructing from different layers of AlexNet with different methods. First in each pair – error in the image space, second – in the feature space.
conv5 fc6 fc7 fc8
Figure 9: Iteratively re-encoding images with AlexNet and reconstructing. Iteration number shown on the left.

Iterative re-encoding. We performed another experiment illustrating how similar are the features of reconstructions to the original image features. Given an image, we compute its features, generate an image from those, and then iteratively compute the features of the result and generate from those. Results are shown in Fig. 9. Interestingly, several iterations do not significantly change the reconstruction, indicating that important perceptual features are preserved in the generated images. More results are shown in the appendix.


We can morph images into each other by linearly interpolating between their features and generating the corresponding images. Fig. 

11 shows that objects shown in the images smoothly warp into each other. More examples are shown in the appendix.

Different comparators. AlexNet network we used above as comparator has been trained on a huge labeled dataset. Is this supervision really necessary to learn a good comparator? We show here results with several alternatives to conv5 features of AlexNet: 1) features of AlexNet, 2) conv5 of AlexNet with random weights, 3) conv5 of the network of Wang & Gupta (2015) which we refer to as VideoNet.

The results are shown in Fig. 10. While AlexNet conv5 comparator provides best reconstructions, other networks preserve key image features as well. We also ran preliminary experiments with conv5 features from the discriminator serving as a comparator, but were not able to get satisfactory results with those.

Image      Alex5     Alex6     Video5     Rand5
Figure 10: Reconstructions from fc6 with different comparators. The number indicates the layer from which features were taken.
Image pair 1         Image pair 2
Figure 11: Interpolation between images by interpolating between their features in fc6 and fc8.

5 Conclusion

We proposed a class of loss functions applicable to image generation that are based on distances in feature spaces. Applying these to three tasks — image auto-encoding, random natural image generation with a VAE and feature inversion — reveals that our loss is clearly superior to the typical loss in image space. In particular, it allows reconstruction of perceptually important details even from very low-dimensional image representations. We evaluated several feature spaces to measure distances. More research is necessary to find optimal features to be used depending on the task. To control the degree of realism in generated images, an alternative to adversarial training is an approach making use of feature statistics, similar to Gatys et al. (2015). We see these as interesting directions of future work.


The authors are grateful to Jost Tobias Springenberg and Philipp Fischer for useful discussions. We acknowledge funding by the ERC Starting Grant VideoLearn (279401).



Here we show some additional results obtained with the proposed method.

Figure 12 illustrates how position and color of an object is preserved in deep layers of AlexNet (Krizhevsky et al., 2012).

Figure 13 shows results of generating images from interpolations between the features of natural images.

Figure 14 shows samples from variational autoencoders with different losses. Fully unsupervised VAE with VideoNet (Wang & Gupta, 2015) loss and random initialization of the encoder is in the bottom right. Samples from this model are qualitatively similar to others, showing that initialization with AlexNet is not necessary.

Figures 15 and 16 show results of iteratively encoding images to a feature representation and reconstructing back to the image space. As can be seen from Figure 16, the network trained with loss in the image space does not preserve the features well, resulting in reconstructions quickly diverging from the original image.

Figure 12: Position (first three columns) and color (last three columns) preservation.
Figure 13: Interpolation in feature spaces at different layers of AlexNet. Topmost: input images, Top left: conv5, Top right: fc6, Bottom left: fc7, Bottom right: fc8.
Figure 14: Samples from VAE with our approach, with different comparators. Top left: AlexNet conv5 comparator, Top right: AlexNet fc6 comparator, Bottom left: VideoNet conv5 comparator, Bottom right: VideoNet conv5 comparator with randomly initialized encoder.
Figure 15: Iterative re-encoding and reconstructions for different layers of AlexNet. Each row of each block corresponds to an iteration number: 1, 2, 4, 6, 8, 12, 16, 20. Topmost: input images, Top left: conv5, Top right: fc6, Bottom left: fc7, Bottom right: fc8.
Figure 16: Iterative re-encoding and reconstructions with network trained to reconstruct from AlexNet fc6 layer with squared Euclidean loss in the image space. On top the input images are shown. Then each row corresponds to an iteration number: 1, 2, 4, 6, 8, 12, 16, 20.