elpips
E-LPIPS: Robust Perceptual Image Similarity via Random Transformation Ensembles
view repo
It has been recently shown that the hidden variables of convolutional neural networks make for an efficient perceptual similarity metric that accurately predicts human judgment on relative image similarity assessment. First, we show that such learned perceptual similarity metrics (LPIPS) are susceptible to adversarial attacks that dramatically contradict human visual similarity judgment. While this is not surprising in light of neural networks' well-known weakness to adversarial perturbations, we proceed to show that self-ensembling with an infinite family of random transformations of the input --- a technique known not to render classification networks robust --- is enough to turn the metric robust against attack, while retaining predictive power on human judgments. Finally, we study the geometry imposed by our our novel self-ensembled metric (E-LPIPS) on the space of natural images. We find evidence of "perceptual convexity" by showing that convex combinations of similar-looking images retain appearance, and that discrete geodesics yield meaningful frame interpolation and texture morphing, all without explicit correspondences.
READ FULL TEXT VIEW PDFE-LPIPS: Robust Perceptual Image Similarity via Random Transformation Ensembles
Computational assessment of perceptual image similarity — how close is an image to another image — is a fundamental question in vision, graphics, and imaging. For an image similarity metric to be perceptually equivalent to human observations, it clearly suffices that
is large when human observers perceive a large dissimilarity between and , and
is small when human observers consider the images and similar.
Perceptually motivated image similarity metrics have a long history, with well-known challenges due to dependence on context and high-level image structure ssim ; ms-ssim ; Mantiuk:2011:HCV
. Standard vector norms applied to images pixelwise, such as the
or distance, are brittle in the sense that many transformations that leave images visually indistinguishable — shifting by one pixel is a classic example — can yield arbitrary jumps in distance, indicating a violation of Condition (C2). Other transformations exhibiting similar issues include intensity variations, small rotations, and slight blur.In recent years, several authors have made the observation that the hidden layer features in a convolutional neural network (CNN) trained for image classification Simonyan2014VeryDC ; Krizhevsky2014OneWT ; SqueezeNet yields a perceptually meaningful space in which to measure image distance Berardino2017EigenDistortionsOH , with applications in, e.g., image generation and restoration Johnson2016PerceptualLF ; Chen2017PhotographicIS ; KimLL15 and performance metrics for generative models Heusel2017 . Inspired by this, Zhang et al. Zhang_2018_CVPR recently showed that hidden CNN activations are indeed a space where distance can strongly correlate with human judgment, much more so than per-pixel metrics or other prior similarity scores. Their Learned Perceptual Image Path Similarity (lpips) metric is calibrated to human judgment using data from two-alternative forced choice (2AFC) tests.
Our message is threefold. First, we will show that lpips
-style image similarity metrics computed as Euclidean distances between the hidden variables of a classifier CNN are brittle: they are easy to break by standard adversarial attack techniques. This is not at all surprising given neural networks’ well-known weakness to adversarial perturbations. More precisely, we show that these metrics fulfill neither (C1) nor (C2) by showing how to easily craft image pairs that have a small
lpips distance albeit human observers see the images as completely different, violating (C1), or that have a large lpips distance, although a human observer is hard pressed to see the difference, violating (C2).Second, we show that computing the CNN feature distance as an average under an effectively infinite ensemble of random image transformations makes the metric significantly more robust against adversarial attacks, even with the attack crafted with full knowledge of the ensemble defense.We call the novel metric the Ensembled Learned Perceptual Image Patch Similarity metric, e-lpips. That self-ensembling through random transformations yields a robust similarity metric might seem surprising at first: similar random transformation ensembles have been previously shown to not be sufficient to make classifier networks robust Athalye2018 ; Athalye2018Synthesizing . We attribute the qualitative difference to the fact that attacking lpips-style metrics requires modification of all hidden variables of a CNN, while attacking a classifier network only requires the attacker to change the output layer.
Finally, we observe that the geometry on the space of natural images induced by the e-lpips metric has several intriguing properties we call “perceptual convexity”. In particular, we show by several examples that when computed under our e-lpips metric, averages (barycenters) of similar-looking images retain a similar appearance even when pixel-wise averages appear very different, and that discrete geodesics between two images yields reasonable frame interpolation and texture morphing — all without explicit correspondence or optical flow. Furthermore, as a practical consequence of a more perceptually convex metric, we show that e-lpips
yields consistently better results than non-ensembled feature losses when used as a loss function in training image restoration models.
We build on lpips Zhang_2018_CVPR that computes image differences in the space of hidden unit activations resulting from the input images when run through the convolutional part of an image classification network such as VGG Simonyan2014VeryDC , SqueezeNet SqueezeNet or AlexNet Krizhevsky2014OneWT . Other, prior uses of “VGG feature distances” Johnson2016PerceptualLF ; Chen2017PhotographicIS ; KimLL15 differ from lpips in minor details only.
lpips first normalizes the feature dimension in all pixels and layers to unit length, scales each feature by a feature-specific weight, and evaluates the square of the distance between these weighted activations. These squared distances are then averaged over the image dimensions and summed over the layers, and this provides the final image distance metric:
(1) |
Here, and denote the normalized feature vectors at layer and pixel , contains weights for each of the features in layer , and multiplies the feature vectors at each pixel by the feature weights. The weights are optimized such that the metric best agrees with human judgment derived from two-alternative forced choice (2AFC) test results.
Below (Section 3.1), we prove that lpips fulfills neither (C1) nor (C2) by describing simple adversarial attacks that violate both conditions. The reader may consult Figures 1 and 2 before proceeding. With hopes of increasing its robustness, we alter lpips in four ways:
We apply a randomized geometric and color transformation to the input images before feeding them to the network, and take the expected lpips distance over realizations. As the model is not limited to a single glimpse of the input, but has access to an essentially infinite number of different versions, it is harder to fool.
While lpips only considers the features in the last layer of each resolution, we compute distances over all convolutional layers. In particular, while lpips is blind to the input layer, considering all layers means Equation (1) includes the distance of the input color tones, with a learned weight .
We make the network stochastic by applying dropout with keep probability
on all layers. Matching intuition, this makes the metric consistently more robust.We replace the image classification networks’ max pooling layers with average pooling. As observed by several prior authors
Gatys2015 ; Henaff2016 , max pooling distracts gradient flow, making optimization harder, and leaves more blind spots to be exploited since many of the network activations can be freely altered without the change propagating forward.For a single evaluation of our method, we first transform both input images with the same random combination of simple transformations. Our transform set includes translations, flipping, mirroring, permutation of color channels, scalar multiplication (brightness change), and downscaling, all with randomly chosen parameters. Furthermore, the simple random transformations are combined randomly, making the effective size of the ensemble very large. This is in stark contrast to an ensemble over a fixed, small number of separate transformations. Algorithms 1 and 2 detail the construction and application of the transformations.
We now define e-lpips as the expectation
(2) |
where the expectation is taken over the stochastic combinations of transformations described above, and the classical distance evaluated with our network with dropout, average pooling, and weights applied to all layers, including the input layer.
For numerical optimization using the metric, e.g., training neural networks, we find that a single random sample per iteration is generally sufficient, but multiple samples may sometimes result in faster convergence. For more precise distance evaluations we recommend averaging evaluations, depending on the required precision. For comparing distances of two images from a third image, we recommend using the same transformations and dropout variables for all images. A single evaluation of e-lpips without gradients takes on average about 10% longer than lpips-vgg, and 20% longer with gradients.
This section studies the properties of e-lpips in comparison to lpips and simpler metrics in various scenarios. We use the VGG-16 version as basis for e-lpips in all results, but compare to both VGG and Squeezenet -based lpips, the best models reported by Zhang et al. Zhang_2018_CVPR
. All models are implemented in TensorFlow, and all training and other optimization is performed on NVIDIA V100 GPUs. We perform all optimizations with Adam
kingma2014adam , propagating gradients the random input transformations. When attacking the metrics, this directly implements the Expectation over Transformations (EoT) attack previously used to break self-ensembling in image classifiers Carlini2017TowardsET ; Athalye2018 .We construct image pairs that humans perceive dissimilar but lpips considers close. To attack metric , we select a source image , constrain to be small, and optimize towards a distant target image in the sense by solving
(3) |
with the anchor distance chosen as the distance between and a slightly noisy version of . We perform the attack on 30 images from the OpenImages OpenImages dataset. Figure 1 shows representative results. Both lpips metrics succumb badly: the result shows that the -lpips-ball around the source image contains images that are wildly different from it. The e-lpips images consistently stay much closer to the input image, as desirable. On average, lpips lets the attack pull the images five times farther from the input image in distance than e-lpips: times for lpips-vgg, for lpips-sqz, and for e-lpips. See the supplemental material for more examples.
In the opposite direction, we construct pairs of images that are similar to the human eye (proxied by a small distance), but far apart in the lpips sense by solving
(4) |
The other details of the attack are the same as in (A1). Figure 2 shows representative results. Both lpips metrics allow a very large increase distance while staying within a small -ball around the input: times the mean distance between the images in the dataset for lpips-sqz, on average, and for lpips-vgg. In contrast, e-lpips gives in much less: .
For lpips, the attack images are visually highly similar to the source, despite the large increase in the distance, indicating violation of (C2); in contrast, the image of maximal e-lpips distance within the --ball is consistently visibly more different from the source image. This is desirable. The supplemental material contains more examples.
Together, the attack results suggest that e-lpips fulfills conditions (C1) and (C2) better than the non-ensembled feature metrics. We now proceed to study the reasons for this result.
We study the necessity of the various random transformations in our ensemble by an ablation. We repeat Attack (A1) on an image pair, adding transformations to the ensemble one-by-one, with dropout being added last. Figure 3 shows representative results: clearly, each individual transformation “plugs more holes”, making the attack less and less successful. On a test set of 30 images, the mean distance of lpips-vgg from the clean image is times the anchor distance. The effectiveness of the adversarial attack drops as we add all layers and average pooling (), translations, rotations and swaps (), color transformations (), multiple scale levels (), and finally dropout (), reaching a five-fold robustness improvement in terms of the distance.
The supplemental material contains another ablation study that highlights the necessity of using an effectively infinite ensemble of random transformations, as opposed to a fixed ensemble.
Above, we found large perturbations that increased the lpips distance only little, and small perturbations that greatly increased the lpips distance. The results are systematic: attacks succeed easily against lpips, and much less so against e-lpips. This suggests severe ill-conditioning of the lpips metric. We now probe the metrics’ local behavior in order to see if it is consistent with this hypothesis. Specifically, we study the behavior of around a fixed image for small through its 2nd-order Taylor expansion. As is the minimum of , the first-order term vanishes, and the expansion equals , where is the Hessian. Note that for an RGB image of size .
As the Hessian is symmetric and positive semidefinite, its distance-scaling properties are characterized by its eigenvalue spectrum. While the full spectrum is intractable, we compute the extremal eigenvalues by power iteration, as well as approximate their mean, variance, skewness and kurtosis by sampling. Details can be found in the supplemental material. The adjacent table shows means of these descriptors computed over 10 different anchor images
of size . For scale-invariance between the metrics, we normalize all values with the corresponding mean eigenvalues. The smallest eigenvalues of e-lpips are challenging to compute due to the metric’s stochastic nature. We thus evaluate the results marked with in lower resolution (), and use a weakened version of e-lpips with no dropout and with a fixed ensemble of 256 input transformations. We notice a clear trend of improved results when the ensemble size is increased, suggesting still better conditioning for full e-lpips.The maximal eigenvalues of lpips are consistently an order of magnitude larger than e-lpips’, the minimal eigenvalues are consistently three orders of magnitude smaller, and lpips’ condition number is three orders of magnitude larger on average. These numbers indicate that lpips is much more ill-conditioned. Moreover, the distribution of lpips’ eigenvalues is more positively skew — indicating that more eigenvalues are clustered towards zero from the mean and consequently more directions where the metric is blind to changes — as well as more kurtotic, implying more concentration at the extremal ends. These findings are consistent with the large-scale behavior observed earlier.
The supplemental material contains videos demonstrating random perturbations in the linear space spanned by the 16 largest eigenvectors of both
lpips and e-lpips around the central image . Though informal, the perturbations sampled from e-lpips eigenvectors appear more natural and consistent with the image structure.We now study the power of e-lpips in predicting human assessment of image similarity, the original goal of the lpips metric that ours is based on. We use the data of, and directly follow the evaluation process presented by Zhang et al. Zhang_2018_CVPR . We train the feature weights to produce distances and which are fed to a small two-hidden-layer MLP that predicts the ratio of human choices of A and B. We enable input transformations only for testing after the weights have been trained, and keep the weights of the underlying VGG network fixed.
The resulting accuracy in predicting human answers in the two-alternative forced choice test (2AFC) is 69.16% for e-lpips, falling directly between the VGG (68.65%) and SqueezeNet (70.07%) versions of the non-ensembled lpips. This indicates that predictive power does not suffer as a result of the ensembling, i.e., additional robustness and more intuitive geometry over images come at no cost. To put the numbers in perspective, Zhang et al. report a mean human score of 73.9%^{1}^{1}1The score is not 100%, as humans have difficulty predicting how other humans perceive image similarity., while simple metrics such as pixel-wise distance and SSIM achieve the score of 63%. The supplemental material contains a table with an accuracy breakdown over different classes of image corruptions.
Using an image similarity metric as a loss function in training an image generation or restoration model is a potential adversarial situation: if the metric cannot reliably distinguish between perceptual similarity and dissimilarity, the optimizer may drive the model to produce nonsensical results. As a case study, we train a 10M parameter standard convolutional U-net with skip connections Ronneberger2015
for removing additive Gaussian noise from photographs using different loss functions, as well as for 4-fold single-image super-resolution. (We train super-resolution in a supervised fashion
Johnson2016PerceptualLF as a case study, aware that state-of-the-art techniques utilize more sophisticated models Dahl2017 ; Ledig2017PhotoRealisticSI .)Although the differences are not large on average, we find that e-lpips consistently results in somewhat better results than plain lpips, both numerically and by visual inspection. In particular, the non-ensembled lpips metric sometimes falls in strange local minima on some network architectures (e.g. transposed convolutions, see supplemental). This is never observed with e-lpips. Despite the increased robustness and somewhat better results, there is no difference in wall-clock training speed between lpips and e-lpips. Visual and numerical results can be found in the supplemental material.
From a perceptual point of view, it would seem reasonable that an average of two or more images that look the same should still look the same. To study the properties of averages of multiple images taken under e-lpips, we first compute barycenters, , over several independent realizations of images transformed by the same random process, by direct numerical optimization over pixels.
As examples, we study i.i.d. additive Gaussian noise and small random offsets in Figure 4. The barycenter is simply the pixelwise average of the inputs, and appears quite different from the individual input images: averaging zero-mean noise cancels it out, and averaging shifted images results in a blur. Despite having converged into a local optimum, the non-ensembled lpips barycenters deviate visually even further, providing another view into the metric’s large null space. In contrast, the appearance of the e-lpips barycenters quite closely matches the individual realizations. This suggests a property we call perceptual convexity: the barycenter — which can be seen as a convex combination — of a set of visually similar images is itself visually similar to the others. The supplemental material contains more examples, and a study of pairwise averages.
A discrete geodesic between images and is a sequence of images that minimizes the total pairwise squared distance between adjacent frames Henaff2016 :
(5) |
The metric directly determines the visual properties of the solution, as the free images affect the optimization objective only through it. The geodesic is a pixelwise linear cross-fade. We numerically compute discrete geodesics of 8 in-between frames for the e-lpips, lpips, and metrics by initializing the free frames to random noise, and simultaneously optimize all of them using Adam. See the supplemental material for details on the optimization procedure.
As shown in Figure 5, geodesic interpolation between adjacent video frames appears to create meaningful frame interpolations in the e-lpips metric: instead of fading in and out like the crossfade, many image features translate and rotate as if reprojected along optical flow. Note that no correspondence or explicit reprojection is performed; all effects emerge from the metric. In addition to frame interpolation, the geodesics result in interesting fusions between different textures. Some evidence of similar behavior is seen in lpips geodesics, but the image fusion is generally of poorer quality, and translation and rotation are not modeled as naturally. The reader is encouraged to view the videos in the supplemental material. Due to the difficulty of the optimization problem, we consider these results an existence proof that warrants detailed further study.
The output of deep neural networks can often be altered drastically with small adversarial perturbations of the input found by gradient-based optimization IntriguingProperties ; Carlini2017TowardsET ; Goodfellow2014ExplainingAH ; MoosaviDezfooli2016DeepFoolAS ; Bhagoji2016DimensionalityRA ; Papernot2016DistillationAA . Random self-ensembling has been suggested before as a defense against attacks on classifiers. Li et al. Liu2017TowardsRN inject noise layers into an image classifier. Xie et al. XieMitigatingAE2017 and Athalye et al. Athalye2018Synthesizing apply random transformations to the input. These defenses succumb to the Expectation over Transformations (EoT) attack Athalye2018 — random transformations are not strong enough to robustify a classifier.
Weaknesses in classical image similarity metrics have been previously demonstrated in the same fashion as our attacks Wang2009MeanSE ; ssim . We are, however, not aware of prior successful attacks against state-of-the-art neural perceptual similarity metrics. Hessian eigenanalysis of deep convolutional features has been previously employed for studying perturbations of minimal and maximal discriminability on different levels of VGG Berardino2017EigenDistortionsOH .
Prior study has hinted toward similar interesting properties of the discrete geodesics of the “VGG feature” image metric: Hénaff and Simoncelli Henaff2016 find evidence of linearization of effects such as small translation and rotation, but only when aided with additional projections into pixel-space geodesics. In contrast, our geodesics are computed by optimizing only the e-lpips and lpips metrics. This, together with the large lpips null space evident in the barycenter results (Figure 4), suggests that e-lpips imposes a more robust geometric structure on the space of images.
By constructing adversarial examples that demonstrate high perceptual non-uniformity in lpips Zhang_2018_CVPR , we showed that image similarity metrics based on convolutional classifier feature distances exhibit susceptibility to white-box gradient-based perturbation attacks, much like image classifiers. While this is not surprising, the observation explains minor practical robustness issues faced when using such metrics as optimization targets without extra regularization.
We further extended the fragile lpips metric into an effectively infinite self-ensemble through applying random, simple image transformations and dropout. The resulting e-lpips image similarity metric (’E’ for ensemble) is much more resilient to various direct attacks mounted with full knowledge of the ensemble. This presents an interesting contrast between perceptual feature differences and image classifiers: it is known that similar defenses do not suffice against similar attacks Athalye2018 . Our preliminary studies indicate the difference can be explained by the much larger dimensionality of the hidden convolutional features — attacking the similarity metric requires modification of all convolutional features, as opposed to just the output layer. While supporting evidence is found from local eigenvalue analysis, we stress that we do not assert theoretical robustness guarantees without further study. Still, the observed higher practical resilience to attack is visible as increased robustness when the similarity metric is used as a loss function in optimization. Moreover, the new metric remains a good predictor of human judgment of image similarity, like lpips.
By computing barycenters of image sets and discrete geodesics between image pairs, we found evidence of “perceptual convexity” in the e-lpips metric: averages taken under it retain a much closer appearance to the images being averaged than under lpips or , and geodesic sequences interestingly linearize effects such as small translation and rotation, without explicit correspondence or optical flow computation. Both findings require systematic further study.
The supplemental material and source code are available at github.com/mkettune/elpips.
We thank Pauli Kemppinen, Frédo Durand, Miika Aittala, Sylvain Paris, Alexei Efros, Richard Zhang, Taesung Park, Tero Karras, Samuli Laine, Timo Aila, and Antti Tarvainen for in-depth discussions; and Seyoung Park for helping with the TensorFlow port of LPIPS. We acknowledge the computational resources provided by the Aalto Science-IT project.
Proc. 35th International Conference on Machine Learning (ICML)
, volume 80, pages 274--283. PMLR, 2018.2017 IEEE International Conference on Computer Vision (ICCV)
, pages 1520--1529, 2017.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 105--114, 2017.The unreasonable effectiveness of deep features as a perceptual metric.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Comments
There are no comments yet.