Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Fréchet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics. Code: https://github.com/clovaai/generative-evaluation-prdc.READ FULL TEXT VIEW PDF
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
Lots of evaluation metrics for the generative adversarial networks in pytorch
Assessing a generative model is difficult. Unlike the evaluation of discriminative models that is often easily done by measuring the prediction performances on a few labelled samples , generative models are assessed by measuring the discrepancy between the real and generated (fake)
sets of high-dimensional data points. Adding to the complexity, there are more than one way of measuring distances between two distributions each with its own pros and cons. In fact, even human judgement based measures like Mean Opinion Scores (MOS) are not ideal, as practitioners have diverse opinions on what the “ideal” generative model is(Borji, 2019).
Nonetheless, there must be some measurement of the quality of generative models for the progress of science. Several quantitative metrics have been proposed, albeit with their own set of trade-offs. For example, Fréchet Inception Distance (FID) score (Heusel et al., 2017), the most popular metric in image generation tasks, has empirically exhibited good agreements with human perceptual scores. However, FID summarises the comparison of two distributions into a single number, failing to separate two important aspects of the quality of generative models: fidelity and diversity (Sajjadi et al., 2018). Fidelity refers to the degree to which the generated samples resemble the real ones. Diversity, on the other hand, measures whether the generated samples cover the full variability of the real samples.
Recent papers (Sajjadi et al., 2018; Simon et al., 2019; Kynkäänniemi et al., 2019) have introduced precision and recall metrics as measures of fidelity and diversity, respectively. Though precision and recall metrics have introduced the important perspectives in generative model evaluation, we show in this paper that they are not ready yet for practical use. We argue that necessary conditions for useful evaluation metrics are: (1) ability to detect identical real and fake distributions, (2) robustness to outlier samples, (3) responsiveness to mode dropping, and (4) the ease of hyperparameter selection in the evaluation algorithms. Unfortunately, even the most recent version of the precision and recall metrics (Kynkäänniemi et al., 2019) fail to meet the requirements.
To address the practical concerns, we propose the density and coverage
metrics. By introducing a simple yet carefully designed manifold estimation procedure, we not only make the fidelity-diversity metrics empirically reliable but also theoretically analysable. We test our metric on generative adversarial networks, one of the most successful generative models in recent years.
We then study the embedding algorithms for evaluating image generation algorithms. Embedding is an inevitable ingredient due to the high-dimensionality of images and the lack of semantics in the RGB space. Despite the importance, the embedding pipeline has been relatively less studied in the existing literature; evaluations of generated images mostly rely on the features from an ImageNet pre-trained model(Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018; Simon et al., 2019; Kynkäänniemi et al., 2019; Deng et al., 2009). This sometimes limits the fair evaluation and provides a false sense of improvement, since the pre-trained models inevitably include the dataset bias (Torralba & Efros, 2011; Geirhos et al., 2019). We show that such pre-trained embeddings often exhibit unexpected behaviours as the target distribution moves away from the natural image domain.
To exclude the dataset bias, we consider using randomly initialised CNN feature extractors (Ulyanov et al., 2018). We compare the evaluation metrics on MNIST and sound generation tasks using the random embeddings. We observe that random embeddings provide more macroscopic views on the distributional discrepancies. In particular, random embeddings provide more sensible evaluation results when the target data distribution is significantly different from ImageNet statistics (e.g. MNIST and spectrograms).
Given a real distribution and a generative model , we assume that we can sample and , respectively. We need an algorithm to assess how likely the sets of samples are arising from the same distribution. When the involved distribution families are tractable and the full density functions can be easily estimated, statistical testing methods or distributional distance measures (e.g.Kullback-Leibler Divergence or Expected Likelihood) are viable. However, when the data are complex and high-dimensional (e.g. natural images), it becomes difficult to apply such measures naively (Theis et al., 2016). Because of the difficulty, the evaluation of samples from generative models is still an actively researched topic. In this section, we provide an overview of existing approaches. We describe the most widely-used evaluation pipeline for image generative models (§2.1), and then introduce the prior works on fidelity and diversity measures (§2.2). For a more extensive survey, see (Borji, 2019).
It is difficult to conduct statistical analyses over complex and high-dimensional data in their raw form (e.g. images). Thus, evaluation metrics for image generative models largely follow the following stages: (1) embed real and fake data ( and ) into a Euclidean space through a non-linear mapping like CNN feature extractors, (2) construct real and fake distributions over with the embedded samples and , and (3) quantify the discrepancy between the two distributions. We describe each stage in the following paragraphs.
Embeddings. It is often difficult to define a sensible metric over the input space. For example, distance over the image pixels is misleading because two perceptually identical images may have great distances (one-pixel translation) (Theis et al., 2016). To overcome this difficulty, researchers have introduced ImageNet pre-trained CNN feature extractors as the embedding function in many generative model evaluation metrics (Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018; Kynkäänniemi et al., 2019) based on the reasoning that the distance in the feature space provide sensible proxies for the human perceptual metric (Zhang et al., 2018). Since we always use embedded samples for computing metrics, we write and for and , respectively.
In this work, we also adopt ImageNet pre-trained CNNs for the embedding, but we also criticise their use when the data distribution is too distinct from the ImageNet distribution (§4.2). We suggest randomly-initialised CNN feature extractors as a strong alternative in such cases (Ulyanov et al., 2018; Zhang et al., 2018). We show that for MNIST digit images or sound spectrograms that have large domain gaps from ImageNet, random embeddings provide more sensible evaluation measures.
Building and comparing distributions. Given embedded samples and , many metrics conduct some form of (non-)parametric statistical estimation. Parzen window estimates (Bengio et al., 2013) approximate the likelihoods of the fake samples by estimating the density with Gaussian kernels around the real samples . On the parametric side, Inception scores (IS) (Salimans et al., 2016) estimate the multinomial distribution over the 1000 ImageNet classes for each sample image and compares it against the estimated marginalised distribution with the KL divergence. Fréchet Inception Distance (FID) (Heusel et al., 2017) estimates the mean and covariance for and assuming that they are multivariate Gaussians. The distance between the two Gaussians is computed by the Fréchet distance (Dowson & Landau, 1982), also known as the Wasserstein-2 distance (Vaserstein, 1969). FID has been reported to generally match with human judgements (Heusel et al., 2017); it has been the most popular metric for image generative models in the last couple of years (Borji, 2019).
While single-value metrics like IS and FID have led interesting advances in the field by ranking generative models, they are not ideal for diagnostic purposes. One of the most important aspects of the generative models is the trade-off between fidelity (how realistic each input is) and diversity (how well fake samples are capturing the variations in the real samples). We introduce variants of the two-value metrics (precision and recall) that capture the two characteristics separately.
Precision and recall. (Sajjadi et al., 2018) have reported the pathological case where two generative models have similar FID scores, while their qualitative fidelity and diversity results are different. For a better diagnosis of generative models, (Sajjadi et al., 2018) have thus proposed the precision and recall metrics based on the estimated supports of the real and fake distributions. Precision is defined roughly as the portion of that can be generated by ; recall is symmetrically defined as the portion of that can be generated by
. While conceptually useful, they have multiple practical drawbacks. It assumes that the embedding space is uniformly dense, relies on the initialisation-sensitive k-means algorithm for support estimation, and produces an infinite number of values as the metric.
Improved precision and recall. (Kynkäänniemi et al., 2019) have proposed the improved precision and recall (P&R)
that address the above drawbacks. The probability density functions are estimated via k-nearest neighbour distances, overcoming the uniform-density assumption and the reliance on the k-means algorithm. Our proposed metrics are based on P&R; we explain the full details of P&R here.
P&R first constructs the “manifold” for and separately, the object is nearly identical to the probabilistic density function except that it does not sum to 1. Precision then measures the expected likelihood of fake samples against the real manifold and recall measures the expected likelihood of real samples against the fake manifold:
where and are the number of real and fake samples. is the indicator function. Manifolds are defined
where is the sphere in around with radius . denotes the distance from to the nearest neighbour among excluding itself. Example computation of P&R is shown in Figure 3.
There are similarities between the precision above and the Parzen window estimate (Bengio et al., 2013). If the manifolds are formed by superposition (summation) instead of the union in Equation 3 and the spheres1 computes the expected likelihood of fake samples.
We propose novel performance measures density and coverage (D&C) as practically usable measures that successfully remedy the problems with precision and recall.
Practicality of the improved precision and recall (P&R) is still compromised due to their vulnerability to outliers and computational inefficiency. Building the nearest neighbour manifolds (Equation 3) must be performed carefully because the spheres around each sample are not normalised according to the their radii or the relative density of samples in the neighbourhood. Consequently, the nearest neighbour manifolds generally overestimate the true manifold around the outliers, leading to undesired effects in practice. We explain the drawbacks at the conceptual level here; they will be quantified in §4.
Precision. We first show a pathological case for precision in Figure (a)a. Because of the real outlier sample, the manifold is overestimated. Generating many fake samples around the real outlier is enough to increase the precision measure.
Recall. The nearest neighbour manifold is built upon the fake samples in Equation 2. Since models often generate many unrealistic yet diverse samples, the fake manifold is often an overestimation of the true fake distribution. The pathological example is shown in Figure (b)b. While the fake samples are generally far from the modes in real samples, the recall measure is rewarded by the fact that real samples are contained in the overestimated fake manifold. Another problem with relying on fake manifolds for the recall computation is that the manifold must be computed per model. For example, to generate the recall-vs-iteration curve for training diagnosis, the k-nearest neighbours for all fake samples must be computed () for every data point.
We remedy the issues with P&R above with simple fixes in each metric.
Density. Density improves upon the precision metric by fixing the overestimation of the manifold around real outliers. Precision counts the binary decision of whether the fake data contained in any neighbourhood sphere of real samples (Equation 1). Density, instead, counts how many real-sample neighbourhood spheres contain (Equation 4). The manifold is now formed by the superposition of the neighbourhood spheres , and a form of expected likelihood of fake samples is measured. In this sense, the density measure is at the midway between the precision metric and the Parzen window estimate. Density is defined as
where is for the k-nearest neighbourhoods. Through this modification, density rewards samples in regions where real samples are densely packed, relaxing the vulnerability to outliers. For example, the problem of overestimating precision (100%) in Figure (a)a is resolved using the density estimate (only 60% now). Note that unlike precision, density is not upper bounded by 1; it may be greater than 1 if the fake samples are concentrated around the dense regions for real samples.
Coverage. Diversity, intuitively, shall be measured by the ratio of real samples that are covered by the fake samples. Coverage improves upon the recall metric to better quantify this criterion by building the nearest neighbour manifolds around the real samples, instead of the fake samples, as they have less outliers. Moreover, the manifold can only be computed per dataset, instead of per model, reducing the heavy nearest neighbour computations in recall. Coverage is defined as
It measures the fraction of real samples whose neighbourhoods contain at least one fake sample. Coverage is bounded between 0 and 1.
The simplest sanity check for an evaluation metric is whether the metric attains the best value when the intended criteria are met. For generative models, we examine if D&C attain 100% performances when the real and fake distribution are identical (). Here, we show that, unlike P&R, D&C yield an analytic expression for the expected values and for identical real and fake. This analysis leads to further insights into the metrics and a systematic algorithm for selecting the hyperparameters (k-nearest neighbours, real samples, and fake samples).
Lack of analytic results for P&R. From Equation 1, the expected precision for identical real and fake is
where is the event . Since are not independent with complex dependence structures, a simple expression for Equation 8 does not exist. Same observation holds for .
Analytic derivations for D&C. We derive expressions for the expected values of D&C under the identical real and fake.
The expected density boils down to
where is the event where is at most
smallest among the random variables. Since the random variables are identical and independently distributed, any particular ranking of them are equally likely with probability . Since the number of rankings with a certain random variable at particular rank is , we have the probability
Moreover, as , .
The expected coverage is estimated as
using the fact that the events is symmetrical with respect to . The event can be re-written as
We write for and for . Then, the set of random variables is independent and identically distributed. The probability of can be equivalently described as:
Assume there are non-negative real numbers distributed according to . Colour of them red uniformly at random and colour the rest blue. What is the chance that the smallest among are all coloured blue?
Since any assignment of red and blue colours is equally likely, we compute the probability by counting the ratio of possible colour assignments where smallest elements are coloured blue. The formula is written as
Note that the expected values do not depend upon the distribution type or the dimensionality of the data.
Given the analytic expression for the expected D&C for identical real and fake distributions, we can systematically choose , , and . The aim of our hyperparameter selection is to ensure when the distributions are identical.
Since does not depend upon the exact distribution type or dimensionality of data (Equation 9), the hyperparameters chosen as above will be effective across various data types. We verify the consistency of D&C across data types in Figure 4. It contains plots of the P&R and D&C values for identical real and fake distributions from (1) 64-dimensional multivariate standard Gaussians, (2) 4096-dimensional ImageNet pre-trained VGG embeddings of the FFHQ face images, and (3) analytic estimations in §3.3. While P&R exhibits significantly different estimated values across the Gaussians and the FFHQ embeddings, D&C metrics agree on all three types of estimates, confirming the independence on data type. This conceptual advantage leads to a confident choice of evaluation hyperparameters that are effective across data types.
In practice, we choose the hyperparameters to achieve . For the sake of symmetry, we first set . We then set to ensure a good approximation , while keeping the computational cost tractable. is then sufficient to ensure . The exact principle behind the choices of those values for P&R (, ) is unknown.
We empirically assess the proposed density and coverage (D&C) metrics and compare against the improved precision and recall (P&R) (Kynkäänniemi et al., 2019). Evaluating evaluation metrics is difficult as the ground truth is often not provided. We carefully select sanity checks with toy and real-world data where the desired behaviours of evaluation metrics are clearly defined. At the end of the section, we study the embedding pipeline (§2.1) and advocate the use of randomly initialised embedding networks under certain scenarios.
We build several toy and real-world data scenarios to examine the behaviour of the four evaluation metrics: P&R and D&C. We first show results on toy data where the desired behaviours of the fidelity and diversity metrics are well-defined (§4.1.1). We then move on to diverse real-world data cases to show that the observations extend to complex data distributions (§4.1.2).
We assume Gaussian or mixture-of-Gaussian distributions for the realand fake in (). We simulate largely two scenarios: (1) moves away from (Figure 6) and (2) gradually fails to cover the modes in (Figure 6). We discuss each result in detail below.
Translating away from . We set and in where
is the vector of ones and
is the identity matrix. We study how the metrics change asvaries in . In Figure 6, without any outlier, this setting leads to the decreasing values of P&R and D&C as moves away from , the desired behaviour for all metrics. However, P&R show a pathological behaviour when the distributions match (): their values are far below 1 (0.68 precision and 0.67 recall). D&C, on the other hand, achieve values close to 1 (1.06 density and 0.97 coverage). D&C detect the distributional match better than P&R.
Translating with real or fake outliers. We repeat the previous experiment with exactly one outlier at in either real or fake samples (Figure 6). Robust metrics must not be affected by one outlier sample. However, P&R are vulnerable to this one sample; precision increases as grows above and recall increases as decreases below . The behaviour is attributed to the overestimation of the manifold on the region enclosed by the inliers and the outlier (§3.1). This susceptibility is a serious issue in practice because outliers are common in realistic data and generated outputs. On the other hand, the D&C measures for the outlier cases largely coincide with the no-outlier case. D&C are sturdy measures.
Mode dropping. We assume that the real distribution is a mixture of Gaussians in with ten modes. We simulate the fake distribution initially as identical to , and gradually drop all but one mode. See Figure 6 for an illustration. We consider two ways the modes are dropped. (1) Each mode is dropped sequentially and (2) weights on all but one mode are decreased simultaneously. Under the sequential mode dropping, both recall and coverage gradually drop, capturing the decrease in diversity. However, under the simultaneous dropping, recall cannot capture the decrease in diversity until the concentration on the first mode reaches 90% and shows a sudden drop when the fake distribution becomes unimodal. Coverage, on the other hand, decreases gradually even under the simultaneous dropping. It reliably captures the decrease in diversity in this case.
Having verified the metrics on toy Gaussians, we assess the metrics on real-world images. As in the toy experiments, we focus on the behaviour of metrics under corner cases including outliers and mode dropping. We further examine the behaviour with respect to the latent truncation threshold (Karras et al., 2019). As the embedding network, we use the fc2 layer features of the ImageNet pre-trained VGG16 (Kynkäänniemi et al., 2019).
|Worst outliers||Worst outliers|
Before studying the impact of outliers on the evaluation metrics, we introduce our criterion for outlier detection. Motivated by(Kynkäänniemi et al., 2019), we use the distance to the nearest neighbour among the fake samples. According to this criterion, we split the fake samples into for inliers and outliers. We experiment with fake images generated by StyleGAN on CelebA (Liu et al., 2015) and LSUN-bedroom (Yu et al., 2015). Example images of inliers and outliers are shown in Figure 7. We observe that the outliers have a more distortions and atypical semantics.
We examine the behaviour of recall and coverage as the outliers are gradually added to the pool of fake samples. In Figure 7, we plot recall and coverage relative to their values when there are only inliers. As outliers are added, recall increases more than 11% and 15% on CelebA and LSUN bedroom, respectively, demonstrating its vulnerability to outliers. Coverage, on the other hand, is stable: less than 2% increase with extra outliers.
Mode dropping. As in the toy experiments, we study mode dropping on the MNIST digit images. We treat the ten classes as modes and simulate the scenario where a generative model gradually favours a particular mode (class “0”) over the others. We use the real data (MNIST images) with decreasing number of classes covered as our fake samples (sequential dropping). The results are shown in Figure 10. While recall is unable to detect the decrease in overall diversity until the ratio of class “0” is over 90%, coverage exhibits a gradual drop. Coverage is superior to recall at detecting mode dropping.
Resolving fidelity and diversity. The main motivation behind two-value metrics like P&R is the diagnosis involving the fidelity and diversity of generated images. We validate whether D&C successfully achieve this. We perform the truncation analysis on StyleGAN (Karras et al., 2019) on FFHQ. The truncation technique is used in generative models to artificially manipulate the learned distributions by thresholding the latent noise with (Brock et al., 2018). In general, for greater , data fidelity decreases and diversity increases, and vice versa. This stipulates the desired behaviours of the P&R and D&C metrics.
The results are shown in Figure 11. With increasing , precision and density decrease, while recall and coverage increase. Note that density varies more than precision as increases, leading to finer-grained diagnosis for the fidelity.
Image embedding is an important component in generative model evaluation (§2.1). Yet, this ingredient is relatively less studied in existing literature. In this section, we explore the limitations of the widely-used ImageNet embeddings when the target data is significantly distinct from the ImageNet samples. In such cases, we propose random embeddings as viable alternatives. In the following experiments, T4096 denotes the fc2 layer features of VGG16 network pre-trained on ImageNet and R64 denotes the fc2 layer features of a randomly initialised VGG16 network, where fc2 is a 64-dimensional linear layer replacing the 4096-dimensional fc2 layer. We experiment on MNIST and a sound dataset, Speech Commands Zero Through Nine (SC09).
|Real MNIST||Real spectrogram|
We study the behaviour of metrics for different truncation thresholds () on StyleGAN generated FFHQ images (Figure 11). We confirm that the metrics on R64 closely follow the trend with T4096. P&R and D&C over R64 can be used in ImageNet-like images to capture the fidelity and diversity aspects of model outputs.
We consider the scenario where the target distribution is significantly different from the ImageNet statistics. We use DCGAN generated images (Radford et al., 2016) on MNIST (LeCun et al., 1998) and WaveGAN generated spectrograms (Donahue et al., 2019) on Speech Commands Zero Through Nine (SC09) dataset. The qualitative examples and corresponding metrics are reported in Table 1.
Compared to the high quality of generated MNIST samples, the metrics on T4096 are generally low (e.g. 0.047 density), while the metrics on R64 report reasonably high scores (e.g. 0.491 density). For sound data, likewise, the metrics on T4096 do not faithfully represent the general fidelity and diversity of sound samples. The real and fake sound samples are provided at http://bit.ly/38DIMAA and http://bit.ly/2HAm8NB. The samples consist of human voice samples of digit words “zero” to “nine”. The fake samples indeed lack enough fidelity yet, but they do cover diverse number classes. Thus, the recall (0.029) and coverage (0.020) values under T4096 are severe underestimations of the actual diversity. Under the R64, the recall and coverage values are in the more sensible range: 0.572 and 0.653, respectively. When the target data domain significantly differ from the embedding training domain, R64 may be a more reasonable choice.
We have systematically studied the existing metrics for evaluating generative models with a particular focus on the fidelity and diversity aspects. While the recent work on the improved precision and recall (Kynkäänniemi et al., 2019) provide good estimates of such aspects, we discover certain failure cases where they are not practical yet: overestimating the manifolds, underestimating the scores when the real and fake distributions are identical, not being robust to outliers, and not detecting certain mode dropping. To remedy the issues, we have proposed the novel density and coverage metrics. Density and coverage have an additional conceptual advantage that they allow a systematic selection of involved hyperparameters. We suggest future researchers to use density and coverage for more stable and reliable diagnosis of their models. On top of this, we analyse the less-studied component of embedding. Prior metrics have mostly relied on ImageNet pre-trained embeddings. We argue through empirical studies that random embeddings are better choices when the target distribution is significantly different from the ImageNet statistics.
IEEE Conference on Computer Vision and Pattern Recognition, 2009.
The fréchet distance between multivariate normal distributions.
Journal of Multivariate Analysis, 1982.
The unreasonable effectiveness of deep features as a perceptual metric.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018.