1 Introduction
Assessing a generative model is difficult. Unlike the evaluation of discriminative models that is often easily done by measuring the prediction performances on a few labelled samples , generative models are assessed by measuring the discrepancy between the real and generated (fake)
sets of highdimensional data points. Adding to the complexity, there are more than one way of measuring distances between two distributions each with its own pros and cons. In fact, even human judgement based measures like Mean Opinion Scores (MOS) are not ideal, as practitioners have diverse opinions on what the “ideal” generative model is
(Borji, 2019).Nonetheless, there must be some measurement of the quality of generative models for the progress of science. Several quantitative metrics have been proposed, albeit with their own set of tradeoffs. For example, Fréchet Inception Distance (FID) score (Heusel et al., 2017), the most popular metric in image generation tasks, has empirically exhibited good agreements with human perceptual scores. However, FID summarises the comparison of two distributions into a single number, failing to separate two important aspects of the quality of generative models: fidelity and diversity (Sajjadi et al., 2018). Fidelity refers to the degree to which the generated samples resemble the real ones. Diversity, on the other hand, measures whether the generated samples cover the full variability of the real samples.
Recent papers (Sajjadi et al., 2018; Simon et al., 2019; Kynkäänniemi et al., 2019) have introduced precision and recall metrics as measures of fidelity and diversity, respectively. Though precision and recall metrics have introduced the important perspectives in generative model evaluation, we show in this paper that they are not ready yet for practical use. We argue that necessary conditions for useful evaluation metrics are: (1) ability to detect identical real and fake distributions, (2) robustness to outlier samples, (3) responsiveness to mode dropping, and (4) the ease of hyperparameter selection in the evaluation algorithms. Unfortunately, even the most recent version of the precision and recall metrics (Kynkäänniemi et al., 2019) fail to meet the requirements.
To address the practical concerns, we propose the density and coverage
metrics. By introducing a simple yet carefully designed manifold estimation procedure, we not only make the fidelitydiversity metrics empirically reliable but also theoretically analysable. We test our metric on generative adversarial networks, one of the most successful generative models in recent years.
We then study the embedding algorithms for evaluating image generation algorithms. Embedding is an inevitable ingredient due to the highdimensionality of images and the lack of semantics in the RGB space. Despite the importance, the embedding pipeline has been relatively less studied in the existing literature; evaluations of generated images mostly rely on the features from an ImageNet pretrained model
(Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018; Simon et al., 2019; Kynkäänniemi et al., 2019; Deng et al., 2009). This sometimes limits the fair evaluation and provides a false sense of improvement, since the pretrained models inevitably include the dataset bias (Torralba & Efros, 2011; Geirhos et al., 2019). We show that such pretrained embeddings often exhibit unexpected behaviours as the target distribution moves away from the natural image domain.To exclude the dataset bias, we consider using randomly initialised CNN feature extractors (Ulyanov et al., 2018). We compare the evaluation metrics on MNIST and sound generation tasks using the random embeddings. We observe that random embeddings provide more macroscopic views on the distributional discrepancies. In particular, random embeddings provide more sensible evaluation results when the target data distribution is significantly different from ImageNet statistics (e.g. MNIST and spectrograms).
2 Backgrounds
Given a real distribution and a generative model , we assume that we can sample and , respectively. We need an algorithm to assess how likely the sets of samples are arising from the same distribution. When the involved distribution families are tractable and the full density functions can be easily estimated, statistical testing methods or distributional distance measures (e.g.KullbackLeibler Divergence or Expected Likelihood) are viable. However, when the data are complex and highdimensional (e.g. natural images), it becomes difficult to apply such measures naively (Theis et al., 2016). Because of the difficulty, the evaluation of samples from generative models is still an actively researched topic. In this section, we provide an overview of existing approaches. We describe the most widelyused evaluation pipeline for image generative models (§2.1), and then introduce the prior works on fidelity and diversity measures (§2.2). For a more extensive survey, see (Borji, 2019).
2.1 Evaluation pipeline
It is difficult to conduct statistical analyses over complex and highdimensional data in their raw form (e.g. images). Thus, evaluation metrics for image generative models largely follow the following stages: (1) embed real and fake data ( and ) into a Euclidean space through a nonlinear mapping like CNN feature extractors, (2) construct real and fake distributions over with the embedded samples and , and (3) quantify the discrepancy between the two distributions. We describe each stage in the following paragraphs.
Embeddings. It is often difficult to define a sensible metric over the input space. For example, distance over the image pixels is misleading because two perceptually identical images may have great distances (onepixel translation) (Theis et al., 2016). To overcome this difficulty, researchers have introduced ImageNet pretrained CNN feature extractors as the embedding function in many generative model evaluation metrics (Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018; Kynkäänniemi et al., 2019) based on the reasoning that the distance in the feature space provide sensible proxies for the human perceptual metric (Zhang et al., 2018). Since we always use embedded samples for computing metrics, we write and for and , respectively.
In this work, we also adopt ImageNet pretrained CNNs for the embedding, but we also criticise their use when the data distribution is too distinct from the ImageNet distribution (§4.2). We suggest randomlyinitialised CNN feature extractors as a strong alternative in such cases (Ulyanov et al., 2018; Zhang et al., 2018). We show that for MNIST digit images or sound spectrograms that have large domain gaps from ImageNet, random embeddings provide more sensible evaluation measures.
Building and comparing distributions. Given embedded samples and , many metrics conduct some form of (non)parametric statistical estimation. Parzen window estimates (Bengio et al., 2013) approximate the likelihoods of the fake samples by estimating the density with Gaussian kernels around the real samples . On the parametric side, Inception scores (IS) (Salimans et al., 2016) estimate the multinomial distribution over the 1000 ImageNet classes for each sample image and compares it against the estimated marginalised distribution with the KL divergence. Fréchet Inception Distance (FID) (Heusel et al., 2017) estimates the mean and covariance for and assuming that they are multivariate Gaussians. The distance between the two Gaussians is computed by the Fréchet distance (Dowson & Landau, 1982), also known as the Wasserstein2 distance (Vaserstein, 1969). FID has been reported to generally match with human judgements (Heusel et al., 2017); it has been the most popular metric for image generative models in the last couple of years (Borji, 2019).
2.2 Fidelity and diversity
While singlevalue metrics like IS and FID have led interesting advances in the field by ranking generative models, they are not ideal for diagnostic purposes. One of the most important aspects of the generative models is the tradeoff between fidelity (how realistic each input is) and diversity (how well fake samples are capturing the variations in the real samples). We introduce variants of the twovalue metrics (precision and recall) that capture the two characteristics separately.
Precision and recall. (Sajjadi et al., 2018) have reported the pathological case where two generative models have similar FID scores, while their qualitative fidelity and diversity results are different. For a better diagnosis of generative models, (Sajjadi et al., 2018) have thus proposed the precision and recall metrics based on the estimated supports of the real and fake distributions. Precision is defined roughly as the portion of that can be generated by ; recall is symmetrically defined as the portion of that can be generated by
. While conceptually useful, they have multiple practical drawbacks. It assumes that the embedding space is uniformly dense, relies on the initialisationsensitive kmeans algorithm for support estimation, and produces an infinite number of values as the metric.
Improved precision and recall. (Kynkäänniemi et al., 2019) have proposed the improved precision and recall (P&R)
that address the above drawbacks. The probability density functions are estimated via knearest neighbour distances, overcoming the uniformdensity assumption and the reliance on the kmeans algorithm. Our proposed metrics are based on P&R; we explain the full details of P&R here.
P&R first constructs the “manifold” for and separately, the object is nearly identical to the probabilistic density function except that it does not sum to 1. Precision then measures the expected likelihood of fake samples against the real manifold and recall measures the expected likelihood of real samples against the fake manifold:
precision  (1)  
recall  (2) 
where and are the number of real and fake samples. is the indicator function. Manifolds are defined
(3) 
where is the sphere in around with radius . denotes the distance from to the nearest neighbour among excluding itself. Example computation of P&R is shown in Figure 3.
There are similarities between the precision above and the Parzen window estimate (Bengio et al., 2013). If the manifolds are formed by superposition (summation) instead of the union in Equation 3 and the spheres
are replaced with Gaussians of fixed variances, then the manifold estimation coincides with the kernel density estimation. In this case, Equation
1 computes the expected likelihood of fake samples.3 Density and Coverage
We propose novel performance measures density and coverage (D&C) as practically usable measures that successfully remedy the problems with precision and recall.
3.1 Problems with improved precision and recall
Practicality of the improved precision and recall (P&R) is still compromised due to their vulnerability to outliers and computational inefficiency. Building the nearest neighbour manifolds (Equation 3) must be performed carefully because the spheres around each sample are not normalised according to the their radii or the relative density of samples in the neighbourhood. Consequently, the nearest neighbour manifolds generally overestimate the true manifold around the outliers, leading to undesired effects in practice. We explain the drawbacks at the conceptual level here; they will be quantified in §4.
Precision. We first show a pathological case for precision in Figure (a)a. Because of the real outlier sample, the manifold is overestimated. Generating many fake samples around the real outlier is enough to increase the precision measure.
Recall. The nearest neighbour manifold is built upon the fake samples in Equation 2. Since models often generate many unrealistic yet diverse samples, the fake manifold is often an overestimation of the true fake distribution. The pathological example is shown in Figure (b)b. While the fake samples are generally far from the modes in real samples, the recall measure is rewarded by the fact that real samples are contained in the overestimated fake manifold. Another problem with relying on fake manifolds for the recall computation is that the manifold must be computed per model. For example, to generate the recallvsiteration curve for training diagnosis, the knearest neighbours for all fake samples must be computed () for every data point.
3.2 Density and coverage
We remedy the issues with P&R above with simple fixes in each metric.
Density. Density improves upon the precision metric by fixing the overestimation of the manifold around real outliers. Precision counts the binary decision of whether the fake data contained in any neighbourhood sphere of real samples (Equation 1). Density, instead, counts how many realsample neighbourhood spheres contain (Equation 4). The manifold is now formed by the superposition of the neighbourhood spheres , and a form of expected likelihood of fake samples is measured. In this sense, the density measure is at the midway between the precision metric and the Parzen window estimate. Density is defined as
(4) 
where is for the knearest neighbourhoods. Through this modification, density rewards samples in regions where real samples are densely packed, relaxing the vulnerability to outliers. For example, the problem of overestimating precision (100%) in Figure (a)a is resolved using the density estimate (only 60% now). Note that unlike precision, density is not upper bounded by 1; it may be greater than 1 if the fake samples are concentrated around the dense regions for real samples.
Coverage. Diversity, intuitively, shall be measured by the ratio of real samples that are covered by the fake samples. Coverage improves upon the recall metric to better quantify this criterion by building the nearest neighbour manifolds around the real samples, instead of the fake samples, as they have less outliers. Moreover, the manifold can only be computed per dataset, instead of per model, reducing the heavy nearest neighbour computations in recall. Coverage is defined as
(5) 
It measures the fraction of real samples whose neighbourhoods contain at least one fake sample. Coverage is bounded between 0 and 1.
3.3 Analytic behaviour of density & coverage
The simplest sanity check for an evaluation metric is whether the metric attains the best value when the intended criteria are met. For generative models, we examine if D&C attain 100% performances when the real and fake distribution are identical (). Here, we show that, unlike P&R, D&C yield an analytic expression for the expected values and for identical real and fake. This analysis leads to further insights into the metrics and a systematic algorithm for selecting the hyperparameters (knearest neighbours, real samples, and fake samples).
Lack of analytic results for P&R. From Equation 1, the expected precision for identical real and fake is
(6)  
(7)  
(8) 
where is the event . Since are not independent with complex dependence structures, a simple expression for Equation 8 does not exist. Same observation holds for .
Precision  Density  
Gaussian  FFHQ  Gaussian  FFHQ  Analytic 
Recall  Coverage  
Gaussian  FFHQ  Gaussian  FFHQ  Analytic 
Analytic derivations for D&C. We derive expressions for the expected values of D&C under the identical real and fake.
Lemma 1.
.
Proof.
The expected density boils down to
where is the event where is at most
smallest among the random variables
. Since the random variables are identical and independently distributed, any particular ranking of them are equally likely with probability . Since the number of rankings with a certain random variable at particular rank is , we have the probability∎
Lemma 2.
(9) 
Moreover, as , .
Proof.
The expected coverage is estimated as
using the fact that the events is symmetrical with respect to . The event can be rewritten as
We write for and for . Then, the set of random variables is independent and identically distributed. The probability of can be equivalently described as:
Assume there are nonnegative real numbers distributed according to . Colour of them red uniformly at random and colour the rest blue. What is the chance that the smallest among are all coloured blue?
Since any assignment of red and blue colours is equally likely, we compute the probability by counting the ratio of possible colour assignments where smallest elements are coloured blue. The formula is written as
∎
Note that the expected values do not depend upon the distribution type or the dimensionality of the data.
3.4 Hyperparameter selection
Given the analytic expression for the expected D&C for identical real and fake distributions, we can systematically choose , , and . The aim of our hyperparameter selection is to ensure when the distributions are identical.
Since does not depend upon the exact distribution type or dimensionality of data (Equation 9), the hyperparameters chosen as above will be effective across various data types. We verify the consistency of D&C across data types in Figure 4. It contains plots of the P&R and D&C values for identical real and fake distributions from (1) 64dimensional multivariate standard Gaussians, (2) 4096dimensional ImageNet pretrained VGG embeddings of the FFHQ face images, and (3) analytic estimations in §3.3. While P&R exhibits significantly different estimated values across the Gaussians and the FFHQ embeddings, D&C metrics agree on all three types of estimates, confirming the independence on data type. This conceptual advantage leads to a confident choice of evaluation hyperparameters that are effective across data types.
In practice, we choose the hyperparameters to achieve . For the sake of symmetry, we first set . We then set to ensure a good approximation , while keeping the computational cost tractable. is then sufficient to ensure . The exact principle behind the choices of those values for P&R (, ) is unknown.
4 Experiments
We empirically assess the proposed density and coverage (D&C) metrics and compare against the improved precision and recall (P&R) (Kynkäänniemi et al., 2019). Evaluating evaluation metrics is difficult as the ground truth is often not provided. We carefully select sanity checks with toy and realworld data where the desired behaviours of evaluation metrics are clearly defined. At the end of the section, we study the embedding pipeline (§2.1) and advocate the use of randomly initialised embedding networks under certain scenarios.
4.1 Empirical results on density and coverage
We build several toy and realworld data scenarios to examine the behaviour of the four evaluation metrics: P&R and D&C. We first show results on toy data where the desired behaviours of the fidelity and diversity metrics are welldefined (§4.1.1). We then move on to diverse realworld data cases to show that the observations extend to complex data distributions (§4.1.2).
4.1.1 Sanity checks with toy data
We assume Gaussian or mixtureofGaussian distributions for the real
and fake in (). We simulate largely two scenarios: (1) moves away from (Figure 6) and (2) gradually fails to cover the modes in (Figure 6). We discuss each result in detail below.Translating away from . We set and in where
is the vector of ones and
is the identity matrix. We study how the metrics change as
varies in . In Figure 6, without any outlier, this setting leads to the decreasing values of P&R and D&C as moves away from , the desired behaviour for all metrics. However, P&R show a pathological behaviour when the distributions match (): their values are far below 1 (0.68 precision and 0.67 recall). D&C, on the other hand, achieve values close to 1 (1.06 density and 0.97 coverage). D&C detect the distributional match better than P&R.Translating with real or fake outliers. We repeat the previous experiment with exactly one outlier at in either real or fake samples (Figure 6). Robust metrics must not be affected by one outlier sample. However, P&R are vulnerable to this one sample; precision increases as grows above and recall increases as decreases below . The behaviour is attributed to the overestimation of the manifold on the region enclosed by the inliers and the outlier (§3.1). This susceptibility is a serious issue in practice because outliers are common in realistic data and generated outputs. On the other hand, the D&C measures for the outlier cases largely coincide with the nooutlier case. D&C are sturdy measures.
Mode dropping. We assume that the real distribution is a mixture of Gaussians in with ten modes. We simulate the fake distribution initially as identical to , and gradually drop all but one mode. See Figure 6 for an illustration. We consider two ways the modes are dropped. (1) Each mode is dropped sequentially and (2) weights on all but one mode are decreased simultaneously. Under the sequential mode dropping, both recall and coverage gradually drop, capturing the decrease in diversity. However, under the simultaneous dropping, recall cannot capture the decrease in diversity until the concentration on the first mode reaches 90% and shows a sudden drop when the fake distribution becomes unimodal. Coverage, on the other hand, decreases gradually even under the simultaneous dropping. It reliably captures the decrease in diversity in this case.
4.1.2 Sanity checks with realworld data
Having verified the metrics on toy Gaussians, we assess the metrics on realworld images. As in the toy experiments, we focus on the behaviour of metrics under corner cases including outliers and mode dropping. We further examine the behaviour with respect to the latent truncation threshold (Karras et al., 2019). As the embedding network, we use the fc2 layer features of the ImageNet pretrained VGG16 (Kynkäänniemi et al., 2019).
CelebA  LSUNBedroom 
Inliers  Inliers 
Worst outliers  Worst outliers 
Outliers.
Before studying the impact of outliers on the evaluation metrics, we introduce our criterion for outlier detection. Motivated by
(Kynkäänniemi et al., 2019), we use the distance to the nearest neighbour among the fake samples. According to this criterion, we split the fake samples into for inliers and outliers. We experiment with fake images generated by StyleGAN on CelebA (Liu et al., 2015) and LSUNbedroom (Yu et al., 2015). Example images of inliers and outliers are shown in Figure 7. We observe that the outliers have a more distortions and atypical semantics.We examine the behaviour of recall and coverage as the outliers are gradually added to the pool of fake samples. In Figure 7, we plot recall and coverage relative to their values when there are only inliers. As outliers are added, recall increases more than 11% and 15% on CelebA and LSUN bedroom, respectively, demonstrating its vulnerability to outliers. Coverage, on the other hand, is stable: less than 2% increase with extra outliers.
Mode dropping. As in the toy experiments, we study mode dropping on the MNIST digit images. We treat the ten classes as modes and simulate the scenario where a generative model gradually favours a particular mode (class “0”) over the others. We use the real data (MNIST images) with decreasing number of classes covered as our fake samples (sequential dropping). The results are shown in Figure 10. While recall is unable to detect the decrease in overall diversity until the ratio of class “0” is over 90%, coverage exhibits a gradual drop. Coverage is superior to recall at detecting mode dropping.
Resolving fidelity and diversity. The main motivation behind twovalue metrics like P&R is the diagnosis involving the fidelity and diversity of generated images. We validate whether D&C successfully achieve this. We perform the truncation analysis on StyleGAN (Karras et al., 2019) on FFHQ. The truncation technique is used in generative models to artificially manipulate the learned distributions by thresholding the latent noise with (Brock et al., 2018). In general, for greater , data fidelity decreases and diversity increases, and vice versa. This stipulates the desired behaviours of the P&R and D&C metrics.
The results are shown in Figure 11. With increasing , precision and density decrease, while recall and coverage increase. Note that density varies more than precision as increases, leading to finergrained diagnosis for the fidelity.
4.2 Random Embeddings
Image embedding is an important component in generative model evaluation (§2.1). Yet, this ingredient is relatively less studied in existing literature. In this section, we explore the limitations of the widelyused ImageNet embeddings when the target data is significantly distinct from the ImageNet samples. In such cases, we propose random embeddings as viable alternatives. In the following experiments, T4096 denotes the fc2 layer features of VGG16 network pretrained on ImageNet and R64 denotes the fc2 layer features of a randomly initialised VGG16 network, where fc2 is a 64dimensional linear layer replacing the 4096dimensional fc2 layer. We experiment on MNIST and a sound dataset, Speech Commands Zero Through Nine (SC09).
Real MNIST  Real spectrogram 
DCGAN  WaveGAN 
MNIST  Sound  

Methods  T4096  R64  T4096  R64 
Precision  0.160  0.715  0.644  0.807 
Recall  0.152  0.657  0.029  0.572 
Density  0.047  0.491  0.292  0.662 
Coverage  0.108  0.714  0.020  0.653 
4.2.1 Metrics with R64 on natural images
We study the behaviour of metrics for different truncation thresholds () on StyleGAN generated FFHQ images (Figure 11). We confirm that the metrics on R64 closely follow the trend with T4096. P&R and D&C over R64 can be used in ImageNetlike images to capture the fidelity and diversity aspects of model outputs.
4.2.2 Metrics with R64 beyond natural images
We consider the scenario where the target distribution is significantly different from the ImageNet statistics. We use DCGAN generated images (Radford et al., 2016) on MNIST (LeCun et al., 1998) and WaveGAN generated spectrograms (Donahue et al., 2019) on Speech Commands Zero Through Nine (SC09) dataset. The qualitative examples and corresponding metrics are reported in Table 1.
Compared to the high quality of generated MNIST samples, the metrics on T4096 are generally low (e.g. 0.047 density), while the metrics on R64 report reasonably high scores (e.g. 0.491 density). For sound data, likewise, the metrics on T4096 do not faithfully represent the general fidelity and diversity of sound samples. The real and fake sound samples are provided at http://bit.ly/38DIMAA and http://bit.ly/2HAm8NB. The samples consist of human voice samples of digit words “zero” to “nine”. The fake samples indeed lack enough fidelity yet, but they do cover diverse number classes. Thus, the recall (0.029) and coverage (0.020) values under T4096 are severe underestimations of the actual diversity. Under the R64, the recall and coverage values are in the more sensible range: 0.572 and 0.653, respectively. When the target data domain significantly differ from the embedding training domain, R64 may be a more reasonable choice.
5 Conclusion and discussion
We have systematically studied the existing metrics for evaluating generative models with a particular focus on the fidelity and diversity aspects. While the recent work on the improved precision and recall (Kynkäänniemi et al., 2019) provide good estimates of such aspects, we discover certain failure cases where they are not practical yet: overestimating the manifolds, underestimating the scores when the real and fake distributions are identical, not being robust to outliers, and not detecting certain mode dropping. To remedy the issues, we have proposed the novel density and coverage metrics. Density and coverage have an additional conceptual advantage that they allow a systematic selection of involved hyperparameters. We suggest future researchers to use density and coverage for more stable and reliable diagnosis of their models. On top of this, we analyse the lessstudied component of embedding. Prior metrics have mostly relied on ImageNet pretrained embeddings. We argue through empirical studies that random embeddings are better choices when the target distribution is significantly different from the ImageNet statistics.
Acknowledgements
We thank the Clova AI Research team for the support, in particular Sanghyuk Chun and Jungwoo Ha for great discussions and paper reviews. Naver Smart Machine Learning (NSML) platform
(Kim et al., 2018) has been used in the experiments. Kay Choi has advised the design of figures.References
 Bengio et al. (2013) Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. Better mixing via deep representations. In Proceedings of the 30th International Conference on Machine Learning, 2013.
 Borji (2019) Borji, A. Pros and cons of gan evaluation measures. Computer Vision and Image Understanding, 2019.
 Brock et al. (2018) Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. International Conference on Learning Representations, 2018.

Deng et al. (2009)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L.
Imagenet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 2009.  Donahue et al. (2019) Donahue, C., McAuley, J., and Puckette, M. Adversarial audio synthesis. In International Conference on Learning Representations, 2019.

Dowson & Landau (1982)
Dowson, D. and Landau, B.
The fréchet distance between multivariate normal distributions.
Journal of Multivariate Analysis
, 1982.  Geirhos et al. (2019) Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenettrained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637, 2017.
 Karras et al. (2019) Karras, T., Laine, S., and Aila, T. A stylebased generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
 Kim et al. (2018) Kim, H., Kim, M., Seo, D., Kim, J., Park, H., Park, S., Jo, H., Kim, K., Yang, Y., Kim, Y., et al. Nsml: Meet the mlaas platform with a realworld case study. arXiv preprint arXiv:1810.09957, 2018.
 Kynkäänniemi et al. (2019) Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, pp. 3929–3938, 2019.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
 Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations, 2016.
 Sajjadi et al. (2018) Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and Gelly, S. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pp. 5228–5237, 2018.
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242, 2016.
 Simon et al. (2019) Simon, L., Webster, R., and Rabin, J. Revisiting precision recall definition for generative modeling. In International Conference on Machine Learning, pp. 5799–5808, 2019.
 Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In International Conference on Learning Representations, pp. 1–10, 2016.
 Torralba & Efros (2011) Torralba, A. and Efros, A. A. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011.
 Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018.
 Vaserstein (1969) Vaserstein, L. N. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 1969.
 Yu et al. (2015) Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. Lsun: Construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.

Zhang et al. (2018)
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O.
The unreasonable effectiveness of deep features as a perceptual metric.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018.
Comments
There are no comments yet.