Evaluation Metrics for Conditional Image Generation

04/26/2020 ∙ by Yaniv Benny, et al. ∙ Tel Aviv University 6

We present two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fréchet Inception Distance (FID). A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts. The link takes the form of a product in the case of IS or an upper bound in the FID case. We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models, thus providing additional insights about their performance, from unlearned classes to mode collapse.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unconditional image generation models have seen rapid improvement both in terms of generation quality and diversity. These generative models are successful, if the generated images are indistinguishable from real images sampled from the training distribution. This property can be evaluated in many different ways, the most popular are the Inception Score (IS), which considers the output of a pretrained classifier, and the Fréchet Inception Distance (FID), which measures the distance between the distributions of extracted features of the real and the generated data.

While unconditional generative models take as input a random vector, conditional generation allows one to control the class or other properties of the synthesized image. In this work, we consider class-conditioned models, introduced in

mirza2014conditional , where the user specifies the desired class of the generated image. Employing unconditional metrics, such as IS and FID, in order to evaluate conditional image generation fails to take into account whether the generated images satisfy the required condition. On the other hand, classification metrics, such as accuracy and precision, currently used to evaluate conditional generation, have no regard for image quality and diversity.

One may opt to combine the unconditional generation and the classification metrics to produce a valuable measurement for conditional generation. However, this suffers from a few problems. First, the two components are of different scales and the trade-off between them is unclear. Second, they do not capture changes in variance within the distribution of each class. To illustrate this, consider Fig. 

1

, which depicts two different distributions. Distribution ‘A’ has a zero mean and a standard deviation 1.0 for the first class and 3.0 for the second class independently in each axis. Distribution ‘B’ has the same mean and a standard deviation 1.5 for the first class and 2.5 for the second class in each axis. The FID and the classification error (classifier shown in green) are zero, despite the in-class distributions being different.

In order to provide valuable metrics to evaluate and compare conditional models, we present two metrics, called Conditional Inception Score (CIS) and Conditional Fréchet Inception Distance (CFID). The metrics contain two components each: (i) the within-class component (WCIS/WCFID) measures the quality and diversity for each of the conditional classes in the generated data. In other words, it measures the ability to replicate the distribution of each class in the true samples; (ii) the between-class component (BCIS/BCFID) measures how close the representation of classes in the generated distribution is to the representation in the real data distribution. See Fig. 2 for an illustration.

In contrast to the combined FID and classifier, our WCFID and BCFID components of the FID are both larger than zero for the example in Fig. 1, successfully capturing the differences between the distributions.

Our analysis shows direct links between the novel conditional metrics and their unconditional counterparts. The (unconditional) Inception Score can be decomposed to a multiplication between BCIS and WCIS. We further show that due to the bounded region of the metrics, this translates to a trade-off between BCIS and WCIS and that each one of them form a tight lower bound on the IS. In the analysis of the FID score, we show that the sum of WCFID and BCFID forms a tight upper bound of the FID.

After analyzing the metrics, we performed various experiments to ground the theoretical claims and to highlight the role of the new metrics in evaluating conditional generation models. First, a set of simulations was conducted, in which we performed label noising, image noising, and simulated mode collapse. Under all conditions, our methods came out as the most sensitive to the applied augmentations. We then evaluated several pretrained models of popular architectures on various datasets and training schemes using the proposed scores and identified significant insights that were detected by our metrics. Our metrics were found to be a decisive factor to determine the generation performance in each dataset.

Figure 1: A case against measuring success by relying on FID combined with a classification score. Shown are two distributions, each with two classes, in a two-dimensional feature space. The green circle acts as the classifier. For both distributions, the overall mean and variance are equal, therefore, the FID is zero. The classification error is also zero, and the two distributions are, therefore, indistinguishable by this score as well. However, within the classes we see a shift in variance in the second distribution compared to the first. Our proposed WCFID and BCFID metrics both show values above zero and, therefore, detect the difference between the distributions.
(a) (b)
(c) (d)
Figure 2: The differences between unconditional, within-class and between-class evaluations for two given distributions with labelled samples. (a) Sample distributions. (b) The unconditional evaluation disregards the labels and compares the distance between the distributions. (c) The within-class evaluation compares each class in the first distribution with the corresponding class in the second (shown for one class). (d) The between-class evaluation compares the distribution of class averages.

1.1 Related Work

Generative Models Generative models, and in particular Generative Adversarial Networks GAN aim to generate realistic looking images from a target distribution, while capturing the diversity of images. Advances in the loss and architecture allowed for improved quality and diversity of generation. For instance, pmlr-v70-arjovsky17a ; 10.5555/3295222.3295327 attempt to minimize the Wasserstein distance between the generated and real distributions. This allowed for improved variability in generation, in particular by reducing mode collapse. On the architectural side, Progressive GANs karras2018progressive , StyleGAN karras2019style and StyleGANv2 Karras2019stylegan2 , introduced advanced architectures and training methods, allowing for further improvements.

Conditional Generation In conditional generation, control over the generation is provided, e.g., by class-conditioning mirza2014conditional ; chen2016infogan ; singh2019finegan , a given text  zhang2017stackgan ; xu2018attngan , requiring specific semantic features johnson2018image , or finding analogs to images from another distribution isola2017image ; zhu2017unpaired . The recent state of the art in class-conditional generation, which is the BigGAN method brock2018large

, can learn conditional representation on ImageNet 

russakovsky2015imagenet with high quality and diversity.

To train these models, several changes have been proposed to the unconditional method. CGAN mirza2014conditional injects the conditional component to the discriminator along with the image, ACGAN odena2017conditional added an auxiliary classifier tasked to accurately predict the conditioned label, SGAN odena2016semi modified the discriminator output to detect real classes while treating fake images as an additional class. A special unsupervised setting was proposed by InfoGAN chen2016infogan , where the condition is unlabelled in the real data and the model constructs a disentangled representation by maximizing the mutual information between the conditioned variable and the observation.

Evaluation Metrics To evaluate different models in terms of high quality generation and diversity, evaluation metrics were proposed for the unconditional generation setting. The Inception Score (IS) salimans2016improved uses the predictions of a pretrained classifier, InceptionV3 Szegedy2015RethinkingTI

, to assess: 1. Quality: whether the conditional probability of a generated sample

over the labels , is highly predictable (low entropy) and 2. Diversity: whether the marginal probability of labels over all generated samples is highly diverse (high entropy). The Fréchet Inception Distance (FID) heusel2017gans

, was proposed as an alternative to the IS, by considering the distribution of features of real data and generated data. FID models these distributions as multivariate Gaussian distributions and measures the distance between them. FID was shown to be sensitive to mode collapse and more robust to noise than IS. Additional metrics, such as Perceptual Path Length (PPL) 

karras2019style and Kernel Inception Distance (KID) binkowski2018demystifying were also introduced. Still, the IS and FID are the most widely accepted metrics for image generation.

Nevertheless, these measures are designed for the unconditional setting, and for class-conditional, they do not assess the level at which the categorical condition manifests itself in the generated data. In this work, we extend the IS and FID to the class-conditional setting, showing their relation to their unconditional counterparts and demonstrate the usefulness of these metrics in the conditional setting.

Some recent attempts have been made to asses conditional generation with modified versions of the IS and FID. A Conditional Inception Score for the Image-to-Image translation task (loosely similar to our BCIS component) was proposed by Huang et al. 

huang2018multimodal and Miyato et al. miyato2018cgans were the first to our knowledge to measure the intra-FID (equal to our WCFID component). Both these contributions lacked thorough experiments and comparison with other methods and presented no theoretical analysis. Nevertheless, these previous attempts to establish conditional metrics confirm the need for conditional evaluations scores and point to the possible ingredients of such solutions. Our work contributes to both approaches by providing justification and explanation, and also by improving upon them to build more unified metrics.

2 Problem Setup

We consider a distribution of real samples:

(1)

where is a distribution over a set of classes and is the conditional distribution of a sample taken from the class . The algorithm is provided with a dataset of i.i.d labelled examples that were sampled from the generative process in Eq. 1. In addition, the distribution

of classes and its corresponding probability density function

are known or assumed to be uniform. The distribution of real samples marginalized over is denoted by and the corresponding probability density function by .

In conditional generation, the algorithm learns a generative model that tries to generate samples that are similar to the real samples in . The generative model takes a random seed and a class as inputs and returns a generated sample . Here, is a pre-defined distribution over a latent space of dimension , where is the dimensionality of the samples. Typically

is the standard normal distribution. We denote by

the distribution of generated samples.

In conditional generation, we are interested in two aspects of the generation. Images from the same conditioned variable should be of the same class in , and different conditioned variables should cover the range of each class.

The category discovery setting is a special case of the conditional generation, proposed in chen2016infogan , where the algorithm is provided with a set of unlabelled samples . The algorithm is still aware of the existence of the partition of the data into classes that are distributed according to . The goal of the algorithm is to generate samples that are similar to the real samples and also have them clustered in a proper manner into clusters.

2.1 Inception Score

The Inception Score (IS) is a method for measuring the realism of a generative model’s outputs. For a given generative model , a latent vector and a random class , we apply a pretrained classifier on the generated image to obtain a distribution over the labels, which is denoted by

. We denote the corresponding random variables by

, , and , the distribution of by and the probability density function of by . Images that contain meaningful objects should have a condition distribution of low entropy. Furthermore, we expect the model to generate varied images, so the marginal distribution should have a high entropy. The Inception Score is computed as:

(2)

where is the KL-divergence between two probability density functions. A high score indicates both a high variety in data and that the images are meaningful.

The Inception Score can also be formulated using the mutual information between the generated samples and the class labels:

(3)

where is the mutual information between and . As can be seen, by maximizing the IS, one maximizes the mutual information between and . However, this equation indicates that the IS is not sufficient in order to evaluate generative models in the conditional generation settings, since the score does not take the conditioned class into account.

Due to the properties of the mutual information, it can be seen that for a domain with classes, the score is within the range .

2.2 Fréchet Inception Distance

The Fréchet distance between two distributions is defined by:

(4)

where the minimization is taken over all random variables and having marginal distributions and , respectively. In general, the Fréchet distance is intractable, due to its minimization over the set of arbitrary random variables. Fortunately, as shown by RePEc:eee:jmvana:v:12:y:1982:i:3:p:450-455 , for the special case of multivariate normal distributions and , the distance takes the form:

(5)

where and are the mean and covariance matrix of . The first term measures the distance between the centers of the two distributions. The second term:

(6)

defines a metric on the space of all covariance matrices of order .

For two given distributions of real samples and of the generated data, the FID score heusel2017gans computes the Fréchet distance between the real data distribution and generated data distribution using a given feature extractor under the assumption that the extracted features are of multivariate normal distribution:

(7)

where and are the centers and covariance matrices of the distributions and , respectively. For evaluation, the mean vectors and covariance matrices are approximated through sampling from the distribution.

3 Method

In this section, we introduce the class-conditioned extensions of the Inception Score and FID.

3.1 Conditional Inception Score

The conditional analysis of the Inception Score addresses both aspects of conditional generation: the need to create realistic and diverse images, and the need to have each generated image match its condition. We define two scores: the between-class (BCIS) and the within-class (WCIS).

BCIS evaluates the IS on the class averages. It is a measurement of the mutual information between the conditioned classes and the real classes. The prediction probabilities for all the samples in each conditioned class are averaged to produce the average prediction probability of the entire class, then the IS is computed on these averages.

The BCIS is defined in the following manner:

(8)

where,

(9)

WCIS evaluates the IS within each category. It is a measurement of the mutual information between the real classes conditioned on the samples and the real classes conditioned on the conditioned classes. The final score is the geometric average score over all the classes, which is equivalent to the exponent on the arithmetic average of the mutual information over all the classes. To define this measure, we define two random variables and which are the random variables and conditioned on the class being .

The WCIS is defined as:

(10)

where the mutual information is computed as follows:

(11)

where is the distribution of .

In general, we wish the BCIS to be as high as possible and the WCIS to be as low as possible. High BCIS indicates a distinct class representation for each conditioned class and a wide coverage across the conditioned classes, which is a desired property. High WCIS indicates a wide coverage of real classes within the conditioned classes, which is an undesired property, since each conditioned class should represent only a single real class. In this way, one obtains consistent prediction within each class and has high variability between classes.

The following theorem presents the compositional relationship between IS and the proposed conditional measures. Let and be two independent random variable. Let for a continuous generator function and let

be a discrete random variable distributed by

. Then,

(12)

The proof is provided in the appendix.

By definition, as with the IS, both BCIS and WCIS lie within . Since we wish IS to be as large as possible and both BCIS and WCIS lie in the same interval, the theorem asserts that there is a tension between the BCIS and WCIS measures, since both of them cannot be large at the same time. In addition, since both components are larger than , the theorem shows that they both provide a lower bound on the IS and the bound is tight when the other component is equal to . The final realization is that the IS can be very high even when the BCIS component is low, simply by having a high WCIS. This gap between IS and BCIS indicates bad conditional representation which is overlooked by the unconditional evaluation.

On these grounds, we propose the BCIS and WCIS together as the conditional alternative to the IS. Each metric shows a different property of the generated data and, as shown in the theorem, the IS is readily obtained by multiplying the conditional components.

3.2 Conditional Fréchet Inception Distance

For conditional FID, we want to measure the distance between different distributions, according to the feature vector , produced by the pre-trained feature extractor on a sample . Analogous to the conditional IS metrics, we measure the between-class distance between averages of conditioned class features and averages of real class features, as well as the average within-class distance for each matching pair of real and conditioned classes.

BCFID measures the FID between the distribution of the average feature vector of conditioned classes in the generated data and the distribution of the average feature vector of real classes in the class real data. It evaluates the coverage of the conditioned classes over the real classes.

For each distribution specifier

, we estimate the per-class mean

, the mean of means , and the covariance of the feature vectors .

(13)
(14)
(15)

The BCFID is defined as:

(16)

WCFID measures the FID between the distribution of the generated data and the real data within each one of the classes. It evaluates how similar each conditioned class is to its respective real class. The total score is the mean FID within the classes.

For each distribution specifier , the within-class covariance matrices are defined as:

(17)

The WCFID is defined as:

(18)

Note that we compare between matching pairs of conditioned and real classes. When a mapping between conditioned and real classes exists, i.e., in conditional GANs, this is straightforward. In the case when there is no such mapping, i.e., in the class discovery case, such as when employing the InfoGAN method, a mapping needs to be created. For example, this can be done by using a classifier to get the prediction probabilities for the generated images. Then average the probabilities for each conditioned class and apply the Hungarian algorithm on the average probabilities.

In general, the desire is to minimize both component, since each computes a different aspect of the distance between the real and the generated distributions.

The following theorem ties the FID and the conditional FID components. Let and be the distributions of real and generated samples. Then,

(19)

and the bound is tight under certain conditions. The proof is provided in the appendix.

By this theorem, in conditional generation, FID gives an optimistic evaluation to the model that ignores bad cases. A good unconditional score can be obtained even though there is a considerable friction between the real and generated distributions in terms of conditional generation. This friction can occur either by bad representation of classes (high BCFID) or unmatching diversity within classes (high WCFID). For this reason, we propose the BCFID and WCFID as the conditional alternative to the FID. In addition to providing two meaningful scores that are similarly scaled, an upper bound to the FID can be computed by adding the two components.

(a)
(b) (c)
Figure 3: Label Noising: Labels were randomly replaced with probability to simulate bad conditional generation. (a) Each row has images conditioned on the same class. Images in red indicate bad conditional generation. (b) The effect of label noising on the unconditional and conditional IS metrics as a function of noise. (c) Same for the conditional FID score.
(a) (b) (c)
(d) (e) (f)
Figure 4: Image noising: The effect of various noises on the unconditional and conditional metrics over an increasing noise.
(a)
(b)
Figure 5: Mode collapse: The effect of mode collapse on the unconditional and conditional IS and FID metrics over an increasing severity. (a): Gradual mode collapse on a single class. (b): Incremental full mode collapse on all classes.

4 Experiments

Our experiments employ three datasets: MNIST lecun , CIFAR10 cifar , and ImageNet russakovsky2015imagenet . We first consider controlled simulations on MNIST to show the behavior of our metrics compared to existing unconditional metrics. Three cases are considered: noisy labels, noisy images, and mode collapse within classes. We then consider our metrics on a variety of well-established generative models and draw visual insights for the reported metric scores. Finally, a user study was held to compare the numeric results to human perception.

Evaluation procedure When evaluating the models, we use an equal number of randomly sampled real and generated samples for each class. For MNIST and CIFAR10, the test set was used as real samples, with samples from each class. For ImageNet, validation samples for each class were used, for a total of validation samples.

To obtain the scores of the ‘Real Data’ in Tab. 12 (i.e., the score obtained not from generating but from the training data itself, which serves as an unofficial upper bound of the performance), an equal number of samples were taken from the train set. For instance, for MNIST, samples were taken from the train data ( for each class). These same samples were also used for the three synthetic noise and mode collapse experiments where they undergo various augmentations.

For each dataset, we applied a pretrained classifier, to give class probabilities for calculating the Inception Scores, and as a feature extractor, to calculate the FID scores. For ImageNet, we used the InceptionV3 Szegedy2015RethinkingTI architecture, as used in the original formulation of the IS salimans2016improved and FID heusel2017gans . For CIFAR10, we used the VGG-16 simonyan2014very architecture, and for MNIST, a classifier with two convolutional blocks and two fully connected layers. The test accuracy is 99.06% for MNIST, 85.20% for CIFAR10 and 77.45% for ImageNet.

The activations of the last hidden layer (a.k.a the penultimate layer) were employed as the extracted features . The feature dimension is for MNIST, for CIFAR10 and for ImageNet. For ImageNet, since the number of samples used compared to the feature dimension is small, the rank of the estimated covariance matrix in Eq. 17 is much smaller than its dimension, which causes an inaccurate estimation of the WCFID. Instead, for independent trials, we randomly select features from the feature vector, compute the FID score using these features, and finally average the FID scores of all trials to get a final FID score.

Note that since the classification and feature extraction differs between each dataset, model scores should be compared per dataset, and not between datasets.

4.1 Synthetic Experiments

Label noising Label noising is the process of assigning random labels to some of the images, instead of their ground truth labels. This process simulates different levels of adherence to the conditional input. To maintain an equal number of images per conditioned class, instead of simply re-selecting a random class, we performed a random permutation of a subset of the images proportionally to a parameter . When no noising was applied and when all image labels were randomly permuted. Fig. 3 shows how label noising simulates decline in conditional generation performance. In Fig. 3(a) each row of each subfigure represents a conditioned class, the red images highlight when the conditional generation fails. When setting , all images are correctly generated on their conditional input and as increases, more images are incorrectly generated.

In Fig. 3(b) and (c), the IS and FID metrics and our proposed conditional variants are presented under the effect of label noising. The plots depict a number of interesting behaviors. First, the unconditional IS and FID remain constant across the experiment. That is because these metrics do not consider any conditional requirements from the generated images, and the unconditional performance has remained the same. Second, label noising has a dramatic effect on the conditional IS and FID metrics. The BCIS, which evaluates both the consistency of each condition in the target classes and the coverage of the target classes falls immediately due to the declining consistency in the conditioned images. The WCIS, on the other hand, which measures inconsistency, shows a rapid increase as a compensation of the decline of the BCIS score. All conditional components of the FID increase, since the label noise inflicts a shift in the distribution within each class and on the class averages.

Image noising We applied four types of noise on the images and compared the effect on the scores. The noise was applied with increasing magnitude between . We applied Gaussian noise with mean and variance , salt & pepper noise with probability per pixel, and random pixel permutation with probability . Fig. 4 shows the IS and FID with the conditional scores. For IS, the BCIS declines more rapidly than the IS, making it more sensitive to image quality. This is matched with an increase of the WCIS, which defines the gap between BCIS and IS. The WCIS provides a support for the IS which gives a false sense of generation quality, best seen during pixel permutation. For FID, the conditional metrics have the same trend as the unconditional one. With the gap between the FID and the conditional FID sum increasing with the level of noise.

Mode collapse Mode collapse occurs when the model fails to generalize on the distribution of the target dataset and collapses to represent only a portion of the distribution. It is a common failure of generative models, which occurs when the model generates similar images for many different initial priors . In the conditional setting, the collapse can be more specific and occur only within a specific class.

Fig. 5 shows how the unconditional and conditional FID metrics react to the collapse.(a) shows a single class collapse in where in each step the diversity in that class gradually declines. (b) shows all of the classes fully collapse one by one at each step. Our metric is more sensitive to mode collapse, both when it occurs in a single class or in multiple classes. No evaluation on the unconditional and conditional IS was performed in this setting, since they both cannot detect mode collapse.

Evaluation metrics User study
FID WCFID BCFID IS WCIS BCIS Accuracy Quality Diversity Class

CIFAR10

Real Data 0.02 0.10 0.04 6.77 1.42 4.78 87.15 - - -
CGAN 5.26 9.50 7.49 3.50 2.17 1.61 42.15 5.5 5.6 7.3
InfoGAN 5.85 17.16 11.73 3.35 2.71 1.24 46.40 4.3 4.8 6.8
SGAN 5.96 17.06 11.55 3.20 2.48 1.29 23.94 3.0 3.5 3.8
ACGAN 4.29 6.92 5.25 3.84 1.84 2.09 55.72 - - -

MNIST

Real Data 19.86 35.99 34.17 9.86 1.04 9.52 99.61 - - -
CGAN 36.67 98.48 25.70 9.87 1.06 9.31 98.90 8.3 8.8 7.6
InfoGAN 76.73 321.56 93.51 9.38 1.33 7.03 89.83 5.3 6.0 6.3
SGAN 69.34 609.48 289.42 8.87 2.03 4.37 73.34 6.0 3.5 5.3
ACGAN 30.13 91.21 25.95 9.74 1.09 8.93 98.30 - - -
Table 1: Unconditional and conditional metrics on CIFAR10 and MNIST for different conditional GANs. indicates that a lower value is better and otherwise.
(a) (b) (c)
Figure 6: Illustrations for CIFAR10. (a) CGAN, (b) InfoGAN, (c) SGAN.
(a) (b) (c)
Figure 7: Illustrations for MNIST. (a) CGAN, (b) InfoGAN, (c) SGAN.

4.2 Model Comparison

We next evaluate the performance of various pretrained conditional GAN models on different datasets. In Tab. 1, for CIFAR10 and MNIST, we consider CGAN mirza2014conditional , SGAN odena2016semi , InfoGAN chen2016infogan and ACGAN odena2017conditional . Note that for SGAN, the generator is not class conditioned, and so we modified the generator to accept both noise and class label as input, and the adversarial loss was applied on the conditioned class.

For conditional generation, there are four extreme cases: (i) good unconditional and good conditional generation, (ii) bad unconditional and bad conditional generation, (iii) good unconditional and bad conditional generation, (iv) bad unconditional and good conditional generation. We argue that the fourth scenario is impossible since the conditional generation metrics always present a more critical evaluation (i.e. a lower bound in IS and upper bound in FID) than the unconditional metric. Therefore bad unconditional generation always leads to bad conditional generation as well. Cases (i) and (ii) are the more trivial cases where the model is either good or bad on both tasks. Case (iii) tells a scenario where the unconditional generation is good but the conditional requirement failed. We will now inspect each model and identify under which scenarios it falls.

The analysis is done by looking at the results in Tab. 1. Additionally, Fig. 7,7 show examples of the generation of CGAN, InfoGAN and SGAN. ACGAN lies under case (i). In CIFAR10, it consistently had the best score. In MNIST, it had either the top score or close to it on each metric. CGAN, while not performing as good as ACGAN, also lies under case (i) compared to the other two models. InfoGAN and SGAN lie under case (iii) for CIFAR10. The difference in the unconditional performance between them and CGAN/ACGAN is relatively small. However, there is a very distinct degradation in their conditional generation scores, both IS and FID.

InfoGAN also highlights the problem of using accuracy as a measure. The accuracy received for the generated images is higher than that of CGAN even though our conditional metrics say otherwise. By inspecting the generated images, we conclude that CGAN performed better and thus our metrics depict the conditional performance of the models better. For MNIST, InfoGAN and SGAN lie either under case (ii) or (iii), depending on how good one considers their unconditional scores. Nevertheless, the performance decrease of InfoGAN and SGAN relative to CGAN becomes much more noticeable when looking at the conditional metrics where a clear drop in performance on all scores is noticeable.

To see how the metrics translate to human perception, we performed a user study on CGAN, InfoGAN and SGAN for both MNIST and CIFAR10. The user study was performed on 20 participants with knowledge in this field. The participants were not aware of the purpose of the study and did not know which model they were evaluating. The participants were asked to grade the ’quality’, ’diversity’ and ’class relation’ of the generated images between 1 (low) and 10 (high), for each model separately. The results in Tab. 1 show that CGAN got higher scores on both datasets. This is aligned with the results of the conditional metrics in our experiments.

FID WCFID BCFID IS WCIS BCIS Accuracy
Real Data 0.110 4.149 0.114 602.613 3.005 201.360 77.45
BigGAN 0.46 6.33 0.43 363.91 5.04 72.15 51.66
Table 2: Unconditional and conditional metrics on ImageNet for BigGAN. indicates that a lower value is better and otherwise.

(a)

(b)

Figure 8: Not all classes perform equally. WCFID (a) and accuracy (b) per class for BigGAN. The average score is shown in red.
(a) (b)
(c) (d)
Figure 9: BigGAN images of best and worst classes in terms of WCFID. (a),(b) real and fake images for classes with the highest WCFID. (c),(d) real and fake images for classes with the lowest WCFID.
(a) (b)
(c) (d)
Figure 10: BigGAN images of best and worst classes in terms of Accuracy. (a),(b) real and fake images for classes with the highest accuracy. (c),(d) real and fake images for classes with the lowest accuracy.

4.3 BigGAN In-Depth Analysis

BigGAN is a state of the art image generation model on the ImageNet dataset. In this section we evaluate BigGAN with our metrics and use them to perform an in-depth analysis of BigGAN’s conditional generation capabilities.

BigGAN’s performance on the various metrics can be seen in Tab. 2. Note that for FID, the score is different than in the original paper since we normalized the score by the size of the feature vector. The results show that BigGAN’s performance is very close to the performance on real data, in both the unconditional and conditional metrics.

A closer inspection shows a variance in generation quality of the model for the different classes. Fig. 8(a) shows us that not all classes have the same WCFID and, instead, some classes are better represented. Which classes are better than others, can serve as a useful insight for fine-tuning a trained model to concentrate on the worst represented classes, or to compare between various trained generative models.

Fig. 9 shows the 10 best and worst classes represented in terms of WCFID. The WCFID metric has a strong correlation with the quality of the class. The images from classes with the best scores are of high quality and resemblance to the real images. The images for classes with the worst scores do not resemble their target class. In addition, as evident from the experiment, several classes (for example, ’digital clock’) have a high WCFID that is due to mode collapse.

To validate that the accuracy score cannot deliver these insights, we present in Fig. 8(b) the accuracy for each class, sorted according to their WCFID (same order as (a)). Similarly to WCFID, not all classes have the same score. However,we can observe that the accuracy scores for each class are only partly (inversely) correlated with the performance in WCFID.

In order to try and understand the difference between the scores in WCFID and accuracy, Fig. 10 shows the best and worst classes in terms of accuracy. Some classes were placed in the top 10 in both metrics (WCFID and accuracy), but others were not equally ranked. When looking at the worst ranked classes, we notice that the low rank in accuracy does not always correlate with a low quality or diversity. For example, ’notebook’ and ’monitor’ were both ranked at the bottom when considering the accuracy, but looked not as bad as the worst classes in WCFID. We observe that these classes were ranked low not because they were poorly generated, but because it is hard to tell them apart.

5 Conclusions

We presented two new evaluation procedures for class-conditional image generation based on well established metrics for unconditional generation. The proposed metrics are supported by theoretical analysis and a number of experiments. Our metrics are beneficial in comparing trained models and gaining significant insights when developing models.

Acknowledgements.
This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974).

References

  • (1) Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks.

    In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning,

    Proceedings of Machine Learning Research, vol. 70, pp. 214–223. PMLR, International Convention Centre, Sydney, Australia (2017)
  • (2) Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans. arXiv preprint arXiv:1801.01401 (2018)
  • (3) Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019). URL https://openreview.net/forum?id=B1xsqj09Fm
  • (4) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: Advances in neural information processing systems, pp. 2172–2180 (2016)
  • (5) Dowson, D.C., Landau, B.V.: The fréchet distance between multivariate normal distributions.

    Journal of Multivariate Analysis

    12(3), 450–455 (1982).
    URL https://EconPapers.repec.org/RePEc:eee:jmvana:v:12:y:1982:i:3:p:450-455
  • (6) Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, p. 2672–2680. MIT Press, Cambridge, MA, USA (2014)
  • (7) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, p. 5769–5779. Curran Associates Inc., Red Hook, NY, USA (2017)
  • (8) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in neural information processing systems, pp. 6626–6637 (2017)
  • (9) Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation.

    In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189 (2018)

  • (10)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks.

    In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134 (2017)

  • (11) Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1219–1228 (2018)
  • (12) Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018). URL https://openreview.net/forum?id=Hk99zCeAb
  • (13) Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
  • (14) Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. CoRR abs/1912.04958 (2019)
  • (15) Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 (canadian institute for advanced research) (2010). URL http://www.cs.toronto.edu/~kriz/cifar.html
  • (16) LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010). URL http://yann.lecun.com/exdb/mnist/
  • (17) Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  • (18) Miyato, T., Koyama, M.: cgans with projection discriminator. arXiv preprint arXiv:1802.05637 (2018)
  • (19) Odena, A.: Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583 (2016)
  • (20) Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. JMLR. org (2017)
  • (21) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision 115(3), 211–252 (2015)
  • (22) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in neural information processing systems, pp. 2234–2242 (2016)
  • (23) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • (24) Singh, K.K., Ojha, U., Lee, Y.J.: Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6490–6499 (2019)
  • (25) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2818–2826 (2015)
  • (26) Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324 (2018)
  • (27) Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 5907–5915 (2017)
  • (28) Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017)

Appendix A Proofs of the Main Results

Let and be two independent random variable. Let for a continuous generator function and let be a discrete random variable distributed by . Then,

(20)

We consider that:

(21)

Let and be two independent random variable. Let for a continuous generator function and let be a discrete random variable distributed by . Then,

(22)

By Lem. A, the Inception Score can be represented as , and by definition, we have: . Next, we would like to represent in terms of and . First, by marginalizing with respect to , we have:

(23)

Since is independent of given , we have: . Hence,

(24)

We consider that . Therefore, we have:

(25)

Finally, we conclude that:

(26)

Let and be the distributions of real and generated samples. Then,

(27)

and the bound is tight.

First, we recall the definitions of the FID and BCFID measures:

(28)

and

(29)

We notice that and . Hence, the only difference between the two quantities arises from the second terms.

Next, we would like to develop the formulation of for :

(30)

Hence,

(31)

In particular,

(32)

Therefore, we summarize:

(33)

Now we can say the following:

(34)

We denote:

(35)

Next, we would like to show that . We consider that sums the non-negative term with the following term:

(36)

Since the function is convex, by Jensen’s trace inequality, the above term is non-negative.

This implies the desired inequality:

(37)

Finally, we would like to demonstrate the tightness of the bound, aside from the trivial case of all or some of the covariance matrices being and for all . Consider a case where all of the matrices and

are simultanously diagonalizable, i.e., there exist an invertible matrix

, such that:

(38)

where and

are the diagonal matrices of the eigenvalues of

and respectively.

Since all matrices are diagonal, we can rewrite as follows: