Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are a class of deep generative models that have achieved unprecedented performance in generating high-quality and diverse images (Brock et al., 2019)
. They have also been successfully applied to a variety of image-generation tasks, e.g. super resolution(Ledig et al., 2017) et al., 2017), and text-to-image synthesis (Reed et al., 2016), to name a few. The GAN framework consists of a generator and a discriminator , where G generates images that are expected to resemble real images , while D discriminates between and . G and D are trained by playing a two-player minimax game in a competing manner. Such novel adversarial training process is a key factor in GANs’ success: It implicitly defines a learnable objective that is flexibly adaptive to various complicated image-generation tasks, in which it would be difficult or impossible to explicitly define such an objective.
One of the biggest challenges in the field of generative models – including for GANs – is the automatic evaluation of the goodness of such models (e.g., whether or not the data generated by such models are similar to the data they were trained on). Unlike supervised learning, where the goodness of the models can be assessed by comparing their predictions with the actual labels, or in some other deep-learning models where the goodness of the model can be assessed using the likelihood of the validation data under the distribution that the real data comes from, in most state of the art generative models we do not know this distribution explicitly or can not rely on labels for such evaluations.
Given that the data (or their corresponding features) in such situations can be assumed to lie on a manifold embedded in a high dimensional space (Goodfellow et al., 2016), tools from topology and geometry come as a natural choice when studying differences between two data set. Hence, we propose topology distance (TD) for the evaluation of GANs; it compares the the topological structures of two manifolds, and calculates a distance between them to evaluate their (dis)similarities. We compare TD with widely-used and relevant metrics, and demonstrate that it is more robust to noise compared to competing distance measures on GAN’s, and it is better suited to distinguish among various shapes that the data might come in. TD is able to evaluate GANs with new insights different from other existing measurements. It can therefore be used either as an alternative to, or in conjunction with other metrics.
1.1 Related work
There have been multiple metrics proposed to automatically evaluate the performance of GANs. In this paper we focus on the most commonly-used and relevant approaches (as follows); for a more comprehensive review of such measurements, please refer to (Borji, 2018).
Inception Score (IS) The main idea behind IS (Salimans et al., 2016)
is that generated images of high quality are expected to meet two requirements: They should contain easily classifiable objects (i.e. the conditional label distributionwith low entropy) and should be diverse (i.e. the marginal distribution with high entropy). IS measures the average KL divergence between these two distributions:
where is the generative distribution. IS relies on a pretrained Inception model (Szegedy et al., 2016) for the classification of the generated images. Therefore, a key limitation of IS is that it is unable to evaluate the image types that are distinct from those that the Inception model was trained on.
Fréchet Inception Distance (FID) and Kernel Inception Distance (KID) Proposed by (Heusel et al., 2017)
, FID relies on a pretrained Inception model, which maps each image to a vector representation (or, features). Given two groups of data in this vector space (one from the real and the other from the generated images), FID measures their similarities, assuming that the features are distributed as multivariate Gaussian; the distance will be the Fréchet distance (also known as Wasserstein-2 distance) between the two Gaussians:
where and denote the feature distributions of real and generated data, (, ) and (, ) denote the means and covariances of the corresponding feature distributions, respectively. It has been shown that FID is more robust to noise (of certain types) than IS (Heusel et al., 2017)
, but its assumption of features following a multivariate Gaussian distribution might be an oversimplification.
A similar metric to FID is KID (Bińkowski, 2018), which computes the squared maximum mean discrepancy (MMD) between the features (learned also from a pretrained Inception model) of real and generated images:
where denotes a polynomial kernel function with feature dimension
. Compared with FID, KID does not have any parametric form assumption for feature distribution, and has a unbiased estimator.
Our proposed TD is closely related to FID and KID in that it also measures the distance between latent features of real and generated data. However, the key distinction of TD is that the target distance is computed by considering the geometric and topological properties of those latent features.
Geometry Score (GS) GS (Khrulkov and Oseledets, 2018) Geometry score is defined as distance between means of the relative living-times (RLT) vectors associated with the two sets of images. RLT of a point cloud data (e.g., a group of images in the feature space) is an infinite vector whose -th entry is a measure of persistent intervals having -persistent homology group rank equal to . That is, , where equals if the rank of persistent homology group of dimension in interval is , and zero otherwise. Persistent homology parameters , are sorted distances in the observed point cloud data.
Geometry score exploits similar idea to the topological distance, with the difference being in the underlying point cloud data used, dimensionality of the homology group and distance measure between the persistent diagrams. We claim that our method better aligns with the existing theory in the area of computational algebraic topology and has superior experimental results.
2 Main idea
According to the manifold hypothesis(Goodfellow et al., 2016)
, real world high dimensional data (and their features) lie on a low dimensional manifold embedded in a high dimensional space. The main idea of this paper is to compare the latent manifold of the real data with that of the generated data, based entirely on the topological properties of the data samples from these two manifolds. Letand be the latent manifolds of the real and generated data, respectively. We aim to compare these two manifolds using the finite samples of points from and from .
Most mainstream methods compare samples and
using the lower order moments (e.g.(Heusel et al., 2017)) – similar to the way we compare two functions using their Taylor expansion, for instance. However, this would only be valid if the underlying manifold is an Euclidean space (zero curvature), as all moments of the samples are calculated using Euclidean distance. For a Riemannian manifold with a nonzero curvature, this type of approach, at least in theory, would not work, and using geodesic instead of Euclidean distance would agree more with the hypothesis.
Here we propose the comparison of the two manifolds on the basis of their topology and/or geometry. The ideal way to compare two manifolds would be to infer if they are geometrically equivalent, i.e. isometric. This, unfortunately, is not attainable. However, we could compare two manifolds by the means of eigenvalues of the Laplace-Beltrami operator111Only for Riemannian manifolds on them.
The Laplace-Beltrami spectrum can be regarded as the set of squared frequencies that are associated to the modes of eigenvalues of an oscillating membrane defined on the manifold. The spectrum then, is an infinite sequence of eigenvalues, and satisfies some nice stability properties, whereby a small perturbation in the metric of the underlying Riemannian manifold results in a small perturbation of the spectrum (Donnelly, 2010; Birman, 1963). Furthermore, the Laplace-Beltrami spectrum is widely considered as a “fingerprint” of a manifold. In 1966, in the famous paper “Can one hear the shape of a drum?” (Kac, 1966), M. Kac has asked a question whether the eigenvalues of Laplace Beltrami operator alone are sufficient to uniquely (up to an isometry) identify a manifold. The answer is unfortunately not, but the isospectral manifolds are rare and when they exist, they share multiple topological and geometric features.
Furthermore, it is possible to translate this methodology to a discrete setting, such that the spectrum calculated on the discrete set relates closely to the spectrum on the manifold itself.
Theorem 1 (Mantuano, 2005) Given a discretisation of a compact Riemannian manifold which has non-negative sectional curvature , and non negative injectivity radius, and for which , where is dimension of a manifold, is a Riemannian metric, then it is possible to associate the eigenvalues of Laplace operator on a graph , with the ones of the Laplace Beltrami operator on , for all .
The discretisation of a manifold , is a set of points in whose distance is at least and the union of the balls centred in the points of with radius which forms an open cover of , denoted by . A version of this theorem also holds for eigenvalues of higher dimensional version of Laplace Beltrami operator, called Laplace–de Rham operator which reflects high dimensional topological and geometric properties of a manifold (Mantuano, 2008). This effectively means that for a sufficiently good sample from , we can claim that calculating the eigenvalues on would effectively be as calculating them on (see (Dey et al., 2010) for more results).
In our case the manifold is unknown, and all we know is a sample of points from it: . In order to calculate the Laplace-Beltrami spectrum, we need to have a graph structure on , which comes through Čech complex on its cover . To obtain the Čech complex, one needs radii of the balls in the cover, i.e., . This in itself poses a problem, because it is difficult to determine the right value of . There is little hope in recovering spectral properties of from the point sample , because we are unable to determine the right value of .
A similar theorem to Theorem 1 applies to homology type of a manifold and its sample.
Theorem 2 Given a Riemannian manifold , and a sample of points from it, , which is sufficiently dense, then a Vietoris–Rips complex of at scale is homologically equivalent to .
This theorem is a direct consequence of the famous nerve theorem (Alexandroff, 1928), but can also be seen as a consequence of Theorem 1, due to a fact that the multiplicity of eigenvalue zero on discretised space is exactly the rank of a homology group of dimension zero on the same spacee.
In practice, as before, one does not know how to choose scale for , but unlike before, in this setting we have available a tool that can, and is specifically designed to, deal with the uncertainty of scale: persistent homology. We chose to utilise persistent homology to extract information about the geometry and topology of , because persistent homology, measured on the sample , is a reliable shape quantifier of .
Intuitively speaking, topological space is any space on which the notion on neighbourhood can be defined. Hence, all metric spaces (and consequently all examples considered in this work) are topological spaces; the opposite is not true (i.e., not all topological spaces can be endowed with a metric).
It is very difficult to directly assess whether two topological spaces are equivalent (homeomorphic); instead topologists use proxies to measure their similarity. One of these proxies are homology groups, denoted by , which loosely speaking encode the information on different types of loops (of different dimensions) that can be observed in the topological space. And here, the following implication holds: If two topological spaces are equivalent, then their homology groups are isomorphic, but the opposite is not true.
In this work we will only be concerned with a special class of topological spaces called simplicial complexes. Simplicial complex, commonly denoted by , is a topological space consisting of vertices in a set and a set of faces chosen from the partitive set of , , with the requirement that if , then all the subsets of are also in . One way to visualise the simplicial complex is to consider vertices as points in and -dimensional faces as convex hulls of vertices, i.e. edges, triangles, tetrahedra, etc.
Homology groups are algebraic constructions defined by
where is a non-negative integer, is a vector space of -dimensional cycles and a -dimensional boundaries, obtained as per-images and images of the boundary mapping on a chain complex, for more detailed account see (Hatcher, 2009). Typically a rank of a homology group of dimension zero would be the number of connected components of , rank of would be the number of one dimensional holes in , rank of would be the number of cavities, and so on.
A persistent homology, loosely speaking, is a homology of a topological space measured at different resolutions. More precisely, we study a nested sequence of topological spaces (i.e., filtration ) and measure (calculate) homology at every step. As an example let’s observe the sequence of Vietoris-Rips complexes on a point set : A Vietoris-Rips complex on a vertex set and a diameter is a simplicial complex,in which is a simplex iff for every .
An example of a filtration would be , where (see Figure 1 (top)). In other words, persistent homology quantifies a change of topological invariants in with a change of parameter .
where the -th persistent homology group of dimension of the -th filtration complex is denoted by . Intuitively, the persistent homology group records the “cycles” at the filtration step , which have not become “boundaries” (i.e. which have not effectively disappeared ) at filtration step . For detailed account see (Edelsbrunner and Harer, 2008).
The main insight when it comes to persistent homology is that the evolution of topological invariants over increase in parameter , can be encoded compactly in the form of a persistent diagram and a barcode.
where in the pair records the appearance (or birth) of a -dimensional homology group and records its disappearance (also referred to as ”death”). In the event that homology group persists, i.e. it does not disappear during the end of filtration, we set . This set of points is represented in the upper triangle of the first quadrant of the (see Figure 1 (bottom)). Another, representation is a barcode where each bar is mapped to a point with the starting point and ending point .
There is a natural measure of distance defined on persistent diagrams, -Wasserstein distance, also known in the community as the bottleneck distance, which has desirable stability properties with respect to small perturbations (Chazal et al., 2016)
, but is sensitive to outliers and mostly unsuitable for use in practice. On the other hand,-Wasserstein distance
where , and ranges over all bijections between sets of persistent intervals in diagrams and , shows more potential as presented in (Chazal et al., 2018), but is computationally demanding.
In practice, much of the applications of persistent homology have used neither of the two distances, but have relied on ad-hoc distances between persistent diagrams, which do not have a strong backing in theory (e.g. (Bendich et al., 2016; Khrulkov and Oseledets, 2018).
To conclude, endowed with any distance measure described above, the space of persistent diagrams is a metric space.
The method we propose for evaluation of the performance of generative models rests on measuring the differences between the set of images generated by GANs and set of original images. We measure the distance on the point cloud data in feature space. Let be the set of features of the real, and the set of features of the generated images represented in feature space: .
Seen as the point cloud data in , one can calculate the distances between the points in . It is worth noting here again, that even though we calculate all the distances using Euclidean metric, the algorithm will effectively use only ”small” distances, and this is in agreement with potential non-zero curvature of the manifold(refer to Theorem 1 and Theorem 2 for full statement of this fact).
Assume that there are data points in and , and let and be an array of sorted distances among vectors in , , respectively. Then we observe the following filtration: and , where the distance is the minimal distance for which corresponding Vietoris Rips complex becomes fully connected. Same is true for . The dimensional persistent homology groups are calculated on the aforementioned filtrations. One consequence of studying only th dimensional persistent homology group, is that the rank of the persistent homology group at time will be exactly , and persistent diagram will consist of pairs , where denotes the point in filtration where the observed homology group has appeared for the first time (In our case , for every , due to the choice of filtration), and denotes a point in filtration where the observed homology group(connected component) has merged with another one, or is equal to otherwise. This observation holds for both and .
As mentioned in Section 3, a commonly used distance between persistence diagrams in the field of topological data analysis is bottleneck distance as it has shown desirable properties of stability. However, we have found that this distance is too sensitive to outliers and its practical applications are limited. Hence, we’ve chosen to define the distance between the persistence diagrams differently. In fact we’ve used the inherent properties of our filtration method. We assign a -dimensional vector , called the longevity vector to the persistent diagram which represents the sorted living times of each homology group for point set . Same for .
Then, we define the Topology Distance (TD) between two persistent diagrams, and consequently between two corpuses of images to be distance between their longevity vectors, i.e. , where and are the longevity vectors of persistent diagrams of filtrations of set of original and generated image features, respectively.
Furthermore, as some persistent pairs may contain we will assume that the difference between two infinite coordinates in , and the difference between and non-infinite coordinate in our algorithm is a some fixed value larger than the maximum finite longevity.
5.1 Datasets and experimental setup
We compared our proposed TD (lower is better) with IS (higher is better), FID (lower is better), KID (lower is better) and GS (lower is better) as introduced in Section 1.1. In addition to some simulated data, which we will introduce in the next Section, our experiments were carried out on the following four datasets: Fashion-MNIST (Xiao et al., 2017), CIFAR10 (Krizhevsky, 2009), corrupted CIFAR100 (CIFAR100-C) (Hendrycks and Dietterich, 2019) and CelebA (Liu et al., 2015). Wherever features were required for computing the metric, we used a ResNet18 model (He et al., 2016) trained from scratch for Fashion-MNIST images, and the Inception model (Szegedy et al., 2016)
pretrained on ImageNet(Deng et al., 2009) for all other datasets.
5.2 Comparison with FID and KID
The idea of basing the distance measure entirely on the first two moments (e.g., a la FID) can be an oversimplification of the underlying distributions at times, as describing certain distributions require the use of higher order statistics (e.g., third or fourth moments). Furthermore, if two distributions have identical moments of all orders, it is still possible for them to be different distributions (Romano and Siegel, 1986)
. This leads to a conclusion that any distance metric based entirely on moments cannot successfully distinguish between all probability distributions.
In order to assess how such theoretical considerations will affect FID score’s performance, we first compared TD and FID on a synthetic dataset. As shown in Figure 2
we aim to calculate the distance between a single Gaussian distribution and a mixture of two Gaussian distributions (the mixture has the same mean and variance as the single Gaussian). Given the identical first and second moments of the two point clouds in this case, as expected, FID cannot discriminate between the two, whereas the difference is very obvious when using TD. KID has similar limitations as FID, as demonstrated in Figure2.
Next, we compared TD with FID and KID on real images, randomly sampled from CelebA dataset; the goal was to compare the actual images with their manipulated counterparts. More specifically, we performed three types of manipulations designed by (Liu et al., 2018), which resulted in three new image datasets; we then computed the distance between each one of these manipulated image datasets and the original image dataset, using TD, FID and KID.
The three image manipulations include: 1) pixel noise (i.e., adding a random noise to each pixel, where the noise is uniformly sampled from the following interval: times the maximum intensity of the image), 2) patch mask (7 out of 64 evenly-divided regions of each image were masked by a random pixel from the image), and 3) patch exchange (2 out of 16 evenly-divided regions of each image were randomly exchanged with each other, performed twice). Some example images after manipulation are shown in Figure 3. It is clear that the image quality increasingly worsens as we go from pixel noise, to patch mask and patch exchange; we expect to see this trend in the metrics.
However, as presented in Figure 3, FID and KID show a decreasing trend (indicating increasingly better quality) over pixel noise, patch mask and patch exchange, which is apparently opposite the human judgements. In other words, they fail to capture the worsening of image quality, as expected from qualitative assessments presented in (Liu et al., 2018)
; unlike TD, which captures the change in the quality of the manipulated images very well. Since all three metrics are based on the features extracted by the same Inception model, this experiment demonstrates that the superiority of TD over FID and KID is due to its effective assessment of the topological properties of the point clouds (rather than their lower-order statistics).
5.3 Comparison with GS
So far, we have attempted to demonstrate the effectiveness of topology in assessing the (dis)similarities of two point clouds. On the other hand, as noted earlier, both topology distance and geometry score exploit the idea of using topology to quantify dissimilarities between the latent manifolds of data. There are, however, two major differences between TD and GS. The first one is in the core method and the way topology is used to construct the distance (for more details see Section 4). The second one is that TD measures distances between point could data in the feature space, whereas geometry score is defined on raw pixels.
Figure 2c shows the heatmap of the distance matrix calculated between a single Gaussian distribution and a mixture of two Gaussian distributions using GS. It is clear that TD better discriminates between the samples from the two aforementioned distributions (see Figure 2d).
We then performed perturbation consistency comparison between TD and GS using the CIFAR100-C dataset, in which 16 different types of perturbations (grouped in four, namely noise, blur, weather and digital) are applied to the original CIFAR100 images; for each type of perturbation there are five levels of severity. 5,000 images were randomly sampled from the real dataset, and split into 10 groups (each with 500 images); for each group, scores are calculated comparing the perturbed images and the original. For every perturbation, as severity increases, the average score (across 10 groups) should increase monotonically with it.
As can be seen in Figure 4, TD is able to capture levels of perturbation severity much better and consistently than GS for many types of perturbations (e.g. Gaussian noise, frost, and elastic transform). This demonstrates that TD trend is more consistent with perturbation trend than GS, which further demonstrates the advantages of using features over pixels when computing topological properties.
5.4 Comparison with IS
In our next experiment, we compare TD with IS on CelebA dataset where there are only face images and thus no distinct classes exist. We trained a GAN model (WGAN-GP (Gulrajani et al., 2017)) on the training set of CelebA; original images were cropped to be of size 64
64, and the model was then trained on them for 200 epochs with a batch size of 64. We recorded TD and IS along with the training process: every 4 epochs we fed the randomly sampled noise vector (remained fixed for different epochs) to the final model; we then computed TD and IS on the generated and real images.
Figure 5 shows the comparison results of TD and IS on the CelebA dataset. TD shows a great correspondence to the quality of generated images (i.e., decreasing trend with the improved quality of images). By contrast, IS fails to do so; at the early stages of training (before 20 epochs) it decreases as the quality of the generated images increases – in contrast to what is expected – and eventually loses its discrimination power at the remaining epochs. In summary, TD shows superiority over IS for evaluating the quality of images from datasets such as CelebA.
5.5 Pixels vs. features
Finally we performed an ablation study to compare the usage of pixels vs features when computing TD. We trained two WGAN-GP models, respectively, on Fashion-MNIST(trained for 100 epochs) and CIFAR10 (trained for 200 epochs) datasets. We then computed pixel-based and feature-based TD between images generated by WGAN-GP trained for different number of epochs and real images, randomly sampled from each dataset.
As can be seen clearly from Figure 6, for both datasets, feature-based TD are able to demonstrate better performance in terms of discrimination and consistency. This attributes to the better generalisation of learned features than raw pixels, which is one of the most significant advances of deep neural networks (Bengio et al., 2013).
In this work, we introduced Topology Distance (TD), a novel metric to evaluate GANs by considering the topological structures of latent manifold of real and generated images. In a range of experiments we have compared TD with Inception Score (IS), Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Geometry Score (GS), and have demonstrated its advantages and superiority over them in terms of consistency with human judgement, as well as other quantitative measures of change in image quality. TD is capable of providing new insights for the evaluation of GANs, and it thus can be used in conjunction with other metrics when evaluating GANs.
- Über den allgemeinen dimensionsbegriff und seine beziehungen zur elementaren geometrischen anschauung. Mathematische Annalen 98, pp. 617–635. Cited by: §2.
- Persistent homology analysis of brain artery trees. Ann. Appl. Stat. 10 (1), pp. 198–218. Cited by: §3.
- Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1798–1828. Cited by: §5.5.
- Demystifying MMD GANs. In ICLR, Cited by: §1.1.
- On existence conditions for wave operators. Izv. Akad. Nauk SSSR 27 (4), pp. 883–906. Cited by: §2.
- Pros and cons of GAN evaluation measures. arXiv preprint arXiv 1802.03446. Cited by: §1.1.
- Large scale GAN training for high fidelity natural image synthesis. In ICLR, Cited by: §1.
- The structure and stability of persistence modules. In Springer Briefs in Mathematics, Cited by: §3.
- Robust topological inference: distance to a measure and kernel distance. J. Mach. Learn. Res. 18 (159), pp. 1–40. Cited by: §3.
- ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §5.1.
- Convergence, stability, and discrete approximation of laplace spectra. In SODA, Cited by: §2.
- Spectral theory of complete riemannian manifolds. Pure Appl. MATH. Q. 6 (2), pp. 439–456. Cited by: §2.
- Persistent homology -— a survey. Discrete Comput. Geom. 453, pp. 257. Cited by: §3.
- Deep learning. MIT Press. Cited by: §1, §2.
- Generative adversarial nets. In NeurIPS, Cited by: §1.
- Improved training of wasserstein GANs. In NeurIPS, Cited by: §5.4.
- Algebraic topology. Cambridge University Press. Cited by: §3.
- Deep residual learning for image recognition. In CVPR, Cited by: §5.1.
- Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, Cited by: §5.1.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §1.1, §2.
- Can one hear the shape of a drum. Amer. Math. Monthly 73 (4), pp. 1–23. Cited by: §2.
- Geometry score: a method for comparing generative adversarial networks. In ICML, Cited by: §1.1, §3.
- Learning multiple layers of features from tiny images. Technical report . Cited by: §5.1.
- Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Cited by: §1.
- An improved evaluation framework for generative adversarial networks. arXiv preprint arXiv 1803.07474. Cited by: §5.2, §5.2.
- Deep learning face attributes in the wild. In ICCV, Cited by: §5.1.
- Discretization of compact riemannian manifolds applied to the spectrum of laplacian. Ann. Glob. Anal. Geom. 27, pp. 33–46. Cited by: §2.
- Discretization of riemannian manifolds applied to the hodge laplacian. Am. J. Math. 130 (6), pp. 1477–1508. Cited by: §2.
- PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §5.1.
- Generative adversarial text to image synthesis. In ICML, Cited by: §1.
- Counterexamples in probability and statistics. Chapman and Hall/CRC. Cited by: §5.2.
- Improved techniques for training GANs. In NeurIPS, Cited by: §1.1.
Rethinking the inception architecture for computer vision. In CVPR, Cited by: §1.1, §5.1.
- Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv 1708.07747. Cited by: §5.1.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §1.
A Comparison of Sample Quality Correlation
An essential property of a metric for evaluating GANs is its good correlation to the quality of generated samples. We thus performed a comprehensive comparison of image quality correlation, among Fréchet Inception Distance (FID), Kernel Inception Distance (KID), Geometry Score (GS), Inception Score (IS) and our proposed Topology Distance (TD), on three datasets (i.e. Fashion-MNIST, CIFAR10 and CelebA). Specifically, for each dataset, we first trained a WGAN-GP model for 200 epochs with a batch size of 64. We then computed the scores of each metric between real images and generated images for every 4 epochs. Results were averaged over 10 groups, each of which consisted of 500 real images (randomly sampled from each original dataset) and corresponding generated images. The results are shown in Figure A. Compared with other metrics, TD demonstrates better (vs. GS and IS) or comparable (vs. FID and KID) performance of sample quality correlation. It is worth noting that the advantages of our proposed TD over other metrics (as we have shown in other experiments) make TD stand out as a robust and strong alternative to evaluate GANs.