1 Introduction
Deep generative models, a class of deep learning methods used to synthesize data representative of some observed but typically unknown data distribution, have in recent years seen a terrific increase in active research. As a consequence, some of these methods can now achieve truly stunning results, e.g., very realistic images depicting people or objects which don’t exist in real life. At the forefront among these are generative adversarial networks (GANs), a quickly expanding subclass of deep generative models following the seminal work by Goodfellow
[11], which are distinguished by their ability to not only learn to synthesize data, but also learn to assess how representative of the dataset the synthetic data is. These two learning processes are pitched against each other in an adversarial game, with the goal of the former being able to synthesize data so well that the latter can’t tell it apart from real data.For many generative models, including GANs, the (so called) generator can be seen as producing samples on a lower dimensional manifold immersed in the high dimensional sample space. Similarly, we herein assume that the set of real data points, of which the dataset is a representative example, form a lower dimensional manifold immersed in sample space, however unknown. The goal of the generator is thereby not only to produce samples coinciding with the observed dataset, but to achieve perfect alignment between the generator and real data manifold.
Due to factors such as finitesize datasets, suboptimal training, choice of network architecture and its capacity, there will in general exist some degree of mismatch between these manifolds. One type of mismatch is commonly called mode collapse; when all points in latent space map to either a single, or too few, points in sample space, such that the generator’s output is apparently less rich than the real data manifold. In this work, we focus on another type of mismatch; when (the welltrained) generator occasionally produces unrealistic outputs, i.e., samples from outside of the real data manifold, as exemplified in Figure 1(a). This mismatch may be less obvious during training, but can be detrimental for interpolations in the latent space; a common postprocessing tool for generating semantically meaningful data. E.g., using two latent points known to produce data with certain semantic features, one may use convex combinations (linear interpolations) of these to synthesize data where these features are mixed, see, e.g., Karras [14] and Chen [9]. Similarly, Bengio [4] and Radford [25], interpolate linearly between points and featuredependent clusters identified post training.
If the latent space is equipped with a Euclidean metric, a linear interpolation defines a shortest path, i.e., a geodesic. The image of this path under the generator will however not, in general, be a shortest path on its manifold in sample space. One may instead equip the generator’s output space with a sensible and interpretable metric and pull this back to the latent space, thereby defining a Riemannian metric (see Arvanitidis [2] and Chen [8]). Searching for shortest paths in the latent space with respect to this newly defined metric will then lead to shortest paths on the manifold of the generator, but not necessarily on the true data manifold; causing the synthesis of unrealistic data, as illustrated in Figure 1
(b), where the shortest path becomes blurry. To that end, this paper investigates a general yet simple approach to modify the Riemannian metric so that the mismatch between the generator and the true data manifold is taken into account, and a shortest path will only traverse regions of high data fidelity. For GANs, we show that a welltrained discriminator has the ability to provide gradient information between real and generated samples, and may thus be used as the modifier of the metric. Formulated as a machine learning method, the proposed method creates short and realistic interpolations between points in the latent space, see Figure
1(c).As a proof of concept, a toy example shows that mismatch of image and true data manifold already appears for simple problems; a GAN trained on the twodimensional Swiss roll illustrates how the generator might learn a manifold larger than the data manifold (see Figure 2). Shortest paths on the image manifold thus fall off the Swiss roll, while the geodesics from our modified metric don’t. The proposed method is also implemented for ProGAN (by Karras [14]) trained on the CelebA dataset (Liu [21]), illustrating its ability to outperform both linear interpolations and shortest paths by producing realistic interpolations on the generator manifold.
2 Related work
It has been shown in a number of recent works that the proper way to measure distances in the latent space is to endow it with a Riemannian metric. Arvanitidis [2]
calculate shortest paths by solving an ordinary differential equation (ODE), applied to a variational autoencoder (or VAE, see
[17]) to account for the curvature of the latent space. Subsequently, Yang [32] show that a shortest path may also be found using a discrete analogue of the ODE in finite many sample points along the path. Both Shao [27] and Chen [8] show similar results using discrete sample points. The latter work shows that linear interpolations are still serviceable, and hypothesises that the reason is that the curvature of the Riemannian metric is typically quite flat already, i.e., close to Euclidean. In subsequent work, Kühnel [18] develop a framework for performing statistics (such as calculating the mean value) in a nonEuclidean latent space.A number of works attempt to flatten or address the curvature of the latent space during training of the generator, thus improving linear interpolations. In the work by Berthelot [5], the performance issues of autoencoders (see [26]) are improved by an additional network measuring image quality that is used to regularize the training of the autoencoder. Similarly, for VAEs, Chen [7] proposed to flatten the generator manifold by replacing the VAEs standard Gaussian prior with a customized type of hierarchical prior. Using a different approach from the same authors, Chen [6], approximate the generator manifold using a finite graph of generated samples and then use classical search algorithms to find the shortest paths. In a recent work, StolbergLarsen [29] consider interpolations over partially disjoint generator manifolds, which could be used for datasets containing (partially disjoint) subclasses of data, for which Euclidean distances are illdefined. Using a novel architecture of multiple generators, interpolations are done by matching the generator output using finite sample graphs. In two recent papers by Karras [15, 16], the authors train a GAN using a generator architecture similar to networks performing style transfer to smoothly mix features between the images of faces or other objects using an intermediate latent space instead of the conventional latent space. To enforce smooth and realistic interpolations, the curvature of the intermediate latent space is regularized during training, thereby encouraging linear interpolations to be shortest paths. In this work, we will only consider traditional GAN architectures.
Two issues with the approaches above arise: Firstly, constraining the curvature might impede the quality of image generation, and secondly, it would be desirable to separate path finding from network training, as one may then reuse pretrained networks. In a paper by Laine [19], shortest paths are found by pulling back the Euclidean metric on intermediary feature spaces in the VGG19 classification network [28], whose representations have been shown to correlate well with human perception [33]. Similarly, a recent work by Arvanitidis [3] studies suitable metrics on the sample space for defining the Riemannian pullback metric. While their method is targeted toward probabilistic generators, they also consider how certain regions (in their example, blond people) can be effectively avoided in interpolating curves by assigning high cost to those regions. While our work was derived independently from their results, it may be considered as a continuation of their work, showing that the existing discriminator of a GAN is a suitable, yet simple ingredient to avoiding regions of unrealistic samples in interpolations.
3 Background
The mathematical convention around which deep generative modeling is built stems from statistical inference; the dataset serving as inspiration for the data synthesis is assumed to be a set of samples from some data distribution, i.e., and the generative process attempts to draw new samples
from this distribution. Since the data distribution is generally intractable to model directly, the approach used in deep generative modeling is to construct a method for drawing samples from a conditional probability
, where a latent spaceis equipped with a probability distribution
that is simple to sample from. The method for sampling fromexploits the immense capabilities of deep neural networks to approximate
with help of a generator function . While the generatormay be probabilistic for some generative models, for GANs, among others, the generator is constructed using a deterministic feedforward network resulting in a degenerate conditional probability distribution
. Using the law of total probability, we have
(1) 
and it is apparent that for the desired simple latent distributions , the conditional distribution needs to compensate by means of complexity.
Assuming a smooth generator , , the image of is a dimensional immersed manifold in the dimensional sample space, which means that it is locally homeomorphic to a dimensional Euclidean space, but might intersect itself globally. To find the most direct path on this manifold between two given samples, it must be clarified how to measure distances in the latent space in a meaningful way. In many applications, a sensible metric can be defined on the sample space instead. For images, for example, one may argue that the Euclidean distance over pixel values makes up a meaningful, yet simple, metric. Alternatively, we can measure the distance between images using the Euclidean metric in layers of a network pretrained on image data (e.g., the VGG19 network [28]). Such metric on the sample space may then be pulled back to the latent space, so that the distance between two points in the latent space is defined by the distance between their corresponding points in the sample space along the generator manifold (see, e.g., [2, 3, 19]). To that end, let be a smooth curve between and in the latent space, which maps to a corresponding curve on the generator manifold. Let denote a possible map from into another space where Euclidean distance aligns better with human perception of distance. The length of in latent space is measured by
(2) 
where denotes the length of an infinitesmal curve element at with respect to the Riemannian metric defined by the pullback metric of the Euclidean metric (see, e.g., DoCarmo [10] and Arvanitidis [3]). This Riemannian metric is defined by the matrix , where denotes the firstorder derivatives (Jacobian) of at , and
(3)  
(4)  
(5) 
For Euclidean distances along the generator manifold, simply denotes the identity function, while for distances in VGG network layers, denotes its corresponding feature maps. A shortest path, called geodesic, may thus be found by minimizing (2) with respect to for fixed endpoints . To also enforce constant speed for a parametrization of the geodesic, one may instead minimize the corresponding energy functional (see [23, p.182]), i.e., the integral over the square of the curve lengths,
(6) 
Both (2) and (6) may be minimized numerically by solving a set of ordinary differential equations, albeit at a significant computational cost, as this involves calculating the secondorder derivative of the generator (see Shao [27])
In this work, we are interested in finding geodesics on the true data distribution, but in the image of a generator of a GAN. A GAN consists of a generator function together with a discriminator function which is trained to return high values for the samples in (and representative of) the dataset, i.e., , for , and low values otherwise. Introduced in 2014 by Goodfellow [11], GANs have been one of the most successful methods of generative modeling with numerous applications. For a comprehensive overview on their architecture and how to train them, we refer the reader to [31], and the references therein. A variant of the original GAN based on the Wasserstein distance between the data distribution and its approximation , called WGANGP (see Arjovsky [1] and Gulrajani [12]), is of central interest to our work as it has been used to train highly competitive generative models on image data.
4 An adaptive metric for generative
adversarial networks
The goal of generative modeling is to match the generated distribution in (1) with the data distribution . Assuming that there is no mismatch between the generator image and the data manifold, it must hold for the deterministic GAN approach that,
(7)  
(8)  
(9) 
for the preimage of , i.e., . Since the generator is modeled as a (continuous) feedforward network with a predetermined architecture and a simplistic latent distribution where typically , it cannot be guaranteed to find network parameters such that equations (7)–(9
) hold in general. Also, since the data manifold is not known everywhere, but only in the sample points included in the finitesized dataset, the (continuous) generator needs to extrapolate between the observed points on the data manifold; an estimation procedure that forms another source of manifold mismatch. One may distinguish between two types of mismatch; firstly when the generator synthesizes unrealistic data, then
(10) 
and, secondly, when the the generator cannot reach all points on the data manifold (mode collapse), then
(11) 
An illustration of these two mismatches can be seen in Figure 3, where unrealistic generation happens at the hole in the manifold and outside of the green set on the left, while mode collapse occurs outside of the red dashed set on the right. The mode collapse mismatch is a common issue for GANs; and many approaches exist for circumventing it, see, e.g., [22] and the references therein for a good overview. This paper instead focuses on the first type of mismatch, were the generator synthesizes samples which are not in the (unknown) data manifold, i.e., being unrealistic, and describes an approach for circumventing it. To that end, consider the adapted curve length
(12) 
where the curve length is defined by a proposed metric, which is infinite outside the data manifold. For the Euclidean norm in sample space this is
(13) 
such that the adapted norm is
(14) 
or equivalently,
(15)  
(16) 
In this formulation, is a penalty function such that any curve traversing points outside the data manifold has infinite length, ruling these out as shortest paths. However, calculating requires knowledge of the data manifold beforehand, which is infeasible, and to that end, one may approximate using some auxiliary function. Depending on the deep generative modeling method, may be approximated in different ways, however always such that the function has small values for realistic data, and large values for unrealistic data. For GANs, a natural choice of auxiliary function would be based on the discriminator, . In the origin work by Goodfellow , [11] the discriminator maps each data sample into the interval and its image on some is interpreted as the probabilty of being a sample from the real data distribution . So, to calculate shortest paths for a GAN, the auxiliary function may be approximated as
(17) 
where
is a hyperparameter of the model, and
a small number added to ensure numerical stability when .For computational reasons, the shortest path may be approximately calculated using a discretized version of . To that end, consider a sequence of points in the latent space along the curve , i.e., , such that and . In sample space, this sequence forms a corresponding discrete path , for which the discrete analogue of the energy functional for the proposed metric becomes
(18) 
which we approximate, using (17), by
(19) 
In order to find a geodesic between and , we follow the approach of Yang [32] to model the curve using a polynomial parametrization with trainable parameters , for which the geodesic can be found by minimizing
over
via gradient descent or its variants applied on the discretized objective from (4).
Adapting our method to Wasserstein GANs. In order to reuse pretrained models of high quality generators, we aim to apply the proposed method to Wasserstein GANs, where the discriminator is usually called the critic; a terminology adopted herein. The change from discriminator to critic requires some adjustments of our method. A distinguishing difference of Wasserstein GANs to regular GANs is to replace the valued discriminator by one that can assign an arbitrary real number. The task of assigning high values to real and low values to generated samples remains unchanged, and a Lipschitz constraint on the discriminator prevents the divergence of their mean difference. With (17) relying on the assumption of a valued discriminator, but this not holding for the critic, we normalize critic values to using an approximate maximum and minimum critic value determined from generated samples. A more difficult issue is the lack of welldefined bounds and, more generally, that critic values cannot be interpreted as a probability of the sample being real. While the critic of a WGAN provides gradient information between real and generated samples toward improved equality, it is possible that nonrealistic samples attain high critic values, for example if they fall outside the convex hull of real and generated samples. Figure 4 shows such an example of a real and generated data distribution and a theoretically optimal Wasserstein GAN critic, where the highest critic values are found off the real manifold. (The optimality of the displayed critic follows from [30, Theorem 5.10], and Observation 4 in [24] provides another illustration where an optimal critic assigns a higher critic value to a generated sample than a real). Therefore, caution is advised when enforcing high critic values, but since our main motivation is to improve the quality of images close to a shortest path, useful gradient information is all we need.
5 Experiments
Our experiments compare the following list of interpolation methods: (a) Linear  Linear interpolations in latent space, which are mapped by the generator onto the image manifold. (b) sqDiff  Shortest path on the image manifold measured with the Euclidean metric in sample space; i.e., the minimum of (6) with being the identity function. This minimizes squared differences of images under and is a method proposed in [2, 8, 27]. (c) sqDiff+D  A herein proposed method using (4), where the metric combines sqDiff with enforcing high discriminator values. (d) VGG  Shortest path on the image manifold as measured by a weighted Euclidean distance between features in different layers of the VGG network (as detailed in [19]). VGG+D  A herein proposed method combining the method by Laine [19] with enforcing high discriminator values; i.e., (4) with being feature maps of the VGG network. (f) Linear in sample space  As a comparison to sqDiff we show the shortest path in sample space with respect to its Euclidean metric, which is a straight line. This path is independent of the generator function, illustrating a completely naive interpolation.
For each of the methods sqDiff, sqDiff+D, VGG and VGG+D
, we parametrize a path using a multidimensional polynomial with fixed endpoints by specifying the coefficients of the constant and the linear part. The coefficients to higher order terms are learned by minimizing the respective objectives. For a fair comparison, all methods are trained for the same number of epochs. We find that all methods benefit from ”early stopping”, i.e., not training until convergence. This is theoretically justified above for methods using a Wasserstein critic, but we also find that the bias toward blurry images of
sqDiff and VGG (explained below) becomes more severe with more training iterations. We describe all hyperparameter settings and implementation details in Appendix A. Our code is published under https://github.com/petzkahe/progan_geodesics.5.1 Swiss roll
As a proof of principle, we consider the twodimensional toy dataset of the socalled swiss roll. We trained a GAN with a twodimensional latent space and visualized both generated samples (in green) and discriminator values (grayscale background) in Figure 2 on page 2. By sampling from a dense grid in latent space, we can visualize how the GAN maps the twodimensional latent space onto the swiss roll. At the end of training, generated samples stay on the swiss roll with high probability. Still, a low density of samples do fall off the swiss roll, implying large gradients of the generator at the corresponding latent points , resulting in large metric . Nonetheless, geodesics with respect to the method sqDiff traverse this region of unrealistic samples if the resulting path is shorter in the sample space . Figure 5 compares paths obtained from Linear, sqDiff and sqDiff+D. sqDiff results in paths in sample space that are more smooth than the paths of Linear, but, as predicted, will often fall off the swiss roll. Taking the discriminator information into consideration, our proposed method sqDiff+D avoids regions of predicted fake samples and is therefore able to find smooth paths following the swiss roll.
5.2 Mnist
In a typical next step, we apply our method to the MNIST dataset [20]. Due to the discrete classes of digits, it seems debatable how a direct path between digits should look. The method sqDiff is expected to smoothly traverse through nonrealistic samples, whereas Linear may traverse many digits on a lengthy path. Figure 6 shows that our method sqDiff+D is able to find a short path of high quality with quick transitions between digits.
5.3 ProGAN on CelebA
We apply our proposed method to CelebA images [21]
using the tensorflow implementation of ProGAN by Karras
[14]; a progressively grown Wasserstein GAN (see WGANGP [1, 12]). We download the pretrained network and reuse it in all of our experiments. As a preprocessing step, we set the feature map used for Minibatch Distcrimination (see [13]) to zero , which is necessary to obtain welldefined critic values, independent of the batch statistics. Then, as the Wasserstein critic is not limited to , we sample a large number of points in the latent space, for which we evaluate the critic values and make note of the range of values. In subsequent evaluations, all critic values are scaled by this range in order to obtain values in , as used in the proposed loss.An evaluation of the pretrained critic. Our method assumes a critic that is able to differentiate the quality of generated samples. Figure 7 (a) shows a histogram of critic values on generated (fake) and real samples for the critic trained in [14]. The significant overlap is a proof of the WGAN objective to be at an equilibrium. However, random samples (shown in Appendix B.1.1) show that, despite the high quality of samples, it is still often easy for a human to tell real apart from generated samples, which ideally should be detected by the critic network. To find the relevant gradient information, we finetune the provided networks. We deviate from an equal learning rate for critic and generator network to a learning rate of for the critic and a learning rate of for the generator, allowing the critic to gradually improve in comparison to the generator, but without worsening the generator results (We also experimented with training the discriminator only, but found the problem of observing high critic values for unrealistic samples as discussed above to worsen under this training regime. At the same time, maintaining a positive, but small, learning rate for the generator kept the generating close to constant.) We pick a model after training on 70k images ( additional training). Figure 7 (b) shows the histogram of real and generated samples for this critic, which shows that the new critic has some understanding of differentiating real and fake.
The necessity to finetune a critic before the application of our method could in principle
be circumvented by
training
the ProGAN
from scratch with an adapted training schedule. Otherwise, as in our case, the extra training is negligible in comparison to the training time of the provided model.
Comparison of the linear and proposed methods.
Randomly sampling points from the generator often gives images that can be easily distinguished from real samples (see B.1.1). As a result, the linear interpolation shows several artifacts from generation. If, however, two generated samples of high quality are observed, then we find that the linear interpolation in latent space often results in curves of high quality images, a fact that has been previously observed by Chen in [8]. We distinguish three scenarios for comparison of our proposed method to the linear interpolation in latent space: (i) Linear interpolation in sample space results in images of low quality, as shown in Figure 9. In this case, our method is able to find a much better path of high quality. (ii) Linear interpolation shows samples of high quality, but the path is not a direct path, as shown in Figure 10 (left). In this case, our method finds a more direct path of similarly high quality. (iii) Linear interpolation gives a direct path of high quality, as shown in Figure 10 (right). In this case, our method does not change the path and approximately agrees with the linear method.
Comparison of sqDiff and VGG with the proposed methods. We find that both sqDiff and VGG both work well in finding a short and direct path, which can help to avoid bad regions crossed by Linear. However, we find that both methods often result in blurry images with VGG slightly outperforming sqDiff. Despite a shorter path, this often results in paths of lower image quality than linear interpolation in latent space, see Figure 9 and 10 (left). Our method additionally enforces high discriminator values. It thereby follows the short paths obtained from these methods, but results in higher quality samples with images that are more crisp. Figure 11 shows critic values along the curve of Figure 9 (left) as well as averaged critic values over several paths. Reflecting visual analysis, the linear method usually obtains higher critic values than both sqDiff and VGG, but our method improves the poor image quality of the shortest path by realizing higher critic values close by.
We explain the blurriness observed in both sqDiff and VGG by a curve that follows the requested objective too closely. Suppose the generator would be capable of producing any image in the sample space, then the closest path according to sqDiff would be linear interpolation in sample space, which results in a blurry path as visualized in comparison in Figure 8. The trained generator is expected to generate reallooking faces from anywhere in latent space, so we cannot find this curve exactly, but due to a mismatch of manifolds of real and generated images, similar paths can be found. Linear interpolation in feature spaces of the VGG network improve upon this, but some blurriness remains.
Limitations The success of the proposed method relies on discriminator (or critic) values that can distinguish samples of good quality from samples with poor quality. For the swiss roll, the good performance of the discriminator can be seen from the background color in Figure 6. For the other datasets, we find that the discriminator is not always reliable on single images, discovering a peculiar property of the trained critic. By evaluating the critic on small perturbations in the latent space, the critic values cover a large part critic’s range without a significant visual change of the generated image. We showcase this phenomenon in Figure 17 in Appendix B.1.4 for the CelebA dataset.
The reasons are potentially twofold: For one, a general property of Lipschitz1 functions from a highdimensional space of dimension (here ) to a onedimensional space is that small changes of size to all input dimensions can change the output by . Here, to cover a range of in critic value, a change of all pixels by
can be sufficient. Secondly, the use of minibatch discrimination may disrupt the correlation between critic value and subjective realisticness, both locally and globally. For consistently penalizing the realisticness of shortest paths, we remove the minibatch standard deviation feature map. But the relative ordering of samples may still not correspond to image quality.
The observation implies that regions of poor image quality can not always be avoided on the CelebA dataset by enforcing high critic values. Indeed, we find that the proposed method can not always recover from poor image quality along the interpolating curves. However, we still find VGG+D outperforms all competing methods resulting in paths that are always at least as good as the interpolations found with the other methods. In particular, enforcing high critic values escapes blurry regions traversed by methods sqDiff and VGG.
Further, we note that some care must be taken in selecting and finetuning the critic network to prevent biased interpolations. By only finetuning the pretrained ProGAN’s critic network, we found a critic that it favored bright lightening conditions. As a direct consequence, our method became biased toward white people and appeared to suffer from the overshooting problem discussed above, converging to extremely bright samples (see B.1.3).
Summary of results. Taken together, we find that our proposed method is always as least as good as linear interpolation in sample space, but can improve upon it in case of failure. While the competing methods sqDiff and VGG alone result in direct paths between two endpoints, they result in blurry images. Our method proposes to exploit gradient information from the critic network, enabling us to find direct paths of higher quality.
6 Discussion and Conclusion
At a theoretical equilibrium of a GAN objective, there is no gradient information from the discriminator between generated and real images. If the generator optimally matches the true data distribution, then the lack of discriminator information reduces the proposed methods to sqDiff and VGG respectively. As a result, our method is only interesting for imperfect generators that do not match the true data distribution, which is the practical case for stateoftheart GANs. For such a generator, our results show that the discriminator information can be used to improve the sample quality of interpolatiog curves. Further improvements with the proposed method can be expected using a tailored training schedules for GANs to produce strong discriminators in the setting of sparse realistic datasets.
To conclude, we have presented a method for incorporating both networks of a GAN into a method for learning realistic interpolations on the generator’s image manifold. The proposed method is lightweight as it is posthoc; reusing the discriminator that is learned in parallel to the generator. Incorporating discriminator values together with meaningful distances in a metric can avoid regions where the generator image and real data distribution do not align, resulting in paths that stay on the real data manifold. In its essence, high quality paths were achieved by discriminating against unrealistic samples.
References
 [1] (2017) Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning ICML, , Vol. , pp. . Cited by: §3, §5.3.
 [2] (2018) Latent space oddity: on the curvature of deep generative models. In Proceedings of the 6th International Conference on Learning Representations ICLR, Cited by: §1, §2, §3, §5.

[3]
(2021)
Geometrically enriched latent spaces.
Proceedings of the 24th International Conference on Artificial Intelligence and Statistics
. Cited by: §2, §3.  [4] (2013) Better mixing via deep representations. In Proceedings of the 30th International Conference on Machine Learning, ICML, pp. . Cited by: §1.
 [5] (2019) Understanding and improving interpolation in autoencoders via an adversarial regularizer. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [6] (2019) Fast approximate geodesics for deep generative models. In Artificial Neural Networks and Machine Learning ICANN: Deep Learning, pp. . Cited by: §2.
 [7] (2020) Learning flat latent manifolds with VAEs. Preprint arXiv:2002.04881. Cited by: §2.
 [8] (2018) Metrics for deep generative models. In International Conference on Artificial Intelligence and Statistics, AISTATS, , Vol. , pp. . Cited by: §1, §2, §5.3, §5.
 [9] (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29 NeurIPS, pp. . Cited by: §1.
 [10] (1992) Riemannian geometry. Mathematics (Boston, Mass.), Birkhäuser. External Links: ISBN 9783764334901, LCCN 91037377, Link Cited by: §3.
 [11] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27 NeurIPS, pp. . Cited by: §1, §3, §4.
 [12] (2017) Improved training of Wasserstein GANs. In Advances in on Neural Information Processing Systems 30 NeurIPS, pp. . Cited by: §3, §5.3.
 [13] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning ICML, pp. . Cited by: §5.3.
 [14] (2018) Progressive growing of gans for improved quality, stability, and variation. In Proceedings of the 6th International Conference on Learning Representations ICLR, Cited by: §B.1.2, §B.1.3, Figure 1, §1, §1, Figure 7, §5.3, §5.3, Discriminating Against Unrealistic Interpolations in Generative Adversarial Networks, Discriminating Against Unrealistic Interpolations in Generative Adversarial Networks.

[15]
(2019)
A stylebased generator architecture for generative adversarial networks.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR
, Vol. . Cited by: §2.  [16] (2020) Analyzing and improving the image quality of stylegan. In Procceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR, Cited by: §2.
 [17] (2014) Autoencoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR, Cited by: §2.
 [18] (2018) Latent space nonlinear statistics. Preprint arXiv:1805.07632. Cited by: §2.
 [19] (2018) Featurebased metrics for exploring the latent space of generative models. In Proceedings of the 6th International Conference on Learning Representations ICLR, External Links: Link Cited by: §B.1.2, §2, §3, §5.
 [20] (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2. Cited by: §5.2.
 [21] (201512) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision ICCV, Cited by: §1, §5.3.
 [22] (2018) Which training methods for gans do actually converge?. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. . Cited by: §4.
 [23] (2016) Riemannian geometry. Graduate Texts in Mathematics: 171, Springer International Publishing. External Links: ISBN 9783319266541 Cited by: §3.
 [24] (2018) On the regularization of wasserstein gans. In Proceedings of the 6th International Conference on Learning Representations ICLR, Cited by: §4.
 [25] (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 4th International Conference on Learning Representations, ICLR, Cited by: §1.
 [26] (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group (Eds.), pp. . Cited by: §2.
 [27] (2018) The Riemannian geometry of deep generative models. In 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2, §3, §5.
 [28] (2015) Very deep convolutional networks for largescale image recognition. In Proceedings of the 3rd International Conference on Learning Representations ICLR, Cited by: §2, §3.
 [29] (2021) Atlas generative models and geodesic interpolation. Preprint arXiv:2102.00264. Cited by: §2.
 [30] (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §4.
 [31] (2021) Generative adversarial networks in computer vision: a survey and taxonomy. Cited by: §3.
 [32] (2020) Geodesic clustering in deep generative models. Preprint arXiv:1809.04747. Cited by: §2, §4.

[33]
(2018)
The unreasonable effectiveness of deep features as a perceptual metric
. In Proceedings of the IEEE conference on computer vision and pattern recognition CVPR, Cited by: §2.
Appendix A Training setup
a.1 Polynomial parametrization
To find shortest paths, i.e., geodesics, with respect to the different metrics, we parametrize a multidimensional polynomial
where each for the dimension of the latent space. When learning shortest paths by minimizing curve lengths over the polynomial coefficients , we do not want the start and endpoint to change. To that end, we consider coefficients as free parameters and calculate the constant part and the linear part to satisfy the constraints at the endpoint.
The standard choice for the parametrization of curves chooses a timeinterval of . For higher degrees of polynomials, small numbers for result in vanishing contributions . To mitigate this problem, we parametrize all curves using the interval instead.
a.2 Parameters of implementation
Our implementation allows the selection of several parameters for learning interpolating curves. The following list comprises the most relevant parameters

D; The discriminator model to use for methods sqDiff+D and VGG+D.

Start, End; Seeds for start and endpoint of the path.

n_interp_pts; The number of interpolating points on the curve

poly_degree; The polynomial degree of the interpolating curve

n_train_steps; The number of training steps to update the curve parameters using Adam optimizer to minimize the respective objectives for path minimization with respect to the different metrics

learn_rate; The learning rate used for the Adam optimizer minimizing path length.

hyperparam ; Hyperparameter used for strength of regularization of the curves using discriminator values as in (4) divided by dimension of image space of .

coefficient_init; Initialization of polynomial coefficients to learn is chosen uniformly random in the interval of this size

Methods;Choice of method to use. Our implementation features linear interpolation, sqDiff, sqDiff+D, VGG, VGG+D, Linear interpolation in sample space and looking for high discriminator values only.
For each dataset, we manually searched for suitable hyperparameters based on visual exploration, Once we found a good setting, we used that choice of hyperparameters for the generation of all results.
Differences in hyperarameter settings between Swiss Roll versus MNIST and CelebAHQ can be explained by using a regular GAN with valued discriminator for the swiss roll, while using a Wasserstein GAN for the other experiments (see Section 4, Adapting our method to Wasserstein GANs.
a.3 More implementation details
a.3.1 Swiss Roll
Examining the discriminator values for the GAN trained on Swiss roll in Figure 5, one sees that there is a thin black region of low values between the origin and the southeast corner. Following the, e.g., linear, paths crossing over this region one sees (in sample space) that these paths fall off the swiss roll’s real data manifold, by jumping across the roll instead of following along it. When training the GAN, this bad region of latent points is apparently not removed, but instead shrunk, becoming very very thin. Nonetheless, when interpolating across it, the results are detrimental. During training of sqDiff and the proposed sqDiff+D, we found that, starting from linear interpolation, there was not always enough gradient information available to escape the unrealistic path. Instead, we trained an ensamble of paths, having the same endpoints but different (and larger) initialization, where after training the path with smallest objective was chosen. In addition, because the bad region of latent points was so thin, to make the selection of hyperparameter less important, we replaced the discrete discriminator values along the paths by the geometric average of discriminator values between adjacent points, thus widening and flattening the region somewhat.
Appendix B Additional Experiments
b.1 CelebAHQ
b.1.1 A grid of samples
In Figure 12, we examine a set of synthetic CelebAHQ images (from our finetuned ProGAN) and corresponding critic values, sampled using latent codes drawn from , i.e., randomly from the generator manifold. It is apparently clear that many images are very realistic, meaning that they would fool a human examiner, but that a surprising amount (perhaps 20 % in this set) are still quite unrealistic, exhibiting either unnatural facial distortions, or show a varying amount of artifacts in either the foreground or in the background, which wouldn’t fool the human examiner. Indeed, to sucessfully utilize this generator in realworld applications, some points in the latent space will need to be avoided.
To that end, figures 13 and 14 showcase the discriminator’s ability to separate the most realistic from the least realistic generations, by displaying a set of images with low and high discriminator values, respectively. These images form the two tails of the discriminator value distribution, sampled from a collection of 20k generated images. Considering the number of unrealistic examples in each set, they are not all and none, unfortunately, but they are in majority among the worst, and in clear minority among the best.
b.1.2 Additional geodesics
Below follows some more examples of trained paths for ProGAN trained on CelebAHQ for (a) Linear, (b) sqDiff, (c) sqDiff+D, (d) VGG and (e) VGG+D on CelebA.
Squared differences along paths
We show mean squared differences along 100 interpolation points for the twelve geodesics used for Figure 11, confirming that those are minimized for the method sqDiff. Similarly, VGG leads to small squared differences in image space, explaining the perceived blurriness. While Linear leads to large squared differences along the geodesic, our proposed methods sqDiff+D and VGG+D result in shorter paths (as measured by the squared differences in sample space). We explain the short distances close to the endpoints by the fact that, in the implementation of [14]
, latent vectors are projected onto a
dimensional sphere of radius before an application of the generator, so that larger distances on the sphere in latent space appear when coordinates cross the zero point.Squared VGG feature differences along paths
We show squared feature differences of the VGG19 network as a measure of distance covered along paths. Following [19], we consider a weighted mean (by reciprocal of layer dimensions) of the layers conv_1_2,conv_2_2, conv_3_2, conv_4_2 and conv_5_2. We find that VGG minimizes this measure close to the endpoints, but obtains slightly larger values than sqDiff in the middle of the path. In general, sqDiff and VGG are similar in distance, confirming visual evaluation that both result in blurry paths. Apparently, the way to minmize in feature spaces defined by a continuous feature map is partially achieved by minimizing distances in image space , which is not entirely surprising. We also note that we do not train until convergence, which may explain the parts where VGG obtains slightly larger values than sqDiff. The comparison to the other curves is similar as above. Linear covers the largest distance, while the proposed methods sqDiff+D and VGG+D settle in the middle.
b.1.3 Biased discriminators
Starting from the model trained by [14] and finetuning the discriminator only (on 2000 images) resulted in a critic that produced high quality images, but with a bias toward white people and bright illumination. All hyperparameter setting are unchanged from the images above in the following biased interpolations:
b.1.4 Local perturbations in latent space
Figure 17 shows generated samples and their critic values. Each row corresponds to images generated from small, local perturbations to the latent vector. While the generated samples do not change significantly, the critic values cover a large part of the range of the critic values. This showcases that critic values on single images may not always provide useful information for improvements of the sample quality.
b.2 Mnist
In the following figures, we illustrate some additional examples of trained paths for the ProGAN trained on MNIST using the same settings as above. Rows illustrate (a) Linear on top, (b) sqDiff in the middle, and (c) sqDiff+D at the bottom.
b.3 LSUNCars
We additionally experimented on the LSUNCar dataset. Both methods sqDiff and VGG to result in very blurry images on this dataset when trying to minimize the path length in sample space or VGG feature space. This is detrimental to our method that tries to improve upon the short path by incorporating critic values into the metric. We also found the Linear method to work well when the endpoints were chosen similarly, but otherwise linear fails to find a good interpolations. Whenever linear failed, we were not able to find short and realistic paths using the proposed method VGG+D, but the experiments show the possibility to improve the image quality significantly over VGG by using discriminator information. This functionality is demonstrated by the following two Figures, where the first choice of hyperparameter balances shortest path and discriminator information, while the second puts emphasis on discriminator information leaving the shortest path significantly.
More examples comparing (a) Linear, (d) VGG and (e) VGG+D.
b.4 LSUNBedrooms
For the LSUN dataset of bedrooms, we exemplify here that our proposed method again recovers sharp images from blurry images along paths using VGG, and that the paths found by VGG+D can be more direct (see in particular the change of color in the third exaple).