# Evaluating Generative Models Using Divergence Frontiers

Despite the tremendous progress in the estimation of generative models, the development of tools for diagnosing their failures and assessing their performance has advanced at a much slower pace. Recent developments have investigated metrics that quantify which parts of the true distribution are modeled well, and, on the contrary, what the model fails to capture, akin to precision and recall in information retrieval. In this paper, we present a general evaluation framework for generative models that measures the trade-off between precision and recall using Rényi divergences. Our framework provides a novel perspective on existing techniques and extends them to more general domains. As a key advantage, it allows for efficient algorithms that are directly applicable to continuous distributions directly without discretization. We further showcase the proposed techniques on a set of image synthesis models.

• 13 publications
• 37 publications
• 65 publications
• 31 publications
• 25 publications
• 39 publications
02/17/2021

### How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models

Devising domain- and model-agnostic evaluation metrics for generative mo...
05/31/2018

### Assessing Generative Models via Precision and Recall

Recent advances in generative modeling have led to an increased interest...
03/17/2021

### Pros and Cons of GAN Evaluation Measures: New Developments

This work is an update of a previous paper on the same topic published a...
06/21/2020

### Equivalence of several curves assessing the similarity between probability distributions

The recent advent of powerful generative models has triggered the renewe...
05/14/2019

08/14/2018

### Skill Rating for Generative Models

We explore a new way to evaluate generative models using insights from e...

## 1 Introduction

Deep generative models, such as generative adversarial networks Goodfellow et al. (2014)

and variational autoencoders

Kingma and Welling (2014); Rezende et al. (2014), have recently risen to prominence due to their ability to model high-dimensional complex distributions. While we have witnessed a tremendous growth in the number of proposed models and their applications, a comprehensive set of quantitative evaluation measures is yet to be established. Obtaining sample-based quantities that can reflect common issues occurring in training generative models, such as “mode dropping” (failing to adequately capture all the modes of the target distribution) or “oversmoothing” (inability to produce the high frequency characteristics of points in the true distribution) remains a key research challenge.

Currently used metrics, such as the inception score (IS) Salimans et al. (2016) and the Fréchet inception distance (FID) Heusel et al. (2017) produce single number summaries quantifying the goodness of fit. Thus, even though they can detect poor performance, they cannot shed light into the underlying cause. Recently, Sajjadi et al. (2018) and later Kynkäänniemi et al. (2019) have offered an alternative view, motivated by the notions of precision and recall in information retrieval. Intuitively, the precision captures the average “quality” of the generated samples, while the recall measures how well the target distribution is covered. It was demonstrated that such metrics can disentangle these two common failure modes.

Unfortunately, these recent approaches rely on data quantization and do not provide a theory that can be directly used on with continuous distributions. For example, in Sajjadi et al. (2018) the data is first clustered and then the resulting class-assignment histograms are compared. In Kynkäänniemi et al. (2019) the space is covered with hyperspheres and is only sensitive to the size of the overlap of the supports of the distributions.

In this work, we present an evaluation framework based on the Pareto frontiers of Rényi divergences that encompasses these previous contributions as special cases. Beyond this novel perspective on existing techniques, we provide a general characterization of these Pareto frontiers, in both the discrete and continuous case. This in turn enables efficient algorithms that are directly applicable to continuous distributions without the need for discretization.

Contributions  (1) We propose a general framework for comparing distributions based on the Pareto frontiers of statistical divergences. (2) We show that the family of Rényi divergences are particularly well suited for this task and derive curves that can be interpreted as precision-recall tradeoffs. (3) We develop tools to compute these curves for several widely used families of distributions. (4) We show that the recently popularized definitions of precision and recall correspond to specific instances of the proposed framework (Sajjadi et al., 2018; Kynkäänniemi et al., 2019). In particular, we give a theoretically sound geometric interpretation of the definitions and algorithms in (Sajjadi et al., 2018; Kynkäänniemi et al., 2019). (5) We showcase the utility of the proposed framework by applying it to state-of-the-art deep generative models.

## 2 Precision-Recall Tradeoffs for Generative Models

The central problem considered in the paper is the development of a framework that formalizes the concepts of precision and recall for arbitrary measures, and enables the development of principled evaluation tools. Namely, we want to understand how does a learned model, henceforth denoted by , compare to the target distribution

. Informally, to compute the precision we need to estimate how much probability

assigns to regions of the space where has high probability. Alternatively, to compute the recall we need to estimate how much probability assigns to regions of the space that are likely under .

We start by developing an intuitive understanding of the problem with simple examples where the relationship between and is easily understandable. Figure 1 illustrates the case where and

are uniform distributions with supports

and . To help with the exposition of our approach in the next section, we also introduce the distributions and which are uniform on the union and intersection of the supports of and respectively. Then, the loss in precision can then be understood to be proportional to the measure of which corresponds to the "part of not covered by ". Analogously, the loss in recall of w.r.t.  is proportional to the size of which represents the "part of not covered by . Note that we can also write these sets as and respectively. The precision and recall are then naturally maximized when . When the distributions are discrete, we expect tradeoffs similar to those in Figure 1. In particular, the first column corresponds to which fails to model one of the modes of , and the second column to a which has an “extra” mode. We would like our framework to mark these two failure modes as losses in recall and precision, respectively. The third column corresponds to , followed by a case where and have disjoint support. Finally, for the last two columns, a possible precision-recall tradeoff is illustrated.

While this intuition is satisfying for uniform and categorical distributions, it is unclear how to extend it to continuous distributions that might be supported on the complete space.

## 3 Divergence Frontiers

To formally extend these ideas to the general case, we will introduce an auxiliary distribution that is constrained to be supported only on those regions where both and assign high probability111This is in contrast to Sajjadi et al. (2018), who require and to be mixtures with a shared component.. Informally, this should act as a generalization to the general case of , which was the measure on the intersection of the supports of and . Then, the discrepancy between and measures the space that is likely under but not under , which can be seen as loss in recall. Similarly, the discrepancy between and quantifies the size of the space where assigns probability mass, but does not, which we can be interpreted as a loss in recall.

Hence, we need both a mechanism to measure distances between distributions and means to constrain to assign mass only where both and do. For example, if and are both mixtures of several components should assign mass only to the components shared by both and .

A dual view  Alternatively, building on the observation from the previous section that both and can be used to define precision and recall, instead of modeling the intersection of and , we can use an auxiliary distribution to approximate the union of the high-probability regions of and . Then, using a similar analogy as before, the distance between and should measure the loss in precision, while the distance between and the loss in recall. In this case, should give non-zero probability to any part of the space where either or assign mass. When and are both mixtures of several components, has to be supported on the union of all mixture components.

As a result, the choice of the statistical divergence between , and becomes paramount.

### 3.1 Choice of Divergence

To be able to constrain to assign probability mass only in those regions where and do, we need a measure of discrepancy between distributions that penalizes differently under- and over-coverage. Even though the theory and concepts introduced in this paper extend to any such divergence, we will focus our attention to the family of Rényi divergences. They not only do exhibit such behavior, but their properties are also well-studied in the literature, which we can leverage to develop a deeper understanding of our approach, and in the design of efficient computational tools.

###### Definition 1 (Rényi Divergence Rényi (1961)).

Let and be two measures such that is absolutely continuous with respect to , i.e., any measure set with zero measure under has also zero measure under . Then, the Rényi divergence of order is defined as

 Dα(P,Q)=1α−1log∫(dPdQ)α−1dP, (1)

where is the Radon-Nikodym derivative222Equal to the ratios of the densities of and when they both exist..

The fact that Rényi divergences are sensitive to how the supports of and relate to one another is already hinted by the constraint in the definition, which requires that . Furthermore, by increasing the divergence becomes “less forgiving” — for example if and are Gaussians with deviations and , we have that increases faster as when drops below , while grows with increasing and , which we illustrate in Figure 2. This is exactly the property that we need to be able to define meaningful concepts of precision and recall. For a detailed analysis we point the reader to Minka et al. (2005).

Rényi divergences have been extensively studied in the literature and many of their properties are well-understood — for example, they are non-negative and zero only if the distributions are equal a.s., and increasing in  (Van Erven and Harremos, 2014). Some of their orders are closely related to the Hellinger and divergences, and it can be further shown that .

### 3.2 Divergence Frontiers

Having defined a suitable discrepancy measure, we are ready to define the central objects which will play the role of precision-recall curves for arbitrary measures. To do so, we will not put hard constraints on , but only softly enforce them. Namely, consider the case when we want to model the intersection of the high likelihood regions of and . Then, if it fails to do so, either or will be significantly large. Similarly, unless fails to assign large probabilities to the high likelihood regions of both and , at least one of and will be large. Thus, we will only consider those that simultaneously minimize both divergences, which motivates the following definition.

###### Definition 2 (Divergence frontiers).

For any two measures and , any class of measures and any , we define the exclusive realizable region as the set

 R∩α(P,Q)={(Dα(R,Q),Dα(R,P))∣R∈M}, (2)

and the inclusive realizable region by swapping the arguments of in (2). The exclusive and inclusive divergence frontiers are then defined as the maximal points of the corresponding realizable regions, i.e.,

 F∪α(P,Q∣M)={(π,ρ)∈R∪α(P,Q∣M)∣∄(π′,ρ′)∈R∪(P,Q) s.t.\ π′<π and ρ′<ρ},

and is defined analogously by replacing with .

In other words, we want to compute the Pareto frontiers of the multi-objective optimization problem with the divergence minimization objectives (exclusive), and

(inclusive), respectively. In machine learning such divergence minimization problems often appear in approximate inference. Interestingly,

and are the central object one minimizes in variational inference (VI) (Wainwright et al., 2008, §5)Li and Turner (2016), while and are exactly the objectives in expectation propagation (EP) Minka (2001); Minka et al. (2005). Hence, the problem of computing the frontiers can be seen as that of performing VI or EP with two target distributions instead of one.

## 4 Computing the Frontiers

Having defined the frontiers, we now turn to the problem of their computation. Remember that to compute the inclusive frontier we have to characterize the set of all pairs which are not strictly dominated. To solve this multi-objective optimization problem we will linearly scalarize it by computing

 γ(λ)=argminRλDα(R∥Q)+(1−λ)Dα(R∥P), (3)

for varying , and plug back in the divergences. Even though this approach does not in general guarantee that we will obtain the frontier, we show that it is indeed the case in our setting. Furthermore, problem (3) is known as the barycenter problem and has closed-form solutions for many classes of distributions, which we will make use of. The exclusive case will be treated analogously.

### 4.1 Discrete Measures

Let us first consider the discrete case, when the distributions take on one of

values. To simplify the discussion, we represent the distributions as vectors in the simplex

, and use for and for . The vectors for the exclusive and inclusive case can be analytically computed (Nielsen and Nock, 2009, III), which results in the following computational scheme.

###### Proposition 1.

For any two discrete measures and in the probability simplex and any , the forward and backward divergence frontiers can be computed as follows:

1. Exclusive: define the curve by . Then,

 F∩α(p,q∣Δ)={(Dα(γ(λ),p),Dα(γ(λ),q))∣λ∈[0,1]}.
2. Inclusive: define the curve by . Then,

 F∪α(p,q∣Δ)={(Dα(p,γ(λ)),Dα(q,γ(λ)))∣λ∈[0,1]}.

Conceptually, to compute the frontier we walk along the path from to , and at each point we compute the distances to and as measured by . We illustrate this in Figure 3.

### 4.2 Continuous Measures

The computation of the frontier for continuous distributions presents several obstacles that do not exist in the discrete case. Namely, even if we have access to the densities of and , characterizing their frontiers can be arduous — it is not obvious how to parameterize , and furthermore optimizing and evaluating the divergences can be challenging. Fortunately, for the exponential family, which includes many commonly used distributions such as Gaussian, Lognormal, and Exponential, these frontiers can be efficiently computed for (i.e., the KL divergence).

###### Definition 3 (Exponential family (Wainwright et al., 2008, §3.2)).

The exponential family over a domain for a sufficient statistic is the set of all distributions of the form

 P(x∣θ)=exp(θ⊤ν(x)−A(θ)), (4)

where is the parameter vector, and is the log-partition function normalizing the distribution.

Importantly, the KL divergence between two distributions in the exponential family with parameters and can be computed in closed form as the following Bregman divergence (Wainwright et al., 2008, §5.2.2)

 DKL(P(⋅∣θ)∥P(⋅∣θ′))=A(θ′)−A(θ)−∇A(θ)⊤(θ′−θ),

which we shall denote as . We can now show how to compute the frontier.

###### Proposition 2.

Let be an exponential family with log-partition function . Let and be elements in with parameters and . Then,

• Inclusive: If we define , then

 F∪1(P,Q∣M)={(DKL(θP∥γ(λ)),DKL(θQ∥γ(λ)))∣λ∈[0,1]}.
• Exclusive: If we define , then

 F∩1(P,Q∣M)={(DKL(γ(λ)∥θP),DKL(γ(λ)∥θQ))∣λ∈[0,1]}.

## 5 Connections to Related Work

We will now show that the two most prominent definitions of precision and recall are special cases of the presented framework.

The limiting case of   Rather than computing tradeoff curves, Kynkäänniemi et al. (2019) focus only on computing and , and estimate the supports using a union of -nearest neighbourhood balls. This is indeed a special case of our framework, as (Van Erven and Harremos, 2014, Thm. 4) . One drawback of this approach is that all regions where and place any mass are considered equal. Hence, for any two measures with the same support this metric assigns perfect precision and recall (see Figure 4).

The limiting case of   We will now show that Sajjadi et al. (2018) corresponds to the case where . In particular, Sajjadi et al. (2018) write both and as mixtures with a shared component that should capture the space to which both assign high likelihood, and which can be used to formalize the notions of precision and recall for distributions.

###### Definition 4 ((Sajjadi et al., 2018, Def. 1)).

For

has precision at recall w.r.t.  if there exist distributions such that

 P=ρR+(1−ρ)P′, and Q=πR+(1−π)Q′. (5)

The union of and all realizable pairs will be denoted by .

Even though the divergence frontiers formalized in this work might seem unrelated to this definition, there is a clear connection between them, which we now establish. As Sajjadi et al. (2018) targets discrete measures, let us treat the distributions as vectors in the probability simplex and use for and for . We need to consider three additional distributions to compute : , and the per-distribution mixtures and . These distributions are arranged as shown in Figure 4. Because and are co-linear and , we have that the recall obtained for this configuration is . Similarly, the precision can be easily seen to be equal to . Most importantly, we can only increase both and if we move and along the rays and , respectively. Specifically, the maximal recall and precision for this fixed are obtained when and are as far as possible from , i.e., when they lie on the boundary . To formalize this, let us denote for any in by the point along the ray that intersects the boundary of . Then, the maximal and are achieved for and . Perhaps surprisingly, these best achievable precision and recall have been already studied in geometry and have very remarkable properties, as they give rise to a weak metric.

The Funk weak metric on is defined by

 FΔ(p,p)=0 and FΔ(p,q)=log(∥p−∂Δ(p,q)∥/∥q−∂Δ(p,q)∥).

Furthermore, we have that the Funk metric coincides with a limiting Rényi divergence.

###### Proposition 3 ((Papadopoulos and Troyanov, 2014, Ex. 4.1),(Van Erven and Harremos, 2014, Thm. 6)).

For any in the probability simplex , we have that

 FΔ(p,q)=limα→∞Dα(p∥q)=lognmaxi=1pi/qi

This immediately implies the following connection between the set of maximal points in , which we shall denote by and . In other words, the maximal points in PRD coincide with one of the exclusive frontiers we have introduced.

###### Proposition 4.

For any distributions on it holds that

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\textscPRD(P,Q)={(e−π,e−ρ)∣(ρ,π)∈F∩∞(P,Q)}.

Furthermore, the fact that is a weak metric implies that, in contrast to the case, the triangle inequality holds (Papadopoulos and Yamada, 2013, Thm. 7.1). As a result, we can make an even stronger claim — the path traced-out by the distributions that generate the frontier is the shortest in the corresponding geometry.

###### Proposition 5.

Define the curve as . Then,

 F∩∞(p,q)={(D∞(γ(λ)∥p),D∞(γ(λ)∥q))∣λ∈[miniqipi,maxiqipi]},

and, moreover, is geodesic, i.e., it evaluates at the endpoints to and , and for any

 FΔ(p,q)=FΔ(p,γ(λ))+FΔ(γ(λ),q).

Finally, we note that the idea of precision and recall for generative models based on distances to the corresponding manifolds first appeared in Lucic et al. (2018), but was only applicable for synthetic data.

## 6 Practical Considerations and Insights

#### Main Design Choices

In practice, when we are tasked with the problem of evaluating a generative model, we typically only have access to samples from the target distribution and the model , and optionally also the density of . There are at least two approaches one can undertake when applying the methods developed in this paper to generate precision-recall curves. We would like to point out that in the case of image synthesis, the comparison is typically not done in the original high-dimensional image space, but (as done in Sajjadi et al. (2018); Kynkäänniemi et al. (2019); Heusel et al. (2017)) in feature spaces where distances are expected to correlate more strongly with perceptual difference.

The first strategy is to discretize the data, as done in Sajjadi et al. (2018), and then apply the methods from Section 4.1. Even though in the limit this will converge to the continuous divergence (Van Erven and Harremos, 2014, Thm. 2), there may be several issues with this approach — if clustering is used for quantization, as done in Sajjadi et al. (2018)

, the results might strongly depend on the clustering quality, and additional hyperparameters have to be tuned.

The second strategy is to fit a parametric model (from the exponential family) to samples from

and separately and apply the methods from Section 4.2. While this might seem simplistic, exactly this approach has been shown to work well for evaluating generative models using the FID score Heusel et al. (2017).

Implications of the Design Choices  While it might seem that these design choices do not have a major impact on the qualitative interpretation of the results, their impact may be critical. Kynkäänniemi et al. (2019) have already uncovered a counter-intuitive behavior of PRD resulting from quantization issues. Namely, they consider a class-conditional BigGAN model Brock et al. (2019) trained on

images from ImageNet

Russakovsky et al. (2015). The generator was trained to map a standard multivariate Gaussian variable to a distribution over high-dimensional natural images. To control the distribution at the output of the generator, Brock et al. (2019) explored various truncation strategies. In particular, instead of sampling from a Gaussian as done during training, they re-sample all entries of the sampled vector that exceed (in absolute value) some fixed truncation threshold . The idea is that low values of should yield more visually pleasing samples at a cost of sample diversity, while the behavior should reverse for larger thresholds. As a result, as one increases the truncation parameter , the precision should decrease and the recall should increase. In our experiments, we varied ranging from to in increments of . Following Kynkäänniemi et al. (2019); Brock et al. (2019), we compare the images in the feature space of the second fully connected layer of a VGG network trained on ImageNet Simonyan and Zisserman (2014). For each class we used 50000 generated samples and 1300 real ones (as many as there are in the dataset). As shown in Figure 5 and observed by Kynkäänniemi et al. (2019), when measuring precision and the recall using the PRD metric Sajjadi et al. (2018), one seems to be able to improve both by truncation. On the other hand, the estimator from Kynkäänniemi et al. (2019), shown in Figure 5, can correctly identify the expected behavior in that particular setting, despite the fundamental issues illustrated in Section 5.

One can side-step these issues by instantiating our framework for the case of (the KL-divergence) and approximate the real and fake data separately with multivariate Gaussians. By evaluating the end-points of , which corresponds to evaluating the KL divergence between these two Gaussians (seen by setting and in Def. 2), one observes the expected behavior. Namely, as shown in Figure 6, (sensitive to precision losses) increases while (sensitive to recall losses) decreases as we increase the truncation parameter. Hence, when one approximates the frontiers from finite samples, extra care has to be taken — for example, while discretization might make sense in the presence of multi-modality, fitting exponential families could be more suitable in capturing the tails of the distributions.

## 7 Conclusion

We developed a framework for comparing distributions via the Pareto frontiers of information divergences, and fully characterized them using efficient computational algorithms for a large family of distributions. We recovered previous approaches as special cases, and thus provided a novel perspective on their definitions and corresponding algorithms. Furthermore, we believe that we have also opened many interesting research questions related to classical approximate inference methods — can we use different divergences or extend the algorithms to even richer model families, and how to identify the correct approach for approximating the frontiers when we only have access to samples.

## References

• Banerjee et al. (2005) Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 2005.
• Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
• Brock et al. (2019) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. International Conference on Learning Representations, 2019.
• Gil et al. (2013) Manuel Gil, Fady Alajaji, and Tamas Linder. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 2013.
• Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. Advances in Neural Information Processing Systems, 2014.
• Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 2017.
• Kingma and Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding Variational Bayes. International Conference on Learning Representations, 2014.
• Kynkäänniemi et al. (2019) Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved Precision and Recall Metric for Assessing Generative Models. arXiv preprint arXiv:1904.06991, 2019.
• Li and Turner (2016) Yingzhen Li and Richard E Turner. Rényi divergence variational inference. Advances in Neural Information Processing Systems, 2016.
• Lucic et al. (2018) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs Created Equal? A Large-Scale Study. Advances in Neural Information Processing Systems, 2018.
• Minka (2001) Thomas P Minka.

Expectation propagation for approximate Bayesian inference.

Conference on Uncertainty in artificial intelligence

, 2001.
• Minka et al. (2005) Tom Minka et al. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
• Nielsen and Nock (2007) Frank Nielsen and Richard Nock. On the centroids of symmetrized Bregman divergences. arXiv preprint arXiv:0711.3242, 2007.
• Nielsen and Nock (2009) Frank Nielsen and Richard Nock. The dual Voronoi diagrams with respect to representational Bregman divergences. International Symposium on Voronoi Diagrams, 2009.
• Papadopoulos and Troyanov (2014) Athanase Papadopoulos and Marc Troyanov. From Funk to Hilbert Geometry. arXiv preprint arXiv:1406.6983, 2014.
• Papadopoulos and Yamada (2013) Athanase Papadopoulos and Sumio Yamada. The Funk and Hilbert geometries for spaces of constant curvature. Monatshefte für Mathematik, 2013.
• Rényi (1961) Alfréd Rényi. On measures of information and entropy. Berkeley Symposium on Mathematics, Statistics and Probability, 1961.
• Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
• Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision

, 2015.
• Sajjadi et al. (2018) Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. Advances in Neural Information Processing Systems, 2018.
• Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. Advances in Neural Information Processing Systems, 2016.
• Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
• Van Erven and Harremos (2014) Tim Van Erven and Peter Harremos.

Rényi divergence and kullback-leibler divergence.

IEEE Transactions on Information Theory, 2014.
• Wainwright et al. (2008) Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 2008.

## Appendix A Proofs

###### Proof of creftypecap 4.

Even though this result follows clearly from the discussion just above the claim, we provide it for completeness. Namely, let be generated for some . Based on the argument below creftypecap 4 it follows that it must be equal to . Then, the pair is maximal in PRD iff is minimal in , i.e., iff . ∎

###### Proof of creftypecap 5.

If we also include the normalizer of , we have that

 [γ(λ)]i=min{pi,qi/λ}/β(λ), % where β(λ)=n∑i=1min{pi,qi/λ}.

The end-point condition is easy to check, namely

 [γminj{qj/pj})]i =min{pi,qiminj{qj/pj}}/β(λ)=pi/β(λ)=pi, and i =min{pi,qimaxj{qj/pj}}/β(λ)=qi/β(λ)=qi.

Let us now show that . The right hand side can be re-written as

 FΔ(γ(λ),p)=logmaximin{pi,qi/λ}/β(λ)pi=−logβ(λ)+logmaximin{1,qipiλ}.

Note that the term inside the log is not one only if for all , which can happen only if , which is outside the domain of . Similarly,

 FΔ(γ(λ),q) =logmaximin{pi,qi/λ}/β(λ)qi =−logβ(λ)+logmaximin{piqi,1/λ} =−logβ(λ)λ+logmaximin{λpiqi,1}.

The claim follows because , and by noting that the maximum inside the logarithm is strictly less than one only if for all it holds that , which is outside the domain of .

Finally, let us show the geodesity of the curve.

 FΔ(p,μ∗(λ))+FΔ(μ∗(λ),q) =logmaxipimin{pi,qi/λ}/β(λ)+logmaximin{pi,qi/λ}/β(λ)qi =maxilogmax{logλpiqi,1}+maxilogmin{piqi,1λ}
• Case (i): . Then, , so that the first term will be equal to . Similarly, , so that the second term is equal to , and the claimed equality is satisfied.

• Case (ii): . Note that

 maxilogmax{logλpiqi,1}+maxilogmin{piqi,1λ}= maxilogλmax{logpiqi,1/λ}+maxilog1λmin{piλqi,1},

so that the problem is symmetric if we parametrize with and the argument from above holds.

###### Proof of creftypecap 1.

Case (i) Remember that we want to minimize and . Instead of minimizing the Rényi divergences , we can alternatively minimize the -divergences as they are monotone functions of each other. As the divergence is an -divergence (see e.g. [Nielsen and Nock, 2009, C]), it follows that it is jointly convex in both arguments. Hence the Pareto frontier can be computed using the linearly scalarized problem (for a proof see [Boyd and Vandenberghe, 2004, §4.7.3]). The fact that the solution to is given by is proven in [Nielsen and Nock, 2009, (12)]. Case (ii) This case follows analogously as above as the -divergence is jointly convex. The formula for the barycenter can be found in [Nielsen and Nock, 2009, (13)]. ∎

###### Proof of creftypecap 2.

The proof follows the same argument of Nielsen and Nock [2007, §2], the main difference that we also discuss about Pareto optimality, while in the Nielsen and Nock [2007] the authors only discuss the barycenter problem. Let us denote for any convex continuously differentiable function by the Bregman divergence generated by , i.e.,

 BG(x,y)=F(x)−F(y)−∇F(y)⊤(x−y).

In the inclusive case, we want to minimize the objectives and over . In terms of Bregman divergences, we want to minimize and . Because Bregman divergences are convex in their first argument, as in the proof of creftypecap 1 we can only consider the solutions to the linearly scalarized objective

 λBA(θR,θP)+(1−λ)BA(θR,θQ),

whose solution is known (see e.g. Banerjee et al. [2005]) to be equal to , which we had to show. The exclusive case follows from the same argument using the fact that and that [Wainwright et al., 2008, Prop. B.2]. ∎