Finding Mixed Nash Equilibria of Generative Adversarial Networks

We reconsider the training objective of Generative Adversarial Networks (GANs) from the mixed Nash Equilibria (NE) perspective. Inspired by the classical prox methods, we develop a novel algorithmic framework for GANs via an infinite-dimensional two-player game and prove rigorous convergence rates to the mixed NE, resolving the longstanding problem that no provably convergent algorithm exists for general GANs. We then propose a principled procedure to reduce our novel prox methods to simple sampling routines, leading to practically efficient algorithms. Finally, we provide experimental evidence that our approach outperforms methods that seek pure strategy equilibria, such as SGD, Adam, and RMSProp, both in speed and quality.

Authors

• 7 publications
• 37 publications
• 72 publications
• Training Generative Adversarial Networks via stochastic Nash games

Generative adversarial networks (GANs) are a class of generative models ...
10/17/2020 ∙ by Barbara Franci, et al. ∙ 0

• Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks

Motivated by the pursuit of a systematic computational and algorithmic u...
02/16/2018 ∙ by Tengyuan Liang, et al. ∙ 0

• Beyond Local Nash Equilibria for Adversarial Networks

Save for some special cases, current training methods for Generative Adv...
06/18/2018 ∙ by Frans A. Oliehoek, et al. ∙ 4

• GANGs: Generative Adversarial Network Games

Generative Adversarial Networks (GAN) have become one of the most succes...
12/02/2017 ∙ by Frans A. Oliehoek, et al. ∙ 0

• A mean-field analysis of two-player zero-sum games

Finding Nash equilibria in two-player zero-sum continuous games is a cen...
02/14/2020 ∙ by Carles Domingo Enrich, et al. ∙ 5

• On the Existence and Structure of Mixed Nash Equilibria for In-Band Full-Duplex Wireless Networks

09/22/2017 ∙ by Andrea Munari, et al. ∙ 0

• A Convenient Infinite Dimensional Framework for Generative Adversarial Learning

In recent years, generative adversarial networks (GANs) have demonstrate...
11/24/2020 ∙ by Hayk Asatryan, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Generative Adversarial Network (GAN) Goodfellow et al. (2014) has become one of the most powerful paradigms in learning real-world distributions, especially for image-related data. It has been successfully applied to a host of applications such as image translation Isola et al. (2017); Kim et al. (2017); Zhu et al. (2017)

, super-resolution imaging

Wang et al. (2015), pose editing Pumarola et al. (2018b), and facial animation Pumarola et al. (2018a).

Despite of the many accomplishments, the major hurdle blocking the full impact of GAN is its notoriously difficult training phase. In the language of game theory, GAN seeks for a

pure strategy equilibrium, which is well-known to be ill-posed in many scenarios Dasgupta & Maskin (1986). Indeed, it is known that a pure strategy equilibrium might not exist Arora et al. (2017), might be degenerate Sønderby et al. (2017), or cannot be reliably reached by existing algorithms Mescheder et al. (2017).

Empirically, it has also been observed that common algorithms, such as SGD or Adam Kingma & Ba (2015), lead to unstable training. While much efforts have been devoted into understanding the training dynamics of GANs Balduzzi et al. (2018); Gemp & Mahadevan (2018); Gidel et al. (2018a, b); Liang & Stokes (2018), a provably convergent algorithm for general GANs, even under reasonably strong assumptions, is still lacking.

In this paper, we address the above problems with the following contributions:

1. We propose to study the mixed Nash Equilibrium (NE) of GANs: Instead of searching for an optimal pure strategy which might not even exist, we optimize over the set of probability distributions over pure strategies of the networks. The existence of a solution to such problems was long established amongst the earliest game theory work Glicksberg (1952), leading to well-posed optimization problems.

2. We demonstrate that the prox methods of Nemirovsky & Yudin (1983); beck2003mirror; Nemirovski (2004), which are fundamental building blocks for solving two-player games with finitely many strategies, can be extended to continuously many strategies, and hence applicable to training GANs. We provide an elementary proof for their convergence rates to learning the mixed NE.

3. We construct a principled procedure to reduce our novel prox methods to certain sampling tasks that were empirically proven easy by recent work Chaudhari et al. (2017, 2018); Dziugaite & Roy (2018)

. We further establish heuristic guidelines to greatly scale down the memory and computational costs, resulting in simple algorithms whose per-iteration complexity is almost as cheap as SGD.

4. We experimentally show that our algorithms consistently achieve better or comparable performance than popular baselines such as SGD, Adam, and RMSProp Tieleman & Hinton (2012).

Related Work: While the literature on training GANs is vast, to our knowledge, there exist only few papers on the mixed NE perspective. The notion of mixed NE is already present in Goodfellow et al. (2014), but is stated only as an existential result. The authors of Arora et al. (2017) advocate the mixed strategies, but do not provide a provably convergent algorithm. Oliehoek et al. (2018) also considers mixed NE, but only with finitely many parameters. The work Grnarova et al. (2018)

proposes a provably convergent algorithm for finding the mixed NE of GANs under the unrealistic assumption that the discriminator is a single-layered neural network. In contrast, our results are applicable to arbitrary architectures, including popular ones

Arjovsky et al. (2017); Gulrajani et al. (2017).

Due to its fundamental role in game theory, many prox methods have been applied to study the training of GANs Daskalakis et al. (2018); Gidel et al. (2018a); Mertikopoulos et al. (2018). However, these works focus on the classical pure strategy equilibria and are hence distinct from our problem formulation. In particular, they give rise to drastically different algorithms from ours and do not provide convergence rates for GANs.

In terms of analysis techniques, our framework is closely related to Balandat et al. (2016), but with several important distinctions. First, the analysis of Balandat et al. (2016) is based on dual averaging Nesterov (2009), while we consider Mirror Descent and also the more sophisticated Mirror-Prox (see Section 3). Second, unlike our work, Balandat et al. (2016) do not provide any convergence rate for learning mixed NE of two-player games. Finally, Balandat et al. (2016) is only of theoretical interest with no practical algorithm.

Notation: Throughout the paper, we use to denote a generic variable and its domain. We denote the set of all Borel probability measures on by , and the set of all functions on by .111Strictly speaking, our derivation requires mild regularity (see Appendix A.1) assumptions on the probability measure and function classes, which are met by most practical applications. We write to mean that the density function of with respect to the Lebesgue measure is . All integrals without specifying the measure are understood to be with respect to Lebesgue. For any objective of the form with achieving the saddle-point value, we say that is an -NE if . Similarly we can define -NE. The symbol denotes the -norm of functions, and denotes the total variation norm of probability measures.

2 Problem Formulation

We review standard results in game theory in Section 2.1, whose proof can be found in Bubeck (2013a, b, c). Section 2.2 relates training of GANs to the two-player game in Section 2.1, thereby suggesting to generalize the prox methods to infinite dimension.

2.1 Preliminary: Prox Methods for Finite Games

Consider the classical formulation of a two-player game with finitely many strategies:

 minp∈Δmmaxq∈Δn⟨q,a⟩−⟨q,Ap⟩, (1)

where is a payoff matrix,

is a vector, and

is the probability simplex, representing the mixed strategies (i.e., probability distributions) over pure strategies. A pair achieving the min-max value in (1) is called a mixed NE.

Assume that the matrix is too expensive to evaluate whereas the (stochastic) gradients of (1) are easy to obtain. Under such settings, a celebrated algorithm, the so-called entropic Mirror Descent (entropic MD), learns an -NE: Let be the entropy function and be its Fenchel dual. For a learning rate and an arbitrary vector , define the MD iterates as

 z′=MDη(z,b)≡z′=∇ϕ⋆(∇ϕ(z)−ηb)≡z′i=zie−ηbi∑di=1zie−ηbi,  ∀1≤i≤d. (2)

The equivalence of the last two formulas in (2) can be readily checked.

Denote by and the ergodic average of two sequences and . Then, with a properly chosen step-size , we have

 {pt+1=MDη(pt,−A⊤qt)qt+1=MDη(qt,−a+Apt)⇒(¯pT,¯qT)isan$O(T−\nicefrac12)$−NE.

Moreover, a slightly more complicated algorithm, called the entropic Mirror-Prox (entropy MP) Nemirovski (2004), achieves faster rate than the entropic MD:

 {pt=MDη(~pt,−A⊤~qt),~pt+1=MDη(~pt,−A⊤qt)qt=MDη(~qt,−a+A~pt),~qt+1=MDη(~qt,−a+Apt)⇒(¯pT,¯qT)isan$O(T−1)$−NE.

If, instead of deterministic gradients, one uses unbiased stochastic gradients for entropic MD and MP, then both algorithms achieve -NE in expectation.

2.2 Mixed Strategy Formulation for Generative Adversarial Networks

For illustration, let us focus on the Wasserstein GAN Arjovsky et al. (2017). The training objective of Wasserstein GAN is

 minθ∈Θmaxw∈WEX∼Preal[fw(X)]−EX∼Pθ[fw(X)], (3)

where is the set of parameters for the generator and the set of parameters for the discriminator222Also known as “critic” in Wasserstein GAN literature. , typically both taken to be neural nets. As mentioned in the introduction, such an optimization problem can be ill-posed, which is also supported by empirical evidence.

The high-level idea of our approach is, instead of solving (3) directly, we focus on the mixed strategy formulation of (3). In other words, we consider the set of all probability distributions over and , and we search for the optimal distribution that solves the following program:

 minν∈M(Θ)maxμ∈M(W)Ew∼μEX∼Preal[fw(X)]−Ew∼μEθ∼νEX∼Pθ[fw(X)]. (4)

Define the function by and the operator as . Denoting for any probability measure and function , we may rewrite (4) as

 (5)

Furthermore, the Fréchet derivative (the analogue of gradient in infinite dimension) of (5) with respect to is simply , and the derivative of (5) with respect to is , where is the adjoint operator of defined via the relation

 (6)

One can easily check that achieves the equality in (6).

To summarize, the mixed strategy formulation of Wasserstein GAN is (5), whose derivatives can be expressed in terms of and . We now make the crucial observation that (5) is exactly the infinite-dimensional analogue of (1): The distributions over finite strategies are replaced with probability measures over a continuous parameter set, the vector is replaced with a function , the matrix is replaced with a linear operator333The linearity of trivially follows from the linearity of expectation. , and the gradients are replaced with Fréchet derivatives. Based on Section 2.1, it is then natural to ask:

Can the entropic Mirror Descent and Mirror-Prox be extended to infinite dimension to solve (5)? Can we retain the convergence rates?

We provide an affirmative answer to both questions in the next section.

Remark. The derivation in Section 2.2 can be applied to any GAN objective.

3 Infinite-Dimensional Prox Methods

This section builds a rigorous infinite-dimensional formalism in parallel to the finite-dimensional prox methods and proves their convergence rates. While simple in retrospect, to our knowledge, these results are new.

3.1 Preparation: The Mirror Descent Iterates

We first recall the notion of (Fréchet) derivative in infinite-dimensional spaces. A (nonlinear) functional is said to possess a derivative at if there exists a function such that, for all , we have

 Φ(μ+ϵμ′)=Φ(μ)+ϵ⟨μ′,dΦ(μ)⟩+o(ϵ).

Similarly, a (nonlinear) functional is said to possess a derivative at if there exists a measure such that, for all , we have

 Φ⋆(h+ϵh′)=Φ⋆(h)+ϵ⟨dΦ⋆(h),h′⟩+o(ϵ).

The most important functionals in this paper are the (negative) Shannon entropy

 μ∈M(Z),Φ(μ)\coloneqq∫dμlogdμdz

and its Fenchel dual

 h∈F(Z),Φ⋆(h)\coloneqqlog∫ehdz.

The first result of our paper is to show that, in direct analogy to (2), the infinite-dimensional MD iterates can be expressed as:

Theorem 1 (Infinite-Dimensional Mirror Descent, informal).

For a learning rate and an arbitrary function , we can equivalently define

 μ+=MDη(μ,h)≡μ+=dΦ⋆(dΦ(μ)−ηh)≡dμ+=e−ηhdμ∫e−ηhdμ. (7)

Moreover, most the essential ingredients in the analysis of finite-dimensional prox methods can be generalized to infinite dimension.

See Theorem 4 of Appendix A for precise statements and a long list of “essential ingredients of prox methods” generalizable to infinite dimension.

3.2 Infinite-Dimensional Prox Methods and Convergence Rates

Armed with results in Section 3.1, we now introduce two “conceptual” algorithms for solving the mixed NE of Wasserstein GANs: The infinite-dimensional entropic MD in Algorithm 1 and MP in Algorithm 2. These algorithms iterate over probability measures and cannot be directly used in practice, but they possess rigorous convergence rates, and hence motivate the reduction procedure in Section 4 to come.

Theorem 2 (Convergence Rates).

Let . Let be a constant such that , and be such that and . Let be the relative entropy, and denote by the initial distance to the mixed NE. Then

1. Assume that we have access to the deterministic derivatives and . Then Algorithm 1 achieves -NE with , and Algorithm 2 achieves -NE with .

2. Assume that we have access to unbiased stochastic derivatives and such that

, and the variance is upper bounded by

. Then Algorithm 1 with stochastic derivatives achieves -NE in expectation with , and Algorithm 2 with stochastic derivatives achieves -NE in expectation with .

The proof can be found in Appendix B and C.

Remark. If, as in previous work Arora et al. (2017), we assume the output of the discriminator to be bounded by , then we have and in Theorem 2.

4 From Theory to Practice

Section 4.1 reduces Algorithm 1 and Algorithm 2 to a sampling routine Welling & Teh (2011)

that has widely been used in machine learning. Section

4.2 proposes to further simplify the algorithms by summarizing a batch of samples by their mean.

For simplicity, we will only derive the algorithm for entropic MD; the case for entropic MP is similar but requires more computation. To ease the notation, we assume throughout this section as does not play an important role in the derivation below.

4.1 Implementable Entropic MD: From Probability Measure to Samples

Consider Algorithm 1. The reduction consists of three steps.

Step 1: Reformulating Entropic Mirror Descent Iterates

The definition of the MD iterate (7) relates the updated probability measure to the current probability measure , but it tells us nothing about the density function of , from which we want to sample. Our first step is to express (7) in a more tractable form. By recursively applying (7) and using Theorem 4.10 in Appendix A, we have, for some constants ,

 dΦ(μT) =dΦ(μT−1)−(−g+GνT−1)+CT−1 =dΦ(μT−2)−(−g+GνT−2)−(−g+GνT−1)+CT−1+Ct−2 =⋯=dΦ(μ1)−(−(T−1)g+GT−1∑s=1νs)+T−1∑s=1Cs.

For simplicity, assume that is uniform so that is a constant function. Then, by (13) and that , we see that the density function of is simply Similarly, we have

Step 2: Empirical Approximation for Stochastic Derivatives

The derivatives of (5) involve the function and operator . Recall that requires taking expectation over the real data distribution, which we do not have access to. A common approach is to replace the true expectation with its empirical average:

 g(w)=EX∼Preal[fw(X)]≃1nn∑i=1fw(Xreali)≜^g(w)

where ’s are real data and is the batch size. Clearly,

is an unbiased estimator of

.

On the other hand, and involve expectation over and , respectively, and also over the fake data distribution . Therefore, if we are able to draw samples from and , then we can again approximate the expectation via the empirical average:

 $θ(1),θ(2),...,θ(n′)∼νt$,{X(j)i}ni=1∼Pθ(j), ^Gνt(w)≃1nn′n∑i=1n′∑j=1fw(X(j)i) $w(1),w(2),...,w(n′)∼μt$,{Xi}ni=1∼Pθ, ^G†μt(θ)≃1nn′n∑i=1n′∑j=1fw(j)(Xi).

Now, assuming that we have obtained unbiased stochastic derivatives and , how do we actually draw samples from and ? Provided we can answer this question, then we can start with two easy-to-sample distributions , and then we will be able to draw samples from . These samples in turn will allow us to draw samples from , and so on. Therefore, it only remains to answer the above question. This leads us to:

Step 3: Sampling by Stochastic Gradient Langevin Dynamics

For any probability distribution with density function , the Stochastic Gradient Langevin Dynamics (SGLD) Welling & Teh (2011) iterates as

 zk+1=zk−γ^∇h(zk)+√2γϵξk, (8)

where is the step-size, is an unbiased estimator of , is the thermal noise, and is a standard normal vector, independently drawn across different iterations.

Suppose we start at . Plugging and into (8), we obtain, for and , the following update rules:

 θk+1 =θk+γ∇θ(1nn′n∑i=1n′∑j=1fw(j)(Xi)) wk+1 =wk+γ∇w(1nn∑i=1fwk(Xreali)−1nn′n∑i=1n′∑j=1fwk(X(j)i)).

The theory of Welling & Teh (2011) states that, for large enough , the iterates of SGLD above (approximately) generate samples according to the probability measures . We can then apply this process recursively to obtain samples from . Finally, since the entropic MD and MP output the averaged measure , it suffices to pick a random index and then output samples from .

Putting Step 1-3 together, we obtain Algorithm 4 and 5 in Appendix D.

Remark. In principle, any first-order sampling method is valid above. In the experimental section, we also use a RMSProp-preconditioned version of the SGLD Li et al. (2016).

4.2 Summarizing Samples by Averaging: A Simple yet Effective Heuristic

Although Algorithm 4 and 5 are implementable, they are quite complicated and resource-intensive, as the total computational complexity is . This high complexity comes from the fact that, when computing the stochastic derivatives, we need to store all the historical samples and evaluate new gradients at these samples.

An intuitive approach to alleviate the above issue is to try to summarize each distribution by only one parameter. To this end, the mean of the distribution is the most natural candidate, as it not only stablizes the algorithm, but also is often easier to acquire than the actual samples. For instance, computing the mean of distributions of the form , where

is a loss function defined by deep neural networks, has been empirically proven successful in

Chaudhari et al. (2017, 2018); Dziugaite & Roy (2018) via SGLD. In this paper, we adopt the same approach as in Chaudhari et al. (2017) where we use exponential damping (the term in Algorithm 3) to increase stability. Algorithm 3, dubbed the Mirror-GAN, shows how to encompass this idea into entropic MD; the pseudocode for the similar Mirror-Prox-GAN can be found in Algorithm 6 of Appendix D.

5 Experimental Evidence

The purpose of our experiments is twofold. First, we use established baselines to demonstrate that Mirror- and Mirror-Prox-GAN consistently achieve better or comparable performance than common algorithms. Second, we report that our algorithms are stable and always improve as the training process goes on. This is in contrast to unstable training algorithms, such as Adam, which often collapse to noise as the iteration count grows. Cha (2017).

We use visual quality of the generated images to evaluate different algorithms. We avoid reporting numerical metrics, as recent studies Barratt & Sharma (2018); Borji (2018); Lucic et al. (2018) suggest that these metrics might be flawed. Setting of the hyperparameters and more auxiliary results can be found in Appendix E.

5.1 Synthetic Data

We repeat the synthetic setup as in Gulrajani et al. (2017)

. The tasks include learning the distribution of 8 Gaussian mixtures, 25 Gaussian mixtures, and the Swiss Roll. For both the generator and discriminator, we use two MLPs with three hidden layers of 512 neurons. We choose SGD and Adam as baselines, and we compare them to Mirror- and Mirror-Prox-GAN. All algorithms are run up to

iterations444One iteration here means using one mini-batch of data. It does not correspond to the in our algorithms, as there might be multiple SGLD iterations within each time step .. The results of 25 Gaussian mixtures are shown in Figure 1; An enlarged figure of 25 Gaussian Mixtures and other cases can be found in Appendix E.1.

As Figure 1 shows, SGD performs poorly in this task, while the other algorithms yield reasonable results. However, compared to Adam, Mirror- and Mirror-Prox-GAN fit the true distribution better in two aspects. First, the modes found by Mirror- and Mirror-Prox-GAN are more accurate than the ones by Adam, which are perceivably biased. Second, Mirror- and Mirror-Prox-GAN perform much better in capturing the variance (how spread the blue dots are), while Adam tends to collapse to modes. These observations are consistent throughout the synthetic experiments; see Appendix E.1.

5.2 Real Data

For real images, we use the LSUN bedroom dataset Yu et al. (2015). We have also conducted a similar study with MNIST; see Appendix E.2.1 for details.

We use the same architecture (DCGAN) as in Radford et al. (2015)

with batch normalization. As the networks become deeper in this case, the gradient magnitudes differ significantly across different layers. As a result, non-adaptive methods such as SGD or SGLD do not perform well in this scenario. To alleviate such issues, we replace SGLD by the RMSProp-preconditioned SGLD

Li et al. (2016) for our sampling routines. For baselines, we consider two adaptive gradient methods: RMSprop and Adam.

Figure 2 shows the results at the th iteration. The RMSProp and Mirror-GAN produce images with reasonable quality, while Adam outputs noise. The visual quality of Mirror-GAN is better than RMSProp, as RMSProp sometimes generates blurry images (the - and -th entry of Figure 8.(b)).

It is worth mentioning that Adam can learn the true distribution at intermediate iterations, but later on suffers from mode collapse and finally degenerates to noise; see Appendix E.2.2.

6 Conclusions

Our goal of systematically understanding and expanding on the game theoretic perspective of mixed NE along with stochastic Langevin dynamics for training GANs is a promising research vein. While simple in retrospect, we provide guidelines in developing approximate infinite-dimensional prox methods that mimic closely the provable optimization framework to learn the mixed NE of GANs. Our proposed Mirror- and Mirror-Prox-GAN algorithm feature cheap per-iteration complexity while rapidly converging to solutions of good quality.

Acknowledgments

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594 - time-data), and Microsoft Research through its PhD scholarship Programme.

Appendix A A Framework for Infinite-Dimensional Mirror Descent

a.1 A note on the regularity

It is known that the (negative) Shannon entropy is not Fréchet differentiable in general. However, below we show that the Fréchet derive can be well-defined if we restrict the probability measures to within the set

 M(Z)\coloneqq {allprobabilitymeasureson$Z$thatadmitdensitiesw.r.t.theLebesguemeasure, andthedensityiscontinuousandpositivealmosteverywhereon$Z$}.

We will also restrict the set of functions to be bounded and integrable:

 F(Z)\coloneqq{allboundedintegrablefunctionson$Z$}.

It is important to notice that and implies ; this readily follows from the formula (7).

a.2 Properties of Entropic Mirror Map

The total variation of a (possibly non-probability) measure is defined as [25]

 ∥μ∥TV=sup∥h∥L∞≤1∫hdμ=sup∥h∥L∞≤1⟨μ,h⟩.

We depart from the fundamental Gibbs Variational Principle, which dates back to the earliest work of statistical mechanics [17]. For two probability measures , denote their relative entropy by (the reason for this notation will become clear in (14))

 DΦ(μ,μ′)\coloneqq∫Zdμlogdμdμ′.
Theorem 3 (Gibbs Variation Principle).

Let and be a reference measure. Then

 log∫Zehdμ′=supμ∈M(Z)⟨μ,h⟩−DΦ(μ,μ′), (9)

and equality is achieved by .

Part of the following theorem is folklore in the mathematics and learning community. However, to the best of our knowledge, the relation to the entropic MD has not been systematically studied before, as we now do.

Theorem 4.

For a probability measure , let be the negative Shannon entropy, and let . Then

1. is the Fenchel conjugate of :

 Φ⋆(h) (10) Φ(μ) (11)
2. The derivatives admit the expression

 dΦ(μ) =1+logρ (12) dΦ⋆(h) =ehdz∫Zehdz =argmaxμ∈M(Z)⟨μ,h⟩−Φ(μ). (13)
3. The Bregman divergence of is the relative entropy:

 (14)
4. is 4-strongly convex with respect to the total variation norm: For all ,

 Φ(λμ+(1−λ)μ′)≤λΦ(μ)+(1−λ)Φ(μ′)−12⋅4λ(1−λ)∥μ−μ′∥2TV. (15)
5. The following duality relation holds: For any constant , we have

 ∀μ,μ′∈M(Z),DΦ(μ,μ′)=DΦ⋆(dΦ(μ′),dΦ(μ))=DΦ⋆(dΦ(μ′)+C,dΦ(μ)). (16)
6. is -smooth with respect to :

 ∀h,h′∈F(Z),∥∥dΦ⋆(h)−dΦ⋆(h′)∥∥TV≤14∥∥h−h′∥∥L∞. (17)
7. Alternative to (17), we have the equivalent characterization of :

 ∀h,h′∈F(Z),Φ⋆(h)≤Φ⋆(h′)+⟨dΦ⋆(h′),h−h′⟩+12⋅14∥∥h−h′∥∥2L∞. (18)
8. Similar to (16), we have

 ∀h,h′,DΦ⋆(h,h′)=DΦ(dΦ⋆(h′),dΦ⋆(h)). (19)
9. The following three-point identity holds for all :

 ⟨μ′′−μ,dΦ(μ′)−dΦ(μ)⟩=DΦ(μ,μ′)+DΦ(μ′′,μ)−DΦ(μ′′,μ′). (20)
10. Let the Mirror Descent iterate be defined as in (7). Then the following statements are equivalent:

1. .

2. There exists a constant such that .

In particular, for any we have

 (21)
Proof.

1. Equation (10) is simply the Gibbs variational principle (9) with .

By (10), we know that

 ∀h∈F(Z),Φ(μ)≥⟨μ,h⟩−log∫Zehdz. (22)

But for , the function saturates the equality in (22).

2. We prove a more general result on the Bregman divergence in (23) below.

Let , and . Let be small enough such that is absolutely continuous with respect to ; note that this is possible because , and . We compute

 DΦ(ρ+ϵρ′′,ρ′) =∫Zρlogρρ′+∫Zρlog(1+ϵρ′′ρ)+ϵ∫Zρ′′logρρ′+ϵ∫Zρ′′log(1+ϵρ′′ρ) (i)=∫Zρlogρρ′+ϵ∫Zρ′′+ϵ∫Zρ′′logρρ′+ϵ2∫Zρ′′2ρ+o(ϵ) =DΦ(ρ,ρ′)+ϵ∫Zρ′′(1+logρρ′)+o(ϵ),

where (i) uses as . In short, for all ,

 dμDΦ(μ,μ′)(μ′′)=⟨μ′′,1+logρρ′