# Provably convergent quasistatic dynamics for mean-field two-player zero-sum games

In this paper, we study the problem of finding mixed Nash equilibrium for mean-field two-player zero-sum games. Solving this problem requires optimizing over two probability distributions. We consider a quasistatic Wasserstein gradient flow dynamics in which one probability distribution follows the Wasserstein gradient flow, while the other one is always at the equilibrium. Theoretical analysis are conducted on this dynamics, showing its convergence to the mixed Nash equilibrium under mild conditions. Inspired by the continuous dynamics of probability distributions, we derive a quasistatic Langevin gradient descent method with inner-outer iterations, and test the method on different problems, including training mixture of GANs.

• 73 publications
• 59 publications
02/14/2020

### A mean-field analysis of two-player zero-sum games

Finding Nash equilibria in two-player zero-sum continuous games is a cen...
05/27/2022

### Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games

We study the problem of finding the Nash equilibrium in a two-player zer...
06/20/2022

### MF-OMO: An Optimization Formulation of Mean-Field Games

Theory of mean-field games (MFGs) has recently experienced an exponentia...
12/08/2020

### Settling the complexity of Nash equilibrium in congestion games

We consider (i) the problem of finding a (possibly mixed) Nash equilibri...
11/08/2019

### Bridging Bayesian and Minimax Mean Square Error Estimation via Wasserstein Distributionally Robust Optimization

We introduce a distributionally robust minimium mean square error estima...
06/19/2020

### Learning Minimax Estimators via Online Learning

We consider the problem of designing minimax estimators for estimating t...
10/31/2017

### Training GANs with Optimism

We address the issue of limit cycling behavior in training Generative Ad...

## 1 Introduction

Finding Nash equilibrium has seen many important applications in machine learning, such as generative adversarial networks (GANs)

(Goodfellow et al., 2014a)(Busoniu et al., 2008). In these problems, pure Nash equilibria are usually search for a function . Yet, the problems arising from machine learning are usually nonconvex in and nonconcave in , in which case pure Nash equilibrium may not exist. And even if it exists, there is no guarantee for any optimization algorithm to find it efficiently. This difficulty is reflected in practice, that compared with simple minimization, machine learning applications involving Nash equilibria usually have more complicated behaviors and more subtle dependence on hyper-parameters. For example, stable and efficient training of GANs requires a number of carefully designed tricks (Gao et al., 2018).

On the other hand, the mixed Nash equilibrium (MNE) is known to exist in much more general settings, e.g. when the strategy spaces are compact and the payoff function is continuous (Glicksberg, 1952). In the mixed Nash equilibrium problem, instead of taking “pure strategies” and , two ”mixed strategies” for and , in the form of probability distributions, are considered, resulting in the following functional,

 ∫f(x,y)p(x)q(y)dxdy,

where and are density functions of probability distributions of and , respectively. Efforts are invested to develop theoretically endorsed algorithms that can efficiently find MNE for high dimensional problems, with applications on the training of mixture of GANs. In Hsieh et al. (2019), a mirror-descent algorithm is proposed and its convergence is proven. In Domingo-Enrich et al. (2020), theoretical analysis and empirical experiments are conducted for a gradient descent-ascent flow under a Wasserstein-Fisher-Rao metric and its particle discretization.

In this paper, we also consider the mixed Nash equilibrium problem, and propose a simple QuasiStatic Wasserstein Gradient Flow (QSWGF) for solving the problem. In our dynamics, we treat as a component with much faster speed than , hence is always at equilibrium as moves. With entropy regularization for both and (without requirement on the strength of the regularizations), we prove that the QSWGF converges to the unique mixed Nash equilibrium from any initialization (under mild conditions). Furthermore, we show there is a simple way to discretize the QSWGF, regardless of the complexity of the Wasserstein gradient flow of induced by the fact that is always at equilibrium. Concretely, a partition function related with appears in the QSWGF dynamics, and we find an efficient way to approximate the partition function. By discretizing the QSWGF, we derive a particle dynamics with an inner-outer structure, named the QuasiStatic Langevin Gradient Descent algorithm (QSLGD). In QSLGD, after each iteration of the outer problem (for the particles), the inner loop conducts sufficient iterations to bring the particles to equilibrium. Numerical experiments show the effectiveness of QSLGD on synthetic examples and training mixture of GANs. Our method outperforms the vanilla Langevin gradient descent-ascent method when the entropy regularization is weak.

As a summary, our two major contributions are:

1. We propose the quasistatic Wasserstein gradient flow dynamics for mixed Nash equilibrium problems, and show its convergence to the unique Nash equilibrium under weak assumptions. Our result neither requires the entropy regularization to be sufficiently strong, nor assumes the dynamics to converge a priori.

2. We derive a simple while practical quasistatic Langevin gradient descent algorithm by discretizing the quasistatic Wasserstein gradient flow, by finding an efficient way to approximate the partition function appearing in the dynamics of . The proposed algorithm is applied on several problems including training mixtures of GANs.

## 2 Related work

The mixed Nash equilibrium problem has a long history, with the proof of its existence dates back to Morgenstern & Von Neumann (1953). It draws new attention in recent years, especially in the machine learning community, due to the development of GANs (Goodfellow et al., 2014a) and adversarial training (Goodfellow et al., 2014b). Training mixture of GANs is already discussed in paper (Goodfellow et al., 2014a). Some numerical experiments were conducted in (Arora et al., 2017). In Grnarova et al. (2017)

, the authors proposed an online learning approach for training mixture of GANs, and proved its effectiveness for semi-shallow GANs (GANs whose discriminator is a shallow neural network). Yet, rigorous theoretical treatment to an algorithm started from

(Hsieh et al., 2019)

, in which a mirror descent method was studied and proven to converge. The implementation of the mirror descent method involves big computational cost that asks for heuristics to alleviate. Later,

(Domingo-Enrich et al., 2020) studied more efficient algorithms under a mixture of Wasserstein and Fisher-Rao metrics. Theoretically, the time average of the dynamics’ trajectories is shown to converge to the mixed Nash equilibrium. As a comparison, in this work we show the global convergence of the quasistatic Wasserstein gradient flow without the need of taking time average. Meanwhile, the Wasserstein nature of our dynamics makes it easy to implement as well.

The Wasserstein gradient flow in the density space has been explored in previous works. For example, (Wang & Li, 2019) studied the Nesterov’s accelerated gradient flows for probability distributions under the Wasserstein metric, and (Arbel et al., 2019) studied practical implementations of the natural gradient method for the Wasserstein metric. Both works focus on minimization problems instead of min-max problems considered in this work. A more related work is Lin et al. (2021b), where a natural gradient based algorithm is proposed for training GANs. Yet, the method still optimizes one generator and one discriminator, searching for pure Nash equilibrium. Another work that derives algorithms for GANs from a Wasserstein perspective is (Lin et al., 2021a).

Another volume of works that studies the Wasserstein gradient flow in the machine learning context is the mean-field analysis of neural networks. This line of works started from two-layer neural networks (Mei et al., 2018; Rotskoff & Vanden-Eijnden, 2018; Chizat & Bach, 2018; Sirignano & Spiliopoulos, 2020), to deep fully-connected networks (Araújo et al., 2019; Sirignano & Spiliopoulos, 2021; Nguyen, 2019; Wojtowytsch et al., 2020), and residual networks (Lu et al., 2020; E et al., 2020). The mean-field formulations treat parameters as probability distributions, and the training dynamics are usually the gradient flow under Wasserstein metric. Attempts to prove convergence of the dynamics to global minima are made (Mei et al., 2018; Chizat & Bach, 2018; Rotskoff et al., 2019), though in the case without entropy regularization a convergence assumption should usually be made a priori.

## 3 The quasistatic dynamics

We consider the entropy regularized mixed Nash equilibrium problem, which in our case is equivalent with solving the following minimax problem:

 minp∈P(Ω)maxq∈P(Ω)∫Ω×ΩK(x,y)p(x)q(y)dxdy+β−1∫Ωplogpdx−β−1∫Ωqlogqdy. (1)

In (1), is a compact Riemannian manifold without boundary, and is the set of probability distributions on . Since is compact, any probability distribution in

naturally has finite moments. Let

, and and be the (negative) entropy of and , respectively. Then, the minimax problem (1) can be written in short as

 minp∈P(Ω)maxq∈P(Ω)E(p,q)+β−1S(p)−β−1S(q). (2)
###### Remark 1.

Strictly speaking, in (1) we should distinguish probability distributions and their density function (if exist), and the entropy should also be defined using the Radon-Nikodym derivative with canonical measure. In this paper, since and indeed have density functions because of the entropy regularization, we shall abuse the notation by using and to represent both probability distributions and their density functions.

The entropy regularizations in (1) and (2) make the problem strongly convex in and strongly concave in . Hence, there exists a unique Nash equilibrium for the problem. Such results are shown for example by the following theorem from (Domingo-Enrich et al., 2020).

###### Theorem 1.

(Theorem 4 of (Domingo-Enrich et al., 2020)) Assume is a compact Polish metric space equipped with canonical Borel measure, and that is a continuous function on . Then, problem (2) has a unique Nash equilibrium given by the solution of the following fixed-point problem:

 p(x)=1Zpexp(−βU(x,q)),q(x)=1Zqexp(βV(y,p)), (3)

where and are normalization constants to make sure and are probability distributions, and and are defined as

 U(x,q)=δE(p,q)δp(x)=∫ΩK(x,y)q(y)dy,V(y,p)=δE(p,q)δq(y)=∫ΩK(x,y)p(x)dx.

Considering the efficiency in high-dimensional cases, a natural dynamics of interest to find the Nash equilibrium for (2) is the gradient descent-ascent flow under the Wasserstein metric,

 ∂tpt =∇⋅(pt∇(U(x,qt)+β−1logpt)), ∂tqt =∇⋅(qt∇(−V(y,pt)+β−1logqt)), (4)

because it can be easily discretized into a Langevin gradient descent-ascent method by treating the PDEs as Fokker-Planck equations of SDEs. When is sufficiently large, (4) can be proven to converge linearly to the unique MNE of (2(Eberle et al., 2019). However, when is small, whether (4) converges remains open. This hinders the application of (4) because in practice the entropy terms are usually used as regularization and are kept small. (We realize that it is proven in Domingo-Enrich & Bruna (2022) when our work is under review.)

In (4), the dynamics of and have the same speed. In this work, instead, we study a quasistatic Wasserstein gradient descent dynamics, which can be understood as a limiting dynamics when the speed of becomes faster and faster compared with that of . In this case, at any time , we assume reaches at the equilibrium of the maximizing problem instantaneously by fixing in (2). That is to say, at any time , is determined by

 qt=q[pt]:=argmaxq∈P(Ω)E(pt,q)−β−1S(q). (5)

On the other hand, follows the Wasserstein gradient descent flow with at the equilibrium:

 ∂tpt=∇⋅(pt∇(δ(E(pt,q[pt])−β−1S(q[pt]))δpt+β−1logpt)). (6)

The following theorem shows can be explicitly written as a Gibbs distribution depending on , and thus the free energy in (6) can be simplified to depend on a partition function related with .

###### Theorem 2.

Assume is continuous on the compact set and . Then, for fixed the maximization problem (5) has a unique solution

 q[pt](y):=1Zq(pt)exp(βV(y,pt)), (7)

where is a normalization factor, . Moreover, the dynamics (6) for can be written as

 ∂tpt=∇⋅(pt∇(δβ−1logZq(pt)δpt+β−1logpt)). (8)

Let . By Theorem 2, the dynamics (8) of is the Wasserstein gradient descent flow for minimizing . By the Proposition 3 below, is strongly convex with respect to . Therefore, it is possible to prove global convergence for the dynamics (8), and thus the convergence for the quasistatic Wasserstein gradient flow for the minimax problem (2).

###### Proposition 3.

For any probability distributions , in , and any , we have

 Fp,β(λp1+(1−λ)p2)<λFp,β(p1)+(1−λ)Fp,β(p2).

In practice the partition function in (8) seems hard to approximate, especially when is in high dimensional spaces. However, we show in the following proposition that the variation of the partition function with respect to can be written as a simple form involving . This property will be used to derive a particle method in Section 5

###### Proposition 4.

For any , we have

 δβ−1logZq(p)δp=U(⋅,q[p]), (9)

where is defined in (7). Therefore, the dynamics (8) is equivalent with

 ∂tpt=∇⋅(pt∇(U(x,q[pt])+β−1logpt)). (10)

## 4 Convergence analysis

In this section, we analyze the convergence of the quasistatic dynamics (7), (8). First, we make the following assumptions on .

###### Assumption 1.

Assume , which means has continuous derivatives of any order (with respect to both and ).

Since is compact, assumption 1 implies boundedness and Lipschitz continuity of any derivatives of .

Now, we state our main theorem, which shows the convergence of QSWGF to the Nash equilibrium.

###### Theorem 5.

(main theorem) Assume Assumption 1 holds for . Then, starting from any initial , the dynamics (7), (8) has a unique solution , and the solution converges weakly to the unique Nash equilibrium of (2), , which satisfies the fixed point problem (3).

Theorem 5 guarantees convergence of the quasistatic Wasserstein gradient flow for any , giving theoretical endorsement to the discretized algorithm that we will introduce in the next section. Note that the initialization in the theorem is not important, because we assume achieves equilibrium immediately after the initialization.

###### Remark 2.

The assumption on ’s smoothness can be made weaker. For example, during the proof, up to -th order derivatives of is enough to give sufficient regularity to the solution of the dynamics. We make the strong assumption partly to prevent tedious technical analysis so as to focus on the idea and insights.

#### Proof sketch

We provide some main steps and ideas of the proof of the main theorem in this section. The detailed proof is put in the appendix.

By the last section, since is always at equilibrium, we only need to considering a Wasserstein gradient descent flow for . Therefore, we can build our analysis based on the theories in (Mei et al., 2018) and (Jordan et al., 1998). However, compared with the analysis therein, our theory deals with a new energy term—, which has not been studied by previous works. From now on, let , and . By simple calculation we have

 Ψ(x,p)=U(x,q[p])=1Zq(p)∫ΩK(x,y)exp(∫ΩβK(x,y)p(x)dx)dy. (11)

First, we study the free energy , and show that it has a unique minimizer which satisfies a fixed point condition. This is the result of the convexity of . We have the following lemma.

###### Lemma 1.

Assume Assumption 1 holds for . Then, has a unique minimizer that satisfies

 Fp,β(p∗)=infp∈P(Ω)Fp,β(p∗).

Moreover, is the unique solution of the following fixed point problem,

 p∗=1Zexp(−βΨ(x,p∗)), (12)

where is the normalization factor.

Next, we want to show that any trajectory given by dynamics (10) will converge to the unique minimizer of . To achieve this, we first study the existence, uniqueness, and regularity of the solution to (8), i.e. the trajectory indeed exists and is well behaved. Related results are given by the following lemma.

###### Lemma 2.

Assume Assumption 1 holds for . Then, starting from any initial , the weak solution to (8) exists and is unique. Moreover, is smooth on .

The proof of Lemma 2 is based on Proposition 5.1 of (Jordan et al., 1998). Especially, the existence part is proven using the JKO scheme proposed in (Jordan et al., 1998). We consider a sequence of probability distributions given by the following discrete iteration schemes with time step ,

 ph0=p0,phk=argminp∈P(Ω){12W22(p,phk−1)+hFp,β(p)},k>0,

where means the 2-Wasserstein distance between probability distributions and . Let

be the piecewise constant interpolations of

on time. We show converges weakly (after taking a subsequence) to a weak solution of (8) as tends to . Details are given in the appendix.

Finally, noting that is a Lyapunov function of the dynamics (8), we have the following lemma showing the convergence of to the solution of the Boltzmann fixed point problem (12). This finishes the proof of the main theorem.

###### Lemma 3.

Let be the solution of (8) from any initial . Let be the unique minimizer of given by (12). Then, converges to weakly as .

As a byproduct, since our convergence results does not impose requirement on , if one is interested in the minimax problem without entropy regularization,

 minp∈P(Ω)maxq∈P(Ω)E(p,q), (13)

then, Theorem 5 in (Domingo-Enrich et al., 2020) ensures that the quasistatic dynamics converges to approximate Nash equilibrium of (13) as long as is small enough. Specifically, a pair of probability distributions is called -Nash equilibrium of (13) if

 supq′∈P(Ω)E(p,q′)−infp′∈P(Ω)E(p′,q)≤ϵ.

Then, we have the following theorem as a direct results of Theorem 5 in (Domingo-Enrich et al., 2020):

###### Theorem 6.

Let be the bound of that satisfies for any , and let be the Lipschitz constant of . For any , let , and let be the volume of a ball with radius in . Then, as long as

 β>4ϵlog(2(1−Vδ)Vδ(4CKϵ−1)),

there exists which depends on , such that for any , the solution of the dynamics (7) and (8) at satisfies

 supq′∈P(Ω)E(pt,q′)−infp′∈P(Ω)E(p′,qt)≤ϵ.

## 5 The quasistatic Langevin gradient descent-ascent method

It is well known that PDEs with the form

 ∂tp(t,x)=∇⋅(p(t,x)μ(t,x))+λΔp(t,x)

are Fokker-Planck equations for SDEs , and the solution for the PDE characterizes the law of —the solution of the SDE—at any time. This result connects the Wasserstein gradient flow with SDE, and gives a natural particle discretization to approximate the continuous Wasserstein gradient flow. For example, the Wasserstein gradient descent-ascent flow dynamics (4) is the Fokker-Planck equation of the SDEs

 dXt =−∇xU(Xt,qt)dt+√2β−1dWt dYt =∇yV(Yt,pt)dt+√2β−1dW′t,

where and are the laws of and , respectively, and and are two Brownian motions. Note that we have

 ∇xU(x,q)=∫Ω∇xK(x,y)q(y)dy,  ∇yV(y,p)=∫Ω∇yK(x,y)p(x)dx.

Therefore, i.i.d. picking and for , the particle update scheme, named Langevin Gradient Descent-Ascent (LGDA),

 X(i)k+1 =X(i)k−hnn∑j=1∇xK(X(i)k,Y(j)k)+√2hβ−1ξ(i)k, Y(i)k+1 =Y(i)k+hnn∑j=1∇yK(X(j)k,Y(i)k)+√2hβ−1ζ(i)k, (14)

approximately solves the SDEs, and thus the empirical distributions of and approximate the solutions of (4) when is large. Here, and are i.i.d. samples from the standard Gaussian.

#### Quasistatic Langevin gradient descent method

Similarly, the dynamics (8) for is the Fokker-Planck equation for the SDE

 dXt=−∇Ψ(x,pt)dt+√2β−1dWt, (15)

where is the law of . By proposition 4 we have . Hence, (15) can be written as

 dXt=−∇xU(Xt,q[pt])dt+√2β−1dWt, (16)

with at the equilibrium of the maximization problem (5), which can be attained by solving the SDE

 dYt=∇yV(Yt,pt)dt+√2β−1dW′t (17)

for sufficiently long time. This motivates us to design a quasistatic particle method as a discretization for the quasistatic Wasserstein gradient flow. Specifically, the method consists of an inner loop and an outer loop. The method starts from some particles and , , sampled i.i.d. from and , respectively. Then, at the -th step, the inner loop conducts enough iterations on the particles to solve (17) with fixed (i.e. with the particles fixed), which drives the empirical distribution of near equilibrium before each update of the outer loop. Next, the outer loop updates according the SDE (16). The algorithm is summarized in Algorithm 1.

###### Remark 3.

Generally speaking, Algorithm 1 consists of two nested loops. The inner loop solves particles to equilibrium in each step of the outer loop, while the outer loop makes one iteration every time, using the equilibrium particles. In the following are some additional explanation for the Algorithm:

1. line 4: at the beginning of the algorithm, we conduct additional inner iterations for , where may be a large number. This is because at the beginning the particles are far from equilibrium. In later outer iterations, since each time the particles only move for a small distance, the particles are close to the equilibrium. Therefore, and need not to be large.

2. line 17: In each inner loop, we conduct inner iterations for the particles, and collect those from the last iterations. We use these particles in the update of particles to approximate the distribution . We assume during the last inner iterations the particles are at equilibrium. One can take to be if is large enough, while taking large allows smaller number of particles.

### 5.1 Examples

In this section, we apply the quasistatic Langevin gradient descent method to several problems.

#### 1-dimensional game on torus

We first consider a problem with and on the -dimensional torus. Specifically, we consider

 K(x,y)=sin(2πx)sin(2πy),

where . It is easy to show that, with this and a positive , at the Nash equilibrium of the problem (1) and

are both uniform distributions. We take initial distributions

and to be the uniform distribution on . Figure 1 shows the comparison of the quasistatic particle method with LGDA for different , step length, and number of particles. In the experiments, all quasistatic methods take and , with different shown in the legends. For each experiment, we conduct , , , outer iterations for LGDA, QS2, QS5, and QS10, respectively. We take different different numbers of iterations for different methods in the consideration of different number of inner iterations. The error is then computed after the last iteration, measured by the KL divergence of the empirical distribution given by particles and the uniform distribution (both in the forms of histograms with equi-length bins). Each point in the figures is an average of experiments.

Seen from the left figure, the QSLGD has comparable performance than LGDA when is small, in which case diffusion dominates the dynamics, while it performs much better than LGDA when is large. We can also see better tolerance to large when more inner iterations are conducted. This shows the advantage of the QSLGD over LGDA when the regularization stength is weak. The middle figures shows slightly better performance of the QSLGD when the step length (both and ) is small. However, when is big, LGDA tends to give smaller error. The results may be caused by the instability of the inner loop when is big. It also guides us to pick small step length when applying the proposed method. Finally, the right figure compares the influence of the number of particles when and , in which case the two methods perform similarly. We can see that the errors for both methods scale in a rate as the number of particles changes.

#### Polynomial games on spheres

In the second example, we consider a polynomial games on sphere similar to that studied in (Domingo-Enrich et al., 2020),

 K(x,y)=xTA0x+xTA1y+yTA2y+yTA3(x2), (18)

where and is the element-wise square of . In this problem, we consider the Nash equilibrium of . Hence, we take big (small ) and compare the Nikaido and Isoda (NI) error of the solutions found by different methods (Nikaidô & Isoda, 1955). The NI error is defined by

 NI(p,q):=supq′∈P(Ω)E(pt,q′)−infp′∈P(Ω)E(p′,qt),

which is also used in Theorem 6. The left panel of Figure 2 shows the NI errors of the solutions found by different methods with different dimensions, we see comparable performance of the QSLGD with LGDA.

#### GANs

Finally, we test our methods on the training of GANs. We train GANs to learn Gaussian mixtures. The results after training are shown in the middle and right panels of Figure 2, where Gaussian mixtures with and modes are learned, respectively. We train GANs with generators and discriminators, and take . The results show that the mixture of GANs trained by QSLGD can learn Gaussian mixtures successfully.

In the right panel of Figure 2, we show the results of learning high dimensional Gaussian mixtures. In the d-dimensional experiment, the Gaussian mixture has modes centered at

. Here, is the

-th unit vector in the standard basis of

. Model and algorithm with same hyper-parameters as above are used. In the figure, we measure the average squared distance of the generated data to the closest mode center along the training process. The figure shows that the average squared distance can be reduced to after iterations. While the ideal value is , the current results still show that the learnt distribution concentrates at the mode centers. Better results may be obtained after longer training or careful hyper-parameter tuning.

## 6 Discussion

In this paper, we study the quasistatic Wasserstein gradient flow for the mixed Nash equilibrium problem. We theoretically show the convergence of the continuous dynamics to the unique Nash equilibrium. Then, a quasistatic particle method is proposed by discretizing the continuous dynamics. The particle method consists of two nested loops, and conduct sufficient inner loop in each step of the outer loop. Numerical experiments show the effectiveness of the method. Comparison with LGDA shows the proposed method has advantage over LGDA when is large (which is usually the case of interest), and performs as good as LGDA in most other cases.

Theoretical extensions are possible. For example, strong convergence results may be established by similar approaches taken in (Feng & Li, 2020). We leave this as future work.

In practice, the idea of nested loops is not new for minimax optimization problems. It is already discussed and utilized in the earliest works for GANs (Goodfellow et al., 2014a), and Wasserstein GANs (Arjovsky et al., 2017). In those works, the discriminator is updated for several steps each time the generator is updated. Our work is different from these works because we consider mixed Nash equilibrium and hence our method is particle based, while their method searches for pure Nash equilibrium.

Finally, though particle methods finding mixed Nash equilibria have stronger theoretical guarantees, applying these methods to the training of GANs faces the problem of computational cost. With both the generator and discriminator being large neural networks, training mixture of GANs with many generators and discriminators imposes formidable computational cost. Developing more efficient particle methods for GANs is an important future work.

## Appendix A Proofs for Section 3

### a.1 Proof of Theorem 2

Note that the free energy in (5) can be written as

 ∫ΩV(y,pt)q(y)dy−β−1S(q), (19)

in which the first term is linear with respect to . Hence, a calculation with Lagrange multiplier shows (19) has a unique minimizer with the form of a Gibbs distribution (e.g. see Chapter 4 of (Mezard & Montanari, 2009)):

 q[pt](y)=1Zq(pt)exp(βV(y,pt)). (20)

Next, we consider the free energy for when is at the equilibrium. By (19) we have

 E(pt,q[pt]) =∫ΩV(y,pt)q[pt](y)dy =1Zq(pt)∫ΩV(y,pt)exp(βV(y,pt))dy. (21)

On the other hand, we have

 β−1S(q[pt]) =β−1∫Ω1Zq(pt)exp(βV(y,pt))log(1Zq(pt)exp(βV(y,pt)))dy =β−1∫Ω1Zq(pt)exp(βV(y,pt))(βV(y,pt)−logZq(pt))dy =1Zq(pt)∫ΩV(y,pt)exp(βV(y,pt))dy−β−1logZq(pt). (22)

Combining (21) and (22), we obtain

 E(pt,q[pt])+β−1S(pt)−β−1S(q[pt])=β−1logZq(pt)+β−1S(pt).

Therefore, the dynamics of is the Wasserstein gradient descent flow minimizing the free energy , given by

 ∂tpt=∇⋅(pt∇(δβ−1logZq(pt)δpt+β−1logpt)).

This finishes the proof.

### a.2 Proof of Proposition 3

Since is strongly convex, it suffices to show is convex. Recall that

 logZq(p)=log(∫Ωexp(βV(y,p))dy).

Note that is linear with respect to , we have

 V(⋅,λp1+(1−λ)p2)=λV(⋅,p1)+(1−λ)V(⋅,p2).

Hence,

 logZq(λp1+(1−λ)p2) =log(∫Ωexp(βV(y,λp1+(1−λ)p2))dy) =log(∫Ωexp(βλV(y,p1))⋅exp(β(1−λ)V(y,p2))) ≤log((∫Ωexp(βV(y,p1)))λ(∫Ωexp(βV(y,p2)))1−λ) =λlogZq(p1)+(1−λ)logZq(p2). (23)

The second last line is given by the Hölder inequality.

### a.3 Proof of Proposition 4

The proposition follows from the following derivations.

 δβ−1logZq(p)δp =β−11Zq(p)δZq(p)δp =β−11Zq(p)∫Ωexp(βV(y,p))βK(x,y)dy =∫ΩK(x,y)exp(βV(y,p))Zq(p)dy =∫ΩK(x,y)q[p](y)dy =U(x,q[p]). (24)

## Appendix B Proof of Theorem 5

In this section, we prove our main theorem. The proof will follow the sketch given in Section 4. Given the assumptions on and the conclusions of Lemma 1 and Lemma 2, Lemma 3 is a direct result of Lemma 10.12 in (Mei et al., 2018), which we will ignore the proof. In the following, we show Lemma 1 and Lemma 2. Some techniques in the proof come from (Jordan et al., 1998) and (Mei et al., 2018).

### b.1 Proof of Lemma 1

Our proof follows the proof of Proposition 4.1 in (Jordan et al., 1998). First, we show the existence of the minimizer for . To see this, note that is bounded on . Assume is a constant such that for any . Then, we have

 Ep,β(p)=∫ΩU(x,q[p])p(x)dx=∫Ω×ΩK(x,y)q[p](y)p(x)dxdy≥−CK,

and

 S(p)=∫Ωp(x)logp(x)dx≥∫Ω−1edx=−1e.

This means is lower bounded, i.e. . Hence, we can find a sequence such that

 limk→∞Fp,β(pk)=infpFp,β(p).

Similar to (Jordan et al., 1998), we can show boundedness of and , which implies that is uniformly integrable, and thus there exists a weakly convergent subsequence of .

Without loss of generality, assume in . Then we need to show is a minimizer of . By (Jordan et al., 1998), the entropy term satisfies

 S(p∗)≤liminfk→∞S(pk).

Hence, the conclusion follows if is continuous in the weak topology. To show this, first note that for any , we have

 ∫Ωexp(βV(y,p))dy≥e−βCK.

Because the function is -Lipschitz for , for any we have

 ∣∣β−1logZq(pk)−β−1logZq(p∗)∣∣≤β−1eβCK∣∣∣∫Ω(eβV(y,pk)−eβV(y,p∗))dy∣∣∣.

By similar boundedness argument, we have

 ∣∣∣∫Ω(eβV(y,pk)−eβV(y,p∗))dy∣∣∣ ≤∫Ω∣∣eβV(y,pk)−eβV(y,p∗)∣∣dy ≤eβCK∫Ωβ|V(y,pk)−V(y,p∗)|dy ≤βeβCK∫Ω∣∣∣∫ΩK(x,y)(pk(x)−p∗(x))dx∣∣∣dy

Totally we have

 ∣∣β−1logZq(pk)−β−1logZq(p∗)∣∣≤e2βCK∫Ω∣∣∣∫ΩK(x,y)(pk(x)−p∗(x))dx∣∣∣dy.

Since is bounded and Lipschitz, it is easy to show that

 limk→∞∫Ω∣∣∣∫ΩK(x,y)(pk(x)−p∗(x))dx∣∣∣dy=0.

Therefore, we have

 Fp,β(p∗)≤liminfk→∞Fp,β(pk)=infpFp,β(p),

and thus