1 Introduction
Finding Nash equilibrium has seen many important applications in machine learning, such as generative adversarial networks (GANs)
(Goodfellow et al., 2014a)(Busoniu et al., 2008). In these problems, pure Nash equilibria are usually search for a function . Yet, the problems arising from machine learning are usually nonconvex in and nonconcave in , in which case pure Nash equilibrium may not exist. And even if it exists, there is no guarantee for any optimization algorithm to find it efficiently. This difficulty is reflected in practice, that compared with simple minimization, machine learning applications involving Nash equilibria usually have more complicated behaviors and more subtle dependence on hyperparameters. For example, stable and efficient training of GANs requires a number of carefully designed tricks (Gao et al., 2018).On the other hand, the mixed Nash equilibrium (MNE) is known to exist in much more general settings, e.g. when the strategy spaces are compact and the payoff function is continuous (Glicksberg, 1952). In the mixed Nash equilibrium problem, instead of taking “pure strategies” and , two ”mixed strategies” for and , in the form of probability distributions, are considered, resulting in the following functional,
where and are density functions of probability distributions of and , respectively. Efforts are invested to develop theoretically endorsed algorithms that can efficiently find MNE for high dimensional problems, with applications on the training of mixture of GANs. In Hsieh et al. (2019), a mirrordescent algorithm is proposed and its convergence is proven. In DomingoEnrich et al. (2020), theoretical analysis and empirical experiments are conducted for a gradient descentascent flow under a WassersteinFisherRao metric and its particle discretization.
In this paper, we also consider the mixed Nash equilibrium problem, and propose a simple QuasiStatic Wasserstein Gradient Flow (QSWGF) for solving the problem. In our dynamics, we treat as a component with much faster speed than , hence is always at equilibrium as moves. With entropy regularization for both and (without requirement on the strength of the regularizations), we prove that the QSWGF converges to the unique mixed Nash equilibrium from any initialization (under mild conditions). Furthermore, we show there is a simple way to discretize the QSWGF, regardless of the complexity of the Wasserstein gradient flow of induced by the fact that is always at equilibrium. Concretely, a partition function related with appears in the QSWGF dynamics, and we find an efficient way to approximate the partition function. By discretizing the QSWGF, we derive a particle dynamics with an innerouter structure, named the QuasiStatic Langevin Gradient Descent algorithm (QSLGD). In QSLGD, after each iteration of the outer problem (for the particles), the inner loop conducts sufficient iterations to bring the particles to equilibrium. Numerical experiments show the effectiveness of QSLGD on synthetic examples and training mixture of GANs. Our method outperforms the vanilla Langevin gradient descentascent method when the entropy regularization is weak.
As a summary, our two major contributions are:

We propose the quasistatic Wasserstein gradient flow dynamics for mixed Nash equilibrium problems, and show its convergence to the unique Nash equilibrium under weak assumptions. Our result neither requires the entropy regularization to be sufficiently strong, nor assumes the dynamics to converge a priori.

We derive a simple while practical quasistatic Langevin gradient descent algorithm by discretizing the quasistatic Wasserstein gradient flow, by finding an efficient way to approximate the partition function appearing in the dynamics of . The proposed algorithm is applied on several problems including training mixtures of GANs.
2 Related work
The mixed Nash equilibrium problem has a long history, with the proof of its existence dates back to Morgenstern & Von Neumann (1953). It draws new attention in recent years, especially in the machine learning community, due to the development of GANs (Goodfellow et al., 2014a) and adversarial training (Goodfellow et al., 2014b). Training mixture of GANs is already discussed in paper (Goodfellow et al., 2014a). Some numerical experiments were conducted in (Arora et al., 2017). In Grnarova et al. (2017)
, the authors proposed an online learning approach for training mixture of GANs, and proved its effectiveness for semishallow GANs (GANs whose discriminator is a shallow neural network). Yet, rigorous theoretical treatment to an algorithm started from
(Hsieh et al., 2019), in which a mirror descent method was studied and proven to converge. The implementation of the mirror descent method involves big computational cost that asks for heuristics to alleviate. Later,
(DomingoEnrich et al., 2020) studied more efficient algorithms under a mixture of Wasserstein and FisherRao metrics. Theoretically, the time average of the dynamics’ trajectories is shown to converge to the mixed Nash equilibrium. As a comparison, in this work we show the global convergence of the quasistatic Wasserstein gradient flow without the need of taking time average. Meanwhile, the Wasserstein nature of our dynamics makes it easy to implement as well.The Wasserstein gradient flow in the density space has been explored in previous works. For example, (Wang & Li, 2019) studied the Nesterov’s accelerated gradient flows for probability distributions under the Wasserstein metric, and (Arbel et al., 2019) studied practical implementations of the natural gradient method for the Wasserstein metric. Both works focus on minimization problems instead of minmax problems considered in this work. A more related work is Lin et al. (2021b), where a natural gradient based algorithm is proposed for training GANs. Yet, the method still optimizes one generator and one discriminator, searching for pure Nash equilibrium. Another work that derives algorithms for GANs from a Wasserstein perspective is (Lin et al., 2021a).
Another volume of works that studies the Wasserstein gradient flow in the machine learning context is the meanfield analysis of neural networks. This line of works started from twolayer neural networks (Mei et al., 2018; Rotskoff & VandenEijnden, 2018; Chizat & Bach, 2018; Sirignano & Spiliopoulos, 2020), to deep fullyconnected networks (Araújo et al., 2019; Sirignano & Spiliopoulos, 2021; Nguyen, 2019; Wojtowytsch et al., 2020), and residual networks (Lu et al., 2020; E et al., 2020). The meanfield formulations treat parameters as probability distributions, and the training dynamics are usually the gradient flow under Wasserstein metric. Attempts to prove convergence of the dynamics to global minima are made (Mei et al., 2018; Chizat & Bach, 2018; Rotskoff et al., 2019), though in the case without entropy regularization a convergence assumption should usually be made a priori.
3 The quasistatic dynamics
We consider the entropy regularized mixed Nash equilibrium problem, which in our case is equivalent with solving the following minimax problem:
(1) 
In (1), is a compact Riemannian manifold without boundary, and is the set of probability distributions on . Since is compact, any probability distribution in
naturally has finite moments. Let
, and and be the (negative) entropy of and , respectively. Then, the minimax problem (1) can be written in short as(2) 
Remark 1.
Strictly speaking, in (1) we should distinguish probability distributions and their density function (if exist), and the entropy should also be defined using the RadonNikodym derivative with canonical measure. In this paper, since and indeed have density functions because of the entropy regularization, we shall abuse the notation by using and to represent both probability distributions and their density functions.
The entropy regularizations in (1) and (2) make the problem strongly convex in and strongly concave in . Hence, there exists a unique Nash equilibrium for the problem. Such results are shown for example by the following theorem from (DomingoEnrich et al., 2020).
Theorem 1.
(Theorem 4 of (DomingoEnrich et al., 2020)) Assume is a compact Polish metric space equipped with canonical Borel measure, and that is a continuous function on . Then, problem (2) has a unique Nash equilibrium given by the solution of the following fixedpoint problem:
(3) 
where and are normalization constants to make sure and are probability distributions, and and are defined as
Considering the efficiency in highdimensional cases, a natural dynamics of interest to find the Nash equilibrium for (2) is the gradient descentascent flow under the Wasserstein metric,
(4) 
because it can be easily discretized into a Langevin gradient descentascent method by treating the PDEs as FokkerPlanck equations of SDEs. When is sufficiently large, (4) can be proven to converge linearly to the unique MNE of (2) (Eberle et al., 2019). However, when is small, whether (4) converges remains open. This hinders the application of (4) because in practice the entropy terms are usually used as regularization and are kept small. (We realize that it is proven in DomingoEnrich & Bruna (2022) when our work is under review.)
In (4), the dynamics of and have the same speed. In this work, instead, we study a quasistatic Wasserstein gradient descent dynamics, which can be understood as a limiting dynamics when the speed of becomes faster and faster compared with that of . In this case, at any time , we assume reaches at the equilibrium of the maximizing problem instantaneously by fixing in (2). That is to say, at any time , is determined by
(5) 
On the other hand, follows the Wasserstein gradient descent flow with at the equilibrium:
(6) 
The following theorem shows can be explicitly written as a Gibbs distribution depending on , and thus the free energy in (6) can be simplified to depend on a partition function related with .
Theorem 2.
Let . By Theorem 2, the dynamics (8) of is the Wasserstein gradient descent flow for minimizing . By the Proposition 3 below, is strongly convex with respect to . Therefore, it is possible to prove global convergence for the dynamics (8), and thus the convergence for the quasistatic Wasserstein gradient flow for the minimax problem (2).
Proposition 3.
For any probability distributions , in , and any , we have
In practice the partition function in (8) seems hard to approximate, especially when is in high dimensional spaces. However, we show in the following proposition that the variation of the partition function with respect to can be written as a simple form involving . This property will be used to derive a particle method in Section 5
4 Convergence analysis
In this section, we analyze the convergence of the quasistatic dynamics (7), (8). First, we make the following assumptions on .
Assumption 1.
Assume , which means has continuous derivatives of any order (with respect to both and ).
Since is compact, assumption 1 implies boundedness and Lipschitz continuity of any derivatives of .
Now, we state our main theorem, which shows the convergence of QSWGF to the Nash equilibrium.
Theorem 5.
Theorem 5 guarantees convergence of the quasistatic Wasserstein gradient flow for any , giving theoretical endorsement to the discretized algorithm that we will introduce in the next section. Note that the initialization in the theorem is not important, because we assume achieves equilibrium immediately after the initialization.
Remark 2.
The assumption on ’s smoothness can be made weaker. For example, during the proof, up to th order derivatives of is enough to give sufficient regularity to the solution of the dynamics. We make the strong assumption partly to prevent tedious technical analysis so as to focus on the idea and insights.
Proof sketch
We provide some main steps and ideas of the proof of the main theorem in this section. The detailed proof is put in the appendix.
By the last section, since is always at equilibrium, we only need to considering a Wasserstein gradient descent flow for . Therefore, we can build our analysis based on the theories in (Mei et al., 2018) and (Jordan et al., 1998). However, compared with the analysis therein, our theory deals with a new energy term—, which has not been studied by previous works. From now on, let , and . By simple calculation we have
(11) 
First, we study the free energy , and show that it has a unique minimizer which satisfies a fixed point condition. This is the result of the convexity of . We have the following lemma.
Lemma 1.
Assume Assumption 1 holds for . Then, has a unique minimizer that satisfies
Moreover, is the unique solution of the following fixed point problem,
(12) 
where is the normalization factor.
Next, we want to show that any trajectory given by dynamics (10) will converge to the unique minimizer of . To achieve this, we first study the existence, uniqueness, and regularity of the solution to (8), i.e. the trajectory indeed exists and is well behaved. Related results are given by the following lemma.
Lemma 2.
The proof of Lemma 2 is based on Proposition 5.1 of (Jordan et al., 1998). Especially, the existence part is proven using the JKO scheme proposed in (Jordan et al., 1998). We consider a sequence of probability distributions given by the following discrete iteration schemes with time step ,
where means the 2Wasserstein distance between probability distributions and . Let
be the piecewise constant interpolations of
on time. We show converges weakly (after taking a subsequence) to a weak solution of (8) as tends to . Details are given in the appendix.Finally, noting that is a Lyapunov function of the dynamics (8), we have the following lemma showing the convergence of to the solution of the Boltzmann fixed point problem (12). This finishes the proof of the main theorem.
Lemma 3.
As a byproduct, since our convergence results does not impose requirement on , if one is interested in the minimax problem without entropy regularization,
(13) 
then, Theorem 5 in (DomingoEnrich et al., 2020) ensures that the quasistatic dynamics converges to approximate Nash equilibrium of (13) as long as is small enough. Specifically, a pair of probability distributions is called Nash equilibrium of (13) if
Then, we have the following theorem as a direct results of Theorem 5 in (DomingoEnrich et al., 2020):
5 The quasistatic Langevin gradient descentascent method
It is well known that PDEs with the form
are FokkerPlanck equations for SDEs , and the solution for the PDE characterizes the law of —the solution of the SDE—at any time. This result connects the Wasserstein gradient flow with SDE, and gives a natural particle discretization to approximate the continuous Wasserstein gradient flow. For example, the Wasserstein gradient descentascent flow dynamics (4) is the FokkerPlanck equation of the SDEs
where and are the laws of and , respectively, and and are two Brownian motions. Note that we have
Therefore, i.i.d. picking and for , the particle update scheme, named Langevin Gradient DescentAscent (LGDA),
(14) 
approximately solves the SDEs, and thus the empirical distributions of and approximate the solutions of (4) when is large. Here, and are i.i.d. samples from the standard Gaussian.
Quasistatic Langevin gradient descent method
Similarly, the dynamics (8) for is the FokkerPlanck equation for the SDE
(15) 
where is the law of . By proposition 4 we have . Hence, (15) can be written as
(16) 
with at the equilibrium of the maximization problem (5), which can be attained by solving the SDE
(17) 
for sufficiently long time. This motivates us to design a quasistatic particle method as a discretization for the quasistatic Wasserstein gradient flow. Specifically, the method consists of an inner loop and an outer loop. The method starts from some particles and , , sampled i.i.d. from and , respectively. Then, at the th step, the inner loop conducts enough iterations on the particles to solve (17) with fixed (i.e. with the particles fixed), which drives the empirical distribution of near equilibrium before each update of the outer loop. Next, the outer loop updates according the SDE (16). The algorithm is summarized in Algorithm 1.
Remark 3.
Generally speaking, Algorithm 1 consists of two nested loops. The inner loop solves particles to equilibrium in each step of the outer loop, while the outer loop makes one iteration every time, using the equilibrium particles. In the following are some additional explanation for the Algorithm:

line 4: at the beginning of the algorithm, we conduct additional inner iterations for , where may be a large number. This is because at the beginning the particles are far from equilibrium. In later outer iterations, since each time the particles only move for a small distance, the particles are close to the equilibrium. Therefore, and need not to be large.

line 17: In each inner loop, we conduct inner iterations for the particles, and collect those from the last iterations. We use these particles in the update of particles to approximate the distribution . We assume during the last inner iterations the particles are at equilibrium. One can take to be if is large enough, while taking large allows smaller number of particles.
5.1 Examples
In this section, we apply the quasistatic Langevin gradient descent method to several problems.
dimensional game on torus
We first consider a problem with and on the dimensional torus. Specifically, we consider
where . It is easy to show that, with this and a positive , at the Nash equilibrium of the problem (1) and
are both uniform distributions. We take initial distributions
and to be the uniform distribution on . Figure 1 shows the comparison of the quasistatic particle method with LGDA for different , step length, and number of particles. In the experiments, all quasistatic methods take and , with different shown in the legends. For each experiment, we conduct , , , outer iterations for LGDA, QS2, QS5, and QS10, respectively. We take different different numbers of iterations for different methods in the consideration of different number of inner iterations. The error is then computed after the last iteration, measured by the KL divergence of the empirical distribution given by particles and the uniform distribution (both in the forms of histograms with equilength bins). Each point in the figures is an average of experiments.Seen from the left figure, the QSLGD has comparable performance than LGDA when is small, in which case diffusion dominates the dynamics, while it performs much better than LGDA when is large. We can also see better tolerance to large when more inner iterations are conducted. This shows the advantage of the QSLGD over LGDA when the regularization stength is weak. The middle figures shows slightly better performance of the QSLGD when the step length (both and ) is small. However, when is big, LGDA tends to give smaller error. The results may be caused by the instability of the inner loop when is big. It also guides us to pick small step length when applying the proposed method. Finally, the right figure compares the influence of the number of particles when and , in which case the two methods perform similarly. We can see that the errors for both methods scale in a rate as the number of particles changes.
Polynomial games on spheres
In the second example, we consider a polynomial games on sphere similar to that studied in (DomingoEnrich et al., 2020),
(18) 
where and is the elementwise square of . In this problem, we consider the Nash equilibrium of . Hence, we take big (small ) and compare the Nikaido and Isoda (NI) error of the solutions found by different methods (Nikaidô & Isoda, 1955). The NI error is defined by
which is also used in Theorem 6. The left panel of Figure 2 shows the NI errors of the solutions found by different methods with different dimensions, we see comparable performance of the QSLGD with LGDA.
GANs
Finally, we test our methods on the training of GANs. We train GANs to learn Gaussian mixtures. The results after training are shown in the middle and right panels of Figure 2, where Gaussian mixtures with and modes are learned, respectively. We train GANs with generators and discriminators, and take . The results show that the mixture of GANs trained by QSLGD can learn Gaussian mixtures successfully.
In the right panel of Figure 2, we show the results of learning high dimensional Gaussian mixtures. In the ddimensional experiment, the Gaussian mixture has modes centered at
with standard deviation
. Here, is theth unit vector in the standard basis of
. Model and algorithm with same hyperparameters as above are used. In the figure, we measure the average squared distance of the generated data to the closest mode center along the training process. The figure shows that the average squared distance can be reduced to after iterations. While the ideal value is , the current results still show that the learnt distribution concentrates at the mode centers. Better results may be obtained after longer training or careful hyperparameter tuning.6 Discussion
In this paper, we study the quasistatic Wasserstein gradient flow for the mixed Nash equilibrium problem. We theoretically show the convergence of the continuous dynamics to the unique Nash equilibrium. Then, a quasistatic particle method is proposed by discretizing the continuous dynamics. The particle method consists of two nested loops, and conduct sufficient inner loop in each step of the outer loop. Numerical experiments show the effectiveness of the method. Comparison with LGDA shows the proposed method has advantage over LGDA when is large (which is usually the case of interest), and performs as good as LGDA in most other cases.
Theoretical extensions are possible. For example, strong convergence results may be established by similar approaches taken in (Feng & Li, 2020). We leave this as future work.
In practice, the idea of nested loops is not new for minimax optimization problems. It is already discussed and utilized in the earliest works for GANs (Goodfellow et al., 2014a), and Wasserstein GANs (Arjovsky et al., 2017). In those works, the discriminator is updated for several steps each time the generator is updated. Our work is different from these works because we consider mixed Nash equilibrium and hence our method is particle based, while their method searches for pure Nash equilibrium.
Finally, though particle methods finding mixed Nash equilibria have stronger theoretical guarantees, applying these methods to the training of GANs faces the problem of computational cost. With both the generator and discriminator being large neural networks, training mixture of GANs with many generators and discriminators imposes formidable computational cost. Developing more efficient particle methods for GANs is an important future work.
References
 Araújo et al. (2019) Dyego Araújo, Roberto I Oliveira, and Daniel Yukimura. A meanfield limit for certain deep neural networks. arXiv preprint arXiv:1906.00193, 2019.
 Arbel et al. (2019) Michael Arbel, Arthur Gretton, Wuchen Li, and Guido Montúfar. Kernelized wasserstein natural gradient. arXiv preprint arXiv:1910.09652, 2019.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. PMLR, 2017.
 Arora et al. (2017) Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pp. 224–232. PMLR, 2017.
 Busoniu et al. (2008) Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
 Chizat & Bach (2018) Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018.
 DomingoEnrich & Bruna (2022) Carles DomingoEnrich and Joan Bruna. Simultaneous transport evolution for minimax equilibria on measures. arXiv preprint, https://arxiv.org/pdf/2202.06460.pdf, 2022.
 DomingoEnrich et al. (2020) Carles DomingoEnrich, Samy Jelassi, Arthur Mensch, Grant Rotskoff, and Joan Bruna. A meanfield analysis of twoplayer zerosum games. arXiv preprint arXiv:2002.06277, 2020.
 E et al. (2020) Weinan E, Chao Ma, and Lei Wu. Machine learning from a continuous viewpoint, i. Science China Mathematics, 63(11):2233–2266, 2020.
 Eberle et al. (2019) Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Quantitative harristype theorems for diffusions and mckean–vlasov processes. Transactions of the American Mathematical Society, 371(10):7135–7173, 2019.
 Feng & Li (2020) Qi Feng and Wuchen Li. Entropy dissipation via information gamma calculus: Nonreversible stochastic differential equations. arXiv preprint arXiv:2011.08058, 2020.
 Gao et al. (2018) Fei Gao, Yue Yang, Jun Wang, Jinping Sun, Erfu Yang, and Huiyu Zhou. A deep convolutional generative adversarial networks (dcgans)based semisupervised method for object recognition in synthetic aperture radar (sar) images. Remote Sensing, 10(6):846, 2018.
 Glicksberg (1952) Irving L Glicksberg. A further generalization of the kakutani fixed point theorem, with application to nash equilibrium points. Proceedings of the American Mathematical Society, 3(1):170–174, 1952.
 Goodfellow et al. (2014a) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014a.
 Goodfellow et al. (2014b) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014b.
 Grnarova et al. (2017) Paulina Grnarova, Kfir Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause. An online learning approach to generative adversarial networks. arXiv preprint arXiv:1706.03269, 2017.
 Hsieh et al. (2019) YaPing Hsieh, Chen Liu, and Volkan Cevher. Finding mixed nash equilibria of generative adversarial networks. In International Conference on Machine Learning, pp. 2810–2819. PMLR, 2019.
 Jordan et al. (1998) Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
 Lin et al. (2021a) Alex Tong Lin, Samy Wu Fung, Wuchen Li, Levon Nurbekyan, and Stanley J Osher. Alternating the population and control neural networks to solve highdimensional stochastic meanfield games. Proceedings of the National Academy of Sciences, 118(31), 2021a.
 Lin et al. (2021b) Alex Tong Lin, Wuchen Li, Stanley Osher, and Guido Montúfar. Wasserstein proximal of gans. arXiv preprint arXiv:2102.06862, 2021b.
 Lu et al. (2020) Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pp. 6426–6436. PMLR, 2020.
 Mei et al. (2018) Song Mei, Andrea Montanari, and PhanMinh Nguyen. A mean field view of the landscape of twolayer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
 Mezard & Montanari (2009) Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
 Morgenstern & Von Neumann (1953) Oskar Morgenstern and John Von Neumann. Theory of games and economic behavior. Princeton university press, 1953.
 Nguyen (2019) PhanMinh Nguyen. Mean field limit of the learning dynamics of multilayer neural networks. arXiv preprint arXiv:1902.02880, 2019.
 Nikaidô & Isoda (1955) Hukukane Nikaidô and Kazuo Isoda. Note on noncooperative convex games. Pacific Journal of Mathematics, 5(S1):807–815, 1955.
 Rotskoff et al. (2019) Grant Rotskoff, Samy Jelassi, Joan Bruna, and Eric VandenEijnden. Global convergence of neuron birthdeath dynamics. arXiv preprint arXiv:1902.01843, 2019.
 Rotskoff & VandenEijnden (2018) Grant M Rotskoff and Eric VandenEijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. stat, 1050:22, 2018.

Sirignano & Spiliopoulos (2020)
Justin Sirignano and Konstantinos Spiliopoulos.
Mean field analysis of neural networks: A central limit theorem.
Stochastic Processes and their Applications, 130(3):1820–1852, 2020.  Sirignano & Spiliopoulos (2021) Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of deep neural networks. Mathematics of Operations Research, 2021.
 Wang & Li (2019) Yifei Wang and Wuchen Li. Accelerated information gradient flow. arXiv preprint arXiv:1909.02102, 2019.
 Wojtowytsch et al. (2020) Stephan Wojtowytsch et al. On the banach spaces associated with multilayer relu networks: Function representation, approximation theory and gradient descent dynamics. arXiv preprint arXiv:2007.15623, 2020.
Appendix A Proofs for Section 3
a.1 Proof of Theorem 2
a.2 Proof of Proposition 3
Since is strongly convex, it suffices to show is convex. Recall that
Note that is linear with respect to , we have
Hence,
(23) 
The second last line is given by the Hölder inequality.
a.3 Proof of Proposition 4
The proposition follows from the following derivations.
(24) 
Appendix B Proof of Theorem 5
In this section, we prove our main theorem. The proof will follow the sketch given in Section 4. Given the assumptions on and the conclusions of Lemma 1 and Lemma 2, Lemma 3 is a direct result of Lemma 10.12 in (Mei et al., 2018), which we will ignore the proof. In the following, we show Lemma 1 and Lemma 2. Some techniques in the proof come from (Jordan et al., 1998) and (Mei et al., 2018).
b.1 Proof of Lemma 1
Our proof follows the proof of Proposition 4.1 in (Jordan et al., 1998). First, we show the existence of the minimizer for . To see this, note that is bounded on . Assume is a constant such that for any . Then, we have
and
This means is lower bounded, i.e. . Hence, we can find a sequence such that
Similar to (Jordan et al., 1998), we can show boundedness of and , which implies that is uniformly integrable, and thus there exists a weakly convergent subsequence of .
Without loss of generality, assume in . Then we need to show is a minimizer of . By (Jordan et al., 1998), the entropy term satisfies
Hence, the conclusion follows if is continuous in the weak topology. To show this, first note that for any , we have
Because the function is Lipschitz for , for any we have
By similar boundedness argument, we have
Totally we have
Since is bounded and Lipschitz, it is easy to show that
Therefore, we have
and thus