1 Introduction
In the theory of machine learning, one of the most exciting recent developments is the introduction of Generative Adversarial Networks (GANs) by
[Goodfellow2014]. GANs provide a versatile class of generative models. The key idea behind GANs is to interpret the process of generative modeling as a competing game between two neural networks: a generator network
and a discriminator network . The generator network attempts to fool the discriminator network by converting random noise into sample data, while the discriminator network tries to identify whether the input sample is faked or true. Since the introduction to the machine learning community, the popularity of GANs has grown exponentially with numerous applications, including high resolution image generation [denton2015deep, radford2015unsupervised][yeh2016semantic], image superresolution
[ledig2016others], visual manipulation [zhu2016generative], texttoimage synthesis [reed2016generative], video generation [vondrick2016generating], semantic segmentation [luc2016semantic], and abstract reasoning diagram generation [ghosh2016contextual].Meanwhile, the field of stochastic controls and games has witnessed the tremendous growth in the theory of meanfield games (MFGs) since the pioneering works of Huang, Malhamé, and Caines [Huang2006] and Lasry and Lions [Lasry2007]. Formulation of MFGs provides an ingenious aggregation approach for analyzing the otherwise notoriously hard player stochastic games. Solutions to MFGs are shown to be good approximations of the Nash equilibria for the corresponding player games when
is large. One of the approaches for analyzing MFGs is a fixed point method that involves solving a coupled partial differential equation (PDE) system: the backward HamiltonJacobiBellman (HJB) equation for the value function of the underlying control problem, and the forward FokkerPlanck (FP) equation for the dynamics of the controlled system. (See
[Bensoussan2013], [Carmonaa], and the references therein.)Curiously, these two vastly distinct subjects, GANs and MFGs, share an unexpected connection: both are minimax games. GANs are minimax games between the generator and the discriminator, whereas (variational) MFGs are minimax games between the value function and the controlled dynamics.
In this paper, we start by reviewing the minimax structures underlying both MFGs and GANs. We then establish theoretical connections between GANs and MFGs. We show that MFGs are intrinsically GANs, and GANs are MFGs under the Pareto Optimality criterion. Interpreting MFGs as GANs, on one hand, allows us to devise GANsbased algorithms to solve MFGs and, potentially, any dynamic systems with variational structures. Interpreting GANs as MFGs, on the other hand, provides a new and probabilistic foundation for GANs. Moreover, this interpretation helps establish an analytical connection between a broad class of GANs and Optimal Transport (OT) problems. Our numerical experiments show strong performance of our proposed algorithm, which trains the two neural networks in an adversarial way.
Related works on connecting GANs with games.
Earlier works linking deep learning, especially GANs, with games can be found in
[Tembine2019] and the references therein. In particular, [Tembine2019]makes the connection from a microscopic perspective in the sense that each neuron/layer represents a player and that individual strategy lies in the choices of parameters at the specific neuron/layer. Here we present analytical connections between GANs and MFGs at a macroscopic level, and design algorithms for computing MFGs based on these connections.
Related works on computing MFGs.
Most computational approaches for solving MFGs adopt traditional numerical schemes, with an exception of [CarmonaLauriere_DL, CarmonaLauriere_DL_periodic] and [guo2019learning]. [guo2019learning]
designs reinforcement learning algorithms for learning MFGs. To solve MFGs,
[CarmonaLauriere_DL, CarmonaLauriere_DL_periodic] exploit machine learning techniques to approximate the density and the value function by two simultaneously trained neural networks. In contrast, our algorithm takes full advantage of the variational structure of MFGs and train the neural networks in an adversarial manner. Our algorithm can be adapted for general dynamic systems with variational structures.Related works on connecting GANs and OT.
The connection between GANs and OT can also be found in [salimans2018improving] and in [lei2017geometric]. The former uses a parametrization approach and the latter takes a geometric point of view. This connection has also been informally exploited for improving the stability of GANs training via the regularization method [gulrajani2017improved], [sanjabi2018convergence], and [chu2019].
2 Review: GANs and MFGs as minimax games
GANs.
GANs fall into the category of generative models. Recall that the procedure of generative modeling is to approximate an unknown probability distribution
by constructing a class of suitable parametric probability distributions . More specifically, given a latent and a sample space , define a latent variable with a fixed probability distribution and a sequence of parametric functions . Then is defined as the probability distribution of .In GANs, the parametric function is implemented using a neural network (NN) called the generator . Meanwhile, another neural network for the discriminator will assign a score between to to the generated sample, either from the true distribution or the approximate distribution . A higher score from the discriminator would indicate that the sample is more likely to be from the true distribution. Mathematically, GAN is a minimax game as
(1) 
A GAN is trained by optimizing and iteratively until can no longer distinguish between samples from or . Mathematically, training of GANs with an optimal discriminator is minimizing some divergence between and . Indeed, fixing and optimizing for in (1), the optimal discriminator would be , where and are density functions of and respectively. Plugging this back to Equation (1), we have
(2) 
That is, training GANs is to minimize the JensenShannon (JS) divergence between and . In essence, GANs are minimizing proper divergences between true distribution and the generated distribution: for instance, [nock2017f] uses fdivergence, [srivastava2019bregmn] explores scaled Bregman divergence, [Arjovsky2017] adopts Wasserstein1 distance, and [guo2017relaxed] proposes relaxed Wasserstein divergence.
MFGs.
MFGs are developed to approximate stochastic
player games. The idea comes from physics for interacting particle systems. By assuming players are indistinguishable and interchangeable, and by aggregation and the strong law of large numbers, MFGs focus on a representative player and the meanfield information. The value function of MFGs is then shown to approximate that of the corresponding
player games with an error of order .One of the classical solution approaches for MFGs is the iterative fixed point approach. It involves recursively solving a coupled PDE system: the HJB equation governing the value function and forward FP equation governing the evolution of the optimally controlled state process.
The minimax structure of MFGs appears in [Cirant2018] when analyzing a class of MFGs on flat torus and a finite time horizon . Instead of solving the MFGs through the coupled PDE system, they analyze the following equivalent minimax game,
(3) 
where
Here, is the convex conjugate of the running cost of control for the game, is an integral of the running holding cost , is the value function for the game with the terminal condition , and is the density function for the controlled dynamics with the initial condition .
3 MFGs as GANs
The minimax representation (3) in the variational structure of MFGs motivates us to study the connection between GANs and a broader class of MFGs.
Recall a standard onedimensional case where there is a continuum of rational and indistinguishable players. Let player be a representative player. Her objective is to choose the optimal control over an admissible control set for the following minimization problem for any and :
(MFG)  
Here the meanfield information is denoted by a flow of probability measures ; is the limiting empirical distribution of players’ states and, by strong law of large numbers, for all .
Suppose the meanfield information admits a density function for all . Then the MFG (MFG) can be characterized by the following PDE system,
(HJB)  
(FP)  
Here the Hamiltonian in (HJB) is given by
and in (FP) is the optimal control.
Under proper technical assumptions on and , an iterative fixed point method solves (MFG) in the following way:

Fix the meanfield information denoted by the density function and solve the optimal control problem via solving (HJB).

Iterate the previous two steps until convergence.
Now in light of the minimax representation of GANs and the variational structure of MFGs, we can recast MFGs as (dynamic) GANs, by specifying the roles of generator and discriminator, as summarized in Table 1.
GANs  MFGs  

Generator G  NN for approximating the map  NN for solving HJB 
Characterization of  Sample data  FP equation for consistency 
Discriminator D  NN measuring divergence between and  NN for measuring differential residual from the FP equation 
More precisely,

In GANs, the NN for the generator is to mimic the sample data to generate new one that resembles the true distribution as much as possible. The discriminator measures the performance of the generator through a second NN, and approximates some divergence between and .

In MFGs, the generator is an NN that approximates the value function in equilibrium. The equilibrium, just like the true distribution in GANs, exists but is not explicitly available. The characterization of the equilibrium is through a consistency condition governed by (FP). The discriminator, through a second NN, measures the level of consistency via the differential residual of (FP).
This link between MFGs and GANs leads to a new computational algorithm for MFGs, namely Algorithm 1. This algorithm computes MFGs using two neural networks in an adversarial way: being the NN approximation of the unknown function and being the NN approximation of the unknown function .
Note that Algorithm 1 can be adapted for broader classes of dynamic systems with variational structures. Such dynamic GANs structures have been exploited in [Yang2018] and [Yang2018a] to synthesize complex systems governed by physical laws.
4 GANs as MFGs
Having established MFGs as GANs, we now show that GANs are MFGs.
Theorem 1.
GANs as in [Goodfellow2014] are MFGs under the Pareto Optimality criterion.
Proof.
Let denote the probability distribution from which the real data is sampled on the sample space and be the prior distribution of the input on . A generator maps any to . A discriminator , on the other hand, takes any sample and returns some probability of being sampled from . The objective function of this GAN can be expressed as
(4) 
where and are selected from appropriate functional spaces.
Consider a group of indistinguishable players, each holding an initial belief distributed as . Players can access the sample data from a masked model , independent from , i.e., for ; each one is asked to find a strategy transforming the initial belief into a mimic version of the sample data so that on average the group can fool the best discriminator.
First, we define the set of admissible strategies and the candidate pool for discriminators. Let be the collection of mappings from to and be the collection of mappings from to . Fix any , let be player ’s initial belief and suppose . Let , , be the sample data. When player chooses strategy , , each player is subject to the same cost
where denotes the profile of strategies for all players.
Definition 1.
A profile of strategies is called a Pareto optimal point (PO) if , for all
Notice that the players are indistinguishable. Then there must be a symmetric PO consisting of the same strategy for all the players, provided that a PO exists. Let denote the set of symmetric strategies, i.e.,
When the number of players as well as the size of the sample data becomes large, by strong law of large number, almost surely we have
From a game point of view, when the number of players is large, instead of the actual synthetic data made by the players, it is more feasible to focus on its distribution, . Furthermore, due to strong law of large numbers, , with . Here, is called the mean field. Therefore, by strong law of large numbers, sending and to infinity the original loss for vanilla GANs is recovered,
∎
5 GANs as OT
As discussed in Section 2, through optimization over discriminators, GANs are essentially minimizing proper divergences between true distribution and the generated distribution over some sample space . Denote as the set of all probability distributions over the sample space . Define a generic divergence function,
For a broad class of GANs, if the divergence in the objective is viewed as the optimal cost of an optimal transport problem with cost function , then the optimization problem breaks down to two subproblems. The first subproblem is, under a fixed set of possible transport plans given by the generator , to compute or approximate the divergence , which is equivalent to solving the optimal transport problem. Note that this step can be done by solving the dual problem from Kantorovich–Rubinstein duality. The discriminator plays the role of the “price” function in the dual problem. The second subproblem is to minimize the optimal transport cost by finding the best .
This connection between GANs and OT is explicit in the case of WassersteinGAN (WGAN).
Proposition 1.
In WGAN, the divergence is the optimal cost of the optimal transport problem with the cost function being the distance.
Proof.
To see this, take the Wasserstein1 distance introduced in [Arjovsky2017], with the objective function given as
where the equivalence is based on the KantorovichRubinstein duality. Here, the Wasserstein1 distance between is given by where denotes the set of all possible coupling of and . Apart from the inaccessibility of an explicit form of , characterizing the infimum over couplings is also computationally challenging. Its dual form, on the other hand, gives rise to a natural adversarial training setup. Notice that there are two given distributions; one is the prior distribution for the latent variable that can be characterized analytically, and the other is the distribution of sample data whose analytical form is not available. Under any given generator , define a cost function ,
(5) 
Define as the set of all possible couplings between the probability distribution of , namely , and . The cost function (5) can be interpreted as the cost of transporting mass from a distribution to a different distribution . Then, consider the following optimal transport problem
(OT) 
Then, by Kantorivich–Rubinstein duality [villani2008optimal], the optimal transport problem (OT) becomes
(6) 
which is exactly the Wasserstein1 distance between and . The role of the discriminator is to locate the best coupling among for (OT) under a given , whereas the role of the generator is to refine the set of possible couplings so that the infimum in (OT) becomes 0 eventually. ∎
We remark that in [salimans2018improving], the connection between GANs and OT is discussed from a different angle. Instead of using a fixed cost function , a class of cost functions parametrized by is considered. For the first subproblem, the primal form of the optimal transport problem is solved using the Sinkhorn algorithm proposed in [cuturi2013sinkhorn]. The parametrized cost function now plays the role of discriminator. In [lei2017geometric], a connection between GANs and optimal transport problem is established from a geometric point of view.
6 Experiments
We now assess the quality of the proposed Algorithm 1, with a class of ergodic MFGs. Their explicit solution structures facilitate numerical comparison.
6.1 Ergodic MFGs with explicit solution
Take (MFG) and consider the following longrun average cost,
(7) 
then the PDE system (HJB)–(FP) becomes
(8) 
Here, the periodic value function , the periodic density function , and the unknown can be solved explicitly for
with
Indeed, assuming the existence of a smooth solution , in the second equation in (8) can be written as
(9) 
Hence the solution to (8) is given by
and
6.2 Result and analysis
Algorithm 1 is tested with the above MFGs in Section 6.1, for both onedimensional and multidimensional cases.
Implementation.
As illustrated in Table 1, except for (FP), there is little information about the density function . Therefore, its NN approximate is assumed to be a maximum entropy probability distribution, i.e., ; see, for instance, [Finn2016]. Furthermore, the Deep Galerkin Method network architecture, proposed in [sirignano2018dgm], is adopted to implement both and .
Adaptation.
Since the MFG in Section 6.1 is of an ergodic type with a specified periodicity, Algorithm 1 is adapted accordingly.

To accommodate the periodicity given by the domain flat torus , for any data point , use
as input.

An additional trainable variable is introduced in the graphical model.

The loss functions
and are modified according to the first and second equation of (8). The generator penalty becomesDue to the structure , the discriminator penalty is no longer needed, i.e., .

We train the generator first in each inner loop with more SGD steps and a larger learning rate compared with the discriminator. (This is opposite to the typical GANs training.)
6.2.1 Onedimensional case
We first conduct numerical experiment with onedimensional input.
The DGM network for both and contains 1 hidden layer with
nodes. The activation function for
is hyperbolic tangent function and that ofis sigmoid function. For the inner loops of generator and discriminator training, the minibatch size is
. As mentioned in the adaptation, the number of SGD steps for the generator is with initial learning rate , whereas the number of SGD steps for the discriminator is with initial learning rate . The weight for the generator penalty is . The number of outer loops is . Adam optimizer is used for the updates.The result is summarized in Figure 1. Figures 0(a) and 0(b) show the learnt functions of and against the true ones, respectively. Both show good accuracy of the learnt functions versus the true ones. This is supported by the plots of loss in Figures 0(c) and 0(d), depicting the evolution of relative error as the number of outer iterations grows to . The relative error of a function against another function , with and not constant 0, is given by
Within iterations, while the relative error of oscillates around , relative errors of decreases below .
To facilitate comparisons for broader classes of MFGs whose analytical solutions are not necessarily available, additional loss functions are adopted. Here we take differential residuals of both the HJB and the FP equations as measurement of the performance. The evolution of the HJB and FP differential residual loss is shown in Figures 0(e) and 0(f)
, respectively. In theses figures, the solid line is the average loss among 3 experiments, with standard deviation captured by the shadow around the line. Both differential residuals first rapidly descend to the magnitude of
and then the descent slows down accompanied by oscillation.One may notice the difference between the training results of and . One reason is that and are implemented using different neural networks. The other is that different loss functions are adopted for training and .
Ablation study.
To understand possible contribution factors for the oscillation in the loss, especially for , an ablation study on the learning rate of the generator is conducted. In our test, the initial learning rate for the Adam Optimizer takes the values of , , and , respectively.
From Figures 0(c) and 0(d), the relative error on oscillates more than that of . Similar phenomenon is observed in Figure 2. In particular, from Figure 1(a), a drastic decrease in oscillation can be seen as the generator learning rate decreases.
Turning to the differential residual losses, we observe from Figures 3 and 4 that, if decreasing from to , the residual losses for both HJB and FP decrease to a lower level with less oscillation.
Another parameter of interest is the number of samples in each minibatch, i.e., and in Algorithm 1. Setting , the cases of , , and are tested.
Figure 4(a) shows that the relative error of oscillates less as and increases from to . Moreover, comparing the case of and , the residual losses for both HJB and FP decrease to a lower level with less oscillation as minibatch size increases, as shown in Figures 6 and 7.
6.2.2 Multidimensional case
We also test with input of dimension and relative errors are shown in Figure 8.
The number of outer loops is increased to , with generator learning rate of . Within iterations, the relative error of decreases below and that of decreases to .
Notice that similar experiment for dimension has been conducted in [CarmonaLauriere_DL]; see Test Case 4. In comparison, their algorithms need significantly larger number of iterations: of iterations vs our to achieve the same level of accuracy.
Comments
There are no comments yet.