In the theory of machine learning, one of the most exciting recent developments is the introduction of Generative Adversarial Networks (GANs) by[Goodfellow2014]
. GANs provide a versatile class of generative models. The key idea behind GANs is to interpret the process of generative modeling as a competing game between two neural networks: a generator networkand a discriminator network . The generator network attempts to fool the discriminator network by converting random noise into sample data, while the discriminator network tries to identify whether the input sample is faked or true. Since the introduction to the machine learning community, the popularity of GANs has grown exponentially with numerous applications, including high resolution image generation [denton2015deep, radford2015unsupervised]yeh2016semantic]
, image super-resolution[ledig2016others], visual manipulation [zhu2016generative], text-to-image synthesis [reed2016generative], video generation [vondrick2016generating], semantic segmentation [luc2016semantic], and abstract reasoning diagram generation [ghosh2016contextual].
Meanwhile, the field of stochastic controls and games has witnessed the tremendous growth in the theory of mean-field games (MFGs) since the pioneering works of Huang, Malhamé, and Caines [Huang2006] and Lasry and Lions [Lasry2007]. Formulation of MFGs provides an ingenious aggregation approach for analyzing the otherwise notoriously hard -player stochastic games. Solutions to MFGs are shown to be good approximations of the Nash equilibria for the corresponding -player games when
is large. One of the approaches for analyzing MFGs is a fixed point method that involves solving a coupled partial differential equation (PDE) system: the backward Hamilton-Jacobi-Bellman (HJB) equation for the value function of the underlying control problem, and the forward Fokker-Planck (FP) equation for the dynamics of the controlled system. (See[Bensoussan2013], [Carmonaa], and the references therein.)
Curiously, these two vastly distinct subjects, GANs and MFGs, share an unexpected connection: both are minimax games. GANs are minimax games between the generator and the discriminator, whereas (variational) MFGs are minimax games between the value function and the controlled dynamics.
In this paper, we start by reviewing the minimax structures underlying both MFGs and GANs. We then establish theoretical connections between GANs and MFGs. We show that MFGs are intrinsically GANs, and GANs are MFGs under the Pareto Optimality criterion. Interpreting MFGs as GANs, on one hand, allows us to devise GANs-based algorithms to solve MFGs and, potentially, any dynamic systems with variational structures. Interpreting GANs as MFGs, on the other hand, provides a new and probabilistic foundation for GANs. Moreover, this interpretation helps establish an analytical connection between a broad class of GANs and Optimal Transport (OT) problems. Our numerical experiments show strong performance of our proposed algorithm, which trains the two neural networks in an adversarial way.
Related works on connecting GANs with games.
Earlier works linking deep learning, especially GANs, with games can be found in[Tembine2019] and the references therein. In particular, [Tembine2019]
makes the connection from a microscopic perspective in the sense that each neuron/layer represents a player and that individual strategy lies in the choices of parameters at the specific neuron/layer. Here we present analytical connections between GANs and MFGs at a macroscopic level, and design algorithms for computing MFGs based on these connections.
Related works on computing MFGs.
Most computational approaches for solving MFGs adopt traditional numerical schemes, with an exception of [CarmonaLauriere_DL, CarmonaLauriere_DL_periodic] and [guo2019learning]. [guo2019learning]
designs reinforcement learning algorithms for learning MFGs. To solve MFGs,[CarmonaLauriere_DL, CarmonaLauriere_DL_periodic] exploit machine learning techniques to approximate the density and the value function by two simultaneously trained neural networks. In contrast, our algorithm takes full advantage of the variational structure of MFGs and train the neural networks in an adversarial manner. Our algorithm can be adapted for general dynamic systems with variational structures.
Related works on connecting GANs and OT.
The connection between GANs and OT can also be found in [salimans2018improving] and in [lei2017geometric]. The former uses a parametrization approach and the latter takes a geometric point of view. This connection has also been informally exploited for improving the stability of GANs training via the regularization method [gulrajani2017improved], [sanjabi2018convergence], and [chu2019].
2 Review: GANs and MFGs as minimax games
GANs fall into the category of generative models. Recall that the procedure of generative modeling is to approximate an unknown probability distributionby constructing a class of suitable parametric probability distributions . More specifically, given a latent and a sample space , define a latent variable with a fixed probability distribution and a sequence of parametric functions . Then is defined as the probability distribution of .
In GANs, the parametric function is implemented using a neural network (NN) called the generator . Meanwhile, another neural network for the discriminator will assign a score between to to the generated sample, either from the true distribution or the approximate distribution . A higher score from the discriminator would indicate that the sample is more likely to be from the true distribution. Mathematically, GAN is a minimax game as
A GAN is trained by optimizing and iteratively until can no longer distinguish between samples from or . Mathematically, training of GANs with an optimal discriminator is minimizing some divergence between and . Indeed, fixing and optimizing for in (1), the optimal discriminator would be , where and are density functions of and respectively. Plugging this back to Equation (1), we have
That is, training GANs is to minimize the Jensen-Shannon (JS) divergence between and . In essence, GANs are minimizing proper divergences between true distribution and the generated distribution: for instance, [nock2017f] uses f-divergence, [srivastava2019bregmn] explores scaled Bregman divergence, [Arjovsky2017] adopts Wasserstein-1 distance, and [guo2017relaxed] proposes relaxed Wasserstein divergence.
MFGs are developed to approximate stochastic
-player games. The idea comes from physics for interacting particle systems. By assuming players are indistinguishable and interchangeable, and by aggregation and the strong law of large numbers, MFGs focus on a representative player and the mean-field information. The value function of MFGs is then shown to approximate that of the corresponding-player games with an error of order .
One of the classical solution approaches for MFGs is the iterative fixed point approach. It involves recursively solving a coupled PDE system: the HJB equation governing the value function and forward FP equation governing the evolution of the optimally controlled state process.
The minimax structure of MFGs appears in [Cirant2018] when analyzing a class of MFGs on flat torus and a finite time horizon . Instead of solving the MFGs through the coupled PDE system, they analyze the following equivalent minimax game,
Here, is the convex conjugate of the running cost of control for the game, is an integral of the running holding cost , is the value function for the game with the terminal condition , and is the density function for the controlled dynamics with the initial condition .
3 MFGs as GANs
The minimax representation (3) in the variational structure of MFGs motivates us to study the connection between GANs and a broader class of MFGs.
Recall a standard one-dimensional case where there is a continuum of rational and indistinguishable players. Let player be a representative player. Her objective is to choose the optimal control over an admissible control set for the following minimization problem for any and :
Here the mean-field information is denoted by a flow of probability measures ; is the limiting empirical distribution of players’ states and, by strong law of large numbers, for all .
Suppose the mean-field information admits a density function for all . Then the MFG (MFG) can be characterized by the following PDE system,
Here the Hamiltonian in (HJB) is given by
and in (FP) is the optimal control.
Under proper technical assumptions on and , an iterative fixed point method solves (MFG) in the following way:
Fix the mean-field information denoted by the density function and solve the optimal control problem via solving (HJB).
Iterate the previous two steps until convergence.
Now in light of the minimax representation of GANs and the variational structure of MFGs, we can recast MFGs as (dynamic) GANs, by specifying the roles of generator and discriminator, as summarized in Table 1.
|Generator G||NN for approximating the map||NN for solving HJB|
|Characterization of||Sample data||FP equation for consistency|
|Discriminator D||NN measuring divergence between and||NN for measuring differential residual from the FP equation|
In GANs, the NN for the generator is to mimic the sample data to generate new one that resembles the true distribution as much as possible. The discriminator measures the performance of the generator through a second NN, and approximates some divergence between and .
In MFGs, the generator is an NN that approximates the value function in equilibrium. The equilibrium, just like the true distribution in GANs, exists but is not explicitly available. The characterization of the equilibrium is through a consistency condition governed by (FP). The discriminator, through a second NN, measures the level of consistency via the differential residual of (FP).
This link between MFGs and GANs leads to a new computational algorithm for MFGs, namely Algorithm 1. This algorithm computes MFGs using two neural networks in an adversarial way: being the NN approximation of the unknown function and being the NN approximation of the unknown function .
Note that Algorithm 1 can be adapted for broader classes of dynamic systems with variational structures. Such dynamic GANs structures have been exploited in [Yang2018] and [Yang2018a] to synthesize complex systems governed by physical laws.
4 GANs as MFGs
Having established MFGs as GANs, we now show that GANs are MFGs.
GANs as in [Goodfellow2014] are MFGs under the Pareto Optimality criterion.
Let denote the probability distribution from which the real data is sampled on the sample space and be the prior distribution of the input on . A generator maps any to . A discriminator , on the other hand, takes any sample and returns some probability of being sampled from . The objective function of this GAN can be expressed as
where and are selected from appropriate functional spaces.
Consider a group of indistinguishable players, each holding an initial belief distributed as . Players can access the sample data from a masked model , independent from , i.e., for ; each one is asked to find a strategy transforming the initial belief into a mimic version of the sample data so that on average the group can fool the best discriminator.
First, we define the set of admissible strategies and the candidate pool for discriminators. Let be the collection of mappings from to and be the collection of mappings from to . Fix any , let be player ’s initial belief and suppose . Let , , be the sample data. When player chooses strategy , , each player is subject to the same cost
where denotes the profile of strategies for all players.
A profile of strategies is called a Pareto optimal point (PO) if , for all
Notice that the players are indistinguishable. Then there must be a symmetric PO consisting of the same strategy for all the players, provided that a PO exists. Let denote the set of symmetric strategies, i.e.,
When the number of players as well as the size of the sample data becomes large, by strong law of large number, almost surely we have
From a game point of view, when the number of players is large, instead of the actual synthetic data made by the players, it is more feasible to focus on its distribution, . Furthermore, due to strong law of large numbers, , with . Here, is called the mean field. Therefore, by strong law of large numbers, sending and to infinity the original loss for vanilla GANs is recovered,
5 GANs as OT
As discussed in Section 2, through optimization over discriminators, GANs are essentially minimizing proper divergences between true distribution and the generated distribution over some sample space . Denote as the set of all probability distributions over the sample space . Define a generic divergence function,
For a broad class of GANs, if the divergence in the objective is viewed as the optimal cost of an optimal transport problem with cost function , then the optimization problem breaks down to two sub-problems. The first sub-problem is, under a fixed set of possible transport plans given by the generator , to compute or approximate the divergence , which is equivalent to solving the optimal transport problem. Note that this step can be done by solving the dual problem from Kantorovich–Rubinstein duality. The discriminator plays the role of the “price” function in the dual problem. The second sub-problem is to minimize the optimal transport cost by finding the best .
This connection between GANs and OT is explicit in the case of Wasserstein-GAN (WGAN).
In WGAN, the divergence is the optimal cost of the optimal transport problem with the cost function being the distance.
To see this, take the Wasserstein-1 distance introduced in [Arjovsky2017], with the objective function given as
where the equivalence is based on the Kantorovich-Rubinstein duality. Here, the Wasserstein-1 distance between is given by where denotes the set of all possible coupling of and . Apart from the inaccessibility of an explicit form of , characterizing the infimum over couplings is also computationally challenging. Its dual form, on the other hand, gives rise to a natural adversarial training setup. Notice that there are two given distributions; one is the prior distribution for the latent variable that can be characterized analytically, and the other is the distribution of sample data whose analytical form is not available. Under any given generator , define a cost function ,
Define as the set of all possible couplings between the probability distribution of , namely , and . The cost function (5) can be interpreted as the cost of transporting mass from a distribution to a different distribution . Then, consider the following optimal transport problem
Then, by Kantorivich–Rubinstein duality [villani2008optimal], the optimal transport problem (OT) becomes
which is exactly the Wasserstein-1 distance between and . The role of the discriminator is to locate the best coupling among for (OT) under a given , whereas the role of the generator is to refine the set of possible couplings so that the infimum in (OT) becomes 0 eventually. ∎
We remark that in [salimans2018improving], the connection between GANs and OT is discussed from a different angle. Instead of using a fixed cost function , a class of cost functions parametrized by is considered. For the first sub-problem, the primal form of the optimal transport problem is solved using the Sinkhorn algorithm proposed in [cuturi2013sinkhorn]. The parametrized cost function now plays the role of discriminator. In [lei2017geometric], a connection between GANs and optimal transport problem is established from a geometric point of view.
We now assess the quality of the proposed Algorithm 1, with a class of ergodic MFGs. Their explicit solution structures facilitate numerical comparison.
6.1 Ergodic MFGs with explicit solution
Take (MFG) and consider the following long-run average cost,
Here, the periodic value function , the periodic density function , and the unknown can be solved explicitly for
Indeed, assuming the existence of a smooth solution , in the second equation in (8) can be written as
Hence the solution to (8) is given by
6.2 Result and analysis
As illustrated in Table 1, except for (FP), there is little information about the density function . Therefore, its NN approximate is assumed to be a maximum entropy probability distribution, i.e., ; see, for instance, [Finn2016]. Furthermore, the Deep Galerkin Method network architecture, proposed in [sirignano2018dgm], is adopted to implement both and .
To accommodate the periodicity given by the domain flat torus , for any data point , use
An additional trainable variable is introduced in the graphical model.
We train the generator first in each inner loop with more SGD steps and a larger learning rate compared with the discriminator. (This is opposite to the typical GANs training.)
6.2.1 One-dimensional case
We first conduct numerical experiment with one-dimensional input.
The DGM network for both and contains 1 hidden layer with
nodes. The activation function foris hyperbolic tangent function and that of
is sigmoid function. For the inner loops of generator and discriminator training, the minibatch size is. As mentioned in the adaptation, the number of SGD steps for the generator is with initial learning rate , whereas the number of SGD steps for the discriminator is with initial learning rate . The weight for the generator penalty is . The number of outer loops is . Adam optimizer is used for the updates.
The result is summarized in Figure 1. Figures 0(a) and 0(b) show the learnt functions of and against the true ones, respectively. Both show good accuracy of the learnt functions versus the true ones. This is supported by the plots of loss in Figures 0(c) and 0(d), depicting the evolution of relative error as the number of outer iterations grows to . The relative error of a function against another function , with and not constant 0, is given by
Within iterations, while the relative error of oscillates around , relative errors of decreases below .
To facilitate comparisons for broader classes of MFGs whose analytical solutions are not necessarily available, additional loss functions are adopted. Here we take differential residuals of both the HJB and the FP equations as measurement of the performance. The evolution of the HJB and FP differential residual loss is shown in Figures 0(e) and 0(f)
, respectively. In theses figures, the solid line is the average loss among 3 experiments, with standard deviation captured by the shadow around the line. Both differential residuals first rapidly descend to the magnitude ofand then the descent slows down accompanied by oscillation.
One may notice the difference between the training results of and . One reason is that and are implemented using different neural networks. The other is that different loss functions are adopted for training and .
To understand possible contribution factors for the oscillation in the loss, especially for , an ablation study on the learning rate of the generator is conducted. In our test, the initial learning rate for the Adam Optimizer takes the values of , , and , respectively.
From Figures 0(c) and 0(d), the relative error on oscillates more than that of . Similar phenomenon is observed in Figure 2. In particular, from Figure 1(a), a drastic decrease in oscillation can be seen as the generator learning rate decreases.
Another parameter of interest is the number of samples in each minibatch, i.e., and in Algorithm 1. Setting , the cases of , , and are tested.
Figure 4(a) shows that the relative error of oscillates less as and increases from to . Moreover, comparing the case of and , the residual losses for both HJB and FP decrease to a lower level with less oscillation as minibatch size increases, as shown in Figures 6 and 7.
6.2.2 Multi-dimensional case
We also test with input of dimension and relative errors are shown in Figure 8.
The number of outer loops is increased to , with generator learning rate of . Within iterations, the relative error of decreases below and that of decreases to .
Notice that similar experiment for dimension has been conducted in [CarmonaLauriere_DL]; see Test Case 4. In comparison, their algorithms need significantly larger number of iterations: of iterations vs our to achieve the same level of accuracy.