Connecting GANs and MFGs

02/10/2020 ∙ by Haoyang Cao, et al. ∙ Princeton University berkeley college 0

Generative Adversarial Networks (GANs), introduced in 2014 [12], have celebrated great empirical success, especially in image generation and processing. Meanwhile, Mean-Field Games (MFGs), established in [17] and [16] as analytically feasible approximations for N-player games, have experienced rapid growth in theoretical studies. In this paper, we establish theoretical connections between GANs and MFGs. Interpreting MFGs as GANs, on one hand, allows us to devise GANs-based algorithm to solve MFGs. Interpreting GANs as MFGs, on the other hand, provides a new and probabilistic foundation for GANs. Moreover, this interpretation helps establish an analytical connection between GANs and Optimal Transport (OT) problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the theory of machine learning, one of the most exciting recent developments is the introduction of Generative Adversarial Networks (GANs) by

[Goodfellow2014]

. GANs provide a versatile class of generative models. The key idea behind GANs is to interpret the process of generative modeling as a competing game between two neural networks: a generator network

and a discriminator network . The generator network attempts to fool the discriminator network by converting random noise into sample data, while the discriminator network tries to identify whether the input sample is faked or true. Since the introduction to the machine learning community, the popularity of GANs has grown exponentially with numerous applications, including high resolution image generation [denton2015deep, radford2015unsupervised]

, image inpainting

[yeh2016semantic]

, image super-resolution

[ledig2016others], visual manipulation [zhu2016generative], text-to-image synthesis [reed2016generative], video generation [vondrick2016generating], semantic segmentation [luc2016semantic], and abstract reasoning diagram generation [ghosh2016contextual].

Meanwhile, the field of stochastic controls and games has witnessed the tremendous growth in the theory of mean-field games (MFGs) since the pioneering works of Huang, Malhamé, and Caines [Huang2006] and Lasry and Lions [Lasry2007]. Formulation of MFGs provides an ingenious aggregation approach for analyzing the otherwise notoriously hard -player stochastic games. Solutions to MFGs are shown to be good approximations of the Nash equilibria for the corresponding -player games when

is large. One of the approaches for analyzing MFGs is a fixed point method that involves solving a coupled partial differential equation (PDE) system: the backward Hamilton-Jacobi-Bellman (HJB) equation for the value function of the underlying control problem, and the forward Fokker-Planck (FP) equation for the dynamics of the controlled system. (See

[Bensoussan2013], [Carmonaa], and the references therein.)

Curiously, these two vastly distinct subjects, GANs and MFGs, share an unexpected connection: both are minimax games. GANs are minimax games between the generator and the discriminator, whereas (variational) MFGs are minimax games between the value function and the controlled dynamics.

In this paper, we start by reviewing the minimax structures underlying both MFGs and GANs. We then establish theoretical connections between GANs and MFGs. We show that MFGs are intrinsically GANs, and GANs are MFGs under the Pareto Optimality criterion. Interpreting MFGs as GANs, on one hand, allows us to devise GANs-based algorithms to solve MFGs and, potentially, any dynamic systems with variational structures. Interpreting GANs as MFGs, on the other hand, provides a new and probabilistic foundation for GANs. Moreover, this interpretation helps establish an analytical connection between a broad class of GANs and Optimal Transport (OT) problems. Our numerical experiments show strong performance of our proposed algorithm, which trains the two neural networks in an adversarial way.

Related works on connecting GANs with games.

Earlier works linking deep learning, especially GANs, with games can be found in

[Tembine2019] and the references therein. In particular, [Tembine2019]

makes the connection from a microscopic perspective in the sense that each neuron/layer represents a player and that individual strategy lies in the choices of parameters at the specific neuron/layer. Here we present analytical connections between GANs and MFGs at a macroscopic level, and design algorithms for computing MFGs based on these connections.

Related works on computing MFGs.

Most computational approaches for solving MFGs adopt traditional numerical schemes, with an exception of [CarmonaLauriere_DL, CarmonaLauriere_DL_periodic] and [guo2019learning]. [guo2019learning]

designs reinforcement learning algorithms for learning MFGs. To solve MFGs,

[CarmonaLauriere_DL, CarmonaLauriere_DL_periodic] exploit machine learning techniques to approximate the density and the value function by two simultaneously trained neural networks. In contrast, our algorithm takes full advantage of the variational structure of MFGs and train the neural networks in an adversarial manner. Our algorithm can be adapted for general dynamic systems with variational structures.

Related works on connecting GANs and OT.

The connection between GANs and OT can also be found in [salimans2018improving] and in [lei2017geometric]. The former uses a parametrization approach and the latter takes a geometric point of view. This connection has also been informally exploited for improving the stability of GANs training via the regularization method [gulrajani2017improved], [sanjabi2018convergence], and [chu2019].

2 Review: GANs and MFGs as minimax games

GANs.

GANs fall into the category of generative models. Recall that the procedure of generative modeling is to approximate an unknown probability distribution

by constructing a class of suitable parametric probability distributions . More specifically, given a latent and a sample space , define a latent variable with a fixed probability distribution and a sequence of parametric functions . Then is defined as the probability distribution of .

In GANs, the parametric function is implemented using a neural network (NN) called the generator . Meanwhile, another neural network for the discriminator will assign a score between to to the generated sample, either from the true distribution or the approximate distribution . A higher score from the discriminator would indicate that the sample is more likely to be from the true distribution. Mathematically, GAN is a minimax game as

(1)

A GAN is trained by optimizing and iteratively until can no longer distinguish between samples from or . Mathematically, training of GANs with an optimal discriminator is minimizing some divergence between and . Indeed, fixing and optimizing for in (1), the optimal discriminator would be , where and are density functions of and respectively. Plugging this back to Equation (1), we have

(2)

That is, training GANs is to minimize the Jensen-Shannon (JS) divergence between and . In essence, GANs are minimizing proper divergences between true distribution and the generated distribution: for instance, [nock2017f] uses f-divergence, [srivastava2019bregmn] explores scaled Bregman divergence, [Arjovsky2017] adopts Wasserstein-1 distance, and [guo2017relaxed] proposes relaxed Wasserstein divergence.

MFGs.

MFGs are developed to approximate stochastic

-player games. The idea comes from physics for interacting particle systems. By assuming players are indistinguishable and interchangeable, and by aggregation and the strong law of large numbers, MFGs focus on a representative player and the mean-field information. The value function of MFGs is then shown to approximate that of the corresponding

-player games with an error of order .

One of the classical solution approaches for MFGs is the iterative fixed point approach. It involves recursively solving a coupled PDE system: the HJB equation governing the value function and forward FP equation governing the evolution of the optimally controlled state process.

The minimax structure of MFGs appears in [Cirant2018] when analyzing a class of MFGs on flat torus and a finite time horizon . Instead of solving the MFGs through the coupled PDE system, they analyze the following equivalent minimax game,

(3)

where

Here, is the convex conjugate of the running cost of control for the game, is an integral of the running holding cost , is the value function for the game with the terminal condition , and is the density function for the controlled dynamics with the initial condition .

3 MFGs as GANs

The minimax representation (3) in the variational structure of MFGs motivates us to study the connection between GANs and a broader class of MFGs.

Recall a standard one-dimensional case where there is a continuum of rational and indistinguishable players. Let player be a representative player. Her objective is to choose the optimal control over an admissible control set for the following minimization problem for any and :

(MFG)

Here the mean-field information is denoted by a flow of probability measures ; is the limiting empirical distribution of players’ states and, by strong law of large numbers, for all .

Suppose the mean-field information admits a density function for all . Then the MFG (MFG) can be characterized by the following PDE system,

(HJB)
(FP)

Here the Hamiltonian in (HJB) is given by

and in (FP) is the optimal control.

Under proper technical assumptions on and , an iterative fixed point method solves (MFG) in the following way:

  1. Fix the mean-field information denoted by the density function and solve the optimal control problem via solving (HJB).

  2. Let be the solution to the (HJB) in the previous step. Then under , update the mean-field information via (FP).

  3. Iterate the previous two steps until convergence.

Now in light of the minimax representation of GANs and the variational structure of MFGs, we can recast MFGs as (dynamic) GANs, by specifying the roles of generator and discriminator, as summarized in Table 1.

GANs MFGs
Generator G NN for approximating the map NN for solving HJB
Characterization of Sample data FP equation for consistency
Discriminator D NN measuring divergence between and NN for measuring differential residual from the FP equation
Table 1: Link between GANS and MFGs

More precisely,

  • In GANs, the NN for the generator is to mimic the sample data to generate new one that resembles the true distribution as much as possible. The discriminator measures the performance of the generator through a second NN, and approximates some divergence between and .

  • In MFGs, the generator is an NN that approximates the value function in equilibrium. The equilibrium, just like the true distribution in GANs, exists but is not explicitly available. The characterization of the equilibrium is through a consistency condition governed by (FP). The discriminator, through a second NN, measures the level of consistency via the differential residual of (FP).

This link between MFGs and GANs leads to a new computational algorithm for MFGs, namely Algorithm 1. This algorithm computes MFGs using two neural networks in an adversarial way: being the NN approximation of the unknown function and being the NN approximation of the unknown function .

Note that Algorithm 1 can be adapted for broader classes of dynamic systems with variational structures. Such dynamic GANs structures have been exploited in [Yang2018] and [Yang2018a] to synthesize complex systems governed by physical laws.

  At , initialize and . Let and be the number of training steps of the inner-loops and be that of the outer-loop. Let , .
  for  do
     Let , .
     Sample on according to a predetermined distribution , where denotes the number of training samples for updating loss related to FP residual.
     Let , with
where is a known density function for the initial distribution of the states and is the weight for the penalty on the initial condition of .
     for  do
         with learning rate .
        Increase .
     end for
     Sample on according to a predetermined distribution , where denotes the number of training samples for updating loss related to HJB residual.
     Let , with
where is the weight for the penalty on the terminal condition of .
     for  do
         with learning rate .
        Increase
     end for
     Increase .
  end for
  Return ,
Algorithm 1 MFGAN

4 GANs as MFGs

Having established MFGs as GANs, we now show that GANs are MFGs.

Theorem 1.

GANs as in [Goodfellow2014] are MFGs under the Pareto Optimality criterion.

Proof.

Let denote the probability distribution from which the real data is sampled on the sample space and be the prior distribution of the input on . A generator maps any to . A discriminator , on the other hand, takes any sample and returns some probability of being sampled from . The objective function of this GAN can be expressed as

(4)

where and are selected from appropriate functional spaces.

Consider a group of indistinguishable players, each holding an initial belief distributed as . Players can access the sample data from a masked model , independent from , i.e., for ; each one is asked to find a strategy transforming the initial belief into a mimic version of the sample data so that on average the group can fool the best discriminator.

First, we define the set of admissible strategies and the candidate pool for discriminators. Let be the collection of mappings from to and be the collection of mappings from to . Fix any , let be player ’s initial belief and suppose . Let , , be the sample data. When player chooses strategy , , each player is subject to the same cost

where denotes the profile of strategies for all players.

Definition 1.

A profile of strategies is called a Pareto optimal point (PO) if , for all

Notice that the players are indistinguishable. Then there must be a symmetric PO consisting of the same strategy for all the players, provided that a PO exists. Let denote the set of symmetric strategies, i.e.,

When the number of players as well as the size of the sample data becomes large, by strong law of large number, almost surely we have

From a game point of view, when the number of players is large, instead of the actual synthetic data made by the players, it is more feasible to focus on its distribution, . Furthermore, due to strong law of large numbers, , with . Here, is called the mean field. Therefore, by strong law of large numbers, sending and to infinity the original loss for vanilla GANs is recovered,

5 GANs as OT

As discussed in Section 2, through optimization over discriminators, GANs are essentially minimizing proper divergences between true distribution and the generated distribution over some sample space . Denote as the set of all probability distributions over the sample space . Define a generic divergence function,

For a broad class of GANs, if the divergence in the objective is viewed as the optimal cost of an optimal transport problem with cost function , then the optimization problem breaks down to two sub-problems. The first sub-problem is, under a fixed set of possible transport plans given by the generator , to compute or approximate the divergence , which is equivalent to solving the optimal transport problem. Note that this step can be done by solving the dual problem from Kantorovich–Rubinstein duality. The discriminator plays the role of the “price” function in the dual problem. The second sub-problem is to minimize the optimal transport cost by finding the best .

This connection between GANs and OT is explicit in the case of Wasserstein-GAN (WGAN).

Proposition 1.

In WGAN, the divergence is the optimal cost of the optimal transport problem with the cost function being the distance.

Proof.

To see this, take the Wasserstein-1 distance introduced in [Arjovsky2017], with the objective function given as

where the equivalence is based on the Kantorovich-Rubinstein duality. Here, the Wasserstein-1 distance between is given by where denotes the set of all possible coupling of and . Apart from the inaccessibility of an explicit form of , characterizing the infimum over couplings is also computationally challenging. Its dual form, on the other hand, gives rise to a natural adversarial training setup. Notice that there are two given distributions; one is the prior distribution for the latent variable that can be characterized analytically, and the other is the distribution of sample data whose analytical form is not available. Under any given generator , define a cost function ,

(5)

Define as the set of all possible couplings between the probability distribution of , namely , and . The cost function (5) can be interpreted as the cost of transporting mass from a distribution to a different distribution . Then, consider the following optimal transport problem

(OT)

Then, by Kantorivich–Rubinstein duality [villani2008optimal], the optimal transport problem (OT) becomes

(6)

which is exactly the Wasserstein-1 distance between and . The role of the discriminator is to locate the best coupling among for (OT) under a given , whereas the role of the generator is to refine the set of possible couplings so that the infimum in (OT) becomes 0 eventually. ∎

We remark that in [salimans2018improving], the connection between GANs and OT is discussed from a different angle. Instead of using a fixed cost function , a class of cost functions parametrized by is considered. For the first sub-problem, the primal form of the optimal transport problem is solved using the Sinkhorn algorithm proposed in [cuturi2013sinkhorn]. The parametrized cost function now plays the role of discriminator. In [lei2017geometric], a connection between GANs and optimal transport problem is established from a geometric point of view.

6 Experiments

We now assess the quality of the proposed Algorithm 1, with a class of ergodic MFGs. Their explicit solution structures facilitate numerical comparison.

6.1 Ergodic MFGs with explicit solution

Take (MFG) and consider the following long-run average cost,

(7)

then the PDE system (HJB)–(FP) becomes

(8)

Here, the periodic value function , the periodic density function , and the unknown can be solved explicitly for

with

Indeed, assuming the existence of a smooth solution , in the second equation in (8) can be written as

(9)

Hence the solution to (8) is given by

and

6.2 Result and analysis

Algorithm 1 is tested with the above MFGs in Section 6.1, for both one-dimensional and multidimensional cases.

Implementation.

As illustrated in Table 1, except for (FP), there is little information about the density function . Therefore, its NN approximate is assumed to be a maximum entropy probability distribution, i.e., ; see, for instance, [Finn2016]. Furthermore, the Deep Galerkin Method network architecture, proposed in [sirignano2018dgm], is adopted to implement both and .

Adaptation.

Since the MFG in Section 6.1 is of an ergodic type with a specified periodicity, Algorithm 1 is adapted accordingly.

  • To accommodate the periodicity given by the domain flat torus , for any data point , use

    as input.

  • An additional trainable variable is introduced in the graphical model.

  • The loss functions

    and are modified according to the first and second equation of (8). The generator penalty becomes

    Due to the structure , the discriminator penalty is no longer needed, i.e., .

  • We train the generator first in each inner loop with more SGD steps and a larger learning rate compared with the discriminator. (This is opposite to the typical GANs training.)

6.2.1 One-dimensional case

We first conduct numerical experiment with one-dimensional input.

(a) Value function .
(b) Density function .
(c) Relative error of .
(d) Relative error of .
(e) HJB residual loss.
(f) FP residual loss.
Figure 1: One-dimensional test case.

The DGM network for both and contains 1 hidden layer with

nodes. The activation function for

is hyperbolic tangent function and that of

is sigmoid function. For the inner loops of generator and discriminator training, the minibatch size is

. As mentioned in the adaptation, the number of SGD steps for the generator is with initial learning rate , whereas the number of SGD steps for the discriminator is with initial learning rate . The weight for the generator penalty is . The number of outer loops is . Adam optimizer is used for the updates.

The result is summarized in Figure 1. Figures 0(a) and 0(b) show the learnt functions of and against the true ones, respectively. Both show good accuracy of the learnt functions versus the true ones. This is supported by the plots of loss in Figures 0(c) and 0(d), depicting the evolution of relative error as the number of outer iterations grows to . The relative error of a function against another function , with and not constant 0, is given by

Within iterations, while the relative error of oscillates around , relative errors of decreases below .

To facilitate comparisons for broader classes of MFGs whose analytical solutions are not necessarily available, additional loss functions are adopted. Here we take differential residuals of both the HJB and the FP equations as measurement of the performance. The evolution of the HJB and FP differential residual loss is shown in Figures 0(e) and 0(f)

, respectively. In theses figures, the solid line is the average loss among 3 experiments, with standard deviation captured by the shadow around the line. Both differential residuals first rapidly descend to the magnitude of

and then the descent slows down accompanied by oscillation.

One may notice the difference between the training results of and . One reason is that and are implemented using different neural networks. The other is that different loss functions are adopted for training and .

Ablation study.

To understand possible contribution factors for the oscillation in the loss, especially for , an ablation study on the learning rate of the generator is conducted. In our test, the initial learning rate for the Adam Optimizer takes the values of , , and , respectively.

(a) Relative error of .
(b) Relative error of .
Figure 2: Impact of generator learning rate on relative error.

From Figures 0(c) and 0(d), the relative error on oscillates more than that of . Similar phenomenon is observed in Figure 2. In particular, from Figure 1(a), a drastic decrease in oscillation can be seen as the generator learning rate decreases.

(a) .
(b) .
Figure 3: HJB residual loss under different generator learning rate.
(a) .
(b) .
Figure 4: FP residual loss under different generator learning rate.

Turning to the differential residual losses, we observe from Figures 3 and 4 that, if decreasing from to , the residual losses for both HJB and FP decrease to a lower level with less oscillation.

Another parameter of interest is the number of samples in each minibatch, i.e., and in Algorithm 1. Setting , the cases of , , and are tested.

(a) Relative error of .
(b) Relative error of .
Figure 5: Impact of minibatch size on relative errors.

Figure 4(a) shows that the relative error of oscillates less as and increases from to . Moreover, comparing the case of and , the residual losses for both HJB and FP decrease to a lower level with less oscillation as minibatch size increases, as shown in Figures 6 and 7.

(a) .
(b) .
Figure 6: HJB residual loss under different minibatch size.
(a) .
(b) .
Figure 7: FP residual loss under different minibatch size.

6.2.2 Multi-dimensional case

We also test with input of dimension and relative errors are shown in Figure 8.

(a) Relative error of .
(b) Relative error of .
Figure 8: Input of dimension 4.

The number of outer loops is increased to , with generator learning rate of . Within iterations, the relative error of decreases below and that of decreases to .

Notice that similar experiment for dimension has been conducted in [CarmonaLauriere_DL]; see Test Case 4. In comparison, their algorithms need significantly larger number of iterations: of iterations vs our to achieve the same level of accuracy.

References