1 Introduction
Most problems in machine learning are formulated as an optimization problem over a single objective. However, a number of problems in machine learning lack a single unified cost, and instead consist of a hybrid of several models, each of which passes information to other models but tries to minimize its own private loss function. This upsets many of the assumptions behind most learning algorithms, and applying ordinary methods like gradient descent often leads to pathological behavior such as oscillations or collapse onto degenerate solutions. Despite these practical difficulties, there is great potential in models with hybrid or multilevel losses, and it has been hypothesized that the combination of many different local losses underlies the functioning of the brain as well
[marblestone2016towards].Actorcritic methods (AC) [sutton1999policy, konda2003onactor] and generative adversarial networks (GANs) [goodfellow2014generative] are two such classes of multilevel optimization problems which have close parallels. In both cases the information flow is a simple feedforward pass from one model which either takes an action (AC) or generates a sample (GANs) to a second model which evaluates the output of the first model. In both cases, the second model is the only one which has direct access to special information in the environment, either reward information (AC) or real samples from the distribution in question (GANs), and the first model must learn based on error signals from the second model alone. GANs and AC methods have important differences as well, and Section 2 we review both classes of algorithms and show a construction that bridges the two. Both of these models suffer from stability issues, and techniques for stabilizing training have been developed largely independently by the two communities. In Section 3 we review the different methods used to address model pathologies. We also provide a review of other related methods in the supplemental materials. The aim of this note is to raise awareness of the strong connection between the two classes of models.
2 Algorithms
Both GANs and AC can be seen as bilevel or twotimescale optimization problems, where one model is optimized with respect to the optimum of another model:
(1)  
(2) 
Bilevel optimization problems have been extensively studied in operations research, including the study of actorcritic methods [konda2002actor], but largely in the setting where the two optimization problems are linear or convex programs [colson2007overview]
. By contrast, the recent applications we review here have largely used deep neural networks as the class of functions over which to optimize.
2.1 Generative Adversarial Networks
Generative adversarial networks [goodfellow2014generative] formulate the unsupervised learning problem as a game between two opponents  a generator which samples from a distribution, and a discriminator
which classifies the samples as real or false. Typically the generator is represented as a deterministic feedforward neural network through which a fixed noise source
is passed and the discriminator is another neural network which maps an image to a binary classification probability. The GAN game is then formulated as a zero sum game where the value is the crossentropy loss between the discriminator’s prediction and the true identity of the image as real or generated, which we denote
:(3) 
To make sure the generator has gradients from which to learn even when the discriminator’s classification accuracy is high, the generator’s loss function is usually formulated as maximizing the probability of classifying a sample as true rather than minimizing its probability of being classified false. The modified loss is still easily formulated as a bilevel optimization problem:
(4)  
(5) 
2.2 ActorCritic Methods
Actorcritic methods are a longestablished class of techniques in reinforcement learning [sutton1999policy]
. While most reinforcement learning algorithms either focus on learning a value function, like value iteration and TDlearning, or learning a policy directly, as in policy gradient methods, AC methods learn both simultaneously  the actor being the policy and the critic being the value function. In some AC methods, the critic provides a lowervariance baseline for policy gradient methods than estimating the value from returns. In this case even a bad estimate of the value function can be useful, as the policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is updated with respect to the approximate value function, in which case pathologies similar to those in GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead to a bad policy which never fully explores the space, preventing a good value function from being found and leading to degenerate solutions. A number of techniques exist to remedy this problem.
Formally, consider the typical MDP setting for RL, where we have a set of states , actions , a distribution over initial states , transition function , reward distribution and discount factor . The aim of actorcritic methods is to simultaneously learn an actionvalue function that predicts the expected discounted reward:
(6) 
and learn a policy that is optimal for that value function:
(7) 
We can express as the solution to a minimization problem:
(8) 
Where is any divergence that is positive except when the two are equal. Now the actorcritic problem can be expressed as a bilevel optimization problem as well:
(9)  
(10) 
There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the policy through policy gradients and scale the policy gradient by the TD error, while the actionvalue function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG) [silver2014determinisitc, lillicrap2015continuous] and its extension to stochastic policies, SVG(0) [heess2015learning], as well as neurallyfitted Qlearning with continuous actions (NFQCA) [hafner2011reinforcement]. These algorithms are all intended for the case where actions and observations are continuous, and use neural networks for function approximation for both the actionvalue function and policy. This is an established approach in RL with continuous actions [prokhorov1997adaptive], and all methods update the policy by passing back gradients of the estimated value with respect to the actions rather than passing the TD error directly. The distinction between the methods lies mainly in the way training proceeds. In NFQCA, the actor and critic are trained in batch mode after every episode, while in DPG and SVG(0), networks are trained online using temporal difference updates.
2.3 GANs as a kind of ActorCritic
The similarities between GANs and AC methods are summarized in Figure 1. In both situations, one model has access to information about errors from the environment (the discriminator in GANs and the critic in AC), while the other model must be updated based only on gradient information from the first model.
We can make this connection more precise and describe an MDP in which GANs are a modified type of actorcritic method. Consider an MDP where the actions set every pixel in an image. The environment randomly chooses either to show the image the actor generates or show a real image. Let the reward from the environment be 1 if the environment chose the real image and 0 if not. This MDP is stateless as the image generated by the actor does not affect future data.
An actorcritic architecture learning in this environment clearly closely resembles the GAN game. A few adjustments have to be made to make it identical. If the actor had access to the state it could trivially pass a real image forward, so the actor must be a blind actor, with no knowledge of the state. Since the MDP is stateless this doesn’t preclude the actor from learning. While the meansquared Bellman residual is usually used as a loss for the critic, crossentropy should be used instead to match the GAN loss. This still gives a loss function with a consistent value function as the minimum. Since the actor receives gradients of the value rather than gradients of the Bellman residual in RL, a scaling term proportional to , where is the crossentropy, should be included in the gradients passed to the actor (though in practice, a different loss is used for the generator precisely to deal with complications from this scaling term). Lastly, the actor’s parameters should not be updated if the environment shows a real image. This could be done by having the critic zero its gradients with respect to the action if the reward is 1. Thus GANs can be seen as a modified actorcritic method with blind actors in stateless MDPs.
There are several peculiar aspects of this MDP that give rise to some behaviors not typically associated with actorcritic methods. First, it is not immediately obvious why an actorcritic algorithm should lead to adversarial behavior. Typically the actor and critic are trying to optimize complimentary loss functions, rather than optimize the same loss in different directions. The adversarial behavior in GANs comes about because the MDP in which the GAN game is played is one in which the actor cannot have any causal effect on the reward. Essentially, it is an MDP in which the true policy gradient is always zero. A critic, however, cannot learn the causal structure of the game from input examples alone, and moves in the direction of features that predict reward. The actor moves in a direction to increase reward based on the best estimate from the critic, but this change cannot lead to an increase in the true reward, so the critic will quickly learn to assign lower value in the direction the actor has moved. Thus the updates to the actor and critic, which ideally would be orthogonal (as in compatible actorcritic) instead becomes adversarial. It is also worth noting that there are important consequences to the environment being only partially observable to the actor. For fully observable MDPs, the optimal policy is always deterministic. For GANs, however, the generator matching the true distribution is a fixed point of the minimax problem. Despite these differences between GANs and typical RL problems, we believe there are enough similarities to merit investigation into techniques to improve training that generalize between the two settings.
3 Stabilizing Strategies
Method  GANs  AC 

Freezing learning  yes  yes 
Label smoothing  yes  no 
Historical averaging  yes  no 
Minibatch discrimination  yes  no 
Batch normalization  yes  yes 
Target networks  n/a  yes 
Replay buffers  no  yes 
Entropy regularization  no  yes 
Compatibility  no  yes 
Having reviewed the basics of GANs, actorcritic algorithms and their extensions, here we discuss the “tricks of the trade" used by each community to make them work. The different methods are summarized in Table 1. While not meant as an exhaustive list, we have included those that we believe have either made the largest impact in their fields or have the greatest potential for crossover between fields.
3.1 GANs

Freezing learning. Frequently either the generator or discriminator will outpace the other and the GAN will get stuck in a degenerate solution. One simple remedy is to freeze learning in one model when it begins to get too strong. A very similar approach has been taken successfully in actorcritic learning, where learning is frozen for either the actor or the critic if the magnitude of the TD error goes below or above a certain threshold [degrispersonal].

Label smoothing. A simple trick to prevent gradients from vanishing when the discriminator’s predictions are very confident, label smoothing replaces 0/1 labels with /, which guarantees the generator will always have informative gradients. While specific to classification, there’s no reason this couldn’t equally well be applied in an RL setting where the rewards are 0/1 and gradients of the critic vanish.

Historical averaging [salimans2016improved]
. Loosely inspired by fictitious play in game theory, historical averaging adds a drag term to gradient descent that penalizes steps that are too far from the past average over parameters. Even though the average over parameters doesn’t have an intuitive meaning, this method is effective at preventing oscillations due to two models optimizing different objectives. Historical averaging is also closely related to PolyakRuppert averaging and extensions
[polyak1990new, ruppert1988efficient]. While PolyakRuppert style averaging has been formally analyzed by those in the RL community [konda2004convergence], it has not been adopted as a standard “tool of the trade" to the best of our knowledge. The use of replay buffers in DPG [lillicrap2015continuous] is also conceptually similar to fictitious play, but only applicable to the actor. 
Minibatch discrimination [salimans2016improved]. To prevent the generator from collapsing onto a single sample, minibatch discrimination extends the role of the discriminator from classifying single images to classifying entire minibatches. This helps increase the entropy of samples from the generator. In RL, the analogous problem of underexploration has been addressed by adding a penalty to the actor that encourages higher entropy policies. Computing the entropy directly becomes impractical for complex stochastic policies, and an RL equivalent to minibatch discrimination could be an effective alternative for encouraging exploration in continuous spaces.

Batch normalization [radford2016unsupervised, salimans2016improved]. Batch normalization has been critical for scaling GANs to deep convolutional networks, and virtual batch normalization has extended this by using a constant reference batch to prevent correlations in predictions due to the other elements of the minibatch. Batch normalization has also been helpful in a large fraction of environments DPG has been tried for, and it has not been investigated whether virtual batch normalization leads to even greater improvement.
3.2 ActorCritic

Replay buffers [lillicrap2015continuous]. Replay buffers have been very effective at removing correlations from the training data in both discrete and continuous RL settings. They are also conceptually similar to fictitious play in game theory, where one player plays the best response to the average over policies of the other player. Unlike fictitious play, where both players play against the historical average opponent, in the AC setting replay buffers can only be used for the critic. The critic can make offpolicy updates based on actions chosen by the actor in the past, but the actor cannot learn from gradients from the critic in the past, since those gradients were taken with respect to different actions. Similarly for GANs, we can keep a buffer of previously generated images to prevent the discriminator from fitting too closely to the current generator. We have experimented with using a replay buffer for GANs, but were not able to generate asymptotically correct samples even for simple distributions.

Target networks [lillicrap2015continuous]. Since the action value function appears twice in the Bellman recursion, stability can be an issue in Qlearning with function approximation. Target networks address this by fixing one of the networks in TD updates. Since the GAN game can be seen as a stateless MDP, the second appearance of the action value function disappears and learning the discriminator becomes an ordinary regression problem. For this reason we don’t consider target networks applicable to the GAN setting. However other multilevel optimization problems with Qlearning as a subproblem would likely benefit. It’s also worth noting that selfcontrastive estimation [goodfellow2014distinguishability] is conceptually similar to target networks, so the idea of a target network can still be applicable to other density estimation problems.

Entropy regularization. Actorcritic methods often fail to sufficiently explore the action space. To address this, an additional reward is sometimes added that encourages highentropy policies. When the action space is discrete this is easily accomplished, but is somewhat more difficult in the continuous control setting. Notably, an analogous problem is often encountered with GANs, where the generator collapses onto sampling only a few modes. Any method for encouraging exploration in continuous control is likely transferable to increasing sample diversity in GANs.

Compatibility. One of the unique theoretical developments of actorcritic methods is the notion of a compatible critic. If the critic is a linear function of the gradient of the policy with respect to its parameters, then using this in place of the true value in the gradient of the expected return gives an unbiased approximation if the critic is optimal [sutton1999policy]. This compatible critic is also closely related to the natural gradient of the expected return, and can be used for efficient natural gradient descent on the policy [kakade2001natural]. While elegant, it is not clear if the notion of compatibility can be naturally extended to the GAN setting. Since the true value of any policy will always be 0.5 in the GAN MDP, the true policy gradient is always zero. We would generally prefer our GANs to be adversarial than compatible.
4 Conclusions
Combining deep learning with multilevel optimization holds great promise for a diverse array of problems in machine learning and AI. Already GANs and actorcritic methods have made large impacts on their respective fields, despite the inherent difficulties in optimization and exploration. We hope that by pointing out the deep connections between the two we encourage the development and adoption of general techniques and free flow of ideas between different communities.
Acknowledgments
Thanks to David Silver, Thomas Degris, Marc Lanctot, Nicholas Heess, Ian Goodfellow, Scott Reed, Josh Merel, Razvan Pascanu, Greg Wayne, Shakir Mohamed, Balaji Lakshminarayanan, Chelsea Finn and Sergey Levine for helpful discussions and to Yuval Tassa for help with the figures.
References
Supplemental Material
Beyond the basic GAN and AC methods described in Section 2, there have been many extensions and other problems in machine learning which have even greater complexity. We review some of them here.
A GAN extensions
There have been many extensions to GANs since the original publication. In [nowozin2016f] it was shown that GANs can be interpreted as minimizing a lower bound on the fdivergence between the generator distribution and true distribution:
(11) 
Where is the distribution induced by the generator network . For the appropriate choice of convex the original GAN loss is recovered as a bound on the JensenShannon divergence. Other choices of lead to lower bounds on other divergences such as the KL divergence or squared Hellinger distance. GANs were also reformulated in [zhao2016energy] by replacing the discriminator with an energybased autoencoder trained with a contrastive loss. Both fGANs and energybased GANs preserve the information and gradient flow from the original GAN formulation but change the loss function.
GANs lack a natural way to infer the latent state for new data, leading to a number of extensions that include an inference network in addition to the generator and discriminator. A hybrid variational autoencoder [kingma2013auto, rezende2014stochastic] and GAN architecture was trained in [larsen2015autoencoding], while [dumoulin2016adversarially] and [donahue2016adversarial] both proposed an extension to GANs which trained the discriminator to classify the latent state of true data induced by an inference network in addition to the generated images. The structure of these models is illustrated in Figures 1(a) and 1(b) respectively. These extensions all added a third model to the mix, complicating optimization even further.
Further extensions include adversarial autoencoders [makhzani2015adversarial] which adversarially learn the structure of the approximate posterior in variational autoencoders (shown in Figure 1(c)), and InfoGANs [chen2016infogan], which augment the latent state in the generative model with another set of units trained to maximize the mutual information with the data to learn the most important factors of variation in the model.
B AC extensions
DPG, SVG(0) and NFQCA are not the only AC methods that work with continuous actions. Asynchronous advantage actorcritic (A3C) has shown promise for RL with both discrete and continuous actions [mnih2016asynchronous]. A3C learns only a state value function rather than an action value function and thus cannot pass back gradients of the value with respect to the action to the actor. Instead it approximates the action value with the rewards from several steps of experience and passes the TD error to the actor. It is therefore less closely connected to GANs, however it is worth pointing out that it has been a successful alternative to DPG and NFQCA as an actorcritic algorithm for continuous control.
Stochastic value gradients (SVG) [heess2015learning] generalize DPG to stochastic policies in a number of ways, giving a spectrum from modelbased to modelfree algorithms. While SVG(0) is a direct stochastic generalization of DPG, SVG(1) combines an actor, critic and model . The actor is trained through a combination of gradients from the critic, model and reward simultaneously.
C Imitation Learning and Inverse Reinforcement Learning
Another machine learning problem closely related to GANs and actorcritic methods is that of learning control policies from expert trajectories, variously referred to as apprenticeship learning [abbeel2005exploration], imitation learning or inverse reinforcement learning [ng2000algorithms] depending on the exact approach taken.
It has long been known that imitation learning can be formulated as a minimax problem where the aim is to learn the cost function that minimizes the amount by which the optimal policy outperforms the expert policy [syed2007game]. In [ho2016generative], a deep connection was shown between imitation learning and GANs. Learning a cost function was shown to be equivalent to minimizing the regularized distance between stateaction occupancy distributions for the expert policy and learned policy. By using a deep neural network with the crossentropy loss as the distance between occupancy distributions, learning a cost directly could be circumvented and the inverse reinforcement learning problem was cast in nearly the same form as GAN learning. The only major difference from GANs is that instead of learning to generate images by gradient descent a policy is learned by trust region policy optimization [schulman2015trust]. In principle an actorcritic method could be substituted for direct policy optimization, and the imitation learning problem would become a threelevel optimization problem that contains both GANs and actorcritic as subproblems.
It has also been shown in [finn2016connection] that generative adversarial networks have a close connection with maximum entropy inverse reinforcement learning. In particular, when the density function of the generative model is known, the GAN objective is identical to the MaxEnt IRL objective, and GAN training is identical to guided cost learning, which approximates the partition function in MaxEnt IRL using a learned importance sampling estimator. While in this paper we have shown a connection between actorcritic learning in a particular MDP and all GANs, their work shows that a particular extension to GANs is the same as guided cost learning in all cases.