1 Introduction
Reinforcement learning (RL) has recently had great success in solving complex tasks with continuous control trpo ; ppo
. However, as these methods often have high variance results while dealing with unstable environments, distributional perspectives on the statevalue function in RL have begun to gain popularity
bellemare2017distributional . Note that the distributional perspective is distinct from Bayesian approaches to the RL problem as the former models the inherent variability of the returns from a state and not the agent’s confidence in its prediction of the average return.Up to now, deep learning methods in RL used multiple function approximators (typically a network with shared hidden layers) to fit a state value or stateaction value distribution. For instance, bootstrappedDQN used heads on the stateaction value function for every available action and used it to model a distribution. In bayesianpol , a Bayesian framework was applied to the actorcritic architecture by fitting a Gaussian Process (GP) instead of the critic, hence allowing for a closedform derivation of update rules. More recently, bellemare2017distributional
introduced a distributional algorithm C51 which aimed to solve the RL problem by learning a categorical probability vector over returns
. Unlike GANRL which uses a generative network to learn the underlying transition model of the environment, we utilize a generative network to model the distribution approximation of the Bellman updates.In this work, we build on top of the aforementioned distributional RL methods and introduce a novel way to learn the statevalue distribution. Inspired by the analogy between the actorcritic architecture and generative adversarial networks (GANs) connection_gan_actor_critic , we leverage the later in order to implicitly represent the distribution of the Bellman target update through a generator/discriminator architecture. We show that, although sometimes volatile, our proposed algorithm is a viable alternative to now considered classical deep Q networks (DQN). We aim to provide a unifying view on distributional RL through the minimization of the EarthMover distance and without explicitly using the distributional Bellman projection operator on the support of the values.
2 Related Work
2.1 Background
Multiple tasks in machine learning require finding an optimal behaviour in a given setting, i.e. solving the reinforcement learning problem. We proceed to formulate the task as follows.
Let (,,,,,) be a 5tuple representing a Markov decision process (MDP) where is the set of states, the set of allowed actions, is the (deterministic or stochastic) reward function, are the environment transition probabilities and is the set of initial states. At a given time step an agent acts according to a policy . The environment is characterized by its set of initial states in which the agent starts, as well as the transition model which encodes the mechanics of moving from one state to another. In order to compare states, we introduce the state value function which gives the expected sum of discounted rewards in that state. That is, .
The reinforcement learning problem is twofold: (1) given a fixed policy we would like to obtain the correct state value function and (2) we wish to find the optimal policy which yields the highest for all states of the MDP or, equivalently, . The first task is known in the reinforcement learning literature as prediction and the second as control.
In order to find the value function for each state, we need to solve the Bellman equations bellman1954theory :
(1) 
for and . If we define a stateaction value function as , then we can rewrite Eq.1 as:
(2) 
for .
While both Eq.1 and Eq.2 have the following direct solution obtained by matrix inversion: . However it should be noted that this is only welldefined for finitestate MDPs, and further as this computation has complexity coppersmith1987matrix , it is only computationally feasible for small MDPs. Therefore, samplebased algorithms such as Temporal Difference (TD) sutton1988learning for prediction and SARSA or Qlearning rummery1994line ; watkins1992q for control are preferred for more general classes of environments.
2.2 Distributional Reinforcement Learning
In the setting of distributional reinforcement learning bellemare2017distributional , we seek to learn the distribution of returns from a state, rather than the mean of the returns. We translate the Bellman operator on points to an operator on distributions. The vector of mean rewards therefore becomes a function of reward distributions . We can thus represent as
, a random variable whose law is the returns following a state. Both expected and distributional distributional quantities are linked through the following relation:
. We finally arrive to the distributional Bellman equations:(3) 
where denotes an analogue to the wellknown Bellman operator now defined over distributions.
Eq.3 is the distributional counterpart of Eq.2, where equality holds for sequences of random variables.
In traditional reinforcement learning algorithms, we use experience in the MDP to improve an estimated state value function in order to minimize the expected distance between the value function’s output for a state and the actual returns from that state. In distributional reinforcement learning, we still aim to minimize the distance between our output and the true distribution, but now we have more freedom in how we choose our definition of "distance", as there are many metrics on probability distributions which have subtly different properties.
The Wasserstein metric between two realvalued random variables and
with cumulative distribution functions
and is given by(4) 
For we recover the widely known EarthMover distance. More generally, for any , holds givens1984class and is useful in contraction arguments. The maximal Wasserstein metric, is defined over stateaction tuples for any two value distributions .
It has been shown that while is a contraction under the maximal Wasserstein metric ruschendorf1985wasserstein , the Bellman optimality operator is not necessarily a contraction. This result implies that the control setting requires a treatment different from prediction, which is done in bellemare2017distributional through the C51 algorithm in order to guarantee proper convergence.
2.3 Generative Models
Generative models such as hidden Markov models (HMMs), restricted Boltzmann machines (RBMs)
salakhutdinov2007restricted , variational autoencoders (VAEs) kingma2013auto and generative adversarial networks (GANs) goodfellow2014generative learn the distribution of data for all classes. Moreover, they provide a mechanism which allows to sample new observations from the learned distribution.In this work, we make use of generative adversarial networks which consist of two neural networks playing a zerosum game against each other. The generator network
is a mapping from a highdimensional noise space onto the input space on which a target distribution is defined. The generator’s task consists in fitting the underlying distribution of observed data as closely as possible. The discriminator network scores each input as the probability of coming from the real data distribution or from the generator . Both networks are gradually improved through alternating or simultaneous gradient descent updates.The classical GAN algorithm minimizes the JensenShannon divergence (JS) between the real and generated distributions. Recently, arjovsky2017wasserstein suggested to replace the JS metric by the Wasserstein1 or EarthMover divergence. We make use of an improved version of this algorithm, Wasserstein GAN with Gradient Penalty gulrajani2017improved . It’s objective is given below:
(5) 
where , , and . Setting recovers the original WGAN objective.
3 GAN Qlearning
3.1 Motivation
We borrow the twoplayer game analogy from the original GAN paper goodfellow2014generative : the generator network’s purpose is to produce realistic samples of the optimal stateaction value distribution (estimate of ). On the other hand, the discriminator network aims to distinguish real samples of from the samples outputted by . The generator network improves its performance through the signal from the discriminator, which is reminiscent of the actorcritic architecture connection_gan_actor_critic .
3.2 Algorithm
At each timestep, receives stochastic noise and a state as input and returns a sample for every action from the current estimate of . We then select the action . The agent then applies the chosen action , receives a reward and transitions to state . The tuple is saved into a replay buffer as done in dqn . Each update step consists in sampling a tuple uniformly from the buffer and proceed to update the generator and discriminator according to Eq 5.
Values obtained from the Bellman backup operator are considered as coming from the real distribution. The discriminator’s goal is to differentiate between the Bellman target and the output produced by . We obtain the following updates for and , respectively:
(6) 
where are weights of the generator and discriminator networks with respect to which the gradient updates are taken.
Note that to further stabilize the training process, one can use a target network updated every epochs as in dqn . Due to the nature of GANs, we advise training the model in a batch setting using experience replay and a second (target) network for a more stable training process.
Here, the objective function is identical to Eq.5, where is taken to be .
Using a generative model to represent the stateaction value distribution allows for an alternative to explicit exploration strategies. Indeed, at the beginning of the training process, taking an action is analogous to using a decaying exploration strategy (since has not clearly separated for every pair). Fig. 1 demonstrates how gradually separating from acts as implicit exploration by sampling suboptimal actions.
4 Convergence
It is wellknown that Qlearning can exhibit divergent behaviour in the case of nonlinear function approximation tsitsiklis1997analysis . Because nonlinear value function approximation can be viewed as a special case of the GAN framework where the latent random variable is sampled from a degenerate distribution, we can see that the class of problems for which GAN Qlearning can fail to converge contains those for which vanilla Qlearning does not converge.
Further, as explored in mescheder2018convergence , we observe that in many popular GAN architectures, convergence to the target distribution is not guaranteed, and oscillatory behaviour can be observed. This is a double blow to the reinforcement learning setting, as we must guarantee both that a stationary distribution exists and that the GAN architecture can converge to this stationary distribution.
We also note that although in an idealized setting for the Wasserstein GAN the generator should be able to represent the target distribution and the discriminator should be able to learn any 1Lipschitz function in order to produce the true Wasserstein distance, this is unlikely to be the case in practice. It is thus possible for the optimal generatordiscriminator equilibrium to correspond to a generated distribution that has a different expected value from the target distribution. Consider, for example, a generator which produces a Dirac distribution, and a discriminator which can compute quadratic functions of the form . Then the discriminator attempts to approximate the Wasserstein distance by computing
Suppose we are in a 2armed bandit setting, where arm A always returns a reward of for some small and arm B gives rewards distributed as a . Then the optimal generator (constrained to the class of Dirac distributions) will predict a Dirac distribution with support for arm A, and a Dirac distribution with support for arm B. Consequently, an agent which has reached an equilibrium will incorrectly estimate arm B as being the optimal arm.
Empirical results reported in the next section demonstrate the ability of our algorithm to solve complex reinforcement learning tasks. However, providing convergence results for nonlinear and can be hindered by complex environment dynamics and the unstable nature of GANs. For instance, proving convergence of the generatordiscriminator tuple to a saddle point requires an argument similar to mescheder2018convergence .
5 Experiments
In order to compare the performance of the distributional approaches to traditional algorithms such as Qlearning, we conducted a series of experiments on tabular and continuous state space environments.
5.1 Environments
We considered the following environments:

10state chain with two goal states (2G Chain) ( is in the middle). Two deterministic actions (left, right) are allowed. A reward of is given when we stay in the goal state for one step, otherwise. The discount factor and the maximum episode length is 50;

Deterministic gridworld (Gridworld) with start and goal states in opposing corners and walls along the perimeter. A reward of 0 is given in the goal state, otherwise. The agent must reach the goal tile in the least number of steps while avoiding being stuck against walls. The discount factor and the maximum episode length is 100;

The simple two state MDP (2 States) presented in Fig.2. The discount factor and the maximum episode length is 25.

OpenAI Gym brockman2016openai environments Cartpolev0 and Acrobotv1.
All experiments were conducted with a similar, one hidden layer architecture for GAN Qlearning and DQN. A total of 3 dense layers of 64 units for tabular and 128 units for OpenAI environments each, as well as
nonlinearities were used. Note that a Convolutional Neural Network (CNN)
krizhevsky2012imagenet can be used to learn the rewards similarly to dqn .5.2 Results
In this section, we present empirical results obtained by our proposed algorithm. For shorthand notation we abbreviate the tabular version of distributional Qlearning to dQlearning; GAN Qlearning will be contracted to GANDQN. The dQlearning performs a mixture update between the predicted and target distribution for each stateaction pair it visits, analogous to TD updates.
All tabular experiments were ran on 10 initial seeds and the reported scores are averaged across 300 episodes. Gym environments were tested on 5 initial seeds over 1000 episodes each with no restrictions on maximum number of steps.
In general, our GAN Qlearning results go on par with ones obtained by tabular baseline algorithms such as Qlearning (see Fig. 3 and Table 1). The high variance in the first few episodes corresponds to the time needed for the generator network to separate real and generated distributions for each actions. Note that although the Gridworld environment yields sparse rewards similar to Acrobot, GAN Qlearning eventually finds the optimal path to the end goal. Fig. 4 shows that our method can efficiently use the greedy policy in order to learn the state value function; lower state values are associated with the start state while the higher state values are attributed to both goal states.
Just like DQN, the proposed method demonstrates the ability to learn an optimal policy in both OpenAI environments. For instance, increasing the number of updates for and in the CartPolev0 environment stabilized the algorithm and ensured the proper convergence of to the Bellman target. Fig. 5 demonstrates that, although sometimes unstable, control with GAN Qlearning has the capacity to learn an optimal policy and outperform DQN. Due to sparse rewards in Acrobotv1, we had to rely on a target network as in dqn to stabilize the training process. Using GAN Qlearning without a second network in such environments would lead to increased variance in the agent’s predictions and is hence discouraged.
Environment  

Algorithms  2 State  2G Chain  Gridworld 
Qlearning  426.613  0.953  8.059 
dQlearning  427.328  0.950  8.190 
GANDQN  398.918  0.978  11.720 
In addition to the common variance reduction practices mentioned above, we used a learning rate scheduler as a safeguard to reduce instability during the training process. We found that for timestep , initial learning rate and varying between environments (for CartPolev0, ) yielded the best results. Even though our model has an implicit exploration strategy induced by the generator, we used an greedy exploration policy like most traditional algorithms. We noticed that introducing greedy the generator in separating the state value distributions for all actions. Note that for environments with less complex dynamics (e.g. tabular MDPs), our method does not require an explicit exploration strategy and has shown viable performance without it.
Unlike in the original WGANGP paper where a strong Lipschitz constraint is enforced via a gradient penalty coefficient (=10), we observed empirically that relaxing this property with =0.1 gave better results.
6 Discussion
We introduced a novel framework based on techniques borrowed from the deep learning literature in order to successfully learn the stateaction value distribution via an actorcritic like architecture. Our experiments indicate that GAN Qlearning can be a viable alternative to classical algorithms such as DQN while having the appealing characteristics of a typical deep learning blackbox model. The parametrization of the returns distribution by a neural network within the scope of our approach is countered by its volatility in environments with particularly sparse rewards. We believe that a thorough understanding of the nonlinear dynamics of generative nets and convergence properties of MDPs is mandatory for a successful improvement of the algorithm. Recent work in the field hints that a saddlepoint analysis of the objective function is a valid way to approach such problems mescheder2018convergence .
Future work should address with high priority the stability of the training iteration. Moreover, using a CNN on top of screen frames in order to encode the state can provide a meaningful approximation to the reward distribution. Our proposed algorithm opens possibilities to integrate the GAN architecture into more complex algorithms such as DDPG ddpg and TRPO trpo , which can be a potential topic of investigation.
7 Acknowledgments
We would like to thank Marc G. Bellemare from Google Brain for helpful advice throughout this paper.
References
 [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [2] Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 [3] Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
 [4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

[5]
Don Coppersmith and Shmuel Winograd.
Matrix multiplication via arithmetic progressions.
In
Proceedings of the nineteenth annual ACM symposium on Theory of computing
, pages 1–6. ACM, 1987.  [6] Mohammad Ghavamzadeh, Yaakov Engel, and Michal Valko. Bayesian policy gradient and actorcritic algorithms. Journal of Machine Learning Research, 17(66):1–53, 2016.
 [7] Clark R Givens, Rae Michael Shortt, et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.
 [8] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
 [10] Vincent Huang, Tobias Ley, Martha VlachouKonchylaki, and Wenfeng Hu. Enhanced experience replay generation for efficient reinforcement learning. CoRR, abs/1705.08245, 2017.
 [11] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [13] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 [14] Lars Mescheder. On the convergence properties of gan training. arXiv preprint arXiv:1801.04406, 2018.
 [15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
 [16] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. CoRR, abs/1602.04621, 2016.
 [17] David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actorcritic methods. CoRR, abs/1610.01945, 2016.
 [18] Gavin A Rummery and Mahesan Niranjan. Online Qlearning using connectionist systems, volume 37. University of Cambridge, Department of Engineering, 1994.
 [19] Ludger Rüschendorf. The wasserstein distance and approximation theorems. Probability Theory and Related Fields, 70(1):117–129, 1985.
 [20] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning, pages 791–798. ACM, 2007.
 [21] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
 [22] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 [23] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 [24] John N Tsitsiklis and Benjamin Van Roy. Analysis of temporaldiffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081, 1997.
 [25] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
Comments
There are no comments yet.