In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Go game alpha_go and Atari games original_atari_paper ; human_level_control ; multitask_learning_model_based_rl ; combine_pg_dqn ; actor_mimic_multi_task_transfer_learning_rl ; sample_efficient_actor_critic ; sobolev_training . In all of these success stories, the size of the action space was relatively small. Many Reinforcement Learning (RL) problems, however, involve high-dimensional action spaces as well as high-dimensional state spaces. Examples include StarCraft deepmind_starcraft ; facebook_starcraft , where there are many agents, each of which can take a finite number of actions; and coordinating self-driving cars at an intersection, where each car can take a finite set of actions comm_net .
In policy gradient, in order to encourage sufficient exploration, an entropy bonus term is typically added to the objective function. However, in the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible.
In this paper, we develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. For all of these parameterizations, actions can be efficiently sampled from the policy distribution, and backpropagation can be employed for training. These parameterizations can be combined with the entropy bonus estimators and stochastic gradient descent, giving a new class of policy gradient algorithms with desirable exploration. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results show that our entropy estimators can substantially improve performance with marginal additional computational cost.
Consider a Markov Decision Process (MDP) with a-dimensional action space . Denote for an action in . A policy specifies for each state a distribution over the action space . In the standard RL setting, an agent interacts with an environment over a number of discrete time steps sutton_book ; silver_lectures . At time step , the agent is in state and samples an action from the policy distribution . The agent then receives a scalar reward and the environment enters the next state . The agent then samples from and so on. The process continues until the end of the episode, denoted by . The return is the discounted accumulated return from time step until the end of the episode where .
In policy gradient, we consider a set of parameterized policies , , and attempt to find a good within a parameter set . Typically, the policy
is generated by a neural network withdenoting the network’s weights and biases. The parameters are updated by performing stochastic gradient ascent on the expected reward. One example of such an algorithm is REINFORCE william , where in a given episode at time step , are updated as followed:
where is a baseline. It is well known that the policy gradient algorithm often converges to a local optimum. To discourage convergence to a highly suboptimal policy, the policy entropy is typically added to the update rule:
This approach is often referred to as adding entropy bonus or entropy regularization william and is widely used in different applications, such as optimal control in Atari games async_rl , multi-agent games multi_agent_openai
and optimizer search for supervised machine learning with RLoptimizer_search . is referred to as the entropy weight.
3 Policy Parameterization for Multidimensional Action Space
For problems with discrete action spaces, policies are commonly parameterized as a feed-forward neural network (FFN) with a softmax output layer of dimension. Therefore sampling from such a policy requires O() effort. For multidimensional action spaces, grows exponential with the number of dimensions .
In order to efficiently sample from our policy, we consider an autoregressive model which can be sampled from each dimension sequentially. In our discussion, we will assume. To handle action sets of different sizes, we will include inconsequential actions. Here we review two such models, and note that sampling from both models only require summing over effort as opposed to
effort. We emphasize that our use of an autoregressive model to create multi-dimensional probability distributions is not novel. However, we need to provide a brief review to motivate our entropy calculation algorithms.
3.1 Using an LSTM to Generate the Parameterized Policy
LSTMs have recently been used with great success for autoregressive models in language translation tasks lstm_translation . An LSTM can also be used to create a parameterized multi-dimensional distribution and to sample from that distribution (Figure 1(a)). To generate , we run a forward pass through the LSTM with the input being and the current state (and implicitly on which influences ). This produces a hidden state , which is then passed through a linear layer, producing a
dimensional vector. The softmax of this vector is taken to produce the one-dimensional conditional distribution, . is sampled from this one-dimensional distribution, and is then fed into the next stage of the LSTM to produce . We note that this approach is an adaptation of sequence modeling in supervised machine learning wave_net to reinforcement learning and has also been proposed by google_sdqn_cont_action and actor_critic_sequence_prediction .
3.2 Using MMDP to Generate Parameterized Policy
The underlying MDP can be modified to create an equivalent MDP for which the action space is one-dimensional. We refer to this MDP as the Modified MDP (MMDP). In the original MDP, we have state space and action space . In MMDP, the state encapsulates the original state and all the action dimensions selected for state so far (Figure 1(b)). We note that google_sdqn_cont_action recently and independently proposed the reformulation of the MDP into MMDP.
4 Entropy Bonus Approximation for Multidimensional Action Space
As shown in (1), an entropy bonus is typically included to enhance exploration. However, for large multi-dimensional action space, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action. In this section, we develop computationally efficient unbiased estimates for the entropy and its gradient. These computationally efficient algorithms can be combined with the autoregressive models discussed in the previous section to provide end-to-end computationally efficient schemes.
To abbreviate notations, we write for and for . We consider auto-regressive models whereby the sample components , are sequentially generated. In particular, after obtaining , we will generate from some parameterized distribution defined over the one-dimensional set . After generating the distribution , and the action components sequentially, we then define .
denote a random variable with distribution. Let denote the exact entropy of the distribution :
4.1 Crude Unbiased Estimator
During training within an episode, for each state , the policy generates an action . We refer to this generated action as the episodic sample. A crude approximation of the entropy bonus is:
This approximation is an unbiased estimate of
but its variance is likely to be large. To reduce the variance, we can generate multiple action samples when inand average the log action probabilities over the samples. However, generating a large number of samples is costly, especially when each sample is generated from a neural network, since each sample requires one additional forward pass.
4.2 Smoothed Estimator
This section proposes an alternative unbiased estimator for which only requires the one episodic sample and accounts for the entropy along each dimension of the action space:
which is the entropy of conditioned on . This estimate of the entropy bonus is computationally efficient since for each dimension , we would need to obtain , its log and gradient anyway during training. We refer to this approximation as the smoothed entropy.
The smoothed estimate of the entropy has several appealing properties. The proofs of Theorem 1 and Theorem 3 are straightforward and omitted.
is an unbiased estimator of the exact entropy .
If has a multivariate normal distribution with mean and variance depending on
has a multivariate normal distribution with mean and variance depending on, then:
Thus, the smoothed estimate of the entropy equals the exact entropy for a multivariate normal parameterization of the policy.
See Appendix B for proof.
(i) If there exists a sequence of weights in such that converges to the uniform distribution over
converges to the uniform distribution over, then
(ii) If there exists a sequence of weights in such that for some , then
Thus, the smoothed estimate of the entropy mimics the exact entropy in that it has the same supremum and infinum values as the exact entropy.
The above theorems indicate that may serve as a good proxy for : it is an unbiased estimator for , it has the same minimum and maximum values when varying ; and in the special case when has a multivariate normal distribution, it is actually equal to for all . Our numerical experiments have shown that the smoothed estimator typically has lower variance than the crude estimator . However, it is not generally true that the smoothed estimator always has lower variance as counterexamples can be found.
4.3 Smoothed Mode Estimator
For the smoothed estimate of the entropy , we use the episodic action to form the estimate. We now consider alternative choices of actions which may improve performance at modest additional computational cost. First consider where . Thus in this case, instead of calculating the smoothed estimate of the entropy with the episodic action , we calculate it with the most likely action . The problem here is that it is not easy to find when the given conditional probabilities are not in closed form but only available algorithmically as outputs of neural networks. A more computationally efficient approach would be to choose the action greedily: and for . This leads to the definition . The action is an approximation for the mode of the distribution . As often done in NLP, we can use beam search to determine an action that has higher probability, that is, . Indeed, the above definition is beam search with beam size equal to 1. We refer to as the smoothed mode estimate.
with an appropriate beam size may be a better approximation for the entropy than . However, calculating and its gradient comes with some computational cost. For example, with a beam size equal to one, we would have to make two passes through the policy neural network at each time step: one to obtain the episodic sample and the other to obtain the greedy action . For beam size , we would need to make passes. We note that is a biased estimator for but with no variance. Thus there is a bias-variance tradeoff between and . Note that also satisfies Theorems 2 and 3 in subsection 4.2.
4.4 Estimating the Gradient of the Entropy
So far we have been looking at estimates of entropy. But the update rule (1) uses the gradient of the entropy rather than the entropy. As it turns out, the gradients of the estimators and are not unbiased estimates of the gradient of the entropy. In this subsection, we provide unbiased estimators for the gradient of the entropy. For simplicity, in this section, we assume a one-step decision setting, such as in a multi-armed bandit problem. A straightforward calculation shows:
Suppose is one sample from . A crude unbiased estimator for the gradient of the entropy therefore is: . Note that this estimator is equal to the gradient of the crude estimator multiplied by a correction factor.
Analogous to the smoothed estimator for entropy, we can also derive a smoothed estimator for the gradient of the entropy.
If is a sample from , then
is an unbiased estimator for the gradient of the entropy.
See Appendix C for proof.
Note that this estimate for the gradient of the entropy is equal to the gradient of the smoothed estimate plus a correction term. We refer to this estimate of the entropy gradient as the unbiased gradient estimate.
5 Experimental Results
We designed experiments to compare the different entropy estimators the LSTM, MMDP, and CommNet model, a related approach introduced by comm_net . As a baseline, we will use the Independent Sampling (IS) model which is an FFN that takes as input the state, creates a representation of the state, and from the representation outputs softmax heads, from which the value of each action dimension can be sampled independently comm_net . In this case, the smoothed estimate is equal to the exact entropy. For each entropy approximation, the entropy weight was tuned to give the highest reward. For IS and MMDP, the number of hidden layers was tuned from 1 to 7. For CommNet, the number of communication steps was tuned from 2 to 5, the learning rate was tuned between 3e-3 and 3e-4 and the size of the policy hidden layer was tuned between 128 and 256.
5.1 Hunters and Rabbits
In this environment, there is a grid. At the beginning of each episode, hunters and rabbits are randomly placed in the grid. The rabbits remain fixed in the episode, and each hunter can move to a neighboring square (including diagonal neighbors) or stay at the current square. So each hunter has nine possible actions, and altogether there are actions at each time step. When a hunter enters a square with a rabbit, the hunter captures the rabbit and remains there until the end of the episode. In each episode, the goal is for the hunters to capture the rabbits as quickly as possible. Each episode is allowed to run for at most ten thousands time steps.
To provide a dense reward signal, we modify the goal as following: capturing a rabbit gives a reward of , which is discounted by the number of time steps taken since the beginning of the episode. The discount factor is 0.8. The goal is to maximize the episode’s total discounted reward. After a hunter captures a rabbit, they both become inactive.
Comparison of different entropy estimates for IS, LSTM, MMDP and CommNet
Table 1 shows the performance of the IS, LSTM, MMDP and CommNet models with the different entropy estimates. Training and evaluation were performed in a square grid of 5 by 5 with 5 hunters and 5 rabbits. Results are averaged over 5 seeds. For each seed, training and evaluation were run for 1 million and 1 thousand episodes respectively.
As compared with no entropy, crude entropy can actually reduce performance. However, smoothed entropy and smoothed mode entropy always increase performance, often significantly. For the LSTM model, the best performing approximation is smoothed entropy, reducing the mean episode length by and increasing the mean episode reward by compared to without entropy. We also note that there is not a significant difference in performance between the smoothed entropy, smoothed mode entropy, and the unbiased gradient approaches. When comparing the four models, we see that the LSTM model with smoothed entropy does significantly better the other three models. The CommNet model could potentially be improved by allowing the hunters to see more of the state; this could be investigated in future research.
|98.7 78.9||32 12.3||11.8 1.9||11.8 1.9||11.8 1.9|
|10.1 1.9||19 8.7||6.0 0.2||6.0 0.1|
|21.5 3.7||37.3 29.6||10.6 0.7||10.6 0.7||9.8 0.6|
|22.7 0.6||22.3 0.4||21.9 0.4||21.9 0.4||21.9 0.4|
|2.2 0.03||2.4 0.05||2.7 0.01||2.7 0.01||2.7 0.01|
|3.0 0.06||3.0 0.03||3.2 0.04||3.2 0.02|
|2.8 0.03||2.7 0.03||2.9 0.03||2.8 0.04||2.9 0.02|
|2.5 0.01||2.6 0.01||2.6 0.01||2.6 0.01||2.6 0.01|
The smoothed estimator is also more robust with respect to the initial seed than without entropy as shown in Figure 2. For example, for the LSTM model, in the case of without entropy, seed 0 leads to significantly worse results than the seeds 1-4. This does not happen with the smoothed estimator.
Entropy approximations versus exact entropy
We now consider how policies trained with entropy approximations compare with polices trained with exact entropy. In order to calculate exact entropy in an acceptable amount of time, we reduced the number of hunters and rabbits to 4 hunters and 4 rabbits. Training was run for 50,000 episodes. Table 2 shows the performance differences between policies trained with entropy approximations and exact entropy. We see that the best entropy approximations perform only slightly worse than exact entropy for both LSTM and MMDP. Once again we see that the LSTM model performs better than the MMDP model.
5.2 Multi-agent Multi-arm Bandits
We examine a multi-agent version of the standard multi-armed bandit problem, where there are agents each pulling one of arms, with . The arm generates a reward . The total reward in a round is generated as follows. In each round, each agent chooses an arm. All of the chosen arms are then pulled, with each pulled arm generating a reward. Note that the total number of arms chosen, , may be less than since some agents may choose the same arm. The total reward is the sum of rewards from the chosen arms. The optimal policy is for the agents to collectively pull the arms with the highest rewards. Additionally, among all the optimal assignments of agents to the arms that yield the highest reward, we add a bonus reward with probability if one particular agents-to-arms configuration is chosen.
We performed experiments with 4 agents and 10 arms, with the arm providing a reward of . The exceptional assignment gets a bonus of 166 (making a total reward of 200) with probability 0.01, and no bonus with probability 0.99. Thus the maximum expected reward is 35.66. Training was run for 100,000 rounds for each of 10 seeds. Table 3 shows average results for the last 500 of the 100,000 rounds.
|34.2 1.3||34.4 1.3||34.2 1.3|
|34.9 0.8||35.5 1.1||35.9 0.8|
|19.8 39.7||29.7 49.6||19.7 39.7|
|39.8 35.9||59.4 35.7||95.0 1.9|
The results for the multi-agent bandit problem are consistent with those for the hunter-rabbit problem. Policies obtained with the entropy approximations all perform better than policies obtained without entropy or with crude entropy, particularly for the percentage of rounds the arms are pulled with the optimal configuration. Note that LSTM with the unbiased gradient estimator gives the best results.
6 Related Work
Metz et al. google_sdqn_cont_action recently and independently proposed the reformulation of MDP into the MMDP and the LSTM policy parameterization. They inject noise into the action space to encourage exploration. Usunier et al. sc_episodic_explore uses MMDP and noise injection in the parameter space to achieve high performance in multi-agent Starcraft micro-management tasks. Instead of noise injection, we propose novel estimators for the entropy bonus that is often used to encourage exploration in policy gradient.
While entropy regularization has been mostly used in policy gradient algorithms, Schulmann et al. equivalence_pg_soft_q applies entropy regularization to Q-learning. They make an important observation about the equivalence between policy gradient and entropy regularized Q-learning.
To the best of our knowledge, no prior work has dealt with approximating the policy entropy for MDP with large multi-dimensional discrete action space. On the other hand, there have been many attempts to devise methods to encourage beneficial exploration for policy gradient. Nachum et al. urex modifies the entropy term by adding weights to the log action probabilities, leading to a new optimization objective termed under-appreciated reward exploration.
Dulac-Arnold et al. drl_large_discrete_action
embeds discrete actions in a continuous space, picks actions in the continuous space and map these actions back into the discrete space. However, their algorithm introduces a new hyper-parameter that requires tuning for every new task. Our approach involves no new hyper-parameter other than those normally used in deep learning.
In this paper, we developed several novel unbiased estimators for entropy bonus and its gradient. We did experimental work for two environments with large multi-dimensional action spaces. We found that the smoothed estimate of the entropy and the unbiased estimate of the entropy gradient can significantly increase performance with marginal additional computational cost.
Appendix A. Hyperparameters
Hyperparameters for hunter-rabbit game
For IS, the numbers of hidden layers for smoothed entropy, unbiased gradient estimate, crude entropy and without entropy are 1, 1, 5 and 7 respectively. The entropy weights for smoothed entropy, unbiased gradient estimate and crude entropy are 0.03, 0.02 and 0.01 respectively. The hyper-parameters for smoothed mode entropy is not listed since the smoothed mode entropy equals the smoothed entropy for IS.
For CommNet, the numbers of communication step for without entropy, crude entropy, smoothed entropy and unbiased entropy gradient are 2, 2, 2 and 2 respectively. The sizes of the policy hidden layer for without entropy, crude entropy, smoothed entropy and unbiased entropy gradient are 256, 256, 256 and 128 respectively. The entropy weights for crude entropy, smoothed entropy and unbiased entropy gradient are 0.04, 0.04 and 0.01 respectively. The policies were optimized using Adamadam with learning rate 3e-4. We found 3e-4 gives better performance than the learning rate 3e-3 originally used in comm_net .
The LSTM policy has 128 hidden nodes. For the MMDP policy, the number of hidden layers for smoothed entropy, smoothed mode entropy, unbiased gradient estimate, crude entropy and without entropy are 5, 3, 3, 4 3 and 3 respectively. Each MMDP layer has 128 nodes. We parameterize the baseline in (2) with a FFN with one hidden layer of size 64. This network was trained using first visit Monte Carlo return to minimize the L1 loss between actual and predicted values of states visited during the episode.
Both the policies and baseline are optimized after each episode with RMSproprmsprops . The RHS of (2) is clipped to before updating the policy parameters. The learning rates for the baseline, IS, LSTM and MMDP are , , , respectively.
To obtain the results in Table 1, the entropy weights for LSTM smoothed entropy, LSTM smoothed mode entropy, LSTM unbiased gradient estimate, LSTM crude entropy, MMDP smoothed entropy, MMDP smoothed mode entropy, MMDP unbiased gradient estimate and MMDP crude entropy are 0.02, 0.021, 0.031, 0.04, 0.02, 0.03, 0.03 and 0.01 respectively.
To obtain the results in Table 2, the entropy weights for LSTM smoothed entropy, LSTM exact entropy, MMDP unbiased gradient estimate and MMDP exact entropy are 0.03, 0.01, 0.03 and 0.01 respectively. The MMDP networks have three layers with 128 nodes in each layer. Experimental results are averaged over five seeds (0-4).
Hyperparamters for Multi-Agent Multi-Arm Bandits
The experiments were run with 4 agents and 10 arms. For the 10 arms, their rewards are for . The LSTM policy has 32 hidden nodes. The baseline in (1) is a truncated average of the reward of the last 100 rounds. The entropy weight for crude entropy, smoothed entropy and unbiased gradient estimate are 0.005, 0.001 and 0.003 respectively. The learning rates for without entropy, crude entropy, smoothed entropy and unbiased gradient estimate are 0.006, 0.008, 0.002 and 0.005 respectively. Experimental results are averaged over ten seeds.
Appendix B. Proof of Theorem 2
Theorem 2. If has a multivariate normal distribution with mean and variance depending on , then:
Thus, the smoothed estimate of the entropy equals the exact entropy for a multivariate normal parameterization of the policy.
We first note that for where and are random vectors, we have where
Observe that the covariance matrix of the conditional distribution does not depend on the value of applied_multi_stats .
Also note that for , the entropy of takes the form
where is the dimension of and denotes the determinant. Therefore, the entropy of a multivariate normal random variable depends only on the variance and not on the mean.
Because is multivariate normal, the distribution of given has a normal distribution with a variance that does not depend on . Therefore
does not depend on and hence does not depend on . Combining this with the fact that is an unbiased estimator for gives for all . ∎
Appendix C. Proof of Theorem 4
Theorem 4. If is a sample from , then
is an unbiased estimator for the gradient of the entropy.
From Equation(3), we have:
We will now use conditional expectation to calculate the terms in the double sum.
Combining these three conditional expectations with (4), we obtain:
Alternatively, Theorem 4 could also be proven by applying Theorem 1 of schulman2015gradient . ∎
Appendix D. State Representation For CommNet
Sukhbaatar et al.comm_net proposes CommNet to handle multi-agent environments where each agent observes only part of the state and the number of agents changes throughout an episode. We thus modify the state representation of the hunters and rabbits environment to better reflect the strengths of CommNet. Each hunter only sees its own id, its position and the positions of all rabbits. More precisely, the state each hunter receives is [hunter id, hunter position, all rabbit positions].
We would like to thank Martin Arjovsky for his input and suggestions at both the early and latter stages of this research. Our gratitude also goes to the HPC team at NYU, NYU Shanghai, and NYU Abu Dhabi.
-  David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, jan 2016.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and et al. Human-level control through deep reinforcement learning, Feb 2015.
-  Asier Mujika. Multi-task learning with deep model based reinforcement learning. CoRR, abs/1611.01457, 2016.
-  Brendan O’Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and q-learning, 2016.
-  Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. CoRR, abs/1511.06342, 2015.
-  Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. CoRR, abs/1611.01224, 2016.
-  Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pascanu. Sobolev training for neural networks. CoRR, abs/1706.04859, 2017.
-  Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Jacob Repp, and Rodney Tsing. Starcraft II: A new challenge for reinforcement learning. CoRR, abs/1708.04782, 2017.
-  Zeming Lin, Jonas Gehring, Vasil Khalidov, and Gabriel Synnaeve. STARDATA: A starcraft AI research dataset. CoRR, abs/1708.02139, 2017.
-  Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. CoRR, abs/1605.07736, 2016.
-  Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.
-  David Silver. UCL course on RL, 2015.
-  Ronald J. Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. 1991.
-  Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. 2016.
-  Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments, 2017.
-  Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc Le. Neural optimizer search with reinforcement learning. 2017.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks, 2014.
-  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016.
-  Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous actions for deep RL. CoRR, abs/1705.05035, 2017.
-  Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. CoRR, abs/1607.07086, 2016.
-  Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks. CoRR, abs/1609.02993, 2016.
-  John Schulman, Pieter Abbeel, and Xi Chen. Equivalence between policy gradients and soft q-learning, 2017.
-  Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans. Improving policy gradient by exploring under-appreciated rewards, 2016.
-  Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces, 2015.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  Tieleman and Hinton. Rmsprop: Divide the gradient by a running average of its recent magnitude - university of toronto, 2012.
-  R. A. Johnson and D. W. Wichern, editors. Applied Multivariate Statistical Analysis. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.
-  John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pages 3528–3536, 2015.