The objective of reinforcement learning(RL) is to find a policy that takes the best action to gain accumulated long-term rewards, through trial and error. There are two important notions in RL: return, which is the weighted accumulated future rewards after taking a series of actions, and value, which is the expected value of the return, serveing as the objective for the policy function to maximize. In classic RL (sutton1998introduction, ), the return is usually defined as the exponentially discounted sum of future rewards, where the discounting factor balances the importance of the immediate and future rewards. The motivation for using such a mathematical form of return is based on the economic theory (sutton1998introduction, ; pitis2019rethinking, )
, however, from the perspective of machine learning, there could be many different feasible ways to define the form of the return function (and thus the value), from which the same optimal policy can be derived. For example, let us consider the task of navigating through a maze. Any form of the return function that gives higher values to the correct path leading to the exit would generate the same optimal policy. However, these different forms might render dramatically different speeds of learning this policy. For example, suppose the agent can only receive a long-delayed non-zero reward when it reaches the exit. The return function that exponentially discounts this reward (inversely) along the path might incur very slow learning at the beginning of the correct path, while the return function that back propagates this reward without decay might greatly ease the learning at the beginning. Based on this observation, our paper is concerned with the following question: Is there any better form of the return function than exponentially discounted sum, and can we design an algorithm to automatically learn such a function?
Actually, there have been some previous works that aim at finding better return/value functions that render more stable and faster policy learning. They can be mainly classified into two categories. The first kind of works modify the form of the return function. For example,xu2018meta
proposed to use meta-learning to learn the hyperparameters, used in the TD- methods. This generalizes the case of using fixed values of , , but in general the return function is still constrained in the exponentially discounted sum form. arjona2018rudder
employed a LSTM model to predict the delayed terminal reward at each time step, and then redistributed the delayed reward to previous time steps according to the prediction error (or the heuristic analysis on the hidden states of LSTM). However, there is no guarantee that the heuristic redistribution can lead to better performance or fast learning. The second kind of works focus on modifying the rewards used for return computation, and do not change the form of the return function at all. Examples include reward shaping(ng1999policy, ), reward clipping (mnih2013playing, ), intrinsic reward (chentanez2005intrinsically, ; icarte2018using, ), exploration bonus (bellemare2016unifying, ), and reward from auxiliary task (jaderberg2016reinforcement, ). In addition to these hand-crafted manipulations on rewards, a few recent works attempted to learn the manipulations automatically (zheng2018learning, ; xu2018meta, ).
In clear contrast to the aforementioned previous works, our work aims at automatically learning the appropriate mathematical form of the return function. Different from just learning the hyperparameters in the exponentially discounted sum, we propose to use a general mathematical form for the return function, and employ meta learning to learn the optimal return function in an end-to-end manner. In particular, we argue that the general form of the return function should take into account the information of the whole trajectory instead of merely the current state. This mimics the process of human reasoning, as we human always tend to analyze a sequence of moves as a whole. We implement our idea upon modern actor-critic algorithms (babaeizadeh2017ga3c, ; schulman2017proximal, ), and demonstrate its efficiency first in a specially designed Maze environment(mattchantk_2016, ), and then in the high-dimensional Atari games using the Arcade Learning Environment(ALE) (bellemare13arcade, ; machado17arcade, ). The experimental results clearly indicate the advantages of automatically learning optimal return functions in reinforcement learning.
2 Background and Related Work
As aforementioned, there are in general two ways to modify how return is computed. The first type directly modify the mathematical form of return function, and the second type modify the reward used in return computation. Our methods fall in the first class of methods, and is orthogonal to the second class.
2.1 Modifying The Mathematical Form of Return Function
Lots of works has researched the discounting factor in the exponentially discounted form return function. lattimore2011time proposed to generalize the exponentially discounting form through discount functions that change with the agents’ age. Precisely choosing different at different time has been proposed in (franccois2015discount, ; OpenAI_dota, ). xu2018meta treated hyper-parameter [, ] in TD- as learnable parameters that could be optimized by meta-gradient approach. pitis2019rethinking introduced a flexible state-action discounting factor by characterizing rationality in sequential decision making. barnard1993temporal generalized the work of singh1992scaling to propose a multi-time scale TD learning model, based on which romoff2019separating ; reinke2017average ; sherstan2018generalizing explored to decompose the original discounting factor into multiple ones. arjona2018rudder proposes to decompose the delayed return to previous time steps by analyzing the hidden states of a LSTM. All these works suggest that the form of return function is one of the key elements in RL. Our work generalizes them by proposing to use a general form return function.
2.2 Reward Augmentation
Another family of works focus on manipulating the reward functions to enhance the learning of the agents. Reward shaping (ng1999policy, ) shows what property a reward modification should possess to remain the optimal policy, i.e., the potential-based reward. Lots of works focus on designing intrinsic reward to help learning, e.g., (oudeyer2009intrinsic, ; schmidhuber2010formal, ; stadie2015incentivizing, ) used hand-engineered intrinsic reward that greatly accelerate the learning. Exploration bonus is another class of works that add bonus reward to improve the exploration behaviours of the agent so as to help learning, e.g., bellemare2016unifying ; martin2017count ; tang2017exploration propose to use the pseudo-count of states to derive the bonus reward. (sutton2011horde, ; mirowski2016learning, ; jaderberg2016reinforcement, ) propose to augment the reward by considering information coming from the task itself.
2.3 Relational Reinforcement Learning
The attention mechanism has been proven successful to extract the relationships of words in sentences in natural language processingvaswani2017attention ; bahdanau2014neural , and has recently been combined with relational learning (dvzeroski2001relational, ) to extract the relations between entities in a single state, so as to enhance learning in RL zambaldi2018relational . To encode the trajectory information in the general return function, we also employ the attention mechanism, but not to different entities in a single state, but to different state-action pairs in a trajectory. More details are given in Section 3.3.
3.1 Classic RL Settings
Here we briefly review the classic RL settings, where return and value are defined as the exponentially discounted sum of accumulated future rewards. We will also review the policy gradient theorem, upon which our method is built on. We assume an episodic setting. At each time step , the agent receive a state from the state space , takes an action in the action space , receives a reward from the environment based on the reward function , and then transfers to the next state according to the transition dynamics . Given a trajectory , the return of is computed in the following exponentially discounted sum form: , , where is the discounting factor. Similarly, the return of a state-action pair is computed as:
A policy (which is usually parameterized bygiven a state , . We say a trajectory is generated under policy if all the actions along the trajectory is chosen following , i.e., means and . Given a policy , the value of a state is defined as the expected return of all the trajectories when the agent starts at and then follows : . Similarly, the value of a state-action pair is defined as the expected return of all trajectories when the agent starts at , takes action , and then follows : .
The performance of a policy can be measured as , where is the initial state distribution. For brevity, we will write throughout. The famous policy gradient theorem (sutton2000policy, ) states:
Usually, a baseline value can be subtracted by
to lower the variance, e.g., in A3C(mnih2016asynchronous, ) the state value is used. In the following, for the simplicity of derivation, we will assume that is computed using a sample return .
3.2 Learning a General Form of Return Function
The key of our method is that we do not restrict the function form of return to be just the exponentially discounted sum of future rewards. Instead, we compute it using an arbitrary function , which is parameterized by a neural network , and we call as the return generating model. In addition, the function considers the information of the whole trajectory:
We can then optimize our policy with regard to this new return by substituting it in the policy gradient theorem:
We now show how to train the return generating model , so that it renders faster learning when the policy is updated using the new return.
For the simplicity of derivation, we take in the following particular form:
i.e., the new return is calculated as a linear combination of future rewards, and the linear coefficient is computed using the return generating model . For the consideration of convergence, we require that , as in the case of the exponentially discounted sum form.
Using meta-learning, it is straightforward to set the learning objective of to be the performance of the updated policy , and thus
can be trained using chain rule. The similar idea has also been used inxu2018meta and zheng2018learning . Formally, with the particular linear combination form of the new return, update in E.q.3 becomes:
We want the new return to maximize the effect of such an update, i.e., to maximize the performance of the updated policy . By the chain rule, we have:
In practice, when computing , we do not really sample a new trajectory using the updated policy as this increases the sample complexity. Instead, we compute it using the current trajectory with importance sampling:
For the clearance of illustration we only show our algorithm in the simplest case, i.e, 1) using the linear combination of future rewards as the new form of return function; 2) using vanilla gradient ascent to update , and 3) using sample return to approximate the Q-value . Actually, our method can be applied much more broadly: 1) can take in any form as long as it is differentiable to the parameter ; 2) the update of the policy
can employ any modern optimizer like RMSProp(tieleman2012lecture, ) or Adam (kingma2014adam, ), as long as the update procedure of computing is fully differentiable to (as in E.q. 8); and 3) our methods can be combined with any advanced actor-critic algorithms like A3C (mnih2016asynchronous, ) or PPO (schulman2017proximal, ), as long as the new gradient can be effectively evaluated (as in E.q. 7). For example, we can use the clipped surrogate objective proposed in PPO to measure the performance of the updated policy , and E.q. 9 then becomes:
Viewing in Trajectory. In this subsection we present how we design the structure of the return generating model . Our first design principle is that should be able to use the information of the whole trajectory to generate return. This mimics the process of human reasoning, as we always tends to analyze a sequence of moves. Our second design principle is that the model should be excel to analyze the relationship between different time steps when generating the return, which is again natural to human reasoning. With these two design principles, we choose to employ the Multi-Head Attention Module (vaswani2017attention, ) as a main block of our model. Actually, we adopted the encoder part of the Transformer architecture vaswani2017attention . Specifically, the input to our model is a whole trajectory . The model takes the trajectory
as input, embeds it into a vectorof dimension , and then passes it through several stack of layers that consist of multi-head attention and residual feed forward connection. At last, the model uses a feed forward generator to generate the new returns . In the linear combination case, the generator uses softmax to produce the normalized linear coefficient .
From New Return to New Value. The new return function naturally brings about a brand new value function. Indeed, we can similarly define the value of and as the expected value of the new returns of trajectories starting from (taking action ), and following policy :
Here we give a very shallow analysis on this new value definition (one obvious future work would be to give a deep analysis on such definitions). The new value definition is way more general than the original one, but this generality also loses useful properties, and perhaps the most important one is that the Bellman Equation no longer applies. The bellman equation tells that the value of the current state can be computed by bootstrapping from the value of the next state , and the correction of such bootstrap indeed lies in the special exponentially discounted sum form of return function. Similarly, the temporal difference learning methods (e.g. Q-learning, SARSA) would usually fail with the new value, as such methods can be viewed as a sample approximation to the Bellman Equation. On the other hand, the Monte-Carlo methods for evaluating the new value would still work, which only relies on the Law of Big Number.
4.1 General Setups
In this subsection, we describe some general setups for the following experiments. We implemented our RGM upon the A2C (mnih2016asynchronous, ) algorithm. All the experiment are running on a machine equipped with Intel(R) Xeon(R) CPU E5-2690, and four Nvidia Tesla M40 GPUs. In the following, we will denote the A2C baseline as the vanilla A2C, and denote our algorithms as A2C + RGM (short for Return Genearting Model). In both the illustrative case and the Atari experiment, the new form of return function is the linear combinations of the future rewards, as discussed in Section 3.2 and E.q. 4. The RGM has 4 stacking layers and 4 heads in multi-head attention.
4.2 Illustrative Case
In this section, we show our algorithm using an illustrative maze example. As shown in the leftmost panel of figure 2, the maze (mattchantk_2016, ) is a 2-D grid world of size . The agent starts from the left-top corner, and needs to find its way to the exit at the bottom-right corner. Black lines in the maze represents unbreakable walls. The key feature of the maze is that it has portals, shown as colored grids in figure 2, which can transport the agent immediately from one location to another with the same color. Indeed, we deliberately created four isolated rooms in the maze, and to successfully reach the exit requires the agent to utilize these portals to transport between different rooms. Two possible routes have been marked out in figure 2.
At each time step, the agent gets a state which is the coordinate of his current location, chooses an moving direction among up, down, left and right, and receives a reward of if he does not reach the exit, and if he reaches the exit, which also ends the game. We set to be very small, e.g., in our experiments. Therefore, the maze renders an environment that has very delayed reward. Besides, from our human beings’ perspective, we would pay special attention to the portals, as they enable nonconsecutive spatial location change.
We compare the vanilla A2C algorithm and our A2C + RGM algorithm. We use separate value and policy networks, both of them are feed-forward MLPs with three hidden layers of 64 neurons and the ReLU nonlinear activation. Figure3 (b) compares the learning curves of these two algorithms. For ablation study, we implemented two more versions of our A2C + RGM besides the vanilla one. 1) ‘RGM + target’: since our RGM is always learning (changing) itself, the learning objective it provides to the value network is also changing, which might cause oscillation for the learning of the value network. To alleviate this, similar to that in DQN (mnih2013playing, ) and DDPG (lillicrap2015continuous, ), we use a target network of the RGM model and optimize the value network towards the target network to stabilize its learning. The parameters of the target network is copied from the learning RGM periodically. 2) ‘RGM without attention’: we replaced all the attention modules with feed-forward layers to test whether the attention module is necessary. The results in figure 3 (b) shows that our algorithm learns dramatically faster and better than the vanilla A2C algorithms. It also shows that the attention module is necessary for stable learning, and whether using a target network does not cause huge difference in learning in the maze environment.
To have a deeper understanding of how the RGM works, we visualized the values of the grids under the vanilla A2C’s value network and our A2C + RGM’s value network. The interesting result is shown in figure 2. As expected, the values of grids learned in vanilla A2C exponentially decreases as the distance to the exit increases, and there is no difference in values between the transport grids and vanilla grids. However, the values learned by our RGM is quite different. It shows that a heavy part of return is redistributed to the beginning of the correct path that leads to the exit, and the value almost follows a decreasing trend as it approaches the exit. Also, compared with the vanilla values, the portals in our A2C + RGM tends to have much higher values than the vanilla grids around it. As discussed in the last section, both values give the correct policy, as they all have higher values along the correct path. However, as indicated by Figure 3 (b), these two different values have large difference in the learning speed.
To further investigate how the new value is learned, we visualized the linear coefficient generated by our RGM along the correct path, and compared it with the traditional discounted . The result is shown in figure 3 (b). We normalize both coefficients to make them sum up to . With , the discounted form return gives almost equal favour to each reward encountered along the path. On the contrast, our RGM distributed almost all the weight to the delayed reward at the last time step, which is the only positive reward the agent receives when reaching the exit. As a result, when using the traditional discounted return, the delayed reward (in our setting, which is also the only effective learning signal) is exponentially decayed times before it can reach the initial state, which greatly hurts its propagation; on the other hand, our RGM could propagate back this learning signal with nearly zero loss to all previous states (recall that our RGM compute return as , if some is near 1, then will be fully used when computing all ). With this effective back propagation of the learning signal, our RGM greatly eases the learning of the agents.
(a) Learning curves of the compared algorithms. The Y-axis shows the moving average reward of the last 1000 episodes, and the shaded region represents half a standard deviation. (b) Visualization ofand along the correct path, which is labeled by the arrows in the left maze.
4.3 Deep Reinforcement Learning Results on Atari Games
We also tested our A2C + RGM on multiple Atari games from the Arcade Learning Environment (ALE) bellemare2013arcade
. We implemented the baseline A2C algorithm using Pytorch with exactly the same network architecture as inmnih2016asynchronous , and trained it using the same hyper-parameters as in the OpenAI implementation dhariwal2017openai , except for the parameter
used in RMSProp. Due to the slight difference of the implementation of RMSProp in Pytorch and Tensorflow, we could only reproduce comparable results to those in published papers by tuningof RMSProp in Pytorch to be . Our RGM has one extra hyper-parameter, which is the learning rate . We detail in below how we set it. We do not use a separate target network for the RGM as it does not bring significant help in the maze experiments.
Figure 4 shows the comparison of our A2C + RGM and the baseline A2C on 6 Atari games: Bowling, Privateeye, Frostbite, Tennis, Mspacman, and SpaceInvaders. The former four games all have very rare and delayed rewards, e.g., in privateeye it takes the agent 180 time steps to collect the glue to get its first reward, while the latter two games have rich immediate rewards. Using the original discounted sum return, learning in the delayed reward games are much more difficult than the rich-reward games. As shown in Figure 4, as expected, our A2C + RGM has bring large benefits in the delayed games by learning a better form of return computation. Meanwhile, it also boosts the performance of rich-reward games, where the learning with original return is already easy. For our RGM, we explored the following values for learning rate: . We plotted the best results from the search.
|(a) Bowling||(b) Privateeye||(c) Frostbite|
|(d) Tennis||(e) Mspacman||(f) SpaceInvaders|
Return and value serve as the key objective that guide the learning of the policy. One key insight is that there could be many different ways to define the computation form of the return (and thus the value), from which the same optimal policy can be derived. However, these different forms could render dramatic difference in the learning speed. In this paper, we propose to use arbitrary general form for return computation, and designed an end-to-end algorithm to learn such general form to enhance policy learning by meta-gradient methods. We test our methods on a specially designed maze environment and several Atari games, and the experimental results show that our methods effectively learned new return computation forms that greatly improved learning performance.
-  Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
-  Silviu Pitis. Rethinking the discount factor in reinforcement learning: A decision theoretic approach. arXiv preprint arXiv:1902.02893, 2019.
-  Zhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems, pages 2396–2407, 2018.
-  Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.
-  Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
-  Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
-  Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pages 2112–2121, 2018.
-  Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
-  Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
-  Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4644–4654, 2018.
-  Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. Reinforcement learning thorugh asynchronous advantage actor-critic on a gpu. In ICLR, 2017.
-  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  MattChanTK. Mattchantk/gym-maze, 2016.
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
-  Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. CoRR, abs/1709.06009, 2017.
-  Tor Lattimore and Marcus Hutter. Time consistent discounting. In International Conference on Algorithmic Learning Theory, pages 383–397. Springer, 2011.
-  Vincent François-Lavet, Raphael Fonteneau, and Damien Ernst. How to discount deep reinforcement learning: Towards new dynamic strategies. arXiv preprint arXiv:1512.02011, 2015.
-  OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
Temporal-difference methods and markov models.IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357–365, 1993.
-  Satinder P Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Machine Learning Proceedings 1992, pages 406–415. Elsevier, 1992.
-  Joshua Romoff, Peter Henderson, Ahmed Touati, Yann Ollivier, Emma Brunskill, and Joelle Pineau. Separating value functions across time-scales. arXiv preprint arXiv:1902.01883, 2019.
-  Chris Reinke, Eiji Uchibe, and Kenji Doya. Average reward optimization with multiple discounting reinforcement learners. In International Conference on Neural Information Processing, pages 789–800. Springer, 2017.
Craig Sherstan, James MacGlashan, and Patrick M Pilarski.
Generalizing value estimation over timescale.Network, 2:3, 2018.
-  Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
-  Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
-  Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. In ICLR, 2016.
-  Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter. Count-based exploration in feature space for reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017.
-  Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 2750–2759, 2017.
-  Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
-  Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  Sašo Džeroski, Luc De Raedt, and Kurt Driessens. Relational reinforcement learning. Machine learning, 43(1-2):7–52, 2001.
-  Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830, 2018.
-  Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
-  Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
-  Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
-  Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. GitHub, GitHub repository, 2017.