1 Introduction
One of the challenges facing an agentdesigner in formulating a sequential decision making task as a Reinforcement Learning (RL) problem is that of defining a reward function. In some cases a choice of reward function is clear from the designer’s understanding of the task. For example, in board games such as Chess or Go the notion of win/loss/draw comes with the game definition, and in Atari games there is a game score that is part of the game. In other cases there may not be any clear choice of reward function. For example, in domains in which the agent is interacting with humans in the environment and the objective is to maximize humansatisfaction it can be hard to define a reward function. Similarly, when the task objective contains multiple criteria such as minimizing energy consumption and maximizing throughput and minimizing latency, it is not clear how to combine these into a single scalarvalued reward function.
Even when a reward function can be defined, it is not unique in the sense that certain transformations of the reward function, e.g., adding a potentialbased reward (Ng et al., 1999), will not change the resulting ordering over agent behaviors. While the choice of potentialbased or other (policy) orderpreserving reward function used to transform the original reward function does not change what the optimal policy is, it can change for better or for worse the sample (and computational) complexity of the RL agent learning from experience in its environment using the transformed reward function.
Yet another aspect to the challenge of rewarddesign stems from the observation that in many complex realworld tasks an RL agent is simply not going to learn an optimal policy because of various bounds (or limitations) on the agentenvironment interaction (e.g., inadequate memory, representational capacity, computation, training data, etc.). Thus, in addressing the rewarddesign problem one may want to consider transformations of the taskspecifying reward function that change the optimal policy. This is because it could result in the boundedagent achieving a more desirable (to the agent designer) policy than otherwise. This is often done in the form of shaping reward functions that are less sparse than an original reward function and so lead to faster learning of a good policy even if it in principle changes what the theoretically optimal policy might be (Rajeswaran et al., 2017). Other examples of transforming the reward function to aid learning in RL agents is the use of exploration bonuses, e.g., countbased reward bonuses for agents that encourage experiencing infrequently visited states (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017).
The above challenges make rewarddesign difficult, errorprone, and typically an iterative process. Reward functions that seem to capture the designer’s objective can sometimes lead to unexpected and undesired behaviors. Phenomena such as rewardhacking (Amodei et al., 2016) illustrate this vividly. There are many formulations and resulting approaches to the problem of rewarddesign including preference elicitation, inverse RL, intrinsically motivated RL, optimal rewards, potentialbased shaping rewards, more general reward shaping, and mechanism design; often the details of the formulation depends on the class of RL domains being addressed. In this paper we build on the optimal rewards problem formulation of Singh et al. (2010). We discuss the optimal rewards framework as well as some other approaches for learning intrinsic rewards in Section 2.
Our main contribution in this paper is the derivation of a new stochasticgradientbased method for learning parametric intrinsic rewards that when added to the taskspecifying (hereafter extrinsic) rewards can improve the performance of policygradient based learning methods for solving RL problems. The policygradient updates the policy parameters to optimize the sum of the extrinsic and intrinsic rewards, while simultaneously our method updates the intrinsic reward parameters to optimize the extrinsic rewards achieved by the policy. We evaluate our method on several Atari games with a state of the art A2C (Advantage ActorCritic) (Mnih et al., 2016) agent as well as on a few Mujoco domains with a similarly state of the art PPO agent and show that learning intrinsic rewards can outperform using just extrinsic reward as well as using a combination of extrinsic reward and a constant “live bonus” (Duan et al., 2016).
2 Background and Related Work
Optimal rewards and reward design.
Our work builds on the Optimal Reward Framework. Formally, the optimal intrinsic reward for a specific combination of RL agent and environment is defined as the reward which when used by the agent for its learning in its environment maximizes the extrinsic reward. The main intuition is that in practice all RL agents are bounded (computationally, representationally, in terms of data availability, etc.) and the optimal intrinsic reward can help mitigate these bounds. Computing the optimal reward remains a big challenge, of course. The paper introducing the framework used exhaustive search over a space of intrinsic reward functions and thus does not scale. Sorg et al. (2010) introduced PGRD (Policy Gradient for Reward Design), a scalable algorithm that only works with lookaheadsearch (UCT) based planning agents (and hence the agent itself is not a learningbased agent; only the reward to use with the fixed planner is learned). Its insight was that the intrinsic reward can be treated as a parameter that influences the outcome of the planning process and thus can be trained via gradient ascent as long as the planning process is differentiable (which UCT and related algorithms are). Guo et al. (2016) extended the scalability of PGRD to highdimensional image inputs in Atari 2600 games and used the intrinsic reward as a reward bonus to improve the performance of the Monte Carlo Tree Search algorithm using the Atari emulator as a model for the planning. A big open challenge is deriving a sound algorithm for learning intrinsic rewards for learningbased RL agents and showing that it can learn intrinsic rewards fast enough to beneficially influence the online performance of the learning based RL agent. Our main contribution in this paper is to answer this challenge.
Reward shaping and Auxiliary rewards.
Reward shaping (Ng et al., 1999) provides a general answer to what space of reward function modifications do not change the optimal policy, specifically potentialbased rewards. Other attempts have been made to design auxiliary rewards to derive policies with desired properties. For example, the UNREAL agent (Jaderberg et al., 2016) used pseudoreward computed from unsupervised auxiliary tasks to refine its internal representations. In some other works (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017), a pseudocount based reward bonus was given to the agent to encourage exploration. Pathak et al. (2017) used selfsupervised prediction errors as intrinsic rewards to help the agent explore. In these and other similar examples (Schmidhuber, 2010; Stadie et al., 2015; Oudeyer and Kaplan, 2009), the agent’s learning performance improves through the reward transformations, but the reward transformations are expertdesigned and not learned. The main departure point in this paper is that we learn the parameters of an intrinsic reward function that maps highdimensional observations and actions to rewards.
Hierarchical RL.
Another approach to a form of intrinsic reward is in the work on hierarchical RL. For example, the recent FeUdal Networks (FuNs) (Vezhnevets et al., 2017) is a hierarchical architecture which contains a Manager and a Worker learning at different time scales. The Manager conveys abstract goals to the Worker and the Worker optimizes its policy to maximize the extrinsic reward and the cosine distance to the goal. The Manager optimizes its proposed goals to guide the Worker to learn a better policy in terms of the cumulative extrinsic reward. A large body of work on hierarchical RL also generally involves a higher level module choosing goals for lower level modules. All of this work can be viewed as a special case of creating intrinsic rewards within a multimodule agent architecture. One special aspect of hierarchicalRL work is that these intrinsic rewards are usually associated with goals of achievement, i.e., achieving a specific goal state while in our setting the intrinsic reward functions are general mappings from observationaction pairs to rewards. Another special aspect is that most evaluations of hierarchical RL show a benefit in the transfer setting with typically worse performance on early tasks while the manager is learning and better performance on later tasks once the manager has learned. In our setting we take on the challenge of showing that learning and using intrinsic rewards can help the RL agent perform better while it is learning on a single task. Finally, another difference is that hierarchical RL typically treats the lowerlevel learner as a black box while we train the intrinsic reward using gradients through the policy module in our architecture.
3 GradientBased Learning of Intrinsic Rewards: A Derivation
As noted earlier, the most practical previous work in learning intrinsic rewards using the Optimal Rewards framework was limited to settings where the underlying RL agent was a planning (i.e., needs a model of the environment) agent using lookahead search in some form (e.g, UCT). In these settings the only quantity being learned was the intrinsic reward function. By contrast, in this section we derive our algorithm for learning intrinsic rewards for the setting where the underlying RL agent is itself a learning agent, specifically a policy gradient based learning agent.
3.1 Policy Gradient based RL
Here we briefly describe how policy gradient based RL works, and then we will present our method that incorporates it. We assume an episodic, discreteactions, reinforcement learning setting. Within an episode, the state of the environment at time step is denoted by and the action the agent takes from action space at time step as , and the reward at time step as . The agent’s policy, parameterized by
(for example the weights of a neural network), maps a representation of states to a probability distribution over actions. The value of a policy
, denoted or equivalently , is the expected discounted sum of rewards obtained by the agent when executing actions according to policy , i.e.,(1) 
where denotes the transition dynamics, and the initial state is chosen from some distribution over states. Henceforth, for ease of notation we will write the above quantity as .
The policy gradient theorem of Sutton et al. (2000) shows that the gradient of the value with respect to the policy parameters can be computed as follows: from all time steps within an episode
(2) 
where is the return until termination. Note that recent advances such as advantage actorcritic (A2C) learn a critic (
) and use it to reduce the variance of the gradient and bootstrap the value after every
steps. However, we present this simple policy gradient formulation (Eq 2) in order to simplify the derivation of our proposed algorithm and aid understanding.3.2 LIRPG: Learning Intrinsic Rewards for Policy Gradient
Notation.
We use the following notation throughout.

[leftmargin=*]

: policy parameters

: intrinsic reward parameters

: extrinsic reward from the environment







: relative weight of intrinsic reward.
The departure point of our approach to reward optimization for policy gradient is to distinguish between the extrinsic reward, , that defines the task, and a separate intrinsic reward that additively transforms the extrinsic reward and influences learning via policy gradients. It is crucial to note that the ultimate measure of performance we care about improving is the value of the extrinsic rewards achieved by the agent; the intrinsic rewards serve only to influence the change in policy parameters. Figure 1 shows an abstract representation of our intrinsic reward augmented policy gradient based learning agent.
Algorithm Overview.
An overview of our algorithm, LIRPG, is presented in Algorithm 1. At each iteration of LIRPG, we simultaneously update the policy parameters and the intrinsic reward parameters . More specifically, we first update in the direction of the gradient of which is the weighted sum of intrinsic and extrinsic rewards. After updating policy parameters, we update in the direction of the gradient of which is just the extrinsic rewards. Intuitively, the policy is updated to maximize both extrinsic and intrinsic reward, while the intrinsic reward function is updated to maximize only the extrinsic reward. We describe more details of each step below.
Updating Policy Parameters ().
Given an episode where the behavior is generated according to policy , we update the policy parameters using regular policy gradient using the sum of intrinsic and extrinsic rewards as the reward:
(3)  
(4) 
where Equation 4 is a stochastic gradient update.
Updating Intrinsic Reward Parameters ().
Given an episode and the updated policy parameters , we update intrinsic reward parameters. Intuitively, updating
requires estimating the effect such a change would have on the extrinsic value through the change in the policy parameters. Our key idea is to use the chain rule to compute the gradient as follows:
(5) 
where the first term () sampled as
(6) 
is an approximate stochastic gradient of the extrinsic value with respect to the updated policy parameters when the behavior is generated by , and the second term can be computed as follows:
(7)  
(8)  
(9)  
(10) 
Note that to compute the gradient of the extrinsic value with respect to the intrinsic reward parameters , we needed a new episode with the updated policy parameters (cf. Equation 6), thus requiring two episodes per iteration. To improve data efficiency we instead reuse the episode generated by the policy parameters at the start of the iteration and correct for the resulting mismatch by replacing the onpolicy update in Equation 6 with the following offpolicy update using importance sampling:
(11) 
The parameters are updated using the product of Equations 10 and 11 with a stepsize parameter ; this approximates a stochastic gradient update (cf. Equation 5).
Implementation on A2C and PPO.
We described LIRPG using the most basic policy gradient formulation for simplicity. There have been many advances in policy gradient methods that reduce the variance of the gradient and improve the dataefficiency. Our LIRPG algorithm is also compatible with such actorcritic architectures. Specifically, for our experiments on Atari games we used a reasonably state of the art advantage actioncritic (A2C) architecture, and for our experiments on Mujoco domains we used a similarly reasonably state of the art proximal policy optimization (PPO) architecture. We provide all implementation details in supplementary material. ^{1}^{1}1Our implementation is available at: https://github.com/Hwhitetooth/lirpg
4 Experiments on Atari Games
Our overall objective in the following first set of experiments is to evaluate whether augmenting a policy gradient based RL agent with intrinsic rewards learned using our LIRPG algorithm (henceforth, augmented agent in short) improves performance relative to the baseline policy gradient based RL agent that uses just the extrinsic reward (henceforth, A2C baseline agent in short). To this end, we first perform this evaluation on multiple Atari games from the Arcade Learning Environment (ALE) platform (Bellemare et al., 2013) using the same opensource implementation with exactly the same hyperparameters of the A2C algorithm (Mnih et al., 2016) from OpenAI (Dhariwal et al., 2017) for both our augmented agent as well as the baseline agent. The extrinsic reward used is the game score change as is standard for the work on Atari games. The LIRPG algorithm has two additional parameters relative to the baseline algorithm, the parameter that controls how the intrinsic reward is scaled before adding it to the extrinsic reward and the stepsize ; we describe how we choose these parameters below in our results.
We also conducted experiments against another baseline which simply gave a constant positive value as a live bonus to the agent at each time step (henceforth, A2Cbonus baseline agent in short). The live bonus heuristic encourages the agent to live longer so that it will potentially have a better chance of getting extrinsic rewards.
Note that the policy module inside the agent is really two networks, a policy network and a value function network (that helps estimate as required in Equation 4). Similarly the intrinsic reward module in the agent is also two networks, a reward function network and a value function network (that helps estimate as required in Equation 6).
4.1 Overall Performance
Figure 2 shows the improvements of the augmented agents over baseline agents on Atari games: Alien, Amidar, Asterix, Atlantis, BeamRider, Breakout, DemonAttack, DoubleDunk, MsPacman, Qbert, Riverraid, RoadRunner, SpaceInvaders, Tennis, and UpNDown. We picked as many games as our computational resources allowed in which the published performance of the underlying A2C baseline agents was good but where the learning was not so fast in terms of sample complexity so as to leave little room for improvement. We ran each agent for separate runs each for million time steps on each game for both the baseline agents and augmented agents. For the augmented agents, we explored the following values for the intrinsic reward weighting coefficient , and the following values for the term , , that weights the loss from the value function estimates with the loss from the intrinsic reward function (the policy component of the intrinsic reward module). The learning rate of the intrinsic reward module, i.e., , was set to for all experiments. We plotted the best results from the hyperparameter search in Figure 2. For the A2Cbonus baseline agents, we explored the value of live bonus over the set on two games, Amidar and MsPacman, and chose the best performing value of for all games. The learning curves of all agents are provided in the supplementary material.
The blue bars in Figure 2 show the human score normalized improvements of the augmented agents over the A2C baseline agents and the A2Cbonus baseline agents. We see that the augmented agent outperforms the A2C baseline agent on all games and has an improvement of more than ten percent on out of games. As for the comparison to the A2Cbonus baseline agent, the augmented agent still performed better on all games except for SpaceInvaders and Asterix. Note that most Atari games are shooting games so the A2Cbonus baseline agent is expected to be a stronger baseline.
4.2 Analysis of the Learned Intrinsic Reward
An interesting question is whether the learned intrinsic reward function learns a general stateindependent bias over actions or whether it is an interesting function of state. To explore this question we used the learned intrinsic reward module and the policy module from the end of a good run (cf. Figure 2) for each game with no further learning to collect new data for each game. Figure 3 shows the variation in intrinsic rewards obtained and the actions selected by the agent over thousand steps, i.e. thousand frames, on games. The analysis for all games is in the supplementary material. The red bars show the average intrinsic reward perstep for each action. The black segments show the standard deviation of the intrinsic rewards. The blue bars show the frequency of each action being selected. Figure 3 shows that the intrinsic rewards for most actions vary through the episode as shown by large black segments, indirectly confirming that the intrinsic reward module learns more than a stateindependent constant bias over actions. By comparing the red bars and the blue bars, we see the expected correlation between aggregate intrinsic reward over actions and their selection (through the policy module that trains on the weighted sum of extrinsic and intrinsic rewards).
5 Mujoco Experiments
Our main objective in the following experiments is to demonstrate that our LIRPGbased algorithm can extend to a different class of domains and a different choice of baseline actorcritic architecture (namely, PPO instead of A2C). Specifically, we explore domains from the Mujoco continuous control benchmark (Duan et al., 2016), and used the opensource implementation of the PPO (Schulman et al., 2017) algorithm from OpenAI (Dhariwal et al., 2017) as our baseline agent. We also compared LIRPG to the simple heuristic of giving a live bonus as intrinsic reward (PPObonus baseline agents for short). As for the Atari game results above, we kept all hyperparameters unchanged to default values for the policy module of both baseline and augmented agents. Finally, we also conduct a preliminary exploration into the question of how robust the learning of intrinsic rewards is to the sparsity of extrinsic rewards. Specifically, we used the delayed versions of the Mujoco domains, where the extrinsic reward is made sparse by accumulating the reward for time steps before providing it to the agent. Note that the live bonus is not delayed when we delay the extrinsic reward for the PPObonus baseline agent. We expect that the problem becomes more challenging with increasing but expect that the learning of intrinsic rewards (that are available at every time step) can help mitigate some of that increasing hardness.
Delayed Mujoco benchmark.
We evaluated environments from the Mujoco benchmark, i.e. Hopper, HalfCheetah, Walker2d, Ant, and Humanoid. As noted above, to create a morechallenging sparsereward setting we accumulated rewards for , and steps (or until the end of the episode, whichever comes earlier) before giving it to the agent. We trained the baseline and augmented agents for million steps on each environment.
5.1 Overall Performance
Our results comparing the use of learning intrinsic reward with using just extrinsic reward on top of a PPO architecture are shown in Figure 4. We only show the results of a delay of here; the full results can be found in the supplementary material. The dark blue curves are for PPO baseline agents. The light blue curves are PPObonus baseline agents, where we explored the value of live bonus over the set and plotted the curves for the domainspecific best performing choice. The red and curves are for the augmented LIRPG agents.
We see that in out of domains learning intrinsic rewards significantly improves the performance of PPO, while in one game (Ant) we got a degradation of performance. Although a live bonus did help on domains, i.e. Hopper and Walker2d, LIRPG still outperformed it on out of domains except for HalfCheetah on which LIRPG got comparable performance. We note that there was no domainspecific hyperparameter optimization for the results in this figure; with such optimization there might be an opportunity to get improved performance for our method in all the domains.
Training with Only Intrinsic Rewards.
We also conducted a more challenging experiment on Mujoco domains in which we used only intrinsic rewards to train the policy module. Recall that the intrinsic reward module is trained to optimize the extrinsic reward. In out of domains, as shown by the green curves denoted by ‘PPO+LIRPG()’ in Figure 4, using only intrinsic rewards achieved similar performance to the red curves where we used a mixture of extrinsic rewards and intrinsic rewards. Using only intrinsic rewards to train the policy performed worse than using the mixture on Hopper but performed even better on HalfCheetah. It is important to note that training the policy using only livebonus reward without the extrinsic reward would completely fail, because there would be no learning signal that encourages the agent to move forward. In contrast, our result shows that the agent can learn complex behaviors solely from the learned intrinsic reward on MuJoCo environment, and thus the intrinsic reward captures far more than a live bonus does; this is because the intrinsic reward module takes into account the extrinsic reward structure through its training.
6 Discussion and Conclusion
Our experiments on using LIRPG with A2C on multiple Atari games showed that it helped improve learning performance in all of the games we tried. Similarly using LIRPG with PPO on multiple Mujoco domains showed that it helped improve learning performance in out domains (for the version with a delay of ). Note that we used the same A2C / PPO architecture and hyperparameters in both our augmented and baseline agents. While more empirical work needs to be done to either make intrinsic reward learning more robust or to understand when it helps and when it does not, we believe our results show promise for the central idea of learning intrinsic rewards in complex RL domains.
In summary, we derived a novel practical algorithm, LIRPG, for learning intrinsic reward functions in problems with highdimensional observations for use with policy gradient based RL agents. This is the first such algorithm to the best of our knowledge. Our empirical results show promise in using intrinsic reward function learning as a kind of metalearning to improve the performance of modern policy gradient architectures like A2C and PPO.
Acknowledgments.
We thank Richard Lewis for conversations on optimal rewards. This work was supported by NSF grant IIS1526059, by a grant from Toyota Research Institute (TRI), and by a grant from DARPA’s L2M program. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsor.
References
 Singh et al. (2010) Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.

Ng et al. (1999)
Andrew Y Ng, Daishi Harada, and Stuart J Russell.
Policy invariance under reward transformations: Theory and
application to reward shaping.
In
Proceedings of the Sixteenth International Conference on Machine Learning
, pages 278–287. Morgan Kaufmann Publishers Inc., 1999.  Rajeswaran et al. (2017) Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pages 6553–6564, 2017.
 Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
 Ostrovski et al. (2017) Georg Ostrovski, Marc G Bellemare, Aäron Oord, and Rémi Munos. Countbased exploration with neural density models. In International Conference on Machine Learning, pages 2721–2730, 2017.
 Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 2750–2759, 2017.
 Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
 Sorg et al. (2010) Jonathan Sorg, Richard L Lewis, and Satinder Singh. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, pages 2190–2198, 2010.
 Guo et al. (2016) Xiaoxiao Guo, Satinder Singh, Richard Lewis, and Honglak Lee. Deep learning for reward design to improve monte carlo tree search in atari games. arXiv preprint arXiv:1604.07095, 2016.
 Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
 Schmidhuber (2010) Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
 Stadie et al. (2015) Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
 Oudeyer and Kaplan (2009) PierreYves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in Neurorobotics, 1:6, 2009.
 Vezhnevets et al. (2017) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 3540–3549, 2017.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Bellemare et al. (2013) Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253–279, 2013.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Schulman et al. (2015) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
Appendix A Implementation Details
a.1 Atari Experiments
Episode Generation.
As in Mnih et al. [2015], each episode starts by doing a noop action for a random number of steps after restarting the game. The number of noop steps is sampled from 0 to 30 uniformly. Within an episode, each action chosen is repeated for 4 frames, before selecting the next action. An episode ends when the game is over or the agent loses a life.
Input State Representation.
As in Mnih et al. [2015], we take the maximum value at each pixel from consecutive frames to compress them into one frame which is then rescaled to a gray scale image. The input to all four neural networks is the stack of the last gray scale images (thus capturing frameobservations over frames). The extrinsic rewards from the game are clipped to .
Details of the Two Networks in the Policy Module.
Note that the policy module is unchanged from the OpenAI implementation. Specifically, the two networks are convolutional neural networks (CNN) with
convolutional layers and fully connected layer. The first convolutional layer has thirtytwofilters with stride
. The second convolutional layer has sixtyfour filters with stride . The third convolutional layer has sixtyfour filters with stride . The fourth layer is a fully connected layer with hidden units. Each hidden layer is followed by a rectifier nonlinearity. The value network (that estimates ) shares parameters for the first four layers with the policy network. The policy network has a separate output layer with an output for every action through a softmax nonlinearity, while the value network separately outputs a single scalar for the value.Details of the Two Networks in the Intrinsic Reward Module.
The intrinsic reward module has two very similar neural network architectures as the policy module described above. It again has two networks, a “policy” network that instead of a softmax over actions produces a scalar reward for every action through a tanh nonlinearity to keep the scalar in ; we will refer to it as the intrinsic reward function. The value network to estimate has the same architecture as the intrinsic reward network except for the output layer that has a single scalar output without a nonlinear activation. These two networks share the parameters of the first four layers.
HyperParameters for Policy module.
We keep the default values of all hyperparameters in the original OpenAI implementation of the A2Cbased policy module unchanged for both the augmented and baseline agents^{2}^{2}2We use actor threads to generate episodes. For each training iteration, each actor acts for time steps. For training the policy, the weighting coefficients of policygradient term, value network loss term, and the entropy regularization term in the objective function are , , and . The learning rate for training the policy is set to at the beginning and anneals to linearly over 50 million steps. The discount factor is for all experiments..
HyperParameters for Intrinsic Reward module in Augmented Agent.
We use RMSProp to optimize the two networks of the intrinsic reward module. The decay factor used for RMSProp is
, and the is . We do not use momentum. Recall that there are two parameters special to LIRPG. Of these, the step size was initialized to and annealed linearly to zero over million time steps for all the experiments reported below. We did a small hyperparameter search for for each game (this is described below in the caption of Figure 5). As for the A2C implementation for the policy module we clipped the gradient by norm to 0.5 in the intrinsic reward module.a.2 Mujoco Experiments
Details of the Two Networks in the Policy Module.
Note that the policy module is unchanged from the OpenAI implementation; we provide details for completeness. The policy network is a MLP with hidden layers, too. The input to the policy network is the observation. The first two layer are fully connected layers with
hidden units. Each hidden layer is followed by a tanh nonlinearity. The output layer outputs a vector with the size of the dimension of the action space with no nonlinearity applied to the output units. Gaussian noise is added to the output of the policy network to encourage exploration. The variance of the Gaussian noise was a inputindependent parameter which was also trained by gradient descent. The corresponding value network (that estimates
) has a similar architecture with the policy network. The only difference is that that output layer outputs a single scalar without any nonlinear activation. These two networks do not share any parameters.Details of the Two Networks in the Intrinsic Reward Module.
The intrinsic reward function networks are quite similar to the two networks in the policy module. Each network is a multilayer perceptron (MLP) with
hidden layers. We concatenated the observation vector and the action vector as the input to the intrinsic reward network. The first two layer are fully connected layers with hidden units. Each hidden layer is followed by a tanh nonlinearity. The output layer has one scalar output. We apply tanh on the output to bound the intrinsic reward to . The value network to estimate has the same architecture as the intrinsic reward network except for the output layer that has a single scalar output without a nonlinear activation. These two networks do not share any parameters.HyperParameters for Policy Module
We keep the default values of all hyperparameters in the original OpenAI implementation of PPO unchanged for both the augmented and baseline agents^{3}^{3}3For each training iteration, the agent interacts with the environment for steps. The learning rate for training the policy is set to at the beginning and was fixed over training. We used a batch size of and swept over the data points for epochs before the next sequence of interaction. The discount factor is for all experiments..
HyperParameters for Intrinsic Reward Module
We use Adam [Kingma and Ba, 2014] to optimize the two networks of the intrinsic reward module. The step size was initialized to and was fixed over million time steps for all the experiments reported below. The mixing coefficient was fixed to and instead we multiplied the extrinsic reward by cross all environments. The PPO implementation clips the gradient by norm to 0.5. We keep this part unchanged for the policy network and clip the gradients by the same norm for the reward network. We used generalized advantage estimate (GAE) [Schulman et al., 2015] for both training the reward network and the policy network. The weighting factor for GAE was .
Comments
There are no comments yet.