Reinforcement learning is a framework to learn a good policy in terms of total expected extrinsic rewards by interacting with an environment. It has shown super-human performance in the game of Go and in Atari games (Mnih et al., 2015; Silver et al., 2017). In the early days, RL algorithms such as Q-learning, and state-action-reward-state-action (SARSA) (Sutton et al., 1998), and recently, more sophisticated algorithms have been proposed. Among the latter, proximal policy optimization (PPO) is one of the most popular algorithms, because it can be used in a variety of tasks such as Atari games and robotic control tasks (Schulman et al., 2017).
However, learning a good policy is difficult when the agent rarely receives extrinsic rewards. Existing methods alleviate this problem by adding another type of reward called intrinsic reward. For example, as an intrinsic reward, Pathak et al. (2017) and Burda et al. (2019a) use prediction error of the next state, and Burda et al. (2019b) use evaluation of state novelty. However, these methods are not based on solid theoretical backgrounds.
Uncertainty Bellman exploration (UBE) is another method to alleviate the sparse reward problem, which has a more solid theoretical background (O’Donoghue et al., 2017). UBE evaluates the value of a policy higher when the estimation of the value is more uncertain, like in “optimism in face of uncertainty” in multi-armed bandit problems (Bubeck et al., 2012). O’Donoghue et al. (2017) showed a relationship between the local uncertainty and the uncertainty of the expected return and applied the uncertainty estimation to SARSA.
We apply the idea of UBE to PPO and propose a new algorithm named optimistic PPO (OPPO) which evaluates the uncertainty of the total return of a policy and updates the policy in the same way as PPO. By updating the policy like PPO, its policy is expected to be stable, and this allows OPPO to evaluate the uncertainty of estimated values in states that are far from the current state.
2.1 Uncertainty Bellman Equation and Exploration
Markov decision processes (MDPs) are models of sequential decision-making problems. In this paper, we focus on an MDP with a finite horizon, state, and action space. An MDP is defined as a tuple, , where is a set of possible states, is a set of possible actions; and is a reward function , which defines the expected reward when the action is taken at the state; is a transition function
, which defines the transition probability to the next state when the action is taken at the current state;
is a probability distribution of the initial state, andis the horizon length of the MDP, i.e. the number of actions until the end of an episode.
The objective of an agent/learner is to learn a good policy in terms of expected total return. Formally, policy is the probability of taking action at state , where is a set of parameters that determines the probability (for the sake of simplicity, we often omit ). The Q-value , is an expected total return when the agent is at state , time-step , takes action , and follows policy after taking action .
Let us assume the Bayesian setting of Q-value estimation, where there are priors and posteriors over the mean reward function and the transition function . Let be the sampled reward function, be the sampled transition function from prior or posterior, and be the sigma-algebra of all data (e.g. states, actions, rewards) earned by times sampling. It is known that there exists a unique that satisfies the Bellman equation,
for all and , for , where . O’Donoghue et al. (2017)
extend this Bellman equation to the variance/uncertainty of.
To prove theoretical results, let us assume that the state transition of the MDP is a directed acyclic graph (DAG) and that expected reward
is bounded for all states and actions. We denote the conditional variance of a random variableas
We denote the maximum of Q-value as and as
where . The Q-value satisfies the following equation (O’Donoghue et al., 2017).
For any policy , there exists a unique that satisfies the uncertainty Bellman equation,
for all and , where , and point-wise.
This theorem shows a relationship between the local uncertainty, and the uncertainty of estimated Q-values.
For convenience of discussion in later sections, we introduce some notations. Let us denote the solution of the Bellman equation,
as , where the estimated mean reward, is . For ,
To estimate , O’Donoghue et al. (2017) start from the case where the domain is tabular. Let denote the number of times action is chosen at state and let denote the variance of a reward sampled from the reward distribution. We assume that the reward distribution and its prior is Gaussian, and the prior over the transition function is Dirichlet; then
where is the number of next states reachable from . Thus, there exists a constant which satisfies , e.g.
. Since this exact upper bound is too loose in most cases, UBE heuristically choosesinstead of using the parameter assured to satisfy the bound. In a domain other than the tabular, UBE extends the discussion above and uses pseudo-counts to estimate the local uncertainty. O’Donoghue et al. (2017) applied UBE to SARSA (Sutton et al., 1998), which is a more primitive algorithm than Proximal Policy Optimization.
2.2 Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a simplified version of trust region policy optimization (TRPO)111While the original TRPO and PPO are formulated under the assumption that the policy is run for an MDP with an infinite horizon, they have recently been extended in the case of finite horizon (Azizzadenesheli et al., 2018), which is the same setting as ours.. Although TRPO shows promising results in control tasks (Schulman et al., 2015a), PPO empirically shows better results in most cases (Schulman et al., 2017). PPO uses a clipped variable as follows, so as not to change policy drastically.
where is the parameters of the policy, is time-step, is , is the estimated advantage value, e.g. the estimated value of in this paper, and is the empirical average over a batch of samples. The clipping function means if and if . PPO samples the data by executing actions for time-steps following the policy and repeating it times. PPO updates the policy by maximizing in the data.
2.3 Exploration Based on Intrinsic Reward
Random network distillation (RND) is recently proposed for alleviating the problem of sparse reward (Burda et al., 2019b)
. It has shown outstanding performance in Atari games. RND uses two neural networks called a target networkand a predictor network . Each network maps state/observation to its value or . The networks are randomly initialized, and the target network’s parameters are fixed, on the other hand, the predictor learns the outputs of the target. The intrinsic reward for observation is defined as the difference of output . As a reward, RND uses , instead of using only the extrinsic one. RND uses the reward defined above and learns a policy like PPO. RND updates the policy to maximize in the batch data. It is expected that more observations lead to smaller differences of the outputs, which means the intrinsic reward is smaller. In RND, the intrinsic rewards can be seen as a kind of pseudo-count bonus. However, there is no theoretical discussion about how this bonus should be used.
There are other methods for exploration by the intrinsic rewards. To calculate the intrinsic rewards, Bellemare et al. (2016) used context tree switching, and Ostrovski et al. (2017) used pixcelCNN. However, those methods depend on visual heuristics and are not straightforward to apply to other tasks than Atari games, e.g. control tasks whose inputs are sensor data. Ecoffet et al. (2019) proposed an another method for exploration, which is based on memorization and random search rather than intrinsic reward. Although it shows state-of-the-art performance on Montezuma’s Revenge, it is also not straightforward to extend the method to other tasks. Tang et al. (2017) proposed a method similar to RND which evaluates the state novelty by using a hash function.
3 Optimistic Proximal Policy Optimization
We propose optimistic proximal policy optimization (OPPO), which is a variant of PPO. OPPO optimizes a policy based on optimistic evaluation of the expected return where the evaluation is optimistic by the amount of the uncertainty of the expected return.
First, we explain its theoretical background. We denote the optimistic value of policy as below:
where is a hyper-parameter for exploration. Setting the high value to means emphasizing exploration more than exploitation. Let us denote the value of policy as . Then the following corollary is derived from Theorem 1.
This corollary shows that is an upper bound of the uncertainty of the expected return of . In general, more data lead to more accurate estimation, and this means lower and . Especially if , . Also, . Therefore, the difference of and decreases to zero as the number of data increases. These facts show that evaluating by is reasonable. Besides, is an estimation of the mean of . Thus, is a form that the estimated return plus its uncertainty and seeking a policy which maximizes is reasonable in terms of “optimism in face of uncertainty”.
However, it is difficult to find policy which maximizes by directly evaluating . Thus, following PPO, OPPO approximates based on the current policy . Let denote
Then the following equations are satisfied.
For any parameters of policy ,
Theorem 2 means that can be approximated by with enough accuracy if and are not very different. Therefore, OPPO chooses the next policy so as to increase the estimated value of with regularizing the ‘similarity’ between and by the clipping function introduced in section 2.2.
The objective function of OPPO is the same as in Equation (11), except that OPPO uses instead of in the equation, where is
Parameter is introduced for stabilizing the estimation when is nearly zero. Note that Theorem 2 is valid if the square root in equations (12), (14) are either or . The terms, , and , are the estimated values of , and , respectively, which are calculated based on generalized advantage function estimation (Schulman et al., 2015b). We show the details in A.2. The other parts of the objective function of OPPO are prediction error of V-values and entropy of policy, which are the same as PPO.
Note that simply adding the bonuses to the extrinsic rewards instead of adding bonuses like UBE and OPPO may be overly optimistic, as shown in an example in O’Donoghue et al. (2017), although ordinary count-based exploration is based on the bonuses (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017).
OPPO can be combined with an arbitrary estimator of the local uncertainty. For example, the local uncertainty can be directly evaluated by bootstrap sampling of the reward and transition functions, like the estimators of Q-values in Osband et al. (2016). In this paper, instead of the model-based approach, we take a model-free one for simplicity. We use the RND bonus of state as the local uncertainty of pair, where is the next state after . Although the networks in RND can be easily extended to evalute novelty of pair instead of , we follow the RND original imprementations for a simple and clear comparison. We discuss the difference between the local uncertainty evaluations in A.3. In this case, OPPO is equivalent to RND, if and . Testing OPPO with various local uncertainty estimators is left for future work. We also tested OPPO with local uncertainties based on exact visitation counts of i.e., .
4.1 Tabular Domain
First, we examine the efficiency of the proposed algorithms in a tabular domain where visitation counts are easily calculated. We used a domain called a bandit tile. A bandit tile is a kind of a grid world with two tiles exist on which the agent receives a stochastic reward. We show an example of a bandit tile in figure 2
. In the figure, ‘G’ represents the tile and ‘S’ represents possible initial positions of the agent. The initial position is stochastically chosen among the two ‘S’ tiles. The reward is sampled from a Gaussian distribution. The mean reward of each ‘G’ tile isand and its variance is . The episode ends when the agent reaches the ‘G’ tile or 100 time-steps are passed.
We compared OPPO with the bonus based on exact visitation counts to OPPO, RND, and PPO. Figure 2 shows that OPPO is more efficient than RND and also suggests that we can improve OPPO if there is a proper method to estimate local uncertainty.
4.2 Atari Domain
Next, we show experimental results on more complex tasks, Atari games, popular testbeds for reinforcement learning. It has been pointed out that Atari games are deterministic, which is not appropriate for being testbeds, so we added randomness by sticky action (Machado et al., 2018). In the sticky action environment, the current chosen action are executed with the probability while the most previous action is repeated with the probability . We set . We chose six games (Frostbite, Freeway, Solaris, Venture, Montezuma’s Revenge, and Private Eye) to evaluate the proposed method and run algorithms until 100 million time-steps in Frostbite and 50 million in the other games. OPPO was more effective than RND at Frostbite in terms of learning speed, although the difference is not so salient as that in the tabular case. The details are shown in figure 4 in Appendix.
We have proposed a new algorithm, optimisitic proximal policy optimization (OPPO) to alleviate the sparse reward problem. OPPO is an extension of proximal policy optimization and considers uncertainty of estimation of expected total returns instead of simply estimating the returns. OPPO optimistically evaluates the values of policies by the amount of uncertainty and improves the policy like PPO. Experimental results show that OPPO learns more effectively than the existing method, RND, in a tabular domain.
Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) was used.
- Azizzadenesheli et al.  Kamyar Azizzadenesheli, Manish Kumar Bera, and Animashree Anandkumar. Trust region policy optimization of pomdps. arXiv preprint arXiv:1810.07900, 2018.
- Bellemare et al.  Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
- Bubeck et al.  Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Burda et al. [2019a] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. Seventh International Conference on Learning Representations, 2019.
- Burda et al. [2019b] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. Seventh International Conference on Learning Representations, 2019.
- Ecoffet et al.  Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
- Kakade and Langford  Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of international conference on Machine learning, volume 2, pages 267–274, 2002.
Machado et al. 
Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew
Hausknecht, and Michael Bowling.
Revisiting the arcade learning environment: Evaluation protocols and
open problems for general agents.
Journal of Artificial Intelligence Research, 61:523–562, 2018.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- O’Donoghue et al.  Brendan O’Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty Bellman equation and exploration. arXiv preprint arXiv:1709.05380, 2017.
- Osband et al.  Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, pages 4026–4034, 2016.
- Ostrovski et al.  Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017.
- Pathak et al.  Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, volume 2017, 2017.
- Schulman et al. [2015a] John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of international conference on Machine learning, volume 37, pages 1889–1897, 2015.
- Schulman et al. [2015b] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Sutton et al.  Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
- Tang et al.  Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # Exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
- van Hasselt et al.  Hado P van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pages 4287–4295, 2016.
Appendix A Details of Proposed Method
Corollary 1 is derived from the following relations.
The first inequality is derived from Jensen’s inequality, and the second one is derived from Theorem 1. ∎
For convenience, we introduce some additional notations. Let denote the probability of the agent being at state at time-step under the condition and expectation under the condition as . Theorem 2 is derived from the following relations.
Firstly, we show that satisfies the following equations,
The first equation is derived from the definition of and the fact that sampling of the initial state only depends on , the second one . The third one and the forth one are derived from the definition of and , respectively.
For simplicity, we denote as . By the fact that ,
The first equation is derived from equation (22). ∎
In the batch data, we denote the state, action, and reward at time-step and sampled by actor are , and , respectively. Let denote and denote the local uncertainty of . in equation (17) is calculated as below:
where is an estimator of and is a discount factor. The discount factor is often used even if the horizon is finite, so we follow the ordinary implementations. is calculated as below:
Pseudo code is shown at Algorithm 1.
a.3 Local Uncertainty Estimation
Let denote the local uncertainty based on the next state after pair. OPPO uses as the local uncertainty of instead of . There is a small gap between the discussion and the implementation of OPPO. However, using is reasonable if the state transition is a tree, a graph without cycles. Using means using the average of as the local uncertainty of . This can be approximated by . In the tree case, can be approximated by . Thus, if , the local uncertainty of can be approximated by . This means that can be approximated by the average of , if .
Appendix B Further Investigation in Tabular Domain
To confirm the validity of using RND bonus as visitation counts, we measured a ratio to check if it is stable at around one in the bandit tile domain. Figure 3 shows that the ratio was around 1 for millions of time-steps, although it was high at the beginning and nearly zero at the end. It can be considered that OPPO is worse than OPPO with the exact count bonus by the amount of the overvaluation, and that the undervaluation was not harmful because it occured after learning the policy to the best tile.
Appendix C Details of Results in Atari Games
We compared OPPO with RND in the six Atari games. In the original RND implementation, a reward clipping technique which transforms negative/positive extrinsic reward to is used, so we also used this technique in OPPO and RND. Note that we use a frame skipping technique, and the number of the frame skips is four; so one time-step is equal to or less than four frames (it is less than four if the episode ends at a skipped frame).
Figure 4 shows that OPPO learns more effectively than RND in Frostbite although there is only slight difference with the other games. Also, Figure 4 shows that exrinsic rewards decrease in Frostbite. One of the reason for the decrease may be the reward clipping, although further investigation is needed to confirm that. By the reward clipping, the agent learns a policy to receive positive rewards with high frequency, not high returns. The agent may learn the policy with the same frequency of rewards but with a small total return, as it receives data. Note that there are small and large rewards in Frostbite, and that a novel states leads to a higher reward in most Atari games [Burda et al., 2019a]. This problem can be alleviated by rescaling the reward by considering the amount of reward, e.g. PopArt [van Hasselt et al., 2016], which is left for future work.