Maximum Entropy Model Rollouts: Fast Model Based Policy Optimization without Compounding Errors

06/08/2020 ∙ by Chi Zhang, et al. ∙ 5

Model usage is the central challenge of model-based reinforcement learning. Although dynamics model based on deep neural networks provide good generalization for single step prediction, such ability is over exploited when it is used to predict long horizon trajectories due to compounding errors. In this work, we propose a Dyna-style model-based reinforcement learning algorithm, which we called Maximum Entropy Model Rollouts (MEMR). To eliminate the compounding errors, we only use our model to generate single-step rollouts. Furthermore, we propose to generate diverse model rollouts by non-uniform sampling of the environment states such that the entropy of the model rollouts is maximized. To accomplish this objective, we propose to utilize a prioritized experience replay. We mathematically show that the entropy of the model rollouts is maximally increased when the sampling criteria is the negative likelihood under historical model rollouts distribution. Our preliminary experiments in challenging locomotion benchmarks show that our approach achieves the same sample efficiency of the best model-based algorithms, matches the asymptotic performance of the best model-free algorithms, and significantly reduces the computation requirements of other model-based methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-based reinforcement learning (MBRL) (Janner et al., 2019; Buckman et al., 2018; Xu et al., 2018; Chua et al., 2018) shows competitive performance compared with best model-free reinforcement learning (MFRL) algorithms (Schulman et al., 2017, 2015; Mnih et al., 2013; Haarnoja et al., 2018a, b) with significantly fewer environment samples on challenging robotics locomotion benchmarks (Todorov et al., 2012). MFRL algorithms learns complex skills by maximizing a scalar reward designed by human engineering. However its promising performance requires large amounts of environment interactions, that may take a long time in real-world applications. In such a case, MBRL is appealing due to its superior sample efficiency that relies on the generalization of a learned predictive dynamics model. However, the quality of the policy trained on imagined trajectories is often worse asymptotically than the best MFRL counterparts due to the imperfect models.

Recently, (Janner et al., 2019) proposed Model-based Policy Optimization (MBPO), including a theoretical framework that encourages short-horizon model usage based on an optimistic assumption of a bounded model generalization error given policy shift. Although empirical studies have shown supports, this property is hard to guarantee in the whole state distribution. Moreover, uniform sampling of the environment states to generate branched model rollouts degrade the diversity of the model dataset, especially when the policy shift is small, which makes the policy updates inefficient.

Our main contribution is a practical algorithm, which we called Maximum Entropy Model Rollouts (MEMR) based on the forementioned insights. The differences between MEMR and MBPO are 1) MEMR follows Dyna (Sutton, 1991)

that only generates single-step model rollouts while MBPO encourages generating short-horizon model rollouts. The generalization ability of MEMR is strictly guaranteed by supervised machine learning theory, which can be empirically estimated by validation errors

(Shalev-Shwartz and Ben-David, 2014). 2) MEMR utilizes a prioritized experience replay (Schaul et al., 2015) to generate max-diversity model rollouts for efficiency policy updates. We validate this idea on challenging locomotion benchmarks (Todorov et al., 2012) and the experimental results show that MEMR matches asymptotic performance and sample efficiency with MBPO (Janner et al., 2019) while significantly reduces the number of policy updates and model rollouts, which leads to faster learning speed.

2 Preliminaries

Reinforcement Learning algorithms aim to solve Markov Decision Process (MDP) with unknown dynamics. A Markov decision process (MDP)

(Sutton and Barto, 2018) is defined as a tuple , where is the set of states, is the set of actions, defines the intermediate reward when the agent transits from state to by taking action ,

defines the probability when the agent transits from state

to by taking action , defines the starting state distribution. The objective of reinforcement learning is to select policy such that

(1)

is maximized.

2.1 Prioritized Experience Replay

Prioritized experience replay (Schaul et al., 2015) is introduced to increase the learning efficiency of DQN (Mnih et al., 2013), where the probability of each transition is proportional to the absolute TD error (Watkins and Dayan, 1992). To avoid overfitting, stochastic prioritization is utilized and the bias is corrected via annealed importance sampling. In this work, we adopt the same idea with a custom prioritization criteria such that the joint entropy of the state and action in the model dataset is maximized.

2.2 Model-based Policy Optimization

Model-based policy optimization (MBPO) (Janner et al., 2019) achieves state-of-the-art sample efficiency and matches the asymptotic performance of MFRL approaches. MBPO optimizes a policy with soft actor-critic (SAC) (Haarnoja et al., 2018a) under the data distribution collected by unrolling the learned dynamics model using the current policy. However, the sample efficiency comes at the cost of 2.5x to 5x increased number of policy updates compared with SAC (Haarnoja et al., 2018a) and a large number of model rollouts, that significantly decreases the training speed. To mitigate this bottleneck, we analyze the model usage and model rollout distribution and propose insights on how to improve MBPO to obtain better computation efficiency.

Model usage.

In MBPO, learned dynamics model is used to generate branched model rollouts with short horizons (Janner et al., 2019). Although (Janner et al., 2019) presented theoretical analysis to bound the policy performance trained using model generate rollouts, the over exploitation of model generalization can’t be eliminated. In this work, one of our core idea is that we only rely on learned model to generate one-step rollouts, in which case we interpret it as model-based exploration. The nice property of this model usage is the natural bounded model generalization error, which can be estimated in practice by the validation dataset (Shalev-Shwartz and Ben-David, 2014).

Model rollout distribution.

Uniform sampling of true states 111States encountered in real environment as opposed to imagined states that are generated by the model. to generate model rollouts is adopted in MBPO (Janner et al., 2019). This potentially generates large amount of similar data when the policy and the learned model changes slowly as training progresses. As result, the efficiency of the policy updates is deteriorated. In this work, we propose to sample true states to generate single-step model rollouts such that the joint entropy of the state and action of the model dataset is maximized. The intuition is to increase the ”diversity” of the model dataset, from which the policy can benefit for efficient learning.

3 Maximum Entropy Model Rollouts

In this section, we unveil the technical details of our Maximum Entropy Model Rollouts (MEMR) for model based policy optimization. First, we propose the Maximum Entropy Sampling Theorem to help understand the choice of our prioritization criteria. Based on the theoretical analysis, we propose a practical implementation of this idea and discuss the challenges posed by runtime complexity along with their fixes.

3.1 Maximum Entropy Sampling Criteria

We begin by considering the following problem definition:

Problem 3.1 (Maximum Entropy Sampling).

Let be the collection of all the states in the environment dataset. Let 222The tasks considered in this work are deterministic so we omit for simplicity. be the collection of all the state-action pairs in the model dataset. Assume for each state in , we sample action using the current policy denoted as

. Assume we parameterize the policy distribution derived from the model dataset as a Gaussian distribution with diagonal covariance:

. Let the joint entropy of the state-action in the model dataset be . Now we select from and add it to the . Let the joint entropy of the new be , the optimal sampling criteria problem is to choose index such that is maximized.

Theorem 3.1 (Maximum Entropy Sampling Theorem).

Assume such that the state distribution of and are identical, then

(2)

where is the probability of model data policy at ,

is the standard deviation of the conditional distribution at

.

Proof.

See Appendix A, Theorem A.2. ∎

3.2 Practical Implementation

Theorem 3.1 provides a mathematically justified criteria to select states from the environment dataset for rollout generation to maximize the ”diversity” of the model dataset, yet it poses several practical challenges to implement: 1) It requires a full sweep of all the states in the environment dataset before each sampling, which is . This is problematic because

grows linearly as training progresses. 2) Stochastic gradient descent assumes uniform sampling of the data distribution whereas prioritized sampling breaks this assumption and introduces bias. 3) Training the model data distribution to converge is expensive but crucial before evaluating the priority. A complete algorithm that handles the aforementioned practical challenges is presented in Algorithm 

1.

1:  Initialize environment dataset and model dataset
2:  Initialize SAC policy , predictive model and model derived policy distribution
3:  for  do
4:     if  then
5:        Train model on via maximum likelihood
6:     end if
7:     Sample ; Execute in the environment and observe
8:     Compute priority according to Equation 4; add to
9:     for  do
10:        Sample from
11:        Compute importance-sampling weight
12:        Sample ; Perform one-step rollout using and obtain .
13:     end for
14:     Add to the next segment in
15:     Update on via maximum likelihood for epochs
16:     Update the priority of according to Equation 4 for all
17:     for  iterations do
18:        Sample segment index uniformly; Sample batch size from segment uniformly
19:        Update Q network as
20:        Update policy using
21:     end for
22:  end for
Algorithm 1 Maximum Entropy Model Rollouts for Model-Based Policy Optimization
Figure 1: Segmented replay buffer for model generated rollouts. Each segment contains data sampled from the same environment state distribution.
Figure 2: Training curves of MEMR and two baselines. Solid curves depict the mean of five trials and shaded regions correspond to standard deviation among trials. The first row depicts the performance vs. the total number of environment interactions. We observe that MEMR matches the performance of state-of-the-art model-based and model-free algorithms. The second row shows the performance vs. the number of policy updates and we observe that MEMR converges as fast as SAC in terms of the number of updates. The third row shows that MEMR generates only a fraction of model rollouts compared to MBPO, which indicates far less training time.

Stochastic prioritization.

Inspired by (Schaul et al., 2015), we only update the priority of the states that are just sampled to avoid an expensive full sweep before each sampling. An immediate consequence of this approach is that certain states with low priorities will not be sampled for a very long time. This potentially leads to overfitting. Following (Schaul et al., 2015)

, we use stochastic prioritization that interpolates between pure greedy and uniform sampling with the following probability of sampling state

:

(3)

where is the priority of state and action . The exponent determines how much prioritization is used, with corresponding to the uniform case. According to Theorem 3.1, we compute as

(4)

Correcting the bias.

Using prioritized sampling introduces bias when fitting the Q network of the SAC. Inspired by (Schaul et al., 2015), we apply weighted importance-sampling (IS) when calculating the loss of the Q network, where the weight for sample is

(5)

Segmented replay buffer.

According to Algorithm 1, we update the priority after sampling states from the environment dataset to perform model rollouts. Thus, the sampling distribution of every model rollout generation is different. This leads to incorrect importance weights if we randomly sample a batch from the model dataset that contains data generated from different distributions to perform policy updates. To fix it, we introduce segmented replay buffer that group every rollouts in the same segment. During sampling for policy updates, we randomly sample a segment index, then sample a batch from that segment.

Training model derived policy distribution.

Fitting using via maximum likelihood to converge is costly since the size of is large and this operation must be performed every time we generate model rollouts. Since the data in model buffer is swapped rapidly, we treat it as an online learning procedure and only perform several gradient updates on the newly stored data.

4 Experiments

Our experimental evaluation aims to study the following questions: How well does MEMR perform on RL benchmarks, compared to state-of-the-art model-based and model-free algorithms in terms of sample efficiency, asymptotic performance and computation efficiency?

We evaluate MEMR on Mujoco benchmarks (Todorov et al., 2012). We compare our method with the state-of-the-art model-based method, MBPO (Janner et al., 2019). As shown in Figure 2, MEMR matches the asymptotic performance of MBPO whereas MEMR only uses policy updates and a fraction of model rollouts. It indicates that MEMR is more efficient in terms of model rollouts data used for policy updates. It also indicates orders of training speedup. Compared with the state-of-the-art model-free method, SAC (Haarnoja et al., 2018a), MEMR matches the asymptotic performance and the data efficiency.

References

  • J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. CoRR abs/1807.01675. External Links: 1807.01675, Link Cited by: Appendix B, Appendix B, §1.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR abs/1805.12114. External Links: 1805.12114, Link Cited by: Appendix B, Appendix B, §1.
  • P. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2004) A tutorial on the cross-entropy method. ANNALS OF OPERATIONS RESEARCH 134. Cited by: Appendix B.
  • M. P. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Madison, WI, USA, pp. 465–472. External Links: ISBN 9781450306195 Cited by: Appendix B, Appendix B.
  • V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine (2018) Model-based value estimation for efficient model-free reinforcement learning. CoRR abs/1803.00101. External Links: 1803.00101, Link Cited by: Appendix B.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018a) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: 1801.01290, Link Cited by: Appendix B, Appendix B, §1, §2.2, §4.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine (2018b) Soft actor-critic algorithms and applications. CoRR abs/1812.05905. External Links: 1812.05905, Link Cited by: §1.
  • M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. CoRR abs/1906.08253. External Links: 1906.08253, Link Cited by: Appendix B, Appendix B, Appendix C, §1, §1, §1, §2.2, §2.2, §2.2, §4.
  • L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, R. Sepassi, G. Tucker, and H. Michalewski (2019) Model-based reinforcement learning for atari. CoRR abs/1903.00374. External Links: 1903.00374, Link Cited by: Appendix B.
  • S. Kakade (2001) A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, Cambridge, MA, USA, pp. 1531–1538. Cited by: Appendix B.
  • T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. CoRR abs/1802.10592. External Links: 1802.10592, Link Cited by: Appendix B, Appendix B.
  • S. Levine and V. Koltun (2013) Guided policy search. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 1–9. External Links: Link Cited by: Appendix B.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: 1312.5602, Link Cited by: Appendix B, Appendix B, §1, §2.1.
  • A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2017) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. CoRR abs/1708.02596. External Links: 1708.02596, Link Cited by: Appendix B, Appendix B.
  • A. Rao (2010) A survey of numerical methods for optimal control. Advances in the Astronautical Sciences 135. Cited by: Appendix B.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. Note: cite arxiv:1511.05952Comment: Published at ICLR 2016 External Links: Link Cited by: Appendix C, §1, §2.1, §3.2, §3.2.
  • J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: 1502.05477, Link Cited by: Appendix B, §1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: 1707.06347, Link Cited by: Appendix B, §1.
  • S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, USA. External Links: ISBN 1107057132 Cited by: §1, §2.2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. A Bradford Book, Cambridge, MA, USA. External Links: ISBN 0262039249 Cited by: §2.
  • R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull. 2 (4), pp. 160–163. External Links: Document, ISSN 0163-5719, Link Cited by: Appendix B, §1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Appendix B, §1, §1, §4.
  • T. Wang and J. Ba (2019) Exploring model-based planning with policy networks. CoRR abs/1906.08649. External Links: 1906.08649, Link Cited by: Appendix B.
  • C. J. C. H. Watkins and P. Dayan (1992) Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: Document, ISBN 1573-0565, Link Cited by: §2.1.
  • G. Williams, A. Aldrich, and E. A. Theodorou (2015) Model predictive path integral control using covariance variable importance sampling. CoRR abs/1509.01149. External Links: 1509.01149, Link Cited by: Appendix B.
  • H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma (2018) Algorithmic framework for model-based reinforcement learning with theoretical guarantees. CoRR abs/1807.03858. External Links: 1807.03858, Link Cited by: Appendix B, Appendix B, §1.

Appendix A Maximum Entropy Sampling Criteria

In this section, we present the proof of theorem 3.1. We begin with a useful lemma as follows:

Lemma A.1 (Entropy Gain of Gaussian distribution).

Suppose random variable

, where and are unknown. Now suppose we have observations and obtain an estimation of the distribution denoted as . If we have one more observation (variable) and obtain a new estimation of the distribution using , which is denoted as . Let the density of be . Let the differential entropy of , be and . Let Then, .

Proof.

According to the maximum likelihood estimation of Gaussian distribution, we obtain , , , . Then,

(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)

Theorem A.2 (Maximum Entropy Sampling Criteria).

Assume such that the state distribution of and are identical, then

(14)

where is the probability of model data policy at , is the standard deviation of the conditional distribution at .

Proof.

We begin by expanding the state and action joint entropy of the model dataset

(15)
(16)

According to Lemma A.1, we obtain

(17)

Note that in Lemma A.1 denotes the number of state in , which is equal to , where is the model dataset size. This is a rough density estimation and more accurate methods are left for future work. Thus,

(18)

Appendix B Related Work

Model-free reinforcement learning (MFRL) learns optimal policy by directly taking gradient of the objective function (Kakade, 2001; Schulman et al., 2015, 2017) or estimate the state-action value function and derive the optimal policy (Mnih et al., 2013; Haarnoja et al., 2018a). These approaches require large amounts of environment interactions, which is not suitable for environments where sampling is expensive. On the contrary, model-based reinforcement learning (MBRL) learns a dynamics model and directly perform model predictive control (MPC) (Nagabandi et al., 2017; Chua et al., 2018; Deisenroth and Rasmussen, 2011) or derives the policy using model generated rollouts (Kurutach et al., 2018; Janner et al., 2019; Buckman et al., 2018; Xu et al., 2018).

Learning accurate dynamics model is often the bottleneck for MBRL to match the asymptotic performance of MFRL counterparts. Although Gaussian process is shown effective in low-dimensional data regime (Deisenroth and Rasmussen, 2011; Levine and Koltun, 2013), it is hard to generalize to high-dimensional environments like (Todorov et al., 2012; Mnih et al., 2013). (Nagabandi et al., 2017) first utilizes deterministic neural network dynamics model for model predictive control in robotics and (Chua et al., 2018) improves the idea with probabilistic ensemble models that matches the asymptotic performance with MFRL baselines. (Wang and Ba, 2019) combines policy networks with online planning and achieves even superior performance on challenging benchmarks. Common MPC or planning methods include shooting method (Rao, 2010), cross-entropy method (de Boer et al., 2004) and model predictive path integral control (Williams et al., 2015). Such planning methods would potentially over exploit the learned dynamics on long horizon predictions that may impair the performance. It is also computational expensive to perform in real time. In such cases, learning a policy, e.g. parameterized by a neural network, is desired for better generalization.

Dyna-style MBRL utilizes learned dynamics to generate model rollouts to learn a good policy (Sutton, 1991). (Feinberg et al., 2018) and (Buckman et al., 2018) utilizes the model to better estimate the value function in order to improve the sample efficiency. (Kurutach et al., 2018) optimizes the policy network via policy gradient algorithm on the trajectories generated by the models. (Kaiser et al., 2019) proposes video prediction network for model-based Atari games. (Xu et al., 2018) develops a theoretical framework that provides monotonic improvement of the to a local maximum of the expected reward for MBRL. Model-based policy optimization (MBPO) (Janner et al., 2019) achieves state-of-the-art sample efficiency and matches the asymptotic performance of MFRL approaches. MBPO optimizes a policy with soft actor-critic (SAC) (Haarnoja et al., 2018a) under the data distribution collected by unrolling the learned dynamics model using the current policy. Our approach combines (Janner et al., 2019) and (Sutton, 1991) by proposing an non-trivial sampling approach to significantly reduce the number of policy updates and model rollouts that obtain asymptotic performance.

Appendix C Ablation Study

In this section, we make ablation studies to our proposed method. We primarily analyze how the performance of our algorithm changes by varying the size of the model dataset, the number of policy updates per environment step and the prioritization strength .

Figure 3: The results of ablation study. Model Datset Size refers to the size of shown in Algorithm 1. Policy Updates refers to the number of policy updates per environment step. It indicates how informative the model rollouts are to the SAC agent. Prioritization strength is the exponent term used to calculate probability of states being sampled. The less it is, the more uniform the distribution would be.

Model dataset size

The size of the model dataset affects how fast the algorithm converges. Since SAC is an off-policy algorithm, the same experience is expected to be revisited several times on average (Schaul et al., 2015). A small dataset would hinders the learning progress as the same transition only resides in the buffer for only a short period. On the other hand, a large model dataset would actually decrease the sample diversity of each batch used to perform policy updates.

Number of policy updates per environment step

As shown in Figure 2, MEMR converges as fast as SAC in terms of policy updates. Surprisingly, we found that increasing the number of policy updates per environment step doesn’t help to increase the convergence speed as shown in Figure 3. It indicates that much computation power is wasted in (Janner et al., 2019) on less informative model rollouts that barely help the learning of the value functions in SAC.

Prioritization strength

Strong prioritization leads to overfit to local optimum whereas weak prioritization leads to less model rollouts diversity. As shown in Figure 3, we found that works best for all benchmarks.

Appendix D Hyperparameter Settings

HalfCheetah-v2 Walker2d-v2 Ant-v2 Hopper-v2
Total number of steps 400000 300000 300000 125000
Mode update frequency 250
Model rollouts per environment step () 400
Prioritization strength () 0.6
Importance weights annealing () Linear anneal from 0.4 to 1.0
Number of epochs to update model data policy 2
Policy updates per environment step 5
Size of model dataset 3e6 6e6 1e6
Table 1: Hyperparameter setting for MEMR