## 1 Introduction

Model-based reinforcement learning (MBRL) (Janner et al., 2019; Buckman et al., 2018; Xu et al., 2018; Chua et al., 2018) shows competitive performance compared with best model-free reinforcement learning (MFRL) algorithms (Schulman et al., 2017, 2015; Mnih et al., 2013; Haarnoja et al., 2018a, b) with significantly fewer environment samples on challenging robotics locomotion benchmarks (Todorov et al., 2012). MFRL algorithms learns complex skills by maximizing a scalar reward designed by human engineering. However its promising performance requires large amounts of environment interactions, that may take a long time in real-world applications. In such a case, MBRL is appealing due to its superior sample efficiency that relies on the generalization of a learned predictive dynamics model. However, the quality of the policy trained on imagined trajectories is often worse asymptotically than the best MFRL counterparts due to the imperfect models.

Recently, (Janner et al., 2019) proposed Model-based Policy Optimization (MBPO), including a theoretical framework that encourages short-horizon model usage based on an optimistic assumption of a bounded model generalization error given policy shift. Although empirical studies have shown supports, this property is hard to guarantee in the whole state distribution. Moreover, *uniform* sampling of the environment states to generate branched model rollouts degrade the *diversity* of the model dataset, especially when the policy shift is small, which makes the policy updates inefficient.

Our main contribution is a practical algorithm, which we called Maximum Entropy Model Rollouts (MEMR) based on the forementioned insights. The differences between MEMR and MBPO are 1) MEMR follows Dyna (Sutton, 1991)

that only generates single-step model rollouts while MBPO encourages generating short-horizon model rollouts. The generalization ability of MEMR is strictly guaranteed by supervised machine learning theory, which can be empirically estimated by validation errors

(Shalev-Shwartz and Ben-David, 2014). 2) MEMR utilizes a prioritized experience replay (Schaul et al., 2015) to generate*max-diversity*model rollouts for efficiency policy updates. We validate this idea on challenging locomotion benchmarks (Todorov et al., 2012) and the experimental results show that MEMR matches asymptotic performance and sample efficiency with MBPO (Janner et al., 2019) while significantly reduces the number of policy updates and model rollouts, which leads to faster learning speed.

## 2 Preliminaries

Reinforcement Learning algorithms aim to solve Markov Decision Process (MDP) with unknown dynamics. A Markov decision process (MDP)

(Sutton and Barto, 2018) is defined as a tuple , where is the set of states, is the set of actions, defines the intermediate reward when the agent transits from state to by taking action ,defines the probability when the agent transits from state

to by taking action , defines the starting state distribution. The objective of reinforcement learning is to select policy such that(1) |

is maximized.

### 2.1 Prioritized Experience Replay

Prioritized experience replay (Schaul et al., 2015) is introduced to increase the learning efficiency of DQN (Mnih et al., 2013), where the probability of each transition is proportional to the absolute TD error (Watkins and Dayan, 1992). To avoid overfitting, stochastic prioritization is utilized and the bias is corrected via annealed importance sampling. In this work, we adopt the same idea with a custom prioritization criteria such that the joint entropy of the state and action in the model dataset is maximized.

### 2.2 Model-based Policy Optimization

Model-based policy optimization (MBPO) (Janner et al., 2019) achieves state-of-the-art sample efficiency and matches the asymptotic performance of MFRL approaches. MBPO optimizes a policy with soft actor-critic (SAC) (Haarnoja et al., 2018a) under the data distribution collected by unrolling the learned dynamics model using the current policy. However, the sample efficiency comes at the cost of 2.5x to 5x increased number of policy updates compared with SAC (Haarnoja et al., 2018a) and a large number of model rollouts, that significantly decreases the training speed. To mitigate this bottleneck, we analyze the model usage and model rollout distribution and propose insights on how to improve MBPO to obtain better computation efficiency.

#### Model usage.

In MBPO, learned dynamics model is used to generate branched model rollouts with short horizons (Janner et al., 2019). Although (Janner et al., 2019) presented theoretical analysis to bound the policy performance trained using model generate rollouts, the over exploitation of model generalization can’t be eliminated. In this work, *one of our core idea is that we only rely on learned model to generate one-step rollouts, in which case we interpret it as model-based exploration.* The nice property of this model usage is the natural bounded model generalization error, which can be estimated in practice by the validation dataset (Shalev-Shwartz and Ben-David, 2014).

#### Model rollout distribution.

Uniform sampling of true states ^{1}^{1}1States encountered in real environment as opposed to imagined states that are generated by the model. to generate model rollouts is adopted in MBPO (Janner et al., 2019). This potentially generates large amount of similar data when the policy and the learned model changes slowly as training progresses. As result, the efficiency of the policy updates is deteriorated. In this work, *we propose to sample true states to generate single-step model rollouts such that the joint entropy of the state and action of the model dataset is maximized.* The intuition is to increase the ”diversity” of the model dataset, from which the policy can benefit for efficient learning.

## 3 Maximum Entropy Model Rollouts

In this section, we unveil the technical details of our Maximum Entropy Model Rollouts (MEMR) for model based policy optimization. First, we propose the Maximum Entropy Sampling Theorem to help understand the choice of our prioritization criteria. Based on the theoretical analysis, we propose a practical implementation of this idea and discuss the challenges posed by runtime complexity along with their fixes.

### 3.1 Maximum Entropy Sampling Criteria

We begin by considering the following problem definition:

###### Problem 3.1 (Maximum Entropy Sampling).

Let be the collection of all the states in the environment dataset. Let ^{2}^{2}2The tasks considered in this work are deterministic so we omit for simplicity. be the collection of all the state-action pairs in the model dataset. Assume for each state in , we sample action using the current policy denoted as

. Assume we parameterize the policy distribution derived from the model dataset as a Gaussian distribution with diagonal covariance:

. Let the joint entropy of the state-action in the model dataset be . Now we select from and add it to the . Let the joint entropy of the new be , the optimal sampling criteria problem is to choose index such that is maximized.###### Theorem 3.1 (Maximum Entropy Sampling Theorem).

Assume such that the state distribution of and are identical, then

(2) |

where is the probability of model data policy at ,

is the standard deviation of the conditional distribution at

.### 3.2 Practical Implementation

Theorem 3.1 provides a mathematically justified criteria to select states from the environment dataset for rollout generation to maximize the ”diversity” of the model dataset, yet it poses several practical challenges to implement: 1) It requires a full sweep of all the states in the environment dataset before each sampling, which is . This is problematic because

grows linearly as training progresses. 2) Stochastic gradient descent assumes uniform sampling of the data distribution whereas prioritized sampling breaks this assumption and introduces bias. 3) Training the model data distribution to converge is expensive but crucial before evaluating the priority. A complete algorithm that handles the aforementioned practical challenges is presented in Algorithm

1.#### Stochastic prioritization.

Inspired by (Schaul et al., 2015), we only update the priority of the states that are just sampled to avoid an expensive full sweep before each sampling. An immediate consequence of this approach is that certain states with low priorities will not be sampled for a very long time. This potentially leads to overfitting. Following (Schaul et al., 2015)

, we use stochastic prioritization that interpolates between pure greedy and uniform sampling with the following probability of sampling state

:(3) |

where is the priority of state and action . The exponent determines how much prioritization is used, with corresponding to the uniform case. According to Theorem 3.1, we compute as

(4) |

#### Correcting the bias.

Using prioritized sampling introduces bias when fitting the Q network of the SAC. Inspired by (Schaul et al., 2015), we apply weighted importance-sampling (IS) when calculating the loss of the Q network, where the weight for sample is

(5) |

#### Segmented replay buffer.

According to Algorithm 1, we update the priority after sampling states from the environment dataset to perform model rollouts. Thus, the sampling distribution of every model rollout generation is different. This leads to incorrect importance weights if we randomly sample a batch from the model dataset that contains data generated from different distributions to perform policy updates. To fix it, we introduce segmented replay buffer that group every rollouts in the same segment. During sampling for policy updates, we randomly sample a segment index, then sample a batch from that segment.

#### Training model derived policy distribution.

Fitting using via maximum likelihood to converge is costly since the size of is large and this operation must be performed every time we generate model rollouts. Since the data in model buffer is swapped rapidly, we treat it as an online learning procedure and only perform several gradient updates on the newly stored data.

## 4 Experiments

Our experimental evaluation aims to study the following questions: How well does MEMR perform on RL benchmarks, compared to state-of-the-art model-based and model-free algorithms in terms of sample efficiency, asymptotic performance and computation efficiency?

We evaluate MEMR on Mujoco benchmarks (Todorov et al., 2012). We compare our method with the state-of-the-art model-based method, MBPO (Janner et al., 2019). As shown in Figure 2, MEMR matches the asymptotic performance of MBPO whereas MEMR only uses policy updates and a fraction of model rollouts. It indicates that MEMR is more efficient in terms of model rollouts data used for policy updates. It also indicates orders of training speedup. Compared with the state-of-the-art model-free method, SAC (Haarnoja et al., 2018a), MEMR matches the asymptotic performance and the data efficiency.

## References

- Sample-efficient reinforcement learning with stochastic ensemble value expansion. CoRR abs/1807.01675. External Links: 1807.01675, Link Cited by: Appendix B, Appendix B, §1.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR abs/1805.12114. External Links: 1805.12114, Link Cited by: Appendix B, Appendix B, §1.
- A tutorial on the cross-entropy method. ANNALS OF OPERATIONS RESEARCH 134. Cited by: Appendix B.
- PILCO: a model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Madison, WI, USA, pp. 465–472. External Links: ISBN 9781450306195 Cited by: Appendix B, Appendix B.
- Model-based value estimation for efficient model-free reinforcement learning. CoRR abs/1803.00101. External Links: 1803.00101, Link Cited by: Appendix B.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: 1801.01290, Link Cited by: Appendix B, Appendix B, §1, §2.2, §4.
- Soft actor-critic algorithms and applications. CoRR abs/1812.05905. External Links: 1812.05905, Link Cited by: §1.
- When to trust your model: model-based policy optimization. CoRR abs/1906.08253. External Links: 1906.08253, Link Cited by: Appendix B, Appendix B, Appendix C, §1, §1, §1, §2.2, §2.2, §2.2, §4.
- Model-based reinforcement learning for atari. CoRR abs/1903.00374. External Links: 1903.00374, Link Cited by: Appendix B.
- A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, Cambridge, MA, USA, pp. 1531–1538. Cited by: Appendix B.
- Model-ensemble trust-region policy optimization. CoRR abs/1802.10592. External Links: 1802.10592, Link Cited by: Appendix B, Appendix B.
- Guided policy search. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 1–9. External Links: Link Cited by: Appendix B.
- Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: 1312.5602, Link Cited by: Appendix B, Appendix B, §1, §2.1.
- Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. CoRR abs/1708.02596. External Links: 1708.02596, Link Cited by: Appendix B, Appendix B.
- A survey of numerical methods for optimal control. Advances in the Astronautical Sciences 135. Cited by: Appendix B.
- Prioritized experience replay. Note: cite arxiv:1511.05952Comment: Published at ICLR 2016 External Links: Link Cited by: Appendix C, §1, §2.1, §3.2, §3.2.
- Trust region policy optimization. CoRR abs/1502.05477. External Links: 1502.05477, Link Cited by: Appendix B, §1.
- Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: 1707.06347, Link Cited by: Appendix B, §1.
- Understanding machine learning: from theory to algorithms. Cambridge University Press, USA. External Links: ISBN 1107057132 Cited by: §1, §2.2.
- Reinforcement learning: an introduction. A Bradford Book, Cambridge, MA, USA. External Links: ISBN 0262039249 Cited by: §2.
- Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull. 2 (4), pp. 160–163. External Links: Document, ISSN 0163-5719, Link Cited by: Appendix B, §1.
- MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Appendix B, §1, §1, §4.
- Exploring model-based planning with policy networks. CoRR abs/1906.08649. External Links: 1906.08649, Link Cited by: Appendix B.
- Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: Document, ISBN 1573-0565, Link Cited by: §2.1.
- Model predictive path integral control using covariance variable importance sampling. CoRR abs/1509.01149. External Links: 1509.01149, Link Cited by: Appendix B.
- Algorithmic framework for model-based reinforcement learning with theoretical guarantees. CoRR abs/1807.03858. External Links: 1807.03858, Link Cited by: Appendix B, Appendix B, §1.

## Appendix A Maximum Entropy Sampling Criteria

In this section, we present the proof of theorem 3.1. We begin with a useful lemma as follows:

###### Lemma A.1 (Entropy Gain of Gaussian distribution).

Suppose random variable

, where and are unknown. Now suppose we have observations and obtain an estimation of the distribution denoted as . If we have one more observation (variable) and obtain a new estimation of the distribution using , which is denoted as . Let the density of be . Let the differential entropy of , be and . Let Then, .###### Proof.

According to the maximum likelihood estimation of Gaussian distribution, we obtain , , , . Then,

(6) | ||||

(7) | ||||

(8) | ||||

(9) | ||||

(10) | ||||

(11) | ||||

(12) | ||||

(13) |

∎

###### Theorem A.2 (Maximum Entropy Sampling Criteria).

Assume such that the state distribution of and are identical, then

(14) |

where is the probability of model data policy at , is the standard deviation of the conditional distribution at .

###### Proof.

We begin by expanding the state and action joint entropy of the model dataset

(15) | ||||

(16) |

According to Lemma A.1, we obtain

(17) |

Note that in Lemma A.1 denotes the number of state in , which is equal to , where is the model dataset size. This is a rough density estimation and more accurate methods are left for future work. Thus,

(18) |

∎

## Appendix B Related Work

Model-free reinforcement learning (MFRL) learns optimal policy by directly taking gradient of the objective function (Kakade, 2001; Schulman et al., 2015, 2017) or estimate the state-action value function and derive the optimal policy (Mnih et al., 2013; Haarnoja et al., 2018a). These approaches require large amounts of environment interactions, which is not suitable for environments where sampling is expensive. On the contrary, model-based reinforcement learning (MBRL) learns a dynamics model and directly perform model predictive control (MPC) (Nagabandi et al., 2017; Chua et al., 2018; Deisenroth and Rasmussen, 2011) or derives the policy using model generated rollouts (Kurutach et al., 2018; Janner et al., 2019; Buckman et al., 2018; Xu et al., 2018).

Learning accurate dynamics model is often the bottleneck for MBRL to match the asymptotic performance of MFRL counterparts. Although Gaussian process is shown effective in low-dimensional data regime (Deisenroth and Rasmussen, 2011; Levine and Koltun, 2013), it is hard to generalize to high-dimensional environments like (Todorov et al., 2012; Mnih et al., 2013). (Nagabandi et al., 2017) first utilizes deterministic neural network dynamics model for model predictive control in robotics and (Chua et al., 2018) improves the idea with probabilistic ensemble models that matches the asymptotic performance with MFRL baselines. (Wang and Ba, 2019) combines policy networks with online planning and achieves even superior performance on challenging benchmarks. Common MPC or planning methods include shooting method (Rao, 2010), cross-entropy method (de Boer et al., 2004) and model predictive path integral control (Williams et al., 2015). Such planning methods would potentially over exploit the learned dynamics on long horizon predictions that may impair the performance. It is also computational expensive to perform in real time. In such cases, learning a policy, e.g. parameterized by a neural network, is desired for better generalization.

Dyna-style MBRL utilizes learned dynamics to generate model rollouts to learn a good policy (Sutton, 1991). (Feinberg et al., 2018) and (Buckman et al., 2018) utilizes the model to better estimate the value function in order to improve the sample efficiency. (Kurutach et al., 2018) optimizes the policy network via policy gradient algorithm on the trajectories generated by the models. (Kaiser et al., 2019) proposes video prediction network for model-based Atari games. (Xu et al., 2018) develops a theoretical framework that provides monotonic improvement of the to a local maximum of the expected reward for MBRL. Model-based policy optimization (MBPO) (Janner et al., 2019) achieves state-of-the-art sample efficiency and matches the asymptotic performance of MFRL approaches. MBPO optimizes a policy with soft actor-critic (SAC) (Haarnoja et al., 2018a) under the data distribution collected by unrolling the learned dynamics model using the current policy. Our approach combines (Janner et al., 2019) and (Sutton, 1991) by proposing an non-trivial sampling approach to significantly reduce the number of policy updates and model rollouts that obtain asymptotic performance.

## Appendix C Ablation Study

In this section, we make ablation studies to our proposed method. We primarily analyze how the performance of our algorithm changes by varying the size of the model dataset, the number of policy updates per environment step and the prioritization strength .

#### Model dataset size

The size of the model dataset affects how fast the algorithm converges. Since SAC is an off-policy algorithm, the same experience is expected to be revisited several times on average (Schaul et al., 2015). A small dataset would hinders the learning progress as the same transition only resides in the buffer for only a short period. On the other hand, a large model dataset would actually decrease the sample diversity of each batch used to perform policy updates.

#### Number of policy updates per environment step

As shown in Figure 2, MEMR converges as fast as SAC in terms of policy updates. Surprisingly, we found that increasing the number of policy updates per environment step doesn’t help to increase the convergence speed as shown in Figure 3. It indicates that much computation power is wasted in (Janner et al., 2019) on less informative model rollouts that barely help the learning of the value functions in SAC.

#### Prioritization strength

Strong prioritization leads to overfit to local optimum whereas weak prioritization leads to less model rollouts diversity. As shown in Figure 3, we found that works best for all benchmarks.

## Appendix D Hyperparameter Settings

HalfCheetah-v2 | Walker2d-v2 | Ant-v2 | Hopper-v2 | |

Total number of steps | 400000 | 300000 | 300000 | 125000 |

Mode update frequency | 250 | |||

Model rollouts per environment step () | 400 | |||

Prioritization strength () | 0.6 | |||

Importance weights annealing () | Linear anneal from 0.4 to 1.0 | |||

Number of epochs to update model data policy | 2 | |||

Policy updates per environment step | 5 | |||

Size of model dataset | 3e6 | 6e6 | 1e6 |

Comments

There are no comments yet.