We consider agents in a human-like setting, where agents must simultaneously act and learn in the world continuously with limited computational resources. All decisions are made online; there are no discrete episodes. Furthermore, the world is vast – too large to feasibly explore exhaustively – and changes over the course of the agent’s lifetime, like how a robot’s actuators might deteriorate with continued use. There are no resets to wipe away past errors. Mistakes are costly, as they compound downstream. To perform well at reasonable computational costs, the agent must utilize its past experience alongside new information about the world to make careful, yet performant, decisions.
Non-stationary worlds require algorithms that are fundamentally robust to changes in dynamics. Factors that would lead to a change in the environment may either be too difficult or principally undesirable to model: for example, humans might interact with the robot in unpredictable ways, or furniture in a robot’s environment could be rearranged. Therefore, we assume that the world can change unpredictably in ways that cannot be learned, and focus on developing algorithms that instead handle these changes gracefully, without using extensive computation.
Model-based trajectory optimization via planning is useful for quickly learning control, but is computationally demanding and can lead to bias due to the finite planning horizon. Model-free reinforcement learning is sample inefficient, but capable of cheaply accessing past experience without sacrifices to asymptotic performance. Consequently, we would like to distill expensive experience from an intelligent planner into neural networks to reduce computation for future decision making.
Deciding when to use the planner vs a learned policy presents a difficult challenge, as it is hard to determine the improvement the planner would yield without actually running the planner. We tackle this as a problem of uncertainty. When uncertain about a course of action, humans use an elongated model-based search to evaluate long-term trajectories, but fall back on habitual behaviors learned with model-free paradigms when they are certain of what to do (Banks and Hope, 2014; Daw et al., 2005; Dayan and Berridge, 2014; Kahneman, 2003). By measuring this uncertainty, we can make informed decisions about when to use model-based planning vs a model-free policy.
Our approach combines model-based planning with model-free policy learning, along with an adaptive computation mechanism, to tackle this setting. Like a robot that is well-calibrated when first coming out of a factory, we give the agent access to a ground truth dynamics model that lacks information about future changes to the dynamics, like different settings in which the robot may be deployed. This allows us to make progress on finite computation continual learning without complications caused by model learning. The dynamics model is updated immediately at world changes. However, as we show empirically, knowing the dynamics alone falls far short of success at this task.
We present a new algorithm, Adaptive Online Planning (AOP), that links Model Predictive Path Integral control (MPPI) (Williams et al., 2015), a model-based planner, with Twin Delayed DDPG (TD3) (Fujimoto et al., 2018b), a model-free policy learning method. We combine the model-based planning method of iteratively updating a planned trajectory with the model-free method of updating the network weights to develop a unified update rule formulation that is amenable to reduced computation when combined with a switching mechanism. We inform this mechanism with the uncertainty given by an ensemble of value functions. Access to the ground truth model is not sufficient by itself, as we show that PPO (Schulman et al., 2017) and TD3 perform poorly, even with the ground truth model. We demonstrate empirically that AOP is capable of integrating the two methodologies to reduce computation while achieving and maintaining strong performance in non-stationary worlds, outperforming other model-based planning methods and avoiding the empirical performance degradation of policy learning methods.
Our contributions include the proposal of a new algorithm combining model-based planning and model-free learning, the introduction of evaluation environments that target the challenges of lifelong reinforcement learning in which traditional methods struggle, and experiments showing the usefulness of utilizing both model-based and model-free methods in this setting. Code to run all experiments and video results are available at https://sites.google.com/berkeley.edu/aop/home.
We consider the world as an infinite-horizon Markov Decision Process (MDP), defined by the tuple, where is the state space, is the action space, is the reward function,
are the transition probabilities, andis the discount factor. The world can change over time: the transitions and rewards may change to some new . Unlike traditional reinforcement learning, the agent’s state is not reset at these world changes. The agent can generate rollouts using the current starting from its current state, but not for future . The agent’s goal is to execute a future sequence of actions, summarized as a policy , that maximizes the expected future return .
2.1 Continual Online Learning
In our work, we consider online learning in the same style as Lowrey et al. (2018), where both acting and learning must occur on a per-timestep basis, and there are no episodes that reset the state. At each timestep, the agent must execute its training procedure, and is then forced to immediately output an action. We also desire agents that are equipped, like humans, to handle different tasks in various environments. Continual learning is difficult, as agents must use their experience to learn to perform well in new tasks (forward transfer) while preserving the ability to perform well in old tasks (backward transfer). In addition to the these difficulties, there is also the challenge of avoiding failure sink states that prevent future learning progress. We augment this task with a world where the dynamics continually change, creating a difficult setting for agent learning.
2.2 Model-Based Planning
Online model-based planning evaluates future sequences of actions using a model, develops a projected future trajectory over some time horizon, and then executes the first action of that trajectory, before repeating. We specifically focus on Model Predictive Control (MPC), which iteratively applies Gaussian noise to the prior predicted trajectory, evaluates them using the dynamics model, and combines them with an update rule. When the update rule is a softmax weighting, this procedure is called Model Predictive Path Integral (MPPI) control (Williams et al., 2015). Due to the nature of this iterative and extended update, this procedure is computationally expensive.
2.3 Model-Free Policy Optimization
Model-free algorithms encode the agent’s past experiences in a function dependent only on the current state, often in the form of a value function critic and/or policy actor. As a result, such algorithms can have difficulty learning long-term dependencies and struggle early on in training; temporally extended exploration is difficult. In exchange, they attain high asymptotic performance, having shown successes in a variety of tasks for the traditional offline setting (Mnih et al., 2013; Schulman et al., 2017). As a consequence of their compact nature, once learned, these algorithms tend to generate cyclic and regular behaviors, whereas model-based planners have no such guarantees (see Fig. 1 for an example).
We run online versions of TD3 (Fujimoto et al., 2018b) and PPO (Schulman et al., 2017) as baselines to AOP. While there is no natural way to give a policy access to the ground truth model, we allow the policies to train on future trajectories generated via the ground truth model, in similar fashion to algorithms that learn a model for this purpose (Buckman et al., 2018; Kurutach et al., 2018), in order to help facilitate fair comparisons to model-based planners.
2.4 Update Rule Perspective on Planning vs Policy Optimization
From a high-level perspective, the model-based planning and model-free policy optimization procedures are very similar (see Appendix B for a side-by-side comparison). Where the planner generates noisy rollouts to synthesize a new trajectory, the model-free algorithm applies noise to the policy to generate data for learning. After an update step, either an action from the planned trajectory or one call of the policy is executed. These procedures are only distinct in their respective update rules.
The primary contribution of AOP is unifying both update rules to compensate for their individual weaknesses. AOP distills the learned experience from the planner into the off-policy learning method of TD3 and a value function, so that planning and acting can be done cheaper in the future.
3 Adaptive Online Planning
3.1 Model-Based Planning with Terminal Value Approximation
For the model-based component, AOP uses MPPI, as described in Section 2.2, with a terminal value function , where trajectories are evaluated in the form of Eq. 1. This process is repeated for several iterations to improve the plan, and then the first action is executed in the environment.
is generated by an ensemble of value functions (see Eq. 2), as proposed in POLO (Lowrey et al., 2018) for MPC. The value ensemble improves the exploration ability of the optimization procedure (Osband et al., 2018, 2016). The log-sum-exp function serves as a softmax, enabling exploration while preserving a stable target for learning. The, which ensures that the approximation is still semantically meaningful for estimating the actual value of the state.
3.2 Early Planning Termination
Past model-based planning procedures (Chua et al., 2018; Wang and Ba, 2019) run a fixed number of iterations of MPC per timestep before executing an action in the environment. However, this is often wasteful. Within a particular timestep, later planning iterations often improve the planned trajectory less than earlier iterations, and may not even improve the trajectory at all. We propose to decide on the number of planning iterations on a per-timestep basis. After generating a new trajectory from the -th iteration of planning, we measure the improvement against the trajectory of the previous iteration (Eq. 3). When this improvement decreases below a threshold , we terminate planning for the current timestep with probability . Using a stochastic termination rule allows for robustness against local minima where more extensive planning may be required, but not evident from early planning iterations, in order to escape.
3.3 Adaptive Planning Horizon
Since planning over a long time horizon is expensive, it would also be desirable to plan over a shorter time horizon when the planner is confident in achieving long-term success with only short-term planning. We represent the planner’s uncertainty with the value ensemble from Section 3.1
, as the mean and standard deviation of the ensemble represent the epistemic uncertainty of the value of the state(Osband et al., 2018). When deciding to use a reduced time horizon, we require that the standard deviation of the value ensemble on the current state be lower than some threshold .
A problem with only considering the standard deviation is that the metric only considers uncertainty with respect to the past – it does not immediately measure uncertainty in a changing dynamics setting, which is only observed when considering experiences in the future. Therefore, we fine-tune the horizon length using the Bellman error (Eq. 4). The time horizon is given by the largest such that . When , the full time horizon is always used, regardless of the Bellman error. This is to say that, if the value function can accurately approximate the latter part of the horizon, we can use the value function instead. While choices for these hyperparameters are somewhat arbitrary, we show in Appendix C.1.1 that AOP is not particularly sensitive to them.
3.4 Off-Policy Model-Free Prior
We use TD3 as a prior to the planning procedure, with the policy learning off of the data generated by the planner during planning, which allows the agent to recall past experience quickly. Similarly to past work (Rajeswaran et al., 2017a; Zhu et al., 2018)
, we found that imitation learning can cap the asymptotic performance of the learned policy. As a baseline, we also run behavior cloning (BC) as a prior, and refer to the resulting algorithms as AOP-TD3 and AOP-BC, respectively.
We note that MPC and policy optimization are both special cases of AOP. MPC is equivalent to AOP with a constant setting for the time horizon that always uses full planning iterations (i.e. a threshold of 0). Policy optimization is equivalent to AOP with one planning iteration, since the first plan is a noisy version of the policy, acting as the data collection procedure in standard policy learning.
4 Empirical Evaluations
We investigate several questions empirically:
What are the challenges in the continual lifelong learning setting? When developing further algorithms in this setting, what should we focus on?
How does AOP perform when acting from well-known states, novel states, and in changing worlds? How do traditional on-policy and off-policy methods fare in these situations?
Are the variance and the Bellman error of the value ensemble suitable metrics for determining the planning computational budget?
4.1 Lifelong Learning Environments
We propose three environments to evaluate our proposed algorithm in the continual lifelong learning setting: Hopper, Ant, and Maze, and release them for others to use. While these environments are not overly complex control environments, they crisply highlight the difficulties of continual lifelong learning in MDPs. Pictures of these environments are included in Appendix D.
Hopper: First, we consider the OpenAI Gym environment Hopper (Brockman et al., 2016). The agent is rewarded based on how closely it matches an unobserved target velocity. Every 4000 timesteps, this target velocity changes. The environment is tough, as it can be difficult to get up if the agent falls down in a strange position, and momentum from past actions affect the state greatly, which makes it easy for the agent to fall over. We consider three versions of the Hopper environment: (1) a standard Hopper with an unchanging target velocity, (2) a novel states Hopper with the target velocity in the observation (thus new target velocities correspond to the agent seeing a new state), and (3) a changing worlds Hopper, where the target velocity is not in the observation.
Ant: We also consider the Ant from Gym. The agent seeks to maximize its forward velocity, but a joint at random is disabled every 2000 timesteps. Once flipped over, getting back up is extremely difficult, which makes this environment harshly unforgiving. We consider two versions: (1) a standard Ant with no disabled joints, and (2) a changing worlds Ant with one changing disabled joint.
Maze: Like in POLO, we test in a 2D point mass maze, where agent seeks to reach a goal. The observation is . Every 500 timesteps, the walls of the maze change, and the goal swaps locations. The difficulty lies in adapting quickly to new mazes while avoiding the negative transfer of old experience. We consider two versions: (1) a novel states Maze, where the walls of the maze remain constant, but new goals are introduced after 20 goal changes in the original positions, and (2) a changing worlds Maze, which is as described above. We also test both versions in a dense reward and a sparse reward setting, where the reward is either the negative L2 distance or a boolean value, respectively. In the sparse reward Maze, exploration can be particularly challenging.
4.2 Baselines and Ablations
We run AOP-BC, POLO, MPC, TD3, and PPO as baselines against AOP-TD3; they can be seen as ablations/special cases of our proposed algorithm (see Section 3.4). We consider two versions of MPC, with and planning iterations, henceforth referred to as MPC-8 and MPC-3, respectively.
|11.39% (1.40% - 16.62%)||11.40% (2.86% - 15.17%)||37.50%||100%||37.50%|
|S Hopper||0.12 0.16||0.33 0.22||0.51||0.23||-14.41||0.36||0.19|
|NS Hopper||0.41 0.18||0.53 0.18||0.59||0.40||-14.22||-0.28||-0.49|
|CW Hopper||0.48 0.24||0.45 0.12||0.57||-2.42||-13.14||-0.30||-0.48|
|S Ant||3.02 0.13||3.38 0.27||3.40||2.19||n/a||3.52||3.40|
|CW Ant||2.76 0.47||3.11 0.41||2.90||2.05||n/a||3.32||3.14|
|NS Maze (D)||-0.21 0.08||-0.25 0.02||-0.25||-1.81||-2.14||-0.19||-0.25|
|CW Maze (D)||-0.29 0.07||-0.34 0.03||-0.30||-1.17||-2.10||-0.19||-0.30|
|NS Maze (S)||0.85 0.07||0.70 0.06||0.62||-0.68||-0.88||0.69||0.61|
|CW Maze (S)||0.69 0.20||0.56 0.04||0.57||-0.66||-0.74||0.58||0.52|
4.3 Challenges in Continual Lifelong Learning Setting
Planner usage is shown in Table 1 and rewards are in Table 2. AOP uses only of the number of timesteps as MPC-8, but achieves generally comparable or stronger performance in most environments. More detailed graphs can be found in Appendix A.
Reset-Free Setting: Even with model access, these environments are challenging for the algorithms to learn. In the standard offline reinforcement learning setting, long-term action dependencies are learned from past experience over time, and this experience can be utilized when the agent resets to the initial state. However, in the online setting, these dependencies must be learned on the fly, and if the agent falls, it must return to the prior state in order to use that information. In particular, for the Ant environment, such falling is catastrophic, as it takes a complex action sequence to return to standing. POLO-style optimistic exploration can thus be a disadvantage, encouraging the Ant to take on new and unstable behaviors. In spite of this, AOP, with about of the planning of POLO, achieves comparable performance to POLO; AOP-BC achieves very strong performance, in general.
Vast Worlds: The performance gain of MPC-8 over MPC-3 shows that achieving strong performance is difficult with constrained computation. In the sparse mazes, MPC is significantly outperformed by AOP-TD3, and the model-free algorithms struggle to make any progress at all, showing their lackluster exploration. Even POLO – the exploration mechanism of AOP – faces weaker performance, indicating that AOP-TD3 has not only correctly identified when planning is important, but is able to effectively leverage additional computation to increase its performance whilst still using less overall computation. The additional performance in the novel states Maze (S) over MPC-8 also shows AOP’s ability to consolidate experience to improve performance in mazes it has seen before. Furthermore, in the changing worlds Maze (S), the performance of AOP improves over time (Fig. A.2), indicating that AOP has learned value and policy functions for effective forward transfer.
Policy Degradation: TD3’s performance significantly degrades in the changing worlds settings, as does PPO’s (see Fig. 2). PPO, an on-policy method, struggles in general. In the novel states Hopper, the variant where the policy is capable of directly seeing the target velocity, TD3 performs very well, even learning to outperform MPC. However, without the help of the observation, in the changing worlds, TD3’s performance quickly suffers after world changes. The model-based planning methods do not suffer this degradation, and AOP is able to maintain its performance and computational savings, even through many world changes, despite its reliance on model-free components.
4.4 Behavior of Policies in Continual Lifelong Learning Setting
In Fig. 3 (left), we plot the episodic reward of the policy running from the initial starting state after each timestep (for the current target velocity). Note that since the AOP policy is learned from off-policy data (the planner), it suffers from divergence issues and should be weaker than TD3 on its own (Fujimoto et al., 2018a). Matching the result in Fig. 2 (a), the TD3 policy degrades in performance over time, but the AOP policy does not. This suggests that the policy degradation effect might stem from exploration, rather than from an issue with the optimization algorithm.
We also show in Fig. 3 (right) tuning the policy learned by AOP after seeing every target velocity once (blue) vs. by TD3 (red) vs. training a new policy (gray), learning from running the standard episodic TD3 algorithm on the first target velocity. The AOP policy learns much faster, showing that AOP is capable of quick backward transfer and adapting quickly to a different situation.
4.5 Behavior of AOP in Well-Known States/Novel States/Changing Worlds
Fig. 4 shows AOP behavior in Maze (D). When encountering novel states, Bellman error is high, but as time progresses, when confronted with the same states again, Bellman error becomes low. The number of planning timesteps matches this: AOP correctly identifies a need to plan early on, but greatly saves computation later, when it is immediately able to know the correct action with almost no planning. The same effect occurs when measuring the time since the world changed for the changing worlds. At the beginning of a new world, the amount of planning is high, before quickly declining to nearly zero, almost running with the speed of a policy: faster than MPC-8.
We plot the standard deviation and Bellman error over time of AOP for the changing worlds Hopper in Fig. 5. After each world change, the Bellman error spikes, and then decreases as time goes on. These trends are reflected in the time horizon (bottom center), which decreases as the agent trains in each world, and indicate that the standard deviation and Bellman error are suitable metrics for determining planning levels. The same effect also occurs for the number of planning iterations.
5 Related Work and Future Directions
Much of past lifelong learning work (Goodfellow et al., 2013; Parisi et al., 2018) has focused on catastrophic forgetting, which AOP is resilient to, but was not a primary focus of our work. Kearns and Singh (2002) considers MDPs with various reward functions, using them to decide whether to explore or exploit, similar to our framework. Finn et al. (2019) uses meta-learning to perform continual online learning, but the tasks are considered in episodes. Future work could investigate meta-learning an adaptive planning strategy. Nagabandi et al. (2018) learn multiple models to represent different tasks, which contrasts with our single unified networks.
Algorithms that combine planning with learning have been studied in both discrete and continuous domains (Anthony et al., 2017; Chua et al., 2018). Recent work (Guez et al., 2018) generalizes the MCTS algorithm and proposes to learn the algorithm instead; having the algorithm set computation levels could be effective in our setting. Levine and Koltun (2013); Mordatch et al. (2015) propose to use priors that make the planner stay close to policy outputs, which is problematic in changing worlds, when the policy is not accurate. (Azizzadenesheli et al., 2018; Nagabandi et al., 2017) learn dynamics models and perform MPC using them. Clavera et al. (2018); Janner et al. (2019); Kurutach et al. (2018) utilize model ensembles to reduce model overfitting for policy learning. Integrating AOP with a learned uncertainty-aware dynamics model would be interesting future work.
We proposed AOP, which incorporates model-based planning with model-free learning, and introduced environments for evaluating algorithms in the continual lifelong learning setting. We empirically analyzed the performance of and signals from the model-free components, and showed experimentally that AOP was able to successfully reduce computation while achieving high performance in difficult tasks, often competitive with a much more powerful MPC procedure.
We would like to thank Vikash Kumar, Aravind Rajeswaran, Michael Janner, and Marvin Zhang for helpful feedback, comments, and discussions.
Thinking fast and slow with deep learning and tree search. CoRR abs/1705.08439. External Links: Cited by: §5.
-  (2018) Sample-efficient deep RL with generative adversarial tree search. CoRR abs/1806.05780. External Links: Cited by: §5.
-  (2014-03) Heuristic and analytic processes in reasoning: an event-related potential study of belief bias. Psychophysiology 51, pp. . External Links: Cited by: §1.
-  (2016) OpenAI gym. CoRR abs/1606.01540. External Links: Cited by: §4.1.
-  (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. CoRR abs/1807.01675. External Links: Cited by: §2.3.
-  (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR abs/1805.12114. External Links: Cited by: §3.2, §5.
-  (2018) Model-based reinforcement learning via meta-policy optimization. CoRR abs/1809.05214. External Links: Cited by: §5.
-  (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience 8 (12), pp. 1704. Cited by: §1.
-  (2014) Model-based and model-free pavlovian reward learning: revaluation, revision, and revelation. Cited by: §1.
-  (2019) Online meta-learning. CoRR abs/1902.08438. External Links: Cited by: §5.
-  (2018) Off-policy deep reinforcement learning without exploration. CoRR abs/1812.02900. External Links: Cited by: §4.4.
-  (2018) Addressing function approximation error in actor-critic methods. CoRR abs/1802.09477. External Links: Cited by: §C.4, §1, §2.3.
-  (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. External Links: Cited by: §5.
-  (2018) Learning to search with mctsnets. CoRR abs/1802.04697. External Links: Cited by: §5.
-  (2019) When to trust your model: model-based policy optimization. CoRR abs/1906.08253. External Links: Cited by: §5.
-  (2003-02) Maps of bounded rationality: psychology for behavioral economics †. American Economic Review 93, pp. 1449–1475. External Links: Cited by: §1.
-  (2002-11-01) Near-optimal reinforcement learning in polynomial time. Machine Learning 49 (2), pp. 209–232. External Links: Cited by: §5.
-  (2018) Model-ensemble trust-region policy optimization. CoRR abs/1802.10592. External Links: Cited by: §2.3, §5.
-  (2013) Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §5.
-  (2018) Plan online, learn offline: efficient learning and exploration via model-based control. CoRR abs/1811.01848. External Links: Cited by: §2.1, §3.1.
-  (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Cited by: §2.3.
-  (2015) Interactive control of diverse complex characters with neural networks. In Advances in Neural Information Processing Systems, pp. 3132–3140. Cited by: §5.
-  (2018) Deep online learning via meta-learning: continual adaptation for model-based RL. CoRR abs/1812.07671. External Links: Cited by: §5.
-  (2017) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. CoRR abs/1708.02596. External Links: Cited by: §5.
-  (2018) Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8617–8629. External Links: Cited by: §3.1, §3.3.
-  (2016) Deep exploration via bootstrapped DQN. CoRR abs/1602.04621. External Links: Cited by: §3.1.
-  (2018) Continual lifelong learning with neural networks: A review. CoRR abs/1802.07569. External Links: Cited by: §5.
-  (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §3.4.
-  (2017) Towards generalization and simplicity in continuous control. CoRR abs/1703.02660. External Links: Cited by: Appendix D.
-  (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Cited by: §1, §2.3, §2.3.
-  (2019) Exploring model-based planning with policy networks. CoRR abs/1906.08649. External Links: Cited by: §3.2.
-  (2015) Model predictive path integral control using covariance variable importance sampling. CoRR abs/1509.01149. External Links: Cited by: §1, §2.2.
-  (2018) Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564. Cited by: §3.4.
Appendix A Detailed Experimental Graphs
Appendix B Pseudocode for Algorithms
b.1 Model-Based Planning vs Policy Optimization Pseudocode
b.2 Adaptive Online Planning Pseudocode
Appendix C Hyperparameter Details
All of our implementations and hyperparameters are available at our website: https://sites.google.com/berkeley.edu/aop/home.
c.1 Adaptive Online Planning Hyperparameters
For AOP, we set , and . We did not tune these hyperparameters much, and similarly do not believe that the algorithm is overly sensitive to the thresholds in dense reward environments (see Appendix C.1.1). However, in the sparse Mazes, we set , in order to avoid early termination of exploration (we do not change the hyperparameters determining the number of planning iterations). Tuning over these hyperparameters (for both dense and sparse rewards) could lead to better performance, if desired.
For the first planning iteration we set , and for the later planning iterations, . We found that having a lower threshold for the first iteration helps the agent to avoid getting stuck in poor trajectories (i.e. avoid only using the policy), alongside the stochastic decision rule. For the Ant environment, we set and always require at least one planning iteration.
c.1.1 Sensitivity to Thresholds
We run a rough grid search with wider values for and , and calculate average reward in the Hopper changing worlds environment. The average reward for each setting is shown in Table C.1 and learning curves are shown in Fig. C.1. AOP is somewhat more sensitive to the setting of early on in training, as a higher value corresponds to less planning, but this effect quickly dissipates. As a result, while the choice of and is fairly arbitrary, we do not believe that AOP is particularly sensitive to them, and use the same values for all of the dense reward environments.
|Standard Deviation :||4||8 (Default)||14|
|Average reward||0.47 0.20||0.42 0.05||0.44 0.16|
|Bellman Error :||10||25 (Default)||40|
|Average reward||0.47 0.17||0.47 0.09||0.43 0.24|
c.2 Model Predictive Control Hyperparameters
Our MPPI temperature is set to 0.01. The other planning hyperparameters (kept constant across environments) are shown below. See Section 3.4 for interpretation of policy optimization as a special case of AOP. Surprisingly, we found TD3 to perform worse with more than 1 trajectory per iteration.
|Planning iterations per timestep||0-8||3||1||1||3, 8|
|Trajectories per iteration||40||40||1||32||40|
|Noise standard deviation||0.1||0.1||0.2||-||0.1|
c.3 Network Architectures
For our value ensembles, we use an ensemble size of and . The value functions are updated in batches of size for gradient steps every timesteps. All networks use tanh activations and a learning rate of , trained using Adam. Network sizes are shown below.
c.4 Policy Optimization Hyperparameters
Our TD3 uses the same hyperparameters as the original authors , where for every timestep, we run a rollout of length 256 and run 256 gradient steps. In the TD3 used for the experiment in Section 4.4, we run rollouts of length 1000 and run 1000 gradient steps after each rollout, equivalent to the standard TD3 setting with no termination.
Our PPO uses , , batch sizes of , and gradient steps per iteration. For behavior cloning, we run gradient steps on batches of size every timesteps. For the policy in AOP-TD3, we run gradient steps on batches of size every timesteps.
Appendix D Environment Details
In the online setting, the agent receives no signal from termination states, i.e. it becomes more difficult to know not to fall down in the cases of Hopper and Ant. To amend this, and achieve the same interpretable behavior as the standard reinforcement learning setting, we set the reward functions as the following for our environments, similar to :