1 Introduction
Realworld environments abound with complexity in their action space. Physical reality is continuous both in space and time; hence many important problems, most notably physical control tasks, have continuous multidimensional action spaces. The joints of a robotic hand can assume arbitrary angles; the acceleration of a selfdriving car should vary smoothly to minimise discomfort for passengers. Discrete problems also often have highdimensional action spaces, leading to an exponential number of possible actions. Many other domains have richly structured actions spaces such as sentences, queries, images, or serialised objects. Consequently, a truly general reinforcement learning (RL) algorithm must be able to deal with such complex action spaces in order to be successfully applied to those realworld problems.
Recent advances in deep learning and RL have indeed led to remarkable progress in modelfree RL algorithms for continuous action spaces
(Lillicrap et al., 2015; Schulman et al., 2017; BarthMaron et al., 2018; Abdolmaleki et al., 2018; Hoffman et al., 2020) and other complex action spaces (DulacArnold et al., 2016). Simultaneously, planning based methods have enjoyed huge successes in domains with discrete action spaces, surpassing human performance in the classical games of chess and Go (Silver et al., 2018) or poker (Brown and Sandholm, 2018; Moravčík et al., 2017). The prospect of combining these two areas of research holds great promise for realworld applications.The modelbased MuZero (Schrittwieser et al., 2020) RL algorithm took a step towards applicability in realworld problems by learning a model of the environment and thus unlocking the use of the powerful methods of planning in domains where the dynamics of the environment are unknown or impossible to simulate efficiently. However, MuZero was only applied to domains with relatively small action spaces; small enough to be in fact enumerated in full by the treebased search at its core.
Samplebased methods provide a powerful approach to dealing with large complex actions spaces. Rather than enumerating all possible actions, the idea is to sample a small subset of actions and compute the optimal policy or value function with respect to those samples. This simple strategy is so general that it can be applied to large, continuous, or structured action spaces. Specifically, action sampling can be used both to propose improvements to the policy at each of the sampled actions, and subsequently to evaluate the proposed improvements. However, to correctly improve or evaluate the policy across the entire action space, and not just the samples, one must understand how the sampling procedure interacts with both policy improvement and policy evaluation.
In this work, we propose a framework to reason in a principled way about policy improvement and evaluation computed over small subsets of sampled actions. We show how this local information can be used to train a global policy, act and even perform explicit steps of policy evaluation for the purpose of planning and local policy iteration. This samplebased framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an algorithmically simple extension of the MuZero^{1}^{1}1The discussion in this paper applies equally to AlphaZero and MuZero; in the text we will only refer to MuZero for simplicity. algorithm that facilitates its application to domains with complex action spaces.
To demonstrate the generality of this approach, we apply our algorithm to two continuous control benchmark domains, the DeepMind Control Suite (Tassa et al., 2018) and RealWorld RL Suite (DulacArnold et al., 2020). We also demonstrate that our algorithm can be applied to large discrete action spaces, by sampling the actions in the game of Go, and show that high performance can be maintained even when subsampling a small fraction of possible moves.
2 Related Work
Previous research in reinforcement learning for complex or continuous action spaces has often focused on modelfree algorithms.
Deep Deterministic Policy Gradient (DDPG) exploits the fact that in action spaces that are entirely continuous (no discrete action dimensions), the actionvalue function can be assumed to be differentiable with respect to the action in order to efficiently compute policy gradients (Silver et al., 2014; Lillicrap et al., 2015). Distributed Distributional Deterministic Policy Gradients (D4PG) extends DDPG by using a distributional value function and a distributed training setup (BarthMaron et al., 2018). Trust Region Policy Optimisation (TRPO) uses a hard KL constraint to ensure that the updated policy remains close to the previous policy during the policy improvement step (Schulman et al., 2015)
, to avoid catastrophic collapse. Proximal Policy Optimisation (PPO) has the same goal as TRPO, but instead uses the KLdivergence as a penalty in the loss function or clipping in the value function
(Schulman et al., 2017). This results in a simpler algorithm with empirically better performance. In the regime of dataefficient offpolicy algorithms, recent advances have derived actorcritic algorithms that optimise a (relative)entropy regularised RL objective such as SAC (Haarnoja et al., 2018), MPO (Abdolmaleki et al., 2018), AWR (Peng et al., 2019). Among these, MPO uses a sample based policy improvement step that can be related to our algorithm (see section 4.4). Distributional MPO (DMPO) extends MPO to use a distributional Qfunction (Hoffman et al., 2020).Modelbased control for high dimensional action spaces has recently seen a resurgence of interest (see e.g. (Byravan et al., 2020; Hafner et al., 2018, 2019; Koul et al., 2020)). While most of these algorithms consider direct policy optimisation against a learned model some have considered combinations of rollout based search/planning with policy learning. (Piché et al., 2018) use planning via sequential importance sampling of action sequences sampled from a SAC policy. (Bhardwaj et al., 2020) use a learned simulator to construct Kstep returns for learning a soft Qfunction. Closest to our work, (Springenberg et al., 2020) consider a sample based policy update similar to ours  but using a policy improvement operator based on the KL regularised objective rather than the MCTS based policy improvement that we consider here.
Sparse sampling algorithms (Kearns et al., 1999) are an effective approach to planning in large state spaces. The main idea is to sample possible state transitions from each state, drawn from a generative model of the underlying MDP. Collectively, these samples provide a search tree over a subset of the MDP; planning over the sampled tree provides a nearoptimal approximation, for large
, to the optimal policy for the full MDP, independent of the size of the state space. Indeed, sampling is known to address the curse of dimensionality in some cases
(Rust, 1997). However, sparse sampling typically enumerates all possible actions from each state, and does not address issues relating to large action spaces. In contrast, our method samples actions rather than state transitions. In principle, it would be straightforward to combine both ideas; however, we focus in this paper upon the novel aspect relating to large action spaces and utilise deterministic transition models.There have been several previous attempts at generalising AlphaZero and MuZero to continuous action spaces. These attempts have shown that such an extension is possible in principle, but have so far been restricted to very low dimensional cases and not yet demonstrated effectiveness in highdimensional tasks. A0C (Moerland et al., 2018) describes an extension of AlphaZero to continuous action spaces using a continuous policy representation and REINFORCE (Williams, 1992)
to estimate the gradients for the reverse KL divergence between the neural network policy estimate and the target MCTS policy, demonstrating some learning on the 1D Pendulum task.
(Yang et al., 2020) describe a similar extension of MuZero to continuous actions and show promising results outperforming soft actorcritic (SAC) (Haarnoja et al., 2018) on environments with 1 and 2 dimensional action spaces.The factorised policy representation described by (Tang and Agrawal, 2020) shows good results in a variety of domains; by representing each action dimension with a separate categorical distribution it efficiently avoids the exponential explosion in the number of actions faced by a simple discretisation scheme.
3 Background
We consider a standard reinforcement learning setup in which an agent acts in an environment by sequentially choosing actions over a sequence of timesteps in order to maximise a cumulative reward. We model the problem as a Markov decision process (MDP) which comprises a state space
, an action space , an initial state distribution, a stationary transition dynamics distribution and a reward function .The agent’s behaviour is controlled by a policy
which maps states to a probability distribution over the action space. The return from a state is defined as the sum of discounted future rewards
where is a discount factor in . The goal of the agent is to learn a policy which maximises the expected return from the start distribution.In order to do so, a common strategy called policy evaluation consists in learning a value function that estimates the expected return of following policy from a state or a state action pair . The value function can then be used in a process called policy improvement
, to find and learn better policies by for instance increasing the probabilities of actions with higher values. The process of repeatedly doing policy evaluation followed by policy improvement is at the heart of many reinforcement learning algorithms and is called
policy iteration.Naturally, a lot of research focuses on improving the methods for policy evaluation and policy improvement. One direction for scaling the efficiency of both is to evaluate, from the current state, several possible actions, or even several possible future trajectories by using a model, instead of just extracting information from the trajectory that was executed. Those evaluations can then be used to build a locally better policy over those actions. Planning algorithms such as Monte Carlo Tree Search (MCTS) (Coulom, 2006) take this even further and make several local policy iteration steps by repeatedly performing a policy improvement step followed by an explicit local step of policy evaluation of the improved policy in the aim of generating an even better policy locally.
From this perspective, the MuZero algorithm can be understood as the combination of two processes of policy evaluation and policy improvement. The inner process, concretely MuZero’s MCTS search, provides the policy improvement for the outer process which in turn learns the quantities: the model, reward function, value function and the policy, necessary for the inner process. Specifically, in the outer process, MuZero learns a deep neural network parameterising a model, a reward function, a statevalue function and a policy. Policy improvement is accomplished by regressing the parametric policy towards the improved policy built by MuZero’s MCTS search. The improved policy is also used for acting. The value function is learned using the usual tools of policy evaluation such as temporaldifference learning (Sutton, 1988). These two objectives coupled with the learning of the reward function drive the learning of the model. In the inner process, MuZero’s MCTS search takes several analytical policy iteration steps: values in the search tree are estimated by explicitly averaging nstep returns bootstrapped from the value function (policy evaluation) while visits are directed towards high policy and high value actions (policy improvement). This results in an improved policy and an estimate of the value of this improved policy that can be used for the outer process.
This raises a few questions, especially in the case where only a small subset of the action space can be evaluated to build the locally improved policy.

[topsep=2pt,itemsep=2pt,parsep=0pt]

how to select the actions or trajectories to be evaluated

how to build a locally improved policy over those actions

how to use the locally improved policy to learn about the global policy

how to use it to act

how to perform an explicit local step of policy evaluation of the improved policy for planning

how all these steps interact with each other
In the following, we will assume that the actions to be evaluated are sampled from some proposal distribution and that we have at our disposal some process to build a locally improved policy. We will mainly focus on the last four questions and propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets.
4 Samplebased Policy Iteration
Let be a policy and be an improved policy of : . If we had complete access to , we would directly use it for policy improvement by projecting it back onto the space of realisable policies. However, when the action space is too large, it might only be feasible to compute an improved policy over a small subset of actions.
It is not immediately clear how to use this locally improved policy to perform principled policy improvement, or policy evaluation of the improved policy, because this locally improved policy only gives us information regarding the sampled actions.
We propose a framework which relies on writing both operations as an expectation with respect to the fully improved policy and use the samples we have to estimate these expectations. This allows us to use the conceptually correct target to define the objectives and clearly surface the approximations that are introduced afterwards. Specifically, we will restrict ourselves to the general class of policy improvement operators that we call actionindependent as defined below in 4.2.
4.1 Operator view of Policy Improvement
We use the concepts introduced by (Ghosh et al., 2020) and decompose policy improvement into the successive application of two operators: (a) a policy improvement operator which maps any policy to a policy achieving strictly larger return; and (b) a projection operator , which finds the best approximation of this improved policy in the space of realisable policies. With those notations, the process of policy improvement can be written as .
(Ghosh et al., 2020) showed that the policy gradient algorithm can be thought of having the following policy improvement operator: where is the actionvalue function. They also showed that PPO’s (Schulman et al., 2017) policy improvement operator is where is a temperature parameter. Similarly, MPO’s (Abdolmaleki et al., 2018) policy improvement operator can be written and AWR (Peng et al., 2019) uses a similar form of improved policy, replacing the actionvalue function by the advantage function .
4.2 ActionIndependent Policy Improvement Operator
We define a policy improvement operator as actionindependent if it can be written as:
where is a unique state dependent normalising factor defined by and .^{2}^{2}2In the continuous case, sums would be replaced by integrals.
All of the policy improvement operators described above are actionindependent.
MPO Example: MPO’s policy improvement operator can be written and .
4.3 SampleBased ActionIndependent Policy Improvement Operator
Let be actions sampled from a proposal distribution and the corresponding empirical distribution^{3}^{3}3 represents the Kronecker delta function. In the continuous case, it would be replaced by the Dirac delta function . which is nonzero only on the sampled actions .
We define the samplebased actionindependent^{4}^{4}4We will omit the actionindependent qualifier in the rest of the text when it is clear from the context. policy improvement operator as
where is a unique state dependent normalising factor defined by and . We have used the shorthand notation to mean .
MPO Example: MPO’s samplebased actionindependent policy improvement operator using would therefore be with .
4.4 Computing an expectation with respect to
We focus in this section on evaluating for a given state the expectation
of a random variable
given actions sampled from a distribution and the samplebased improved policy .Theorem. For a given random variable , we have
Proof See Appendix E
Corollary. The samplebased policy improvement operator converges in distribution to the true policy improvement operator:
and is approximately normally distributed around the true policy improvement operator as .
Proof. See Appendix E
We illustrate this result in Figure 1.
4.5 Samplebased Policy Evaluation and Improvement
The previous expression computing an estimate of using the quantity and the sampled actions can be used for policy improvement and policy evaluation of the improved policy.
Policy improvement can be performed by for instance instantiating , minimising the crossentropy between and the improved policy : .
Additionally, samples from can be obtained by resampling an action from . This procedure also known as Sampling Importance Resampling (SIR) (Rubin, 1987) gives us a way to act with the improved policy and reuse the usual tools such as temporaldifference learning to do policy evaluation of the improved policy.
Finally, for instance for the purpose of planning, an explicit step of policy evaluation of the improved policy can be computed by estimating 1step or nstep returns. Using for example
lets us backpropagate the value
by one step in a search tree: .5 Sampled MuZero
Building on the samplebased policy iteration framework established in the previous section, we now instantiate those ideas in the context of a complete system. Concretely, we apply our sampling procedure to the MuZero algorithm, to produce a new algorithm that we term Sampled MuZero. This algorithm may be applied to any domain where MuZero can be applied; but furthermore can also be used, in principle, to learn and plan in domains with arbitrarily complex action spaces.
As introduced in the background section, MuZero may be understood as combining an inner process of policy iteration, within its MonteCarlo tree search, and an outer process, in its overall interactions with the environment.
5.1 Inner Policy Evaluation and Improvement
Specifically, within its search tree, MuZero estimates values by explicitly averaging nstep returns samples (policy evaluation) and selects the next node to evaluate and expand by recursively maximising (policy improvement) over the probabilistic upper confidence tree (PUCT) bound (Silver et al., 2016)
where is an exploration factor controlling the influence of the policy relative to the values as nodes are visited more often.
Naive Modification. A first approach to extending MuZero’s MCTS search is to search over the sampled actions and keep the PUCT formula unchanged, directly using the probabilities coming from the policy network in the PUCT formula just like in MuZero. The search’s visit count distribution can then be used to construct the sampledbased to correct for the effect of sampling at policy network training time and acting time, but also dynamically as the tree is built for value backpropagation (inner policy evaluation). Theoretically this procedure is not complicated, but in practice it might lead to unstable results because of the term, especially if is represented by normalised visit counts which have limited numerical precision.
Proposed Modification. Instead, we propose to search with probabilities proportional to , in place of in the PUCT formula and directly use the resulting visit count distributions just like in MuZero. We use the following Theorem to justify this proposed modification.
Theorem. Let be the visit count distribution^{5}^{5}5for a given number of simulations of MuZero’s search using prior when considering the whole action space and let be the visit count distribution obtained by searching using prior . Then, is approximately equal to the samplebased policy improvement associated to . In other words, .
Proof. See Appendix F
We can therefore directly use the results of the previous section 4.4 and in particular,
This lets us conclude that the only modification beyond sampling that needs to be made to MuZero is to use instead of in the PUCT formula. The rest of the MuZero algorithm, from estimating the values in the search tree by averaging nstep returns, to acting and training the policy network using the visit count distribution, can proceed unchanged.
Remark. Note that, if , the probabilities used in the PUCT formula can be written: . If , is equal to the empirical sampling/prior distribution . This means that the search is guided by a potentially quasi uniform prior but only evaluates relatively high probability actions. If , the search evaluates more diverse samples but is guided by more peaked probabilities .
5.2 Outer Policy Improvement
Once the inner iterations of policy improvement and policy evaluation within MonteCarlo tree search have been completed, the net result is a set of visit counts at the root state of the search tree, corresponding to each sampled action . These visit counts may be normalised to provide the samplebased improved policy . Following the argument in the previous section, these visit counts already take account of the fact that the root actions were sampled according to .
Hence all that remains is to project the samplebased improved policy back onto the space of representable policies, using an appropriate projection operator . Following MuZero, we choose a standard projection operator for probability distributions that selects parameters minimising the KL divergence .
5.3 Outer Policy Evaluation
To select actions, the agent samples its behaviour from its samplebased improved policy, . As above, we note that this already corrects for the sampling procedure in the construction of the visit counts, and hence may be used directly as a policy.
The outer policy evaluation step then follows directly from MuZero, i.e. a value function is trained from step returns, using trajectories of behaviour generated by the samplebased improved policy.
5.4 Search Tree Node Expansion
In MuZero, each time a leaf node is expanded, all the actions of the action space are returned alongside the probabilities the policy network assigns to each of those actions.
Proposed Modification. In Sampled MuZero, we instead sample actions from a distribution and return each action along with its corresponding probabilities and .
We note that, if the number of simulations of the search is much bigger than , techniques such as progressive widening (Chaslot et al., 2008) could in principle be used to dynamically sample more actions for nodes on highly visited search paths.
5.5 Sampling distribution
In principle any sampling distribution
with a wide support can be used, including the uniform distribution. However, as only a limited number of samples can be drawn, it is preferable to sample moves that are likely according to our current estimate for the best policy, i.e. the policy network.
^{6}^{6}6Note that MuZero with a limited number of simulations will only visit the high prior movesProposed Modification. We use , potentially modulated by a temperature parameter. To encourage exploration and to make sure that even low prior moves have an opportunity to be reassessed from time to time, MuZero combines the prior produced by the policy network with Dirichlet noise at the root of the search tree. We obtain the same behaviour in Sampled MuZero by also including noise in and , ensuring that low prior moves can be sampled and searched.
6 Experiments
We evaluated the performance of Sampled MuZero on a variety of reinforcement learning environments. We focus upon standard benchmark environments in which clear baselines are available for comparison. We use those benchmarks to explore two important properties of realworld applications. First, whether Sampled MuZero is sufficiently general to operate across discrete and continuous environments of very different types. Second, whether the algorithm is robust to sampling – that is, whether we can come close to the performance of algorithms that have access to the entire action set (and are therefore not scalable to large action spaces), when only sampling a small fraction of the action space.
6.1 Go
Go has long been a challenge problem for AI, with only the AlphaGo (Silver et al., 2016, 2018; Schrittwieser et al., 2020) family of algorithms finally surpassing human professional players. It is a domain that requires deep and precise planning and as such is an ideal domain to put the planning capabilities of Sampled MuZero to the test.
Using MuZero as a baseline, we trained multiple instances of Sampled MuZero with varying number of action samples (see Figure 2). The size of the action space in 19x19 Go is 362 (all board points plus pass), so all the tested values of only cover a small part of the action space. As expected, the performance improves as increases, with samples already closely approaching the performance of the baseline that is allowed to search over all possible actions.
6.2 Atari
We also performed the same experiment as in Figure 2 for the Arcade game of Ms. Pacman, from the classic Atari RL benchmark. The action space in Atari is of size 18. Searching with samples is not sufficient for efficient learning, but already with samples performance rapidly approaches the baseline that is allowed to search all possible actions without sampling (Figure 3).
6.3 DeepMind Control Suite
The DeepMind Control Suite (Tassa et al., 2018) provides a set of continuous control tasks based on MuJoCo (Todorov et al., 2012) and has been widely used as a benchmark to assess performance of continuous control algorithms. For the experiments in this paper we use the task classification and data budgets introduced in Acme (Hoffman et al., 2020), evaluating Sampled MuZero on the easy, medium and hard tasks. We additionally evaluated Sampled MuZero on the manipulator tasks which are known to be interesting and difficult.
In its most common setup, the control suite domains provide 1 dimensional state inputs (as opposed to 2 dimensional image inputs in board games and Atari as used by MuZero). We therefore used a variation of the MuZero model architecture in which all convolutions are replaced by fullyconnected layers (see Appendix A for further details). For the policy prediction, we chose the factored policy representation introduced by (Tang and Agrawal, 2020)
, representing each dimension by a categorical distribution. There are however no difficulties in working directly with continuous actions and we show results with a policy prediction parameterised with a Gaussian distribution on the hard and manipulator tasks in the Appendix (Figure
A.1).Sampled MuZero showed good performance across the task set (Figure 7 for full results), with especially good results for tasks in the most difficult hard and manipulator categories (Figure 4) such as humanoid.run or the manipulator tasks in general.
The control suite domains can also be configured to provide raw pixel inputs instead of 1 dimensional state inputs. We ran Sampled MuZero
on the same tasks with the same data budget (25M frames) and the same hyperparameters. As demonstrated in Figure
5, Sampled MuZero can be applied to efficiently learn from raw pixel inputs as well. It is particularly remarkable that Sampled MuZero can learn to control the 21 dimensional humanoid from raw pixel inputs only. In addition, we compared Sampled MuZero to the Dreamer agent (Hafner et al., 2019) in Appendix A.2, Table 2. Sampled MuZero equalled or surpassed the Dreamer agent’s performance in all tasks, without any action repeat (Dreamer uses an action repeat of 2), observation reconstruction, or any hyperparameter retuning.To investigate the scalability to more complex action spaces, we also applied Sampled MuZero to the dm_control (Tassa et al., 2020) based Locomotion environment. In this set of highdimensional tasks, the agent must control a humanoid body with 56 action dimensions to accomplish a variety of goals (Figure 6). In all tasks Sampled MuZero not only outperformed previously reported results, but it did so using more than an order of magnitude fewer interactions with the environment.
Finally, we investigated the impact on performance of the number of samples in the Appendix (Figure 10). We show that Sampled MuZero can learn high dimensional action tasks with as little as samples. Furthermore, we evaluated the stability of Sampled MuZero, both from state inputs and raw pixel inputs, in Figure 11 and Figure 12. We show that Sampled MuZero’s performance is overall very reproducible across tasks and number of samples. We also verified the practical importance of using instead of just in Sampled MuZero’s PUCT formula in Figure 13. We find that, as suggested by the theory, it is much more robust to use .
6.4 RealWorld RL Challenge Benchmark
The realworld Reinforcement Learning (RWRL) Challenge set of benchmark tasks (DulacArnold et al., 2020) is a set of continuous control tasks that aims to capture the aspects of realworld tasks that commonly cause RL algorithms to fail. We used this benchmark to test the robustness of our proposed algorithm to complications such as delays, partial observability or stochasticity. We used the same neural network architecture as for the DeepMind Control Suite with the addition of an LSTM (Hochreiter and Schmidhuber, 1997) to deal with partial observability.
As shown in Table 1, Sampled MuZero significantly outperformed baseline algorithms in all three challenge difficulties. We provide full learning curve results in the Appendix (Figure 8).
Agent  Cartpole  Walker  Quadruped  Humanoid 

Easy  
DMPO  464.05  474.44  567.53  1.33 
D4PG  482.32  512.44  787.73  102.92 
STACX  734.40  487.75  865.80  1.21 
SMuZero  861.05  959.83  987.20  289.36 
Medium  
DMPO  155.63  64.63  180.30  1.27 
D4PG  175.47  75.49  268.01  1.28 
STACX  398.71  94.01  466.43  1.18 
SMuZero  516.69  448.51  946.21  108.56 
Hard  
DMPO  138.06  63.05  144.69  1.40 
D4PG  108.20  59.85  280.75  1.27 
STACX  135.26  58.11  351.56  1.26 
SMuZero  244.71  71.16  348.09  1.19 
7 Conclusions
In this paper we introduced a unified framework for learning and planning in discrete, continuous and structured complex action spaces. Our approach is based upon a simple principle of sampling actions. By careful bookkeeping we have shown how one may take account of the sampling process during policy improvement and policy evaluation. In principle, the same samplebased strategy could be applied to a variety of other reinforcement algorithms in which the policy is updated by, or approximated by, an actionindependent improvement step. Concretely, we have focused upon applying our framework to the modelbased planning algorithm of MuZero, resulting in our new algorithm Sampled MuZero. Our empirical results show that the idea is both general, succeeding across a wide variety of discrete and continuous benchmark environments, and robust, scaling gracefully down to small numbers of samples. These results suggest that the ideas introduced in this paper may also be effective in larger scale applications where it is not feasible to enumerate the action space.
Acknowledgements
We would like to thank Jost Tobias Springenberg for providing very detailed feedback and constructive suggestions.
References
 Relative entropy regularized policy iteration. External Links: 1812.02256 Cited by: §A.1, §1, §2, §4.1.
 Layer normalization. External Links: 1607.06450 Cited by: Appendix A.
 Distributed Distributional Deterministic Policy Gradients. External Links: 1804.08617 Cited by: Figure 7, Appendix A, §1, §2, Figure 4.
 Information theoretic model predictive qlearning. In Learning for Dynamics and Control, pp. 840–850. Cited by: §2.
 Superhuman AI for headsup nolimit poker: libratus beats top professionals. Science 359 (6374), pp. 418–424. Cited by: §1.
 Imagined value gradients: modelbased policy optimization with transferable latent dynamics models. In Conference on Robot Learning, pp. 566–589. Cited by: §2.
 Progressive strategies for montecarlo tree search. New Mathematics and Natural Computation 04, pp. 343–357. External Links: Document Cited by: §5.4.
 Efficient selectivity and backup operators in montecarlo tree search. In International conference on computers and games, pp. 72–83. Cited by: §3.
 Deep reinforcement learning in large discrete action spaces. External Links: 1512.07679 Cited by: §1.
 An empirical investigation of the challenges of realworld reinforcement learning. External Links: 2003.11881 Cited by: §1, §6.4, Table 1.
 An operator view of policy gradient methods. External Links: 2006.11266 Cited by: §4.1, §4.1.
 Montecarlo tree search as regularized policy optimization. External Links: 2007.12509 Cited by: Appendix F, Appendix F.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. External Links: 1801.01290 Cited by: §2, §2.
 Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. Cited by: §A.2, Table 2, §2, §6.3.
 Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §2.
 Identity mappings in deep residual networks. CoRR abs/1603.05027. External Links: Link, 1603.05027 Cited by: Appendix A.
 Haiku: Sonnet for JAX External Links: Link Cited by: Appendix A.
 Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 08997667, Link, Document Cited by: Appendix A, §6.4.
 Acme: a research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979. External Links: Link Cited by: Figure 7, Appendix A, Appendix A, §1, §2, Figure 4, §6.3.

A sparse sampling algorithm for nearoptimal planning in large markov decision processes.
In
Proceedings of the 16th International Joint Conference on Artificial Intelligence  Volume 2
, IJCAI’99, San Francisco, CA, USA, pp. 1324–1331. Cited by: §2.  Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: Appendix A.
 Dream and search to control: latent space planning for continuous control. External Links: 2010.09832 Cited by: §2.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §2.
 Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Link, 1711.05101 Cited by: Appendix A.
 Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, External Links: Link Cited by: Figure 6.
 A0C: AlphaZero in continuous action space. External Links: 1805.09613 Cited by: §2.
 Deepstack: expertlevel artificial intelligence in headsup nolimit poker. Science 356 (6337), pp. 508–513. Cited by: §1.
 Advantageweighted regression: simple and scalable offpolicy reinforcement learning. arXiv preprint arXiv:1910.00177. Cited by: §2, §4.1.
 Probabilistic planning with sequential monte carlo methods. In International Conference on Learning Representations, Cited by: §2.

The calculation of posterior distributions by data augmentation: comment: a noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the sir algorithm
. Journal of the American Statistical Association 82 (398), pp. 543–546. External Links: ISSN 01621459, Link Cited by: §4.5.  Using randomization to break the curse of dimensionality. Econometrica 65 (3), pp. 487–516. External Links: ISSN 00129682, 14680262, Link Cited by: §2.
 Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico. Cited by: Appendix A.
 Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature 588 (7839), pp. 604–609. Cited by: Appendix A, Appendix B, Appendix C, §1, §6.1.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §2, §4.1.
 Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §5.1, §6.1.
 A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay. Science 362 (6419), pp. 1140–1144. Cited by: §1, §6.1.
 Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 387–395. External Links: Link Cited by: §2.
 VMPO: onpolicy maximum a posteriori policy optimization for discrete and continuous control. In International Conference on Learning Representations, External Links: Link Cited by: Figure 6.
 Local search for policy iteration in continuous control. arXiv preprint arXiv:2010.05545. Cited by: §2.
 Learning to predict by the methods of temporal differences. External Links: Link Cited by: §3.
 Discretizing continuous action space for onpolicy optimization. External Links: 1901.10500 Cited by: Appendix A, §2, §6.3.
 DeepMind control suite. External Links: 1801.00690 Cited by: §1, §6.3.
 Dm_control: software and tasks for continuous control. arXiv preprint arXiv:2006.12983. Cited by: Figure 6, §6.3.
 MuJoCo: a physics engine for modelbased control. pp. 5026–5033. External Links: ISBN 9781467317375, Document Cited by: §6.3.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §2.
 Continuous control for searching and planning with a learned model. External Links: 2006.07430 Cited by: §2.
 A selftuning actorcritic algorithm. Advances in Neural Information Processing Systems 33. Cited by: Table 1.
Appendix A DeepMind Control Suite and RealWorld RL Experiments
For the continuous control experiments where the input is 1 dimensional (as opposed to 2 dimensional image inputs in board games and Atari as used by MuZero), we used a variation of the MuZero model architecture in which all convolutions are replaced by fully connected layers.
The representation function processed the input via an input block composed of a linear layer, followed by a Layer Normalisation and a tanh activation. The resulting embedding was then processed by a ResNet v2 style preactivation residual tower (He et al., 2016) coupled with Layer Normalisation (Ba et al., 2016)
and Rectified Linear Unit (ReLU) activations. We used 10 blocks, each block containing 2 layers with a hidden size of 512.
For the RealWorld RL experiments, we additionally inserted an LSTM module (Hochreiter and Schmidhuber, 1997) in the representation function between the input block and the residual tower to deal with partial observability. We trained the LSTM using truncated backpropagation through time for 8 steps, initialised from LSTM states stored during acting, each step having the last observations concatenated together, for an effective unroll step of steps.
The dynamics function processed the action via an action block composed of a linear layer, followed by a Layer Normalisation and a ReLU activation. The action embedding was then added to the dynamics function’s input embedding and then processed by a residual tower using the same architecture as the residual tower for the representation function.
The reward and value predictions used the categorical representation introduced in MuZero (Schrittwieser et al., 2020). We used 51 bins for both the value and the reward predictions with the value being able to represent values between and the reward being able to represent values between . We used nstep bootstrapping with and a discount of consistent with Acme (Hoffman et al., 2020).
We used the factored policy representation introduced by (Tang and Agrawal, 2020) representing each dimension by a categorical distribution over bins for the policy prediction.
To implement the network, we used the modules provided by the Haiku neural network library (Hennigan et al., 2020).
We used the Adam optimiser (Kingma and Ba, 2015) with decoupled weight decay (Loshchilov and Hutter, 2017) for training. We used a weight decay scale of , a batch size of an initial learning rate of , decayed to 0 over 1 million training batches using a cosine schedule:
where and .
For replay, we keep a buffer of the most recent sequences, splitting episodes into subsequences of length up to . Samples are drawn from the replay buffer according to prioritised replay (Schaul et al., 2016) using the same priority and hyperparameters as in MuZero.
We trained Sampled MuZero using samples and a search budget of simulations per move. At the root of the search tree only, we evaluated all sampled actions before the start of the search and used those to initialise the quantities in the PUCT formula (Appendix D). We evaluated Sampled MuZero’s network checkpoints throughout training playing games with a search budget of simulations per move and picked the move with the highest number of visits to act, consistent with previous work.
We used Acme (Hoffman et al., 2020) to produce the results for DMPO (Hoffman et al., 2020) and D4PG (BarthMaron et al., 2018). Compared to Acme, we used bigger networks (Policy Network layers = (512, 512, 256, 128), Critic Network Layers = (1024, 1024, 512, 256)) and a bigger batch size of for better comparison. Each task was run with three seeds.
We provide full learning curve results on the DeepMind Control Suite (Figure 7) and RealWorld RL (Figure 8) tasks.
a.1 Gaussian policy parameterisation
Even though a categorical policy representation was used to compute the main results, Sampled MuZero can also be applied working directly with continuous actions. Figure 9 shows results on the hard and manipulator tasks when the policy prediction is parameterised by a Gaussian distribution.
The performance is similar across almost all tasks but we found that Gaussian distributions are harder to optimise than their categorical counterpart and that using entropy regularisation was useful to produce better results (we used a coefficient of 5e3). It is possible that these results could be improved with better regularisation schemes such as constraining the deviation of the mean and standard deviation as in the MPO
(Abdolmaleki et al., 2018) algorithm. In contrast, we did not need to add any regularisation to train the categorical distribution.a.2 Sampled MuZero from Pixels
In addition to Sampled MuZero’s results on the hard and manipulator tasks when learning from raw pixel inputs, we compared Sampled MuZero to the Dreamer agent (Hafner et al., 2019) in Table 2. We used the 20 tasks and the 5 million environment steps experimental setup defined by (Hafner et al., 2019). Sampled MuZero equalled or surpassed the Dreamer agent’s performance in all 20 tasks, without any action repeat (Dreamer uses an action repeat of 2), observation reconstruction, or any hyperparameter retuning.
Tasks  Dreamer  SMuZero 

acrobot.swingup  365.26  417.52 
cartpole.balance  979.56  984.86 
cartpole.balance_sparse  941.84  998.14 
cartpole.swingup  833.66  868.87 
cartpole.swingup_sparse  812.22  846.91 
cheetah.run  894.56  914.39 
ball_in_cup.catch  962.48  977.38 
finger.spin  498.88  986.38 
finger.turn_easy  825.86  972.53 
finger.turn_hard  891.38  963.07 
hopper.hop  368.97  528.24 
hopper.stand  923.72  926.50 
pendulum.swingup  833.00  837.76 
quadruped.run  888.39  923.54 
quadruped.walk  931.61  933.77 
reacher.easy  935.08  982.26 
reacher.hard  817.05  971.53 
walker.run  824.67  931.06 
walker.stand  977.99  987.79 
walker.walk  961.67  975.46 
a.3 Ablation on the number of samples
We trained multiple instances of Sampled MuZero with varying number of action samples on the humanoid.run task for which the action is 21 dimensional. We ran six seeds for each instance. Surprisingly is already sufficient to learn a good policy and performance does not seem to be improved by sampling more than samples (see Figure 10).
a.4 Reproducibility
In order to evaluate the reproducibility of Sampled MuZero from state inputs and raw pixel inputs, we show the individual performance of 3 seeds on the hard and manipulator tasks in Figure 11. Overall, the variation in performance across seeds is minimal.
In addition, we show the individual performance of 6 seeds when sampling actions on the humanoid.run task. We observe that even when the number of samples is small, performance stays very reproducible across runs.
a.5 Ablation on using vs
We evaluated the practical importance of using instead of just in Sampled MuZero’s PUCT formula and ran experiments on the humanoid.run task.
We expect that as the number of samples increases, the difference will go away as . We therefore looked at the difference in performance when drawing or samples.
Furthermore, evaluating the Q values of all sampled actions at the root of the search tree before the start of the search puts more emphasis on the values and less on the prior in the PUCT formula. We therefore also show the difference in performance with and without Q evaluations (no Q in the figure).
The experiments in Figure 13 confirm that it is much better to use when the number of samples is small and not evaluating the Q values. The performance drop of using is attenuated by evaluating the Q values at the root of the search tree, but it is still better to use even in that case.
Appendix B Go Experiments
For the Go experiments, we mostly used the same neural network architecture, optimisation and hyperparameters used by MuZero (Schrittwieser et al., 2020) with the following differences. Instead of using the outcome of the game to train the value network, we used nstep bootstrapping with where the value used to bootstrap was the averaged predictions of a target network applied to consecutive states at indices for . We averaged multiple consecutive target network value predictions due to the alternation of perspective for value prediction in twoplayer games; using the average of multiple estimates ensures that learning is based on the estimates for both sides. We observed that this reduced value overfitting and allowed us to train MuZero while generating less data. In addition, we used a search budget of simulations per move instead of in order to use less computation.
We evaluated the network checkpoints of MuZero and Sampled MuZero throughout training playing matches with a search budget of simulations per move. We anchored the Elo scale to a final MuZero baseline performance of Elo.
Appendix C Atari Experiments
For the Atari experiments, we used the same architecture, optimisation and hyperparameters used by MuZero (Schrittwieser et al., 2020).
We evaluated the network checkpoints of MuZero and Sampled MuZero throughout training playing games with a search budget of simulations per move.
Appendix D Search
The full PUCT formula used in Sampled MuZero is:
where
with and in the experiments for this paper. Note that at visit counts , the in the exploration term is approximately and the formula can be written:
Appendix E Samplebased Policy Improvement and Evaluation Proofs
Lemma. and are linked by:
Proof is defined such that .
Therefore
where we used to go from line 1 to 2.
We therefore have
which shows by the uniqueness of that .
Theorem. For a given random variable , we have
Furthermore, is approximately normally distributed around as :
where .
Proof. We have
where we used the law of large numbers to go from line 2 to 3, replacing the expectation with the limit of a sum, and the lemma to go from line 3 to 4.
Making the approximation of swapping in for based on the lemma, we obtain that as :
Corollary. The samplebased policy improvement operator converges in distribution to the true policy improvement operator:
Furthermore, the samplebased policy improvement operator is approximately normally distributed around the true policy improvement operator as :
where .
Proof. We obtain the corollary by using in conjunction with and
Appendix F The MuZero Policy Improvement Operator
Recent work (Grill et al., 2020) showed that MuZero’s visit count distribution was tracking the solution of a regularised policy optimisation problem:
where KL is the Kullback–Leibler divergence and
is a constant dependent on and the total number of simulations.can be computed analytically:
where is a normalising factor such that and .
In other words, using the terminology introduced in Section 4, MuZero’s policy improvement can be approximately written:
where
and is therefore actionindependent.
Let’s consider the visit count distribution obtained by searching using prior .
This shows that is the actionindependent samplebased policy improvement operator associated to .