1 Introduction
Reinforcement learning (RL) is a general formulation for the problem of sequential decision making under uncertainty, where a learning system (the agent) must learn to maximize the cumulative rewards provided by the world it is embedded in (the environment), from experience of interacting with such environment (Sutton & Barto, 2018). An agent is said to be valuebased if its behavior, i.e. its policy, is inferred (e.g by inspection) from learned valueestimates (Sutton, 1988; Watkins, 1989; Rummery & Niranjan, 1994; Tesauro, 1995). In contrast, a policybased agent directly updates a (parametric) policy (Williams, 1992; Sutton et al., 2000)
based on past experience. We may also classify as
model free the agents that update values and policies directly from experience (Sutton, 1988), and as modelbased those that use (learned) models (Oh et al., 2015; van Hasselt et al., 2019) to plan either global (Sutton, 1990) or local (Richalet et al., 1978; Kaelbling & LozanoPérez, 2010; Silver & Veness, 2010) values and policies. Such distinctions are useful for communication, but, to master the singular goal of optimizing rewards in an environment, agents often combine ideas from more than one of these areas (Hessel et al., 2018; Silver et al., 2016; Schrittwieser et al., 2020).In this paper, we focus on a critical part of RL, namely policy optimization. We leave a precise formulation of the problem for later, but different policy optimization algorithms can be seen as answers to the following crucial question:
given data about an agent’s interactions with the world,
and predictions in the form of value functions or models,
how should we update the agent’s policy?
We start from an analysis of the desiderata for general policy optimization. These include support for partial observability and function approximation, the ability to learn stochastic policies, robustness to diverse environments or training regimes (e.g. offpolicy data), and being able to represent knowledge as value functions and models. See Section 3 for further details on our desiderata for policy optimization.
Then, we propose a policy update combining regularized policy optimization with modelbased ideas so as to make progress on the dimensions highlighted in the desiderata. More specifically, we use a model inspired by MuZero (Schrittwieser et al., 2020) to estimate action values via onestep lookahead. These action values are then plugged into a modified Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018) mechanism, based on clipped normalized advantages, that is robust to scaling issues without requiring constrained optimization. The overall update, named Muesli, then combines the clipped MPO targets and policygradients into a direct method (Vieillard et al., 2020) for regularized policy optimization.
The majority of our experiments were performed on 57 classic Atari games from the Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2018), a popular benchmark for deep RL. We found that, on Atari, Muesli can match the state of the art performance of MuZero, without requiring deep search, but instead acting directly with the policy network and using onestep lookaheads in the updates. To help understand the different design choices made in Muesli, our experiments on Atari include multiple ablations of our proposed update. Additionally, to evaluate how well our method generalises to different domains, we performed experiments on a suite of continuous control environments (based on MuJoCo and sourced from the OpenAI Gym (Brockman et al., 2016)). We also conducted experiments in 9x9 Go in selfplay, to evaluate our policy update in a domain traditionally dominated by search methods.
2 Background
The environment.
We are interested in episodic environments with variable episode lengths (e.g. Atari games), formalized as Markov Decision Processes (MDPs) with initial state distribution
and discount ; ends of episodes correspond to absorbing states with no rewards.The objective. The agent starts at a state from the initial state distribution. At each time step , the agent takes an action from a policy , obtains the reward and transitions to the next state . The expected sum of discounted rewards after a stateaction pair is called the actionvalue or Qvalue :
(1) 
The value of a state is and the objective is to find a policy that maximizes the expected value of the states from the initial state distribution:
(2) 
Policy improvement. Policy improvement is one of the fundamental building blocks of reinforcement learning algorithms. Given a policy and its Qvalues , a policy improvement step constructs a new policy such that . For instance, a basic policy improvement step is to construct the greedy policy:
(3) 
Regularized policy optimization. A regularized policy optimization algorithm solves the following problem:
(4) 
where are approximate Qvalues of a policy and is a regularizer. For example, we may use as the regularizer the negative entropy of the policy , weighted by an entropy cost (Williams & Peng, 1991). Alternatively, we may also use , where is the previous policy, as used in TRPO (Schulman et al., 2015).
Following the terminology introduced by Vieillard et al. (2020), we can then solve Eq. 4 by either direct or indirect methods. If is differentiable with respect to the policy parameters, a direct method applies gradient ascent to
(5) 
Using the log derivative trick to sample the gradient of the expectation results in the canonical (regularized) policy gradient update (Sutton et al., 2000).
In indirect methods, the solution of the optimization problem (4) is found exactly, or numerically, for one state and then distilled into a parametric policy. For example, Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018) uses as regularizer , for which the exact solution to the regularized problem is
(6) 
where
is a normalization factor that ensures that the resulting probabilities form a valid probability distribution (i.e. they sum up to 1).
MuZero. MuZero (Schrittwieser et al., 2020) uses a weakly grounded (Grimm et al., 2020) transition model trained end to end exclusively to support accurate reward, value and policy predictions: . Since such model can be unrolled to generate sequences of rewards and value estimates for different sequences of actions (or plans), it can be used to perform MonteCarlo Tree Search, or MCTS (Coulom, 2006). MuZero then uses MCTS to construct a policy as the categorical distribution over the normalized visit counts for the actions in the root of the search tree; this policy is then used both to select actions, and as a policy target for the policy network. Despite MuZero being introduced with different motivations, Grill et al. (2020) showed that the MuZero policy update can also be interpreted as approximately solving a regularized policy optimization problem with the regularizer also used by the TRPO algorithm (Schulman et al., 2015).
3 Desiderata and motivating principles
First, to motivate our investigation, we discuss a few desiderata for a general policy optimization algorithm.
3.1 Observability and function approximation
Being able to learn stochastic policies, and being able to leverage MonteCarlo or multistep bootstrapped return estimates is important for a policy update to be truly general.
This is motivated by the challenges of learning in partially observable environments (Åström, 1965) or, more generally, in settings where function approximation is used (Sutton & Barto, 2018). Note that these two are closely related: if a chosen function approximation ignores a state feature, then the state feature is, for all practical purposes, not observable.
In POMDPs the optimal memoryless stochastic policy can be better than any memoryless deterministic policy, as shown by Singh et al. (1994). As an illustration, consider the MDP in Figure 2; in this problem we have states and, on each step, actions ( or ). If the state representation of all states is the same , the optimal policy is stochastic. We can easily find such policy with pen and paper: ; see Appendix B for details.
It is also known that, in these settings, it is often preferable to leverage MonteCarlo returns, or at least multistep bootstrapped estimators, instead of using onestep targets (Jaakkola et al., 1994). Consider again the MDP in Figure 2: boostrapping from produces biased estimates of the expected return, because aggregates the values of multiple states; again, see Appendix B for the derivation.
Among the methods in Section 2, both policy gradients and MPO allow convergence to stochastic policies, but only policy gradients naturally incorporate multistep return estimators. In MPO, stochastic return estimates could make the agent overly optimistic ().
3.2 Policy representation
Policies may be constructed from action values or they may combine action values and other quantities (e.g., a direct parametrization of the policy or historical data). We argue that the action values alone are not enough.
First, we show that action values are not always enough to represent the best stochastic policy. Consider again the MDP in Figure 2 with identical state representation in all states. As discussed, the optimal stochastic policy is . This nonuniform policy cannot be inferred from Qvalues, as these are the same for all actions and are thus wholly uninformative about the best probabilities: . Similarly, a model on its own is also insufficient without a policy, as it would produce the same uninformative action values.
One approach to address this limitation is to parameterize the policy explicitly (e.g. via a policy network). This has the additional advantage that it allows us to directly sample both discrete (Mnih et al., 2016) and continuous (van Hasselt & Wiering, 2007; Degris et al., 2012; Silver et al., 2014) actions. In contrast, maximizing Qvalues over continuous action spaces is challenging. Access to a parametric policy network that can be queried directly is also beneficial for agents that act by planning with a learned model (e.g. via MCTS), as it allows to guide search in large or continuous action space.
3.3 Robust learning
We seek algorithms that are robust to 1) offpolicy or historical data; 2) inaccuracies in values and models; 3) diversity of environments. In the following paragraphs we discuss what each of these entails.
Reusing data from previous iterations of policy (Lin, 1992; Riedmiller, 2005; Mnih et al., 2015) can make RL more data efficient. However, if computing the gradient of the objective on data from an older policy , an unregularized application of the gradient can degrade the value of . The amount of degradation depends on the total variation distance between and , and we can use a regularizer to control it, as in Conservative Policy Iteration (Kakade & Langford, 2002), Trust Region Policy Optimization (Schulman et al., 2015), and Appendix C.
Whether we learn on or offpolicy, agents’ predictions incorporate errors. Regularization can also help here. For instance, if Qvalues have errors, the MPO regularizer maintains a strong performance bound (Vieillard et al., 2020). The errors from multiple iterations average out, instead of appearing in a discounted sum of the absolute errors. While not all assumptions behind this result apply in an approximate setting, Section 5 shows that MPOlike regularizers are helpful empirically.
Finally, robustness to diverse environments is critical to ensure a policy optimization algorithm operates effectively in novel settings. This can take various forms, but we focus on robustness to diverse reward scales and minimizing problem dependent hyperparameters. The latter are an especially subtle form of inductive bias that may limit the applicability of a method to established benchmarks
(Hessel et al., 2019).Observability and function approximation  
1a)  Support learning stochastic policies 
1b)  Leverage MonteCarlo targets 
Policy representation  
2a)  Support learning the optimal memoryless policy 
2b)  Scale to (large) discrete action spaces 
2c)  Scale to continuous action spaces 
Robust learning  
3a)  Support offpolicy and historical data 
3b)  Deal gracefully with inaccuracies in the values/model 
3c)  Be robust to diverse reward scales 
3d)  Avoid problemdependent hyperparameters 
Rich representation of knowledge  
4a)  Estimate values (variance reduction, bootstrapping) 
4b)  Learn a model (representation, composability) 
3.4 Rich representation of knowledge
Even if the policy is parametrized explicitly, we argue it is important for the agent to represent knowledge in multiple ways (Degris & Modayil, 2012) to update such policy in a reliable and robust way. Two classes of predictions have proven particularly useful: value functions and models.
Value functions (Sutton, 1988; Sutton et al., 2011) can capture knowledge about a cumulant over long horizons, but can be learned with a cost independent of the span of the predictions (van Hasselt & Sutton, 2015). They have been used extensively in policy optimization, e.g., to implement forms of variance reduction (Williams, 1992), and to allow updating policies online through bootstrapping, without waiting for episodes to fully resolve (Sutton et al., 2000).
Models can also be useful in various ways: 1) learning a model can act as an auxiliary task (Schmidhuber, 1990; Sutton et al., 2011; Jaderberg et al., 2017; Guez et al., 2020), and help with representation learning; 2) a learned model may be used to update policies and values via planning (Werbos, 1987; Sutton, 1990; Ha & Schmidhuber, 2018); 3) finally, the model may be used to plan for action selection (Richalet et al., 1978; Silver & Veness, 2010). These benefits of learned models are entangled in MuZero. Sometimes, it may be useful to decouple them, for instance to retain the benefits of models for representation learning and policy optimization, without depending on the computationally intensive process of planning for action selection.
4 Robust yet simple policy optimization
The full list of desiderata is presented in Table 1. These are far from solved problems, but they can be helpful to reason about policy updates. In this section, we describe a policy optimization algorithm designed to address these desiderata.
4.1 Our proposed clipped MPO (CMPO) regularizer
We use the Maximum a Posteriori Policy Optimization (MPO) algorithm (Abdolmaleki et al., 2018) as starting point, since it can learn stochastic policies (1a), supports discrete and continuous action spaces (2c), can learn stably from offpolicy data (3a), and has strong performance bounds even when using approximate Qvalues (3b). We then improve the degree of control provided by MPO on the total variation distance between and (3a), avoiding sensitive domainspecific hyperparameters (3d).
MPO uses a regularizer , where is the previous policy. Since we are interested in learning from stale data, we allow to correspond to arbitrary previous policies, and we introduce a regularizer , based on the new target
(7) 
where is a nonstochastic approximation of the advantage and the factor ensures the policy is a valid probability distribution. The term we use in the regularizer has an interesting relation to natural policy gradients (Kakade, 2001):
is obtained if the natural gradient is computed with respect to the logits of
and then the expected gradient is clipped (for proof note the natural policy gradient with respect to the logits is equal to the advantages (Agarwal et al., 2019)).The clipping threshold controls the maximum total variation distance between and . Specifically, the total variation distance between and is defined as
(8) 
As discussed in Section 3.3, constrained total variation supports robust offpolicy learning. The clipped advantages allows us to derive not only a bound for the total variation distance but an exact formula:
Theorem 4.1 (Maximum CMPO total variation distance)
For any clipping threshold , we have:
We refer readers to Appendix D for proof of Theorem 4.1; we also verified the theorem predictions numerically.
Note that the maximum total variation distance between and does not depend on the number of actions or other environment properties (3d). It only depends on the clipping threshold as visualized in Figure 3a. This allows to control the maximum total variation distance under a CMPO update, for instance by setting the maximum total variation distance to , without requiring the constrained optimization procedure used in the original MPO paper. Instead of the constrained optimization, we just set . We used in our experiments, across all domains.
4.2 A novel policy update
Given the proposed regularizer , we can update the policy by direct optimization of the regularized objective, that is by gradient descent on
(9) 
where the advantage terms in each component of the loss can be normalized using the approach described in Section 4.5 to improve the robustness to reward scales.
The first term corresponds to a standard policy gradient update, thus allowing stochastic estimates of that use MonteCarlo or multistep estimators (1b). The second term adds regularization via distillation of the CMPO target, to preserve the desiderata addressed in Section 4.1.
Critically, the hyperparameter is easy to set (3d), because even if is high, still proposes improvements to the policy . This property is missing in popular regularizers that maximize entropy or minimize a distance from . We refer to the sensitivity analysis depicted in Figure 3b for a sample of the wide range of values of that we found to perform well on Atari. We used in all other experiments reported in the paper.
Both terms can be sampled, allowing to trade off the computation cost and the variance of the update; this is especially useful in large or continuous action spaces (2b), (2c).
We can sample the gradient of the first term by computing the loss on data generated on a prior policy , and then use importance sampling to correct for the distribution shift wrt . This results in the estimator
(10) 
for the first term of the policy loss. In this expression, is the behavior policy; the advantage uses a stochastic multistep bootstrapped estimator and a learned baseline .
We can also sample the regularizer, by computing a stochastic estimate of the KL on a subset of actions , sampled from . In which case, the second term of Eq. 4.2 becomes (ignoring an additive constant):
(11) 
where is computed from the learned values and . To support sampling just few actions from the current state , we can estimate for the th sample out of as:
(12) 
where is an initial estimate. We use .
4.3 Learning a model
As discussed in Section 3.4, learning models has several potential benefits. Thus, we propose to train a model alongside policy and value estimates (4b). As in MuZero (Schrittwieser et al., 2020) our model is not trained to reconstruct observations, but is rather only required to provide accurate estimates of rewards, values and policies. It can be seen as an instance of value equivalent models (Grimm et al., 2020).
For training, the model is unrolled steps, taking as inputs an initial state and an action sequence . On each step the model then predicts rewards , values and policies . Rewards and values are trained to match the observed rewards and values of the states actually visited when executing those actions.
Policy predictions after unrolling the model steps are trained to match the policy targets computed in the actual observed states . The policy component of the model loss can then be written as:
(13) 
This differs from MuZero in that here the policy predictions are updated towards the targets , instead of being updated to match the targets constructed from the MCTS visitations.
4.4 Using the model
The first use of a model is as an auxiliary task. We implement this by conditioning the model not on a raw environment state but, instead, on the activations from a hidden layer of the policy network. Gradients from the model loss are then propagated all the way into the shared encoder, to help learning good state representations.
The second use of the model is within the policy update from Eq. 4.2. Specifically, the model is used to estimate the action values , via onestep lookahead:
(14) 
and the modelbased action values are then used in two ways. First, they are used to estimate the multistep return in Eq. 10, by combining action values and observed rewards using the Retrace estimator (Munos et al., 2016). Second, the action values are used in the (nonstochastic) advantage estimate required by the regularisation term in Eq. 11.
Using the model to compute the target instead of using it to construct the searchbased policy has advantages: a fast analytical formula, stochastic estimation of in large action spaces (2b), and direct support for continuous actions (2c). In contrast, MuZero’s targets are only an approximate solution to regularized policy optimization (Grill et al., 2020), and the approximation can be crude when using few simulations.
Note that we could have also used deep search to estimate actionvalues, and used these in the proposed update. Deep search would however be computationally expensive, and may require more accurate models to be effective (3b).
4.5 Normalization
CMPO avoids overly large changes but does not prevent updates from becoming vanishingly small due to small advantages. To increase robustness to reward scales (3c), we divide advantages
by the standard deviation of the advantage estimator. A similar normalization was used in
PPO (Schulman et al., 2017), but we estimateusing moving averages, to support small batches. Normalized advantages do not become small, even when the policy is close to optimal; for convergence, we rely on learning rate decay.
All policy components can be normalized using this approach, but the model also predict rewards and values, and the corresponding losses could be sensitive to reward scales. To avoid having to tune, per game, the weighting of these unnormalized components (4c), (4d), we compute losses in a nonlinearly transformed space
(Pohlen et al., 2018; van Hasselt et al., 2019), using the categorical reparametrization introduced by MuZero (Schrittwieser et al., 2020).5 An empirical study
In this section, we investigate empirically the policy updates described in the Section 4. The full agent implementing our recommendations is named Muesli, as homage to MuZero. The Muesli policy loss is . All agents in this section are trained using the Sebulba podracer architecture (Hessel et al., 2021).
First, we use the 57 Atari games in the Arcade Learning Environment (Bellemare et al., 2013) to investigate the key design choices in Muesli, by comparing it to suitable baselines and ablations. We use sticky actions to make the environments stochastic (Machado et al., 2018). To ensure comparability, all agents use the same policy network, based on the IMPALA agent (Espeholt et al., 2018). When applicable, the model described in Section 4.3 is parametrized by an LSTM (Hochreiter & Schmidhuber, 1997), with a diagram in Figure 10 in the appendix. Agents are trained using uniform experience replay, and estimate multistep returns using Retrace (Munos et al., 2016).
In Figure 1a we compare the median humannormalized score on Atari achieved by Muesli to that of several baselines: policy gradients (in red), PPO (in green), MPO (in grey) and a policy gradient variant with TRPOlike regularization (in orange). The updates for each baseline are reported in Appendix F, and the agents differed only in the policy components of the losses. In all updates we used the same normalization, and trained a MuZerolike model grounded in values and rewards. In MPO and Muesli, the policy loss included the policy model loss from Eq. 13. For each update, we separately tuned hyperparameters on 10 of the 57 Atari games. We found the performance on the full benchmark to be substantially higher for Muesli (in blue). In the next experiments we investigate how different design choices contributed to Muesli’s performance.
In Figure 4 we use the Atari games beam_rider and gravitar to investigate advantage clipping. Here, we compare the updates that use clipped (in blue) and unclipped (in red) advantages, when first rescaling the advantages by factors ranging from to to simulate diverse return scales. Without clipping, performance was sensitive to scale, and degraded quickly when scaling advantages by a factor of 100 or more. With clipping, learning was almost unaffected by rescaling, without requiring more complicated solutions such as the constrained optimization introduced in related work to deal with this issue (Abdolmaleki et al., 2018).
In Figure 5 we show how Muesli combines the benefits of direct and indirect optimization. A direct MPO update uses the regularizer as a penalty; c.f. Mirror Descent Policy Optimization (Tomar et al., 2020). Indirect MPO first finds from Eq. 6 and then trains the policy by the distillation loss . Note the different direction of the KLs. Vieillard et al. (2020) observed that the best choice between direct and indirect MPO is problem dependent, and we found the same: compare the ordering of direct MPO (in green) and indirect CMPO (in yellow) on the two Atari games alien and robotank. In contrast, we found that the Muesli policy update (in blue) was typically able to combine the benefits of the two approaches, by performing as well or better than the best among the two updates on each of the two games. See Figure 13 in the appendix for aggregate results across more games.
In Figure 6a we evaluate the importance of using multistep bootstrapped returns and modelbased action values in the policygradientlike component of Muesli’s update. Replacing the multistep return with an approximate (in red in Figure 6a) degraded the performance of Muesli (in blue) by a large amount, showing the importance of leveraging multistep estimators. We also evaluated the role of modelbased action value estimates in the Retrace estimator, by comparing full Muesli to an ablation (in green) where we instead used modelfree values in a Vtrace estimator (Espeholt et al., 2018). The ablation performed worse.
In Figure 6b we compare the performance of Muesli when using different numbers of actions to estimate the KL term in Eq. 4.2. We found that the resulting agent performed well, in absolute terms ( median human normalized performance) when estimating the KL by sampling as little as a single action (brown). Performance increased by sampling up to 16 actions, which was then comparable the exact KL.
In Figure 7a we show the impact of different parts of the model loss on representation learning. The performance degraded when only training the model for one step (in green). This suggests that training a model to support deeper unrolls (5 steps in Muesli, in blue) is a useful auxiliary task even if using only onestep lookaheads in the policy update. In Figure 7a we also show that performance degraded even further if the model was not trained to output policy predictions at each steps in the future, as per Eq. 13, but instead was only trained to predict rewards and values (in red). This is consistent with the value equivalence principle (Grimm et al., 2020): a rich signal from training models to support multiple predictions is critical for this kind of models.
In Figure 7b we compare Muesli to an MCTS baseline. As in MuZero, the baseline uses MCTS both for acting and learning. This is not a canonical MuZero, though, as it uses the (smaller) IMPALA network. MCTS (in purple) performed worse than Muesli (in blue) in this regime. We ran another MCTS variant with limited search depth (in green); this was better than full MCTS, suggesting that with insufficiently large networks, the model may not be sufficiently accurate to support deep search. In contrast, Muesli performed well even with these smaller models (3b).
Since we know from the literature that MCTS can be very effective in combination with larger models, in Figure 1b we reran Muesli with a much larger policy network and model, similar to that of MuZero. In this setting, Muesli matched the published performance of MuZero (the current state of the art on Atari in the 200M frames regime). Notably, Muesli achieved this without relying on deep search: it still sampled actions from the fast policy network and used onestep lookaheads in the policy update. We note that the resulting median score matches MuZero and is substantially higher than all other published agents, see Table 2 to compare the final performance of Muesli to other baselines.
Next, we evaluated Muesli on learning 9x9 Go from selfplay. This requires to handle nonstationarity and a combinatorial space. It is also a domain where deep search (e.g. MCTS) has historically been critical to reach nontrivial performance. In Figure 8a we show that Muesli (in blue) still outperformed the strongest baselines from Figure 1a, as well as CMPO on its own (in yellow). All policies were evaluated against Pachi (Baudiš & Gailly, 2011). Muesli reached a 75% win rate against Pachi: to the best of our knowledge, this is the first system to do so from selfplay alone without deep search. In the Appendix we report even stronger win rates against GnuGo (Bump et al., 2005).
In Figure 8b, we compare Muesli to MCTS on Go; here, Muesli’s performance (in blue) fell short of that of the MCTS baseline (in purple), suggesting there is still value in using deep search for acting in some domains. This is demonstrated also by another Muesli variant that uses deep search at evaluation only. Such Muesli/MCTS[Eval] hybrid (in light blue) recovered part of the gap with the MCTS baseline, without slowing down training. For reference, with the pink vertical line we depicts published MuZero, with its even greater data efficiency thanks to more simulations, a different network, more replay, and early resignation.
Finally, we tested the same agents on MuJoCo environments in OpenAI Gym (Brockman et al., 2016), to test if Muesli can be effective on continuous domains and on smaller data budgets (2M frames). Muesli performed competitively. We refer readers to Figure 12, in the appendix, for the results.
Agent  Median 

DQN (Mnih et al., 2015)  79% 
DreamerV2 (Hafner et al., 2020)  164% 
IMPALA (Espeholt et al., 2018)  192% 
Rainbow (Hessel et al., 2018)  231% 
Metagradient (Xu et al., 2018)  287% 
STAC (Zahavy et al., 2020)  364% 
LASER (Schmitt et al., 2020)  431% 
MuZero Reanalyse (Schrittwieser et al., 2021)  1,047 40% 
Muesli  1,041 40% 

6 Conclusion
Starting from our desiderata for general policy optimization, we proposed an update (Muesli), that combines policy gradients with Maximum a Posteriori Policy Optimization (MPO) and modelbased action values. We empirically evaluated the contributions of each design choice in Muesli, and compared the proposed update to related ideas from the literature. Muesli demonstrated state of the art performance on Atari (matching MuZero’s most recent results), without the need for deep search. Muesli even outperformed MCTSbased agents, when evaluated in a regime of smaller networks and/or reduced computational budgets. Finally, we found that Muesli could be applied out of the box to selfplay 9x9 Go and continuous control problems, showing the generality of the update (although further research is needed to really push the state of the art in these domains). We hope that our findings will motivate further research in the rich space of algorithms at the intersection of policy gradient methods, regularized policy optimization and planning.
Acknowledgements
We would like to thank Manuel Kroiss and Iurii Kemaev for developing the research platform we use to run and distribute experiments at scale. Also we thank Dan Horgan, Alaa Saade, Nat McAleese and Charlie Beattie for their excellent help with reinforcement learning environments. Joseph Modayil improved the paper by wise comments and advice. Coding was made fun by the amazing JAX library (Bradbury et al., 2018), and the ecosystem around it (in particular the optimisation library Optax
, the neural network library
Haiku, and the reinforcement learning library Rlax). We thank the MuZero team at DeepMind for inspiring us.References
 Abdolmaleki et al. (2018) Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a Posteriori Policy Optimisation. In International Conference on Learning Representations, 2018.
 Agarwal et al. (2019) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. arXiv eprints, art. arXiv:1908.00261, August 2019.
 Agarwal et al. (2020) Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. Reinforcement Learning: Theory and Algorithms, 2020. URL https://rltheorybook.github.io.
 Åström (1965) Åström, K. Optimal Control of Markov Processes with Incomplete State Information I. Journal of Mathematical Analysis and Applications, 10:174–205, 1965. ISSN 0022247X.

Baudiš & Gailly (2011)
Baudiš, P. and Gailly, J.l.
Pachi: State of the art open source Go program.
In Advances in computer games, pp. 24–38. Springer, 2011.  Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The Arcade Learning Environment: An evaluation platform for general agents. JAIR, 2013.
 Bradbury et al. (2018) Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., WandermanMilne, S., and Zhang, Q. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv eprints, art. arXiv:1606.01540, June 2016.
 Bump et al. (2005) Bump et al. Gnugo, 2005. URL http://www.gnu.org/software/gnugo/gnugo.html.
 Byravan et al. (2020) Byravan, A., Springenberg, J. T., Abdolmaleki, A., Hafner, R., Neunert, M., Lampe, T., Siegel, N., Heess, N., and Riedmiller, M. Imagined value gradients: Modelbased policy optimization with tranferable latent dynamics models. In Conference on Robot Learning, pp. 566–589. PMLR, 2020.
 Cobbe et al. (2020) Cobbe, K., Hilton, J., Klimov, O., and Schulman, J. Phasic Policy Gradient. arXiv eprints, art. arXiv:2009.04416, September 2020.
 Coulom (2006) Coulom, R. Efficient Selectivity and Backup Operators in MonteCarlo Tree Search. In Proceedings of the 5th International Conference on Computers and Games, CG’06, pp. 72–83, Berlin, Heidelberg, 2006. SpringerVerlag. ISBN 3540755373.

Dabney et al. (2018)
Dabney, W., Ostrovski, G., Silver, D., and Munos, R.
Implicit Quantile Networks for Distributional Reinforcement Learning.
InInternational Conference on Machine Learning
, pp. 1096–1105, 2018.  Degris & Modayil (2012) Degris, T. and Modayil, J. Scalingup Knowledge for a Cognizant Robot. In AAAI Spring Symposium: Designing Intelligent Robots, 2012.
 Degris et al. (2012) Degris, T., Pilarski, P. M., and Sutton, R. S. Modelfree reinforcement learning with continuous action in practice. In 2012 American Control Conference (ACC), pp. 2177–2182, 2012.
 Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. IMPALA: Scalable Distributed DeepRL with Importance Weighted ActorLearner Architectures. In International Conference on Machine Learning, 2018.
 Farquhar et al. (2018) Farquhar, G., Rocktäschel, T., Igl, M., and Whiteson, S. TreeQN and ATreeC: Differentiable treestructured models for deep reinforcement learning. In International Conference on Learning Representations, volume 6. ICLR, 2018.
 Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pp. 1582–1591, 2018.
 Gregor et al. (2019) Gregor, K., Jimenez Rezende, D., Besse, F., Wu, Y., Merzic, H., and van den Oord, A. Shaping belief states with generative environment models for rl. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 Grill et al. (2020) Grill, J.B., Altché, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. MCTS as regularized policy optimization. In International Conference on Machine Learning, 2020.
 Grimm et al. (2020) Grimm, C., Barreto, A., Singh, S., and Silver, D. The value equivalence principle for modelbased reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020.
 Gruslys et al. (2018) Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and Munos, R. The Reactor: A fast and sampleefficient ActorCritic agent for Reinforcement Learning. In International Conference on Learning Representations, 2018.
 Guez et al. (2019) Guez, A., Mirza, M., Gregor, K., Kabra, R., Racanière, S., Weber, T., Raposo, D., Santoro, A., Orseau, L., Eccles, T., et al. An investigation of modelfree planning. arXiv preprint arXiv:1901.03559, 2019.
 Guez et al. (2020) Guez, A., Viola, F., Weber, T., Buesing, L., Kapturowski, S., Precup, D., Silver, D., and Heess, N. Valuedriven hindsight modelling. In Advances in Neural Information Processing Systems, 2020.
 Guo et al. (2018) Guo, Z. D., Gheshlaghi Azar, M., Piot, B., Pires, B. A., and Munos, R. Neural Predictive Belief Representations. arXiv eprints, art. arXiv:1811.06407, November 2018.
 Guo et al. (2020) Guo, Z. D., Pires, B. A., Piot, B., Grill, J.B., Altché, F., Munos, R., and Azar, M. G. Bootstrap latentpredictive representations for multitask reinforcement learning. In International Conference on Machine Learning, pp. 3875–3886. PMLR, 2020.
 Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, pp. 2450–2462. Curran Associates, Inc., 2018.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S. Soft ActorCritic Algorithms and Applications. arXiv eprints, art. arXiv:1812.05905, December 2018.
 Hafner et al. (2020) Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering Atari with Discrete World Models. arXiv eprints, art. arXiv:2010.02193, October 2020.

Hamrick (2019)
Hamrick, J. B.
Analogues of mental simulation and imagination in deep learning.
Current Opinion in Behavioral Sciences, 29:8–16, 2019.  Hamrick et al. (2020a) Hamrick, J. B., Bapst, V., SanchezGonzalez, A., Pfaff, T., Weber, T., Buesing, L., and Battaglia, P. W. Combining qlearning and search with amortized value estimates. In International Conference on Learning Representations, 2020a.
 Hamrick et al. (2020b) Hamrick, J. B., Friesen, A. L., Behbahani, F., Guez, A., Viola, F., Witherspoon, S., Anthony, T., Buesing, L., Veličković, P., and Weber, T. On the role of planning in modelbased deep reinforcement learning. arXiv preprint arXiv:2011.04021, 2020b.

Hessel et al. (2018)
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney,
W., Horgan, D., Piot, B., Azar, M. G., and Silver, D.
Rainbow: Combining improvements in deep reinforcement learning.
AAAI Conference on Artificial Intelligence
, 2018.  Hessel et al. (2019) Hessel, M., van Hasselt, H., Modayil, J., and Silver, D. On Inductive Biases in Deep Reinforcement Learning. arXiv eprints, art. arXiv:1907.02908, July 2019.
 Hessel et al. (2021) Hessel, M., Kroiss, M., Clark, A., Kemaev, I., Quan, J., Keck, T., Viola, F., and van Hasselt, H. Podracer architectures for scalable reinforcement learning. arXiv eprints, April 2021.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hubert et al. (2021) Hubert, T., Schrittwieser, J., Antonoglou, I., Barekatain, M., Schmitt, S., and Silver, D. Learning and Planning in Complex Action Spaces. arXiv eprints, April 2021.
 Jaakkola et al. (1994) Jaakkola, T., Singh, S. P., and Jordan, M. I. Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in Neural Information Processing Systems, pp. 345–352, Cambridge, MA, USA, 1994. MIT Press.
 Jaderberg et al. (2017) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2017.
 Janner et al. (2019) Janner, M., Fu, J., Zhang, M., and Levine, S. When to Trust Your Model: ModelBased Policy Optimization. arXiv eprints, art. arXiv:1906.08253, June 2019.
 Kaelbling & LozanoPérez (2010) Kaelbling, L. P. and LozanoPérez, T. Hierarchical task and motion planning in the now. In Proceedings of the 1st AAAI Conference on Bridging the Gap Between Task and Motion Planning, AAAIWS’1001, pp. 33–42. AAAI Press, 2010.
 Kaiser et al. (2019) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., and Michalewski, H. ModelBased Reinforcement Learning for Atari. arXiv eprints, art. arXiv:1903.00374, March 2019.
 Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pp. 267–274, 2002.
 Kakade (2001) Kakade, S. M. A natural policy gradient. Advances in Neural Information Processing Systems, 14:1531–1538, 2001.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. arXiv eprints, art. arXiv:1412.6980, December 2014.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. arXiv eprints, art. arXiv:1312.6114, December 2013.
 Lanctot et al. (2019) Lanctot, M., Lockhart, E., Lespiau, J.B., Zambaldi, V., Upadhyay, S., Pérolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., Hennes, D., Morrill, D., Muller, P., Ewalds, T., Faulkner, R., Kramár, J., De Vylder, B., Saeta, B., Bradbury, J., Ding, D., Borgeaud, S., Lai, M., Schrittwieser, J., Anthony, T., Hughes, E., Danihelka, I., and RyanDavis, J. OpenSpiel: A Framework for Reinforcement Learning in Games. arXiv eprints, art. arXiv:1908.09453, August 2019.
 Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv eprints, art. arXiv:2005.01643, May 2020.
 Lin (1992) Lin, L.J. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn., 8(3–4):293–321, May 1992. ISSN 08856125.
 Loshchilov & Hutter (2017) Loshchilov, I. and Hutter, F. Decoupled Weight Decay Regularization. arXiv eprints, art. arXiv:1711.05101, November 2017.
 Machado et al. (2018) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
 Moerland et al. (2020) Moerland, T. M., Broekens, J., and Jonker, C. M. Modelbased Reinforcement Learning: A Survey. arXiv eprints, art. arXiv:2006.16712, June 2020.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
 Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. Actionconditional video prediction using deep networks in Atari games. In Advances in Neural Information Processing Systems, pp. 2845–2853. Curran Associates, Inc., 2015.
 Oh et al. (2017) Oh, J., Singh, S., and Lee, H. Value prediction network. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
 Papamakarios et al. (2019) Papamakarios, G., Nalisnick, E., Jimenez Rezende, D., Mohamed, S., and Lakshminarayanan, B. Normalizing Flows for Probabilistic Modeling and Inference. arXiv eprints, art. arXiv:1912.02762, December 2019.
 Pascanu et al. (2017) Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racanière, S., Reichert, D., Weber, T., Wierstra, D., and Battaglia, P. Learning modelbased planning from scratch. arXiv eprints, art. arXiv:1707.06170, July 2017.
 Pohlen et al. (2018) Pohlen, T., Piot, B., Hester, T., Gheshlaghi Azar, M., Horgan, D., Budden, D., BarthMaron, G., van Hasselt, H., Quan, J., Večerík, M., Hessel, M., Munos, R., and Pietquin, O. Observe and Look Further: Achieving Consistent Performance on Atari. arXiv eprints, art. arXiv:1805.11593, May 2018.
 Racanière et al. (2017) Racanière, S., Weber, T., Reichert, D., Buesing, L., Guez, A., Jimenez Rezende, D., Puigdomènech Badia, A., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Hassabis, D., Silver, D., and Wierstra, D. Imaginationaugmented agents for deep reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D.
Stochastic backpropagation and approximate inference in deep generative models.
In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR.  Rezende et al. (2020) Rezende, D. J., Danihelka, I., Papamakarios, G., Ke, N. R., Jiang, R., Weber, T., Gregor, K., Merzic, H., Viola, F., Wang, J., Mitrovic, J., Besse, F., Antonoglou, I., and Buesing, L. Causally Correct Partial Models for Reinforcement Learning. arXiv eprints, art. arXiv:2002.02836, February 2020.

Richalet et al. (1978)
Richalet, J., Rault, A., Testud, J. L., and Papon, J.
Paper: Model predictive heuristic control.
Automatica, 14(5):413–428, September 1978. ISSN 00051098.  Riedmiller (2005) Riedmiller, M. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In Proceedings of the 16th European Conference on Machine Learning, ECML’05, pp. 317–328, Berlin, Heidelberg, 2005. SpringerVerlag. ISBN 3540292438.
 Rummery & Niranjan (1994) Rummery, G. A. and Niranjan, M. Online Qlearning using connectionist systems. Technical Report TR 166, Cambridge University Engineering Department, Cambridge, England, 1994.
 Schmidhuber (1990) Schmidhuber, J. An online algorithm for dynamic reinforcement learning and planning in reactive environments. In In Proc. IEEE/INNS International Joint Conference on Neural Networks, pp. 253–258. IEEE Press, 1990.
 Schmitt et al. (2020) Schmitt, S., Hessel, M., and Simonyan, K. OffPolicy ActorCritic with Shared Experience Replay. In International Conference on Machine Learning, volume 119, pp. 8545–8554, Virtual, 13–18 Jul 2020. PMLR.
 Schrittwieser et al. (2020) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., and Silver, D. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, Dec 2020. ISSN 14764687.
 Schrittwieser et al. (2021) Schrittwieser, J., Hubert, T., Mandhane, A., Barekatain, M., Antonoglou, I., and Silver, D. Online and offline reinforcement learning by planning with a learned model. arXiv eprints, April 2021.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust Region Policy Optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms. arXiv eprints, art. arXiv:1707.06347, July 2017.
 Silver & Veness (2010) Silver, D. and Veness, J. MonteCarlo Planning in Large POMDPs. In Advances in Neural Information Processing Systems, volume 23, pp. 2164–2172. Curran Associates, Inc., 2010.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395. JMLR.org, 2014.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
 Silver et al. (2017) Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., DulacArnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. The predictron: Endtoend learning and planning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3191–3199. PMLR, 06–11 Aug 2017.
 Singh et al. (1994) Singh, S. P., Jaakkola, T., and Jordan, M. I. Learning without stateestimation in partially observable Markovian decision processes. In Machine Learning Proceedings 1994, pp. 284–292. Elsevier, 1994.
 Springenberg et al. (2020) Springenberg, J. T., Heess, N., Mankowitz, D., Merel, J., Byravan, A., Abdolmaleki, A., Kay, J., Degrave, J., Schrittwieser, J., Tassa, Y., Buchli, J., Belov, D., and Riedmiller, M. Local Search for Policy Iteration in Continuous Control. arXiv eprints, art. arXiv:2010.05545, October 2020.
 Srinivas et al. (2018) Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. Universal planning networks: Learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pp. 4732–4741. PMLR, 2018.
 Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 1988.
 Sutton (1990) Sutton, R. S. Integrated architectures for learning, planning and reacting based on dynamic programming. In Machine Learning: Proceedings of the Seventh International Workshop, 1990.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.
 Sutton et al. (2011) Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, pp. 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
 Tamar et al. (2016) Tamar, A., WU, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
 Tesauro (1995) Tesauro, G. Temporal difference learning and TDGammon. Communications of the ACM, 38(3):58–68, March 1995.
 Tomar et al. (2020) Tomar, M., Shani, L., Efroni, Y., and Ghavamzadeh, M. Mirror Descent Policy Optimization. arXiv eprints, art. arXiv:2005.09814, May 2020.
 van Hasselt & Sutton (2015) van Hasselt, H. and Sutton, R. S. Learning to Predict Independent of Span. arXiv eprints, art. arXiv:1508.04582, August 2015.
 van Hasselt & Wiering (2007) van Hasselt, H. and Wiering, M. A. Reinforcement learning in continuous action spaces. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 272–279. IEEE, 2007.
 van Hasselt et al. (2019) van Hasselt, H., Quan, J., Hessel, M., Xu, Z., Borsa, D., and Barreto, A. General nonlinear Bellman equations. arXiv eprints, art. arXiv:1907.03687, July 2019.
 van Hasselt et al. (2016) van Hasselt, H. P., Guez, A., Guez, A., Hessel, M., Mnih, V., and Silver, D. Learning values across many orders of magnitude. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

van Hasselt et al. (2019)
van Hasselt, H. P., Hessel, M., and Aslanides, J.
When to use parametric models in reinforcement learning?
In Advances in Neural Information Processing Systems, volume 32, pp. 14322–14333. Curran Associates, Inc., 2019.  Vieillard et al. (2020) Vieillard, N., Kozuno, T., Scherrer, B., Pietquin, O., Munos, R., and Geist, M. Leverage the Average: an Analysis of KL Regularization in RL. In Advances in Neural Information Processing Systems, 2020.
 Wang et al. (2016) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample Efficient ActorCritic with Experience Replay. arXiv eprints, art. arXiv:1611.01224, November 2016.
 Watkins (1989) Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, England, 1989.
 Werbos (1987) Werbos, P. J. Learning how the world works: Specifications for predictive networks in robots and brains. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, N.Y., 1987.
 Williams (1992) Williams, R. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, May 1992.
 Williams & Peng (1991) Williams, R. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3:241–, 09 1991.
 Xu et al. (2018) Xu, Z., van Hasselt, H., and Silver, D. Metagradient reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2402–2413, 2018.
 Zahavy et al. (2020) Zahavy, T., Xu, Z., Veeriah, V., Hessel, M., Oh, J., van Hasselt, H., Silver, D., and Singh, S. A selftuning actorcritic algorithm, 2020.
 Zhang & Yao (2019) Zhang, S. and Yao, H. ACE: An actor ensemble algorithm for continuous control with tree search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5789–5796, 2019.
Appendix A Stochastic estimation details
In the policygradient term in Eq. 10, we clip the importance weight to be from . The importance weight clipping introduces a bias. To correct for it, we use LOO actiondependent baselines (Gruslys et al., 2018).
Although the LOO actiondependent baselines were not significant in the Muesli results, the LOO was helpful for the policy gradients with the TRPO penalty (Figure 16).
Appendix B The illustrative MDP example
Here we will analyze the values and the optimal policy for the MDP from Figure 9, when using the identical state representation in all states. With the state representation , the policy is restricted to be the same in all states. Let’s denote the probability of the action by .
Given the policy , the following are the values of the different states:
(15)  
(16)  
(17)  
(18) 
Finding the optimal policy. Our objective is to maximize the value of the initial state. That means maximizing . We can find the maximum by looking at the derivatives. The derivative of with respect to the policy parameter is:
(19) 
The second derivative is negative, so the maximum of is at the point where the first derivative is zero. We conclude that the maximum of is at .
Finding the action values of the optimal policy. We will now find the and . The is defined as the expected return after the , when doing the action (Singh et al., 1994):
(20) 
where is the probability of being in the state when observing .
In our example, the Qvalues are:
(21)  
(22)  
(23)  
(24) 
We can now substitute the in for to find the and :
(25)  
(26) 
We see that these Qvalues are the same and uninformative about the probabilities of the optimal (memoryless) stochastic policy. This generalizes to all environments: the optimal policy gives zero probability to all actions with lower Qvalues. If the optimal policy at a given state representation gives nonzero probabilities to some actions, these actions must have the same Qvalues .
Bootstrapping from would be worse. We will find the . And we will show that bootstrapping from it would be misleading. In our example, the is:
(27)  
(28) 
We can notice that is different from or . Estimating by bootstrapping from instead of would be misleading. Here, it is better to estimate the Qvalues based on MonteCarlo returns.
Appendix C The motivation behind Conservative Policy Iteration and TRPO
In this section we will show that unregularized maximization of on data from an older policy can produce a policy worse than . The size of the possible degradation will be related to the total variation distance between and . The explanation is based on the proofs from the excellent book by Agarwal et al. (2020).
As before, our objective is to maximize the expected value of the states from an initial state distribution :
(29) 
It will be helpful to define the discounted state visitation distribution as:
(30) 
where is the probably of being , if starting the episode from and following the policy . The scaling by ensures that sums to one.
From the policy gradient theorem (Sutton et al., 2000) we know that the gradient of with respect to the policy parameters is
(31) 
In practice, we often train on data from an older policy . Training on such data maximizes a different function:
(32) 
where is an advantage. Notice that the states are sampled from and the policy is criticized by . This happens often in the practice, if updating the policy multiple times in an episode, using a replay buffer or bootstrapping from a network trained on past data.
While maximization of is more practical, we will see that unregularized maximization of does not guarantee an improvement in our objective . The difference can be even negative, if we are not careful.
Kakade & Langford (2002) stated a useful lemma for the performance difference:
Lemma C.1 (The performance difference lemma)
For all policies , ,
(33) 
We would like the to be positive. We can express the performance difference as plus an extra term:
(34)  
(35)  
(36) 
To get a positive performance difference, it is not enough to maximize . We also need to make sure that the second term in (36) will not degrade the performance. The impact of the second term can be kept small by keeping the total variation distance between and small.
For example, the performance can degrade, if is not trained at a state and that state gets a higher probability. The performance can also degrade, if a stochastic policy is needed and the advantages are for an older policy. The would become deterministic, if maximizing without any regularization.
c.1 Performance difference lower bound.
We will express a bound of the performance difference as a function of the total variation between and . Starting from Eq. 36, we can derive the TRPO lower bound for the performance difference. Let be the maximum total variation distance between and :
(37) 
The is then bounded (see Agarwal et al., 2020, Similar policies imply similar state visitations):
(38) 
Appendix D Proof of Maximum CMPO total variation distance
We will prove the following theorem: For any clipping threshold , we have:
Having 2 actions. We will first prove the theorem when the policy has 2 actions. To maximize the distance, the clipped advantages will be and . Let’s denote the probabilities associated with these advantages as and , respectively.
The total variation distance is then:
(40) 
We will maximize the distance with respect to the parameter .
The first derivative with respect to is:
(41) 
The second derivative with respect to is:
(42) 
Because the second derivative is negative, the distance is a concave function of . We will find the maximum at the point where the first derivative is zero. The solution is:
(43) 
At the found point , the maximum total variation distance is:
(44) 
This completes the proof when having 2 actions.
Having any number of actions. We will now prove the theorem when the policy has any number of actions. To maximize the distance, the clipped advantages will be or . Let’s denote the sum of probabilities associated with these advantages as and , respectively.
The total variation distance is again:
(45) 
and the maximum distance is again
We also verified the theorem predictions experimentally, by using gradient ascent to maximize the total variation distance.
Appendix E Extended related work
We used the desiderata to motivate the design of the policy update. We will use the desiderata again to discuss the related methods to satisfy the desiderata. For a comprehensive overview of modelbased reinforcement learning, we recommend the surveys by Moerland et al. (2020) and Hamrick (2019).
e.1 Observability and function approximation
1a) Support learning stochastic policies. The ability to learn a stochastic policy is one of the benefits of policy gradient methods.
1b) Leverage MonteCarlo targets. Muesli uses multistep returns to train the policy network and Qvalues. MPO and MuZero need to train the Qvalues, before using the Qvalues to train the policy.
e.2 Policy representation
2a) Support learning the optimal memoryless policy. Muesli represents the stochastic policy by the learned policy network. In principle, acting can be based on a combination of the policy network and the Qvalues. For example, one possibility is to act with the policy. ACER (Wang et al., 2016) used similar acting based on . Although we have not seen benefits from acting based on on Atari (Figure 15), we have seen better results on Go with a deeper search at the evaluation time.
2b) Scale to (large) discrete action spaces. Muesli supports large actions spaces, because the policy loss can be estimated by sampling. MCTS is less suitable for large action spaces. This was addressed by Grill et al. (2020), who brilliantly revealed MCTS as regularized policy optimization and designed a tree search based on MPO or a different regularized policy optimization. The resulting tree search was less affected by a small number of simulations. Muesli is based on this view of regularized policy optimization as an alternative to MCTS. In another approach, MuZero was recently extended to support sampled actions and continuous actions (Hubert et al., 2021).
2c) Scale to continuous action spaces. Although we used the same estimator of the policy loss for discrete and continuous actions, it would be possible to exploit the structure of the continuous policy. For example, the continuous policy can be represented by a normalizing flow (Papamakarios et al., 2019)
to model the joint distribution of the multidimensional actions. The continuous policy would also allow to estimate the gradient of the policy regularizer with the reparameterization trick
(Kingma & Welling, 2013; Rezende et al., 2014). Soft ActorCritic (Haarnoja et al., 2018) and TD3 (Fujimoto et al., 2018) achieved great results on the Mujoco tasks by obtaining the gradient with respect to the action from an ensemble of approximate Qfunctions. The ensemble of Qfunctions would probably improve Muesli results.e.3 Robust learning
3a) Support offpolicy and historical data. Muesli supports offpolicy data thanks to the regularized policy optimization, Retrace (Munos et al., 2016) and policy gradients with clipped importance weights (Gruslys et al., 2018). Many other methods deal with offpolicy or offline data (Levine et al., 2020). Recently MuZero Reanalyse (Schrittwieser et al., 2021) achieved stateoftheart results on an offline RL benchmark by training only on the offline data.
3b) Deal gracefully with inaccuracies in the values/model. Muesli does not trust fully the Qvalues from the model. Muesli combines the Qvalues with the prior policy to propose a new policy with a constrained total variation distance from the prior policy. Without the regularized policy optimization, the agent can be misled by an overestimated Qvalue for a rarely taken action. Soft ActorCritic (Haarnoja et al., 2018) and TD3 (Fujimoto et al., 2018) mitigate the overestimation by taking the minimum from a pair of Qnetworks. In modelbased reinforcement learning an unrolled onestep model would struggle with compounding errors (Janner et al., 2019). VPN (Oh et al., 2017) and MuZero (Schrittwieser et al., 2020) avoid compounding errors by using multistep predictions , not conditioned on previous model predictions. While VPN and MuZero avoid compounding errors, these models are not suitable for planning a sequence of actions in a stochastic environment. In the stochastic environment, the sequence of actions needs to depend on the occurred stochastic events, otherwise the planning is confounded and can underestimate or overestimate the state value (Rezende et al., 2020). Other models conditioned on limited information from generated (latent) variables can face similar problems on stochastic environment (e.g. DreamerV2 (Hafner et al., 2020)). Muesli is suitable for stochastic environments, because Muesli uses only onestep lookahead. If combining Muesli with a deep search, we can use an adaptive search depth or a stochastic model sufficient for causally correct planning (Rezende et al., 2020). Another class of models deals with model errors by using the model as a part of the Qnetwork or policy network and trains the whole network endtoend. These networks include VIN (Tamar et al., 2016), Predictron (Silver et al., 2017), I2A (Racanière et al., 2017), IBP (Pascanu et al., 2017), TreeQN, ATreeC (Farquhar et al., 2018) (with scores in Table 3), ACE (Zhang & Yao, 2019), UPN (Srinivas et al., 2018) and implicit planning with DRC (Guez et al., 2019).
3c) Be robust to diverse reward scales. Muesli benefits from the normalized advantages and from the advantage clipping inside . PopArt (van Hasselt et al., 2016) addressed learning values across many orders of magnitude. On Atari, the score of the games vary from 21 on Pong to 1M on Atlantis. The nonlinear transformation by Pohlen et al. (2018) is practically very helpful, although biased for stochastic returns.
3d) Avoid problemdependent hyperparameters. The normalized advantages were used before in PPO (Schulman et al., 2017). The maximum CMPO total variation (Theorem 4.1) helps to explain the success of such normalization. If the normalized advantages are from , they behave like advantages clipped to . Notice that the regularized policy optimization with the popular entropy regularizer is equivalent to MPO with uniform (because ). As a simple modification, we recommend to replace the uniform prior with based on a target network. That leads to the modelfree direct MPO with normalized advantages, outperforming vanilla policy gradients (compare Figure 13 to Figure 1a).
e.4 Rich representation of knowledge
4a) Estimate values (variance reduction, bootstrapping). In Muesli, the learned values are helpful for bootstrapping Retrace returns, for computing the advantages and for constructing the . Qvalues can be also helpful inside a search, as demonstrated by Hamrick et al. (2020a).
4b) Learn a model (representation, composability). Multiple works demonstrated benefits from learning a model. Like VPN and MuZero, Gregor et al. (2019) learns a multistep actionconditional model; they learn the distribution of observations instead of actions and rewards, and focus on the benefits of representation learning in modelfree RL induced by modellearning; see also (Guo et al., 2018, 2020). Springenberg et al. (2020) study an algorithm similar to MuZero with an MPOlike learning signal on the policy (similarly to SAC and Grill et al. (2020)) and obtain strong results on Mujoco tasks in a transfer setting. Byravan et al. (2020) use a multistep action model to derive a learning signal for policies on continuousvalued actions, leveraging the differentiability of the model and of the policy. Kaiser et al. (2019) show how to use a model for increasing dataefficiency on Atari (using an algorithm similar to Dyna (Sutton, 1990)), but see also van Hasselt et al. (2019) for the relation between parametric model and replay. Finally, Hamrick et al. (2020b) investigate drivers of performance and generalization in MuZerolike algorithms.
Alien  Amidar  Crazy Climber  Enduro  Frostbite  Krull  Ms. Pacman  QBert  Seaquest  

TreeQN1  2321  1030  107983  800  2254  10836  3030  15688  9302 
TreeQN2  2497  1170  104932  825  581  11035  3277  15970  8241 
ATreeC1  3448  1578  102546  678  1035  8227  4866  25159  1734 
ATreeC2  2813  1566  110712  649  281  8134  4450  25459  2176 
Muesli  16218  524  143898  2344  10919  15195  19244  30937  142431 
Appendix F Experimental details
f.1 Common parts
Network architecture. The large MuZero network is used only on the large scale Atari experiments (Figure 1b) and on Go. In all other Atari and MuJoCo experiments the network architecture is based on the IMPALA architecture (Espeholt et al., 2018). Like the LASER agent (Schmitt et al., 2020), we increase the number of channels 4times. Specifically, the numbers of channels are: (64, 128, 128, 64), followed by a fully connected layer and LSTM (Hochreiter & Schmidhuber, 1997) with 512 hidden units. This LSTM inside of the IMPALA representation network is different from the second LSTM used inside the model dynamics function, described later. In the Atari experiments, the network takes as the input one RGB frame. Stacking more frames would help as evidenced in Figure 17.
Qnetwork and model architecture. The original IMPALA agent was not learning a Qfunction. Because we train a MuZerolike model, we can estimate the Qvalues by:
(46) 
where and are the reward model and the value model, respectively. The reward model and the value model are based on MuZero dynamics and prediction functions (Schrittwieser et al., 2020). We use a very small dynamics function, consisting of a single LSTM layer with 1024 hidden units, conditioned on the selected action (Figure 10).
The decomposition of to a reward model and a value model is not crucial. The Muesli agent obtained a similar score with a model of the actionvalues (Figure 14).