Muesli: Combining Improvements in Policy Optimization

by   Matteo Hessel, et al.

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.


page 1

page 2

page 3

page 4


Uncertainty-aware Model-based Policy Optimization

Model-based reinforcement learning has the potential to be more sample e...

Local Search for Policy Iteration in Continuous Control

We present an algorithm for local, regularized, policy improvement in re...

Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning

Model-free deep reinforcement learning (RL) algorithms have been widely ...

OffCon^3: What is state of the art anyway?

Two popular approaches to model-free continuous control tasks are SAC an...

Policy Optimization via Importance Sampling

Policy optimization is an effective reinforcement learning approach to s...

Faded-Experience Trust Region Policy Optimization for Model-Free Power Allocation in Interference Channel

Policy gradient reinforcement learning techniques enable an agent to dir...

Compatible Natural Gradient Policy Search

Trust-region methods have yielded state-of-the-art results in policy sea...

1 Introduction

Reinforcement learning (RL) is a general formulation for the problem of sequential decision making under uncertainty, where a learning system (the agent) must learn to maximize the cumulative rewards provided by the world it is embedded in (the environment), from experience of interacting with such environment (Sutton & Barto, 2018). An agent is said to be value-based if its behavior, i.e. its policy, is inferred (e.g by inspection) from learned valueestimates (Sutton, 1988; Watkins, 1989; Rummery & Niranjan, 1994; Tesauro, 1995). In contrast, a policy-based agent directly updates a (parametric) policy (Williams, 1992; Sutton et al., 2000)

based on past experience. We may also classify as

model free the agents that update values and policies directly from experience (Sutton, 1988), and as model-based those that use (learned) models (Oh et al., 2015; van Hasselt et al., 2019) to plan either global (Sutton, 1990) or local (Richalet et al., 1978; Kaelbling & Lozano-Pérez, 2010; Silver & Veness, 2010) values and policies. Such distinctions are useful for communication, but, to master the singular goal of optimizing rewards in an environment, agents often combine ideas from more than one of these areas (Hessel et al., 2018; Silver et al., 2016; Schrittwieser et al., 2020).

Figure 1: Median human normalized score across 57 Atari games. (a)

Muesli and other policy updates; all these use the same IMPALA network and a moderate amount of replay data (75%). Shades denote standard errors across 5 seeds.

(b) Muesli with the larger MuZero network and the high replay fraction used by MuZero (95%), compared to the latest version of MuZero (Schrittwieser et al., 2021). These large scale runs use 2 seeds. Muesli still acts directly with the policy network and uses one-step look-aheads in updates.

In this paper, we focus on a critical part of RL, namely policy optimization. We leave a precise formulation of the problem for later, but different policy optimization algorithms can be seen as answers to the following crucial question:

given data about an agent’s interactions with the world,

and predictions in the form of value functions or models,

how should we update the agent’s policy?

We start from an analysis of the desiderata for general policy optimization. These include support for partial observability and function approximation, the ability to learn stochastic policies, robustness to diverse environments or training regimes (e.g. off-policy data), and being able to represent knowledge as value functions and models. See Section 3 for further details on our desiderata for policy optimization.

Then, we propose a policy update combining regularized policy optimization with model-based ideas so as to make progress on the dimensions highlighted in the desiderata. More specifically, we use a model inspired by MuZero (Schrittwieser et al., 2020) to estimate action values via one-step look-ahead. These action values are then plugged into a modified Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018) mechanism, based on clipped normalized advantages, that is robust to scaling issues without requiring constrained optimization. The overall update, named Muesli, then combines the clipped MPO targets and policy-gradients into a direct method (Vieillard et al., 2020) for regularized policy optimization.

The majority of our experiments were performed on 57 classic Atari games from the Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2018), a popular benchmark for deep RL. We found that, on Atari, Muesli can match the state of the art performance of MuZero, without requiring deep search, but instead acting directly with the policy network and using one-step look-aheads in the updates. To help understand the different design choices made in Muesli, our experiments on Atari include multiple ablations of our proposed update. Additionally, to evaluate how well our method generalises to different domains, we performed experiments on a suite of continuous control environments (based on MuJoCo and sourced from the OpenAI Gym (Brockman et al., 2016)). We also conducted experiments in 9x9 Go in self-play, to evaluate our policy update in a domain traditionally dominated by search methods.

2 Background

The environment.

We are interested in episodic environments with variable episode lengths (e.g. Atari games), formalized as Markov Decision Processes (MDPs) with initial state distribution

and discount ; ends of episodes correspond to absorbing states with no rewards.

The objective. The agent starts at a state from the initial state distribution. At each time step , the agent takes an action from a policy , obtains the reward and transitions to the next state . The expected sum of discounted rewards after a state-action pair is called the action-value or Q-value :


The value of a state is and the objective is to find a policy that maximizes the expected value of the states from the initial state distribution:


Policy improvement. Policy improvement is one of the fundamental building blocks of reinforcement learning algorithms. Given a policy and its Q-values , a policy improvement step constructs a new policy such that . For instance, a basic policy improvement step is to construct the greedy policy:


Regularized policy optimization. A regularized policy optimization algorithm solves the following problem:


where are approximate Q-values of a policy and is a regularizer. For example, we may use as the regularizer the negative entropy of the policy , weighted by an entropy cost (Williams & Peng, 1991). Alternatively, we may also use , where is the previous policy, as used in TRPO (Schulman et al., 2015).

Following the terminology introduced by Vieillard et al. (2020), we can then solve Eq. 4 by either direct or indirect methods. If is differentiable with respect to the policy parameters, a direct method applies gradient ascent to


Using the log derivative trick to sample the gradient of the expectation results in the canonical (regularized) policy gradient update (Sutton et al., 2000).

In indirect methods, the solution of the optimization problem (4) is found exactly, or numerically, for one state and then distilled into a parametric policy. For example, Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018) uses as regularizer , for which the exact solution to the regularized problem is



is a normalization factor that ensures that the resulting probabilities form a valid probability distribution (i.e. they sum up to 1).

MuZero. MuZero (Schrittwieser et al., 2020) uses a weakly grounded (Grimm et al., 2020) transition model trained end to end exclusively to support accurate reward, value and policy predictions: . Since such model can be unrolled to generate sequences of rewards and value estimates for different sequences of actions (or plans), it can be used to perform Monte-Carlo Tree Search, or MCTS (Coulom, 2006). MuZero then uses MCTS to construct a policy as the categorical distribution over the normalized visit counts for the actions in the root of the search tree; this policy is then used both to select actions, and as a policy target for the policy network. Despite MuZero being introduced with different motivations, Grill et al. (2020) showed that the MuZero policy update can also be interpreted as approximately solving a regularized policy optimization problem with the regularizer also used by the TRPO algorithm (Schulman et al., 2015).

3 Desiderata and motivating principles

First, to motivate our investigation, we discuss a few desiderata for a general policy optimization algorithm.

3.1 Observability and function approximation

Being able to learn stochastic policies, and being able to leverage Monte-Carlo or multi-step bootstrapped return estimates is important for a policy update to be truly general.

This is motivated by the challenges of learning in partially observable environments (Åström, 1965) or, more generally, in settings where function approximation is used (Sutton & Barto, 2018). Note that these two are closely related: if a chosen function approximation ignores a state feature, then the state feature is, for all practical purposes, not observable.

In POMDPs the optimal memory-less stochastic policy can be better than any memory-less deterministic policy, as shown by Singh et al. (1994). As an illustration, consider the MDP in Figure 2; in this problem we have states and, on each step, actions ( or ). If the state representation of all states is the same , the optimal policy is stochastic. We can easily find such policy with pen and paper: ; see Appendix B for details.

It is also known that, in these settings, it is often preferable to leverage Monte-Carlo returns, or at least multi-step bootstrapped estimators, instead of using one-step targets (Jaakkola et al., 1994). Consider again the MDP in Figure 2: boostrapping from produces biased estimates of the expected return, because aggregates the values of multiple states; again, see Appendix B for the derivation.

Among the methods in Section 2, both policy gradients and MPO allow convergence to stochastic policies, but only policy gradients naturally incorporate multi-step return estimators. In MPO, stochastic return estimates could make the agent overly optimistic ().

3.2 Policy representation

Policies may be constructed from action values or they may combine action values and other quantities (e.g., a direct parametrization of the policy or historical data). We argue that the action values alone are not enough.

First, we show that action values are not always enough to represent the best stochastic policy. Consider again the MDP in Figure 2 with identical state representation in all states. As discussed, the optimal stochastic policy is . This non-uniform policy cannot be inferred from Q-values, as these are the same for all actions and are thus wholly uninformative about the best probabilities: . Similarly, a model on its own is also insufficient without a policy, as it would produce the same uninformative action values.

One approach to address this limitation is to parameterize the policy explicitly (e.g. via a policy network). This has the additional advantage that it allows us to directly sample both discrete (Mnih et al., 2016) and continuous (van Hasselt & Wiering, 2007; Degris et al., 2012; Silver et al., 2014) actions. In contrast, maximizing Q-values over continuous action spaces is challenging. Access to a parametric policy network that can be queried directly is also beneficial for agents that act by planning with a learned model (e.g. via MCTS), as it allows to guide search in large or continuous action space.

Figure 2: An episodic MDP with 4 states. State 1 is the initial state. State 4 is terminal. At each step, the agent can choose amongst two actions: or . The rewards range from -1 to 1, as displayed. The discount is 1. If the state representation is the same in all states, the best stochastic policy is .

3.3 Robust learning

We seek algorithms that are robust to 1) off-policy or historical data; 2) inaccuracies in values and models; 3) diversity of environments. In the following paragraphs we discuss what each of these entails.

Reusing data from previous iterations of policy (Lin, 1992; Riedmiller, 2005; Mnih et al., 2015) can make RL more data efficient. However, if computing the gradient of the objective on data from an older policy , an unregularized application of the gradient can degrade the value of . The amount of degradation depends on the total variation distance between and , and we can use a regularizer to control it, as in Conservative Policy Iteration (Kakade & Langford, 2002), Trust Region Policy Optimization (Schulman et al., 2015), and Appendix C.

Whether we learn on or off-policy, agents’ predictions incorporate errors. Regularization can also help here. For instance, if Q-values have errors, the MPO regularizer maintains a strong performance bound (Vieillard et al., 2020). The errors from multiple iterations average out, instead of appearing in a discounted sum of the absolute errors. While not all assumptions behind this result apply in an approximate setting, Section 5 shows that MPO-like regularizers are helpful empirically.

Finally, robustness to diverse environments is critical to ensure a policy optimization algorithm operates effectively in novel settings. This can take various forms, but we focus on robustness to diverse reward scales and minimizing problem dependent hyperparameters. The latter are an especially subtle form of inductive bias that may limit the applicability of a method to established benchmarks

(Hessel et al., 2019).

Observability and function approximation
1a) Support learning stochastic policies
1b) Leverage Monte-Carlo targets
Policy representation
2a) Support learning the optimal memory-less policy
2b) Scale to (large) discrete action spaces
2c) Scale to continuous action spaces
Robust learning
3a) Support off-policy and historical data
3b) Deal gracefully with inaccuracies in the values/model
3c) Be robust to diverse reward scales
3d) Avoid problem-dependent hyperparameters
Rich representation of knowledge

Estimate values (variance reduction, bootstrapping)

4b) Learn a model (representation, composability)
Table 1: A recap of the desiderata or guiding principles that we believe are important when designing general policy optimization algorithms. These are discussed in Section 3.

3.4 Rich representation of knowledge

Even if the policy is parametrized explicitly, we argue it is important for the agent to represent knowledge in multiple ways (Degris & Modayil, 2012) to update such policy in a reliable and robust way. Two classes of predictions have proven particularly useful: value functions and models.

Value functions (Sutton, 1988; Sutton et al., 2011) can capture knowledge about a cumulant over long horizons, but can be learned with a cost independent of the span of the predictions (van Hasselt & Sutton, 2015). They have been used extensively in policy optimization, e.g., to implement forms of variance reduction (Williams, 1992), and to allow updating policies online through bootstrapping, without waiting for episodes to fully resolve (Sutton et al., 2000).

Models can also be useful in various ways: 1) learning a model can act as an auxiliary task (Schmidhuber, 1990; Sutton et al., 2011; Jaderberg et al., 2017; Guez et al., 2020), and help with representation learning; 2) a learned model may be used to update policies and values via planning (Werbos, 1987; Sutton, 1990; Ha & Schmidhuber, 2018); 3) finally, the model may be used to plan for action selection (Richalet et al., 1978; Silver & Veness, 2010). These benefits of learned models are entangled in MuZero. Sometimes, it may be useful to decouple them, for instance to retain the benefits of models for representation learning and policy optimization, without depending on the computationally intensive process of planning for action selection.

4 Robust yet simple policy optimization

The full list of desiderata is presented in Table 1. These are far from solved problems, but they can be helpful to reason about policy updates. In this section, we describe a policy optimization algorithm designed to address these desiderata.

4.1 Our proposed clipped MPO (CMPO) regularizer

We use the Maximum a Posteriori Policy Optimization (MPO) algorithm (Abdolmaleki et al., 2018) as starting point, since it can learn stochastic policies (1a), supports discrete and continuous action spaces (2c), can learn stably from off-policy data (3a), and has strong performance bounds even when using approximate Q-values (3b). We then improve the degree of control provided by MPO on the total variation distance between and (3a), avoiding sensitive domain-specific hyperparameters (3d).

MPO uses a regularizer , where is the previous policy. Since we are interested in learning from stale data, we allow to correspond to arbitrary previous policies, and we introduce a regularizer , based on the new target


where is a non-stochastic approximation of the advantage and the factor ensures the policy is a valid probability distribution. The term we use in the regularizer has an interesting relation to natural policy gradients (Kakade, 2001):

is obtained if the natural gradient is computed with respect to the logits of

and then the expected gradient is clipped (for proof note the natural policy gradient with respect to the logits is equal to the advantages (Agarwal et al., 2019)).

The clipping threshold controls the maximum total variation distance between and . Specifically, the total variation distance between and is defined as


As discussed in Section 3.3, constrained total variation supports robust off-policy learning. The clipped advantages allows us to derive not only a bound for the total variation distance but an exact formula:

Theorem 4.1 (Maximum CMPO total variation distance)

For any clipping threshold , we have:

We refer readers to Appendix D for proof of Theorem 4.1; we also verified the theorem predictions numerically.

Figure 3: (a) The maximum total variation distance between and is exclusively a function of the clipping threshold . (b) A comparison (on 10 Atari games) of the Muesli sensitivity to the regularizer multiplier . Each dot is the mean of 5 runs with different random seeds and the black line is the mean across all 10 games. With Muesli’s normalized advantages, the good range of values for is fairly large, not strongly problem dependent, and performs well on many environments.

Note that the maximum total variation distance between and does not depend on the number of actions or other environment properties (3d). It only depends on the clipping threshold as visualized in Figure 3a. This allows to control the maximum total variation distance under a CMPO update, for instance by setting the maximum total variation distance to , without requiring the constrained optimization procedure used in the original MPO paper. Instead of the constrained optimization, we just set . We used in our experiments, across all domains.

4.2 A novel policy update

Given the proposed regularizer , we can update the policy by direct optimization of the regularized objective, that is by gradient descent on


where the advantage terms in each component of the loss can be normalized using the approach described in Section 4.5 to improve the robustness to reward scales.

The first term corresponds to a standard policy gradient update, thus allowing stochastic estimates of that use Monte-Carlo or multi-step estimators (1b). The second term adds regularization via distillation of the CMPO target, to preserve the desiderata addressed in Section 4.1.

Critically, the hyper-parameter is easy to set (3d), because even if is high, still proposes improvements to the policy . This property is missing in popular regularizers that maximize entropy or minimize a distance from . We refer to the sensitivity analysis depicted in Figure 3b for a sample of the wide range of values of that we found to perform well on Atari. We used in all other experiments reported in the paper.

Both terms can be sampled, allowing to trade off the computation cost and the variance of the update; this is especially useful in large or continuous action spaces (2b), (2c).

We can sample the gradient of the first term by computing the loss on data generated on a prior policy , and then use importance sampling to correct for the distribution shift wrt . This results in the estimator


for the first term of the policy loss. In this expression, is the behavior policy; the advantage uses a stochastic multi-step bootstrapped estimator and a learned baseline .

We can also sample the regularizer, by computing a stochastic estimate of the KL on a subset of actions , sampled from . In which case, the second term of Eq. 4.2 becomes (ignoring an additive constant):


where is computed from the learned values and . To support sampling just few actions from the current state , we can estimate for the -th sample out of as:


where is an initial estimate. We use .

4.3 Learning a model

As discussed in Section 3.4, learning models has several potential benefits. Thus, we propose to train a model alongside policy and value estimates (4b). As in MuZero (Schrittwieser et al., 2020) our model is not trained to reconstruct observations, but is rather only required to provide accurate estimates of rewards, values and policies. It can be seen as an instance of value equivalent models (Grimm et al., 2020).

For training, the model is unrolled steps, taking as inputs an initial state and an action sequence . On each step the model then predicts rewards , values and policies . Rewards and values are trained to match the observed rewards and values of the states actually visited when executing those actions.

Policy predictions after unrolling the model steps are trained to match the policy targets computed in the actual observed states . The policy component of the model loss can then be written as:


This differs from MuZero in that here the policy predictions are updated towards the targets , instead of being updated to match the targets constructed from the MCTS visitations.

4.4 Using the model

The first use of a model is as an auxiliary task. We implement this by conditioning the model not on a raw environment state but, instead, on the activations from a hidden layer of the policy network. Gradients from the model loss are then propagated all the way into the shared encoder, to help learning good state representations.

The second use of the model is within the policy update from Eq. 4.2. Specifically, the model is used to estimate the action values , via one-step look-ahead:


and the model-based action values are then used in two ways. First, they are used to estimate the multi-step return in Eq. 10, by combining action values and observed rewards using the Retrace estimator (Munos et al., 2016). Second, the action values are used in the (non-stochastic) advantage estimate required by the regularisation term in Eq. 11.

Using the model to compute the target instead of using it to construct the search-based policy has advantages: a fast analytical formula, stochastic estimation of in large action spaces (2b), and direct support for continuous actions (2c). In contrast, MuZero’s targets are only an approximate solution to regularized policy optimization (Grill et al., 2020), and the approximation can be crude when using few simulations.

Note that we could have also used deep search to estimate action-values, and used these in the proposed update. Deep search would however be computationally expensive, and may require more accurate models to be effective (3b).

4.5 Normalization

CMPO avoids overly large changes but does not prevent updates from becoming vanishingly small due to small advantages. To increase robustness to reward scales (3c), we divide advantages

by the standard deviation of the advantage estimator. A similar normalization was used in

PPO (Schulman et al., 2017), but we estimate

using moving averages, to support small batches. Normalized advantages do not become small, even when the policy is close to optimal; for convergence, we rely on learning rate decay.

All policy components can be normalized using this approach, but the model also predict rewards and values, and the corresponding losses could be sensitive to reward scales. To avoid having to tune, per game, the weighting of these unnormalized components (4c), (4d), we compute losses in a non-linearly transformed space

(Pohlen et al., 2018; van Hasselt et al., 2019), using the categorical reparametrization introduced by MuZero (Schrittwieser et al., 2020).

Figure 4: A comparison (on two Atari games) of the robustness of clipped and unclipped MPO agents to the scale of the advantages. Without clipping, we found that performance degraded quickly as the scale increased. In contrast, with CMPO, performance was almost unaffected by scales ranging from to .

Figure 5: A comparison (on two Atari games) of direct and indirect optimization. Whether direct MPO (in green) or indirect CMPO (in yellow) perform best depends on the environment. Muesli, however, typically performs as well or better than either one of them. The aggregate score across the 57 games for Muesli, direct MPO and CMPO are reported in Figure 13 of the appendix.

5 An empirical study

In this section, we investigate empirically the policy updates described in the Section 4. The full agent implementing our recommendations is named Muesli, as homage to MuZero. The Muesli policy loss is . All agents in this section are trained using the Sebulba podracer architecture (Hessel et al., 2021).

First, we use the 57 Atari games in the Arcade Learning Environment (Bellemare et al., 2013) to investigate the key design choices in Muesli, by comparing it to suitable baselines and ablations. We use sticky actions to make the environments stochastic (Machado et al., 2018). To ensure comparability, all agents use the same policy network, based on the IMPALA agent (Espeholt et al., 2018). When applicable, the model described in Section 4.3 is parametrized by an LSTM (Hochreiter & Schmidhuber, 1997), with a diagram in Figure 10 in the appendix. Agents are trained using uniform experience replay, and estimate multi-step returns using Retrace (Munos et al., 2016).

Figure 6: Median score across 57 Atari games. (a) Return ablations: 1) Retrace or V-trace, 2) training the policy with multi-step returns or with only (in red). (b) Different numbers of samples to estimate the . The ”1 sample, oracle” (pink) used the exact normalizer, requiring to expand all actions. The ablations were run with 2 random seeds.

In Figure 1a we compare the median human-normalized score on Atari achieved by Muesli to that of several baselines: policy gradients (in red), PPO (in green), MPO (in grey) and a policy gradient variant with TRPO-like regularization (in orange). The updates for each baseline are reported in Appendix F, and the agents differed only in the policy components of the losses. In all updates we used the same normalization, and trained a MuZero-like model grounded in values and rewards. In MPO and Muesli, the policy loss included the policy model loss from Eq. 13. For each update, we separately tuned hyperparameters on 10 of the 57 Atari games. We found the performance on the full benchmark to be substantially higher for Muesli (in blue). In the next experiments we investigate how different design choices contributed to Muesli’s performance.

In Figure 4 we use the Atari games beam_rider and gravitar to investigate advantage clipping. Here, we compare the updates that use clipped (in blue) and unclipped (in red) advantages, when first rescaling the advantages by factors ranging from to to simulate diverse return scales. Without clipping, performance was sensitive to scale, and degraded quickly when scaling advantages by a factor of 100 or more. With clipping, learning was almost unaffected by rescaling, without requiring more complicated solutions such as the constrained optimization introduced in related work to deal with this issue (Abdolmaleki et al., 2018).

In Figure 5 we show how Muesli combines the benefits of direct and indirect optimization. A direct MPO update uses the regularizer as a penalty; c.f. Mirror Descent Policy Optimization (Tomar et al., 2020). Indirect MPO first finds from Eq. 6 and then trains the policy by the distillation loss . Note the different direction of the KLs. Vieillard et al. (2020) observed that the best choice between direct and indirect MPO is problem dependent, and we found the same: compare the ordering of direct MPO (in green) and indirect CMPO (in yellow) on the two Atari games alien and robotank. In contrast, we found that the Muesli policy update (in blue) was typically able to combine the benefits of the two approaches, by performing as well or better than the best among the two updates on each of the two games. See Figure 13 in the appendix for aggregate results across more games.

In Figure 6a we evaluate the importance of using multi-step bootstrapped returns and model-based action values in the policy-gradient-like component of Muesli’s update. Replacing the multi-step return with an approximate (in red in Figure 6a) degraded the performance of Muesli (in blue) by a large amount, showing the importance of leveraging multi-step estimators. We also evaluated the role of model-based action value estimates in the Retrace estimator, by comparing full Muesli to an ablation (in green) where we instead used model-free values in a V-trace estimator (Espeholt et al., 2018). The ablation performed worse.

Figure 7: Median score across 57 Atari games. (a) Muesli ablations that train one-step models (in green), or drop the policy component of the model (in red). (b) Muesli and two MCTS-baselines that act sampling from and learn using as target; all use the IMPALA policy network and an LSTM model.

In Figure 6b we compare the performance of Muesli when using different numbers of actions to estimate the KL term in Eq. 4.2. We found that the resulting agent performed well, in absolute terms ( median human normalized performance) when estimating the KL by sampling as little as a single action (brown). Performance increased by sampling up to 16 actions, which was then comparable the exact KL.

In Figure 7a we show the impact of different parts of the model loss on representation learning. The performance degraded when only training the model for one step (in green). This suggests that training a model to support deeper unrolls (5 steps in Muesli, in blue) is a useful auxiliary task even if using only one-step look-aheads in the policy update. In Figure 7a we also show that performance degraded even further if the model was not trained to output policy predictions at each steps in the future, as per Eq. 13, but instead was only trained to predict rewards and values (in red). This is consistent with the value equivalence principle (Grimm et al., 2020): a rich signal from training models to support multiple predictions is critical for this kind of models.

In Figure 7b we compare Muesli to an MCTS baseline. As in MuZero, the baseline uses MCTS both for acting and learning. This is not a canonical MuZero, though, as it uses the (smaller) IMPALA network. MCTS (in purple) performed worse than Muesli (in blue) in this regime. We ran another MCTS variant with limited search depth (in green); this was better than full MCTS, suggesting that with insufficiently large networks, the model may not be sufficiently accurate to support deep search. In contrast, Muesli performed well even with these smaller models (3b).

Since we know from the literature that MCTS can be very effective in combination with larger models, in Figure 1b we reran Muesli with a much larger policy network and model, similar to that of MuZero. In this setting, Muesli matched the published performance of MuZero (the current state of the art on Atari in the 200M frames regime). Notably, Muesli achieved this without relying on deep search: it still sampled actions from the fast policy network and used one-step look-aheads in the policy update. We note that the resulting median score matches MuZero and is substantially higher than all other published agents, see Table 2 to compare the final performance of Muesli to other baselines.

Figure 8: Win probability on 9x9 Go when training from scratch, by self-play, for 5B frames. Evaluating 3 seeds against Pachi with 10K simulations per move. (a) Muesli and other search-free baselines. (b) MuZero MCTS with 150 simulations and Muesli with and without the use of MCTS at the evaluation time only.

Next, we evaluated Muesli on learning 9x9 Go from self-play. This requires to handle non-stationarity and a combinatorial space. It is also a domain where deep search (e.g. MCTS) has historically been critical to reach non-trivial performance. In Figure 8a we show that Muesli (in blue) still outperformed the strongest baselines from Figure 1a, as well as CMPO on its own (in yellow). All policies were evaluated against Pachi (Baudiš & Gailly, 2011). Muesli reached a 75% win rate against Pachi: to the best of our knowledge, this is the first system to do so from self-play alone without deep search. In the Appendix we report even stronger win rates against GnuGo (Bump et al., 2005).

In Figure 8b, we compare Muesli to MCTS on Go; here, Muesli’s performance (in blue) fell short of that of the MCTS baseline (in purple), suggesting there is still value in using deep search for acting in some domains. This is demonstrated also by another Muesli variant that uses deep search at evaluation only. Such Muesli/MCTS[Eval] hybrid (in light blue) recovered part of the gap with the MCTS baseline, without slowing down training. For reference, with the pink vertical line we depicts published MuZero, with its even greater data efficiency thanks to more simulations, a different network, more replay, and early resignation.

Finally, we tested the same agents on MuJoCo environments in OpenAI Gym (Brockman et al., 2016), to test if Muesli can be effective on continuous domains and on smaller data budgets (2M frames). Muesli performed competitively. We refer readers to Figure 12, in the appendix, for the results.

Agent Median
DQN (Mnih et al., 2015)      79%
DreamerV2 (Hafner et al., 2020)    164%
IMPALA (Espeholt et al., 2018)    192%
Rainbow (Hessel et al., 2018)    231%
Meta-gradient (Xu et al., 2018)    287%
STAC (Zahavy et al., 2020)    364%
LASER (Schmitt et al., 2020)    431%
MuZero Reanalyse (Schrittwieser et al., 2021) 1,047 40%
Muesli 1,041 40%

Table 2: Median human-normalized score across 57 Atari games from the ALE, at 200M frames, for several published baselines. These results are sourced from different papers, thus the agents differ along multiple dimensions (e.g. network architecture and amount of experience replay). MuZero and Muesli both use a very similar network, the same proportion of replay, and both use the harder version of the ALE with sticky actions (Machado et al., 2018). The denotes the standard error over 2 random seeds.

6 Conclusion

Starting from our desiderata for general policy optimization, we proposed an update (Muesli), that combines policy gradients with Maximum a Posteriori Policy Optimization (MPO) and model-based action values. We empirically evaluated the contributions of each design choice in Muesli, and compared the proposed update to related ideas from the literature. Muesli demonstrated state of the art performance on Atari (matching MuZero’s most recent results), without the need for deep search. Muesli even outperformed MCTS-based agents, when evaluated in a regime of smaller networks and/or reduced computational budgets. Finally, we found that Muesli could be applied out of the box to self-play 9x9 Go and continuous control problems, showing the generality of the update (although further research is needed to really push the state of the art in these domains). We hope that our findings will motivate further research in the rich space of algorithms at the intersection of policy gradient methods, regularized policy optimization and planning.


We would like to thank Manuel Kroiss and Iurii Kemaev for developing the research platform we use to run and distribute experiments at scale. Also we thank Dan Horgan, Alaa Saade, Nat McAleese and Charlie Beattie for their excellent help with reinforcement learning environments. Joseph Modayil improved the paper by wise comments and advice. Coding was made fun by the amazing JAX library (Bradbury et al., 2018), and the ecosystem around it (in particular the optimisation library Optax

, the neural network library

Haiku, and the reinforcement learning library Rlax). We thank the MuZero team at DeepMind for inspiring us.


Appendix A Stochastic estimation details

In the policy-gradient term in Eq. 10, we clip the importance weight to be from . The importance weight clipping introduces a bias. To correct for it, we use -LOO action-dependent baselines (Gruslys et al., 2018).

Although the -LOO action-dependent baselines were not significant in the Muesli results, the -LOO was helpful for the policy gradients with the TRPO penalty (Figure 16).

Figure 9: The episodic MDP from Figure 2, reproduced here for an easier reference. State 1 is the initial state. State 4 is terminal. The discount is 1.

Appendix B The illustrative MDP example

Here we will analyze the values and the optimal policy for the MDP from Figure 9, when using the identical state representation in all states. With the state representation , the policy is restricted to be the same in all states. Let’s denote the probability of the action by .

Given the policy , the following are the values of the different states:


Finding the optimal policy. Our objective is to maximize the value of the initial state. That means maximizing . We can find the maximum by looking at the derivatives. The derivative of with respect to the policy parameter is:


The second derivative is negative, so the maximum of is at the point where the first derivative is zero. We conclude that the maximum of is at .

Finding the action values of the optimal policy. We will now find the and . The is defined as the expected return after the , when doing the action (Singh et al., 1994):


where is the probability of being in the state when observing .

In our example, the Q-values are:


We can now substitute the in for to find the and :


We see that these Q-values are the same and uninformative about the probabilities of the optimal (memory-less) stochastic policy. This generalizes to all environments: the optimal policy gives zero probability to all actions with lower Q-values. If the optimal policy at a given state representation gives non-zero probabilities to some actions, these actions must have the same Q-values .

Bootstrapping from would be worse. We will find the . And we will show that bootstrapping from it would be misleading. In our example, the is:


We can notice that is different from or . Estimating by bootstrapping from instead of would be misleading. Here, it is better to estimate the Q-values based on Monte-Carlo returns.

Appendix C The motivation behind Conservative Policy Iteration and TRPO

In this section we will show that unregularized maximization of on data from an older policy can produce a policy worse than . The size of the possible degradation will be related to the total variation distance between and . The explanation is based on the proofs from the excellent book by Agarwal et al. (2020).

As before, our objective is to maximize the expected value of the states from an initial state distribution :


It will be helpful to define the discounted state visitation distribution as:


where is the probably of being , if starting the episode from and following the policy . The scaling by ensures that sums to one.

From the policy gradient theorem (Sutton et al., 2000) we know that the gradient of with respect to the policy parameters is


In practice, we often train on data from an older policy . Training on such data maximizes a different function:


where is an advantage. Notice that the states are sampled from and the policy is criticized by . This happens often in the practice, if updating the policy multiple times in an episode, using a replay buffer or bootstrapping from a network trained on past data.

While maximization of is more practical, we will see that unregularized maximization of does not guarantee an improvement in our objective . The difference can be even negative, if we are not careful.

Kakade & Langford (2002) stated a useful lemma for the performance difference:

Lemma C.1 (The performance difference lemma)

For all policies , ,


We would like the to be positive. We can express the performance difference as plus an extra term:


To get a positive performance difference, it is not enough to maximize . We also need to make sure that the second term in (36) will not degrade the performance. The impact of the second term can be kept small by keeping the total variation distance between and small.

For example, the performance can degrade, if is not trained at a state and that state gets a higher probability. The performance can also degrade, if a stochastic policy is needed and the advantages are for an older policy. The would become deterministic, if maximizing without any regularization.

c.1 Performance difference lower bound.

We will express a bound of the performance difference as a function of the total variation between and . Starting from Eq. 36, we can derive the TRPO lower bound for the performance difference. Let be the maximum total variation distance between and :


The is then bounded (see Agarwal et al., 2020, Similar policies imply similar state visitations):


Finally, by plugging the bounds to Eq. 36, we can construct the lower bound for the performance difference:


where . The same bound was derived in TRPO (Schulman et al., 2015).

Appendix D Proof of Maximum CMPO total variation distance

We will prove the following theorem: For any clipping threshold , we have:

Having 2 actions. We will first prove the theorem when the policy has 2 actions. To maximize the distance, the clipped advantages will be and . Let’s denote the probabilities associated with these advantages as and , respectively.

The total variation distance is then:


We will maximize the distance with respect to the parameter .

The first derivative with respect to is:


The second derivative with respect to is:


Because the second derivative is negative, the distance is a concave function of . We will find the maximum at the point where the first derivative is zero. The solution is:


At the found point , the maximum total variation distance is:


This completes the proof when having 2 actions.

Having any number of actions. We will now prove the theorem when the policy has any number of actions. To maximize the distance, the clipped advantages will be or . Let’s denote the sum of probabilities associated with these advantages as and , respectively.

The total variation distance is again:


and the maximum distance is again

We also verified the theorem predictions experimentally, by using gradient ascent to maximize the total variation distance.

Appendix E Extended related work

We used the desiderata to motivate the design of the policy update. We will use the desiderata again to discuss the related methods to satisfy the desiderata. For a comprehensive overview of model-based reinforcement learning, we recommend the surveys by Moerland et al. (2020) and Hamrick (2019).

e.1 Observability and function approximation

1a) Support learning stochastic policies. The ability to learn a stochastic policy is one of the benefits of policy gradient methods.

1b) Leverage Monte-Carlo targets. Muesli uses multi-step returns to train the policy network and Q-values. MPO and MuZero need to train the Q-values, before using the Q-values to train the policy.

e.2 Policy representation

2a) Support learning the optimal memory-less policy. Muesli represents the stochastic policy by the learned policy network. In principle, acting can be based on a combination of the policy network and the Q-values. For example, one possibility is to act with the policy. ACER (Wang et al., 2016) used similar acting based on . Although we have not seen benefits from acting based on on Atari (Figure 15), we have seen better results on Go with a deeper search at the evaluation time.

2b) Scale to (large) discrete action spaces. Muesli supports large actions spaces, because the policy loss can be estimated by sampling. MCTS is less suitable for large action spaces. This was addressed by Grill et al. (2020), who brilliantly revealed MCTS as regularized policy optimization and designed a tree search based on MPO or a different regularized policy optimization. The resulting tree search was less affected by a small number of simulations. Muesli is based on this view of regularized policy optimization as an alternative to MCTS. In another approach, MuZero was recently extended to support sampled actions and continuous actions (Hubert et al., 2021).

2c) Scale to continuous action spaces. Although we used the same estimator of the policy loss for discrete and continuous actions, it would be possible to exploit the structure of the continuous policy. For example, the continuous policy can be represented by a normalizing flow (Papamakarios et al., 2019)

to model the joint distribution of the multi-dimensional actions. The continuous policy would also allow to estimate the gradient of the policy regularizer with the reparameterization trick

(Kingma & Welling, 2013; Rezende et al., 2014). Soft Actor-Critic (Haarnoja et al., 2018) and TD3 (Fujimoto et al., 2018) achieved great results on the Mujoco tasks by obtaining the gradient with respect to the action from an ensemble of approximate Q-functions. The ensemble of Q-functions would probably improve Muesli results.

e.3 Robust learning

3a) Support off-policy and historical data. Muesli supports off-policy data thanks to the regularized policy optimization, Retrace (Munos et al., 2016) and policy gradients with clipped importance weights (Gruslys et al., 2018). Many other methods deal with off-policy or offline data (Levine et al., 2020). Recently MuZero Reanalyse (Schrittwieser et al., 2021) achieved state-of-the-art results on an offline RL benchmark by training only on the offline data.

3b) Deal gracefully with inaccuracies in the values/model. Muesli does not trust fully the Q-values from the model. Muesli combines the Q-values with the prior policy to propose a new policy with a constrained total variation distance from the prior policy. Without the regularized policy optimization, the agent can be misled by an overestimated Q-value for a rarely taken action. Soft Actor-Critic (Haarnoja et al., 2018) and TD3 (Fujimoto et al., 2018) mitigate the overestimation by taking the minimum from a pair of Q-networks. In model-based reinforcement learning an unrolled one-step model would struggle with compounding errors (Janner et al., 2019). VPN (Oh et al., 2017) and MuZero (Schrittwieser et al., 2020) avoid compounding errors by using multi-step predictions , not conditioned on previous model predictions. While VPN and MuZero avoid compounding errors, these models are not suitable for planning a sequence of actions in a stochastic environment. In the stochastic environment, the sequence of actions needs to depend on the occurred stochastic events, otherwise the planning is confounded and can underestimate or overestimate the state value (Rezende et al., 2020). Other models conditioned on limited information from generated (latent) variables can face similar problems on stochastic environment (e.g. DreamerV2 (Hafner et al., 2020)). Muesli is suitable for stochastic environments, because Muesli uses only one-step look-ahead. If combining Muesli with a deep search, we can use an adaptive search depth or a stochastic model sufficient for causally correct planning (Rezende et al., 2020). Another class of models deals with model errors by using the model as a part of the Q-network or policy network and trains the whole network end-to-end. These networks include VIN (Tamar et al., 2016), Predictron (Silver et al., 2017), I2A (Racanière et al., 2017), IBP (Pascanu et al., 2017), TreeQN, ATreeC (Farquhar et al., 2018) (with scores in Table 3), ACE (Zhang & Yao, 2019), UPN (Srinivas et al., 2018) and implicit planning with DRC (Guez et al., 2019).

3c) Be robust to diverse reward scales. Muesli benefits from the normalized advantages and from the advantage clipping inside . Pop-Art (van Hasselt et al., 2016) addressed learning values across many orders of magnitude. On Atari, the score of the games vary from 21 on Pong to 1M on Atlantis. The non-linear transformation by Pohlen et al. (2018) is practically very helpful, although biased for stochastic returns.

3d) Avoid problem-dependent hyperparameters. The normalized advantages were used before in PPO (Schulman et al., 2017). The maximum CMPO total variation (Theorem 4.1) helps to explain the success of such normalization. If the normalized advantages are from , they behave like advantages clipped to . Notice that the regularized policy optimization with the popular entropy regularizer is equivalent to MPO with uniform (because ). As a simple modification, we recommend to replace the uniform prior with based on a target network. That leads to the model-free direct MPO with normalized advantages, outperforming vanilla policy gradients (compare Figure 13 to Figure 1a).

e.4 Rich representation of knowledge

4a) Estimate values (variance reduction, bootstrapping). In Muesli, the learned values are helpful for bootstrapping Retrace returns, for computing the advantages and for constructing the . Q-values can be also helpful inside a search, as demonstrated by Hamrick et al. (2020a).

4b) Learn a model (representation, composability). Multiple works demonstrated benefits from learning a model. Like VPN and MuZero, Gregor et al. (2019) learns a multi-step action-conditional model; they learn the distribution of observations instead of actions and rewards, and focus on the benefits of representation learning in model-free RL induced by model-learning; see also (Guo et al., 2018, 2020). Springenberg et al. (2020) study an algorithm similar to MuZero with an MPO-like learning signal on the policy (similarly to SAC and Grill et al. (2020)) and obtain strong results on Mujoco tasks in a transfer setting. Byravan et al. (2020) use a multi-step action model to derive a learning signal for policies on continuous-valued actions, leveraging the differentiability of the model and of the policy. Kaiser et al. (2019) show how to use a model for increasing data-efficiency on Atari (using an algorithm similar to Dyna (Sutton, 1990)), but see also van Hasselt et al. (2019) for the relation between parametric model and replay. Finally, Hamrick et al. (2020b) investigate drivers of performance and generalization in MuZero-like algorithms.

Alien Amidar Crazy Climber Enduro Frostbite Krull Ms. Pacman QBert Seaquest
TreeQN-1 2321 1030 107983 800 2254 10836 3030 15688 9302
TreeQN-2 2497 1170 104932 825 581 11035 3277 15970 8241
ATreeC-1 3448 1578 102546 678 1035 8227 4866 25159 1734
ATreeC-2 2813 1566 110712 649 281 8134 4450 25459 2176
Muesli 16218 524 143898 2344 10919 15195 19244 30937 142431
Table 3: The mean score from the last 100 episodes at 40M frames on games used by TreeQN and ATreeC. The agents differ along multiple dimensions.

Appendix F Experimental details

f.1 Common parts

Network architecture. The large MuZero network is used only on the large scale Atari experiments (Figure 1b) and on Go. In all other Atari and MuJoCo experiments the network architecture is based on the IMPALA architecture (Espeholt et al., 2018). Like the LASER agent (Schmitt et al., 2020), we increase the number of channels 4-times. Specifically, the numbers of channels are: (64, 128, 128, 64), followed by a fully connected layer and LSTM (Hochreiter & Schmidhuber, 1997) with 512 hidden units. This LSTM inside of the IMPALA representation network is different from the second LSTM used inside the model dynamics function, described later. In the Atari experiments, the network takes as the input one RGB frame. Stacking more frames would help as evidenced in Figure 17.

Q-network and model architecture. The original IMPALA agent was not learning a Q-function. Because we train a MuZero-like model, we can estimate the Q-values by:


where and are the reward model and the value model, respectively. The reward model and the value model are based on MuZero dynamics and prediction functions (Schrittwieser et al., 2020). We use a very small dynamics function, consisting of a single LSTM layer with 1024 hidden units, conditioned on the selected action (Figure 10).

The decomposition of to a reward model and a value model is not crucial. The Muesli agent obtained a similar score with a model of the action-values (Figure 14).

Figure 10: The model architecture when using the IMPALA-based representation network. The predicts the reward . The