## 1 Introduction

Planning algorithms based on lookahead search has been very successful for decision-making problems with known dynamics, such as board games silver2017masteringshogi; silver2016mastering; silver2017mastering and simulated robot control tassa2012synthesis. However, to apply planning algorithms to more general tasks with unknown dynamics, the agent needs to learn the dynamics model from the interactions with the environment. Although learning the dynamics model has been a long-standing challenging problem, planning with a learned model has several benefits, including data efficiency, better performance, and adaptation to different tasks hafner2018learning. Recently, a model-based reinforcement learning algorithm MuZero schrittwieser2019mastering was proposed to extend the planning ability to more general environments through learning the dynamics model from the experiences. Building upon AlphaZero’s silver2017mastering powerful search and search-based policy iteration algorithms, MuZero achieves state-of-the-art performance in Atari 2600 with visually rich domains and board games that require precision planning.

However, while MuZero is able to solve problems efficiently with high-dimensional observation spaces, it can only handle environments with discrete action spaces. Many real-world applications, especially physical control problems, require agents to sequentially choose actions from continuous action spaces. While discretizing the action space is a possible way to adapt MuZero

to continuous control problems, the number of actions increases exponentially with the number of degrees of freedom. Besides, action space discretization can not maintain the information about the structure of the action domain, which may be essential for solving many problems

lillicrap2015continuous.In this paper, we provide a possible way and the necessary theoretical results to extend the MuZero algorithm to the continuous action space environments. More specifically, to enable the tree search process to handle the continuous action space, we use progressive widening chaslot2008progressive

strategy which gradually adding actions from the action space to the search tree. For the policy parameterization of the policy network output that aims to narrow down the search to high-probability moves, we use the Gaussian distribution to represent the policy and learn statistics of the probability distribution from the experience data

sutton2018reinforcement. For the policy training, a loss function is derived to match the predicted policy output by the policy network and the search policy during the Monte Carlo Tree Search (MCTS) simulation process. Through the above extensions, we show the proposed algorithm in this paper, continuous

MuZero, outperforms the soft actor-critic method (SAC) haarnoja2018soft in relative low-dimensional MuJoCo environments.This paper is organized as follows. Section 2 presents related work on model-based reinforcement learning and tree search algorithm in continuous action space. In Section 3 we describe the MCTS with the progressive widening strategy in continuous action space. Section 4 covers the network loss function and the algorithm training process. Section 5 presents the numerical experiment results, and Section 6 concludes this paper.

## 2 Related Work

In this section, we briefly review the model-based reinforcement learning algorithms and the Monte Carlo Tree Search (MCTS) in continuous action space.

### 2.1 Model-based Reinforcement Learning

Reinforcement learning algorithms are often divided into model-free and model-based algorithms Sutton1998. Model-based reinforcement learning algorithms learn a model of the environments, with which they can use for planning and predicting the future steps. This gives the agent an advantage in solving problems requiring a sophisticated lookahead. A classic approach is to directly model the dynamics of the observations Sutton:1991b; NIPS2018_7512; Kaiser2020Model; hafner2018learning. Model-based reinforcement learning algorithm like Dyna-Q Sutton90integratedarchitectures; NIPS2009_3670 combines model-free and model-based algorithms using its model to generate samples for model-free algorithms which augment the samples obtained through interaction with the environment. Dyna-Q has been adapted to continuous control problems in 10.5555/3045390.3045688. This approach requires learning to reconstruct the observations without distinguishing between useful information and details.

The MuZero algorithm schrittwieser2019mastering avoids this by encoding observations into a hidden state without imposing the constraints of capturing all information necessary to reconstruct the original observation, which helps reduce computations by focusing on the information useful for planning. From the encoded hidden state (which has no semantics of environment state attached to it), MuZero also has the particularity of predicting the quantities necessary for a sophisticated lookahead: the policy and value function based on the current hidden state, the reward and next hidden state based on the current hidden state and the selected action. MuZero uses its model to plan with an MCTS search which outputs an improved policy target. It is quite similar to value prediction networks NIPS2017_7192 but uses a predicted policy in addition to the value to prune the search space. However, most of this work is on discrete models but many real-world reinforcement learning domains have continuous action spaces. The most successful methods in domains with a continuous action space remain model-free algorithms wang2019benchmarking; pmlr-v80-haarnoja18b; schulman2017proximal.

### 2.2 MCTS with Continuous Action Space

Applying the tree search algorithm to continuous action case causes tension between exploring the larger set of candidate actions to cover more actions, and exploiting the current candidate actions to evaluate them more accurately through deeper search and more execution outcomes. Several recent research works have sought to adapt the tree search to continuous action space.

tesauro1997line

proposed truncated Monte Carlo that prunes away both candidate actions that are unlikely to be the best action, and the candidates with values close to the current best estimate (i.e., choosing either one would not make a significant difference). Similarly, AlphaGo

silver2016mastering uses a trained policy network to narrow down the search to high-value actions. The classical approach of progressive widening (or unpruning) coulom2007computing; chaslot2008progressive; couetoux2011continuous; yee2016monte can handle continuous action space by considering a slowly growing discrete set of sampled actions, which has been theoretically analyzed in wang2009algorithms. mansley2011sample replaces the Upper Confidence Bound (UCB) method kocsis2006bandit in the MCTS algorithm with Hierarchical Optimistic Optimization (HOO) bubeck2011x, an algorithm with theoretical guarantees in continuous action spaces. However, the HOO method has quadratic running time, which makes it intractable in games that require extensive planning. In this paper, we apply the progressive widening strategy since for its computation efficiency (it does not increase computation time during the tree search process).## 3 Monte Carlo Tree Search in Continuous Action Space

We now describe the details of the MCTS algorithm in continuous action space, with a variant of the UCB kocsis2006bandit algorithm named PUCB (Predictor + UCB) rosin2011multi. UCB has some promising properties: it’s very efficient and guaranteed to be within a constant factor of the best possible bound on the growth of regret (defined as the expected loss due to selecting sub-optimal action), and it can balance exploration and exploitation very well kocsis2006bandit. Building upon UCB, PUCB incorporates the prior information for each action to help the agent select the most promising action, which can bring benefits especially for large/continuous action space.

For the MCTS algorithm with a discrete action space, the PUCB score rosin2011multi for all actions can be evaluated and the action with the max PUCB score will be selected. However, when the action space becomes large or continuous, it is impossible to enumerate the PUCB score for all possible actions. Under such scenarios, progressive widening strategy deals with the large/continuous action space through artificially limiting the number of actions in the search tree based on the number of visits to the node and slowly growing the discrete set of sampled actions during the simulation process. After the quality of the best available action is estimated well, additional actions are taken into consideration. More specifically, at the beginning of each action selection step, the algorithm continues by either improving the estimated value of current child actions in the search tree by selecting an action with max PUCB score, or exploring untried actions by adding an action under the current node. This decision is based on keeping the number of child actions for a node bounded by a sublinear function of the number of visits to the current node denoted as :

(1) |

In Eq. 1, and are two parameters that balance whether the MCTS algorithm should cover more actions or improve the estimate of a few actions. At each selection step, if the number of child actions of a node is smaller than , a new action will be added as a child action to the current node. Otherwise, the agent will select an action from the current child actions according to their PUCB score. PUCB score assures that the tree grows deeper more quickly in the promising parts of the search tree. The progressive widening strategies add that it also grows wider to explore more for some part of the search tree.

To represent the action probability density function in continuous action space, in this paper we let the policy network learn a Gaussian distribution as the policy distribution. The policy network will output

andas the mean and standard deviation of the normal distribution. With the mean and standard deviation, the action is sampled from the distribution

(2) |

In the tree search process of the Continuous MuZero algorithm, every node of the search tree is associated with a hidden state , either through the pre-processing of the representation network, or through the dynamics network prediction. For each of the action currently in the search tree (note the number of child actions for each node keeps changing during the simulation process), there is an edge that stores a set of statistics , respectively representing visit counts , mean value , policy mean , policy standard deviation , reward , and state transition . Similar to the MuZero algorithm, the search is divided into three stages, repeated for a number of simulations.

Selection: Each simulation starts from the current root node , keeps repeating the selection until reaching an unexpanded node that has no child actions. For each hypothetical time step , a decision is made by comparing the number of child actions for the node and the value in Eq. 1.

If , an action will be selected according to the stored statistics of node , by maximizing over the PUCB score rosin2011multi; silver2018general

(3) |

Here we note the difference with the MuZero algorithm is that the prior value here is normalized from the policy probability density function value in Eq. 2, since the density value can be unbounded and PUCB algorithm requires that the prior values are all positive and summed to 1.

(4) |

The constants and are used to control the influence of the prior relative to the value , which follow the same parameter setting with MuZero algorithm.

If , the agent will select a new action from the action space, add it to the search tree, and expand this new edge. moerland2018a0c proposed to sample new a new action according to the mean and standard deviation values output by the policy network and stored at the parent node , which can effectively prune away child actions with low prior value. In this paper, we adopt this naive strategy to focus our work on the first step to extend the MuZero algorithm in continuous action space. We expect the algorithm performance can be further improved with a better action sampling strategy such as yee2016monte; lee2018deep.

Expansion: When the agent reaches an unexpanded node either through the PUCB score maximization, or through the progressive widening strategy, the selection stage finishes and the new node will get expanded (we denote this final hypothetical time step during the selection stage as ). In the node expansion process, based on the current state-action information the reward and next state are first computed by the dynamics function, . With the state information for the next step, the policy and value are then computed by the prediction function, . After the function prediction, the new node corresponding to state is added to the search tree under the edge .

In the MuZero algorithm, with a finite number of actions, each edge from the newly expanded node is initialized according to the predicted policy distribution. Since the action space is now continuous, only one edge with the action value randomly sampled from is added to the newly expanded node, and the statistics for this edge is initilaized to , where and are used to determine the probability density prior for the sampled . Similar to the MuZero and AlphaZero algorithm, the search algorithm with progressive strategy makes at most one call to the dynamics function and prediction function respectively per simulation, maintaining the same order of computation cost.

Backup: In the general MCTS algorithm, there is a simulation step that performs one random playout from the newly expanded node to a terminal state of the game. However, in the MuZero algorithm, each step from an unvisited node will require one call to the dynamics function and the prediction function, which makes it intractable for games that need a long trajectory to finish. Thus similar to the MuZero algorithm, immediately after the Expansion step, the statistics (the mean value and the visit count ) for each edge in the simulation path will be updated based on the cumulative discounted reward, bootstrapping from the value function of the newly expanded node.

## 4 Neural network training in continuous action space

In the MuZero

algorithm, the parameters of the representation, dynamics, and prediction networks are trained jointly, through backpropagation-through-time, to predict the policy, the value function, and the reward. At each training step, a trajectory with

consecutive steps are sampled from the replay buffer, from which the network targets for the policy, value, and reward are calculated. The loss functions for the policy, value, and reward are designed respectively to minimize the difference between the network predictions and targets. In the following, we describe how we design the loss function for the policy network in continuous action space, and briefly review the loss function for the value/reward network.### 4.1 Policy Network

Similar to other policy-based algorithms schulman2015trust; duan2016benchmarking, the policy network outputs the mean and the standard deviation of a Gaussian distribution. For the policy mean, we use a fully-connected MLP with leakyrelu nonlinearities for the hidden layers and tanh activation for the output layer. A separate fully-connected MLP specifies the log standard deviation, which also depends on the state input.

For the training target of the policy network, we want to transform the MCTS result of the root node to a continuous target probability density . In general, to estimate the density of a probability distribution with continuous support, independent and identically distributed (i.i.d.) samples are drawn from the underlying distribution perez2008kullback. However, in the MCTS algorithm, only a finite number of actions with different visit counts will be returned (for MCTS algorithm with simulations, the number of actions will be ). Thus here we assume the density value of the target distribution at a root action is proportional to its visit counts

(5) |

where specifies the temperature parameter, and is a normalization term that only depends on the current state and the temperature parameter.

For the loss function, we use the Kullback-Leibler divergence between the network output

and the empirical density from the MCTS result.(6) |

In general, since the KL divergence between two distributions is asymmetric, we have the choice of minimizing either or . As illustrated in Fig. 1 from Goodfellow-et-al-2016, the choice of the direction is problem dependent. The loss function require to place high probability anywhere that the places high probability, while the other loss function requires to rarely places high probability anywhere that places low probability. As in our problem, we desire the trained policy network to prune the undesired actions with low return, thus we choose as the policy loss function.

Also note that the empirical density from the search result does not define a proper density, as we never specify the density value in between the finite support points. However, through the following theorem, we show that even if we only consider the loss at the support points as show in Eq. 7, the expectation of loss function equals the true KL divergence if we sample actions according to the policy network prediction .

(7) |

###### Theorem 1.

If the actions are sampled according to , then for the empirical estimator of the policy loss function, we have

(8) |

Further more, the variance of the empirical estimator

converges to 0 with .We provide the proof in the Appendix. Theorem 1 states that if , then the empirical estimator for the policy loss function in Eq. 6

is an unbiased estimator.

By subtituting Eq. 5 into the estimator in Eq. 7, we can further simplify the estimator:

(9) |

where the term

can be dropped since it does not depend on neural network weights

and action , which means it is a constant given a specific data sample. Thus the policy loss function becomes(10) |

We also give the derivation of the expected gradient for the empirical loss , with the details provided in the Appendix:

(11) |

Also note here that for the estimator to be unbiased, the actions need to be sampled according to the distribution , which depends on and predicted from the policy network. However, for the experience data sampled from the replay buffer, the actions are sampled according to old policy network prediction during the self play phase. In this paper, we replace the expectation over with the empirical support points from old policy distribution , where denotes the old network weights during the self play phase. Although the empirical estimate become biased with this replacement, it does not affect the performance of the proposed algorithm from the numerical experiments results. Future work beyond this paper would include using weighted sampling to determine an unbiased estimation of the policy loss. The final policy loss function and its gradient become

(12) |

(13) |

### 4.2 Value/Reward Network

For the value/rework network, we follow the same setting with the MuZero algorithm, and we briefly review the details in this subsection. Following pohlen2018observe, the value/reward targets are scaled using an invertible transform

(14) |

where in our experiments. We then apply a transformation to the scalar reward and value targets to obtain equivalent categorical representations. For the reward/value target, we use a discrete support set of size 21 with one support for every integer between -10 and 10. Under this transformation, each scalar is represented as the linear combination of its two adjacent supports, such that the original value can be recovered by .

During inference the actual value and rewards are obtained by first computing their expected value under their respective softmax distribution and subsequently by inverting the scaling transformation using Eq. 14. Scaling and transformation of the value and reward happen transparently on the network side and is not visible to the rest of the algorithm.

With the above formulation, the loss for the value/reward has the following form

(15) |

(16) |

where are the value/reward targets, are the value/reward network output, and denotes the transformation from the scalar values to the categorical representations.

### 4.3 Loss Function

For reinforcement learning algorithms in continuous action space, there is a risk that the policy network may converge prematurely, hence losing any exploration haarnoja2018soft. To learn a policy that acts as randomly as possible while still being able to succeed at the task, in the numerical experiment we also add an entropy loss to the policy:

(17) |

With the above formulation, the loss function for the proposed continuous MuZero algorithm is a weighted sum of the loss functions described above, with a weight normalization term:

(18) |

where controls the contribution of the entropy loss and is the coefficient for the weight normalization term.

### 4.4 Network Training

The original version of the MuZero

algorithm uses a residual network which is suitable for image processing in Atari environments and board games. In our experiments with MuJoCo environments, we replace the networks (policy network, value network, reward network, and dynamics network) with fully connected networks. The hyperparameter and network architecture details are provided in the Appendix.

During the data generation process, we use 3 actors deployed on the CPU to keep generating experience data using the proposed algorithm, by pulling the most recent network weights from time to time. The same exploration scheme with the MuZero algorithm is used, where the visit count distribution is parametrized using a temperature parameter that can balance the exploitation and exploration.

At each training step, an episode is first sampled from the replay buffer, and then consecutive transitions are sampled from the episode. During the sampling process, the samples are drawn according to prioritized replay schaul2015prioritized. The priority for transition is , where is determined through the difference between the search value and the observed n-step return. The priority for an episode equals the mean priorities of all the transitions in this episode. To correct for sampling bias introduced by the prioritized sampling, we scale the loss using the importance sampling ratio .

To maintain a similar magnitude of the gradient across different unroll steps, we scale the gradient following the MuZero

algorithm. To improve the learning process and bound the activation function output, we scale the hidden state to the same range as the action input (

):(19) |

## 5 Experimental Results

In this section, we show the preliminary experimental results on two relatively low-dimensional MuJoCo environments, compared with Soft Actor-Critic (SAC), a state-of-the-art model-free deep reinforcement learning algorithm. For the comparison with the SAC algorithm, the stable baselines stable-baselines implementation was used, with the same parameter setting following the SAC paper haarnoja2018soft.

We conducted experiments on InvertedPendulum-v2 and InvertedDoublePendulum-v2 tasks over 5 random seeds, and the results are shown in Fig. 2. From this plot we can see the proposed continuous MuZero algorithm consistently outperforms the SAC algorithm. Our proposed algorithm converges to the optimal score after training for 4k steps for Inverted Pendulum and 9k steps for Inverted Double Pendulum, achieves better data efficiency.

In Fig. 3, we also varied the number of simulations in the experiments to illustrate its effect on the continuous MuZero algorithm. The algorithm trained and played with more simulations is able to converge faster, which corresponds to our intuition, since the simulation number determines the size of the search tree, where a higher number allows the action space to be explored more (resulting in a wider tree) and estimated with greater precision (resulting in a deeper tree), at the cost of more intensive calculations.

## 6 Conclusion

This paper provides a possible way and the necessary related theoretical results to extend the MuZero algorithm to continuous action space environments. We propose a loss function for the policy in continuous action case that can help the policy network to match the search results of the MCTS algorithm. The progressive widening is used to gradually extend the action space, which is an effective strategy to deal with large/continuous action space. Preliminary results on low-dimensional MuJoCo environments show that our approach performs much better than the soft actor-critic (SAC) algorithm. Future work will further explore the empirical performance of the continuous MuZero algorithm on MuJoCo environments with higher dimensions, since the adaption to the MuZero algorithm proposed in this paper can be easily extended to higher dimension action space. Improving the selection process in the MCTS progressive widening process could also be a future direction to help speed up the algorithm convergence.

## Broader Impact

In this paper, we introduce continuous MuZero algorithm, that achieves the state-of-the-art across some low dimensional continuous control tasks. Although the experiments are for MuJoCo tasks, we broaden our focus to consider the longer-term impacts of developing decision-making agents with planning capabilities. Such capabilities could be applied to a range of domains, such as robotics, games, business management, finance, and transportation, etc. Improvements to decision-making strategy likely have complex effects on welfare, depending on how these capabilities are distributed and the character of the strategic setting. For example, depending on who can use this scientific advance, such as criminals or well-motivated citizens, this technology may be socially harmful or beneficial.

## References

## Appendix A Proof of Theorem 1

###### Theorem 1.

If the actions are sampled according to , then for the empirical estimator of the policy loss function

(20) |

we have

(21) |

Further more, the variance of the empirical estimator converges to 0 with .

###### Proof.

In the above theorem, the actions are in one-dimensional space. However, the proof is also valid for actions in higher-dimensional space. Thus in the following proof, we will use

in vector form to provide a more general proof.

We simplify the left hand side of Eq. 21.

(22) |

Since we assume that , we have

(23) |

from which we show the unbiasedness of the estimator .

Furthermore, due to the Strong Law of Large Numbers, as we increase the number of samples

, the estimator becomes a closer and closer approximation:(24) |

For the variance of the estimator, we have

(25) |

For a given state , is fixed and we denote it as , thus we have

(26) |

which converges to 0 as . ∎

## Appendix B Derivations of the policy loss gradient

Here we give the derivation of the expected gradient for the empirical loss :

(27) |

Similar to the proof of Theorem 1, this derivation is also valid for actions in higher dimensional space, and we will use and in vector form to provide a more general derivation.

From

(28) |

we have

(29) |

which follows directly from the fact that

(30) |

and

(31) |

## Appendix C Experiment parameters

Here we provide the details of the hyperparameters we used in the experiments. For both two environments InvertedPendulum-v2 and InvertedDoublePendulum-v2, we use the same set of parameters. For the progressive widening parameter , we set it to 0.49. In this case, when the agent does 50 simulations at each decision-making step, in total 7 actions are sampled from the root state, and each child action, as a root of the subtree, has roughly 7 simulations. To speed up the convergence, we scaled the reward by 1/20 similar to the soft actor-critic algorithm. The smaller value/reward scale enables us to use a smaller value/reward support size of 21 compared to the MuZero

algorithm. For the network, we used relatively simple network architecture with single 64 neurons hidden layer for the dynamics/reward/value/policy mean network, and we did not include hidden layers for the representation/policy log std network. For the optimizer, we observed that the

Adamoptimizer converges faster compared to the Stochastic Gradient Descent algorithm. The learning rate is kept fixed to be 0.003 during the whole training process. Additionally, to avoid network overfitting, we keep the ratio that the number of self-played games/training steps fixed to 1/5.

Hyperparameter | Value |

Number of simulations | 50 |

Discount factor | 0.997 |

PUCB | 1.25 |

PUCB | 19652 |

Progressive widening | 0.49 |

Progressive widening | 1 |

Support size | 21 |

Encoding size | 15 |

Batch size | 128 |

Entropy loss weight | 0.1 |

Optimizer | Adam |

Weight decay | |

Learning rate | 0.003 |

Representation network hidden layer size | None |

Dynamics network hidden layer size | [64] |

Reward network hidden layer size | [64] |

Value network hidden layer size | [64] |

Policy mean network hidden layer size | [64] |

Policy log std network hidden layer size | None |

Replay buffer size | 700 |

Number of unroll steps | 10 |

Number of TD steps | 50 |

Prioritized replay | 0.1 |

Prioritized replay | 1 |

Self played games per training steps | 1/5 |

Reward scale | 1/20 |

Temperature | 1 |

Comments

There are no comments yet.