Introduction
In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control in reinforcement learning (RL). In continuous control, a deterministic policy (silver2014deterministic ̵̃silver2014deterministic) is a recent approach, which is a mapping from state to action. In contrast, a stochastic policy is a mapping from state to a probability distribution over the actions.
Recently, neural networks has achieved great success as function approximators in various challenging domains (tesauro1995temporal ̵̃tesauro1995temporal; mnih2015human ̵̃mnih2015human; silver2016mastering ̵̃silver2016mastering). A deterministic policy parameterized by a neural network is usually trained via gradient ascent to maximize the critic, which is a stateaction value function parameterized by a neural network (silver2014deterministic ̵̃silver2014deterministic; lillicrap2015continuous ̵̃lillicrap2015continuous; barth2018distributed ̵̃barth2018distributed). However, gradient ascent can be easily trapped by local maxima during the search for the global maxima. We utilize the ensemble technique to mitigate this issue. We train multiple actors (i.e., deterministic policies) in parallel, and each actor has a different initialization. In this way, each actor is in charge of maximizing the stateaction value function in a local area. Different actors may be trapped in different local maxima. By considering the maximum stateaction value of the actions proposed by all the actors, we are more likely to find the global maxima of the stateaction value function than a single actor.
ACE fits in with the option framework (sutton1999between ̵̃sutton1999between). First, each option has its intraoption policy, which maximizes the return in a certain area of the state space. Similarly, an actor in ACE maximizes the critic in a certain area of the domain of the critic. It may be difficult for a single actor to maximize the critic in the whole domain due to the complexity of the manifold of the critic. However, in contrast, the job for action search is easier if we ask an actor to find the best action in a local neighborhood of the action dimension. Second, we chain the outputs of all the actors to the critic, enabling a selection over the locally optimal action values. In this way, the critic works similar to the policy over options in the option framework. We quantify this similarity between ensemble and options by extending the optioncritic architecture (OCA, bacon2017option ̵̃bacon2017option) with deterministic intraoption policies. Particularly, we provide the Deterministic Intraoption Policy Gradient theorem, based on which we show the actor ensemble in ACE is a special case of the general optioncritic framework.
To make the stateaction value function more accurate, which is essential in the actor selection, we perform a lookahead tree search with the multiple actors. The lookahead tree search has achieved great success in various discrete action problems (knuth1975analysis ̵̃knuth1975analysis; browne2012survey ̵̃browne2012survey; silver2016mastering ̵̃silver2016mastering; oh2017value ̵̃oh2017value; farquhar2018treeqn ̵̃farquhar2018treeqn). Recently, lookahead tree search was extended to continuousaction problems. For example, mansley2011sample ̵̃mansley2011sample combined planning with adaptive discretization of a continuous action space, resulting in a performance boost in continuous bandit problems. nitti2015planning ̵̃nitti2015planning utilized probability programming in planning in continuous action space. yee2016monte ̵̃yee2016monte used kernel regression to generalize the values between explored actions and unexplored actions, resulting in a new Monte Carlo tree search algorithm. However, to our best knowledge, a general tree search algorithm for continuousaction problems is a gap. In ACE, we use the multiple actors as metaactions to perform a tree search with the help of a learned value prediction model (oh2017value ̵̃oh2017value; farquhar2018treeqn ̵̃farquhar2018treeqn).
We demonstrate the superiority of ACE over DDPG and its variants empirically in Roboschool ^{1}^{1}1https://github.com/openai/roboschool, a challenging physical robot environment.
In the rest of this paper, we first present some preliminaries of RL. Then we detail ACE and show some empirical results, after which we discuss some related work, followed by closing remarks.
Preliminaries
We consider a Markov Decision Process (MDP) which consists of a state space
, an action space , a reward function , a transition kernel , and a discount factor . We use to denote a stochastic policy. At each time step , an agent is at state and selects an action . Then the environment gives the reward and leads the agent to the next state according . We use to denote the state action value of a policy , i.e., . In an RL problem, we are usually interested in finding an optimal policy s.t. for . All the optimal policies share the same optimal state action value function , which is the unique fixed point of the Bellman optimality operator ,(1) 
where we use to indicate an estimation of . With tabular representation, Qlearning (watkins1992q ̵̃watkins1992q) is a commonly used method to find this fixed point. The per step update is
(2) 
Recently, mnih2015human ̵̃mnih2015human used a deep convolutional neural network
to parameterize an estimation , resulting in the DeepQNetwork (DQN). At each time step , DQN samples the transitionfrom a replay buffer (lin1992self ̵̃lin1992self) and performs stochastic gradient descent to minimize the loss
where is a target network (mnih2015human ̵̃mnih2015human), which is a copy of and is synchronized with periodically.
Continuous Action Control
In continuous control problems, it is not straightforward to apply Qlearning. The basic idea of the deterministic policy algorithms (silver2014deterministic ̵̃silver2014deterministic) is to use a deterministic policy to approximate the greedy action
. The deterministic policy is trained via gradient ascent according to chain rule. The gradient per step is
(3) 
where we assume is parameterized by . With this actor , we are able to update the state action value function as usual. To be more specific, assuming is parameterized by , we perform gradient decent to minimize the following loss
Recently, lillicrap2015continuous ̵̃lillicrap2015continuous used neural networks to parameterize and , resulting in the Deep Deterministic Policy Gradient (DDPG) algorithm. DDPG is an offpolicy control algorithm with experience replay and a target network.
In the onpolicy setting, the gradient in Equation (3) guarantees policy improvement thanks to the Deterministic Policy Gradient theorem (silver2014deterministic ̵̃silver2014deterministic). In offpolicy setting, the policy improvement of this gradient is based on Offpolicy Policy Gradient theorem (OPG, degris2012off ̵̃degris2012off). However, Errata of degris2012off ̵̃degris2012off shows that OPG only holds for tabular function representation and certain linear function approximation. So far no policy improvement is guaranteed for DDPG. However, DDPG and its variants has gained great success in various challenging domains (lillicrap2015continuous ̵̃lillicrap2015continuous; barth2018distributed ̵̃barth2018distributed; tassa2018deepmind ̵̃tassa2018deepmind). This success may be attributed to the gradient ascent via chain rule in Equation (3).
Option
An option is a triple, , and we use to indicate the option set. We use to denote the initiation set of , indicating where the option can be initiated. In this paper, we consider for , meaning that each option can be initiated at all states. We use to denote the intraoption policy of . Once the option is committed, the action selection is based on . We use to denote the termination function of . At each time step , the agent terminates the previous option with probability . In this paper, we consider the callandreturn option execution model (sutton1999between ̵̃sutton1999between), where an agent executes an option until the option terminates.
An MDP augmented with options forms a SemiMDP (sutton1999between ̵̃sutton1999between). We use to denote the policy over options. We use to denote the optionvalue function and to denote the value function of . Furthermore, we use to denote the option value upon arrival at the state option pair and to denote the value of executing an action in the context of a stateoption pair . They are related as
(4)  
(5)  
(6) 
bacon2017option ̵̃bacon2017option proposed a policy gradient method, the OptionCritic Architecture (OCA), to learn stochastic intraoption policies (parameterized by ) and termination functions (parameterized by ). The objective is to maximize the expected discounted return per episode, i.e., . Based on their Introoption Policy Gradient Theorem and Termination Gradient Theorem, the per step updates for and are
And at each time step , OCA takes one gradient descent step minimizing
(7) 
to update the critic , where
The update target is also used in Introoption Qlearning (sutton1999between ̵̃sutton1999between).
Modelbased RL
In RL, a transition model usually takes as inputs a stateaction pair and generates the immediate reward and the next state. A transition model reflects the dynamics of an environment and can either be given or learned. A transition model can be used to generate imaginary transitions for training the agent to increase data efficiency (sutton1990integrated ̵̃sutton1990integrated; yao2009multi ̵̃yao2009multi; gu2016continuous ̵̃gu2016continuous). A transition model can also be used to reason about the future. When making a decision, an agent can perform a lookahead tree search (knuth1975analysis ̵̃knuth1975analysis; browne2012survey ̵̃browne2012survey) with a transition model to maximize the possible rewards. A lookahead tree search can be performed with either a perfect model (coulom2006efficient ̵̃coulom2006efficient; sturtevant2008analysis ̵̃sturtevant2008analysis; silver2016mastering ̵̃silver2016mastering) or a learned model (weber2017imagination ̵̃weber2017imagination).
A latent state is an abstraction of the original state. A latent state is also referred to as an abstract state (oh2017value ̵̃oh2017value) or an encoded state (farquhar2018treeqn ̵̃farquhar2018treeqn). A latent state can be used as the input of a transition model. Correspondingly, the transition model then predicts the next latent state, instead of the original next state. A latent state is particularly useful for high dimensional state space (e.g., images), where a latent state is usually a low dimensional vector.
Recently, some works demonstrated that learning a value prediction model instead of a transition model is effective for a lookahead tree search. For example, VPN (oh2017value ̵̃oh2017value) predicted the value of the next latent state, instead of the next latent state itself. Although this value prediction was explicitly done in two phases (predicting the next latent state and then predicting its value), the loss of predicting the next latent state was not used in training. Instead, only the loss of the value prediction for the next latent state was used. oh2017value ̵̃oh2017value showed this value prediction model is particularly helpful for nondeterministic environments. TreeQN (farquhar2018treeqn ̵̃farquhar2018treeqn) adopted a similar idea, where only the outcome value of a lookahead tree search is grounded in the loss. farquhar2018treeqn ̵̃farquhar2018treeqn showed that grounding the predicted next latent state did not bring in a performance boost. Although a value prediction model predicts much fewer information than a transition model, VPN and TreeQN demonstrated improved performance over baselines in challenging domains. This value prediction model is particularly helpful for a lookahead tree search in nondeterministic environments. In ACE, we followed these works and built a value prediction model similar to TreeQN.
The Actor Ensemble Algorithm
As discussed earlier, it is important for the actor to maximize the critic in DDPG. lillicrap2015continuous ̵̃lillicrap2015continuous trained the actor via gradient ascent. However, gradient ascent can easily be trapped by local maxima or saddle points.
To mitigate this issue, we propose an ensemble approach. In our work, we have actors . At each time step , ACE selects the action over the proposals by all the actors,
(8) 
Our actors are trained in parallel. Assuming those actors are parameterized by , each actor is trained such that given a state the action maximizes . We adopted similar gradient ascent as DDPG. The gradient at time step is
(9) 
for all . ACE initializes each actor independently. So that the actors are likely to cover different local maxima of . By considering the maximum action value of the proposed actions, the action in Equation (8) is more likely to find the global maxima of the critic . To train the critic , ACE takes one gradient descent step at each time minimizing
(10) 
Note our actors are not independent. They reinforce each other by influencing the critic update, which in turn gives them better policy gradient.
An Option Perspective
Intuitively, each actor in ACE is similar to an option. To quantify the relationship between ACE and the option framework, we first extend OCA with deterministic intraoption policies, referred to as OCAD. For each option , we use to denote its intraoption policy, which is a deterministic policy. The introoption policies are parameterized by . The termination functions are parameterized by . We have
Theorem 1 (Deterministic Intraoption Policy Gradient)
Given a set of Markov options with deterministic intraoption policies, the gradient of the expected discounted return objective w.r.t. is:
where is the limiting stateoption pair distribution. Here represents the discounted probability of transitioning to from in steps.
Theorem 2 (Termination Policy Gradient)
Given a set of Markov options with deterministic intraoption policies, the gradient of the expected discounted return objective w.r.t. to is:
The proof of Theorem 1 follows a similar scheme of sutton2000policy ̵̃sutton2000policy, silver2014deterministic ̵̃silver2014deterministic, and bacon2017option ̵̃bacon2017option. The proof of Theorem 2 follows a similar scheme of bacon2017option ̵̃bacon2017option. The conditions and proofs of both theorems are detailed in Supplementary Material. The critic update of OCAD remains the same as OCA (Equation 7).
We now show that the actor ensemble in ACE is a special setting of OCAD. The gradient update of the actors in ACE (Equation 9) can be justified via Theorem 1. The critic update in ACE (Equation 10) is equivalent to the critic update in OCAD (Equation 13). We first consider a special setting, where for , which means each option terminates at every time step. In this setting, the value of does not depend on (Equations 4 and 5). Based on this observation, we rewrite as . We have
(11) 
With the three s being the same, we rewrite the intrapolicy gradient update in OCAD according to Theorem 1 as
(12) 
And we rewrite the critic update in OCAD (Equation 7) as
(13) 
Now the actorcritic updates in OCAD (Equations 12 and 13) recover the actorcritic updates in ACE (Equations 9 and 10), revealing the relationship between the ensemble in ACE and OCAD.
Note that in the intraoption policy update of OCAD (Equation 12) only one intraoption policy is updated at each time step, while in the actor ensemble update (Equation 9) all actors are updated. Based on the introoption policy update of OCAD, we propose a variant of ACE, named Alternative ACE (ACEAlt), where only the selected actor is updated at each time step. In practice, we add exploration noise for each action and use experience replay to stabilize the training of the neural network function approximator like DDPG, resulting in offpolicy learning.
Modelbased Enhancements
To refine the stateaction value function estimation, which is essential for actor selection, we utilize a lookahead tree search method with a learned value prediction model similar to TreeQN. TreeQN was developed for discrete action space. We extend TreeQN to continuous control problems via the actor ensemble by searching over the actions proposed by the actors.
Formally speaking, we first define the following learnable functions:

, an encoding function that transforms a state into an dimensional latent state, parameterized by

, a reward prediction function that predicts the immediate reward given a latent state and an action, parameterized by

, a transition function that predicts the next latent state given a latent state and an action, parameterized by

, a value prediction function that computes the value for a pair of a latent state and an action, parameterized by

: an actor that computes an action given a latent state, parameterized by , for each
We use to denote and to denote . Note the encoding function is shared in our implementation.
We use to represent and to represent , where is a latent state and . Furthermore, can also be decomposed into the sum of the predicted immediate reward and the value of the predicted next latent state, i.e.,
(14) 
where
(15)  
(16) 
We apply Equation (14) recursively times with state prediction and action selection in Equations (15, 16), resulting in a new estimator for , which is defined as
(17) 
where
The lookahead tree search and backup process (Equation 17) are illustrated in Figure 1. The value of stands for the stateaction value estimation for the predicted latent state and action after steps from , with Equation (14) applied times.
As is fully differentiable w.r.t. , we plug in whenever we need . We also ground the predicted reward in the first recursive expansion as suggested by farquhar2018treeqn ̵̃farquhar2018treeqn. To summarize, given a transition , the gradients for and are
and
respectively. ACE also utilizes experience replay and a target network similar to DDPG. The pseudocode of ACE is provided in Supplementary Material.
Experiments
We designed experiments to answer the following questions:

Does ACE outperform baseline algorithms?

If so, how do the components of ACE contribute to the performance boost?
We used twelve continuous control tasks from Roboschool, a free port of Mujoco ^{2}^{2}2http://www.mujoco.org/ by OpenAI. A state in Roboschool contains joint information of a robot (e.g., speed or angle) and is presented as a vector of real numbers. An action in Roboschool is a vector, with each dimension being a real number in . All the implementations are made publicly available. ^{3}^{3}3https://github.com/ShangtongZhang/DeepRL
ACE Architecture
In this section we describe the parameterization of and for Roboschool tasks. For each state , we first transformed it into a latent state by , which was parameterized as a single neural network layer. This latent state was used as the input for all other functions (i.e., ).
The networks for are single hidden layer networks with input units and 300 hidden units, taking as inputs the concatenation of a latent state and an dimensional action. Particularly, the network for
used two residual connections similar to farquhar2018treeqn ̵̃farquhar2018treeqn. The networks for
are single hidden layer networks with 400 input units and 300 hidden units, and all the networks of shared a common first layer. The architecture of ACE is illustrated in Figure 4(a).We used
as the activation functions for the hidden units. (This selection will be further discussed in the next section.) We set the number of actors to 5 (i.e.,
) and the planning depth to 1 (i.e. ).Baseline Algorithms
Ddpg
In DDPG, lillicrap2015continuous ̵̃lillicrap2015continuous used two separate networks to parameterize the actor and the critic. Each network had 2 hidden layers with 400 and 300 hidden units respectively. lillicrap2015continuous ̵̃lillicrap2015continuous used ReLU activation function (nair2010rectified ̵̃nair2010rectified) and applied a
regularization to the critic. However, our analysis experiments found that activation function outperformed ReLU with regularization. So throughout all our experiments, we always used activation function (without regularization) for all algorithms. All other hyperparameter values were the same as lillicrap2015continuous ̵̃lillicrap2015continuous. All the other compared algorithms inherited the hyperparameter values from DDPG without tuning. We used the same OrnsteinUhlenbeck process (uhlenbeck1930theory ̵̃uhlenbeck1930theory) as lillicrap2015continuous ̵̃lillicrap2015continuous for exploration in all the compared algorithms.WideDDPG
ACE had more parameters than DDPG. To investigate the influence of the number of parameters, we implemented WideDDPG, where the hidden units were doubled (i.e., the two hidden layers had 800 and 600 units respectively). WideDDPG had a comparable number of parameters as ACE and remained the same depth as ACE.
SharedDDPG
DDPG used separate networks for actor and critic, while the actor and critic in ACE shared a common representation layer. To investigate the influence of a shared representation, we implemented SharedDDPG, where the actor and the critic shared a common bottom layer in the same manner as ACE.
EnsembleDDPG
To investigate the influence of the tree search in ACE, we removed the tree search in ACE by setting , giving an ensemble version of DDPG, called EnsembleDDPG. We still used 5 actors in EnsembleDDPG.
TmAce
To investigate the usefulness of the value prediction model, we implemented TransitionModelACE (TMACE), where we learn a transition model instead of a value prediction model. To be more specific, and were trained to fit sampled transitions from the replay buffer to minimize the squared loss of the predicted reward and the predicted next latent state. This model was then used for a lookahead tree search. The pseudocode of TMACE is detailed in Supplementary Material.
The architectures of all the above algorithms are illustrated in Supplementary Material.
Results
For each task, we trained each algorithm for 1 million steps. At every 10 thousand steps, we performed 20 deterministic evaluation episodes without exploration noise and computed the mean episode return. We report the best evaluation performance during training in Table 1, which is averaged over 5 independent runs. The full evaluation curves are reported in Supplementary Material.
In a summary, either ACE or ACEAlt was placed among the best algorithms in 11 out of the 12 games. ACE itself was placed among the best algorithms in 8 games taking ACEAlt into comparison. Without considering ACEAlt, ACE was placed among the best algorithms in 10 games. ACEAlt itself was placed among the best algorithms in 7 games taking ACE into comparison. Without considering ACE, ACEAlt was placed among the best algorithms in 10 games. Overall, ACE was slightly better than ACEAlt. However, ACEAlt enjoyed lower variance than ACE. We conjecture this is because ACE had more offpolicy learning than ACEAlt. Offpolicy learning improved data efficiency but increased variances.
WideDDPG outperformed DDPG in only 1 game, indicating that naively increasing the parameters does not guarantee performance improvement. SharedDDPG outperformed DDPG in only 2 games (lost in 2 games and tied in 8 games), showing shared representation contributes little to the overall performance in ACE. EnsembleDDPG outperformed DDPG in 6 games (lost in 3 games and tied in 3 games), indicating the DDPG agent benefits from an actor ensemble. This may be attributed to that multiple actors are more likely to find the global maxima of the critic. ACE further outperformed EnsembleDDPG in 9 games, indicating the agent benefits from the lookahead tree search with a value prediction model. In contrast, TMACE outperformed EnsembleDDPG only in 2 games (lost in 3 games and tied in 7 games), indicating that a value prediction model is better than a transition model for a lookahead tree search. This was also consistent with the results observed earlier in VPN and TreeQN.
In conclusion, the actor ensemble and the lookahead tree search with a learned value prediction model are key to the performance boost.
ACE and ACEAlt increase performance in terms of environment steps while require more computation than vanilla DDPG. We benchmarked the wall time for the algorithms. The results are reported in Supplementary Material. We also verified the diversity of the actors in ACE and ACEAlt in Supplementary Material.
Ensemble Size and Planning Depth
In this section, we investigate how the ensemble size and the planning depth in ACE influence the performance. We performed experiments in HalfCheetah with various and and used the same evaluation protocol as before. As a large and induced a significant computation increase, we only used up to 10 and up to 2. The results are reported in Figure 2.
To summarize, and achieved the best performance. We hypothesize there is a tradeoff in the selection of both the ensemble size and the planning depth. On the one hand, a single actor can easily be trapped into local maxima during training. The more actors we have, the more likely we find the global maxima. On the other hand, all the actors share the same encoding function with the critic. A large number of actors may dominate the training of the encoding function to damage the critic learning. So a medium ensemble size is likely to achieve the best performance. A possible solution is to normalize the gradient according to the ensemble size, and we leave this for future work. With a perfect model, the more planning steps we have, the more accurate estimation we can get. However, with a learned value prediction model, there is a compound error in unrolling. So a medium planning depth is likely to achieve the best performance. Similar phenomena were also observed in the multistep Dyna planning (yao2009multi ̵̃yao2009multi).
ACE  ACEAlt  TMACE  EnsembleDDPG  SharedDDPG  WideDDPG  DDPG  
Ant  1041(70.8)  983(36.8)  1031(55.6)  1026(87.2)  796(16.8)  871(19.9)  875(14.2) 
HalfCheetah  1667(40.4)  1023(60.4)  800(28.8)  812(49.4)  771(79.6)  733(52.5)  703(37.3) 
Hopper  2136(86.4)  1923(88.3)  1586(85.0)  1972(63.4)  2047(76.7)  2090(118.6)  2133(99.0) 
Humanoid  380(56.1)  441(90.1)  61(6.8)  76(11.4)  53(2.2)  54(1.0)  54(1.7) 
HF  311(30.3)  289(20.8)  126(39.6)  85(5.6)  53(1.0)  55(1.6)  53(1.4) 
HFH  22(2.4)  20(2.2)  4(2.1)  2(7.5)  6(6.4)  15(3.2)  15(2.5) 
IDP  7555(1610.9)  9356(1.1)  7549(1613.7)  4102(1923.2)  7549(1613.6)  7548(1618.7)  5662(1945.9) 
IP  417(212.8)  1000(0.0)  415(213.4)  1000(0.0)  1000(0.0)  1000(0.0)  1000(0.0) 
IPS  892(0.1)  892(0.2)  891(0.2)  891(0.2)  891(0.4)  546(308.7)  891(0.4) 
Pong  12(0.3)  11(0.1)  6(0.7)  8(0.9)  4(0.2)  4(0.1)  5(0.8) 
Reacher  16(0.7)  17(0.2)  17(0.5)  17(0.3)  20(0.7)  15(2.2)  18(0.9) 
Walker2d  1659(65.9)  1864(21.4)  1086(97.0)  1142(99.5)  1142(146.3)  1185(121.2)  815(11.4) 
Related Work
Continuousaction RL
klissarov2017learnings ̵̃klissarov2017learnings extended OCA into continuous action problems with the Proximal Policy Option Critic (PPOC) algorithm. However, PPOC considered stochastic intraoption policies, and each intraoption policy was trained via a policy search method. In ACE, we consider deterministic intraoption policies, and the intraoption policies are optimized under the same objective as OCA.
gu2016continuous ̵̃gu2016continuous parameterized the function in a quadric form to deal with continuous control problems. In this way, the global maxima can be determined analytically. However, in general, the optimal value does not necessarily fall into this quadric form. In ACE, we use an actor ensemble to search the global maxima of the function. gu2016continuous ̵̃gu2016continuous utilized a transition model to generate imaginary data, which is orthogonal to ACE.
Ensemble in RL
wiering2008ensemble ̵̃wiering2008ensemble designed four ensemble methods combining five RL algorithms with a voting scheme based on value functions of different RL algorithms. hans2010ensembles ̵̃hans2010ensembles used a network ensemble to improve the performance of Fitted QIteration. osband2016deep ̵̃osband2016deep used a ensemble to approximate Thomas’ sampling, resulting in improved exploration and performance boost in challenging video games. huang2017learning ̵̃huang2017learning used both an actor ensemble and a critic ensemble in continuous control problems. However, to our best knowledge, the present work is the first to relate ensemble with options and to use an ensemble for a lookahead tree search in continuous control problems.
Closing Remarks
In this paper, we propose the ACE algorithm for continuous control problems. From an ensemble perspective, ACE utilizes an actor ensemble to search the global maxima of a critic function. From an option perspective, ACE is a special optioncritic algorithm with deterministic intraoption policies. Thanks to the actor ensemble, ACE is able to perform a lookahead tree search with a learned value prediction model in continuous control problems, resulting in a significant performance boost in challenging robot manipulation tasks.
Acknowledgement
The authors thank Bo Liu for feedbacks of the first draft of this paper. We also appreciate the insightful reviews of the anonymous reviewers.
References

[Bacon, Harb, and
Precup2017]
Bacon, P.L.; Harb, J.; and Precup, D.
2017.
The optioncritic architecture.
In
Proceedings of the 31st AAAI Conference on Artificial Intelligence
.  [BarthMaron et al.2018] BarthMaron, G.; Hoffman, M. W.; Budden, D.; Dabney, W.; Horgan, D.; Muldal, A.; Heess, N.; and Lillicrap, T. 2018. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617.
 [Browne et al.2012] Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games.
 [Coulom2006] Coulom, R. 2006. Efficient selectivity and backup operators in montecarlo tree search. In Proceedings of the International Conference on Computers and Games.
 [Degris, White, and Sutton2012] Degris, T.; White, M.; and Sutton, R. S. 2012. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839.
 [Farquhar et al.2018] Farquhar, G.; Rocktäschel, T.; Igl, M.; and Whiteson, S. 2018. Treeqn and atreec: Differentiable treestructured models for deep reinforcement learning. arXiv preprint arXiv:1710.11417.

[Gu et al.2016]
Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S.
2016.
Continuous deep qlearning with modelbased acceleration.
In
Proceedings of the 33rd International Conference on Machine Learning
.  [Hans and Udluft2010] Hans, A., and Udluft, S. 2010. Ensembles of neural networks for robust reinforcement learning. In Proceedings of the 9th International Conference on Machine Learning and Applications.
 [Huang et al.2017] Huang, Z.; Zhou, S.; Zhuang, B.; and Zhou, X. 2017. Learning to run with actorcritic ensemble. arXiv preprint arXiv:1712.08987.
 [Klissarov et al.2017] Klissarov, M.; Bacon, P.L.; Harb, J.; and Precup, D. 2017. Learnings options endtoend for continuous action tasks. arXiv preprint arXiv:1712.00004.
 [Knuth and Moore1975] Knuth, D. E., and Moore, R. W. 1975. An analysis of alphabeta pruning. Artificial Intelligence.
 [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 [Lin1992] Lin, L.J. 1992. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
 [Mansley, Weinstein, and Littman2011] Mansley, C. R.; Weinstein, A.; and Littman, M. L. 2011. Samplebased planning for continuous action markov decision processes. In Proceedings of the 21st International Conference on Automated Planning and Scheduling.
 [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature.
 [Nair and Hinton2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.
 [Nitti, Belle, and De Raedt2015] Nitti, D.; Belle, V.; and De Raedt, L. 2015. Planning in discrete and continuous markov decision processes by probabilistic programming. In Proceedings of the 17th Joint European Conference on Machine Learning and Knowledge Discovery in Databases.
 [Oh, Singh, and Lee2017] Oh, J.; Singh, S.; and Lee, H. 2017. Value prediction network. In Advances in Neural Information Processing Systems.
 [Osband et al.2016] Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems.
 [Silver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning.
 [Silver et al.2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature.
 [Sturtevant2008] Sturtevant, N. 2008. An analysis of uct in multiplayer games. In Proceedings of the International Conference on Computers and Games.
 [Sutton et al.2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
 [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence.
 [Sutton1990] Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning.
 [Tassa et al.2018] Tassa, Y.; Doron, Y.; Muldal, A.; Erez, T.; Li, Y.; Casas, D. d. L.; Budden, D.; Abdolmaleki, A.; Merel, J.; Lefrancq, A.; et al. 2018. Deepmind control suite. arXiv preprint arXiv:1801.00690.
 [Tesauro1995] Tesauro, G. 1995. Temporal difference learning and tdgammon. Communications of the ACM.
 [Uhlenbeck and Ornstein1930] Uhlenbeck, G. E., and Ornstein, L. S. 1930. On the theory of the brownian motion. Physical review.
 [Watkins and Dayan1992] Watkins, C. J., and Dayan, P. 1992. Qlearning. Machine Learning.
 [Weber et al.2017] Weber, T.; Racanière, S.; Reichert, D. P.; Buesing, L.; Guez, A.; Rezende, D. J.; Badia, A. P.; Vinyals, O.; Heess, N.; Li, Y.; et al. 2017. Imaginationaugmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203.
 [Wiering and Van Hasselt2008] Wiering, M. A., and Van Hasselt, H. 2008. Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).
 [Yao et al.2009] Yao, H.; Bhatnagar, S.; Diao, D.; Sutton, R. S.; and Szepesvári, C. 2009. Multistep dyna planning for policy evaluation and control. In Advances in Neural Information Processing Systems.
 [Yee, Lisỳ, and Bowling2016] Yee, T.; Lisỳ, V.; and Bowling, M. H. 2016. Monte carlo tree search in continuous action spaces with execution uncertainty. In Proceedings of the 25th International Joint Conference on Artificial Intelligence.
Supplementary Material
Proof of Theorem 1
Under mild conditions (bacon2017option ̵̃bacon2017option), the Markov chain underlying
is aperiodic and ergodic. We use the following augmented process defined by bacon2017option ̵̃bacon2017option, which is homogeneous.We compute the gradient as
(18) 
From Equation (5), we have
(19) 
Plug in Equation (Proof of Theorem 1) into Equation (18) and use the definition of , we have
(20) 
Expand with Equation (20) recursively and apply the augmented process, we end up with
Proof of Theorem 2
Under mild conditions (bacon2017option ̵̃bacon2017option), the Markov chain underlying is aperiodic and ergodic. We use the following augmented process defined by bacon2017option ̵̃bacon2017option, which is homogeneous.
We have
(21) 
The gradient of w.r.t. is
(22) 
Applying Equation (22) recursively and using the augmented process, we have