ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search

11/06/2018 ∙ by Shangtong Zhang, et al. ∙ HUAWEI Technologies Co., Ltd. University of Alberta 0

In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control with a deterministic policy in reinforcement learning. In ACE, we use actor ensemble (i.e., multiple actors) to search the global maxima of the critic. Besides the ensemble perspective, we also formulate ACE in the option framework by extending the option-critic architecture with deterministic intra-option policies, revealing a relationship between ensemble and options. Furthermore, we perform a look-ahead tree search with those actors and a learned value prediction model, resulting in a refined value estimation. We demonstrate a significant performance boost of ACE over DDPG and its variants in challenging physical robot simulators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control in reinforcement learning (RL). In continuous control, a deterministic policy (silver2014deterministic ̵̃silver2014deterministic) is a recent approach, which is a mapping from state to action. In contrast, a stochastic policy is a mapping from state to a probability distribution over the actions.

Recently, neural networks has achieved great success as function approximators in various challenging domains (tesauro1995temporal ̵̃tesauro1995temporal; mnih2015human ̵̃mnih2015human; silver2016mastering ̵̃silver2016mastering). A deterministic policy parameterized by a neural network is usually trained via gradient ascent to maximize the critic, which is a state-action value function parameterized by a neural network (silver2014deterministic ̵̃silver2014deterministic; lillicrap2015continuous ̵̃lillicrap2015continuous; barth2018distributed ̵̃barth2018distributed). However, gradient ascent can be easily trapped by local maxima during the search for the global maxima. We utilize the ensemble technique to mitigate this issue. We train multiple actors (i.e., deterministic policies) in parallel, and each actor has a different initialization. In this way, each actor is in charge of maximizing the state-action value function in a local area. Different actors may be trapped in different local maxima. By considering the maximum state-action value of the actions proposed by all the actors, we are more likely to find the global maxima of the state-action value function than a single actor.

ACE fits in with the option framework (sutton1999between ̵̃sutton1999between). First, each option has its intra-option policy, which maximizes the return in a certain area of the state space. Similarly, an actor in ACE maximizes the critic in a certain area of the domain of the critic. It may be difficult for a single actor to maximize the critic in the whole domain due to the complexity of the manifold of the critic. However, in contrast, the job for action search is easier if we ask an actor to find the best action in a local neighborhood of the action dimension. Second, we chain the outputs of all the actors to the critic, enabling a selection over the locally optimal action values. In this way, the critic works similar to the policy over options in the option framework. We quantify this similarity between ensemble and options by extending the option-critic architecture (OCA, bacon2017option ̵̃bacon2017option) with deterministic intra-option policies. Particularly, we provide the Deterministic Intra-option Policy Gradient theorem, based on which we show the actor ensemble in ACE is a special case of the general option-critic framework.

To make the state-action value function more accurate, which is essential in the actor selection, we perform a look-ahead tree search with the multiple actors. The look-ahead tree search has achieved great success in various discrete action problems (knuth1975analysis ̵̃knuth1975analysis; browne2012survey ̵̃browne2012survey; silver2016mastering ̵̃silver2016mastering; oh2017value ̵̃oh2017value; farquhar2018treeqn ̵̃farquhar2018treeqn). Recently, look-ahead tree search was extended to continuous-action problems. For example, mansley2011sample ̵̃mansley2011sample combined planning with adaptive discretization of a continuous action space, resulting in a performance boost in continuous bandit problems. nitti2015planning ̵̃nitti2015planning utilized probability programming in planning in continuous action space. yee2016monte ̵̃yee2016monte used kernel regression to generalize the values between explored actions and unexplored actions, resulting in a new Monte Carlo tree search algorithm. However, to our best knowledge, a general tree search algorithm for continuous-action problems is a gap. In ACE, we use the multiple actors as meta-actions to perform a tree search with the help of a learned value prediction model (oh2017value ̵̃oh2017value; farquhar2018treeqn ̵̃farquhar2018treeqn).

We demonstrate the superiority of ACE over DDPG and its variants empirically in Roboschool 111https://github.com/openai/roboschool, a challenging physical robot environment.

In the rest of this paper, we first present some preliminaries of RL. Then we detail ACE and show some empirical results, after which we discuss some related work, followed by closing remarks.

Preliminaries

We consider a Markov Decision Process (MDP) which consists of a state space

, an action space , a reward function , a transition kernel , and a discount factor . We use to denote a stochastic policy. At each time step , an agent is at state and selects an action . Then the environment gives the reward and leads the agent to the next state according . We use to denote the state action value of a policy , i.e., . In an RL problem, we are usually interested in finding an optimal policy s.t. for . All the optimal policies share the same optimal state action value function , which is the unique fixed point of the Bellman optimality operator ,

(1)

where we use to indicate an estimation of . With tabular representation, Q-learning (watkins1992q ̵̃watkins1992q) is a commonly used method to find this fixed point. The per step update is

(2)

Recently, mnih2015human ̵̃mnih2015human used a deep convolutional neural network

to parameterize an estimation , resulting in the Deep-Q-Network (DQN). At each time step , DQN samples the transition

from a replay buffer (lin1992self ̵̃lin1992self) and performs stochastic gradient descent to minimize the loss

where is a target network (mnih2015human ̵̃mnih2015human), which is a copy of and is synchronized with periodically.

Continuous Action Control

In continuous control problems, it is not straightforward to apply Q-learning. The basic idea of the deterministic policy algorithms (silver2014deterministic ̵̃silver2014deterministic) is to use a deterministic policy to approximate the greedy action

. The deterministic policy is trained via gradient ascent according to chain rule. The gradient per step is

(3)

where we assume is parameterized by . With this actor , we are able to update the state action value function as usual. To be more specific, assuming is parameterized by , we perform gradient decent to minimize the following loss

Recently, lillicrap2015continuous ̵̃lillicrap2015continuous used neural networks to parameterize and , resulting in the Deep Deterministic Policy Gradient (DDPG) algorithm. DDPG is an off-policy control algorithm with experience replay and a target network.

In the on-policy setting, the gradient in Equation (3) guarantees policy improvement thanks to the Deterministic Policy Gradient theorem (silver2014deterministic ̵̃silver2014deterministic). In off-policy setting, the policy improvement of this gradient is based on Off-policy Policy Gradient theorem (OPG, degris2012off ̵̃degris2012off). However, Errata of degris2012off ̵̃degris2012off shows that OPG only holds for tabular function representation and certain linear function approximation. So far no policy improvement is guaranteed for DDPG. However, DDPG and its variants has gained great success in various challenging domains (lillicrap2015continuous ̵̃lillicrap2015continuous; barth2018distributed ̵̃barth2018distributed; tassa2018deepmind ̵̃tassa2018deepmind). This success may be attributed to the gradient ascent via chain rule in Equation (3).

Option

An option is a triple, , and we use to indicate the option set. We use to denote the initiation set of , indicating where the option can be initiated. In this paper, we consider for , meaning that each option can be initiated at all states. We use to denote the intra-option policy of . Once the option is committed, the action selection is based on . We use to denote the termination function of . At each time step , the agent terminates the previous option with probability . In this paper, we consider the call-and-return option execution model (sutton1999between ̵̃sutton1999between), where an agent executes an option until the option terminates.

An MDP augmented with options forms a Semi-MDP (sutton1999between ̵̃sutton1999between). We use to denote the policy over options. We use to denote the option-value function and to denote the value function of . Furthermore, we use to denote the option value upon arrival at the state option pair and to denote the value of executing an action in the context of a state-option pair . They are related as

(4)
(5)
(6)

bacon2017option ̵̃bacon2017option proposed a policy gradient method, the Option-Critic Architecture (OCA), to learn stochastic intra-option policies (parameterized by ) and termination functions (parameterized by ). The objective is to maximize the expected discounted return per episode, i.e., . Based on their Intro-option Policy Gradient Theorem and Termination Gradient Theorem, the per step updates for and are

And at each time step , OCA takes one gradient descent step minimizing

(7)

to update the critic , where

The update target is also used in Intro-option Q-learning (sutton1999between ̵̃sutton1999between).

Model-based RL

In RL, a transition model usually takes as inputs a state-action pair and generates the immediate reward and the next state. A transition model reflects the dynamics of an environment and can either be given or learned. A transition model can be used to generate imaginary transitions for training the agent to increase data efficiency (sutton1990integrated ̵̃sutton1990integrated; yao2009multi ̵̃yao2009multi; gu2016continuous ̵̃gu2016continuous). A transition model can also be used to reason about the future. When making a decision, an agent can perform a look-ahead tree search (knuth1975analysis ̵̃knuth1975analysis; browne2012survey ̵̃browne2012survey) with a transition model to maximize the possible rewards. A look-ahead tree search can be performed with either a perfect model (coulom2006efficient ̵̃coulom2006efficient; sturtevant2008analysis ̵̃sturtevant2008analysis; silver2016mastering ̵̃silver2016mastering) or a learned model (weber2017imagination ̵̃weber2017imagination).

A latent state is an abstraction of the original state. A latent state is also referred to as an abstract state (oh2017value ̵̃oh2017value) or an encoded state (farquhar2018treeqn ̵̃farquhar2018treeqn). A latent state can be used as the input of a transition model. Correspondingly, the transition model then predicts the next latent state, instead of the original next state. A latent state is particularly useful for high dimensional state space (e.g., images), where a latent state is usually a low dimensional vector.

Recently, some works demonstrated that learning a value prediction model instead of a transition model is effective for a look-ahead tree search. For example, VPN (oh2017value ̵̃oh2017value) predicted the value of the next latent state, instead of the next latent state itself. Although this value prediction was explicitly done in two phases (predicting the next latent state and then predicting its value), the loss of predicting the next latent state was not used in training. Instead, only the loss of the value prediction for the next latent state was used. oh2017value ̵̃oh2017value showed this value prediction model is particularly helpful for non-deterministic environments. TreeQN (farquhar2018treeqn ̵̃farquhar2018treeqn) adopted a similar idea, where only the outcome value of a look-ahead tree search is grounded in the loss. farquhar2018treeqn ̵̃farquhar2018treeqn showed that grounding the predicted next latent state did not bring in a performance boost. Although a value prediction model predicts much fewer information than a transition model, VPN and TreeQN demonstrated improved performance over baselines in challenging domains. This value prediction model is particularly helpful for a look-ahead tree search in non-deterministic environments. In ACE, we followed these works and built a value prediction model similar to TreeQN.

The Actor Ensemble Algorithm

As discussed earlier, it is important for the actor to maximize the critic in DDPG. lillicrap2015continuous ̵̃lillicrap2015continuous trained the actor via gradient ascent. However, gradient ascent can easily be trapped by local maxima or saddle points.

To mitigate this issue, we propose an ensemble approach. In our work, we have actors . At each time step , ACE selects the action over the proposals by all the actors,

(8)

Our actors are trained in parallel. Assuming those actors are parameterized by , each actor is trained such that given a state the action maximizes . We adopted similar gradient ascent as DDPG. The gradient at time step is

(9)

for all . ACE initializes each actor independently. So that the actors are likely to cover different local maxima of . By considering the maximum action value of the proposed actions, the action in Equation (8) is more likely to find the global maxima of the critic . To train the critic , ACE takes one gradient descent step at each time minimizing

(10)

Note our actors are not independent. They reinforce each other by influencing the critic update, which in turn gives them better policy gradient.

An Option Perspective

Intuitively, each actor in ACE is similar to an option. To quantify the relationship between ACE and the option framework, we first extend OCA with deterministic intra-option policies, referred to as OCAD. For each option , we use to denote its intra-option policy, which is a deterministic policy. The intro-option policies are parameterized by . The termination functions are parameterized by . We have

Theorem 1 (Deterministic Intra-option Policy Gradient)

Given a set of Markov options with deterministic intra-option policies, the gradient of the expected discounted return objective w.r.t. is:

where is the limiting state-option pair distribution. Here represents the -discounted probability of transitioning to from in steps.

Theorem 2 (Termination Policy Gradient)

Given a set of Markov options with deterministic intra-option policies, the gradient of the expected discounted return objective w.r.t. to is:

The proof of Theorem 1 follows a similar scheme of sutton2000policy ̵̃sutton2000policy, silver2014deterministic ̵̃silver2014deterministic, and bacon2017option ̵̃bacon2017option. The proof of Theorem 2 follows a similar scheme of bacon2017option ̵̃bacon2017option. The conditions and proofs of both theorems are detailed in Supplementary Material. The critic update of OCAD remains the same as OCA (Equation 7).

We now show that the actor ensemble in ACE is a special setting of OCAD. The gradient update of the actors in ACE (Equation 9) can be justified via Theorem 1. The critic update in ACE (Equation 10) is equivalent to the critic update in OCAD (Equation 13). We first consider a special setting, where for , which means each option terminates at every time step. In this setting, the value of does not depend on (Equations 4 and 5). Based on this observation, we rewrite as . We have

(11)

With the three s being the same, we rewrite the intra-policy gradient update in OCAD according to Theorem 1 as

(12)

And we rewrite the critic update in OCAD (Equation 7) as

(13)

Now the actor-critic updates in OCAD (Equations 12 and 13) recover the actor-critic updates in ACE (Equations 9 and 10), revealing the relationship between the ensemble in ACE and OCAD.

Note that in the intra-option policy update of OCAD (Equation 12) only one intra-option policy is updated at each time step, while in the actor ensemble update (Equation 9) all actors are updated. Based on the intro-option policy update of OCAD, we propose a variant of ACE, named Alternative ACE (ACE-Alt), where only the selected actor is updated at each time step. In practice, we add exploration noise for each action and use experience replay to stabilize the training of the neural network function approximator like DDPG, resulting in off-policy learning.

Model-based Enhancements

To refine the state-action value function estimation, which is essential for actor selection, we utilize a look-ahead tree search method with a learned value prediction model similar to TreeQN. TreeQN was developed for discrete action space. We extend TreeQN to continuous control problems via the actor ensemble by searching over the actions proposed by the actors.

Formally speaking, we first define the following learnable functions:

  • , an encoding function that transforms a state into an -dimensional latent state, parameterized by

  • , a reward prediction function that predicts the immediate reward given a latent state and an action, parameterized by

  • , a transition function that predicts the next latent state given a latent state and an action, parameterized by

  • , a value prediction function that computes the value for a pair of a latent state and an action, parameterized by

  • : an actor that computes an action given a latent state, parameterized by , for each

We use to denote and to denote . Note the encoding function is shared in our implementation.

We use to represent and to represent , where is a latent state and . Furthermore, can also be decomposed into the sum of the predicted immediate reward and the value of the predicted next latent state, i.e.,

(14)

where

(15)
(16)

We apply Equation (14) recursively times with state prediction and action selection in Equations (15, 16), resulting in a new estimator for , which is defined as

(17)

where

The look-ahead tree search and backup process (Equation 17) are illustrated in Figure 1. The value of stands for the state-action value estimation for the predicted latent state and action after steps from , with Equation (14) applied times.

As is fully differentiable w.r.t. , we plug in whenever we need . We also ground the predicted reward in the first recursive expansion as suggested by farquhar2018treeqn ̵̃farquhar2018treeqn. To summarize, given a transition , the gradients for and are

and

respectively. ACE also utilizes experience replay and a target network similar to DDPG. The pseudo-code of ACE is provided in Supplementary Material.

Figure 1: An example of the tree search (Equation 17) with and . We use as the estimation . Here, represents the predicted latent state after steps from , and are actions proposed by the two actors and depend on latent state. Arrows represent the transition kernel , and brackets represent maximization (backup) operations. Reward prediction is omitted for simplicity.

Experiments

We designed experiments to answer the following questions:

  • Does ACE outperform baseline algorithms?

  • If so, how do the components of ACE contribute to the performance boost?

We used twelve continuous control tasks from Roboschool, a free port of Mujoco 222http://www.mujoco.org/ by OpenAI. A state in Roboschool contains joint information of a robot (e.g., speed or angle) and is presented as a vector of real numbers. An action in Roboschool is a vector, with each dimension being a real number in . All the implementations are made publicly available. 333https://github.com/ShangtongZhang/DeepRL

ACE Architecture

In this section we describe the parameterization of and for Roboschool tasks. For each state , we first transformed it into a latent state by , which was parameterized as a single neural network layer. This latent state was used as the input for all other functions (i.e., ).

The networks for are single hidden layer networks with input units and 300 hidden units, taking as inputs the concatenation of a latent state and an -dimensional action. Particularly, the network for

used two residual connections similar to farquhar2018treeqn ̵̃farquhar2018treeqn. The networks for

are single hidden layer networks with 400 input units and 300 hidden units, and all the networks of shared a common first layer. The architecture of ACE is illustrated in Figure 4(a).

We used

as the activation functions for the hidden units. (This selection will be further discussed in the next section.) We set the number of actors to 5 (i.e.,

) and the planning depth to 1 (i.e. ).

Baseline Algorithms

Ddpg

In DDPG, lillicrap2015continuous ̵̃lillicrap2015continuous used two separate networks to parameterize the actor and the critic. Each network had 2 hidden layers with 400 and 300 hidden units respectively. lillicrap2015continuous ̵̃lillicrap2015continuous used ReLU activation function (nair2010rectified ̵̃nair2010rectified) and applied a

regularization to the critic. However, our analysis experiments found that activation function outperformed ReLU with regularization. So throughout all our experiments, we always used activation function (without regularization) for all algorithms. All other hyper-parameter values were the same as lillicrap2015continuous ̵̃lillicrap2015continuous. All the other compared algorithms inherited the hyper-parameter values from DDPG without tuning. We used the same Ornstein-Uhlenbeck process (uhlenbeck1930theory ̵̃uhlenbeck1930theory) as lillicrap2015continuous ̵̃lillicrap2015continuous for exploration in all the compared algorithms.

Wide-DDPG

ACE had more parameters than DDPG. To investigate the influence of the number of parameters, we implemented Wide-DDPG, where the hidden units were doubled (i.e., the two hidden layers had 800 and 600 units respectively). Wide-DDPG had a comparable number of parameters as ACE and remained the same depth as ACE.

Shared-DDPG

DDPG used separate networks for actor and critic, while the actor and critic in ACE shared a common representation layer. To investigate the influence of a shared representation, we implemented Shared-DDPG, where the actor and the critic shared a common bottom layer in the same manner as ACE.

Ensemble-DDPG

To investigate the influence of the tree search in ACE, we removed the tree search in ACE by setting , giving an ensemble version of DDPG, called Ensemble-DDPG. We still used 5 actors in Ensemble-DDPG.

Tm-Ace

To investigate the usefulness of the value prediction model, we implemented Transition-Model-ACE (TM-ACE), where we learn a transition model instead of a value prediction model. To be more specific, and were trained to fit sampled transitions from the replay buffer to minimize the squared loss of the predicted reward and the predicted next latent state. This model was then used for a look-ahead tree search. The pseudo-code of TM-ACE is detailed in Supplementary Material.

The architectures of all the above algorithms are illustrated in Supplementary Material.

Results

For each task, we trained each algorithm for 1 million steps. At every 10 thousand steps, we performed 20 deterministic evaluation episodes without exploration noise and computed the mean episode return. We report the best evaluation performance during training in Table 1, which is averaged over 5 independent runs. The full evaluation curves are reported in Supplementary Material.

In a summary, either ACE or ACE-Alt was placed among the best algorithms in 11 out of the 12 games. ACE itself was placed among the best algorithms in 8 games taking ACE-Alt into comparison. Without considering ACE-Alt, ACE was placed among the best algorithms in 10 games. ACE-Alt itself was placed among the best algorithms in 7 games taking ACE into comparison. Without considering ACE, ACE-Alt was placed among the best algorithms in 10 games. Overall, ACE was slightly better than ACE-Alt. However, ACE-Alt enjoyed lower variance than ACE. We conjecture this is because ACE had more off-policy learning than ACE-Alt. Off-policy learning improved data efficiency but increased variances.

Wide-DDPG outperformed DDPG in only 1 game, indicating that naively increasing the parameters does not guarantee performance improvement. Shared-DDPG outperformed DDPG in only 2 games (lost in 2 games and tied in 8 games), showing shared representation contributes little to the overall performance in ACE. Ensemble-DDPG outperformed DDPG in 6 games (lost in 3 games and tied in 3 games), indicating the DDPG agent benefits from an actor ensemble. This may be attributed to that multiple actors are more likely to find the global maxima of the critic. ACE further outperformed Ensemble-DDPG in 9 games, indicating the agent benefits from the look-ahead tree search with a value prediction model. In contrast, TM-ACE outperformed Ensemble-DDPG only in 2 games (lost in 3 games and tied in 7 games), indicating that a value prediction model is better than a transition model for a look-ahead tree search. This was also consistent with the results observed earlier in VPN and TreeQN.

In conclusion, the actor ensemble and the look-ahead tree search with a learned value prediction model are key to the performance boost.

ACE and ACE-Alt increase performance in terms of environment steps while require more computation than vanilla DDPG. We benchmarked the wall time for the algorithms. The results are reported in Supplementary Material. We also verified the diversity of the actors in ACE and ACE-Alt in Supplementary Material.

Ensemble Size and Planning Depth

In this section, we investigate how the ensemble size and the planning depth in ACE influence the performance. We performed experiments in HalfCheetah with various and and used the same evaluation protocol as before. As a large and induced a significant computation increase, we only used up to 10 and up to 2. The results are reported in Figure 2.

To summarize, and achieved the best performance. We hypothesize there is a trade-off in the selection of both the ensemble size and the planning depth. On the one hand, a single actor can easily be trapped into local maxima during training. The more actors we have, the more likely we find the global maxima. On the other hand, all the actors share the same encoding function with the critic. A large number of actors may dominate the training of the encoding function to damage the critic learning. So a medium ensemble size is likely to achieve the best performance. A possible solution is to normalize the gradient according to the ensemble size, and we leave this for future work. With a perfect model, the more planning steps we have, the more accurate estimation we can get. However, with a learned value prediction model, there is a compound error in unrolling. So a medium planning depth is likely to achieve the best performance. Similar phenomena were also observed in the multi-step Dyna planning (yao2009multi ̵̃yao2009multi).

Figure 2: Evaluation curves of ACE in HalfCheetah with different and

. Each curve is averaged over 5 independent runs, and standard errors are plotted in shadow.

ACE ACE-Alt TM-ACE Ensemble-DDPG Shared-DDPG Wide-DDPG DDPG
Ant 1041(70.8) 983(36.8) 1031(55.6) 1026(87.2) 796(16.8) 871(19.9) 875(14.2)
HalfCheetah 1667(40.4) 1023(60.4) 800(28.8) 812(49.4) 771(79.6) 733(52.5) 703(37.3)
Hopper 2136(86.4) 1923(88.3) 1586(85.0) 1972(63.4) 2047(76.7) 2090(118.6) 2133(99.0)
Humanoid 380(56.1) 441(90.1) 61(6.8) 76(11.4) 53(2.2) 54(1.0) 54(1.7)
HF 311(30.3) 289(20.8) 126(39.6) 85(5.6) 53(1.0) 55(1.6) 53(1.4)
HFH 22(2.4) 20(2.2) -4(2.1) 2(7.5) 6(6.4) 15(3.2) 15(2.5)
IDP 7555(1610.9) 9356(1.1) 7549(1613.7) 4102(1923.2) 7549(1613.6) 7548(1618.7) 5662(1945.9)
IP 417(212.8) 1000(0.0) 415(213.4) 1000(0.0) 1000(0.0) 1000(0.0) 1000(0.0)
IPS 892(0.1) 892(0.2) 891(0.2) 891(0.2) 891(0.4) 546(308.7) 891(0.4)
Pong 12(0.3) 11(0.1) 6(0.7) 8(0.9) 4(0.2) 4(0.1) 5(0.8)
Reacher 16(0.7) 17(0.2) 17(0.5) 17(0.3) 20(0.7) 15(2.2) 18(0.9)
Walker2d 1659(65.9) 1864(21.4) 1086(97.0) 1142(99.5) 1142(146.3) 1185(121.2) 815(11.4)
Table 1: Best evaluation performance during training. Mean and standard error are reported. Bold numbers indicate the best performance. Scores are averaged over 5 independent runs. Numbers are rounded for the ease of display. HF, HFH, IDP, IP, and IPS stand for HumanoidFlagrun, HumanoidFlagrunHarder, InvertedDoublePendulum, InvertedPendulum, and InvertedPendulumSwingup respectively.

Related Work

Continuous-action RL

klissarov2017learnings ̵̃klissarov2017learnings extended OCA into continuous action problems with the Proximal Policy Option Critic (PPOC) algorithm. However, PPOC considered stochastic intra-option policies, and each intra-option policy was trained via a policy search method. In ACE, we consider deterministic intra-option policies, and the intra-option policies are optimized under the same objective as OCA.

gu2016continuous ̵̃gu2016continuous parameterized the function in a quadric form to deal with continuous control problems. In this way, the global maxima can be determined analytically. However, in general, the optimal value does not necessarily fall into this quadric form. In ACE, we use an actor ensemble to search the global maxima of the function. gu2016continuous ̵̃gu2016continuous utilized a transition model to generate imaginary data, which is orthogonal to ACE.

Ensemble in RL

wiering2008ensemble ̵̃wiering2008ensemble designed four ensemble methods combining five RL algorithms with a voting scheme based on value functions of different RL algorithms. hans2010ensembles ̵̃hans2010ensembles used a network ensemble to improve the performance of Fitted Q-Iteration. osband2016deep ̵̃osband2016deep used a ensemble to approximate Thomas’ sampling, resulting in improved exploration and performance boost in challenging video games. huang2017learning ̵̃huang2017learning used both an actor ensemble and a critic ensemble in continuous control problems. However, to our best knowledge, the present work is the first to relate ensemble with options and to use an ensemble for a look-ahead tree search in continuous control problems.

Closing Remarks

In this paper, we propose the ACE algorithm for continuous control problems. From an ensemble perspective, ACE utilizes an actor ensemble to search the global maxima of a critic function. From an option perspective, ACE is a special option-critic algorithm with deterministic intra-option policies. Thanks to the actor ensemble, ACE is able to perform a look-ahead tree search with a learned value prediction model in continuous control problems, resulting in a significant performance boost in challenging robot manipulation tasks.

Acknowledgement

The authors thank Bo Liu for feedbacks of the first draft of this paper. We also appreciate the insightful reviews of the anonymous reviewers.

References

Supplementary Material

Proof of Theorem 1

Under mild conditions (bacon2017option ̵̃bacon2017option), the Markov chain underlying

is aperiodic and ergodic. We use the following augmented process defined by bacon2017option ̵̃bacon2017option, which is homogeneous.

We compute the gradient as

(18)

From Equation (5), we have

(19)

Plug in Equation (Proof of Theorem 1) into Equation (18) and use the definition of , we have

(20)

Expand with Equation (20) recursively and apply the augmented process, we end up with

Proof of Theorem 2

Under mild conditions (bacon2017option ̵̃bacon2017option), the Markov chain underlying is aperiodic and ergodic. We use the following augmented process defined by bacon2017option ̵̃bacon2017option, which is homogeneous.

We have

(21)

The gradient of w.r.t. is

(22)

Applying Equation (22) recursively and using the augmented process, we have

The ACE Algorithm

Input:
: number of actors
: a noise process
: plan depth
: two step sizes
: parameterized by and // see Section Model-based Enhancements
Output:
The parameters and
Initialize the replay buffer
for each time step t do
       Observe the state
      
       // see Equation 17 for
       Execute the action , get reward and next state
       Store into
       Sample a batch of transitions