In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control in reinforcement learning (RL). In continuous control, a deterministic policy (silver2014deterministic ̵̃silver2014deterministic) is a recent approach, which is a mapping from state to action. In contrast, a stochastic policy is a mapping from state to a probability distribution over the actions.
Recently, neural networks has achieved great success as function approximators in various challenging domains (tesauro1995temporal ̵̃tesauro1995temporal; mnih2015human ̵̃mnih2015human; silver2016mastering ̵̃silver2016mastering). A deterministic policy parameterized by a neural network is usually trained via gradient ascent to maximize the critic, which is a state-action value function parameterized by a neural network (silver2014deterministic ̵̃silver2014deterministic; lillicrap2015continuous ̵̃lillicrap2015continuous; barth2018distributed ̵̃barth2018distributed). However, gradient ascent can be easily trapped by local maxima during the search for the global maxima. We utilize the ensemble technique to mitigate this issue. We train multiple actors (i.e., deterministic policies) in parallel, and each actor has a different initialization. In this way, each actor is in charge of maximizing the state-action value function in a local area. Different actors may be trapped in different local maxima. By considering the maximum state-action value of the actions proposed by all the actors, we are more likely to find the global maxima of the state-action value function than a single actor.
ACE fits in with the option framework (sutton1999between ̵̃sutton1999between). First, each option has its intra-option policy, which maximizes the return in a certain area of the state space. Similarly, an actor in ACE maximizes the critic in a certain area of the domain of the critic. It may be difficult for a single actor to maximize the critic in the whole domain due to the complexity of the manifold of the critic. However, in contrast, the job for action search is easier if we ask an actor to find the best action in a local neighborhood of the action dimension. Second, we chain the outputs of all the actors to the critic, enabling a selection over the locally optimal action values. In this way, the critic works similar to the policy over options in the option framework. We quantify this similarity between ensemble and options by extending the option-critic architecture (OCA, bacon2017option ̵̃bacon2017option) with deterministic intra-option policies. Particularly, we provide the Deterministic Intra-option Policy Gradient theorem, based on which we show the actor ensemble in ACE is a special case of the general option-critic framework.
To make the state-action value function more accurate, which is essential in the actor selection, we perform a look-ahead tree search with the multiple actors. The look-ahead tree search has achieved great success in various discrete action problems (knuth1975analysis ̵̃knuth1975analysis; browne2012survey ̵̃browne2012survey; silver2016mastering ̵̃silver2016mastering; oh2017value ̵̃oh2017value; farquhar2018treeqn ̵̃farquhar2018treeqn). Recently, look-ahead tree search was extended to continuous-action problems. For example, mansley2011sample ̵̃mansley2011sample combined planning with adaptive discretization of a continuous action space, resulting in a performance boost in continuous bandit problems. nitti2015planning ̵̃nitti2015planning utilized probability programming in planning in continuous action space. yee2016monte ̵̃yee2016monte used kernel regression to generalize the values between explored actions and unexplored actions, resulting in a new Monte Carlo tree search algorithm. However, to our best knowledge, a general tree search algorithm for continuous-action problems is a gap. In ACE, we use the multiple actors as meta-actions to perform a tree search with the help of a learned value prediction model (oh2017value ̵̃oh2017value; farquhar2018treeqn ̵̃farquhar2018treeqn).
We demonstrate the superiority of ACE over DDPG and its variants empirically in Roboschool 111https://github.com/openai/roboschool, a challenging physical robot environment.
In the rest of this paper, we first present some preliminaries of RL. Then we detail ACE and show some empirical results, after which we discuss some related work, followed by closing remarks.
We consider a Markov Decision Process (MDP) which consists of a state space, an action space , a reward function , a transition kernel , and a discount factor . We use to denote a stochastic policy. At each time step , an agent is at state and selects an action . Then the environment gives the reward and leads the agent to the next state according . We use to denote the state action value of a policy , i.e., . In an RL problem, we are usually interested in finding an optimal policy s.t. for . All the optimal policies share the same optimal state action value function , which is the unique fixed point of the Bellman optimality operator ,
where we use to indicate an estimation of . With tabular representation, Q-learning (watkins1992q ̵̃watkins1992q) is a commonly used method to find this fixed point. The per step update is
Recently, mnih2015human ̵̃mnih2015human used a deep convolutional neural networkto parameterize an estimation , resulting in the Deep-Q-Network (DQN). At each time step , DQN samples the transition
from a replay buffer (lin1992self ̵̃lin1992self) and performs stochastic gradient descent to minimize the loss
where is a target network (mnih2015human ̵̃mnih2015human), which is a copy of and is synchronized with periodically.
Continuous Action Control
In continuous control problems, it is not straightforward to apply Q-learning. The basic idea of the deterministic policy algorithms (silver2014deterministic ̵̃silver2014deterministic) is to use a deterministic policy to approximate the greedy action
. The deterministic policy is trained via gradient ascent according to chain rule. The gradient per step is
where we assume is parameterized by . With this actor , we are able to update the state action value function as usual. To be more specific, assuming is parameterized by , we perform gradient decent to minimize the following loss
Recently, lillicrap2015continuous ̵̃lillicrap2015continuous used neural networks to parameterize and , resulting in the Deep Deterministic Policy Gradient (DDPG) algorithm. DDPG is an off-policy control algorithm with experience replay and a target network.
In the on-policy setting, the gradient in Equation (3) guarantees policy improvement thanks to the Deterministic Policy Gradient theorem (silver2014deterministic ̵̃silver2014deterministic). In off-policy setting, the policy improvement of this gradient is based on Off-policy Policy Gradient theorem (OPG, degris2012off ̵̃degris2012off). However, Errata of degris2012off ̵̃degris2012off shows that OPG only holds for tabular function representation and certain linear function approximation. So far no policy improvement is guaranteed for DDPG. However, DDPG and its variants has gained great success in various challenging domains (lillicrap2015continuous ̵̃lillicrap2015continuous; barth2018distributed ̵̃barth2018distributed; tassa2018deepmind ̵̃tassa2018deepmind). This success may be attributed to the gradient ascent via chain rule in Equation (3).
An option is a triple, , and we use to indicate the option set. We use to denote the initiation set of , indicating where the option can be initiated. In this paper, we consider for , meaning that each option can be initiated at all states. We use to denote the intra-option policy of . Once the option is committed, the action selection is based on . We use to denote the termination function of . At each time step , the agent terminates the previous option with probability . In this paper, we consider the call-and-return option execution model (sutton1999between ̵̃sutton1999between), where an agent executes an option until the option terminates.
An MDP augmented with options forms a Semi-MDP (sutton1999between ̵̃sutton1999between). We use to denote the policy over options. We use to denote the option-value function and to denote the value function of . Furthermore, we use to denote the option value upon arrival at the state option pair and to denote the value of executing an action in the context of a state-option pair . They are related as
bacon2017option ̵̃bacon2017option proposed a policy gradient method, the Option-Critic Architecture (OCA), to learn stochastic intra-option policies (parameterized by ) and termination functions (parameterized by ). The objective is to maximize the expected discounted return per episode, i.e., . Based on their Intro-option Policy Gradient Theorem and Termination Gradient Theorem, the per step updates for and are
And at each time step , OCA takes one gradient descent step minimizing
to update the critic , where
The update target is also used in Intro-option Q-learning (sutton1999between ̵̃sutton1999between).
In RL, a transition model usually takes as inputs a state-action pair and generates the immediate reward and the next state. A transition model reflects the dynamics of an environment and can either be given or learned. A transition model can be used to generate imaginary transitions for training the agent to increase data efficiency (sutton1990integrated ̵̃sutton1990integrated; yao2009multi ̵̃yao2009multi; gu2016continuous ̵̃gu2016continuous). A transition model can also be used to reason about the future. When making a decision, an agent can perform a look-ahead tree search (knuth1975analysis ̵̃knuth1975analysis; browne2012survey ̵̃browne2012survey) with a transition model to maximize the possible rewards. A look-ahead tree search can be performed with either a perfect model (coulom2006efficient ̵̃coulom2006efficient; sturtevant2008analysis ̵̃sturtevant2008analysis; silver2016mastering ̵̃silver2016mastering) or a learned model (weber2017imagination ̵̃weber2017imagination).
A latent state is an abstraction of the original state. A latent state is also referred to as an abstract state (oh2017value ̵̃oh2017value) or an encoded state (farquhar2018treeqn ̵̃farquhar2018treeqn). A latent state can be used as the input of a transition model. Correspondingly, the transition model then predicts the next latent state, instead of the original next state. A latent state is particularly useful for high dimensional state space (e.g., images), where a latent state is usually a low dimensional vector.
Recently, some works demonstrated that learning a value prediction model instead of a transition model is effective for a look-ahead tree search. For example, VPN (oh2017value ̵̃oh2017value) predicted the value of the next latent state, instead of the next latent state itself. Although this value prediction was explicitly done in two phases (predicting the next latent state and then predicting its value), the loss of predicting the next latent state was not used in training. Instead, only the loss of the value prediction for the next latent state was used. oh2017value ̵̃oh2017value showed this value prediction model is particularly helpful for non-deterministic environments. TreeQN (farquhar2018treeqn ̵̃farquhar2018treeqn) adopted a similar idea, where only the outcome value of a look-ahead tree search is grounded in the loss. farquhar2018treeqn ̵̃farquhar2018treeqn showed that grounding the predicted next latent state did not bring in a performance boost. Although a value prediction model predicts much fewer information than a transition model, VPN and TreeQN demonstrated improved performance over baselines in challenging domains. This value prediction model is particularly helpful for a look-ahead tree search in non-deterministic environments. In ACE, we followed these works and built a value prediction model similar to TreeQN.
The Actor Ensemble Algorithm
As discussed earlier, it is important for the actor to maximize the critic in DDPG. lillicrap2015continuous ̵̃lillicrap2015continuous trained the actor via gradient ascent. However, gradient ascent can easily be trapped by local maxima or saddle points.
To mitigate this issue, we propose an ensemble approach. In our work, we have actors . At each time step , ACE selects the action over the proposals by all the actors,
Our actors are trained in parallel. Assuming those actors are parameterized by , each actor is trained such that given a state the action maximizes . We adopted similar gradient ascent as DDPG. The gradient at time step is
for all . ACE initializes each actor independently. So that the actors are likely to cover different local maxima of . By considering the maximum action value of the proposed actions, the action in Equation (8) is more likely to find the global maxima of the critic . To train the critic , ACE takes one gradient descent step at each time minimizing
Note our actors are not independent. They reinforce each other by influencing the critic update, which in turn gives them better policy gradient.
An Option Perspective
Intuitively, each actor in ACE is similar to an option. To quantify the relationship between ACE and the option framework, we first extend OCA with deterministic intra-option policies, referred to as OCAD. For each option , we use to denote its intra-option policy, which is a deterministic policy. The intro-option policies are parameterized by . The termination functions are parameterized by . We have
Theorem 1 (Deterministic Intra-option Policy Gradient)
Given a set of Markov options with deterministic intra-option policies, the gradient of the expected discounted return objective w.r.t. is:
where is the limiting state-option pair distribution. Here represents the -discounted probability of transitioning to from in steps.
Theorem 2 (Termination Policy Gradient)
Given a set of Markov options with deterministic intra-option policies, the gradient of the expected discounted return objective w.r.t. to is:
The proof of Theorem 1 follows a similar scheme of sutton2000policy ̵̃sutton2000policy, silver2014deterministic ̵̃silver2014deterministic, and bacon2017option ̵̃bacon2017option. The proof of Theorem 2 follows a similar scheme of bacon2017option ̵̃bacon2017option. The conditions and proofs of both theorems are detailed in Supplementary Material. The critic update of OCAD remains the same as OCA (Equation 7).
We now show that the actor ensemble in ACE is a special setting of OCAD. The gradient update of the actors in ACE (Equation 9) can be justified via Theorem 1. The critic update in ACE (Equation 10) is equivalent to the critic update in OCAD (Equation 13). We first consider a special setting, where for , which means each option terminates at every time step. In this setting, the value of does not depend on (Equations 4 and 5). Based on this observation, we rewrite as . We have
With the three s being the same, we rewrite the intra-policy gradient update in OCAD according to Theorem 1 as
And we rewrite the critic update in OCAD (Equation 7) as
Note that in the intra-option policy update of OCAD (Equation 12) only one intra-option policy is updated at each time step, while in the actor ensemble update (Equation 9) all actors are updated. Based on the intro-option policy update of OCAD, we propose a variant of ACE, named Alternative ACE (ACE-Alt), where only the selected actor is updated at each time step. In practice, we add exploration noise for each action and use experience replay to stabilize the training of the neural network function approximator like DDPG, resulting in off-policy learning.
To refine the state-action value function estimation, which is essential for actor selection, we utilize a look-ahead tree search method with a learned value prediction model similar to TreeQN. TreeQN was developed for discrete action space. We extend TreeQN to continuous control problems via the actor ensemble by searching over the actions proposed by the actors.
Formally speaking, we first define the following learnable functions:
, an encoding function that transforms a state into an -dimensional latent state, parameterized by
, a reward prediction function that predicts the immediate reward given a latent state and an action, parameterized by
, a transition function that predicts the next latent state given a latent state and an action, parameterized by
, a value prediction function that computes the value for a pair of a latent state and an action, parameterized by
: an actor that computes an action given a latent state, parameterized by , for each
We use to denote and to denote . Note the encoding function is shared in our implementation.
We use to represent and to represent , where is a latent state and . Furthermore, can also be decomposed into the sum of the predicted immediate reward and the value of the predicted next latent state, i.e.,
The look-ahead tree search and backup process (Equation 17) are illustrated in Figure 1. The value of stands for the state-action value estimation for the predicted latent state and action after steps from , with Equation (14) applied times.
As is fully differentiable w.r.t. , we plug in whenever we need . We also ground the predicted reward in the first recursive expansion as suggested by farquhar2018treeqn ̵̃farquhar2018treeqn. To summarize, given a transition , the gradients for and are
respectively. ACE also utilizes experience replay and a target network similar to DDPG. The pseudo-code of ACE is provided in Supplementary Material.
We designed experiments to answer the following questions:
Does ACE outperform baseline algorithms?
If so, how do the components of ACE contribute to the performance boost?
We used twelve continuous control tasks from Roboschool, a free port of Mujoco 222http://www.mujoco.org/ by OpenAI. A state in Roboschool contains joint information of a robot (e.g., speed or angle) and is presented as a vector of real numbers. An action in Roboschool is a vector, with each dimension being a real number in . All the implementations are made publicly available. 333https://github.com/ShangtongZhang/DeepRL
In this section we describe the parameterization of and for Roboschool tasks. For each state , we first transformed it into a latent state by , which was parameterized as a single neural network layer. This latent state was used as the input for all other functions (i.e., ).
The networks for are single hidden layer networks with input units and 300 hidden units, taking as inputs the concatenation of a latent state and an -dimensional action. Particularly, the network for
used two residual connections similar to farquhar2018treeqn ̵̃farquhar2018treeqn. The networks forare single hidden layer networks with 400 input units and 300 hidden units, and all the networks of shared a common first layer. The architecture of ACE is illustrated in Figure 4(a).
as the activation functions for the hidden units. (This selection will be further discussed in the next section.) We set the number of actors to 5 (i.e.,) and the planning depth to 1 (i.e. ).
In DDPG, lillicrap2015continuous ̵̃lillicrap2015continuous used two separate networks to parameterize the actor and the critic. Each network had 2 hidden layers with 400 and 300 hidden units respectively. lillicrap2015continuous ̵̃lillicrap2015continuous used ReLU activation function (nair2010rectified ̵̃nair2010rectified) and applied aregularization to the critic. However, our analysis experiments found that activation function outperformed ReLU with regularization. So throughout all our experiments, we always used activation function (without regularization) for all algorithms. All other hyper-parameter values were the same as lillicrap2015continuous ̵̃lillicrap2015continuous. All the other compared algorithms inherited the hyper-parameter values from DDPG without tuning. We used the same Ornstein-Uhlenbeck process (uhlenbeck1930theory ̵̃uhlenbeck1930theory) as lillicrap2015continuous ̵̃lillicrap2015continuous for exploration in all the compared algorithms.
ACE had more parameters than DDPG. To investigate the influence of the number of parameters, we implemented Wide-DDPG, where the hidden units were doubled (i.e., the two hidden layers had 800 and 600 units respectively). Wide-DDPG had a comparable number of parameters as ACE and remained the same depth as ACE.
DDPG used separate networks for actor and critic, while the actor and critic in ACE shared a common representation layer. To investigate the influence of a shared representation, we implemented Shared-DDPG, where the actor and the critic shared a common bottom layer in the same manner as ACE.
To investigate the influence of the tree search in ACE, we removed the tree search in ACE by setting , giving an ensemble version of DDPG, called Ensemble-DDPG. We still used 5 actors in Ensemble-DDPG.
To investigate the usefulness of the value prediction model, we implemented Transition-Model-ACE (TM-ACE), where we learn a transition model instead of a value prediction model. To be more specific, and were trained to fit sampled transitions from the replay buffer to minimize the squared loss of the predicted reward and the predicted next latent state. This model was then used for a look-ahead tree search. The pseudo-code of TM-ACE is detailed in Supplementary Material.
The architectures of all the above algorithms are illustrated in Supplementary Material.
For each task, we trained each algorithm for 1 million steps. At every 10 thousand steps, we performed 20 deterministic evaluation episodes without exploration noise and computed the mean episode return. We report the best evaluation performance during training in Table 1, which is averaged over 5 independent runs. The full evaluation curves are reported in Supplementary Material.
In a summary, either ACE or ACE-Alt was placed among the best algorithms in 11 out of the 12 games. ACE itself was placed among the best algorithms in 8 games taking ACE-Alt into comparison. Without considering ACE-Alt, ACE was placed among the best algorithms in 10 games. ACE-Alt itself was placed among the best algorithms in 7 games taking ACE into comparison. Without considering ACE, ACE-Alt was placed among the best algorithms in 10 games. Overall, ACE was slightly better than ACE-Alt. However, ACE-Alt enjoyed lower variance than ACE. We conjecture this is because ACE had more off-policy learning than ACE-Alt. Off-policy learning improved data efficiency but increased variances.
Wide-DDPG outperformed DDPG in only 1 game, indicating that naively increasing the parameters does not guarantee performance improvement. Shared-DDPG outperformed DDPG in only 2 games (lost in 2 games and tied in 8 games), showing shared representation contributes little to the overall performance in ACE. Ensemble-DDPG outperformed DDPG in 6 games (lost in 3 games and tied in 3 games), indicating the DDPG agent benefits from an actor ensemble. This may be attributed to that multiple actors are more likely to find the global maxima of the critic. ACE further outperformed Ensemble-DDPG in 9 games, indicating the agent benefits from the look-ahead tree search with a value prediction model. In contrast, TM-ACE outperformed Ensemble-DDPG only in 2 games (lost in 3 games and tied in 7 games), indicating that a value prediction model is better than a transition model for a look-ahead tree search. This was also consistent with the results observed earlier in VPN and TreeQN.
In conclusion, the actor ensemble and the look-ahead tree search with a learned value prediction model are key to the performance boost.
ACE and ACE-Alt increase performance in terms of environment steps while require more computation than vanilla DDPG. We benchmarked the wall time for the algorithms. The results are reported in Supplementary Material. We also verified the diversity of the actors in ACE and ACE-Alt in Supplementary Material.
Ensemble Size and Planning Depth
In this section, we investigate how the ensemble size and the planning depth in ACE influence the performance. We performed experiments in HalfCheetah with various and and used the same evaluation protocol as before. As a large and induced a significant computation increase, we only used up to 10 and up to 2. The results are reported in Figure 2.
To summarize, and achieved the best performance. We hypothesize there is a trade-off in the selection of both the ensemble size and the planning depth. On the one hand, a single actor can easily be trapped into local maxima during training. The more actors we have, the more likely we find the global maxima. On the other hand, all the actors share the same encoding function with the critic. A large number of actors may dominate the training of the encoding function to damage the critic learning. So a medium ensemble size is likely to achieve the best performance. A possible solution is to normalize the gradient according to the ensemble size, and we leave this for future work. With a perfect model, the more planning steps we have, the more accurate estimation we can get. However, with a learned value prediction model, there is a compound error in unrolling. So a medium planning depth is likely to achieve the best performance. Similar phenomena were also observed in the multi-step Dyna planning (yao2009multi ̵̃yao2009multi).
klissarov2017learnings ̵̃klissarov2017learnings extended OCA into continuous action problems with the Proximal Policy Option Critic (PPOC) algorithm. However, PPOC considered stochastic intra-option policies, and each intra-option policy was trained via a policy search method. In ACE, we consider deterministic intra-option policies, and the intra-option policies are optimized under the same objective as OCA.
gu2016continuous ̵̃gu2016continuous parameterized the function in a quadric form to deal with continuous control problems. In this way, the global maxima can be determined analytically. However, in general, the optimal value does not necessarily fall into this quadric form. In ACE, we use an actor ensemble to search the global maxima of the function. gu2016continuous ̵̃gu2016continuous utilized a transition model to generate imaginary data, which is orthogonal to ACE.
Ensemble in RL
wiering2008ensemble ̵̃wiering2008ensemble designed four ensemble methods combining five RL algorithms with a voting scheme based on value functions of different RL algorithms. hans2010ensembles ̵̃hans2010ensembles used a network ensemble to improve the performance of Fitted Q-Iteration. osband2016deep ̵̃osband2016deep used a ensemble to approximate Thomas’ sampling, resulting in improved exploration and performance boost in challenging video games. huang2017learning ̵̃huang2017learning used both an actor ensemble and a critic ensemble in continuous control problems. However, to our best knowledge, the present work is the first to relate ensemble with options and to use an ensemble for a look-ahead tree search in continuous control problems.
In this paper, we propose the ACE algorithm for continuous control problems. From an ensemble perspective, ACE utilizes an actor ensemble to search the global maxima of a critic function. From an option perspective, ACE is a special option-critic algorithm with deterministic intra-option policies. Thanks to the actor ensemble, ACE is able to perform a look-ahead tree search with a learned value prediction model in continuous control problems, resulting in a significant performance boost in challenging robot manipulation tasks.
The authors thank Bo Liu for feedbacks of the first draft of this paper. We also appreciate the insightful reviews of the anonymous reviewers.
[Bacon, Harb, and
Bacon, P.-L.; Harb, J.; and Precup, D.
The option-critic architecture.
Proceedings of the 31st AAAI Conference on Artificial Intelligence.
- [Barth-Maron et al.2018] Barth-Maron, G.; Hoffman, M. W.; Budden, D.; Dabney, W.; Horgan, D.; Muldal, A.; Heess, N.; and Lillicrap, T. 2018. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617.
- [Browne et al.2012] Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games.
- [Coulom2006] Coulom, R. 2006. Efficient selectivity and backup operators in monte-carlo tree search. In Proceedings of the International Conference on Computers and Games.
- [Degris, White, and Sutton2012] Degris, T.; White, M.; and Sutton, R. S. 2012. Off-policy actor-critic. arXiv preprint arXiv:1205.4839.
- [Farquhar et al.2018] Farquhar, G.; Rocktäschel, T.; Igl, M.; and Whiteson, S. 2018. Treeqn and atreec: Differentiable tree-structured models for deep reinforcement learning. arXiv preprint arXiv:1710.11417.
[Gu et al.2016]
Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S.
Continuous deep q-learning with model-based acceleration.
Proceedings of the 33rd International Conference on Machine Learning.
- [Hans and Udluft2010] Hans, A., and Udluft, S. 2010. Ensembles of neural networks for robust reinforcement learning. In Proceedings of the 9th International Conference on Machine Learning and Applications.
- [Huang et al.2017] Huang, Z.; Zhou, S.; Zhuang, B.; and Zhou, X. 2017. Learning to run with actor-critic ensemble. arXiv preprint arXiv:1712.08987.
- [Klissarov et al.2017] Klissarov, M.; Bacon, P.-L.; Harb, J.; and Precup, D. 2017. Learnings options end-to-end for continuous action tasks. arXiv preprint arXiv:1712.00004.
- [Knuth and Moore1975] Knuth, D. E., and Moore, R. W. 1975. An analysis of alpha-beta pruning. Artificial Intelligence.
- [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- [Lin1992] Lin, L.-J. 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
- [Mansley, Weinstein, and Littman2011] Mansley, C. R.; Weinstein, A.; and Littman, M. L. 2011. Sample-based planning for continuous action markov decision processes. In Proceedings of the 21st International Conference on Automated Planning and Scheduling.
- [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature.
- [Nair and Hinton2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.
- [Nitti, Belle, and De Raedt2015] Nitti, D.; Belle, V.; and De Raedt, L. 2015. Planning in discrete and continuous markov decision processes by probabilistic programming. In Proceedings of the 17th Joint European Conference on Machine Learning and Knowledge Discovery in Databases.
- [Oh, Singh, and Lee2017] Oh, J.; Singh, S.; and Lee, H. 2017. Value prediction network. In Advances in Neural Information Processing Systems.
- [Osband et al.2016] Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems.
- [Silver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning.
- [Silver et al.2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature.
- [Sturtevant2008] Sturtevant, N. 2008. An analysis of uct in multi-player games. In Proceedings of the International Conference on Computers and Games.
- [Sutton et al.2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
- [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence.
- [Sutton1990] Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning.
- [Tassa et al.2018] Tassa, Y.; Doron, Y.; Muldal, A.; Erez, T.; Li, Y.; Casas, D. d. L.; Budden, D.; Abdolmaleki, A.; Merel, J.; Lefrancq, A.; et al. 2018. Deepmind control suite. arXiv preprint arXiv:1801.00690.
- [Tesauro1995] Tesauro, G. 1995. Temporal difference learning and td-gammon. Communications of the ACM.
- [Uhlenbeck and Ornstein1930] Uhlenbeck, G. E., and Ornstein, L. S. 1930. On the theory of the brownian motion. Physical review.
- [Watkins and Dayan1992] Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine Learning.
- [Weber et al.2017] Weber, T.; Racanière, S.; Reichert, D. P.; Buesing, L.; Guez, A.; Rezende, D. J.; Badia, A. P.; Vinyals, O.; Heess, N.; Li, Y.; et al. 2017. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203.
- [Wiering and Van Hasselt2008] Wiering, M. A., and Van Hasselt, H. 2008. Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).
- [Yao et al.2009] Yao, H.; Bhatnagar, S.; Diao, D.; Sutton, R. S.; and Szepesvári, C. 2009. Multi-step dyna planning for policy evaluation and control. In Advances in Neural Information Processing Systems.
- [Yee, Lisỳ, and Bowling2016] Yee, T.; Lisỳ, V.; and Bowling, M. H. 2016. Monte carlo tree search in continuous action spaces with execution uncertainty. In Proceedings of the 25th International Joint Conference on Artificial Intelligence.
Proof of Theorem 1
Under mild conditions (bacon2017option ̵̃bacon2017option), the Markov chain underlyingis aperiodic and ergodic. We use the following augmented process defined by bacon2017option ̵̃bacon2017option, which is homogeneous.
We compute the gradient as
From Equation (5), we have
Expand with Equation (20) recursively and apply the augmented process, we end up with
Proof of Theorem 2
Under mild conditions (bacon2017option ̵̃bacon2017option), the Markov chain underlying is aperiodic and ergodic. We use the following augmented process defined by bacon2017option ̵̃bacon2017option, which is homogeneous.
The gradient of w.r.t. is
Applying Equation (22) recursively and using the augmented process, we have