1 Introduction
Temporal abstraction (i.e., hierarchy) is a key component in reinforcement learning (RL). A good temporal abstraction usually improves exploration
(Machado et al., 2017b) and enhances the interpretability of agents’ behavior (Smith et al., 2018). The option framework (Sutton et al., 1999), which is commonly used to formulate temporal abstraction, gives rise to two problems: learning options (i.e., temporally extended actions) and learning a master policy (i.e., a policy over options, a.k.a. an interoption policy).A Markov Decision Process (MDP,
Puterman 2014) with options can be interpreted as a SemiMDP (SMDP, Puterman 2014), and a master policy is used in this SMDP for option selection. While in principle, any SMDP algorithm can be used to learn a master policy, such algorithms are data inefficient as they cannot update a master policy during option execution. To address this issue, Sutton et al. (1999) propose intraoption algorithms, which can update a master policy at every time step during option execution. Intraoption Learning (Sutton et al., 1999) is a valuebased intraoption algorithm and has enjoyed great success (Bacon et al., 2017; Riemer et al., 2018; Zhang et al., 2019b).However, in the MDP setting, policybased methods are often preferred to valuebased ones because they can cope better with large action spaces and enjoy better convergence properties with function approximation. Unfortunately, to the best of our knowledge, there are no policybased intraoption algorithms for learning a master policy. This is the first issue we address in this paper.
Recently, gradientbased option learning algorithms have enjoyed great success (Levy and Shimkin, 2011; Bacon et al., 2017; Smith et al., 2018; Riemer et al., 2018; Zhang et al., 2019b). However, most require algorithms that are customized to the optionbased SMDP. Consequently, we cannot directly leverage recent advances in gradientbased policy optimization from MDPs (e.g., Schulman et al. 2015, 2017; Haarnoja et al. 2018). This is the second issue we address in this paper.
To address these issues, we reformulate the SMDP of the option framework as two augmented MDPs. Under this novel formulation, all policy optimization algorithms can be used for option learning and master policy learning offtheshelf and the learning remains intraoption. We apply an actorcritic algorithm on each augmented MDP, yielding the Double ActorCritic (DAC) architecture. Furthermore, we show that, when statevalue functions are used as critics, one critic can be expressed in terms of the other, and hence only one critic is necessary. Finally, we empirically study the combination of DAC and Proximal Policy Optimization (PPO, Schulman et al. 2017) in challenging robot simulation tasks. This combination outperforms previous gradientbased option learning algorithms by a large margin and significantly outperforms vanilla PPO in a transfer learning setting.
2 Background
We consider an MDP consisting of a state space , an action space , a reward function , a transition kernel , an initial distribution and a discount factor . We refer to this MDP as and consider episodic tasks. In the option framework (Sutton et al., 1999), an option is a triple of , where is an initiation set indicating where the option can be initiated, is an intraoption policy, and is a termination function. In this paper, we consider following Bacon et al. (2017); Smith et al. (2018). We use to denote the option set and assume all options are Markov. We use to denote a master policy and consider the callandreturn execution model (Sutton et al., 1999)
. Timeindexed capital letters are random variables. At time step
, an agent at state either terminates the previous option w.p. and initiates a new option according to , or proceeds with the previous option w.p. and sets . Then an action is selected according to . The agent gets a reward satisfying and proceeds to a new state according to . Under this execution model, we havewhere is the indicator function. With a slight abuse of notations, we define . The MDP and the options form an SMDP. For each stateoption pair and an action , we define
The stateoption value of on the SMDP is . The state value of on the SMDP is . They are related as
where is the optionvalue upon arrival (Sutton et al., 1999). Correspondingly, we have the optimal master policy satisfying . We use to denote the stateoption value function of .
Master Policy Learning: To learn the optimal master policy given a fixed , one valuebased approach is to learn first and derive from . We can use SMDP
Learning to update an estimate
for aswhere we assume the option initiates at time and terminates at time (Sutton et al., 1999). Here the option lasts steps. However, SMDP Learning performs only one single update, yielding significant data inefficiency. This is because SMDP algorithms simply interpret the optionbased SMDP as a generic SMDP, ignoring the presence of options. By contrast, Sutton et al. (1999) propose to exploit the fact that the SMDP is generated by options, yielding an update rule:
(1)  
This update rule is efficient in that it updates every time step. However, it is still inefficient in that it only updates for the executed option . We refer to this property as onoption. Sutton et al. (1999) further propose Intraoption Learning, where the update (1) is applied to every option satisfying . We refer to this property as offoption. Intraoption Learning is theoretically justified only when all intraoption policies are deterministic (Sutton et al., 1999). The convergence analysis of Intraoption Learning with stochastic intraoption policies remains an open problem (Sutton et al., 1999). The update (1) and the Intraoption Learning can also be applied to offpolicy transitions.
The OptionCritic Architecture: Bacon et al. (2017) propose a gradientbased option learning algorithm, the OptionCritic (OC) architecture. Assuming is parameterized by and is parameterized by , Bacon et al. (2017) prove that
where defined in Bacon et al. (2017) is the unnormalized discounted stateoption pair occupancy measure. OC is onoption in that given a transition , it updates only parameters of the executed option . OC provides the gradient for and can be combined with any master policy learning algorithm. In particular, Bacon et al. (2017) combine OC with (1). Hence, in this paper, we use OC to indicate this exact combination. OC has also been extended to multilevel options (Riemer et al., 2018) and deterministic intraoption policies (Zhang et al., 2019b).
Inferred Option Policy Gradient: We assume is parameterized by and define . We use to denote a trajectory from , where is a terminal state. We use to denote the total expected discounted rewards along . Our goal is to maximize . Smith et al. (2018) propose to interpret the options along the trajectory as latent variables and marginalize over them when computing . In the Inferred Option Policy Gradient (IOPG), Smith et al. (2018) show
where is the stateaction history and
is the probability of occupying an option
at time . Smith et al. (2018) further show that can be expressed recursively via , allowing efficient computation of . IOPG is an offline algorithm in that it has to wait for a complete trajectory before computing . To admit online updates, Smith et al. (2018) propose to store at each time step and use the stored for computing , yielding the Inferred Option Actor Critic (IOAC). IOAC is biased in that a stale approximation of is used for computing . The longer a trajectory is, the more biased the IOAC gradient is. IOPG and IOAC are offoption in that given a transition , all options contribute to the gradient explicitly.Augmented Hierarchical Policy: Levy and Shimkin (2011) propose the Augmented Hierarchical Policy (AHP) architecture. AHP reformulates the SMDP of the option framework as an augmented MDP. The new state space is . The new action space is , where indicates whether to terminate the previous option or not. All policy optimization algorithms can be used to learn an augmented policy under this new MDP, which learns and implicitly. However, the resulting gradient for the master policy is nonzero only when an option terminates (Equation 23 in Levy and Shimkin (2011)). This suggests that the master policy learning in AHP is SMDPstyle. Moreover, the resulting gradient for an intraoption policy is nonzero only when the option is being executed (Equation 24 in Levy and Shimkin (2011)). This suggests that the option learning in AHP is onoption.
3 Two Augmented MDPs
In this section, we reformulate the SMDP as two augmented MDPs: the highMDP and the lowMDP . The agent makes highlevel decisions (i.e., option selection) in according to and thus optimizes . The agent makes lowlevel decisions (i.e., action selection) in according to and thus optimizes . Both augmented MDPs share the same samples with the SMDP .
We first define a dummy option and . This dummy option is never executed. In the highMDP, we interpret a stateoption pair in the SMDP as a new state and an option in the SMDP as a new action. Formally speaking, we define
We define a Markov policy on as
In the lowMDP, we interpret a stateoption pair in the SMDP as a new state and leave the action space unchanged. Formally speaking, we define
We define a Markov policy on as
We consider trajectories with nonzero probabilities and define , , . With , we define a function , which maps to , where . We have:
Lemma 1
, , and is a bijection.
Proof. See supplementary materials.
We now take action into consideration. With , we define a function , which maps to , where . We have:
Lemma 2
, , and is a bijection.
Proof. See supplementary materials.
Proposition 1
Lemma 1 and Lemma 2 indicate that sampling from is equivalent to sampling from and . Proposition 1 indicates that optimizing in is equivalent to optimizing in and optimizing in . We now make two observations:
Observation 1
depends on while depends on and .
Observation 2
depends on while depends on .
Observation 1 suggests that when we keep the intraoption policies fixed and optimize , we are implicitly optimizing and (i.e., and ). Observation 2 suggests that when we keep the master policy and the termination conditions fixed and optimize , we are implicitly optimizing (i.e., ). All policy optimization algorithms for MDPs can be used off the shelf to optimize the two actors and with samples from , yielding a new family of algorithms for master policy learning and option learning, which we refer to as the Double ActorCritic (DAC) architecture. When we optimize and alternatively with different samples, is guaranteed to improve unless we have already reached a local maximum, provided that policy improvement is guaranteed for the specific policy optimization algorithm we use. We can also optimize and with the same samples. This can improve data efficiency but introduces bias. The pseudocode of DAC is provided in the supplementary materials. We present a thorough comparison of DAC, OC, IOPG and AHP in Table 1. DAC combines the advantages of both AHP (i.e., compatibility) and OC (intraoption learning). Enabling offoption learning of intraoption policies in DAC as IOPG is a possible future work.
Learning  Learning  Online Learning  Compatibility  

AHP  SMDP  onoption  yes  yes 
OC  intraoption  onoption  yes  no 
IOPG  intraoption  offoption  no  no 
DAC  intraoption  onoption  yes  yes 
In general, we need two critics in DAC, which can be learned via all policy evaluation algorithms. However, when state value functions are used as critics, Proposition 2 shows that the state value function in the highMDP () can be expressed by the state value function in the lowMDP (), and hence only one critic is needed.
Proposition 2
With
we have .
Proof. See supplementary materials.
Beyond Intraoption Learning: In terms of learning with a fixed , Observation 1 suggests we optimize on . This immediately yields a family of policybased algorithms for learning a master policy, all of which are intraoption. Particularly, when we use OffPolicy Expected Policy Gradients (OffEPG, Ciosek and Whiteson 2017) for optimizing , we get all the merits of both Intraoption Learning and policy gradients for free. (1) By definition of and , OffEPG optimizes in an intraoption manner, and is as data efficient as Intraoption Learning. (2) OffEPG is an offpolicy algorithm, so offpolicy transitions can also be used, as in Intraoption Learning. (3) OffEPG is offoption in that all the options, not only the executed one, explicitly contribute to the policy gradient every time step. Particularly, this offoption approach does not require deterministic intraoption policies like Intraoption Learning. (4) OffEPG uses a policy for decision making, which is more robust than valuebased decision making. We leave an empirical study of this particular application for future work and focus in this paper on the more general problem, learning and simultaneously. When is not fixed, the MDP () for learning becomes nonstationary. We, therefore, prefer onpolicy methods to offpolicy methods.
4 Experimental Results
We design experiments to answer the following questions: (1) Can DAC outperform existing gradientbased option learning algorithms (e.g., AHP, OC, IOPG)? (2) Can options learned in DAC translate into a performance boost over its hierarchyfree counterparts? (3) What options does DAC learn?
DAC can be combined with any policy optimization algorithm, e.g., policy gradient (Sutton et al., 2000), Natural Actor Critic (NAC, Peters and Schaal 2008), PPO, Soft Actor Critic (Haarnoja et al., 2018), Generalized OffPolicy Actor Critic (Zhang et al., 2019a). In this paper, we focus on the combination of DAC and PPO, given the great empirical success of PPO (OpenAI, 2018)
. Our PPO implementation uses the same architecture and hyperparameters reported by
Schulman et al. (2017).Levy and Shimkin (2011) combine AHP with NAC and present an empirical study on an inverted pendulum domain. In our experiments, we also combine AHP with PPO for a fair comparison. To the best of our knowledge, this is the first time that AHP has been evaluated with stateoftheart policy optimization algorithms in prevailing deep RL benchmarks. We also implemented IOPG and OC as baselines. We use 4 options for all algorithms, following Smith et al. (2018)
. We report the online training episode return, smoothed by a sliding window of size 20. All curves are averaged over 10 independent runs and shaded regions indicate standard errors. More details about the experiments are provided in the supplementary materials.
4.1 Single Task Learning
We consider four robot simulation tasks used by Smith et al. (2018) from OpenAI gym (Brockman et al., 2016). We also include the combination of DAC and A2C (Mnih et al., 2016) for reference. The results are reported in Figure 1.
Results: (1) Our implementations of OC and IOPG reach similar performance to that reported by Smith et al. (2018), which is significantly outperformed by both vanilla PPO and optionbased PPO (i.e., DAC+PPO, AHP+PPO). However, the performance of DAC+A2C is similar to OC and IOPG. These results indicate that the performance boost of DAC+PPO and AHP+PPO mainly comes from the more advanced policy optimization algorithm (PPO). This is exactly the major advantage of DAC and AHP. They allow all stateoftheart policy optimization algorithms to be used off the shelf to learn options. (2) The performance of DAC+PPO is similar to vanilla PPO in 3 out of 4 tasks. DAC+PPO outperforms PPO in Swimmer by a large margin. This performance similarity between an optionbased algorithm and a hierarchyfree algorithm is expected and is also reported by Harb et al. (2018); Smith et al. (2018). Within a single task, it is usually hard to translate the automatically discovered options into a performance boost, as primitive actions are enough to express the optimal policy. In the meantime, learning the additional structure, the options, may also slow down learning of the original task. (3) The performance of DAC+PPO is similar to AHP+PPO, as expected. The main advantage of DAC over AHP is its data efficiency in learning the master policy. Within a single task, it is possible that an agent focuses on a “mighty” option and ignores other specialized options, making master policy learning less important. By contrast, when we switch tasks, cooperation among different options becomes more important. We, therefore, expect that the data efficiency in learning the master policy in DAC translates into a performance boost over AHP in a transfer learning setting.
4.2 Transfer Learning
We consider a transfer learning setting where after the first 1M training steps, we switch to a new task and train the agent for another 1M steps. The agent is not aware of the task switch. The two tasks are correlated and we expect learned options from the first task can be used to accelerate learning of the second task.
We use 6 pairs of tasks from DeepMind Control Suite (DMControl, Tassa et al. 2018): . Most of them are provided by DMControl and some of them we constructed in a similar manner as Hafner et al. (2018). The maximum score is always 1000. More details are provided in the supplementary materials. There are other possible paired tasks in DMControl but we found that in such pairs, PPO hardly learns anything in the second task. Hence, we omit those pairs from our experiments. The results are reported in Figure 2.
Results: (1) During the first task, DAC+PPO consistently outperforms OC and IOPG by a large margin and maintains a similar performance to PPO and AHP+PPO. These results are consistent with our previous observations in the single task learning setting. (2) After the task switch, the advantage of DAC+PPO becomes clear. DAC+PPO outperforms all other baselines by a large margin in 3 out of 6 tasks and is among the best algorithms in the other 3 tasks. This satisfies our previous expectation about DAC and AHP in Section 4.1. (3) We further study the influence of the number of options in Walker2. Results are provided in the supplementary materials. We find 8 options are slightly better than 4 options and 2 options are worse. We conjecture that 2 options are not enough for transferring the knowledge from the first task to the second.
4.3 Option Structures
We visualize the learned options and option occupancy of DAC+PPO on Cheetah in Figure 3. There are 4 options in total, displayed via different colors. The upper strip shows the option occupancy during an episode at the end of the training of the first task (run). The lower strip shows the option occupancy during an episode at the end of the training of the second task (backward). Both episodes last 1000 steps.^{1}^{1}1The video of the two episodes is available at https://youtu.be/K0ZPHQtx6M The four options are distinct. The blue option is mainly used when the cheetah is “flying”. The green option is mainly used when the cheetah pushes its left leg to move right. The yellow option is mainly used when the cheetah pushes its left leg to move left. The red option is mainly used when the cheetah pushes its right leg to move left. During the first task, the red option is rarely used. The cheetah uses the green and yellow options for pushing its left leg and uses the blue option for flying. The right leg rarely touches the ground during the first episode. After the task switch, the flying option (blue) transfers to the second task, the yellow option specializes for moving left, and the red option is developed for pushing the right leg to the left.
5 Related Work
Many components in DAC are not new. The idea of an augmented MDP is suggested by Levy and Shimkin (2011) in AHP. The augmented state spaces and are also used by Bacon et al. (2017) to simplify the derivation. Applying vanilla policy gradient to and leads immediately to the IntraOption Policy Gradient Theorem (Bacon et al., 2017). The augmented policy is also used by Smith et al. (2018) to simplify the derivation. However, neither OC nor IOPG works on the augmented state space directly. To the best of our knowledge, DAC is the first time that the two augmented MDPs are formulated explicitly. It is this explicit formulation that allows the offtheshelf application of all stateoftheart policy optimizations algorithm and combines advantages from both OC and AHP, yielding a significant empirical performance boost. Furthermore, it is this explicit formulation that generates a family of policybased intraoption algorithms for master policy learning.
Besides gradientbased option learning, there are also other option learning approaches based on finding bottleneck states or subgoals (Stolle and Precup, 2002; McGovern and Barto, 2001; Silver and Ciosek, 2012; Niekum and Barto, 2011; Machado et al., 2017a). In general, these approaches are expensive in terms of both samples and computation (Precup, 2018).
Besides the option framework, there are also other frameworks to describe hierarchies in RL. Dietterich (2000) decomposes the value function in the original MDP into value functions in smaller MDPs in the MAXQ framework. Dayan and Hinton (1993) employ multiple managers on different levels for describing a hierarchy. Vezhnevets et al. (2017) further extend this idea to FeUdal Networks, where a manager module sets abstract goals for workers. This goalbased hierarchy description is also explored by Schmidhuber and Wahnsiedler (1993); Levy et al. (2017); Nachum et al. (2018). Moreover, Florensa et al. (2017)
use stochastic neural networks for hierarchical RL. We leave a comparison between the option framework and other hierarchical RL frameworks for future work.
6 Conclusions
In this paper, we reformulate the SMDP of the option framework as two augmented MDPs, allowing in an offtheshelf application of all policy optimization algorithms in option learning and master policy learning in an intraoption manner.
In DAC, there is no clear boundary between option termination functions and the master policy. They are different internal parts of the augmented policy . We observe that the termination probability of the active option becomes high as training progresses, although still selects the same option. This is also observed by Bacon et al. (2017). To encourage long options, Harb et al. (2018) propose a cost model for option switching. Including this cost model in DAC is a possibility for future work.
Acknowledgments
SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. The authors thank Matthew Smith for an insightful discussion.
References

Bacon et al. (2017)
Bacon, P.L., Harb, J., and Precup, D. (2017).
The optioncritic architecture.
In
Proceedings of the 31st AAAI Conference on Artificial Intelligence
.  Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
 Ciosek and Whiteson (2017) Ciosek, K. and Whiteson, S. (2017). Expected policy gradients. arXiv preprint arXiv:1706.05374.
 Dayan and Hinton (1993) Dayan, P. and Hinton, G. E. (1993). Feudal reinforcement learning. In Advances in Neural Information Processing Systems.
 Dietterich (2000) Dietterich, T. G. (2000). Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research.
 Florensa et al. (2017) Florensa, C., Duan, Y., and Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
 Hafner et al. (2018) Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2018). Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551.
 Harb et al. (2018) Harb, J., Bacon, P.L., Klissarov, M., and Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
 Levy et al. (2017) Levy, A., Platt, R., and Saenko, K. (2017). Hierarchical actorcritic. arXiv preprint arXiv:1712.00948.
 Levy and Shimkin (2011) Levy, K. Y. and Shimkin, N. (2011). Unified inter and intra options learning using policy gradient methods. In Proceedings of the 2011 European Workshop on Reinforcement Learning.
 Machado et al. (2017a) Machado, M. C., Bellemare, M. G., and Bowling, M. (2017a). A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956.
 Machado et al. (2017b) Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. (2017b). Eigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089.

McGovern and Barto (2001)
McGovern, A. and Barto, A. G. (2001).
Automatic discovery of subgoals in reinforcement learning using
diverse density.
Proceedings of the 18th International Conference on Machine Learning
.  Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning.
 Nachum et al. (2018) Nachum, O., Gu, S. S., Lee, H., and Levine, S. (2018). Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems.
 Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.
 Niekum and Barto (2011) Niekum, S. and Barto, A. G. (2011). Clustering via dirichlet process mixture models for portable skill discovery. In Advances in neural information processing systems.
 OpenAI (2018) OpenAI (2018). Openai five. https://openai.com/five/.
 Peters and Schaal (2008) Peters, J. and Schaal, S. (2008). Natural actorcritic. Neurocomputing.
 Precup (2018) Precup, D. (2018). Temporal abstraction. url: http://videolectures.net/site/normal_dl/tag=1199094/DLRLsummerschool2018_precup_temporal_abstraction_01.pdf.
 Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Riemer et al. (2018) Riemer, M., Liu, M., and Tesauro, G. (2018). Learning abstract options. In Advances in Neural Information Processing Systems.
 Schmidhuber and Wahnsiedler (1993) Schmidhuber, J. and Wahnsiedler, R. (1993). Planning simple trajectories using neural subgoal generators. In Proceedings of the Second International Conference on Simulation of Adaptive Behavior.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 Silver and Ciosek (2012) Silver, D. and Ciosek, K. (2012). Compositional planning using optimal option models. arXiv preprint arXiv:1206.6473.
 Smith et al. (2018) Smith, M., Hoof, H., and Pineau, J. (2018). An inferencebased policy gradient method for learning options. In Proceedings of the 35th International Conference on Machine Learning.
 Stolle and Precup (2002) Stolle, M. and Precup, D. (2002). Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
 Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence.
 Tassa et al. (2018) Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. (2018). Deepmind control suite. arXiv preprint arXiv:1801.00690.
 Vezhnevets et al. (2017) Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning.
 Zhang et al. (2019a) Zhang, S., Boehmer, W., and Whiteson, S. (2019a). Generalized offpolicy actorcritic. arXiv preprint arXiv:1903.11329.
 Zhang et al. (2019b) Zhang, S., Chen, H., and Yao, H. (2019b). Ace: An actor ensemble algorithm for continuous control with tree search. Proceedings of the 33rd AAAI Conference on Artificial Intelligence.
Appendix A Assumptions and Proofs
a.1 Assumptions
a.2 Proof of Lemma 1
Proof.
follows directly from the definition of . is an injection by definition. The definition of guarantees is a surjection. So is a bijection.
a.3 Proof of Lemma 2
Proof.
follows directly from the definition of . is an injection by definition. The definition of guarantees is a surjection. So is a bijection.
a.4 Proof of Proposition 2
Proof.
Appendix B Details of Experiments
b.1 Pseudocode of DAC
Pseudocode of DAC is provided in Algorithm 1.
b.2 Details of Environments
CartPole consists of balance and balance_sparse, where the latter has a sparse reward. Reacher consists of easy and hard, where the latter has a smaller target sphere than the former. Those four tasks are provided in DMControl. Cheetah consists of run and backward. The former is from DMControl, the latter is from Hafner et al. (2018), where the horizontal speed of the cheetah is negated before being used for computing rewards. In this task, the cheetah is encouraged to run backward rather than forward. Fish consists of upright and downleft. The former is from DMControl. In the latter, we negate the uprightness before using it to compute rewards. This task encourages the fish to be “downleft”. Walker1 consists of squat and stand. The latter is from DMControl, where a reward is given when the torso height of the walker is larger than . In the former, we give a reward when the torso height is larger than . Walker2 consists of walk and backward. The former is from DMControl. In the latter, we negate the horizontal speed as in Cheetahbackward.
b.3 Parameterization
We base our parameterization on Schulman et al. (2017). For an option ,
is parameterized as a twohiddenlayer network. A sigmoid activation function is used after the output layer.
is parameterized as a twohiddenlayer network. A linear activation function is used after the output layer to output the mean of the Gaussian policy . The std of is a stateindependent variable as Schulman et al. (2015, 2017). The master policy is parameterized in the same manner as . The value function has the same parameterization as except that the activation function after the output layer is linear. All hidden layers have 64 hidden units. For Mujoco tasks, we use a Tanh activation for hidden layers as suggested by Schulman et al. (2017). For DMControl tasks, we find a ReLU
(Nair and Hinton, 2010) activation for hidden layers produces better performance. We use 4 options as suggested by Smith et al. (2018). We use this parameterization for all compared algorithms. We use 4 workers for OC, IOPG and DAC+A2C as suggested by Smith et al. (2018). For OC, we set the option switching cost to 0.01 as suggested by Bacon et al. (2017); Harb et al. (2018). For DAC+PPO and DAC+A2C, the two optimization steps use the same samples. Our preliminary experiments show that performing the two optimization steps alternatively leads to a similar performance.b.4 Other Experimental Results
Figure 4 studies the influence of the number of options on performance. In the first task, the performance is similar. In the second task, 8 options are slightly better than 4 options, while 2 options are clearly worse.
Comments
There are no comments yet.