DAC: The Double Actor-Critic Architecture for Learning Options

04/29/2019 ∙ by Shangtong Zhang, et al. ∙ University of Oxford 16

We reformulate the option framework as two parallel augmented MDPs. Under this novel formulation, all policy optimization algorithms can be used off the shelf to learn intra-option policies, option termination conditions, and a master policy over options. We apply an actor-critic algorithm on each augmented MDP, yielding the Double Actor-Critic (DAC) architecture. Furthermore, we show that, when state-value functions are used as critics, one critic can be expressed in terms of the other, and hence only one critic is necessary. Our experiments on challenging robot simulation tasks demonstrate that DAC outperforms previous gradient-based option learning algorithms by a large margin and significantly outperforms its hierarchy-free counterparts in a transfer learning setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Temporal abstraction (i.e., hierarchy) is a key component in reinforcement learning (RL). A good temporal abstraction usually improves exploration

(Machado et al., 2017b) and enhances the interpretability of agents’ behavior (Smith et al., 2018). The option framework (Sutton et al., 1999), which is commonly used to formulate temporal abstraction, gives rise to two problems: learning options (i.e., temporally extended actions) and learning a master policy (i.e., a policy over options, a.k.a. an inter-option policy).

A Markov Decision Process (MDP,

Puterman 2014) with options can be interpreted as a Semi-MDP (SMDP, Puterman 2014), and a master policy is used in this SMDP for option selection. While in principle, any SMDP algorithm can be used to learn a master policy, such algorithms are data inefficient as they cannot update a master policy during option execution. To address this issue, Sutton et al. (1999) propose intra-option algorithms, which can update a master policy at every time step during option execution. Intra-option -Learning (Sutton et al., 1999) is a value-based intra-option algorithm and has enjoyed great success (Bacon et al., 2017; Riemer et al., 2018; Zhang et al., 2019b).

However, in the MDP setting, policy-based methods are often preferred to value-based ones because they can cope better with large action spaces and enjoy better convergence properties with function approximation. Unfortunately, to the best of our knowledge, there are no policy-based intra-option algorithms for learning a master policy. This is the first issue we address in this paper.

Recently, gradient-based option learning algorithms have enjoyed great success (Levy and Shimkin, 2011; Bacon et al., 2017; Smith et al., 2018; Riemer et al., 2018; Zhang et al., 2019b). However, most require algorithms that are customized to the option-based SMDP. Consequently, we cannot directly leverage recent advances in gradient-based policy optimization from MDPs (e.g., Schulman et al. 2015, 2017; Haarnoja et al. 2018). This is the second issue we address in this paper.

To address these issues, we reformulate the SMDP of the option framework as two augmented MDPs. Under this novel formulation, all policy optimization algorithms can be used for option learning and master policy learning off-the-shelf and the learning remains intra-option. We apply an actor-critic algorithm on each augmented MDP, yielding the Double Actor-Critic (DAC) architecture. Furthermore, we show that, when state-value functions are used as critics, one critic can be expressed in terms of the other, and hence only one critic is necessary. Finally, we empirically study the combination of DAC and Proximal Policy Optimization (PPO, Schulman et al. 2017) in challenging robot simulation tasks. This combination outperforms previous gradient-based option learning algorithms by a large margin and significantly outperforms vanilla PPO in a transfer learning setting.

2 Background

We consider an MDP consisting of a state space , an action space , a reward function , a transition kernel , an initial distribution and a discount factor . We refer to this MDP as and consider episodic tasks. In the option framework (Sutton et al., 1999), an option is a triple of , where is an initiation set indicating where the option can be initiated, is an intra-option policy, and is a termination function. In this paper, we consider following Bacon et al. (2017); Smith et al. (2018). We use to denote the option set and assume all options are Markov. We use to denote a master policy and consider the call-and-return execution model (Sutton et al., 1999)

. Time-indexed capital letters are random variables. At time step

, an agent at state either terminates the previous option w.p. and initiates a new option according to , or proceeds with the previous option w.p. and sets . Then an action is selected according to . The agent gets a reward satisfying and proceeds to a new state according to . Under this execution model, we have

where is the indicator function. With a slight abuse of notations, we define . The MDP and the options form an SMDP. For each state-option pair and an action , we define

The state-option value of on the SMDP is . The state value of on the SMDP is . They are related as

where is the option-value upon arrival (Sutton et al., 1999). Correspondingly, we have the optimal master policy satisfying . We use to denote the state-option value function of .

Master Policy Learning: To learn the optimal master policy given a fixed , one value-based approach is to learn first and derive from . We can use SMDP

-Learning to update an estimate

for as

where we assume the option initiates at time and terminates at time (Sutton et al., 1999). Here the option lasts steps. However, SMDP -Learning performs only one single update, yielding significant data inefficiency. This is because SMDP algorithms simply interpret the option-based SMDP as a generic SMDP, ignoring the presence of options. By contrast, Sutton et al. (1999) propose to exploit the fact that the SMDP is generated by options, yielding an update rule:

(1)

This update rule is efficient in that it updates every time step. However, it is still inefficient in that it only updates for the executed option . We refer to this property as on-option. Sutton et al. (1999) further propose Intra-option -Learning, where the update (1) is applied to every option satisfying . We refer to this property as off-option. Intra-option -Learning is theoretically justified only when all intra-option policies are deterministic (Sutton et al., 1999). The convergence analysis of Intra-option -Learning with stochastic intra-option policies remains an open problem (Sutton et al., 1999). The update (1) and the Intra-option -Learning can also be applied to off-policy transitions.

The Option-Critic Architecture: Bacon et al. (2017) propose a gradient-based option learning algorithm, the Option-Critic (OC) architecture. Assuming is parameterized by and is parameterized by , Bacon et al. (2017) prove that

where defined in Bacon et al. (2017) is the unnormalized discounted state-option pair occupancy measure. OC is on-option in that given a transition , it updates only parameters of the executed option . OC provides the gradient for and can be combined with any master policy learning algorithm. In particular, Bacon et al. (2017) combine OC with (1). Hence, in this paper, we use OC to indicate this exact combination. OC has also been extended to multi-level options (Riemer et al., 2018) and deterministic intra-option policies (Zhang et al., 2019b).

Inferred Option Policy Gradient: We assume is parameterized by and define . We use to denote a trajectory from , where is a terminal state. We use to denote the total expected discounted rewards along . Our goal is to maximize . Smith et al. (2018) propose to interpret the options along the trajectory as latent variables and marginalize over them when computing . In the Inferred Option Policy Gradient (IOPG), Smith et al. (2018) show

where is the state-action history and

is the probability of occupying an option

at time . Smith et al. (2018) further show that can be expressed recursively via , allowing efficient computation of . IOPG is an off-line algorithm in that it has to wait for a complete trajectory before computing . To admit online updates, Smith et al. (2018) propose to store at each time step and use the stored for computing , yielding the Inferred Option Actor Critic (IOAC). IOAC is biased in that a stale approximation of is used for computing . The longer a trajectory is, the more biased the IOAC gradient is. IOPG and IOAC are off-option in that given a transition , all options contribute to the gradient explicitly.

Augmented Hierarchical Policy: Levy and Shimkin (2011) propose the Augmented Hierarchical Policy (AHP) architecture. AHP reformulates the SMDP of the option framework as an augmented MDP. The new state space is . The new action space is , where indicates whether to terminate the previous option or not. All policy optimization algorithms can be used to learn an augmented policy under this new MDP, which learns and implicitly. However, the resulting gradient for the master policy is non-zero only when an option terminates (Equation 23 in Levy and Shimkin (2011)). This suggests that the master policy learning in AHP is SMDP-style. Moreover, the resulting gradient for an intra-option policy is non-zero only when the option is being executed (Equation 24 in Levy and Shimkin (2011)). This suggests that the option learning in AHP is on-option.

3 Two Augmented MDPs

In this section, we reformulate the SMDP as two augmented MDPs: the high-MDP and the low-MDP . The agent makes high-level decisions (i.e., option selection) in according to and thus optimizes . The agent makes low-level decisions (i.e., action selection) in according to and thus optimizes . Both augmented MDPs share the same samples with the SMDP .

We first define a dummy option and . This dummy option is never executed. In the high-MDP, we interpret a state-option pair in the SMDP as a new state and an option in the SMDP as a new action. Formally speaking, we define

We define a Markov policy on as

In the low-MDP, we interpret a state-option pair in the SMDP as a new state and leave the action space unchanged. Formally speaking, we define

We define a Markov policy on as

We consider trajectories with nonzero probabilities and define , , . With , we define a function , which maps to , where . We have:

Lemma 1

, , and is a bijection.

Proof. See supplementary materials.

We now take action into consideration. With , we define a function , which maps to , where . We have:

Lemma 2

, , and is a bijection.

Proof. See supplementary materials.

Proposition 1

Proof. Follows directly from Lemma 1 and Lemma 2.

Lemma 1 and Lemma 2 indicate that sampling from is equivalent to sampling from and . Proposition 1 indicates that optimizing in is equivalent to optimizing in and optimizing in . We now make two observations:

Observation 1

depends on while depends on and .

Observation 2

depends on while depends on .

Observation 1 suggests that when we keep the intra-option policies fixed and optimize , we are implicitly optimizing and (i.e., and ). Observation 2 suggests that when we keep the master policy and the termination conditions fixed and optimize , we are implicitly optimizing (i.e., ). All policy optimization algorithms for MDPs can be used off the shelf to optimize the two actors and with samples from , yielding a new family of algorithms for master policy learning and option learning, which we refer to as the Double Actor-Critic (DAC) architecture. When we optimize and alternatively with different samples, is guaranteed to improve unless we have already reached a local maximum, provided that policy improvement is guaranteed for the specific policy optimization algorithm we use. We can also optimize and with the same samples. This can improve data efficiency but introduces bias. The pseudocode of DAC is provided in the supplementary materials. We present a thorough comparison of DAC, OC, IOPG and AHP in Table 1. DAC combines the advantages of both AHP (i.e., compatibility) and OC (intra-option learning). Enabling off-option learning of intra-option policies in DAC as IOPG is a possible future work.

Learning Learning Online Learning Compatibility
AHP SMDP on-option yes yes
OC intra-option on-option yes no
IOPG intra-option off-option no no
DAC intra-option on-option yes yes
Table 1: A comparison of AHP, OC, IOPG and DAC. (1) For learning , all four are intra-option. (2) IOAC is online with bias introduced and consumes extra memory. (3) Compatibility indicates whether a framework can be combined with any policy optimization algorithm off-the-shelf.

In general, we need two critics in DAC, which can be learned via all policy evaluation algorithms. However, when state value functions are used as critics, Proposition 2 shows that the state value function in the high-MDP () can be expressed by the state value function in the low-MDP (), and hence only one critic is needed.

Proposition 2

With

we have .

Proof. See supplementary materials.

Beyond Intra-option -Learning: In terms of learning with a fixed , Observation 1 suggests we optimize on . This immediately yields a family of policy-based algorithms for learning a master policy, all of which are intra-option. Particularly, when we use Off-Policy Expected Policy Gradients (Off-EPG, Ciosek and Whiteson 2017) for optimizing , we get all the merits of both Intra-option -Learning and policy gradients for free. (1) By definition of and , Off-EPG optimizes in an intra-option manner, and is as data efficient as Intra-option -Learning. (2) Off-EPG is an off-policy algorithm, so off-policy transitions can also be used, as in Intra-option -Learning. (3) Off-EPG is off-option in that all the options, not only the executed one, explicitly contribute to the policy gradient every time step. Particularly, this off-option approach does not require deterministic intra-option policies like Intra-option -Learning. (4) Off-EPG uses a policy for decision making, which is more robust than value-based decision making. We leave an empirical study of this particular application for future work and focus in this paper on the more general problem, learning and simultaneously. When is not fixed, the MDP () for learning becomes non-stationary. We, therefore, prefer on-policy methods to off-policy methods.

4 Experimental Results

We design experiments to answer the following questions: (1) Can DAC outperform existing gradient-based option learning algorithms (e.g., AHP, OC, IOPG)? (2) Can options learned in DAC translate into a performance boost over its hierarchy-free counterparts? (3) What options does DAC learn?

DAC can be combined with any policy optimization algorithm, e.g., policy gradient (Sutton et al., 2000), Natural Actor Critic (NAC, Peters and Schaal 2008), PPO, Soft Actor Critic (Haarnoja et al., 2018), Generalized Off-Policy Actor Critic (Zhang et al., 2019a). In this paper, we focus on the combination of DAC and PPO, given the great empirical success of PPO (OpenAI, 2018)

. Our PPO implementation uses the same architecture and hyperparameters reported by

Schulman et al. (2017).

Levy and Shimkin (2011) combine AHP with NAC and present an empirical study on an inverted pendulum domain. In our experiments, we also combine AHP with PPO for a fair comparison. To the best of our knowledge, this is the first time that AHP has been evaluated with state-of-the-art policy optimization algorithms in prevailing deep RL benchmarks. We also implemented IOPG and OC as baselines. We use 4 options for all algorithms, following Smith et al. (2018)

. We report the online training episode return, smoothed by a sliding window of size 20. All curves are averaged over 10 independent runs and shaded regions indicate standard errors. More details about the experiments are provided in the supplementary materials.

4.1 Single Task Learning

We consider four robot simulation tasks used by Smith et al. (2018) from OpenAI gym (Brockman et al., 2016). We also include the combination of DAC and A2C (Mnih et al., 2016) for reference. The results are reported in Figure 1.

Figure 1: Online performance on a single task

Results: (1) Our implementations of OC and IOPG reach similar performance to that reported by Smith et al. (2018), which is significantly outperformed by both vanilla PPO and option-based PPO (i.e., DAC+PPO, AHP+PPO). However, the performance of DAC+A2C is similar to OC and IOPG. These results indicate that the performance boost of DAC+PPO and AHP+PPO mainly comes from the more advanced policy optimization algorithm (PPO). This is exactly the major advantage of DAC and AHP. They allow all state-of-the-art policy optimization algorithms to be used off the shelf to learn options. (2) The performance of DAC+PPO is similar to vanilla PPO in 3 out of 4 tasks. DAC+PPO outperforms PPO in Swimmer by a large margin. This performance similarity between an option-based algorithm and a hierarchy-free algorithm is expected and is also reported by Harb et al. (2018); Smith et al. (2018). Within a single task, it is usually hard to translate the automatically discovered options into a performance boost, as primitive actions are enough to express the optimal policy. In the meantime, learning the additional structure, the options, may also slow down learning of the original task. (3) The performance of DAC+PPO is similar to AHP+PPO, as expected. The main advantage of DAC over AHP is its data efficiency in learning the master policy. Within a single task, it is possible that an agent focuses on a “mighty” option and ignores other specialized options, making master policy learning less important. By contrast, when we switch tasks, cooperation among different options becomes more important. We, therefore, expect that the data efficiency in learning the master policy in DAC translates into a performance boost over AHP in a transfer learning setting.

4.2 Transfer Learning

We consider a transfer learning setting where after the first 1M training steps, we switch to a new task and train the agent for another 1M steps. The agent is not aware of the task switch. The two tasks are correlated and we expect learned options from the first task can be used to accelerate learning of the second task.

We use 6 pairs of tasks from DeepMind Control Suite (DMControl, Tassa et al. 2018): . Most of them are provided by DMControl and some of them we constructed in a similar manner as Hafner et al. (2018). The maximum score is always 1000. More details are provided in the supplementary materials. There are other possible paired tasks in DMControl but we found that in such pairs, PPO hardly learns anything in the second task. Hence, we omit those pairs from our experiments. The results are reported in Figure 2.

Figure 2: Online performance for transfer learning

Results: (1) During the first task, DAC+PPO consistently outperforms OC and IOPG by a large margin and maintains a similar performance to PPO and AHP+PPO. These results are consistent with our previous observations in the single task learning setting. (2) After the task switch, the advantage of DAC+PPO becomes clear. DAC+PPO outperforms all other baselines by a large margin in 3 out of 6 tasks and is among the best algorithms in the other 3 tasks. This satisfies our previous expectation about DAC and AHP in Section 4.1. (3) We further study the influence of the number of options in Walker2. Results are provided in the supplementary materials. We find 8 options are slightly better than 4 options and 2 options are worse. We conjecture that 2 options are not enough for transferring the knowledge from the first task to the second.

4.3 Option Structures

We visualize the learned options and option occupancy of DAC+PPO on Cheetah in Figure 3. There are 4 options in total, displayed via different colors. The upper strip shows the option occupancy during an episode at the end of the training of the first task (run). The lower strip shows the option occupancy during an episode at the end of the training of the second task (backward). Both episodes last 1000 steps.111The video of the two episodes is available at https://youtu.be/K0ZP-HQtx6M The four options are distinct. The blue option is mainly used when the cheetah is “flying”. The green option is mainly used when the cheetah pushes its left leg to move right. The yellow option is mainly used when the cheetah pushes its left leg to move left. The red option is mainly used when the cheetah pushes its right leg to move left. During the first task, the red option is rarely used. The cheetah uses the green and yellow options for pushing its left leg and uses the blue option for flying. The right leg rarely touches the ground during the first episode. After the task switch, the flying option (blue) transfers to the second task, the yellow option specializes for moving left, and the red option is developed for pushing the right leg to the left.

Figure 3: Learned options and option occupancy of DAC+PPO in Cheetah

5 Related Work

Many components in DAC are not new. The idea of an augmented MDP is suggested by Levy and Shimkin (2011) in AHP. The augmented state spaces and are also used by Bacon et al. (2017) to simplify the derivation. Applying vanilla policy gradient to and leads immediately to the Intra-Option Policy Gradient Theorem (Bacon et al., 2017). The augmented policy is also used by Smith et al. (2018) to simplify the derivation. However, neither OC nor IOPG works on the augmented state space directly. To the best of our knowledge, DAC is the first time that the two augmented MDPs are formulated explicitly. It is this explicit formulation that allows the off-the-shelf application of all state-of-the-art policy optimizations algorithm and combines advantages from both OC and AHP, yielding a significant empirical performance boost. Furthermore, it is this explicit formulation that generates a family of policy-based intra-option algorithms for master policy learning.

Besides gradient-based option learning, there are also other option learning approaches based on finding bottleneck states or subgoals (Stolle and Precup, 2002; McGovern and Barto, 2001; Silver and Ciosek, 2012; Niekum and Barto, 2011; Machado et al., 2017a). In general, these approaches are expensive in terms of both samples and computation (Precup, 2018).

Besides the option framework, there are also other frameworks to describe hierarchies in RL. Dietterich (2000) decomposes the value function in the original MDP into value functions in smaller MDPs in the MAXQ framework. Dayan and Hinton (1993) employ multiple managers on different levels for describing a hierarchy. Vezhnevets et al. (2017) further extend this idea to FeUdal Networks, where a manager module sets abstract goals for workers. This goal-based hierarchy description is also explored by Schmidhuber and Wahnsiedler (1993); Levy et al. (2017); Nachum et al. (2018). Moreover, Florensa et al. (2017)

use stochastic neural networks for hierarchical RL. We leave a comparison between the option framework and other hierarchical RL frameworks for future work.

6 Conclusions

In this paper, we reformulate the SMDP of the option framework as two augmented MDPs, allowing in an off-the-shelf application of all policy optimization algorithms in option learning and master policy learning in an intra-option manner.

In DAC, there is no clear boundary between option termination functions and the master policy. They are different internal parts of the augmented policy . We observe that the termination probability of the active option becomes high as training progresses, although still selects the same option. This is also observed by Bacon et al. (2017). To encourage long options, Harb et al. (2018) propose a cost model for option switching. Including this cost model in DAC is a possibility for future work.

Acknowledgments

SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. The authors thank Matthew Smith for an insightful discussion.

References

Appendix A Assumptions and Proofs

a.1 Assumptions

We use standard assumptions (Sutton et al., 1999; Bacon et al., 2017) about the MDP and options. Particularly, we assume all options are Markov.

a.2 Proof of Lemma 1

Proof.

follows directly from the definition of . is an injection by definition. The definition of guarantees is a surjection. So is a bijection.

a.3 Proof of Lemma 2

Proof.

follows directly from the definition of . is an injection by definition. The definition of guarantees is a surjection. So is a bijection.

a.4 Proof of Proposition 2

Proof.

Appendix B Details of Experiments

b.1 Pseudocode of DAC

Pseudocode of DAC is provided in Algorithm 1.

Input: Parameterized Policy optimization algorithms Get an initial state while True do
       Sample from Sample from Execute , get // The two optimizations can be done in any order or alternatively
       Optimize with and Optimize with and
end while
Algorithm 1 Pseudocode of DAC

b.2 Details of Environments

CartPole consists of balance and balance_sparse, where the latter has a sparse reward. Reacher consists of easy and hard, where the latter has a smaller target sphere than the former. Those four tasks are provided in DMControl. Cheetah consists of run and backward. The former is from DMControl, the latter is from Hafner et al. (2018), where the horizontal speed of the cheetah is negated before being used for computing rewards. In this task, the cheetah is encouraged to run backward rather than forward. Fish consists of upright and downleft. The former is from DMControl. In the latter, we negate the uprightness before using it to compute rewards. This task encourages the fish to be “downleft”. Walker1 consists of squat and stand. The latter is from DMControl, where a reward is given when the torso height of the walker is larger than . In the former, we give a reward when the torso height is larger than . Walker2 consists of walk and backward. The former is from DMControl. In the latter, we negate the horizontal speed as in Cheetah-backward.

b.3 Parameterization

We base our parameterization on Schulman et al. (2017). For an option ,

is parameterized as a two-hidden-layer network. A sigmoid activation function is used after the output layer.

is parameterized as a two-hidden-layer network. A linear activation function is used after the output layer to output the mean of the Gaussian policy . The std of is a state-independent variable as Schulman et al. (2015, 2017). The master policy is parameterized in the same manner as . The value function has the same parameterization as except that the activation function after the output layer is linear. All hidden layers have 64 hidden units. For Mujoco tasks, we use a Tanh activation for hidden layers as suggested by Schulman et al. (2017)

. For DMControl tasks, we find a ReLU

(Nair and Hinton, 2010) activation for hidden layers produces better performance. We use 4 options as suggested by Smith et al. (2018). We use this parameterization for all compared algorithms. We use 4 workers for OC, IOPG and DAC+A2C as suggested by Smith et al. (2018). For OC, we set the option switching cost to 0.01 as suggested by Bacon et al. (2017); Harb et al. (2018). For DAC+PPO and DAC+A2C, the two optimization steps use the same samples. Our preliminary experiments show that performing the two optimization steps alternatively leads to a similar performance.

b.4 Other Experimental Results

Figure 4 studies the influence of the number of options on performance. In the first task, the performance is similar. In the second task, 8 options are slightly better than 4 options, while 2 options are clearly worse.

Figure 4: The influence of number of options on performance.