SOAC: The Soft Option Actor-Critic Architecture

06/25/2020 ∙ by Chenghao Li, et al. ∙ Tsinghua University 5

The option framework has shown great promise by automatically extracting temporally-extended sub-tasks from a long-horizon task. Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy. However, existing methods typically suffer from two major challenges: ineffective exploration and unstable updates. In this paper, we present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges. Our approach introduces an information-theoretical intrinsic reward for encouraging the identification of diverse and effective options. Meanwhile, we utilize a probability inference model to simplify the optimization problem as fitting optimal trajectories. Experimental results demonstrate that our approach significantly outperforms prior on-policy and off-policy methods in a range of Mujoco benchmark tasks while still providing benefits for transfer learning. In these tasks, our approach learns a diverse set of options, each of whose state-action space has strong coherence.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years, deep reinforcement learning (DRL) has shown remarkable progress in challenging application domains, such as Atari Games 

Mnih et al. (2015), Go game Silver et al. (2017), poker Brown & Sandholm (2018), StarCraft II Vinyals et al. (2019), and Dota 2 Openai et al. (2019)

. The combination of RL and high-capacity function approximators, such as neural networks, holds the promise of solving complex tasks in continuous control. However, millions of steps of data collection are needed to train effective behaviors. This training process might be simplified with a comprehensive understanding of tasks. A sophisticated agent should have the ability to identify distinct temporally-extended sub-tasks in a long-horizon task. How to efficiently discover such temporal abstractions has been widely studied in reinforcement learning (RL) 

McGovern & Barto (2001); Barto & Mahadevan (2003); Konidaris & Barto (2009); Da Silva et al. (2012); Kulkarni et al. (2016); Li et al. (2019). In this paper, we focus on the option framework Sutton et al. (1999), a distinct temporal abstraction method that can automatically discover courses of action with different intervals Riemer et al. (2018). This distinct hierarchical structure has achieved notable success recently Bacon et al. (2017); Fox et al. (2017).

However, there are remain challenges hampering widespread adoption of the option framework. One important such aspect is exploration. The option framework suffers from a degradation problem caused by ineffective exploration: there might be just one option selected to complete the entire task, which is tantamount to traditional end-to-end learning. Previous research tends to use on-policy learning to concurrently train option selection policy and intra-option policies Riemer et al. (2018); Bacon et al. (2017); Fox et al. (2017); Zhang & Whiteson (2019). However, only actually invoked options can be updated in on-policy learning. Intra-option policies sampled more frequently will be trained better to get more chance to be selected. This biased sampling makes the degradation problem worse. Another widely-known challenge is instability caused by simultaneous updates of high-level and low-level policies. Learning of intra-option policies will be unstable if the option selection policy frequently switches options to solve one sub-task. Previous work adapts option selection policy to updates of intra-option policies Zhang & Whiteson (2019); Osa et al. (2019). However, this short-sighted learning might exacerbate instability.

To address these challenges, we present an off-policy soft option actor-critic (SOAC) approach that maximizes discounted rewards with entropy terms. This maximum entropy formulation provides sufficient exploration and robustness while acquiring diverse behaviors Haarnoja et al. (2017, 2018). The entropy bonus encourages the option selection policy to consider each intra-option policy in a balanced way. In addition, we introduce an information-theoretical intrinsic reward to enhance identifiability of intra-option policies. We utilize this intrinsic reward with another intrinsic reward related to anti-interference to define the objective for learning the optimal option selection policy. Meanwhile, we utilize external rewards to define the objective for learning the action selection policy. We theoretically derive that optimizing our maximum entropy model is equivalent to fitting optimal trajectories. Our algorithm can alternate between policy evaluation and policy improvement to learn optimal policies. Moreover, in our approach, the soft optimality of policies allows that behavior policies can be different from target policies Levine (2018); Schulman et al. (2017a). With this flexibility, the option selection policy can be trained to select options with considering all historical behavior of each intra-option policy to reduce instability. As shown in Figure 1, our algorithm learns a deep hierarchy of options.

Experimental results indicate that our hierarchical approach significantly improves the performance of SAC Haarnoja et al. (2018) and outperforms state-of-the-art hierarchical RL algorithms Zhang & Whiteson (2019); Osa et al. (2019) on the benchmark Mujoco tasks (Section 5). In addition, we observe an obvious distinction between options, which indicates that a well trained option selection policy is sophisticated enough to invoke a diverse set of options in different situations. We also show that, the option selection policy can be transferred and accelerate learning in a new environment, even if the target task is dramatically different from the original one.

(a) Graphical model for a basic option trajectory.
(b) Option trajectory with a probability inference model.
Figure 1: Grey nodes are hidden variables. Left. The option framework introduces a hidden variable representing labels of intra-option policies. At each time step, option selection policy first decides whether to terminate the previous intra-option policy. If so, it will choose another intra-option policy depending on the current state. Right. The optimality variable , theoretically indicating whether the current state-action pair is optimal, is introduced to the option framework. The hidden variable only affects the environment by guiding option selection policy to choose intra-option policies. So it not directly influences optimality variables. Instead, we utilize optimality variables to judge whether option selection policy is optimal, which is literately explained in Section 4.1.

2 Related Work

Considerable prior work has explored how to extend the option framework Sutton et al. (1999) to deep reinforcement learning (DRL). Compared with the end-to-end learning progress, learning the option framework from a single task brings more complex networks and more computational complexity. How to quickly learn an effective hierarchical structure is still an open question. Bacon et al. Bacon et al. (2017) train the whole option framework with policy gradient method. To leverage recent advances in gradient-based policies, the option framework has combined with PPO Zhang & Whiteson (2019); Schulman et al. (2017b), TD3 Osa et al. (2019); Fujimoto et al. (2018) and multitask Igl et al. (2019). To improve sample efficiency, all available intra-option policies can be trained simultaneously with a marginal distribution evaluating the probability of all options being selected Smith et al. (2018). In addition, important sampling (IS) has been used to propose off-policy algorithms Harutyunyan et al. (2018); Guo et al. (2017) to reuse past experience. These research does show some interesting ways forward. However, they are not efficient enough compared with current baseline model-free DRL algorithms such as SAC Haarnoja et al. (2018).

Probability inference models provide a way to analyze the probability of optimal trajectories Levine (2018); Kappen et al. (2012); Schulman et al. (2017a). Recently, these models have been adapted to numerous environments with DRL. Eysenbach et al. Eysenbach et al. (2018) utilize diversity only to update parameters. Haarnoja et al. Haarnoja et al. (2018) propose the Soft Actor-Critic algorithm which is a state-of-the-art algorithm in single agent DRL. Huang et al. Huang et al. (2019) optimize Partially Observable MDPs (POMDPs) with sequential variational soft Q-learning. Previous work based on soft optimality has shown both sample-efficient learning and stability. Meanwhile, the information bottleneck, related to mutual information (MI) and the Kullback-Leibler (KL) divergence, is widely used to control the spread of information Alemi et al. (2016); Galashov et al. (2019); Goyal et al. (2019a); Wang et al. (2019). It can be used to judge division of state-action space Osa et al. (2019) or used to distinguish different skills Sharma et al. (2019). We propose our approach on these basis.

3 Background

3.1 The Option Framework

Traditional Markov Decision Process (MDP) considers a tuple

. is the state space, is the action space, is the transition probability, is the relevant reward function, and is a discount factor. The option framework extends the original MDP problem to a SMDP problem. It consists three components: a policy choosing options , a termination condition , and an initiation set  Sutton et al. (1999). In this paper, we use to denote the option space. At each time step , agents will decide whether to terminate the previous intra-option policy labeled as with the termination probability . If the previous intra-option policy is terminated, another intra-option policy will be sampled from . The whole probability of transitioning options written as below is called high-level option selection policy in this paper. Meanwhile each action is sampled from corresponding to the current option and state.


3.2 Probability Inference Models

Different from the general form of reinforcement learning problems, DRL based on probability inference models attempts to directly optimize the probability of optimal trajectories. An additional variable is introduced to denote whether the current time step is optimal. This variable provides a mathematical formalization to analyze whether current policies are optimal. The log form of the probability of the optimal trajectories can be theoretically proved having an evidence low bound related to dense rewards and entropy Levine (2018).


where is the actor policy, and is the entropy regular term.

4 Method

In this section, we propose a maximum entropy problem and simplify it as fitting optimal trajectories with probability inference models. We propose an algorithm to estimate optimal policies iteratively.

4.1 Problem Formulation

Although previous research based on the option framework usually considers directly maximizing the reward function. We are interested in optimizing a maximum entropy model to solve the ineffective exploration challenge. In addition, we introduce mutual information as an intrinsic reward to enhance identifiability of each intra-option policy. Meanwhile, disturbance in a state-action pair should not lead to a substantial change in option selection Li et al. (2019); Puri et al. (2019). So we add another intrinsic reward based on TV distance to encourage option selection policy to consider connectivity in state-action space while allocating options. In , , and are gaussian noise, and represents parameters of our model which can be neural networks. The whole maximum entropy problem is:


where we label high-level option selection policy as and label low-level intra-option policys as , is entropy,

is a hyperparameter representing importance of external rewards,

and are weights of intrinsic rewards.

To simplify the above problem, we introduce probability inference models. An additional variable are introduced to describe whether the current condition is optimal. indicates time step is optimal, and indicates time step is not optimal. In the rest of this paper, we use to represent for concise functions. With this additional variable, we define a conditional probability model representing the probability of a trajectory with optimal policies:


where means for all steps from to . The probability of whether a state-action pair is optimal is defined as below, which is based on boltzmann distribution of energy Levine (2018).


Inspired by Equation 5, we utilize a similar exponential form to define the optimal option selection.


With option selection policy selecting options and intra-option policy selecting actions, the probability of sampling a trajectory is:

Theorem 1.

The original maximum entropy optimization problem shown in Equation3 can be simplified as shrinking the Kullback-Leibler (KL) divergence between and .


Proof. See supplementary materials.

4.2 Optimal Policies with Probability Inference Models

In this sub-section, we derive optimal policies with probability inference models. First, we introduce three backward messages: , and . These messages denote the probability of whether a trajectory starting from corresponding condition is optimal. With these backward messages, we can derive optimal option selection probability and optimal action selection probability as below.


Inspired by Levine Levine (2018), we use the log form of three backward messages to define value functions. We define , , and . With these value functions, optimal high level policy and optimal low level policy are derived as below.


where controls exploration degree. If

approaches infinity, optimal policies will obey uniform distribution. In contrast, if

approaches zero, optimal policies will be greedy. To estimate , and , we derive relationships between them.

Lemma 1.

The relationship between and is:


Proof. See supplementary materials.

Lemma 2.

The relationship between and is:


Proof. See supplementary materials.

Lemma 3.

The relationship between and is:


Proof. See supplementary materials.

With these relationships between value functions, we can iteratively train them to estimate optimal policies. In the next sub-section, we will explain our algorithm in detail.

4.3 Algorithm

In this subsection, we will literally show our training process. We use function approximators and stochastic gradient descent to estimate and train U-value functions

and , Q-value functions and , option selection policy , and intra-option policys . For more stable training, we utilize double neural networks Fujimoto et al. (2018); Van Hasselt et al. (2016) and target neural networks Van Hasselt et al. (2016); Mnih et al. (2015) while estimating U-value functions and Q-value functions. Q-value functions are trained by minimizing the Bellman residual shown as below, where we use the relationship between and shown in Equation 13to replace .


The Bellman residual of U-value functions are:


It is difficult to directly calculate optimal high level policy and optimal low level policy from Equation 11 and Equation 12. We use KL divergence to estimate policies. Option selection policy can be optimized by minimizing . Our option space is discrete. So we calculcate the expectation directly Christodoulou (2019).


where and are the list of and , , decides whether to terminate previous options, chooses new options, and . Here both and are trained by minimizing .

Intra-option policy is also optimized by minimizing . We use the reparameterization trick to allow gradients to pass through the expectations operator. At each time step , is sampled from , where

is a noise vector sampled from a Gaussian distribution.


With all above loss functions, we can iteratively train value functions and estimate high-level and low-level optimal policies. The whole algorithm is literally listed in Algorithm 


1:Input: , , , , , , , Initialize parameters
2:, , , Initialize target network weights
3: Initialize an empty replay buffer
4:for each iteration do
5:     for each simulation step do
6:         , ,
8:     end for
9:     for each update step do
10:         , for
11:         , for
12:         ,
13:     end for
14:     , for Soft update target network weights
15:     , for Soft update target network weights
16:end for
Algorithm 1 Soft Option Actor-Critic

5 Experiment

In this section, we design experiments to answer following questions: (1) Can the additional option framework accelerate training? (2) Whether state-action space related to each option has strong coherence? (3) What is the impact of a well trained option selection policy in an opposite task? We adapt several benchmarking robot control tasks in Mujoco domains to answer the above questions.

5.1 Results and Comparisons

We compare our algorithm with three other algorithms: Soft Actor-Critic (SAC)Haarnoja et al. (2018), Double Actor-Critic (DAC) Zhang & Whiteson (2019) and adInfoHRL Osa et al. (2019). SAC is a current baseline off-policy DRL algorithm, which is also based on maximum entropy and probability inference models. We use it here to test whether our option framework can accelerate learning. Meanwhile, to the best of our knowledge, DAC and adInfoHRL are current best on-policy and off-policy algorithms with a similar hierarchical structure introducing a hidden and latent variable to abstractly present state-action space. All corresponding hyperparameters are literally listed in supplement materials.

Figure 2: Training curves of episode return in benchmark continuous control tasks.

Figure 2

demonstrates the average return of test rollout during training for SOAC (our algorithm), SAC, DAC and adInfoHRL on four Mujoco tasks. We train four different instances of each algorithm with random seeds from zero to three with each performing ten evaluation rollouts every 5000 environment steps and choose the best three instances. The solid curves represent the mean value smoothed by the Moving Average method and the shaded region represents the minimum and maximum returns over related trials. We notice that our algorithm dramatically outperforms DAC and adInfoHRL, both in terms of learning speed and stability. For example, on Hopper-v2, DAC and adInfoHRL suffer from unstable learning, but our algorithm quickly stabilizes at the highest score. Meanwhile, on Ant-v2, addInfoRL fails to make any progress, but our algorithm dramatically outperforms other algorithms. Compared with SAC, our algorithm performs comparably on HalfCheetah-v2 and Walker2d-v2 and outperforms on Hopper-v2 and Ant-v2. These results indicate that our algorithm can accelerate learning by softly dividing state-action space based on the option framework with sufficient exploration. We address part of the reason as the multimodal treatment of our actor’s policy. To deal with continuous action space, an actor’s policy is usually defined as a normal distribution. However, this might not meet the actual optimal policy. The entire policy of our actor has a multimodal distribution similar to Gaussian Mixture Model (GMM) and give our agents a stronger ability to make decisions. In addition, part of neural networks related to different intra-option policies are shared to accelerate training 

Zheng12 et al. (2018)

. This provides the same feature extraction strategy for all intra-option policies.

5.2 Visualization of State-Action Space with Different Options

Our algorithm performs well in Mujoco domains with stable learning curves. To verify whether our option selection policy is reasonable, we utilize the t-sne method Maaten & Hinton (2008) to illustrate state-action space corresponding to each option in Figure 3. We notice distinct clusters for different options in each Mujoco task. This indicates that our option selection policy is well trained to assign options for different situations. In addition, we notice multi-cluster related to one option, which is similiar with Osa et al. (2019); Oord et al. (2018); Goyal et al. (2019b). This is because option selection policy might assign different sub-tasks for one option to solve limited by the fixed number of options. How to determine the most suitable number of options is still an open question, although most previous research tends to set the number to four Bacon et al. (2017); Smith et al. (2018); Osa et al. (2019); Zhang & Whiteson (2019). An exploration of variable number of options is a future direction.

Figure 3: Embeddings visualizing of state-action space with the t-sne method.

5.3 Transfer of Option Selction Policy

We are wondering whether the option selection policy learned from a certain task can discover a general division method of the environment. Even though our algorithm is not designed for transfer learning, we find out that our well trained option selection policy can accelerate training in a diametrically different task with an opposite reward function. As shown in Figure 4, transferring high-level option selection policy will accelerate learning compared with transferring nothing in most Mujoco domains. Meanwhile, the transferred option selection policy makes the training more stable. Especially on Hopper-v2, a hopper suffers from falling down while attempting to jump backwards. With transferred option selection policy, agents have more opportunities to learn to jump backwards rather than staying in place. These results indicate that our well trained option selection policy can generally divide the environment and assign sub-tasks with probability models, which will provide benefits for transfer learning.

Figure 4: Learning curves of transferring option selection policy compared with transferring nothing.

6 Conclusion

In this paper, we propose soft option actor-critic (SOAC), an off-policy maximum entropy DRL algorithm with the option framework. With probability inference models, we theoretically derive optimal policies based on soft optimality and simplify our optimization problem as fitting optimal trajectories. We empirically demonstrate that our algorithm matches or exceeds prior on-policy and off-policy methods in a range of Mujoco benchmark tasks and while still providing benefits for transfer learning. The state-action space associated with each option shows strong connectivity. These results indicate that our option selection policy is sophisticated to assign options for different situation. Our algorithm has shown the potential to boost sample efficiency with operative exploration to address current well-known challenges restricting the applicability of the option framework.

7 Broader Impact

Deep reinforcement learning (DRL) has achieved remarkable progress in recent years. It has exceeded the human level performance in many challenging environments such as Atari Games Mnih et al. (2015, 2013), game of go Silver et al. (2017), poker Brown & Sandholm (2018), and StarCraft II Vinyals et al. (2019)

. However, classical end-to-end learning progress still suffers from high dimension in state and action space, which might influence the convergence rate and cause unbearable training time. In this paper, we attempt to train an option framework, which can extract sub-tasks with arbitrary interval from a long-horizon task to simplify the original MDP problem. We combine the option framework with probability inference models and information-theoretical intrinsic rewards and propose a novel and stable off-policy algorithm to address the well known challenges mentioned in the introduction section. As we all know, creation starts from the ability to discover and summarize problems. With the option framework, agents can learn diverse skills from sub-tasks proposed by themselves while solving the entire task. In general, the option framework encourages agents to explore the environment and ask questions. This might be a key point in the artificialization of artificial intelligence. Learning the option framework will definitely bring more computational complexity. Nevertheless, our approach has shown that learning this hierarchical structure can accelerate training in Mujoco domains. Our approach can be regarded as a step for the option framework to be widespreadly adopted.


Appendix A Theory Details

a.1 Graphical Models

The whole trajectory is shown in Figure. 1. Its corresponding distribution is:


where , is a terminal condition function, and is an option choosing policy.

a.2 Derivation of the Optimization Problem

Based on probability models corresponding to shown in Equation 5 and Equation 6, we can recover the explicit form of from Equation 4.



is a constant representing the multiplication of some prior probabilities.

Our optimization process can be defined as continuously shrinking the KL divergence from the optimal strategy, which can be written as


where is a constant which can be ignored while optimizing policies to maximize or minimize . Our optimization problem can be further simplified to:


a.3 Relationship among Backward Messages

The relationship among , and is:


where is the prior option choosing policy and can be assumed as a uniform distribution over the set of option.


where is the prior action choosing policy and can be assumed as a uniform distribution over the set of action.


a.4 Proof of Lemma 1


We assume the prior option choosing policy is equally probable in all possible values. To simplify our formulation, we assume the value of is one no matter what is. This might cause the estimated V function to be a multiple of the actual V function. Our optimal option choosing policy and optimal action choosing policy have the softmax form. So this multiple form error will not lead to changes in the optimal policies. In addition, we believe a sophisticated alpha can offset the deviation. Based on assumptions above, the V function can be written as: