Safe Option-Critic: Learning Safety in the Option-Critic Architecture

07/21/2018 ∙ by Arushi Jain, et al. ∙ McGill University 0

Designing hierarchical reinforcement learning algorithms that induce a notion of safety is not only vital for safety-critical applications, but also, brings better understanding of an artificially intelligent agent's decisions. While learning end-to-end options automatically has been fully realized recently, we propose a solution to learning safe options. We introduce the idea of controllability of states based on the temporal difference errors in the option-critic framework. We then derive the policy-gradient theorem with controllability and propose a novel framework called safe option-critic. We demonstrate the effectiveness of our approach in the four-rooms grid-world, cartpole, and three games in the Arcade Learning Environment (ALE): MsPacman, Amidar and Q*Bert. Learning of end-to-end options with the proposed notion of safety achieves reduction in the variance of return and boosts the performance in environments with intrinsic variability in the reward structure. More importantly, the proposed algorithm outperforms the vanilla options in all the environments and primitive actions in two out of three ALE games.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Safety in Artificial Intelligence (AI) can be viewed from many perspectives. Traditionally, introducing some form of risk-awareness in AI systems has been a prime way of defining safety in the machines. More recently, researchers have broadened the horizon of safety in AI to address different sources of errors and faulty behaviors Amodei et al. (2016). The Asilomar AI principles Future of Life Institute (2017) comprise of varied aspects of safety like risk-averseness, transparency, robustness, fairness and also legal and ethical values an agent should hold. In this work, we refer to the following definition of safety: prevent undesirable behavior, in particular, reducing the visits to the undesirable states during the learning process in reinforcement learning (RL).

RL agents primarily learn by optimizing their discounted cumulative rewards Sutton and Barto (1998). While rewards are a good indicator of how to behave, they do not necessarily always lead to the most desired behavior. Optimal reward design Sorg et al. (2010) still poses a challenge for the algorithm designers with several issues such as misspecified rewards Amodei and Clark (2016); Hadfield-Menell et al. (2017) and corrupted reward channels Everitt et al. (2017) to name a few. Alternatively, learning with constraints allows us to introduce more clarity in the objective function Altman (1999).

During exploration, agents are naturally unaware of the states which may be prone to errors or may lead to catastrophic consequences. Risk-awareness has been introduced in the agents by directing exploration safely Law et al. (2005), optimizing the worst-case performance Tamar et al. (2013)

, measuring the probabilities of visiting erroneous states

Geibel and Wysotzki (2005) and several other approaches. Garcıa and Fernández

(garcia2015comprehensive) presents a comprehensive survey covering a broad range of techniques to realize safety in RL. In a Markov Decision Process (MDP), majority of the methods seek to minimize the variance of return as a risk mitigation strategy. Many authors

Sato et al. (2001); Mihatsch and Neuneier (2002); Tamar et al. (2012); Gehring and Precup (2013); Tamar et al. (2016); Sherstan et al. (2018)

have used temporal difference (TD) learning for estimating the variance of return to capture the notion of uncertainty in the value of a state.

While some of the aforementioned approaches leverage TD learning in estimating errors and risks, all of them define notions of safety in the primitive action space. Temporally abstract actions provide an approach to represent the information in a hierarchical format. The concept of learning and planning in a hierarchical fashion is very close to how humans think and approach a problem. Temporal abstractions have been vital to the AI community since s Fikes et al. (1981); Iba (1989); Korf (1983); McGovern and Barto (2001); Menache et al. (2002); Barto and Mahadevan (2003). Prior research has shown that the temporal abstractions improve exploration, reduce complexity of choosing the actions and enhance robustness to the misspecified models. The options framework Sutton et al. (1999); Precup (2000) provided an intuitive way to plan, reason, and act in a continual fashion as opposed to learning with the primitive actions. Many authors Stolle and Precup (2002); Daniel et al. (2016); Konidaris and Barto (2007); Konidaris et al. (2011); Kulkarni et al. (2016); Vezhnevets et al. (2016); Mankowitz et al. (2016) provide methods for discovering subgoals and then the learning policies to achieve those subgoals.

The option-critic framework Bacon et al. (2017) enables end-to-end learning of the options. However, defining a safe option which does not lead to the erroneous states during the learning process still remains an open question. We introduce the idea of controllability Gehring and Precup (2013) in the options framework as an additional condition in the optimality criterion which constrains the variance of the TD error as a measure of uncertainty about the value of a state-option pair. In this work, we propose a new framework called safe option-critic for learning the safety in options.

Key Contributions: This work incorporates the notion of safety in the option-critic framework and presents a mechanism to automatically learn safe options. We derive the policy-gradient theorem for the safe option-critic framework using constraint based optimization. We then demonstrate through experiments in the four-rooms grid environment, that learning the options with controllability (term quantifying controllable behavior of an agent) results in the safer policies which avoid states with the high variance in the TD error. Empirically, we show the benefits of learning safe options in the ALE environments with high intrinsic variability in the rewards. Our approach outperforms the vanilla options with no notion of safety in Atari games namely, MsPacman, Amidar and Q*Bert. In out of games, learning the safe options also outperforms the primitive actions. To this end, we propose a novel Safe Option-Critic framework for the future research in the AI Safety paradigm.

2. Preliminaries

In RL, an agent interacts with the environment at discrete time steps where it observes a state . The agent then chooses an action

from a policy which defines a probability distribution of actions over the state space

. After choosing an action, the agent transitions to a new state according to the transition function and receives a reward where the reward function is defined as . A MDP is defined by a tuple where is a discount factor. A discounted state-action value function is defined as with . The value of can be learned in an incremental fashion using one-step TD learning also written as TD() which is a special case of TD() Sutton (1988). The state-action value is updated using the equation: . Here is the step size and is TD() error which is defined as .

The policy gradient theorem Sutton et al. (2000) presents a way of updating the parameterized policy according to the gradient of expected discounted return which is defined as . The gradient with respect to the policy parameter is given as:


where is the discounted weighting of the states with the starting state as .

2.1. Options

The options framework Sutton et al. (1999); Precup (2000) facilitates a way to incorporate the temporally abstract knowledge into RL with no change in the existing setup. An option is defined as a tuple ; where is the initiation set containing the initial states from which an option can start, is the option policy defining a distribution over actions given a state and is the termination condition of an option defined as the probability of terminating in a state. An example of options could be having high level sub-goals like going to a market, buying vegetables and making the dish wherein the primitive actions for instance could be the muscle twitches.

In case of options being Markov, the intra-option Bellman equation Sutton et al. (1999) provides an off-policy method for updating the Q value of a state-option pair which can be written as:


where is selected from the policy over options .

2.2. Learning Options

The intra-option value learning Sutton et al. (1999) lays the foundation for learning the options in the option-critic architecture (Bacon et al., 2017). It is a policy-gradient based method for learning the intra-options policies and the termination conditions of the options. (Bacon et al., 2017) considered the call and return option execution model, where an option is chosen according to the policy over options , wherein the intra-option policy is followed until the termination condition is met. Once the current option terminates, another option to be executed at that state is selected in the same fashion. denotes the parameterized intra-option policy in terms of and represents the option termination which is parameterized by . The value of executing an action at a particular state-option pair is then given by where


where represents the value of executing an option at a state :


Here, represents the optimal-value function for a given option given by . represents the optimal-value function over given by . (Bacon et al., 2017) derived the gradient of discounted return with respect to and the initial condition as:


where is the discounted weighting of a state-option pair with . The gradient of the expected discounted return with respect to the option termination parameter and the initial condition is described as:


where is an advantage function .

3. Safe Option-Critic Model

Taking inspiration from Gehring and Precup’s work (Gehring:2013), we define controllability as a negation of the variance in the TD error of a state-option action pair. We use the aforementioned definition of controllability to introduce the concept of safety in the option-critic architecture which aids in measuring the uncertainty about the value of a state-option pair. Higher the variance in TD error of a state-option pair, higher would be the uncertainty in the value of that state-option pair. In the safety critical applications, the agent should learn to eventually avoid such pairs as they induce variability in the return. We optimize for the expected discounted return along with the controllability value of initial state-option pair. Depending on the nature of the application, one can limit or encourage the agent visiting a state-option pair based on the degree of controllability. Introducing controllability using the TD error facilitates the linear scalability of the method with the increase in the number of state-option pairs.

Continuing with the notations used in Bacon et al. (2017)

, we are introducing a parameter vector described by

where is an intra-option policy parameter and is an option termination parameter. We assume that an option can be initialized from any state . Given a state-option pair, uncertainty in its value is measured by controllability , which is given by the negation of the variance in its TD error . The expected value of the TD error would converge to zero, hence controllability is written as:


From now onwards, we would refer as whose value is given by:


where and are defined in (3) and (4) respectively. The aim here is to maximize the expected discounted return along with the controllability criterion of a state-option pair. We call this objective , where we want to:


where acts as a regularizer for the controllability and is the initial state-option pair distribution. The value of a state-option pair is defined as . The above objective can also be interpreted as a constrained optimization problem with an additional constraint on the controllability function. We will now derive the gradient of the performance evaluator with respect to the intra-option policy parameter assuming they are differentiable. First we will take the gradient of with . Following from (7):


where the gradient of TD error w.r.t. using (8):


Next, the gradient of w.r.t. is:


The gradient of w.r.t. following from (9), (10), (11) and (12) is reduced to:


where the gradient of using (3) is:


and the gradient of using (4) is:


Substituting the above gradient value of and from (14) and (15) in (13), the gradient of w.r.t. becomes:


where . (Bacon et al., 2017) derived the gradient of as:


On expanding the gradient of as in (17), the gradient of following (16) becomes:


Here, corresponds to the initial state-option pair. The gradient of here describes that each option aims to maximize its own reward with controllability as a constraint pertaining to that option only. Our interpretation here is that each option learnt with this safety constraint translates to an overall risk-averse behavior.

Now we will compute the gradient of with respect to the option termination function parameter . The gradient of controllability with can be written following (7) and (8) as:


where . The gradient of w.r.t is written as:


Using (19) and (20) the gradient of w.r.t. is:


Therefore, the gradient of w.r.t. becomes equal to that of which is equal to the Termination Gradient Theorem Bacon et al. (2017) in (6). The interpretation of the derivation is in accordance with the way the notion of safety has been conceptualized, that is, each option is responsible for making its intra-option policy safe by incorporating the factor of controllability. We are using one-step i.e. TD(0) while updating the Q value of a state-option pair. Due to the assumption that each option take care of its own safety through it’s intra-option policy, one is only concerned about choosing an option which maximizes the expected discounted return from next state-option pair while terminating an option. Due to this assumption as shown in derivation above, introducing controllability does not impact the termination of an option. Algorithm 1 shows the implementation details of controllability in the option-critic architecture in a tabular setting.

  Here stands for step size of critic, intra-option policy and termination respectively. is controllability regularization parameter.
  Select using softmax policy over options
  Let initial be
      using softmax intra-option policy
     Let initial taken at be
     Maintain at the beginning of the episode
     if  is non-terminal state then
     end if
     if  then
     end if
     if  terminates then
        Choose new
     end if
  until  is a terminal state
Algorithm 1 Safe Option-Critic with tabular intra-option Q learning

4. Experiments

4.1. Grid World

First, we consider a simple navigation task in a two dimensional grid environment using a variant of the four-rooms domain as described in Sutton et al. (1999). As seen in the Fig. 1, similar to Gehring and Precup (2013), we define some slippery frozen states in the environment which are unsafe to visit. We accomplish this by introducing variability in their rewards. States labeled F and G indicate the frozen and goal states respectively.

Figure 1. Four Room Environment: and depicts the unsafe frozen and goal states respectively. The lightest color represents the normal states whereas the darkest color shows the wall.

An agent can be initialized with any random start state in the environment apart from the goal state. The action space consists of four stochastic actions namely, up, down, left, and right. The random actions are taken with probability in the environment. The task is to navigate through the rooms to a fixed goal state as depicted in Fig. 1. The dark states in Fig. 1 depict the walls. The agent remains in the same state with a reward of if the agent hits the wall. A reward of and is given to the agent while transitioning into the normal and the goal state respectively. The rewards for the unsafe states are drawn uniformly from while the agent transitions to a slippery state. The expected value of the reward for the normal and the slippery states is kept same.

In the safe option-critic framework, we learn both the policy over options and the intra-option policies with the Boltzmann distribution. We ran the experiments with varying controllability factor for learning

options. We optimize for the hyperparameters: temperature and

for both Option-Critic (OC) with and safe OC. The discount factor is set to . The step size of the intra-option policy is set to . The best performance is achieved for with the step size of termination and critic as and respectively. The optimal value of controllability was achieved at with the step size of termination and critic at and respectively. The temperature for the Boltzmann distribution is set to . The results are achieved with total of episodes averaged over trials where training in each trial starts from the scratch. In each episode, the agent is allowed to take only steps, wherein if the agent fails to reach the goal state within those steps then the episode terminates.

Figure 2. Learning curve with 4 options in Four Room Environment

: Graph depicts the averaged return over 200 trials in the four room environment with 4 options. The bands around solid lines represent the standard deviation of the return. The experiment with controllability has lesser standard deviation in the observed return value as compared to the one without controllability.

To evaluate these experiments, we consider the following metrics: the learned policy, average cumulative discounted return of episodes and the density of the state visits. It can be observed from Fig. 2 that the options with the controllability (Safe-OC) have lower variance in the return of an episode as compared to the options without the controllability (OC). This highlights the fact that the controllability helps the agent in avoiding the unsafe states (inducing variability in the return value). To validate that learning with the controllability causes fewer visits to the unsafe state, we visualize the state frequency graph depicted in the Fig. 3. It is observed that the options with the controllability have lower frequency of visit to the unsafe states as opposed to the vanilla options.

The learning of safe options induces transparency to the behavior of an agent. This is most explicitly demonstrated through the path taken by the agent in case of both controllability and no controllability in the options as shown in Fig. 4. Regardless of the start state, Safe-OC agent navigates to the goal state avoiding the states with a high variance in the reward as opposed to the OC agent which finds a shortest route being unaware of the error prone states.

(a) OC ()
(b) Safe-OC ()
Figure 3. State frequency in Four Room Environment: Density graph represents the number of times a state was visited during testing over 80 trials. Darker shade represents the higher density. a) Model without safety has equally likely density for both the hallways. b) Model with safety shows higher density for the path without the frozen states.
(a) OC Policy ()
(b) OC Policy ()
(c) Safe-OC Policy()
(d) Safe-OC Policy()
Figure 4. Policy in Four Room Environment: Policy learned with 4 options where and represents the start & goal state. denotes 4 actions agent takes according to the learned policy; might take different actions due to environment stochasticity. Change in color represents the option switching. Same color represents the same option. The and states are depicted with red color. Light blue patch represents the frozen states. a) & b) shows the policy with passing through the frozen area. c) & d) depicts policy learnt with avoiding the frozen area due to the inbuilt safety constraint.

4.2. CartPole Environment

We consider linear function approximation with the options. In the Cartpole111 environment a pole is attached to the cart which can move along the horizontal axis. The environment has four continuous features: position, velocity, pole angle and angular velocity of the pole. There are two discrete actions that can be taken, namely left or right. In the environment, a reward of + is achieved as long as the pole is maintained upright between a certain angle and a position. The discount factor is set to .

The experiment is conducted with

options. We use an intra-option Q-learning in the critic for learning the policy over options. The Boltzmann distribution was used for learning both the intra-option policies and the policy over options. The linear-sigmoid function was used for the termination of options. The hyperparameters were fine-tuned using the grid search over the parameter space. The optimal performance was obtained with the step size being set to

for termination, intra-option and critic. The temperature for the Boltzmann distribution was set to . Sutton and Barto’s (Sutton:1998:IRL:551283) open source tile coding implementation222 is used for discretization of the state space. Ten dimensional features (joint space of continuous features) are used to represent the state space. The continuous features: position, velocity and pole angle were discretized into bins and the angular velocity was discretized into bins.

Fig. 5 shows the averaged return over trials with the different degrees of controllability . The best performance is achieved with . The figure shows that with the right degree of controllability, the variance in the return reduces and leads to the faster learning in terms of the mean return score. The controllability helps in the identification of the features which lead to the consistent behavior of the agent, thus learning to avoid state-action pairs which might lead the cart pole to topple. The code for the experiments in the grid world and the cartpole environment is available on the Github333The source code is available at

Figure 5. Learning curve for 4 options in Cart Pole Environment: Results are averaged over 50 trials. The band around the solid horizontal lines represents the standard deviation of the return. performs the best in case of 4 options.

4.3. Arcade Learning Environment

In this section, we discuss our experiments in the ALE domain. Recent work in learning options introduced a deliberation cost Harb et al. (2017) in the option-critic framework Bacon et al. (2017). The deliberation cost could be interpreted as a penalty for terminating an option, thereby leading to temporally extended options. We use the asynchronous advantage option-critic (A2OC) Harb et al. (2017) algorithm as our baseline for learning the ‘safe’ options with the non-linear function approximation. Within the option-critic architecture, A2OC works in a similar fashion as the asynchronous advantage actor-critic (A3C) algorithm Mnih et al. (2016).

Introducing controllability in the A2OC algorithm results in an additional term to the intra-option policy gradient alone as shown in Equation (18). Our update rule for the intra-option policy gradient in the A2OC with controllability setting thus becomes:


Here is a mixture of -step returns similar to the A2OC with the difference that here we consider this return only for the duration an option persisted in continuation. Without any loss in generality, the -step TD error in the definition of controllability can be substituted with -step TD error only if the same option has continued up until the step. Similarly, as discussed in the Equation (21), there is no change in the termination gradient and we use the same update rule as derived in the A2OC algorithm. here is the deliberation cost.


We use primarily three games; MsPacman, Amidar and Q*Bert from the ATARI 2600 suite to test our Safe-A2OC algorithm and analyze the performances. We introduce Safe-A2OC 444The source code is available at built using the same deep network architecture as A2OC, wherein the policy over options is

-greedy, the intra-option policies are linear softmax functions, the termination functions use sigmoid activation functions along will the linear function approximation for the Q values. For hyperparameters, we learn

options, with a fixed deliberation cost of , margin cost of , step size of , and entropy regularization of for varying degrees of controllability () and . The training used parallel threads for all our experiments. We optimized the parameter for no controllability (). For a fair analysis, we compare the best performance of A2OC against different degrees of controllability parameter with the Safe-A2OC.

Results and Evaluation: To evaluate the performances, we use two metrics namely the learning curves Machado et al. (2017) and the average performance over k games. Figures 6,  7 and  8 show the learning curves over 80M frames with varying controllability parameter. It is observed that for specific degrees of controllability, options learned with our notion of safety (Safe-A2OC) outperforms the vanilla options (A2OC). It is important to note that the different values of control the degree to which an agent would be risk averse. A grid search over the different degrees of the controllability hyperparameter resulted in a narrow range of to . For a very high value of , we observe that the agents become extremely risk-averse resulting in a poor performance. An optimum value of for all the three games is obtained around . We present the videos of some of these trained agents as qualitative results in the supplementary material555Supplementary material is available at Upon visual inspection of the trained Safe-A2OC agent, we observe that explicitly optimizing for the variance in TD error results in the agent learning to avoid states with higher variance in TD error. For instance, in MsPacman, the acquisition of the corner diamonds provide the intrinsic variability in the reward structure. Our objective function helps the agent understand such an intrinsic variability in reward, thus boosting the overall performance.

Figure 6. Learning curve for 4 options in MsPacman: Options with a controllability factor of learn better than the best performing scenario of no controllability (, ). Higher degrees of results in poor performance.

Figure 7. Learning curve for 4 options in Amidar: Options with a controllability factor of outperform vanilla options (, ). Higher degrees of controllability () results in reduced exploration and adversely effects the performance.

Figure 8. Learning curve for 4 options in Q*bert: Options with a controllability factor of see better average performance in the learning than the one with no controllability (, ).
Algorithm MsPacman Amidar Q*Bert
Double DQN
, 17642.0 (3346.85)
, 2710.9 (598.69) 925.43 (211.52)
Table 1. ALE Final Scores: Average Performance over games once training is completed after frames. Scores in boxes highlight the performance with no controllability whereas aqua highlighted cells indicate the benefits of introducing our notion of safety in learning end-to-end options. Introducing controllability in options outperforms best performances of primitive actions in out of games analyzed here. Learning options with our notion of safety outperforms vanilla A2OC in all games. A3C scores have been taken from Mnih et al. (2016), DQN from Nair et al. (2015), Double DQN from Van Hasselt et al. (2016), and Dueling from Wang et al. (2015). represents the degree of controllability. Values in brackets indicate standard deviation across games.

The trained agents are then tested for their averaged performance across games as shown in the Table 1. Safe-A2OC with a controllability value of in Q*Bert and in MsPacman and Amidar outperforms the score achieved by A2OC. In MsPacman and Amidar, Safe-A2OC also outperforms the other state-of-the-art approaches Mnih et al. (2016); Nair et al. (2015); Van Hasselt et al. (2016); Wang et al. (2015) using the primitive actions. Empirical effects of introducing the right degree of controllability in options demonstrates that an agent which additionally optimizes for low variance in the TD errors learns better than the one optimizing only for the cumulative reward. The intuition here is that using variance in the TD error as a measure of safety in hierarchical RL helps the agents avoid states with high intrinsic variability. Depending on the nature of the game itself, we observe different degrees of response to different levels of controllability in Q*bert, Amidar, and MsPacman.

5. Discussion

In this work, we introduce a new framework called Safe Option-Critic wherein we define the safety in learning end-to-end options. We extend the idea of controllability from the primitive action space using the temporal difference error to the option-critic architecture for incorporating safety. The underlying idea of this learning process is to discourage the agent from visiting the harmful or the undesirable state-option pair by constraining the variance in the TD error. Recent work by Sherstan et al. (2018) proposed a direct method to calculate the variance of the

return instead of the traditional indirect approaches which use the second order moment. The authors proposed a Bellman operator which uses the square of the TD error to measure the variance of return. This work further supports our approach of estimating the risk through the square of TD error.

Our experiments in the tabular methods empirically demonstrate the reduced variance in the return. Moreover, we observe a boost in the overall performance in both the tabular and the linear approximation methods. Furthermore, experiments in the ALE domain demonstrate that an RL agent was able to learn about the intrinsic variability in a large and complicated state-space such as images with non-linear function approximation. Results from ALE also demonstrated that the options with the notion of safety outperform the algorithms using the primitive actions.

Limitations and Future Work: In this work, we limit the return calculation until an option terminates. Using the n-step returns during the intermediate switching of the options at the SMDP level is of potential interest for the future work. Additionally, it is currently assumed that all the options are available in all the states. In the context of safety, it might be of interest to understand what happens if the options initiation sets were limited to subset of the entire state space. One could also work with varying the degree of controllability regularizer , where could start from to support the exploration in the beginning and gradually increase the value of to limit the exploration to the unsafe states.

A potential direction of future work is the extension of controllability to more than just the initial state-option pair. One could extend the definition of controllability to all the state-option pairs in the trajectory which could potentially expedite the effects of the risk mitigation. The proposed notion of safety could also be extended to different levels of hierarchy in the framework. For instance, a mixture of options with varying degrees of controllability can be learned, wherein at policy over the options level, one could select an option based on how much controllability is desirable for a subset of an environment. The intra-option policy could still retain the current formalizations.

The authors would like to thank their colleagues Herke van Hoof, Pierre-Luc Bacon, Jean Harb, Ayush Jain and Martin Klissarov for their useful comments and discussions throughout the duration of this work. The authors would also like to thank Open Philanthropy for funding this work, and Compute Canada for the computing resources.