1. Introduction
Safety in Artificial Intelligence (AI) can be viewed from many perspectives. Traditionally, introducing some form of riskawareness in AI systems has been a prime way of defining safety in the machines. More recently, researchers have broadened the horizon of safety in AI to address different sources of errors and faulty behaviors Amodei et al. (2016). The Asilomar AI principles Future of Life Institute (2017) comprise of varied aspects of safety like riskaverseness, transparency, robustness, fairness and also legal and ethical values an agent should hold. In this work, we refer to the following definition of safety: prevent undesirable behavior, in particular, reducing the visits to the undesirable states during the learning process in reinforcement learning (RL).
RL agents primarily learn by optimizing their discounted cumulative rewards Sutton and Barto (1998). While rewards are a good indicator of how to behave, they do not necessarily always lead to the most desired behavior. Optimal reward design Sorg et al. (2010) still poses a challenge for the algorithm designers with several issues such as misspecified rewards Amodei and Clark (2016); HadfieldMenell et al. (2017) and corrupted reward channels Everitt et al. (2017) to name a few. Alternatively, learning with constraints allows us to introduce more clarity in the objective function Altman (1999).
During exploration, agents are naturally unaware of the states which may be prone to errors or may lead to catastrophic consequences. Riskawareness has been introduced in the agents by directing exploration safely Law et al. (2005), optimizing the worstcase performance Tamar et al. (2013)
, measuring the probabilities of visiting erroneous states
Geibel and Wysotzki (2005) and several other approaches. Garcıa and Fernández(garcia2015comprehensive) presents a comprehensive survey covering a broad range of techniques to realize safety in RL. In a Markov Decision Process (MDP), majority of the methods seek to minimize the variance of return as a risk mitigation strategy. Many authors
Sato et al. (2001); Mihatsch and Neuneier (2002); Tamar et al. (2012); Gehring and Precup (2013); Tamar et al. (2016); Sherstan et al. (2018)have used temporal difference (TD) learning for estimating the variance of return to capture the notion of uncertainty in the value of a state.
While some of the aforementioned approaches leverage TD learning in estimating errors and risks, all of them define notions of safety in the primitive action space. Temporally abstract actions provide an approach to represent the information in a hierarchical format. The concept of learning and planning in a hierarchical fashion is very close to how humans think and approach a problem. Temporal abstractions have been vital to the AI community since s Fikes et al. (1981); Iba (1989); Korf (1983); McGovern and Barto (2001); Menache et al. (2002); Barto and Mahadevan (2003). Prior research has shown that the temporal abstractions improve exploration, reduce complexity of choosing the actions and enhance robustness to the misspecified models. The options framework Sutton et al. (1999); Precup (2000) provided an intuitive way to plan, reason, and act in a continual fashion as opposed to learning with the primitive actions. Many authors Stolle and Precup (2002); Daniel et al. (2016); Konidaris and Barto (2007); Konidaris et al. (2011); Kulkarni et al. (2016); Vezhnevets et al. (2016); Mankowitz et al. (2016) provide methods for discovering subgoals and then the learning policies to achieve those subgoals.
The optioncritic framework Bacon et al. (2017) enables endtoend learning of the options. However, defining a safe option which does not lead to the erroneous states during the learning process still remains an open question. We introduce the idea of controllability Gehring and Precup (2013) in the options framework as an additional condition in the optimality criterion which constrains the variance of the TD error as a measure of uncertainty about the value of a stateoption pair. In this work, we propose a new framework called safe optioncritic for learning the safety in options.
Key Contributions: This work incorporates the notion of safety in the optioncritic framework and presents a mechanism to automatically learn safe options. We derive the policygradient theorem for the safe optioncritic framework using constraint based optimization. We then demonstrate through experiments in the fourrooms grid environment, that learning the options with controllability (term quantifying controllable behavior of an agent) results in the safer policies which avoid states with the high variance in the TD error. Empirically, we show the benefits of learning safe options in the ALE environments with high intrinsic variability in the rewards. Our approach outperforms the vanilla options with no notion of safety in Atari games namely, MsPacman, Amidar and Q*Bert. In out of games, learning the safe options also outperforms the primitive actions. To this end, we propose a novel Safe OptionCritic framework for the future research in the AI Safety paradigm.
2. Preliminaries
In RL, an agent interacts with the environment at discrete time steps where it observes a state . The agent then chooses an action
from a policy which defines a probability distribution of actions over the state space
. After choosing an action, the agent transitions to a new state according to the transition function and receives a reward where the reward function is defined as . A MDP is defined by a tuple where is a discount factor. A discounted stateaction value function is defined as with . The value of can be learned in an incremental fashion using onestep TD learning also written as TD() which is a special case of TD() Sutton (1988). The stateaction value is updated using the equation: . Here is the step size and is TD() error which is defined as .The policy gradient theorem Sutton et al. (2000) presents a way of updating the parameterized policy according to the gradient of expected discounted return which is defined as . The gradient with respect to the policy parameter is given as:
(1) 
where is the discounted weighting of the states with the starting state as .
2.1. Options
The options framework Sutton et al. (1999); Precup (2000) facilitates a way to incorporate the temporally abstract knowledge into RL with no change in the existing setup. An option is defined as a tuple ; where is the initiation set containing the initial states from which an option can start, is the option policy defining a distribution over actions given a state and is the termination condition of an option defined as the probability of terminating in a state. An example of options could be having high level subgoals like going to a market, buying vegetables and making the dish wherein the primitive actions for instance could be the muscle twitches.
In case of options being Markov, the intraoption Bellman equation Sutton et al. (1999) provides an offpolicy method for updating the Q value of a stateoption pair which can be written as:
(2) 
where is selected from the policy over options .
2.2. Learning Options
The intraoption value learning Sutton et al. (1999) lays the foundation for learning the options in the optioncritic architecture (Bacon et al., 2017). It is a policygradient based method for learning the intraoptions policies and the termination conditions of the options. (Bacon et al., 2017) considered the call and return option execution model, where an option is chosen according to the policy over options , wherein the intraoption policy is followed until the termination condition is met. Once the current option terminates, another option to be executed at that state is selected in the same fashion. denotes the parameterized intraoption policy in terms of and represents the option termination which is parameterized by . The value of executing an action at a particular stateoption pair is then given by where
(3) 
where represents the value of executing an option at a state :
(4) 
Here, represents the optimalvalue function for a given option given by . represents the optimalvalue function over given by . (Bacon et al., 2017) derived the gradient of discounted return with respect to and the initial condition as:
(5) 
where is the discounted weighting of a stateoption pair with . The gradient of the expected discounted return with respect to the option termination parameter and the initial condition is described as:
(6) 
where is an advantage function .
3. Safe OptionCritic Model
Taking inspiration from Gehring and Precup’s work (Gehring:2013), we define controllability as a negation of the variance in the TD error of a stateoption action pair. We use the aforementioned definition of controllability to introduce the concept of safety in the optioncritic architecture which aids in measuring the uncertainty about the value of a stateoption pair. Higher the variance in TD error of a stateoption pair, higher would be the uncertainty in the value of that stateoption pair. In the safety critical applications, the agent should learn to eventually avoid such pairs as they induce variability in the return. We optimize for the expected discounted return along with the controllability value of initial stateoption pair. Depending on the nature of the application, one can limit or encourage the agent visiting a stateoption pair based on the degree of controllability. Introducing controllability using the TD error facilitates the linear scalability of the method with the increase in the number of stateoption pairs.
Continuing with the notations used in Bacon et al. (2017)
, we are introducing a parameter vector described by
where is an intraoption policy parameter and is an option termination parameter. We assume that an option can be initialized from any state . Given a stateoption pair, uncertainty in its value is measured by controllability , which is given by the negation of the variance in its TD error . The expected value of the TD error would converge to zero, hence controllability is written as:(7) 
From now onwards, we would refer as whose value is given by:
(8) 
where and are defined in (3) and (4) respectively. The aim here is to maximize the expected discounted return along with the controllability criterion of a stateoption pair. We call this objective , where we want to:
(9) 
where acts as a regularizer for the controllability and is the initial stateoption pair distribution. The value of a stateoption pair is defined as . The above objective can also be interpreted as a constrained optimization problem with an additional constraint on the controllability function. We will now derive the gradient of the performance evaluator with respect to the intraoption policy parameter assuming they are differentiable. First we will take the gradient of with . Following from (7):
(10) 
where the gradient of TD error w.r.t. using (8):
(11) 
Next, the gradient of w.r.t. is:
(12) 
The gradient of w.r.t. following from (9), (10), (11) and (12) is reduced to:
(13) 
where the gradient of using (3) is:
(14) 
and the gradient of using (4) is:
(15) 
Substituting the above gradient value of and from (14) and (15) in (13), the gradient of w.r.t. becomes:
(16) 
where . (Bacon et al., 2017) derived the gradient of as:
(17) 
On expanding the gradient of as in (17), the gradient of following (16) becomes:
(18) 
Here, corresponds to the initial stateoption pair. The gradient of here describes that each option aims to maximize its own reward with controllability as a constraint pertaining to that option only. Our interpretation here is that each option learnt with this safety constraint translates to an overall riskaverse behavior.
Now we will compute the gradient of with respect to the option termination function parameter . The gradient of controllability with can be written following (7) and (8) as:
(19) 
where . The gradient of w.r.t is written as:
(20) 
Using (19) and (20) the gradient of w.r.t. is:
(21) 
Therefore, the gradient of w.r.t. becomes equal to that of which is equal to the Termination Gradient Theorem Bacon et al. (2017) in (6). The interpretation of the derivation is in accordance with the way the notion of safety has been conceptualized, that is, each option is responsible for making its intraoption policy safe by incorporating the factor of controllability. We are using onestep i.e. TD(0) while updating the Q value of a stateoption pair. Due to the assumption that each option take care of its own safety through it’s intraoption policy, one is only concerned about choosing an option which maximizes the expected discounted return from next stateoption pair while terminating an option. Due to this assumption as shown in derivation above, introducing controllability does not impact the termination of an option. Algorithm 1 shows the implementation details of controllability in the optioncritic architecture in a tabular setting.
4. Experiments
4.1. Grid World
First, we consider a simple navigation task in a two dimensional grid environment using a variant of the fourrooms domain as described in Sutton et al. (1999). As seen in the Fig. 1, similar to Gehring and Precup (2013), we define some slippery frozen states in the environment which are unsafe to visit. We accomplish this by introducing variability in their rewards. States labeled F and G indicate the frozen and goal states respectively.
An agent can be initialized with any random start state in the environment apart from the goal state. The action space consists of four stochastic actions namely, up, down, left, and right. The random actions are taken with probability in the environment. The task is to navigate through the rooms to a fixed goal state as depicted in Fig. 1. The dark states in Fig. 1 depict the walls. The agent remains in the same state with a reward of if the agent hits the wall. A reward of and is given to the agent while transitioning into the normal and the goal state respectively. The rewards for the unsafe states are drawn uniformly from while the agent transitions to a slippery state. The expected value of the reward for the normal and the slippery states is kept same.
In the safe optioncritic framework, we learn both the policy over options and the intraoption policies with the Boltzmann distribution. We ran the experiments with varying controllability factor for learning
options. We optimize for the hyperparameters: temperature and
for both OptionCritic (OC) with and safe OC. The discount factor is set to . The step size of the intraoption policy is set to . The best performance is achieved for with the step size of termination and critic as and respectively. The optimal value of controllability was achieved at with the step size of termination and critic at and respectively. The temperature for the Boltzmann distribution is set to . The results are achieved with total of episodes averaged over trials where training in each trial starts from the scratch. In each episode, the agent is allowed to take only steps, wherein if the agent fails to reach the goal state within those steps then the episode terminates.To evaluate these experiments, we consider the following metrics: the learned policy, average cumulative discounted return of episodes and the density of the state visits. It can be observed from Fig. 2 that the options with the controllability (SafeOC) have lower variance in the return of an episode as compared to the options without the controllability (OC). This highlights the fact that the controllability helps the agent in avoiding the unsafe states (inducing variability in the return value). To validate that learning with the controllability causes fewer visits to the unsafe state, we visualize the state frequency graph depicted in the Fig. 3. It is observed that the options with the controllability have lower frequency of visit to the unsafe states as opposed to the vanilla options.
The learning of safe options induces transparency to the behavior of an agent. This is most explicitly demonstrated through the path taken by the agent in case of both controllability and no controllability in the options as shown in Fig. 4. Regardless of the start state, SafeOC agent navigates to the goal state avoiding the states with a high variance in the reward as opposed to the OC agent which finds a shortest route being unaware of the error prone states.
4.2. CartPole Environment
We consider linear function approximation with the options. In the Cartpole^{1}^{1}1https://gym.openai.com/envs/CartPolev0/ environment a pole is attached to the cart which can move along the horizontal axis. The environment has four continuous features: position, velocity, pole angle and angular velocity of the pole. There are two discrete actions that can be taken, namely left or right. In the environment, a reward of + is achieved as long as the pole is maintained upright between a certain angle and a position. The discount factor is set to .
The experiment is conducted with
options. We use an intraoption Qlearning in the critic for learning the policy over options. The Boltzmann distribution was used for learning both the intraoption policies and the policy over options. The linearsigmoid function was used for the termination of options. The hyperparameters were finetuned using the grid search over the parameter space. The optimal performance was obtained with the step size being set to
for termination, intraoption and critic. The temperature for the Boltzmann distribution was set to . Sutton and Barto’s (Sutton:1998:IRL:551283) open source tile coding implementation^{2}^{2}2http://incompleteideas.net/tiles/tiles3.pyremove is used for discretization of the state space. Ten dimensional features (joint space of continuous features) are used to represent the state space. The continuous features: position, velocity and pole angle were discretized into bins and the angular velocity was discretized into bins.Fig. 5 shows the averaged return over trials with the different degrees of controllability . The best performance is achieved with . The figure shows that with the right degree of controllability, the variance in the return reduces and leads to the faster learning in terms of the mean return score. The controllability helps in the identification of the features which lead to the consistent behavior of the agent, thus learning to avoid stateaction pairs which might lead the cart pole to topple. The code for the experiments in the grid world and the cartpole environment is available on the Github^{3}^{3}3The source code is available at https://github.com/arushi12130/SafeOptionCritic.
4.3. Arcade Learning Environment
In this section, we discuss our experiments in the ALE domain. Recent work in learning options introduced a deliberation cost Harb et al. (2017) in the optioncritic framework Bacon et al. (2017). The deliberation cost could be interpreted as a penalty for terminating an option, thereby leading to temporally extended options. We use the asynchronous advantage optioncritic (A2OC) Harb et al. (2017) algorithm as our baseline for learning the ‘safe’ options with the nonlinear function approximation. Within the optioncritic architecture, A2OC works in a similar fashion as the asynchronous advantage actorcritic (A3C) algorithm Mnih et al. (2016).
Introducing controllability in the A2OC algorithm results in an additional term to the intraoption policy gradient alone as shown in Equation (18). Our update rule for the intraoption policy gradient in the A2OC with controllability setting thus becomes:
(22) 
Here is a mixture of step returns similar to the A2OC with the difference that here we consider this return only for the duration an option persisted in continuation. Without any loss in generality, the step TD error in the definition of controllability can be substituted with step TD error only if the same option has continued up until the step. Similarly, as discussed in the Equation (21), there is no change in the termination gradient and we use the same update rule as derived in the A2OC algorithm. here is the deliberation cost.
(23) 
We use primarily three games; MsPacman, Amidar and Q*Bert from the ATARI 2600 suite to test our SafeA2OC algorithm and analyze the performances. We introduce SafeA2OC ^{4}^{4}4The source code is available at https://github.com/kkhetarpal/safe_a2oc_delib built using the same deep network architecture as A2OC, wherein the policy over options is
greedy, the intraoption policies are linear softmax functions, the termination functions use sigmoid activation functions along will the linear function approximation for the Q values. For hyperparameters, we learn
options, with a fixed deliberation cost of , margin cost of , step size of , and entropy regularization of for varying degrees of controllability () and . The training used parallel threads for all our experiments. We optimized the parameter for no controllability (). For a fair analysis, we compare the best performance of A2OC against different degrees of controllability parameter with the SafeA2OC.Results and Evaluation: To evaluate the performances, we use two metrics namely the learning curves Machado et al. (2017) and the average performance over k games. Figures 6, 7 and 8 show the learning curves over 80M frames with varying controllability parameter. It is observed that for specific degrees of controllability, options learned with our notion of safety (SafeA2OC) outperforms the vanilla options (A2OC). It is important to note that the different values of control the degree to which an agent would be risk averse. A grid search over the different degrees of the controllability hyperparameter resulted in a narrow range of to . For a very high value of , we observe that the agents become extremely riskaverse resulting in a poor performance. An optimum value of for all the three games is obtained around . We present the videos of some of these trained agents as qualitative results in the supplementary material^{5}^{5}5Supplementary material is available at https://sites.google.com/view/safeoptioncritic. Upon visual inspection of the trained SafeA2OC agent, we observe that explicitly optimizing for the variance in TD error results in the agent learning to avoid states with higher variance in TD error. For instance, in MsPacman, the acquisition of the corner diamonds provide the intrinsic variability in the reward structure. Our objective function helps the agent understand such an intrinsic variability in reward, thus boosting the overall performance.
Algorithm  MsPacman  Amidar  Q*Bert 
A3C  
DQN  
Double DQN  
Dueling  
,  
,  17642.0 (3346.85)  
,  2710.9 (598.69)  925.43 (211.52)  
,  
, 
The trained agents are then tested for their averaged performance across games as shown in the Table 1. SafeA2OC with a controllability value of in Q*Bert and in MsPacman and Amidar outperforms the score achieved by A2OC. In MsPacman and Amidar, SafeA2OC also outperforms the other stateoftheart approaches Mnih et al. (2016); Nair et al. (2015); Van Hasselt et al. (2016); Wang et al. (2015) using the primitive actions. Empirical effects of introducing the right degree of controllability in options demonstrates that an agent which additionally optimizes for low variance in the TD errors learns better than the one optimizing only for the cumulative reward. The intuition here is that using variance in the TD error as a measure of safety in hierarchical RL helps the agents avoid states with high intrinsic variability. Depending on the nature of the game itself, we observe different degrees of response to different levels of controllability in Q*bert, Amidar, and MsPacman.
5. Discussion
In this work, we introduce a new framework called Safe OptionCritic wherein we define the safety in learning endtoend options. We extend the idea of controllability from the primitive action space using the temporal difference error to the optioncritic architecture for incorporating safety. The underlying idea of this learning process is to discourage the agent from visiting the harmful or the undesirable stateoption pair by constraining the variance in the TD error. Recent work by Sherstan et al. (2018) proposed a direct method to calculate the variance of the
return instead of the traditional indirect approaches which use the second order moment. The authors proposed a Bellman operator which uses the square of the TD error to measure the variance of return. This work further supports our approach of estimating the risk through the square of TD error.
Our experiments in the tabular methods empirically demonstrate the reduced variance in the return. Moreover, we observe a boost in the overall performance in both the tabular and the linear approximation methods. Furthermore, experiments in the ALE domain demonstrate that an RL agent was able to learn about the intrinsic variability in a large and complicated statespace such as images with nonlinear function approximation. Results from ALE also demonstrated that the options with the notion of safety outperform the algorithms using the primitive actions.
Limitations and Future Work: In this work, we limit the return calculation until an option terminates. Using the nstep returns during the intermediate switching of the options at the SMDP level is of potential interest for the future work. Additionally, it is currently assumed that all the options are available in all the states. In the context of safety, it might be of interest to understand what happens if the options initiation sets were limited to subset of the entire state space. One could also work with varying the degree of controllability regularizer , where could start from to support the exploration in the beginning and gradually increase the value of to limit the exploration to the unsafe states.
A potential direction of future work is the extension of controllability to more than just the initial stateoption pair. One could extend the definition of controllability to all the stateoption pairs in the trajectory which could potentially expedite the effects of the risk mitigation. The proposed notion of safety could also be extended to different levels of hierarchy in the framework. For instance, a mixture of options with varying degrees of controllability can be learned, wherein at policy over the options level, one could select an option based on how much controllability is desirable for a subset of an environment. The intraoption policy could still retain the current formalizations.
The authors would like to thank their colleagues Herke van Hoof, PierreLuc Bacon, Jean Harb, Ayush Jain and Martin Klissarov for their useful comments and discussions throughout the duration of this work. The authors would also like to thank Open Philanthropy for funding this work, and Compute Canada for the computing resources.
References
 (1)
 Altman (1999) Eitan Altman. 1999. Constrained Markov decision processes. Vol. 7. CRC Press.
 Amodei and Clark (2016) Dario Amodei and Jack Clark. 2016. Faulty Reward Functions in the Wild. (2016). https://blog.openai.com/faultyrewardfunctions/
 Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. CoRR (2016). arXiv:1606.06565
 Bacon et al. (2017) PierreLuc Bacon, Jean Harb, and Doina Precup. 2017. The OptionCritic Architecture.. In AAAI. 1726–1734.
 Barto and Mahadevan (2003) Andrew G Barto and Sridhar Mahadevan. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13, 4 (2003), 341–379.
 Daniel et al. (2016) Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. 2016. Probabilistic inference for determining options in reinforcement learning. Machine Learning 104, 23 (2016), 337–357.
 Everitt et al. (2017) Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. 2017. Reinforcement Learning with a Corrupted Reward Channel. arXiv preprint arXiv:1705.08417 (2017).
 Fikes et al. (1981) Richard E Fikes, Peter E Hart, and Nils J Nilsson. 1981. Learning and executing generalized robot plans. In Readings in Artificial Intelligence. Elsevier, 231–249.
 Future of Life Institute (2017) Future of Life Institute. 2017. Asilomar AI Principles. (2017). https://futureoflife.org/2017/01/17/principledaidiscussionasilomar/
 Garcıa and Fernández (2015) Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480.
 Gehring and Precup (2013) Clement Gehring and Doina Precup. 2013. Smart Exploration in Reinforcement Learning Using Absolute Temporal Difference Errors. In Proceedings of the 2013 International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’13). 1037–1044.
 Geibel and Wysotzki (2005) Peter Geibel and Fritz Wysotzki. 2005. Risksensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res.(JAIR) 24 (2005), 81–108.
 HadfieldMenell et al. (2017) Dylan HadfieldMenell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. 2017. Inverse reward design. In Advances in Neural Information Processing Systems. 6768–6777.
 Harb et al. (2017) Jean Harb, PierreLuc Bacon, Martin Klissarov, and Doina Precup. 2017. When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571 (2017).

Iba (1989)
Glenn A Iba.
1989.
A heuristic approach to the discovery of macrooperators.
Machine Learning 3, 4 (1989), 285–317.  Konidaris and Barto (2007) George Konidaris and Andrew G Barto. 2007. Building Portable Options: Skill Transfer in Reinforcement Learning.. In IJCAI, Vol. 7. 895–900.
 Konidaris et al. (2011) George Konidaris, Scott Kuindersma, Roderic A Grupen, and Andrew G Barto. 2011. Autonomous Skill Acquisition on a Mobile Manipulator.. In AAAI.
 Korf (1983) Richard E Korf. 1983. Learning to Solve Problems by Searching for Macrooperators. Ph.D. Dissertation. Pittsburgh, PA, USA. AAI8425820.
 Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems. 3675–3683.
 Law et al. (2005) Edith LM Law, Melanie Coggan, Doina Precup, and Bohdana Ratitch. 2005. Riskdirected Exploration in Reinforcement Learning. Planning and Learning in A Priori Unknown or Dynamic Domains (2005), 97.
 Machado et al. (2017) M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. 2017. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. ArXiv eprints (Sept. 2017). arXiv:cs.LG/1709.06009
 Mankowitz et al. (2016) Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. 2016. Adaptive Skills Adaptive Partitions (ASAP). In Advances in Neural Information Processing Systems. 1588–1596.
 McGovern and Barto (2001) Amy McGovern and Andrew G Barto. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, Vol. 1. 361–368.
 Menache et al. (2002) Ishai Menache, Shie Mannor, and Nahum Shimkin. 2002. Qcut  dynamic discovery of subgoals in reinforcement learning. In European Conference on Machine Learning. Springer, 295–306.
 Mihatsch and Neuneier (2002) Oliver Mihatsch and Ralph Neuneier. 2002. Risksensitive reinforcement learning. Machine learning 49, 23 (2002), 267–290.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. 1928–1937.
 Nair et al. (2015) Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. 2015. Massively Parallel Methods for Deep Reinforcement Learning. CoRR (2015). arXiv:1507.04296
 Precup (2000) Doina Precup. 2000. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst.
 Sato et al. (2001) Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi. 2001. TD algorithm for the variance of return and meanvariance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence 16, 3 (2001), 353–362.
 Sherstan et al. (2018) C. Sherstan, B. Bennett, K. Young, D. R. Ashley, A. White, M. White, and R. S. Sutton. 2018. Directly Estimating the Variance of the Return Using TemporalDifference Methods. ArXiv eprints (Jan. 2018). arXiv:cs.AI/1801.08287
 Sorg et al. (2010) Jonathan Sorg, Richard L Lewis, and Satinder P Singh. 2010. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems. 2190–2198.
 Stolle and Precup (2002) Martin Stolle and Doina Precup. 2002. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation. Springer, 212–223.
 Sutton (1988) Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning 3, 1 (1988), 9–44.
 Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning (1st ed.). MIT Press, Cambridge, MA, USA.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems. 1057–1063.
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112, 12 (1999), 181–211.
 Tamar et al. (2012) Aviv Tamar, Dotan Di Castro, and Shie Mannor. 2012. Policy gradients with variance related risk criteria. In Proceedings of the twentyninth international conference on machine learning. 387–396.
 Tamar et al. (2016) Aviv Tamar, Dotan Di Castro, and Shie Mannor. 2016. Learning the variance of the rewardtogo. Journal of Machine Learning Research 17, 13 (2016), 1–36.
 Tamar et al. (2013) Aviv Tamar, Huan Xu, and Shie Mannor. 2013. Scaling up robust MDPs by reinforcement learning. arXiv preprint arXiv:1306.6189 (2013).
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep Reinforcement Learning with Double Learning.. In AAAI, Vol. 16. 2094–2100.
 Vezhnevets et al. (2016) Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al. 2016. Strategic attentive writer for learning macroactions. In Advances in neural information processing systems. 3486–3494.
 Wang et al. (2015) Ziyu Wang, Nando de Freitas, and Marc Lanctot. 2015. Dueling Network Architectures for Deep Reinforcement Learning. CoRR (2015). arXiv:1511.06581
Comments
There are no comments yet.