When Waiting is not an Option : Learning Options with a Deliberation Cost

09/14/2017
by   Jean Harb, et al.
McGill University
0

Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. While the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

11/30/2017

Learnings Options End-to-End for Continuous Action Tasks

We present new results on learning temporally extended actions for conti...
01/07/2022

Attention Option-Critic

Temporal abstraction in reinforcement learning is the ability of an agen...
12/03/2016

A Matrix Splitting Perspective on Planning with Options

We show that the Bellman operator underlying the options framework leads...
01/10/2013

Decision-Theoretic Planning with Concurrent Temporally Extended Actions

We investigate a model for planning under uncertainty with temporallyext...
03/02/2022

Delegated Online Search

In a delegation problem, a principal P with commitment power tries to pi...
09/07/2016

Feasibility of Post-Editing Speech Transcriptions with a Mismatched Crowd

Manual correction of speech transcription can involve a selection from p...
05/22/2019

Learning Robust Options by Conditional Value at Risk Optimization

Options are generally learned by using an inaccurate environment model (...

Code Repositories

a2oc_delib

A3C style Option-Critic with deliberation cost


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Temporal abstraction has a rich history in AI [Minsky1961, Fikes et al.1972, Kuipers1979, Korf1983, Iba1989, Drescher1991, Dayan and Hinton1992, Kaelbling1993, Thrun and Schwartz1995, Parr and Russell1998, Dietterich1998]

and has been presented as a useful mechanism for a variety of problems that affect AI systems in may settings, including to: generate shorter plans, speed up planning, improve generalization, yield better exploration, increase robustness against model mis-specification or partial observability. In reinforcement learning,

options [Sutton et al.1999b] provide a framework to represent, learn and plan with temporally extended actions. Interest in temporal abstraction in reinforcement learning has increased substantially in the last couple of years, due to increasing success in constructing such abstractions automatically from data, e.g. [Bacon et al.2017, Kulkarni et al.2016, Daniel et al.2016, Mankowitz et al.2016, Machado et al.2017]. However, defining what constitutes a good set of options remains an open problem.

In this paper, we aim to leverage the bounded rationality framework [Simon1957]

in order to explain what would make good temporal abstractions for an RL system. A lot of existing reinforcement learning work has focused on Markov Decision Processes, where optimal policies can be obtained under certain assumptions. However, optimality does not take into account possible resource limitations of the agent, which is assumed to have access to a lot of data and computation time. Indeed, options help agents overcome such limitations, by allowing policies to be computed faster

[Dietterich1998, Precup2000]. However, from the point of view of absolute optimality, temporal abstractions are not necessary: the optimal policy is achieved by primitive actions. Therefore, it has been difficult to formalize in what precise theoretical sense temporally abstract actions are helpful.

Bounded rationality is a very important framework for understanding rationality in both natural and artificial systems. In this paper, we propose bounded rationality as a lens through which we can describe the desiderata for constructing temporal abstractions, as their goal is mainly to help agents which are restricted in terms of computation time. This perspective helps us to formulate more precisely what objective criteria that should be fulfilled during option construction. We propose that good options are those which allow an agent to learn and plan faster, and provide an optimization objective for learning options based on this idea. We implement the optimization using the option-critic framework [Bacon et al.2017] and illustrate its usefulness with experiments in Atari games.

Preliminaries

A finite discounted Markov Decision Process is a tuple where and denote the state and action set respectively, and is a discount factor. The reward function is often assumed to be a deterministic function of the state and actions, but can also map to a distribution, (a perspective which we use in our formulation). The transition matrix is a conditional distribution over next states given that an action is taken under a certain state . The interaction of a randomized stationary policy or a deterministic policy with an MDP induces a Markov process over states, actions and rewards over which is defined the expected discounted return . The value function of a policy satisfies the Bellman equations :

In the control problem, we are interested in finding an optimal policy for a given MDP. A policy is said to be optimal if for all .

An important class of control methods in reinforcement learning is based on the actor-critic architecture [Sutton1984]. In the same way that function approximation can be used for value functions, policies can also be approximated within a parameterized family which is searched over. In the policy gradient theorem, [Sutton et al.1999a] shows that the gradient of the expected discounted return with respect to the parameters of a policy is of the form , where is an initial state distribution. A locally optimal policy can then be found by stochastic gradient ascent over the policy parameters while simultaneously learning the action-value function (usually by TD).

Options

Options [Sutton et al.1999b] provide a framework for representing, planning and learning with temporally abstraction actions. The option frameworks assumes the existence of a base MDP on which are overlaid temporally abstract actions called options. An option is defined as a triple where is an initiation set, is the policy of an option (which can also be deterministic) and is a termination condition. In the call-and-return execution model, a policy over options (deterministic if wanted) chooses an option among those which can be initiated in a given state and executes the policy of that option until termination. Once the chosen option has terminated, the policy over options chooses a new option and the process is repeated until the end of the episode.

The combination of a set of options and base MDP leads to a semi-Markov decision process (SMDP) [Howard1963, Puterman1994]

in which the transition time between two decision points is a random variable. When considering the induced process only at the level of state-option pairs, usual dynamic programming results can be reused after a transformation to an equivalent MDP

[Puterman1994]. To see this, we need to define two kinds of models for every option : a reward model and a transition model . If an option does not depend on the history since initiation, we can write its models either in closed form or as the solution to Bellman-like equations [Sutton et al.1999b]. The expected discounted return associated with a set of options and a policy over them is the solution to a set of Bellman equations :

where is a concatenation of the policy over options , options policies and termination conditions.

Intra-Option Bellman Equations

In the case of Markov options, there exists another form for the Bellman equations, called the intra-option Bellman equations [Sutton et al.1999b], which are key for deriving gradient-based algorithms for learning options.

Let be a random variable over state-option tuples. We call the space of state-option pairs the augmented state space. This augmentation is sufficient to provide the Markov property, which would otherwise be lost when considering the process at the flat level of state-action pairs [Sutton et al.1999b]. The transition matrix of the Markov process over the augmented state space [Bacon et al.2017] is given by :

Using this chain structure, we can define the MDP whose associated value function is:

(1)

Since the rewards come from the base (primitive) MDP, we can simply write and because , we get:

Hence, when taking the expectation in (1) over the next values, we obtain :

(2)

where is the advantage function [Baird1993]. The equations in (2) correspond exactly to the intra-option Bellman equations [Sutton et al.1999b]. However, we chose to present them under an alternate – but more convenient – form highlighting a connection to the advantage function:

where represents the utility of continuing with the same option or switching to a better one.

Optimization

The option-critic architecture [Bacon et al.2017] is a gradient-based actor-critic architecture for learning options end-to-end. As in actor-critic methods, the idea is to parametrize the options policies and termination conditions and learn their parameters jointly by stochastic gradient ascent on the expected discounted return. [Bacon et al.2017] provided the form of the gradients for both the option policies and termination functions

under the assumption that options are available everywhere. In the following, we further assume that the parameter vector

is partitioned into disjoint sets of parameters for the policy over option, the option policies and the termination functions.

In the gradient theorem for options policies [Bacon et al.2017], the result maintains the same form as that of original policy gradient theorem for MDP [Sutton et al.1999a] but over the augmented state space. If is the expected discount return for the set of options and the policy over them, then the gradient of the option policies (whose parameters are independent from the terminations) is :

where is an initial state distribution over state and options.

To obtain the gradient for the termination functions, let’s first take the derivative of the intra-option Bellman equations:

(3)

By noticing the similarity between (3) and (1), we can easily solve for the recursive form of the derivative. Indeed, it suffices to see that plays the role of the “reward” term in the usual Bellman equations (see [Bacon et al.2017] for a detailed proof) and conclude that:

(4)

Hence the termination gradient shows that if an option is advantageous, the probability of termination should be lowered, making that option longer. Conversely, if the value of an option is less than what could be achieved through a different choice of option at a given state, the termination gradient will make it more likely to terminate at this state. The termination gradient has the same structure as the interruption operator

[Mann et al.2014] in the interruption execution model [Sutton et al.1999b]. Rather than executing the policy of an option irrevocably until termination, interruption execution consists in choosing a new option whenever . Moving the the value function to the left hand side, interruption execution can also be understood in terms of the advantage function: . As for the termination gradient, interruption execution leads to the termination of an option whenever there is no advantage (negative advantage) in maintaining it. Interestingly, [Mann et al.2014] also considered adding a scalar regularizer to the advantage function to favor longer options. From the more general perspective of bounded rationality, we also recover this regularizer but within a larger family which follows from the notion of deliberation cost.

Deliberation Cost Model

From a representation learning perspective, good options ought to allow an agent to learn and plan faster [Minsky1961]. Due to their temporal structure, options offer a mechanism through which an agent can make better use of its limited computational resources and act faster. Once an option has been chosen, we assume that the computational cost of executing that option is negligible or constant until termination. After deliberating on the choice of option, an agent can relax thanks to the fast – but perhaps imperfect – knowledge compiled within the its policy.

This perspective on options is similar to fast and frugalheuristics [Gigerenzer and Selten2001] which form a decision repertoire for efficient decision making under limited resource. Our assumption on the cost structure is also consistent with models of the prefrontal areas [Botvinick et al.2009, Solway et al.2014] presenting decision making over options as a slower model-based planning process as opposed to fast and habitual learning taking place within an option. When planning with options (in computers), there is also a cost for deciding which option to choose next by making predictions based on their models. For example, options models could be given by deep networks, necessitating back-and-forth to the GPU, or using a simulator with costly explicit rollouts [Guo et al.2014, Mann et al.2015].

Bounded rationality can also be useful to understand how efficient communication can take place between two agents over a limited channel [Neyman1985]. Options offer a mechanism for communicating intents to and from an agent [Branavan et al.2012, Andreas et al.2017] more efficiently, by compressing the information into a simpler form, sending only the identifier of the options and not the details themselves. Having longer options is a way to provide better interpretability and simplifies communication and understanding by compressing information.

Consider the cost model (fig. 1) in which executing an option within an option is free but switching to an option upon arriving in a new state incurs a cost . To build some intuition, let’s further assume that the termination function of an option is constant over all states. If is the continuation probability of that option, its expected discounted duration is . When a fixed cost is incurred upon termination, the average cost per step for that option is then . Hence, as the probability of continuation increases and options get longer, the cost rate decreases. Conversely, if an option only terminates after one step – a primitive option – is and the cost rate is . The fact that longer options lead to a better amortization of the deliberation cost is key to understanding their benefit in comparison to using only primitive actions.

TimeBase MDP + OptionsDeliberation Costs
Figure 1: A deliberation cost is incurred upon switching to a new option and is subtracted from the reward of the base MDP. Open circles represent SMDP decision points while filled circles are primitive steps within an option. The cost rate for each option is represented by the intensity of the subtrajectory.

Formulation

In addition to the value function for the base MDP and options over them, we define an immediate cost function and a corresponding deliberation cost function . The expected sum of discounted costs associated with a set of options and the policy over them is given by the function :

We first formulate our goal of maximizing the expected return while keeping the deliberation cost low as a constrained optimization problem:

where is an initial state distribution over state-option pairs. But in general, solving a problem of this form [Altman1999]

requires a Linear Programming (LP) formulation which is both expensive to solve and incompatible with the model-free learning methods adopted in this work. Instead, we consider the unconstrained optimization problem arising from the Lagrangian formulation

[Sennott1991, Altman1999]:

(5)

and is a regularization coefficient. While (5) shows the option-value function and the deliberation cost function as separate entities, they can in fact bee seen as a single MDP whose reward function is the difference of the base MDP reward and the cost function:

Therefore, there is a set of Bellman equations which the value function over the transformed reward function satisfies:

(6)

Similarly, there exist Bellman optimality equations in the sense of [Sutton et al.1999b] for the parameters of the policy over options :

(7)

where the notation here indicates that the parameters for the options are kept fixed and only is allowed to change. A policy over option is -optimal with respect to a set of options if it reaches the maximum in (7) for a given . Clearly, when , the corresponding is also optimal in the base MDP and there is no loss of optimality in this regard.

Switching Cost and its Interpretation as a Margin

One way to favor long options is by a cost function which penalizes for frequent options switches. In the same way that the MDP formulation allows for randomized reward functions [Puterman1994], we can also capture the random event of switching through the immediate cost function . Since is the mean of a Bernoulli random variable over the two possible outcomes, switching or continuing ( or ), the cost function corresponding to the switching event is (where was added for mathematical convenience).

When expanding the value function over the transformed reward (6) for this choice of , we get:

(8)

with appearing along with the advantage function : a term which would otherwise be absent from the intra-option Bellman equations over the base MDP (2). Therefore, adding the switching cost function to the base MDP reward contributes a scalar margin to the advantage function over the transformed reward. When learning termination functions in option-critic, the termination gradient for the unconstrained problem (5) is then of the form:

(9)

Hence, sets a margin or a baseline

for how good an option ought to be : a correction which might be due to approximation error or to reflect some form of uncertainty in the value estimates. By increasing its value, we can reduce the gap in the advantage function, tilting the balance in favor of maintaining an option rather than terminating it.

Computational Horizon

Due to the generality of our formulation, the discount factor of the deliberation cost function can be different from that of the value function over the base MDP reward. The unconstrained formulation of (5) then becomes a function of two discount factors: for base MDP and for the deliberation cost function:

Since the derivative of the deliberation cost function with respect to the termination parameters is:

setting when the cost function is leaves only one term : . Hence, by linearity with (4), the derivative over the mixed objective is:

(10)

While similar to (9) in the sense that the margin also enters the advantage function, (10) differs fundamentally in the fact that it depends on and not , the advantage function over the transformed reward. We can also see that when , we recover the same form for the derivative of the expected return in the transformed MDP from (9):

The discount factor for the deliberation cost function provides a mechanism for truncating the sum of costs. Therefore, it plays a distinct role from the regularization coefficient which merely scales the deliberation cost function but does not affect the computational horizon. As opposed to the random horizon set by the discount factor in the environment, pertains to the internal environment of agent about the cost of its own cognitive or computational processes. It is a parameter about an introspective process of self-prediction of how likely a sequence of internal costs will be accumulated as a result of deliberating about courses of action in the outside environment. In accordance with more general results on discounting [Petrik and Scherrer2008, Jiang et al.2015], should be aligned with the representational capacity of the system since involves an increasingly more difficult prediction problem.

In that sense, indicates that only the immediate computational cost should be considered when learning options that also maximize for the reward. When learning termination functions, the resulting shallow evaluation under small values of might not take into account the possibility that the overall expected cost could be lowered in exchange of a less favorable immediate cost : it lacks foresight. Despite the fact that the full effect of a change in the options or the policy over them cannot be captured with , the corresponding gradient (9) is still useful when . It leads to both the regularization strategy proposed in [Bacon et al.2017] for gradient-based learning and [Mann et al.2014] in the dynamic programming case. Furthermore, since (9) does not depend on , values can be learned only for the original reward function and does not require mixed or separate estimates.

Experiments

Previous results [Bacon et al.2017] in the Arcade Learning Environment [Bellemare et al.2013] have shown that while learning options end-to-end is possible, frequent terminations can become an issue unless regularization is used. Hence, we chose to apply the idea of deliberation cost in combination with a novel option-critic implementation based on the Advantage Asynchronous Actor-Critic (A3C) architecture of [Mnih et al.2016]. More specifically, our experiments are meant to assess : the interpretability of the resulting options, whether degeneracies (frequent terminations) to single-step options can be controlled, and if the deliberation cost can provide an inductive bias for learning faster.

Asynchronous Advantage Option-Critic (A2OC)

Initialize global counter Initialize thread counter repeat
       Reset gradients: , and Choose with an policy over options repeat
             Choose according to Take action in , observe if the current option terminates in  then
                   choose new with
             else
                  
             end if
            
      until episode ends or or ( and terminated) for  do
             Accumulate thread specific gradients:
       end for
      Update global parameters with thread gradients
until 
Algorithm 1 Asynchronous Advantage Option-Critic

The option-critic architecture [Bacon et al.2017] had introduced a deep RL version of the algorithm, which allowed one to learn options in an end-to-end fashion, directly from pixels. However, it was built on top of the DQN algorithm [Mnih et al.2015], which is an off-line algorithm using samples from an experience replay buffer. Option-critic, on the other hand, is an on-line algorithm which uses every new sampled transition for its updates. Using on-line samples has been known to cause issues when training deep networks.

Recently, the asynchronous advantage actor-critic (A3C) algorithm [Mnih et al.2016] addressed this issue and lead to to stable on-line learning by running multiple parallel agents. The parallel agents allows the deep networks to see samples from very different states, which greatly stabilizes learning. This algorithm is also much more consistent with the spirit of option-critic, as they both use on-line policy gradients to train. We introduce the 111The source code is available at https://github.com/jeanharb/a2oc_delibasynchronous advantage option-critic (A2OC), an algorithm (alg. 1) that learns options in a similar way to A3C but within the option-critic architecture.

The architecture used for A2OC was kept as consistent with A3C as possible. We use a convolutional neural network of the same size, which outputs a feature vector that is shared among 3 heads as in

[Bacon et al.2017]

: the option policies, the termination functions and the Q-value networks. The option policies are linear softmax functions, the termination functions use sigmoid activation functions to represent probabilities of terminating and the Q-values are simply linear layers. During training, all gradients are summed together, and updating is performed in a single thread instance. A3C only needs to learn a value function for its policy, as opposed to Q-values for every action. Similarly, A2OC gets away with the action dimension through sampling

[Bacon et al.2017] but needs to maintain state-option value because of the underlying augmented state space.

As for the hyperparameters, we use an

-greedy policy over options, with . The preprocessing are the same as the A3C, with RGB pixels scaled to grayscale images. The agent repeats actions for 4 consecutive moves and receives stacks of 4 frames as inputs. We used entropy regularization of , which pushes option policies not to collapse to deterministic policies. A learning rate of was used in all experiments. We usually trained the agent with parallel threads.

Empirical Effects of Deliberation Cost

(a) Without a deliberation cost, options terminate instantly and are used in any scenario without specialization.
(b) Options are used for extended periods and in specific scenarios through a trajectory, when using a deliberation cost.
(c) Termination is sparse when using the deliberation cost. The agent terminates options at intersections requiring high level decisions.
Figure 2: We show the effects of using deliberation costs on both the option termination and policies. In figures (a) and (b), every color in the agent trajectory represents a different option being executed. This environment is the game Amidar, of the Atari 2600 suite.

We use Amidar, a game of the Atari 2600 suite, as an environment that allows us to analyze the option policies and terminations qualitatively. The game is grid-like and the task is to cover as much ground as possible, without running into enemies.

Without a deliberation cost, the options eventually learn to terminate at every step, as seen in figure 1(a), where every color in the figure represents a different option being executed at the time the agent was in the location. In contrast, figure 1(b) shows the effect of training an agent with a deliberation cost, which persists with an option over a long period of time. The temporally extended structure of the options shown by color does not result from simply terminating and re-picking the same option at every step but truly represents a contiguous segment of the trajectory where that option was maintained in a call-and-return fashion. Only at certain intersections the options terminate, allowing the agent to select an option which will lead it in a different direction. As opposed to the agent which was trained without a deliberation cost (fig. 1(a)), figure 1(a) shows that the options learned with the regularizer were specialized and only selected in specific scenarios. Figure 1(c) shows us where the agent terminated options on its trajectory. The options are clearly terminating at intersections, which represent key decision points.

Algorithm Amidar Asterix Breakout Hero
[Mnih et al.2015]
[Mnih et al.2016]
No deliberation cost
Table 1: Final performance for different levels of regularization. Note that the A3C Deepmind scores use a nonpublic human starts evaluation and may not be directly comparable to our random start initialization.

We also trained agents with multiple levels of deliberation cost, from to , with increments of . The range of values was chosen according to the general scale of the values proper to these environments. As figure 4 shows, an increase in deliberation cost quickly decreases the average termination probabilities as expected by the formulation (5). When no deliberation cost is used, termination raises up to very quickly, meaning each option only lasts a single time-step. The decrease in probability is not the same in every environment, this is due to the difference in returns. The deliberation cost has an effect proportional to its ratio with the state values. Intuitively, environments with many high rewards would indeed require a larger deliberation cost to have substantial effects.

Conclusion and Future Work

We presented the use of deliberation cost as a way to incentivize the creation of options which persist for a longer period of time. Using this approach in the option-critic architecture yields both good performance as well as options which are intuitive and do not shrink over time. In doing so, we also outlined a connection from our more general notion of deliberation cost with previous notions of regularization from [Mann et al.2014] and [Bacon et al.2017].

The deliberation cost goes beyond only the idea of penalizing for lengthy computation. It can also be used to incorporate other forms of bounds intrinsic to an agent in its environment. One interesting direction for future work is to also think of deliberation cost in terms of missed opportunity and opening the way for an implicit form of regularization when interacting asynchronously with an environment. Another interesting form of limitation inherent to reinforcement learning agents has to do with their representational capacities when estimating action values. Preliminary work seems to indicate that the error decomposition for the action values could be also be expressed in the form of a deliberation cost.

References

  • [Altman1999] E. Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.
  • [Andreas et al.2017] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In ICML, pages 166–175, 2017.
  • [Bacon et al.2017] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.
  • [Baird1993] Leemon C. Baird. Advantage updating. Technical Report WL–TR-93-1146, Wright Laboratory, 1993.
  • [Bellemare et al.2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    , 47:253–279, 06 2013.
  • [Botvinick et al.2009] Matthew M. Botvinick, Yael Niv, and Andrew C. Barto. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113(3):262 – 280, 2009.
  • [Branavan et al.2012] S. R. K. Branavan, Nate Kushman, Tao Lei, and Regina Barzilay. Learning high-level planning from text. In ACL, pages 126–135, 2012.
  • [Daniel et al.2016] C. Daniel, H. van Hoof, J. Peters, and G. Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, Special Issue, 104(2):337–357, 2016.
  • [Dayan and Hinton1992] Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In NIPS, pages 271–278, 1992.
  • [Dietterich1998] Thomas G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In ICML, pages 118–126, 1998.
  • [Drescher1991] Gary L. Drescher. Made-up Minds: A Constructivist Approach to Artificial Intelligence. MIT Press, Cambridge, MA, USA, 1991.
  • [Fikes et al.1972] Richard Fikes, Peter E. Hart, and Nils J. Nilsson. Learning and executing generalized robot plans. Artif. Intell., 3(1-3):251–288, 1972.
  • [Gigerenzer and Selten2001] Gerd Gigerenzer and R. Selten. Bounded Rationality: The adaptive toolbox. Cambridge: The MIT Press, 2001.
  • [Guo et al.2014] Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3338–3346. Curran Associates, Inc., 2014.
  • [Howard1963] Ronald A. Howard. Semi-markovian decision processes. In Proceedings 34th Session International Statistical Institute, pages 625–652, 1963.
  • [Iba1989] Glenn A. Iba. A heuristic approach to the discovery of macro-operators. Machine Learning, 3:285–317, 1989.
  • [Jiang et al.2015] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard L. Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2015, Istanbul, Turkey, May 4-8, 2015, pages 1181–1189, 2015.
  • [Kaelbling1993] Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: Preliminary results. In ICML, pages 167–173, 1993.
  • [Korf1983] Richard Earl Korf. Learning to Solve Problems by Searching for Macro-operators. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1983.
  • [Kuipers1979] Benjamin Kuipers. Commonsense knowledge of space: Learning from experience. In Proceedings of the 6th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI’79, pages 499–501, San Francisco, CA, USA, 1979. Morgan Kaufmann Publishers Inc.
  • [Kulkarni et al.2016] Tejas Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Joshua Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems 29, 2016.
  • [Machado et al.2017] Marlos C. Machado, Marc G. Bellemare, and Michael H. Bowling. A laplacian framework for option discovery in reinforcement learning. In ICML, pages 2295–2304, 2017.
  • [Mankowitz et al.2016] Daniel J. Mankowitz, Timothy Arthur Mann, and Shie Mannor. Adaptive skills, adaptive partitions (ASAP). In Advances in Neural Information Processing Systems 29, 2016.
  • [Mann et al.2014] Timothy Arthur Mann, Daniel J. Mankowitz, and Shie Mannor. Time-regularized interrupting options (TRIO). In ICML, pages 1350–1358, 2014.
  • [Mann et al.2015] Timothy Arthur Mann, Shie Mannor, and Doina Precup. Approximate value iteration with temporally extended actions. J. Artif. Intell. Res., 53:375–438, 2015.
  • [Minsky1961] Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30, January 1961.
  • [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [Mnih et al.2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937, 2016.
  • [Neyman1985] Abraham Neyman. Bounded complexity justifies cooperation in the finitely repeated prisoners dilemma. Economics Letters, 19(3):227–229, jan 1985.
  • [Parr and Russell1998] Ronald Parr and Stuart J. Russell. Reinforcement learning with hierarchies of machines. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 1043–1049. MIT Press, 1998.
  • [Petrik and Scherrer2008] Marek Petrik and Bruno Scherrer. Biasing approximate dynamic programming with a lower discount factor. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 1265–1272, 2008.
  • [Precup2000] Doina Precup. Temporal abstraction in reinforcement learning. PhD thesis, University of Massachusetts Amhersts, 2000.
  • [Puterman1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994.
  • [Sennott1991] Linn I. Sennott. Constrained discounted markov decision chains. Probability in the Engineering and Informational Sciences, 5(4):463–475, 1991.
  • [Simon1957] Herbert A. Simon. Models of man: social and rational; mathematical essays on rational human behavior in society setting. Wiley, 1957.
  • [Solway et al.2014] Alec Solway, Carlos Diuk, Natalia Córdova, Debbie Yee, Andrew G. Barto, Yael Niv, and Matthew M. Botvinick. Optimal behavioral hierarchy. PLOS Computational Biology, 10(8):1–10, 08 2014.
  • [Sutton et al.1999a] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057–1063, 1999.
  • [Sutton et al.1999b] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112(1-2):181–211, 1999.
  • [Sutton1984] Richard S. Sutton. Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts Amherst, 1984.
  • [Thrun and Schwartz1995] Sebastian Thrun and Anton Schwartz. Finding structure in reinforcement learning. In NIPS, 1995.

Appendix

Figure 3: Training curves with different deliberation costs on 4 Atari 2600 games. Trained for up to 80M frames.
Figure 4: Average termination probabilities through training, with varying amounts of deliberation costs. With no deliberation, the termination rate quickly goes to 100% (black curve).