Learning with Options that Terminate Off-Policy

11/10/2017 ∙ by Anna Harutyunyan, et al. ∙ 0

A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Abstraction is essential for scaling up learning, and there has been a renewed interest in methods that extract, or leverage it [Vezhnevets et al.2016, Kulkarni et al.2016, Tessler et al.2017]. The options framework [Sutton, Precup, and Singh1999] is the standard for modeling temporal

abstraction in reinforcement learning. The temporal aspect of an option is determined by its

termination condition , which roughly determines its length. Learning and planning with longer options is known to be more efficient [Mann, Mannor, and Precup2015]. This is partly due to an option having similar properties to the familiar multi-step -returns,111The similarity is particularly relevant since a full -model [Sutton1995] is at the basis of both paradigms. which are known to yield faster convergence [Bertsekas and Tsitsiklis1996]. The key qualitative difference between and , however, is that directly affects the solution rather than, like , just the rate of convergence. If is not trivial, this couples the quality of the solution with the quality of the options at hand. This can be restrictive especially if the options are not perfect, which of course is likely. Indeed, when a set of options is given, we can show that the more they terminate, the more optimal the resulting policy is at the primitive action level. This poses a challenge: on the one hand, we wish for the options to be long to yield fast convergence and meaningful exploration, but on the other, if these options are not ideal, the more we commit to them, the poorer the quality of our solution. Interrupting suboptimal options is one way of addressing this [Sutton, Precup, and Singh1999], but like “cutting” traces in off-policy learning, it may prevent us from following a coherent policy for more than a couple of steps.

To this end, we propose to terminate options off-policy, that is: decouple the behavior termination condition that the options execute with, from the target termination condition that is to be factored into the solution. The behavior terminations then solely become a means of influencing convergence speed. We describe a new algorithm, Q(), that achieves this by leveraging connections to several old and new multi-step off-policy temporal difference algorithms.

The paper is organized as follows. After introducing relevant background, we will derive the fixed point of the classical call-and-return option operator. Using its shape for intuition, we will then incrementally build up to our proposed off-policy termination option operator, starting from the classical intra-option equations. We will analyze the convergence properties of this operator for both policy evaluation and control, and state the corresponding online algorithm Q(). Finally, we will validate Q() empirically, and show that it can (1) learn to evaluate a task w.r.t. options terminating off-policy, and (2) learn an optimal solution from suboptimal options quicker than the alternatives.

2 Framework and notation

We assume the standard reinforcement learning setting [Sutton and Barto2017] of an MDP , where is the set of states, the set of discrete actions; the transition that specifies the environment dynamics, with (or

in short), denoting the probability of transitioning to state

upon taking action in ; is the reward function, and the scalar discount factor. A policy is a probabilistic mapping from states to actions. For a policy , and the MDP transition matrix , let the matrix

denote the dynamics of the induced Markov chain:

and the reward expected for each state under :

Consider the Q-function as a mapping , and define the one-step transition operator over Q-functions as:

(1)

Using operator notation, define the Q-function corresponding to the value of policy as:

the recursive form of which defines the Bellman equation for the policy  [Bellman1957]. The corresponding one-step Bellman operator can be applied to any Q-function:

(2)

and its repeated applications are guaranteed to produce its fixed point  [Puterman1994]

. The policy evaluation setting is concerned with estimating this quantity for a given policy

.

The Bellman optimality operator introduces maximization over the set of policies: and its repeated applications are guaranteed to produce its fixed point, , which is the value of the optimal policy . The control setting is concerned with finding this value. A policy is greedy w.r.t. a Q-function if at each state it picks the action of maximum value. The optimal policy is greedy w.r.t. the optimal Q-function .

Throughout, we will write and

for the vectors with 1- or 0-components. For a policy

, we will use as a shorthand for the expectation w.r.t. the trajectory , with drawn according to and drawn according to . We will occasionally use to denote , and write arguments in subscripts (i.e. use and equivalently). In the learning setting, the algorithms often update with a sample of the residual :

which is referred to as a temporal difference (TD) error.

2.1 Multi-step off-policy TD learning

One need not only consider single-step operators, and may apply and repeatedly. The -operator is a particularly flexible mixture of such multi-step operators:

In the learning setting, this corresponds to considering multi-step returns, rather than one-step samples for updating the estimate, and

trades off the bias of bootstrapping with an approximate estimate with the variance of sampling multiple steps. A high intermediate value of

is typically best in practice [Kearns and Singh2000].

Off-policy learning is the setting where the behavior and target policies are decoupled. That is: . Multi-step methods pose a challenge when considered off-policy, and advances have recently been made in this direction [Munos et al.2016, Mahmood, Yu, and Sutton2017]. To this end, munos2016safe munos2016safe unified several off-policy return-based algorithms under a common umbrella:

(3)

where

and the trace is a state-action coefficient, whose specific shapes correspond to different algorithms. In particular, Tree-Backup() sets to  [Precup, Sutton, and Singh2000], while Retrace() sets to  [Munos et al.2016].

Unifying algorithm

Recently, deasis2017multistep deasis2017multistep formulated an algorithm that encompasses the ones discussed so far even more generally. Instead of the binary taxonomy of on-and off-policy algorithms, the authors introduce a parameter that smoothly transitions between the two. This algorithm can also be expressed via Eq. (3) with:

In particular, corresponds to the on-policy SARSA() algorithm, while to Tree-Backup().

2.2 Options

An option is a tuple , with the initiation set, from which an option may start, , the probabilistic termination condition, and , the option policy with which it navigates through the environment. Just like the MDP reward and transition models and , options can be seen to induce semi-MDP [Puterman1994] models and as follows [Sutton, Precup, and Singh1999]:

(4)

where and are the expectations of the option duration from state and the travel time between states and , respectively, w.r.t. option dynamics and the termination condition .

In the call-and-return model of option execution, an option is run until completion (according to its termination condition), and only then a new option choice is made [Precup, Sutton, and Singh1998]. This suggests the following state-option analogues of the state-action transition operator from Eq. (1), and the Bellman operator from Eq. (2). For a policy over options :

(5)
(6)

It will also be relevant to consider the marginal flat policy over primitive actions:

(7)

For simplicity, we assume that options can initiate anywhere: . Finally, in the main departure from the standard framework, we will distinguish between the target termination condition that is to be estimated, and a behavior termination condition (“zeta”) that the options actually terminate with.

3 The call-and-return operator

Before proceeding to describe our key idea and algorithm, let us quantify the role of in the target solution of planning with options. To do this, we plan to derive this solution at the primitive action resolution.

Let be an arbitrary policy over options, and a coefficient function. Consider the following transition operator and its corresponding Bellman operator:

where is the -vector of for all options. Note that , like , is a one-step operator, whereas is an option-level operator. The transition operator in particular defines the following operators corresponding to option continuation and termination, respectively:

That is: (“iota”) is the policy over options that maintains the current (argument) option. Using these operators, we can express from Eq. (5) and the reward model from Eq. (4) concisely for all state-option pairs:

and rewrite the option-level Bellman operator from Eq. (6):

(8)

The following proposition derives the fixed point of in terms of the one-step operators and .

Proposition 1.

The fixed point of the operator is the same as the fixed point of the operator , and writes:

(9)

Thus, the termination scheme directly affects the convergence limit: in the extreme, if , options never terminate, and we have the fixed point of : , the value of the option . In the other extreme, , the options terminate at every step and we have the fixed point of , which can be shown (Theorem 1 in [Bacon and Precup2016]) to correspond to the value of the marginal policy from Eq. (7):

(10)

4 Off-policy option termination

We would like to decouple the target termination condition that factors into the solution from the behavior termination condition that governs for how long the options are followed. Apart from the theoretical appeal of the freedom that this allows, a key motivation is in the fact that on the one hand just like with multi-step returns, the less options terminate the faster the convergence, but on the other the more options terminate, the better the control solution (as we show formally in the next section). Being able to decouple the two allows one to achieve the best of both worlds, and is exactly what we propose to do in this paper.

The critical insight in our approach is the off-policy-ness at the primitive action level that is introduced by the discrepancy between policies (that picks a new option) and (that maintains the current option) in Eq. (9). The degree of this off-policy-ness is modulated exactly by the termination condition , and is usually implicit. We propose to leverage multi-step off-policy learning and to impose a desired degree explicitly, independent of the behavior terminations .222Technically this should be referred to as off-termination learning, since the difference is indeed in termination conditions, of which the induced policies are a consequence: the “behavior policy” takes the current option w.p. 1, while the “target policy” is the actual policy over options . In the extreme, we can learn the marginal policy directly. In the other extreme, we can learn the value of the current option. The two extremes are traded off via . The algorithm we propose is analogous to the unifying algorithm Q() in which modulates the degree of off-policy-ness [Asis et al.2017].

To highlight the parallels between learning with options and learning off-policy from multi-step returns, this section will incrementally build the technical intuition towards the proposed off-policy termination operator, starting from the classical intra-option equations. The next section will analyze the convergence of this operator.

4.1 From one step intra-option learning to General Q()

To begin, let us first re-derive the target from Proposition 1 starting from the familiar intra-option equations [Sutton and Precup1998]. At this point let us assume that the options execute w.r.t. some behavior condition . Letting denote the update on the estimated Q-function, we have at time and the current option :

(11)

where as before we write Notice that this is exactly a sample of the one-step update corresponding to . In fact, if we roll it out over multiple steps, and take an expectation, we obtain Eq. (8) for the behavior exactly:

(12)

from which some simple algebra yields:

(13)

This is the same (expected) update as Peng’s Q() for greedy policies [Peng and Williams1996] or General Q() (i.e. off-policy Expected SARSA([Van Seijen et al.2009]) for arbitrary policies, with being the state-option analogue of from those algorithms. The fixed point of both of those algorithms is given in Proposition 1 in [Harutyunyan et al.2016] and is indeed analogous to that given in Proposition 1 in this paper.

4.2 General Q() to Tree-Backup()

The update (13) converges to the fixed point from Prop. 1, which is an on-/off-policy mixture, regulated by the behavior terminations . We wish to reweigh this mixture by the desired target terminations . Let us now begin introducing the off-policy machinery that will allow us to do so. It will be useful to cast the multi-step intra-option update (13) in the form of the general off-policy operator (3). This can be done by replacing the second expectation in the TD-error with the point-estimate , which introduces off-policy corrections:

This update will converge to the “off-policy” fixed point of only if and are close [Harutyunyan et al.2016], which, in our particular case of the indicator behavior policy , is not a very interesting scenario. If we further augment the “trace” with the policy probability coefficient , we obtain option-level Tree-Backup([Precup, Sutton, and Singh2000] whose target policy is , and behavior policy is , and with being the state-option analogue of :

(14)

From the convergence guarantees of Tree-Backup, we know that this update converges to the fixed point of , which in turn corresponds to (Eq. (10)).

4.3 The off-policy termination operator

We are now ready to present the operator underlying the Q() algorithm, which is the main contribution of this paper. Eq. (14) can be considered a special case where we correct all of the off-policy-ness, thus implicitly assuming . To obtain the general case, we need to split the target in each TD-error into two off- and on-policy terms:

(15)

Inspecting these updates, it is clear that the parameter controls the degree of partial “off-policy-ness”: in the next state, with a weight the option continues and the update considers the “on-policy” current option value , and with a weight , the option terminates, and the update considers the “off-policy” value w.r.t. the policy over options . Note that, these coefficients appear in the correction terms as well, since in order for the algorithm to converge to the correct on-/off-policy mixed target, the -weighted off-policy portion needs to be corrected. Finally, note that in the learning setting, the second factor in is explicit, while is sampled from the current option. Algorithm 1 presents the forward view of the algorithm underlying this expected operator, for the general case of an evolving sequence of policies .

This algorithm is very similar to the recently formalized Q([Asis et al.2017, Sutton and Barto2017], with being a state-option generalization of . In Q(), the parameter controls the degree of off-policy-ness: corresponds to (on-policy) SARSA, and to (off-policy) Tree-Backup. Analogously, Q() with learns the “on-policy” value of the current option policy (i.e. the policy ), and Q() with learns the “off-policy” value of the marginal policy (i.e. the policy ). The behavior terminations on the other hand have a role analogous to that of the eligibility trace parameter .

0:   Option set , target termination function , initial Q-function , step-sizes , start state
1:  
2:  for  do
3:     Sample an option from
4:     Sample the return from . is determined by sampling .
5:     for  do
6:        
7:        
8:        
9:        
10:        
11:     end for
12:     
13:  end for
Algorithm 1 Q() algorithm

4.4 Relationship with intra-option learning

The off-policy-ness discussed so far is different than that in the more familiar off-policy intra-option setting. The intra-option learning algorithm suggests applying the update (4.4) to all options “consistent with” the experience stream  [Sutton and Precup1998]. When considering multiple steps, this amounts to introducing importance sampling into Eq. (12). That is, given a behavior option , the trace coefficient from Eq. (12) now becomes

(16)

And hence, writing for , the value of option from Eq. (12) writes:

Now, notice that there are two sources of off-policy-ness in these formulas. One is vs. , the contrast between option policies, and the other is in the target: vs. itself, as discussed before. Indeed if we write the above in the form from Eq. (13) we get a different correction:

Since the corrections for the two sources of off-policy-ness are orthogonal, it could be possible to combine them. We leave this for the future.

5 Analysis

In this section we will analyze the convergence behavior of the off-policy termination operator in both policy evaluation and control settings, and show that learning about shorter target options off-policy is generally asymptotically more efficient than on-policy. We will then consider the relationship of the solution quality in control with option duration, and show that shorter options generally yield better solutions.

5.1 Policy evaluation

We will prove that the evaluation operator is contractive around the appropriate fixed point, and that its contraction factor is less than that of the respective on-policy operator, given the target options are longer than the behavior ones.

Theorem 1 (Policy evaluation).

The operator defined in Eq. (15) has a unique fixed point , as defined in Eq. (9). Furthermore, if for each state and option we have , then for any Q-function :

where

Proof.

The proof is analogous to that of Theorem 1 from [Munos et al.2016] and is given in appendix. ∎

Figure 1: Prediction error on the 19-state chain task. Each variant is an average of 10 seeds. Left: Sum error for each - combination. Q() always gets more efficient as decreases (the options get longer). Right: Example learning curves. The lines corresponding to

are outside the axes’ bounds. The shaded region covers standard deviation.

The contraction coefficient controls the convergence speed of this operator: the smaller (and the larger ) the fewer iterations are needed to converge, but the larger the computational expense when planning, or the variance when learning [Bertsekas and Ioffe1996, Munos et al.2016]. Since , and options terminate eventually, the variance is less significant here, and generally, larger will yield faster convergence. In our case, since a behavior option is assumed (i.e. the factor in Eq. (15) is fixed), the additional -term in can only reduce the existing trace. However, since we are interested in learning about a different target, we ought to compare these traces to the setting in which that same target is learnt on-policy. The following corollary derives a sufficient condition for Q() to maintain larger traces than its on-policy counterpart.

Corollary 1.

If for all states and options we have:

then the iteration (15) converges faster than if was used on-policy, in place of in iteration (12).

In particular if , any satisfies this, irrespective of . If is deterministic, this holds for all for the chosen . In general, the intuition here is that it’s easier to learn from longer option traces about shorter options than vice versa.

5.2 Control

Let us now formulate the control analogue of Theorem 1. Its proof is a simpler version of that of Theorem 2 from [Munos et al.2016], and we omit it here.

Theorem 2 (Control).

Consider a sequence of policies over options that are greedy w.r.t. the sequence of estimates , and consider the update:

where the operator is defined by Eq. (15) for the -th policy . Let , and let . Suppose that . 333 This can be attained by pessimistic initialization: Then for any ,

It follows that as .

We do not give a proof of online convergence at this time, but verify it empirically in our experiments.

5.3 Option duration and solution quality

Our convergence results show that learning about shorter options off-policy is more efficient than on-policy. Here, we motivate why one would want to learn about shorter options in the first place. In particular, we will show that given a set of options, the more they terminate, the better the resulting control solution at the primitive action resolution. Intuitively, this is because upon termination, the learner picks the current best option, whereas during option execution, the target value includes the potentially suboptimal current option (Proposition 1). The following theorem (proof in appendix) formalizes this intuition, and may be of independent interest.

Theorem 3 (The more options terminate, the better the solution.).

Given a set of options , and a greedy policy over options . Let be two termination conditions for the options in . Then:

Note that this result refers to the target solution. During learning, the more decisions there is to make, the more potential there is for error. As such, in reasonably complex tasks we expect the performance to obey a tradeoff on .

6 Experiments

Finally, let us evaluate our algorithm empirically. We aim to illustrate the following claims:

  • The learning speed improves as gets smaller (behavior options get longer);

  • The control performance improves, as gets larger (target options get shorter);

  • Q() converges with off-policy terminations.

For simplicity, we assume that options terminate deterministically in a set of goal states. We hence reduce and to single parameters that determine the likelihood of terminating before reaching the goal. The “plain” variant refers to the on-policy intra-option update from Eq. (12). The complete experimental details are provided in appendix E.

Figure 2: The 19-state random walk task. The agent starts in the middle. Transitions are deterministic, and the task terminates in each end.

6.1 Policy evaluation

First, we show that Q() learns the correct values on the 19-state random walk task (Fig. 2). There are two options: one leads all the way to the left, the other to the right. The policy over options is uniform. The task is to estimate the value function w.r.t. target terminations . The results are given in Figure 1. Q() is able to learn the correct values (up to an irreducible exploration-related error). As expected, Q() gets more efficient as behavior options get longer ( gets smaller). The opposite is true for the plain algorithm, since without interrupting sufficiently, it does not have a chance to update the intermediate states with anything other than the option policy. There is a small inflection point in the performance of the plain algorithm at the on-policy value of .

(a)
(b)
Figure 3: Control performance. (a) Cliffwalk. Each variant is evaluated on 5 seeds for 10 runs each. Left: Average performance per value of on all seeds. Right: Learning curves for the best seeds per variant. Notice how Q() is the only variant that escapes the plateau of the suboptimal policy. (b) Pinball. Each variant is evaluated on 20 independent runs. Top row: Influence of and on Q(): performance improves as gets smaller. Overall, intermediate target -s are best, but reaches slightly better performance at the end of learning. Bottom row: Comparison within the variants for select values of
Figure 4: Left: The modified cliffwalk task. Shaded regions are cliffs. Right: Pinball domain configuration used. The red ball must be moved to the blue hole. Each black diamond indicates an option landmark.

6.2 Control

To demonstrate the benefit of decoupled off- and on- policy terminations, we compare our algorithm with the plain on-policy variant (labelled: onpolicy-plain) that uses during both learning and evaluation. In order to demonstrate that the learning target plays a role, we also compare it with the plain algorithm that uses at evaluation only (labelled: offpolicy-plain). The behavior , unless specified otherwise.

Modified Cliffwalk

We illustrate the benefits of off-policy termination on a modified Cliffwalk example (Fig. 4, left). The agent starts in a position inside a grid with the goal of getting to a corner where a positive reward is given. The step reward is zero, but there are small cliffs along the border that aren’t fatal, but induce a penalty. We have four options, one for each cardinal direction, that take the agent up until the corresponding border (and cliff). Thus, while these options are able to learn to reach the goal in an optimal number of steps, they are unable to learn the optimal policy which only moves inside the grid, so as to not encounter cliffs. To ensure adequate exploration we consider -soft option policies, as well as a usual -greedy policy over options during learning (but not during evaluation). The results are plotted in Fig. 2(a). Q() outperforms the alternatives in all cases, and its performance improves with larger . Note that it is the only variant able to surpass the value of the suboptimal upper bound of the naive approach.

Pinball

We finally evaluate our algorithm on a variation of the Pinball domain [Konidaris and Barto2009]. Here, a small ball must be maneuvered through a set of obstacles into a hole. Observations consist of four continuous variables describing the ball’s positions and velocities. There are five primitive actions: the first four apply a small force to either the or velocity, the final action leaves all velocities unchanged. There is a step penalty and a final reward. We define a set of landmark options [Mann, Mannor, and Precup2015] that move the ball near a target goal location on the board. The agent can initiate and terminate each option from within some initiation and termination distances from the respective landmark. To illustrate the benefits of terminating suboptimal options, landmarks were placed in such a way that the paths from start to the goal via the landmarks are suboptimal (Fig. 4, right). The results are plotted in Fig. 2(b). As expected, the performance of Q() drastically improves with longer behavior options. The influence of is more subtle: intermediate -s perform best overall, but reaches slightly better eventual performance.

In a comparison, Q() outperforms the on-policy variant that learns with . However, in this domain, the off-policy variant (that learns with , but evaluates with ) performs comparably to Q(). This may in part be due to the use of function approximation, which allows the plain update to generalize meaningfully within the option trajectory, and in part due to the noisy nature of Pinball, in which there are many optimal policies of similar values. Since, as we have seen, Q() is the only variant to learn accurate values, we expect it to stand out more in settings where the reward scheme is more intricate.

7 Related work

Much of the related work has already been discussed throughout. We mention a few more relevant works below.

mann2014time mann2014time propose an algorithm for multi-step option interruption that stems from the same motivation of mitigating poor-quality options. In order to avoid the resulting options being too short, they introduce a time-regularization term. Our approach bypasses the need to do so by interrupting off-policy and the ability to explicitly specify target terminations.

ryan2002using ryan2002using also considers the problem of termination improvement, or interrupting options when they are no longer relevant, in a hybrid decision-theoretic and classical planning setting. The intuitions in that work are particularly aligned with the ones here – in particular the tradeoff between efficiency and optimality is hinted at, and it is even suggested that “persistence [with an option] is a kind of off-policy exploration, at the behavior level.” In this work, we formalize and exploit this intuition.

yu2012weighted yu2012weighted consider general -operators with state-dependent . In the case of options, takes the role of a state-option-dependent . white2017unifying white2017unifying proposes to consider as part of the transition-based discount instead.

8 Discussion

We propose decoupling behavior and target termination conditions, like it is done with policies in off-policy learning. We formulate an algorithm for learning target terminations off-policy, analyze its expected convergence, and validate it empirically, confirming the theoretical intuition that learning shorter options from longer options is beneficial both computationally and qualitatively. More generally, we cast learning with options into a common framework with well-studied multi-step off-policy temporal difference learning, which allows us to carry over existing results with ease.

Learning longer options from shorter options.

We have assumed here that the options are given, but may not express the optimal policy well. This scenario applies when the options describe simple rules of thumb, or are transferred from a different task. If the options are not given, but learnt end-to-end, our wish typically is to distill meaningful behavior in them. However, instead, the result often ends up reducing to degenerate options [Bacon, Harb, and Precup2017, Mann, Mankowitz, and Mannor2014]. Being able to impose longer durations on the target off-policy may mitigate this. Though it should be noted that our convergence results suggest that learning may not be as efficient then.

Action-level importance sampling.

The option policy term in the trace is ambivalent to the action choice. Thus if is small, will be small, even if the taken action is consistent with the option policy . It would be interesting to replace this term with the importance sampling ratio at the primitive action level, like , which corresponds to another multi-step off-policy algorithm, Retrace([Munos et al.2016]. Another direction towards this goal is to incorporate the intra-option correction from Eq. (16).

Online convergence.

Proving online convergence of Q() remains an open problem. Supported by reliable empirical behavior, we hypothesize that it holds under reasonable conditions and plan to investigate it further in the future.

Future.

A natural direction for future work is to learn many -s in parallel as is done in classical off-policy learning [Sutton et al.2011], or learn a new kind444I.e., not one indifferent to the reward, as in [Szepesvari et al.2014], but one indifferent to the termination scheme. of a universal option model using the ideas from [Schaul et al.2015]. It is also worthwhile to extend the convergence results to approximate state spaces, perhaps similarly to [Touati et al.2017].

References

  • [Asis et al.2017] Asis, K. D.; Hernandez-Garcia, J. F.; Holland, G. Z.; and Sutton, R. S. 2017. Multi-step reinforcement learning: A unifying algorithm. CoRR abs/1703.01327.
  • [Bacon and Precup2016] Bacon, P.-L., and Precup, D. 2016. A matrix splitting perspective on planning with options. CoRR abs/1612.00916.
  • [Bacon, Harb, and Precup2017] Bacon, P.-L.; Harb, J.; and Precup, D. 2017. The option-critic architecture. In Proceedings of AAAI, 1726–1734.
  • [Bellman1957] Bellman, R. 1957. Dynamic Programming. Princeton, NJ, USA: Princeton University Press.
  • [Bertsekas and Ioffe1996] Bertsekas, D. P., and Ioffe, S. 1996. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, MIT.
  • [Bertsekas and Tsitsiklis1996] Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Athena Scientific.
  • [Harutyunyan et al.2016] Harutyunyan, A.; Bellemare, M. G.; Stepleton, T.; and Munos, R. 2016. Q() with off-policy corrections. In Proceedings of the Conference on Algorithmic Learning Theory, 305–320.
  • [Kearns and Singh2000] Kearns, M. J., and Singh, S. P. 2000. Bias-variance error bounds for temporal difference updates. In COLT, 142–147.
  • [Konidaris and Barto2009] Konidaris, G., and Barto, A. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems, 1015–1023.
  • [Kulkarni et al.2016] Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675–3683.
  • [Mahmood, Yu, and Sutton2017] Mahmood, A. R.; Yu, H.; and Sutton, R. S. 2017. Multi-step off-policy learning without importance sampling ratios. CoRR abs/1702.03006.
  • [Mann, Mankowitz, and Mannor2014] Mann, T.; Mankowitz, D.; and Mannor, S. 2014. Time-regularized interrupting options (trio). In

    Proceedings of the International Conference on Machine Learning

    , 1350–1358.
  • [Mann, Mannor, and Precup2015] Mann, T. A.; Mannor, S.; and Precup, D. 2015. Approximate value iteration with temporally extended actions. JAIR 53:375–438.
  • [Munos et al.2016] Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, 1046–1054.
  • [Peng and Williams1996] Peng, J., and Williams, R. J. 1996. Incremental multi-step q-learning. Machine Learning 22(1-3):283–290.
  • [Precup, Sutton, and Singh1998] Precup, D.; Sutton, R. S.; and Singh, S. 1998. Theoretical results on reinforcement learning with temporally abstract options. In Proceedings of the European conference on machine learning, 382–393.
  • [Precup, Sutton, and Singh2000] Precup, D.; Sutton, R. S.; and Singh, S. 2000. Eligibility traces for off-policy policy evaluation. In Proceedings of the International Conference on Machine Learning, 759–766.
  • [Puterman1994] Puterman, M. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition.
  • [Ryan2002] Ryan, M. R. K. 2002. Using abstract models of behaviours to automatically generate reinforcement learning hierarchies. In In Proceedings of The International Conference on Machine Learning, 522–529.
  • [Schaul et al.2015] Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Universal value function approximators. In Proceedings of the International Conference on Machine Learning (ICML-15), 1312–1320.
  • [Sutton and Barto2017] Sutton, R. S., and Barto, A. G. 2017. Reinforcement Learning: An Introduction. MIT Press, 2nd edition.
  • [Sutton and Precup1998] Sutton, R. S., and Precup, D. 1998. Intra-option learning about temporally abstract actions. In In Proceedings of the International Conference on Machine Learning, 556–564.
  • [Sutton et al.2011] Sutton, R.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P.; White, A.; and Precup, D. 2011. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, AAMAS, 761–768.
  • [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.
  • [Sutton1995] Sutton, R. S. 1995. Td models: Modeling the world at a mixture of time scales. In Proceedings of the Twelfth International Conference on Machine Learning, 531–539.
  • [Szepesvari et al.2014] Szepesvari, C.; Sutton, R. S.; Modayil, J.; Bhatnagar, S.; et al. 2014. Universal option models. In Advances in Neural Information Processing Systems, 990–998.
  • [Tessler et al.2017] Tessler, C.; Givony, S.; Zahavy, T.; Mankowitz, D. J.; and Mannor, S. 2017. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of AAAI, 1553–1561.
  • [Touati et al.2017] Touati, A.; Bacon, P.-L.; Precup, D.; and Vincent, P. 2017. Convergent tree-backup and retrace with function approximation. CoRR abs/1705.09322.
  • [Van Seijen et al.2009] Van Seijen, H.; Van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 177–184.
  • [Vezhnevets et al.2016] Vezhnevets, A.; Mnih, V.; Osindero, S.; Graves, A.; Vinyals, O.; Agapiou, J.; and Kavukcuoglu, K. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems 29, 3486–3494.
  • [White2017] White, M. 2017. Unifying task specification in reinforcement learning. In Proceedings of the International Conference on Machine Learning, volume 70, 3742–3750.
  • [Yu and Bertsekas2012] Yu, H., and Bertsekas, D. P. 2012. Weighted bellman equations and their applications in approximate dynamic programming. Technical Report LIDS-P-2876, MIT.

Appendix A Proof of Proposition 1

Proof.

From Eq. (8), and writing for for less clutter:

Solving for yields the result. ∎

Appendix B Proof of Theorem 1

Proof.

Write for . The fact that the fixed point is the desired one is clear from Eq. (15) and Proposition 1. Recall that:

Now, let us derive the contraction result. Eq. (15) can be written:

Let , and be defined analogously. Since , and after shifting the sum index forward, we have:

where we write for the indicator function. Since and , , we have a linear combination of weighted by non-negative coefficients defined as: