1 Introduction
Abstraction is essential for scaling up learning, and there has been a renewed interest in methods that extract, or leverage it [Vezhnevets et al.2016, Kulkarni et al.2016, Tessler et al.2017]. The options framework [Sutton, Precup, and Singh1999] is the standard for modeling temporal
abstraction in reinforcement learning. The temporal aspect of an option is determined by its
termination condition , which roughly determines its length. Learning and planning with longer options is known to be more efficient [Mann, Mannor, and Precup2015]. This is partly due to an option having similar properties to the familiar multistep returns,^{1}^{1}1The similarity is particularly relevant since a full model [Sutton1995] is at the basis of both paradigms. which are known to yield faster convergence [Bertsekas and Tsitsiklis1996]. The key qualitative difference between and , however, is that directly affects the solution rather than, like , just the rate of convergence. If is not trivial, this couples the quality of the solution with the quality of the options at hand. This can be restrictive especially if the options are not perfect, which of course is likely. Indeed, when a set of options is given, we can show that the more they terminate, the more optimal the resulting policy is at the primitive action level. This poses a challenge: on the one hand, we wish for the options to be long to yield fast convergence and meaningful exploration, but on the other, if these options are not ideal, the more we commit to them, the poorer the quality of our solution. Interrupting suboptimal options is one way of addressing this [Sutton, Precup, and Singh1999], but like “cutting” traces in offpolicy learning, it may prevent us from following a coherent policy for more than a couple of steps.To this end, we propose to terminate options offpolicy, that is: decouple the behavior termination condition that the options execute with, from the target termination condition that is to be factored into the solution. The behavior terminations then solely become a means of influencing convergence speed. We describe a new algorithm, Q(), that achieves this by leveraging connections to several old and new multistep offpolicy temporal difference algorithms.
The paper is organized as follows. After introducing relevant background, we will derive the fixed point of the classical callandreturn option operator. Using its shape for intuition, we will then incrementally build up to our proposed offpolicy termination option operator, starting from the classical intraoption equations. We will analyze the convergence properties of this operator for both policy evaluation and control, and state the corresponding online algorithm Q(). Finally, we will validate Q() empirically, and show that it can (1) learn to evaluate a task w.r.t. options terminating offpolicy, and (2) learn an optimal solution from suboptimal options quicker than the alternatives.
2 Framework and notation
We assume the standard reinforcement learning setting [Sutton and Barto2017] of an MDP , where is the set of states, the set of discrete actions; the transition that specifies the environment dynamics, with (or
in short), denoting the probability of transitioning to state
upon taking action in ; is the reward function, and the scalar discount factor. A policy is a probabilistic mapping from states to actions. For a policy , and the MDP transition matrix , let the matrixdenote the dynamics of the induced Markov chain:
and the reward expected for each state under :Consider the Qfunction as a mapping , and define the onestep transition operator over Qfunctions as:
(1) 
Using operator notation, define the Qfunction corresponding to the value of policy as:
the recursive form of which defines the Bellman equation for the policy [Bellman1957]. The corresponding onestep Bellman operator can be applied to any Qfunction:
(2) 
and its repeated applications are guaranteed to produce its fixed point [Puterman1994]
. The policy evaluation setting is concerned with estimating this quantity for a given policy
.The Bellman optimality operator introduces maximization over the set of policies: and its repeated applications are guaranteed to produce its fixed point, , which is the value of the optimal policy . The control setting is concerned with finding this value. A policy is greedy w.r.t. a Qfunction if at each state it picks the action of maximum value. The optimal policy is greedy w.r.t. the optimal Qfunction .
Throughout, we will write and
for the vectors with 1 or 0components. For a policy
, we will use as a shorthand for the expectation w.r.t. the trajectory , with drawn according to and drawn according to . We will occasionally use to denote , and write arguments in subscripts (i.e. use and equivalently). In the learning setting, the algorithms often update with a sample of the residual :which is referred to as a temporal difference (TD) error.
2.1 Multistep offpolicy TD learning
One need not only consider singlestep operators, and may apply and repeatedly. The operator is a particularly flexible mixture of such multistep operators:
In the learning setting, this corresponds to considering multistep returns, rather than onestep samples for updating the estimate, and
trades off the bias of bootstrapping with an approximate estimate with the variance of sampling multiple steps. A high intermediate value of
is typically best in practice [Kearns and Singh2000].Offpolicy learning is the setting where the behavior and target policies are decoupled. That is: . Multistep methods pose a challenge when considered offpolicy, and advances have recently been made in this direction [Munos et al.2016, Mahmood, Yu, and Sutton2017]. To this end, munos2016safe munos2016safe unified several offpolicy returnbased algorithms under a common umbrella:
(3)  
where
and the trace is a stateaction coefficient, whose specific shapes correspond to different algorithms. In particular, TreeBackup() sets to [Precup, Sutton, and Singh2000], while Retrace() sets to [Munos et al.2016].
Unifying algorithm
Recently, deasis2017multistep deasis2017multistep formulated an algorithm that encompasses the ones discussed so far even more generally. Instead of the binary taxonomy of onand offpolicy algorithms, the authors introduce a parameter that smoothly transitions between the two. This algorithm can also be expressed via Eq. (3) with:
In particular, corresponds to the onpolicy SARSA() algorithm, while to TreeBackup().
2.2 Options
An option is a tuple , with the initiation set, from which an option may start, , the probabilistic termination condition, and , the option policy with which it navigates through the environment. Just like the MDP reward and transition models and , options can be seen to induce semiMDP [Puterman1994] models and as follows [Sutton, Precup, and Singh1999]:
(4) 
where and are the expectations of the option duration from state and the travel time between states and , respectively, w.r.t. option dynamics and the termination condition .
In the callandreturn model of option execution, an option is run until completion (according to its termination condition), and only then a new option choice is made [Precup, Sutton, and Singh1998]. This suggests the following stateoption analogues of the stateaction transition operator from Eq. (1), and the Bellman operator from Eq. (2). For a policy over options :
(5)  
(6) 
It will also be relevant to consider the marginal flat policy over primitive actions:
(7) 
For simplicity, we assume that options can initiate anywhere: . Finally, in the main departure from the standard framework, we will distinguish between the target termination condition that is to be estimated, and a behavior termination condition (“zeta”) that the options actually terminate with.
3 The callandreturn operator
Before proceeding to describe our key idea and algorithm, let us quantify the role of in the target solution of planning with options. To do this, we plan to derive this solution at the primitive action resolution.
Let be an arbitrary policy over options, and a coefficient function. Consider the following transition operator and its corresponding Bellman operator:
where is the vector of for all options. Note that , like , is a onestep operator, whereas is an optionlevel operator. The transition operator in particular defines the following operators corresponding to option continuation and termination, respectively:
That is: (“iota”) is the policy over options that maintains the current (argument) option. Using these operators, we can express from Eq. (5) and the reward model from Eq. (4) concisely for all stateoption pairs:
and rewrite the optionlevel Bellman operator from Eq. (6):
(8) 
The following proposition derives the fixed point of in terms of the onestep operators and .
Proposition 1.
The fixed point of the operator is the same as the fixed point of the operator , and writes:
(9) 
Thus, the termination scheme directly affects the convergence limit: in the extreme, if , options never terminate, and we have the fixed point of : , the value of the option . In the other extreme, , the options terminate at every step and we have the fixed point of , which can be shown (Theorem 1 in [Bacon and Precup2016]) to correspond to the value of the marginal policy from Eq. (7):
(10) 
4 Offpolicy option termination
We would like to decouple the target termination condition that factors into the solution from the behavior termination condition that governs for how long the options are followed. Apart from the theoretical appeal of the freedom that this allows, a key motivation is in the fact that on the one hand just like with multistep returns, the less options terminate the faster the convergence, but on the other the more options terminate, the better the control solution (as we show formally in the next section). Being able to decouple the two allows one to achieve the best of both worlds, and is exactly what we propose to do in this paper.
The critical insight in our approach is the offpolicyness at the primitive action level that is introduced by the discrepancy between policies (that picks a new option) and (that maintains the current option) in Eq. (9). The degree of this offpolicyness is modulated exactly by the termination condition , and is usually implicit. We propose to leverage multistep offpolicy learning and to impose a desired degree explicitly, independent of the behavior terminations .^{2}^{2}2Technically this should be referred to as offtermination learning, since the difference is indeed in termination conditions, of which the induced policies are a consequence: the “behavior policy” takes the current option w.p. 1, while the “target policy” is the actual policy over options . In the extreme, we can learn the marginal policy directly. In the other extreme, we can learn the value of the current option. The two extremes are traded off via . The algorithm we propose is analogous to the unifying algorithm Q() in which modulates the degree of offpolicyness [Asis et al.2017].
To highlight the parallels between learning with options and learning offpolicy from multistep returns, this section will incrementally build the technical intuition towards the proposed offpolicy termination operator, starting from the classical intraoption equations. The next section will analyze the convergence of this operator.
4.1 From one step intraoption learning to General Q()
To begin, let us first rederive the target from Proposition 1 starting from the familiar intraoption equations [Sutton and Precup1998]. At this point let us assume that the options execute w.r.t. some behavior condition . Letting denote the update on the estimated Qfunction, we have at time and the current option :
(11)  
where as before we write Notice that this is exactly a sample of the onestep update corresponding to . In fact, if we roll it out over multiple steps, and take an expectation, we obtain Eq. (8) for the behavior exactly:
(12) 
from which some simple algebra yields:
(13)  
This is the same (expected) update as Peng’s Q() for greedy policies [Peng and Williams1996] or General Q() (i.e. offpolicy Expected SARSA() [Van Seijen et al.2009]) for arbitrary policies, with being the stateoption analogue of from those algorithms. The fixed point of both of those algorithms is given in Proposition 1 in [Harutyunyan et al.2016] and is indeed analogous to that given in Proposition 1 in this paper.
4.2 General Q() to TreeBackup()
The update (13) converges to the fixed point from Prop. 1, which is an on/offpolicy mixture, regulated by the behavior terminations . We wish to reweigh this mixture by the desired target terminations . Let us now begin introducing the offpolicy machinery that will allow us to do so. It will be useful to cast the multistep intraoption update (13) in the form of the general offpolicy operator (3). This can be done by replacing the second expectation in the TDerror with the pointestimate , which introduces offpolicy corrections:
This update will converge to the “offpolicy” fixed point of only if and are close [Harutyunyan et al.2016], which, in our particular case of the indicator behavior policy , is not a very interesting scenario. If we further augment the “trace” with the policy probability coefficient , we obtain optionlevel TreeBackup() [Precup, Sutton, and Singh2000] whose target policy is , and behavior policy is , and with being the stateoption analogue of :
(14)  
From the convergence guarantees of TreeBackup, we know that this update converges to the fixed point of , which in turn corresponds to (Eq. (10)).
4.3 The offpolicy termination operator
We are now ready to present the operator underlying the Q() algorithm, which is the main contribution of this paper. Eq. (14) can be considered a special case where we correct all of the offpolicyness, thus implicitly assuming . To obtain the general case, we need to split the target in each TDerror into two off and onpolicy terms:
(15)  
Inspecting these updates, it is clear that the parameter controls the degree of partial “offpolicyness”: in the next state, with a weight the option continues and the update considers the “onpolicy” current option value , and with a weight , the option terminates, and the update considers the “offpolicy” value w.r.t. the policy over options . Note that, these coefficients appear in the correction terms as well, since in order for the algorithm to converge to the correct on/offpolicy mixed target, the weighted offpolicy portion needs to be corrected. Finally, note that in the learning setting, the second factor in is explicit, while is sampled from the current option. Algorithm 1 presents the forward view of the algorithm underlying this expected operator, for the general case of an evolving sequence of policies .
This algorithm is very similar to the recently formalized Q() [Asis et al.2017, Sutton and Barto2017], with being a stateoption generalization of . In Q(), the parameter controls the degree of offpolicyness: corresponds to (onpolicy) SARSA, and to (offpolicy) TreeBackup. Analogously, Q() with learns the “onpolicy” value of the current option policy (i.e. the policy ), and Q() with learns the “offpolicy” value of the marginal policy (i.e. the policy ). The behavior terminations on the other hand have a role analogous to that of the eligibility trace parameter .
4.4 Relationship with intraoption learning
The offpolicyness discussed so far is different than that in the more familiar offpolicy intraoption setting. The intraoption learning algorithm suggests applying the update (4.4) to all options “consistent with” the experience stream [Sutton and Precup1998]. When considering multiple steps, this amounts to introducing importance sampling into Eq. (12). That is, given a behavior option , the trace coefficient from Eq. (12) now becomes
(16) 
And hence, writing for , the value of option from Eq. (12) writes:
Now, notice that there are two sources of offpolicyness in these formulas. One is vs. , the contrast between option policies, and the other is in the target: vs. itself, as discussed before. Indeed if we write the above in the form from Eq. (13) we get a different correction:
Since the corrections for the two sources of offpolicyness are orthogonal, it could be possible to combine them. We leave this for the future.
5 Analysis
In this section we will analyze the convergence behavior of the offpolicy termination operator in both policy evaluation and control settings, and show that learning about shorter target options offpolicy is generally asymptotically more efficient than onpolicy. We will then consider the relationship of the solution quality in control with option duration, and show that shorter options generally yield better solutions.
5.1 Policy evaluation
We will prove that the evaluation operator is contractive around the appropriate fixed point, and that its contraction factor is less than that of the respective onpolicy operator, given the target options are longer than the behavior ones.
Theorem 1 (Policy evaluation).
Proof.
The proof is analogous to that of Theorem 1 from [Munos et al.2016] and is given in appendix. ∎
are outside the axes’ bounds. The shaded region covers standard deviation.
The contraction coefficient controls the convergence speed of this operator: the smaller (and the larger ) the fewer iterations are needed to converge, but the larger the computational expense when planning, or the variance when learning [Bertsekas and Ioffe1996, Munos et al.2016]. Since , and options terminate eventually, the variance is less significant here, and generally, larger will yield faster convergence. In our case, since a behavior option is assumed (i.e. the factor in Eq. (15) is fixed), the additional term in can only reduce the existing trace. However, since we are interested in learning about a different target, we ought to compare these traces to the setting in which that same target is learnt onpolicy. The following corollary derives a sufficient condition for Q() to maintain larger traces than its onpolicy counterpart.
Corollary 1.
In particular if , any satisfies this, irrespective of . If is deterministic, this holds for all for the chosen . In general, the intuition here is that it’s easier to learn from longer option traces about shorter options than vice versa.
5.2 Control
Let us now formulate the control analogue of Theorem 1. Its proof is a simpler version of that of Theorem 2 from [Munos et al.2016], and we omit it here.
Theorem 2 (Control).
Consider a sequence of policies over options that are greedy w.r.t. the sequence of estimates , and consider the update:
where the operator is defined by Eq. (15) for the th policy . Let , and let . Suppose that . ^{3}^{3}3 This can be attained by pessimistic initialization: Then for any ,
It follows that as .
We do not give a proof of online convergence at this time, but verify it empirically in our experiments.
5.3 Option duration and solution quality
Our convergence results show that learning about shorter options offpolicy is more efficient than onpolicy. Here, we motivate why one would want to learn about shorter options in the first place. In particular, we will show that given a set of options, the more they terminate, the better the resulting control solution at the primitive action resolution. Intuitively, this is because upon termination, the learner picks the current best option, whereas during option execution, the target value includes the potentially suboptimal current option (Proposition 1). The following theorem (proof in appendix) formalizes this intuition, and may be of independent interest.
Theorem 3 (The more options terminate, the better the solution.).
Given a set of options , and a greedy policy over options . Let be two termination conditions for the options in . Then:
Note that this result refers to the target solution. During learning, the more decisions there is to make, the more potential there is for error. As such, in reasonably complex tasks we expect the performance to obey a tradeoff on .
6 Experiments
Finally, let us evaluate our algorithm empirically. We aim to illustrate the following claims:

The learning speed improves as gets smaller (behavior options get longer);

The control performance improves, as gets larger (target options get shorter);

Q() converges with offpolicy terminations.
For simplicity, we assume that options terminate deterministically in a set of goal states. We hence reduce and to single parameters that determine the likelihood of terminating before reaching the goal. The “plain” variant refers to the onpolicy intraoption update from Eq. (12). The complete experimental details are provided in appendix E.
6.1 Policy evaluation
First, we show that Q() learns the correct values on the 19state random walk task (Fig. 2). There are two options: one leads all the way to the left, the other to the right. The policy over options is uniform. The task is to estimate the value function w.r.t. target terminations . The results are given in Figure 1. Q() is able to learn the correct values (up to an irreducible explorationrelated error). As expected, Q() gets more efficient as behavior options get longer ( gets smaller). The opposite is true for the plain algorithm, since without interrupting sufficiently, it does not have a chance to update the intermediate states with anything other than the option policy. There is a small inflection point in the performance of the plain algorithm at the onpolicy value of .

6.2 Control
To demonstrate the benefit of decoupled off and on policy terminations, we compare our algorithm with the plain onpolicy variant (labelled: onpolicyplain) that uses during both learning and evaluation. In order to demonstrate that the learning target plays a role, we also compare it with the plain algorithm that uses at evaluation only (labelled: offpolicyplain). The behavior , unless specified otherwise.
Modified Cliffwalk
We illustrate the benefits of offpolicy termination on a modified Cliffwalk example (Fig. 4, left). The agent starts in a position inside a grid with the goal of getting to a corner where a positive reward is given. The step reward is zero, but there are small cliffs along the border that aren’t fatal, but induce a penalty. We have four options, one for each cardinal direction, that take the agent up until the corresponding border (and cliff). Thus, while these options are able to learn to reach the goal in an optimal number of steps, they are unable to learn the optimal policy which only moves inside the grid, so as to not encounter cliffs. To ensure adequate exploration we consider soft option policies, as well as a usual greedy policy over options during learning (but not during evaluation). The results are plotted in Fig. 2(a). Q() outperforms the alternatives in all cases, and its performance improves with larger . Note that it is the only variant able to surpass the value of the suboptimal upper bound of the naive approach.
Pinball
We finally evaluate our algorithm on a variation of the Pinball domain [Konidaris and Barto2009]. Here, a small ball must be maneuvered through a set of obstacles into a hole. Observations consist of four continuous variables describing the ball’s positions and velocities. There are five primitive actions: the first four apply a small force to either the or velocity, the final action leaves all velocities unchanged. There is a step penalty and a final reward. We define a set of landmark options [Mann, Mannor, and Precup2015] that move the ball near a target goal location on the board. The agent can initiate and terminate each option from within some initiation and termination distances from the respective landmark. To illustrate the benefits of terminating suboptimal options, landmarks were placed in such a way that the paths from start to the goal via the landmarks are suboptimal (Fig. 4, right). The results are plotted in Fig. 2(b). As expected, the performance of Q() drastically improves with longer behavior options. The influence of is more subtle: intermediate s perform best overall, but reaches slightly better eventual performance.
In a comparison, Q() outperforms the onpolicy variant that learns with . However, in this domain, the offpolicy variant (that learns with , but evaluates with ) performs comparably to Q(). This may in part be due to the use of function approximation, which allows the plain update to generalize meaningfully within the option trajectory, and in part due to the noisy nature of Pinball, in which there are many optimal policies of similar values. Since, as we have seen, Q() is the only variant to learn accurate values, we expect it to stand out more in settings where the reward scheme is more intricate.
7 Related work
Much of the related work has already been discussed throughout. We mention a few more relevant works below.
mann2014time mann2014time propose an algorithm for multistep option interruption that stems from the same motivation of mitigating poorquality options. In order to avoid the resulting options being too short, they introduce a timeregularization term. Our approach bypasses the need to do so by interrupting offpolicy and the ability to explicitly specify target terminations.
ryan2002using ryan2002using also considers the problem of termination improvement, or interrupting options when they are no longer relevant, in a hybrid decisiontheoretic and classical planning setting. The intuitions in that work are particularly aligned with the ones here – in particular the tradeoff between efficiency and optimality is hinted at, and it is even suggested that “persistence [with an option] is a kind of offpolicy exploration, at the behavior level.” In this work, we formalize and exploit this intuition.
yu2012weighted yu2012weighted consider general operators with statedependent . In the case of options, takes the role of a stateoptiondependent . white2017unifying white2017unifying proposes to consider as part of the transitionbased discount instead.
8 Discussion
We propose decoupling behavior and target termination conditions, like it is done with policies in offpolicy learning. We formulate an algorithm for learning target terminations offpolicy, analyze its expected convergence, and validate it empirically, confirming the theoretical intuition that learning shorter options from longer options is beneficial both computationally and qualitatively. More generally, we cast learning with options into a common framework with wellstudied multistep offpolicy temporal difference learning, which allows us to carry over existing results with ease.
Learning longer options from shorter options.
We have assumed here that the options are given, but may not express the optimal policy well. This scenario applies when the options describe simple rules of thumb, or are transferred from a different task. If the options are not given, but learnt endtoend, our wish typically is to distill meaningful behavior in them. However, instead, the result often ends up reducing to degenerate options [Bacon, Harb, and Precup2017, Mann, Mankowitz, and Mannor2014]. Being able to impose longer durations on the target offpolicy may mitigate this. Though it should be noted that our convergence results suggest that learning may not be as efficient then.
Actionlevel importance sampling.
The option policy term in the trace is ambivalent to the action choice. Thus if is small, will be small, even if the taken action is consistent with the option policy . It would be interesting to replace this term with the importance sampling ratio at the primitive action level, like , which corresponds to another multistep offpolicy algorithm, Retrace() [Munos et al.2016]. Another direction towards this goal is to incorporate the intraoption correction from Eq. (16).
Online convergence.
Proving online convergence of Q() remains an open problem. Supported by reliable empirical behavior, we hypothesize that it holds under reasonable conditions and plan to investigate it further in the future.
Future.
A natural direction for future work is to learn many s in parallel as is done in classical offpolicy learning [Sutton et al.2011], or learn a new kind^{4}^{4}4I.e., not one indifferent to the reward, as in [Szepesvari et al.2014], but one indifferent to the termination scheme. of a universal option model using the ideas from [Schaul et al.2015]. It is also worthwhile to extend the convergence results to approximate state spaces, perhaps similarly to [Touati et al.2017].
References
 [Asis et al.2017] Asis, K. D.; HernandezGarcia, J. F.; Holland, G. Z.; and Sutton, R. S. 2017. Multistep reinforcement learning: A unifying algorithm. CoRR abs/1703.01327.
 [Bacon and Precup2016] Bacon, P.L., and Precup, D. 2016. A matrix splitting perspective on planning with options. CoRR abs/1612.00916.
 [Bacon, Harb, and Precup2017] Bacon, P.L.; Harb, J.; and Precup, D. 2017. The optioncritic architecture. In Proceedings of AAAI, 1726–1734.
 [Bellman1957] Bellman, R. 1957. Dynamic Programming. Princeton, NJ, USA: Princeton University Press.
 [Bertsekas and Ioffe1996] Bertsekas, D. P., and Ioffe, S. 1996. Temporal differencesbased policy iteration and applications in neurodynamic programming. Technical Report LIDSP2349, MIT.
 [Bertsekas and Tsitsiklis1996] Bertsekas, D. P., and Tsitsiklis, J. N. 1996. NeuroDynamic Programming. Athena Scientific.
 [Harutyunyan et al.2016] Harutyunyan, A.; Bellemare, M. G.; Stepleton, T.; and Munos, R. 2016. Q() with offpolicy corrections. In Proceedings of the Conference on Algorithmic Learning Theory, 305–320.
 [Kearns and Singh2000] Kearns, M. J., and Singh, S. P. 2000. Biasvariance error bounds for temporal difference updates. In COLT, 142–147.
 [Konidaris and Barto2009] Konidaris, G., and Barto, A. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems, 1015–1023.
 [Kulkarni et al.2016] Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675–3683.
 [Mahmood, Yu, and Sutton2017] Mahmood, A. R.; Yu, H.; and Sutton, R. S. 2017. Multistep offpolicy learning without importance sampling ratios. CoRR abs/1702.03006.

[Mann, Mankowitz, and
Mannor2014]
Mann, T.; Mankowitz, D.; and Mannor, S.
2014.
Timeregularized interrupting options (trio).
In
Proceedings of the International Conference on Machine Learning
, 1350–1358.  [Mann, Mannor, and Precup2015] Mann, T. A.; Mannor, S.; and Precup, D. 2015. Approximate value iteration with temporally extended actions. JAIR 53:375–438.
 [Munos et al.2016] Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, 1046–1054.
 [Peng and Williams1996] Peng, J., and Williams, R. J. 1996. Incremental multistep qlearning. Machine Learning 22(13):283–290.
 [Precup, Sutton, and Singh1998] Precup, D.; Sutton, R. S.; and Singh, S. 1998. Theoretical results on reinforcement learning with temporally abstract options. In Proceedings of the European conference on machine learning, 382–393.
 [Precup, Sutton, and Singh2000] Precup, D.; Sutton, R. S.; and Singh, S. 2000. Eligibility traces for offpolicy policy evaluation. In Proceedings of the International Conference on Machine Learning, 759–766.
 [Puterman1994] Puterman, M. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition.
 [Ryan2002] Ryan, M. R. K. 2002. Using abstract models of behaviours to automatically generate reinforcement learning hierarchies. In In Proceedings of The International Conference on Machine Learning, 522–529.
 [Schaul et al.2015] Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Universal value function approximators. In Proceedings of the International Conference on Machine Learning (ICML15), 1312–1320.
 [Sutton and Barto2017] Sutton, R. S., and Barto, A. G. 2017. Reinforcement Learning: An Introduction. MIT Press, 2nd edition.
 [Sutton and Precup1998] Sutton, R. S., and Precup, D. 1998. Intraoption learning about temporally abstract actions. In In Proceedings of the International Conference on Machine Learning, 556–564.
 [Sutton et al.2011] Sutton, R.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P.; White, A.; and Precup, D. 2011. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, AAMAS, 761–768.
 [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(12):181–211.
 [Sutton1995] Sutton, R. S. 1995. Td models: Modeling the world at a mixture of time scales. In Proceedings of the Twelfth International Conference on Machine Learning, 531–539.
 [Szepesvari et al.2014] Szepesvari, C.; Sutton, R. S.; Modayil, J.; Bhatnagar, S.; et al. 2014. Universal option models. In Advances in Neural Information Processing Systems, 990–998.
 [Tessler et al.2017] Tessler, C.; Givony, S.; Zahavy, T.; Mankowitz, D. J.; and Mannor, S. 2017. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of AAAI, 1553–1561.
 [Touati et al.2017] Touati, A.; Bacon, P.L.; Precup, D.; and Vincent, P. 2017. Convergent treebackup and retrace with function approximation. CoRR abs/1705.09322.
 [Van Seijen et al.2009] Van Seijen, H.; Van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 177–184.
 [Vezhnevets et al.2016] Vezhnevets, A.; Mnih, V.; Osindero, S.; Graves, A.; Vinyals, O.; Agapiou, J.; and Kavukcuoglu, K. 2016. Strategic attentive writer for learning macroactions. In Advances in Neural Information Processing Systems 29, 3486–3494.
 [White2017] White, M. 2017. Unifying task specification in reinforcement learning. In Proceedings of the International Conference on Machine Learning, volume 70, 3742–3750.
 [Yu and Bertsekas2012] Yu, H., and Bertsekas, D. P. 2012. Weighted bellman equations and their applications in approximate dynamic programming. Technical Report LIDSP2876, MIT.
Appendix A Proof of Proposition 1
Proof.
Appendix B Proof of Theorem 1
Proof.
Write for . The fact that the fixed point is the desired one is clear from Eq. (15) and Proposition 1. Recall that:
Now, let us derive the contraction result. Eq. (15) can be written:
Let , and be defined analogously. Since , and after shifting the sum index forward, we have:
where we write for the indicator function. Since and , , we have a linear combination of weighted by nonnegative coefficients defined as: