1 Introduction
The ability to represent and plan with temporally extended actions has long been recognized as an essential component for scaling up reinforcement learning systems. The
options framework (Sutton et al., 1999; Precup, 2000) formalized this idea and showed how temporally extended actions can be used for learning and planning in reinforcement learning. Options can sometimes lead to better and faster exploration, learning, planning and transfer (Sutton et al., 1999) while being robust to model misspecifications and uncertainty (He et al., 2011). While efficient algorithms for learning options have recently been proposed (Bacon et al., 2016; Mankowitz et al., 2016; Vezhnevets et al., 2016; Daniel et al., 2016), there is still no consensus on what constitute good options.In this paper, we show that a choice of options is equivalent to a choice of iterative algorithm for solving Markov decision problems. We reach this conclusion by noting that the generalized Bellman operator underlying options and their models admits a linear representation as a matrix splitting (Varga, 1962; Young and Rheinboldt, 1971; Puterman, 1994), a notion which comes in pair with that of matrix preconditioning. As with options, the goal of these methods is to transform a linear system into one with the same solution but which is easier to solve. With this new perspective, the answer to what good options are becomes clearer while offering new theoretical tools to analyze them.
2 Background and Notation
We restrict our attention to the class of discounted Markovian decision problems with finite state and action spaces. A discounted Markov Decision Process is specified by a finite set of states
, a finite set of actions , a reward function , a transition function and a discount factor . We writeto denote the probability of taking action
in state under the stochastic policy : that is . The value function represents the expected sum of discounted rewards encountered along the trajectories induced by the MDP and policy: . We also write and . The spectral radius of a matrixwith eigenvalues
is .An option is a triple where is the initiation set of , is its policy, and is its termination function. In addition, there is a policy over options whose role is to select an option whenever a termination event is sampled from the termination function of the current option. In the callandreturn model of execution, picks among in state and executes the policy of the selected option irrevocably until termination. We distinguish the callandreturn model from what we call gating execution, in which the choice of option is reconsidered at every step. The results in sections 3 and 4 apply only to the gating model. However, we also show in section 5 that the callandreturn execution model can be studied in the matrix splitting framework. For simplicity, we finally assume that options are available everywhere, that is : .
3 Generalized Bellman Equations for Gating Execution
Planning with the value iteration algorithm over primitive actions usually involves a Bellman operator of the form: . Rather than backing up values for only one step ahead, Sutton (1995) showed that multisteps backups can equally be used for planning as long as the corresponding generalized Bellman operators satisfy the Bellman equations. We can extend this idea (Precup and Sutton, 1998)
by making the number of Bellman backups a random variable which is determined by the termination events of an optionsbased process. Let
be a random variable representing the number of backups performed per iteration, we then define our generalized Bellman operator as follows:(1) 
By linearity and with Markov options, we can decompose into a reward and a transition part. The reward model underlying is the discounted sum of rewards until termination, averaged over all options:
(2) 
where and are defined as follows:
The sharp symbol here stands for “continuation”. Likewise, the transition model has the following recursive form:
(3) 
where and stands for “termination”. The linear system of equations corresponding to the reward and transition models admits a solution given that the matrix is nonsingular (which we prove below):
(4) 
The generalized Bellman operator then becomes:
(5) 
van Nunen and Wessels (1976) showed that the basic iterative methods such as the GaussSeidel, Jacobi, Successive Overrelaxation, or Richardson’s variants (Puterman, 1994) of value iteration can be obtained through different stopping time functions in an operator of the same form as (1), or equivalently, as a linear transformation of an MDP into an equivalent one. This perspective was leveraged by Porteus (1975) to derive better bounds by transforming an MDP into one which has the same optimal policy but whose spectral radius is smaller. The idea that stopping time functions (termination functions) lead to a transformation of an MDP is also what we just found by writing (1) as (5). Using Porteus’ terminology, (5) in fact corresponds a preinverse transformation through the reward and transition models and . The preinverse transform as well as the basic iterative methods can also be studied more generally via the notion of matrix splittings (Varga, 1962) developed in the context of matrix iterative analysis.
4 Matrix Splitting
For and provided that and are nonsingular, Varga (1962) showed that the iterates converge to the unique solution of the linear system of equation . In the policy evaluation problem, we are working with the linear system of equations where is the target policy to evaluate. We now show that the reward and transition models (4) precisely corresponds to the notion of a matrix splitting for the matrix .
Theorem 1 (Matrix Splitting in the Gating Model).
Corollary 1.
Since a set of options and the policy over them induce a matrix splitting, a choice of options is in fact a choice of algorithm for solving MDPs. An important property of an iterative solver, besides its computational efficiency, is that it should converge to a solution of the original problem. We should therefore ask ourselves whether the iterates corresponding to (5) converge to the true value function underlying a given target policy. Theorem 2 shows that the successive approximation method induced by a set of options and policy over them is consistent (Young and Rheinboldt, 1971) given that the marginal action probabilities is equal to the target policy.
Theorem 2 (Consistency of Policy Evaluation in the Gating Model).
The iterative method associated with the splitting (5)
is a consistent policy evaluation method in the gating model if the set of options and policy
over them is such that
where is the target policy to be evaluated.
Proof:
Let be the unique solution to the generalized Bellman equations
(5), we have:
Therefore is also the solution to the policy evaluation problem for the policy .
While many set of options can satisfy the marginal condition in the gating model, not all of them would converge equally fast. Using comparison theorems for regular matrix splittings (Varga, 1962) we can better understand the effect of modelling the world at different timescales on the asymptotic performance of the induced algorithms.
Theorem 3 (Predict further, plan faster).
In the gating model, if a set of options has the same intraoption policies and policy over options with some other set but whose termination functions are such that , then .
Theorem 3 consolidates the idea that modelling the world over longer time horizons increases the asymptotic rate of convergence. This also becomes apparent when writing (5) in the following form:
(6) 
Therefore, options enter the linear system of equations through the preconditioning matrix (Saad, 2003) and yield the following transformed linear system of equations:
(7) 
As the options timescales increase and , then and the solution is obtained directly on the right hand side of (7). The corresponding generalized Bellman operator also becomes and solves the original system in one iteration. On the other hand, if the termination functions are such that the options terminate after only one step, we get , the usual onestep Bellman operator of the value iteration algorithm. Since the spectral radius associated with matrix splitting methods is given by , we also have the following:
This suggests that in terms of asymptotic performances, a good set of options should be such that it induces a preconditioning matrix that is close to in some sense but whose inverse is easier to compute.
5 CallandReturn Execution Model
The gating execution model assumed so far does not account for the notion of temporal commitment provided by the callandreturn model. Since a choice of option is made at every step in the gating model, the policy over option can be decoupled from the termination functions. This gives us the ability to express the value function as the solution of a matrix splitting over states. However, it is known (Sutton et al., 1999) that a set of options with callandreturn execution and an MDP induce a semiMarkov Decision Process (SMDP), even in the case of Markov options. This means that the trajectories over state and actions generated with options might no longer correspond to the dynamics of a Markov process. Hence, the existence of an equivalent marginal policy in theorem 2 cannot be guaranteed under the callandreturn model.
To restore the Markov property with callandreturn execution, the choice of option must be remembered as part of the state. The resulting process defines a Markov chain in an augmented state space over stateoption pairs
(Bacon et al., 2017). This conditioning on both states and options is key to the derivation of the Bellmanlike expressions of Sutton et al. (1999) for the reward and transition models of options. In the following, we show that the solution to these equations also yields a matrix splitting.Sutton et al. (1999) showed that the reward model of an option can be written recursively as:
If we define and , we can see that the fixed point of these equations is :
Similarly, the transition model of an option also admits a set of Bellmanlike equations of the form:
whose fixed point can be written using the termination operator :
Therefore, with and we have as a splitting.
6 Implications
Given the interpretation of options in terms of matrix splittings, and consequently as preconditioners, it comes as no surprise that preconditioning methods share the same goals as options. Indeed in both cases we seek a representation of the problem which is easier to solve than the original one. As Herbert Simon wrote in his Sciences of the Artificial: “[…] solving a problem simply means representing it so as to make the solution transparent” (Simon, 1969). Hence, the problem of finding good options or preconditioners is closely related to the representation learning problem (Minsky, 1961).
As with options, the design of general and fast preconditioners is a longstanding problem of numerical analysis. In some cases, good preconditioners can be found when problemspecific knowledge is available. However, manual design of preconditioners, and of options, quickly become a tedious process for large problems or when only partial knowledge about the domain is available. This is especially true in the context of reinforcement learning where the MDP is assumed to be unknown or too large to manipulate directly. When solving a single problem with options, it is also clear from the connection with preconditioners that the initial setup cost and subsequent cost per preconditioned iteration should not outweigh the cost associated with the original problem. This leads to a fundamental tension between the improvement effort per iteration and the number of overall iterations. In our framework, this tension exists between the two poles and . While has the fastest convergence, it also is just as expensive as solving the original problem directly. This is a reminder that modelling far
comes with a cost: computational here but also statistical in the learning case. In fact, the choice of timescale for options also falls under the same biasvariance tradeoff as for the
operator (Kearns and Singh, 2000) with mirroring the choice and for .Computational expenses associated with building preconditioners can be amortized throughout related problems. With options, this would corresponds to the typical case in which options are reused to speed up learning and planning in a transfer or continual learning setting. Reusability of options can take place for example when the transition structure remains fixed but the reward function on the righthand side of (7) changes. The accounting exercise necessary to justify the use of options in a transfer setting was considered by Solway et al. (2014). From a Bayesian perspective, the authors showed that a particular kind of bottleneck options is in fact optimal when a learning system must solve a series of related tasks. The idea of adapting the options structure to minimize the computational effort associated with solving a single task was also explored in Bacon and Precup (2015). Using the bounded rationality framework, it was argued that options should primarily help computationally restricted adaptive systems: an idea which naturally fits with the preconditioning point of view.
References
 Bacon and Precup [2015] PierreLuc Bacon and Doina Precup. Learning with options: Just deliberate and relax. In NIPS Bounded Optimality and Rational Metareasoning Workshop, 2015.
 Bacon et al. [2016] PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. CoRR, abs/1609.05140, 2016.

Bacon et al. [2017]
PierreLuc Bacon, Jean Harb, and Doina Precup.
The optioncritic architecture.
In
Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA.
, pages 1726–1734, 2017.  Berman and Plemmons [1979] A. Berman and R.J. Plemmons. Nonnegative Matrices in the Mathematical Sciences. Academic Press, New York, 1979.
 Daniel et al. [2016] C. Daniel, H. van Hoof, J. Peters, and G. Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, Special Issue, 104(2):337–357, 2016.
 He et al. [2011] Ruijie He, Emma Brunskill, and Nicholas Roy. Efficient planning under uncertainty with macroactions. J. Artif. Intell. Res. (JAIR), 40:523–570, 2011.

Kearns and Singh [2000]
Michael J. Kearns and Satinder P. Singh.
Biasvariance error bounds for temporal difference updates.
In
Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
, COLT ’00, pages 142–147, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.  Mankowitz et al. [2016] Daniel J. Mankowitz, Timothy Arthur Mann, and Shie Mannor. Adaptive skills, adaptive partitions (ASAP). In Advances in Neural Information Processing Systems 29, 2016.
 Minsky [1961] Marvin Minsky. Steps toward artificial intelligence. In Computers and Thought, pages 406–450. McGrawHill, 1961.
 Porteus [1975] Evan L. Porteus. Bounds and transformations for discounted finite markov decision chains. Oper. Res., 23(4):761–784, August 1975.
 Precup and Sutton [1998] Doina Precup and Richard S Sutton. Multitime models for temporally abstract planning. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 1050–1056. MIT Press, 1998.
 Precup [2000] Doina Precup. Temporal abstraction in reinforcement learning. PhD thesis, University of Massachusetts, Amherst, 2000.
 Puterman [1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
 Saad [2003] Yousef Saad. Preconditioned Iterations, chapter 9, pages 261–281. 2003.
 Simon [1969] H.A. Simon. The Sciences of the Artificial. Karl Taylor Compton lectures. M.I.T. Press, 1969.
 Solway et al. [2014] Alec Solway, Carlos Diuk, Natalia Córdova, Debbie Yee, Andrew G. Barto, Yael Niv, and Matthew M. Botvinick. Optimal behavioral hierarchy. PLoS Comput Biol, 10(8):e1003779, aug 2014.
 Sutton et al. [1999] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112(12):181–211, 1999.
 Sutton [1995] Richard S. Sutton. TD models: Modeling the world at a mixture of time scales. In Machine Learning, Proceedings of the Twelfth International ConferenceVa on Machine Learning, Tahoe City, California, USA, July 912, 1995, pages 531–539, 1995.
 van Nunen and Wessels [1976] J.A.E.E. van Nunen and J. Wessels. Stopping times and markov programming. Technical Report 7622, Technische Hogeschool Eindhoven, Eindhoven:, 1976.
 Varga [1962] Richard S. Varga. Matrix iterative analysis. PrenticeHall, Englewood Cliffs, 1962.
 Vezhnevets et al. [2016] Alexander (Sasha) Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, and Koray Kavukcuoglu. Strategic attentive writer for learning macroactions. In Advances in Neural Information Processing Systems 29, 2016.
 Young and Rheinboldt [1971] David Matheson Young and Werner Rheinboldt. Iterative solution of large linear systems. Academic Press, New York, NY, 1971.
Comments
There are no comments yet.