The classical Policy Iteration (PI) algorithm alternates between greedy one-step policy improvement and policy evaluation. Recent literature shows that multi-step lookahead policy improvement leads to a better convergence rate at the expense of increased complexity per iteration. However, prior to running the algorithm, one cannot tell what is the best fixed lookahead horizon. Moreover, per a given run, using a lookahead of horizon larger than one is often wasteful. In this work, we propose for the first time to dynamically adapt the multi-step lookahead horizon as a function of the state and of the value estimate. We devise two PI variants and analyze the trade-off between iteration count and computational complexity per iteration. The first variant takes the desired contraction factor as the objective and minimizes the per-iteration complexity. The second variant takes as input the computational complexity per iteration and minimizes the overall contraction factor. We then devise a corresponding DQN-based algorithm with an adaptive tree search horizon. We also include a novel enhancement for on-policy learning: per-depth value function estimator. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and in Atari.

## Authors

• 11 publications
• 11 publications
• 138 publications
• 51 publications
• 16 publications
07/13/2018

### On the Complexity of Value Iteration

Value iteration is a fundamental algorithm for solving Markov Decision P...
05/20/2015

### Convergence Analysis of Policy Iteration

Adaptive optimal control of nonlinear dynamic systems with deterministic...
05/10/2019

### Second Order Value Iteration in Reinforcement Learning

Value iteration is a fixed point iteration technique utilized to obtain ...
07/11/2012

### Heuristic Search Value Iteration for POMDPs

We present a novel POMDP planning algorithm called heuristic search valu...
10/26/2021

### A Horizon Detection Algorithm for Maritime Surveillance

The horizon line is a valuable feature in the maritime environment as it...
01/05/2021

### On the convergence rate of the Kačanov scheme for shear-thinning fluids

We explore the convergence rate of the Kačanov iteration scheme for diff...
02/14/2019

### Learn a Prior for RHEA for Better Online Planning

Rolling Horizon Evolutionary Algorithms (RHEA) are a class of online pla...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The classic Policy Iteration (PI) howard1960dynamic

and Value Iteration (VI) algorithms are the basis for most state-of-the-art reinforcement learning (RL) algorithms. As both PI and VI are based on a one-step greedy approach for policy improvement, so are the most commonly used policy-gradient

schulman2017proximal; haarnoja2018soft and Q-learning mnih2013playing; hessel2018rainbow based approaches. In each iteration, they perform an improvement of their current policy by looking one step forward and acting greedily. While this is the simplest and most common paradigm, stronger performance was recently achieved using multi-step lookahead. Notable examples are AlphaGo (silver2018general) and MuZero schrittwieser2020mastering, where the multi-step lookahead is implemented via Monte Carlo Tree Search (MCTS) (browne2012survey).

Several recent works rigorously analyzed the properties of multi-step lookahead in common RL schemes (efroni2018beyond; efroni2018multiple; efroni2019combine; efroni2020online; hallak2021improve). This and other related literature studied a fixed planning horizon chosen in advance. However, both in simulated and real-world environments there is a large variety of states that benefit differently from various lookahead horizons. A grasping robot far from its target will learn very little from looking a few steps into the future, but if the target is within reach, much more precision and planning is required to grasp the object correctly. Similarly, in the beginning of a chess game (the opening stage), lookahead grants little information as to which move is better, while agents in intricate situations in the middle game benefit immensely from considering all future possibilities for the next few moves. Indeed, in this work we devise a well-established methodology for adaptively choosing the planning horizons in each state, and show it achieves a significant speed-up of the learning process.

We propose two complementing approaches to determine the suitable horizon per state in each PI iteration. To do so, we keep track of the distance between the value function estimate and the optimal value. Our first algorithm Threshold-based Lookahead PI (TLPI) ensures a desired convergence rate and minimizes the computational complexity for each iteration. Alternatively, our second algorithm Quantile-based Lookahead PI (QLPI) takes the per-iteration computational complexity as a given budget and aims for the best possible convergence rate. We then prove that both TLPI and QLPI converge to the optimum and achieve significantly lower computational cost than its fixed-horizon alternative.

Next, we devise QL-DQN: a DQN mnih2013playing variant of QLPI. In QL-DQN, the policy chooses an action by employing an exhaustive tree search hallak2021improve looking steps into the future. The tree-depth is chosen adaptively per state to achieve overall improved convergence rate at a reduced computational cost. To sustain on-policy consistency while generalizing over the multiple depths, we use a different value network per depth, where the first layers are shared across networks. We test our method on Atari and show it improves upon a fixed-depth tree search.

To summarize, our contributions are:

1. We are the first to propose adaptive state-dependent lookahead and devise two corresponding algorithms. Our analysis shows they converge with improved computational complexity.

2. We extend our approach to online learning with DQN that uses exhaustive tree search with adaptive depth, and per-depth value network.

3. We evaluate the proposed methods on maze and Atari environments and show better results compared to a fixed lookahead horizon.

## 2 Preliminaries

We consider a discounted MDP , where is a finite state space of size , is a finite action space of size , is the reward function, is the transition function, and is the discount factor. Let be a stationary policy, and let be the value function of defined by , where .

The goal of a planning algorithm is to find the optimal policy such that, for every ,

 V⋆(s)=Vπ⋆(s)=maxπ:S→AVπ(s).

Given a policy , let be the operator:

 Tπ[V]=rπ+γPπV,

where and . It is well known that the value of policy is the unique solution to the linear equations: .

Let be the Bellman operator defined as:

 T[V](s)=maxa∈Ar(s,a)+γ∑s′∈SP(s′|s,a)V(s′).

Then, the optimal value is the unique solution to the nonlinear equations: .

### 2.1 PI and h-Pi

PI starts from an arbitrary policy and performs iterations that consist of: (1) an evaluation step that evaluates the value of the current policy, and (2) an improvement step that performs a 1-step improvement based on the computed value. That is, for ,

 πn+1(s)=argmaxa∈Ar(s,a)+γ∑s′∈SP(s′∣s,a)Vπn(s′).

By the contraction property of the Bellman operator, one can prove that PI finds the optimal policy after at most iterations (scherrer2016improved).

The PI algorithm can be extended to -PI by performing -step improvements (instead of -step). Formally, define the -function of policy with a -step lookahead as follows,

 Qπh(s,a)=max{πt}ht=1Es,a[h−1∑t=0γtr(st,πt(st))+γhVπ(sh)],

where . Then, the update rule of -PI is .

The operator induced by -step lookahead is contracting which allows to reduce a factor of from the bound on the number of iterations until convergence (efroni2018beyond), i.e., the number of iterations is bounded by .

Multi-step lookahead guarantees that the number of iterations to convergence is smaller than -step lookahead, but it comes with a computational cost. Computing the -step improvement may take exponential time in . In tabular MDPs, this can be mitigated with the use of dynamic programming (efroni2020online), while in MDPs with large (or infinite) state space, MCTS browne2012survey or the alternative exhaustive tree-search hallak2021improve are used in forward-looking fashion. To compare our algorithms, in the rest of the paper we measure the computational complexity as follows:

###### Definition 2.1.

Let be the computational cost of performing a -step improvement in a single state. For example, in a deterministic full -ary tree we have

## 3 Motivating Example

To show the potential of our approach, consider the chain MDP example from derman2021acting (in Figure 1):

###### Example 3.1 (Chain MDP).

Let be an MDP with states and a single sink state . Each of the states transitions to the consecutive state by applying action and to the sink state with action All rewards are except for state where action obtains reward .

Now consider the standard PI algorithm when initialized with for all . Since the reward at the end of the chain needs to propagate backward, in each iteration the value of only a single state is updated. Thus, PI takes exactly iterations to converge to the optimal policy for all . When instead a fixed horizon is used, the reward propagates through two states in each iteration (instead of one) and therefore convergence takes iterations. Generally, performing PI with a fixed horizon , i.e., -PI, takes iterations to converge (up to a rounding factor).

While -PI converges faster (in terms of iterations) as increases, in most states, performing -step lookahead does not contribute to the speed-up at all. For instance, taking , we can achieve the exact same convergence rate as -PI by using a -step lookahead in only a single state in each iteration (and -step in all other states). Specifically, we need to pick the state that is exactly steps behind the last updated state in the chain. For general , consider applying -step lookahead in only one state — the one that is steps behind the last updated state in the chain — for each and -step in the others. This guarantees the same number of iterations until convergence as -PI, but with much less computation time. Namely, while the per-iteration computational cost of -PI is , we can achieve the same convergence rate with just In practice, when is large and can scale exponentially with , this gap can be immense: versus .

In this section, we introduce the concept of dynamically adapting the planning lookahead horizon during runtime, based on the online obtained contraction.

As shown in Example 3.1, -PI convergence rate can be achieved when using lookahead larger than in just states. A prominent question is thus how to choose these states? In the example, the chosen states are evidently those with maximal distance between their -step improvement and optimal value, i.e.,

 argmaxs∈S|V⋆(s)−T[Vπt](s)|. (1)

In this section, we show this approach also leads to theoretical guarantees on the convergence of the PI algorithm.

To understand the connection between convergence rates and the quantity in Equation 1, we need to delve into the theoretical properties of PI. Since the standard -step improvement yields a contraction of while the -step improvement gives , -PI converges times faster than standard PI efroni2018beyond. Importantly, this contraction is with respect to the norm; i.e., the states with the smallest contraction (that is, largest contraction coefficient) determine the convergence rate of PI. This behaviour is the source of weakness of using a fixed lookahead. Example 3.1 shows that one state may slow down convergence, but it also hints at an elegant solution: use a larger lookahead value in states with small contraction.

We leverage the this observation and present two new algorithms: TLPI which aims to achieve a fixed contraction in all states in reduced computational cost, and QLPI which aims to achieve maximal contraction in every iteration within a fixed computational budget. While both algorithms seek to optimize a similar problem, their analysis differs and sheds light on the problem from different perspectives: TLPI depends on the actual value of the contraction per state, while QLPI considers the ordering with respect to the contraction factors of all states and the value itself is less significant.

In the coming subsections, our vanilla algorithms assume knowledge of . This is clearly a strong assumption. However, we make it only for the basis of our theoretical analysis. In Section 5 we provide analysis after relaxing this assumption while in Sections 7 and 6 we present experiments that use alternative warm-start value functions.

### 4.1 Threshold-based Lookahead Policy Iteration

TLPI (Algorithm 1) takes as input the optimal value function and a desired contraction factor The algorithm ensures that in each iteration, the value in every state contracts by at least . This is achieved by first performing -step improvement in all states, and then performing -improvement in states whose measured contraction is less than , where is the smallest integer such that .

The following result states that TLPI converges at least as fast as -PI efroni2018beyond with set to and with improved computational complexity. To measure the trade-off between the contraction factor (that determines the convergence rate) and the computational complexity needed to achieve it, Definition 4.1 presents as the fraction of states in which we perform a large lookahead.

###### Definition 4.1 (Def. of θ(κ)).

Let be the sequence of policies generated by TLPI. Let and define

 Xt={s:|V⋆(s)−T[Vπt](s)|≤κ∥V⋆−Vπt∥∞}

as the set of states contracted by after -step improvement in iteration . Then, denote by the largest fraction of states with contraction less than observed along all policy updates.

###### Theorem 4.2.

The TLPI algorithm converges in at most iterations. Moreover, its per-iteration computational complexity is bounded by .

###### Proof sketch.

The proof to bound the number of iterations follows scherrer2016improved while utilizing two key observations. First, the convergence analysis of PI only uses the contraction property of the Bellman operator w.r.t.

, and not w.r.t. an arbitrary pivot vector. Second, by the construction of the algorithm, a contraction of at least

in every state is guaranteed. The computational complexity follows because we perform -step lookahead in all states and -step lookahead in of the states by Definition 4.1. For the complete proof see Appendix A.1. ∎

To illustrate the merits of TLPI and Thoerem 4.2, consider the Chain MDP in Example 3.1 where we set for some . In every iteration, the states not contracted by after -step improvement are the states closest to the end of the chain that have not been updated yet (recall that each state reaches the correct optimal value after just one non-idle update). Thus, and the per-iteration computational complexity is .

### 4.2 Quantile-based Lookahead Policy Iteration

QLPI (Algorithm 2) resembles TLPI, but instead of a contraction coefficient it takes as input a vector of quantiles (budgets) for some predetermined maximal considered lookahead . QLPI attempts to maximize the contraction in every iteration while using -step lookahead in at most states. This is achieved by performing -step improvement on the portion of states that are furthest away from Note that QLPI is a generalization of -PI, obtained by setting and for all

The following result is complementary to Theorem 4.2: now, instead of choosing the desired iteration complexity (via in TLPI), we choose the desired computational complexity per iteration via budgets For the resulting iteration complexity we define the induced contraction factor:

###### Definition 4.3 (Def. of κ(θ)).

Let be the sequence of policies generated by QLPI. Let be the largest lookahead applied in state in iteration when running QLPI with quantiles . For a given , define

 Yt(κ)={s:|V⋆(s)−Thθt(s)[Vπt](s)|≤κ∥V⋆−Vπt∥∞}

as the set of states contracted by in iteration . The induced contraction factor is defined as the minimal such that for every .

Though its formal definition may seem complex, is simply the effective contraction obtained by QLPI.

###### Theorem 4.4.

The QLPI algorithm converges in at most iterations. Moreover, its per-iteration computational complexity is bounded by .

We provide the proof in Appendix A.2; it is based on similar ideas as the proof of Theorem 4.2.

To illustrate the merits of QLPI and Theorem 4.4, consider the Chain MDP in Example 3.1 where we set and for some . In every iteration QLPI first performs -step lookahead in all states, and then, for each , it performs -step lookahead in exactly one state – the state that is steps behind the last updated state in the chain. As explained in Section 3, the induced contraction is and QLPI converges in iterations with optimal per-iteration complexity of .

Finally, we highlight the complementary nature of the two algorithms: while in TLPI the complexity parameter is governed by the desired contraction coefficient, in QLPI the induced contraction is the outcome of the pre-determined computational budget.

The algorithms discussed in the previous section suffer from a major disadvantage – they rely on the unknown quantity , which we are interested in finding. In many cases, we can obtain an approximation of , denoted by , through, e.g., state aggregation, training agents on similar tasks, or by running PI for a small number of iterations. In this section, we show how our algorithms from Section 4 can leverage , and prove that they maintain theoretical guarantees.

We start with TLPI and show that it can maintain convergence guarantees with a simple tweak to the algorithm, as long as we have a bound on the quality of our approximation. Formally, assume that . This implies we can measure contraction up to some approximation error that scales with . Thus, we define the algorithm TLPI with correction term to follow TLPI, but replace Equation 2 with the following condition:

 |˜V⋆(s)−maxa∈AU(s,a)|>κ∥˜V⋆−Vπt∥∞−β.

Recall that TLPI imposes an lookahead in all states that do not achieve a contraction after -step lookahead. Then, the gap ensures that no states seem to achieve the desired contraction due to the approximation error The following result ties to the required Similarly to Definition 4.1, we define as the fraction of states in which we need to perform a large lookahead when the gap is used. Notice, that now the computational complexity may increase as a consequence of the approximation error .

###### Definition 5.1 (Def. of ~θ(κ)).

Let be the sequence of policies generated by TLPI with correction term . Let and define

 ˜Xt={s:|˜V⋆(s)−T[Vπt](s)|≤κ∥˜V⋆−Vπt∥∞−β}

as the set of states, that after -step improvement in iteration , are -close to be contracted by with respect to . Then, denote by the largest fraction of states with contraction less than observed along all policy updates.

###### Theorem 5.2.

Running TLPI with and a correction term guarantees the same number of iterations to convergence as TLPI() with the real . Moreover, the additional computational complexity of a single iteration is bounded by .

Theorem 5.2 points to a weakness in TLPI with approximate – it is highly sensitive to approximation errors since it measures the contraction directly and thus requires bound on the approximation error. On the other hand, QLPI may preserve convergence guarantees even with since it only relies on the ordering of the states in terms of distance from the optimum. Intuitively, as long as the orderings with respect to and are close, analogous performance guarantees can be achieved.

###### Definition 5.3.

Let and be the positions of state in the orderings of and , respectively. We define the approximation to be -order-preserving if, for every , .

If is -order-preserving, then by using the larger quantiles , we would still induce the same contraction factor and thus preserve the same bound on the number of iterations. However, since the quantiles are larger, so is the computational cost.

###### Theorem 5.4.

Running QLPI with an -order-preserving and the quantiles guarantees the same number of iterations to convergence as QLPI with the real and . The additional computational complexity of a single iteration is

State-aggregation is an example of an approximation that preserves orders, and that is available in many domains where states are based on locality (like the maze environment considered in Section 6). Assume that we have access to a state-aggregation scheme that splits the state space into groups of size such that for every two states in the same group and for any state from a different group . Then the optimal value of the aggregated MDP is -order-preserving as long as for every , since the position of a state can be shifted due to the aggregation by at most the size of its group .

## 6 Maze Experiments

In a first set of experiments, we evaluate our adaptive lookahead algorithms, TLPI and QLPI, on a grid world with walls (tennenholtz2022covariate). Specifically, we used a grid world that is divided to four rooms with doors between them; see Figure 2. The agent is spawned in the top left corner (blue) and needs to reach one of four randomly chosen goals (green) where the reward is , while avoiding the trap (red) that incurs a reward of . There are four deterministic actions (up, down, right, left). Upon reaching a goal, the agent is moved to a random state. We set

We begin with testing the fixed-horizon -PI with values . To corroborate that larger lookahead values reduce the number of PI iterations required for convergence, in Figure 4, we show the distance from the solution as the function of iteration for the different depths. The plot demonstrates the effect of the lookahead in a less pathological example than Example 3.1.

In Figure 5, we compare the overall computational complexity, and not only number of iterations, of the different fixed lookahead values. To measure performance, we count the number of queries to the simulator (environment) until convergence to the optimal value. More efficient lookahead horizons will require less overall calls to the simulator.

Beginning with the fixed lookahead results in the leftmost plot, we see the trade-off when picking the lookahead. For a lookahead too short (1 in this case), the convergence requires too many iterations such that even the low computational complexity of each iteration is not sufficient to compensate for the total compute time. Note that -PI with is the standard PI algorithm, which evidently performs worse than the best fixed lookahead although it is overwhelmingly the most widely used version of PI. On the other extreme of very large lookahead, each iteration is too computationally expensive, despite the smaller number of iterations.

#### Tlpi.

To verify our observation that a long lookahead is wasteful in large parts of the state space, we first plot a histogram of the contraction factor along several PI iterations in Figure 3. Here we see that indeed the effective contraction factor is much smaller than (i.e., more contractive than -step lookahead) in roughly 90% of the states.

Next, we run TLPI with . The results are given in Figure 5, second plot. By Theorem 4.2, when setting , we expect the same number of iterations until convergence as -PI but with better computational complexity. In fact, the results reveal even stronger behavior: TLPI() for all achieves similar computational complexity compared to the best fixed lookahead witnessed in -PI.

#### Qlpi.

We run QLPI with and in all our experiments. For we set the following values: , , and which respectively depict decreasing weights to depths The results are presented in Figure 5, third plot. Again we can see that for all the parameters, QLPI performs as well as the best fixed lookahead. Moreover, notice that for some of choices of vectors, the performance significantly improves upon the best fixed horizon.

#### Approximate V⋆ via state aggregation.

We again run QLPI with budget values , but replace with an approximation we obtain with state aggregation. Namely, we merge squares of into a single state, solve the smaller aggregated MDP, and use its optimal value as an approximation for .

We perform this experiment with and include the aggregated MDP solution process as part of the toal simulator query count. This way, our final algorithm does not have any prior knowledge of The results are presented in Figure 5, last plot. As expected, the performance is slightly worse than the original QLPI that uses the accurate , but for all different aggregation choices the algorithm still performs as well as the best fixed lookahead in -PI.

To summarize, the maze experiments show that with adaptive planning lookahead, we manage to reach the solution with better sample complexity compared to fixed-horizon

-PI. More importantly, our methods are robust to hyperparameter choices: the improved results are obtained uniformly with

all various tested parameters of TLPI and QLPI. This alleviates the heavy burden of finding the best fixed horizon for a given environment.

## 7 QL-DQN and Atari Experiments

In this section, we extend our adaptive lookahead algorithm QLPI to neural network function approximation. We present Quantile-based lookahead DQN (QL-DQN): the first DQN algorithm that uses state-dependent lookahead that is dynamically chosen throughout the learning process.

To extend QLPI to QL-DQN, we introduce three key features: First, we need to compute the quantile of the current state in order to determine which lookahead to use. However, we cannot go over the entire state space to find the position of in the ordering as we did in the tabular case. Instead, we propose to estimate the position of by using the replay buffer. Namely, we compute the ordering-based quantile of in the replay buffer and use lookahead of if . For we use a trained agent with depth which we know to perform poorly compared to larger lookahead values.

Second, once a lookahead was chosen, we need to perform -step improvement. To implement -step lookahead in Atari, we build upon the batch-BFS algorithm of hallak2021improve. In short, the lookahead mechanism relies on an exhaustive tree-search that at each step spans the tree of outcomes up to depth from each state. This is done efficiently via parallel GPU emulation of Atari.

Third, we introduce a per-depth Q-function. This feature is crucial in order to keep online consistency and achieve convergence. Without it, values across mixed depths (which are essentially mixed policies) suffer from off-policy distortion and the agent fails to converge. Technically, we maintain parallel -networks (where is the maximal tree depth) and use the -th network when predicting the value of a leaf in depth in the tree. To improve generalization and data re-use, the networks share the initial layers (feature extractors). When storing states to the replay buffer, we attach their chosen lookahead depth and used it later for the target loss. All other parts of the algorithm and hyper-parameter choices are taken as-is from the original DQN paper mnih2013playing. QL-DQN is visualzied in Figure 7.

We train QL-DQN on several Atari environments (bellemare2013arcade). Since the goal of our work is to improve sample complexity over fixed-horizon baselines, our metric of interest here is the reward as a function of training time. Hence, in Figure 6 we present the convergence of QL-DQN versus DQN with fixed depths through , as a function of time. The plots consist of the average score across seeds together with std values. As seen, QL-DQN achieves better performance on VideoPinball and Tutankham, while on Solaris and Berzerk, it is on-par with the best fixed lookahead.

The conclusion here is again that we obtain a better or similarly-performing agent to a pre-determined fixed planning horizon. This comes with the benefit of robustness to the expensive hyper-parameter choice of the best fixed horizon per a given environment.

## 8 Discussion

In this paper we propose the first planning and learning algorithms that dynamically adapt the multi-step lookahead horizon as a function of the state and the current value function estimate. We demonstrate the significant potential of adaptive lookahead both theoretically — proving convergence with improved computational complexity, and empirically — demonstrating their favorable performance in a maze and Atari. Our algorithms often perform as well as the best fixed horizon in hindsight in almost all the experiments, while in some cases they surpass it. Future work warrants an investigation whether the best fixed horizon can always be outperformed by an adaptive horizon.

Theoretically, our guarantees rely on prior knowledge of the (approximate) optimal value, which raises the question whether one can choose lookahead horizons adaptively without any prior knowledge, for example using transfer learning based on similarity between domains. Moreover, when the forward model used to perform the lookahead is inaccurate or learned from data, the adaptive state-dependent lookahead itself may serve as a quantifier for the level of trust in the value function estimate (short lookahead) versus the model (long lookahead). In essence, this can offer a way for state-wise regularization of the learning or planning problem. Our work is also related to the growing Sim2Real literature. In particular, consider the case of having several simulators with different computational costs and fidelity levels. The lookahead problem then translates to choosing in which states to use which simulator with which lookahead.

Our focus in this paper was to reduce the iteration and overall complexity; we thus ignored more intricate details of the forward search itself. Additional practical aspects such as CPU-GPU planning efficiency tradeoffs hallak2021improve can also affect the lookahead selection problem.

## Appendix A Proofs

### a.1 Proof of Theorem 4.2

#### Bounding the number of iterations to convergence.

Let be the operator induced by the algorithm, i.e., it is for states that are contracted by after -step lookahead and for the other states. The TLPI algorithm ensure that, for every iteration and state ,

 |Bt[Vπt](s)−V⋆(s)|≤κ∥V⋆−Vπt∥∞,

and therefore: . Next, we show that the sequence is contracting with coefficient . To show this, we split the states into two groups:

1. . Thus, TLPI uses -step lookahead for , which implies:

 V⋆(s)−Vπt+1(s) =V⋆(s)−T[Vπt](s)+T[Vπt](s)−Tπt+1[Vπt](s)+Tπt+1[Vπt](s)−Tπt+1[Vπt+1](s) ≤V⋆(s)−T[Vπt](s)+Tπt+1[Vπt](s)−Tπt+1[Vπt+1](s) ≤κ∥V⋆−Vπt∥∞+γPπt+1(Vπt−Vπt+1) ≤κ∥V⋆−Vπt∥∞,

where the second step follows since , and the last step is by monotonicity of PI.

2. . Thus, TLPI uses -step lookahead for , which implies:

 V⋆(s)−Vπt+1(s) =TTh(κ)−1[V⋆](s)−TTh(κ)−1[Vπt](s)+TTh(κ)−1[Vπt](s) −Tπt+1Th(κ)−1[Vπt](s)+Tπt+1Th(κ)−1[Vπt](s)−Tπt+1[Vπt+1](s) ≤TTh(κ)−1[V⋆](s)−TTh(κ)−1[Vπt](s)+Tπt+1Th(κ)−1[Vπt](s)−Tπt+1[Vπt+1](s) ≤γ∥Th(κ)−1[V⋆]−Th(κ)−1[Vπt]∥∞+γPπt+1(Th(κ)−1[Vπt]−Vπt+1) ≤γ∥Th(κ)−1[V⋆]−Th(κ)−1[Vπt]∥∞ ≤γh(κ)∥V⋆−Vπt∥∞≤κ∥V⋆−Vπt∥∞,

where the second step follows since , the forth step is by monotonicity of PI, the fifth step is since is -contracting (because is -contracting), and the last step is by definition of .

We now follow the proof of scherrer2016improved for bounding the number of iterations of PI, and introduce the notation . Then,

 ∥Aπtπ⋆∥∞ =∥Tπt[V⋆]−V⋆∥∞≤∥V⋆−Vπt∥∞≤κt∥V⋆−Vπ0∥∞≤κt1−γ∥Aπ0π⋆∥∞,

where the first and last inequalities are by scherrer2016improved, and the second inequality is because the sequence is contracting with coefficient . By the definition of the max-norm, and as a (using the fact that is optimal), there exists a state such that . We deduce that for all ,

 −Aπtπ⋆(s0)≤∥Aπtπ⋆∥∞≤κt1−γ∥Aπ0π⋆∥∞=−κt1−γAπ0π⋆(s0).

As a consequence, the action must be different from when , that is for all values of satisfying

 t≥t⋆=⎡⎢ ⎢ ⎢⎢log11−γlog1κ⎤⎥ ⎥ ⎥⎥.

In other words, if some policy is not optimal, then one of its non-optimal actions will be eliminated for good after at most iterations. By repeating this argument, one can eliminate all non-optimal actions (there are at most of them), and the result follows.

#### Bounding the per-iteration computational complexity.

In every iteration , we first perform -step improvement in all of the states. This has a computational cost of . By definition of , after performing -step improvement in all of the states, there are at most states that are not contracted by at least . Thus, by definition of the TLPI algorithm, we perform -step improvement in at most states. This has a computational cost of .

### a.2 Proof of Theorem 4.4

#### Bounding the number of iterations to convergence.

The proof follows the same path as the proof of Theorem 4.2, but now the operator induced by the algorithm is -contracting instead of -contracting

#### Bounding the per-iteration computational complexity.

By definition of the QLPI algorithm, in every iteration , we perform -step lookahead in at most states. For every , the computational complexity of performing -step lookahead in states is at most , which gives a total computational complexity of per iteration.

### a.3 Proof of Theorem 5.2

To prove the theorem, it suffices to show that we use -step improvement in all the states that do not achieve -contraction after -step improvement. Let such that . We want to show that , because this implies that -step improvement will be used. Indeed,

 |˜V⋆(s)−T[Vπt(s)| =|˜V⋆(s)−V⋆(s)+V⋆(s)−T[Vπt(s)| ≥|V⋆(s)−T[Vπt(s)|−|˜V⋆(s)−V⋆(s)| ≥|V⋆(s)−T[Vπt(s)|−ϵ >κ∥V⋆−Vπt∥∞−ϵ =κ∥V⋆−˜V⋆+˜V⋆−Vπt∥∞−ϵ ≥κ∥˜V⋆−Vπt∥∞−κ∥V⋆−˜V⋆∥∞−ϵ ≥κ∥˜V⋆−Vπt∥∞−κϵ−ϵ=κ∥˜V⋆−Vπt∥∞−β,

where the second and sixth steps follow by the triangle inequality, the third and seventh steps are by definition of , the forth step follows because did not achieve -contraction after -step improvement, and the last step is by definition of .

### a.4 Proof of Theorem 5.4

#### Bounding the number of iterations to convergence.

Denote by the algorithm QLPI with the approximation and the quantiles , and by ALG the algorithm QLPI with the accurate and the quantiles . In order to bound the number of iterations of by that of ALG, we need to show that a contraction factor of is still achieved. This follows because, for every state , performs lookahead with a horizon that it at least as large as the one ALG uses, as we show next.

Let and assume that ALG uses -step lookahead in state in some iteration . Then, this means that in the ordering of the states , the position of is such that lookahead is chosen for it. Recall that is -order-preserving which means that this position shifts by at most in the ordering , but since the quantiles are now also larger by we will still use lookahead of in state .

#### Bounding the per-iteration computational complexity.

By Theorem 4.4, when we run QLPI with the quantiles , the per-iteration computational complexity is:

 O(S⋅H∑h=1(θh+m/S)⋅c(h))=O(S⋅H∑h=1θh⋅c(h))+O(m⋅H∑h=1c(h)),

and the first term on the right hand side of the equation is exactly the per-iteration computational complexity of running QLPI with the quantiles .