# Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs - Extended Version

In this paper we propose an algorithm for polynomial-time reinforcement learning in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an empirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized optimistically. We prove that with suitable initialization (i) FOIM converges to the fixed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with respect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the first algorithm with these properties. This extended version contains the rigorous proofs of the main theorem. A version of this paper appeared in ICML'09.

## Authors

• 3 publications
• 17 publications
07/12/2021

### Polynomial Time Reinforcement Learning in Correlated FMDPs with Linear Value Functions

Many reinforcement learning (RL) environments in practice feature enormo...
07/27/2014

### MDPs with Unawareness

Markov decision processes (MDPs) are widely used for modeling decision-m...
07/06/2017

### Efficient Strategy Iteration for Mean Payoff in Markov Decision Processes

Markov decision processes (MDPs) are standard models for probabilistic s...
02/02/2018

### Optimal probabilistic polynomial time compression and the Slepian-Wolf theorem: tighter version and simple proofs

We give simplify the proofs of the 2 results in Marius Zimand's paper "K...
04/30/2018

### Stochastic Shortest Paths and Weight-Bounded Properties in Markov Decision Processes

The paper deals with finite-state Markov decision processes (MDPs) with ...
02/25/2020

### Efficient and Simple Algorithms for Fault Tolerant Spanners

It was recently shown that a version of the greedy algorithm gives a con...
12/15/2015

### Globally Optimal Joint Uplink Base Station Association and Beamforming

The joint base station (BS) association and beamforming problem has been...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Factored Markov decision processes (FMDPs) are practical ways to compactly formulate sequential decision problems—provided that we have ways to solve them. When the environment is unknown, all effective reinforcement learning methods apply some form of the “optimism in the face of uncertainty” principle: whenever the learning agent faces the unknown, it should assume high rewards in order to encourage exploration. Factored optimistic initial model (FOIM) takes this principle to the extreme: its model is initialized to be overly optimistic. For more often visited areas of the state space, the model gradually gets more realistic, inspiring the agent to head for unknown regions and explore them, in search of some imaginary “Garden of Eden”. The working of the algorithm is simple to the extreme: it will not make any explicit effort to balance exploration and exploitation, but always follows the greedy optimal policy with respect to its model. We show in this paper that this simple (even simplistic) trick is sufficient for effective FMDP learning.

The algorithm is an extension of OIM (optimistic initial model) [Szita08Many], which is a sample-efficient learning algorithm for flat MDPs. There is an important difference, however, in the way the model is solved. Every time the model is updated, the corresponding value function needs to be re-calculated (or updated) For flat MDPs, this is not a problem: various dynamic programming-based algorithms (like value iteration) can solve the model to any required accuracy in polynomial time.

The situation is less bright for generating near-optimal FMDP solutions: all currently known algorithms may take exponential time, e.g. the approximate policy iteration of [Boutilier00Stochastic]

using decision-tree representations of policies, or solving the exponential-size flattened version of the FMDP. If we require polynomial running time (as we do in this paper in search for a practical algorithm), then we have to accept sub-optimal solutions. The only known example of a polynomial-time FMDP planner is

factored value iteration (FVI) [Szita08Factored], which will serve as the base planner for our learning method. This planner is guaranteed to converge, and the error of its solution is bounded by a term depending only on the quality of function approximators.

Our analysis of the algorithm will follow the established techniques for analyzing sample-efficient reinforcement learning (like the works of [Kearns98Near-Optimal, Brafman01R-MAX, Kakade03Sample, Strehl05Theoretical, Szita08Many] on flat MDPs and [Strehl07Model-Based]

on FMDPs). However, the listed proofs of convergence rely critically on access to a near-optimal planner, so they have to be generalized suitably. By doing so, we are able to show that FOIM converges to a bounded-error solution in polynomial time with high probability.

We introduce basic concepts and notations in section 2, then in section 3 we review existing work, with special emphasis to the immediate ancestors of our method. In sections 4 and 5 we describe the blocks of FOIM and the FOIM algorithm, respectively. We finish the paper with a short analysis and discussion.

## 2. Basic concepts and notations

An MDP is characterized by a quintuple , where is a finite set of states; is a finite set of possible actions; is the reward function of the agent; is the transition function; and finally, is the discount rate on future rewards. A (stationary, Markov) policy of the agent is a mapping . The optimal value function gives the maximum attainable total rewards for each state, and satisfies the Bellman equation

 (1)

Given the optimal value function, it is easy to get an optimal policy: iff and otherwise.

### 2.1. Vector notation

Let , and suppose that states are integers from 1 to , i.e. . Clearly, value functions are equivalent to

-dimensional vectors of reals, which may be indexed with states. The vector corresponding to

will be denoted as and the value of state by . Similarly, for each let us define the -dimensional column vector with entries and matrix with entries .

The Bellman equations can be expressed in vector notation as where max denotes the componentwise maximum operator. The Bellman equations are the basis to many RL algorithms, most notably, value iteration:

 (2) vt+1:=maxa∈A(ra+γPavt),

which converges to for any initial vector .

### 2.2. Factored structure

We assume that is the Cartesian product of smaller state spaces (corresponding to individual variables):

 X=X1×X2×…×Xm.

For the sake of notational convenience we will assume that each has the same size, . With this notation, the size of the full state space is . We note that all derivations and proofs carry through to different size variable spaces.

###### Definition 2.1.

For any subset of variable indices , let , furthermore, for any , let denote the value of the variables with indices in . We shall also use the notation without specifying a full vector of values , in such cases denotes an element in . For single-element sets we shall also use the shorthand .

###### Definition 2.2 (Local-scope function).

A function is a local-scope function if it is defined over a subspace of the state space, where is a (presumably small) index set.

If is small, local-scope functions can be represented efficiently, as they can take only different values.

###### Definition 2.3 (Extension).

For be a local-scope function. Its extension to the whole state space is defined by . The extension operator for is a linear operator with a matrix , with entries

 (E[Z])u,v[Z]={1,if u[Z]=v[Z];0,otherwise.

For any local-scope function with a corresponding vector representation , is the vector representation of the extended function.

We assume that the reward function is the sum of local-scope functions with scopes : In vector notation: . We also assume that for each variable there exist neighborhood sets such that the value of depends only on and the action taken. Then we can write the transition probabilities in a factored form

 (3) P(y∣x,a)=m∏i=1Pi(y[i]∣x[Γi],a)

for each , , where each factor is a local-scope function (for all ). In vector/matrix notation, for any vector , where denotes the Kronecker product. Finally, we assume that the size of all local scopes are bounded by a small constant : for all . As a consequence, all probability factors can be represented with tables having at most rows.

An FMDP is fully characterized by the tuple .

## 3. Related literature

The idea of representing a large MDP using a factored model was first proposed by [Koller00Policy] but similar ideas appear already in the works of [Boutilier95Exploiting, Boutilier00Stochastic].

### 3.1. Planning in known FMDPs

Decision trees (or equivalently, decision lists) provide a way to represent the agent’s policy compactly. [Koller00Policy] and [Boutilier95Exploiting, Boutilier00Stochastic] present algorithms to evaluate and improve such policies, according to the policy iteration scheme. Unfortunately, the size of the policies may grow exponentially even with a decision tree representation [Boutilier00Stochastic, Liberatore02Size].

The exact Bellman equations (1

) can be transformed to an equivalent linear program with

variables and constraints. In the approximate linear programming approach, we approximate the value function as a linear combination of basis functions, resulting in an approximate LP with variables and constraints. Both the objective function and the constraints can be written in compact forms, exploiting the local-scope property of the appearing functions. [Guestrin02Efficient] show that the maximum of exponentially many local-scope functions can be computed by rephrasing the task as a non-serial dynamic programming task and eliminating variables one by one. Therefore, the equations can be transformed to an equivalent, more compact linear program. The gain may be exponential, but this is not necessarily so in all cases. Furthermore, solutions will not be (near-)optimal because of the function approximation; the best that can be proved is bounded error from the optimum (where the bound depends on the quality of basis functions used for approximation).

The approximate policy iteration algorithm [Koller00Policy, Guestrin02Efficient] also uses an approximate LP reformulation, but it is based on the policy-evaluation Bellman equations. Policy-evaluation equations are, however, linear and do not contain the maximum operator, so there is no need for a costly transformation step. On the other hand, the algorithm needs an explicit decision tree representation of the policy. [Liberatore02Size] has shown that the size of the decision tree representation can grow exponentially. Furthermore, the convergence properties of these algorithms are unknown.

Factored value iteration [Szita08Factored] also approximates the value function as a linear combination of basis functions, but uses a variant of approximate value iteration: the projection operator is modified to avoid divergence. FVI converges in a polynomial number of steps, but the solution may be sub-optimal. The error of the solution has bounded distance from the optimal value function, where the bound depends on the quality of function approximation. As an integral part of FOIM, FVI is described in detail in Section 4.1.

### 3.2. Reinforcement Learning in FMDPs

In the reinforcement learning setting, the agent interacts with an FMDP environment with unknown parameters. In the model-based approach, the agent has to learn the structure of the FMDP (i.e., the dependency sets and the reward domains ), the transition probability factors and the reward factors .

Unknown transitions. Most approaches assume that the structure of the FMDP and the reward functions are known, so only transition probabilities need to be learnt. Examples include the factored versions of sample-efficient model-based RL algorithms: factored E [Kearns99Efficient], factored R-max [Guestrin02Algorithm-Directed], or factored MBIE [Strehl07Model-Based]. All the abovementioned algorithms have polynomial sample complexity (in all relevant task parameters), and require polynomially many calls to an FMDP-planner. Note however, that all of the mentioned approaches require access to a planner that is able to produce -optimal solutions111The assumption of [Kearns99Efficient] is slightly less restrictive: they only require that the value of the returned policy has value at least with some . However, no planner is known that can achieve this and cannot achieve near-optimality. – and to date, no algorithm exists that would accomplish this accuracy in polynomial time. [Guestrin02Algorithm-Directed] also present an algorithm where exploration is guided by the uncertainties of the linear programming solution. While this approach does not require access to a near-optimal planner, no formal performance bounds are known.

Unknown rewards. Typically, it is asserted that the rewards can be approximated from observations analogously to transition probabilities. However, if the reward is composed of multiple factors (i.e., ), then we can only observe the sums of unknown quantities, not the individual quantities themselves. To date, we know of no efficient approximation method for learning factored rewards.

Unknown structure. Few attempts exist that try to obtain the structure of the FMDP automatically. [Strehl07Efficient] present a method that learns the structure of an FMDP in polynomial time (in all relevant parameters).

## 4. Building blocks of FOIM

We describe the two main building blocks of our algorithm, factored value iteration and optimistic initial model.

### 4.1. Factored value iteration

We assume that all value functions are approximated as the linear combination of basis functions : .

Let be the matrix mapping feature weights to state values, with entries , and let be an arbitrary linear mapping projecting state values to feature weights. Let denote the weight vector of the basis functions. It is known that if , then the approximate Bellman equations have a unique fixed point solution , and approximate value iteration (AVI)

 (4) wt+1:=Gmaxa∈A(ra+γPaHwt)

converges there for any starting vector .

###### Definition 4.1.

Let the AVI-optimal value function be defined as .

As shown by [Szita08Factored], the distance of AVI-optimal value function from the true optimum is bounded by the projection error of :

 (5) ∥v×−v∗∥∞≤11−γ∥HGv∗−v∗∥∞.

We make the further assumption that all the basis functions are local-scope ones: for each , , with feature matrices . The feature matrix can be decomposed as .

###### Definition 4.2.

For any matrices and , let the row-normalization of be a matrix of the same size as , and having the entries

Throughout the paper, we shall use the projection matrix .

The AVI equation (4) can be considered as the product of the matrix and an vector . Using the above assumptions and notations, we can see that for any , the corresponding columm of and the corresponding element of can be computed in polynomial time:

 [G]k,x=1∥[HHT]k,∗∥∞K∑k′=1[HTk′]∗,x[Ck′]; [vt]x=maxa∈A[J∑j=1[raj]x[Zaj]+γK∑k=1E[Γ∪Ck](⨂i∈CkPai)(hkwk,t)]

Factored value iteration draws states uniformly at random, and performs approximate value iteration on this reduced state set.

###### Theorem 4.3 ([Szita08Factored]).

Suppose that For any , , if the sample size is , then with probability at least , factored value iteration converges to a weight vector such that . In terms of the optimal value function,

 (6) ∥v×−v∗∥∞≤11−γ∥HGv∗−v∗∥∞+ϵ.

### 4.2. Optimistic initial model for flat MDPs

There are a number of sample-efficient learning algorithms for MDPs, e.g., E3, Rmax, MBIE, and most recently, OIM. The underlying principle of all these methods is similar: they all maintain an approximate MDP model of the environment. Wherever the uncertainty of the model parameters is high, the models are optimistic. This way, the agent is encouraged to explore the unknown areas, reducing the uncertainty of the models.

Here, we shall use and extend OIM to factored environments. In the OIM algorithm, we introduce a hypothetical “garden of Eden” (GOE) state , where the agent gets a very large reward and remains there indefinitely. The model is initialized with fake experience, according to which the agent has experienced an transition for all and . According to this initial model, each state has value , which is a major overestimation of the true values. The model is continuously updated by the collected experience of the agent, who always takes the greedy optimal action with respect to its current model. For well-explored pairs, the optimism of the model vanishes, thus encouraging the agent to explore the less-known areas.

The reason for choosing OIM is twofold: (1) The optimism of the model is ensured at initialization time, and after that, no extra work is needed to ensure the optimism of the model or to encourage exploration. (2) Results on several standard benchmark MDPs indicate that OIM is superior to the other algorithms mentioned.

## 5. Learning in FMDPs with an Optimistic initial model

Similarly to other approaches, we will make the assumptions that (a) the dependencies are known, and (b) the reward function is known, only the transition probabilities need to be learned.

### 5.1. Optimistic initial model for factored MDPs

During the learning process, we will maintain approximations of the model, in particular, of the transition probability factors. We extend all state factors with the hypothetical ”garden of Eden” state . Seeing the current state and the action taken, the transition model should give the probabilities of various next states . Specifically, the th factor of the transition model should give the probabilities of various values, given and . Initially, the agent has no idea, so we let it start with an overly optimistic model: we inject the fake experience to the model that taking action in leads to a state with th component . This optimistic model will encourage the agent to explore action whenever its state is consistent with . After many visits to , the weight of the initial fake experience will shrink, and the optimistic belief of the agent (together with its exploration-boosting effect) fades away. However, by that time, the collected experience provides an accurate approximation of the values.

So, according to the initial model (based purely on fake experience),

 ˆP(y∣x,a)={1,if % y=(xE,…,xE);0,otherwise,

if components of are . This model is optimistic indeed, all non-GOE states have value at least . Note that it is not possible to encode the -rewards for the states using the original set of reward factors, so for all state factor , we add a new reward factor with local scope : , defining With this modification, we are able to fully specify our algorithm, as shown in the pseudocode below.

### 5.2. Analysis

Below we prove that FOIM gets as good as possible. What is “as good as possible”? We clearly cannot expect better policies than the one the planner would output, were the parameters of the FMDP known. And because of the polynomial-running-time constraint on the planner, it will not be able to compute a near-optimal solution. However, we can prove that FOIM gets -close to the solution of the planner (which is AVI-near-optimal if the planner is FVI), except for a polynomial number of mistakes during its run.222We are using the term polynomial and polynomial in all relevant quantities as a shorthand for polynomial in , , , , , and .

###### Theorem 5.1.

Suppose that an agent is following FOIM in an unknown FMDP, where all reward components fall into the interval , there are state factors, and all probability- and reward-factors depend on at most factors. Let and let and . If the initial values of FOIM satisfy

 RE=c⋅mR2max(1−γ)4ϵ[logmNf|A|(1−γ)ϵδ],

then the number of timesteps when FOIM makes non-AVI-near-optimal moves, i.e., when is bounded by

 O(R2maxm4Nf|A|ϵ4(1−γ)4log31δlog2mNf|A|ϵ)

with probability at least .

Proof sketch. The proof uses standard techniques from the literature of sample-efficient reinforcement learning. Most notably, our proof follows the structure of [Strehl07Model-Based]. There are two important differences compared to previous approaches: (1) we may not assume that the planner is able to output a near-optimal solution, and (2) FOIM may make an unbounded number of model updates, so we cannot make use of the standard argument that “we are encountering only finitely many different models, each of them fails with negligible probability, so the whole algorithm fails with negligible probability”. Instead, a more careful analysis of the failure probability is needed. The rigorous proof can be found in the appendix.

#### 5.2.1. Boundedness of value functions

According to our assumptions, all rewards fall between and . From this, it is easy to derive an upper bound on the magnitude of the AVI-optimal value function . The bound we get is . For future reference, we note that .

#### 5.2.2. From visit counts to model accuracy

The FOIM algorithm builds a transition probability model by keeping track of visit counts to state-action components and state-action-state transition components . First of all, we show that if a state-action component is visited many times, then the corresponding probability components become accurate.

Let us fix a timestep , a probability factor and a state-action component , and . Let us denote the number of visits to the component up to time by . Let us introduce the shorthands and . By Theorem 3 of [Strehl07Model-Based] (an application of the Hoeffding–Azuma inequality),

 (7) Pr(∑y∈Xi∣∣pi(−ˆpt,i∣∣>ϵt)≤2nexp(−ϵ2tkt(x[Γi],a)2).

Unfortunately, the above inequality only speaks about a single time step

, but we need to estimate the failure probability for the whole run of the algorithm. By the union bound, that is at most

 (8) ∞∑k=1Pr(∑y∈Xi∣∣pi−ˆptk,i∣∣>ϵtk).

Let . For , the number of visits is too low, so in eq. (7), either , or the right-hand side is too big. We choose the former: we make the failure probability less than some constant by setting , where . For , the number of visits is sufficiently large, so we can decrease either the accuracy or the failure probability (or even both). It turns out that an approximation accuracy is sufficient, so we decrease failure probability. Let us set . With this choice of and , whenever , furthermore, , so we get that

 ∞∑k=1Pr(∑y∈Xi∣∣pi−ˆptk,i∣∣>max(β(δ′)√k,ϵ(1−γ)m)) ≤k0−1∑k=1δ′+∞∑k=k02nexp(−kϵ22m2)

We can repeat this estimation for every state-action components . There are at most of these, so the total failure probability is still less than . This means that

 (9) ∑y∈Xi∣∣pi−ˆpt,i∣∣≤max(β(δ′)√kt(x[Γi],a),ϵ(1−γ)m)

will hold for all pairs and all timesteps with high probability. From now on, we will consider only realizations where the failure event does not happen, but bear in mind that all our statements that are based on (9) are true only with probability.

From (9), we can easily get bounds on the accuracy of the full transition probability function: for all and for all .

#### 5.2.3. The known-state FMDP

A state-action component is called known at timestep if it has been visited at least times, i.e., if . We define the known-component FMDP as follows: (1) its state and action space, rewards, and the decompositions of the transition probabilities (i.e., the dependency sets ) are identical to the corresponding quantities of the true FMDP , and hence to the current approximate FMDP ; (2) for all , and , for any , the corresponding transition probability component is

 {ˆPt,i(yi|x[Γi],a),if (x[Γi],a)∈Kt;Pi(yi|x[Γi],a),if (x[Γi],a)∉Kt.

Note that FMDPs and are very close to each other: unknown state-action components have identical transition functions by definition, while for known components, . Consequently, for all ,

 (10) ∑y∈X∣∣PKt(y|x,a)−ˆPt(y|x,a)∣∣≤ϵ(1−γ)γV0.

For an arbitrary policy , let and be the value functions (the fixed points of the approximate Bellman equations) of in and , respectively. By a suitable variant of the Simulation Lemma (see supplementary material) that works with the approximate Bellman equations, we get that whenever (10) holds, .

#### 5.2.4. The FOIM model is optimistic

First of all, note that FOIM is not directly using the empirical transition probabilities , but it is more optimistic; it gives some chance for getting to the garden of Eden state :

 ⎧⎪⎨⎪⎩kt,ikt,i+1ˆPt,i(yi|x[Γi],a),if yi≠xE;1kt,i+1,otherwise,

where we introduced the shorthand .

Now, we show that

 Q×(x,a)−[R(x,a)+γ∑y∈XˆPFOIMt(y|x,a)V×(y)] (11) ≤Θ(ϵ(1−γ)),

or equivalently,

 ∑y∈X(P(y∣x,a)−P% FOIMt(y∣x,a))V×(y) ≥−m∑i=1max(β√kt,i,ϵ(1−γ)m)⋅kt,i+1kt,iV0+1kt,iVE.

Every term in the right-hand side is larger than , provided that we can prove the slightly stronger inequality

 −max(β√kt,i,ϵ(1−γ)m)⋅2V0+1kt,iVE≥−ϵ(1−γ)m.

First note that if the second term dominates the max expression, then the inequality is automatically true, so we only have to deal with the situation when the first term dominates. In this case, the inequality takes the form which always holds because of our choice of .

We show by induction that and for all and all . The inequalities hold for . When moving from step to ,

 Q(t+1)(x,a)=R(x,a)+γ∑y∈XˆPt(y∣x,a)V(t)(y) ≥R(x,a)+γ∑y∈XˆPt(y∣x,a)(V×(y)−Θ(ϵ)) ≥Q×(x,a)−γΘ(ϵ)−Θ((1−γ)ϵ)

for all , where we applied the induction assumption and eq. (11). Consequently, for all . Note that according to our assumptions, all entries of are nonnegative as well as the entries of , so multiplication by rows of is a monotonous operator, furthermore, all rows sum to 1, yielding

 ∑x∈X[HG]y,xmaxa∈AQ(t+1)(x,a) ≥∑x∈X[HG]y,x(maxa∈AQ×(x,a)−Θ(ϵ)),

that is, .

#### 5.2.5. Proximity of value functions

The rest of the proof is standard, so we give here a very rough sketch only. We define a cutoff horizon and an escape event which happens at timestep if the agent encounters an unknown transition in the next steps. We will separate two cases depending on whether is smaller than or not. If the probability of escape is low, then we can show that . Otherwise, if is large, then an unknown state-action component is found with significant probability. However, this can happen only at most times (because all components become known after visits), which is polynomial, so the second case can happen only a polynomial number of times.

Finally, we remind that the statements are true only with probability . To round off the proof, we note that we are free to choose the constant in the definition of (as it is hidden in the notation), so we set it in a way that and become at most and , respectively.

## 6. Discussion

FOIM is conceptually very simple: the exploration-exploitation dilemma is resolved without any explicit exploration, action selection is always greedy. The model update and model solution are also at least as simple as the alternatives found in the literature. Further, FOIM has some favorable theoretical properties. FOIM is the first example to an RL algorithm that has a polynomial per-step computational complexity in FMDPs. To achieve this, we had to relax the near-optimality of the FMDP planner. The particular planner we used, FVI, runs in polynomial time, it does reach a bounded error, and the looseness of the bound depends on the quality of basis functions. In almost all time steps, FOIM gets -close to the FVI value function with high probability (for any pre-specified ). The number of timesteps when this does not happen is polynomial.333Note that in general there may be some hard-to-reach states that are visited after a very long time only, so not all steps will be near-optimal after a polynomial number of steps. This issue was analyzed by [Kakade03Sample], who defined an analogue of “probably approximately correctness” for MDPs.

From a practical point of view, calling an FMDP model-solver in each iteration could be prohibitive. However, the model and the value function usually change very little after a single model update, so we may initialize FVI with the previous value function, and a few iterations might be sufficient.

## Acknowledgements

This work has been supported by the EC NEST Perceptual Consciousness: Explication and Testing grant under contract 043261. Opinions and errors in this manuscript are the author s responsibility, they do not necessarily reflect the opinions of the EC or other project members. The first author has been partially supported by the Fulbright Scholarship.

## Appendix A The Proof of Theorem 5.1

### a.1. General lemmas

###### Lemma A.1.

(Azuma’s Inequality) If the random variables

form a martingale difference sequence, meaning that for all , and for each , then

 Pr[k∑i=1Xi≥a]≤exp(−a22b2k)

and

 Pr[∣∣ ∣∣k∑i=1Xi∣∣ ∣∣≥a]≤2exp(−a22b2k)
###### Lemma A.2 (Theorem 3 of [Strehl07Model-Based]).

Fix a probability factor and a pair . Let be the empirical distribution of after visits to . Then for all , the -error of the approximation will be small with high probability:

 Pr⎛⎝∑y∈Xi∣∣Pi(y|x[Γi],a)−ˆPi(y|x[Γi],a)∣∣>ϵ1⎞⎠≤2nexp(−kiϵ212)
###### Corollary A.3.

For any , define

 (12) β(δ1):=√2(log1δ1+nlog2).

Then with probability at least ,

 ∑y∈Xi∣∣ˆPi(y|x[Γi],a)−Pi(y|x[Γi],a)∣∣≤β(δ1)√ki.
###### Lemma A.4.

Let , . Let

 δ′:=C′δ2ϵ22/log1δ2ϵ2

with some suitable constant . Then

 Pr⎛⎝∑y∈Xi∣∣Pi(y|x[Γi],a)−ˆPt,i(y|x[Γi],a)∣∣>max(β(δ′)√kt,ϵ2) for any t=1,2,…⎞⎠≤δ,

that is, the probability is very low that the approximate transition probabilities ever get very far from their exact values.

Proof. By the union bound, the above probability is at most

 (13) ∞∑k=1Pr⎛⎝∑y∈Xi∣∣Pi(y|x[Γi],a)−ˆPt,i(y|x[Γi],a)∣∣>max(β(δ′)√kt,ϵ2)⎞⎠.

We will cut the sum into two parts, the cutting point is a constant to be determined later. Define the auxiliary constants

 e := 11−exp(−ϵ222) δ′ := δ2k0+e

Let such that becomes smaller than after terms, that is, or equivalently,

 k0 ≥ β2(δ′)ϵ22=1ϵ22(2log1δ′+nlog2) = 1ϵ22(2log(k0+e)+2log1δ2+nlog2).

Using the very loose inequality with , we get that the above inequality holds if the stronger inequality

 k0 ≥ 12(k0+e)+4ϵ22+4ϵ22log4ϵ22+1ϵ22(2log1δ2+nlog2),

holds, that is, for While this is a lower bound on , this also means that there is a constant such that

 (14) k0=C1ϵ22log1δ2ϵ2

satisfies the inequality.

Using the above facts and Corollary A.3, the sum of terms up to is bounded by

 k