# On the Correctness and Sample Complexity of Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) is the problem of finding a reward function that generates a given optimal policy for a given Markov Decision Process. This paper looks at an algorithmic-independent geometric analysis of the IRL problem with finite states and actions. A L1-regularized Support Vector Machine formulation of the IRL problem motivated by the geometric analysis is then proposed with the basic objective of the inverse reinforcement problem in mind: to find a reward function that generates a specified optimal policy. The paper further analyzes the proposed formulation of inverse reinforcement learning with n states and k actions, and shows a sample complexity of O(n^2 (nk)) for recovering a reward function that generates a policy that satisfies Bellman's optimality condition with respect to the true transition probabilities.

Comments

There are no comments yet.

## Authors

• 1 publication
• 41 publications
• ### Continuous Inverse Optimal Control with Locally Optimal Examples

Inverse optimal control, also known as inverse reinforcement learning, i...
06/18/2012 ∙ by Sergey Levine, et al. ∙ 0

read it

• ### Detecting Spiky Corruption in Markov Decision Processes

Current reinforcement learning methods fail if the reward function is im...
06/30/2019 ∙ by Jason Mancuso, et al. ∙ 0

read it

• ### Towards Diverse Text Generation with Inverse Reinforcement Learning

Text generation is a crucial task in NLP. Recently, several adversarial ...
04/30/2018 ∙ by Zhan Shi, et al. ∙ 0

read it

• ### Active Task-Inference-Guided Deep Inverse Reinforcement Learning

In inverse reinforcement learning (IRL), given a Markov decision process...
01/24/2020 ∙ by Farzan Memarian, et al. ∙ 0

read it

• ### Generalized Maximum Causal Entropy for Inverse Reinforcement Learning

We consider the problem of learning from demonstrated trajectories with ...
11/16/2019 ∙ by Tien Mai, et al. ∙ 44

read it

• ### Langevin Dynamics for Inverse Reinforcement Learning of Stochastic Gradient Algorithms

Inverse reinforcement learning (IRL) aims to estimate the reward functio...
06/20/2020 ∙ by Vikram Krishnamurthy, et al. ∙ 0

read it

• ### Inverse Reinforcement Learning with Conditional Choice Probabilities

We make an important connection to existing results in econometrics to d...
09/22/2017 ∙ by Mohit Sharma, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement Learning is the process of generating an optimal policy for a given Markov Decision Process along with a reward function. Often, in situations including apprenticeship learning, the reward function is unknown but optimal policy can be observed through the actions of an expert. In cases such as these, it is desirable to learn a reward function generating the observed optimal policy. This problem is referred to as Inverse Reinforcement Learning (IRL) [5]

. It is well known that such a reward function is not necessarily unique. Various algorithms to solve the IRL problem have been made including linear programming

[5]

and Bayesian estimation

[6]. [2] looked at using IRL to solve the apprenticeship learning problem by trying to find a reward function that maximizes the margin of the expert’s policy. The goal of the problem presented in [2] is to find a policy that comes close in value to the expert’s policy for some unspecified true reward function. However none of the prior works provide a formal guarantee that the reward function obtained from the empirical data is optimal for the true transition probabilities in inverse reinforcement learning.

This paper looks at formulating the IRL problem by using the basic objective of inverse reinforcement: to find a reward function that generates a specified optimal policy. The paper also looks at establishing a sample complexity to meet this basic goal when the transition probabilities are estimated from observed trajectories. To achieve this, an algorithmic-independent geometric analysis of the IRL problem with finite states and actions as presented in [5] is provided. A L1-regularized Support Vector Machine (SVM) formulation of the IRL problem motivated by the geometric analysis is then proposed. Theoretical analysis of the sample complexity of the L1 SVM formulation is then performed. Finally, experimental results comparing the L1 SVM formulation to the linear programming formulation presented in [5] are presented showing the improved performance of the L1 SVM formulation with respect to the basic objective, i.e Bellman optimality with respect to the true transition probabilities. To the best of our knowledge, we are the first to provide an algorithm with formal guarantees for inverse reinforcement learning.

## 2 Preliminaries

The formulation of the IRL problem is based on a Markov Decision Process (MDP) , where

• is a finite set of states.

• is a set of actions.

• are the state transition probabilities for action . We use and to represent the state transition probabilities for action in state and to represent the probability of going from state to state when taking action .

• is the discount factor.

• is the reinforcement or reward function.

It is important to note that the state transition probability matrices are right stochastic. Mathematically this can be stated as

In the subsequent sections, we use the notation

 Fai:=(Pa1(i)−Pa(i))(I−γPa1)−1

The empirical maximum likelihood estimates of the transition probabilities from sampled trajectories are denoted by and in a similar fashion we use the notation

 ^Fai:=(^Pa1(i)−^Pa(i))(I−γ^Pa1)−1

The following norms are used throughout this paper. The infinity norm of a matrix is defined as . The norm of a vector is defined as . The induced matrix norm is defined as where is the -th row of the matrix

. Note that for a right stochastic matrix

, we can see that and .

In this paper the reward function is assumed to be a function of purely the state instead of the state and the action. This assumption is also made for the initial results in [5]. A policy is defined as a map . Given a policy , we can define two functions.

The value function at a state is defined as

 Vπ(s1)=E[R(s1)+γR(π(s1))+γ2R(π(π(s1)))+…∣π]

The function is defined as

 Qπ(s,a)=R(s)+γEs′∼Pa(s)[Vπ(s′)]

From [5], the inverse reinforcement learning problem for finite states and actions can be formulated as the following linear programming problem, assuming without loss of generality that . By enforcing the Bellman optimality of the policy , linear constraints on the reward function are formed. [5] then suggest some "natural" criteria that then forms the basis of the objective function to be minimized to obtained the desired reward function. The formulation presented in [5] is as follows.

 maximizeR N∑i=1mina∈{a2,…,ak}(^FTaiR)−λ∥R∥1 (2.1) subject to ^FTaiR≥0∀a∈A∖a1,i=1,…,n ∥R∥∞≤Rmax

## 3 Geometric analysis of the IRL problem

The objective of the Inverse Reinforcement Learning problem is to find a reward function that generates an optimal policy. As shown in [5], the necessary and sufficient conditions for a policy (without loss of generality ) to be optimal are given by the Bellman Optimality principle and can be stated mathematically as

 FTaiR≥0∀a∈A∖a1,i=1,…,n

Clearly, is always a solution. However this solution is degenerate in the sense that it also allows any and every other policy to be "optimal" and as a result is not of practical use. If the constraint of is considered, then by noticing that the points , the set of reward functions generating the optimal policy

is then the set of hyperplanes passing through the origin for which the entire collection of points

lie in one half space. The problem of Inverse Reinforcement Learning, then is equivalent to the problem of finding such a separating hyperplane passing through the origin for the points . Here we also assume none of the as this would mean that there is no distinction between the policies and .

This geometric perspective of the IRL problem allows the classification of all finite state, finite action IRL problems into 3 regimes, graphically visualized in Figure 1:

Regime 1: In this regime, there is no hyperplane passing through the origin for which all the points lie in one half space. This is equivalent to saying that the origin is in the interior of the convex hull of the points . In this case, independent of the algorithm, there is no nonzero reward function for which the policy is optimal.

Regime 2: In this regime, up to scaling by a constant, there can be one or more hyperplanes passing through the origin for which all the points lie in one half space, however the hyperplanes always contain one of the points . This is equivalent to saying that the origin is on the boundary of the convex hull of the points but is not one of the vertices since by assumption . In this case, up to a constant scaling, there are one or more nonzero reward functions that generates the optimal policy . In this case, it is also important to notice that the policy cannot be strictly optimal for any of the reward functions.

Regime 3: In this regime, up to scaling by a constant, there are infinitely many hyperplanes passing through the origin for which all the points lie in one half space. This is equivalent to saying that the origin is outside the convex hull of the points . In this case, up to a constant scaling, there are infinitely many nonzero reward functions that generates the optimal policy and it is possible to find a reward function for which the policy is strictly optimal.

These geometric regimes and their implication on the finite state, finite action inverse reinforcement learning problem are summed up in the following theorem.

###### Theorem 3.1.

There exists a hyperplane passing through the origin such that all the points lie on one side of the hyperplane (or on the hyperplane) if and only if there is a non-zero reward function that generates the optimal policy for the inverse reinforcement learning problem . i.e. such that .

###### Remark 3.1.

Notice that as an extension of Theorem 3.1, there is an for which the policy is strictly optimal iff there exists a hyperplane for which all the points are strictly on one side.

###### Remark 3.2.

Note that it is possible to find a separating hyperplane between the origin and the collection of points if and only if the problem is in Regime 3. Therefore, the problem of inverse reinforcement learning can be viewed as a one class support vector machine (or as a two class support vector machine with the origin as the negative class) problem in this regime. This, along with the objective of determining sample complexity, leads in to the formulation of the problem discussed in the next section.

## 4 Formulation of optimization problem

The objective function formulation of the inverse reinforcement problem described in [5] was formed by imposing the conditions that the value from the optimal policy was as far as possible from the next best action at each state, as well as sparseness of the reward function. These were choices made by the authors to enable a unique solution to the proposed linear programming problem. We propose a different formulation in terms of a 1 class L1-regularized support vector machine that allows for a geometric interpretation as well as provides an efficient sample complexity. The Inverse Reinforcement Learning problem is now considered in Regime 3. Here it is known that there is a separating hyperplane between the origin and so the strict inequality which by scaling of is equivalent to . Formally this assumption is stated as follows

###### Definition 4.1 (β-Strict Separability).

An inverse reinforcement learning problem satisfies -strict separability if and only if there exists a such that

 ∥R∗∥1=1andFTaiR∗≥β>0∀a∈A∖a1,i=1,…,n

Notice that the IRL problem is in Regime 3 (i.e. such that ) if and only if the strict separability assumption is satisfied.

Strict nonzero assumptions are well-accepted in the statistical learning theory community, and have been used for instance in compressed sensing

[8], Markov random fields [7], nonparametric regression [4], diffusion networks [3].

Problems in Regime 2 are avoided since based on the statistical estimation of the transition probability matrices from empirical data, the problem can easily tip into Regime 1 or Regime 3, as shown in Figure 2. To solve problems in Regime 2, an infinite number of samples would be required, where as problems in Regime 3 can be solved with a large enough number of samples.

Given the strict separability assumption, the optimization problem proposed is as follows

 maximizeR ∥R∥1 (4.1) subject to ^FTaiR≥1∀a∈A∖a1i=1,…,n

This problem is in the form of a one class L1-regularized Support Vector Machine [9] except that we use hard margins instead of soft margins. The minimization of the norm plays a two fold role in this formulation. First, it promotes a sparse reward function, keeping in lines with the idea of simplicity. Second, it also plays a role in establishing the sample complexity bounds of the inverse reinforcement learning problem as well, as shown in the subsequent section. The constraints derive from strict Bellman optimality in the separable case (Regime 3) of inverse reinforcement learning and help avoid the degenerate solution of . We now use this optimization problem along with the objective of finding a reward function for which the policy is optimal to establish the correctness and sample complexity of the inverse reinforcement learning problem.

## 5 Correctness and sample complexity of Inverse Reinforcement Learning

Consider the inverse reinforcement learning problem in the strictly separable case (Regime 3). We have such that

 FTaiR∗≥β>0∀a∈A∖a1,i=1,…,n

Let . Let be the solution to the optimization problem 4.1 with . We desire that

 FTai^R≥0∀a∈A∖a1,i=1,…,n

i.e. the reward we obtain from the problem using the estimated transition probability matrices also generates as the optimal for the problem with the true transition probabilities. This can be done by reducing , i.e. by using more samples. The result in the strictly separable case follows from the following theorem.

###### Theorem 5.1.

Let be an inverse reinforcement learning problem that is - strictly separable. Let be the values of using estimates of the transition probability matrices such that . Let be the solution to the optimization problem 4.1 with . Let

 ε≤1−c2−cβ

Then we have .

###### Proof.

Consider , using Hölder’s inequality we have

 FTai^R≥−∥Fai−^Fai∥∞∥^R∥1+^FTai^R≥−ε∥^R∥1+1 (5.1)

Now let where and is the reward satisfying the -strict separability for the problem. We have as well as . Now we have

 ^FTai~R≥−∥Fai−^Fai∥∞∥~R∥1+FTai~R≥−ε∥~R∥1+K=−Kεβ+K=K(1−εβ)

We now construct to satisfy the constraints of the optimization problem 4.1 with by choosing such that

 ^FTai~R≥K(1−εβ)≥1⟹K=11−εβ

Notice here since we have , then

Now since is a feasible solution to the optimization problem 4.1 with for which is the optimal solution, we have from the objective function

 ∥^R∥1≤∥~R∥1=Kβ

Substituting this upper bound for in (5.1) we get,

 FTai^R≥−εKβ+1=1−εβ⎛⎝11−εβ⎞⎠≥1−1−c2−c⎛⎜⎝11−1−c2−c⎞⎟⎠=1−1−c2−c(2−c)=c

###### Remark 5.1.

It is important to note that since and and , we have with equality holding only when , i.e infinitely many samples. This shows the equivalence of the problems with the true and the estimated transitions probabilities in the case of infinite samples.

Our desired result then follows as a corollary of the above theorem.

###### Corollary 5.1.

Let be an inverse reinforcement learning problem that is - strictly separable. Let be the values of using estimates of the transition probability matrices such that . Let be the solution to the optimization problem 4.1 with .

 ε≤12β

Then we have .

###### Proof.

Straightforwardly, by setting in Theorem 5.1. ∎

###### Theorem 5.2.

Let be an inverse reinforcement learning problem that is - strictly separable. Let every state be reachable from the starting state in one step with probability at least . Let be the solution to the optimization problem 4.1 with with transition probability matrices that are maximum likelihood estimates of formed from samples where

 m≥64αβ2((n−1)γ+1(1−γ)2)2log4nkδ

Then with probability at least , we have .

The theorem above follows from concentration inequalities for the estimation of the transition probabilities, which are detailed in the following section. (All missing proofs are included in the Supplementary Material.)

## 6 Concentration inequalities

In this section we look at the propagation of the concentration of the empirical estimate of the transition probabilities around their true values.

###### Lemma 6.1.

Let A and B be two matrices, we have

 ∥AB∥∞≤|||A|||∞∥B∥∞

Next we look at the propagation of the concentration of a right stochastic matrix to the concentration of its -th power.

###### Lemma 6.2.

Let be a right stochastic matrix and let be an estimate of such that

 ∥^P−P∥∞≤ε

then,

 ∥^Pk−Pk∥∞≤((k−1)n+1)ε

Now we can consider the concentration of the expression .

Notice that since is a right stochastic matrix and , we can expand as and therefore

 (Pa1(i)−Pa(i))(I−γPa1)−1=(Pa1(i)−Pa(i))∞∑j=0(γPa1)j
###### Theorem 6.1.

Let and be right stochastic matrices corresponding to actions and and let . Let and be estimates of and such that

 ∥^Pa−Pa∥∞≤εand∥^Pa1−Pa1∥∞≤ε

Then,

 ∥∥∥(^Pa1−^Pa)(I−γ^Pa1)−1−(Pa1−Pa)(I−γPa1)−1∥∥∥∞≤2ε(n−1)γ+1(1−γ)2

Note that this result is for each action. The concentration over all actions can be found by using the union bound over the set of actions.

An estimate of the value of when the estimation is done using samples can be shown using the Dvoretzky-Kiefer-Wolfowitz inequality [1] to be on the order of .

This result is shown in the following Theorem 6.2.

###### Theorem 6.2.

Let be a right stochastic matrix for an action and let be an maximum likelihood estimate of formed from samples. If , then we have

 P[∥∥^Pa−Pa∥∥∞≤ε]≥1−δ

The theorem above assumes that it is possible to start in any given state. However, this may not always be the case. In this case, as long as every state is reachable from an initial state with probability at least , the result presented in Theorem 5.2 can be modified to use Theorem 6.3 instead of Theorem 6.2.

###### Theorem 6.3.

Let be a right stochastic matrix for an action and let be an maximum likelihood estimate of formed from samples. Let every state be reachable from the starting state in one step with probability at least . If then

 P[∥∥^Pa−Pa∥∥∞≤ε]≥1−δ,δ∈(0,1)∀a∈A

## 7 Discussion

The result of Theorem 5.2 shows that the number of samples required to solve a -strict separable inverse reinforcement learning problem and obtain a reward that generates the desired optimal policy is on the order of . Notice that the number of samples in inversely proportional to . Thus by viewing the case of Regime 2 as of the -strict separable case (Regime 3), it is easy to see that an infinite number of samples are required to guarantee that the reward obtained will generate the optimal policy for the MDP with the true transition probability matrices.

In practical applications, however, it may be difficult to determine if an inverse reinforcement learning problem is -strict separable (Regime 3) or not. In this case, the result of equation (5.1) can be used as a witness to determine that the obtained satisfies Bellman’s optimality condition with respect to the true transition probability matrices with high probability as shown in the following remark.

###### Remark 7.1.

Let be an inverse reinforcement learning problem. Let every state be reachable from the starting state in one step with probability at least . Let be the solution to the optimization problem 4.1 with with transition probability matrices that are maximum likelihood estimates of formed from samples and let

 ε=2√4αmlog4nkδ⋅(n−1)γ+1(1−γ)2

If , then with probability at least , we have .

## 8 Experimental results

Experiments were performed using randomly generated transition probability matrices for -strictly separable MDPs with states, actions, and with states, actions, . Both experiments were done with as the optimal policy. Thirty randomly generated MDPs were considered in each case and a varying number of samples were used to find estimates of the transition probability matrices in each trial. Reward functions were found by solving Problem 4.1 for our L1-regularized SVM formulation, and Problem 2.1 for the method of [5], using the same set of estimated transition probabilities, i.e., . The resulting reward functions were then tested using the true transition probabilities for . The percentage of trials for which held true for both of the methods is shown in Figure 3 for different number of samples used. As prescribed by Theorem 5.2, for , the sufficient number of samples for the success of our method is . As we can observe, the success rate increases with the number of samples as expected. The L1-regularized support vector machine, however, significantly outperforms the linear programming formulation proposed in [5], reaching success shortly after the sufficient number of samples while the method proposed by [5] falls far behind. The result is that the reward function given by the L1-regularized support vector machine formulation successfully generates the optimal policy in almost of the trials given samples while the reward function estimated by the method presented in [5] fails to generate the desired optimal policy.

## 9 Concluding remarks

The L1-regularized support vector formulation along with the geometric interpretation provide a useful way of looking at the inverse reinforcement learning problem with strong, formal guarantees. Possible future work on this problem includes extension to the inverse reinforcement learning problem with continuous states by using sets of basis functions as presented in [5].

## References

• [1] J. Kiefer A. Dvoretzky and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642–669, 1956.
• [2] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In

Proceedings of the Twenty-first International Conference on Machine Learning

, ICML ’04, pages 1–, New York, NY, USA, 2004. ACM.
• [3] Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Scholkopf. Estimating diffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In International Conference on Machine Learning, pages 793–801, 2014.
• [4] Han Liu, Larry Wasserman, John D Lafferty, and Pradeep K Ravikumar. Spam: Sparse additive models. In Advances in Neural Information Processing Systems, pages 1201–1208, 2008.
• [5] A. Y. Ng and S.J. Russel. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663 – 670, 2000.
• [6] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007.
• [7] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al.

High-dimensional ising model selection using l1-regularized logistic regression.

The Annals of Statistics, 38(3):1287–1319, 2010.
• [8] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
• [9] Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.

## Appendix A Proofs of Lemmas and Theorems

### a.1 Proof of Theorem 3.1

###### Proof.

The proof follows from the fact that the points lie on one side of the hyperplane passing through the origin given by if and only if

 wTFai≥0∀a∈A∖a1,i=1,…,n

or

 wTFai≤0∀a∈A∖a1,i=1,…,n

The proof in the ’if’ direction follows by taking the hyperplane defined by and noticing that so all the points lie on one side of the hyperplane passing through the origin given by

The proof in the ’only if’ direction is as follows. Consider a separating hyperplane . Without loss of generality,

 wTFai≥0

Now let then so generates the optimal policy . ∎

### a.2 Proof of Theorem 5.2

###### Proof.

The proof of this theorem is a consequence of Corollary 5.1 and Theorems 6.1 and 6.3. Note that from Theorem 6.3, we want the concentration to hold with probability for all transition probability matrices corresponding to the set of actions. This can be viewed as the concentration inequality holding for a single matrix which gives us the result for samples

 m≥4αε2log4nkδ
 ⟹P[∥∥^Pa−Pa∥∥∞≤ε1]<1−δ

The result then follows from substituting this value of into the in Theorem 6.1 and the consequent result into Corollary 5.1. ∎

### a.3 Proof of Lemma 6.1

###### Proof.

Let , we have

 ∥AB∥∞=∥C∥∞=supi,j|cij|

From Holder’s inequality we get

 ∥AB∥∞ =supi,j{∣∣ ∣∣∑kaikbkj∣∣ ∣∣}

### a.4 Proof of Lemma 6.2

###### Proof.

First note that if is a right stochastic matrix then is a right stochastic matrix for all natural numbers . Consider right stochastic matrices . Consider the expression From Lemma 1, we get,

 ∥AC−BD∥∞ =∥AC−AD+AD−BD∥∞ ≤∥AC−AD∥∞+∥AD−BD∥∞

Notice that and and , thus we have

 ∥AC−BD∥∞ ≤∥C−D∥∞+n∥A−B∥∞

Now we will prove the lemma by induction . We have

 ∥^P−P∥∞≤ε=((1−1)n+1)ε

Assume the statement for is true. For we have

 ∥^P(k−1)−P(k−1)∥∞≤(((k−1))−1)n+1)ε

Consider the previous result with . Substituting, we get

 ∥^P^P(k−1)−PP(k−1)∥∞≤(((k−1))−1)n+1)ε+nε
 ⟹∥^P(k)−P(k)∥∞≤((k−1)n+1)ε

### a.5 Proof of Theorem 6.1

###### Proof.

Consider the expression from the theorem

 ∥∥(^Pa1−^Pa)(I−γ^Pa1)−1−(Pa1−Pa)(I−γPa1)−1∥∥∞
 =∥∥∥(^Pa1−^Pa)∞∑j=0(γ^Pa1)j−(Pa1−Pa)∞∑j=0(γPa1)j∥∥∥∞
 =∥∥∥∞∑j=0γj(^Pj+1a1−Pj+1a1)−(^Pa)∞∑j=0γj(^Pja1−Pja1)−(^Pa−Pa)∞∑j=0γj(Pja1)∥∥∥∞
 ≤∞∑j=0γj∥∥(^Pj+1a1−Pj+1a1)∥∥∞+∞∑j=0γj∥∥(^Pa)(^Pja1−Pja1)∥∥∞+∞∑j=0γj∥∥(^Pa−Pa)(Pja1)∥∥∞

From Lemma 6.1 and Lemma 6.2; and the fact that for a right stochastic matrix , and ; we have

 ∞∑j=0γj∥∥(^Pj+1a1−Pj+1a1)∥∥∞+∞∑j=0γj∥∥(^Pa)(^Pja1−Pja1)∥∥∞+∞∑j=0γj∥∥(^Pa−Pa)(Pja1)∥∥∞
 ≤∞∑j=0γj∥∥(^Pj+1a1−Pj+1a1)∥∥∞+∞∑j=0γj∥∥(^Pja1−Pja1)∥∥∞+∞∑j=0γjn∥∥^Pa−Pa∥∥∞
 ≤∞∑j=0γj((j)n+1)ε+∞∑j=0γj((j−1)n+1)ε+∞∑j=0γjnε =ε∞∑j=0γj((jn+1)+((j−1)n+1)+n) =2nε∞∑j=0jγj+2ε∞∑j=0γj =2ε(nγ(1−γ)2+11−γ) =2ε(n−1)γ+1(1−γ)2

### a.6 Proof of Theorem 6.2

###### Proof.

Here we invoke The Dvoretzky-Kiefer-Wolfowitz inequality [1]. Consider

samples of a random variable

with domain , let correspond to the observed resulting state under an action taken at a state . Let be an estimate of the CDF of and let be the actual CDF. From the Dvoretzky-Kiefer-Wolfowitz inequality we have

 P(sups∈{1,…,n}∣∣^Tia(s)−Tia(s)∣∣>ε)≤2e−2mε2
 ⟹P(sups∈{1,…,n}∣∣^Tia(s)−Tia(s)∣∣≤ε)>1−2e−2mε2

Now consider the PDF of given by . Notice that

 ∣∣^pia(s)−pia(s)∣∣ ≤∣∣(^Tia(s)−^Tia(s−1))−(Tia(s)−Tia(s−1))∣∣ ≤∣∣^Tia(s)−Tia(s)∣∣+∣∣^Tia(s−1)−Tia(s−1)∣∣

So if we have

 sups∈{1,…,n}∣∣^Tia(s)−Tia(s)∣∣≤ε

then

 sups∈{1,…,n}∣∣^pia(s)−pia(s)∣∣≤2ε
 ⟹P(sups∈{1,…,n}∣∣^pia(s)−pia(s)∣∣≤ε)>1−2e−mε2/2

Here we can interpret and as the -th rows of the matrices and respectively. , is the maximum likelihood estimator formed from samples. From application of the union bound over all rows of the matrix , we have for , and samples,

 P((∀i∈1,…,n)∥^p(Yia)−p(Yia)∥∞<ε) >1−2ne−mε2/2 ⟹P[∥∥^Pa−Pa∥∥∞≤ε]≥1−δ,δ∈(0,1)

if

### a.7 Proof of Theorem 6.3

###### Proof.

Without loss of generality, let every state be reachable from state by action after a step with probability at least . Let be a random variable domain . Let be a Bernoulli random variable such that . Let be pairs of independent samples of and . Here represents the state chain

Consider the event . By the one-sided Hoeffding’s inequality and taking the union bound over all states we have

 P(A1)≥1−ne−2ϵ2m

We also have the conditional maximum likelihood probability estimator

 ^p(Yj=s|Zj=1)=1∑mk=1z(k)jm∑l=11[(y(l)j=s)∧z(l)j]

From Theorem 6.2 we have for event

 A2≡{∥∥^p(Yja|Zj=1)−p(Yja|Zj=1)∥∥∞≤β}
 P(A2|A1)≥1−2ne−2β2m(α−ϵ)/2

By the law of total probability

 P(A2)≥P(A2,A1) =P(A2|A1)P(A1) =(1−ne−2ϵ2m)(1−2ne−2β2m(α−ϵ)/2) ≥1−ne−2ϵ2m−2ne−2β2m(α−ϵ)/2

By solving and we can see that if