Reinforcement Learning is the process of generating an optimal policy for a given Markov Decision Process along with a reward function. Often, in situations including apprenticeship learning, the reward function is unknown but optimal policy can be observed through the actions of an expert. In cases such as these, it is desirable to learn a reward function generating the observed optimal policy. This problem is referred to as Inverse Reinforcement Learning (IRL) 
. It is well known that such a reward function is not necessarily unique. Various algorithms to solve the IRL problem have been made including linear programming
and Bayesian estimation.  looked at using IRL to solve the apprenticeship learning problem by trying to find a reward function that maximizes the margin of the expert’s policy. The goal of the problem presented in  is to find a policy that comes close in value to the expert’s policy for some unspecified true reward function. However none of the prior works provide a formal guarantee that the reward function obtained from the empirical data is optimal for the true transition probabilities in inverse reinforcement learning.
This paper looks at formulating the IRL problem by using the basic objective of inverse reinforcement: to find a reward function that generates a specified optimal policy. The paper also looks at establishing a sample complexity to meet this basic goal when the transition probabilities are estimated from observed trajectories. To achieve this, an algorithmic-independent geometric analysis of the IRL problem with finite states and actions as presented in  is provided. A L1-regularized Support Vector Machine (SVM) formulation of the IRL problem motivated by the geometric analysis is then proposed. Theoretical analysis of the sample complexity of the L1 SVM formulation is then performed. Finally, experimental results comparing the L1 SVM formulation to the linear programming formulation presented in  are presented showing the improved performance of the L1 SVM formulation with respect to the basic objective, i.e Bellman optimality with respect to the true transition probabilities. To the best of our knowledge, we are the first to provide an algorithm with formal guarantees for inverse reinforcement learning.
The formulation of the IRL problem is based on a Markov Decision Process (MDP) , where
is a finite set of states.
is a set of actions.
are the state transition probabilities for action . We use and to represent the state transition probabilities for action in state and to represent the probability of going from state to state when taking action .
is the discount factor.
is the reinforcement or reward function.
It is important to note that the state transition probability matrices are right stochastic. Mathematically this can be stated as
In the subsequent sections, we use the notation
The empirical maximum likelihood estimates of the transition probabilities from sampled trajectories are denoted by and in a similar fashion we use the notation
The following norms are used throughout this paper. The infinity norm of a matrix is defined as . The norm of a vector is defined as . The induced matrix norm is defined as where is the -th row of the matrix
. Note that for a right stochastic matrix, we can see that and .
In this paper the reward function is assumed to be a function of purely the state instead of the state and the action. This assumption is also made for the initial results in . A policy is defined as a map . Given a policy , we can define two functions.
The value function at a state is defined as
The function is defined as
From , the inverse reinforcement learning problem for finite states and actions can be formulated as the following linear programming problem, assuming without loss of generality that . By enforcing the Bellman optimality of the policy , linear constraints on the reward function are formed.  then suggest some "natural" criteria that then forms the basis of the objective function to be minimized to obtained the desired reward function. The formulation presented in  is as follows.
3 Geometric analysis of the IRL problem
The objective of the Inverse Reinforcement Learning problem is to find a reward function that generates an optimal policy. As shown in , the necessary and sufficient conditions for a policy (without loss of generality ) to be optimal are given by the Bellman Optimality principle and can be stated mathematically as
Clearly, is always a solution. However this solution is degenerate in the sense that it also allows any and every other policy to be "optimal" and as a result is not of practical use. If the constraint of is considered, then by noticing that the points , the set of reward functions generating the optimal policy
is then the set of hyperplanes passing through the origin for which the entire collection of pointslie in one half space. The problem of Inverse Reinforcement Learning, then is equivalent to the problem of finding such a separating hyperplane passing through the origin for the points . Here we also assume none of the as this would mean that there is no distinction between the policies and .
This geometric perspective of the IRL problem allows the classification of all finite state, finite action IRL problems into 3 regimes, graphically visualized in Figure 1:
Regime 1: In this regime, there is no hyperplane passing through the origin for which all the points lie in one half space. This is equivalent to saying that the origin is in the interior of the convex hull of the points . In this case, independent of the algorithm, there is no nonzero reward function for which the policy is optimal.
Regime 2: In this regime, up to scaling by a constant, there can be one or more hyperplanes passing through the origin for which all the points lie in one half space, however the hyperplanes always contain one of the points . This is equivalent to saying that the origin is on the boundary of the convex hull of the points but is not one of the vertices since by assumption . In this case, up to a constant scaling, there are one or more nonzero reward functions that generates the optimal policy . In this case, it is also important to notice that the policy cannot be strictly optimal for any of the reward functions.
Regime 3: In this regime, up to scaling by a constant, there are infinitely many hyperplanes passing through the origin for which all the points lie in one half space. This is equivalent to saying that the origin is outside the convex hull of the points . In this case, up to a constant scaling, there are infinitely many nonzero reward functions that generates the optimal policy and it is possible to find a reward function for which the policy is strictly optimal.
These geometric regimes and their implication on the finite state, finite action inverse reinforcement learning problem are summed up in the following theorem.
There exists a hyperplane passing through the origin such that all the points lie on one side of the hyperplane (or on the hyperplane) if and only if there is a non-zero reward function that generates the optimal policy for the inverse reinforcement learning problem . i.e. such that .
Notice that as an extension of Theorem 3.1, there is an for which the policy is strictly optimal iff there exists a hyperplane for which all the points are strictly on one side.
Note that it is possible to find a separating hyperplane between the origin and the collection of points if and only if the problem is in Regime 3. Therefore, the problem of inverse reinforcement learning can be viewed as a one class support vector machine (or as a two class support vector machine with the origin as the negative class) problem in this regime. This, along with the objective of determining sample complexity, leads in to the formulation of the problem discussed in the next section.
4 Formulation of optimization problem
The objective function formulation of the inverse reinforcement problem described in  was formed by imposing the conditions that the value from the optimal policy was as far as possible from the next best action at each state, as well as sparseness of the reward function. These were choices made by the authors to enable a unique solution to the proposed linear programming problem. We propose a different formulation in terms of a 1 class L1-regularized support vector machine that allows for a geometric interpretation as well as provides an efficient sample complexity. The Inverse Reinforcement Learning problem is now considered in Regime 3. Here it is known that there is a separating hyperplane between the origin and so the strict inequality which by scaling of is equivalent to . Formally this assumption is stated as follows
Definition 4.1 (-Strict Separability).
An inverse reinforcement learning problem satisfies -strict separability if and only if there exists a such that
Notice that the IRL problem is in Regime 3 (i.e. such that ) if and only if the strict separability assumption is satisfied.
Strict nonzero assumptions are well-accepted in the statistical learning theory community, and have been used for instance in compressed sensing, Markov random fields , nonparametric regression , diffusion networks .
Problems in Regime 2 are avoided since based on the statistical estimation of the transition probability matrices from empirical data, the problem can easily tip into Regime 1 or Regime 3, as shown in Figure 2. To solve problems in Regime 2, an infinite number of samples would be required, where as problems in Regime 3 can be solved with a large enough number of samples.
Given the strict separability assumption, the optimization problem proposed is as follows
This problem is in the form of a one class L1-regularized Support Vector Machine  except that we use hard margins instead of soft margins. The minimization of the norm plays a two fold role in this formulation. First, it promotes a sparse reward function, keeping in lines with the idea of simplicity. Second, it also plays a role in establishing the sample complexity bounds of the inverse reinforcement learning problem as well, as shown in the subsequent section. The constraints derive from strict Bellman optimality in the separable case (Regime 3) of inverse reinforcement learning and help avoid the degenerate solution of . We now use this optimization problem along with the objective of finding a reward function for which the policy is optimal to establish the correctness and sample complexity of the inverse reinforcement learning problem.
5 Correctness and sample complexity of Inverse Reinforcement Learning
Consider the inverse reinforcement learning problem in the strictly separable case (Regime 3). We have such that
Let . Let be the solution to the optimization problem 4.1 with . We desire that
i.e. the reward we obtain from the problem using the estimated transition probability matrices also generates as the optimal for the problem with the true transition probabilities. This can be done by reducing , i.e. by using more samples. The result in the strictly separable case follows from the following theorem.
Let be an inverse reinforcement learning problem that is - strictly separable. Let be the values of using estimates of the transition probability matrices such that . Let be the solution to the optimization problem 4.1 with . Let
Then we have .
Consider , using Hölder’s inequality we have
Now let where and is the reward satisfying the -strict separability for the problem. We have as well as . Now we have
We now construct to satisfy the constraints of the optimization problem 4.1 with by choosing such that
Notice here since we have , then
Now since is a feasible solution to the optimization problem 4.1 with for which is the optimal solution, we have from the objective function
Substituting this upper bound for in (5.1) we get,
It is important to note that since and and , we have with equality holding only when , i.e infinitely many samples. This shows the equivalence of the problems with the true and the estimated transitions probabilities in the case of infinite samples.
Our desired result then follows as a corollary of the above theorem.
Let be an inverse reinforcement learning problem that is - strictly separable. Let be the values of using estimates of the transition probability matrices such that . Let be the solution to the optimization problem 4.1 with .
Then we have .
Straightforwardly, by setting in Theorem 5.1. ∎
Let be an inverse reinforcement learning problem that is - strictly separable. Let every state be reachable from the starting state in one step with probability at least . Let be the solution to the optimization problem 4.1 with with transition probability matrices that are maximum likelihood estimates of formed from samples where
Then with probability at least , we have .
The theorem above follows from concentration inequalities for the estimation of the transition probabilities, which are detailed in the following section. (All missing proofs are included in the Supplementary Material.)
6 Concentration inequalities
In this section we look at the propagation of the concentration of the empirical estimate of the transition probabilities around their true values.
Let A and B be two matrices, we have
Next we look at the propagation of the concentration of a right stochastic matrix to the concentration of its -th power.
Let be a right stochastic matrix and let be an estimate of such that
Now we can consider the concentration of the expression .
Notice that since is a right stochastic matrix and , we can expand as and therefore
Let and be right stochastic matrices corresponding to actions and and let . Let and be estimates of and such that
Note that this result is for each action. The concentration over all actions can be found by using the union bound over the set of actions.
An estimate of the value of when the estimation is done using samples can be shown using the Dvoretzky-Kiefer-Wolfowitz inequality  to be on the order of .
This result is shown in the following Theorem 6.2.
Let be a right stochastic matrix for an action and let be an maximum likelihood estimate of formed from samples. If , then we have
The theorem above assumes that it is possible to start in any given state. However, this may not always be the case. In this case, as long as every state is reachable from an initial state with probability at least , the result presented in Theorem 5.2 can be modified to use Theorem 6.3 instead of Theorem 6.2.
Let be a right stochastic matrix for an action and let be an maximum likelihood estimate of formed from samples. Let every state be reachable from the starting state in one step with probability at least . If then
The result of Theorem 5.2 shows that the number of samples required to solve a -strict separable inverse reinforcement learning problem and obtain a reward that generates the desired optimal policy is on the order of . Notice that the number of samples in inversely proportional to . Thus by viewing the case of Regime 2 as of the -strict separable case (Regime 3), it is easy to see that an infinite number of samples are required to guarantee that the reward obtained will generate the optimal policy for the MDP with the true transition probability matrices.
In practical applications, however, it may be difficult to determine if an inverse reinforcement learning problem is -strict separable (Regime 3) or not. In this case, the result of equation (5.1) can be used as a witness to determine that the obtained satisfies Bellman’s optimality condition with respect to the true transition probability matrices with high probability as shown in the following remark.
Let be an inverse reinforcement learning problem. Let every state be reachable from the starting state in one step with probability at least . Let be the solution to the optimization problem 4.1 with with transition probability matrices that are maximum likelihood estimates of formed from samples and let
If , then with probability at least , we have .
8 Experimental results
Experiments were performed using randomly generated transition probability matrices for -strictly separable MDPs with states, actions, and with states, actions, . Both experiments were done with as the optimal policy. Thirty randomly generated MDPs were considered in each case and a varying number of samples were used to find estimates of the transition probability matrices in each trial. Reward functions were found by solving Problem 4.1 for our L1-regularized SVM formulation, and Problem 2.1 for the method of , using the same set of estimated transition probabilities, i.e., . The resulting reward functions were then tested using the true transition probabilities for . The percentage of trials for which held true for both of the methods is shown in Figure 3 for different number of samples used. As prescribed by Theorem 5.2, for , the sufficient number of samples for the success of our method is . As we can observe, the success rate increases with the number of samples as expected. The L1-regularized support vector machine, however, significantly outperforms the linear programming formulation proposed in , reaching success shortly after the sufficient number of samples while the method proposed by  falls far behind. The result is that the reward function given by the L1-regularized support vector machine formulation successfully generates the optimal policy in almost of the trials given samples while the reward function estimated by the method presented in  fails to generate the desired optimal policy.
9 Concluding remarks
The L1-regularized support vector formulation along with the geometric interpretation provide a useful way of looking at the inverse reinforcement learning problem with strong, formal guarantees. Possible future work on this problem includes extension to the inverse reinforcement learning problem with continuous states by using sets of basis functions as presented in .
-  J. Kiefer A. Dvoretzky and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642–669, 1956.
Pieter Abbeel and Andrew Y. Ng.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, pages 1–, New York, NY, USA, 2004. ACM.
-  Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Scholkopf. Estimating diffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In International Conference on Machine Learning, pages 793–801, 2014.
-  Han Liu, Larry Wasserman, John D Lafferty, and Pradeep K Ravikumar. Spam: Sparse additive models. In Advances in Neural Information Processing Systems, pages 1201–1208, 2008.
-  A. Y. Ng and S.J. Russel. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663 – 670, 2000.
-  Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007.
Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al.
High-dimensional ising model selection using l1-regularized logistic regression.The Annals of Statistics, 38(3):1287–1319, 2010.
-  Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
-  Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.
Appendix A Proofs of Lemmas and Theorems
a.1 Proof of Theorem 3.1
The proof follows from the fact that the points lie on one side of the hyperplane passing through the origin given by if and only if
The proof in the ’if’ direction follows by taking the hyperplane defined by and noticing that so all the points lie on one side of the hyperplane passing through the origin given by
The proof in the ’only if’ direction is as follows. Consider a separating hyperplane . Without loss of generality,
Now let then so generates the optimal policy . ∎
a.2 Proof of Theorem 5.2
The proof of this theorem is a consequence of Corollary 5.1 and Theorems 6.1 and 6.3. Note that from Theorem 6.3, we want the concentration to hold with probability for all transition probability matrices corresponding to the set of actions. This can be viewed as the concentration inequality holding for a single matrix which gives us the result for samples
a.3 Proof of Lemma 6.1
Let , we have
From Holder’s inequality we get
a.4 Proof of Lemma 6.2
First note that if is a right stochastic matrix then is a right stochastic matrix for all natural numbers . Consider right stochastic matrices . Consider the expression From Lemma 1, we get,
Notice that and and , thus we have
Now we will prove the lemma by induction . We have
Assume the statement for is true. For we have
Consider the previous result with . Substituting, we get
a.5 Proof of Theorem 6.1
a.6 Proof of Theorem 6.2
Here we invoke The Dvoretzky-Kiefer-Wolfowitz inequality . Consider
samples of a random variablewith domain , let correspond to the observed resulting state under an action taken at a state . Let be an estimate of the CDF of and let be the actual CDF. From the Dvoretzky-Kiefer-Wolfowitz inequality we have
Now consider the PDF of given by . Notice that
So if we have
Here we can interpret and as the -th rows of the matrices and respectively. , is the maximum likelihood estimator formed from samples. From application of the union bound over all rows of the matrix , we have for , and samples,
a.7 Proof of Theorem 6.3
Without loss of generality, let every state be reachable from state by action after a step with probability at least . Let be a random variable domain . Let be a Bernoulli random variable such that . Let be pairs of independent samples of and . Here represents the state chain
Consider the event . By the one-sided Hoeffding’s inequality and taking the union bound over all states we have
We also have the conditional maximum likelihood probability estimator
From Theorem 6.2 we have for event
By the law of total probability
By solving and we can see that if