. When the state-action space is not large and transition probabilities for all state-action pairs are known, standard techniques such as policy iteration and value iteration can compute an optimal policy. More often than not, however, problem instances of realistic size have very large state-action spaces and it becomes impractical to know the transition probabilities everywhere and compute an optimal policy using theseoff-line methods.
For such large problems, one resorts to approximate methods, collectively referred to as Approximate Dynamic Programming (ADP) [3, 4]. ADP methods approximate either the value function and/or the policy and optimize with respect to the parameters, as for instance is done in actor-critic methods [5, 6, 7]. The optimization of approximation parameters requires the learner to have access to the system and be able to observe the effect of applied control actions.
In this paper, we adopt a different perspective and assume that the learner has no direct access to the system but only has samples of the actions applied in various states of the system. These actions applied in various states are generated according to a policy that is fixed but unknown. As an example, the actions could be followed by an expert player who plays an optimal policy. Our goal is to learn a policy consistent with the observed states-actions, which we will call demonstrations.
A plausible way to obtain such a policy is to learn a mapping of the states-actions to a lower dimensional space. Borrowing from ADP methods, we can obtain a lower-dimensional representation of the state-action space through the use of features that are functions of the state and the action taken at that state. In particular, we will consider policies that combine various features through a weight vector and reduce the problem of learning the policy to learning this weight/parameter vector. Essentially, we will be learning a parsimonious parametrization of the policy used by the expert.
The related work in the literature on learning through demonstrations can be broadly classified intodirect and indirect methods 
. In the direct methods, supervised learning techniques are used to obtain a best estimate of the expert’s behavior; specifically, a best fit to the expert’s behavior is obtained by minimizing an appropriately defined loss function. A key limitation of this method is that the estimated policy is not well adapted to the parts of the state space not visited often by the expert, thus resulting in poor action selection if the system enters these states. Furthermore, the loss function can be non-convex in the policy, rendering the corresponding problem hard to solve.
Indirect methods, on the other hand, evaluate a policy by learning the full description of the MDP. In particular, one solves a so called inverse reinforcement learning
problem which assumes that the dynamics of the environment are known but the one-step reward function is unknown. Then, one estimates the reward function that the expert is aiming to maximize through the demonstrations, and obtains the policy simply by solving the MDP. A policy obtained in this fashion tends to generalize better to the states visited infrequently by the expert, compared to policies obtained by direct methods. The main drawback of inverse reinforcement learning is that at each iteration it requires to solve an MDP which is computationally expensive. In addition, the assumption that the dynamics of the environment are known for all states and actions is unrealistic for problems with very large state-action spaces.
In this work, we exploit the benefits of both direct and indirect methods by assuming that the expert is using a Randomized Stationary Policy (RSP). As we alluded to earlier, an RSP is characterized in terms of a vector of features associated with state-action pairs and a parameter weighing the various elements of the feature vector. We consider the case where we have many features, but only relatively few of them are sufficient to approximate the target policy well. However, we do not know in advance which features are relevant; learning them and the associated weights (elements of ) is part of the learning problem.
We will use supervised learning to obtain the best estimate of the expert’s policy. As in actor-critic methods, we use an RSP which is a parameterized “Boltzmann” policy and rely on an -regularized maximum likelihood estimator of the policy parameter vector. An -norm regularization induces sparse estimates and this is useful in obtaining an RSP which uses only the relevant features. In , it is shown that the sample complexity of -penalized logistic regression grows as , where is the number of features. As a result, we can learn the parameter vector of the target RSP with relatively few samples, and the RSP we learn generalizes well across states that are not included in the demonstrations of the expert. Furthermore, -regularized logistic regression is a convex optimization problem which can be solved efficiently.
1.1 Related Work
There is substantial work in the literature on learning MDP policies by observing experts; see  for a survey. We next discuss papers that are more closely related to our work.
In the context of indirect methods,  develops techniques for estimating a reward function from demonstrations under varying degrees of generality on the availability of an explicit model.  introduces an inverse reinforcement learning algorithm that obtains a policy from observed MDP trajectories followed by an expert. The policy is guaranteed to perform as well as that of the expert’s policy, even though the algorithm does not necessarily recover the expert’s reward function. In , the authors combine supervised learning and inverse reinforcement learning by fitting a policy to empirical estimates of the expert demonstrations over a space of policies that are optimal for a set of parameterized reward functions.  shows that the policy obtained generalizes well over the entire state space.  uses a supervised learning method to learn an optimal policy by leveraging the structure of the MDP, utilizing a kernel-based approach. Finally,  develops the DAGGER (Dataset Aggregation) algorithm that trains a deterministic policy which achieves good performance guarantees under its induced distribution of states.
Our work is different in that we focus on a parameterized set of policies rather than parameterized rewards. In this regard, our work is similar to approximate DP methods which parameterize the policy, e.g., expressing the policy as a linear functional in a lower dimensional parameter space. This lower-dimensional representation is critical in overcoming the well known “curse of dimensionality.”
We adopt the -regularized logistic regression to estimate a target RSP that generates a given collection of state-action samples. We evaluate the performance of the estimated policy and derive a bound on the difference between the average reward of the estimated RSP and the target RSP, typically referred to as regret. We show that a good estimation of the parameter of the target RSP also implies a good bound on the regret. To that end, we generalize a sample complexity result on the log-loss of the maximum likelihood estimates  from the case of two actions available at each state to the multi-action case. Using this result, we establish a sample complexity result on the regret.
Our analysis is based on the novel idea of separating the loss in average reward into two parts. The first part is due to the error in the policy estimation (training error) and the second part is due to the perturbation in the stationary distribution of the Markov chain caused by using the estimated policy instead of the true target policy (perturbation error). We bound the first part by relating the result on the log-loss error to the Kulback-Leibler divergence  between the estimated and the target RSPs. The second part is bounded using the ergodic coefficient of the induced Markov chain. Finally, we evaluate the performance of our method on a synthetic example.
The paper is organized as follows. In Section 2, we introduce some of our notation and state the problem. In Section 3, we describe the supervised learning algorithm used to train the policy. In Section 4, we establish a result on the (log-loss) error in policy estimation. In Section 5, we establish our main result which is a bound on the regret of the estimated policy. In Section 7, we introduce a robot navigation example and present our numerical results. We end with concluding remarks in Section 8.
Notational conventions. Bold letters are used to denote vectors and matrices; typically vectors are lower case and matrices upper case. Vectors are column vectors, unless explicitly stated otherwise. Prime denotes transpose. For the column vector we write for economy of space, while denotes its -norm. Vectors or matrices with all zeroes are written as
, the identity matrix as, and is the vector with all entries set to . For any set , denotes its cardinality. We use to denote the natural logarithm and a subscript to denote different bases, e.g., denotes logarithm with base .
2 Problem Formulation
Let denote a Markov Decision Process (MDP) with a finite set of states and a finite set of actions . For a state-action pair , let denote the probability of transitioning to state after taking action in state . The function denotes the one-step reward function.
Let us now define a class of Randomized Stationary Policies (RSPs) parameterized by vectors . Let denote the set of RSPs we are considering. For a given parameter and a state , denotes the probability of taking action at state . Specifically, we consider the Boltzmann-type RSPs of the form
where is a vector of features associated with each state-action pair . (Features are normalized to take values in .) Henceforth, we identify an RSP by its parameter . We assume that the policy is sparse, that is, the vector has only non-zero components and each is bounded by , i.e., for all . Given an RSP , the resulting transition probability matrix on the Markov chain is denoted by , whose element is for all state pairs .
Notice that for any RSP , the sequence of states and the sequence of state-action pairs form a Markov chain with state space and , respectively. We assume that for every , the Markov chains and are irreducible and aperiodic with stationary probabilities and , respectively.
The average reward function associated with an RSP is a function defined as
Let now fix a target RSP . As we assumed above, is sparse having at most non-zero components , each satisfying . This is simply the policy used by an expert (not necessarily optimal) which we wish to learn. We let denote a set of state-action samples generated by playing policy . The state samples are independent and identically distributed (i.i.d.) drawn from the stationary distribution and is the action taken in state according to the policy . It follows that the samples in are i.i.d. according to the distribution .
We assume we have access only to the demonstrations while the transition probability matrix and the target RSP are unknown. The goal is to learn the target policy and characterize the average reward obtained when the learned RSP is applied to the Markov process. In particular, we are interested in estimating the target parameter from the samples efficiently and evaluate the performance of the estimated RSP with respect to the target RSP, i.e., bound the regret defined as
where is the estimated RSP from the samples .
3 Estimating the policy
Next we discuss how to estimate the target RSP from the i.i.d. state-action training samples in . Given the Boltzmann structure of the RSP we have assumed, we fit a logistic regression function using a regularized maximum likelihood estimator as follows:
where is a parameter that adjusts the trade-off between fitting the training data “well” and obtaining a sparse RSP that generalizes well on a test sample.
We can evaluate how well the maximum likelihood function fits the samples in the logistic function using a log-loss metric, defined as the expected negative of the likelihood over the (random) test data. Formally, for any parameter , log-loss is given by
where the expectation is taken over state-action pairs drawn from the distribution ; recall that we defined to be the stationary distribution of the state-action pairs induced by the policy . Since the expectation is taken with respect to new data not included in the training set , we can think of as an out-of-sample metric of how well the RSP approximates the actions taken by the target RSP .
We also define a sample-average version of log-loss: given any set of state-action pairs, define
We will use the term empirical log-loss to refer to log-loss over the training set and use the notation
4 Log-loss performance
In this section we establish a sample complexity result indicating that relatively few training samples (logarithmic in the dimension of the RSP parameter ) are needed to guarantee that the estimated RSP has out-of-sample log-loss close to the target RSP . We will use the line of analysis in  but generalize the result from the case where only two actions are available for every state to the general multi-action setting.
We start by relating the difference of the log-loss function associated with RSP and its estimate to the relative entropy, or Kullback-Leibler (KL) divergence, between the corresponding RSPs. For a given , we denote the KL-divergence between RSPs, and as , where
denotes the probability distribution induced by RSPon in state , and is given as follows:
We also define the average KL-divergence, denoted by , as the average of over states visited according to the stationary distribution of the Markov chain induced by policy . Specifically, we define
Let be an estimate of RSP . Then,
For the case of a binary action at every state,  showed the following result. To state the theorem, let denote a function which is polynomial in its arguments and recall that is the number of state-action pairs in the training set used to learn .
Theorem 4.2 (, Thm. 3.1)
Suppose action set contains only two actions, i.e. . Let and . In order to guarantee that, with probability at least , produced by the algorithm performs as well as , i.e.,
it suffices that .
We will generalize the result to the case when more than two actions are available at each state. We assume that , that is, at most actions are available at each state. By introducing in (1) features that get activated at specific states, it is possible to accommodate MDPs where some of the actions are not available at these states.
Let and . In order to guarantee that, with probability at least , produced by Algorithm 1 performs as well as , i.e.,
it suffices that
Furthermore, in terms of only , .
The remainder of this section is devoted to proving Theorem 4.3. The proof is similar to the proof of Theorem 4.2 (Theorem 3.1 in ) but with key differences to accommodate the multiple actions per state. We start by introducing some notations and stating necessary lemmata.
Denote by a class of functions over some domain and range . Let , which is the valuation of the class of functions at a certain collection of points . It is said that a set of vectors in -covers in the -norm, if for every point there exists some , , such that . Let also denote the size of the smallest set that -covers in the -norm. Finally, let
To simplify the representation of log-loss of the general logistic function in (1), we use the following notations. For each , denotes feature vectors associated with each action. For any , define the -likelihood function as
Note . Further, let be the class of functions . We can then rewrite using as .
Lemma 4.5 ()
Suppose and the input has a norm-bound such that , where . Then
Let be a class of functions from to some bounded set in . Consider as a class of functions from to some bounded set in , with the following form:
If is Lipschitz with Lipschitz constant in the -norm for every , then we have
Proof: Let . It is sufficient to show for every inputs
we can find points in that -cover . Fix some set of points . From the definition of , for each ,
Let 111One may find less than points, but we consider the worst case scenario. be a set of points in that -covers . We use notation to denote the th element of vector . Then, for any and , there exists a such that
Now consider points with the following form
Given a and , let be as defined above and consider
where we used the Lipschitz property of the function in the first inequality and the last inequality follows from (4). Thus, the points -cover in -norm. Finally, notice that the set of points is arbitrary, which concludes the proof. ∎
We now continue with the proof of Theorem 4.3. Recall Algorithm 1 and let be the smallest integer in that is greater or equal to . Notice that in Algorithm 1 one can use a larger but we select the smallest possible to obtain a tighter bound. For such a , it follows that . Define a class of functions with domain as
The partial derivatives of the log-loss function are
and it can be seen that
Hence, the Lipschitz constant for is for any .
We next find the range of class . To begin with, the range of class is
Since is Lipschitz in -norm with Lipschitz constant and (by the fact ), then
Treat as a constant. To upper bound the right hand side of the above equation by , it suffices to have
Thus, using our definition of in (9), for any with and with probability at least , we have
Therefore, for all with and with probability at least , it holds
where is the training set from Algorithm 1.
where is the target policy and the last inequality follows simply from the fact that .
Eq. 15 indicates that the training step of Algorithm 1, finds at least one parameter vector whose performance is nearly as good as that of . At the validation step, we select one of the found during training. It can be shown () that with a validation set of the same order of magnitude as the training set (and independent of ), we can ensure that with probability at least , the selected parameter vector will have performance at most worse than that of the best performing vector discovered in the training step. Hence, with probability at least , the output of our algorithm satisfies
Finally, replacing with and with everywhere in the proof, establishes Theorem 4.3.
5 Bounds on Regret
Theorem 4.3 provides a sufficient condition on the number of samples required to learn a policy whose log-loss performance is close to the target policy. In this section we study the regret of the estimated policy, defined in Sec. 2 as the difference between the average reward of the target policy and the estimated policy. Given that we use a number of samples in the training set proportional to the expression provided in Theorem 4.3, we establish explicit bounds on the regret.
We will bound the regret of the estimated policy by separating the effect of the error in estimating the policy function (which is characterized by Theorem 4.3) and the effect the estimated policy function introduces in the stationary distribution of the Markov chain governing how states are visited. To bound the regret due to the perturbation of the stationary distribution, we will use results from the sensitivity analysis of Markov chains. In Section 5.1 we provide some standard definitions for Markov chains and state our result on the regret, while in Section 6 we provide a proof of this result.
5.1 Main Result
We start by defining the fundamental matrix of a Markov chain.
The fundamental matrix of a Markov chain with state transition probability matrix induced by RSP is
where denotes the vector of all ’s, and denotes the stationary distribution associated with . Also, the group inverse of denoted as is the unique matrix satisfying
Most of the properties of a Markov chain can be expressed in terms of the fundamental matrix . For example, if is the stationary probability distribution associated with a Markov chain whose transition probability matrix is and if for some perturbation matrix , then the stationary probability distribution of the Markov chain with transition probability matrix satisfies the relation
where is the fundamental matrix associated with .
The ergodic coefficient of a matrix with equal row sums is
The ergodic coefficient of a Markov chain indicates sensitivity of its stationary distribution. For any stochastic matrix, .
We now have all the ingredients to state our main result bounding regret.
Given and , suppose
i.i.d. samples are used by Algorithm 1 to produce an estimate the unknown target RSP policy parameter . Then with probability at least , we have
and is a constant that depends on the RSP and can be any of the following:
The constant is referred to as condition number. The regret is thus governed by the condition number of the estimated RSP; the smaller the condition number of the trained policy, the smaller is the regret.
6 Proof of the Main Result
In this section we analyze the average reward obtained by the MDP when we apply the estimated RSP , and prove the regret bound in Theorem 5.1. First, we bound the regret as the sum of two parts.
Note that the first absolute sum above has terms for all that are related to the estimation error from fitting the RSP policy to