IRL addresses an ill-posed inverse problem that expert demonstrations may yield a reward function in a Markov decision process (MDP). The recovered reward function will quantify how good or bad the certain actions are. With the knowledge of reward function, agents can perform better. However, not all reward functions will provide a succinct, robust, and transferable definition of the learning task, and a policy may be optimal for many reward functions . For example, any given policy is optimal for the constant reward function in an MDP. Thus, the core of IRL is to explore the regular structure of reward functions.
Many existing methods impose the regular structures of reward functions in a combination of hand-selected features. Formally, a set of real-valued feature functions is given by a hand-selection of experts as basis functions of the reward function space in an MDP. Then, the reward function is approximated by a linear or nonlinear combination of the hand-selected feature functions . The goal of this approach is to find the best-fitting weights of the feature functions in reward functions. One novel framework in this approach is to formulate the IRL problem as a numerical optimization problem [1; 2; 3; 4], and the other is based on maximizing a posteriori in a probabilistic approach [5; 6; 7; 8].
In this paper, we propose a generalized perspective of studying the IRL problem called stochastic inverse reinforcement learning (SIRL) which is formulated as an expectation optimization problem aiming to recover a probability distribution of the reward function from expert demonstrations. Typically, the solution to classic IRL is not always best-fitting because a highly nonlinear inverse problem with limited information from a collection of expert behavior is very likely to get trapped in a secondary maximum for a partially observable system. Thus, we employ the MCEM approach 
, which adopts a Monte Carlo mechanism for exhaustive search as a global search method, to give the first solution to the SIRL problem in a model-based environment, and then we obtain the desired reward conditional probability distribution which can generate more than one weight for reward feature functions as composing alternative solutions to IRL problem. The most benefit of our generalized perspective gives a method that allows analysis and display of any given highly nonlinear IRL problem with a large collection of pseudorandomly generated local likelihood maxima. In view of the successful application of IRL in imitation learning and apprenticeship learning in the industry[4; 10; 11; 12], our generalized method demonstrates a great potential practical value.
An MDP is defined as a tuple , where is the set of states, is the set of actions, and the transition function for and records the probability of being current state , taking action and yielding next state . Reward is a real-valued function and is the discount factor. A policy , which is a map from states to actions, has two formulations. The stochastic policy refers to a conditional distribution , and the deterministic policy is represented by a deterministic function . Sequential decisions are recorded in a series of episodes which consist of states, actions, and rewards. The goal of a reinforcement learning task is to find the optimal policy that optimizes the expected total reward . In an IRL problem setting, we have an MDP without a reward function, denoted as an MDPR, and a collection of expert demonstrations . Each demonstration consists of sequential state-action pairs representing the behavior of an expert. The goal of IRL problem is to estimate the unknown reward function from expert demonstrations for an MDPR. The learned complete MDP yields an optimal policy that acts as closely as the expert demonstrations.
2.1 MaxEnt and DeepMaxEnt
In this section, we provide a small overview of the probabilistic approach for the IRL problem under two existing kinds of the regular structure on reward functions. One kind is of a linear structure on the reward function. A reward function is always written as , where are a -dimensional feature functions given by a hand-selection of experts . Ziebart et al. 
propose a probabilistic approach dealing with the ambiguity of the IRL problem based on the principle of maximum entropy, which is called Maximum entropy IRL (MaxEnt). In MaxEnt, we always assume that trajectories with higher rewards are exponentially more preferred, where is one trajectory from expert demonstrations. The objective function for MaxEnt is derived from maximizing the likelihood of expert trajectory under the maximum entropy, and it is always convex for deterministic MDPs. Typically, the optimum is obtained by a gradient-based method. The other kind is of a nonlinear structure on the reward function which is of the form , where is a nonlinear function of feature basis . Following the principle of maximum entropy, Wulfmeier et al. 
extend MaxEnt by adopting a neural network-based approach approximating the unknown nonlinear reward, which is called Maximum entropy deep IRL (DeepMaxEnt).
To generalize the classic regular structures on the reward function, we propose a stochastic regular structure on the reward function in the following section.
2.2 Problem Statement
Formally, we are given an MDPR with a known transition function for and and a hand-crafted reward feature basis . A stochastic regular structure on the reward function assumes weights of the reward feature functions
, which are random variables with a reward conditional probability distributionconditional on expert demonstrations . Parametrizing with parameter , our aim is to estimate the best-fitting parameter from the expert demonstrations , such that more likely generates weights to compose reward functions as the ones derived from expert demonstrations, which is called stochastic inverse reinforcement learning problem.
In practice, expert demonstration can be observed but lack of sampling representativeness . For example, one driver’s demonstrations encode his own preferences in driving style but may not reflect the true rewards of an environment. To overcome this limitation, we introduce a representative trajectory class such that each trajectory element set is a subset of expert demonstrations with the cardinality at least , where is a preset threshold and is the number of expert demonstrations, and it is written as .
We integrate out unobserved weights , and then SIRL problem is formulated to estimate parameter on an expectation optimization problem over the representative trajectory class as follows:
where trajectory element set
assumes to be uniformly distributed for the sake of simplicity in this study but usually known from the rough estimation of the statistics in expert demonstrations in practice, and
is the conditional joint probability density function of trajectory elementand weights for reward feature functions conditional on parameter .
In the following section, we propose a novel approach to estimate the best-fitting parameter in Equation 1, which is called the two-stage hierarchical method, a variant of MCEM method.
3.1 Two-stage Hierarchical Method
The two-stage hierarchical method requires us to write parameter in a profile form . The conditional joint density in Equation 1 can be written as the product of two conditional densities and as follows:
Take the log of both sides in Equation 2, and we have
We optimize the right side of Equation 3 over the profile parameter in the expectation-maximization (EM) update steps at the -th iteration independently as follows,
We randomly initialize profile parameter and sample a collection of rewards weights . The reward weights compose reward in each learning task for .
3.1.2 First Stage
In the first stage, we aim to update parameter for the intractable expectation in Equation 4 in each iteration. Specifically, we take a Monte Carlo method to estimate model parameters in an empirical expectation at the -th iteration,
where reward weights at the -th iteration are randomly drawn from the reward conditional probability distribution and compose a set of learning tasks with a trajectory element set uniformly drawn from representative trajectory class , for .
The parameter in Equation 6 has coordinates written as . For each learning task , the -th coordinate is derived from maximization of a posteriori, the same trick as the ones in MaxEnt and DeepMaxEnt [6; 7], as follows:
which is a convex formulation optimized in a gradient ascent method.
In practice, we move steps uphill to the optimum in each learning task . The update formula of -step reward weights is written as
where the learning rate at the -th iteration is preset. Hence, the parameter in practice is represented as .
3.1.3 Second Stage
In the second stage, we aim to update parameter for the intractable expectation in Equation 5 in each iteration. Specifically, we consider the empirical expectation at the -th iteration as follows,
where is implicit but fitting a set of -step reward weights in a generative model yields a large empirical expectation value. The reward conditional probability distribution
is the generative model formulated in a Gaussian Mixture Model (GMM) in practice, i.e.
with , , and parameter set .
We estimate parameter in GMM via an EM approach and initialize GMM with the -th iteration parameter . The EM procedures are given as follows: for ,
Expectation Step: Compute responsibility for -step reward weight ,
Maximization Step: Compute weighted mean
After the EM converges, parameter of GMM in this iteration, and profile parameter .
Finally, when the two-stage hierarchical method converges, our desired best-fitting parameter in is parameter in profile parameter .
3.2 Termination Criteria
In this section, we will talk about the termination criteria in our algorithm. An ordinary EM algorithm terminates usually when the parameters do not substantively change after enough iterations. For example, one classic termination criterion in the EM algorithm terminates at the -th iteration satisfying,
for user-specified and , where is the model parameter in the EM algorithm.
However, such a termination criterion in MCEM has the risk of terminating early because of the Monte Carlo error in the update step. Hence, we adopt a practical method in which the following similar stopping criterion holds in three consecutive times,
3.3 Convergence Issue
The convergence issue of MCEM is more complicated than ordinary EM. In light of model-based interactive MDPR, we can always increase the sample size of MCEM per iteration. We require the Monte Carlo sample size per iteration in practice satisfy the following inequality,
A pseudocode of our approach is given in Algorithm 1.
We evaluate our approach on an environment, objectworld, which is a particularly challenging environment with a large number of irrelevant features and the highly nonlinearity of the reward functions. We employ the expected value difference (EVD) to be the metric of optimality as follows:
which is a measure of the difference between the expected reward earned under the optimal policy , given by the true reward, and the policy derived from the rewards sampling from our reward conditional probability distribution . Notice that we use to denote the best estimation parameter in our approach.
The objectworld is an IRL environment proposed by Levine et al.  which is an grid board with colored objects placed in randomly selected cells. Each colored object is assigned one inner color and one outer color from preselected colors. Each cell on the grid board is a state, and stepping to four neighbor cells (up, down, left, right) or staying in place (stay) are five actions with a 30% chance of moving in a random direction.
The ground truth of reward function is defined in the following way. Suppose two primary colors of preselected colors are red and blue. The reward of a state is 1 if the state is within 3 steps of an outer red object and 2 steps of an outer blue object, -1 if the state is within 3 steps of an outer red object, and 0 otherwise. The other pairs of inner and outer colors are distractors. Continuous and discrete versions of feature basis functions are provided. For the continuous version, is a
-dimensional real-valued feature vector. Each dimension records the Euclidean distance from the state to objects. For example, the first and second coordinates are the distances to the nearest inner and outer red object respectively, and so on through allcolors. For the discrete version, is a -dimensional binary feature vector. Each -dimensional vector records a binary representation of distance to the nearest inner or outer color object with the -th coordinate 1 if the corresponding continuous distance is less than .
4.2 Evaluation Procedure and Analysis
In this section, we design several tasks to evaluate our generative model, reward conditional probability distribution . For each task, the environment setting is as follows. The instance of objectworld has 25 random objects with 2 colors and 0.9 discount factor. 200 expert demonstrations are generated according to the true optimal policy for the recovery. The length of each expert demonstration is 5 grid size trajectory length. We have four algorithms in the evaluation including MaxEnt, DeepMaxEnt, SIRL, and DSIRL. SIRL and DSIRL are implemented as in Algorithm 1 with an assumption of the linear and nonlinear structure of reward functions respectively, i.e. the drawn weights from reward conditional probability distribution will compose the coefficients in a linear or nonlinear combination of feature functions.
In our evaluation, SIRL and DSIRL start from 10 samples and double the sample size per iteration until it converges. In the first stage, the epochs of algorithm iteration are set to 20 and the learning rates are 0.01. The parameterin representative trajectory set
is preset to 0.95. In the second stage, GMM in both SIRL and DSIRL has three components with at most 1000 iterations before convergence. Additionally, the neural networks for DeepMaxEnt and DSIRL are both implemented in a 3-layer fully-connected architecture with the sigmoid function as the activation function.
4.2.1 Evaluation Platform
4.2.2 Recovery Experiment
In this experiment, we aim to compare the ground truth of the reward function, the optimal policy, and the optimal value with the ones derived from four algorithms on objectworld. For SIRL and DSIRL, the mean of reward conditional probability distribution is used as the comparison object. In Figure 1, we notice that the mean of DSIRL performs better than DeepMaxEnt, and the mean of SIRL is better than MaxEnt because of our Monte Carlo mechanism, a global search approach, in our algorithm. Both MaxEnt and DeepMaxEnt are very likely to get trapped in a secondary maximum. Because of highly nonlinear ground truth, the mean of DSIRL beats the mean of SIRL in this task. Our generalized perspective can be regarded as an average of all outcomes which is prone to imitate the true optimal value from finite expert demonstrations.
4.2.3 Generative Experiment
In this experiment, we aim to evaluate the generativeness of the reward conditional probability distribution which will generate more than one reward function as the solution for the IRL problem. Each optimal value derived from the drawn reward has a small EVD value compared with the true original reward. We design an generative algorithm to capture robust solutions. The pseudocode of generative algorithm is given in Algorithm 2, and for experimental result refers to Figure 2.
In the generative algorithm, we always use the Frobenius norm to measure the distance between weights (matrices) drawn from the reward conditional probability distribution, given by
Each drawn weight in the solution set should satisfy
where represents any other member in the solution set , and are the preset thresholds in the generative algorithm.
4.2.4 Hyperparameter Experiment
In this experiment, we aim to evaluate the effectiveness of our approach under the influence of preset variant quantities and qualities of expert demonstrations. The amount of information carried in expert demonstrations will compose a specific learning environment, and hence has an impact on the effectiveness of our generative model. Due to page limit, we only verify three hyperparameters including the number of expert demonstrations in Figure5, the trajectory length of expert demonstrations in Figure 5 and the portion size in representative trajectory class in Figure 5 on objectworld
. The Shadow of the line in the figures represents the standard error for each experimental trail. Notice that the EVDs for SIRL and DSIRL are both decreasing as the number and the trajectory length of expert demonstrations, and the portion size in the representative trajectory class are increasing.
In this paper, we propose a generalized perspective for IRL problem called stochastic inverse reinforcement learning problem. We formulate it as an expectation optimization problem and adopt the MCEM method to give the first solution to it. The solution to SIRL gives a generative model to produce more than one reward function for original IRL problem, making it possible to analyze and display highly nonlinear IRL problem from a global viewpoint. The experimental results demonstrate the recovery and generative ability of the generative model under the comparison metric EVD. We then show the effectiveness of our model under the influence of a set of hyperparameters of expert demonstrations.
-  Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
-  Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
-  Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.
-  Pieter Abbeel, Dmitri Dolgov, Andrew Y Ng, and Sebastian Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1083–1090. IEEE, 2008.
-  Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
-  Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
-  Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
-  Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591, 2007.
-  Greg CG Wei and Martin A Tanner. A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411):699–704, 1990.
-  Henrik Kretzschmar, Markus Spies, Christoph Sprunk, and Wolfram Burgard. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 35(11):1289–1307, 2016.
-  Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems, pages 1–8, 2007.
-  Dizan Vasquez, Billy Okal, and Kai O Arras. Inverse reinforcement learning algorithms and features for robot navigation in crowds: an experimental comparison. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1341–1346. IEEE, 2014.
-  William Kruskal and Frederick Mosteller. Representative sampling, iii: The current statistical literature. International Statistical Review/Revue Internationale de Statistique, pages 245–265, 1979.
James G Booth and James P Hobert.
Maximizing generalized linear mixed model likelihoods with an automated monte carlo em algorithm.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(1):265–285, 1999.
-  Brian S Caffo, Wolfgang Jank, and Galin L Jones. Ascent-based monte carlo expectation–maximization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):235–251, 2005.
-  KS Chan and Johannes Ledolter. Monte carlo em estimation for time series models involving counts. Journal of the American Statistical Association, 90(429):242–252, 1995.
-  Gersende Fort, Eric Moulines, et al. Convergence of the monte carlo expectation maximization for curved exponential families. The Annals of Statistics, 31(4):1220–1259, 2003.
-  Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, 2018.