1 Introduction
The problem of extracting the reward function of a task given observed optimal behavior has been studied in parallel in both robotics and economics. In robotics this literature is collected under the heading "Inverse Reinforcement Learning" (IRL), [Ng and Russell2000] [Abbeel and Ng2004]. The aim here is to learn a reward function that best explains demonstrations of expert behavior so that a robotic system can reproduce expert like behavior. Alternatively, in economics it is referred to as "structural econometrics" [Miller1984] [Pakes1986] [Rust1987]
and is used to help economists better understand human decision making. Although both fields developed in parallel, they are similar in that both seek to uncover a latent reward function of an underlying Markov Decision Process (MDP).
One of the main challenges in IRL is the large computational complexity of current state of the art algorithms [Ziebart et al.2008] [Ratliff, Bagnell, and Zinkevich2006]. To infer the reward function of the underlying MDP, we need to repeatedly solve this MDP at every step of a reward parameter optimization scheme. The MDP solution, which is characterized by a value function, requires a computationally expensive Dynamic Programming (DP) procedure. Unfortunately, solving this DP step repeatedly makes IRL algorithms computationally prohibitive. Thus, recent works have looked at scaling IRL algorithms to large environment spaces [Finn, Levine, and Abbeel2016] [Levine and Koltun2012].
This problem of large computational complexity has also been studied in economics [Hotz and Miller1993] [Su and Judd2012] [Aguirregabiria and Mira2002]. Among the many works, Conditional Choice Probability (CCP) estimators [Hotz and Miller1993] are particularly interesting because of their computational efficiency. CCP estimators use CCP values to estimate the reward function of the MDP. The CCP values specify the optimal action for a state and are estimated from expert demonstrations. These estimators are computationally efficient since they avoid the repeated computation of the DP step by using an alternative representation of the MDP’s value function.
In this paper we leverage results from [Rust1987], [Hotz and Miller1993] and [Magnac and Thesmar2002] to formulate an estimation routine for the reward function with CCPs, that avoids repeated calls to the solver of the full dynamic decision problem. The key insight from [Hotz and Miller1993] is that differences in current reward and future values between actions can be calculated from CCPs. This allows us to express future value functions in terms of difference value functions and therefore CCPs. Since CCPs are directly observed in the data, we can use this function representation to estimate the value function of the MDP at each step of the optimization process without solving the expensive dynamic programming (DP) formulation. This results in an algorithm whose overall computational time is comparable to a single MDP computation of a traditional gradientbased IRL method.
In this work we introduce CCPIRL by incorporating CCPs into IRL. We test the CCPIRL algorithm on multiple different IRL benchmarks and compare the results to the state of the art IRL algorithm, MaxEntIRL [Ziebart et al.2008]. We show that with CCPIRL we can achieve up to 5 speedup without affecting the quality of the inferred reward function. We also show that this speedup holds across large state spaces and increases for complex problems, such as, problems where value iteration takes much longer to converge.
2 Preliminaries
In this section, we first introduce the MDP formulation as used in the econometrics literature under the name "Dynamic Discrete Choice Model". Following this, we show how the optimality equation is formulated under these assumptions, and how the resulting optimization problems can be related to traditional IRL algorithms.
2.1 Dynamic Discrete Choice Model
A dynamic discrete choice (DDC) model (i.e., a discrete Markov decision process with action shocks) is defined as a tuple . We assume a discrete state space, although this is not strictly necessary. is a countable set of states with a cardinality of . is a finite set of actions with cardinality . is the transition function where is the probability of reaching state given current state and action . The reward function is a mapping .
Different from MDPs typically used in RL, each action also has a "payoffshock" associated with it, that enters payoffs additively. Intuitively, the shock variable accounts for the possibility that an agent takes a nonoptimal behavior due to some unobserved factor of the environment or agent. The vector of shocks is denoted
and . Total rewards for action in state are therefore given by:(1) 
. A shock value is often assumed to be distributed according to a Gumbel or Type 1 Extreme Value (TIEV) distribution,
(2) 
We will see that the use of a TIEV distribution is numerically convenient for the following derivations. However, alternative algorithms can be derived for other functional forms. Each shock is independently and identically drawn from . This ensures that state transitions are conditionally independent. All serial dependence between and is transmitted through . [Rust1988] proves the existence of optimal stationary policies in this setting.
2.2 Bellman Optimality Equation Derivation
Consider a system currently in state , where is a vector of shock values. The decision problem is to select the action that maximizes the payoff:
(3) 
where is the value function, is the discount factor and is the shock value when selecting action at time .
Given the conditional independence assumption of the shock variable described previously, we can separate the integration of and . Define the exante value function (i.e., prior to the revelation of the values of ) as:
(4) 
that is, the expectation of the value function with respect to the shock distribution. Using this notation and conditional independence, we can write the original decision problem as:
The exante value function also follows a Bellmanlike equation:
(5) 
Assuming TIEV distribution for the shock values, one obtains the following expression for the exante value functions as shown by [Rust1987]:
(6) 
where is Euler’s constant. The expectation of the maximum is equal to the average of expected value functions, conditional on choosing action with integrated using the TIEV density. Weights in the average are given by the CCPs of choosing action .
Notice that the above is exactly the recursive representation of the Maximum Causal Entropy IOC algorithm as derived in Theorem 6.8 in [Ziebart2010a]. In our setting, the softmax recursion is a consequence of Bellman’s optimality principle in a setting with a separable stochastic payoff shock with a TIEV distribution, while in [Ziebart2010a] the authors derive the recursion from an informationtheoretic perspective that enforces a maximum causal entropy distribution over trajectories.
3 Conditional Choice Probability Inverse Reinforcement Learning
We will now show how it is possible to efficiently recover the optimal value function, and consequently the underlying reward function, using the DDC model. The key insight is that the optimal value function can be directly estimated from observed stateactions pairs (Conditional Choice probabilities), observed over a large set of expert demonstrations. When this assumption holds the optimal value function can be represented as a linear function of the CCPs and efficiently computed for different parameter values without solving the DP problem iteratively.
3.1 Conditional Choice Probabilities
Since an outside observer does not have access to the shock (), the underlying deterministic policy of the expert is not directly measurable. However, if we average decisions across trajectories conditioned on the same state variables we are able to identify the integrated policy. We denote this integrated policy by , the conditional choice probability (CCP) of an action being chosen conditioned on state :
(7) 
where is the indicator function. The event in the indicator function is equivalent to the event:
(8) 
Expanding the expectation under the TIEV assumption on the shock variable allows CCPs to be solved in closedform:
(9) 
Notice that (9) is identical to the definition of the policy of the MaxEnt formulation in [Ziebart2010a], which is derived from an entropic prior on trajectories. The CCP is derived by integrating out the TIEV shock variable.
4 HotzMiller’s CCP Method
Our aim in Inverse Reinforcement Learning is to find the parameterized reward function for the given MDP/R. We now show how we can leverage the nonparameteric estimates of choice conditional probabilities to efficiently estimate the parameters () of the reward function.
First, we look at the alternative representation of the exante value function which can be derived from the CCP representation. Using this alternative representation, we will see how we can avoid solving the original MDP using the expensive dynamic programming formulation for every update of .
Returning to the definition of exante value function (4) we know that,
(10) 
[Hotz and Miller1993] show that if we can get consistent CCP estimates from the data, the above equation (10) can be estimated as,
(11) 
Now, defining the expected shock given that action is optimal as , we can rewrite (11) as,
(12) 
It was shown further that depends on CCPs and distribution of only. They prove that the mapping between CCPs and choice specific value function is invertible. Using this inverse mapping and assuming TIEV distribution for , we get .
From (12) we can see that, excluding the unknown reward function, all other terms can be estimated from CCPs. We can now stack the exante value function over all states,
(13) 
where:
Notice that (13) is linear in exante value function (). Thus we can write a closed form solution for it. First, rearranging the terms we get,
(14) 
Defining as a vector of ones. We can now write the closed form solution for as,
(15) 
The above is the value function representation used by [Pesendorfer and Schmidtdengler2008] and discussed in [Arcidiacono and Ellickson2011].
We now discuss the CCPIRL algorithm and how it avoids repeatedly solving the original MDP problem. The pseudocode for CCPIRL is given in Algorithm 1, where is expert’s feature expectations and are features at every state. Notice that the only quantity dependent on in (15) is . Thus, to estimate using (15) we calculate for every value i.e., at every step of the iteration (Line 6). But the inverse matrix is independent of and hence can be precomputed once for all iterations (Line 3). This inverse matrix computes the state visitation frequency for each state, weighted by the appropriate discount factor and hence encompasses a large part of calculations involved in MaxEnt [Ziebart2010a]. Given this inverse matrix computing at any requires simple matrix operations (Line 7), which allows us to avoid solving the MDP using dynamic programming at every step of the iteration. Lines 810 calculate the gradient for the reward parameters and are explained in [Kitani et al.2012].
We also note how to calculate the initial CCP estimates (). In their simplest form, the initial CCP estimates can be computed directly from expert trajectories each with time periods: in tabular form. An initial maximum likelihood estimate can be computed by maintaining a table over stateaction pair occurrences.
4.1 Complexity Analysis
The main computation in CCPIRL is to estimate the inverse matrix in (15). In contrast the main computation in MaxEntIRL is solving the MDP using dynamic programming. However, note that unlike MaxEntIRL where we need to repeatedly solve the MDP using dynamic programming we only need to estimate the inverse matrix once. Once the matrix inverse has been found estimating the MDP in CCPIRL involves simple matrix computations and hence involve no significant computation overhead.
Thus, assuming a total of iterations for the entire MaxEntIRL convergence and iterations for each backwards recursion, MaxEntIRL takes a total of [Ziebart2010a]. For CCPIRL assuming the matrix inversion can be performed with state of the art matrix inversion method, we get a corresponding runtime of . This complexity can be further reduced to for linear reward formulations.
We also look at how for large state spaces, we can avoid using matrix inversion and rather estimate the inverse matrix (15) using successive approximations. Defining we can write it as where is used as a shorthand for notational convenience. Premultiplying both sides with we get which finally gives us . We can now use this last equation to estimate the invese matrix by successive approximations. From a computational perspective this can be much more efficient compared to estimating the inverse directly. Next, we will empirically show the above computational gains in CCPIRL as well as discuss the expert data requirements for CCPIRL.
5 Experiments
In this section we empirically validate, (1) the computational efficiency of CCPIRL and (2) the underlying assumptions of consistent CCP estimates i.e., we show the data requirement for CCPIRL. To this end we evaluate the performance of CCPIRL on three standard IRL tasks. Since CCPIRL is most closely related with traditional MaxEntIRL [Ziebart et al.2008], we use it as a baseline method to compare our results on the benchmark tasks. Previously, both linear [Ziebart et al.2008] [Ziebart2010b] and nonlinear [Wulfmeier, Ondruska, and Posner2015] formulations of MaxEntIRL have been used to estimate the reward functions. Hence, we discuss results for both linear and nonlinear parameterization of CCPIRL. For the former, we focus on problems of navigation in a traditional Gridworld setting with stochastic dynamics, while for the latter we choose the Objectworld task as described in [Levine and Koltun2013].
For comparative analysis, we use both qualitative and quantitative results. For qualitative analysis, we directly compare the visualizations of the inferred reward functions for both CCPIRL and MaxEntIRL. For quantitative comparison, we use negative log likelihood (NLL) [Kitani et al.2012] and expected value difference (EVD) [Levine, Popovic, and Koltun2011] as the evaluation criterion. NLL is a probabilistic comparison metric and evaluates the likelihood of a path under the predicted policy. For a policy (), NLL is defined as,
(16) 
As another metric of success, similar to related works [Levine and Koltun2013] [Wulfmeier, Ondruska, and Posner2015], we use expected value difference (EVD). EVD measures the difference between the optimal and learned policy by comparing the value function obtained with each policy under the true reward distribution. Further, to verify the computational improvement using CCPIRL, we observe the time taken by each algorithm as well as the number of iterations it takes for each algorithm to converge. We show that our algorithm is able to achieve similar qualitative and quantitative performance with much less computational time.
5.1 Gridworld: Evaluating Linear Rewards
We use the Gridworld experiments to show the computational efficiency of CCPIRL assuming linear parameterization. We use the Gridworld problem because the reward function is approximately linear. We test with two increasingly difficult settings in Gridworld i.e., Fixed Target and Macro Cells (described below) to show how CCPIRL provides computational advantage across both tasks. Also, for the more complex Macrocell task, we show how CCPIRL requires consistent CCP estimates, which in turn depend on the amount of expert demonstrations available.
5.1.1 Fixed Target Gridworld
For our initial experiment we focus on the standard RL task of navigation in a grid world. We show that CCPIRL provides a significant computation advantage when compared to MaxEntIRL. Additionally, we also show that similar to MaxEntIRL, CCPIRL is able to extract the underlying reward function across large state spaces.
In this setting, the agent is required to move to a specific target location given some obstacles. The initial start location is randomly distributed through the grid. The agent gets a large positive reward at the target location. For states with obstacles, the agent gets a large negative reward. At all other states, the agent gets reward. The agent can only move in four directions (North, South, East, West) i.e., no diagonal movement is allowed. To make the environment more challenging, we assume stochastic wind, which forces the agent to move to a random neighboring location with a certain probability, . For our feature representation, we use distance to the target location along with the state of each grid cell i.e., whether the grid cell contains an obstacle or not.
First, we compare the EVD performance of our proposed CCPIRL algorithm against the MaxEnt IRL baseline in Figure 1 (Right). As seen in the above plot, both algorithms converge to the expert behavior with similar amount of data. Hence our proposed CCPIRL algorithm is correctly able to infer the underlying reward distribution.
We now observe the computational gain provided by CCPIRL. The first three rows in Table 1 compare the amount of time between CCPIRL and MaxEntIRL for increasing state spaces. Notice that CCPIRL is atleast 2 faster compared to MaxEntIRL, for both small and large state spaces. This is expected given that we do not use backwards recursion to solve the MDP problem at every iteration.
Next, we look at the convergence rate for both algorithms. This is important since CCPIRL provides much larger computational advantage with increasing number of iterations. Figure 1 (Left) shows the NLL values for both algorithms against increasing number of iterations. Notice that both algorithms converge to similar result with same number of iterations for different amount of input trajectories. This shows that both algorithms have a similar rate of convergence.
We also compare the computation time for each algorithm against the discount factor () of the underlying MDP. By varying we are able to vary the complexity of the original MDP since a large value gives more weight to future actions and thus each solution of the value iteration DP takes longer. Figure 4 shows the computation time for both algorithms against different values. As expected, we see an almost exponential rise in the computation time for MaxEntIRL while CCPIRL shows a negligible time increase which indicates that CCPIRL provides much larger gains for more complex MDP problems.
N  Cell Size  MaxEnt (sec)  CCP (sec)  Speedup 

32    584.31  270.52  
64    1812.94  552.18  
128    15062.24  3119.20  
32  8  635.63  266.18  
32  4  584.30  283.81  
64  8  3224.97  1024.42 
Results for gridworld of size 16 with macrocells of size 2. Left: Minimum NLL results with varying number of trajectories. Right: Expected Value Difference results. For few trajectories CCPIRL shows much larger variance as compared to MaxEntIRL.
True Reward  MaxEnt  CCP 
5.1.2 Macro Cells
We use the more complex macrocell Gridworld environment [Abbeel and Ng2004] to demonstrate how CCPIRL’s performance depends on expert trajectories. As discussed above, CCPIRL requires consistent CCP estimates for reward function estimation. Since CCP estimates are calculated from expert trajectories we observe how CCPIRL’s performance depends on the amount of expert trajectories.
In this setting, the grid is divided into nonoverlapping square regions (macrocells). Each region contains multiple grid cells and each cell in a region shares the same reward. Similar to [Abbeel and Ng2004], for every region we select a positive reward with probability of and with probability . This reward distribution leads to positive rewards in few macro cells which results in interesting policies to be learned and hence requires more precise CCP estimates to match expert behavior.
Since all the cells in the same region share the same reward, our feature representation is a onehot encoding of the region that cell belongs to
e.g., in a grid world of size and macro cell of size 8 we have 64 regions and thus each state vector is of size 64. As before, we assume a stochastic wind with probability and the agent can move only in four directions.We analyze the performance of both algorithms given different amounts of expert trajectories. Figure 2 compares the NLL and EVD results against increasing number of expert trajectories. As seen above, both algorithms show poor performance given very few trajectories (< 20 trajectories). However, with moderate number of trajectories MaxEntIRL approaches expert behavior while CCPIRL is still comparatively worse. Finally, with sufficiently large number of trajectories (> 60) both algorithms converge to expert behavior. CCPIRL’s poor performance with few expert demonstrations reflect its dependence on sufficient amount of input data. Since CCP estimates are calculated from input data CCPIRL needs a sufficient (relatively larger than MaxEntIRL) amount of trajectories to get consistent CCP estimates.
We also qualitatively compare the rewards inferred by both algorithms given few trajectories in Figure 3. Notice that the darker regions in the true reward are similarly darker for both algorithms. Thus, both algorithms are able to infer the general reward distribution. However, MaxEntIRL is able to match the true reward distribution at a much finer level (since less discrepancy compared to the true reward) and hence the underlying policy more closely as compared to CCPIRL. Thus, given few input trajectories MaxEntIRL performs better than our proposed CCPIRL algorithm.
We verify the computation advantage for CCPIRL across large state spaces () in Table 1. As seen before, CCPIRL is atleast 2 faster than MaxEntIRL. Also, its computational efficiency increasing for larger state spaces.
Experiment Setting  MaxEnt  CCP  Speedup 

Grid size: 16, C = 2  1622.63  296.43  
Grid size: 32, C = 2  9115.50  1580.22  
Grid size: 16, C = 8  2535.38  545.95  
Grid size: 32, C = 8  19445.66  4799.02 
True Reward  MaxEnt  CCP 
5.2 Objectworld: Evaluating NonLinear Rewards
We now look at CCPIRL’s performance when the true reward function is a nonlinear parameterization of the feature vector. For this, we use the Objectworld [Levine, Popovic, and Koltun2011] environment since the reward function is a nonlinear function of state features [Levine, Popovic, and Koltun2011]. Similar to related work [Wulfmeier, Ondruska, and Posner2015]
, we use a Deep Neural Network (DNN) as the nonlinear function approximator. As before, we verify both (1) the computational advantage provided by CCPIRL (DeepCCPIRL) and (2) the data requirement for CCPIRL in the above scenario.
The Objectworld environment consists of a grid of states. At each state the agent can take 5 actions, including movement in 4 directions and staying in place. Spread through the grid are random objects, each with an inner and outer color. Each of these colors is chosen from a set of colors. The reward for each cell(state) is positive if the cell is within distance 3 of color 1 and distance 2 of color 2, negative if only within distance 3 of color 1 and zero in all other cases. For our feature vector we use a continuous set of values , where and is the shortest distance from the state to the i’th inner and outer color respectively. Since the reward is only dependent on two colors, features for other colors act as distractors.
We use DeepMaxEntIRL [Wulfmeier, Ondruska, and Posner2015]
as the baseline, using similar deep neural network architecture for both algorithms. Precisely, we use a 2layer feedforward network with rectified linear units. We use the Adam
[Kingma and Ba2014] optimizer with the initial learning rate set to .We quantitatively analyze the performance of our proposed DeepCCPIRL algorithm. Figure 5 compares the NLL and EVD results for both algorithms. Notice that as observed before, with few expert trajectories both algorithms perform poorly. However, DeepMaxEntIRL matches expert performance with moderate number of trajectories (), while DeepCCPIRL requires relatively large number of trajectories (). This is expected since CCPIRL requires larger number of expert trajectories to get consistent CCP estimates.
Also, we qualitatively look at the inferred reward to verify how well the DNN is able to approximate the nonlinear reward. Figure 7 plots the inferred rewards against the true reward function. Notice that both algorithms capture the nonlinearities in the underlying reward function and consequently match the expert behavior. Thus, a deep neural network suffices as a nonlinear function approximator for CCPIRL.
We now analyze the computation gain in the nonlinear case. Table 2 shows the computation time for different sized state spaces and different sized feature vectors. Notice that DeepCCPIRL is almost 5 as fast as DeepMaxEntIRL across small and large state spaces. Thus we see that CCPIRL provides a much larger computation advantage for the nonlinear case, which we believe is because the objectworld MDP problem is more complex than the above grid world experiments. This results in both algorithms requiring larger number of iterations until convergence which leads to a large computational increase for DeepMaxEntIRL as compared to DeepCCPIRL. This computational increase with larger number of iterations is also shown in Figure 6 (Right). Notice that as the number of iterations increase, our proposed DeepCCPIRL algorithms shows minor computational increase as compared to DeepMaxEntIRL. Thus, for significantly complex MDP problems which require large number of iterations our proposed CCPIRL algorithm should require much less computation time compared to MaxEntIRL.
6 Conclusion
We have described an alternative framework for inverse reinforcement learning (IRL) problems that avoids value function iteration or backward induction. In IRL problems, the aim is to estimate the reward function from observed trajectories of a Markov decision process (MDP). We first analyze the decision problem and introduce an alternative representation of value functions due to [Hotz and Miller1993]. These representations allow us to express value functions in terms of empirically estimable objects from actionstate data and the unknown parameters of the reward function. We then show that it is possible to estimate reward functions with few parametric restrictions.
References

[Abbeel and
Ng2004]
Abbeel, P., and Ng, A. Y.
2004.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, 1. ACM.  [Aguirregabiria and Mira2002] Aguirregabiria, V., and Mira, P. 2002. Swapping the nested fixed point algorithm: A class of estimators for discrete markov decision models. Econometrica 70(4):1519–1543.
 [Arcidiacono and Ellickson2011] Arcidiacono, P., and Ellickson, P. B. 2011. Practical methods for estimation of dynamic discrete choice models. Annu. Rev. Econ. 3(1):363–394.
 [Finn, Levine, and Abbeel2016] Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, 49–58.
 [Hotz and Miller1993] Hotz, V. J., and Miller, R. A. 1993. Conditional choice probabilities and the estimation of dynamic models. Review of Economic Studies 60:497–529.
 [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[Kitani et al.2012]
Kitani, K. M.; Ziebart, B. D.; Bagnell, J. A.; and Hebert, M.
2012.
Activity forecasting.
In
European Conference on Computer Vision
, 201–214. Springer.  [Levine and Koltun2012] Levine, S., and Koltun, V. 2012. Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617.
 [Levine and Koltun2013] Levine, S., and Koltun, V. 2013. Guided Policy Search. Proceedings of the 30th International Conference on Machine Learning 28:1–9.
 [Levine, Popovic, and Koltun2011] Levine, S.; Popovic, Z.; and Koltun, V. 2011. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, 19–27.
 [Magnac and Thesmar2002] Magnac, T., and Thesmar, D. 2002. Identifying dynamic discrete decision processes. Econometrica 70:801–816.
 [Miller1984] Miller, R. A. 1984. Job matching and occupational choice. Journal of Political Economy 92(6):1086–1120.
 [Ng and Russell2000] Ng, A., and Russell, S. 2000. Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning 0:663–670.
 [Pakes1986] Pakes, A. 1986. Patents as options: Some estimates of the value of holding european patent stocks. Econometrica 54(4):755–84.
 [Pesendorfer and Schmidtdengler2008] Pesendorfer, M., and Schmidtdengler, P. 2008. Asymptotic Least Squares Estimators for Dynamic Games. Review of Economic Studies 901–928.
 [Ratliff, Bagnell, and Zinkevich2006] Ratliff, N. D.; Bagnell, J. A.; and Zinkevich, M. A. 2006. Maximum margin planning. Proceedings of the 23rd International Conference on Machine Learning (ICML) 729–736.
 [Rust1987] Rust, J. 1987. Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica 55:999–1033.
 [Rust1988] Rust, J. 1988. Maximum likelihood estimation of discrete control processes. SIAM Journal on Control and Optimization 26(5):1006–1024.
 [Su and Judd2012] Su, C.L., and Judd, K. L. 2012. Constrained optimization approaches to estimation of structural models. Econometrica 80(5):2213–2230.
 [Wulfmeier, Ondruska, and Posner2015] Wulfmeier, M.; Ondruska, P.; and Posner, I. 2015. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888.
 [Ziebart et al.2008] Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, 1433–1438. Chicago, IL, USA.
 [Ziebart2010a] Ziebart, B. D. 2010a. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D. Dissertation, Carnegie Mellon University.
 [Ziebart2010b] Ziebart, B. D. 2010b. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University.
Comments
There are no comments yet.