1 Introduction
The problem of inverse reinforcement learning (IRL) is to infer the latent reward function that the agent subsumes by observing its demonstrations or trajectories in the task. It has been successfully applied in scientific inquiries, e.g., animal and human behavior modeling (Ng et al., 2000), as well as practical challenges, e.g., navigation (Ratliff et al., 2006; Abbeel et al., 2008; Ziebart et al., 2008) and intelligent building controls (Barrett and Linder, 2015). By learning the reward function, which provides the most succinct and transferable definition of a task, IRL has enabled advancing the state of the art in the robotic domains (Abbeel and Ng, 2004; Kolter et al., 2007).
Previous IRL algorithms treat the underlying reward as a linear (Abbeel and Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Syed and Schapire, 2007; Ratliff et al., 2009) or nonparametric function (Levine et al., 2010; 2011) of the state features. Main formulations within the linearity category include maximum margin (Ratliff et al., 2006), which presupposes that the optimal reward function leads to maximal difference of expected reward between the demonstrated and random strategies, and feature expectation matching (Abbeel and Ng, 2004; Syed et al., 2008), based on the observation that it suffices to match the feature expectation of a policy to the expert in order to guarantee similar performances. The reward function can be also regarded as the parameters for the policy class, such that the likelihood of observing the demonstrations is maximized with the true reward function, e.g., the maximum entropy approach (Ziebart et al., 2008).
As the representation power is limited by the linearity assumption, nonlinear formulations (Levine et al., 2010) are proposed to learn a set of composite features based on logical conjunctions. Nonparametric methods, pioneered by (Levine et al., 2011) based on Gaussian Processes (GPs) (Rasmussen, 2006), greatly enlarge the function space of latent reward to allow for nonlinearity, and have been shown to achieve the state of the art performance on benchmark tests, e.g., object world and simulated highway driving (Abbeel and Ng, 2004; Syed and Schapire, 2007; Levine et al., 2010; 2011). Nevertheless, the heavy reliance on predefined or handcrafted features becomes a bottleneck for the existing methods, especially when the complexity or essence of the reward can not be captured by the given features. Finding such features automatically from data would be highly desirable.
In this paper, we propose an approach which performs feature and inverse reinforcement learning simultaneously and coherently within the same model by incorporating deep learning. The success of deep learning in a wide range of domains has drawn the community’s attention to its structural advantages that can improve learning in complicated scenarios, e.g.,
Mnih et al. (2013) recently achieved a deep reinforcement learning (RL) breakthrough. Nevertheless, most deep models require massive data to be properly trained and can become impractical for IRL. On the other hand, deep Gaussian processes (deep GPs) (Damianou and Lawrence, 2013; Damianou, 2015) not only can they learn abstract structures with smaller data sets, but they also retain the nonparametric properties which Levine et al. (2011) demonstrated as important for IRL.A deep GP is a deep belief network comprising a hierarchy of latent variables with Gaussian process mappings between the layers. Analogously to how gradients are propagated through a standard neural network, deep GPs aim at propagating uncertainty through Bayesian learning of latent posteriors. This constitutes a useful property for approaches involving stochastic decision making and also guards against overfitting by allowing for noisy features. However, previous methodologies employed for approximate Bayesian learning of deep GPs
(Damianou and Lawrence, 2013; Hensman and Lawrence, 2014; Bui et al., 2015; Mattos et al., 2016) fail when diverging from the simple case of fixed output data modeled through a Gaussian regression model. In particular, in the IRL setting, the reward (output) is only revealed through the demonstrations, which is guided by the policy given by the reinforcement learning (Damianou, 2015).The main contributions of our paper are below.

We extend the deep GP framework to the IRL domain (Fig. 1), allowing for learning latent rewards with more complex structures from limited data.

We derive a variational lower bound on the marginal log likelihood using an innovative definition of the variational distributions. This methodological contribution enables Bayesian learning in our model and can be applied to other scenarios where the observation layer’s dynamics cause similar intractabilities.

We compare the proposed deep GP for IRL with existing approaches in benchmark tests as well as newly defined tests with more challenging rewards.
In the following, we review the problem of inverse reinforcement learning (IRL). The list of notations used throughout the paper is summarized in Table 1.
time step  
index of demonstrations  
state , action  
RL discount factor,  
reward vector for all states 

reward function for states  
Qvalue function  
variational distribution  
state value function  
policy , and the corresponding estimated and optimal versions 

dimensional feature matrix for discrete states,  
features for th state, , the th feature for all states,  
covariance function parametrized by  
covariance matrix,  
latent state matrix with dimensional features,  
with Gaussian noise,  
inducing inputs,  
inducing outputs,,  
tilde  mean of the corresponding variable distribution, e.g., 
1.1 Inverse Reinforcement Learning
The Markov Decision Process (MDP) is characterized by
, which represents the state space, action space, transition model, discount factor, and reward function, respectively.Take robot navigation as an example. The goal is to travel to the goal spot while avoiding stairwells. The state describes the current location and heading. The robot can choose actions from going forward or backward, turning left or right. The transition model specifies
, i.e., the probability of reaching the next state given the current state and action, which accounts for the kinematic dynamics. The reward is +1 if it achieves the goal, 1 if it ends up in the stairwell, and 0 otherwise. The discount factor,
, is a positive number less than 1, e.g., 0.9, to discount the future rewards. The optimal policy is then given by maximizing the expected reward, i.e.,(1) 
The IRL task is to find the reward function such that the induced optimal policy matches the demonstrations, given and , where is the demonstration trajectory, consisting of stateaction pairs. Under the linearity assumption, the feature representation of states forms the linear basis of reward, , where is the dimensional mapping from the state to the feature vector. From this definition, the expected reward for policy is given by:
where is the feature expectation for policy . The reward parameter is learned such that
(2) 
a prevalent idea that appears in the maximum margin planning (MMP) (Ratliff et al., 2006) and feature expectation matching (Syed and Schapire, 2007).
Motivated by the perspective of expected reward that parametrizes the policy class, the maximum entropy (MaxEnt) model (Ziebart et al., 2008) considers a stochastic decision model, where the optimal policy randomly chooses the action according to the associated reward:
(3) 
where follows the Bellman equation, and are measures of how desirable are the corresponding state and stateaction pair under rewards . In principle, for a given state , the best action corresponds to the highest Qvalue, which represents the “desirability” of the action. Assuming independence among stateaction pairs from demonstrations, the likelihood of the demonstration corresponds to the joint probability of taking a sequence of actions under states according to the Bellman equation:
(4) 
Though directly optimizing the above criteria with respect to is possible, it does not lead to generalized solutions transferrable in a new test case where no demonstrations are available; hence, we need a “model” of . MaxEnt assumes linear structure for rewards, while GPIRL (Levine et al., 2011) uses GPs to relate the states to rewards. In Section 2.1, we will give a brief overview of GPs and GPIRL.
2 The Model
In this section, we start by discussing the reward modeling through Gaussian processes (GPs) following (Levine et al., 2011), proceed to incorporate the representation learning module as additional GP layers, then develop our variational training framework and, finally, develop the transfer between tasks.
2.1 Gaussian Process Reward Modeling
We consider the setup of discretizing the world into states. Assume that observed stateaction pairs (demonstrations) are generated by a set of dimensional state features through the reward function . Throughout this paper we denote points (rows of a matrix) as , features (columns) as and single elements as .
In this modeling framework, the reward function plays the role of an unknown mapping, thus we wish to treat it as latent and keep it flexible and nonlinear. Therefore, we follow (Levine et al., 2011) and model it with a zeromean GP prior (Rasmussen, 2006):
where denotes the covariance function, e.g. , . Given a finite amount of data, this induces the probability , where and the covariance matrix is obtained by . The GPIRL training objective comes from integrating out the latent reward:
(5) 
and maximizing over , which we drop from our expressions from now on.
The above integral is intractable, because has the complicated expression of (4) (this is in contrast to the tranditional GP regression where is a Gaussian or other simple distribution). This can be alleviated using the approximation of (Levine et al., 2011). We will describe this approximation in the next section, as it is also used by our approach. Notice that all latent function instantiations are linked through a joint multivariate Gaussian. Thus, prediction of the function value at a test input is found through the conditional
As can be seen, the prediction is reliant on the effectiveness of feature representation: states with features close in Euclidean distance are assumed to be associated with similar rewards. This motivates our novel deep GPIRL method which is obtained by considering additional layers, as we will describe next.
2.2 Incorporating the Representation Learning Layers
The traditional modelbased IRL approach is to learn the latent reward (operating on fixed state features ) that best explains the demonstrations . In this paper we wish to additionally and simultaneously uncover a highly descriptive state feature representation. To achieve this, we introduce a latent state feature representation . constitutes the instantiations of an introduced function which is learned as a nonlinear GP transformation from . To account for noise we further introduce as the noisy versions of , i.e., , .
Importantly, rather than performing two steps of learning separately (for the GPs on and on ), we nest them into a single objective function, to maintain the flow of information during optimization. This results in a deep GP whose top layers perform representation learning and lower layers perform modelbased IRL. Fig. 1 outlines our model, Deep Gaussian Process for Inverse Reinforcement Learning (DGPIRL). By using to represent the th column of , and similarly for the full generative model is written as follows:
(6) 
where the IRL term takes the form of (4) as suggested by Ziebart et al. (2008). and are the covariance matrices in each layer, constructed with covariance functions and respectively. Compared to GPIRL the proposed framework has substantial gain in flexibility by introducing the abstract representation of states in the hidden layers . Note that the model in Fig. 1 can be extended in depth by introducing additional hidden layers and connecting them with additional GP mappings; it is only for illustration simplicity that we base our derivation on the two layered structure.
We can compress the statistical power of each generative layer into a set of auxiliary variables within a sparse GP framework (Snelson and Ghahramani, 2005). Specifically, we introduce inducing outputs and inputs, denoted by and respectively for the lower layer and by and for the top layer (as in Fig. 1). The inducing outputs and inputs are related with the same GP prior appearing in each layer. For example, with
. By relating the original and inducing variables through the conditional Gaussian distribution, the auxiliary variables are learned to be
sufficient statistics of the GP. The augmented model, shown in Fig. 1, has the following definition:(7)  
where we adopt the Fully Independent Training Conditional (FITC) to preserve the exact variances in
, and the Deterministic Training Conditional (DTC) (QuiñoneroCandela and Rasmussen, 2005) in as in GPIRL to facilitate the integration of in the training objective (see next section).In the following, we will omit the inducing inputs in the conditions, with the convention to treat them as model parameters (Damianou and Lawrence, 2013; Damianou, 2015; Kandemir, 2015b). By selecting the complexity reduces from to
. While DGPIRL resolves the case when the outputs have complex dependencies with the latent layers, the training of the model based on variational inference requires gradients for the parameters, as in backpropagation, whose convergence can be improved by leveraging advancements in deep learning.
Additionally, in DGPIRL, the role of auxiliary variables goes further than just introducing scalability. Indeed, as we shall see next, the auxiliary variables play a distinct role in our model, by forming the base of a variational framework for Bayesian inference.
2.3 Variational Inference
We wish to optimize the model evidence for training:
However, this quantity is intractable. Firstly because the latent variables appear nonlinearly in the inverse of covariance matrices. Secondly, because the latent rewards relate to the observation through the reinforcement learning layer; the choice of in (7) does not completely solve this problem because in DGPIRL there is additional uncertainty propagated by the latent layers. This indicates that Laplace approximation is not practical, neither is the variational method employed for deep GP (Damianou and Lawrence, 2013; Kandemir, 2015a), where the output is related to the latent variable in a simple regression framework.
To this end, we show that we can derive an analytic lower bound on the model evidence by constructing a variational framework using the following special form of variational distribution:
(8)  
(9)  
(10)  
(11) 
The delta distribution is equivalent to taking the mean of normal distributions for prediction, which is reasonable in the context of reinforcement learning
(Levine et al., 2011). Also note that the delta distribution is applied only in the bottom layer and not repeatedly; therefore, representation learning is indeed being manifested in the latent layers.In addition, matches the exact conditional so that these two terms cancel in the fraction of (13) and the number of variational parameters is minimized, as in (Titsias and Lawrence, 2010). As for , it is chosen as delta distributions such that combined with the IRL term in (12) becomes tractable and information can flow through the latent layers .
The variational marginal is factorized across its dimensions with fully parameterized normal densities, as in (Damianou and Lawrence, 2013). Notice that and are the mean of the inducing outputs (Table 1), corresponding to pseudoinputs and , where
(initialized with random numbers from uniform distributions
(Sutskever et al., 2013)) be learned to further maximize the marginaliked likelihood, and is chosen as a subset of .The variational means of can be augmented with input data, , to improve stability during training (Duvenaud et al., 2014).
The variational lower bound, , follows from the Jensen’s inequality, and can be derived analytically due to the choice of variational distribution (see Section 2 of the Appendix for details):
(12)  
(13)  
(14) 
where
(15)  
(16)  
(17) 
where is the determinant of . , where . is the term associated with RL. is the Gaussian prior on inducing outputs . denotes the Kullback – Leibler (KL) divergence between the variational posterior to the prior , acting as a regularization term. The lower bound can be optimized with gradientbased methods, which are computed by backpropagation. In addition, we can find the optimal fixedpoint equations for the variational distribution parameters for using variational calculus, in order to raise the variational lower bound further (refer to Section 3 of the supplement for this derivation).
Notice that the approximate marginalization of all hidden spaces, in (14), approximates a Bayesian training procedure, according to which model complexity is automatically balanced through the Bayesian Occam’s razor principle. Optimizing the objective turns the variational distribution into an approximation to the true model posterior.
2.4 Transfer to New Tasks
The inducing points provide a succinct summary of the data, by the property of FITC (QuiñoneroCandela and Rasmussen, 2005), which means only the inducing points are necessary for prediction. Given a set of new states , DGPIRL can infer the latent reward through the full Bayesian treatment:
(18) 
Given that the above integral is computationally intensive to evaluate, a practical alternative adopted in our implementation is to use point estimates for latent variables; hence, the rewards are given by:
(19) 
where , with . The above formulae suggest that instead of making inference based on layer directly as in Levine et al. (2011), DGPIRL first estimates the latent representation of the states, , then makes GP regression using the latent variables.
3 Experiments
For the experimental evaluation, we employ the expected value difference (EVD) as a metric of optimality, given by:
(20) 
which is the difference between the expected reward earned under the optimal policy, , given by the true rewards, and the policy derived from the IRL rewards, . Our software implementation is included in the supplementary.
3.1 Object World Benchmark
The Object World (OW) benchmark, originally implemented by Levine et al. (2011), is a gridworld where dots of primary colors, e.g., red and blue, and secondary colors, e.g., purple and green, are placed on the grid at random, as shown in Fig. 2. Each state, i.e., grid block, is described by the shortest distances to dots among each color group. The latent reward is assigned such that if a block is 1 step within a red dot and 3 steps within a blue dot, the reward is +1; if it is 3 steps within a blue dot only, the reward is 1, and the reward is 0 otherwise. The agent maximizes its expected discounted reward by following a policy which provides the probabilities of actions (moving up/down/left/right, or stay still) at each state, subject to a transition probability.
The objective of the experiment is to compare the performances of DGPIRL with previous methods as the number of demonstrations varies. Candidates that are evaluated include the Multiplicative Weights for Apprenticeship Learning (MWAL) (Syed and Schapire, 2007), MaxEnt, MMP, which assume a linear reward function, and GPIRL, which is the stateoftheart method on the benchmark. Linear models, as is shown in Fig. 2, cannot capture the complex structure, while GPIRL learns more accurate yet still noisy rewards, as limited by feature discriminability; DGPIRL, on the contrary, makes inference closest to the ground truth even with limited data, thanks to the increased representational power and robust training through variational inference.
Additionally, the transferability test is carried out by examining EVD in a new world where no demonstrations are available, which requires the ability of knowledge transfer from the previous learning scenario. DGPIRL outperforms GPIRL and other models in both the training and transfer cases, and the improvement is obvious as more data is accessible (Fig. 3). The shaded area (Fig. 3, 6, 7(a)
) are the standard deviation of EVD among independent experiments, which reflect that DGPIRL and GPIRL are more reliable.
3.2 Binary World Benchmark
Though the rewards in OW are nonlinear functions of the features, they form separated clusters in the subspace spanned by two dimensions, i.e., distances to the nearest red and blue dots. Binary world (BW) is a benchmark introduced by Wulfmeier et al. (2015) whose reward depends on combinatorics of features. More specifically, in a world of plane where each block is randomly assigned with either a blue or red dot, the state is associated with the +1 reward if there are 4 blues in the neighborhood, 1 for 5 blues, and 0 otherwise. The feature represents the color of the 9 dots in the neighborhood. BW sets up a challenging scenario, where states that are maximally separated in feature space can have the same rewards, yet those that are close in euclidean distance may have opposite rewards.
The task is to learn the latent rewards of states given limited demonstrations, as is shown in Fig. 4. While linear models are limited by their capacity of representation, the results of GPIRL also deviate from the latent rewards as it cannot generalize from training data with the convoluted features. DGPIRL, nevertheless, is able to recover the ground truth with the highest fidelity.
By successively warping the original feature space through the latent layers, DGPIRL can learn an abstract representation that reveals the reward structure. As is illustrated in Fig. 5, though the points are mixed up in the input space, making it impossible to separate those with the same rewards, their positions in the latent space clearly form clusters, which indicates that DGPIRL has remarkably uncovered the mechanism of reward generation by simply observing the traces of actions.
The advantage of simultaneous representation and inverse reinforcement learning is demonstrated in Fig. 6, which plots the EVD for the training and transfer cases. As the features are interlinked not only with the reward but also with themselves in a very nonlinear way, this scenario is particularly challenging for linear models, such as LEARCH (Ratliff et al., 2009), MaxEnt, and MMP. While both GPIRL and DGPIRL have satisfactory performance in the training case, DGPIRL significantly outperforms GPIRL in the transfer task, which indicates that the learned latent space is transferrable across different scenarios.
3.3 Highway Driving Behaviors
Highway driving behavior modeling, as investigated by Levine et al. (2010; 2011), is a concrete example to examine the capacity of IRL algorithms in learning the underlying motives from demonstrations based on a simple simulator. In a threelane highway, vehicles of specific class (civilian or police) and category (car or motorcycle) are positioned at random, driving at the same constant speed. The robot car can switch lanes and navigate at up to three times the traffic speed. The state is described by a continuous feature which consists of the closest distances to vehicles of each class and category in the same lane, together with the left, right, and any lane, both in the front and back of the robot car, in addition to the current speed and position.
The goal is to navigate the robot car as fast as possible, but avoid speeding by checking that the speed is no greater than twice the current traffic when the police car is 2 car lengths nearby. As the reward is a nonlinear function determined by the current speed and distance to the police, linear models are outrun by GPIRL and DGPIRL. Performance generally improves with more demonstrations, and DGPIRL remains to yield the policy closest to the optimal in EVD, and with minimal probability of speeding, as illustrated in Fig. 7(a) and 7(b), respectively.
4 Conclusion and Future Work
DGPIRL is proposed as a solution for IRL based on deep GPs. By extending the structure with additional GP layers to learn the abstract representation of the state features, DGPIRL has more representational capability than linearbased and GPbased methods. Meanwhile, the Bayesian training approach through variational inference guards against overfitting, bringing the advantage of automatic capacity control and principled uncertainty handling.
Our proposed DGPIRL outperforms stateoftheart methods in our experiments, and is shown to learn efficiently even from few demonstrations. For future work, the unique properties of DGPIRL enable easy incorporation of side knowledge (through priors on the latent space) to IRL, but our work also opens up the way for combining deep GPs with other complicated inference engines, e.g., selective attention models
(Gregor et al., 2015). We plan to investigate these ideas in the future. Another promising future direction is to construct transferable models where the latent layer is relied on for knowledge sharing. Finally, we plan to investigate some of the many applications where DGPIRL can prove especially beneficial, such as intelligent building and grid controls (Jin et al., 2017a; b) and humanintheloop gamification (Ratliff et al., 2014).References

Abbeel and Ng (2004)
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 1. ACM, 2004.  Abbeel et al. (2008) Pieter Abbeel, Dmitri Dolgov, Andrew Y Ng, and Sebastian Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. In Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on, pages 1083–1090. IEEE, 2008.
 Barrett and Linder (2015) Enda Barrett and Stephen Linder. Autonomous hvac control, a reinforcement learning approach. In Machine Learning and Knowledge Discovery in Databases, pages 3–19. Springer, 2015.
 Bui et al. (2015) Thang D Bui, José Miguel HernándezLobato, Yingzhen Li, Daniel HernándezLobato, and Richard E Turner. Training deep Gaussian processes using stochastic expectation propagation and probabilistic backpropagation. arXiv preprint arXiv:1511.03405, 2015.
 Damianou (2015) Andreas Damianou. Deep Gaussian processes and variational propagation of uncertainty. PhD thesis, University of Sheffield, 2015.

Damianou and Lawrence (2013)
Andreas Damianou and Neil Lawrence.
Deep Gaussian processes.
In
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics
, pages 207–215, 2013.  Duvenaud et al. (2014) David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, 2014.
 Gregor et al. (2015) Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 Hensman and Lawrence (2014) James Hensman and Neil D Lawrence. Nested variational compression in deep gaussian processes. arXiv preprint arXiv:1412.1370, 2014.
 Jin et al. (2017a) Ming Jin, Wei Feng, Ping Liu, Chris Marnay, and Costas Spanos. Moddr: Microgrid optimal dispatch with demand response. Applied Energy, 187:758–776, 2017a.
 Jin et al. (2017b) Ming Jin, Ruoxi Jia, and Costas Spanos. Virtual occupancy sensing: Using smart meters to indicate your presence. IEEE Transactions on Mobile Computing, 99:1, 2017b.

Kandemir (2015a)
M. Kandemir.
Asymmetric transfer learning with deep gaussian processes.
2015a.  Kandemir (2015b) Melih Kandemir. Asymmetric transfer learning with deep gaussian processes. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 730–738, 2015b.
 Kolter et al. (2007) J Zico Kolter, Pieter Abbeel, and Andrew Y Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. In Advances in Neural Information Processing Systems, pages 769–776, 2007.
 Levine et al. (2010) Sergey Levine, Zoran Popovic, and Vladlen Koltun. Feature construction for inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 1342–1350, 2010.
 Levine et al. (2011) Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with Gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
 Mattos et al. (2016) César Lincoln C Mattos, Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A Barreto, and Neil D Lawrence. Recurrent Gaussian processes. International Conference on Learning Representations (ICLR), 2016.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Ng et al. (2000) Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
 QuiñoneroCandela and Rasmussen (2005) Joaquin QuiñoneroCandela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
 Rasmussen (2006) Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.
 Ratliff et al. (2014) Lillian J Ratliff, Ming Jin, Ioannis C Konstantakopoulos, Costas Spanos, and S Shankar Sastry. Social game for building energy efficiency: Incentive design. In 52nd Annual Allerton Conference on Communication, Control, and Computing, 2014, pages 1011–1018, 2014.
 Ratliff et al. (2006) Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.

Ratliff et al. (2009)
Nathan D Ratliff, David Silver, and J Andrew Bagnell.
Learning to search: Functional gradient techniques for imitation learning.
Autonomous Robots, 27(1):25–53, 2009.  Snelson and Ghahramani (2005) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264, 2005.
 Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 1139–1147, 2013.
 Syed and Schapire (2007) Umar Syed and Robert E Schapire. A gametheoretic approach to apprenticeship learning. In Advances in neural information processing systems, volume 20, pages 1–8, 2007.

Syed et al. (2008)
Umar Syed, Michael Bowling, and Robert E Schapire.
Apprenticeship learning using linear programming.
In Proceedings of the 25th international conference on Machine learning, pages 1032–1039. ACM, 2008.  Titsias and Lawrence (2010) Michalis K Titsias and Neil D Lawrence. Bayesian Gaussian process latent variable model. In International Conference on Artificial Intelligence and Statistics, pages 844–851, 2010.
 Wulfmeier et al. (2015) Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.
Comments
There are no comments yet.