Inverse Reinforcement Learning via Deep Gaussian Process

12/26/2015 ∙ by Ming Jin, et al. ∙ Amazon berkeley college 0

We propose a new approach to inverse reinforcement learning (IRL) based on the deep Gaussian process (deep GP) model, which is capable of learning complicated reward structures with few demonstrations. Our model stacks multiple latent GP layers to learn abstract representations of the state feature space, which is linked to the demonstrations through the Maximum Entropy learning framework. Incorporating the IRL engine into the nonlinear latent structure renders existing deep GP inference approaches intractable. To tackle this, we develop a non-standard variational approximation framework which extends previous inference schemes. This allows for approximate Bayesian treatment of the feature space and guards against overfitting. Carrying out representation and inverse reinforcement learning simultaneously within our model outperforms state-of-the-art approaches, as we demonstrate with experiments on standard benchmarks ("object world","highway driving") and a new benchmark ("binary world").

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of inverse reinforcement learning (IRL) is to infer the latent reward function that the agent subsumes by observing its demonstrations or trajectories in the task. It has been successfully applied in scientific inquiries, e.g., animal and human behavior modeling (Ng et al., 2000), as well as practical challenges, e.g., navigation (Ratliff et al., 2006; Abbeel et al., 2008; Ziebart et al., 2008) and intelligent building controls (Barrett and Linder, 2015). By learning the reward function, which provides the most succinct and transferable definition of a task, IRL has enabled advancing the state of the art in the robotic domains (Abbeel and Ng, 2004; Kolter et al., 2007).

Previous IRL algorithms treat the underlying reward as a linear (Abbeel and Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Syed and Schapire, 2007; Ratliff et al., 2009) or non-parametric function (Levine et al., 2010; 2011) of the state features. Main formulations within the linearity category include maximum margin (Ratliff et al., 2006), which presupposes that the optimal reward function leads to maximal difference of expected reward between the demonstrated and random strategies, and feature expectation matching (Abbeel and Ng, 2004; Syed et al., 2008), based on the observation that it suffices to match the feature expectation of a policy to the expert in order to guarantee similar performances. The reward function can be also regarded as the parameters for the policy class, such that the likelihood of observing the demonstrations is maximized with the true reward function, e.g., the maximum entropy approach (Ziebart et al., 2008).

As the representation power is limited by the linearity assumption, nonlinear formulations (Levine et al., 2010) are proposed to learn a set of composite features based on logical conjunctions. Non-parametric methods, pioneered by (Levine et al., 2011) based on Gaussian Processes (GPs) (Rasmussen, 2006), greatly enlarge the function space of latent reward to allow for non-linearity, and have been shown to achieve the state of the art performance on benchmark tests, e.g., object world and simulated highway driving (Abbeel and Ng, 2004; Syed and Schapire, 2007; Levine et al., 2010; 2011). Nevertheless, the heavy reliance on predefined or handcrafted features becomes a bottleneck for the existing methods, especially when the complexity or essence of the reward can not be captured by the given features. Finding such features automatically from data would be highly desirable.

In this paper, we propose an approach which performs feature and inverse reinforcement learning simultaneously and coherently within the same model by incorporating deep learning. The success of deep learning in a wide range of domains has drawn the community’s attention to its structural advantages that can improve learning in complicated scenarios, e.g., 

Mnih et al. (2013) recently achieved a deep reinforcement learning (RL) breakthrough. Nevertheless, most deep models require massive data to be properly trained and can become impractical for IRL. On the other hand, deep Gaussian processes (deep GPs) (Damianou and Lawrence, 2013; Damianou, 2015) not only can they learn abstract structures with smaller data sets, but they also retain the non-parametric properties which Levine et al. (2011) demonstrated as important for IRL.

A deep GP is a deep belief network comprising a hierarchy of latent variables with Gaussian process mappings between the layers. Analogously to how gradients are propagated through a standard neural network, deep GPs aim at propagating uncertainty through Bayesian learning of latent posteriors. This constitutes a useful property for approaches involving stochastic decision making and also guards against overfitting by allowing for noisy features. However, previous methodologies employed for approximate Bayesian learning of deep GPs

(Damianou and Lawrence, 2013; Hensman and Lawrence, 2014; Bui et al., 2015; Mattos et al., 2016) fail when diverging from the simple case of fixed output data modeled through a Gaussian regression model. In particular, in the IRL setting, the reward (output) is only revealed through the demonstrations, which is guided by the policy given by the reinforcement learning (Damianou, 2015).

The main contributions of our paper are below.

  • We extend the deep GP framework to the IRL domain (Fig. 1), allowing for learning latent rewards with more complex structures from limited data.

  • We derive a variational lower bound on the marginal log likelihood using an innovative definition of the variational distributions. This methodological contribution enables Bayesian learning in our model and can be applied to other scenarios where the observation layer’s dynamics cause similar intractabilities.

  • We compare the proposed deep GP for IRL with existing approaches in benchmark tests as well as newly defined tests with more challenging rewards.


[width=trim = 62mm 120mm 30mm 30mm, clip]figures/dgpirl1.jpg [width=trim = 50mm 120mm 45mm 30mm, clip]figures/dgpirl2.jpg
Figure 1: Left: The proposed deep GP model for IRL. The two layers of GPs are stacked to generate the latent reward , which is input to the reinforcement learning (RL) engine to produce an optimal control policy and demonstrations. denotes the initial feature representation of states. Right: Illustration of DGP-IRL augmented with inducing outputs and corresponding inputs .

In the following, we review the problem of inverse reinforcement learning (IRL). The list of notations used throughout the paper is summarized in Table 1.

time step
index of demonstrations
state , action
RL discount factor,

reward vector for all states

reward function for states
Q-value function
variational distribution
state value function
policy

, and the corresponding estimated and optimal versions

-dimensional feature matrix for discrete states,
features for th state, , the th feature for all states,
covariance function parametrized by
covariance matrix,
latent state matrix with -dimensional features,
with Gaussian noise,
inducing inputs,
inducing outputs,,
tilde mean of the corresponding variable distribution, e.g.,
Table 1: Summary of notations.

1.1 Inverse Reinforcement Learning

The Markov Decision Process (MDP) is characterized by

, which represents the state space, action space, transition model, discount factor, and reward function, respectively.

Take robot navigation as an example. The goal is to travel to the goal spot while avoiding stairwells. The state describes the current location and heading. The robot can choose actions from going forward or backward, turning left or right. The transition model specifies

, i.e., the probability of reaching the next state given the current state and action, which accounts for the kinematic dynamics. The reward is +1 if it achieves the goal, -1 if it ends up in the stairwell, and 0 otherwise. The discount factor,

, is a positive number less than 1, e.g., 0.9, to discount the future rewards. The optimal policy is then given by maximizing the expected reward, i.e.,

(1)

The IRL task is to find the reward function such that the induced optimal policy matches the demonstrations, given and , where is the demonstration trajectory, consisting of state-action pairs. Under the linearity assumption, the feature representation of states forms the linear basis of reward, , where is the -dimensional mapping from the state to the feature vector. From this definition, the expected reward for policy is given by:

where is the feature expectation for policy . The reward parameter is learned such that

(2)

a prevalent idea that appears in the maximum margin planning (MMP) (Ratliff et al., 2006) and feature expectation matching (Syed and Schapire, 2007).

Motivated by the perspective of expected reward that parametrizes the policy class, the maximum entropy (MaxEnt) model (Ziebart et al., 2008) considers a stochastic decision model, where the optimal policy randomly chooses the action according to the associated reward:

(3)

where follows the Bellman equation, and are measures of how desirable are the corresponding state and state-action pair under rewards . In principle, for a given state , the best action corresponds to the highest Q-value, which represents the “desirability” of the action. Assuming independence among state-action pairs from demonstrations, the likelihood of the demonstration corresponds to the joint probability of taking a sequence of actions under states according to the Bellman equation:

(4)

Though directly optimizing the above criteria with respect to is possible, it does not lead to generalized solutions transferrable in a new test case where no demonstrations are available; hence, we need a “model” of . MaxEnt assumes linear structure for rewards, while GPIRL (Levine et al., 2011) uses GPs to relate the states to rewards. In Section 2.1, we will give a brief overview of GPs and GPIRL.

2 The Model

In this section, we start by discussing the reward modeling through Gaussian processes (GPs) following (Levine et al., 2011), proceed to incorporate the representation learning module as additional GP layers, then develop our variational training framework and, finally, develop the transfer between tasks.

2.1 Gaussian Process Reward Modeling

We consider the setup of discretizing the world into states. Assume that observed state-action pairs (demonstrations) are generated by a set of dimensional state features through the reward function . Throughout this paper we denote points (rows of a matrix) as , features (columns) as and single elements as .

In this modeling framework, the reward function plays the role of an unknown mapping, thus we wish to treat it as latent and keep it flexible and non-linear. Therefore, we follow (Levine et al., 2011) and model it with a zero-mean GP prior (Rasmussen, 2006):

where denotes the covariance function, e.g. , . Given a finite amount of data, this induces the probability , where and the covariance matrix is obtained by . The GPIRL training objective comes from integrating out the latent reward:

(5)

and maximizing over , which we drop from our expressions from now on.

The above integral is intractable, because has the complicated expression of (4) (this is in contrast to the tranditional GP regression where is a Gaussian or other simple distribution). This can be alleviated using the approximation of (Levine et al., 2011). We will describe this approximation in the next section, as it is also used by our approach. Notice that all latent function instantiations are linked through a joint multivariate Gaussian. Thus, prediction of the function value at a test input is found through the conditional

As can be seen, the prediction is reliant on the effectiveness of feature representation: states with features close in Euclidean distance are assumed to be associated with similar rewards. This motivates our novel deep GP-IRL method which is obtained by considering additional layers, as we will describe next.

2.2 Incorporating the Representation Learning Layers

The traditional model-based IRL approach is to learn the latent reward (operating on fixed state features ) that best explains the demonstrations . In this paper we wish to additionally and simultaneously uncover a highly descriptive state feature representation. To achieve this, we introduce a latent state feature representation . constitutes the instantiations of an introduced function which is learned as a non-linear GP transformation from . To account for noise we further introduce as the noisy versions of , i.e., , .

Importantly, rather than performing two steps of learning separately (for the GPs on and on ), we nest them into a single objective function, to maintain the flow of information during optimization. This results in a deep GP whose top layers perform representation learning and lower layers perform model-based IRL. Fig. 1 outlines our model, Deep Gaussian Process for Inverse Reinforcement Learning (DGP-IRL). By using to represent the -th column of , and similarly for the full generative model is written as follows:

(6)

where the IRL term takes the form of (4) as suggested by Ziebart et al. (2008). and are the covariance matrices in each layer, constructed with covariance functions and respectively. Compared to GPIRL the proposed framework has substantial gain in flexibility by introducing the abstract representation of states in the hidden layers . Note that the model in Fig. 1 can be extended in depth by introducing additional hidden layers and connecting them with additional GP mappings; it is only for illustration simplicity that we base our derivation on the two layered structure.

We can compress the statistical power of each generative layer into a set of auxiliary variables within a sparse GP framework (Snelson and Ghahramani, 2005). Specifically, we introduce inducing outputs and inputs, denoted by and respectively for the lower layer and by and for the top layer (as in Fig. 1). The inducing outputs and inputs are related with the same GP prior appearing in each layer. For example, with

. By relating the original and inducing variables through the conditional Gaussian distribution, the auxiliary variables are learned to be

sufficient statistics of the GP. The augmented model, shown in Fig. 1, has the following definition:

(7)

where we adopt the Fully Independent Training Conditional (FITC) to preserve the exact variances in

, and the Deterministic Training Conditional (DTC) (Quiñonero-Candela and Rasmussen, 2005) in as in GPIRL to facilitate the integration of in the training objective (see next section).

In the following, we will omit the inducing inputs in the conditions, with the convention to treat them as model parameters (Damianou and Lawrence, 2013; Damianou, 2015; Kandemir, 2015b). By selecting the complexity reduces from to

. While DGP-IRL resolves the case when the outputs have complex dependencies with the latent layers, the training of the model based on variational inference requires gradients for the parameters, as in backpropagation, whose convergence can be improved by leveraging advancements in deep learning.

Additionally, in DGP-IRL, the role of auxiliary variables goes further than just introducing scalability. Indeed, as we shall see next, the auxiliary variables play a distinct role in our model, by forming the base of a variational framework for Bayesian inference.

2.3 Variational Inference

We wish to optimize the model evidence for training:

However, this quantity is intractable. Firstly because the latent variables appear nonlinearly in the inverse of covariance matrices. Secondly, because the latent rewards relate to the observation through the reinforcement learning layer; the choice of in (7) does not completely solve this problem because in DGP-IRL there is additional uncertainty propagated by the latent layers. This indicates that Laplace approximation is not practical, neither is the variational method employed for deep GP (Damianou and Lawrence, 2013; Kandemir, 2015a), where the output is related to the latent variable in a simple regression framework.

To this end, we show that we can derive an analytic lower bound on the model evidence by constructing a variational framework using the following special form of variational distribution:

(8)
(9)
(10)
(11)

The delta distribution is equivalent to taking the mean of normal distributions for prediction, which is reasonable in the context of reinforcement learning

(Levine et al., 2011). Also note that the delta distribution is applied only in the bottom layer and not repeatedly; therefore, representation learning is indeed being manifested in the latent layers.

In addition, matches the exact conditional so that these two terms cancel in the fraction of (13) and the number of variational parameters is minimized, as in (Titsias and Lawrence, 2010). As for , it is chosen as delta distributions such that combined with the IRL term in (12) becomes tractable and information can flow through the latent layers .

The variational marginal is factorized across its dimensions with fully parameterized normal densities, as in (Damianou and Lawrence, 2013). Notice that and are the mean of the inducing outputs (Table 1), corresponding to pseudo-inputs and , where

(initialized with random numbers from uniform distributions

(Sutskever et al., 2013)) be learned to further maximize the marginaliked likelihood, and is chosen as a subset of .

The variational means of can be augmented with input data, , to improve stability during training (Duvenaud et al., 2014).

The variational lower bound, , follows from the Jensen’s inequality, and can be derived analytically due to the choice of variational distribution (see Section 2 of the Appendix for details):

(12)
(13)
(14)

where

(15)
(16)
(17)

where is the determinant of . , where . is the term associated with RL. is the Gaussian prior on inducing outputs . denotes the Kullback – Leibler (KL) divergence between the variational posterior to the prior , acting as a regularization term. The lower bound can be optimized with gradient-based methods, which are computed by backpropagation. In addition, we can find the optimal fixed-point equations for the variational distribution parameters for using variational calculus, in order to raise the variational lower bound further (refer to Section 3 of the supplement for this derivation).

Notice that the approximate marginalization of all hidden spaces, in (14), approximates a Bayesian training procedure, according to which model complexity is automatically balanced through the Bayesian Occam’s razor principle. Optimizing the objective turns the variational distribution into an approximation to the true model posterior.

2.4 Transfer to New Tasks

The inducing points provide a succinct summary of the data, by the property of FITC (Quiñonero-Candela and Rasmussen, 2005), which means only the inducing points are necessary for prediction. Given a set of new states , DGP-IRL can infer the latent reward through the full Bayesian treatment:

(18)

Given that the above integral is computationally intensive to evaluate, a practical alternative adopted in our implementation is to use point estimates for latent variables; hence, the rewards are given by:

(19)

where , with . The above formulae suggest that instead of making inference based on layer directly as in Levine et al. (2011), DGP-IRL first estimates the latent representation of the states, , then makes GP regression using the latent variables.

3 Experiments

For the experimental evaluation, we employ the expected value difference (EVD) as a metric of optimality, given by:

(20)

which is the difference between the expected reward earned under the optimal policy, , given by the true rewards, and the policy derived from the IRL rewards, . Our software implementation is included in the supplementary.

3.1 Object World Benchmark

The Object World (OW) benchmark, originally implemented by Levine et al. (2011), is a gridworld where dots of primary colors, e.g., red and blue, and secondary colors, e.g., purple and green, are placed on the grid at random, as shown in Fig. 2. Each state, i.e., grid block, is described by the shortest distances to dots among each color group. The latent reward is assigned such that if a block is 1 step within a red dot and 3 steps within a blue dot, the reward is +1; if it is 3 steps within a blue dot only, the reward is -1, and the reward is 0 otherwise. The agent maximizes its expected discounted reward by following a policy which provides the probabilities of actions (moving up/down/left/right, or stay still) at each state, subject to a transition probability.

The objective of the experiment is to compare the performances of DGP-IRL with previous methods as the number of demonstrations varies. Candidates that are evaluated include the Multiplicative Weights for Apprenticeship Learning (MWAL) (Syed and Schapire, 2007), MaxEnt, MMP, which assume a linear reward function, and GPIRL, which is the state-of-the-art method on the benchmark. Linear models, as is shown in Fig. 2, cannot capture the complex structure, while GPIRL learns more accurate yet still noisy rewards, as limited by feature discriminability; DGP-IRL, on the contrary, makes inference closest to the ground truth even with limited data, thanks to the increased representational power and robust training through variational inference.

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/gnd_obj.png

(a) Ground Truth

[width=0.312]figures/dgpirl_obj.png

(b) DGP-IRL

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/gpirl_obj.png

(c) GPIRL

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/mwal_obj.png

(d) MWAL

[width=0.312]figures/maxent_obj.png

(e) MaxEnt

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/mmp_obj.png

(f) MMP
Figure 2: OW benchmark for IRL, evaluated for (a) DGP-IRL, (b) GPIRL, (c) MWAL, (d) MaxEnt, and (e) MMP, with 64 demonstrations and continuous features. Except for DGP-IRL, all the other algorithms are evaluated with the toolbox by Levine et al. (2011).

[width=0.45trim=0mm 0mm 0mm 0mm,clip]figures/plot_evd_obj.pdf

(a) Training

[width=0.45]figures/plot_evd_obj_transfer.pdf

(b) Transfer
Figure 3: Plots of EVD in the training (a) and transfer (b) tests for the OW benchmark, where continuous features are employed.

Additionally, the transferability test is carried out by examining EVD in a new world where no demonstrations are available, which requires the ability of knowledge transfer from the previous learning scenario. DGP-IRL outperforms GPIRL and other models in both the training and transfer cases, and the improvement is obvious as more data is accessible (Fig. 3). The shaded area (Fig. 3, 6, 7(a)

) are the standard deviation of EVD among independent experiments, which reflect that DGP-IRL and GPIRL are more reliable.

3.2 Binary World Benchmark

Though the rewards in OW are nonlinear functions of the features, they form separated clusters in the subspace spanned by two dimensions, i.e., distances to the nearest red and blue dots. Binary world (BW) is a benchmark introduced by Wulfmeier et al. (2015) whose reward depends on combinatorics of features. More specifically, in a world of plane where each block is randomly assigned with either a blue or red dot, the state is associated with the +1 reward if there are 4 blues in the neighborhood, -1 for 5 blues, and 0 otherwise. The feature represents the color of the 9 dots in the neighborhood. BW sets up a challenging scenario, where states that are maximally separated in feature space can have the same rewards, yet those that are close in euclidean distance may have opposite rewards.

The task is to learn the latent rewards of states given limited demonstrations, as is shown in Fig. 4. While linear models are limited by their capacity of representation, the results of GPIRL also deviate from the latent rewards as it cannot generalize from training data with the convoluted features. DGP-IRL, nevertheless, is able to recover the ground truth with the highest fidelity.

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/gnd_bin.png

(a) Ground Truth

[width=0.312]figures/dgpirl_bin.png

(b) DGP-IRL

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/gpirl_bin.png

(c) GPIRL

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/learch_bin.png

(d) LEARCH

[width=0.312]figures/maxent_bin.png

(e) MaxEnt

[width=0.312trim=0mm 0mm 0mm 0mm,clip]figures/mmp_bin.png

(f) MMP
Figure 4: BW benchmark evaluated with 128 demonstrated traces for DGP-IRL, GPIRL, LEARCH Ratliff et al. (2009), MaxEnt, and MMP.

[width=0.48trim=0mm 0mm 0mm 0mm,clip]figures/scatter_dgpirl_bin_X.pdf

(a) Input space

[width=0.48]figures/scatter_dgpirl_bin_D.pdf

(b) Latent space
Figure 5: Visualization of points along two arbitrary dimensions in the (a) input space and (b) latent space of DGP-IRL. The points represent features of states and are color-coded with the associated rewards. The rewarads are entangled in the input space but separated in the latent space .

[width=0.45trim=0mm 0mm 0mm 0mm,clip]figures/plot_evd_bin.pdf

(a) Training

[width=0.45]figures/plot_evd_bin_transfer.pdf

(b) Transfer
Figure 6: Plots of EVD in the training (a) and transfer (b) tests for the BW benchmark as the number of training samples varies.

[width=0.45trim=0mm 0mm 0mm 0mm,clip]figures/plot_evd_highway.pdf

(a) Training EVD

[width=0.45]figures/plot_speeding.pdf

(b) Speeding Prob.
Figure 7: Plots of EVD in the training (a) and the probability of speeding (with 64 demonstrations) (b) in the highway driving simulation benchmark, with three lanes and 32 car lengths.

By successively warping the original feature space through the latent layers, DGP-IRL can learn an abstract representation that reveals the reward structure. As is illustrated in Fig. 5, though the points are mixed up in the input space, making it impossible to separate those with the same rewards, their positions in the latent space clearly form clusters, which indicates that DGP-IRL has remarkably uncovered the mechanism of reward generation by simply observing the traces of actions.

The advantage of simultaneous representation and inverse reinforcement learning is demonstrated in Fig. 6, which plots the EVD for the training and transfer cases. As the features are interlinked not only with the reward but also with themselves in a very nonlinear way, this scenario is particularly challenging for linear models, such as LEARCH (Ratliff et al., 2009), MaxEnt, and MMP. While both GPIRL and DGP-IRL have satisfactory performance in the training case, DGP-IRL significantly outperforms GPIRL in the transfer task, which indicates that the learned latent space is transferrable across different scenarios.

3.3 Highway Driving Behaviors

Highway driving behavior modeling, as investigated by Levine et al. (2010; 2011), is a concrete example to examine the capacity of IRL algorithms in learning the underlying motives from demonstrations based on a simple simulator. In a three-lane highway, vehicles of specific class (civilian or police) and category (car or motorcycle) are positioned at random, driving at the same constant speed. The robot car can switch lanes and navigate at up to three times the traffic speed. The state is described by a continuous feature which consists of the closest distances to vehicles of each class and category in the same lane, together with the left, right, and any lane, both in the front and back of the robot car, in addition to the current speed and position.

The goal is to navigate the robot car as fast as possible, but avoid speeding by checking that the speed is no greater than twice the current traffic when the police car is 2 car lengths nearby. As the reward is a nonlinear function determined by the current speed and distance to the police, linear models are outrun by GPIRL and DGP-IRL. Performance generally improves with more demonstrations, and DGP-IRL remains to yield the policy closest to the optimal in EVD, and with minimal probability of speeding, as illustrated in Fig. 7(a) and 7(b), respectively.

4 Conclusion and Future Work

DGP-IRL is proposed as a solution for IRL based on deep GPs. By extending the structure with additional GP layers to learn the abstract representation of the state features, DGP-IRL has more representational capability than linear-based and GP-based methods. Meanwhile, the Bayesian training approach through variational inference guards against overfitting, bringing the advantage of automatic capacity control and principled uncertainty handling.

Our proposed DGP-IRL outperforms state-of-the-art methods in our experiments, and is shown to learn efficiently even from few demonstrations. For future work, the unique properties of DGP-IRL enable easy incorporation of side knowledge (through priors on the latent space) to IRL, but our work also opens up the way for combining deep GPs with other complicated inference engines, e.g., selective attention models

(Gregor et al., 2015). We plan to investigate these ideas in the future. Another promising future direction is to construct transferable models where the latent layer is relied on for knowledge sharing. Finally, we plan to investigate some of the many applications where DGP-IRL can prove especially beneficial, such as intelligent building and grid controls (Jin et al., 2017a; b) and human-in-the-loop gamification (Ratliff et al., 2014).

References

  • Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the twenty-first international conference on Machine learning

    , page 1. ACM, 2004.
  • Abbeel et al. (2008) Pieter Abbeel, Dmitri Dolgov, Andrew Y Ng, and Sebastian Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. In Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on, pages 1083–1090. IEEE, 2008.
  • Barrett and Linder (2015) Enda Barrett and Stephen Linder. Autonomous hvac control, a reinforcement learning approach. In Machine Learning and Knowledge Discovery in Databases, pages 3–19. Springer, 2015.
  • Bui et al. (2015) Thang D Bui, José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, and Richard E Turner. Training deep Gaussian processes using stochastic expectation propagation and probabilistic backpropagation. arXiv preprint arXiv:1511.03405, 2015.
  • Damianou (2015) Andreas Damianou. Deep Gaussian processes and variational propagation of uncertainty. PhD thesis, University of Sheffield, 2015.
  • Damianou and Lawrence (2013) Andreas Damianou and Neil Lawrence. Deep Gaussian processes. In

    Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics

    , pages 207–215, 2013.
  • Duvenaud et al. (2014) David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, 2014.
  • Gregor et al. (2015) Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
  • Hensman and Lawrence (2014) James Hensman and Neil D Lawrence. Nested variational compression in deep gaussian processes. arXiv preprint arXiv:1412.1370, 2014.
  • Jin et al. (2017a) Ming Jin, Wei Feng, Ping Liu, Chris Marnay, and Costas Spanos. Mod-dr: Microgrid optimal dispatch with demand response. Applied Energy, 187:758–776, 2017a.
  • Jin et al. (2017b) Ming Jin, Ruoxi Jia, and Costas Spanos. Virtual occupancy sensing: Using smart meters to indicate your presence. IEEE Transactions on Mobile Computing, 99:1, 2017b.
  • Kandemir (2015a) M. Kandemir.

    Asymmetric transfer learning with deep gaussian processes.

    2015a.
  • Kandemir (2015b) Melih Kandemir. Asymmetric transfer learning with deep gaussian processes. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 730–738, 2015b.
  • Kolter et al. (2007) J Zico Kolter, Pieter Abbeel, and Andrew Y Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. In Advances in Neural Information Processing Systems, pages 769–776, 2007.
  • Levine et al. (2010) Sergey Levine, Zoran Popovic, and Vladlen Koltun. Feature construction for inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 1342–1350, 2010.
  • Levine et al. (2011) Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with Gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
  • Mattos et al. (2016) César Lincoln C Mattos, Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A Barreto, and Neil D Lawrence. Recurrent Gaussian processes. International Conference on Learning Representations (ICLR), 2016.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Ng et al. (2000) Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
  • Quiñonero-Candela and Rasmussen (2005) Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
  • Rasmussen (2006) Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.
  • Ratliff et al. (2014) Lillian J Ratliff, Ming Jin, Ioannis C Konstantakopoulos, Costas Spanos, and S Shankar Sastry. Social game for building energy efficiency: Incentive design. In 52nd Annual Allerton Conference on Communication, Control, and Computing, 2014, pages 1011–1018, 2014.
  • Ratliff et al. (2006) Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.
  • Ratliff et al. (2009) Nathan D Ratliff, David Silver, and J Andrew Bagnell.

    Learning to search: Functional gradient techniques for imitation learning.

    Autonomous Robots, 27(1):25–53, 2009.
  • Snelson and Ghahramani (2005) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264, 2005.
  • Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1139–1147, 2013.
  • Syed and Schapire (2007) Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, volume 20, pages 1–8, 2007.
  • Syed et al. (2008) Umar Syed, Michael Bowling, and Robert E Schapire.

    Apprenticeship learning using linear programming.

    In Proceedings of the 25th international conference on Machine learning, pages 1032–1039. ACM, 2008.
  • Titsias and Lawrence (2010) Michalis K Titsias and Neil D Lawrence. Bayesian Gaussian process latent variable model. In International Conference on Artificial Intelligence and Statistics, pages 844–851, 2010.
  • Wulfmeier et al. (2015) Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
  • Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.