Option Compatible Reward Inverse Reinforcement Learning

11/07/2019 ∙ by Rakhoon Hwang, et al. ∙ 0

Reinforcement learning with complex tasks is a challenging problem. Often, expert demonstrations of complex multitasking operations are required to train agents. However, it is difficult to design a reward function for given complex tasks. In this paper, we solve a hierarchical inverse reinforcement learning (IRL) problem within the framework of options. A gradient method for parametrized options is used to deduce a defining equation for the Q-feature space, which leads to a reward feature space. Using a second-order optimality condition for option parameters, an optimal reward function is selected. Experimental results in both discrete and continuous domains confirm that our segmented rewards provide a solution to the IRL problem for multitasking operations and show good performance and robustness against the noise created by expert demonstrations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The reinforcement-learning (RL) method seeks an optimal policy for a given reward function in a Markov decision process (MDP). There are several circumstances in which an agent can learn only from an expert demonstration, because it is difficult to prescribe a proper reward function for a given task. Inverse reinforcement learning (IRL) aims to find a reward function for which the expert’s policy is optimized. When the IRL method is applied to a large domain, the size of each trajectory of the required demonstration by the expert can be huge. There are also certain complex tasks that must be segmented into a sequence of sub-tasks (e.g., robotics of ubiquitous general-purpose automation (

[11] [13]), robotic surgical procedure training ([7], [14]), hierarchical human behavior modeling [21], and autonomous driving [16]). For such complex tasks, a problem designer can decompose it hierarchically. Then an expert can easily demonstrate it at different levels of implementation.

Another challenge with the IRL method is the design of feature spaces that capture the structure of the reward functions. Linear models for reward functions have been used in existing IRL algorithms. However, nonlinear models have recently been introduced [15], [6], [17]. Exploring more general feature spaces for reward functions becomes necessary when expert intuition is insufficient for designing good features, including linear models. This problem raises concerns, such as in the robotics field [20].

Regarding the first aspect of our problem, several works considered the decomposition of underlying reward functions for given expert demonstrations in RL and IRL problems ([9], [3], [12]). For hierarchical IRL problems, most of works focus on how to perform segmentation on demonstrations of complex tasks and find suitable reward functions. For the IRL problem in options framework, option discovery should be first carried out as a segmentation process. Since our work focuses on hierarchical extensions of policy gradient based IRL algorithms, we assign options for each given domain instead of applying certain option discovery algorithms.

To simultaneously resolve the problems of segmentation and feature construction , we propose a new method that applies the option framework presented by [2] to the compatible reward inverse reinforcement learning (CR-IRL) [17], a recent work on generating a feature space of rewards. Our method is called Option CR-IRL. Previous works on the selection of proper reward functions for the IRL problem require design features that consider the environment of the problem. However, the CR-IRL algorithm directly provides a space of features from which compatible reward functions can be constructed.

The main contribution of our work comprises the following items.

  • New method of assigning task-wise reward functions for a hierarchical IRL problem is introduced. Although both OptionGAN [9] and our work use policy gradient methods as a common grounding component, the former work adopts GAN approach to solve the IRL problem while we construct an explicit defining equation for reward features.

  • For a multitask IRL problem, we simultaneously obtain feature spaces for sub-tasks. When a task-wise reward function is constructed by combining obtained features, they share the same parameters for combination. As a consequence, we can avoid the problem of task-wise feature design and perform a one-shot process of optimal parameter selection across the domain of entire task.

  • Our optional framework approach for solving the IRL problem in a multitask environment outperforms other algorithms that do not consider the multitask nature of the problem. While handling the termination of each subtask, introducing parameters to termination and subtask policy functions in the policy gradient framework allows us to perform fine tuning on subtask reward functions, providing a better reward selection. It also shows better robustness to noise created by expert policy than other algorithms without using a hierarchical learning framework. The noise robustness of our algorithm is enabled by a larger dimension of the feature space than that which is available to non-optional CR-IRL algorithms.

There are differences in several aspects between our work and some of recent works [9], [18] and [12] on segmentation of reward functions in IRL problems. [18] uses Bayesian nonparametric mixture models to simultaneously partition the demonstration and learn associated reward functions. It has an advantage in the case with domains in which subgoals of each subtask are definite. For such domains, a successful segmentation simply defines task-wise reward functions. However, our work allows for indefiniteness of subgoals for which an assignment of rewards is not simple. [12] focuses on segmentation using transitions defined as changes in local linearity about a kernel function. It assumes pre-designed features for reward functions. On the other hand, our method does not assume any pre-knowledge on feature spaces.

2 Preliminaries

Markov decision process The Markov decision process comprises the state space, , the action space, , the transition function, , and the reward function,

. A policy is a probability distribution,

, over actions conditioned on the states. The value of a policy is defined as , and the action-value function is , where is the discounted factor.

Policy Gradients Policy gradient methods [22] aim to optimize a parametrized policy, , via the stochastic gradient ascent. In a discounted setting, the optimization of the expected -discounted return, , is considered. It can be written as

where . The policy gradient theorem [22] states:

CR-IRL The CR-IRL method is an algorithm that generates a set of base functions spanning the subspace of reward functions that cause the policy gradient to vanish. As input, a parametric policy space, , and a set of trajectories from the expert policy, , are taken. It first builds the features, , of the action-value function, which cause the policy gradient to vanish. These features can be transformed into reward features, , via the Bellman equation (model-based case) or reward-shaping [19](model-free). Then, a reward function that maximizes the expected return is chosen by enforcing a second-order optimality condition based on the policy Hessian [10], [8].

The options framework We use the options framework for the hierarchical learning. See [23] for details. A Markovian option, , is a triple where is an initiation set, is an intra-option policy, and is a termination function. Following [2], we consider the call-and-return option execution model in which a fixed trajectory of options is given. Let denote the intra-option policy of option parametrized by and , the termination function of the same option parametrized by .

Option-critic architecture [2] proposed a method of option discovery based on gradient descent applied to the expected discounted return, defined by

The objective function used here depends on policy over options and the parameters for intra-option policies and termination functions. Its gradient with respect to these parameters is taken through the following equations: the option-value function can be written as

where

is the action-value function for the state-option pair, and

is the option-value function upon arrival.

3 Generation of Q-features compatible with the optimal policy

The first step to obtain a reward function as a solution for a given IRL problem is to generate Q-features compatible with an expert policy using the gradient of expected discounted returns. We assume that the parametrized expert intra-option policies, , are differentiable with respect to . By the intra-option policy gradient theorem [2], the gradient of the expected discounted return with respect to vanishes as in the following equation:

(1)

where is the occupancy measure of state-option pairs.

The first-order optimality condition, , gives a defining equation for Q-features compatible with the optimal policy. It is convenient to define a subspace of such compatible Q-features in the Hilbert space of functions on . We define the inner product:

Consider the subspace, , of the Hilbert space of functions on with the inner product defined above. Then, the space of Q-features can be represented by the orthogonal complement, of .

Parametrization of terminations is expected to allow us to have more finely tuned option-wise reward functions in IRL problems. We can impose an additional optimality condition on the expected discounted return with respect to parameters of the termination function. Let

be the expected discounted return with initial condition By the termination gradient theorem [2], one has

(2)

where is the advantage function over options .

The vanishing equation (2) gives a constraint on the space of the -feature, . For simplicity, set . The constraint equation for is given by

(3)

where

Thus, we can combine two linear equations (1), (3) for to define the space of -features.

Let be the size of parameters and let be the size of parameters . Because for each pair , the rank of the equation (1) is not greater than , and the rank of the equation (3) is not greater than . Thus, the dimension of the space of Q-features defined by (1) and (3) is not less than .

4 Reward function from Q-functions

For the MDP model, if two reward functions share the same optimal policy, then they satisfy the following([19]):

If we take , then

Because the -value function depends on the option in the options framework, the potential function, , also depends on the option. We thus need to consider reward-shaping with regards to the intra-option policy, . Then, the reward functions also need to be defined in the intra-option sense. Thus, reward functions also depend on options. This viewpoint is essential to our work and is similar to the approach taken in [9], in which , the reward option, was introduced corresponding to the intra-option policy, . Reward functions, , sharing the same intra-option policy, , satisfy

If we take , then

This provides us with a way to generate reward functions from -features in the options framework.

5 Reward selection via the second-order optimality condition

Among the linear combinations of reward features constructed in the previous section, selecting a linear combination that maximizes and is required. For the purpose of optimization, we use the second-order optimality condition based on the Hessian of and .

Consider a trajectory, , with termination indicator and terminal state . The termination indicator, , is 1 if a previous option terminates at step . Otherwise it is 0. The probability density of trajectory is given by

where

We denote the set of all possible trajectories by and the -discounted trajectory reward by . Then, the objective function can be rewritten as

Its gradient and Hessian with respect to can be expressed as

and

The second objective function can be written as

where is a trajectory beginning with with the probability distribution

Then, its Hessian can be written as

Let be the reward features constructed from the previous section. Rewrite each Hessian as where is the expected return with respect to for the reward function, , and as where is the expected return with respect to for the reward function, . Set and for .

We want to determine the reward weight, , for the reward function,

, which yields a negative definite Hessian with a minimal trace. Geometrically, this corresponds to choosing a reward function that makes the expected returns locally the sharpest maximum points. Here, to relieve a computational burden, we exploit a heuristic method suggested by

[17].

Thus, we only choose reward features having negative definite Hessians, compute the trace of each Hessian, and collect them in the vectors

and . We determine w by solving

Typically, multi-objective optimization problems have no single solutions that optimize all objective functions simultaneously. One well-known approach to tackling this problem is a linear scalarization. Thus, we consider the following single-objective problem:

with positive weights and . A closed-form solution is computed as . With a different choice of scalarization weights, and , different reward functions can be produced. In practice, it is difficult to choose a proper weight pair, because the gap between the magnitudes of two trace vectors can be too large. Alternatively, we consider a weighted sum of solutions to the two single-objective optimization problems: , satisfying and . Thus, we cannot guarantee the obtained solution is Pareto optimal.

6 Algorithm

We summarize our algorithm of solving the IRL problem in the options framework as follows:

0:   (1) Expert’s demo-trajectories , (2) Option , for which expert’s policies, , and termination functions, , are parametrized, and a policy over options.
0:   Reward function Phase 1
1:

   Estimate

for the visited state-action pairs.
2:   Get the matrix and matrix .
3:   Compute the null space of .
4:   Estimate for the visited state-action pairs.
5:   Get the matrices, , , , and .
6:   Compute the null space of .
7:   Find the intersection, , of two null spaces.
8:   Get the set of advantage functions using . Phase 2
9:   Get the set of reward functions by applying reward shaping .
10:

   Apply singular value decomposition to orthogonalize

. Phase 3
11:   Estimate the Hessian, and , for each reward feature, ,
12:   Discard the reward feature having an indefinite Hessian; switch sign for those having positive semi-definite Hessian; and compute and for
13:   Reward function ,
Algorithm 1 Option Compatible Reward IRL

Our algorithm consists of three phases. In the first phase, we obtain basis for Q-features space by solving linear equations. Linear equations consist of two parts. The first part is defined by the gradient of logarithmic policy and the second part is defined by the gradient of option termination. The matrices and are introduced to carry out computation for the second part. The matrix is the row repetition of policy over option, , on visited option and state pair. The matrix is a block diagonal where each entry is intra-option policy over visited state and action pair for each option.

In the second phase, we obtain basis for reward-features using reward shaping for option. In the last phase, we select the definite reward by applying Hessian test to two objective functions.

Our algorithm can be naturally extended to continuous states and action spaces. In the continuous domains we use a -nearest neighbors method to extend recovered reward functions to non-visited state-action pairs. Additional penalization terms can be included. Details about implementation are presented in section 7.2.

7 Experiment

7.1 Taxi

Figure 1: Taxi domain

We test Option CR-IRL in the Taxi domain defined in [4]. The environment is a 5×5 grid world (Figure 1) having four landmarks, . The state vector, , comprises a current position, , of a taxi, a location, , of a passenger, and the passenger’s destination, . The possible actions are movements in four directions, including pick-up and drop-off actions. In each episode, positions of taxi, passenger, and destination are assigned to random landmarks. The task of the driver is to pick up the passenger and drop him off at the destination. Depending on where the passenger or destination is located, each subtask involves navigating to one of the four landmarks. We define the option, , to have each landmark subgoal, where

The intra-option policy, , can be induced from the expert policy computed from the value iteration by duplicating the policy parameters relevant to its subgoal. The policy, , over options is deterministic, because a subgoal can be directly specified from the current state. This user-defined option captures the hierarchical structure of the Taxi environment. On the other hand, we further evaluate our algorithm using option , discovered by [2].

Each intra-option policy is parametrized as an -Boltzmann policy:

where the policy features, , are the state features defined by current location, passenger location, destination location, and whether the passenger has been picked up. The noise, , is included to evaluate the robustness to imperfect experts. Termination probability,

, is parametrized as a sigmoid function.

For comparison, we give weights to the option-wise reward function, , based on the policy over options:

It is easy to compare against other IRL algorithms by combining the rewards assigned to each option while the modified reward maintains the nature of each task. We evaluate Option CR-IRL against behavior cloning (BC), maximum entropy IRL (ME-IRL) [24]

, and linear programming apprenticeship learning (LPAL)

[1]. A natural choice for a reward feature in ME-IRL and LPAL is the policy feature, , defined above. Figures 2 and 3 show the results of training a Boltzmann policy using REINFORCE, coped with the recovered reward function and the user-defined option, , and the discovered option,

, respectively. Each result is averaged over 10 repetitions, and the error bars correspond to the standard deviation. We see that Option CR-IRL converges faster to the optimal policy than does the original reward function and ME-IRL when the user-defined option is used. When the discovered option is used, ME-IRL often fails to learn the optimal policy. However, BC and LPAL are very sensitive to noise, whereas our algorithm is not significantly affected by a noise level. The noise robustness of our algorithm can be explained by the larger dimension of the feature space over non-optional CR-IRL algorithms. As explained at the end of Section 3, the dimension of the space of Q-features increases by factor,

, which is the size of options, and can absorb the change in defining the equation of Q-features.

Figure 2: Average return of Taxi domain as a function of the number of iterations of REINFORCE, for the user-defined option,
Figure 3: Average return of Taxi domain as a function of the number of iterations of REINFORCE, for the discovered option,

7.2 Car on the Hill

We test Option CR-IRL in the continuous Car-on-the-Hill domain [5]. A car traveling on a hill is required to reach the top of the hill. Here, the shape of the hill is given by the function, :

The state space is continuous with dimension two: position and speed of the car with and . The action acts on the car’s acceleration. The reward function, , is defined as:

The discount factor, , is 0.95, and the initial state is with .

Because the car engine is not strong enough, simply accelerating up the slope cannot make it to the desired goal. The entire task can be divided into two subtasks: reaching enough speed at the bottom of the valley to leverage potential energy (subgoal 1), and driving to the top (subgoal 2). To evaluate our algorithm, we introduce hand-crafted options:

for . Intra-option policy is defined by approximating the deterministic intra-option policies, , via the fitted-Q iteration (FQI) [5] with the two corresponding small MDPs. We consider noisy intra-option policies in which a random action is selected with probability :

for each option, . Each intra-option policy is parametrized as Gaussian policy , where is fixed to be 0.01, and

is obtained using radial basis functions:

with uniform grids, , in the state space. The parameter, , is estimated using 20 expert trajectories for each option. Termination probability, , is parametrized as a sigmoid function.

For comparison, the task-wise reward function, , is merged into one reward, , by omitting the option term. This modification is possible, because the policy-over-options is deterministic in our setting. The merged reward function, , can be compared with other reward functions using a non-hierarchical RL algorithm. We extend the recovered reward function to non-visited state-action pairs using a kernel

-nearest neighbors (KNN) regression with a Gaussian kernel:

where is the set of the nearest state-action pairs in the demonstrations, , and is a Gaussian kernel over :

We also penalize the reward function for a non-visited state-action pairs far from the visited one. The penalization term is obtained using a KNN regression with a Gaussian kernel for a state-action occupancy measure:

where is the scaled reward within the interval, , and is computed as:

These reward extensions and penalties are based on [17].

The recovered rewards are obtained from expert demonstrations with different levels of noise, . We repeated the evaluation over 10 runs. As shown in Figure 4, FQI with the reward function outperforms the original reward in terms of convergence speed, regardless of noise level. When , Option CR-IRL converges to the optimal policy in only one iteration. As the noise level increases, BC yields worse performance, whereas Option CR-IRL is still robust to noise.

Figure 5 displays the trajectories of the expert’s policy, the BC policy, and the policy computed via FQI with the reward recovered by Option CR-IRL. When , trajectories are almost overlapping. When increases, BC requires more steps to reach to the termination state, and some cannot finish the task properly. On the other hand, we see that our reward function can recover the optimal policy, even if expert demonstrations are not close to optimal.

Figure 4: Average return of Car-on-the-Hill domain as a function of the number of FQI iterations.
Figure 5: Trajectories of the expert’s policy, the BC policy, and the policy computed via FQI with the reward recovered by Option CR-IRL.

8 Discussion

We developed a model-free IRL algorithm for hierarchical tasks modeled in the options framework. Our algorithm, Option CR-IRL, extracts reward features using first-order optimality conditions based on the gradient for intra-option policies and termination functions. Then, it constructs task-wise reward functions from the extracted reward spaces using a second-order optimality condition. The recovered reward functions explain the expert’s behavior and the underlying hierarchical structure.

Most IRL algorithms require hand-crafted reward features, which are crucial to the quality of recovered reward functions. Our algorithm directly builds the approximate space of the reward function from expert demonstrations. Additionally, unlike other IRL methods, our algorithm does not require solving a forward problem as an inner step.

Some heuristic methods were used to solve the multi-objective optimization problem in the reward selection step. We used the weighted solution obtained from two separate single-objective optimization problems, empirically finding that any combination of weights resulted in good performances. Generally, depending on the type of option used, one of parameters of intra-option policies or termination functions could be more sensitive than the other. Therefore, the choice of weights can make a difference in the final performance. Additionally, we tested the linear scalarization approach, and our algorithm performed well, except for the case of the Taxi domain with the user-defined option, . In this case, we found that two trace vectors computed with the policies and terminations were too different in magnitude. Thus, the alternative approach was inevitable.

Our algorithm was validated in several classical benchmark domains, but to apply it to real-world problems, we need to experiment with more complex environments. More sophisticated options should be used to better explain the complex nature of a hierarchical task, making experiment extensions easier.

9 Acknowledgement

Hyung Ju Hwang was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF-2017R1E1A1A03070105, NRF-2019R1A5A1028324) and by the Institute for Information & Communications Technology Promotion (IITP) under the ITRC (Information Technology Research Center) support program (IITP-2018-0-01441).

References

  • [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the twenty-first international conference on Machine learning

    , page 1. ACM, 2004.
  • [2] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • [3] Jaedeug Choi and Kee-Eung Kim. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, pages 305–313, 2012.
  • [4] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
  • [5] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
  • [6] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
  • [7] Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
  • [8] Thomas Furmston and David Barber. A unifying perspective of parametric policy search methods for markov decision processes. In Advances in neural information processing systems, pages 2717–2725, 2012.
  • [9] Peter Henderson, Wei-Di Chang, Pierre-Luc Bacon, David Meger, Joelle Pineau, and Doina Precup. Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [10] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
  • [11] George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31(3):360–375, 2012.
  • [12] Sanjay Krishnan, Animesh Garg, Richard Liaw, Lauren Miller, Florian T Pokorny, and Ken Goldberg. Hirl: Hierarchical inverse reinforcement learning for long-horizon tasks with delayed rewards. arXiv preprint arXiv:1604.06508, 2016.
  • [13] Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller, Florian T Pokorny, and Ken Goldberg. Swirl: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards. The International Journal of Robotics Research, 38(2-3):126–145, 2019.
  • [14] Sanjay Krishnan, Animesh Garg, Sachin Patil, Colin Lea, Gregory Hager, Pieter Abbeel, and Ken Goldberg. Transition state clustering: Unsupervised surgical trajectory segmentation for robot learning. In Robotics Research, pages 91–110. Springer, 2018.
  • [15] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pages 19–27, 2011.
  • [16] Richard Liaw, Sanjay Krishnan, Animesh Garg, Daniel Crankshaw, Joseph E Gonzalez, and Ken Goldberg. Composing meta-policies for autonomous driving using hierarchical deep reinforcement learning. arXiv preprint arXiv:1711.01503, 2017.
  • [17] Alberto Maria Metelli, Matteo Pirotta, and Marcello Restelli. Compatible reward inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 2050–2059, 2017.
  • [18] Bernard Michini, Thomas Walsh, Ali-akbar Agha-mohammadi, and Jonathan P How. Bayesian nonparametric reward learning from demonstration. IEEE transactions on Robotics, 31:369–386, 2015.
  • [19] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
  • [20] Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervised perceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699, 2016.
  • [21] Alec Solway, Carlos Diuk, Natalia Córdova, Debbie Yee, Andrew G Barto, Yael Niv, and Matthew M Botvinick. Optimal behavioral hierarchy. PLoS computational biology, 10(8):e1003779, 2014.
  • [22] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • [23] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • [24] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Twenty-Third AAAI Conference on Artificial Intelligence, volume 8, pages 1433–1438, 2008.