Meta Inverse Reinforcement Learning via Maximum Reward Sharing for Human Motion Analysis

10/07/2017 ∙ by Kun Li, et al. ∙ California Institute of Technology 0

This work handles the inverse reinforcement learning (IRL) problem where only a small number of demonstrations are available from a demonstrator for each high-dimensional task, insufficient to estimate an accurate reward function. Observing that each demonstrator has an inherent reward for each state and the task-specific behaviors mainly depend on a small number of key states, we propose a meta IRL algorithm that first models the reward function for each task as a distribution conditioned on a baseline reward function shared by all tasks and dependent only on the demonstrator, and then finds the most likely reward function in the distribution that explains the task-specific behaviors. We test the method in a simulated environment on path planning tasks with limited demonstrations, and show that the accuracy of the learned reward function is significantly improved. We also apply the method to analyze the motion of a patient under rehabilitation.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Inverse reinforcement learning (IRL) [1]

algorithms estimate a reward function that explains the motions demonstrated by an operator or other agents on a task described by a Markov Decision Process (MDP)

[2]. The recovered reward function can be used by a robot to replicate the demonstrated task [3], or by an algorithm to analyze the demonstrator’s preference [4]. Therefore, IRL algorithms can make multi-task robot control simpler by alleviating the need to explicitly set a cost function for each task, and make robot friendlier by personalizing services based on the recovered condition and preference of the operator.

The accuracy of the recovered function depends heavily on the ratio of visited states in the demonstrations to the whole state space, because the demonstrator’s motion policy can be estimated more accurately if every state is repeatedly visited. However, the ratio is low for many useful applications, since they usually have huge or high-dimensional state spaces, while the demonstrations are relatively rare for each task. For example, in a path planning task on a mild grid, the demonstrator chooses paths based on the destination, but may not move to the same destination hundreds of times in practice. For robot manipulation tasks based on ordinary RGB images, east task specifies a final result, but it is expensive to repeat each task millions of times. For human motion analysis, it is physically improbable to follow an instruction thousands of times in the huge state space of human poses. Therefore, it is difficult to estimate an accurate reward function for a single task with limited data.

In practice, usually multiple tasks can be observed from the same demonstrator, and the problem of rare demonstrations can be handled by combining data from all tasks, hence the meta-learning problem. Existing solutions mainly classification problems, like using the data from all tasks to learn an optimizer for each task, using the data from all tasks to learn a metric space where a single task can be more accurate with limited data, using the data from all tasks to learn a good initialization or a good initial parameter for each task, etc. Some of these methods are applicable to inverse reinforcement learning problems, but they mainly consider transfer of motion policy.

Fig. 1: Different behaviors under different goal states and goal rewards: Figure (a)a and Figure (b)b share the same goal state, but the goal reward of Figure (b)b is larger than Figure (a)a.Figure (c)c and Figure (d)d share the same goal state, but the goal reward of Figure (d)d is larger than Figure (c)c. Five trajectories are plotted in each figure, where red dots denote the starting point and black dots denote the ending point.

In many IRL applications, we observe that a demonstrator usually has an inherent reward for each state, materialized as the innate state preferences of a human, the hardware-dependent cost function of a robot, the default structure of an environment, etc. For a given task, the demonstrators are usually reluctant to drastically change the inherent reward function to complete the task; instead, they alter the innate reward function minimally to generate a task-specific reward function and plan the motion. For example, in path planning, the C-space of a mobile robot at home rarely changes, and the robot’s motion depends on the goal state; in human motion analysis, the costs of different poses are mostly invariant, while the actual motion depends on the desired directions.

Based on this observation, we propose a meta inverse reinforcement learning algorithm by maximizing the shared rewards among all tasks. We model the reward function for each task as a probabilistic distribution conditioned on an inherent baseline function, and estimate the most likely reward function in the distribution that explains the observed task-specific demonstrations.

We review existing IRL and meta-learning algorithms in Section II, and then introduce the proposed method in Section III. We show a simulated experiments for evaluation and a real-world experiment for application in Section IV, with conclusions in Section V.

Ii Related Works

The idea of inverse optimal control is proposed by Kalman [5], white the inverse reinforcement learning problem is firstly formulated in [1], where the agent observes the states resulting from an assumingly optimal policy, and tries to learn a reward function that makes the policy better than all alternatives. Since the goal can be achieved by multiple reward functions, this paper tries to find one that maximizes the difference between the observed policy and the second best policy. This idea is extended by [6], in the name of max-margin learning for inverse optimal control. Another extension is proposed in [3], where the purpose is not to recover the real reward function, but to find a reward function that leads to a policy equivalent to the observed one, measured by the amount of rewards collected by following that policy.

Since a motion policy may be difficult to estimate from observations, a behavior-based method is proposed in [7], which models the distribution of behaviors as a maximum-entropy model on the amount of reward collected from each behavior. This model has many applications and extensions. For example, [8] considers a sequence of changing reward functions instead of a single reward function. [9] and [10]

consider complex reward functions, instead of linear one, and use Gaussian process and neural networks, respectively, to model the reward function.

[11] considers complex environments, instead of a well-observed Markov Decision Process, and combines partially observed Markov Decision Process with reward learning. [12] models the behaviors based on the local optimality of a behavior, instead of the summation of rewards. [13] uses a multi-layer neural network to represent nonlinear reward functions.

Another method is proposed in [14]

, which models the probability of a behavior as the product of each state-action’s probability, and learns the reward function via maximum a posteriori estimation. However, due to the complex relation between the reward function and the behavior distribution, the author uses computationally expensive Monte-Carlo methods to sample the distribution. This work is extended by

[15], which uses sub-gradient methods to simplify the problem. Another extensions is shown in [16], which tries to find a reward function that matches the observed behavior. For motions involving multiple tasks and varying reward functions, methods are developed in [17] and [18], which try to learn multiple reward functions.

Most of these methods need to solve a reinforcement learning problem in each step of reward learning, thus practical large-scale application is computationally infeasible. Several methods are applicable to large-scale applications. The method in [1] uses a linear approximation of the value function, but it requires a set of manually defined basis functions. The methods in [10, 19] update the reward function parameter by minimizing the relative entropy between the observed trajectories and a set of sampled trajectories based on the reward function, but they require a set of manually segmented trajectories of human motion, where the choice of trajectory length will affect the result. Besides, these methods solve large-scale problems by approximating the Bellman Optimality Equation, thus the learned reward function and Q function are only approximately optimal. In our previous work, we proposed an approximation method that guarantees the optimality of the learned functions as well as the scalability to large state space problems [20].

To learn a model from limited data, meta learning algorithms are developed. A survey of different work is given in [21], viewing meta-learner as a way to improve biases for base-learners. The method in [22] uses neural memory machine to do the meta learning. The method in [23] minimizes the representations. The method in [24] learns by gradient descent. The method [25] learns optimizers.

Meta learning algorithms are also applied to reinforcement learning problems. The method in [26] tunes meta parameters for reinforcement learning, learning rate for TD learning, action selection trade-off, and discount factor. The method in [27] uses one network to play multiple games. The method in [28] trains reinforcement learning with slower rl. The method in [29]

learns a good initial parameter that reaches optimal parameters with limited gradient descent. Meta learning in inverse reinforcement learning focuses on imitation learning, like one-shot imitation learning


Iii Meta Inverse Reinforcement Learning

Iii-a Meta Inverse Reinforcement Learning

We assume that an agent needs to handle multiple tasks in an environment, denoted by , where denotes the task and denotes the number of tasks.

We describe a task as a Markov Decision Process, consisting of the following variables:

  • , a set of states

  • , a set of actions

  • , a state transition function that defines the probability that state becomes after action .

  • , a reward function that defines the immediate reward of state .

  • , a discount factor that ensures the convergence of the MDP over an infinite horizon.

For a task , the agent performs a set of demonstrations , represented by sequences of state-action pairs:

where denotes the length of the sequence . Given the observed sequences for the tasks, inverse reinforcement learning algorithms try to recover a reward function for each task.

Our key observation in multi-task IRL is that the demonstrator has an inherent reward function , generating a baseline reward for each state in all tasks. To complete the task, the agent generates a reward function from a distribution conditioned on to plan the motion. Therefore, the motion is generated as:

For the task, we want to find the most likely sampled from that explains the demonstration

. Assuming all the tasks are independent from each other, the following joint distribution is formulated:

The reward functions can be found via maximum-likelihood estimation:


where denotes a function space, is the negative loglikelihood of , and is the negative loglikelihood .

Iii-B Loss for Inverse Reinforcement Learning

While many solutions exist for the inverse reinforcement learning problem, we adopt the solution based on function approximation developed in our earlier work [20] to handle the practical high-dimensional state spaces.

The core idea of the method is to approximate the Bellman Optimality Equation [2] with a function approximation framework. The Bellman Optimality Equation is given as:


It is computationally prohibitive to solve in high-dimensional state spaces.

But with a parameterized VR function, we describe the summation of the reward function and the discounted optimal value function as:


where denotes the parameter of VR function. The function value of a state is named as VR value.

Substituting Equation (4) into Bellman Optimality Equation, the optimal Q function is given as:


the optimal value function is given as:


and the reward function can be computed as:


This framework avoids solving the Bellman Optimality Equation. Besides, this formulation can be generalized to other extensions of Bellman Optimality Equation by replacing the operator with other types of Bellman backup operators. For example, is used in the maximum-entropy method[7]; is used in Bellman Gradient Iteration [31].

To apply this framework to IRL problems, this work chooses a motion model based on the optimal Q function [14]:


where is a parameter controlling the degree of confidence in the agent’s ability to choose actions based on Q values. Other models can also be used, like in [7].

Assuming the approximation function is a neural network, the parameter -weights and biases, the negative log-likelihood of is given by:


where the optimal Q function is given by Equation (5). After estimating the parameter , the value function and reward function can be computed with Equation (4), (6), and (7).

Iii-C Loss for Reward Sharing

Since the demonstrator makes minimal changes to adapt the inherent reward function into task-specific one , we model the distribution as:

where measures the difference between and

. Thus the loss function for reward sharing is given as:

where is the partition function and remains the same for all .

We test several functions as . The first choice is L2 loss, where

where denotes the set of differences, evaluated on the full state space or only the visited states.

The second choice is Huber loss with , a differentiable approximation of the L1 loss popular in sparse models:


The third choice is standard deviation:

The fourth choice is information entropy, after converting into a probabilistic distribution with sofmax function:

With the loss function for IRL and reward sharing, the reward functions can be learned via gradient method. The algorithm is shown in Algorithm 1.

1:  Data:
2:  Result: reward value
3:  create variable for a neural network
4:  create variable for each task
5:  initialize
6:  while Not converging do
7:     update based on optimization (1)
8:  end while
9:  compute and return all
Algorithm 1 Meta IRL

Iv Experiments

Iv-a Path Planning

We consider a path planning problem on an uneven terrain, where an agent can observe the whole terrain to find the optimal paths from random starting points to arbitrary goal points, but a mobile robot can only observe the agent’s demonstrations to learn how to plan paths. Given a starting point and a goal point, an optimal path depends solely on the costs to move across the terrain. To learn the costs, we formulate a Markov Decision Process for each goal point, where a state denotes a small region of the terrain and an action denotes a possible movement. The reward of a state equals to the negative of the cost to move across the corresponding region, while the goal state has an additional reward to attract movements.

In this work, we create a discretized terrain with several hills, where each hill is defined as a peak of cost distribution and the costs around each hill decay exponentially, and the true cost of a region is the summation of the costs from all hills. Ten worlds are randomly generated, and in each world, ten tasks are generated, each with a different goal state. For each task, the agent demonstrates ten trajectories, where the length of a trajectory depends on how many steps to reach the goal state.

We evaluate the proposed method with different reward sharing loss functions under different number of tasks and different number of trajectories. The evaluated loss functions include no reward sharing, reward sharing with standard deviation, information entropy, L2 loss, and huber loss. The number of tasks ranges from 1 to 16, and for each task, the number of trajectories ranges from 1 to 10. The learning rate is 0.01, with Adam optimizer. The accuracy of a reward is computed as the correlation coefficient between the learned reward function and the ground truth one. The results are shown in Figure 2.

The result shows that the meta learning step can significantly improve the accuracy of reward learning, among which the huber loss function leads to the best performance in average. L2 loss and standard deviation have similar performance, not surprisingly. However, the information entropy has a really bad performance.

(a) 1 task
(b) 2 tasks
(c) 3 tasks
(d) 4 tasks
(e) 5 tasks
(f) 6 tasks
(g) 7 tasks
(h) 8 tasks
(i) 9 tasks
(j) 10 tasks
(k) 11 tasks
(l) 12 tasks
(m) 13 tasks
(n) 14 tasks
(o) 15 tasks
(p) 16 tasks
Fig. 2: The result with five reward sharing loss functions on 16 tasks with at most 10 demonstrations for each task in 10 environment.

Iv-B Motion Analysis

During rehabilitation, a patient with spinal cord injuries sits on a box, with a flat plate force sensor mounted on box to capture the center-of-pressure (COP) of the patient during movement. Each experiment is composed of two sessions, one without transcutaneous stimulation and one with stimulation. The electrodes configuration and stimulation signal pattern are manually selected by the clinician [32].

In each session, the physician gives eight (or four) directions for the patient to follow, including left, forward left, forward, forward right, right, right backward, backward, backward left, and the patient moves continuously to follow the instruction. The physician observes the patient’s behaviors and decides the moment to change the instruction.

Six experiments are done, each with two sessions. The COP trajectories in Figure 3 denote the case with four directional instructions; Figure 4, 5, 6, 7, and 8 denote the sessions with eight directional instructions.

Fig. 3: Patient 1 under four directional instructions: ”unstimulated motion” means that the patient moves without transcutaneous stimulations, while ”stimulated motion” represents the motion under stimulations.
Fig. 4: Patient 2 under eight directional instructions: ”unstimulated motion” means that the patient moves without transcutaneous stimulations, while ”stimulated motion” represents the motion under stimulations.
Fig. 5: Patient 3 under eight directional instructions: ”unstimulated motion” means that the patient moves without transcutaneous stimulations, while ”stimulated motion” represents the motion under stimulations.
Fig. 6: Patient 4 under eight directional instructions: ”unstimulated motion” means that the patient moves without transcutaneous stimulations, while ”stimulated motion” represents the motion under stimulations.
Fig. 7: Patient 5 under eight directional instructions: ”unstimulated motion” means that the patient moves without transcutaneous stimulations, while ”stimulated motion” represents the motion under stimulations.
Fig. 8: Patient 6 under eight directional instructions: ”unstimulated motion” means that the patient moves without transcutaneous stimulations, while ”stimulated motion” represents the motion under stimulations.

The COP sensory data from each session is discretized on a

grid, which is fine enough to capture the patient’s small movements. The problem is formulated into a MDP, where each state captures the patient’s discretized location and velocity, and the set of actions changes the velocity into eight possible directions. The velocity is represented with a two-dimensional vector showing eight possible velocity directions. Thus the problem has 80000 states and 8 actions, and each action is assumed to lead to a deterministic state.

forward backward left right top left top right bottom left bottom right origin
1u 0.411741 0.257564 0.0691989 -0.210216 0.49016
1s 0.200355 0.486723 0.129839 0.436533 0.207188
2u 0.161595 -0.17814 0.153376 -0.16767 0.162906 0.105993 -0.0211192 -0.220457 0.156034
2s -0.0310265 -0.0803484 0.0474505 -0.00146299 0.0442916 0.0874981 0.00668849 0.0742221 0.0726437
3u 0.362801 -0.2995 0.245916 -0.178778 0.386421 0.0148849 -0.00335653 -0.385605 0.0507719
3s -0.265834 0.146516 0.379665 -0.272437 0.138805 -0.2683 0.212331 0.00301386 -0.182916
4u 0.301472 -0.281474 0.377787 -0.320403 0.410212 -0.119599 0.136309 -0.306677 0.171433
4s -0.104719 0.0930068 0.327783 -0.229091 0.175432 -0.161819 0.323862 -0.0521654 -0.202197
5u 0.360293 -0.311692 -0.253715 0.260426 0.0863029 0.495134 -0.38137 -0.140836 -0.160687
5s -0.212823 0.0414435 0.0908994 -0.124174 0.00414109 -0.107462 0.122018 0.0453461 0.145686
6u -0.0416432 0.0570847 0.210028 -0.104113 0.0363181 -0.0672399 0.0704143 -0.00392284 0.190253
6s -0.157148 0.178879 0.0880393 -0.0718817 -0.102579 -0.298918 0.307328 0.171319 0.359168
TABLE I: Evaluation of the learned rewards: ”forward” etc. denote the instructed direction; item name”1u” denotes the patient id ”1”, with ”u” denoting unstimulated session and ”s” denoting stimulated sessions. The table shows the correlation coefficient between the ideal reward and the recovered reward.

To learn the reward function from the observed trajectories based on the formulated MDP, we use the coordinate and velocity direction of each grid as the feature, and learn the reward function parameter from each set of data after segmentation based on peak detection on distances from the origin. The function approximator is a neural network with three hidden layers and nodes. The huber loss function is used in reward sharing, and the result is show in Table I.

It shows that the patient’s ability to following instructions vary among different directions, and the values will assist physicians to design the stimulating signals.

V Conclusions

This work proposes a solution to learn an accurate reward function for each task with limited demonstrations but from the same demonstrator, by maximizing the shared rewards among different tasks. We proposed several loss functions to maximize the shared reward, and compared their accuracies in a simulated environment. It shows that huber loss has the best performance.

In future work, we will apply the proposed method to imitation learning.