Learning from Sparse Demonstrations

08/05/2020 ∙ by Wanxin Jin, et al. ∙ Purdue University 0

This paper proposes an approach which enables a robot to learn an objective function from sparse demonstrations of an expert. The demonstrations are given by a small number of sparse waypoints; the waypoints are desired outputs of the robot's trajectory at certain time instances, sparsely located within a demonstration time horizon. The duration of the expert's demonstration may be different from the actual duration of the robot's execution. The proposed method enables to jointly learn an objective function and a time-warping function such that the robot's reproduced trajectory has minimal distance to the sparse demonstration waypoints. Unlike existing inverse reinforcement learning techniques, the proposed approach uses the differential Pontryagin's maximum principle, which allows direct minimization of the distance between the robot's trajectory and the sparse demonstration waypoints and enables simultaneous learning of an objective function and a time-warping function. We demonstrate the effectiveness of the proposed approach in various simulated scenarios. We apply the method to learn motion planning/control of a 6-DoF maneuvering unmanned aerial vehicle (UAV) and a robot arm in environments with obstacles. The results show that a robot is able to learn a valid objective function to avoid obstacles with few demonstrated waypoints.



There are no comments yet.


page 1

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The appeal of learning from demonstrations (LfD) lies in its capability to facilitate robot programming by simply providing demonstrations from an expert. It circumvents the need for expertise in controller design and coding, which is required by traditional robot programming, and empowers non-experts to program a robot as needed [39]. LfD has been successfully applied to various scenarios such as manufacturing [8], assistive robots [26], and autonomous vehicles [18].

LfD techniques can be broadly categorized based on what to learn from the observed demonstrations. A branch of LfD focuses on learning a policy [33, 10, 6, 37, 41]

, which directly maps from the robot’s states, environment, or raw observation information to the robot’s actions, based on supervised machine learning techniques. While effective in many situations, policy learning typically requires a considerable amount of demonstration data, and the learned policy may generalize poorly to unseen or long horizon tasks

[39]. To alleviate this, another direction of LfD research focuses on learning control objective (e.g., cost or reward) functions from demonstrations [1], based on which the optimal policies or trajectories are derived. These methods assume the optimality of demonstrations and use inverse reinforcement learning (IRL) [29] or inverse optimal control (IOC) [27]

to estimate the control objective function. Since an objective function is a more compact and high-level representation of a task, LfD via learning objective functions has demonstrated advantage over policy learning in terms of better generalization to unseen situations

[24] and relatively lower sample complexity [1]. Despite significant progress along this direction, LfD based on objective learning still inherits some limitations from the core IRL and IOC techniques, which are summarized below.

  • Most IOC/IRL techniques require entire demonstrations of a complete task [38, 42, 25, 16, 36, 11]

    . Such requirements make it challenging to collect demonstration data, especially obtaining demonstration of high degree-of-freedom systems such as humanoid robots.

  • The majority of existing IOC and IRL methods [38, 42, 25, 16, 36, 11] assume an objective function as a linear combination of selected features, and their algorithms are designed in the feature space by taking advantage of the linearity of the feature weights [15]. Those approaches typically do not directly minimize the discrepancy between the robot’s reproduced trajectory and the demonstrations in trajectory space, and cannot be readily extended to non-linear parameterization of an objective function.

  • There might exist time-scale discrepancy between expert’s demonstrations and the actual actuation of the robot [17]. For instance, consider a robot that learns from human motion. The duration of the human demonstration may not reflect the dynamics constraint of a robot, as the robot may be actuated by a weak servo motor and cannot move as fast as the human.

Fig. 1: Illustration of learning from sparse demonstrations. The red dots are the expert’s sparse demonstration waypoints, from which the robot learns a control objective function such that its reproduced trajectory (blue line) is closest to these waypoints. At first sight, the depicted robot’s reproduced trajectory (blue line) may seem a result of using ‘curve fitting’ method (which inherently belongs to policy learning methods); however, a key difference from ‘curve fitting’ is that the robot here learns a control objective function instead of imitating a trajectory, and the learned control objective function is generalizable to unseen situations, such as new initial conditions or longer time horizons. Please find video demos at https://wanxinjin.github.io/posts/lfsd.

In recognition of these limitations, in this paper we propose a new method to learn from sparse demonstrations, which has the following advantages over existing methods:

  • First, the proposed method learns an objective function using only sparse demonstration data, which consists of a small number of desired outputs of the robot’s trajectory at some sparse time instances within a time horizon, as shown in Fig. 1. The given sparse demonstrations do not necessarily contain control input information.

  • Second, the proposed method learns an objective function over a parameterized function set by directly minimizing the distance between the robot’s reproduced trajectory and the sparse demonstrations. Even though the demonstrations may not correspond to an exact objective function within the parameterized function set, e.g., demonstrations are not optimal or even randomly given, the method can still find a ‘best’ objective function within the parameterized function set such that the reproduced trajectory is closest to the given demonstrations in Euclidean distance, as shown in Fig. 1.

  • Third, since the time requirement associated with the sparse waypoints may not be achievable with the robot actuation, in addition to learning an objective function, the proposed method jointly learns a time-warping function, which maps from the demonstration time axis to the robot execution time axis. This addresses the potential issue of time misalignment for existing IOC/IRL methods.

I-a Background and Related Work

Since the theme of the method devised in this paper belongs to the category of LfD based on learning objective functions, here we mainly focus on the related work of IRL/IOC methods, which share the same goal of learning objective functions from demonstrations. For the other types of LfD methods, e.g., policy learning, or the comparison between them, we refer the reader to recent surveys [39] and [31] for more details.

Over the past decades, various IRL and IOC techniques have been proposed, with different work emphasizing different formulations to infer an objective function. One popular method is feature matching [1], in which the objective function is updated towards matching the feature values of demonstrations with one of the reproduced trajectories. Another method is maximum margin [38], in which an objective function is solved by maximizing the margin between the objective values of demonstrations and those of the reproduced trajectories. Lastly, maximum entropy [42] is a method that finds an objective function such that the trajectory distribution has the maximum entropy while subject to empirical feature values of the observed demonstrations.

Another line of IOC/IRL research [16, 36, 11, 15, 14] directly solves for the objective function parameters by establishing optimality conditions, such as Karush-Kuhn-Tucker conditions [19] or Pontryagin’s maximum principle [34]. The key idea is that the demonstration data is assumed to be optimal and thus must satisfy the optimality conditions. By directly minimizing the violation of the optimality conditions by the demonstration data over objective function parameters, one can obtain an estimate of the objective function. The benefits of doing so is that these methods can avoid repetitive solving of the direct optimal control or reinforcement learning problems in each iteration, but a potential drawback is that these methods may not robust to demonstrations which significantly deviate from the optimal ones.

All the above IOC and IRL techniques assume a linear combination of selected features as their parameterized objective functions with unknown feature weights. Learning objective functions is not formulated on a trajectory space, that is, they do not directly minimize discrepancy between the reproduced trajectory and demonstrations. Instead, they design algorithms in the selected feature space by taking advantage of the linearity of the feature weights. For example, the maximum margin IRL [38] and feature matching IRL [1] focus on maximizing and equaling the feature values between the given demonstrations and the reproduced trajectories, respectively. The recent work in [9] attempts to formulate the IOC problem as minimization of direct loss. However, the algorithm is still similar to the maximum margin approach in a selected-feature space. In [25]

, the authors use a double-layer optimization to solve the IOC problem and directly minimize a trajectory loss function. In the upper layer of updating the objective function, a derivative-free method

[35] is used by approximating the loss function with a quadratic function. This involves multiple evaluations of the loss and thus requires solving the optimal control problem repetitively in each update, which is computationally expensive. More importantly, the derivative-free method has inherent limitations for handling problems of large size [40]. For the second line of IOC methods, they solve the feature weights by minimizing the extent to which the demonstrations violate the optimality conditions, and thus still only indirectly consider trajectory error.

Learning objective functions in a linear feature space may facilitate the design of learning algorithms (such as by taking advantage of linearity of feature weights), but their performance heavily relies on the choice of features. Many IOC approaches [16, 36, 11]

assume the optimality of demonstration data; that is, the observed demonstrations are a result of optimizing the parameterized objective functions. However, this assumption is subject to observation noise and good feature selection, and recent work

[4] shows that large data noise is likely to lead to failure of the methods.

The other challenges of existing IOC and IRL techniques are listed below. First, existing methods require as input the continuous demonstration data of an entire task; in other words, a given demonstration needs to be a complete trajectory over the entire course of execution time. Thus, demonstration data needs to be carefully collected from an expert, which can be burdensome especially for high-dimensional systems. Instead, it is relatively easier to provide only sparse demonstrations. Although [15] proposes a method to solve IOC from incomplete trajectory data, it still requires a trajectory segment to be long enough to satisfy a recovery condition and thus cannot handle very sparse demonstrations as shown in Fig. 1. In [2]

, the authors develop a method for learning from keyframe demonstrations. This method is a policy learning technique: it learns a kinematic trajectory model (Gaussian mixture models) instead of learning an objective function. The unseen motion between keyframes is handled by interpolation. Such a process leads to poor generalization and high sample complexity (we will show this later in experiments). Another limitation of existing IOC and IRL methods is that they rarely account for the time misalignment between the demonstrations and the feasible actuation capabilities of a robot. This is critical in practical implementation. For example, consider a humanoid robot that learns to imitate a human demonstrator. The robot may be actuated by a weak servo motor which may not move as fast as human. The demonstrations thus cannot be directly used for objective function learning. To address this,

[17] learns a time-warping function between a robot and a demonstrator, but this method is used to align the time of a demonstrated trajectory for optimal tracking instead of learning objective functions.

I-B Contributions

We propose a new approach to learn objective functions from sparse demonstrations. The contributions of the method relative to existing IRL/IOC methods are listed below.

  • The proposed method learns an objective function by directly minimizing a trajectory loss, which quantifies the discrepancy between a robot’s reproduced trajectory and the observed demonstrations. Different from [25] using derivative-free techniques [35], the proposed approach is a gradient based optimization method, which can handle high-dimensional systems.

  • The proposed method accepts a general parameterization of objective functions (e.g., nonlinear in function parameters such as neural networks), which is not necessarily a linear combination of features. The algorithm finds an objective function within the given function set such that the reproduced trajectory has minimum Euclidean distance to demonstrations, even though the demonstrations may not be optimal and the exact corresponding objective function does not exist in the function set.

  • The proposed learning algorithm permits sparse demonstrations, which consists of a small number of desired outputs of the robot’s trajectory at sparse time instances. The algorithm will find an objective function such that the reproduced trajectory gets closest to the given waypoints in Euclidean distance. In addition to learning the objective function, the method jointly learns a time-warping function to align the duration between the expert’s demonstration and the feasible motion of the robot.

  • The theory developed in this paper is the differential Pontryagin’s Maximum Principle. This allows us to obtain the analytical gradient of the system optimal trajectory with respect to the objective function parameter, thus enabling update of objective function using gradient descent.

The organization of this paper is as follows: Section II formulates the problem. Section III discusses the time-warping technique and reformulates the problem under a unified time axis. Section IV proposes the learning algorithm. Experiments are provided in Sections V and VI. Section VII gives discussion to the method, and finally Section VIII draws conclusions.

Ii Problem Formulation

Consider a robot with the following continuous dynamics:


where is the robot state;

is the control input; vector function

is assumed to be twice-differentiable, and is time. Suppose the robot motion over a time horizon is controlled by optimizing the following parameterized objective function:


where and are the running and final costs, respectively, both of which are assumed twice-differentiable; and is a tunable parameter vector. For a fixed choice of , the robot produces a trajectory of states and inputs


which optimizes the objective function (2). Here the subscript in indicates that the trajectory implicitly depends on .

The goal of learning from demonstrations is to estimate the objective function parameter based on the observed demonstrations of an expert (usually a human operator). Here, we suppose that an expert provides demonstrations through a known output function


where defines a map from the robot’s state and input to an output . The expert’s demonstrations include (i) an expected time horizon , and (ii) a number of waypoints, each of which is a desired output for the robot to reach at an expected time instance, denoted as


Here, is the th waypoint demonstrated by the expert, and is the expected time instance at which the expert wants the robot to reach the waypoint . As the expert can freely provide the number of waypoints and choose the positions of expected time instances relative to the expected horizon , we refer to as sparse demonstrations. As will be shown later in simulations, here can be small.

Note that both the expected time horizon and the expected time instances are in the time axis of the expert’s demonstrations. This demonstration time axis may not be identical to the actual time axis of execution of the robot; in other words, the given times and may not be achievable by the robot. For example, when the robot is actuated by a weak servo motor, its motion inherently cannot meet the time step required by a human demonstrator. To accommodate the misalignment of duration between the robot and expert’s demonstrations, we introduce a time warping function


which defines a map from the expert’s demonstration time axis to the robot time axis . We make the following reasonable assumption: is strictly increasing for the range of and continuously differentiable function with .

Given the sparse demonstrations , the problem of interest is to find an objective function parameter and a time-warping function such that the following trajectory loss is minimized:


where is a given differentiable scalar function to quantify a point distance metric between vectors and , e.g., . Minimizing the loss in (7) means that we want the robot to find the ‘best’ objective function within the parameterized objective function set (2), together with a time-warping function, such that its reproduced trajectory is as close to the given sparse demonstrations as possible.

Iii Problem Reformulation by Time-warping

In this section, we present the parameterization of the time-warping function, and then re-formulate the problem of interest presented in the previous section under a unified time axis.

Iii-a Parametric Time Warping Function

To facilitate learning of an unknown time-warping function, we parameterize the time-warping function. Suppose that a differentiable time-warping function satisfies and is strictly increasing in the range . Then the derivative


for all . We use a polynomial time-warping function:


where is the coefficient vector of the polynomial. Since , there is no constant (zero-order) term in (9) (i.e., ). Due to the requirement for all in (8), one can always obtain a feasible (e.g. compact) set for , denoted as , such that for all if .

Iii-B Equivalent Formulation under a Unified Time Axis

Substituting the parametric time-warping function in (9) into both the robot’s dynamics (1) and the control objective function (2), we obtain the following time-warped dynamics


and the time-warped objective function


Here, the left side of (10

) is due to chain rule:

, and the time horizon satisfies (note that is specified by the expert). For notation simplicity, we write , , , and . Then, the above time-warped dynamics (10) and time-warped objective function (11) are rewritten as:


respectively. We concatenate the unknown objective function parameter vector and unknown time-warping function parameter vector as


For a choice of , the time-warped optimal trajectory resulting from solving the above time-warped optimal control system (12) is rewritten as


with . The trajectory distance loss in (7) to be minimized can now be defined as


Minimizing the above loss function in (15) over the unknown parameter vector is a process of simultaneously learning the control objective function in (2) and the time-warping function in (9).

In summary, the problem of interest is reformulated as an optimization problem of jointly learning the objective function in (2) and time-warping function in (9):


Here defines a feasible domain of variable , ; the constraint in optimization (16) says that is an optimal trajectory generated by the optimal control system (12) with the control objective function (12b) and dynamics (12a). In the next section, we will focus on developing a new learning algorithm to efficiently solve the above optimization problem.

Iv Proposed Learning Algorithm

Iv-a Algorithm Overview

To solve the optimization (16), we start with an arbitrary initial guess , and apply the gradient descent


where is the iteration index; is the step size (or learning rate); is a projection operator to guarantee the feasibility of in each update, e.g., ; and denotes the gradient of the given loss function (15) directly with respect to evaluated at . Applying the chain rule to the gradient term, we have


where is the gradient of the single point distance loss defined in (15) with respect to the -time trajectory point, , evaluated at point , and is the gradient of the -time trajectory point, , with respect to the parameter vector evaluated at value . From (17) and (18), we can note that at each iteration , the update of the parameter includes the following three steps:

  • [leftmargin=35pt,font=]

  • With the current parameter estimate , generate the optimal trajectory in (14) by solving the optimal control problem in (12);

  • Compute the gradients and ; apply the chain rule (18) to compute ;

  • Update using (17) for the next iteration.

The interpretation of the above procedure is straightforward: In each update , first, with the current parameter estimate , the optimal control system (12) produces an optimal trajectory , and the corresponding trajectory loss (that is, the distance to the given sparse demonstrations) is computed; second, the current gradient of the trajectory loss with respect to , , is solved; finally, this gradient is used to update the current estimate for the next iteration .

In Step 1 of the learning procedure, the optimal trajectory for the current parameter estimate is solved using any available optimal control solvers such as Casadi [3]. In Step 2, the gradient quantities can be readily computed by directly differentiating the given trajectory loss function (15). The main challenge, however, lies in how to obtain the gradient , that is, the gradient of the system optimal trajectory with respect to the parameter for the optimal control system (12). In what follows, we will show how to efficiently compute it by proposing the technique of differential Pontryagin’s Maximum Principle. In the following, we suppress the iteration index for notation simplicity.

Iv-B Differential Pontryagin’s Maximum Principle

Consider the system optimal trajectory in (14) produced by the optimal control system (12) under a fixed choice of . The Pontryagin’s Maximum Principle [34] states an optimality condition that the optimal trajectory must satisfy. To present Pontryagin’s Maximum Principle, we define the Hamiltonian:


where is called the costate or adjoint variable for . According to Pontryagin’s Maximum Principle, there exists a costate trajectory


which is associated with the optimal trajectory in (14), such that the following conditions hold:


In fact, given one can always solve the corresponding costate trajectory by integrating the ODE equation (21b) backward in time with the end condition given by (21d).

Recall that our technical challenge in the previous part is to obtain the gradient . Towards this goal, we differentiate the above Pontryagin’s Maximum Principle equations in (21) on both sides with respect to the parameter , which yields the following differential Pontryagin’s Maximum Principle


Here the coefficient matrices in (22) are defined as


Once we obtain the optimal trajectory and the associated costate trajectory in (20), all the above coefficient matrices in (23) are known and their computation is straightforward. Using these matrices (23) and (22), the lemma below presents an iterative method to solve the gradient .

Lemma 1.

If in (23c) is invertible for all , define the following differential equations for matrix variables and :


with and Here, is identity,


are all known given (23). Then, the gradient of the optimal trajectory at any time instance , denoted as


is obtained by integrating the following equations up to :


with (because is given), where the matrices and are the solutions to the differential equations in (24a) and (24b), respectively.

The proof of Lemma 1 is given in the Appendix. Lemma 1 states that for the optimal control system (12), the gradient of its optimal trajectory (the trajectory satisfying Pontryagin’s Maximum Principle) with respect to parameter can be obtained in two steps: first, integrate (24) backward in time to obtain matrices and for ; and second, obtain by integrating (27). With the differential Pontryagin’s maximum principle, Lemma 1 states an efficient way to obtain the gradient of the optimal trajectory with respect the unknown parameters in an optimal control system. By Lemma 1, one can obtain the derivative of any trajectory point , for any , along the optimal trajectory , with respect to the parameter , .

Based on Lemma 1, we summarize the overall algorithm to solve the optimization problem (16) in Algorithm LABEL:algorithm1.


V Numerical Examples

We demonstrate the proposed approach using two systems: (i) an inverted pendulum, and (ii) 6-DoF UAV maneuvering control. We compare the proposed method with related work.

V-a Inverted Pendulum

The dynamics of the pendulum is


with being the angle between the pendulum and direction of gravity, is the torque applied to the pivot, m, kg, and are the length, mass, and damping ratio of the pendulum, respectively. We define the state and control variables of the pendulum system as and , respectively, and set the initial state . For the inverted pendulum control, we set the parameterized cost function in (2) as


with the parameter vector to be determined. For the parametric time-warping function (9), we simply use a linear function:


with (we will discuss the use of more complex time-warping functions later). The overall parameter vector to be determined is .

The output function (4) is set as which means that the expert only provides the position information, not including the velocity information. For the trajectory loss function in (15), we use the norm to quantify the distance measure:


V-A1 Known Ground Truth

First, we generate sparse demonstrations to test the proposed method when the true objective function and time-warping function are both known. Specifically, we set the true parameter , based on which we generate the trajectory by solving the optimal control problem (12). Then, we pick some points as the sparse demonstrations , listed in Table I. We want to see if the proposed method can correctly learn from these sparse points. Given the sparse waypoints in Table I, we apply Algorithm LABEL:algorithm1 to learn the parameter by solving (16). In Algorithm LABEL:algorithm1, we set the learning rate , and initialize the parameter randomly.

Demonstration time instance waypoints
Time horizon s
TABLE I: Sparse demonstrations for inverted pendulum.
Fig. 2: Learning from sparse demonstrations for inverted pendulum using data in Table I. Left: the loss value (31) versus the number of iterations. Right: the convergence of the pendulum’s (time-warped) trajectory as iteration increases, where the color from light to dark gray corresponds to increasing iteration number, and the red dots are waypoints in Table I.

We plot the loss value in (31) versus the number of iterations in Fig. 2. The result shows that as the iteration number increases, the loss diminishes fast and finally converges to zero. This indicates that the trajectory gradually gets close to the sparse demonstrations and finally passes through them. This convergence is also illustrated by the right panel of Fig. 2, where we plot the pendulum’s (time-warped) trajectory in each iteration, where the color going from light to dark gray corresponds to increasing iteration number, and the red dots indicate the sparse demonstrations. As shown by the results, the initial trajectory (lightest gray) is far away from the sparse demonstrations, and as updates, the trajectory (with increasingly dark colors) approaches and finally passes through the waypoints (i.e., the converged loss is zero). To illustrate whether the parameters converge to the ground truth , we define the following parameter error: and plot the parameter error versus the number of iterations in Fig. 3, from which we note that as the number of iterations increases, converges to zero, indicating that the true parameter of the objective and time-warping functions is successfully learned.

Fig. 3: Parameter error versus iteration number.

V-A2 Non-realizable Case

In this case, we use random sparse demonstrations, where the waypoints here are sampled from a uniform distribution with the centers being the ones in Table

I. The randomness of the given sparse demonstrations means that an exact objective function (whose optimal trajectory exactly passes through the sparse demonstrations) may not exist within the given parameterized function set in (29) because of limited expressive power. The random sparse demonstrations are listed in Table II, and the other settings are the same as the previous case. The learning results are shown in Fig. 4. The results show that as the number of iterations increases, the loss value (31) is decreasing and converging to a value of but not zero. This is because the waypoints are randomly given, thus there does not exist such that the corresponding system trajectory exactly passes through these given waypoints. It shows that the proposed method can always find the ‘best’ objective function and the ‘best’ time-warping function within the parametric function sets, which finally leads the reproduced trajectory to be closest to the waypoints in a sense of having the minimal distance loss (7), as shown in the right panel of Fig. 4.

Demonstration time instance waypoints
Time horizon s
TABLE II: Sparse demonstrations for pendulum system.
Fig. 4: Learning from sparse demonstrations for inverted pendulum from data in Table II. Left: the loss value (31) versus the number of iterations. Right: the convergence of the pendulum’s (time-warped) trajectory as the number of iterations increases, where the color from light to gray dark corresponds to increasing iteration number, and the red dots are waypoints in Table II.

V-A3 Different Parametric Time-Warping Functions

In this case, we test the performance of the method using different parametric time-warping functions. The sparse demonstrations are in Table III, where the demonstration time labels are infeasible for the pendulum actuation. The other experimental settings are the same as the previous cases, except that we use the parametric polynomial time-warping function (9) with different degrees . We summarize in Table IV the learned time-warping function and the obtained minimal loss value of (31), i.e., .

Demonstration time instance waypoints
Time horizon s
TABLE III: Sparse demonstrations for pendulum system.
Learned time-warping function
TABLE IV: Different polynomial time-warping functions

As shown in Table IV, more complex time-warping functions lead to a lower minimal loss value of . This is understandable because using a higher-degree polynomial will introduce additional degrees of freedom, which contribute to further decreasing the loss in terms of generating a ‘more-deformed’ time axis. Also from a system perspective, if we look at the entire parameterized optimal control system (12), use of a higher-degree polynomial time-warping function will make the parameterized system more expressive, achieving a lower loss on the same training data.

From Table IV, we further observe that the first-order terms in all learned time-warping polynomials are approximately the same, and the higher-order terms are relatively small compared to the first-order term and they do not significantly contribute to lowering the final training loss. This indicates that the first-order term dominates the time scale difference between the demonstration and robot’s execution, because here is small and the higher-order terms thus are not significant compared to the first-order term. In the following experiments, we therefore only use the first-order polynomial time-warping functions.

V-A4 Neural Objective Functions

Instead of using parameterization (29), we here represent the objective function using a neural network and aim to learn a neural objective function. We test this still using the inverted pendulum system. Specifically, the parameterized objective function is represented as


where is a 2-2-1 fully-connected neural network with activation functions [30]

(i.e., 2-neuron input layer, 2-neuron hidden layer, and 1-neuron output layer), and

is the parameter vector of the neural network, that is, the weight matrices and bias vectors. The time-warping polynomial is first-order as in (

30) and the loss function is (31). We use the sparse demonstration data in Table III, and the learning rate is set as . We plot the learning results in Fig. 5, which shows that the proposed approach can successfully learn a neural objective function from sparse demonstrations, such that the pendulum’s reproduced trajectory is close to the given waypoint in Euclidean distance.

Fig. 5: Learning from sparse waypoints with the objective function represented by a neural network. Left: the loss value (31) versus the number of iterations, and the loss finally converges to . Right: the learned time-warped trajectory, where the red dots are waypoints in Table III.

In the left panel of Fig. 5, the converged loss is , which is lower than the loss of in Table IV for the weighted distance parameterization (29). This difference can be also seen by comparing the right panel of Fig. 5 with the one in Fig. 4. The lower loss here is because neural network representation is more expressive than weighted distance parameterization. The results in Fig. 5 demonstrate the capability of the proposed method to learn complex parametric objective functions, and it shows the utility of the method when the knowledge-based parametric objective function is not readily available.

However, despite the convenience of using universal neural network objective functions, how to choose appropriate structure and hyper-parameters for a neural network (such as the number of layers/neurons and the type of activation functions) still needs to be specified. Our empirical experience also finds the other drawbacks of neural objective functions, including a lack of physical interpretability for the learned results, more iterations needed to reach convergence as empirically shown in left panel of Fig. 5, and a tendency of getting trapped in locally optimal solutions. In Section VII, we will provide a further analysis for the choice of parametric objective functions.

V-B Comparison with other Methods

V-B1 Comparison with Learning From KeyFrames [2]

We first compare the proposed method with the method of learning from keyframe demonstrations developed in [2]. As discussed in the related work, this is a policy-learning based method: a Gaussian mixture model (GMM) is first learned from keyframe demonstrations, based on which a trajectory is then reproduced using Gaussian mixture regression (GMR). In this comparison experiment, we use the inverted pendulum system with the same setting as in Section V-A1. Here, we provide 20 waypoints (with the time instances evenly populated over ; we find that a smaller number of waypoints leads to failure of the GMM method). During trajectory reproduction, we set a new time duration (note that the training data uses ) to test the generalization performance of each method. Comparison results are plotted in Fig. 6, where we also plot the ground-truth for reference.

Fig. 6: Reproduced trajectories with a new time duration (note that the demonstration data is with the duration ).

From Fig. 6, we observe that under unseen information (here with a longer time horizon), our method produces a trajectory much closer to the ground truth than [2]. This indicates better generalization of the proposed method to unseen settings (or long horizon tasks). In fact, better generalization is generally one of the advantages of objective function learning over policy learning, as discussed in [24].

V-B2 Comparison with Numerical Gradient Descent

Here, we compare the proposed method with direct gradient descent, where the gradient is estimated numerically. Specifically, in each update we use the numerical differentiation to approximate the gradient . The experiment uses the pendulum system with the same settings as Section V-A. Here we have tried two cases: the first case uses the sparse demonstration data in Table I, and the second case uses the sparse demonstration data in Table II. The comparison results are shown in Fig. 7.

Fig. 7: Comparison between the proposed method and numerical gradient descent. Left: using the sparse demonstrations in Table I; and right: using the sparse data in Table II. Both methods use the same learning rate .

From Fig. 7, we can observe that the proposed method has an obvious advantage in terms of lower training loss and faster convergence speed. The numerical gradient descent is effective for this case but has a lower accuracy due to the error induced during gradient approximation. Because of this approximation error, the loss does not descend along the ‘steepest’ direction, thus leading to a slower convergence. Here, the optimization variable is low-dimensional, the numerical gradient is thus relatively easier to compute, and the numerical gradient descent works. For high dimensional tasks, as we will show below, we found that the numerical gradient descent is prone to fail due to inaccuracy of gradient estimation.

V-C Experiment on 6 DoF Maneuvering UAVs

We here show the effectiveness of the proposed method on a more complex 6-DoF UAV maneuvering control system. The equation of motion of a quadrotor UAV flying in SE(3) (full position and attitude) space is given by


Here, subscripts and denote a quantity expressed in the UAV body frame and world reference frame, respectively; is the mass of the UAV; and are the position and velocity of the UAV;

is the moment of inertia of the UAV expressed in its body frame;

is the angular velocity of the UAV; is the unit quaternion [20] describing the attitude of the UAV with respect to the world frame; (33c) is the time derivative of quaternion with being the matrix notation of used for quaternion multiplication [20]; is the torque vector applied to the UAV; and is the total force vector applied to the UAV’s center of mass. The total force magnitude (along the z-axis of UAV’s body frame) and torque are generated by thrust of the four rotating propellers, which can be written as


with denoting the UAV’s wing length and a fixed constant. In our experiment, the gravity constant is set as and all the other constant parameters are units. We define the state variable


and define the control variable


To achieve SE(3) maneuvering control, we need to carefully design the attitude error. As in [23], we define the attitude error between the UAV’s current attitude and goal attitude as


where is the direction cosine matrix corresponding to the quaternion (see [20] for more details).

The parameterized cost function in (2) is set as


Here, , , , and are the goal position, velocity, orientation, and angular velocity, respectively; the objective function parameter vector here is


For the parametric time-warping function, we use the first-degree polynomial as in (30). The total parameter vector to be determined is


We set the output function in (4) as


which means that the expert can only provide the position and attitude demonstrations for UAV maneuvering (not including velocity information).

time instance waypoints
Time horizon s
TABLE V: Sparse demonstrations for UAV maneuvering.
Fig. 8: Learning from sparse demonstrations for 6-DoF UAV maneuvering. Left: the loss function value versus the number of iterations. Right: the UAV trajectory before learning (red) and the UAV trajectory after learning (blue), and green objects are the sparse demonstrations in Table V.

The sparse demonstrations are in Table V. The loss function is defined using Euclidean distance as in (31). In Algorithm LABEL:algorithm1, we set the learning rate . We plot the learning results in Fig. 8. The results show that, as the parameter is updated at each iteration, the loss value diminishes to zero quickly, meaning that the UAV’s reproduced trajectory gets closest to the sparse demonstrations in Table V. The right panel of Fig. 8 shows the final reproduced trajectory, which exactly passes through the given sparse demonstrations. This indicates the capability of the method in handling more complex systems.

Vi Application: Learning for Obstacle Avoidance

In this section, we apply the proposed method to learning robot motion control in an environment with obstacles. Here, a human provides few waypoints in the vicinity of obstacles in an environment, and the robot learns a control objective function from those waypoints such that its resulting motion can get around the obstacles. We experiment on two systems: a 6-DoF maneuvering UAV and a two-link robot arm.

Vi-a 6 DoF Maneuvering UAV

The dynamics of a 6-DoF UAV is given in (33). For the parameterized control objective function (2), instead of using the weighted distance to the goal state, we here use a general second-order polynomial parameterization as follows: