I Introduction
The appeal of learning from demonstrations (LfD) lies in its capability to facilitate robot programming by simply providing demonstrations from an expert. It circumvents the need for expertise in controller design and coding, which is required by traditional robot programming, and empowers nonexperts to program a robot as needed [39]. LfD has been successfully applied to various scenarios such as manufacturing [8], assistive robots [26], and autonomous vehicles [18].
LfD techniques can be broadly categorized based on what to learn from the observed demonstrations. A branch of LfD focuses on learning a policy [33, 10, 6, 37, 41]
, which directly maps from the robot’s states, environment, or raw observation information to the robot’s actions, based on supervised machine learning techniques. While effective in many situations, policy learning typically requires a considerable amount of demonstration data, and the learned policy may generalize poorly to unseen or long horizon tasks
[39]. To alleviate this, another direction of LfD research focuses on learning control objective (e.g., cost or reward) functions from demonstrations [1], based on which the optimal policies or trajectories are derived. These methods assume the optimality of demonstrations and use inverse reinforcement learning (IRL) [29] or inverse optimal control (IOC) [27]to estimate the control objective function. Since an objective function is a more compact and highlevel representation of a task, LfD via learning objective functions has demonstrated advantage over policy learning in terms of better generalization to unseen situations
[24] and relatively lower sample complexity [1]. Despite significant progress along this direction, LfD based on objective learning still inherits some limitations from the core IRL and IOC techniques, which are summarized below.
The majority of existing IOC and IRL methods [38, 42, 25, 16, 36, 11] assume an objective function as a linear combination of selected features, and their algorithms are designed in the feature space by taking advantage of the linearity of the feature weights [15]. Those approaches typically do not directly minimize the discrepancy between the robot’s reproduced trajectory and the demonstrations in trajectory space, and cannot be readily extended to nonlinear parameterization of an objective function.

There might exist timescale discrepancy between expert’s demonstrations and the actual actuation of the robot [17]. For instance, consider a robot that learns from human motion. The duration of the human demonstration may not reflect the dynamics constraint of a robot, as the robot may be actuated by a weak servo motor and cannot move as fast as the human.
In recognition of these limitations, in this paper we propose a new method to learn from sparse demonstrations, which has the following advantages over existing methods:

First, the proposed method learns an objective function using only sparse demonstration data, which consists of a small number of desired outputs of the robot’s trajectory at some sparse time instances within a time horizon, as shown in Fig. 1. The given sparse demonstrations do not necessarily contain control input information.

Second, the proposed method learns an objective function over a parameterized function set by directly minimizing the distance between the robot’s reproduced trajectory and the sparse demonstrations. Even though the demonstrations may not correspond to an exact objective function within the parameterized function set, e.g., demonstrations are not optimal or even randomly given, the method can still find a ‘best’ objective function within the parameterized function set such that the reproduced trajectory is closest to the given demonstrations in Euclidean distance, as shown in Fig. 1.

Third, since the time requirement associated with the sparse waypoints may not be achievable with the robot actuation, in addition to learning an objective function, the proposed method jointly learns a timewarping function, which maps from the demonstration time axis to the robot execution time axis. This addresses the potential issue of time misalignment for existing IOC/IRL methods.
Ia Background and Related Work
Since the theme of the method devised in this paper belongs to the category of LfD based on learning objective functions, here we mainly focus on the related work of IRL/IOC methods, which share the same goal of learning objective functions from demonstrations. For the other types of LfD methods, e.g., policy learning, or the comparison between them, we refer the reader to recent surveys [39] and [31] for more details.
Over the past decades, various IRL and IOC techniques have been proposed, with different work emphasizing different formulations to infer an objective function. One popular method is feature matching [1], in which the objective function is updated towards matching the feature values of demonstrations with one of the reproduced trajectories. Another method is maximum margin [38], in which an objective function is solved by maximizing the margin between the objective values of demonstrations and those of the reproduced trajectories. Lastly, maximum entropy [42] is a method that finds an objective function such that the trajectory distribution has the maximum entropy while subject to empirical feature values of the observed demonstrations.
Another line of IOC/IRL research [16, 36, 11, 15, 14] directly solves for the objective function parameters by establishing optimality conditions, such as KarushKuhnTucker conditions [19] or Pontryagin’s maximum principle [34]. The key idea is that the demonstration data is assumed to be optimal and thus must satisfy the optimality conditions. By directly minimizing the violation of the optimality conditions by the demonstration data over objective function parameters, one can obtain an estimate of the objective function. The benefits of doing so is that these methods can avoid repetitive solving of the direct optimal control or reinforcement learning problems in each iteration, but a potential drawback is that these methods may not robust to demonstrations which significantly deviate from the optimal ones.
All the above IOC and IRL techniques assume a linear combination of selected features as their parameterized objective functions with unknown feature weights. Learning objective functions is not formulated on a trajectory space, that is, they do not directly minimize discrepancy between the reproduced trajectory and demonstrations. Instead, they design algorithms in the selected feature space by taking advantage of the linearity of the feature weights. For example, the maximum margin IRL [38] and feature matching IRL [1] focus on maximizing and equaling the feature values between the given demonstrations and the reproduced trajectories, respectively. The recent work in [9] attempts to formulate the IOC problem as minimization of direct loss. However, the algorithm is still similar to the maximum margin approach in a selectedfeature space. In [25]
, the authors use a doublelayer optimization to solve the IOC problem and directly minimize a trajectory loss function. In the upper layer of updating the objective function, a derivativefree method
[35] is used by approximating the loss function with a quadratic function. This involves multiple evaluations of the loss and thus requires solving the optimal control problem repetitively in each update, which is computationally expensive. More importantly, the derivativefree method has inherent limitations for handling problems of large size [40]. For the second line of IOC methods, they solve the feature weights by minimizing the extent to which the demonstrations violate the optimality conditions, and thus still only indirectly consider trajectory error.Learning objective functions in a linear feature space may facilitate the design of learning algorithms (such as by taking advantage of linearity of feature weights), but their performance heavily relies on the choice of features. Many IOC approaches [16, 36, 11]
assume the optimality of demonstration data; that is, the observed demonstrations are a result of optimizing the parameterized objective functions. However, this assumption is subject to observation noise and good feature selection, and recent work
[4] shows that large data noise is likely to lead to failure of the methods.The other challenges of existing IOC and IRL techniques are listed below. First, existing methods require as input the continuous demonstration data of an entire task; in other words, a given demonstration needs to be a complete trajectory over the entire course of execution time. Thus, demonstration data needs to be carefully collected from an expert, which can be burdensome especially for highdimensional systems. Instead, it is relatively easier to provide only sparse demonstrations. Although [15] proposes a method to solve IOC from incomplete trajectory data, it still requires a trajectory segment to be long enough to satisfy a recovery condition and thus cannot handle very sparse demonstrations as shown in Fig. 1. In [2]
, the authors develop a method for learning from keyframe demonstrations. This method is a policy learning technique: it learns a kinematic trajectory model (Gaussian mixture models) instead of learning an objective function. The unseen motion between keyframes is handled by interpolation. Such a process leads to poor generalization and high sample complexity (we will show this later in experiments). Another limitation of existing IOC and IRL methods is that they rarely account for the time misalignment between the demonstrations and the feasible actuation capabilities of a robot. This is critical in practical implementation. For example, consider a humanoid robot that learns to imitate a human demonstrator. The robot may be actuated by a weak servo motor which may not move as fast as human. The demonstrations thus cannot be directly used for objective function learning. To address this,
[17] learns a timewarping function between a robot and a demonstrator, but this method is used to align the time of a demonstrated trajectory for optimal tracking instead of learning objective functions.IB Contributions
We propose a new approach to learn objective functions from sparse demonstrations. The contributions of the method relative to existing IRL/IOC methods are listed below.

The proposed method learns an objective function by directly minimizing a trajectory loss, which quantifies the discrepancy between a robot’s reproduced trajectory and the observed demonstrations. Different from [25] using derivativefree techniques [35], the proposed approach is a gradient based optimization method, which can handle highdimensional systems.

The proposed method accepts a general parameterization of objective functions (e.g., nonlinear in function parameters such as neural networks), which is not necessarily a linear combination of features. The algorithm finds an objective function within the given function set such that the reproduced trajectory has minimum Euclidean distance to demonstrations, even though the demonstrations may not be optimal and the exact corresponding objective function does not exist in the function set.

The proposed learning algorithm permits sparse demonstrations, which consists of a small number of desired outputs of the robot’s trajectory at sparse time instances. The algorithm will find an objective function such that the reproduced trajectory gets closest to the given waypoints in Euclidean distance. In addition to learning the objective function, the method jointly learns a timewarping function to align the duration between the expert’s demonstration and the feasible motion of the robot.

The theory developed in this paper is the differential Pontryagin’s Maximum Principle. This allows us to obtain the analytical gradient of the system optimal trajectory with respect to the objective function parameter, thus enabling update of objective function using gradient descent.
The organization of this paper is as follows: Section II formulates the problem. Section III discusses the timewarping technique and reformulates the problem under a unified time axis. Section IV proposes the learning algorithm. Experiments are provided in Sections V and VI. Section VII gives discussion to the method, and finally Section VIII draws conclusions.
Ii Problem Formulation
Consider a robot with the following continuous dynamics:
(1) 
where is the robot state;
is the control input; vector function
is assumed to be twicedifferentiable, and is time. Suppose the robot motion over a time horizon is controlled by optimizing the following parameterized objective function:(2) 
where and are the running and final costs, respectively, both of which are assumed twicedifferentiable; and is a tunable parameter vector. For a fixed choice of , the robot produces a trajectory of states and inputs
(3) 
which optimizes the objective function (2). Here the subscript in indicates that the trajectory implicitly depends on .
The goal of learning from demonstrations is to estimate the objective function parameter based on the observed demonstrations of an expert (usually a human operator). Here, we suppose that an expert provides demonstrations through a known output function
(4) 
where defines a map from the robot’s state and input to an output . The expert’s demonstrations include (i) an expected time horizon , and (ii) a number of waypoints, each of which is a desired output for the robot to reach at an expected time instance, denoted as
(5) 
Here, is the th waypoint demonstrated by the expert, and is the expected time instance at which the expert wants the robot to reach the waypoint . As the expert can freely provide the number of waypoints and choose the positions of expected time instances relative to the expected horizon , we refer to as sparse demonstrations. As will be shown later in simulations, here can be small.
Note that both the expected time horizon and the expected time instances are in the time axis of the expert’s demonstrations. This demonstration time axis may not be identical to the actual time axis of execution of the robot; in other words, the given times and may not be achievable by the robot. For example, when the robot is actuated by a weak servo motor, its motion inherently cannot meet the time step required by a human demonstrator. To accommodate the misalignment of duration between the robot and expert’s demonstrations, we introduce a time warping function
(6) 
which defines a map from the expert’s demonstration time axis to the robot time axis . We make the following reasonable assumption: is strictly increasing for the range of and continuously differentiable function with .
Given the sparse demonstrations , the problem of interest is to find an objective function parameter and a timewarping function such that the following trajectory loss is minimized:
(7) 
where is a given differentiable scalar function to quantify a point distance metric between vectors and , e.g., . Minimizing the loss in (7) means that we want the robot to find the ‘best’ objective function within the parameterized objective function set (2), together with a timewarping function, such that its reproduced trajectory is as close to the given sparse demonstrations as possible.
Iii Problem Reformulation by Timewarping
In this section, we present the parameterization of the timewarping function, and then reformulate the problem of interest presented in the previous section under a unified time axis.
Iiia Parametric Time Warping Function
To facilitate learning of an unknown timewarping function, we parameterize the timewarping function. Suppose that a differentiable timewarping function satisfies and is strictly increasing in the range . Then the derivative
(8) 
for all . We use a polynomial timewarping function:
(9) 
where is the coefficient vector of the polynomial. Since , there is no constant (zeroorder) term in (9) (i.e., ). Due to the requirement for all in (8), one can always obtain a feasible (e.g. compact) set for , denoted as , such that for all if .
IiiB Equivalent Formulation under a Unified Time Axis
Substituting the parametric timewarping function in (9) into both the robot’s dynamics (1) and the control objective function (2), we obtain the following timewarped dynamics
(10) 
and the timewarped objective function
(11)  
Here, the left side of (10
) is due to chain rule:
, and the time horizon satisfies (note that is specified by the expert). For notation simplicity, we write , , , and . Then, the above timewarped dynamics (10) and timewarped objective function (11) are rewritten as:(12a)  
and  
(12b) 
respectively. We concatenate the unknown objective function parameter vector and unknown timewarping function parameter vector as
(13) 
For a choice of , the timewarped optimal trajectory resulting from solving the above timewarped optimal control system (12) is rewritten as
(14) 
with . The trajectory distance loss in (7) to be minimized can now be defined as
(15) 
Minimizing the above loss function in (15) over the unknown parameter vector is a process of simultaneously learning the control objective function in (2) and the timewarping function in (9).
In summary, the problem of interest is reformulated as an optimization problem of jointly learning the objective function in (2) and timewarping function in (9):
(16)  
s.t. 
Here defines a feasible domain of variable , ; the constraint in optimization (16) says that is an optimal trajectory generated by the optimal control system (12) with the control objective function (12b) and dynamics (12a). In the next section, we will focus on developing a new learning algorithm to efficiently solve the above optimization problem.
Iv Proposed Learning Algorithm
Iva Algorithm Overview
To solve the optimization (16), we start with an arbitrary initial guess , and apply the gradient descent
(17) 
where is the iteration index; is the step size (or learning rate); is a projection operator to guarantee the feasibility of in each update, e.g., ; and denotes the gradient of the given loss function (15) directly with respect to evaluated at . Applying the chain rule to the gradient term, we have
(18) 
where is the gradient of the single point distance loss defined in (15) with respect to the time trajectory point, , evaluated at point , and is the gradient of the time trajectory point, , with respect to the parameter vector evaluated at value . From (17) and (18), we can note that at each iteration , the update of the parameter includes the following three steps:
The interpretation of the above procedure is straightforward: In each update , first, with the current parameter estimate , the optimal control system (12) produces an optimal trajectory , and the corresponding trajectory loss (that is, the distance to the given sparse demonstrations) is computed; second, the current gradient of the trajectory loss with respect to , , is solved; finally, this gradient is used to update the current estimate for the next iteration .
In Step 1 of the learning procedure, the optimal trajectory for the current parameter estimate is solved using any available optimal control solvers such as Casadi [3]. In Step 2, the gradient quantities can be readily computed by directly differentiating the given trajectory loss function (15). The main challenge, however, lies in how to obtain the gradient , that is, the gradient of the system optimal trajectory with respect to the parameter for the optimal control system (12). In what follows, we will show how to efficiently compute it by proposing the technique of differential Pontryagin’s Maximum Principle. In the following, we suppress the iteration index for notation simplicity.
IvB Differential Pontryagin’s Maximum Principle
Consider the system optimal trajectory in (14) produced by the optimal control system (12) under a fixed choice of . The Pontryagin’s Maximum Principle [34] states an optimality condition that the optimal trajectory must satisfy. To present Pontryagin’s Maximum Principle, we define the Hamiltonian:
(19) 
where is called the costate or adjoint variable for . According to Pontryagin’s Maximum Principle, there exists a costate trajectory
(20) 
which is associated with the optimal trajectory in (14), such that the following conditions hold:
(21a)  
(21b)  
(21c)  
(21d) 
In fact, given one can always solve the corresponding costate trajectory by integrating the ODE equation (21b) backward in time with the end condition given by (21d).
Recall that our technical challenge in the previous part is to obtain the gradient . Towards this goal, we differentiate the above Pontryagin’s Maximum Principle equations in (21) on both sides with respect to the parameter , which yields the following differential Pontryagin’s Maximum Principle
(22a)  
(22b)  
(22c)  
(22d) 
Here the coefficient matrices in (22) are defined as
(23a)  
(23b)  
(23c)  
(23d) 
Once we obtain the optimal trajectory and the associated costate trajectory in (20), all the above coefficient matrices in (23) are known and their computation is straightforward. Using these matrices (23) and (22), the lemma below presents an iterative method to solve the gradient .
Lemma 1.
If in (23c) is invertible for all , define the following differential equations for matrix variables and :
(24a)  
(24b) 
with and Here, is identity,
(25a)  
(25b)  
(25c)  
(25d)  
(25e) 
are all known given (23). Then, the gradient of the optimal trajectory at any time instance , denoted as
(26) 
is obtained by integrating the following equations up to :
The proof of Lemma 1 is given in the Appendix. Lemma 1 states that for the optimal control system (12), the gradient of its optimal trajectory (the trajectory satisfying Pontryagin’s Maximum Principle) with respect to parameter can be obtained in two steps: first, integrate (24) backward in time to obtain matrices and for ; and second, obtain by integrating (27). With the differential Pontryagin’s maximum principle, Lemma 1 states an efficient way to obtain the gradient of the optimal trajectory with respect the unknown parameters in an optimal control system. By Lemma 1, one can obtain the derivative of any trajectory point , for any , along the optimal trajectory , with respect to the parameter , .
Based on Lemma 1, we summarize the overall algorithm to solve the optimization problem (16) in Algorithm LABEL:algorithm1.
algocf[h]
V Numerical Examples
We demonstrate the proposed approach using two systems: (i) an inverted pendulum, and (ii) 6DoF UAV maneuvering control. We compare the proposed method with related work.
Va Inverted Pendulum
The dynamics of the pendulum is
(28) 
with being the angle between the pendulum and direction of gravity, is the torque applied to the pivot, m, kg, and are the length, mass, and damping ratio of the pendulum, respectively. We define the state and control variables of the pendulum system as and , respectively, and set the initial state . For the inverted pendulum control, we set the parameterized cost function in (2) as
(29)  
with the parameter vector to be determined. For the parametric timewarping function (9), we simply use a linear function:
(30) 
with (we will discuss the use of more complex timewarping functions later). The overall parameter vector to be determined is .
The output function (4) is set as which means that the expert only provides the position information, not including the velocity information. For the trajectory loss function in (15), we use the norm to quantify the distance measure:
(31) 
VA1 Known Ground Truth
First, we generate sparse demonstrations to test the proposed method when the true objective function and timewarping function are both known. Specifically, we set the true parameter , based on which we generate the trajectory by solving the optimal control problem (12). Then, we pick some points as the sparse demonstrations , listed in Table I. We want to see if the proposed method can correctly learn from these sparse points. Given the sparse waypoints in Table I, we apply Algorithm LABEL:algorithm1 to learn the parameter by solving (16). In Algorithm LABEL:algorithm1, we set the learning rate , and initialize the parameter randomly.
Demonstration time instance  waypoints 

s  
s  
s  
s  
s  
Time horizon s 
We plot the loss value in (31) versus the number of iterations in Fig. 2. The result shows that as the iteration number increases, the loss diminishes fast and finally converges to zero. This indicates that the trajectory gradually gets close to the sparse demonstrations and finally passes through them. This convergence is also illustrated by the right panel of Fig. 2, where we plot the pendulum’s (timewarped) trajectory in each iteration, where the color going from light to dark gray corresponds to increasing iteration number, and the red dots indicate the sparse demonstrations. As shown by the results, the initial trajectory (lightest gray) is far away from the sparse demonstrations, and as updates, the trajectory (with increasingly dark colors) approaches and finally passes through the waypoints (i.e., the converged loss is zero). To illustrate whether the parameters converge to the ground truth , we define the following parameter error: and plot the parameter error versus the number of iterations in Fig. 3, from which we note that as the number of iterations increases, converges to zero, indicating that the true parameter of the objective and timewarping functions is successfully learned.
VA2 Nonrealizable Case
In this case, we use random sparse demonstrations, where the waypoints here are sampled from a uniform distribution with the centers being the ones in Table
I. The randomness of the given sparse demonstrations means that an exact objective function (whose optimal trajectory exactly passes through the sparse demonstrations) may not exist within the given parameterized function set in (29) because of limited expressive power. The random sparse demonstrations are listed in Table II, and the other settings are the same as the previous case. The learning results are shown in Fig. 4. The results show that as the number of iterations increases, the loss value (31) is decreasing and converging to a value of but not zero. This is because the waypoints are randomly given, thus there does not exist such that the corresponding system trajectory exactly passes through these given waypoints. It shows that the proposed method can always find the ‘best’ objective function and the ‘best’ timewarping function within the parametric function sets, which finally leads the reproduced trajectory to be closest to the waypoints in a sense of having the minimal distance loss (7), as shown in the right panel of Fig. 4.Demonstration time instance  waypoints 

s  
s  
s  
s  
s  
Time horizon s 
VA3 Different Parametric TimeWarping Functions
In this case, we test the performance of the method using different parametric timewarping functions. The sparse demonstrations are in Table III, where the demonstration time labels are infeasible for the pendulum actuation. The other experimental settings are the same as the previous cases, except that we use the parametric polynomial timewarping function (9) with different degrees . We summarize in Table IV the learned timewarping function and the obtained minimal loss value of (31), i.e., .
Demonstration time instance  waypoints 

s  
s  
s  
s  
s  
Time horizon s 
Learned timewarping function  

As shown in Table IV, more complex timewarping functions lead to a lower minimal loss value of . This is understandable because using a higherdegree polynomial will introduce additional degrees of freedom, which contribute to further decreasing the loss in terms of generating a ‘moredeformed’ time axis. Also from a system perspective, if we look at the entire parameterized optimal control system (12), use of a higherdegree polynomial timewarping function will make the parameterized system more expressive, achieving a lower loss on the same training data.
From Table IV, we further observe that the firstorder terms in all learned timewarping polynomials are approximately the same, and the higherorder terms are relatively small compared to the firstorder term and they do not significantly contribute to lowering the final training loss. This indicates that the firstorder term dominates the time scale difference between the demonstration and robot’s execution, because here is small and the higherorder terms thus are not significant compared to the firstorder term. In the following experiments, we therefore only use the firstorder polynomial timewarping functions.
VA4 Neural Objective Functions
Instead of using parameterization (29), we here represent the objective function using a neural network and aim to learn a neural objective function. We test this still using the inverted pendulum system. Specifically, the parameterized objective function is represented as
(32)  
where is a 221 fullyconnected neural network with activation functions [30]
(i.e., 2neuron input layer, 2neuron hidden layer, and 1neuron output layer), and
is the parameter vector of the neural network, that is, the weight matrices and bias vectors. The timewarping polynomial is firstorder as in (
30) and the loss function is (31). We use the sparse demonstration data in Table III, and the learning rate is set as . We plot the learning results in Fig. 5, which shows that the proposed approach can successfully learn a neural objective function from sparse demonstrations, such that the pendulum’s reproduced trajectory is close to the given waypoint in Euclidean distance.In the left panel of Fig. 5, the converged loss is , which is lower than the loss of in Table IV for the weighted distance parameterization (29). This difference can be also seen by comparing the right panel of Fig. 5 with the one in Fig. 4. The lower loss here is because neural network representation is more expressive than weighted distance parameterization. The results in Fig. 5 demonstrate the capability of the proposed method to learn complex parametric objective functions, and it shows the utility of the method when the knowledgebased parametric objective function is not readily available.
However, despite the convenience of using universal neural network objective functions, how to choose appropriate structure and hyperparameters for a neural network (such as the number of layers/neurons and the type of activation functions) still needs to be specified. Our empirical experience also finds the other drawbacks of neural objective functions, including a lack of physical interpretability for the learned results, more iterations needed to reach convergence as empirically shown in left panel of Fig. 5, and a tendency of getting trapped in locally optimal solutions. In Section VII, we will provide a further analysis for the choice of parametric objective functions.
VB Comparison with other Methods
VB1 Comparison with Learning From KeyFrames [2]
We first compare the proposed method with the method of learning from keyframe demonstrations developed in [2]. As discussed in the related work, this is a policylearning based method: a Gaussian mixture model (GMM) is first learned from keyframe demonstrations, based on which a trajectory is then reproduced using Gaussian mixture regression (GMR). In this comparison experiment, we use the inverted pendulum system with the same setting as in Section VA1. Here, we provide 20 waypoints (with the time instances evenly populated over ; we find that a smaller number of waypoints leads to failure of the GMM method). During trajectory reproduction, we set a new time duration (note that the training data uses ) to test the generalization performance of each method. Comparison results are plotted in Fig. 6, where we also plot the groundtruth for reference.
From Fig. 6, we observe that under unseen information (here with a longer time horizon), our method produces a trajectory much closer to the ground truth than [2]. This indicates better generalization of the proposed method to unseen settings (or long horizon tasks). In fact, better generalization is generally one of the advantages of objective function learning over policy learning, as discussed in [24].
VB2 Comparison with Numerical Gradient Descent
Here, we compare the proposed method with direct gradient descent, where the gradient is estimated numerically. Specifically, in each update we use the numerical differentiation to approximate the gradient . The experiment uses the pendulum system with the same settings as Section VA. Here we have tried two cases: the first case uses the sparse demonstration data in Table I, and the second case uses the sparse demonstration data in Table II. The comparison results are shown in Fig. 7.
From Fig. 7, we can observe that the proposed method has an obvious advantage in terms of lower training loss and faster convergence speed. The numerical gradient descent is effective for this case but has a lower accuracy due to the error induced during gradient approximation. Because of this approximation error, the loss does not descend along the ‘steepest’ direction, thus leading to a slower convergence. Here, the optimization variable is lowdimensional, the numerical gradient is thus relatively easier to compute, and the numerical gradient descent works. For high dimensional tasks, as we will show below, we found that the numerical gradient descent is prone to fail due to inaccuracy of gradient estimation.
VC Experiment on 6 DoF Maneuvering UAVs
We here show the effectiveness of the proposed method on a more complex 6DoF UAV maneuvering control system. The equation of motion of a quadrotor UAV flying in SE(3) (full position and attitude) space is given by
(33a)  
(33b)  
(33c)  
(33d) 
Here, subscripts and denote a quantity expressed in the UAV body frame and world reference frame, respectively; is the mass of the UAV; and are the position and velocity of the UAV;
is the moment of inertia of the UAV expressed in its body frame;
is the angular velocity of the UAV; is the unit quaternion [20] describing the attitude of the UAV with respect to the world frame; (33c) is the time derivative of quaternion with being the matrix notation of used for quaternion multiplication [20]; is the torque vector applied to the UAV; and is the total force vector applied to the UAV’s center of mass. The total force magnitude (along the zaxis of UAV’s body frame) and torque are generated by thrust of the four rotating propellers, which can be written as(34) 
with denoting the UAV’s wing length and a fixed constant. In our experiment, the gravity constant is set as and all the other constant parameters are units. We define the state variable
(35) 
and define the control variable
(36) 
To achieve SE(3) maneuvering control, we need to carefully design the attitude error. As in [23], we define the attitude error between the UAV’s current attitude and goal attitude as
(37) 
where is the direction cosine matrix corresponding to the quaternion (see [20] for more details).
The parameterized cost function in (2) is set as
(38a)  
(38b) 
Here, , , , and are the goal position, velocity, orientation, and angular velocity, respectively; the objective function parameter vector here is
(39) 
For the parametric timewarping function, we use the firstdegree polynomial as in (30). The total parameter vector to be determined is
(40) 
We set the output function in (4) as
(41) 
which means that the expert can only provide the position and attitude demonstrations for UAV maneuvering (not including velocity information).
time instance  waypoints  

s 


s 


s 


s 


s 


Time horizon s 
The sparse demonstrations are in Table V. The loss function is defined using Euclidean distance as in (31). In Algorithm LABEL:algorithm1, we set the learning rate . We plot the learning results in Fig. 8. The results show that, as the parameter is updated at each iteration, the loss value diminishes to zero quickly, meaning that the UAV’s reproduced trajectory gets closest to the sparse demonstrations in Table V. The right panel of Fig. 8 shows the final reproduced trajectory, which exactly passes through the given sparse demonstrations. This indicates the capability of the method in handling more complex systems.
Vi Application: Learning for Obstacle Avoidance
In this section, we apply the proposed method to learning robot motion control in an environment with obstacles. Here, a human provides few waypoints in the vicinity of obstacles in an environment, and the robot learns a control objective function from those waypoints such that its resulting motion can get around the obstacles. We experiment on two systems: a 6DoF maneuvering UAV and a twolink robot arm.
Comments
There are no comments yet.