# Learning to control from expert demonstrations

In this paper, we revisit the problem of learning a stabilizing controller from a finite number of demonstrations by an expert. By first focusing on feedback linearizable systems, we show how to combine expert demonstrations into a stabilizing controller, provided that demonstrations are sufficiently long and there are at least n+1 of them, where n is the number of states of the system being controlled. When we have more than n+1 demonstrations, we discuss how to optimally choose the best n+1 demonstrations to construct the stabilizing controller. We then extend these results to a class of systems that can be embedded into a higher-dimensional system containing a chain of integrators. The feasibility of the proposed algorithm is demonstrated by applying it on a CrazyFlie 2.0 quadrotor.

## Authors

• 2 publications
• 1 publication
• 1 publication
• 12 publications
06/15/2021

### Residual Reinforcement Learning from Demonstrations

Residual reinforcement learning (RL) has been proposed as a way to solve...
07/11/2018

### Learning Singularity Avoidance

With the increase in complexity of robotic systems and the rise in non-e...
07/28/2020

### Learning Stable Manoeuvres in Quadruped Robots from Expert Demonstrations

With the research into development of quadruped robots picking up pace, ...
05/04/2016

### A Bayesian Approach to Policy Recognition and State Representation Learning

Learning from demonstration (LfD) is the process of building behavioral ...
12/20/2021

### Demonstration Informed Specification Search

This paper considers the problem of learning history dependent task spec...
11/16/2019

### Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

In this paper, we study Reinforcement Learning from Demonstrations (RLfD...
10/28/2021

### Learning Feasibility to Imitate Demonstrators with Different Dynamics

The goal of learning from demonstrations is to learn a policy for an age...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

### 1.1. Motivation

The usefulness of learning from demonstrations has been well-argued in the literature (see [2, 4, 41]). In the context of control, imagine that we need to design a controller for an autonomous car that prioritizes comfort of its passengers. It is not obvious how to capture the idea of comfortable driving in a mathematical expression. It is fairly straightforward, however, to collect demonstrations of comfortable driving from human drivers. There are many other control tasks where providing examples of the desired behaviour is easier than defining such behaviour formally (e.g., teaching a robot to manipulate objects). The growing research interest in learning from demonstrations (LfD) for robot control [41] reflects the need for a well-defined controller design methodology for such tasks. In this work, we propose a methodology that uses expert demonstrations to construct a stabilizing controller.

There are many examples in the literature, where various LfD methodologies have been applied to robots [41]. The most popular application of LfD so far is in robotic manipulators. More specifically, LfD is used to teach manipulators skills to perform tasks in manufacturing [49], health-care [48, 28], and human-robot interaction [30, 40]. In addition, LfD has been applied with significant success to ground vehicles [10, 37], aerial vehicles [24, 1], bipedal robots [12, 32], and quadrupedal robots [26, 33]. These examples illustrate that, for these platforms, there exist control tasks for which LfD techniques are preferable to traditional control approaches.

### 1.2. Related work

In this section, we describe the previous work in learning from demonstrations to indicate where our approach lies within the existing landscape. This is in no way a comprehensive account of the literature on learning from demonstrations, but rather an overview of approaches related to ours (please refer to [41] or [27] for a description of the literature on LfD).

Policy-learning LfD methods

, to which this work belongs, assume that there exists a mapping from state (or observations) to control input that dictates the expert’s behaviour. This mapping is referred to as the expert’s policy. The goal of these methods is to find (or approximate) the expert’s policy given expert demonstrations. In many machine-learning-based LfD methods, policy learning is viewed as a supervised-learning problem where states and control inputs are treated as features and labels, respectively. We refer to these methods as

behavioural cloning methods. Pioneered in the 80s by works like [39], this class of methods is still popular today. Behavioural cloning methods are typically agnostic to the nature of the expert — demonstrations can be provided by a human (see [5, 10]), an offline optimal controller (see [29, 9]), or a controller with access to privileged state information (see [8, 24]). They do, however, require a large number of demonstrations to work well in practice and, if trained solely on data from unmodified expert demonstrations, generate unstable policies that cannot recover from drifts or disturbances [10]. The latter problem can be fixed using online meta-algorithms like DAgger [42] which ensure that training data includes observations of recoveries from perturbations. Using such algorithms, however, comes at the expense of enlarging the training dataset. Moreover, the works on behavioural cloning typically provide few formal stability guarantees and, instead, illustrate performance with experiments.

Currently, there is a concerted effort to develop policy-learning LfD methods that improve on existing techniques using tools from control theory. In that context, the work that is closest to ours is described in [36], where the authors use convex optimization to construct a linear policy that is both close to expert demonstrations and stabilizes a linear system. They guarantee that the resulting controller is optimal with respect to some quadratic cost by adding an additional set of constraints (originally proposed in [23]) to the optimization problem. This work has been extended in [18] to enforce other properties, such as stability, optimality, and -robustness. Our methodology is different from those in [36] and [18] because we do not assume the expert to be a linear time-invariant controller.

### 1.3. Contributions

In this paper, we propose a methodology for constructing a controller for a known nonlinear system from a finite number of expert demonstrations of desired behaviour, provided their number exceeds the number of states and the demonstrations are sufficiently long. Our approach consists of two steps:

• use feedback linearization to transform the nonlinear system into a chain of integrators;

• use affine combinations of demonstrations in the transformed coordinates to construct a control law stabilizing the original system.

The expert demonstrations are assumed to be of finite-length, whereas the resulting controller is expected to control the system indefinitely, making this a non-trivial problem to address. In this paper, we formally prove the learned controller asymptotically stabilizes the system. Furthermore, in case there are more demonstrations than states, we determine which subset of demonstrations needs to be chosen to minimize the error between the trajectory of the learned controller and the trajectory of the expert controller. To demonstrate the feasibility of this methodology, we apply it to the problem of quadrotor control. Unlike [36], our methodology produces a controller that is time-varying and not linear in the original coordinates. This reflects our belief that, in many cases, the expert demonstration is produced by a nonlinear controller. We also extend the proposed methodology beyond the class of feedback linearizable systems by using the embedding technique described in [43] and demonstrate its feasibility on the classical example of the ball-and-beam system.

A preliminary version of this methodology was introduced in [46]. In [45], it was combined together with the data-driven control results from [13] to learn to control unknown SISO systems from demonstrations. This paper provides a unified presentation of the results from [46], as well as several new results, such as the discussion on the optimality of the controller approximation error and the extension of the results beyond the class of feedback linearizable systems.

## 2. Problem Statement and Preliminaries

### 2.1. Notations and basic definitions

The notation used in this paper is fairly standard. The integers are denoted by , the natural numbers, including zero, by , the real numbers by , the positive real numbers by , and the non-negative real numbers by . We denote by (or by ) the standard Euclidean norm or the induced matrix 2-norm; and by

the matrix Frobenius norm. A set of vectors

in is affinely independent if the set is linearly independent.

A function is of class if is continuous, strictly increasing, and . If is also unbounded, it is of class . A function is of class if, for fixed , is of class and decreases to as for each fixed .

The Lie derivative of a function along a vector field , given by , is denoted by . We use the notation for the iterated Lie derivative, i.e., , with . Given open sets and , a smooth map is called a diffeomorphism from to if it is a bijection and its inverse is smooth.

Consider the continuous-time system:

 (1) ˙x=f(t,x),

where is the state and is a smooth function. The origin of (1) is uniformly asymptotically stable if there exist and such that, for all , the following is satisfied [25]:

 (2) ∥x(t)∥≤β(∥x(t0)∥,t−t0),∀t≥t0≥0.

Consider the continuous-time control system:

 (3) ˙x=f(t,x,u),

where is the state, is the input, and is a smooth function. The system (3) is said to be input-to-state stable (ISS) if there exist and such that for any and any bounded input , the following is satisfied:

 (4) ∥x(t)∥≤β(∥x(t0)∥,t−t0)+γ(supt0≤τ≤t∥u(τ)∥).

Let be a set of points in . A point with is called an affine combination of points in . If, in addition, for all , then is a convex combination of points in .

### 2.2. Problem Statement

Consider a known continuous-time control-affine system:

 (5) Σ:˙x=f(x)+g(x)u,

where and are the state and the input, respectively; and , are smooth functions. Assume that the origin is an equilibrium point of (5). We call a pair a solution of the system (5) if, for all , the equation (5) is satisfied. Furthermore, we refer to the functions and as a trajectory and a control input of the system (5).

We say that a controller is asymptotically stabilizing for the system (5) if the origin is uniformly asymptotically stable for the system (5) with . Suppose there exists an unknown asymptotically stabilizing controller , which we call the expert controller. We assume that is smooth. Our goal is to learn a controller such that having asymptotically stabilizes the origin of the system (5). Towards this goal, we use a set of finite-length expert solutions of (5), where: for each , the trajectory and the control input are smooth and satisfy for all ; is the length of a solution; and . We also ascertain that the “trivial” expert solution, wherein and for all , is included in .

###### Remark 2.1.

In practice, we can record the values of continuous solutions provided by the expert only at certain sampling instants. In this work, however, we choose to work in continuous-time to simplify the theoretical analysis. We can do this without sacrificing practical applicability because it is well-known that continuous-time controller designs can be implemented via emulation and still guarantee stability [34].

We make the assumption that the system (5) is feedback linearizable on an open set containing the origin and the expert demonstrations belong to for all . To avoid the cumbersome notation that comes with feedback linearization of multiple-input systems, we assume that , that is, the system (5) only has a single input. Readers familiar with feedback linearization can verify that all the results extend to multiple-input case, mutatis mutandis (refer to [21, Ch. 4-5] for a complete introduction to feedback linearization). In the single-input case, the system (5) is feedback linearizable on the open set if there is an output function that has relative degree , i.e., for all , for and . Moreover, the map:

 (6) z=Φ(x)=[h(x)Lfh(x)⋯Ln−1fh(x)]T,

is a diffeomorphism from to its image , i.e., the inverse exists and is also smooth. We further assume, without loss of generality, that .

## 3. Learning a stabilizing controller from n+1 expert demonstrations

Here, we describe the methodology for constructing an asymptotically stabilizing controller when . We consider the case when in Section 4.

### 3.1. Feedback linearization

Recall that using the feedback linearizability assumption, we can rewrite the system dynamics (5) in the coordinates given by (6) resulting in:

 (7) ˙z1 =z2, ⋮ ˙zn−1 =zn, ˙zn =a(z)+b(z)u,

where and . The feedback law:

 (8) u =b(z)−1(−a(z)+v),

further transforms the system (5) into the system given by:

 (9) ˙z=Az+Bv,

where is a Brunovsky pair.

###### Remark 3.1.

The expert controller in the -coordinates is given by . The smoothness of implies that the function is also smooth.

### 3.2. Expert demonstrations

Recall that the set of demonstrations consists of solutions of the system (5). Using (6) and (8), we can represent the demonstrations in -coordinates. We denote the resulting set by , where functions and are given by:

 (10) zi(t) ≜Φ(xi(t)) (11) vi(t) ≜Lnfh(xi(t))+LgLn−1fh(xi(t))ui(t),

for all and for all . We define the set of demonstrations evaluated at time as:

 D(z,v)(t)={(zi(t),vi(t))}Mi=1.

It can be easily verified that the demonstrations in satisfy the dynamics (9) and .

### 3.3. Constructing the learned controller

We denote by the controller learned from the expert demonstrations. We begin by partitioning time into intervals of length and indexing these intervals with . Let us construct the following matrices for :

 (12) Z(t) ≜[z2(t)−z1(t)⋯zn+1(t)−z1(t)] (13) V(t) ≜[v2(t)−v1(t)⋯vn+1(t)−v1(t)].

Our first attempt at constructing the learned controller, which we improve upon later, is to use the piecewise-continuous controller for all , where:

 (14) ˆκ(t,z(pT)) =V(t−pT)ζ(p),

with , and defined in (12) and (13), respectively.

The next lemma formally shows that an affine combination of trajectories of (9) is a valid trajectory for (9).

###### Lemma 3.2.

Suppose we are given a set of finite-length solutions of the system (9), where each is defined for , . Assume that is an affinely independent set. Then, under the control law with , the solution of the system (9) is , for , where and are defined in (12) and (13), respectively.

###### Proof.

This lemma can be verified by substitution. ∎

###### Remark 3.3.

Affine independence of the set is a generic property, i.e., this is true for almost all expert demonstrations. In practice, if this set is not affinely independent, a user can eliminate the affinely dependent demonstrations and request the expert to provide additional demonstrations.

We note, however, that the control law (14) samples the state with a sampling time and essentially operates in open loop in between these samples. To allow for closed-loop control, we propose the improved controller that has, for all , the following form:

 (15) v(t)=ˆκ(t,z(t)) =V(t−pT)ζ(p,t), ζ(p,t) =Z−1(t−pT)z(t).

In the absence of uncertainties and disturbances, by Lemma 3.2, the coefficients satisfy:

 (16) ζ(p,t) =Z−1(t−pT)z(t)=Z−1(0)z(pT),

i.e., the controller (15) applies the input equal to that applied by the controller (14).

### 3.4. Stability of the learned controller

Assuming (16) holds, the system (9) in closed loop with (15) has the following form:

 (17) ˙z=Az+BV(t−pT)Z−1(0)z(pT),

for all . Integrating the dynamics, we show that the sequence satisfies:

 (18) z((p+1)T) =Ψ(T)z(pT),

where:

 (19) Ψ(T)≜eAT+∫T0eA(T−τ)BV(τ)Z−1(0)dτ.

By adopting a term from Floquet’s theory, we refer to in (19) as the closed-loop monodromy matrix [22].

This section’s main result provides sufficient conditions for asymptotic stability of system (5) in closed loop with (8)-(15).

###### Theorem 3.4.

Consider the system (5) and assume it is feedback linearizable on an open set containing the origin. Let and suppose we are given a finite set of demonstrations generated by the system (5), in closed loop with a smooth asymptotically stabilizing controller , and satisfying for all . Assume that is affinely independent for all . Then, there is a such that for all , the origin of system (5) in closed-loop with controller (8)-(15) is uniformly asymptotically stable.

###### Proof.

The asymptotic stability of (5) and (9) are equivalent on and [44], and, therefore, the set given by (10) and (11) also consists of asymptotically stable solutions, i.e., there exists such that for all :

 (20) ∥zi(t)∥≤β(∥zi(0)∥,t),∀t∈R+0.

Consider the closed-loop system (17). By Lemma 3.2:

 z((p+1)T)=Z(T)Z−1(0)z(pT),∀T∈R+0.

Combining this with (18) implies that:

 (21) Ψ(T)=Z(T)Z−1(0).

We claim that, for any constants , there exists such that for all . This claim will be shown using an argument similar to that of the proof of Lemma 16 in [14]. Using Lemma 4.3 from [47], there exist class functions such that for all . Let . Define, for all , to be the solution of and obtain:

 t(r)=−logσ−11(c−ε)σ2(r).

Since is a continuous function and is compact, the extreme value theorem implies that is well-defined. For all , it is true that:

 β(r,t∗) ≤σ1(σ2(r)e−t∗)≤c−ε

Using the previous claim with , and , we conclude the existence of such that, for all , the following inequality holds:

 β(∥zi(0)∥,T)<12√n∥Z−1(0)∥,

for all . Therefore, by (20), for all and for all , we have:

 (22) ∥zi(T)∥<12√n∥Z−1(0)∥.

Using (21) and (22), for all , we have:

 (23) ∥Ψ(T)∥ ≤∥Z(T)∥∥∥Z−1(0)∥∥≤∥Z(T)∥F∥∥Z−1(0)∥∥ =(n+1∑i=2∥zi(T)−z1(T)∥2)12∥∥Z−1(0)∥∥ <√n√n∥Z−1(0)∥⋅∥∥Z−1(0)∥∥<1.

According to stability conditions for linear discrete-time systems (see Theorem 10.9 in [38]), the equation (23) implies that, for all , the system (18) is uniformly exponentially stable. From [22], we know that uniform exponential stability of the sampled-data system (18) implies uniform exponential stability of the system (9)-(15) because the matrices are bounded for . Uniform asymptotic stability of the origin for the system (9)-(15) in the -coordinates implies uniform asymptotic stability of the origin for the feedback equivalent system (5)-(8)-(15) in -coordinates [44]. ∎

###### Remark 3.5.

Theorem 3.4 shows the existence of such that for all . In practice, a user can determine satisfying this condition by directly computing for various .

###### Remark 3.6.

The fact that we assume feedback linearizability on some open set presents the user with the opportunity to use either local or global feedback linearization results, depending on what their application allows for. We recommend [21] as a good starting point to find conditions for both local (see Theorem 4.2.3 in [21]) and global (see Theorem 9.1.1 in [21]) feedback linearizability.

###### Remark 3.7.

In Theorem 3.4, we provide a guarantee the learned controller stabilizes the system at the origin. This result can also be useful when the objective of the learned controller is to track a trajectory. The key idea is to recast the problem of trajectory tracking into that of stabilizing the error dynamics (see Section 4.5 in [21]). We consider this generality of the learned controller to be a strength of this approach. We will experimentally illustrate this in Section 6.1.

###### Remark 3.8.

Although we assume in this work an exact knowledge of the state, in most applications, the state is estimated via an observer. Depending on the design of the observer, the stability results of our methodology may also vary. To give an example, using Lemma III.8 from

[45], we can show that, with a well-designed sampled-data observer providing state estimates of both the expert demonstrations and the current state, we can still retain asymptotic stability. In general, however, a persistent error between the state estimate and the current state can weaken the guarantee of asymptotic stability guarantee of the closed-loop system to that of practical stability.

## 4. Learning from more than n+1 expert demonstrations

Here, we extend the previous results to the case where more than . For every interval of length , we show how to select a subset of demonstrations that results in the best approximation of the expert controller.

### 4.1. Preliminaries

We begin by reviewing several key concepts from multivariate linear interpolation. Let

be a finite set of points in . The convex hull of a set , denoted , is the set of all convex combinations of points in [6]. For any , we define the subset . A Cartesian product of two sets has a natural left projection map (resp., right projection map ) given by (resp., ). An -simplex is the convex hull of a set of affinely independent points. A triangulation of points in , denoted , is a collection of -simplices such that their vertices are points in , their interiors are disjoint, and their union is . We denote the -simplex in containing by and define a vertex index set associated with in , denoted , as to satisfy . The Delaunay triangulation of , denoted , is a triangulation with the property that the circum-hypersphere of every -simplex in the triangulation contains no point from in its interior. It is unique if no

points are on the same hyperplane and no

points are on the same hypersphere [7].

Let be an unknown function. Given a finite set of points and a set of function values , an interpolant is an approximation of that satisfies for all . We define an interpolant , called a piecewise-linear interpolant based on , as:

 ˆψX,YT(x)=∑i∈IT(x)θiyi,

where satisfy:

 x=∑i∈IT(x)θixi,∑i∈IT(x)θi=1.

### 4.2. Constructing the learned controller

Let us describe the construction of the controller for . Define and . We partition time into intervals of length , indexed by . For each , we propose using the piecewise-continuous control law , where is defined as follows:

1. [label=()]

2. For , the value of is given by the value at of a piecewise-linear interpolant . Since a piecewise-linear interpolant is determined by an associated triangulation [7], this implies that there is a family of possible learned controllers we can construct from . Moreover, the value of the interpolant depends only on the values of and , where is a vertex set associated with in .

3. For , let be the Euclidean projection of onto . Define the index set and express as an affine combination . Then, the value of is given by .

In both cases, the controller can be concisely expressed if, given a vertex index set for and , we construct the following matrices:

 (24) ZI(t)≜[zi2(t)−zi1(t)⋯zin+1(t)−zi1(t)] (25) VI(t)≜[vi2(t)−vi1(t)⋯vin+1(t)−vi1(t)],

for . Then, using (24) and (25), the proposed control law, for all , is given by:

 (26) v(t)=ˆκT(t,z(t)) =VIT(z(t))(t−pT)ζ(p,t) ζ(p,t) =Z−1IT(z(t))(t−pT)z(t).

Note that, in the absence of uncertainties and disturbances, by Lemma 3.2, the coefficients satisfy:

 (27) ζ(p,t) =Z−1IT(z(t))(t−pT)z(t) =Z−1IT(z(pT))(0)z(pT).

Therefore, for all , the controller (26) applies the input equal to that applied by the following controller:

 (28) v(t)=ˆκT(t,z(pT)) =VIT(z(pT))(t−pT)ζ(p) ζ(p) =Z−1IT(z(pT))(0)z(pT).

Incidentally, this corresponds to the value of the piecewise-linear interpolant at .

### 4.3. Stability of the learned controller

Let us define the collection of index sets , where each selects vertices of an -simplex in and . Note that is a finite set because there are only finitely many -simplices in . Suppose the index set associated with in is for some . Assuming (27) holds, the system (9) in closed loop with (26) is given by:

 (29) ˙z=Az+BVIj(p)(t−pT)Z−1Ij(p)(0)z(pT),

for all . Integrating the dynamics shows that the sequence satisfies:

 (30) z((p+1)T) =Ψj(p)(T)z(pT),

where

 Ψj(p)(T)≜eAT+∫T0eA(T−τ)BVIj(p)(τ)Z−1Ij(p)(0)dτ.

Note that now, instead of a single monodromy matrix, we have a set of monodromy matrices .

The following result is an extension of Theorem 3.4 for demonstrations.

###### Theorem 4.1.

Consider the system (5) and assume it is feedback linearizable on an open set containing the origin. Let and suppose we are given a finite set of demonstrations generated by the system (5), in closed loop with a smooth asymptotically stabilizing controller , and satisfying for all . Assume that is affinely independent for all . Then, there exists a such that for all , the origin of system (5) in closed-loop with controller (8)-(26) is uniformly asymptotically stable.

###### Proof.

The proof of Theorem 3.4 implies the existence of such that for all . We choose . The system (9) in closed loop with controller (26) can be represented as a switched system (30), where is a switching sequence. By Theorem 3 in [15], the fact that for all and implies that, for any switching signal , the system (30) is uniformly exponentially stable. Since the matrices are bounded for , the system (9) in closed loop with controller (26) is uniformly exponentially stable. Uniform asymptotic stability of the origin for the system (9)-(26) in the -coordinates implies uniform asymptotic stability of the origin for the feedback equivalent system (5)-(8)-(26) in -coordinates [44]. ∎

### 4.4. Optimality of the learned controller

Recall that the piecewise-linear interpolant defining the controller depends on the choice of the triangulation . Assuming (27) holds, this choice reduces to the choice of the triangulation , which dictates the index set of demonstrations used to construct the solution for each interval . Without loss of generality, in what follows we discuss the solutions on the interval only — a solution on can be represented as a solution on with the initial condition equal to .

Typically, there are several triangulations one can define given a set of sample points . We want our choice of triangulation to result in closed-loop trajectories that approximate expert trajectories well for any initial state distinct from . More precisely, we want to find a triangulation that best approximates the function , which defines solutions of (9) under the expert controller , by the function , which defines the solutions of (9) under the learned controller . That is, we want solution to:

 (31) minT(Z(0))supϕ∈Fmaxt∈[0,T]∥∥ϕ(t,z0)−ˆϕT(t,z0)∥∥,

where is the class of functions to which the expert solutions belong. We can view (31) as a game where we pick , and the adversary, upon seeing our choice of , picks to maximize the cost.

Let us leverage the properties has by virtue of describing solutions of (9) under the expert controller to determine the class . We will use the notation for . By Theorem 4.1 in [16, Ch. V], since is a smooth function, the Hessians of the coordinate functions of the solution are continuous with respect to and . By the extreme value theorem, compactness of implies that, for every , there exists such that for all and . Thus, the norms of the Hessians of the coordinate functions can be bounded by . We denote the class of functions whose coordinate functions have the Hessian norm smaller or equal to by . For a fixed , and, therefore, the function belongs to , the set of all functions from to .

###### Definition 4.2.

For any and any learned controller , the worst-case trajectory approximation error on the interval is given by:

 supϕ∈F(H)[0,T]maxt∈[0,T]∥∥ϕ(t,z0)−ˆϕT(t,z0)∥∥,

where is the trajectory of the system (9) with the initial condition under the expert controller , is the trajectory of the system (9) with the same initial condition under the learned controller , and is the set of all functions from to . The smallest worst-case trajectory approximation error on the interval is given by:

 (32) minT(Z(0))supϕ∈F(H)[0,