Learning Constraints from Demonstrations

12/17/2018 ∙ by Glen Chou, et al. ∙ University of Michigan 6

We extend the learning from demonstration paradigm by providing a method for learning unknown constraints shared across tasks, using demonstrations of the tasks, their cost functions, and knowledge of the system dynamics and control constraints. Given safe demonstrations, our method uses hit-and-run sampling to obtain lower cost, and thus unsafe, trajectories. Both safe and unsafe trajectories are used to obtain a consistent representation of the unsafe set via solving an integer program. Our method generalizes across system dynamics and learns a guaranteed subset of the constraint. We also provide theoretical analysis on what subset of the constraint can be learnable from safe demonstrations. We demonstrate our method on linear and nonlinear system dynamics, show that it can be modified to work with suboptimal demonstrations, and that it can also be used to learn constraints in a feature space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 15

page 19

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inverse optimal control and inverse reinforcement learning (IOC/IRL)

[26, 2, 6, 22] have proven to be powerful tools in enabling robots to perform complex goal-directed tasks. These methods learn a cost function that replicates the behavior of an expert demonstrator when optimized. However, planning for many robotics and automation tasks also requires knowing constraints, which define what states or trajectories are safe. For example, the task of safely and efficiently navigating an autonomous vehicle can naturally be described by a cost function trading off user comfort and efficiency and by the constraints of collision avoidance and executing only legal driving behaviors. In some situations, constraints can provide a more interpretable representation of a behavior than cost functions. For example, in safety critical environments, recovering a hard constraint or an explicit representation of an unsafe set in the environment is more useful than learning a “softened” cost function representation of the constraint as a penalty term in the Lagrangian. Consider the autonomous vehicle, which absolutely must avoid collision, not simply give collisions a cost penalty. Furthermore, learning global constraints shared across many tasks can be useful for generalization. Again consider the autonomous vehicle, which must avoid the scene of a car accident: a shared constraint that holds regardless of the task it is trying to complete.

While constraints are important, it can be impractical for a user to exhaustively program into a robot all the possible constraints it should obey when performing its repertoire of tasks. To avoid this, we consider in this paper the problem of recovering the latent constraints within expert demonstrations that are shared across tasks in the environment. Our method is based on the key insight that each safe, optimal demonstration induces a set of lower-cost trajectories that must be unsafe due to violation of an unknown constraint. Our method samples these unsafe trajectories, ensuring they are also consistent with the known constraints (system dynamics, control constraints, and start/goal constraints), and uses these unsafe trajectories together with the safe demonstrations as constraints in an “inverse” integer program which recovers a consistent unsafe set. Our contributions are fourfold:

  • We pose the novel problem of learning a shared constraint across tasks.

  • We propose an algorithm that, given known constraints and boundedly suboptimal demonstrations of state-control sequences, extracts unknown constraints defined in a wide range of constraint spaces (not limited to the trajectory or state spaces) shared across demonstrations of different tasks.

  • We provide theoretical analysis on the limits of what subsets of a constraint can be learned, depending on the demonstrations, the system dynamics, and the trajectory discretization. We also show that our method can recover a guaranteed underapproximation of the constraint.

  • We provide experiments that justify our theory and show that our algorithm can recover an unsafe set with few demonstrations, across different types of linear and nonlinear dynamics, and can be adapted to work with boundedly suboptimal demonstrations. We also demonstrate that our method can learn constraints in the state space and a feature space.

2 Related Work

Inverse optimal control [14, 16] (IOC) and inverse reinforcement learning (IRL) [22] aim to recover an objective function consistent with the received expert demonstrations, in the sense that the demonstrations (approximately) optimize the cost function. Our method is complementary to these approaches; if the demonstration is solving a constrained optimization problem, we are finding its constraints, given the objective function; IOC/IRL finds the objective function, given its constraints. For example, [12] attempts to learn the cost function of a constrained optimization problem from optimal demonstrations by minimizing the residuals of the KKT conditions, but the constraints themselves are assumed known. Another approach [5] can represent a state-space constraint shared across tasks as a penalty term in the reward function of an MDP. However, when viewing a constraint as a penalty, it becomes unclear if a demonstrated motion was performed to avoid a penalty or to improve the cost of the trajectory in terms of the true cost function (or both). Thus, learning a constraint which generalizes between tasks with different cost functions becomes difficult. To avoid this issue, we assume a known cost function to explicitly reason about the constraint.

One branch of safe reinforcement learning aims to perform exploration while minimizing visitation of unsafe states. Several methods for safe exploration in the state space [28, 29, 3] use a Gaussian process (GP) to explore safe regions in the state space. These approaches differ from ours in that they use exploration instead of demonstrations. Some drawbacks to these methods include that unsafe states can still be visited, Lipschitz continuity of the safety function is assumed, or the dynamics are unknown but the safe set is known. Furthermore, states themselves are required to be explicitly labeled as safe or unsafe, while we only require the labeling of whole trajectories. Our method is capable of learning a binary constraint defined in other spaces, using only state-control trajectories.

There exists prior work in learning geometric constraints in the workspace. In [7], a method is proposed for learning Pfaffian constraints, recovering a linear constraint parametrization. In [25], a method is proposed to learn geometric constraints which can be described by the classes of considered constraint templates. Our method generalizes these methods by being able to learn a nonlinear constraint defined in any constraint space (not limited to the state space).

Learning local trajectory-based constraints has also been explored in the literature. The method in [19] samples feasible poses around waypoints of a single demonstration; areas where few feasible poses can be sampled are assumed to be constrained. Similarly, [20] performs online constraint inference in the feature space from a single trajectory, and then learns a mapping to the task space. The methods in [23, 30, 10, 9] also learn constraints in a single task. These methods are inherently local since only one trajectory or task is provided, unlike our method, which aims to learn a global constraint shared across tasks.

3 Preliminaries and Problem Statement

The goal of this work is to recover unknown constraints shared across a collection of optimization problems, given boundedly suboptimal solutions, the cost functions, and knowledge of the dynamics, control constraints, and start/goal constraints. We discuss the forward problem, which generates the demonstrations, and the inverse problem: the core of this work, which recovers the constraints.

3.1 Forward optimal control problem

Consider an agent described by a state in some state space . It can take control actions to change its state. The agent performs tasks drawn from a set of tasks , where each task can be written as a constrained optimization problem over state trajectories in state trajectory space and control trajectories in control trajectory space: [Forward problem / “task” ]

(1)

where is a cost function for task , is a known feature function mapping state-control trajectories to some constraint space . and are known and map to potentially different constraint spaces and , containing a known shared safe set and a known task-dependent safe set , respectively. is an unknown safe set, and the inverse problem aims to recover its complement, , the “unsafe” set. In this paper, we focus on constraints separable in time: , where we overload so it applies to the instantaneous values of the state and the input. An analogous definition holds for the continuous time case. Our method easily learns non-separable trajectory constraints as well111Write Problem 1 constraints as sums over partially separable/inseparable feature components instead of completely separable components..

A demonstration, , is a state-control trajectory which is a boundedly suboptimal solution to Problem 1, i.e. the demonstration satisfies all constraints and its cost is at most a factor of above the cost of the optimal solution , i.e. . Furthermore, let be a finite time horizon which is allowed to vary. If is a discrete-time trajectory (, ), Problem 3.1 is a finite-dimensional optimization problem, while Problem 3.1 becomes a functional optimization problem if is a continuous-time trajectory (, ). We emphasize this setup does not restrict the unknown constraint to be defined on the trajectory space; it allows for constraints to be defined on any space described by the range of some known feature function .

We assume the trajectories are generated by a dynamical system or with control constraints , for all , and that the dynamics, control constraints, and start/goal constraints are known. We further denote the set of state-control trajectories satisfying the unknown shared constraint, the known shared constraint, and the known task-dependent constraint as , , and , respectively. Lastly, we also denote the set of trajectories satisfying all known constraints but violating the unknown constraint as .

3.2 Inverse constraint learning problem

Figure 1: Discretized constraint space with cells . The trajectory’s constraint values are assigned to the red cells.

The goal of the inverse constraint learning problem is to recover an unsafe set, , using provided safe demonstrations , known constraints, and inferred unsafe trajectories, , generated by our method, which can come from multiple tasks. These trajectories can together be thought of as a set of constraints on the possible assigments of unsafe elements in . To recover a gridded approximation of the unsafe set that is consistent with these trajectories, we first discretize into a finite set of discrete cells and define an occupancy function, , which maps each cell to its safeness: , where if , and otherwise. Continuous space trajectories are gridded by concatenating the set of grid cells that lie in, which is graphically shown in Figure 1 with a non-uniform grid. Then, the problem can be written down as an integer feasibility problem

[Inverse feasibility problem]

(2)

Inferring unsafe trajectories, i.e. obtaining , is the most difficult part of this problem, since finding lower-cost trajectories consistent with known constraints that complete a task is essentially a planning problem. Much of the next section shows how to efficiently obtain . Further details on Problem 1, including conservativeness guarantees, incorporating a prior on the constraint, and a continuous relaxation can be found in Section 4.4.

4 Method

The key to our method lies in finding lower-cost trajectories that do not violate the known constraints, given a demonstration with boundedly-suboptimal cost satisfying all constraints. Such trajectories must then violate the unknown constraint. Our goal is to determine an unsafe set in the constraint space from these trajectories using Problem 1. In the following, Section 4.1 describes lower-cost trajectories consistent with the known constraints; Section 4.2 describes how to sample such trajectories; Section 4.3 describes how to get more information from unsafe trajectories; Section 4.4 describes details and extensions to Problem 2; Section 4.5 discusses how to extend our method to suboptimal demonstrations. The complete flow of our method is described in Algorithm 2.

4.1 Trajectories satisfying known constraints

Consider the forward problem (Problem 1). We define the set of unsafe state-control trajectories induced by an optimal, safe demonstration , , as the set of state-control trajectories of lower cost that obey the known constraints:

(3)

In this paper, we deal with the known constraints from the system dynamics, the control limits, and task-dependent start and goal state constraints. Hence, , where denotes the set of dynamically feasible trajectories and denotes the set of trajectories using controls in at each time-step. denotes trajectories satisfying start and goal constraints. We develop the method for discrete time trajectories, but analogous definitions hold in continuous time. For discrete time, length trajectories, , , and are explicitly:

(4)

4.2 Sampling trajectories satisfying known constraints

Dynamics Cost function Control constraints Sampling method
Linear Quadratic Convex Ellipsoid hit-and-run (Section 4.2.1)
Linear Convex Convex Convex hit-and-run (Section 4.2.2)
Else Non-convex hit-and-run (Section 4.2.3)
Table 1: Sampling methods for different dynamics/costs/feasible controls.

We sample from to obtain lower-cost trajectories obeying the known constraints. For the most part, we use hit-and-run sampling [18] over the set

, a method guaranteeing convergence to a uniform distribution of samples over

in the limit; the method is detailed in Algorithm 1 and an illustration is shown in Figure 2. Hit-and-run starts from an initial point within the set, chooses a direction uniformly at random, moves a random amount in that direction such that the new point remains within the set, and repeats.

Depending on the convexity of the cost function and the control constraints and the form of the dynamics, different sampling techniques can be used, organized in Table 1. The following sections describe each sampling method.

Output :  Input :  1 ; ; 2 for i = 1: do 3       ; 4       ; 5       ; 6       ; 7       ; 8       9 end for Algorithm 1 Hit-and-run Figure 2: Illustration of hit-and-run. Blue lines denote sampled random directions, black dots denote samples.

4.2.1 Ellipsoid hit-and-run

When we have a linear system with quadratic cost and convex control constraints - a very common setup in the optimal control literature - is an ellipsoid in the trajectory space, which can be efficiently sampled via a specially-tailored hit-and-run method. Here, the quadratic cost is written as , where is a matrix of cost parameters, and we omit the control and task constraints for now. Without dynamics, the endpoints of the line , , (c.f. Alg. 1), can be found by solving a quadratic equation . We show that this can still be done with linear dynamics by writing in a special way.

can be written as an eigenspace of a singular “dynamics consistency” matrix,

, which converts any arbitrary state-control trajectory to one that satisfies the dynamics, one time-step at a time. Precisely, if the dynamics can be written as , we can write a matrix :

(5)

that fixes the controls and the initial state and performs a one-step rollout, replacing the second state with the dynamically correct state. In Eq. 5, we denote by a state that cannot be reached by applying control to state . Multiplying the one-step corrected trajectory by again changes to the dynamically reachable state . Applying to the original -time-step infeasible trajectory times results in a dynamically feasible trajectory, . Further, note that the set of dynamically feasible trajectories is

, which is the span of the eigenvectors of

associated with eigenvalue

. Thus, obtaining a feasible trajectory via repeated multiplication is akin to finding the eigenspace via power iteration [13]. One can also interpret this as propagating through the dynamics with a fixed control sequence. Now, we can write as another ellipsoid which can be efficiently sampled by finding by solving a quadratic equation:

(6)

We deal with control constraints separately, as the intersection of and Eq. 6 is in general not an ellipsoid. To ensure control constraint satisfaction, we reject samples with controls outside of ; this works if is not measure zero. For task constraints, we ensure all sampled rollouts obey the goal constraints by adding a large penalty term to the cost function: , where is a large scalar, which can be incorporated into Eq. 6 by modifying and including in ; all trajectories sampled in this modified set satisfy the goal constraints to an arbitrarily small tolerance , depending on the value of . The start constraint is satisfied trivially: all rollouts start at . Note the demonstration cost remains the same, since the demonstration satisfies the start and goal constraints; this modification is made purely to ensure these constraints hold for sampled trajectories.

4.2.2 Convex hit-and-run

For general convex cost functions, the same sampling method holds, but cannot be found by solving a quadratic function. Instead, we solve via a root finding algorithm or line search.

4.2.3 Non-convex hit-and-run

If is non-convex, can now in general be a union of disjoint line segments. In this scenario, we perform a “backtracking” line search by setting to lie in some initial range: ; sampling within this range and then evaluating the cost function to see whether or not lies within the intersection. If it does, the sample is kept and hit-and-run proceeds normally; if not, then the range of possible values is restricted to if is negative, and otherwise. Then, new s are re-sampled until either the interval length shrinks below a threshold or a feasible sample is found. This altered hit-and-run technique still converges to a uniform distribution on the set in the limit, but has a slower mixing time than for the convex case, where mixing time describes the number of samples needed until the total variation distance to the steady state distribution is less than a small threshold [1]. Further, we accelerate sampling spread by relaxing the goal constraint to a larger tolerance but keeping only the trajectories reaching within of the goal.

4.3 Improving learnability using cost function structure

Output : 
Input : , , known constraints}
1 ;
2 for i = 1: do
       /* Sample unsafe */
3       if lin., quad., conv. then
4             ;
5            
6      else if lin., conv., conv. then
7             ;
8            
9      else
10             ;
11            
12      
13 end for
/* Constraint recovery */
14 if prior, continuous then
15       Problem 4.4
16else if prior, binary then
17       Problem 4.4
18else
19       Problem 1
20 end if
Algorithm 2 Overall method

Naïvely, the sampled unsafe trajectories may provide little information. Consider an unsafe, length- discrete-time trajectory , with start and end states in the safe set. This only says there exists at least one intermediate unsafe state in the trajectory, but says nothing directly about which state was unsafe. The weakness of this information can be made concrete using the notion of a version space. In machine learning, the version space is the set of consistent hypotheses given a set of examples [27]. In our setting, hypotheses are possible unsafe sets, and examples are the safe and unsafe trajectories. Knowing is unsafe only disallows unsafe sets that mark every element of the constraint space that traverses as safe: . If is gridded into cells, this information invalidates at most out of possible unsafe sets. We could do exponentially better if we reduced the number of cells that implies could be unsafe.

We can achieve this by sampling sub-segments (or sub-trajectories) of the larger demonstrations, holding other portions of the demonstration fixed. For example, say we fix all but one of the points on when sampling unsafe lower-cost trajectories. Since only one state can be different from the known safe demonstration, the unsafeness of the trajectory can be uniquely localized to whatever new point was sampled: then, this trajectory will reduce the version space by at most a factor of , invalidating at most unsafe sets. One can sample these sub-trajectories in the full-length trajectory space by fixing appropriate waypoints during sampling: this ensures the full trajectory has lower cost and only perturbs desired waypoints. However, to speed up sampling, sub-trajectories can be sampled directly in the lower dimensional sub-trajectory space if the cost function that is being optimized is strictly monotone [21]: for any costs , control , and state , , for all , where represents the cost of starting with initial cost at state and taking control . Strictly monotone cost functions include separable cost functions with additive or multiplicative stage costs, which are common in motion planning and optimal control. If the cost function is strictly monotone, we can sample lower-cost trajectories from sub-segments of the optimal path; otherwise it is possible that even if a new sub-segment with lower cost than the original sub-segment were sampled, the full trajectory containing the sub-segment could have a higher cost than the demonstration.

4.4 Integer program formulation

After sampling, we can solve Problem 1 to find an unsafe set consistent with the safe and unsafe trajectories. We now discuss the details of this process. Conservative estimate:

One can obtain a conservative estimate of the unsafe set

from Problem 1 by intersecting all possible solutions: if the unsafeness of a cell is shared across all feasible solutions, that cell must be occupied. In practice, it may be difficult to directly find all solutions to the feasibility problem, as in the worst case, finding the set of all feasible solutions is equivalent to exhaustive search in the full gridded space [24]. A more efficient method is to loop over all grid cells and set each one to be safe, and see if the optimizer can still find a feasible solution. Cells where there exists no feasible solution are guaranteed unsafe. This amounts to solving binary integer feasibility problems, which can be trivially parallelized. Furthermore, any cells that are known safe (from demonstrations) do not need to be checked. We use this method to compute the “learned guaranteed unsafe set”, , in Section 6.

A prior on the constraint: As will be further discussed in Section 5.1, it may be fundamentally impossible to recover a unique unsafe set. If we have some prior on the nature of the unsafe set, such as it being simply connected, or that certain regions of the constraint space are unlikely to be unsafe, we can make the constraint learning problem more well-posed. Assume that this prior knowledge can be encoded in some “energy” function mapping the set of binary occupancies to a scalar value, which indicates the desirability of a particular unsafe set configuration. Using as the objective function in Problem 1 results in a binary integer program, which finds an unsafe set consistent with the safe and unsafe trajectories, and minimizes the energy:
[Inverse binary minimization constraint recovery]

(7)

Probabilistic setting and continuous relaxation:

A similar problem can be posed for a probabilistic setting, where grid cell occupancies represent beliefs over unsafeness: instead of the occupancy of a cell being an indicator variable, it is instead a random variable

, where takes value

with probability

and value with probability . Here, the occupancy probability function maps cells to occupancy probabilities .

Trajectories can now be unsafe with some probability. We obtain analogous constraints from the integer program in Section 4.4 in the probabilistic setting. Known safe trajectories traverse cells that are unsafe with probability 0; we enforce this with the constraint : if the unsafeness probabilities are all zero along a trajectory, then the trajectory must be safe. Trajectories that are unsafe with probability satisfy where we denote the number of unsafe grid cells traverses when the trajectory is unsafe as , where . The following problem directly optimizes over occupancy probabilities: [Inverse continuous minimization constraint recovery]

(8)

When , for all (i.e. all unsafe trajectories are unsafe for sure), this probabilistic formulation coincides with the continuous relaxation of Problem 4.4. This justifies interpreting the solution of the continuous relaxation as occupancy probabilities for each cell. Note that Problem 4.4 and 4.4 have no conservativeness guarantees and use prior assumptions to make the problem more well-posed. However, we observe that they improve constraint recovery in our experiments.

4.5 Bounded suboptimality of demonstrations

If we are given a -suboptimal demonstration , where , where is an optimal demonstration, we can still apply the sampling techniques discussed in earlier sections, but we must ensure that sampled unsafe trajectories are truly unsafe: a sampled trajectory of cost can be potentially safe. Two options follow: one is to only keep trajectories with cost less than , but this can cause little to be learned if is large. Instead, if we assume a distribution on suboptimality, i.e. given a trajectory of cost , we know that a trajectory of cost is unsafe with probability . We can then use these values of to solve Problem 4.4. We implement this in the experiments.

5 Analysis

Due to space, the proofs/additional remarks can be found in the appendix.

5.1 Learnability

We provide analysis on the learnability of unsafe sets, given the known constraints and cost function. Most analysis assumes unsafe sets defined over the state space: , but we extend it to the feature space in Corollary 9. We provide some definitions and state a result bounding , the set of all states that can be learned guaranteed unsafe. We first define the signed distance:

[Signed distance] Signed distance from point to set , if ; if .

[Learnability (discrete time)] For trajectories generated by a discrete time dynamical system satisfying for all , the set of learnable guaranteed unsafe states is a subset of the outermost shell of the unsafe set: (see Section A.1 for illustration).

[Learnability (continuous time)] For continuous trajectories , the set of learnable guaranteed unsafe states shrinks to the boundary of the unsafe set: .

Depending on the cost function, can become arbitrarily small: some cost functions are not very informative for recovering a constraint. For example, the path length cost function used in many of the experiments (which was chosen due to its common use in the motion planning community), prevents any lower-cost sub-trajectories from being sampled from straight sub-trajectories. The system’s controllability also impacts learnability: the more controllable the system, the more of the shell is reachable. We present a theorem quantifying when the dynamics allow unsafe trajectories to be sampled in Theorem A.1.1.

5.2 Conservativeness

We discuss conditions on and discretization which ensure our method provides a conservative estimate of . For analysis, we assume has a Lipschitz boundary [11]. We begin with notation (explanatory illustrations are in Section A.2):

[Set thickness] Denote the outward-pointing normal vector at a point

as . Furthermore, at non-differentiable points on , is replaced by the set of normal vectors for the sub-gradient of the Lipschitz function describing at that point [4]. The set has a thickness larger than if .

[

-offset padding] Define the

-offset padding as: .

[-padded set] We define the -padded set of the unsafe set , , as the union of the -offset padding and : .

[Conservative recovery of unsafe set] A sufficient condition ensuring that the set of learned guaranteed unsafe states is contained in is that has a set thickness greater than or equal to (c.f. Definition 5.1).

If we use continuous trajectories directly, the guaranteed learnable set shrinks to a subset of the boundary of the unsafe set, (c.f. Corollary 5.1). However, if we discretize these trajectories, we can learn unsafe states lying in the interior, at the cost of conservativeness holding only for a padded unsafe set.[Continuous-to-discrete time conservativeness] Let be a continuous trajectory: . The system dynamics are described by . The trajectory is discretized in time, potentially non-uniformly, resulting in a discretized trajectory for all . Assume the maximum discretization time is . Denote

(9)

Then, our method recovers a subset of the -padded unsafe set, . [Continuous-to-discrete time and space conservativeness] Let the largest grid cell in the constraint space be contained by a ball of radius . Then, if trajectories are discretized both in space and time, our method recovers a subset of the -padded unsafe set, . [Continuous-to-discrete feature space conservativeness] Let the feature mapping from the state space to the constraint space be Lipschitz continuous with Lipschitz constant . Then, our method recovers a subset of the -padded unsafe set in the feature space, .

6 Evaluations

We provide an example showing the importance of using unsafe trajectories, and experiments showing that our method generalizes across system dynamics, that it works with discretization and suboptimal demonstrations, and that it learns a constraint in a feature space from a single demonstration. See Appendix B for parameters, cost functions, the dynamics, control constraints, and timings.

Version space example: Consider a simple 8-connected grid world in which the tasks are to go from a start to a goal, minimizing Euclidean path length while staying out of the unsafe “U-shape”, the outline of which is drawn in black (Fig. 3). Four demonstrations are provided, shown in Fig. 3 on the far left. Initially, the version space contains possible unsafe sets. Each safe trajectory of length reduces the version space at most by a factor of , invalidating at most possible unsafe sets. Unsafe trajectories are computed by enumerating the set of trajectories going from the start to the goal at lower cost than the demonstration. The numbers of unsafe sets consistent with the safe and unsafe trajectories for varying numbers of safe trajectories are given in Table 2.

1 2 3 4
Safe 262144 4096 1024 256
Safe & unsafe 11648 48 12 3
Table 2: Number of consistent unsafe sets, varying the no. of demonstrations, using/not using unsafe trajectories.

Ultimately, it is impossible to distinguish between the three unsafe sets on the right in Fig. 3. This is because there exists no task where a trajectory with cost lower than the demonstration can be sampled which only goes through one of the two uncertain states. Further, though the uncertain states are in the shell of the constraint, due to the limitations of the cost function, we can only learn a subset of that shell (c.f. Theorem 5.1).

There are two main takeaways from this experiment. First, by generating unsafe trajectories, we can reduce the uncertainty arising from the ill-posedness of constraint learning: after 4 demonstrations, using unsafe demonstrations enables us to reduce the number of possible constraints by nearly a factor of 100, from 256 to 3. Second, due to limitations in the cost function, it may be impossible to recover a unique unsafe set, but the version space can be reduced substantially by sampling unsafe trajectories.

Figure 3: Leftmost: Demonstrations and unsafe set. Rest: Set of possible constraints. Postulated unsafe cells are plotted in red, safe states in blue.

Dynamics and discretization: In the experiments in Fig. 4, we show that our method can be applied to several types of system dynamics, can learn non-convex/multiple unsafe sets, and can use continuous trajectories. The dynamics, control constraints, and cost functions for each experiment are given in Table 5 in Appendix B. All unsafe sets are open sets. We solve Problems 4.4 and 4.4, with an energy function promoting smoothness by penalizing squared deviations of the occupancy of a grid cell from its 4-connected neighbors : . In all experiments, the mean squared error (MSE) is computed as , where is the ground truth occupancy. The demonstrations are color-matched with their corresponding number on the x-axis of the MSE plots. For experiments with more demonstrations, only those causing a notable change in the MSE were color-coded. The learned guaranteed unsafe states are colored red on the left column.

We recover a non-convex “U-shaped” unsafe set in the state space using trivial 2D single-integrator dynamics (row 1 of Fig. 4). The solutions to both Problems 4.4 and 4.4 return reasonable results, and the solution of Problem 4.4

achieves zero error. The second row shows learning two polyhedral unsafe sets in the state space with 4D double integrator linear dynamics, yielding similar results. We note the linear interpolation of some demonstrations in row 1 and 2 enter

; this is because both sets of dynamics are in discrete time and only the discrete waypoints must stay out of . The third row shows learning a polyhedral unsafe set in the state space, with time-discretized continuous, nonlinear Dubins’ car dynamics, which has a 3D state . These dynamics are more constrained than the previous cases, so sampling lower cost trajectories becomes more difficult, but despite this we can still achieve near zero error solving Problem 4.4. Some over-approximation results from some sampled unsafe trajectories entering regions not covered by the safe trajectories. For example, the cluster of red blocks to the top left of is generated by lower-cost trajectories that trade off the increased cost of entering the top left region by entering . This phenomenon is consistent with Theorem 3 of the appendix; we recover a set that is contained within the appropriate padded unsafe set (the max discretization time was 4.5 seconds). Learning curve spikes occur when overapproximation occurs. Overall, we note tends to be a significant underapproximation of due to the chosen cost function and limited demonstrations. For example, in row 1 of Fig. 4, cannot contain the portion of near long straight edges, since there exists no shorter path going from any start to any goal with only one state within that region. For row 3 of Fig. 4, we learn less of the bottom part of due to most demonstrations’ start and goal locations making it harder to sample feasible control trajectories going through that region; with more demonstrations, this issue becomes less pronounced.

Figure 4: Results across dynamics, discretization. Rows (top-to-bottom): Single integrator; double integrator; Dubins’ car (CT). Columns, left-to-right: Demos., , ; MSE; Problem 4.4 solution, all demos.; Problem 4.4 solution, all demos.

Suboptimal human demonstrations: We demonstrate our method on suboptimal demonstrations collected via a driving simulator, using a car model with CT Dubins’ car dynamics. Human steering commands were recorded as demonstrations, where the task was to navigate around the orange box and drive between the trees (Fig. 5). For a demonstration of cost , trajectories with cost less than were believed unsafe with probability 1. Trajectories with cost in the interval were believed unsafe with probability . MSE for Problem 4.4 is shown in Fig. 5 (Problem 4.4 is not solved since the probabilistic interpretation is needed). The maximum discretization time is seconds; hence, despite suboptimality, the learned guaranteed unsafe set is a subset of . While the MSE is highest here of all experiments, this is expected, as trajectories may be incorrectly labeled safe/unsafe with some probability.

Figure 5: Suboptimal demonstrations: left: setup, center: demonstrations, , , center-right: MSE, right: solution to Problem 4.4.
Figure 6: Demonstration (red: start, green: goal). Unsafe set is plotted in orange. Terrain isocontours are overlaid.

Feature space constraint: We demonstrate that our framework is not limited to the state space by learning a constraint in a feature space. Consider the scenario of planning a safe path for a mobile robot with continuous Dubins’ car dynamics through hilly terrain, where the magnitude of the terrain’s slope is given as a feature map (i.e. , where and is the elevation map). The robot will slip if the magnitude of the terrain slope is too large, so we generate a demonstration which obeys the ground truth constraint ; hence, the ground truth unsafe set is . From one safe trajectory (Fig. 6) generated by RRT* [15] and gridding the feature space as , we recover the constraint exactly.

7 Conclusion

In this paper we propose an algorithm that learns constraints from demonstrations, which acts as a complementary method to IOC/IRL algorithms. We analyze the properties of our algorithm as well as the theoretical limits of what subset of an unsafe set can be learned from safe demonstrations. The method works well on a variety of system dynamics and can be adapted to work with suboptimal demonstrations. We further show that our method can also learn constraints in a feature space. The largest shortcoming of our method is the constraint space gridding, which yields a complex constraint representation and causes the method to scale poorly to higher dimensional constraints. We aim to remedy this issue in future work by developing a grid-free counterpart of our method for convex unsafe sets, which can directly describe standard pose constraints like task space regions [8].

Acknowledgements

This work was supported in part by a Rackham first-year graduate fellowship, ONR grants N00014-18-1-2501 and N00014-17-1-2050, and NSF grants CNS-1446298, ECCS-1553873, and IIS-1750489.

References

  • [1] Y. Abbasi-Yadkori, P. L. Bartlett, V. Gabillon, and A. Malek. Hit-and-run for sampling and planning in non-convex spaces. In AISTATS 2017, 2017.
  • [2] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
  • [3] A. Akametalu, J. Fisac, J. Gillula, S. Kaynama, M. Zeilinger, and C. Tomlin. Reachability-based safe learning with gaussian processes. In CDC, Dec 2014.
  • [4] G. Allaire, F. Jouve, and G. Michailidis. Thickness control in structural optimization via a level set method. Struct. and Multidisciplinary Optimization, 2016.
  • [5] K. Amin, N. Jiang, and S. P. Singh. Repeated inverse reinforcement learning. In NIPS, pages 1813–1822, 2017.
  • [6] B. Argall, S. Chernova, M. M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
  • [7] L. Armesto, J. Bosga, V. Ivan, and S. Vijayakumar. Efficient learning of constraints and generic null space policies. In ICRA, pages 1520–1526. IEEE, 2017.
  • [8] D. Berenson, S. S. Srinivasa, and J. J. Kuffner. Task space regions: A framework for pose-constrained manipulation planning. IJRR, 30(12):1435–1460, 2011.
  • [9] S. Calinon and A. Billard. Incremental learning of gestures by imitation in a humanoid robot. In HRI 2007, pages 255–262, 2007.
  • [10] S. Calinon and A. Billard. A probabilistic programming by demonstration framework handling constraints in joint space and task space. In RSJ, 2008.
  • [11] B. Dacorogna. Introduction to the calculus of variations. Imp. College Press, 2015.
  • [12] P. Englert, N. A. Vien, and M. Toussaint. Inverse kkt: Learning cost functions of manipulation tasks from demonstrations. IJRR, 36(13-14):1474–1488, 2017.
  • [13] G. H. Golub and C. F. Van Loan. Matrix Computations (3rd Ed.). 1996.
  • [14] R. E. Kalman. When is a linear control system optimal? Journal of Basic Engineering, 86(1):51–60, Mar 1964.
  • [15] S. Karaman and E. Frazzoli. Incremental sampling-based algorithms for optimal motion planning. In RSS, 2010.
  • [16] A. Keshavarz, Y. Wang, and S. P. Boyd. Imputing a convex objective function. In ISIC, pages 613–619. IEEE, 2011.
  • [17] H. K. Khalil. Nonlinear systems. Prentice-Hall, Upper Saddle River, NJ, 2002.
  • [18] S. Kiatsupaibul, R. L. Smith, and Z. B. Zabinsky. An analysis of a variation of hit-and-run for uniform sampling from general regions. TOMACS, 2011.
  • [19] C. Li and D. Berenson. Learning object orientation constraints and guiding constraints for narrow passages from one demonstration. In ISER. Springer, 2016.
  • [20] N. Mehr, R. Horowitz, and A. D. Dragan. Inferring and assisting with constraints in shared autonomy. In (CDC), pages 6689–6696, Dec 2016.
  • [21] T. L. Morin. Monotonicity and the principle of optimality. Journal of Mathematical Analysis and Applications, 88(2):665 – 674, 1982.
  • [22] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In ICML ’00, pages 663–670, San Francisco, CA, USA, 2000.
  • [23] A. L. Pais, K. Umezawa, Y. Nakamura, and A. Billard. Learning robot skills through motion segmentation and constraints extraction. HRI, 2013.
  • [24] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, Englewood Cliffs, NJ, 1982.
  • [25] C. Pérez-D’Arpino and J. A. Shah. C-LEARN: learning geometric constraints from demonstrations for multi-step manipulation in shared autonomy. In ICRA, 2017.
  • [26] N. D. Ratliff, J. A. Bagnell, and M. Zinkevich. Maximum margin planning. In ICML 2006, pages 729–736, 2006.
  • [27] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. 2003.
  • [28] J. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert, and M. Toussaint.

    Safe exploration for active learning with gaussian processes.

    In ECML, 2015.
  • [29] M. Turchetta, F. Berkenkamp, and A. Krause.

    Safe exploration in finite markov decision processes with gaussian processes.

    In NIPS, pages 4305–4313, 2016.
  • [30] G. Ye and R. Alterovitz. Demonstration-guided motion planning. In ISRR, 2011.

Appendix A Analysis

A brief overview of the most important results in this section:

  • Theorem A.1 shows that all states that can be guaranteed unsafe must lie within some distance to the boundary of the unsafe set. Corollary A.1 shows that the set of guaranteed unsafe states shrinks to a subset of the boundary of the unsafe set when using a continuous demonstration directly to learn the constraint.

  • Corollary 9 shows that for the discrete time case and the continuous, non-discretized case, our estimate of the unsafe set is a guaranteed underapproximation of the true unsafe set if the unsafe set is sufficiently “thick”.

  • For continuous trajectories that are then discretized, Theorem 9 shows us that the guaranteed unsafe set can be made to contain states on the interior of the unsafe set, but at the cost of potentially labeling states within some distance outside of the unsafe set as unsafe as well.

For convenience, we repeat the definitions, along with some illustrations for the sake of visualization.

a.1 Learnability

In this section, we will provide analysis on the learnability of unsafe sets, given the known constraints and cost function. Most of the analysis will be based off unsafe sets defined over the state space, i.e. , but we will extend it to the feature space in Corollary 9. If a state can be learned to be guaranteed unsafe, then we denote that , where is the set of all states that can be learned guaranteed unsafe.

We begin our analysis with some notation.

[Signed distance] Signed distance from point to set , if ; if .

The following theorem describes the nature of :

[Learnability (discrete time)] For trajectories generated by a discrete time dynamical system satisfying for all , the set of learnable guaranteed unsafe states is a subset of the outermost shell of the unsafe set: .

Proof.

Consider the case of a length unsafe trajectory , . For a state to be learned guaranteed unsafe, states in must be learned safe. This implies that regardless of where that unsafe state is located in the trajectory, it must be reachable from some safe state within one time-step. This is because if multiple states in differ from the original safe trajectory , to learn that one state is unsafe with certainty means that the others should be learned safe from some other demonstration. Say that , i.e. they are learned safe. Since and , must be within of the boundary of the unsafe set: , implying .

Figure 7: Illustration of the outermost shell (shown in red) of the unsafe set . The hatched area cannot be learned guaranteed safe.
Remark.

For linear dynamics, can be found exactly via

(10)

where convexity depends on the convexity of and .

In the case of general dynamics, an upper bound on can be found via

(11)

[Learnability (continuous time)] For continuous trajectories , the set of learnable guaranteed unsafe states shrinks to the boundary of the unsafe set: .

Proof.

The output trajectory of a continuous time system can be seen as the output of a discrete time system in the limit as the time-step is taken to 0. In this case, as long as the dynamics are locally Lipschitz continuous, [17], and via Theorem 5.1, the corollary is proved. ∎

It is worth noting that depending on the cost function chosen, can become arbitrarily small; in other words, some cost functions are more informative than others in recovering a constraint. An interesting avenue of future work is to investigate the properties of cost functions that enable more to be learned about the constraints and how this knowledge can help inform reward (or cost) shaping.

a.1.1 Learnability (dynamics)

Depending on the dynamics of the system, it may be impossible to obtain sub-trajectories with few perturbed waypoints from sampling, due to there only being one feasible control sequence that takes the system from a start to a goal state. We formalize this intuition in the following theorem:

[Forward reachable set] The forward reachable set is the set of all states that a dynamical system can reach at time starting from at time , using controls drawn from an admissible set of controls :

(12)

[Learnability (dynamics)] Let be consecutive waypoints on a safe trajectory at times , with time discretization between states and , where all but are free to move. Then, a necessary condition for being able to sample unsafe trajectories is that such that : i.e. there exists at least one state that the dynamics allow to be moved from the demonstrated trajectory.

Proof.

Proof by contradiction. Assume that there does not exist an such that . Then, there exists no alternate sequence of controls taking the system from to ; hence no trajectories satisfying the start and goal constraints can be satisfied.

Additionally, the same analysis can be used for continuous trajectories in the limit as the time-step between consecutive waypoints, , goes to 0. ∎

Remark.

This implies that when the dynamics are highly restrictive, less of the unsafe set can be learned to be guaranteed unsafe, and the learnable subset of the -shell of the unsafe set (as described in Theorem 5.1) can become small.

a.2 Conservativeness

For the analysis in this section, we will assume that the unsafe set has a Lipschitz boundary; informally, this means that can be locally described by the graph of a Lipschitz continuous function. A formal definition can be found in [11]. We define some notation:

[Set thickness] Denote the outward-pointing normal vector at a point as . Furthermore, at non-differentiable points on , is replaced by the set of normal vectors for the sub-gradient of the Lipschitz function describing at that point [4]. The set has a thickness larger than if .

[-offset padding] Define the -offset padding as: .

[-padded set] We define the -padded set of the unsafe set , , as the union of the -offset padding and : .

Figure 8: Illustration of thickness, c.f. Definition 5.2.
Figure 9: Illustration of the -padded set , marked in red. The -offset padding is displayed in red. The original set is shown in white.

[Conservative recovery of unsafe set] A sufficient condition ensuring that the set of recovered guaranteed unsafe states is contained in is that has a set thickness greater than or equal to (c.f. Definition A.1).

Proof.

Via Theorem 5.1, our method will not determine that any state further inside than the outer -shell is unsafe for sure. If has thickness at least , then our method will only determine states that are within the unsafe set to be guaranteed unsafe. This holds for discrete time dynamics and continuous time dynamics as well as . ∎

Note that if we deal with continuous trajectories directly, the guaranteed learnable set shrinks to a subset of the boundary of the unsafe set, . However, if we discretize these trajectories, we can learn unsafe states lying in the interior, at the cost of conservativeness holding only for a padded unsafe set. [Continuous-to-discrete time conservativeness] Let be a continuous trajectory: . The system dynamics are described by . The trajectory is discretized in time, potentially non-uniformly, resulting in a discretized trajectory for all . Assume the maximum discretization time is . Denote