Discretizing Dynamics for Maximum Likelihood Constraint Inference

by   Kaylene C. Stocking, et al.

Maximum likelihood constraint inference is a powerful technique for identifying unmodeled constraints that affect the behavior of a demonstrator acting under a known objective function. However, it was originally formulated only for discrete state-action spaces. Continuous dynamics are more useful for modeling many real-world systems of interest, including the movements of humans and robots. We present a method to generate a tabular state-action space that approximates continuous dynamics and can be used for constraint inference on demonstrations that obey the true system dynamics. We then demonstrate accurate constraint inference on nonlinear pendulum systems with 2- and 4-dimensional state spaces, and show that performance is robust to a range of hyperparameters. The demonstrations are not required to be fully optimal with respect to the objective, and the most likely constraints can be identified even when demonstrations cover only a small portion of the state space. For these reasons, the proposed approach may be especially useful for inferring constraints on human demonstrators, which has important applications in human-robot interaction and biomechanical medicine.



There are no comments yet.


page 1


Maximum Likelihood Constraint Inference from Stochastic Demonstrations

When an expert operates a perilous dynamic system, ideal constraint info...

Maximum Likelihood Constraint Inference for Inverse Reinforcement Learning

While most approaches to the problem of Inverse Reinforcement Learning (...

Learning Constraints from Demonstrations

We extend the learning from demonstration paradigm by providing a method...

Joint Estimation of Expertise and Reward Preferences From Human Demonstrations

When a robot learns from human examples, most approaches assume that the...

Inferring geometric constraints in human demonstrations

This paper presents an approach for inferring geometric constraints in h...

Recognizing Orientation Slip in Human Demonstrations

Manipulations of a constrained object often use a non-rigid grasp that a...

Neural computation from first principles: Using the maximum entropy method to obtain an optimal bits-per-joule neuron

Optimization results are one method for understanding neural computation...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Inverse reinforcement learning (IRL) allows an agent to infer the goals driving someone’s behavior and learn to complete the same task simply by observing. This is a powerful paradigm for learning new behaviors from scratch, but doesn’t encompass all of the useful information we may extract from observations. For example, consider the case where the agent already has a good policy for some task, but notices that an expert demonstrator is deviating from the optimal behavior. A reasonable explanation would be that the demonstrator is acting under new environmental constraints that the agent is unaware of. As a concrete example, we can imagine a scenario where an autonomous vehicle (the agent) is following a car that suddenly swerves (the demonstrator). Since both vehicles have the same policy of avoiding collisions and following the road, the agent can infer that an obstacle suddenly appeared in the road and take evasive action even before it can detect the obstacle directly. Taking cues from other agents is an important aspect of intelligent behavior that can help compensate for problems such as sensor failure or perceptual error.

Fig. 1: An overview of maximum likelihood constraint inference on a continuous system. (a) A human demonstrates task performance while obeying an unknown constraint. (b) Their continuous trajectory is projected onto the approximate state space

of a tabular Markov Decision Process (MDP). (c) The human’s behavior deviates from what the MDP expects due to the unmodeled constraint. (d) The most likely constraint

can be inferred, causing the tabular MDP trajectory to more closely match that of the human. The inferred constraint region can be mapped back into the continuous state space for applications such as identifying potential biomechanical problems.

The process of detecting constraints that help explain the behavior of a demonstrator is called constraint inference. Scobee and Sastry [17]

applied the maximum entropy IRL framework to this problem, resulting in an algorithm that can identify the most likely constraints from a hypothesis set. However, this approach is limited to systems with tabular state-action spaces. This precludes its use in many real systems of interest whose dynamics are inherently continuous. In this paper, we describe a procedure for creating a tabular approximation of an arbitrary continuous system, and show that maximum likelihood constraint inference (MLCI) can be used to infer constraints on the approximated system that transfer well to the original continuous system. We analyze the effects of various approximation hyperparameters on the accuracy of constraint inference on an example 2-dimensional pendulum system. Although this analysis does not necessarily generalize to other sets of dynamics, following a similar procedure on a system of interest can indicate whether the approximation is sufficient for meaningful constraint inference. We also present a technique for estimating confidence in the inferred constraint in the form of a Bayesian probability update.

In addition to applications where the agent wishes to use learned constraints to improve its own policy, this extension of MLCI allows us to perform constraint inference directly from observed human movements. One exciting potential application of this work is in individuals with non-specific low-back pain, pain that is not immediately attributable to a specific pathology. This affects 84% of people in their lifetime, with around 12% of people being disabled from this pain [1]. Our proposed constraint inference approach would enable an explanatory biomechanical tool to infer joint level limitations from a series of full-body movements, where traditional biomechanical methods have seen limited success [14]. This motivates the telescoping inverted pendulum model (section VI) which has been used to model different standing patterns in clinical populations [13].

We first present related work in section II, before briefly introducing MLCI in section III and our method for translating continuous dynamics into a tabular Markov Decision Process that can be used with MLCI in section IV. We then perform experimental analysis in section V, and finally show an example with clinically motivated 4D telescoping inverted pendulum dynamics in section VI.

Ii Related Work

Previous work on constraint inference can be split into two categories: approaches that infer the most likely constraints but require a tabular state-action space, and those that admit continuous dynamics but drop the maximum likelihood feature. In the former category, in addition to Scobee and Sastry [17], Vazquez-Chanlatte et al. [19] learn task specifications, which can be thought of as a generalization of state-space constraints to include complex multi-step behaviors. Unfortunately, neither of these approaches can be applied directly to many real-world systems that are inherently continuous.

There are many proposed methods for identifying constraints in systems that cannot be tabulated. Some use heuristics such as assuming that constrained behaviors will have high intra-demonstration variance and low inter-demonstration variance, or that a maintaining an end effector in the same orientation throughout a demonstration suggests a constraint

[12, 5]. [7] presents a kinematics-based approach for learning constraints that affect how a nominal policy is executed in different environments, but doesn’t assume an objective function and therefore requires demonstrations to cover much of the state space for the inference to be well-defined. [10] is specialized for online constraint inference in the context of shared autonomy, where mis-identified constraints can be corrected by the user. [2] provides a flexible approach for learning state-space constraints by sampling from possible trajectories with lower costs than the demonstrations.

Although the present work introduces error by estimating continuous dynamics with a finite state-action space, it provides two key advantages over previous methods that work with continuous dynamics. First, using the maximum entropy framework allows us to model the demonstrators as soft-optimal with respect to a reward function, which may be especially appropriate for human demonstrators. Second, we are able to estimate and rank the most likely constraints even in situations where demonstrations cover only a small portion of the state space and do not provide enough information to fully resolve ambiguity in possible constraints.

Iii Markov Decision Processes and Maximum Likelihood Constraint Inference

To perform maximum likelihood constraint inference (MLCI), we adapt the approach developed by [17]. In this section, we present a brief overview of the MLCI algorithm.

Iii-a Markov Decision Dynamics

MLCI is formulated as an operation on a tabular Markov Decision Process (MDP). The MDP is a tuple of four elements:

  • A state space to navigate. is a finite set of discrete state values:

  • A set of actions to decide between. is a finite set of discrete input values:

  • A transition kernel

    that determines the influence of on . The repeated action of this transition kernel generates a sequence of states over a time horizon given a sequence of action choices up to the horizon. The couple of state sequence and action sequence is the trajectory and the space of all possible trajectories is .

  • An objective metric that measures the quality of trajectories.

This work focuses on deterministic dynamics, so the transition kernel will be singleton distributions with zero probability of all next states except the deterministic successor

. That is, we focus on MDP’s with transitions of the form:


MLCI requires that and be finite sets, and we refer to an MDP which satisfies this property as tabular.

Iii-B Maximum Entropy Likelihood on Trajectories

This work leverages the maximum entropy likelihood distribution advanced in [21] and extended to constraint inference in [17]. This distribution’s randomness reflects epistemological uncertainty in the estimated reward function of the demonstrator. Under this distribution, the likelihood of a trajectory is defined on the deterministic MDP as:


where is the normalizing constant:

This work investigates how dynamic agents avoid certain sets of states . These constrained states further refine the choice distribution by zeroing out illegal choices:


Where the partition constant decreases to for this new distribution that constrains out much of the previous support. Let be the subset of trajectories that don’t violate the constraint :

So that may be simply defined as:

Iii-C Constraint Inference

The distribution in equation (3) describes the likelihood of observing any demonstrated trajectory given a constraint set . Given a set of independent and identically distributed sample trajectories , the likelihood of observing this dataset is:


Adding a constraint to the model that helps explain the demonstrations will increase this likelihood. Therefore, the most likely constraint is the one that maximizes . Note two properties that will aid in finding :

Remark 1.

The optimal constraint set must have all inside of its corresponding . Otherwise its likelihood would be 0 – a lower likelihood even than having no constraints at all. This would contradict its being the optimum.

Therefore for any feasible candidate constraints, the indicator will always evaluate to 1. With the zero-case ruled out, the likelihood can be straightforwardly characterized by factoring out the remaining -dependent component:

Remark 2.

When comparing the likelihood amongst feasible constraint sets, they are only re-scalings of the same dataset-determined constant by . So the maximum likelihood constraint set is simply whichever set , amongst the feasible constraint sets, has the smallest .

For every hypothesized constraint set , can be computed by a Bellman backup or by forward simulation. The latter approach is favored by [17] as it makes a direct parallel to the seminal Maximum Entropy IRL work [21]. Let be some baseline set of known constraints (e.g. the empty set for the unconstrained case). The forward simulation relies on the fact that is proportional to .

It can be calculated by forward simulating the state distribution under ’s maximum entropy distribution and observing the probability that trajectories violate the constraint up to time . Call that quantity , then:

Therefore, the most likely constraint minimizes , or equivalently, maximizes . The quantity will be useful in some of our subsequent analysis.

Iv Formulation of Approximate MDP

Given an arbitrary set of continuous dynamics of the form , we wish to generate an appropriate tabular state-action space that can be used with the MLCI algorithm described in section III. We will illustrate this process with a pendulum model that we return to for experimental analysis in section V.

Iv-a Running Example: Pendulum System

The pendulum model consists of a 2-dimensional state space (angle and angular velocity). The 1-dimensional control input is the normalized torque applied at the base of the pendulum:


Where the gravitational constant and the length of the pendulum are both assumed to be 1 for simplicity. The constraint hypothesis set is an evenly spaced 10-by-10 grid of non-overlapping cells that cover the state space, for a total of 100 possible constraints. (Note that any set of state space regions is acceptable as the constraint hypothesis set, including overlapping regions or ones that do not cover the whole state space, but it is typically appropriate for them to be equally sized. This is because a larger constraint region is able to ”explain away” more demonstrator sub-optimality and is therefore likely to have a larger , making it difficult to directly compare constraint regions of different sizes when choosing the most likely one.) The demonstrator wants to arrive at a particular goal state at the end of a = 5s period while minimizing the total squared torque and avoiding the true constraint region, :


Where we use to refer to continuous time and to refer to discrete time steps.

Iv-B Forming The Tabular State-Action Space

First, we choose appropriate bounds for each dimension of the state space and control input, which can come from domain knowledge or observing the range of values in the demonstrations. For the pendulum system, it is natural to bound , we select the velocity bound , and the control input bound is chosen by observing that the controls used by continuous trajectories optimizing the objective in equation (8) rarely exceed this range. We then grid up the continuous state space by dividing it into disjoint cells that completely cover the bounded area. A reasonable default is to use equally sized boxes. For example, we can divide the pendulum state space into 100 cells, 10 along each dimension, each encompassing a rad angle width and a 1.2 rad/s angular velocity range. The set of these cells is . Similarly, the range of possible control inputs is divided into discrete points to give . We use and to label the discrete states and actions, respectively. is the value of the continuous state at the center point of state cell , while is the value of the control input associated with discrete action choice .

Iv-C Tabular MDP Transition Kernel and Objective

To complete the tabular MDP representation, we need to determine the transition and reward associated with each ). For “gridworld” environments frequently used in inverse reinforcement learning, the agent is allowed to transition to any adjacent cell. However, for arbitrary continuous dynamics, this behavior may result in trajectories that bear little resemblance to what is possible under the true dynamics. For example, consider that in the pendulum system, allowing a transition from to is nonsensical if the current velocity is a large negative value, regardless of the control input.

To resolve this problem, we select a constant time interval that represents the amount of time that passes between state transitions in the tabular model. For each discrete , we use an ODE solver to determine the trajectory that would result from starting at the center of state cell , , and applying a constant control input of for time. We can then determine which state cell the agent would land in at the end of this trajectory segment, which becomes the successor state . While this is sufficient for determining appropriate discrete transitions, the start and successor cells alone do not tell us which state-based constraints may have been violated while taking a particular transition. Therefore, we also keep track of which hypothesized constraints would be violated while executing the continuous trajectory underlying the discrete transition. This ensures that the agent isn’t allowed to “warp through” constraints even when and are not adjacent cells.

Finally, we assume that the ground-truth reward for an entire trajectory can be expressed as for some function of the continuous state and control input. We estimate the tabular reward function as , where the sequence is the sequence of discrete state-action pairs over the course of a trajectory on the tabular MDP.

It is worth noting that trajectories allowable under the tabular MDP described above are not necessarily feasible or safe under the true continuous dynamics. For example, starting from different points within the same cell might result in slightly different constraint violations, while we only track violations that result from starting in the center of each cell. This is acceptable for our application because we are trying to obtain estimates of general behavior that enable reasonable likelihood-based constraint inference. Similarly, there is no well-defined mapping from a particular continuous trajectory to a feasible discrete state-action sequence under the approximate tabular dynamics. Since we only handle state-based constraints, it is sufficient to determine which possible constraints a demonstration violates without trying to construct a discrete version of the trajectory. This can be done by sampling points along the trajectory to determine which constraint regions it passes though.

The primary hyperparameters that determine the final tabular MDP are the number of cells to use for each state dimension, the number of actions, and the transition time step . These parameters can be tuned using domain knowledge or by running simulated experiments with known constraints to determine which model obtains the best performance. An example of these experiments and the resulting constraint inference performance for the pendulum system is described in the following section. Once an appropriate model has been selected, it can be used with any combination of reward function and demonstration set. Additionally, if the objective of the demonstrators is known in advance, MLCI can be performed on the appropriately initialized discrete MDP as a pre-computation step, and constraints can be inferred online with very little additional computation.

V Analysis On Pendulum System

After following the procedure outlined in section IV for the pendulum system, we now have a tabular MDP representation that can be used with MLCI as described in section III. We next turn to analyzing the behavior of this approximate MDP. For our experiments, we tested two possible ground-truth constraints: prohibits while , and prohibits while Both ground-truth constraints are aligned with the constraint hypothesis set. The constraint hypothesis space is illustrated in Fig. 2. For each ground truth constraint, we randomly sampled 100 pairs of start and end states from (defined in equation (10) below) for agents to satisfy while optimizing the objective in equation (8). Some of these start-end state pairs were ill-posed since the pendulum could not reach across them in the fixed 5 second time horizon provided. After removing these configurations, the set of demonstrations was reduced to trajectories.

Fig. 2: We test our constraint inference on a fixed-base pendulum model. The constraint hypothesis space evenly divides the state space into 100 cells, 10 along the angle axis and 10 along angular velocity. The two constraints used in our experiments, and , are shown here in different shades of red. The constraints cover different angle regions but the same angular velocity.

V-a Accuracy Of Tabular MDP Dynamics

We first examine how accurately the tabular MDP recovers the true continuous dynamics under goal-directed behavior induced by the objective function. For each ground-truth constraint and random start-goal pair, we initialized the MDP while incorporating the true constraint into the MDP dynamics (i.e., actions that would result in violating the true constraint were not allowed). We then performed a Bellman backup to determine the distribution of soft optimal policies on the tabular MDP. Intuitively, if the MDP perfectly describes the true continuous dynamics, we expect that running a simulation with the ground-truth dynamics while taking the sequence of actions determined by one of these policies will cause the agent to land exactly at the goal state. Following this intuition, we sampled and executed a random policy from each MDP and measured the normalized Euclidean distance between the final state and goal state. As shown in Fig. 3, increasing the number of state cells reduces the “round-off” error associated with each discrete state transition and results in a final state that is closer to the intended goal. Since the objective function specifies a fixed time horizon, increasing decreases the number of transitions over the course of a trajectory and therefore reduces final state error as well.

Fig. 3: Dividing the state space into larger numbers of smaller cells in the MDP allows for a more accurate approximation of the continuous dynamics when executing optimal policies over 5s. A smaller means that a larger number of discrete transitions are taken over the same time horizon, leading to decreased accuracy, especially for a coarser discretized state space. Pink circles: 100 state cells, green crosses: 400, yellow squares: 900, blue triangles: 1600. Note that there is decreasing benefit to further refining the state space grid above 900 cells.

V-B Generating Simulated Expert Demonstrations

To understand the accuracy of constraint inference with the tabular MDP, we first need expert demonstrations that follow the ground-truth dynamics. For each ground-truth constraint, 100 random pairs of states were sampled to serve as the start and goal points for independent demonstrations. These expert continuous demonstrations were synthesized using a second-order descent method with simulation time step (), much finer than the used in the tabular MDP. The demonstrations are optimized using a Gauss-Newton-style descent method known as Iterative Linear-Quadratic Regulators (or iLQR) [6]. The optimization is halted after ten iterations. For each start-goal pair, the best-of-three optimizations is picked (each with randomly sampled controls initialization) to reject optimizations that get stuck in local minima. The optimizations that could not succeed in reaching their goal were filtered out from the dataset, reducing the dataset size to .

The state constraints are blocked out as rectangular polytope constraints in the continuous state-space. They are enforced using an interior-point method that supersedes any controls (as in [3]) that would reach the constrained states. This backwards-reachable set that forms the barrier-certificate [15]

is computed via a Hamilton-Jacobi Isaacs Partial Differential Equation

[11]. For a continuous dynamic , the robust backwards reachable set of the constraint region can be computed as the sub-zero level set of:


where is initialized to the signed distance from :

Let be the complement of this backwards reachable set:


As the complement of the reachable set, is the set from which there is a way to avoid the keepout set . Since there exists an avoidant strategy, this is a control-invariant set. So long as the system is initialized within it is possible to remain safe. Furthermore, any controls can be taken up to crossing the border from into . At this point, the maximally safe action must be taken. This is the safety strategy advanced in [3].

This safety strategy ensures the system will stay on the interior of the feasible region. Due to intervening only when absolutely necessary (i.e. when crossing into ), this intervention is also the least restrictive. It will not eliminate any trajectories that weren’t already infeasible. Therefore the set of feasible solutions remains unchanged after instituting these dynamics. The optimal trajectory of the non-intervened dynamics will be the same as the optimal trajectory on the intervened dynamics.

This constraint-enforcing switching control is non-differentiable, so derivative-based optimizations on the controls cannot be used. Fortunately, new relaxations of switched dynamics [20] can substitute a relaxed problem whose solutions will converge to the true unrelaxed solution as the relaxation is tightened.

V-C Constraint Inference Performance

Accurate constraint inference relies on a close match between expert demonstrations and soft-optimal trajectories on the tabular MDP that incorporates the ground-truth constraint. For the purposes of constraint inference, two trajectories are equivalent if they violate the same constraints in the constraint hypothesis set. Therefore, we next examine the difference between the expected constraint violation under the tabular MDP and the actual constraints violated by independent continuous demonstrations. Over all of the possible constraints, this difference can be expressed as


Where is an indicator for whether demonstration violates constraint . If the approximate MDP perfectly tracks the true constraint violation distribution and demonstrations are distributed according to soft optimality, we expect the quantity in equation (11) to go asymptotically to 0 as the number of demonstrations increases. Results for different model hyperparameters and a single demonstration (averaged over 65 trials and the two alternative ground-truth constraints) are shown in Fig. 4. Increasing the number of state space grid cells from 100 to 400 lowers constraint violation error, but increasing the number of states in the tabular MDP beyond this point does not have much effect. This suggests that the greater accuracy of the approximate MDP for larger numbers of discrete states does not necessarily translate into improved constraint inference. Error is stable across different values of the discrete time interval .

Fig. 4: Accuracy in estimating constraint violation for random combinations of start and goal is relatively constant over various hyperparameter values, although there is a small benefit for increasing the state space grid size from 100 cells (pink circles) to 400 cells (green crosses). The metric used here is quantified in equation (11). Pink circles: 100 state cells, green crosses: 400, yellow squares: 900, blue triangles: 1600.

We see a very similar trend when examining the performance of constraint inference across MDP’s generated with different hyperparameters, as can be seen in Fig. 5. After choosing an appropriate , tabular MDP’s with 100 to 1600 states are able to successfully identify the true constraint as one of the top-5 likeliest constraints after 9 demonstrations. Increasing the number of states to at least 400 stabilizes performance across different choices of . Even though the approximate MDP’s do not capture the true continuous dynamics with high fidelity, especially for the coarsest state-space grid, constraint inference still works well and is robust to a range of hyperparameters.

Fig. 5: The true constraint is ranked as the one of the most likely possibilities after just a few demonstrations across many hyperparameter choices. For the 10x10 state space grid (pink circles), gives poor performance because it often isn’t possible for the agent to reach a new cell within this time frame, so most transitions result in the tabular agent erroneously staying in the same location. Overall, there is a small benefit for increasing from 100 state cells to 400 cells (green crosses). A 1600 state cell version performs equivalently to 900 cells (yellow squares) but is omitted for clarity. Pink circles: 100 state cells, green crosses: 400, yellow squares: 900.

In addition to these average trends, we can qualitatively examine the approximation quality by sampling a trajectory from the discrete MDP and comparing it to the original continuous demonstration. An example of this for a single trial is shown in Fig. 6.

Fig. 6: We can directly compare the approximated MDP to individual demonstrated trajectories by sampling a discrete trajectory from the MDP when the true constraint is known. From left to right, a trajectory sampled from an MDP with 100, 900, and 1600 states, respectively. The blue line is the demonstration, which is attempting to get from the start (cyan box) to the goal (green box) while avoiding the true constraint (red box). The states in the discrete trajectory are shown with shaded pink boxes. Possible constraints violated by the demonstration are shown with blue diagonal lines going from top left to bottom right, while possible constraints violated by the discrete trajectory are shown with pink diagonal lines going from bottom left to top right. For the finer grid state spaces in the center and right panels, the states visited by the discrete trajectory are smaller than the constraint regions. For this example, the trajectory from the 900-state MDP happens to most closely match the demonstration.

For all of the analyses described above, we also varied the number of discrete actions in the approximate MDP but found that this made little difference to any of the measures we examined. A larger number of actions allows the discrete agent more possible routes to the goal, but it may be that these routes do not change constraint violation behavior in expectation across the soft-optimal policy distribution. Fig. 3 through Fig. 6 show results using 9 actions evenly spaced from to .

V-D Confidence In Found Constraints

In addition to identifying the most likely constraints influencing agent behavior, it is desirable to calculate the probability of there being a constraint at all. First, consider the simple case where we assume that there is at most one constraint, and if there is one, it is the most likely one identified via MLCI. Let be the event that this is truly a constraint, and be the event that N independent trajectories do not violate this constraint. We would like to calculate . We know that since no demonstrations may violate a constraint, and that (i.e. the probability of a demonstration not violating this constraint by coincidence, even if the agent isn’t really subject to it), which we obtain from the MLCI algorithm. We can therefore use Bayes’ Rule to obtain the following formula:


where is a prior on the probability of the constraint being present. This simple formula introduces no additional approximation error beyond what is already present in the model under the assumptions described above, and presents an important advantage of the MLCI approach to constraint inference over previous approaches that cannot provide confidence estimates of found constraints. Unfortunately, relaxing the assumptions on possible constraints and calculating probabilities of all possible constraints quickly becomes computationally intractable. Providing estimates of these probabilities is left for future work.

Vi Potential Application: Sit-to-Stand and Lower Back Pain

The robustness to hyperparameters selection, low number of required demonstrations, and ability to provide a confidence interval on the identified constraints supports the use of the MLCI approach to identify patient-specific impairments from observed motion. One potential application is in the analysis of individuals with Low back pain (LBP).

LBP affects 70-90% of adults during their lifetime and can be extremely debilitating [18]. However, it is often difficult to determine the source of the pain and therefore prescribe an appropriate treatment. Disorders of the lower spine, hip, and pelvic region can all cause LBP [16]. Treating the wrong problem may result in an unnecessary surgery that doesn’t resolve the patient’s LBP. When a treatment plan that addresses the physical cause of the pain can’t be identified, patients may be prescribed opioids for chronic pain management, even though these are ineffective and can lead to abuse and addiction [9]. Therefore, there is a pressing clinical need to develop better methods for understanding the source of LBP.

There is a recent body of literature suggesting that LBP may be linked to irregularities in movement patterns. For example, inappropriate amounts of pelvic movement during various motions appears to contribute to LBP [8]. This pelvic movement may be compensatory for a limited range of motion in other joints - in other words, constraints on the achievable joint angles. We would expect the resulting movement patterns to avoid regions of the biomechanical state space associated with pain. Identifying both physical and pain-related constraints on movement could therefore lead us to a better understanding of the underlying cause of the LBP. For this reason, we would like to infer the most likely constraints a person is acting under when observing their movements. A particularly promising movement pattern for demonstrations is completing a sit-to-stand trajectory, which exerts significant strain on several joints implicated in LBP [4]. A telescoping inverted pendulum system has been used to model this movement, which reduces the problem to 4 dimensions while allowing for clinically relevant discovery [13].

Vi-a Constraint Inference on Telescoping Inverted Pendulum

Following the above motivation, we next demonstrate successful constraint inference on a telescoping inverted pendulum (TIP) model. The dynamics for this model are as follows:

These dynamics omit the cross-coupling term between angular acceleration and linear velocity for simplicity. For this experiment, we chose the goal set as the set of all states within a certain range of pendulum length and angle, leaving velocity as a free parameter. The objective is to reach the goal set at = 5s while minimizing . The constraint hypothesis space is a 10x10 evenly spaced grid along the angle and length dimensions, so that if a particular (angle, length) combination is constrained, the agent is not allowed to enter that combination at any velocity. We generated 5 demonstrations with random start and goal states following the same procedure as in section V-B. We then formulated a tabular MDP with 2500 states (10 cells each for angle and length, and 5 cells each for angular and linear velocity) and 15 actions (5 discrete torque options and 3 discrete linear force options) and performed constraint inference on the demonstrations. The ground truth constraint and top 2 likeliest inferred constraints are shown in Fig. 7. Despite the coarseness of the tabular state-action space and a mismatch between the constraint hypothesis space and the true constraint region, MLCI correctly identifies the ground truth constraint and takes about 5 minutes with no optimization effort on a single CPU core. If the start and goal states of demonstrations are known in advance, as is likely to be the case in a clinical test, this computation can be done ahead of time and inferring constraints after observing the actual demonstration trajectories is virtually instantaneous.

Fig. 7: Ground truth and inferred constraints for a 4-dimensional telescopic inverted pendulum (TIP) model. The constraint hypothesis space is 100 evenly spaced cells in length and angle. The ground truth constraint, which does not evenly align with the constraint hypothesis set, is shown in red. After 5 demonstrations, the top 2 most likely constraints, shown as cells with dark black outlines, coincide very well with the true constraint. 2 demonstrations of those used in inference, along with the constraints they violated (shaded cells), are shown in blue and purple. The tabular MDP must keep track of the linear and angular velocity dimensions as well as the position dimensions shown here.

Vii Conclusion

We have presented methodology for forming a tabular MDP approximation of continuous dynamics which can be used for maximum likelihood constraint inference. Although the approximation introduces some error into the estimation, constraint inference works well with pendulum dynamics over a range of hyperparameters, including a small discrete state space. The present approach allows for ranking possible constraints by their likelihood, which is especially useful in applications with significant uncertainty, and uses the maximum entropy framework, which may be an especially good fit for human demonstrators, who tend to act sub-optimally. Future work should characterize the kinds of dynamics for which this approach works well and whether techniques such as variable grid size may allow for higher accuracy and increased computational efficiency.


  • [1] F. Balagué, A. F. Mannion, F. Pellisé, and C. Cedraschi (2012-02) Non-specific low back pain. The Lancet 379 (9814), pp. 482–491 (en). External Links: ISSN 0140-6736, Link, Document Cited by: §I.
  • [2] G. Chou, N. Ozay, and D. Berenson (2020-04) Learning Constraints From Locally-Optimal Demonstrations Under Cost Function Uncertainty. IEEE Robotics and Automation Letters 5 (2), pp. 3682–3690. Note: Conference Name: IEEE Robotics and Automation Letters External Links: ISSN 2377-3766, Document Cited by: §II.
  • [3] G. M. Hoffmann and C. J. Tomlin (2008-12) Decentralized cooperative collision avoidance for acceleration constrained vehicles. In 2008 47th IEEE Conference on Decision and Control, pp. 4357–4363. Note: ISSN: 0191-2216 External Links: Document Cited by: §V-B, §V-B.
  • [4] M. A. Hughes, D. K. Weiner, M. L. Schenkman, R. M. Long, and S. A. Studenski (1994-05) Chair rise strategies in the elderly. Clinical Biomechanics 9 (3), pp. 187–192 (en). External Links: ISSN 0268-0033, Link, Document Cited by: §VI.
  • [5] C. Li and D. Berenson (2017) Learning Object Orientation Constraints and Guiding Constraints for Narrow Passages from One Demonstration. In 2016 International Symposium on Experimental Robotics, D. Kulić, Y. Nakamura, O. Khatib, and G. Venture (Eds.), Springer Proceedings in Advanced Robotics, Cham, pp. 197–210 (en). External Links: ISBN 978-3-319-50115-4, Document Cited by: §II.
  • [6] W. Li and T. Emanuel (2004) ITERATIVE LINEAR QUADRATIC REGULATOR DESIGN FOR NONLINEAR BIOLOGICAL MOVEMENT SYSTEMS:. In Proceedings of the First International Conference on Informatics in Control, Automation and Robotics, Setúbal, Portugal, pp. 222–229 (en). External Links: ISBN 978-972-8865-12-2, Link, Document Cited by: §V-B.
  • [7] H. Lin, M. Howard, and S. Vijayakumar (2015-05) Learning null space projections. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 2613–2619. Note: ISSN: 1050-4729 External Links: Document Cited by: §II.
  • [8] S. M, M. Fd, K. Kk, R. A, N. N, K. Mt, and O. Ae (2015-10) Correlation between Hip Rotation Range-of-Motion Impairment and Low Back Pain. A Literature Review.. Ortopedia, Traumatologia, Rehabilitacja 17 (5), pp. 455–462 (English). External Links: ISSN 1509-3492, 2084-4336, Link, Document Cited by: §VI.
  • [9] B. A. Martell, P. G. O’Connor, R. D. Kerns, W. C. Becker, K. H. Morales, T. R. Kosten, and D. A. Fiellin (2007-01) Systematic Review: Opioid Treatment for Chronic Back Pain: Prevalence, Efficacy, and Association with Addiction. Annals of Internal Medicine 146 (2), pp. 116 (en). External Links: ISSN 0003-4819, Link, Document Cited by: §VI.
  • [10] N. Mehr, R. Horowitz, and A. D. Dragan (2016-12) Inferring and assisting with constraints in shared autonomy. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 6689–6696. External Links: Document Cited by: §II.
  • [11] I. Mitchell (2007) A Toolbox of Level Set Methods. UBC Department of Computer Science Technical Report TR-2007-11, pp. 31 (en). Cited by: §V-B.
  • [12] L. Pais, K. Umezawa, Y. Nakamura, and A. Billard (2013) Learning Robot Skills Through Motion Segmentation and Constraints Extraction. HRI Workshop on Collaborative Manipulation, pp. 5 (en). Cited by: §II.
  • [13] E. Papa and A. Cappozzo (1999-11) A telescopic inverted-pendulum model of the musculo-skeletal system and its use for the analysis of the sit-to-stand motor task. Journal of Biomechanics 32 (11), pp. 1205–1212 (en). External Links: ISSN 0021-9290, Link, Document Cited by: §I, §VI.
  • [14] E. Papi, A. M. J. Bull, and A. H. McGregor (2018-06) Is there evidence to use kinematic/kinetic measures clinically in low back pain patients? A systematic review. Clinical Biomechanics 55, pp. 53–64 (en). External Links: ISSN 0268-0033, Link, Document Cited by: §I.
  • [15] S. Prajna and A. Jadbabaie (2004) Safety Verification of Hybrid Systems Using Barrier Certificates. In Hybrid Systems: Computation and Control, R. Alur and G. J. Pappas (Eds.), Lecture Notes in Computer Science, Berlin, Heidelberg, pp. 477–492 (en). External Links: ISBN 978-3-540-24743-2, Document Cited by: §V-B.
  • [16] H. Prather and L. v. Dillen (2019) Links between the Hip and the Lumbar Spine (Hip Spine Syndrome) as they Relate to Clinical Decision Making for Patients with Lumbopelvic Pain. PM&R 11 (S1), pp. S64–S72 (en). Note: _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/pmrj.12187 External Links: ISSN 1934-1563, Link, Document Cited by: §VI.
  • [17] D. R. R. Scobee and S. S. Sastry (2019-09) Maximum Likelihood Constraint Inference for Inverse Reinforcement Learning. arXiv:1909.05477 [cs, eess, stat] (en). Note: arXiv: 1909.05477 External Links: Link Cited by: §I, §II, §III-B, §III-C, §III.
  • [18] V. Thiruganasambandamoorthy, E. Turko, D. Ansell, A. Vaidyanathan, G. A. Wells, and I. G. Stiell (2014-07) Risk Factors for Serious Underlying Pathology in Adult Emergency Department Nontraumatic Low Back Pain Patients. The Journal of Emergency Medicine 47 (1), pp. 1–11 (en). External Links: ISSN 0736-4679, Link, Document Cited by: §VI.
  • [19] M. Vazquez-Chanlatte, S. Jha, A. Tiwari, M. K. Ho, and S. Seshia (2018) Learning Task Specifications from Demonstrations. Advances in Neural Information Processing Systems 31, pp. 5367–5377 (en). External Links: Link Cited by: §II.
  • [20] T. Westenbroek, H. Gonzalez, and S. S. Sastry (2018-12) A New Solution Concept and Family of Relaxations for Hybrid Dynamical Systems. In 2018 IEEE Conference on Decision and Control (CDC), pp. 743–750. Note: ISSN: 2576-2370 External Links: Document Cited by: §V-B.
  • [21] B. D. Ziebart, J. A. Bagnell, and A. K. Dey (2010) Modeling Interaction via the Principle of Maximum Causal Entropy. In

    International Conference on Machine Learning

    pp. 8 (en). Cited by: §III-B, §III-C.