RAT iLQR: A Risk Auto-Tuning Controller to Optimally Account for Stochastic Model Mismatch

10/16/2020 ∙ by Haruki Nishimura, et al. ∙ Toyota Research Institute University of Illinois at Urbana-Champaign Stanford University 0

Successful robotic operation in stochastic environments relies on accurate characterization of the underlying probability distributions, yet this is often imperfect due to limited knowledge. This work presents a control algorithm that is capable of handling such distributional mismatches. Specifically, we propose a novel nonlinear MPC for distributionally robust control, which plans locally optimal feedback policies against a worst-case distribution within a given KL divergence bound from a Gaussian distribution. Leveraging mathematical equivalence between distributionally robust control and risk-sensitive optimal control, our framework also provides an algorithm to dynamically adjust the risk-sensitivity level online for risk-sensitive control. The benefits of the distributional robustness as well as the automatic risk-sensitivity adjustment are demonstrated in a dynamic collision avoidance scenario where the predictive distribution of human motion is erroneous.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Proper modeling of a stochastic system of interest is a key step towards successful control and decision making under uncertainty. In particular, accurate characterization of the underlying probability distribution is crucial, as it encodes how we expect the system to behave unexpectedly over time. However, such a modeling process can pose significant challenges in real-world problems. On the one hand, we may have only limited knowledge of the underlying system, which would force us to use an erroneous model. On the other hand, even if we can perfectly model a complicated stochastic phenomenon, such as a complex multi-modal distribution, it may still not be appropriate for the sake of real-time control or planning. Indeed, many model-based stochastic control methods require a Gaussian noise assumption, and many of the others need computationally intensive sampling. The present work addresses this problem via distributionally robust control, wherein a potential distributional mismatch is considered between a baseline Gaussian process noise and the true, unknown model within a certain Kullback-Leibler (KL) divergence bound. The use of the Gaussian distribution is advantageous to retain computational tractability without the need for sampling in the state space. Our contribution is a novel model predictive control (MPC) method for nonlinear, non-Gaussian systems with non-convex costs. This controller would be useful, for example, to safely navigate a robot among human pedestrians while the stochastic transition model for humans is not perfect. It is important to note that our contribution is built on the mathematical equivalence between distributionally robust control and risk-sensitive optimal control [22]. Unlike the conventional stochastic optimal control that is concerned with the expected cost, risk-sensitive optimal control seeks to optimize the following entropic risk measure [18]:

(1)

where is a probability distribution characterizing any source of randomness in the system, is a user-defined scalar parameter called the risk-sensitivity parameter, and is an optimal control cost. The risk-sensitivity parameter

determines a relative weight between the expected cost and other higher-order moments such as the variance

[36]. Loosely speaking, the larger becomes, the more the objective cares about the variance and is thus more risk-sensitive.

Fig. 1: Model-based stochastic control methods often require a Gaussian noise assumption, such as the one in the left that represents process noise in pedestrian motion under a collision avoidance scenario (see Section V). However, the true stochastic model can be highly multi-modal and better captured by a more complex distribution as shown in the right, which we may not exactly know. The proposed MPC effectively handles such a model mismatch without the knowledge of the true distribution, except for a bound on the KL divergence between the two.

Our distributionally robust control algorithm can alternatively be viewed as an algorithm for automatic online tuning of the risk-sensitivity parameter in applying risk-sensitive control. Risk-sensitive optimal control has been shown to be effective and successful in many robotics applications [19, 20, 2, 21]. However, in prior work the user has to specify a fixed risk-sensitivity parameter offline. This would require an extensive trial and error process until a desired robot behavior is observed. Furthermore, a risk-sensitivity parameter that works in a certain state can be infeasible in another state, as we will see in Section IV. Ideally, the risk-sensitivity should be adapted online depending on the situation to obtain a specifically desired robot behavior [20, 21], yet this is highly nontrivial as no simple general relationship is known between the risk-sensitivity parameter and the performance of the robot. Our algorithm addresses this challenge as a secondary contribution. Due to the fundamental equivalence between distributionally robust control and risk-sensitive control, it serves as a nonlinear risk-sensitive control that can dynamically adjust the risk-sensitivity parameter depending on the state of the robot as well as the surrounding environment. The rest of the paper is organized as follows. Section II reviews the related work in controls and robotics literature. Section III summarizes the theoretical results originally presented in [22] that connect distributionally robust control to risk-sensitive optimal control. Section IV develops this theory into an algorithm that provides a locally optimal solution for general nonlinear systems with non-convex cost functions, which is a novel contribution of this paper. In Section V, we test our method in a collision avoidance scenario wherein the predictive distribution of pedestrian motion is erroneous. We further show its benefits as a risk-sensitive optimal controller that can automatically adjust its risk-sensitivity parameter in this section. The paper concludes in Section VI with potential future research directions.

Ii Related Work

Ii-a Distributional Robustness and Risk-Sensitivity

Distributionally robust control seeks to optimize control actions against a worst-case distribution within a given set of probability distributions, often called the ambiguity set [32, 26]. There exist various formulations to account for distributional robustness in optimal control. Some works are concerned with minimizing the worst-case expectation of a cost objective [26, 28], while others enforce risk-based or chance constraint satisfaction under a worst-case distribution [10, 32]. The present work belongs to the former class. Existing methods also differ in the formulation of the ambiguity set. Moment-based ambiguity sets require knowledge of moments of the ground-truth distribution up to a finite order [32, 26], which is often overly conservative [10]. Statistical distance-based ambiguity sets are also gaining attention. The authors of [10] uses a Wasserstein metric to define the ambiguity set for motion planning with collision avoidance, but their MPC formulation is not suited for nonlinear systems. -divergence and more general -divergences (which KL divergence belongs to) are employed in [28], similar to the present work. However, the ambiguity set considered in [28] is restricted to categorical distributions, while our work requires no assumption on the class of the ground-truth distributions. Furthermore, we make use of risk-sensitive optimal control to obtain planned robot trajectories with feedback, unlike sampling in their implementation. Optimization of the entropic risk measure has been an active research topic in economics and controls literature since 1970s [14, 35, 36, 8]. The concept of risk-sensitive optimal control has been successfully applied to robotics in various domains, including haptic assistance [19, 20]

, model-based reinforcement learning (RL)

[2], and safe robot navigation [21, 34], to name a few. In all these works, the risk-sensitivity parameter is introduced as a user-specified constant, and is found to significantly affect the behavior of the robot. For instance, our prior work on safe robot navigation in human crowds [21] reveals that a robot with higher risk-sensitivity tends to yield more to oncoming human pedestrians. However, how to find a desirable risk-sensitivity parameter still remains an open research question; in the robot navigation problem, the robot simply freezes if it is too risk-sensitive when the scene is crowded. As the authors of [20] point out, the robot should adapt its risk-sensitivity level depending on the situation, yet there still does not exist an effective algorithmic framework to automate it due to the issues discussed in Section I. In this work, we provide such an algorithm for nonlinear, non-Gaussian stochastic systems. As mentioned earlier, our approach is built on previously-established theoretical results that link risk-sensitive and distributionally robust control [22].

Ii-B Approximate Methods for Optimal Feedback Control

The theory of optimal control lets us derive an optimal feedback control law via dynamic programming (DP) [4]

. For linear systems with additive Gaussian white noise and quadratic cost functions, the exact DP solution is tractable and is known as Linear-Quadratic-Gaussian (LQG)

[1] or Linear-Exponential-Quadratic-Gaussian (LEQG) [35]. They are different in that LQG optimizes the expected cost while LEQG optimizes the entropic risk measure, although both DP recursions are quite similar. However, solving general optimal control problems for nonlinear systems remains a challenge due to lack of analytical tractability. Hence, approximate local optimization methods have been developed, including Differential Dynamic Programming (DDP) [13], iterative Linear-Quadratic Regulator (iLQR) [17], and iterative Linear-Quadratic-Gaussian (iLQG) [30, 29]. While both DDP and iLQR are designed for deterministic systems with quadratic cost functions, iLQG can locally optimize the expected cost objective for Gaussian stochastic systems with non-convex cost functions. Similarly, the iterative Linear-Exponential-Quadratic-Gaussian (iLEQG) has been recently proposed to locally optimize the entropic risk for Gaussian systems with non-convex costs [9, 34, 23]. Note however that they are not designed to be robust to model mismatches that we consider in this paper. In fact, it is known that even LQG does not possess guaranteed robustness [7].

Iii Problem Statement

Iii-a Distributionally Robust Optimal Control

Consider the following stochastic nonlinear system:

(2)

where denotes the state, the control, and the noise input to the system at time . For some finite horizon , let

denote the joint noise vector with probability distribution

. This distribution is assumed to be a known Gaussian white noise process, i.e. is independent of for all , and we call (2) the reference system. Ideally, we would like the model distribution to perfectly characterize the noise in the dynamical system. However, in reality the noise may come from a different, more complex distribution which we may not know exactly. Let denote a perturbed noise vector that is distributed according to . We define the following perturbed system that characterizes the true but unknown dynamics:

(3)

Note that we make no assumptions on Gaussianity or whiteness of . One could also attribute it to potentially unmodeled dynamics. The true, unknown probability distribution is contained in the set of all probability distributions on the support . We assume that is not “too different” from . This is encoded as the following constraint on the KL divergence between and :

(4)

where is the KL divergence and is a given constant. Note that always holds, with equality if and only if . The set of all possible probability distributions satisfying (4) is denoted by , which we define as our ambiguity set. Note that is a convex subset of for a fixed . We are interested in controlling the perturbed system (3) with a state feedback controller of the form . The operator defines a mapping from into . The class of all such controllers is denoted . The cost functions considered in this paper are given by

(5)

We assume that the above objective satisfies the following non-negativity assumption.

Assumption 1 (Assumption 3.1, [22])

The functions and satisfy and for all , , and .

Under the dynamics model (3), the cost model (5), and the KL divergence constraint (4) on , we are interested in finding an admissible controller that minimizes the worst-case expected value of the cost objective (5). In other words, we are concerned with the following distributionally robust optimal control problem:

(6)

where indicates that the expectation is taken with respect to the true, unknown distribution . In this formulation, the robustness arises from the ability of the controller to plan against a worst-case distribution in the ambiguity set .

Remark 1

If the KL divergence bound is zero, then is necessary. In this degenerate case, (6) reduces to the standard stochastic optimal control problem:

(7)

Iii-B Equivalent Risk-Sensitive Optimal Control

Unfortunately, the distributionally robust optimal control problem (6) is intractable as it involves maximization with respect to a probability distribution . To circumvent this, [22] proves that problem (6) is equivalent to a bilevel optimization problem involving risk-sensitive optimal control with respect to the model distribution . We refer the reader to [22] for the derivation and only re-state the main results in this section for self-containedness. Before doing so, we impose an additional assumption on the worst-case expected cost.

Assumption 2 (Assumption 3.2, [22])

For any admissible controller , the resulting closed-loop system satisfies

(8)

This assumption states that, without the KL divergence constraint, some adversarially-chosen noise could make the expected cost objective arbitrarily large in the worst case. It amounts to a controllability-type assumption with respect to the noise input and an observability-type assumption with respect to the cost objective [22]. Under Assumptions 1 and 2, the following theorem holds.

Theorem 1

Consider the stochastic systems (2), (3) with the KL divergence constraint (4) and the cost model (5). Under Assumptions 1 and 2, the following equivalence holds for the distributionally robust optimal control problem (6):

(9)

provided that the set

(10)

is non-empty.

See Theorems 3.1 and 3.2 in [22].

Remark 2

Notice that the first term in the right-hand side of (9) is the entropic risk measure , where the risk is computed with respect to the model distribution and serves as the inverse of the risk-sensitivity parameter. Rewriting the equation in terms of the risk-sensitivity parameter , we see that the right-hand side of (9) is equivalent to

(11)

where . Theorem 1 shows that the original distributionally robust optimal control problem (6) is mathematically equivalent to a bilevel optimization problem (11) involving risk-sensitive optimal control. Note that the new problem does not involve any optimization with respect to the true distribution .

Iv RAT iLQR ALGORITHM

Even though the mathematical equivalence shown in [22] and summarized in Section III-B is general, it does not immediately lead to a tractable method to efficiently solve (11) for general nonlinear systems. There are two major challenges to be addressed. First, exact optimization of the entropic risk with a state feedback control law is intractable, except for linear systems with quadratic costs. Second, the optimal risk-sensitivity parameter has to be searched efficiently over the feasible space , which not only is unknown but also varies dependent on the initial state . A novel contribution of this paper is a tractable algorithm that approximately solves both of the problems for general nonlinear systems with non-convex cost functions. In what follows, we detail how we solve both the inner and the outer loop of (11) to develop a distributionally-robust, risk-sensitive MPC.

Iv-a Iterative Linear-Exponential-Quadratic-Gaussian

Let us first consider the inner minimization of (11):

(12)

where we omitted the extra term as it is constant with respect to the controller . This amounts to solving a risk-sensitive optimal control problem for a nonlinear Gaussian system. Recently, a computationally-efficient, local optimization method called iterative Linear-Exponential-Quadratic-Gaussian (iLEQG) has been proposed for both continuous-time systems [9] and the discrete-time counterpart [23, 34]. Both versions locally optimize the entropic risk measure with respect to a receding horizon, affine feedback control law for general nonlinear systems with non-convex costs. We adopt a variant of the discrete-time iLEQG algorithm [34] to obtain a locally optimal solution to (12). In what follows, we assume that the noise coefficient function in (2) is the identity mapping for simplicity, but it is straightforward to handle nonlinear functions in a similar manner as discussed in [30]. The algorithm starts by applying a given nominal control sequence to the noiseless dynamics to obtain the corresponding nominal state trajectory . In each iteration, the algorithm maintains and updates a locally optimal controller of the form:

(13)

where denotes the feedback gain matrix. The -th iteration of our iLEQG implementation consists of the following four steps:

  1. [label=0)]

  2. Local Approximation: Given the nominal trajectory , we compute the following linear approximation of the dynamics as well as the quadratic approximation of the cost functions:

    (14)
    (15)
    (16)
    (17)
    (18)
    (19)
    (20)
    (21)

    for to , where is the differentiation operator. We also let , , and .

  3. Backward Pass: We perform approximate DP using the current feedback gain matrices as well as the approximated model obtained in the previous step. Suppose that the noise vector is Gaussian-distributed according to with . Let , , and . Given these terminal conditions, we recursively compute the following quantities:

    (22)
    (23)
    (24)
    (25)

    and

    (26)
    (27)
    (28)

    from down to . Note that is necessary so it is invertible, which may not hold if is too large. This is called “neurotic breakdown,” when the optimizer is so pessimistic that the cost-to-go approximation becomes infinity [36]. Otherwise, the approximated cost-to-go for this optimal control (under the controller ) is given by .

  4. Regularization and Control Computation: Having derived the DP solution, we compute new control gains and offset updates as follows:

    (29)
    (30)

    where is a regularization parameter to prevent

    from having negative eigenvalues. We adaptively change

    across multiple iterations as suggested in [29], so the algorithm enjoys fast convergence near a local minimum while ensuring the positive-definiteness of at all times.

  5. Line Search for Ensuring Convergence: It is known that the update could lead to increased cost or even divergence if a new trajectory strays too far from the region where the local approximation is valid [29]. Thus, the new nominal control trajectory is computed by backtracking line search with line search parameter . Initially, and we derive a new candidate nominal trajectory as follows:

    (31)
    (32)

    If this candidate trajectory results in a lower cost-to-go than the current nominal trajectory, then the candidate trajectory is accepted and returned as . Otherwise, the trajectory is rejected and re-derived with until it is accepted. More details on this line search can be found in [31].

The above procedure is iterated until the nominal control does not change beyond some threshold in a norm. Once converged, the algorithm returns the nominal trajectory as well as the feedback gains and the approximate cost-to-go .

Iv-B Cross-Entropy Method

Having implemented the iLEQG algorithm for the inner-loop optimization of (11), it remains to solve the outer-loop optimization for the optimal risk-sensitivity parameter . This is a one-dimensional optimization problem in which the function evaluation is done by solving the corresponding risk-sensitive optimal control (12). In this work we choose to adapt the cross entropy method [24, 15] to derive the approximately optimal value for . This method is favorable for online optimization due to its any-time nature and high parallelizability of the Monte Carlo sampling, but it is possible to use other methods as well. The cross entropy method is a stochastic method that maintains an explicit probability distribution over the design space. At each step, a set of Monte Carlo samples is drawn from the distribution, out of which a subset of

“elite samples” that achieve the best performance is retained. The parameters of the distribution is then updated according to the maximum likelihood estimate on the elite samples. The algorithm stops after a desired number of steps

. In our implementation we model the distribution as univariate Gaussian . A remaining issue is that the iLEQG may return the cost-to-go of infinity if a sampled is too large, due to neurotic breakdown. Since our search space is limited to where yields a finite cost-to-go, we have to ensure that each iteration has enough samples in . To address this problem, we augment the cross entropy method with rejection and re-sampling. Out of the samples drawn from the univariate Gaussian, we first discard all non-positive samples. For each of the remaining samples, we evaluate the objective (11) by a call to iLEQG, and then count the number of samples that obtained a finite cost-to-go. Let be the number of such valid samples. If , we proceed and fit the distribution. Otherwise, we redo the sampling procedure as there are not sufficiently many valid samples to choose the elites from. In practice, re-sampling is not likely to occur after the first iteration of the cross entropy method. At the same time, we empirically found that the first iteration has a risk of re-sampling multiple times, hence degrading the efficiency. We therefore also perform an adaptive initialization of the Gaussian parameters and in the first iteration as follows. If the first iteration with results in re-sampling, we not only re-sample but also divide and by half. If all of the samples are valid, on the other hand, we accept them but double and , since it implies that the initial set of samples is not wide-spread enough to cover the whole feasible set . The parameters and are stored internally in the cross entropy solver and carried over to the next call to the algorithm.

Iv-C RAT iLQR as MPC

We name the proposed bilevel optimization algorithm RAT iLQR. The pseudo-code is given in Algorithm 1. At run time, it is executed as an MPC in a receding-horizon fashion; the control is re-computed after executing the first control input and transitioning to a new state. A previously-computed control trajectory is reused for the initial nominal control trajectory at the next time step to warm-start the computation.

1:Initial state , initial nominal control trajectory , KL divergence bound
2:New nominal trajectory , control gains , risk-sensitivity parameter
3:Compute initial nominal state trajectory using
4:
5:while  do
6:     while True do
7:         if  then
8:              
9:         else
10:              
11:         end if
12:         array Empty array of size
13:         for  do
14:              Solve iLEQG with
15:              Obtain approximate cost-to-go
16:              
17:         end for
18:         
19:         if  and  then
20:              
21:         else if  and  then
22:              
23:              break
24:         else if  then
25:              break
26:         end if
27:     end while
28:     
29:     
30:     
31:end while
32:
33:Solve iLEQG with
34:Obtain new nominal trajectory and control gains
35:return
Algorithm 1 RAT iLQR Algorithm

V Results

This section presents qualitative and quantitative results of the simulation study that we conducted to show the effectiveness of the RAT iLQR algorithm. We provide the problem setup as well as implementation details in Section V-A. The goals of this study are two-fold. First, we demonstrate that the robot controlled by RAT iLQR can successfully accomplish its task under the presence of stochastic disturbance, without access to the ground-truth distribution but the knowledge of the KL divergence bound. This is presented in Section V-B with comparisons to (non-robust) iLQG and a model-based MPC with sampling from the true generative model. Second, Section V-C focuses on the nonlinear risk-sensitive optimal control aspect of RAT iLQR to show its value as an algorithm that can optimally adjust the risk-sensitivity parameter online, which itself is a novel contribution.

V-a Problem Setup

We consider a dynamic collision avoidance problem where a unicycle robot has to avoid a pedestrian in a collision course as illustrated in Figure 2. Collision avoidance problems are often modeled by stochastic optimal control in the autonomous systems literature [16] and the human-robot interaction literature [27, 12], including our prior work [21]. The state of the robot is defined by , where [m] denotes the position, [m/s] the velocity, and [rad] the heading angle. The robot’s control input is , where [m/s] is the acceleration and [rad/s] is the angular velocity. The pedestrian is modeled as a single integrator, whose position is given by [m]. We assume a constant nominal velocity [m/s] for the pedestrian. The joint system is in . The dynamics are propagated by Euler integration with time interval [s] and additive noise to the joint state. The model distribution for is a zero-mean Gaussian with . The ground-truth distribution for the robot is the same Gaussian as in the model, but the pedestrian’s distribution is a mixture of Gaussians that is independent of the robot’s noise. Both the model and the true distributions for the pedestrian are illustrated in Figure 1

. Gaussian mixtures are favored by many recent papers in machine learning to account for multi-modality in human’s decision making

[5, 33, 25]. RAT iLQR requires an upper-bound on the KL divergence between the model and the true distributions. For the sake of this paper we assume that there is a separate module that provides an estimate. In this specific simulation study, we performed Monte Carlo integration with samples drawn from the true distribution offline. During the simulation, however, we did not reveal any information on the true distribution to RAT iLQR but the estimated KL value of 32.02. This offline computation was possible due to our time-invariant assumption on the Gaussian mixture. If one is to use more realistic data-driven prediction instead, it is necessary to estimate the KL divergence online since the predictive distribution may change over time as the human-robot interaction evolves. Note that efficient and accurate estimation of information measures (including KL divergence) is an active area of research in information theory and machine learning [3, 11], which is one of our future research directions. The cost functions for this problem are given by

(33)

where denotes a quadratic cost that penalizes the deviation from a given target robot trajectory, a collision penalty that incurs high cost when the robot is too close to the pedestrian, and a small quadratic cost on the control input. Mirroring the formulation in [34], we used the following collision cost:

(34)

RAT iLQR was implemented in Julia and the Monte Carlo sampling of the cross entropy method was distributed across multiple CPU cores. Our implementation with , , , and yielded the average computation time of 0.27 [s], which is 2.7 times slower than real time. We expect to achieve improved efficiency by further parameter tuning as well as more careful parallelization.

V-B Comparison with Baseline MPC Algorithms

Fig. 2: A unicycle robot avoiding collision with a road-crossing pedestrian. The figures are overlaid with predictions drawn from the model distribution (Figure 1, left) and closed-loop motion plans with RAT iLQR. Note that the prediction is erroneous since the actual pedestrian motion follows the Gaussian mixture distribution (Figure 1, right). (Left) When the KL bound is set to , RAT iLQR ignores this model error and reduces to iLQG. (Right) With the correct information on the KL, RAT iLQR is aware of the prediction error and optimally adjusts the risk-sensitivity parameter for iLEQG, planning a trajectory that stays farther away from the pedestrian.
Fig. 3: Histograms of the minimum separation distance between the robot and the pedestrian. A negative value indicates that a collision has occurred in that run. For each control algorithm, we performed 30 runs of the simulation with randomized pedestrian start positions. RAT iLQR consistently maintained a sufficient safety margin to avoid collision, while iLQG and PETS both failed. See Table I for the summary statistics of these data.

We compared the performance of RAT iLQR against two baseline MPC algorithms, iLQG [30] and PETS [6]. iLQG corresponds to RAT iLQR with the KL bound of , i.e. no distributional robustness is considered. Instead it is more computationally efficient than RAT iLQR, taking only 0.01 [s]. PETS is a state-of-the-art, model-based stochastic MPC algorithm with sampling and is originally proposed in a model-based reinforcement learning (RL) context. We chose PETS as our baseline since it also relies on the cross entropy method for online control optimization and is not limited to Gaussian distributions, similar to RAT iLQR. However, there are three major differences between PETS and RAT iLQR. First, PETS performs the cross entropy optimization directly in the high-dimensional control sequence space, which is far less sample efficient than RAT iLQR which uses the cross entropy method to only optimize the scalar risk-sensitivity parameter. Second, PETS does not consider feedback during planning as opposed to RAT iLQR. Third, PETS requires access to the exact ground-truth Gaussian mixture distribution to perform sampling, while RAT iLQR only relies on the KL divergence bound and the Gaussian distribution that we have modeled. We let PETS perform iterations of the cross entropy optimization, each with samples for the control sequence coupled with samples for the joint state trajectory prediction, which resulted in the average computation time of 0.67 [s].

Method Min. Sep. Dist. [m] Total Collision Count
RAT iLQR (Ours) 0
iLQG [20] 1
PETS [36] 4
TABLE I: Statistics summarizing histogram plots presented in Figure 3

. RAT iLQR achieved the largest average value for the minimum separation distance with the smallest standard deviation, which contributed to safe robot navigation without a single collision. Note that PETS had multiple collisions despite its access to the true Gaussian mixture distribution.

We performed 30 runs of the simulation for each algorithm, with randomized pedestrian start positions and stochastic transitions. To measure the performance, we computed the minimum separation distance between the robot and the pedestrian in each run, assuming that the both agents are circular with the same diameter. The histogram plots presented in Figure 3 clearly indicates the failure of iLQG and PETS as well as RAT iLQR’s capability to maintain a sufficient safety margin for collision avoidance despite the distributional model mismatch. As summarized in Table I, RAT iLQR achieved the largest minimum separation distance on average with the smallest standard deviation, which contributed to safe robot navigation. Note that even iLQG had one collision under this large model mismatch. Figure 2 provides a qualitative explanation of this failure; the planned trajectories by iLQG tend to be much closer to the passing pedestrian than those by the risk-sensitive RAT iLQR. This difference is congruous with our earlier observations in prior work [21] where risk-sensitive optimal control is shown to affect the global behavior of the robot.

V-C Benefits of Risk-Sensitivity Parameter Optimization

Fig. 4: Time-averaged ratio of the optimal found by RAT iLQR to the maximum feasible before the neurotic breakdown occurs, plotted for three distinct KL divergence values. As the KL bound increases from 1.34 to 32.02, the ratio also consistently increased from 0.66 to 0.93. Note also that the standard deviation decreased from 0.29 to 0.10. This suggests that the robot becomes more risk-sensitive as the KL bound increases, and yet it does not choose the maximum value all the time.
KL Bound:
Method Total Collision Count Tracking Error [m]
RAT iLQR (Ours) 0
iLEQG with 0
KL Bound:
Method Total Collision Count Tracking Error [m]
RAT iLQR (Ours) 0
iLEQG with 0
KL Bound:
Method Total Collision Count Tracking Error [m]
RAT iLQR (Ours) 0
iLEQG with 0
TABLE II: Our comparative study between RAT iLQR with and iLEQG with (i.e. maximum feasible risk-sensitivity) reveals that RAT iLQR’s optimal choice of the risk-sensitivity parameter results in a more efficient robot navigation with smaller trajectory tracking errors, while still achieving collision avoidance under the model mismatch. With RAT iLQR, the average tracking error was reduced by , , and , for 3 true distributions with different KL divergences of , , and , respectively.

Aside from the baseline comparison, we also performed two additional sets of 30 simulations for RAT iLQR, with different ground-truth distributions that have smaller KL divergences of and . This is to study how the KL divergence bound given to RAT iLQR affects the resulting risk-sensitivity parameter . The results are shown in Figure 4. As the KL bound increases from 1.34 to 32.02, the ratio between the optimal found by RAT iLQR to the maximum feasible also increases. This matches our intuition that the larger the model mismatch, the more risk-sensitive the robot becomes. However, we also note that the robot does not saturate all the time even under the largest KL bound of 32.02. This raises a fundamental question on the benefits of RAT iLQR as a risk-sensitive optimal control algorithm: how favorable is RAT iLQR with optimal compared to iLEQG with the highest risk-sensitivity? To answer this question, we performed a comparative study between RAT iLQR with and iLEQG with (i.e. maximum feasible found during the cross entropy sampling of RAT iLQR) under the same simulation setup as before. The results are reported in Table II. In terms of collision avoidance, both algorithms were equally safe with collision count of . However, RAT iLQR achieved significantly more efficient robot navigation compared to iLEQG with , reducing the average tracking error by , , and for the KL values of , , and , respectively. The efficiency and the safety of robot navigation are often in conflict in dynamic collision avoidance, and prior work [21] struggles to find the right balance by manually tuning a fixed . With RAT iLQR, such need for manual tuning is eliminated since the algorithm dynamically adjusts so it is the most desirable to handle the potential model mismatch specified by the KL bound.

Vi Conclusions

In this work we propose RAT iLQR, a novel nonlinear MPC algorithm for distributionally robust control under a KL divergence bound. Our method is based on the mathematical equivalence between distributionally robust control and risk-sensitive optimal control. A locally optimal solution to the resulting bilevel optimization problem is derived with iLEQG and the cross entropy method. The simulation study shows that RAT iLQR successfully accounts for the distributional mismatch during collision avoidance. It also shows the effectiveness of dynamic adjustment of the risk-sensitivity parameter by RAT iLQR, which overcomes a limitation of conventional risk-sensitive optimal control methods. Future work will focus on accurate online estimation of the KL divergence from a stream of data. We are also interested in exploring applications of RAT iLQR, including control of learned dynamical systems and perception-aware control.

References

  • [1] K. J. Åström (1970) Introduction to stochastic control theory. Academic Press. Cited by: §II-B.
  • [2] S. Bechtle, Y. Lin, A. Rai, L. Righetti, and F. Meier (2020) Curious ilqr: resolving uncertainty in model-based rl. In Conference on Robot Learning, pp. 162–171. Cited by: §I, §II-A.
  • [3] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018) Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: §V-A.
  • [4] D. P. Bertsekas (1976) Dynamic programming and stochastic control. Academic Press, Inc., USA. External Links: ISBN 0120932504 Cited by: §II-B.
  • [5] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov (2019) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In Conference on Robot Learning, pp. 86–99. Cited by: §V-A.
  • [6] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §V-B.
  • [7] J. C. Doyle (1978) Guaranteed margins for lqg regulators. IEEE Transactions on automatic Control 23 (4), pp. 756–757. Cited by: §II-B.
  • [8] I. Exarchos, E. A. Theodorou, and P. Tsiotras (2016) Game-theoretic and risk-sensitive stochastic optimal control via forward and backward stochastic differential equations. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 6154–6160. Cited by: §II-A.
  • [9] F. Farshidian and J. Buchli (2015) Risk sensitive, nonlinear optimal control: iterative linear exponential-quadratic optimal control with gaussian noise. arXiv preprint arXiv:1512.07173. Cited by: §II-B, §IV-A.
  • [10] A. Hakobyan and I. Yang (2020) Wasserstein distributionally robust motion planning and control with safety constraints using conditional value-at-risk. In 2020 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 490–496. Cited by: §II-A.
  • [11] M. N. Iranzad (2019) Estimation of information measures and its applications in machine learning. Ph.D. Thesis, University of Michigan. Cited by: §V-A.
  • [12] B. Ivanovic, A. Elhafsi, G. Rosman, A. Gaidon, and M. Pavone (2020) MATS: an interpretable trajectory forecasting representation for planning and control. arXiv preprint arXiv:2009.07517. Cited by: §V-A.
  • [13] D. H. Jacobson and D. Q. Mayne (1970) Differential dynamic programming. Cited by: §II-B.
  • [14] D. Jacobson (1973) Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Transactions on Automatic control 18 (2), pp. 124–131. Cited by: §II-A.
  • [15] M. J. Kochenderfer and T. A. Wheeler (2019) Algorithms for optimization. Mit Press. Cited by: §IV-B.
  • [16] R. Kusumoto, L. Palmieri, M. Spies, A. Csiszar, and K. O. Arras (2019) Informed information theoretic model predictive control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 2047–2053. Cited by: §V-A.
  • [17] W. Li and E. Todorov (2004) Iterative linear quadratic regulator design for nonlinear biological movement systems.. In ICINCO (1), pp. 222–229. Cited by: §II-B.
  • [18] A. Majumdar and M. Pavone (2020) How should a robot assess risk? towards an axiomatic theory of risk in robotics. In Robotics Research, pp. 75–84. Cited by: §I.
  • [19] J. R. Medina, D. Lee, and S. Hirche (2012) Risk-sensitive optimal feedback control for haptic assistance. In 2012 IEEE international conference on robotics and automation, pp. 1025–1031. Cited by: §I, §II-A.
  • [20] J. R. Medina, T. Lorenz, D. Lee, and S. Hirche (2012) Disagreement-aware physical assistance through risk-sensitive optimal feedback control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3639–3645. Cited by: §I, §II-A.
  • [21] H. Nishimura, B. Ivanovic, A. Gaidon, M. Pavone, and M. Schwager (2020) Risk-sensitive sequential action control with multi-modal human trajectory forecasting for safe crowd-robot interaction. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I, §II-A, §V-A, §V-B, §V-C.
  • [22] I. R. Petersen, M. R. James, and P. Dupuis (2000) Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control 45 (3), pp. 398–412. Cited by: §I, §I, §II-A, §III-B, §III-B, §III-B, §IV, Assumption 1, Assumption 2.
  • [23] V. Roulet, M. Fazel, S. Srinivasa, and Z. Harchaoui (2020) On the convergence of the iterative linear exponential quadratic gaussian algorithm to stationary points. In 2020 American Control Conference (ACC), pp. 132–137. Cited by: §II-B, §IV-A.
  • [24] R. Y. Rubinstein and D. P. Kroese (2013)

    The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation and machine learning

    .
    Springer Science & Business Media. Cited by: §IV-B.
  • [25] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone (2020-08) Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data. In

    European Conf. on Computer Vision

    ,
    . Cited by: §V-A.
  • [26] S. Samuelson and I. Yang (2017) Data-driven distributionally robust control of energy storage to manage wind power fluctuations. In 2017 IEEE Conference on Control Technology and Applications (CCTA), Vol. , pp. 199–204. Cited by: §II-A.
  • [27] E. Schmerling, K. Leung, W. Vollprecht, and M. Pavone (2018) Multimodal probabilistic model-based planning for human-robot interaction. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §V-A.
  • [28] A. Sinha, M. O’Kelly, H. Zheng, R. Mangharam, J. Duchi, and R. Tedrake (2020) FormulaZero: distributionally robust online adaptation via offline population synthesis. arXiv preprint arXiv:2003.03900. Cited by: §II-A.
  • [29] Y. Tassa, T. Erez, and E. Todorov (2012) Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913. Cited by: §II-B, item 3, item 4.
  • [30] E. Todorov and W. Li (2005) A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pp. 300–306. Cited by: §II-B, §IV-A, §V-B.
  • [31] J. Van Den Berg, S. Patil, and R. Alterovitz (2012) Motion planning under uncertainty using iterative local optimization in belief space. The International Journal of Robotics Research 31 (11), pp. 1263–1278. Cited by: item 4.
  • [32] B. P. G. Van Parys, D. Kuhn, P. J. Goulart, and M. Morari (2016) Distributionally robust control of constrained stochastic systems. IEEE Transactions on Automatic Control 61 (2), pp. 430–442. Cited by: §II-A.
  • [33] A. Wang, X. Huang, A. Jasour, and B. Williams (2020) Fast risk assessment for autonomous vehicles using learned models of agent futures. In Robotics: Science and Systems 2020, Cited by: §V-A.
  • [34] M. Wang, N. Mehr, A. Gaidon, and M. Schwager (2020) Game-theoretic planning for risk-aware interactive agents. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §II-A, §II-B, §IV-A, §V-A.
  • [35] P. Whittle (1981) Risk-sensitive linear/quadratic/gaussian control. Advances in Applied Probability 13 (4), pp. 764–777. Cited by: §II-A, §II-B.
  • [36] P. Whittle (2002) Risk sensitivity, a strangely pervasive concept. Macroeconomic Dynamics 6 (1), pp. 5–18. Cited by: §I, §II-A, item 2.