Efficient Iterative Linear-Quadratic Approximations for Nonlinear Multi-Player General-Sum Differential Games

09/10/2019 ∙ by David Fridovich-Keil, et al. ∙ berkeley college 0

Differential games offer a powerful theoretical framework for formulating safety and robustness problems in optimal control. Unfortunately, numerical solution techniques for general nonlinear dynamical systems scale poorly with state dimension and are rarely used in applications requiring real-time computation. For single-agent optimal control problems, however, local methods based on efficiently solving iterated approximations with linear dynamics and quadratic costs are becoming increasingly popular. We take inspiration from one such method, the iterative linear quadratic regulator (ILQR), and observe that efficient algorithms also exist to solve multi-player linear-quadratic games. Whereas ILQR converges to a local solution of the optimal control problem, if our method converges it returns a local Nash equilibrium of the differential game. We benchmark our method in a three-player general-sum simulated example, in which it takes < 0.75 s to identify a solution and < 50 ms to solve warm-started subproblems in a receding horizon. We also demonstrate our approach in hardware, operating in real-time and following a 10 s receding horizon.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

I Introduction

Differential game theory is a popular framework for formulating safety, robustness, and human-robot interaction problems. For example, Turetsky et al.

[turetsky2014robust] compute a robust tracking controller by solving a two-player zero-sum game, and Wang et al. [wang2019game] model the behavior of human-driven vehicles on the road using a general-sum game .

Most classes of differential games have no analytic solution, and many numerical techniques suffer from the so-called “curse of dimensionality

[bellman1956dynamic]. Numerical dynamic programming solutions for general nonlinear systems have been studied extensively, though primarily in cases with a priori known objectives and constraints which permit offline computation, such as automated aerial refueling [ding2008reachability]. Approaches such as [herbert2017fastrack, fisac2018hierarchical] which decouple offline game analysis from online operation are promising. Still, scenarios with more than two players remain extremely challenging, and the practical restriction of solving games offline prevents differential games from being widely used in many applications of interest, such as autonomous driving.

Like differential games, global solution techniques (i.e., those that return global optima) for single-agent optimal control problems also suffer from the curse of dimensionality. Recently, however, local methods such as differential dynamic programming (DDP) [mayne1966second] and iterative linear quadratic regulation (ILQR) [li2004iterative], which return only local optima, have become popular [koenemann2015whole, kitaev2015physics, chen2017constrained] in the nonlinear model predictive control and motion planning communities. These methods repeatedly refine an initial control strategy by efficiently solving linear-quadratic (LQ) approximations of the problem.

Fig. 1: Demonstration of the proposed algorithm for a three-player general-sum game modeling an intersection. Two cars (red and green triangles) navigate the intersection while a pedestrian (blue triangle) traverses a crosswalk. Observe how both cars swerve slightly to provide extra clearance to the pedestrian.

We observe that, just as LQ optimal control problems can be solved efficiently for a single agent, LQ games afford an equivalently efficient numerical solution for multiple players. In this paper, we present an algorithm for computing an approximate solution to differential games which leverages the computational efficiency of solving these LQ games. Like ILQR, our method is local. In the context of differential games, this means that if our approach converges, the solution it returns is a local Nash equilibrium under easily-verified conditions (rather than a local optimum in ILQR). However, like other local methods for solving games (e.g., [wang2019game]) our approach is not guaranteed to converge from arbitrary initializations.

The core computational steps of our approach are substantially similar to those of ILQR, as both algorithms require linearizing dynamics, quadraticizing costs, and solving a Riccati equation. For this reason, we believe that our work will enable and encourage a broad range of planning and control practitioners already familiar with ILQR to formulate problems robustly as differential games and seamlessly integrate with existing computational tools. For example, in Fig. 1 we demonstrate our approach in a three-player general-sum game intended to model traffic at an intersection. In , our algorithm converges to a set of strategies which exhibit non-trivial coordination to avoid collision. These types of coordinated strategies could be useful for modeling human behavior, as well as for computing a motion plan for an autonomous car acting as one of the players.

Ii Background & Related Work

Differential games have been widely studied since the introduction of pursuit-evasion games in [isaacs1951games]. Here, we survey both zero-sum and general-sum games and discuss approximate solution techniques. We also summarize iterative linear quadratic methods used both for optimal control and games and discuss their bearing on this work.

Ii-a Zero-sum games

In zero-sum games, two (groups of) players choose control strategies to optimize equal and opposite objectives. Two-player zero-sum games are often formulated through a Hamilton-Jacobi-Isaacs (HJI) PDE, e.g. [isaacs1999differential, evans1983differential, lions1985differential, mitchell2005time, margellos2011hamilton]. More complicated games, such as active target defense and multiplayer capture-avoid games are also addressed in a zero-sum framework in [garcia2019design, fisac2015pursuit] and [zhou2016cooperative], respectively.

Ii-B General-sum games

Initially formulated in [starr1969nonzero, starr1969further], general-sum differential games generalize zero-sum games to model situations in which players have competing—but not necessarily opposite—costs. Like zero-sum games, general-sum games are characterized by Hamilton-Jacobi equations [starr1969nonzero] in which each player’s Hamiltonian is coupled with other players’. Solutions to both zero-sum and general-sum games, and especially games with many players, are generally difficult to solve numerically. However, efficient methods do exist for solving games with linear dynamics and quadratic costs, e.g. [li1995lyapunov, basar1999dynamic]. Dockner et al. [dockner1985tractable] also characterize classes of games which admit tractable open loop, rather than feedback, solutions.

Ii-C Approximation techniques

While general-sum games may be analyzed, in principle, by solving coupled Hamilton-Jacobi equations [starr1969nonzero], doing so unfortunately requires both exponential time and computational memory. A number of more tractable approximate solution techniques have been proposed for zero-sum games, many of which require linear system dynamics, e.g. [kurzhanski00, kurzhanski02, greenstreet1999reachability, maidens13], or decomposable dynamics [chen2018decomposition]. Approximate dynamic programming techniques such as [bertsekas1996neuro] are not restricted to linear dynamics. Still, scalability to online, real-time operation remains a challenge.

Iterative best response algorithms form another class of approximate methods for solving general-sum games. Here, in each iteration every player solves (or approximately solves) the optimal control problem that results from holding other players’ strategies fixed. This reduction to a sequence of optimal control problems is attractive; however, it can also be computationally inefficient. Still recent work demonstrates the effectiveness of iterative best response in lane changes [fisac2018hierarchical] and multi-vehicle racing [wang2019game].

Another similarly-motivated class of approximations involves changing the information structure of the game. For example, Chen et al. [chen2015safe] solve a multi-player reach-avoid game by pre-specifying an ordering amongst the players and allowing earlier players to communicate their intended strategies to later players. Zhou et al. [zhou2012general] and Liu et al. [liu2014evasion] operate in a similar setting, but solve for open-loop conservative strategies.

Ii-D Iterative linear-quadratic (LQ) methods

Iterative LQ approximation methods are increasingly common in the robotics and control communities. Our work builds directly upon the iterative linear-quadratic regulator (ILQR) algorithm [li2004iterative, todorov2005generalized]. ILQR is closely related to differential dynamic programming [mayne1966second, jacobson1970differential], and is widely used to find local solutions to smooth nonlinear optimal control problems. ILQR has been demonstrated in a variety of applications including driving [chen2017constrained], humanoid locomotion [koenemann2015whole], and grasping [kitaev2015physics]. There are also many extensions to ILQR, including trajectory smoothing [van2014iterated] and constraint handling via barrier functions [chen2017constrained].

At each iteration, ILQR simulates the full nonlinear system trajectory, computes a discrete-time linear dynamics approximation and quadratic cost approximation, and solves a LQR subproblem to generate the next control strategy iterate. While structurally similar to ILQR, our approach solves a LQ game at each iteration instead of a LQR problem. This core idea is related to the sequential linear-quadratic method of [mukai2000sequential, tanikawa2012local], which is restricted to the two-player zero-sum context. In this paper, we show that LQ approximations can be applied to -player, general-sum games, provide a theoretical characterization of convergence properties, and, significantly, demonstrate that our approach is faster than existing approaches and is easily real-time for moderate to large-scale problems.

Iii Problem Formulation

We consider a -player finite horizon general-sum differential game characterized by system dynamics

(1)

where is the state of the system, and is the control input of player . Each player has a cost function defined as an integral of running costs . is understood to depend implicitly upon the state trajectory , which itself depends upon initial state and control signals :

(2)

We shall presume that is continuous in and continuously differentiable in uniformly in . We shall also require to be twice differentiable in uniformly in . Without any practical loss of generality, we shall also presume that .

Ideally, we would like to find time-varying state feedback control strategies for each player which constitute a global Nash equilibrium for the game defined by (1) and (III). Here, the strategy space for player is the set of measurable functions mapping time and state to player ’s control. Note that, in this formulation, player only observes the state of the system at each time and is unaware of other players’ control strategies. With a slight abuse of notation , the global Nash equilibrium is defined as the set of strategies for which the following inequalities hold (see, e.g., [basar1999dynamic, Chapter 6]):

(3)

In (3), the inequalities must hold for all . Informally, a set of feedback strategies is a global Nash equilibrium if no player has a unilateral incentive to deviate from his/her current strategy.

Since finding a global Nash equilibrium is generally computationally intensive, in this work, we will be concerned with finding local Nash equilibria. A local Nash equilibrium is characterized similarly to (3), except that the inequalities may only hold for an open subset of [ratliff2016characterization, Definition 1]. For a detailed characterization of local Nash equilibria, please refer to [ratliff2016characterization]. Intuitively, we shall seek to find strategies for all players such that no player has a unilateral incentive to make a small deviation from his/her current strategy. Although a player operating at a local Nash equilibrium might prefer a very different strategy, restricting our attention to local equilibria will be computationally advantageous. Moreover, as we shall see in Section V, these local equilibria may still involve intricate coordination between players.

Iv Iterative Linear-Quadratic Games

We approach the -player general-sum game (III) with nonlinear dynamics (1) from the perspective of classical LQ games. It is well known that equilibrium strategies for finite-horizon LQ games satisfy coupled Riccati differential equations. These coupled Riccati equations may be derived directly from the first order necessary conditions of optimality for each player [basar1999dynamic, Chapter 6], or equivalently by substituting linear dynamics and quadratic running costs into the generalized coupled HJ equations [starr1969further]. These coupled differential equations may be solved approximately in discrete-time using dynamic programming [basar1999dynamic]. We will leverage the existence and computational efficiency of this discrete-time LQ solution to solve successive approximations to the original nonlinear non-quadratic game.

Iv-a Iterative LQ game algorithm

Our iterative LQ game approach proceeds in stages, as summarized in Algorithm 1. We begin with an initial state and initial feedback control strategies (or open-loop controls) for each player , and integrate the system forward (line 1 of Algorithm 1) to obtain the current trajectory iterate . Next (line 1) we obtain a Jacobian linearization of the dynamics about trajectory . At each time and for arbitrary states and controls we define deviations from this trajectory and . Thus equipped, we compute a continuous-time linear system approximation about :

(4)

where is the Jacobian and is likewise .

Input: initial state , control strategies , time horizon
Output: converged control strategies
1 for iteration  do
2      
3       getTrajectory;
4       linearizeDynamics;
5       quadraticizeCost;
6      
7       ;
8       ;
9       if converged then
10             return
11      
Algorithm 1 Iterative LQ Games

We also obtain a quadratic approximation to the running cost for each player (see line 1 of Algorithm 1)

(5)

where vector

is the gradient , and matrices and are Hessians and , respectively.

Thus, we have constructed a finite-horizon continuous-time LQ game, which may be solved via coupled Riccati differential equations [basar1999dynamic, green2012linear]. This results in a new set of candidate feedback strategies which constitute a feedback (global) Nash equilibrium of the LQ game [basar1999dynamic]. In fact, these feedback strategies are affine maps of the form:

(6)

with gains and affine terms .

However, we find that choosing often diverges because the trajectory resulting from the is far enough from the current trajectory iterate that the dynamics linearizations (Algorithm 1, line 1) and cost quadraticizations (line 1) no longer hold. As in ILQR [tassa2014control], to improve convergence, we take only a small step in the “direction” of . More precisely, for some choice of step size , we set

(7)

which corresponds to line 1 in Algorithm 1. Note that at , and . Thus, taking , we have (which may be verified recursively). That is, when we recover the open-loop controls from the previous iterate, and hence . Taking , we recover the LQ solution in (6). Similar logic implies the following lemma, which we will use shortly in the proof of our main theoretical result.

Lemma 1

Suppose that trajectory is a fixed point of Algorithm 1, with . Then, the converged affine terms must all be identically zero for all time.

In ILQR, it is common to perform an Armijo line-search over step size to ensure a sufficient decrease in cost at every iteration, and thereby improve convergence (e.g., [tassa2014control]). In the context of a noncooperative game, however, line-searching to decrease “cost” does not make sense, as costs may conflict. While we believe this to be a rich topic of future research, practically we find that our algorithm typically converges for a fixed, small step size (e.g.

). Heuristically decaying step size with each iteration

or line-searching until is smaller than a threshold are also promising alternatives.

Iv-B Convergence

Like other local methods (e.g., [wang2019game]), our approach is not guaranteed to converge to a fixed strategy for each player. However, presuming that our algorithm does converge for a particular problem instantiation, Theorem 1 provides conditions for which the control signals associated with the converged strategies constitute a local Nash equilibrium in open-loop strategies. The feedback controllers may be understood as stabilizing about this open-loop equilibrium point.

Theorem 1

(Characterization of Fixed Points) Suppose that Algorithm 1 converges to feedback controllers , and let and respectively be the trajectory and costs corresponding to the and initial condition . If the quadratic approximation of each player’s running cost about is convex in state and controls (i.e., ), then the open-loop control signals constitute a local Nash equilibrium in open-loop strategies for the game specified by dynamics (1) and costs (III). That is, the satisfy (3) locally on the space of time-varying but state-independent strategies.

It suffices to show, without loss of generality, that small perturbations in player 1’s control signal away from result in higher cost , if . The proof follows two steps: first, we prove the result for the linearized system (4), then we invoke standard results from system theory on sensitivity to extend the proof to the nonlinear system.

Suppose ; this implies that is convex in in the neighborhood of . However, the state at time , , depends upon control at all prior times . We shall first show that is convex in near for trajectories of the Jacobian linearization about (4). Fixing other players’ control signals , the state (and hence ) of the resulting affine time-varying system is affine in (and hence in ) [callier2012linear]. Convex functions composed with affine functions are still convex, and integrals of nonnegative convex functions are also convex; hence, is also convex in in the neighborhood of for trajectories of the linearized system.

Additionally, we know that the converged strategies comprise a fixed point of Algorithm 1. By Lemma 1, . Moreover, the comprise the unique global Nash equilibrium of the LQ approximation (about ) to the original game. Hence, has a local minimum at for the linearized dynamics.

It remains to argue that these results extend to trajectories of the nonlinear system (1). Fortunately, since we know that trajectories of the nonlinear system vary smoothly with inputs . Moreover, , so trajectories of can be driven arbitrarily closely to trajectories of the linearized system. Formally, such that . Taking sufficiently small, this implies that the difference in cost for trajectories of the nonlinear system can also be made arbitrarily close to that for trajectories of the linearized system. Thus, our analysis of the linearized system holds locally for the nonlinear dynamics.

We offer further interpretation of Theorem 1 below.

Fig. 2: Three-player general-sum game which models traffic at an intersection on a time horizon s with time discretization s. Two cars (red and green triangles) and a pedestrian (blue triangle) wish to navigate the intersection while avoiding collision. (Left) Green car seeks the lane center and then swerves slightly to avoid the pedestrian. (Center) Red car weaves in front of the green car and slows slightly to allow the pedestrian to pass. (Right) Red car swerves left to give pedestrian a wide berth.
Remark 1

This characterization of the fixed points as potentially local Nash equilibria in open-loop strategies is roughly equivalent to the characterization of the convergence of other local methods for solving differential games, such as iterated best response, e.g., [wang2019game, Theorem 1].

Remark 2

The condition that be nonnegative-definite is a common requirement in general-sum LQ games. However, Başar and Olsder [basar1999dynamic, Remark 6.4] remark that nonnegative-definiteness is sufficient, but not necessary, for the correctness of the LQ game solution; indeed, it is never satisfied in a zero-sum game. We conjecture that the positive-definiteness condition in Theorem 1 may also be sufficient, but not necessary.

Note: Although we have presented our algorithm in continuous-time, in practice, we solve the coupled Riccati equations analytically in discrete-time via dynamic programming. Please refer to [basar1999dynamic, Corollary 6.1] for a full derivation. To discretize time at resolution , we employ Runge-Kutta integration of nonlinear dynamics (1) with a zero-order hold for control input over each time interval . That is, we numerically compute:

(8)

Iv-C Computational complexity and runtime

The per-iteration computational complexity of our approach is comparable to that of ILQR, and scales modestly with the number of players, . Specifically, at each iteration, we first linearize system dynamics about . Presuming that the state dimension is larger than the control dimension for each player, linearization requires computing partial derivatives at each time step (which also holds for ILQR). We also quadraticize costs, which requires partial derivatives at each time step (compared to for ILQR). Finally, solving the coupled Riccati equations of the resulting LQ game at each time step has complexity , which may be verified by inspecting [basar1999dynamic, Corollary 6.1] (for ILQR, this complexity is ).

Total algorithmic complexity depends upon the number of iterations, which we currently have no theory to bound. However, empirical results111Code available at: github.com/HJReachability/ilqgames are extremely promising. For the three-player 14-state game described in Section V-A, each iteration takes ms and the entire game can be solved from a zero initialization () in ms. Moreover, receding horizon invocations warm-started every ms can be solved in (and often ) ms. All computation times are reported for single-threaded operation on a 2017 MacBook Pro with a 2.8 GHz Intel Core i7 CPU. For reference, the iterative best response scheme of [wangmingyu2019game] reports solving a receding horizon two-player zero-sum racing game at Hz, and the method of [tanikawa2012local] reportedly takes several minutes to converge for a different two-player zero-sum example. The dynamics and costs in both cases differ from those in Section V (or are not clearly reported); nonetheless, the runtime of our approach compares favorably.

V Examples

In this section, we demonstrate our algorithm experimentally in three-player noncooperative settings, both in software simulation and hardware.

V-a Three-player intersection (software)

We begin by testing our algorithm in software simulation. As shown in Fig. 2, we consider an intersection with two cars and one pedestrian, all of which must cross paths to reach desired goal locations. We use a time horizon of s and a time discretization of ms. We model collision-avoiding interactions with semi-quadratic penalties on the pairwise distances between players. We assign asymmetric weights to the different players, so that the two cars are more strongly penalized for near-misses and therefore bear a greater burden for taking evasive action. Additionally, we assign each player a quadratic penalty for distance to his/her goal location, and cars are penalized quadratically for their distance to the appropriate lane center.

Fig. 3: Time-lapse of a hardware demonstration of Algorithm 1. We model the interaction of a holonomic robot (blue triangle) and two pedestrians (purple and red triangles) using a differential game in which each agent wishes to reach a goal location while maintaining sufficient distance from other agents. Our algorithm solves receding horizon instantiations of this game in real-time, and successfully plans and executes interactive collision-avoiding maneuvers. Planned (and predicted) trajectories are shown in blue (robot), purple, and red (pedestrians).

We model the cars’ dynamics using a classical 5D bicycle model (left), and model the pedestrian using a 4D unicycle model (right):

(9)

where the state variables represent the position of center of the (rear) axle, the heading relative to the positive -axis, the front wheel angle, and speed, respectively. The car is controlled by front wheel turning rate and tangential acceleration , while the unicycle is controlled by angular rate and tangential acceleration . is the distance between the cars’ front and rear axles. Together, the state of the three-player game is 14-dimensional.

Fig. 2 shows a time-lapse of the converged solution identified by our algorithm. While this may or may not be a global Nash equilibrium, it certainly exhibits nontrivial coordination among the players, and with careful choice of costs satisfies the sufficient conditions for being an open-loop local Nash equilibrium (Theorem 1). Observe how, between s (left panel), the green car initially seeks the lane center to minimize its cost, but then turns slightly to avoid the pedestrian (blue). Between s (center panel), the red car turns slightly right to pass in front of the green car, and then slows and begins to turn left to give the pedestrian time to cross. Finally (right panel), the red car turns left to give the pedestrian a wide berth.

V-B Receding horizon motion planning (hardware)

We next implement Algorithm 1 within the Robot Operating System framework, and evaluate it in a real-time hardware test. Here, we set up a game in which a TurtleBot 2 holonomic ground robot and two pedestrians seek to cross a room as quickly as possible while maintaining m separation between all agents. We model the robot as a 4D unicycle (9) and the pedestrians as 3D Dubins cars moving at constant speed :

(10)

We use a similar cost structure as in Section V-A, and re-solve the game in a s receding horizon with time discretization of ms, replanning every ms. We gather state information for all agents using a motion capture system. Fig. 3 shows a time-lapse of a typical interaction. Internally, we initialize Algorithm 1 with all agents’ strategies identically zero (i.e., ), and warm-start each successive receding horizon invocation with the previous solution. Initially, in frame (a) Algorithm 1 identifies a set of strategies which steer each agent to their respective goals while maintaining comfortable separation. Of course, the pedestrians do not actually follow these precise trajectories; hence later receding horizon invocations converge to slightly different strategies. In fact, between frames (c) and (d) the red pedestrian makes an unanticipated sharp right-hand turn, which forces the (blue) robot to stay to the right of its previous plan and then turn left in order to maintain sufficient separation between itself and both pedestrians. We note that our assumed cost structure models all agents as wishing to avoid collision; hence the resulting strategies may be less conservative than those that would arise from a non-game-theoretic motion planning approach. As our primary objective is to demonstrate the real-time performance of Algorithm 1, we leave a more complete study of agent intent modeling and its impact on Nash equilibria for future work.

Vi Discussion

We have presented a novel algorithm for finding local Nash equilibria in multi-player general-sum differential games. Our approach is closely related on the popular iterative linear quadratic regulator (ILQR) algorithm [li2004iterative], and offers a relatively straightforward way for practitioners already using ILQR to add robustness to their systems and model multi-agent interactions with differential games. We demonstrated our method in a 14-dimensional three-player example, in which it finds an interactive solution to a traffic scenario. We also demonstrated our approach in a hardware test, in which it operates in real-time following a receding horizon.

There are several other approaches to identifying local Nash equilibria in differential games, e.g. iterative best response [wang2019game]. We have shown the relative computational efficiency of our approach. However, quantitatively comparing the equilibria identified by these algorithms is challenging because, in arbitrary general-sum games, different players may prefer different equilibria. Studying the qualitative differences in these equilibria is an important direction of future research.

While we identify sufficient conditions for converged strategies to be a local Nash equilibrium (Theorem 1

), our approach is not guaranteed to convergence from arbitrary initializations. However, in our experience convergence can almost always be achieved with a sufficiently small step size. Future work will be seek a theoretical explanation of this empirical property. Another important point of practical concern, e.g. in motion planning, is how to estimate appropriate cost functions

for each player. We hope to direct future work in this direction. Finally, it will also be critical to develop a theory for understanding the topology of the local Nash equilibria identified by our algorithm, and their sensitivity to both misspecified objectives and sub-optimal play.

Acknowledgments

The authors would like to thank Andrew Packard for his helpful insights on LQ games, as well as Somil Bansal, Jaime Fisac, Tyler Westenbroek, and Eric Mazumdar for helpful discussions about the convergence properties of our approach.