In robotics, a wide variety of decision making problems, including low-level control, motion planning, and task planning, are often best expressed as optimal control problems. Specific algorithms and solution strategies may differ depending upon factors such as system dynamics and cost structure; yet, modern methods such as model predictive control have proven extremely effective in many applications of interest. Still, optimal control formulations are fundamentally limited to solving decision problems for a single agent.
Dynamic game theory—the study of games played over time—provides a natural extension of optimal control to the multi-agent setting. For example, nearby vehicles at an intersection (e.g., Fig.1) mutually influence one another as they attempt to balance making forward progress in a desired direction while avoiding collision. Abstractly, dynamic games provide each agent, or “player,” a separate input to the system, and allow each player to have a different cost function which they wish to optimize. Players may wish to cooperate with one another in some situations and not cooperate in others, leading to complicated, coupled optimal play. Moreover, different players may know different pieces of information at any point in time. Optimal play depends strongly upon this information structure; a player with an informational advantage can often exploit that knowledge to the detriment of any competitors. For example, in poker a player who cheats and looks at the top card in the deck is more certain of who will win the next hand, and hence can assume less risk in betting.
In this paper, we consider dynamic games played in continuous time, or differential games. Historically, differential games were first studied in zero-sum (perfectly competitive) settings such as pursuit-evasion problems [isaacs1951games]. Here, optimal play is described by the Hamilton-Jacobi-Isaacs (HJI) PDE, in which the Hamiltonian includes a minimax optimization problem that encodes the instantaneous preferences of both players. These results extend to general-sum games as well, in which optimal play follows coupled HJ equations [starr1969nonzero, starr1969further]. Unfortunately, numerical solutions to these coupled PDEs often prove intractable because they operate on a densely discretized state space. Approximate dynamic programming methods [bertsekas1996neuro] offer a promising alternative; still, computational efficiency remains a significant challenge.
Recently, the robotics community has shown renewed interest in dynamic and differential games, with a variety of new approximate algorithms for identifying locally optimal play. For example, Sadigh et al. [sadigh2016planning] optimize the behavior of a self-driving car while accounting for the reaction of a human driver. Wang et al. [wang2019game] demonstrate a real-time iterative best response algorithm for planning competitive trajectories in a 6-player drone racing game. Building upon the earlier sequential linear-quadratic method of [mukai2000sequential] and the well-known iterative linear-quadratic regulator [li2004iterative, jacobson1970differential], our own prior work [fridovich2019efficient] solves warm-started 3-player differential games in under 50 ms each, operating single-threaded on a consumer laptop.
In this paper, we extend and improve upon our previous work [fridovich2019efficient] by exploiting the structure in a broad class of dynamical systems. Many systems, including quadcopters and the planar unicycle and bicycle models commonly used to model automobiles, are feedback linearizable. That is, there exists a (nonlinear) control law which renders the closed-loop input-output dynamics of these nonlinear systems linear. Here, we develop an algorithm for identifying locally optimal play in differential games with feedback linearizable dynamics. We establish theoretical equivalence between the solutions identified using this algorithm and those which do not exploit the feedback linearizable structure. By exploiting the structure, however, our algorithm is able to take much larger steps at each iteration and generally converge to an equilibrium more quickly and more reliably than was previously possible. Experimental results in Section VI confirm these computational advantages for the interactive traffic scenarios shown in Fig. 1.
Ii Related Work
To put our work in context, here we provide a brief summary of iterative linear-quadratic (ILQ) methods of solving differential games, other approximate techniques of solving games, and common ways in which feedback linearization is used to accelerate motion planning.
Ii-a ILQ methods and other approximate techniques
Iterative linear-quadratic (ILQ) methods are increasingly popular in the nonlinear model predictive control (MPC) and motion planning communities [li2004iterative, van2014iterated]. These algorithms refine an initial control law at each iteration by forming a Jacobian linearization of system dynamics and a quadratic approximation of cost, and solving the resulting LQR subproblem. Because LQ games also offer an efficient solution, this approach has also been applied in the context of two-player zero-sum differential games by [mukai2000sequential] and recently extended in [fridovich2019efficient] to the -player general-sum setting. ILQ methods are local. For optimal control problems, this means that they generally converge to local optima; for differential games, if they converge, they converge to local Nash equilibria under certain conditions [fridovich2019efficient, Theorem 1]. Importantly, these methods scale favorably with state dimension (cubic), number of players (cubic), and time horizon (linear), and [fridovich2019efficient] reports real-time operation for several three-player examples.
Iterative best response (IBR) algorithms comprise another class of methods for solving games. Here, in each iteration players sequentially solve (or approximately solve) the optimal control problem which results when all other players’ strategies are fixed. IBR has been demonstrated in a wide variety of settings, including congestion games [jonsson2011scaling], drone racing [wang2019game], and autonomous driving [wangmingyu2019game]. As in the case of ILQ methods, convergence is not generally guaranteed for arbitrary initializations and, at best, IBR converges to local Nash equilibria (e.g., [wang2019game]). However, by reducing the game to a sequence of optimal control problems, IBR algorithms can take advantage of existing MPC and planning tools.
Ii-B Feedback linearization in motion planning
Feedback linearization is a popular differential geometric control technique which renders a class of nonlinear systems’ input-output response linear. We provide a brief technical overview of feedback linearization in Section IV; here, we summarize its relevance to motion planning. One of the early successes of feedback linearization was its effectiveness in planning for chained systems, e.g., a car with multiple trailers [murray1993nonholonomic, rouchon1993flatness]. Feedback linearization (and the related notion of differential flatness) is also commonly used for minimum snap control of quadrotors [mellinger2011minimum, richter2016polynomial]. Here, the differentially flat structure of the underlying system dynamics allows planners to generate piecewise polynomial trajectories which the system can track exactly. This concept is extended to the case of differential games in [wang2019game], where each iteration of IBR yields a new spline trajectory.
Iii Problem Formulation
We consider -player, general-sum differential games with control-affine dynamics. That is, we presume that the game state evolves as
where is the control input of player . In our examples (Section VI), will be the concatenated states of multiple subsystems, but this is not strictly necessary. We assume that (1) is full-state feedback linearizable, i.e. there exist outputs such that and finitely many of its time derivatives evolve linearly as a function of some auxiliary inputs , for some control law . A brief review of feedback linearization may be found in Section IV.
Next, we suppose that each player wishes to minimize a running cost over finite time horizon :
We shall require to be in , uniformly in time . Player ’s total cost then depends explicitly upon each player’s control input signal and implicitly upon the initial condition .
Finally, we presume that each player has access to the state at every time , but not other players’ control inputs , i.e.
for some measurable function . We shall denote the set of such functions . For clarity, we shall also overload the notation of costs .
(Nash equilibrium, [basar1999dynamic, Chapter 6]) A set of strategies constitute a Nash equilibrium if no player has a unilateral incentive to deviate from his or her strategy. Precisely, the following inequality must hold for each player :
In practice, we shall only be able to check if these conditions are satisfied locally in the neighborhood of strategy . That is, we will only be able to verify whether our algorithm has found a local Nash equilibrium.
Iv Background: Feedback Linearization
This section provides a brief review of feedback linearization, a geometric control technique popularly used across a wide range of robotic applications including manipulation, quadrotor flight, and autonomous driving.
Recall dynamics (1), and define the matrix
and vector, with the total control dimension. Thus, (1) may be rewritten as
where is the output of the system, and the functions , and are sufficiently smooth.
Suppose that (4) has well-defined vector relative degree [sastry1999nonlinear, Definition 9.15] and is full-state feedback linearizable. Then, there exists a matrix and vector such that the time derivatives of the outputs follow
Presuming the invertibility of the so-called “decoupling matrix” , we may design the following feedback linearizing control law as a function of both state and an auxiliary input :
which renders the input-output dynamics linear in the new auxiliary inputs :
Note that, as for in (1) we shall consider to be a concatenation of auxiliary inputs for each player, with .
Iv-B Change of coordinates
We have seen how a careful choice of feedback linearizing controller renders the dynamics of the output and its derivatives linear. Define the state of this linear system as . Just as there is a bijective map (6) between control and auxiliary input whenever is invertible, there is also a bijection between state and linear system state [sastry1999nonlinear] because (4) is full-state feedback linearizable. We shall use both bijective maps (and their derivatives) in Section V to rewrite costs (2) in terms of the linearized dynamics (7).
In this section we present our main contribution, a computationally stable and efficient algorithm for identifying local Nash equilibria of general-sum games with feedback linearizable dynamics. We begin in Section V-A by computing a feedback linearizing controller for unicycle dynamics, which we shall use as a running example throughout the paper. Then, in Section V-B we show how to transform the costs for each player to depend upon linear system state and auxiliary inputs rather than state and controls . In Section V-C we introduce the main algorithm, and finally in Section V-D we summarize the effects of using feedback linearization.
V-a Feedback linearization by example
Consider the following (single player) 4D unicycle dynamical model:
representing the evolution of the positions and , the orientation , and speed . The inputs and represent the angular rate and the acceleration. By taking time derivatives of the output following the procedure from Section IV, we obtain the new set of states for the linearized system. Differentiation reveals that
From this result, we compute the inverse decoupling matrix and drift term as
Finally, we can also explicitly derive the state conversion map
Now, consider a differential game with two players, each of whom independently follows dynamics (8). The inverse decoupling matrix and the Jacobian of the state conversion map for the full system will be block diagonal.
V-B Transforming costs
So far, we have introduced feedback linearization and shown how to derive the mappings from auxiliary input to control and linearized system state to state . To exploit the feedback linearizable structure of (4) when solving the game, we must rewrite running costs in terms of and . Overloading notation, we shall denote the transformed running costs as
where is given in (6).
Section V-C presents our main algorithm; a core step will be to compute first and second derivatives of each player’s running cost with respect to the new state and inputs
. This may be done efficiently using the chain rule and exploiting known sparsity patterns for particular systems and costs. For completeness, however, we shall ignore sparsity and illustrate computing the first derivative ofwith respect to the dimension of , denoted :
where is the entry of the player’s control input.
Second derivatives may be computed similarly, though again we stress that for specific dynamics and cost functions it is often much more efficient to exploit the a priori known sparsity of partial derivatives. Interestingly, we also observe that the terms arising from the second sum in (13), which account for the state-dependence of the feedback linearizing controllers (6), are often negligible in practice and may be dropped without significant impact on solution quality.
V-C Core algorithm
Like the original iterative LQ game algorithm, we proceed from a set of initial strategies for each player—understood now to map from to —and iteratively refine them by solving LQ approximations. Our main contribution, therefore, lies in the transformation of the game itself into the coordinates which correspond to feedback linearized dynamics. As we shall see in Section VI, iterative LQ approximations are much more stable in the transformed coordinates and converge at least as quickly.
Algorithm 1 outlines the major steps in the resulting algorithm. We begin at the given initial condition for the linearized system and strategies for each player. Note that these strategies define control laws for the linearized system, i.e. .
At each iteration, we first (Algorithm 1, line 1) integrate the linearized dynamics (7) forward to obtain the current operating point . Then (Algorithm 1, line 1), we compute a quadratic approximation to each player’s running cost in terms of the variations and
using the chain rule as in (12) to compute the terms and for each player.
Equipped with linear dynamics (7) and quadratic costs (14), the solution of the resulting general-sum LQ game is given by a set of coupled Riccati differential equations, which may be derived from the first order necessary conditions of optimality for each player [basar1999dynamic, Chapter 6]. In practice (Algorithm 1, line 1), we numerically solve these equations in discrete-time using a time step of . If a solution exists at the iteration, it is known to take the form
for matrix and vector [basar1999dynamic, Corollary 6.1].
We cannot simply use these strategies at the iteration or we risk diverging, however, without further assumptions on the curvature and convexity of running costs . In fact, these costs are generally nonconvex when expressed in terms of and (12), which necessitates some care in updating strategies. To address this issue (Algorithm 1, line 1), we follow a common practice in the ILQR and sequential quadratic programming literature (e.g., [tassa2014control]) and introduce a step size parameter :
Observe that, taking and recalling that , we recover the previous open-loop control signal . Taking , we recover the LQ solution from this iteration (15). As is common in the literature, we perform a backtracking linesearch on , starting with initial value and terminating when the trajectory that results from (16) satisfies a trust region constraint at level . In our experiments, we use an constraint, i.e.
and also check to ensure that exists at each time.
V-D Effect of feedback linearization
We conclude this section with several remarks about the theoretical soundness of our approach and the overall impact of exploiting feedback linearization.
(Criterion for Convergence to Local Nash Equilibrium) Suppose that Algorithm 1 converges to strategies . Then, from [fridovich2019efficient, Theorem 1] and presuming the invertibility of we have that if Hessians at convergence, then the open-loop controllers defined by comprise a local Nash equilibrium in open-loop strategies for the original system. That is, taking
(Benefits of Feedback Linearization) In comparison to the non-feedback linearizable case, the linearized dynamics (7) are trajectory- (and hence iteration-) independent. That is, in the non-feedback linearizable case [fridovich2019efficient], each iteration begins by constructing a Jacobian linearization of dynamics (1); this is superfluous in our case. As a consequence, large changes in auxiliary input between iterations—which lead to large changes in state trajectory—are trivially consistent with the feedback linearized dynamics (7). By contrast, a large change in control may take the nonlinear dynamics (1) far away from the previous Jacobian linearization, which causes the algorithm from [fridovich2019efficient] to be fairly sensitive to step size and trust region size . We study this sensitivity more carefully in Section VI.
(Drawbacks of Feedback Linearization) While many systems of interest (e.g., manipulators, cars, and quadrotors) are feedback linearizable, this is not true of all systems. Additionally, there are two major drawbacks of our algorithm. First, we must take care to avoid singularities (regions in which does not exist), especially when constructing the costs. Second, and more importantly, the transformed costs may have much more varied, extreme curvature than the original costs . In some cases, this can make Algorithm 1 sensitive to linesearch parameters and , even offsetting the benefits from Remark 2. We defer further discussion and empirical study for Section VI.
In this section, we study the empirical performance of Algorithm 1. In Section VI-A, we quantify the improvements in algorithmic stability from Remark 2 for an intersection scenario. In Section VI-B, we discuss a case in which the extreme curvature of the transformed cost alluded to in Remark 3 causes Algorithm 1 to converge very slowly. In practice, however, this is not necessarily a serious problem. In Section VI-C, we redesign this problematic cost function to depend explicitly upon rather than without changing the semantic character of equilibria.
Vi-a Improvements in solver stability
To showcase the benefits of our feedback linearization-based approach, we study the empirical sensitivity of solutions to the initial step size and trust region size hyperparameters from Section V-C. We shall consider a three-player intersection example and compare the strategies identified by Algorithm 1 with those identified on the original dynamics, using the algorithm from [fridovich2019efficient]. Here, two cars, modeled with bicycle dynamics
(with inter-axle distance and inputs controlling front wheel rate and controlling jerk), and a pedestrian modeled with dynamics (8) navigate an intersection. Like (8), bicycle dynamics (18) are feedback linearizable in the outputs . We place quadratic penalties on each player’s distance from the appropriate lane center and from a fixed goal location, as well as on the difference between speed and a fixed nominal speed . Players are also penalized quadratically within a fixed distance of one another.
In order to assess the quality of a trajectory generated by a particular algorithm, we define the following metric:
Here, we take to be the equilibrium trajectory which that algorithm ideally converges to. The norm measures Euclidean distance only in the dimensions. Trajectories that diverge or converge to unreasonable solutions yield high values for , while trajectories that closely match incur low values.
We fix the initial conditions and cost weights identically for both algorithms. Thus, any trajectory identified by the solver will solely be a function of the initial step size and trust region size . Therefore, we will overload the penalty metric notation as . Given this metric we study the quality of solutions over the ranges and , and test uniformly sampled pairs.
Fig. 2 displays the sampled pairs over the space of and . For clarity, we set a success threshold and color “successful” pairs with blue, and “unsucessful” pairs red. Fig. 3 shows histograms of solution quality for each algorithm, with a horizontal line denoting threshold . We observe that solving the game using feedback linearization converges much more reliably than solving it for the original nonlinear system. Moreover, for converged trajectories with low -value, the average computation time was s (mean standard deviation) for our method and for the baseline.
Vi-B Sensitivity to transformed cost landscape
Unfortunately, these results do not generalize to all games. As per Remark 3, in some cases the cost landscape gets much more complicated when expressed in linearized system coordinates . For example, a simple quadratic penalty on a single player’s speed difference from nominal in (8) is nonconvex and non-smooth near the origin when expressed as a function of linearized system state :
Consequences vary; the effect is negligible in the intersection example from Fig. 3, but it is more significant in the roundabout example below in Section VI-C, where cars must slow down before turning into the roundabout.
Vi-C Designing costs directly for the linearized system
Fortunately, in practical settings of interest it is typically straightforward to design smooth, semantically equivalent costs explicitly as functions of the linearized system coordinates . For example, we can replace the nominal speed cost of (20) with a time-varying quadratic penalty in that player’s position :
where defines the point on the lane center a distance from the initial condition.
We demonstrate the effectiveness of this substitution in two examples—merging into a roundabout, and overtaking a lead vehicle—in which the original cost (20) led to instability in Algorithm 1. In both cases, we also use simple quadratic penalties for (rather than transforming into linearized coordinates), albeit with different weightings. Results for the roundabout merging and overtaking examples are shown in Figures 4 and 5, respectively. From the 324 samples in each (drawn from expanded ranges ), we see that Algorithm 1 converged more frequently than the method of [fridovich2019efficient]. Moreover, when successful, the average computational time in the roundabout example was s for our method and s for the baseline. Runtimes for the overtaking example were s (ours) and s (baseline). Observe how runtimes for our approach cluster more tightly around the mean, indicating a more reliable convergence rate.
We have presented a novel algorithm for identifying local Nash equilibria in differential games with feedback linearizable dynamics. Our method works by repeatedly solving LQ games in the linearized system coordinates, rather than in the original system coordinates. By working with the linearized system, our algorithm becomes less sensitive to parameters such as initial step size and trust region size, which often leads it to converge more quickly. Our method is fully general, i.e. any cost expressed in terms of nonlinear system coordinates may also be expressed in terms of linearized coordinates, which implies sufficient conditions for fixed points of our algorithm to be local Nash equilibria. However, in some cases transforming costs in this way makes the cost landscape extremely complicated. In such cases, it is often possible to design semantically equivalent replacement costs directly in the linearized coordinates. We test our method in a variety of competitive traffic scenarios. Using appropriately redesigned costs when necessary, our experiments confirm the computational stability and efficiency of our approach.