# Convergence of Learning Dynamics in Stackelberg Games

This paper investigates the convergence of learning dynamics in Stackelberg games. In the class of games we consider, there is a hierarchical game being played between a leader and a follower with continuous action spaces. We show that in zero-sum games, the only stable attractors of the Stackelberg gradient dynamics are Stackelberg equilibria. This insight allows us to develop a gradient-based update for the leader that converges to Stackelberg equilibria in zero-sum games and the set of stable attractors in general-sum games. We then consider a follower employing a gradient-play update rule instead of a best response strategy and propose a two-timescale algorithm with similar asymptotic convergence results. For this algorithm, we also provide finite-time high probability bounds for local convergence to a neighborhood of a stable Stackelberg equilibrium in general-sum games.

Comments

There are no comments yet.

## Authors

• 6 publications
• 2 publications
• 12 publications
• ### Computing Stackelberg Equilibria of Large General-Sum Games

We study the computational complexity of finding Stackelberg Equilibria ...
09/07/2019 ∙ by Avrim Blum, et al. ∙ 0

read it

• ### Robust Commitments and Partial Reputation

Agents rarely act in isolation -- their behavioral history, in particula...
05/28/2019 ∙ by Vidya Muthukumar, et al. ∙ 0

read it

• ### Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games

We study the global convergence of policy optimization for finding the N...
05/31/2019 ∙ by Kaiqing Zhang, et al. ∙ 0

read it

• ### Stable Opponent Shaping in Differentiable Games

A growing number of learning methods are actually games which optimise m...
11/20/2018 ∙ by Alistair Letcher, et al. ∙ 74

read it

• ### Path to Stochastic Stability: Comparative Analysis of Stochastic Learning Dynamics in Games

Stochastic stability is a popular solution concept for stochastic learni...
04/08/2018 ∙ by Hassan Jaleel, et al. ∙ 0

read it

• ### Neural Replicator Dynamics

In multiagent learning, agents interact in inherently nonstationary envi...
06/01/2019 ∙ by Shayegan Omidshafiei, et al. ∙ 12

read it

• ### A Generalized Training Approach for Multiagent Learning

This paper investigates a population-based training regime based on game...
09/27/2019 ∙ by Paul Müller, et al. ∙ 20

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tools from game theory now play a prominent role in machine learning. The emerging coupling between the fields can be credited to the formulation of learning problems as interactions between competing algorithms and the desire to characterize the limiting behaviors of such strategic interactions. Indeed, game theory provides a systematic framework to model the strategic interactions found in modern machine learning problems.

A significant portion of the game theory literature concerns games of simultaneous play and equilibrium analysis. In simultaneous play games, each player reveals the strategy they have selected concurrently. The solution concept often adopted in non-cooperative simultaneous play games is the Nash equilibrium. In a Nash equilibrium, the strategy of each player is a best response to the joint strategy of the competitors so that no player can benefit from unilaterally deviating from this strategy.

The study of equilibrium gives rise to the question of when and why the observed play in a game can be expected to correspond to an equilibrium. A common explanation is that an equilibrium emerges as the long run outcome of a process in which players repeatedly play a game and compete for optimality over time [16]. Consequently, a fundamental question in the study of learning in games is the convergence behavior of interacting learning algorithms. It is often viewed as a desirable property of such algorithms to converge to a Nash equilibrium. For this reason, much of the work in this field is geared toward designing learning rules that converge to Nash equilibria in broad classes of games [16]. In the process of this endeavor, many negative results for algorithms and classes of games have been established that characterize the limits of what is achievable under this solution concept [40, 28, 30, 6].

The classic objective of learning in games is now being widely embraced in the machine learning community. While not encompassing, the prevailing research areas epitomizing this phenomenon are adversarial training and multi-agent learning. A substantial amount of interest has been given to Generative Adversarial Networks (GANs) [17]. Finding Nash equilibria in GANs is a challenging and there is a surge of interest in developing principled training algorithms for this purpose [32, 31, 19, 29, 3, 27]. The majority of these works attempt to use second-order gradient information to speed-up convergence. In our work, we draw connections to this literature and believe that the problem we study gives an unexplored perspective that may provide valuable insights moving forward.

Seeking equilibria in multi-agent learning gained prominence much earlier than adversarial training. However, following initial works on this topic [25, 18, 20], scrutiny was given to the solution concepts being considered [41] and the field cooled. Owing to the arising applications with interacting agents, problems of this form are being studied extensively again. There has also been a shift toward analyzing gradient-based learning rules and convergence analysis [43, 3, 24, 15, 28].

The progress analyzing learning dynamics and seeking equilibria in games is promising, but the work has been narrowly focused on simultaneous play games and the Nash equilibrium solution concept. There are many problems exhibiting a hierarchical order of play between agents in a diverse set of fields. Examples include human-robot collaboration and interacting autonomous systems in artificial intelligence

[33, 26, 14, 38], incentive design and control [13, 35, 36], and organizational structures in economics [10, 2]. In game theory, this type of game is known as a Stackelberg game and the solution concept studied is called a Stackelberg equilibrium.

In the simplest formulation of a Stackelberg game, there is a leader and a follower that interact in a hierarchical structure. The sequential order of play is such that the leader is endowed with the power to select an action to which the follower can then respond. In a Stackelberg equilibrium, the follower plays a best response to the strategy of the leader and the leader uses this knowledge to its advantage when selecting a strategy. As we highlight in this paper, the ability of the leader to act before the follower can give the leader a distinct advantage.

In this paper, we study the convergence of learning dynamics in Stackelberg games. Our motivation stems from the emergence of problems in which there is a distinct order of play between interacting learning agents and the lack of existing theoretical convergence guarantees in this domain. The rigorous study of the learning dynamics in Stackelberg games we provide also has implications for simultaneous play games relevant to adversarial training. The insights we discover come as a direct consequence of taking up a viewpoint deviating from that appearing in the resurgent literature on learning in games.

##### Contributions

A novelty of our work is the exploration of a topic relevant to adversarial training and multi-agent learning that has not been sufficiently scrutinized. Before summarizing our contributions, we mention some exceptions of papers that have similar objectives in mind. The recent work of Jin et al. [21] proposes a local minmax equilibrium notion that is similar to the Stackelberg equilibrium notion we adopt. However, the results in that work do not bear a strong resemblance to ours since the problem is analyzed without noise and under constant step-sizes. It is also worth pointing out that the multi-agent learning papers of Foerster et al. [15] and Letcher et al. [24] do in some sense seek to give a player an advantage, but nevertheless focus on the Nash equilibrium concept in any analysis that is provided. The following is a summary of our contributions:

• [leftmargin=25pt, itemsep=-2pt, topsep=2pt]

• We show that stable Nash equilibria are Stackelberg equilibria in zero-sum games. Moreover, there exist stable attractors of the gradient dynamics that are Stackelberg equilibria and not Nash equilibria.

• We demonstrate that the only stable attractors of the Stackelberg gradient dynamics are Stackelberg equilibria in zero-sum games. This allows us to define a gradient-based learning rule for the leader that converges to Stackelberg equilibria in zero-sum games and the set of stable attractors in general-sum games.

• We consider the follower uses a gradient-play update rule instead of an exact best response strategy and propose a two-timescale algorithm to learn Stackelberg equilibria. We show almost sure asymptotic convergence to Stackelberg equilibria in zero-sum games and to stable attractors in general-sum games; a finite-time high probability bound for local convergence to a neighborhood of a stable Stackelberg equilibrium in general-sum games is also given.

We present this paper with a single leader and a single follower, but this is only for ease of presentation. The extension to followers that play in a staggered hierarchical structure or simultaneously is in Appendix C; equivalent results hold with some additional assumptions.

##### Organization.

In Section 2, we formalize the problem we study and provide background material on Stackelberg games. We then draw connections between learning in Stackelberg games and existing work in zero-sum and general sum-games relevant to GANs and multi-agent learning, respectively. In Section 3, we give a rigorous convergence analysis of learning in Stackelberg games. Numerical examples are provided in Section 4 and we conclude in Section 5.

## 2 Preliminaries

We leverage the rich theory of continuous games and dynamical systems in order to analyze algorithms implemented by agents interacting in a hierarchical game. In particular, each agent has an objective they want to selfishly optimize which depends on not only their actions but the actions of their competitor. However, there is an order of play in the sense that one player is the leader and the other player is the follower111While we present the work for a single leader and a single follower, the theory extends to the multi-follower case (we discuss this in Appendix C) and to the case where the single leader abstracts multiple cooperating agents.. The leader then optimizes its objective with the knowledge that the follower will respond by selecting a best response. We refer to algorithms for learning in this setting as hierarchical learning algorithms. We specifically consider a class of learning algorithms in which the agents act myopically with respect to their given objective and role in the underlying hierarchical game by following the gradient of their objective with respect to their choice variable.

To concretize ideas, consider a game between two agents where one agent is deemed the leader and the other the follower. The leader has cost and the follower has cost , where with the action space of the leader being and the action space of the follower being . The designation of ‘leader’ and ‘follower’ indicates the order of play between the two agents, meaning the leader plays first and the follower second. The leader and the follower need not be cooperative. Such a game is known as a Stackelberg game.

### 2.1 Stackelberg Games

Let us adopt the typical game theoretic notation in which the player index set is and denotes the joint action profile of all agents excluding agent . In the Stackelberg case, where player is the leader and player is the follower. We assume throughout that each is sufficiently smooth, meaning for some and for each .

The leader aims to solve the optimization problem given by

 minx1∈X1{f1(x1,x2)∣∣ x2∈argminy∈X2f2(x1,y)}

and the follower aims to solve the optimization problem . As noted above, the learning algorithms we study are such that the agents follow myopic update rules which take steps in the direction of steepest descent with respect to the above two optimizations problems, the former for the leader and the latter for the follower.

Before formalizing these updates, let us first discuss the equilibrium concept studied for simultaneous play games and contrast it with that which is studied in the hierarchical play counterpart. The typical equilibrium notion in continuous games is the pure strategy Nash equilibrium in simultaneous play games and the Stackelberg equilibrium in hierarchical play games. Each notion of equilibria can be characterized as the intersection points of the reaction curves of the players [4]. [Nash Equilibrium] The joint strategy is a Nash equilibrium if for each ,

 fi(x∗)≤fi(xi,x∗−i),  ∀ xi∈Xi.

The strategy is a local Nash equilibrium on if for each ,

 fi(x∗)≤fi(xi,x∗−i),  ∀ xi∈Wi⊂Xi.

[Stackelberg Equilibrium] In a two-player game with player 1 as the leader, a strategy is called a Stackelberg equilibrium strategy for the leader if

 supx2∈\mcR(x∗1)f1(x∗1,x2)≤supx2∈\mcR(x1)f1(x1,x2),  ∀x1∈X1,

where is the rational reaction set of . This definition naturally extends to the -follower setting when is replaced with the set of Nash equilibria , given that player 1 is playing so that the follower’s reaction set is a Nash equilibrium.

We denote by the derivative of with respect to and the total derivative222For example, given a function , .. Denote by

the vector of individual gradients for simultaneous play and

as the equivalent for hierarchical play where is the total derivative of with respect to and is implicitly a function of which captures the fact that the leader operates under the assumption that the follower will play a best response to its choice of .

It is possible to characterize a local Nash equilibrium using sufficient conditions for Definition 2.1. [Differential Nash Equilibrium [34]] The joint strategy is a differential Nash equilibrium if and for each .

Analogous sufficient conditions can be stated which characterize a local Stackelberg equilibrium strategy for the leader using first and second order conditions on the leader’s optimization problem. Indeed, if and is positive definite, then is a local Stackelberg equilibrium strategy for the leader. We use these sufficient conditions to define the following refinement of the Stackelberg equilibrium concept. [Differential Stackelberg Equilibrium] The pair with , where is implicitly defined by , is a differential Stackelberg equilibrium for the game with player 1 as the leader if , and is positive definite..

We utilize these local characterizations in terms of first and second order conditions to formulate the myopic hierarchical learning algorithms we study. Indeed, following the preceding discussion, consider the learning rule for each player to be given by

 xi,k+1=xi,k−γi,k(ω\mcS,i(xk)+wi,k+1), (1)

where recall that and the notation indicates the entry of corresponding to the –th player. Moreover, the sequence of learning rates and is the noise process for player , both of which satisfy the usual assumptions from theory of stochastic approximation provided in detail in Section 3. We note that the component of the update captures the case in which each agent does not have oracle access to

, but instead has an unbiased estimator for it. The given update formalizes the class of learning algorithms we study in this paper.

##### Leader-Follower Timescale Separation.

We require a timescale separation between the leader and the follower: the leader is assumed to be learning at a slower rate than the follower so that . The reason for this timescale separation is that the leader’s update is formulated using the reaction curve of the follower. In the gradient-based learning setting considered, the reaction curve can be characterized by the set of critical points of that have a local positive definite structure in the direction of , which is

 {x2| D2f2(x1,k,x2)=0, D22f2(x1,k,x2)≥0}.

This set can be characterized in terms of an implicit map , defined by the leader’s belief that the follower is playing a best response to its choice at each iteration, which would imply , and under sufficient regularity conditions the implicit mapping theorem [23] gives rise to the implicit map on a neighborhood of . Formalized in Section 3, we note that when is defined uniformly in on the domain for which convergence is being assessed, the update in (1) is well-defined in the sense that the component of the derivative corresponding to the implicit dependence of the follower’s action on via is well-defined and locally consistent. In particular, for a given point such that with an isomorphism, the implicit function theorem implies there exists an open set such that there exists a unique continuously differentiable function such that and for all . Moreover,

 Dr(x1)=−(D22f2)−1(x1,r(x1))D21f2(x1,r(x1))

on . Thus, in the limit of the two-timescale setting, the leader sees the follower as having equilibriated (i.e., ) so that

 Df1(x1,x2) =D1f1(x1,x2)+D2f1(x1,x2)Dr(x1) =D1f1(x1,x2)−D2f1(x1,x2)(D22f2)−1(x1,x2)D12f2(x1,x2).

The map is an implicit representation of the follower’s reaction curve.

##### Overview of analysis techniques.

The following describes the general approach to studying the hierarchical learning dynamics in (1). The purpose of this overview is to provide the reader with the high-level architecture of the analysis approach.

The analysis techniques we employ combine tools from dynamical systems theory with the theory of stochastic approximation. In particular, we leverage the limiting continuous time dynamical systems derived from (1) to characterize concentration bounds for iterates or samples generated by (1). We note that the hierarchical learning update in (1) with timescale separation has a limiting dynamical system that takes the form of a singularly perturbed dynamical system given by

 ˙x1(t)=−τDf1(x1(t),x2(t))˙x2(t)=−D2f2(x1(t),x2(t)) (2)

where, in the limit as , the above approximates (1).

The limiting dynamical system has known convergence properties (asymptotic convergence in a region of attraction for a locally asymptotically stable attractor). Such convergence properties can be translated in some sense to the discrete time system by comparing pseudo-trajectories

—in this case, linear interpolations between sample points of the update process—generated by sample points of (

1) and the limiting system flow for initializations containing the set of sample points of (1). Indeed, the limiting dynamical system is then used to generate flows initialized from the sample points generated by (1). Creating pseudo-trajectories, we then bound the probability that the pseudo-trajectories deviate by some small amount from the limiting dynamical system flow over each continuous time interval between the sample points. A concentration bound can be constructed by taking a union bound over all the time intervals after a finite time after which we can guarantee the sample path has entered the region of attraction on which we can produce a Lyapunov function for the continuous time dynamical system. The analysis in this paper is based on the above high-level idea.

### 2.2 Connections and Implications

Before presenting convergence analysis of the update in (1

), we draw some connections to applications domains—including adversarial learning where zero-sum game abstractions have been recently touted for finding robust parameter configurations for neural networks and opponent shaping in multi-agent learning—and equilibrium concepts commonly used in these domains. Let us first remind the reader of some common definitions from dynamical systems theory.

Given a sufficiently smooth function , a critical point of is said to be stable if for all and , there exists such that

 x0∈Bδ(x∗) ⟹ x(t)∈B\vep(x∗), ∀t≥t0

Further, is said to be asymptotically stable if is additionally attractive—that is, for all , there exists such that

 x0∈Bδ(x∗) ⟹ limt\rar∞∥x(t)−x∗∥=0.

A critical point is said to be non-degenerate if the determinant of the Jacobian of the dynamics at the critical point is non-zero. For a non-degenerate critical point, the Hartman-Grobman theorem [39]

enables us to check the eigenvalues of the Jacobian to determine asymptotic stability. In particular, at non-degenerate critical point, if the eigenvalues of the Jacobian are in the

open left-half complex plane, then the critical point is asymptotically stable. The dynamical systems we study in this paper are of the form for some vector field determined by the gradient based update rules employed by the agents. Hence, to determine if a critical point is stable, we simply need to check that the spectrum of the Jacobian of is in the open right-half complex plane.

For the dynamics , let denote the Jacobian of the vector field . Similarly, for the dynamics , let denote the Jacobian of the vector field . Then, we say a differential Nash equilibrium of a continuous game with corresponding individual gradient vector field is stable if where denotes the spectrum of its argument and denotes the open right-half complex plane. Similarly, we say differential Stackelberg equilibrium is stable if .

#### 2.2.1 Implications for Zero-Sum Settings

Zero-sum games are a very special class since there is a strong connection between Nash equilibria and Stackelberg equilibria.

Stable differential Nash equilibria in continuous zero-sum games are differential Stackelberg equilibria. That is, given a zero-sum game defined by a sufficiently smooth function with , a differential Nash equilibrium is a differential Stackelberg equilibrium.

###### Proof.

Consider an arbitrary sufficiently smooth zero-sum game on continuous strategy spaces. Suppose is a stable differential Nash equilibrium so that by definition , , and

 J(x)=[D21f(x)D12f(x)−D21f(x)−D22f(x)]>0.

Then, the Schur complement of is also positive definite:

 D21f(x)−D21f(x)⊤(D22f(x))−1D21f(x)>0

Hence, is a differential Stackelberg equilibrium since the Schur complement of is exactly the derivative at critical points and since is a differential Nash equilibrium. ∎

In the zero-sum setting, the fact that Nash equilibria are a subset of Stackelberg equilibria (or minimax equilibria) for finite games is well-known [4]. We show the result for the notion of differential Stackelberg equilibria for continuous action space games that we introduce. It is interesting to point out that for a subclass of zero-sum continuous games with a convex-concave structure for the leader’s cost the set of (differential) Nash and (differential) Stackelberg equilibria coincide. Indeed, at critical points for convex-concave games, so that if is a differential Stackelberg equilibrium, it is also a Nash equilibrium.

In recent work on GANs [32], hierarchical learning of a similar nature proposed in this paper is studied in the context of zero-sum games. Proposition 2.2.1 result says two-timescale gradient-based procedures for GANs in which the generator and the discriminator update their parameters following their individual gradients with the generator having a slower timescale lead to Stackelberg equilibria. It is worth studying if the distortion of the vector field from the timescale separation produces more efficient equilibria. Empirically, GANs learned with such procedures seem to outperform gradient descent with uniform stepsizes [32].

Consider the class of continuous zero-sum games defined by for some . Stable attractors of at which and are non-degenerate and either or is positive definite are differential Stackelberg equilibria and attractors of .

###### Proof.

Without loss of generality, let . Since is a stable attractor, the Jacobian of is positive definite. Hence, with the fact that , the Schur complement of is positive definite:

 D21f(x)−D21f(x)⊤(D22f(x))−1D21f(x)>0.

Thus, is a differential Stackelberg equilibrium. Moreover, since

 D1(Df)(x)=schur(J)(x)=D21f(x)−D21f(x)⊤(D22f(x))−1D21f(x)>0,

the Jacobian of the Stackelberg limiting dynamics with player 1 as the leader,

 J\mcS(x)=[D1(Df)(x)0−D21f−D22f(x)], (3)

is positive definite. The structure of at critical points follows from the fact that since at critical points. ∎

This result implies that some of the non-Nash attractors of are in fact Stackelberg equilibria which, in the case of GANs, may be desirable equilibria to find as suggested by the success of the techniques and implementation proposed in Metz et al. [32]. This is a surprising result to some extent since recent works such as Mazumdar et al. [27]

propose schemes to avoid such attractors because they have been classified or viewed as being undesirable. This further suggests that techniques such as those proposed in

Mazumdar et al. [27] requiring strong coordination between players may be relaxed to require less coordination if Stackelberg are acceptable equilibria for the application.

For zero-sum games, all stable attractors of are differential Stackelberg equilibria.

###### Proof.

The result follows directly from the structure of the Jacobian . ∎

The result of Proposition 2.2.1 implies that with appropriately chosen stepsizes the update rule in (1) will only converge to Stackelberg equilibria and thus, unlike simultaneous play individual gradient descent (known as gradient-play in the game theory literature), will not converge to spurious locally asymptotically stable attractors of the dynamics that are not relevant to the underlying game. This means the hierarchical learning dynamics will not converge to non-Stackelberg equilibria in zero-sum games.

#### 2.2.2 Connections to Opponent Shaping

Beyond the work in zero-sum games and applications to GANs, there has also been recent work, which we will refer to as ‘opponent shaping’, where one or more players takes into account its opponents’ response to their action [24, 15, 43]. The initial work of Foerster et al. [15] bears the most resemblance to the learning algorithms studied in this paper. The update rule (LOLA) considered there (in the deterministic setting with constant stepsizes) takes the following form:

 x+1 =x1−γ1(D1f1(x)−γ2D2f1(x)⊤D12f2(x)) x+2 =x2−γ2D2f2(x)

The attractors of these dynamics are not necessarily Nash equilibria nor are they Stackelberg equilibria as can be seen by looking at the critical points of the dynamics. Indeed, the LOLA dynamics lead only to Nash or non-Nash stable attractors of the limiting dynamics. The effect of the additional ‘look-ahead’ term is simply that it changes the vector field and region of attraction for stable critical points. In the zero-sum case, however, the critical points of the above are the same as those of simultaneous play individual gradient updates, yet the Jacobian is not the same and it is still possible to converge to a non-Nash attractor.

With a few modifications, the above update rule can be massaged into a form which more closely resembles the hierarchical learning rules we study in this paper. In particular, if instead of , player 2 employed a Newton stepsize of , then the update would look like

 x+1 =x1−γ1(D1f1(x)−D2f1(x)⊤(D22f2)−1(x)D12f2(x)) x+2 =x2−γ2D2f2(x)

which resembles a deterministic version of (1). The critical points of this update coincide with the critical points of a Stackelberg game . With appropriately chosen stepsizes and with an initialization in a region on which the implicit map, which defines the component of the update, is well-defined uniformly in , the above dynamics will converge to Stackelberg equilibria. In this paper, we provide an in-depth convergence analysis and for the stochastic setting333In [15], the authors do not provide convergence analysis; they do in their extension, yet only for constant and uniform stepsizes and for a learning rule that is different than the one studied in this paper as all players are conjecturing about the behavior of their opponents. This distinguishes the present work from their setting. of the above update.

#### 2.2.3 Comparing Nash and Stackelberg Equilibrium Cost

We have alluded to the idea that the ability to act first gives the leader a distinct advantage over the follower in a hierarchical game. We now formalize this statement with a known result that compares the cost of the leader at Nash and Stackelberg equilibrium.

([4, Proposition 4.4]). Consider an arbitrary sufficiently smooth two-player general-sum game on continuous strategy spaces. Let denote the the infimum of all Nash equilibrium costs for player 1 and denote an arbitrary Stackelberg equilibrium cost for player 1. Then, if is a singleton for every , .

This result says that the leader never favors the simultaneous play game instead of the hierarchical play game in two-player general-sum games with unique follower responses. On the other hand, the follower may or may not prefer the simultaneous play game over the hierarchical play game.

The fact that under certain conditions the leader can obtain lower cost under a Stackelberg equilibrium compared to any of the Nash equilibrium may provide further explanation for the success of the methods in [32]. Commonly, the discriminator can overpower the generator when training a GAN [32] and giving the generator an advantage may mitigate this problem. In the context of multi-agent learning, the advantage of the leader in hierarchical games leads to the question of how the roles of each player in a game are decided. While we do not focus on this question, it is worth noting that when each player mutually benefits from the leadership of a player the solution is called concurrent and when each player prefers to be the leader the solution is called non-concurrent. We believe that exploring classes of games in which each solution concept arises is an interesting direction of future work.

## 3 Convergence Analysis

Following the preceding discussion, consider the learning rule for each player to be given by

 xi,k+1=xi,k−γi,k(ω\mcS,i(xk)+wi,k+1), (4)

where recall that . Moreover, for each , is the sequence of learning rates and is the noise process for player . As before, suppose player 1 is the leader and conjectures that player 2 updates its action in each round via

. This setting captures the scenario in which players do not have oracle access to their gradients, but do have an unbiased estimator. As an example, players could be performing policy gradient reinforcement learning or alternative gradient-based learning schemes. Let

for each and . The following hold:

1. [itemsep=-2pt, topsep=0pt, label=A0., leftmargin=25pt]

2. The maps , are , Lipschitz, and .

3. For each , the learning rates satisfy , .

4. The noise processes are zero mean, martingale difference sequences. That is, given the filtration , are conditionally independent, a.s., and a.s. for some constants , .

Before diving into the convergence analysis, we need some machinery from dynamical systems theory. Consider the dynamics from (4) written as a continuous time combined system where is a continuous map and is the flow of . A set is said to be invariant under the flow if for all , , in which case denotes the semi-flow. A point is an equilibrium if for all and, of course, when is induced by , equilibria coincide with critical points of . Let be a topological metric space with metric , an example being endowed with the Euclidean distance. A nonempty invariant set for is said to be internally chain transitive if for any and , , there exists a finite sequence with and , , such that , .

### 3.1 Learning Stackelberg Solutions for the Leader

Suppose that the leader (player 1) operates under the assumption that the follower (player 2) is playing a local optimum in each round. That is, given , for which is a first-order local optimality condition. If, for a given , is invertible and , then the implicit function theorem implies that there exists neighborhoods and and a smooth map such that . For every , has a globally asymptotically stable equilibrium uniformly in and is –Lipschitz. Consider the leader’s learning rule

 x1,k+1=x1,k−γ1,k(Df1(x1,k,x2,k)+w1,k+1) (5)

where is defined via the map defined implicitly in a neighborhood of .

Suppose that for each , is non-degenerate and Assumption 3 holds for . Then, converges almost surely to an (possibly sample path dependent) equilibrium point which is a local Stackelberg solution for the leader. Moreover, if Assumption 3 holds for and Assumption 3.1 holds, then so that is a differential Stackelberg equilibrium.

###### Proof.

This proof follows primarily from using known stochastic approximation results. The update rule in (5) is a stochastic approximation of and consequently is expected to track this ODE asymptotically. The main idea behind the analysis is to construct a continuous interpolated trajectory for and show it asymptotically almost surely approaches the solution set to the ODE. Under Assumptions 33.2.1, results from [9, §2.1] imply that the sequence generated from (5) converges almost surely to a compact internally chain transitive set of . Furthermore, it can be observed that the only internally chain transitive invariant sets of the dynamics are differential Stackelberg equilibria since at any stable attractor of the dynamics and from assumption . Finally, from [9, §2.2], we can conclude that the update from (5) almost surely converges to a possibly sample path dependent equilibrium point since the only internally chain transitive invariant sets for  are equilibria. The final claim that is guaranteed since is Lipschitz and . ∎

The above result can be stated with a relaxed version of Assumption 3.1. Given a differential Stackelberg equilibrium , let for some on which is non-degenerate. Suppose that Assumption 3 holds for and that . Then, converges almost surely to . Moreover, if Assumption 3 holds for , is a locally asymptotically stable equilibrium uniformly in on the ball , and , then . The proof follows the same arguments as the proof of Proposition 3.1.

### 3.2 Learning Stackelberg Equilibria: Two-Timescale Analysis

Now, let us consider the case where the leader again operates under the assumption that the follower is playing (locally) optimally at each round so that the belief is , but the follower is actually performing the update where . The learning dynamics in this setting are then

 x1,k+1 =x1,k−γ1,k(Df1(xk)+w1,k+1) (6) x2,k+1 =x2,k−γ2,k(D2f2(xk)+w2,k+1) (7)

where . Suppose that faster than so that in the limit , the above approximates the singularly perturbed system defined by

 ˙x1(t)=−τDf1(x1(t),x2(t))˙x2(t)=−D2f2(x1(t),x2(t)) (8)

The learning rates can be seen as stepsizes in a discretization scheme for solving the above dynamics. The condition that induces a timescale separation in which evolves on a faster timescale than . That is, the fast transient player is the follower and the slow component is the leader since implies that from the perspective of the follower, appears quasi-static and from the perspective of the leader, appears to have equilibriated, meaning given . From this point of view, the learning dynamics (6)–(7) approximate the dynamics in the preceding section. Moreover, stable attractors of the dynamics are such that the leader is at a local optima for , not just along its coordinate axis but in both coordinates constrained to the manifold ; this is to make a distinction between differential Nash equilibria in agents are at local optima aligned with their individual coordinate axes.

#### 3.2.1 Asymptotic Almost Sure Convergence

The following two results are fairly classical results in stochastic approximation. They are leveraged here to making conclusions about convergence to Stackelberg equilibria in hierarchical learning settings.

While we do not need the following assumption for all the results in this section, it is required for asymptotic convergence of the two-timescale process in (6)–(7). The dynamics have a globally asymptotically stable equilibrium.

Under Assumption 33.2.1, and the assumption that , classical results imply that the dynamics (6)–(7) converge almost surely to a compact internally chain transitive set of (8); see, e.g., [9, §6.1-2], [7, §3.3]. Furthermore, it is straightforward to see that stable differential Nash equilibria are internally chain transitive sets since they are stable attractors of the dynamics from (8).

There are two important points to remark on at this juncture. First, the flow of the dynamics (8) is not necessarily a gradient flow, meaning that the dynamics may admit non-equilibrium attractors such as periodic orbits. The dynamics correspond to a gradient vector field if and only if , meaning when the dynamics admit a potential function. Equilibria may also not be isolated unless the Jacobian of , say , is non-degenerate at the points. Second, except in the case of zero-sum settings in which , non-Stackelberg locally asymptotically stable equilibria are attractors. That is, convergence does not imply that the players have settled on a Stackelberg equilibrium, and this can occur even if the dynamics admit a potential.

Let be the (continuous) time accumulated after samples of the slow component . Define to be the flow of starting at time from intialization . Suppose that Assumptions 3 and 3.1 hold. Then, conditioning on the event , for any integer , almost surely.

###### Proof.

The proof follows standard arguments in stochastic approximation. We simply provide a sketch here to give some intuition. First, we show that conditioned on the event , almost surely. Let . Hence the leader’s sample path is generated by which tracks since so that it is asymptotically negligible. In particular, tracks . That is, on intervals where , the norm difference between interpolated trajectories of the sample paths and the trajectories of vanishes a.s. as . Since the leader is tracking , the follower can be viewed as tracking . Then applying Lemma A provided in Appendix A, almost surely.

Now, by Assumption 3, is Lipschitz and bounded (in fact, independent of 1, since , , it is locally Lipschtiz and, on the event , it is bounded). In turn, it induces a continuous globally integrable vector field, and therefore satisfies the assumptions of Benaïm [5, Prop. 4.1]. Moreover, under Assumptions 2 and 3, the assumptions of Benaïm [5, Prop. 4.2] are satisfied, which gives the desired result. ∎

Under Assumption 3.2.1 and the assumptions of Proposition 3.2.1, almost surely conditioned on the event . That is, the learning dynamics (6)–(7) converge to stable attractors of (8), the set of which includes the stable differential Stackelberg equilibria.

###### Proof.

Continuing with the conclusion of the proof of Proposition 3.2.1, on intervals the norm difference between interpolates of the sample path and the trajectories of vanish asymptotically; applying Lemma A (Appendix A) gives the result. ∎

Leveraging the results in Section 2.2.1, the convergence guarantees are stronger since in zero-sum settings all attractors are Stackelberg; this contrasts with the Nash equilibrium concept. Consider a zero-sum setting . Under the assumptions of Proposition 3.2.1 and Assumption 3.2.1, conditioning on the event , the learning dynamics (6)–(7) converge to a differential Stackelberg equilibria almost surely. The proof of this corollary follows the above analysis and invokes Proposition 2.2.1.

As with Corollary 3.1, we can relax Assumption 3.1 and 3.2.1 to local asymptotical stability assumptions. In this case, again we would need to assume only that for a given ball around a differential Nash equilibrium , the dynamics have a locally asymptotically stable attractor uniformly in on , the dynamics have a locally asymptotically stable attractor on , and that .

#### 3.2.2 Finite-Time High-Probability Guarantees

While asymptotic guarantees of the proceeding section are useful, high-probability finite-time guarantees can be leveraged more directly in analysis and synthesis, e.g., of mechanisms to coordinate otherwise autonomous agents. In this section, we aim to provide concentration bounds for the purpose of deriving convergence rate and error bounds in support of this objective. The results in this section follow the very recent work by Borkar and Pattahil [8]. We highlight key differences and, in particular, where the analysis may lead to insights relevant for learning in hierarchical decision problems between non-cooperative agents.

Consider a locally asymptotically stable differential Stackelberg equilibrium and let be an radius ball around contained in the region of attraction. Stability implies that the Jacobian is positive definite and by the converse Lyapunov theorem [39, Chap. 5] there exists local Lyapunov functions for the dynamics and for the dynamics , for each fixed . In particular, there exists a local Lyapunov function with , and for . For , let . Then, there is also and such that for , where . An analogously defined exists for the dynamics for each fixed .

For now, fix sufficiently large; we specify the values of for which the theory holds before the statement of Theorem 3.2.2. Define the event