DeepAI
Log In Sign Up

Robust Commitments and Partial Reputation

Agents rarely act in isolation -- their behavioral history, in particular, is public to others. We seek a non-asymptotic understanding of how a leader agent should shape this history to its maximal advantage, knowing that follower agent(s) will be learning and responding to it. We study Stackelberg leader-follower games with finite observations of the leader commitment, which commonly models security games and network routing in engineering, and persuasion mechanisms in economics. First, we formally show that when the game is not zero-sum and the vanilla Stackelberg commitment is mixed, it is not robust to observational uncertainty. We propose observation-robust, polynomial-time-computable commitment constructions for leader strategies that approximate the Stackelberg payoff, and also show that these commitment rules approximate the maximum obtainable payoff (which could in general be greater than the Stackelberg payoff).

READ FULL TEXT VIEW PDF
10/07/2022

Learning Stackelberg Equilibria and Applications to Economic Design Games

We study the use of reinforcement learning to learn the optimal leader's...
11/03/2022

Zero-Sum Games with Noisy Observations

In this paper, 2 × 2 zero-sum games (ZSGs) are studied under the followi...
06/04/2019

Convergence of Learning Dynamics in Stackelberg Games

This paper investigates the convergence of learning dynamics in Stackelb...
03/07/2019

Imitative Follower Deception in Stackelberg Games

Uncertainty is one of the major challenges facing applications of game t...
02/10/2022

No-Regret Learning in Dynamic Stackelberg Games

In a Stackelberg game, a leader commits to a randomized strategy, and a ...
09/15/2022

Computing the optimal distributionally-robust strategy to commit to

The Stackelberg game model, where a leader commits to a strategy and the...
09/27/2021

Learning Attacker's Bounded Rationality Model in Security Games

The paper proposes a novel neuroevolutionary method (NESG) for calculati...

1 Introduction

Consider a selfish, rational agent (designated as the “leader”) who is interacting non-cooperatively with another selfish, rational agent (designated as the “follower”). If the agents are interacting simultaneously

, and know the game that they are playing, they will naturally play a Nash equilibrium(a) of the two-player game. This is the traditionally studied solution concept in game theory. Now, let’s say that the leader has the ability to reveal its strategy in advance – in the form of a

mixed strategy commitment, and the follower has the ability to observe this commitment and respond to it. The optimal strategy for the leader now corresponds to the Stackelberg equilibrium of the two-player leader-follower game. The Stackelberg solution concept enjoys application in several engineering settings where commitment is the natural framework for the leader, such as security [PPM08], network routing [Rou04] and law enforcement [MS17]. The solution concept is interpreted in a broader sense as the ensuing strategy played by a patient leader that wishes to build a reputation by playing against an infinite number of myopic followers [KW82, MR82, FL89, FL92]. Crucially, one can show rigorously that the leader will benefit significantly from commitment power, i.e. its ensuing Stackelberg equilibrium payoff is at least as much as the simultaneous equilibrium payoff [VSZ10]. Further, several mechanism design problems that involve private information revelation - this includes signalling games [CS82] and persuasion games [KG11] - can be thought of as Stackelberg games, and the optimal mechanism can be interpreted as the Stackelberg commitment111albeit sometimes with multiple followers..

But whichever interpretation one chooses, the Stackelberg solution concept assumes a very idealized setting (even over and above the assumptions of selfishness and infinite rationality) in which the mixed strategy commitment is exactly revealed

to the follower. Further, the follower 100% believes that the leader will actually stick to her commitment. What happens when these assumptions are relaxed? What if the leader could only demonstrate her commitment in a finite number of interactions – how would she modify her commitment to maximize payoff, and how much commitment power would she continue to enjoy? Is she even incentivized to help the follower estimate her commitment effectively?

What changes in a finite-interaction regime is that the follower only observes a part of the leader behavioral history and needs to learn about the leader’s strategic behavior

– to the extent that he is able to respond as optimally as possible. By restricting our attention to finitely repeated play, we arrive at a problem setting that is fairly general: these follower agents in general will not know about the preferences of the leader agent. When provided with historical context, we assume that they will use it rather than ignore it. A broad umbrella of problems that has received significant attention in the machine learning literature is learning of strategic behavior from samples of play; for example, learning the agent’s utility function through

inverse reinforcement learning

 [ZMBD08], learning the agent’s level of rationality [WZB13], and inverse game theory [KS15]. While significant progress has been made in this goal of learning strategic behavior, attention has been restricted to the passive learning setting in which the leading agent is unaware of the presence of the learner, or agnostic to the learner’s utility. In many situations, the agent himself will be invested in the outcome of the learning process. In this paper, we put ourselves in the shoes of an agent who is shaping her historical context and is aware of the learner’s presence as well as preferences, and study her choice of optimal strategy222It is worth mentioning the recent paradigm of cooperative inverse reinforcement learning [HMRAD16] which studies the problem of agent investment in principal learning where the incentives are not completely aligned, but the setting is cooperative. In contrast, we focus on non-cooperative settings.. As we will see, the answer will depend on her utility function itself, as well as what kind of response she is able to elicit from the learner.

1.1 Related work

The Stackelberg solution concept is used in the engineering and economics literature to model a number of scenarios. For one, the Stackelberg security game is played between a defender, who places different utility levels on different targets to be protected and accordingly uses her resources to defend some subset of targets; and an attacker, who observes the defender strategy and wishes to attack some of the targets depending on whether he thinks they are left open as well as how much he values those targets. Stackelberg games can also be modeled with a single leader and multiple followers, such as in computer network routing applications [Rou04]. Many mechanism design problems involve computing an optimal mechanism to commit to, or an optimal way of revealing private information - this includes auctions and, more recently, Bayesian persuasion mechanisms [KG11].

Economists have established an important link between the Stackelberg solution concept and the asymptotic limit of reputation building. Reputation effects were first observed in the chain-store paradox, a firm-competition game where an incumbent firm would often deviate from Nash equilibrium behavior and play its aggressive Stackelberg (pure, in this case) strategy [Sel78]. Theoretical justification was provided for this particular application [KW82, MR82] by modeling a -player interaction between a leader and multiple followers, and studying the Nash equilibrium of an ensuing game as . It was shown that the leader would play its pure Stackelberg strategy333This is clearly more restrictive than the mixed strategy Stackelberg solution concept, and not necessarily advantageous over Nash, but it turns out to be so in the firm competition case. in the Bayes-Nash equilibrium of this game endowed with a common prior on the leader’s payoff structure444A more nuanced model considered a Bayesian prior over a leader being constrained to play its pure Stackelberg strategy as opposed to unconstrained play.. This model was generalized to such leader-follower ensembles for a general two-player game, and considering the possibility of mixed-strategy reputation, still retaining the asymptotic nature of results [FL89, FL92]. The “first-player” advantage, and the entire Stackelberg solution concept, rely on an important assumption: that the commitment is perfectly revealed to the follower. This is usually not the case: in security games, the attacker will usually observe a finite number of deployments of the defender’s resource, as opposed to the allocation strategy itself (which is often mixed). In theoretical models for Bayesian persuasion, the persuader conveys a conditional distribution on her signal given the privately observed state of the world - but what will be practically observed is her history, and thus realizations of the signal, not the distribution itself. In all of these models, the leader establishes her reputation only partially, and the manifestation of the revelation is itself random. It is natural to ask how she should plan to optimally reveal her information under this constraint.

The idea of a robust solution concept in game theory is certainly not new. The concept of trembling-hand-perfect-equilibrium [Sel75] explicitly studies how robust mixed-strategy Nash equilibria are to slight perturbations in the mixtures themselves, and a similar concept was proposed for Stackelberg [VDH97]. Another solution concept, quantal-response-equilibrium [MP95], studies agents that are boundedly rational, an orthogonal but important source of uncertainty in response. In the Stackelberg setting, it was noted that robust commitments exist that preserve the Stackelberg guarantee for small enough amounts of noise in the commitment; however, this is still an asymptotic perspective and does not directly help us answer our key computational questions: can we construct a robust commitment efficiently when the game is multi-dimensional, and does the leader want to use the noise to reveal or obfuscate her commitment?

The problem of computing the optimal commitment under finitely limited observability corresponds to a robust optimization problem that is, in fact, NP-hard [AKK12, SAY12]; so directly reasoning about the optimal commitment is not easy. Whether there exists a polynomial-time approximation scheme for this problem was also unclear. A duo of papers [AKK12, SAY12]

considered a model of full-fledged observational uncertainty with a Bayesian prior and posterior update based on samples of behavior, and proposed heuristic algorithmic techniques to compute the optimum. In fact, they also factored for quantal responses using a bounded rationality model 

[MP95]. This work showed through simulations that there could be a positive return over and above Stackelberg payoff. In one important piece of analytical work, the problem was also considered for the special case of zero-sum games [BHP14], and it was shown that the Stackelberg commitment itself approximated the optimal payoff. In this result, the extent of approximation actually depends on the amount of observational uncertainty itself - the results we prove for all non-zero-sum games have a similar flavor.

The problem of communication constraints in the commitment has also received a lot of interest in the recent algorithmic persuasion literature, but with quantitatively different models for the uncertainty. Communication constraints on signaling in bilateral trading games [DKQ16] and auction design [DIR14] have been studied from a compression perspective, where the leading agent can design the observation channel - while in our model the observation channel is even more constrained to be a finite number of random realizations of the mixed commitment. Further, in many of these settings, the principal is naturally incentivized to reveal the private information and the problem primarily becomes about the communication complexity, and whether the social welfare of the optimal mechanism can be approximated555These are clearly interesting algorithmic questions in themselves, especially in the case of multiple receivers and private vs public signaling [DX16], but do not directly address the questions we have raised.. In security games, the possibility of the mixed commitment being either fully observed or not observed at all has been considered [KCP11]; as well as different ways of handling the uncertainty, eg: showing that for some security games the Stackelberg and Nash equilibria coincide and observation of commitment does not matter [KYK11]. Pita et al [PJT10] first proposed a model for the defender (leader) to account for attacker (follower) observational uncertainty by allowing the follower to anchor [RT97] to a certain extent on its uniform prior. While they showed significant returns from using their model through extensive experiments, they largely circumvented the algorithmic and analytical challenges by not explicitly considering random samples of defender behavior, thus keeping the attacker response deterministic but shifted. Our work limits observation in the most natural way for the applications that we consider (i.e. number of samples of leader behavior), and because the manifestation of the uncertainty is itself random, our results have distinct and new implications.

1.2 Our contributions

Our main contribution is to understand the extent of reputational advantage when interaction is finite, and prescribe approximately optimal commitment rules for this regime.

We study Stackelberg leader-follower games in which a follower obtains a limited number of observations of the leader commitment. We first prove that in most non-zero-sum games the payoff of the Stackelberg commitment is not robust to even an infinitesmal amount of observational uncertainty. Therefore, the Stackelberg commitment is suboptimal in its payoff666This property has actually been proved for special examples of Stackelberg games [VDH97], but it was unclear whether it holds for all or most games.. Next, we propose robust commitment rules for leaders and show that we can approach the Stackelberg payoff as the number of observations increases. The robust commitment construction involves optimizing a tradeoff between preserving the follower best response and staying close to the ideal Stackelberg commitment, by moving the commitment a little bit into the interior of an appropriate convex polytope [CS06]. The analysis of payoff of the commitment construction is inspired by interior point convex geometry [KN12]. Finally, we show that a possible advantage for the leader from limited observability is only related to follower response mismatch, and show that this advantage is limited. Computationally speaking, the corollary is that we are able to approximate the optimal payoff through a simple construction which can be obtained in constant time from computation of the Stackelberg commitment (itself a polynomial-time operation [CS06]). Philosophically, this result implies that a leader can gain to a very limited extent by misrepresenting her commitment and eliciting a suboptimal response from the follower. We corroborate our theoretical results with simulations on illustrative examples and random ensembles of security games.

2 Problem statement

2.1 Preliminaries

We represent a two-player leader-follower game in normal form by the pair of matrices , where denotes the leader payoff matrix and denotes the follower payoff matrix. We denote the leader mixed strategy space by (where for any represents the

-dimensional probability simplex) and the follower mixed strategy space by

. From now on, we define an effective dimension of a game as a number for which the effective payoff matrices of leader and follower respectively are , , and the effective set of leader strategies is given by a convex polytope 777This definition is important in the context of Stackelberg security games, for which the leader strategy space looks exponential in the number of targets - but the actual manifestation of all leader strategies is in fact -dimensional. In particular, a defender strategy manifests as a distribution over different targets being covered. .

We consider a setting of asymmetric private information in which the leader knows about the follower preferences (i.e. she knows the matrix ) while the follower does not know about the leader preferences (i.e. he possesses no knowledge of the matrix )888This is an important assumption for the paper, and is in fact used in traditional reputation building frameworks. In future, we will want to better understand situations of repeated interaction where the -level leader and -level follower are both learning about one another..

With infinite experience, the well-established effect from the follower’s point of view is that the leader has established commitment, or developed a reputation, for playing according to some mixed strategy . We denote the follower’s set of theoretically best pure-strategy responses to a mixed strategy commitment by . Explicitly, we have

An important assumption that we make (and that has been made in the classical literature [CS06]) is the follower actually responds with the pure strategy in the set that is most beneficial to the leader999The technical reason for this tie-breaking rule is to be able to explicitly define the Stackelberg commitment as an explicit maximum - this in itself gives a subtle clue of its fragility. . That is, the follower responds with pure strategy

Then, we also define best-response regions as the set of leader commitments that would elicit the pure strategy response from the follower, i.e. .

With these definitions, we can define the leader’s ideal payoff to be expected with an infinite reputation:

Definition 1.

A leader with an infinite reputation of playing according to the strategy should expect payoff

Therefore, the leader’s Stackelberg payoff is the solution to the program

The argmax of this program is denoted as the Stackelberg commitment . Further, we denote the best response faced in Stackelberg equilibrium by .

It is clear that the Stackelberg commitment is optimal for the leader under two conditions: the leader is 100% known to be committed to a fixed strategy, and the follower knows exactly the leader’s committed-to strategy. For a finite number of interactions, neither is true.

2.2 Observational uncertainty with established commitment

Even assuming that there is a shared belief in commitment, there is uncertainty. In particular, with a finite number of plays, the follower does not know the exact strategy that the leader has committed to, and only has an estimate.

Consider the situation where a leader can only reveal its commitment through pure strategy plays . The commitment is known (to both leader and followers) to come from a set of mixed strategies . We denote the maximum likelihood estimate of the leader’s mixed strategy, as seen by the follower, by . It is reasonable to expect, under certainty of commitment, that a “rational”101010Rational is in quotes because the follower is not necessarily using expected-utility theory (although there is an expected-utility-maximization interpetation to this estimate if the mixed strategy were uniform drawn from ). follower would best-respond to , i.e. play the pure strategy

(1)

We can express the expected leader payoff under this learning rule.

Definition 2.

A leader which will have plays according to the hidden strategy can expect payoff in the th play of

against a follower that plays according to (1). The maximal payoff a leader can expect is

and it acquires this payoff by playing the argmax strategy .

Ideally, we want to understand how close is to , and also how close is to . An answer to the former question would tell us how observational uncertainty impacts the first-player advantage. An answer to the latter question would shed light on whether the best course of action deviates significantly from Stackelberg commitment. We are also interested in algorithmic techniques for approximately computing the quantity , as doing so exactly would involve solving a non-convex optimization problem.

3 Main Results

3.1 Robustness of Stackelberg commitment to observational uncertainty

A natural first question is whether the Stackelberg commitment, which is clearly optimal if the game were being played infinitely (or equivalently, if the leader had infinite commitment power and exact public commitment), is also suitable for finite play. In particular, we might be interested in evaluating whether we can do better than the baseline Stackelberg performance . We show through a few paradigmatic examples that the answer can vary.

(a) zero-sum game.
(b) non-zero-sum game.
(c) non-zero-sum game.
Figure 1: Illustration of examples of zero-sum game and non-zero-sum games in the form of normal form tables and ideal leader payoff function . denotes the probability that the leader will play strategy , and fully describes leader mixed commitment for these games.
Example 1.
Figure 2: Semilog plot of extent of advantage over Stackelberg payoff as a function of in the zero-sum game depicted in Figure 1(a).

We consider a zero-sum game, represented in normal form in Figure 1(a), in which we can express the leader strategy according to the probability with which she will pick strategy , and leader payoff is as follows:

Since the game is zero-sum, the follower responds in a way that is worst-case for the leader. This means that we can express the leader payoff as

This leader payoff structure is depicted in Figure 1(a). Therefore, we can express the Stackelberg payoff as

attained at . We wish to evaluate . It was noted [BHP14] that by the minimax theorem, but not always clear whether strict inequality would hold (that is, if observational uncertainty gives a strict advantage). For this example, we can actually get a sizeable improvement! To see this, look at the simple example of . Denoting , we have

The semilog plot in Figure 2 shows that this improvement persists for larger values of , although the extent of improvement decreases exponentially with . We can show that

where

denotes the Kullback-Leibler divergence, and the last inequality is due to Sanov’s theorem 

[CK11].

This shows analytically that the advantage does indeed decrease exponentially with . Naturally, this is because we see a decrease in the stochasticity that elicits the more favorable follower response with action with the number of observations .

Example 1 showed us how Stackelberg commitment power could be increased by stochastically eliciting more favorable responses. We now see an example illustrating that the commitment power can disappear completely.

Example 2.
(a) Plot depicting the performance of the sequence of robust commitments and the Stackelberg commitment as a function of . The benchmark for comparison is idealized Stackelberg payoff .
(b) Plot showing the performance of sequence of robust commitments as compared to the optimum performance (brute-forced).
Figure 3: Example of the non-zero-sum game depicted in Figure 1(b), for which observational uncertainty is always undesirable.

We consider a non-zero-sum game, represented in normal form and leader payoff structure in Figure 1(b). Explicitly, the ideal leader payoff function is

This is essentially the example reproduced in [BHP14], which we repeat for storytelling value. Notice that , but the advantage evaporates with observational uncertainty. For any finite , we have

Remarkably, this implies that and so ! This is clearly a very negative result for the robustness of Stackelberg commitment, and as a very pragmatic matter tells us that the idealized Stackelberg commitment is far from ideal in finite-observation settings. This example shows us a case where stochasticity in follower response is not desired, principally because of the discontinuity in the leader payoff function at .

Example 2 displayed to the fullest the significant disadvantage of observational uncertainty. The game considered was special in that there was no potential for limited-observation gain, while in the game presented in Example 1 there was only potential for limited-observational gain. What could happen in general? Our next and final example provides an illustration.

Example 3.
(a) Plot of the extent of (dis)advantage over Stackelberg payoff as a function of .
(b) Plot depicting the performance of the sequence of robust commitments and the Stackelberg commitment as a function of . The benchmark for comparison is idealized Stackelberg payoff .
(c) Plot showing the performance of sequence of robust commitments as compared to the optimum performance (brute-forced).
Figure 4: Example of the non-zero-sum game depicted in Figure 1(c), in which observational uncertainty could either help or hurt the leader.

Our final example considers a non-zero-sum game, whose normal form and leader payoff structure are depicted in Figure 1(c). The ideal leader payoff function is

As in the other examples, . Notice that this example captures both positive and negative effects of stochasticity in response. On one hand, follower response is highly undesirable (a la Example 2) but follower response is highly desirable (a la Example 1). What is the net effect? We have

A quick calculation thus tells us that if , showing that Stackelberg in fact has poor robustness for this example. Intuitively, the probability of the “bad” stochastic event remains constant while the probability of the “good” stochastic event decreases exponentially with . Even more damningly, we see that , again showing that the Stackelberg commitment is far from ideal. We can see the dramatic decay of leader advantage over and above Stackelberg, and ensuing disadvantage even for a very small number of observations, in Figure 3(a).

While the three examples detailed above provide differing conclusions, there are some common threads. For one, in all the examples it is the case that committing to the Stackelberg mixture can result in the follower being agnostic between more than one response. Only one of these responses, the pure strategy , is desirable for the leader. A very slight misperception in the estimation of the true value can therefore lead to a different, worse-than-expected response and this misperception happens with a sizeable, non-vanishing probability. On the flipside, a different response could also lead to better-than-expected payoff, raising the potential for a gain over and above . However, these better-than-expected responses cannot share a boundary with the Stackelberg commitment, and we will see that the probability of eliciting them decreases exponentially with . The net effect is that the Stackelberg commitment is, most often, not robust – and critically, this is even the case for small amounts of uncertainty.

Our first result is a formal statement of instability of Stackelberg commitments for a general game. We denote the leader probability of playing strategy by , and the Stackelberg commitment’s probability of playing strategy by .

Furthermore, let

denote the CDF of the standard normal distribution

. We are now ready to state the result.

Theorem 1.

For any leader-follower game in which and discontinuous at , we have

(2)

where are strictly positive constants depending on the parameters of the game. This directly implies the following:

  1. For some , we have for all .

  2. We have .

The proof of Theorem 1 is contained in Section A.1. The technical ingredients in the proof are the Berry-Esseen theorem [Ber41, Ess42], used to show that the detrimental alternate responses on the Stackelberg boundary are non-vanishingly likely – and the Hoeffding bound, used to tail bound the probability of potentially beneficial alternate responses not on the boundary111111It is worth noting that a similar argument as presented here could be extended to a general

game, using iid random vectors instead of random variables and considering a demarcation into best-response regions as illustrated in Figure 

7. We restrict attention to the case for ease of exposition.

For non-robustness of Stackelberg commitment to hold, the two critical conditions for the game are that there is a discontinuity at the Stackelberg boundary, and that the Stackelberg commitment is mixed. For a zero-sum game, the first condition does not hold and the Stackelberg commitment stays robust as we saw in Example 1.

The theorem directly implies that the ideal Stackelberg payoff is only obtained for the exact case of (when the commitment is perfectly observed), and that for any value of there is a non-vanishing reduction in payoff. In the simulations in Section 4, we will see that this gap is empirically significant.

3.2 Robust commitments achieving close-to-Stackelberg performance

The surprising message of Theorem 1 is that, in general, the Stackelberg commitment is undesirable. The commitment is pushed to the extreme point of the best-response-region to ensure optimality under idealized conditions; and this is precisely what makes it sub-optimal under uncertainty. What if we could move our commitment a little bit into the interior of the region instead, such that we can get a high-probability-guarantee on eliciting the expected response, while staying sufficiently close to the idealized optimum? Our next result quantifies the ensuing tradeoff and shows that we can cleverly construct the commitment to approximate Stackelberg performance.

Theorem 2.

Let the best-response polytope be non-empty in . Then, provided that the number of samples , we can construct commitment for every such that

(3)

Furthermore, these constructions are computable in time with knowledge of the Stackelberg commitment . (The contains constant factors that depend on both the local and global geometry of the best-response-region . For a fully formal statement that includes these factors, see Lemma 6.)

The full proof of Theorem 2, deferred to Appendix A.2, involves some technical steps to achieve as good as possible a scaling in . The caveat of Theorem 2 is that commitment power can be robustly exploited in this way only if there are enough observations of the commitment. One obvious requirement is that the best-response-region needs to be non-empty in . Second, the number of observations needs to be greater than the effective dimension of the game for the leader, . This is a natural requirement to ensure that the follower has learned at least a meaningful estimate of the commitment. Third, the “constant” factors in Theorem 2 actually reflect properties about both the local and global geometry of the polytope; see Appendix A.2 for more details. Intuitively, geometric properties that lead to undesirable scaling in the constant factors in the robustness guarantee are listed below:

  1. The Stackelberg commitment being a “pointy” vertex: this can lead to a commitment being far away from the boundary in certain directions, but closer in others, making it more likely for a different response to be elicited.

  2. Local constraints being very different from global constraints, which implies that commitments too far in the interior of the local feasibility set will no longer satisfy all the constraints of the best-response-region.

Even with these caveats, Theorem 2 provides an attractive general framework for constructing robust commitments by making a natural connection to interior-point methods in optimization121212Noting that interior point methods are provably polynomial-time algorithms to solve LPs, it is plausible to think that in fact, stopping the interior point method appropriately early would also give us a robustness guarantee - which would imply that finding optimal robust commitments is even easier than finding optimal commitments!. We observe significant empirical benefit from the constructions in the simulations in Section 4.

We also mention a couple of special cases of leader-follower games for which the robust commitment constructions of Theorem 2 are not required; in fact, it is simply optimal to play Stackelberg.

Remark 1.

For games in which the mixed-strategy Stackelberg equilibrium coincides with a pure strategy, the follower’s best response is always as expected regardless of the number of observations. There is no tradeoff and it is simply optimal to play Stackelberg even under observational uncertainty.

Remark 2.

For the zero-sum case, it was observed [BHP14] that a Stackelberg commitment is made assuming that the follower will respond in the worst case. If there is observational uncertainty, the follower can only respond in a way that yields payoff for the leader that is better than expected. This results in an expected payoff greater than or equal to the Stackelberg payoff , and it simply makes sense to stick with the Stackelberg commitment . As we have seen, this logic does not hold up for non-zero-sum games because different responses can lead to worse-than-expected payoff. One way of thinking of this is that the function can generally be discontinuous in for a non-zero-sum game, but is always continuous for the special case of zero-sum.

3.3 Approximating the maximum possible payoff

So far, we have considered the limited-observability problem and shown that the Stackelberg commitment is not a suitable choice. We have constructed robust commitments that come close to idealized Stackelberg payoff and shown that the guarantee fundamentally depends on the number of observations scaling with the effective dimension of the game. Now, we turn to the question of whether we can approximate , the actual optimum of the program. Note that since the problem is in general non-convex in , it is NP-hard to exactly compute.

Rather than the traditional approach of constructing a polynomial-time-approximation-algorithm, our approach is approximation-theoretic131313In other words, the extent of approximation is measured by the number of samples as opposed to the runtime of an algorithm. This is very much the flavor of previously-obtained results on Stackelberg zero-sum security games [BHP14].. We first show that in the large-sample case, we cannot do much better than the actual Stackelberg payoff ; informally speaking, our ability to fool the follower into responding strictly-better-than-expected is limited. Combining this with the robust commitment construction of Theorem 2, we obtain an approximation to the optimum payoff.

The main result of this section is stated below.

Theorem 3.

We have

for some constant depending on the parameters of the game .

As a corollary the commitment construction defined in Theorem 2 provides a -additive-approximation algorithm to . The proof of Theorem 3 is provided in the appendix.

Intellectually, Theorem 3 tells us that the robust commitments are essentially optimal. The practical benefit that Theorem 3 affords us is that we now have an approximation to the optimum payoff the leader could possibly obtain, which can be computed in constant time after computing the Stackelberg equilibrium, which itself is polynomial time [CS06]. This is because the robust commitment is obtained by first computing Stackelberg equilibrium , and then deviating away from in the magnitude and direction specified. We will now study the empirical benefits of our robust commitment constructions.

4 Simulations

4.1 Example and games

First, we return to the non-zero-sum games described in Examples 2 and  3. These were and games respectively, and the Stackelberg commitment was non-robust for both games. Now, armed with the results in Theorem 2, we can employ our robust commitment constructions and study their performance. To construct our robust commitments, we first computed the Stackelberg commitment using the LP solver in scipy (scipy.optimize.linprog), and then used the construction in Theorem 2.

Figures 3(a) and  4(b) compares the expected payoff obtained by our robust commitment construction scheme for different numbers of samples , and for the games described in Examples 2 and  3 respectively. The benchmark with respect to which we measure this expected payoff is the Stackelberg payoff (obtained by Stackelberg commitment under infinite observability and tie-breakability in favor of the leader). We also observe a significant gap between the payoffs obtained by these robust commitment constructions and the payoff obtained if we used the Stackelberg commitment . We showed in theory that there is significant benefit for choosing the commitment to factor in such observational uncertainty, and we can now see it in practice.

Furthermore, for the case of leader actions we were able to brute-force the maximum possible obtainable payoff141414First we used scipy.optimize.brute with an appropriate grid size to initialize, and then ran a gradient descent at that initialization point. This was feasible for the case of pure strategies. , and compare the value to the robust commitment payoff. This comparison is particularly valuable for smaller values of , as shown in Figures 3(b) and 4(c). We notice that the values are much closer even than our theory would have predicted, and even for small values of . Thus, our constructions have significant practical benefit as well: we are able to get close to the optimum while drastically reducing the required computation (to just solving LPs!).

Since these examples involved games, the commitment construction became trivial (i.e. only one direction to move along) – next, we test our commitment constructions for security games. Instead of looking at specific examples, we now look at a random ensemble to see what behavior ensues.

4.2 Random security games

Figure 5: Illustration of random ensemble of security game.
(a) Plot of expected defender payoff when defender uses robust commitments – compared to Stackelberg commitment as well as idealized Stackelberg payoff.
(b) Log-log plot of the gap between robust commitment payoff and idealized Stackelberg payoff.
(c) Percentage plot of the gap between robust commitment payoff and idealized Stackelberg payoff.
Figure 6: Illustration of performance of robust commitments and Stackelberg commitment in random Stackelberg security games for a finite number of observations of defender commitment.

Our next set of simulations is inspired by the security games framework. We create a random ensemble of security games in which the defender can defend one of targets, and the attacker can attack one of these targets. The defender and attacker rewards are chosen to be uniformly at random between , and their penalties are uniform at random between . This is essentially the random ensemble that was created in previous empirical work on security games [AKK12]. Figure 5 shows the construction of this ensemble.

The purpose of random security games is to show that the properties we observed above – unstable Stackelberg commitment, robust commitment payoff approximating the optimum – are the norm rather than the exception. Figure 6 illustrates the results for random security games. The performance of the sequence of robust commitments , as well as the Stackelberg commitment is plotted in Figure 6(a) against the benchmark of idealized Stackelberg performance . Figure 6(b) depicts the rate of convergence of the gap in robust commitment performance to the idealized Stackelberg payoff – we can clearly see the rate of convergence in this plot. Finally, Figure 6(c) plots the percentage gap between robust commitment payoff and idealized Stackelberg payoff as a function of .

We can make the following conclusions from these plots:

  1. The Stackelberg commitment is extremely non-robust on average. In fact we noticed that this was the case with high probability. This happens because the Stackelberg commitment, although it can vary widely for different games in the random ensemble, is very likely on a boundary shared with other responses and therefore unstable.

  2. The robust commitments are doing much better on average than the original Stackelberg commitment even for very large values of . The stark difference in payoff between the two motivates the construction of the robust commitment, which was as easy to compute as the Stackelberg commitment.

5 Proof sketches

Figure 7: Illustration of partition of the set of follower responses, , into sets (red region), alternate best responses (purple regions) and everything else (orange regions).

In this section we describe briefly the philosophy for the proofs of our main theorems. To understand the strong lack of robustness in Stackelberg equilibrium, it is essential to visualize the best-response-regions of the leader, i.e. subsets of the mixed strategy space for which the follower best response is a particular pure strategy. (Note that there are such best-response regions.) Figure 7 depicts an illustration of these best-response-regions, with the region corresponding to the follower’s best response to the Stackelberg commitment highlighted in red. The figure shows the Stackelberg commitment at a vertex (extreme-point) of the best-response polytope ; this is generally the case [CS06].

First, the reason for the strong instability of Stackelberg commitment to even an infinitesmal amount of uncertainty can be seen from Figure 7: an infinitesmal amount of fluctuation in how the leader commitment is observed will make the follower respond with a different pure strategy with constant probability, corresponding to the regions depicted in purple. Because of the tie-breaking assumption, it turns out that the expected payoff from any of these alternate responses is strictly worse than the Stackelberg payoff. These facts are proved formally using the Berry-Esseen theorem. Note that an uncertainty in commitment could also lead to a response from one of the yellow regions in the figure (which could either hurt or benefit the leader), but the probability of this happening turns out to decay exponentially.

This observation implied that the optimality of the Stackelberg commitment under ideal assumptions was exactly what made it suboptimal under a small amount of uncertainty; we exploit this to construct the robust commitment constructions of Theorem 2. The qualitative idea is to push the commitment to a small extent into the interior of the best-response-region so that it simultaneously satisfies a property of being “close” to the Stackelberg commitment, while also ensuring that the fluctuations in its empirical estimate are highly likely to stay in (which guarantees that the identity of the best response of the follower is preserved). For the special case of , this is a simple tradeoff to navigate as there is only one direction in which one can move into the interior. For higher dimensions, we take inspiration from the rich literature on interior-point methods and, in fact, use Dikin ellipsoids [KN12] for both the commitment construction and analysis. Ensuring that the fluctuations of the commitment preserve the follower best response with high probability, in particular, requires sophisticated tail bounds on discrete distribution learning and a careful consideration of the best-response-polytope geometry.

The proof of Theorem 3 ties several facts that we have seen formally, as well as alluded to, together. First, a generalization of Theorem 1 tells us that we cannot improve sizeably over Stackelberg by committing to any mixed strategy on the boundary between two or more best-response regions. Second, we show that the improvement gained by a fixed commitment in the interior of any best-response-region decreases exponentially with , simply because the probability of eliciting a better-than-expected response decreases exponentially with . Putting these two facts together, the natural thing to try would be commitments that approach a boundary as increases (much like our robust commitment constructions, but now with a different motive). This should happen fast enough that we maintain a sizeable probability of eliciting a different response for every , while simultaneously ensuring that that different response is actually better-than-expected. We then show that the ensuing gain over and above Stackelberg would have to decrease with according to the rate specified.

6 Conclusions and Discussion

We constructed robust commitment constructions with several advantages. First, we are able to effectively preserve the Stackelberg payoff by ensuring a high-probability guarantee on the follower responding as expected. An oblique, but significant philosophical advantage to our robust commitments is that their guarantees hold even if we removed the pivotal assumption of follower breaking ties in favor of the leader. We essentially showed that as the number of observations grows, our construction naturally converges to the Stackelberg commitment at a specific rate. We also motivated that the constructions, which were inspired by interior point geometry, are computable in polynomial time given the Stackelberg commitment.

Second, we established fundamental limits on the ability of the leader to gain over and above Stackelberg payoff. We formally showed that this ability disappears in the large-sample regime, and in a certain sense that our robust commitments are approximately optimal. Our results established a formal connection between leader payoff and follower discrete distribution learning, and in the context of these limits, both players are mutually incentivized to increase learnability under limited samples, even though the setting is non-cooperative – which was a rather surprising conclusion.

Our work provides implications for both leader and follower payoffs when the leader is known to be committed to a fixed strategy, but the commitment can only be revealed partially. However, our model took commitment establishment for granted, i.e. the follower assumed that the leader would indeed be drawing its pure strategies iid from the same mixture in every round. The partial reputation setting should most generally be modeled as a repeated game (either with a finite-horizon or discounted model), in which the belief in commitment needs to be built up over time. Studying the problem of finite observability of commitment in isolation is, in our view, an important first step towards eventually solving this problem, which poses many modeling challenges in itself. In earlier rounds, directly responding to the empirical estimate of leader commitment will be suboptimal for the follower. Instead, he may want to maintain a possibility that the leader will play minimax/Nash and respond accordingly. From the point of view of the leader, if the iid assumption is removed, an interesting question is whether the leader could choose to play more deterministically, in such a way to increase strategy learnability while maintaining a follower impression of iid commitment. By doing this, the leader could establish commitment faster but also run the risk of looking too deterministic/predictable in time, in which case the follower would take undue advantage. Conversely, the leader may not even be incentivized to increase follower learnability in the finitely repeated, or discounted setting.

Finally, it is interesting to think about the applicability of the robust commitment perspective to algorithmically more difficult problems like Bayesian persuasion and public/private signalling games with multiple followers, which can have observational limitations on information transfer in much the same way as has been described for the applications in this paper.

Acknowledgments

We thank the anonymous reviewers for valuable feedback. We gratefully acknowledge the support of the NSF through grant AST-1444078, and the Berkeley ML4Wireless research center.

References

  • [AKK12] Bo An, David Kempe, Christopher Kiekintveld, Eric Shieh, Satinder Singh, Milind Tambe, and Yevgeniy Vorobeychik. Security games with limited surveillance. In

    Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence

    , pages 1241–1248. AAAI Press, 2012.
  • [Ber41] Andrew C Berry. The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the american mathematical society, 49(1):122–136, 1941.
  • [BHP14] Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Lazy Defenders Are Almost Optimal against Diligent Attackers. In AAAI, pages 573–579, 2014.
  • [CK11] Imre Csiszar and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
  • [CS82] Vincent P Crawford and Joel Sobel. Strategic information transmission. Econometrica: Journal of the Econometric Society, pages 1431–1451, 1982.
  • [CS06] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM Conference on Electronic Commerce, pages 82–90. ACM, 2006.
  • [Dev83] Luc Devroye.

    The equivalence of weak, strong and complete convergence in l1 for kernel density estimates.

    The Annals of Statistics, pages 896–904, 1983.
  • [DIR14] Shaddin Dughmi, Nicole Immorlica, and Aaron Roth. Constrained signaling in auction design. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1341–1357. Society for Industrial and Applied Mathematics, 2014.
  • [DKQ16] Shaddin Dughmi, David Kempe, and Ruixin Qiang. Persuasion with limited communication. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 663–680. ACM, 2016.
  • [DX16] Shaddin Dughmi and Haifeng Xu. Algorithmic bayesian persuasion. In

    Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

    , pages 412–425. ACM, 2016.
  • [Ess42] Carl-Gustaf Esseen. On the Liapounoff limit of error in the theory of probability. Almqvist & Wiksell, 1942.
  • [FL89] Drew Fudenberg and David K Levine. Reputation and equilibrium selection in games with a patient player. Econometrica: Journal of the Econometric Society, pages 759–778, 1989.
  • [FL92] Drew Fudenberg and David K Levine. Maintaining a reputation when strategies are imperfectly observed. The Review of Economic Studies, 59(3):561–579, 1992.
  • [HMRAD16] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pages 3909–3917, 2016.
  • [KCP11] Dmytro Korzhyk, Vincent Conitzer, and Ronald Parr. Solving Stackelberg games with uncertain observability. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3, pages 1013–1020. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
  • [KG11] Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. The American Economic Review, 101(6):2590–2615, 2011.
  • [KN12] Ravindran Kannan and Hariharan Narayanan.

    Random walks on polytopes and an affine interior point method for linear programming.

    Mathematics of Operations Research, 37(1):1–20, 2012.
  • [KS15] Volodymyr Kuleshov and Okke Schrijvers. Inverse game theory. Web and Internet Economics, 2015.
  • [KW82] David M Kreps and Robert Wilson. Reputation and imperfect information. Journal of economic theory, 27(2):253–279, 1982.
  • [KYK11] Dmytro Korzhyk, Zhengyu Yin, Christopher Kiekintveld, Vincent Conitzer, and Milind Tambe. Stackelberg vs. Nash in security games: An extended investigation of interchangeability, equivalence, and uniqueness. Journal of Artificial Intelligence Research, 2011.
  • [MP95] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
  • [MR82] Paul Milgrom and John Roberts. Predation, reputation, and entry deterrence. Journal of economic theory, 27(2):280–312, 1982.
  • [MS17] Vidya Muthukumar and Anant Sahai. Fundamental limits on ex-post enforcement and implications for spectrum rights. In Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium on, pages 1–10. IEEE, 2017.
  • [PJT10] James Pita, Manish Jain, Milind Tambe, Fernando Ordóñez, and Sarit Kraus. Robust solutions to Stackelberg games: Addressing bounded rationality and limited observations in human cognition. Artificial Intelligence, 174(15):1142–1171, 2010.
  • [PPM08] Praveen Paruchuri, Jonathan P Pearce, Janusz Marecki, Milind Tambe, Fernando Ordonez, and Sarit Kraus. Playing games for security: An efficient exact algorithm for solving Bayesian Stackelberg games. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 2, pages 895–902. International Foundation for Autonomous Agents and Multiagent Systems, 2008.
  • [Rou04] Tim Roughgarden. Stackelberg scheduling strategies. SIAM Journal on Computing, 33(2):332–350, 2004.
  • [RT97] Yuval Rottenstreich and Amos Tversky. Unpacking, repacking, and anchoring: advances in support theory. Psychological review, 104(2):406, 1997.
  • [SAY12] Eric Shieh, Bo An, Rong Yang, Milind Tambe, Craig Baldwin, Joseph DiRenzo, Ben Maule, and Garrett Meyer. Protect: A deployed game theoretic system to protect the ports of the United States. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 13–20. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
  • [Sel75] Reinhard Selten. Reexamination of the perfectness concept for equilibrium points in extensive games. International journal of game theory, 4(1):25–55, 1975.
  • [Sel78] Reinhard Selten. The chain store paradox. Theory and decision, 9(2):127–159, 1978.
  • [VDH97] Eric Van Damme and Sjaak Hurkens. Games with imperfectly observable commitment. Games and Economic Behavior, 21(1-2):282–308, 1997.
  • [VSZ10] Bernhard Von Stengel and Shmuel Zamir. Leadership games with convex strategy sets. Games and Economic Behavior, 69(2):446–457, 2010.
  • [WZB13] Kevin Waugh, Brian D Ziebart, and J Andrew Bagnell. Computational rationalization: The inverse equilibrium problem. arXiv preprint arXiv:1308.3506, 2013.
  • [ZMBD08] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.

Appendix A Proofs

Before moving into the proofs themselves, we define some additional notation.

Definition 3.

The set of alternate follower best response to the mixed commitment is denoted by

We will be particularly interested in this set for the Stackelberg commitment, that is, . In general, the set will be non-empty as the follower could be agnostic between more than one pure strategy in response – it is only responding with the pure strategy to break ties in the leader’s favor. Figure 7 shows this demarcation of follower responses into the expected response , and alternate responses to the Stackelberg commitment .

Further, we denote maximum and minimum obtainable leader payoffs respectively by

a.1 Proof of Theorem 1

We consider a general game and denote the Stackelberg probability of leader playing pure strategy by . Recall that (since we have assumed for the proof that the Stackelberg commitment is mixed). Let be the alternate response to the Stackelberg commitment, i.e. we have . Without loss of generality, the best-response regions can be described as

Finally, we define . Since we are considering leader-follower games for which the function is discontinuous at , by the tie-breaking assumption on Stackelberg commitment we will have .

Now, we consider the quantity . Denoting as the empirical estimate of the quantity , we have

We will now proceed to bound the probabilities and .

First, we deal with the quantity , which reflects the probability of a mismatched response that is neither Stackelberg nor the alternate response on the boundary. By the Hoeffding bound, we have

Denoting , we then have

(4)

and as expected, this probability decays exponentially with .

Next, we deal with the quantity , which reflects the probability of eliciting the alternate response on the Stackelberg boundary. We show that this event is non-vanishingly probable.

We define the following quantities

(5)
(6)

Recall that

is a real-valued random variable. We denote its cumulative distribution function by

.

By a simple change of variables, we then have

Now, recall that for iid random variables . Also note that since we have considered games with mixed Stackelberg commitment, we have . We now invoke the first half of the classical Berry-Esseen theorem [Ber41, Ess42] stated here as a lemma.

Lemma 1.

There exists a positive constant such that if are iid random variables with , and , we have

for all , where denotes the CDF of the standard normal distribution .

It is easy to verify that the distribution satisfies the above conditions. Therefore, we can directly apply Lemma 1 and get

for positive constant , thus giving

(7)

Substituting for the expressions for and , we now have

which corresponds exactly to Equation (2). Clearly, the right hand side of this equation is decreasing in and so the first corollary – that for – holds. Precisely, we have

and so we have

This is the second corollary from Theorem 1 and completes the proof. ∎

a.2 Proof of Theorem 2

a.2.1 Notation

For this proof, it will be convenient to consider the -dimensional representation of the probability simplex, i.e.

Then, we can represent a commitment by its -dimensional representation , and the leader payoff if the follower were to respond with pure strategy by

where we have

Similarly, we can represent the corresponding follower payoff by

where we have

We can also represent this representation of the empirical estimate of from samples by , and this representation Stackelberg commitment by .

Now, we can consider all the functions introduced in Section 2.2 in terms of the commitment and equivalently define them in terms of the -dimensional representation of the commitment, .

We also denote the th operator norm of a matrix by .

a.2.2 The commitment construction

We consider the -dimensional representation of the best-response-region corresponding to the Stackelberg commitment, . There are many things to consider while constructing a robust commitment. The first, and obvious, one would be that the follower should respond the same way as it would to Stackelberg when it observes the full mixture. That is, we would have or alternatively stated, .

Intuitively, the expected payoff of a leader commitment under observational uncertainty, particularly in terms of gap to the optimal Stackelberg payoff, will depend on two factors: one, how likely the follower is to respond the same as it would if it observed the full commitment; and two, how “far” the leader commitment mixture is from the optimal Stackelberg commitment mixture. We qualitatively show this dependence in the following lemma.

Lemma 2.

Consider a commitment for which we can provide the following guarantee:

We then have

Proof.

We have

Recall that we have . Therefore, the gap from Stackelberg is bounded as

where the second inequality follows from Holder’s inequality. This proves the lemma. ∎

This lemma implies that we want a commitment construction with the following two-fold guarantee151515Interestingly, the fact that is on an extreme point of

will imply that the two conditions are at odds with one another, and we will need to trade them off. For instance, choosing

would satisfy the second condition perfectly by being as close as possible to the Stackelberg commitment, but there would be no guarantee on the best-response as it lies on the boundary of the best-response region..

  1. is bounded (and ideally vanishes with ).

  2. with high probability.

a.2.3 Commitment construction using localized geometry

We will leverage the special structure of the Dikin ellipsoid [KN12] used in interior-point methods to make our commitment constructions. Observe that is always going to be on an extreme point (vertex) of the best-response-polytope161616Recall that the Stackelberg equilibrium is the solution to the LP defined on the best-response-polytope [CS06]. . We now collect the constraints that are satisfied with equality at :

This is simply the constraint set for commitments such that the follower prefers to respond with pure strategy over any pure strategy (i.e. any pure strategy whose corresponding best-response-polytope shares a boundary with the Stackelberg best-response-polytope at point ), and can be thought of as the set of local constraints to the Stackelberg vertex in the best-response polytope . We also collect the other constraints that describe :

and together with the local constraints at the Stackelberg vertex, these describe the global constraints for the polytope.

We represent the system of inequalities in matrix form as: for some and some . We leverage the following useful fact about a general set of linear constraints.

Fact 1.

For any parameterization of linear constraints , there exists an affine transformation (where is invertible and ) and a matrix such that

We denote the transformation function by and its inverse by . In particular, we note the relationship .

The above fact is useful171717A subtle point is that there do exist special cases of polytope constraints for which Fact 1 is true only with an augmentation of the variable space from to dimensions. Then, defining the invertible map becomes trickier. Nevertheless, for ease of exposition and clarity in the proof, we assume that we can indeed carry out the affine transformation without augmenting the dimension. because it is most convenient to define our class of commitments in the transformed space .

Definition 4.

For a particular value of , Stackelberg commitment , and local constraints modeled by , we define a -deviation commitment by

Our robust commitments are going to be taken out of the set of -deviation commitments, with appropriately chosen values of . Clearly, the computational complexity of constructing any -deviation commitment is equivalent to the complexity of computing the Stackelberg equilibrium itself.

To understand how to set these values, we will turn to the question of how to satisfy the three conditions above.

First, we observe that satisfies the local constraints for any . Because of Fact 1, it suffices to show that its affine transformation satisfies the local constraints . Recall that satisfies all the local constraints with equality, i.e. we have . From the definition of the commitment, we thus have

Next, we turn to the question of how close such a defined commitment would be from the Stackelberg commitment , in terms of the norm. For this, we have

Therefore, we have

(8)

In lieu of Lemma 2, we wish to choose values (to create commitments