1 Introduction
Consider a selfish, rational agent (designated as the “leader”) who is interacting noncooperatively with another selfish, rational agent (designated as the “follower”). If the agents are interacting simultaneously
, and know the game that they are playing, they will naturally play a Nash equilibrium(a) of the twoplayer game. This is the traditionally studied solution concept in game theory. Now, let’s say that the leader has the ability to reveal its strategy in advance – in the form of a
mixed strategy commitment, and the follower has the ability to observe this commitment and respond to it. The optimal strategy for the leader now corresponds to the Stackelberg equilibrium of the twoplayer leaderfollower game. The Stackelberg solution concept enjoys application in several engineering settings where commitment is the natural framework for the leader, such as security [PPM08], network routing [Rou04] and law enforcement [MS17]. The solution concept is interpreted in a broader sense as the ensuing strategy played by a patient leader that wishes to build a reputation by playing against an infinite number of myopic followers [KW82, MR82, FL89, FL92]. Crucially, one can show rigorously that the leader will benefit significantly from commitment power, i.e. its ensuing Stackelberg equilibrium payoff is at least as much as the simultaneous equilibrium payoff [VSZ10]. Further, several mechanism design problems that involve private information revelation  this includes signalling games [CS82] and persuasion games [KG11]  can be thought of as Stackelberg games, and the optimal mechanism can be interpreted as the Stackelberg commitment^{1}^{1}1albeit sometimes with multiple followers..But whichever interpretation one chooses, the Stackelberg solution concept assumes a very idealized setting (even over and above the assumptions of selfishness and infinite rationality) in which the mixed strategy commitment is exactly revealed
to the follower. Further, the follower 100% believes that the leader will actually stick to her commitment. What happens when these assumptions are relaxed? What if the leader could only demonstrate her commitment in a finite number of interactions – how would she modify her commitment to maximize payoff, and how much commitment power would she continue to enjoy? Is she even incentivized to help the follower estimate her commitment effectively?
What changes in a finiteinteraction regime is that the follower only observes a part of the leader behavioral history and needs to learn about the leader’s strategic behavior
– to the extent that he is able to respond as optimally as possible. By restricting our attention to finitely repeated play, we arrive at a problem setting that is fairly general: these follower agents in general will not know about the preferences of the leader agent. When provided with historical context, we assume that they will use it rather than ignore it. A broad umbrella of problems that has received significant attention in the machine learning literature is learning of strategic behavior from samples of play; for example, learning the agent’s utility function through
inverse reinforcement learning
[ZMBD08], learning the agent’s level of rationality [WZB13], and inverse game theory [KS15]. While significant progress has been made in this goal of learning strategic behavior, attention has been restricted to the passive learning setting in which the leading agent is unaware of the presence of the learner, or agnostic to the learner’s utility. In many situations, the agent himself will be invested in the outcome of the learning process. In this paper, we put ourselves in the shoes of an agent who is shaping her historical context and is aware of the learner’s presence as well as preferences, and study her choice of optimal strategy^{2}^{2}2It is worth mentioning the recent paradigm of cooperative inverse reinforcement learning [HMRAD16] which studies the problem of agent investment in principal learning where the incentives are not completely aligned, but the setting is cooperative. In contrast, we focus on noncooperative settings.. As we will see, the answer will depend on her utility function itself, as well as what kind of response she is able to elicit from the learner.1.1 Related work
The Stackelberg solution concept is used in the engineering and economics literature to model a number of scenarios. For one, the Stackelberg security game is played between a defender, who places different utility levels on different targets to be protected and accordingly uses her resources to defend some subset of targets; and an attacker, who observes the defender strategy and wishes to attack some of the targets depending on whether he thinks they are left open as well as how much he values those targets. Stackelberg games can also be modeled with a single leader and multiple followers, such as in computer network routing applications [Rou04]. Many mechanism design problems involve computing an optimal mechanism to commit to, or an optimal way of revealing private information  this includes auctions and, more recently, Bayesian persuasion mechanisms [KG11].
Economists have established an important link between the Stackelberg solution concept and the asymptotic limit of reputation building. Reputation effects were first observed in the chainstore paradox, a firmcompetition game where an incumbent firm would often deviate from Nash equilibrium behavior and play its aggressive Stackelberg (pure, in this case) strategy [Sel78]. Theoretical justification was provided for this particular application [KW82, MR82] by modeling a player interaction between a leader and multiple followers, and studying the Nash equilibrium of an ensuing game as . It was shown that the leader would play its pure Stackelberg strategy^{3}^{3}3This is clearly more restrictive than the mixed strategy Stackelberg solution concept, and not necessarily advantageous over Nash, but it turns out to be so in the firm competition case. in the BayesNash equilibrium of this game endowed with a common prior on the leader’s payoff structure^{4}^{4}4A more nuanced model considered a Bayesian prior over a leader being constrained to play its pure Stackelberg strategy as opposed to unconstrained play.. This model was generalized to such leaderfollower ensembles for a general twoplayer game, and considering the possibility of mixedstrategy reputation, still retaining the asymptotic nature of results [FL89, FL92]. The “firstplayer” advantage, and the entire Stackelberg solution concept, rely on an important assumption: that the commitment is perfectly revealed to the follower. This is usually not the case: in security games, the attacker will usually observe a finite number of deployments of the defender’s resource, as opposed to the allocation strategy itself (which is often mixed). In theoretical models for Bayesian persuasion, the persuader conveys a conditional distribution on her signal given the privately observed state of the world  but what will be practically observed is her history, and thus realizations of the signal, not the distribution itself. In all of these models, the leader establishes her reputation only partially, and the manifestation of the revelation is itself random. It is natural to ask how she should plan to optimally reveal her information under this constraint.
The idea of a robust solution concept in game theory is certainly not new. The concept of tremblinghandperfectequilibrium [Sel75] explicitly studies how robust mixedstrategy Nash equilibria are to slight perturbations in the mixtures themselves, and a similar concept was proposed for Stackelberg [VDH97]. Another solution concept, quantalresponseequilibrium [MP95], studies agents that are boundedly rational, an orthogonal but important source of uncertainty in response. In the Stackelberg setting, it was noted that robust commitments exist that preserve the Stackelberg guarantee for small enough amounts of noise in the commitment; however, this is still an asymptotic perspective and does not directly help us answer our key computational questions: can we construct a robust commitment efficiently when the game is multidimensional, and does the leader want to use the noise to reveal or obfuscate her commitment?
The problem of computing the optimal commitment under finitely limited observability corresponds to a robust optimization problem that is, in fact, NPhard [AKK12, SAY12]; so directly reasoning about the optimal commitment is not easy. Whether there exists a polynomialtime approximation scheme for this problem was also unclear. A duo of papers [AKK12, SAY12]
considered a model of fullfledged observational uncertainty with a Bayesian prior and posterior update based on samples of behavior, and proposed heuristic algorithmic techniques to compute the optimum. In fact, they also factored for quantal responses using a bounded rationality model
[MP95]. This work showed through simulations that there could be a positive return over and above Stackelberg payoff. In one important piece of analytical work, the problem was also considered for the special case of zerosum games [BHP14], and it was shown that the Stackelberg commitment itself approximated the optimal payoff. In this result, the extent of approximation actually depends on the amount of observational uncertainty itself  the results we prove for all nonzerosum games have a similar flavor.The problem of communication constraints in the commitment has also received a lot of interest in the recent algorithmic persuasion literature, but with quantitatively different models for the uncertainty. Communication constraints on signaling in bilateral trading games [DKQ16] and auction design [DIR14] have been studied from a compression perspective, where the leading agent can design the observation channel  while in our model the observation channel is even more constrained to be a finite number of random realizations of the mixed commitment. Further, in many of these settings, the principal is naturally incentivized to reveal the private information and the problem primarily becomes about the communication complexity, and whether the social welfare of the optimal mechanism can be approximated^{5}^{5}5These are clearly interesting algorithmic questions in themselves, especially in the case of multiple receivers and private vs public signaling [DX16], but do not directly address the questions we have raised.. In security games, the possibility of the mixed commitment being either fully observed or not observed at all has been considered [KCP11]; as well as different ways of handling the uncertainty, eg: showing that for some security games the Stackelberg and Nash equilibria coincide and observation of commitment does not matter [KYK11]. Pita et al [PJT10] first proposed a model for the defender (leader) to account for attacker (follower) observational uncertainty by allowing the follower to anchor [RT97] to a certain extent on its uniform prior. While they showed significant returns from using their model through extensive experiments, they largely circumvented the algorithmic and analytical challenges by not explicitly considering random samples of defender behavior, thus keeping the attacker response deterministic but shifted. Our work limits observation in the most natural way for the applications that we consider (i.e. number of samples of leader behavior), and because the manifestation of the uncertainty is itself random, our results have distinct and new implications.
1.2 Our contributions
Our main contribution is to understand the extent of reputational advantage when interaction is finite, and prescribe approximately optimal commitment rules for this regime.
We study Stackelberg leaderfollower games in which a follower obtains a limited number of observations of the leader commitment. We first prove that in most nonzerosum games the payoff of the Stackelberg commitment is not robust to even an infinitesmal amount of observational uncertainty. Therefore, the Stackelberg commitment is suboptimal in its payoff^{6}^{6}6This property has actually been proved for special examples of Stackelberg games [VDH97], but it was unclear whether it holds for all or most games.. Next, we propose robust commitment rules for leaders and show that we can approach the Stackelberg payoff as the number of observations increases. The robust commitment construction involves optimizing a tradeoff between preserving the follower best response and staying close to the ideal Stackelberg commitment, by moving the commitment a little bit into the interior of an appropriate convex polytope [CS06]. The analysis of payoff of the commitment construction is inspired by interior point convex geometry [KN12]. Finally, we show that a possible advantage for the leader from limited observability is only related to follower response mismatch, and show that this advantage is limited. Computationally speaking, the corollary is that we are able to approximate the optimal payoff through a simple construction which can be obtained in constant time from computation of the Stackelberg commitment (itself a polynomialtime operation [CS06]). Philosophically, this result implies that a leader can gain to a very limited extent by misrepresenting her commitment and eliciting a suboptimal response from the follower. We corroborate our theoretical results with simulations on illustrative examples and random ensembles of security games.
2 Problem statement
2.1 Preliminaries
We represent a twoplayer leaderfollower game in normal form by the pair of matrices , where denotes the leader payoff matrix and denotes the follower payoff matrix. We denote the leader mixed strategy space by (where for any represents the
dimensional probability simplex) and the follower mixed strategy space by
. From now on, we define an effective dimension of a game as a number for which the effective payoff matrices of leader and follower respectively are , , and the effective set of leader strategies is given by a convex polytope ^{7}^{7}7This definition is important in the context of Stackelberg security games, for which the leader strategy space looks exponential in the number of targets  but the actual manifestation of all leader strategies is in fact dimensional. In particular, a defender strategy manifests as a distribution over different targets being covered. .We consider a setting of asymmetric private information in which the leader knows about the follower preferences (i.e. she knows the matrix ) while the follower does not know about the leader preferences (i.e. he possesses no knowledge of the matrix )^{8}^{8}8This is an important assumption for the paper, and is in fact used in traditional reputation building frameworks. In future, we will want to better understand situations of repeated interaction where the level leader and level follower are both learning about one another..
With infinite experience, the wellestablished effect from the follower’s point of view is that the leader has established commitment, or developed a reputation, for playing according to some mixed strategy . We denote the follower’s set of theoretically best purestrategy responses to a mixed strategy commitment by . Explicitly, we have
An important assumption that we make (and that has been made in the classical literature [CS06]) is the follower actually responds with the pure strategy in the set that is most beneficial to the leader^{9}^{9}9The technical reason for this tiebreaking rule is to be able to explicitly define the Stackelberg commitment as an explicit maximum  this in itself gives a subtle clue of its fragility. . That is, the follower responds with pure strategy
Then, we also define bestresponse regions as the set of leader commitments that would elicit the pure strategy response from the follower, i.e. .
With these definitions, we can define the leader’s ideal payoff to be expected with an infinite reputation:
Definition 1.
A leader with an infinite reputation of playing according to the strategy should expect payoff
Therefore, the leader’s Stackelberg payoff is the solution to the program
The argmax of this program is denoted as the Stackelberg commitment . Further, we denote the best response faced in Stackelberg equilibrium by .
It is clear that the Stackelberg commitment is optimal for the leader under two conditions: the leader is 100% known to be committed to a fixed strategy, and the follower knows exactly the leader’s committedto strategy. For a finite number of interactions, neither is true.
2.2 Observational uncertainty with established commitment
Even assuming that there is a shared belief in commitment, there is uncertainty. In particular, with a finite number of plays, the follower does not know the exact strategy that the leader has committed to, and only has an estimate.
Consider the situation where a leader can only reveal its commitment through pure strategy plays . The commitment is known (to both leader and followers) to come from a set of mixed strategies . We denote the maximum likelihood estimate of the leader’s mixed strategy, as seen by the follower, by . It is reasonable to expect, under certainty of commitment, that a “rational”^{10}^{10}10Rational is in quotes because the follower is not necessarily using expectedutility theory (although there is an expectedutilitymaximization interpetation to this estimate if the mixed strategy were uniform drawn from ). follower would bestrespond to , i.e. play the pure strategy
(1) 
We can express the expected leader payoff under this learning rule.
Definition 2.
A leader which will have plays according to the hidden strategy can expect payoff in the th play of
against a follower that plays according to (1). The maximal payoff a leader can expect is
and it acquires this payoff by playing the argmax strategy .
Ideally, we want to understand how close is to , and also how close is to . An answer to the former question would tell us how observational uncertainty impacts the firstplayer advantage. An answer to the latter question would shed light on whether the best course of action deviates significantly from Stackelberg commitment. We are also interested in algorithmic techniques for approximately computing the quantity , as doing so exactly would involve solving a nonconvex optimization problem.
3 Main Results
3.1 Robustness of Stackelberg commitment to observational uncertainty
A natural first question is whether the Stackelberg commitment, which is clearly optimal if the game were being played infinitely (or equivalently, if the leader had infinite commitment power and exact public commitment), is also suitable for finite play. In particular, we might be interested in evaluating whether we can do better than the baseline Stackelberg performance . We show through a few paradigmatic examples that the answer can vary.
Example 1.
We consider a zerosum game, represented in normal form in Figure 1(a), in which we can express the leader strategy according to the probability with which she will pick strategy , and leader payoff is as follows:
Since the game is zerosum, the follower responds in a way that is worstcase for the leader. This means that we can express the leader payoff as
This leader payoff structure is depicted in Figure 1(a). Therefore, we can express the Stackelberg payoff as
attained at . We wish to evaluate . It was noted [BHP14] that by the minimax theorem, but not always clear whether strict inequality would hold (that is, if observational uncertainty gives a strict advantage). For this example, we can actually get a sizeable improvement! To see this, look at the simple example of . Denoting , we have
The semilog plot in Figure 2 shows that this improvement persists for larger values of , although the extent of improvement decreases exponentially with . We can show that
where
denotes the KullbackLeibler divergence, and the last inequality is due to Sanov’s theorem
[CK11].This shows analytically that the advantage does indeed decrease exponentially with . Naturally, this is because we see a decrease in the stochasticity that elicits the more favorable follower response with action with the number of observations .
Example 1 showed us how Stackelberg commitment power could be increased by stochastically eliciting more favorable responses. We now see an example illustrating that the commitment power can disappear completely.
Example 2.
We consider a nonzerosum game, represented in normal form and leader payoff structure in Figure 1(b). Explicitly, the ideal leader payoff function is
This is essentially the example reproduced in [BHP14], which we repeat for storytelling value. Notice that , but the advantage evaporates with observational uncertainty. For any finite , we have
Remarkably, this implies that and so ! This is clearly a very negative result for the robustness of Stackelberg commitment, and as a very pragmatic matter tells us that the idealized Stackelberg commitment is far from ideal in finiteobservation settings. This example shows us a case where stochasticity in follower response is not desired, principally because of the discontinuity in the leader payoff function at .
Example 2 displayed to the fullest the significant disadvantage of observational uncertainty. The game considered was special in that there was no potential for limitedobservation gain, while in the game presented in Example 1 there was only potential for limitedobservational gain. What could happen in general? Our next and final example provides an illustration.
Example 3.
Our final example considers a nonzerosum game, whose normal form and leader payoff structure are depicted in Figure 1(c). The ideal leader payoff function is
As in the other examples, . Notice that this example captures both positive and negative effects of stochasticity in response. On one hand, follower response is highly undesirable (a la Example 2) but follower response is highly desirable (a la Example 1). What is the net effect? We have
A quick calculation thus tells us that if , showing that Stackelberg in fact has poor robustness for this example. Intuitively, the probability of the “bad” stochastic event remains constant while the probability of the “good” stochastic event decreases exponentially with . Even more damningly, we see that , again showing that the Stackelberg commitment is far from ideal. We can see the dramatic decay of leader advantage over and above Stackelberg, and ensuing disadvantage even for a very small number of observations, in Figure 3(a).
While the three examples detailed above provide differing conclusions, there are some common threads. For one, in all the examples it is the case that committing to the Stackelberg mixture can result in the follower being agnostic between more than one response. Only one of these responses, the pure strategy , is desirable for the leader. A very slight misperception in the estimation of the true value can therefore lead to a different, worsethanexpected response and this misperception happens with a sizeable, nonvanishing probability. On the flipside, a different response could also lead to betterthanexpected payoff, raising the potential for a gain over and above . However, these betterthanexpected responses cannot share a boundary with the Stackelberg commitment, and we will see that the probability of eliciting them decreases exponentially with . The net effect is that the Stackelberg commitment is, most often, not robust – and critically, this is even the case for small amounts of uncertainty.
Our first result is a formal statement of instability of Stackelberg commitments for a general game. We denote the leader probability of playing strategy by , and the Stackelberg commitment’s probability of playing strategy by .
Furthermore, let
denote the CDF of the standard normal distribution
. We are now ready to state the result.Theorem 1.
For any leaderfollower game in which and discontinuous at , we have
(2) 
where are strictly positive constants depending on the parameters of the game. This directly implies the following:

For some , we have for all .

We have .
The proof of Theorem 1 is contained in Section A.1. The technical ingredients in the proof are the BerryEsseen theorem [Ber41, Ess42], used to show that the detrimental alternate responses on the Stackelberg boundary are nonvanishingly likely – and the Hoeffding bound, used to tail bound the probability of potentially beneficial alternate responses not on the boundary^{11}^{11}11It is worth noting that a similar argument as presented here could be extended to a general
game, using iid random vectors instead of random variables and considering a demarcation into bestresponse regions as illustrated in Figure
7. We restrict attention to the case for ease of exposition.For nonrobustness of Stackelberg commitment to hold, the two critical conditions for the game are that there is a discontinuity at the Stackelberg boundary, and that the Stackelberg commitment is mixed. For a zerosum game, the first condition does not hold and the Stackelberg commitment stays robust as we saw in Example 1.
The theorem directly implies that the ideal Stackelberg payoff is only obtained for the exact case of (when the commitment is perfectly observed), and that for any value of there is a nonvanishing reduction in payoff. In the simulations in Section 4, we will see that this gap is empirically significant.
3.2 Robust commitments achieving closetoStackelberg performance
The surprising message of Theorem 1 is that, in general, the Stackelberg commitment is undesirable. The commitment is pushed to the extreme point of the bestresponseregion to ensure optimality under idealized conditions; and this is precisely what makes it suboptimal under uncertainty. What if we could move our commitment a little bit into the interior of the region instead, such that we can get a highprobabilityguarantee on eliciting the expected response, while staying sufficiently close to the idealized optimum? Our next result quantifies the ensuing tradeoff and shows that we can cleverly construct the commitment to approximate Stackelberg performance.
Theorem 2.
Let the bestresponse polytope be nonempty in . Then, provided that the number of samples , we can construct commitment for every such that
(3) 
Furthermore, these constructions are computable in time with knowledge of the Stackelberg commitment . (The contains constant factors that depend on both the local and global geometry of the bestresponseregion . For a fully formal statement that includes these factors, see Lemma 6.)
The full proof of Theorem 2, deferred to Appendix A.2, involves some technical steps to achieve as good as possible a scaling in . The caveat of Theorem 2 is that commitment power can be robustly exploited in this way only if there are enough observations of the commitment. One obvious requirement is that the bestresponseregion needs to be nonempty in . Second, the number of observations needs to be greater than the effective dimension of the game for the leader, . This is a natural requirement to ensure that the follower has learned at least a meaningful estimate of the commitment. Third, the “constant” factors in Theorem 2 actually reflect properties about both the local and global geometry of the polytope; see Appendix A.2 for more details. Intuitively, geometric properties that lead to undesirable scaling in the constant factors in the robustness guarantee are listed below:

The Stackelberg commitment being a “pointy” vertex: this can lead to a commitment being far away from the boundary in certain directions, but closer in others, making it more likely for a different response to be elicited.

Local constraints being very different from global constraints, which implies that commitments too far in the interior of the local feasibility set will no longer satisfy all the constraints of the bestresponseregion.
Even with these caveats, Theorem 2 provides an attractive general framework for constructing robust commitments by making a natural connection to interiorpoint methods in optimization^{12}^{12}12Noting that interior point methods are provably polynomialtime algorithms to solve LPs, it is plausible to think that in fact, stopping the interior point method appropriately early would also give us a robustness guarantee  which would imply that finding optimal robust commitments is even easier than finding optimal commitments!. We observe significant empirical benefit from the constructions in the simulations in Section 4.
We also mention a couple of special cases of leaderfollower games for which the robust commitment constructions of Theorem 2 are not required; in fact, it is simply optimal to play Stackelberg.
Remark 1.
For games in which the mixedstrategy Stackelberg equilibrium coincides with a pure strategy, the follower’s best response is always as expected regardless of the number of observations. There is no tradeoff and it is simply optimal to play Stackelberg even under observational uncertainty.
Remark 2.
For the zerosum case, it was observed [BHP14] that a Stackelberg commitment is made assuming that the follower will respond in the worst case. If there is observational uncertainty, the follower can only respond in a way that yields payoff for the leader that is better than expected. This results in an expected payoff greater than or equal to the Stackelberg payoff , and it simply makes sense to stick with the Stackelberg commitment . As we have seen, this logic does not hold up for nonzerosum games because different responses can lead to worsethanexpected payoff. One way of thinking of this is that the function can generally be discontinuous in for a nonzerosum game, but is always continuous for the special case of zerosum.
3.3 Approximating the maximum possible payoff
So far, we have considered the limitedobservability problem and shown that the Stackelberg commitment is not a suitable choice. We have constructed robust commitments that come close to idealized Stackelberg payoff and shown that the guarantee fundamentally depends on the number of observations scaling with the effective dimension of the game. Now, we turn to the question of whether we can approximate , the actual optimum of the program. Note that since the problem is in general nonconvex in , it is NPhard to exactly compute.
Rather than the traditional approach of constructing a polynomialtimeapproximationalgorithm, our approach is approximationtheoretic^{13}^{13}13In other words, the extent of approximation is measured by the number of samples as opposed to the runtime of an algorithm. This is very much the flavor of previouslyobtained results on Stackelberg zerosum security games [BHP14].. We first show that in the largesample case, we cannot do much better than the actual Stackelberg payoff ; informally speaking, our ability to fool the follower into responding strictlybetterthanexpected is limited. Combining this with the robust commitment construction of Theorem 2, we obtain an approximation to the optimum payoff.
The main result of this section is stated below.
Theorem 3.
We have
for some constant depending on the parameters of the game .
As a corollary the commitment construction defined in Theorem 2 provides a additiveapproximation algorithm to . The proof of Theorem 3 is provided in the appendix.
Intellectually, Theorem 3 tells us that the robust commitments are essentially optimal. The practical benefit that Theorem 3 affords us is that we now have an approximation to the optimum payoff the leader could possibly obtain, which can be computed in constant time after computing the Stackelberg equilibrium, which itself is polynomial time [CS06]. This is because the robust commitment is obtained by first computing Stackelberg equilibrium , and then deviating away from in the magnitude and direction specified. We will now study the empirical benefits of our robust commitment constructions.
4 Simulations
4.1 Example and games
First, we return to the nonzerosum games described in Examples 2 and 3. These were and games respectively, and the Stackelberg commitment was nonrobust for both games. Now, armed with the results in Theorem 2, we can employ our robust commitment constructions and study their performance. To construct our robust commitments, we first computed the Stackelberg commitment using the LP solver in scipy (scipy.optimize.linprog), and then used the construction in Theorem 2.
Figures 3(a) and 4(b) compares the expected payoff obtained by our robust commitment construction scheme for different numbers of samples , and for the games described in Examples 2 and 3 respectively. The benchmark with respect to which we measure this expected payoff is the Stackelberg payoff (obtained by Stackelberg commitment under infinite observability and tiebreakability in favor of the leader). We also observe a significant gap between the payoffs obtained by these robust commitment constructions and the payoff obtained if we used the Stackelberg commitment . We showed in theory that there is significant benefit for choosing the commitment to factor in such observational uncertainty, and we can now see it in practice.
Furthermore, for the case of leader actions we were able to bruteforce the maximum possible obtainable payoff^{14}^{14}14First we used scipy.optimize.brute with an appropriate grid size to initialize, and then ran a gradient descent at that initialization point. This was feasible for the case of pure strategies. , and compare the value to the robust commitment payoff. This comparison is particularly valuable for smaller values of , as shown in Figures 3(b) and 4(c). We notice that the values are much closer even than our theory would have predicted, and even for small values of . Thus, our constructions have significant practical benefit as well: we are able to get close to the optimum while drastically reducing the required computation (to just solving LPs!).
Since these examples involved games, the commitment construction became trivial (i.e. only one direction to move along) – next, we test our commitment constructions for security games. Instead of looking at specific examples, we now look at a random ensemble to see what behavior ensues.
4.2 Random security games
Our next set of simulations is inspired by the security games framework. We create a random ensemble of security games in which the defender can defend one of targets, and the attacker can attack one of these targets. The defender and attacker rewards are chosen to be uniformly at random between , and their penalties are uniform at random between . This is essentially the random ensemble that was created in previous empirical work on security games [AKK12]. Figure 5 shows the construction of this ensemble.
The purpose of random security games is to show that the properties we observed above – unstable Stackelberg commitment, robust commitment payoff approximating the optimum – are the norm rather than the exception. Figure 6 illustrates the results for random security games. The performance of the sequence of robust commitments , as well as the Stackelberg commitment is plotted in Figure 6(a) against the benchmark of idealized Stackelberg performance . Figure 6(b) depicts the rate of convergence of the gap in robust commitment performance to the idealized Stackelberg payoff – we can clearly see the rate of convergence in this plot. Finally, Figure 6(c) plots the percentage gap between robust commitment payoff and idealized Stackelberg payoff as a function of .
We can make the following conclusions from these plots:

The Stackelberg commitment is extremely nonrobust on average. In fact we noticed that this was the case with high probability. This happens because the Stackelberg commitment, although it can vary widely for different games in the random ensemble, is very likely on a boundary shared with other responses and therefore unstable.

The robust commitments are doing much better on average than the original Stackelberg commitment even for very large values of . The stark difference in payoff between the two motivates the construction of the robust commitment, which was as easy to compute as the Stackelberg commitment.
5 Proof sketches
In this section we describe briefly the philosophy for the proofs of our main theorems. To understand the strong lack of robustness in Stackelberg equilibrium, it is essential to visualize the bestresponseregions of the leader, i.e. subsets of the mixed strategy space for which the follower best response is a particular pure strategy. (Note that there are such bestresponse regions.) Figure 7 depicts an illustration of these bestresponseregions, with the region corresponding to the follower’s best response to the Stackelberg commitment highlighted in red. The figure shows the Stackelberg commitment at a vertex (extremepoint) of the bestresponse polytope ; this is generally the case [CS06].
First, the reason for the strong instability of Stackelberg commitment to even an infinitesmal amount of uncertainty can be seen from Figure 7: an infinitesmal amount of fluctuation in how the leader commitment is observed will make the follower respond with a different pure strategy with constant probability, corresponding to the regions depicted in purple. Because of the tiebreaking assumption, it turns out that the expected payoff from any of these alternate responses is strictly worse than the Stackelberg payoff. These facts are proved formally using the BerryEsseen theorem. Note that an uncertainty in commitment could also lead to a response from one of the yellow regions in the figure (which could either hurt or benefit the leader), but the probability of this happening turns out to decay exponentially.
This observation implied that the optimality of the Stackelberg commitment under ideal assumptions was exactly what made it suboptimal under a small amount of uncertainty; we exploit this to construct the robust commitment constructions of Theorem 2. The qualitative idea is to push the commitment to a small extent into the interior of the bestresponseregion so that it simultaneously satisfies a property of being “close” to the Stackelberg commitment, while also ensuring that the fluctuations in its empirical estimate are highly likely to stay in (which guarantees that the identity of the best response of the follower is preserved). For the special case of , this is a simple tradeoff to navigate as there is only one direction in which one can move into the interior. For higher dimensions, we take inspiration from the rich literature on interiorpoint methods and, in fact, use Dikin ellipsoids [KN12] for both the commitment construction and analysis. Ensuring that the fluctuations of the commitment preserve the follower best response with high probability, in particular, requires sophisticated tail bounds on discrete distribution learning and a careful consideration of the bestresponsepolytope geometry.
The proof of Theorem 3 ties several facts that we have seen formally, as well as alluded to, together. First, a generalization of Theorem 1 tells us that we cannot improve sizeably over Stackelberg by committing to any mixed strategy on the boundary between two or more bestresponse regions. Second, we show that the improvement gained by a fixed commitment in the interior of any bestresponseregion decreases exponentially with , simply because the probability of eliciting a betterthanexpected response decreases exponentially with . Putting these two facts together, the natural thing to try would be commitments that approach a boundary as increases (much like our robust commitment constructions, but now with a different motive). This should happen fast enough that we maintain a sizeable probability of eliciting a different response for every , while simultaneously ensuring that that different response is actually betterthanexpected. We then show that the ensuing gain over and above Stackelberg would have to decrease with according to the rate specified.
6 Conclusions and Discussion
We constructed robust commitment constructions with several advantages. First, we are able to effectively preserve the Stackelberg payoff by ensuring a highprobability guarantee on the follower responding as expected. An oblique, but significant philosophical advantage to our robust commitments is that their guarantees hold even if we removed the pivotal assumption of follower breaking ties in favor of the leader. We essentially showed that as the number of observations grows, our construction naturally converges to the Stackelberg commitment at a specific rate. We also motivated that the constructions, which were inspired by interior point geometry, are computable in polynomial time given the Stackelberg commitment.
Second, we established fundamental limits on the ability of the leader to gain over and above Stackelberg payoff. We formally showed that this ability disappears in the largesample regime, and in a certain sense that our robust commitments are approximately optimal. Our results established a formal connection between leader payoff and follower discrete distribution learning, and in the context of these limits, both players are mutually incentivized to increase learnability under limited samples, even though the setting is noncooperative – which was a rather surprising conclusion.
Our work provides implications for both leader and follower payoffs when the leader is known to be committed to a fixed strategy, but the commitment can only be revealed partially. However, our model took commitment establishment for granted, i.e. the follower assumed that the leader would indeed be drawing its pure strategies iid from the same mixture in every round. The partial reputation setting should most generally be modeled as a repeated game (either with a finitehorizon or discounted model), in which the belief in commitment needs to be built up over time. Studying the problem of finite observability of commitment in isolation is, in our view, an important first step towards eventually solving this problem, which poses many modeling challenges in itself. In earlier rounds, directly responding to the empirical estimate of leader commitment will be suboptimal for the follower. Instead, he may want to maintain a possibility that the leader will play minimax/Nash and respond accordingly. From the point of view of the leader, if the iid assumption is removed, an interesting question is whether the leader could choose to play more deterministically, in such a way to increase strategy learnability while maintaining a follower impression of iid commitment. By doing this, the leader could establish commitment faster but also run the risk of looking too deterministic/predictable in time, in which case the follower would take undue advantage. Conversely, the leader may not even be incentivized to increase follower learnability in the finitely repeated, or discounted setting.
Finally, it is interesting to think about the applicability of the robust commitment perspective to algorithmically more difficult problems like Bayesian persuasion and public/private signalling games with multiple followers, which can have observational limitations on information transfer in much the same way as has been described for the applications in this paper.
Acknowledgments
We thank the anonymous reviewers for valuable feedback. We gratefully acknowledge the support of the NSF through grant AST1444078, and the Berkeley ML4Wireless research center.
References

[AKK12]
Bo An, David Kempe, Christopher Kiekintveld, Eric Shieh, Satinder Singh, Milind
Tambe, and Yevgeniy Vorobeychik.
Security games with limited surveillance.
In
Proceedings of the TwentySixth AAAI Conference on Artificial Intelligence
, pages 1241–1248. AAAI Press, 2012.  [Ber41] Andrew C Berry. The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the american mathematical society, 49(1):122–136, 1941.
 [BHP14] Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Lazy Defenders Are Almost Optimal against Diligent Attackers. In AAAI, pages 573–579, 2014.
 [CK11] Imre Csiszar and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
 [CS82] Vincent P Crawford and Joel Sobel. Strategic information transmission. Econometrica: Journal of the Econometric Society, pages 1431–1451, 1982.
 [CS06] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM Conference on Electronic Commerce, pages 82–90. ACM, 2006.

[Dev83]
Luc Devroye.
The equivalence of weak, strong and complete convergence in l1 for kernel density estimates.
The Annals of Statistics, pages 896–904, 1983.  [DIR14] Shaddin Dughmi, Nicole Immorlica, and Aaron Roth. Constrained signaling in auction design. In Proceedings of the twentyfifth annual ACMSIAM symposium on Discrete algorithms, pages 1341–1357. Society for Industrial and Applied Mathematics, 2014.
 [DKQ16] Shaddin Dughmi, David Kempe, and Ruixin Qiang. Persuasion with limited communication. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 663–680. ACM, 2016.

[DX16]
Shaddin Dughmi and Haifeng Xu.
Algorithmic bayesian persuasion.
In
Proceedings of the fortyeighth annual ACM symposium on Theory of Computing
, pages 412–425. ACM, 2016.  [Ess42] CarlGustaf Esseen. On the Liapounoff limit of error in the theory of probability. Almqvist & Wiksell, 1942.
 [FL89] Drew Fudenberg and David K Levine. Reputation and equilibrium selection in games with a patient player. Econometrica: Journal of the Econometric Society, pages 759–778, 1989.
 [FL92] Drew Fudenberg and David K Levine. Maintaining a reputation when strategies are imperfectly observed. The Review of Economic Studies, 59(3):561–579, 1992.
 [HMRAD16] Dylan HadfieldMenell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pages 3909–3917, 2016.
 [KCP11] Dmytro Korzhyk, Vincent Conitzer, and Ronald Parr. Solving Stackelberg games with uncertain observability. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 3, pages 1013–1020. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
 [KG11] Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. The American Economic Review, 101(6):2590–2615, 2011.

[KN12]
Ravindran Kannan and Hariharan Narayanan.
Random walks on polytopes and an affine interior point method for linear programming.
Mathematics of Operations Research, 37(1):1–20, 2012.  [KS15] Volodymyr Kuleshov and Okke Schrijvers. Inverse game theory. Web and Internet Economics, 2015.
 [KW82] David M Kreps and Robert Wilson. Reputation and imperfect information. Journal of economic theory, 27(2):253–279, 1982.
 [KYK11] Dmytro Korzhyk, Zhengyu Yin, Christopher Kiekintveld, Vincent Conitzer, and Milind Tambe. Stackelberg vs. Nash in security games: An extended investigation of interchangeability, equivalence, and uniqueness. Journal of Artificial Intelligence Research, 2011.
 [MP95] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
 [MR82] Paul Milgrom and John Roberts. Predation, reputation, and entry deterrence. Journal of economic theory, 27(2):280–312, 1982.
 [MS17] Vidya Muthukumar and Anant Sahai. Fundamental limits on expost enforcement and implications for spectrum rights. In Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium on, pages 1–10. IEEE, 2017.
 [PJT10] James Pita, Manish Jain, Milind Tambe, Fernando Ordóñez, and Sarit Kraus. Robust solutions to Stackelberg games: Addressing bounded rationality and limited observations in human cognition. Artificial Intelligence, 174(15):1142–1171, 2010.
 [PPM08] Praveen Paruchuri, Jonathan P Pearce, Janusz Marecki, Milind Tambe, Fernando Ordonez, and Sarit Kraus. Playing games for security: An efficient exact algorithm for solving Bayesian Stackelberg games. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systemsVolume 2, pages 895–902. International Foundation for Autonomous Agents and Multiagent Systems, 2008.
 [Rou04] Tim Roughgarden. Stackelberg scheduling strategies. SIAM Journal on Computing, 33(2):332–350, 2004.
 [RT97] Yuval Rottenstreich and Amos Tversky. Unpacking, repacking, and anchoring: advances in support theory. Psychological review, 104(2):406, 1997.
 [SAY12] Eric Shieh, Bo An, Rong Yang, Milind Tambe, Craig Baldwin, Joseph DiRenzo, Ben Maule, and Garrett Meyer. Protect: A deployed game theoretic system to protect the ports of the United States. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent SystemsVolume 1, pages 13–20. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
 [Sel75] Reinhard Selten. Reexamination of the perfectness concept for equilibrium points in extensive games. International journal of game theory, 4(1):25–55, 1975.
 [Sel78] Reinhard Selten. The chain store paradox. Theory and decision, 9(2):127–159, 1978.
 [VDH97] Eric Van Damme and Sjaak Hurkens. Games with imperfectly observable commitment. Games and Economic Behavior, 21(12):282–308, 1997.
 [VSZ10] Bernhard Von Stengel and Shmuel Zamir. Leadership games with convex strategy sets. Games and Economic Behavior, 69(2):446–457, 2010.
 [WZB13] Kevin Waugh, Brian D Ziebart, and J Andrew Bagnell. Computational rationalization: The inverse equilibrium problem. arXiv preprint arXiv:1308.3506, 2013.
 [ZMBD08] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Proofs
Before moving into the proofs themselves, we define some additional notation.
Definition 3.
The set of alternate follower best response to the mixed commitment is denoted by
We will be particularly interested in this set for the Stackelberg commitment, that is, . In general, the set will be nonempty as the follower could be agnostic between more than one pure strategy in response – it is only responding with the pure strategy to break ties in the leader’s favor. Figure 7 shows this demarcation of follower responses into the expected response , and alternate responses to the Stackelberg commitment .
Further, we denote maximum and minimum obtainable leader payoffs respectively by
a.1 Proof of Theorem 1
We consider a general game and denote the Stackelberg probability of leader playing pure strategy by . Recall that (since we have assumed for the proof that the Stackelberg commitment is mixed). Let be the alternate response to the Stackelberg commitment, i.e. we have . Without loss of generality, the bestresponse regions can be described as
Finally, we define . Since we are considering leaderfollower games for which the function is discontinuous at , by the tiebreaking assumption on Stackelberg commitment we will have .
Now, we consider the quantity . Denoting as the empirical estimate of the quantity , we have
We will now proceed to bound the probabilities and .
First, we deal with the quantity , which reflects the probability of a mismatched response that is neither Stackelberg nor the alternate response on the boundary. By the Hoeffding bound, we have
Denoting , we then have
(4) 
and as expected, this probability decays exponentially with .
Next, we deal with the quantity , which reflects the probability of eliciting the alternate response on the Stackelberg boundary. We show that this event is nonvanishingly probable.
We define the following quantities
(5)  
(6) 
By a simple change of variables, we then have
Now, recall that for iid random variables . Also note that since we have considered games with mixed Stackelberg commitment, we have . We now invoke the first half of the classical BerryEsseen theorem [Ber41, Ess42] stated here as a lemma.
Lemma 1.
There exists a positive constant such that if are iid random variables with , and , we have
for all , where denotes the CDF of the standard normal distribution .
It is easy to verify that the distribution satisfies the above conditions. Therefore, we can directly apply Lemma 1 and get
for positive constant , thus giving
(7) 
Substituting for the expressions for and , we now have
which corresponds exactly to Equation (2). Clearly, the right hand side of this equation is decreasing in and so the first corollary – that for – holds. Precisely, we have
and so we have
This is the second corollary from Theorem 1 and completes the proof. ∎
a.2 Proof of Theorem 2
a.2.1 Notation
For this proof, it will be convenient to consider the dimensional representation of the probability simplex, i.e.
Then, we can represent a commitment by its dimensional representation , and the leader payoff if the follower were to respond with pure strategy by
where we have
Similarly, we can represent the corresponding follower payoff by
where we have
We can also represent this representation of the empirical estimate of from samples by , and this representation Stackelberg commitment by .
Now, we can consider all the functions introduced in Section 2.2 in terms of the commitment and equivalently define them in terms of the dimensional representation of the commitment, .
We also denote the th operator norm of a matrix by .
a.2.2 The commitment construction
We consider the dimensional representation of the bestresponseregion corresponding to the Stackelberg commitment, . There are many things to consider while constructing a robust commitment. The first, and obvious, one would be that the follower should respond the same way as it would to Stackelberg when it observes the full mixture. That is, we would have or alternatively stated, .
Intuitively, the expected payoff of a leader commitment under observational uncertainty, particularly in terms of gap to the optimal Stackelberg payoff, will depend on two factors: one, how likely the follower is to respond the same as it would if it observed the full commitment; and two, how “far” the leader commitment mixture is from the optimal Stackelberg commitment mixture. We qualitatively show this dependence in the following lemma.
Lemma 2.
Consider a commitment for which we can provide the following guarantee:
We then have
Proof.
We have
Recall that we have . Therefore, the gap from Stackelberg is bounded as
where the second inequality follows from Holder’s inequality. This proves the lemma. ∎
This lemma implies that we want a commitment construction with the following twofold guarantee^{15}^{15}15Interestingly, the fact that is on an extreme point of
will imply that the two conditions are at odds with one another, and we will need to trade them off. For instance, choosing
would satisfy the second condition perfectly by being as close as possible to the Stackelberg commitment, but there would be no guarantee on the bestresponse as it lies on the boundary of the bestresponse region..
is bounded (and ideally vanishes with ).

with high probability.
a.2.3 Commitment construction using localized geometry
We will leverage the special structure of the Dikin ellipsoid [KN12] used in interiorpoint methods to make our commitment constructions. Observe that is always going to be on an extreme point (vertex) of the bestresponsepolytope^{16}^{16}16Recall that the Stackelberg equilibrium is the solution to the LP defined on the bestresponsepolytope [CS06]. . We now collect the constraints that are satisfied with equality at :
This is simply the constraint set for commitments such that the follower prefers to respond with pure strategy over any pure strategy (i.e. any pure strategy whose corresponding bestresponsepolytope shares a boundary with the Stackelberg bestresponsepolytope at point ), and can be thought of as the set of local constraints to the Stackelberg vertex in the bestresponse polytope . We also collect the other constraints that describe :
and together with the local constraints at the Stackelberg vertex, these describe the global constraints for the polytope.
We represent the system of inequalities in matrix form as: for some and some . We leverage the following useful fact about a general set of linear constraints.
Fact 1.
For any parameterization of linear constraints , there exists an affine transformation (where is invertible and ) and a matrix such that
We denote the transformation function by and its inverse by . In particular, we note the relationship .
The above fact is useful^{17}^{17}17A subtle point is that there do exist special cases of polytope constraints for which Fact 1 is true only with an augmentation of the variable space from to dimensions. Then, defining the invertible map becomes trickier. Nevertheless, for ease of exposition and clarity in the proof, we assume that we can indeed carry out the affine transformation without augmenting the dimension. because it is most convenient to define our class of commitments in the transformed space .
Definition 4.
For a particular value of , Stackelberg commitment , and local constraints modeled by , we define a deviation commitment by
Our robust commitments are going to be taken out of the set of deviation commitments, with appropriately chosen values of . Clearly, the computational complexity of constructing any deviation commitment is equivalent to the complexity of computing the Stackelberg equilibrium itself.
To understand how to set these values, we will turn to the question of how to satisfy the three conditions above.
First, we observe that satisfies the local constraints for any . Because of Fact 1, it suffices to show that its affine transformation satisfies the local constraints . Recall that satisfies all the local constraints with equality, i.e. we have . From the definition of the commitment, we thus have
Next, we turn to the question of how close such a defined commitment would be from the Stackelberg commitment , in terms of the norm. For this, we have
Therefore, we have
(8) 
In lieu of Lemma 2, we wish to choose values (to create commitments