1 Introduction
Many realworld settings call for allocating limited defender resources against a strategic adversary, such as protecting public infrastructure [Tambe2011], transportation networks [Okamoto et al.2012], large public events [Yin et al.2014], urban crime [Zhang et al.2015], and green security [Fang et al.2015]. Stackelberg security games (SSGs) are a critical framework for computing defender strategies that maximize expected defender utility to protect important targets from an intelligent adversary [Tambe2011].
In many SSG settings, the adversary’s utility function is not known a priori. In domains where there are many interactions with the adversary, the history of interactions can be leveraged to construct an adversary behavior model: a mapping from target features to values [Kar et al.2016]. An example of such a domain is protecting wildlife from poaching [Fang et al.2015]. The adversary’s behavior is observable because snares are left behind, which rangers aim to remove (Fig. 1). Various features such as animal counts, distance to the edge of the park, weather and time of day may affect how attractive a particular target is to the adversary.
We focus on the problem of learning adversary models that generalize well: the training data consists of adversary behavior in the context of particular sets of targets, and we wish to achieve a high defender utility in the situation where we are playing against the same adversary and new sets of targets. In problem of poaching prevention, rangers patrol a small portion of the park each day and aim to predict poacher behavior across a large park consisting of targets with novel feature values [Gholami et al.2018].
The standard approach to this problem [Nguyen et al.2013, Yang et al.2011, Kar et al.2016]
breaks the problem into two stages. In the first, the adversary model is fit to the historical data using a standard machine learning loss function, such as mean squared error. In the second, the defender optimizes her allocation of defense resources against the model of adversary behavior learned in the first stage. Extensive research has focused on the first, predictive stage: developing better models of human behavior
[Cui and John2014, Abbasi et al.2016]. We show that models that provide better predictions may not improve the defender’s true objective: higher expected utility. This was observed previously by Ford et al. ford2015beware in the context of network security games, motivating our approach.We propose a decisionfocused approach to adversary modeling in SSGs which directly trains the predictive model to maximize defender expected utility on the historical data. Our approach builds on a recently proposed framework (outside of security games) called decisionfocused learning, which aims to optimize the quality of the decisions induced by the predictive model, instead of focusing solely on predictive accuracy [Wilder et al.2019]; Fig. 2 illustrates our approach vs. a standard twostage method. The main idea is to integrate a solver for the defender’s equilibrium strategy into the loop of machine learning training and update the model to improve the decisions output by the solver.
While decisionfocused learning has recently been explored in other domains (see related work), we overcome two main challenges to extend it to SSGs. First, the defender optimization problem is typically nonconvex, whereas previous work has focused on convex problems. Second, decisionfocused learning requires counterfactual data—we need to know what our decision outcome quality would have been, had we taken a different action than the one observed in training. By contrast, in SSGs we typically only observe the attacker’s response to a fixed historical mixed strategy.
In summary, our contributions are: First, we provide a theoretical justification for why decisionfocused approaches can outperform twostage approaches in SSGs. Second
, we develop a decisionfocused learning approach to adversary modeling in SSGs, showing both how to differentiate through general nonconvex problems as well as estimate counterfactual utilities for subjective utility quantal response
[Nguyen et al.2013] and related adversary models. Third, we test our approach on a combination of synthetic and human subject data and show that decisionfocused learning outperforms a twostage approach in many settings.Related Work.
There is a rich literature on SSGs, ranging from information revelation [Korzhyk et al.2011, Guo et al.2017] to extensiveform models [Cermak et al.2016] to patrolling on graphs [Basilico et al.2012, Basilico et al.2017]. Adversary modeling in particular has been a subject of extensive study. Yang et al. yang2011improving show that modeling the adversary with quantal response (QR) results in more accurate attack predictions. Nguyen et al. nguyen2013analyzing develops subjective utility quantal response (SUQR), which is more accurate than QR. SUQR is the basis of other models such as SHARP [Kar et al.2016]. We focus on SUQR in our experiments because it is a relatively simple and widely used approach. Our decisionfocused approach extends to other models that decompose the attacker’s behavior into the impact of coverage and target value. Sinha et al. sinha2016learning and Haghtalab et al. haghtalab2016three study the sample complexity (i.e., the number of attacks required) of learning an adversary model. Our setting differs from theirs because their defender observes attacks on the same target set that their defense performance is evaluated on. Ling et al. ling2018game,ling2019game use a differentiable QR equilibrium solver to reconstruct the payoffs of both players from play. This differs from our objective of maximizing the defender’s expected utility.
Outside of SSGs, Hartford et al. hartford2016deep and Wright and LeytonBrown WRIGHT201716 study the problem of predicting play in unseen games assuming that all payoffs are fully observable; in our case, the defender seeks to maximize expected utility and does not observe the attacker’s payoffs. Hartford et al. hartford2016deep is the only other work to apply deep learning to modeling boundedly rational players in games.
Wilder et al. wilder2019melding and Donti et al. donti2017task study decisionfocused learning for discrete and convex optimization, respectively. Donti et al. use sequential quadratic programming to solve a convex nonquadratic objective and use the last program to calculate derivatives. Here we propose an approach that works for the broader family of nonconvex functions.
2 Setting
Stackelberg Security Games (SSGs).
Our focus is on optimizing defender strategies for SSGs, which describe the problem of protecting a set of targets given limited defense resources and constraints on how the resources may be deployed [Tambe2011]. Formally, an SSG is a tuple , where is a set of targets, is the defender’s payoff if each target is successfully attacked, is the attacker’s, and is the set of constraints the defender’s strategy must satisfy. Both players receive a payoff of zero when the attacker attacks a target that is defended.
The game proceeds in two stages: the defender computes a mixed strategy that satisfies the constraints , which induces a
marginal coverage probability (or coverage)
. The attacker’s attack function determines which target is attacked, inducing an attack probability for each target. The defender seeks to maximize her expected utility:(1)  
The attacker’s function can represent a rational attacker, e.g., , or a boundedly rational attacker. A QR attacker [McKelvey and Palfrey1995] attacks each target with probability proportional to the exponential of its payoff scaled by a constant , i.e., . An SUQR [Nguyen et al.2013] attacker attacks each target with probability proportional to the exponential of an attractiveness function:
(2) 
where
is a vector of features of target
and is a constant. We call the target value function.Learning in SSGs.
We consider the problem of learning to play against an attacker with an unknown attack function . We observe attacks made by the adversary against sets of targets with differing features, and our goal is to generalize to new sets of targets with unseen feature values.
Formally, let be an instance of a Stackelberg security game with latent attack function (SSGLA). , which is not observed by the defender, is the true mapping from the features and coverage of each target to the probability that the attacker will attack that target. is the set of constraints that a mixed strategy defense must satisfy for the defender. are training games of the form , where is the set of targets, and , , and are the features, observed attacks, defender’s utility function, and historical coverage probabilities, respectively, for each target . are test games , each containing a set of targets and the associated features and defender values for each target. We assume that all games are drawn i.i.d. from the same distribution. In a green security setting, the training games represent the results of patrols on limited areas of the park and the test games represent the entire park.
The defender’s goal is to select a coverage function that takes the parameters of each test game as input and maximizes her expected utility across the test games against the attacker’s true :
(3) 
To achieve this, she can observe the attacker’s behavior in the training data and learn how he values different combinations of features. We now explore two approaches to the learning problem: the standard twostage approach taken by previous work and our proposed decisionfocused approach.
TwoStage Approach.
A standard twostage approach to the defender’s problem is to estimate the attacker’s function from the training data and optimize against the estimate during testing. This process resembles multiclass classification where the targets are the classes: the inputs are the target features and historical coverages, and the output is a distribution over the predicted attack. Specifically, the defender fits a function to the training data that minimizes a loss function. Using the cross entropy, the loss for a particular training example is
(4) 
where is the empirical attack distribution and is the number of historical attacks that were observed on target . Note that we use hats to indicate model outputs and tildes to indicate the ground truth. For each test game , coverage is selected by maximizing the defender’s expected utility assuming the attack function is :
(5) 
DecisionFocused Learning.
The standard approach may fall short when the loss function (e.g., cross entropy) does not align with the true goal of maximizing expected utility. Ultimately, the defender just wants to induce the correct mixed strategy, regardless of how accurate it is in a general sense. The idea behind our decisionfocused learning approach is to directly train to maximize defender utility. Define
(6) 
to be the optimal defender coverage function given attack function . Ideally, we would find a which maximizes
(7) 
This is just the defender’s expected utility on the test games when she plans her mixed strategy defense based on attack function but the true function is . While we do not have access to , we can estimate Eq. 7 using samples from (taking the usual precaution of controlling model complexity to avoid overfitting). The idea behind decisionfocused learning is to directly optimize Eq. 7 on the training data instead of using an intermediate loss function such as cross entropy. Minimizing Eq. 7
on the training set via gradient descent requires the gradient, which we can derive using the chain rule:
Here, describes how the defender’s true utility with respect to changes as a function of her strategy . describes how depends on the estimated attack function , which requires differentiating through the optimization problem in Eq. 6. Suppose that we have a means to calculate both terms. Then we can estimate by sampling example games from and computing gradients on the samples. If
is itself implemented in a differentiable manner (e.g., a neural network), this allows us to train the entire system endtoend via gradient descent. Previous work has explored decisionfocused learning in other contexts
[Donti et al.2017, Wilder et al.2019], but SSGs pose unique challenges that complicate the process of computing both of the required terms above. In Sec. 4, we explore these challenges and propose solutions.3 Impact of TwoStage Learning on DEU
We demonstrate that, for natural twostage training loss functions, decreasing the loss may not lead to increasing the . This indicates that we may be able to improve decision quality by making use of decisionfocused learning because a decisionfocused approach uses the decision objective as the loss. Thus, reducing the loss function increases the in decisionfocused learning.
We begin with a simple case: twotarget games with a rational attacker and zerosum utilities. All proofs are in the appendix.
Theorem 1.
Consider a twotarget SSG with a rational attacker, zerosum utilities, and a single defense resource to allocate, which is not subject to scheduling constraints (i.e., any nonnegative marginal coverage that sums to one is feasible). Let be the attacker’s values for the targets, which are observed by the attacker, but not the defender, and we assume w.l.o.g. are nonnegative and sum to 1.
The defender has an estimate of the attacker’s values with mean squared error (MSE) . Suppose the defender optimizes coverage against this estimate. If , the ratio between the highest under the estimate of with MSE and the lowest is:
(8) 
The reason for the gap in defender expected utilities is that the attacker attacks the target with value that is underestimated by . This target has less coverage than it would have if the defender knew the attacker’s utilities precisely, allowing the attacker to benefit. When the defender reduces the coverage on the larger value target, the attacker benefits more, causing the gap in expected defender utilities.
Note that because (8) is at least one (since are negative), decreasing the MSE does not necessarily lead to higher . For , the learned model at MSE= will have higher DEU than the model at MSE= if the former underestimates the value of , the latter underestimates the value of and , and are sufficiently close. In decisionfocused learning, the is used as the loss directly—thus, a model with lower loss must have higher .
In the case of Thm. 1, the defender can lose value , or as , compared to the optimum because of an unfavorable distribution of estimation error. We show that this carries over to a boundedly rational QR attacker, with the degree of loss converging towards the rational case as increases.
Theorem 2.
Consider the setting of Thm. 1, but in the case of a QR attacker. For any , if , the defender’s loss compared to the optimum may be as much as under a target value estimate with MSE .
4 DecisionFocused Learning in SSGs with an SUQR Adversary
We now present our technical approach to decisionfocused learning in SSGs. As discussed above, we use , the expected utility induced by an estimate , as the objective for training. The key idea is to embed the defender optimization problem into training and compute gradients of with respect to the model’s predictions. In order to do so, we need two quantities, each of which poses a unique challenge in the context of SSGs.
First, we need , which describes how the defender’s strategy depends on . Computing this requires differentiating through the defender’s optimization problem. Previous work on differentiable optimization considers convex problems [Amos and Kolter2017]. However, typical bounded rationality models for (e.g., QR, SUQR, and SHARP) all induce nonconvex defender problems. We resolve this challenge by showing how to differentiate through the local optimum output by a blackbox nonconvex solver.
Second, we need , which describes how the defender’s true utility with respect to depends on her strategy . Computing this term requires a counterfactual estimate of how the attacker would react to a different coverage vector than the historical one. Unfortunately, typical datasets only contain a set of sampled attacker responses to a particular historical defender mixed strategy. Previous work on decisionfocused learning in other domains [Donti et al.2017, Wilder et al.2019] assumes that the historical data specifies the utility of any possible decision, but this assumption breaks down under the limited data available in SSGs. We show that common models like SUQR exhibit a crucial decomposition property that enables unbiased counterfactual estimates. We now explain both steps in more detail.
4.1 DecisionFocused Learning for Nonconvex Optimization
Under nonconvexity, all that we can (in general) hope for is a local optimum. Since there may be many local optima, it is unclear what it means to differentiate through the solution to the problem. We assume that we have blackbox access to a nonconvex solver which outputs a fixed local optimum. We show that we can obtain derivatives of that particular optimum by differentiating through a convex quadratic approximation around the solver’s output (since existing techniques apply to the quadratic approximation).
We prove that this procedure works for a wide range of nonconvex problems. Specifically, we consider the generic problem where is a (potentially nonconvex) objective which depends on a learned parameter . is a feasible set that is representable as for some convex functions and affine functions . We assume there exists some with , where is the vector of constraints. In SSGs, is the defender objective , is the attack function , and is the set of satisfying . We assume that is twice continuously differentiable. These two assumptions capture smooth nonconvex problems over a nondegenerate convex feasible set.
Suppose that we can obtain a local optimum of . Formally, we say that is a strict local minimizer of if (1) there exist and such that and and (2) . Intuitively, the first condition is firstorder stationarity, where and are dual multipliers for the constraints, while the second condition says that the objective is strictly convex at (i.e., we have a strict local minimum, not a plateau or saddle point). We prove the following:
8 Targets 24 Targets  
Theorem 3.
Let be a strict local minimizer of over . Then, except on a measure zero set, there exists a convex set around such that is differentiable. The gradients of with respect to are given by the gradients of solutions to the local quadratic approximation .
This states that the local minimizer within the region output by the nonconvex solver varies smoothly with , and we can obtain gradients of it by applying existing techniques [Amos and Kolter2017] to the local quadratic approximation. It is easy to verify that the defender utility maximization problem for an SUQR attacker satisfies the assumptions of Theorem 3 since the objective is smooth and typical constraint sets for SSGs are polytopes with nonempty interior (see [Xu2016] for a list of examples). In fact, our approach is quite general and applies to a range of behavioral models such as QR, SUQR, and SHARP since the defender optimization problem remains smooth in all.
4.2 Counterfactual Adversary Estimates.
We now turn to the second challenge, that of estimating how well a different strategy would perform on the historical games. We focus here on the SUQR attacker, but the main idea extends more widely (as we discuss below). For SUQR, if the historical attractiveness values were known, then could be easily computed in closed form using Eq. 2. The difficulty is that we typically only observe samples from the attack distribution , where for SUQR, . itself is not observed directly.
The crucial property enabling counterfactual estimates is that the attacker’s behavior can be decomposed into his reaction to the defender’s coverage () and the impact of target values (). Suppose that we know and observe sampled attacks for a particular historical game. Because we can estimate and the term is known, we can invert the function to obtain an estimate of (formally, this corresponds to the maximum likelihood estimator under the empirical attack distribution). Note that we do not know the entire function , only its value at , and that the inversion yields that is unique up to a constant additive factor. Having recovered , we can then perform complete counterfactual reasoning for the defender on the historical games.
5 Experiments
We compare the performance of decisionfocused and twostage approaches across a range of settings both simulated and real (using data from Nguyen et al. nguyen2013analyzing). We find that decisionfocused learning outperforms twostage when the number of training games is low, the number of attacks observed on each training game is low, and the number of target features is high. We compare the following three defender strategies: Decisionfocused (DF)
is our decisionfocused approach. For the prediction neural network, we use a single layer with ReLU activations with 200 hidden units on synthetic data and 10 hidden units on the simpler human subject data. We do not tune
DF. Twostage (2S) is a standard twostage approach, where a neural network is fit to predict attacks, minimizing crossentropy on the training data, using the same architecture as DF. We find that twostage is sensitive to overfitting, and thus, we use Dropout and early stopping based on a validation set. Uniform attacker values (Unif) is a baseline where the defender assumes that the attacker’s value for all targets is equal and maximizes under that assumption.5.1 Experiments in Simulation
We perform experiments against an attacker with an SUQR target attractiveness function. Raw features values are sampled i.i.d. from the uniform distribution over [10, 10]. Because it is necessary that the attacker target value function is a function of the features, we sample the attacker and defender target value functions by generating a random neural network for the attacker and defender. Our other parameter settings are chosen to align with Nguyen et al.’s nguyen2013analyzing human subject data. We rescale defender values to be between 10 and 0.
We choose instance parameters to illustrate the differences in performance between decisionfocused and twostage approaches. We run 28 trials per parameter combination. Unless it is varied in an experiment, the parameters are:

Number of targets .

Features per target .

Number of training games . We fix the number of test games .

Number of attacks per training game .

Defender resources is the number of defense resources available. We use 3 for 8 targets and 9 for 24.

We fix the attacker’s weight on defender coverage to be (see Eq. 2), a value chosen because of its resemblance to observed attacker in human subject experiments [Nguyen et al.2013, Yang et al.2014]. All strategies receive access to this value, which would require the defender to vary her mixed strategies to learn.

Historical coverage is the coverage generated by Unif, which is fixed for each training game.
Results (Simulations).
Fig. 3 shows the results of the experiments in simulation, comparing DF and 2S across a variety of problem types. DF yields higher DEU than 2S across most tested parameter settings and DF especially excels in problems where learning is more difficult: more features, fewer training games and fewer attacks. The vertical axis of each graph is median minus the achieved by Unif. Because Unif does not perform learning, its is unaffected by the horizontal axis parameter variation, which only affects the difficulty of the learning problem, not the difficulty of the game. The average for 8 targets and for 24.
The left column of Fig. 3 compares DF to 2S as the number of attacks observed per game increases. For both 8 and 24 targets, DF receives higher than 2S across the tested range. 2S fails to outperform Unif at 2 attacks per target, whereas DF receives 75% of the it receives at 15 attacks per target.
The center column of Fig. 3 compares as the number of training games increases. Note that without training games, no learning is possible and . DF receives equal or higher than 2S, except for 24 targets and 200 training games.
The right column of Fig. 3 compares as the number of features decreases. A larger number of features results in a harder learning problem, as each feature increases the complexity of the attacker’s value function. Of the the parameters we vary, features has the largest impact on the relative performance of DF and 2S. DF performs better than 2S for more than 50 features (for 8 targets) and 100 features (for 24 targets). For more than 150 features, 2S fails to learn for both 8 and 24 targets and performs extremely poorly.
5.2 Experiments on Human Subject Data
We use data from human subject experiments performed by Nguyen et al. nguyen2013analyzing. The data consists of an 8target setting with 3 defender resources and a 24target setting with 9. Each setting has 44 games. Historical coverage is the optimal coverage assuming a QR attacker with . For each game, 3045 attacks by human subjects are recorded.
We use the attacker coverage parameter calculated by Nguyen et al. nguyen2013analyzing: . We use maximum likelihood estimation to calculate the ground truth target values for the test games. There are four features for each target: attacker’s reward and defender’s penalty for a successful attack, attacker’s penalty and defender’s reward for a failed attack. Note that to be consistent with the rest of the paper, we assume the defender receives a reward of 0 if she successfully prevents an attack.
Results (Human Subject Data).
We find that DF receives higher than 2S on the human subject data. Fig. 4 summarizes our results as the number of training attacks per target and games are varied. Varying the number of attacks, for 8 targets, DF achieves its highest percentage improvement in at 5 attacks where it receives 28% more than 2S. For 24 targets, DF achieves its largest improvement of 66% more than 2S at 1 attack.
Varying the number of games, DF outperforms 2S except for fewer than 10 training games in the 8target case. The percentage advantage is greatest for 8target games at 20 training games (33%) and at 2 training games for 24target games, where 2S barely outperforms Unif.
The theorems of Sec. 3 suggest that models with higher may not have higher predictive accuracy. We find that, indeed, this can occur. The effect is most pronounced in the human subject experiments, where 2S has lower test cross entropy than DF by 2–20%. Note that we measure test cross entropy against the attacks generated by Unif, the same defender strategy used to generate the training data and that 2S
received extensive hyperparameter to improve validation cross entropy and
DF did not.6 Conclusion
We present a decisionfocused approach to adversary modeling in security games. We provide a theoretical justification as to why training an attacker model to maximize can provide higher than training the model to maximize predictive accuracy. We extend past work in decisionfocused learning to smooth nonconvex objectives, accounting for the defender’s optimization in SSGs against many attacker types, including SUQR. We show empirically, in both synthetic and human subject data, that our decisionfocused approach outperforms standard two stage approaches.
We conclude that improving predictive accuracy does not guarantee increased in SSGs. We believe this conclusion has important consequences for future research and that our decisionfocused approach can be extended to a variety of SSG models where smooth nonconvex objectives and polytope feasible regions are common.
References
 [Abbasi et al.2016] Y. Abbasi, N. BenAsher, C. Gonzalez, D. Morrison, N. Sintov, and M. Tambe. Adversaries wising up: Modeling heterogeneity and dynamics of behavior. In Internation Conference on Cognitive Modeling, 2016.
 [Amos and Kolter2017] B. Amos and J. Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks. In ICML, 2017.
 [Basilico et al.2012] N. Basilico, N. Gatti, and F. Amigoni. Patrolling security games: Definition and algorithms for solving large instances with single patroller and single intruder. Artificial Intelligence, 2012.
 [Basilico et al.2017] N. Basilico, G. De Nittis, and N. Gatti. Adversarial patrolling with spatially uncertain alarm signals. Artificial Intelligence, 2017.
 [Cermak et al.2016] J. Cermak, B. Bosansky, K. Durkota, V. Lisy, and C. Kiekintveld. Using correlated strategies for computing stackelberg equilibria in extensiveform games. In AAAI, 2016.
 [Cui and John2014] J. Cui and R. S John. Empirical comparisons of descriptive multiobjective adversary models in Stackelberg security games. In GameSec, 2014.
 [Donti et al.2017] P. Donti, B. Amos, and J. Z. Kolter. Taskbased endtoend model learning in stochastic optimization. In NIPS, 2017.
 [Fang et al.2015] F. Fang, P. Stone, and M. Tambe. When security games go green: Designing defender strategies to prevent poaching and illegal fishing. In IJCAI, 2015.
 [Ford et al.2015] B. Ford, T. Nguyen, M. Tambe, N. Sintov, and F. Delle Fave. Beware the soothsayer: From attack prediction accuracy to predictive reliability in security games. In GameSec, 2015.
 [Gholami et al.2018] S. Gholami, S. McCarthy, B. Dilkina, A. Plumptre, M. Tambe, M. Driciru, F. Wanyama, A. Rwetsiba, M. Nsubaga, J. Mabonga, et al. Adversary models account for imperfect crime data: Forecasting and planning against realworld poachers. In AAMAS, 2018.
 [Guo et al.2017] Q. Guo, B. An, B. Bosanskỳ, and C. Kiekintveld. Comparing strategic secrecy and stackelberg commitment in security games. In IJCAI, 2017.
 [Haghtalab et al.2016] N. Haghtalab, F. Fang, T. H. Nguyen, A. Sinha, A. D. Procaccia, and M. Tambe. Three strategies to success: Learning adversary models in security games. In IJCAI16, 2016.
 [Hartford et al.2016] J. S. Hartford, J. R. Wright, and K. LeytonBrown. Deep learning for predicting human strategic behavior. In NIPS, 2016.
 [Kar et al.2016] D. Kar, F. Fang, F. M. Delle Fave, N. Sintov, M. Tambe, and A. Lyet. Comparing human behavior models in repeated Stackelberg security games: An extended study. Artificial Intelligence, 2016.
 [Korzhyk et al.2011] D. Korzhyk, V. Conitzer, and R. Parr. Solving stackelberg games with uncertain observability. In AAMAS, 2011.
 [Ling et al.2018] C. Kai Ling, F. Fang, and J. Z. Kolter. What game are we playing? endtoend learning in normal and extensive form games. In IJCAI, 2018.
 [Ling et al.2019] Chun Kai Ling, Fei Fang, and J. Zico Kolter. Large scale learning of agent rationality in twoplayer zerosum games. In Proc. of AAAI19, Honolulu, 2019.
 [McKelvey and Palfrey1995] R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 1995.
 [Nguyen et al.2013] T. H. Nguyen, R. Yang, A. Azaria, S. Kraus, and M. Tambe. Analyzing the effectiveness of adversary modeling in security games. In AAAI, 2013.
 [Okamoto et al.2012] S. Okamoto, N. Hazon, and K. Sycara. Solving nonzero sum multiagent network flow security games with attack costs. In AAMAS, Valencia, 2012.
 [Sinha et al.2016] A. Sinha, D. Kar, and M. Tambe. Learning adversary behavior in security games: A PAC model perspective. In AAMAS, Singapore, 2016.

[Tambe2011]
M. Tambe.
Security and game theory: algorithms, deployed systems, lessons learned
. Cambridge University Press, 2011. 
[Wilder et al.2019]
B. Wilder, B. Dilkina, and M. Tambe.
Melding the datadecisions pipeline: Decisionfocused learning for combinatorial optimization.
In AAAI, 2019.  [Wright and LeytonBrown2017] J. R. Wright and K. LeytonBrown. Predicting human behavior in unrepeated, simultaneousmove games. Games and Economic Behavior, 2017.
 [Xu2016] H. Xu. The mysteries of security games: Equilibrium computation becomes combinatorial algorithm design. In EC, 2016.
 [Yang et al.2011] R. Yang, C. Kiekintveld, F. Ordonez, M. Tambe, and R. John. Improving resource allocation strategy against human adversaries in security games. In IJCAI, 2011.
 [Yang et al.2014] R. Yang, B. Ford, M. Tambe, and A. Lemieux. Adaptive resource allocation for wildlife protection against illegal poachers. In AAMAS, 2014.
 [Yin et al.2014] Y. Yin, B. An, and M. Jain. Gametheoretic resource allocation for protecting large public events. In AAAI, 2014.
 [Zhang et al.2015] C. Zhang, A.Sinha, and M. Tambe. Keeping pace with criminals: Designing patrol allocation against adaptive opportunistic criminals. In AAMAS, 2015.
Appendix A Proof of Theorem 1
We use to represent the attacker’s value for successfully attacking target .
Lemma 1.
Consider a twotarget, zerosum SSG with a rational attacker, and a single defense resource, which is not subject to scheduling contraints. The optimal defender coverage is and , and the defender’s payoff under this coverage is .
Proof.
The defender’s maximum payoff is achieved when the expected value for attacking each target is equal, and we require that for feasibility. With and , the attacker’s payoff is if he attacks target 0 and if he attacks target 1. ∎
Theorem 4.
Consider a twotarget SSG with a rational attacker, zerosum utilities, and a single defense resource to allocate, which is not subject to scheduling constraints (i.e., any nonnegative marginal coverage that sums to one is feasible). Assume the attacker observes the utilities of the attacker and defender, which we assume w.l.o.g. are nonnegative and sum to one. The defender has an estimate of the attacker’s utility with mean squared error (MSE) . Suppose the defender optimizes her coverage against this estimate. If , the ratio between the defender’s expected utility under the worst estimate of with MSE and that with the best is:
(9) 
where are the attacker’s values for the targets.
Proof.
Given the condition that , there are two configurations of that have mean squared error : , yielding defender utility and , respectively, because the attacker always attacks the target with underestimated value. The condition on is required to make both estimates feasible. Because , . ∎
Appendix B Proof of Theorem 2
Let denote the defender’s utility with coverage probability against a perfectly rational attacker and denote their utility against a QR attacker. Suppose that we have a bound
for some value . Let be the optimal coverage probability under perfect rationality. Note that for an alternate probability
(since holds for all ) and so any is guaranteed to have , implying that the defender must have in the optimal QR solution.
We now turn to estimating how large must be in order to get a sufficiently small . Let be the probability that the attacker chooses the first target under QR. Note that we have and . We have
For two targets with value 1 and , is given by
Provided that , we will have . Suppose that we would like this bound to hold over all for some . Then, and so suffices. Now if we take , we have that for , the QR optimal strategy must satisfy , implying that the defender allocates at least coverage to the target with true value 0. Suppose the attacker chooses the target with value 1 with probability . Then, the defender’s loss compared to the optimum is . By a similar arugment as above, it is easy to verify that under our stated conditions on , and assuming , we have , for total defender loss .
Appendix C Proofs for nonconvex optimization
Theorem 5.
Let be twice continuously differentiable and be a strict local minimizer of over . Then, at except on a measure zero set, there exists a convex set around such that is differentiable. The gradients of are given by the gradients of solutions to the local quadratic approximation .
Proof.
By continuity, there exists an open ball around on which is negative definite; let be this ball. Restricted to , the optimization problem is convex, and satisfies Slater’s condition by our assumption on combined with Lemma 2. Therefore, the KKT conditions are a necessary and sufficient description of . Since the KKT conditions depend only on secondorder information, is differentiable whenever the quadratic approximation is differentiable. Note that in the quadratic approximation, we can drop the requirement that since the minmizer over already lies in by continuity. Using Theorem 1 of Amos and Kolter (2017), the quadratic approximation is differentiable except at a measure zero set, proving the theorem. ∎
Lemma 2.
Let be convex functions and consider the set . If there is a point which satisfies , then for any point , the set contains a point satisfying and .
Proof.
By convexity, for any , the point lies in , and for , satisfies . Moreoever, for sufficiently large (but strictly less than 1), we must have , proving the existence of . ∎