As our societal systems become increasingly populated with smart devices, there is an increasing number of learning agents collecting data and drawing inferences. Furthermore, as more and more decisions are being made by (automated) learning agents, an increasing number of secondary agents will arise to monitor their decision making processes.
The associated problem that one faces when learning about the relevant environment becomes much more complex in such settings. Agents no longer learn about their environment in isolation; an agent’s learning process (the act of collecting information) is subject to observation from others who are attempting to infer their private information. Since beliefs influence actions, and actions influence payoffs, each agent must be mindful of how its learning actions influence what others believe about its private information. This is especially a concern when the interests of the agents are misaligned.111The influence of one’s actions/strategies on the beliefs of others is known in the economics literature as signaling. Signaling is present in both competitive (game) settings  and cooperative (decentralized control) settings [2, 3]. As a result, a complete analysis of how an agent should learn in such environments must explicitly take into account how the learning process itself influences players’ beliefs and their subsequent choices.
The general problem of learning while under observation arises in many real-world scenarios. Examples range from problems in e-commerce and online marketplaces, e.g.
, where a merchant estimates a user’s willingness to pay based on their browsing patterns in order to set revenue-maximizing prices, to cyber security,e.g., where a defender partially observes an attacker’s reconnaissance of various target locations in order to determine where to allocate defensive resources.
For the purposes of this paper, our focus is on cyber security. As a motivating scenario, consider a hacker carrying out some preliminary reconnaissance on potential attack vectors,e.g., sending out commands to a particular server to see which ports are open, before launching an attack. Such recon actions not only provide information to the hacker about the structure of the network but also, due to monitoring systems deployed throughout the system, reveal information to the defender about the hacker’s knowledge and true intent. The defender thus faces a problem of attempting to infer the hacker’s state of knowledge based on the recon actions. This is further complicated by the fact that the defender may only partially observe the recon actions, for instance, while the defender can determine which server received a malicious command, it may not know what information the hacker had received. Consequently, the defender does not know the hacker’s updated state of knowledge. Conversely, the hacker would be unsure of the value of various attack vectors/targets, and thus does not know how its private information influenced the defender’s view. The interaction is thus one of asymmetric information: each agent possesses information not available to the other. For the remainder of the paper, we will present the results in the context of cyber security and refer to the agents as the attacker and the defender.
A first step in analyzing this problem is understanding how the learning agent’s private information influences the observing agent’s inference process. In this paper, we introduce a stylized game-theoretic model to capture the basic strategic components of this interaction in the context of the previously described cyber security scenario. Specifically, we consider an attacker and a defender who simultaneously choose among two targets, and . If the attacker chooses to attack , it receives a randomized reward with an uncertain distribution; if the attacker chooses to attack , it receives a fixed and known reward. In both cases, the defender loses the same amount. If the defender chooses to defend the same target that the attacker attacked, then the attacker incurs a cost of capture and the defender gains the same amount (a reward for the capture). The interesting part of this model is what happens beforehand. Before the attacker and the defender play the game, the attacker receives a single private sample from the randomized reward distribution. While the attacker’s prior before this observation is common knowledge, the posterior after the observation is known only to the attacker. The defender knows the true distribution of the reward, but does not see the attacker’s sample and thus does not know the posterior of the attacker. The fundamental problem we investigate is the influence of the attacker’s private sample on beliefs and equilibria of the game.
I-a Background & Related Work
The structure of our model bears similarity to some existing models in the economics and learning literatures. One such model, termed strategic experimentation , describes a setting where agents learn from the outcomes of other’s experiments. In the original model , each experiment, defined as the selection of an alternative, yields an outcome that is commonly observed by the agents. The model is a multi-agent extension of the two-armed bandit222See  for an overview of multi-armed bandits and some fundamental results. problem [6, 7] where agents choose between a “risky” alternative (with unknown statistics) and a “known” alternative, with the objective of maximizing expected payoff. The main insight from the classical model is that, due to free-riding, any equilibrium in which players only use the common belief is socially suboptimal.
The original strategic experimentation model has been extended in many directions. The most relevant to our setting are the models of [8, 9] that consider the case of private information. Under private information, the outcomes of experiments are no longer publicly observed, rather they are assumed to be private to the players. The model of  studies a two player problem where each player faces a bandit problem consisting of a risky arm (of two possible types) and a known arm. Similarly,  study a related problem but allow for communication (via cheap talk) among the players and for the players to reverse their decisions to stop experimenting, distinguishing it from the setup of .
A related setting is the Bayesian exploration model of . The model considers a problem where a principal is attempting to incentivize (a series of) myopic agents to explore, rather than be greedy and exploit. This is similar in spirit to the motivation found in strategic experimentation settings.
Given the motivation for our study, it is useful to also discuss some models from the security literature. The general problem of allocating defenses in the presence of a strategic attacker is a well-studied topic. One prominent model is the Colonel Blotto game [11, 12, 13] consisting of two players who aim to allocate resources (troops) to alternatives (battlefields) in order to maximize wins (by majority rule). Another popular model is that of Stackelberg security games  where a defender (leader) allocates defenses to a collection of targets before an attacker (follower) launches its attack. While differing in the timing and informational assumptions, both the Colonel Blotto and Stackelberg security game settings provide insight into how defensive resources should be allocated in order to minimize the chance of a successful attack.
Compared to the aforementioned work, the model presented in this paper studies a simple setting in which the learning agent receives a single sample of private information from a distribution privately known by another (observing) agent. Our model aims to isolate the effect of this single private sample on the inference process of the observing agent. This structured information asymmetry has some natural applications (some of which were discussed earlier), especially in the context of cyber security. To the best of our knowledge, models possessing this structure have not yet been investigated in the economics, learning, or security literatures.
Comparing our model to the more general strategic experimentation setting, our model is one of private information; however, we note that the learner is not actively deciding to receive the sample, it receives the sample regardless of its choice. As a result, we refer to the model of this paper as strategic inference rather than strategic experimentation.
In summary, the contribution of our work is a game formulation, motivated by the aforementioned foundational security setting, that isolates a structured class of asymmetric information games that admits tractable analysis. The model provides insight into how an observing agent reasons about the private information revealed to a learning agent and its subsequent impact on decisions.
Ii The Game Model
Consider the following two-player game where an attacker (A) and defender (D) choose among two potential targets and . The reward of target is uncertain to the attacker, dictated by an unknown distribution, whereas target has a known reward. The attacker is assumed to possess a prior over the unknown distribution. For the purposes of this paper, we assume a Bernoulli reward distribution (for the unknown target ) and a beta prior.333 This choice is for mathematical simplicity (conjugate priors). The model can also consider alternative distributions, such as Gaussian.
This choice is for mathematical simplicity (conjugate priors). The model can also consider alternative distributions, such as Gaussian.The Bernoulli parameter, denoted by , is only known to the defender, whereas the beta prior, denoted by is assumed to be common knowledge.
The game proceeds as follows. First, the attacker receives a sample from the unknown target , which it uses to form an updated (posterior) belief. Since the prior is assumed to be beta and the trials Bernoulli, the updated posterior is simply . The defender does not see the realized sample . Players then act simultaneously, the attacker choosing which target to attack and the defender choosing which target to defend. Fig. 1 represents the timeline of this interaction.
The objectives of the attacker and the defender are misaligned; the attacker wishes to avoid the defender whereas the defender wishes to catch the attacker. These preferences are encoded using payoffs as follows. If the attacker selects target , it receives a payoff of
with probability. If the attacker picks , it receives a known payoff of where is a constant. The attacker incurs a capture cost of if the defender picks the same target. Payoffs are zero-sum where the attacker is assumed to be maximizing.
Ii-a Subjective Payoffs & Costs
Before describing payoffs, we define the players’ types. The attacker’s type is the sample it receives from target , denoted by . The defender’s type is the true (reward) parameter of the reward distribution of target , denoted by . Note that while the defender has private knowledge of the true , it does not know the sample that the attacker received. In other words, neither player knows the other player’s true type.
The attacker’s uncertainty of the defender’s type leads to it being uncertain of the true payoffs of the game. The attacker can only compute its expected payoff using its prior and the information it gathers from the received sample. Specifically, the attacker’s best estimate of the true parameter is given by the mean of its posterior after the sample is received, thus its expected reward for selecting target is . The defender on the other hand, knows the true expected costs, since it knows the reward parameter . Furthermore, since is known, the payoff for selecting target is known to both players. The players’ subjective payoffs/costs are illustrated below in Table I.
Ii-B Best Response Functions
The strategies of the attacker and the defender, denoted by and , respectively, represent the probabilities that target will be chosen. The best response functions of the players are given by the following optimizations.
Note that the attacker is maximizing and the defender is minimizing, reflecting the fact that the attacker wants to avoid the defender but the defender wants to catch the attacker. Substitution of the subjective payoffs/costs from Table I yields the following best response functions.
The best response functions of the players are
where is the attacker’s expectation of the defender’s strategy and is the defender’s expectation of the attacker’s strategy.
See Appendix A-A.
The form of the expectations of the strategies in Lemma 1 reflects the information asymmetry among the players: the attacker sees but does not know ; the defender knows but does not see . The attacker computes the expectation of the defender’s strategy, , using its private sample. Given a beta prior,
, the attacker’s distribution over the true reward parameter as a function of the private sample is dictated by a beta distribution,The defender computes the expectation of the attacker’s strategy, , given its knowledge of the reward parameter
. The defender’s distribution over the attacker’s private sample is given by a Bernoulli distribution,
. Since the attacker’s type (sample) is generated from a probability distribution dictated by the defender’s type (parameter of the reward distribution), the game is one with correlated types. This structure is core to our analysis; under correlation, knowledge of one’s type is informative for inferring the other player’s type.
Iii Pure Strategy Equilibria
As mentioned in the introduction, we are interested in studying pure strategy equilibria. The following lemma shows that, for a given set of parameters, , , , , the resulting game has a (unique) pure strategy equilibrium which can take three possible forms.
The private sampling game has at most one pure strategy saddle-point equilibrium, characterized by the following three disjoint regions:
for all if and only if
for all if and only if
, , and
if and only if
where and are the beta function and regularized incomplete beta function, respectively.
See Appendix A-B.
Iii-a Equilibrium Discussion
Our discussion proceeds in two steps. First, for fixed parameters , , and , we describe the region where each equilibrium condition of Lemma 2 holds. Second, we vary the capture cost to investigate how the equilibrium-supporting regions change.
We study the parameter regime in which . The rationale for this choice is as follows. The quantity is the highest possible reward the attacker can receive from the uncertain target if it gets caught. However, the attacker does not always receive from , it may get unlucky and get a zero reward (since the reward from is stochastic) in addition to getting caught. Due to the attacker’s prior on the true value of , the attacker’s view of the reward from playing (when the defender also plays ), regardless of the received sample, is strictly less than . By analyzing the equilibria in the regime, we can investigate the specific level of uncertainty (quantified by the prior) such that the attacker would prefer to play over . As such, we fix , , and for the following analysis.
For fixed , , and , the equilibrium conditions of Lemma 2 describe regions in the space of prior parameters . In particular, the first two regions in Lemma 2, namely 1) for all , , and 2) for all , are illustrated in Figure 2. The interpretation of the first two regions is straightforward. Recall from Lemma 2 the condition for the first region is . The equilibrium condition states that this must hold true for all received samples. Thus, regardless of the sample that the attacker receives, it must believe with a sufficiently high confidence that target will yield reward . This is only true when is much larger than , characterized by the shaded region in Fig. 1(a). The reasoning for the second region follows similarly (see Fig. 1(b)).
In the third region, the attacker follows its observation and the defender defends the more valuable target. That is, if the attacker receives a positive sample, , it attacks target , and vice versa. The defender defends if and only if . This equilibrium is best interpreted by separating the interval condition into its two inequalities. From the proof of Lemma 2, the condition for (the lower bound in the statement of the lemma) is
This region, illustrated in Fig. 3, is further subdivided into four subregions with the following interpretations:
In this subregion, is high relative to so target looks sufficiently desirable even if it gets caught.
In this subregion, is still sufficiently larger than , so the uncertain reward appears better to the attacker than the certain reward. However, the attacker believes it to be very likely that (due to a higher ), and therefore would get caught should it choose , outweighing its gain in reward. The capture cost deters the attacker from choosing at this point.
Analogous to case (ii), the attacker is sufficiently confident that the defender will defend target , and thus the attacker chooses target , even though the expected reward of the uncertain target is lower than the certain target.
Analogous to (i), the attacker is confident that will yield no reward, and thus it does not choose .
The reasoning for (derived from the upper bound condition in Lemma 2) follows identically. Combining the lower and upper bounds, the equilibrium holds in the intersection of the two regions, as illustrated by Fig. 4.
We now vary the capture cost to see how the equilibrium-supporting regions change. In particular, we look at the three regions in Lemma 2 as we increase . As is increased, see Figs. 4(a) – 4(c), the attacker is increasingly concerned with getting caught. This requires the attacker to have an increasing level of confidence that it will get from choosing in order to outweigh the higher risk of getting caught. For large enough , such that , all pure strategy equilibrium regions vanish. Committing to a choice of either or (a pure strategy) becomes too risky, causing the attacker to want to (either partially or fully) mix between targets. In other words, a sufficiently large will partially deter an attack on either of the targets. A full characterization of these mixed strategies is left for future work.
Iv Concluding Remarks
Motivated by cyber security settings, we have introduced a simple asymmetric information game model for describing the influence of a learner’s (the attacker) single private sample on the inference process of an observing agent (the defender). The subsequent game admits at most one pure strategy equilibrium which, depending on the parameters of the game, takes different forms. We illustrated that, even in the case where the attacker is confident that target will produce a reward higher than the known reward associated with the other target, it is not necessarily optimal for the attacker to attack target as it has an increased chance of being caught. Future work includes considering multiple unknown targets and allowing the attacker to make sampling decisions. If these choices (but not the received samples) are observable to the defender, then the game becomes a signaling game.
-  H. Tavafoghi, Y. Ouyang, and D. Teneketzis, “A unified approach to dynamic decision problems with asymmetric information-part II: Strategic agents,” arXiv preprint arXiv:1812.01132, 2018.
-  Y.-C. Ho, M. Kastner, and E. Wong, “Teams, signaling, and information theory,” IEEE Transactions on Automatic Control, vol. 23, no. 2, pp. 305–312, 1978.
-  H. Tavafoghi, Y. Ouyang, and D. Teneketzis, “A unified approach to dynamic decision problems with asymmetric information-part I: Non-strategic agents,” arXiv preprint arXiv:1812.01130, 2018.
-  P. Bolton and C. Harris, “Strategic experimentation,” Econometrica, vol. 67, no. 2, pp. 349–374, 1999.
-  A. Mahajan and D. Teneketzis, “Multi-armed bandit problems,” in Foundations and Applications of Sensor Management. Springer, 2008, pp. 121–151.
-  M. Woodroofe, “A one-armed bandit problem with a concomitant variable,” Journal of the American Statistical Association, vol. 74, no. 368, pp. 799–806, 1979.
-  D. A. Berry and B. Fristedt, “Two-armed bandits with a goal, I. One arm known,” Advances in Applied Probability, vol. 12, no. 3, pp. 775–798, 1980.
-  D. Rosenberg, E. Solan, and N. Vieille, “Social learning in one-arm bandit problems,” Econometrica, vol. 75, no. 6, pp. 1591–1611, 2007.
-  P. Heidhues, S. Rady, and P. Strack, “Strategic experimentation with private payoffs,” Journal of Economic Theory, vol. 159, pp. 531–551, 2015.
-  Y. Mansour, A. Slivkins, V. Syrgkanis, and Z. S. Wu, “Bayesian exploration: Incentivizing exploration in Bayesian games,” in Proceedings of the 2016 ACM Conference on Economics and Computation. New York, NY, USA: ACM, 2016, p. 661.
É. Borel, “The theory of play and integral equations with skew symmetric kernels,”Econometrica, pp. 97–100, 1953.
-  A. Ferdowsi, A. Sanjab, W. Saad, and T. Başar, “Generalized Colonel Blotto game,” in 2018 American Control Conference (ACC). IEEE, 2018, pp. 5744–5749.
A. Gupta, T. Başar, and G. A. Schwartz, “A three-stage Colonel Blotto
game: When to provide more information to an adversary,” in
International Conference on Decision and Game Theory for Security. Springer, 2014, pp. 216–233.
-  A. Sinha, F. Fang, B. An, C. Kiekintveld, and M. Tambe, “Stackelberg security games: Looking beyond a decade of success,” in IJCAI, 2018, pp. 5494–5501.
-  D. Fudenberg and J. A. Tirole, Game Theory. MIT Press, 1991.
Appendix A Proofs
A-a Proof of Lemma 1
Denote by the attacker’s subjective payoff when both players select as dictated by the subjective payoffs in Table I (a), similarly for , , and . Also, denoting by and substituting in subjective payoffs, the attacker’s expected payoff is
The first two terms do not influence the and thus one can equivalently maximize . Similarly, using Table I (b) and denoting by , the defender’s expected cost is
Again, the constants do not influence the and one can equivalently minimize .
A-B Proof of Lemma 2
Throughout the proof, let . The attacker’s expected strategy from the defender’s perspective is . There are four possible cases:
1.i) : Assume the attacker plays regardless of its type, i.e., . Thus and by Lemma 1, the defender’s best response is . The attacker does not deviate iff
1.ii) : Assume the attacker plays regardless of its type. Then and the best response of the defender is . Similar to case (1.i), the attacker does not deviate iff .
1.iii) : Assume the attacker plays , . Then . The best response (almost surely) for the defender is . By Lemma 1, the attacker plays iff
or equivalently, using the fact that ,
By a similar argument, the attacker plays iff
Using consecutive neighbor identities of the regularized incomplete beta function, the above inequality becomes
1.iv) : Assume the attacker plays , . Then . The defender’s best response is . The attacker plays iff
Since , an equivalent condition is
Similarly, the attacker plays iff
Or equivalently, by consecutive neighbor identities,
which can be seen to be non-empty by selecting and appropriately.