I Introduction
Deception refers to a deliberate attempt to mislead or confuse adversaries so that they may take strategies that are in the defender’s favor [1]. Deception can limit the effectiveness of an adversary’s attack, waste adversary’s resources and prevent the leakage of critical information [2]. It is a widely observed behavior in nature for selfdefence and survival. Deception also plays a key role in many aspects of human society, such as economics [3], warfare [4], game [5], cyber security [2] and so on.
In this paper, we focus on the scenario in which the adversary acts in an environment where this interaction is modeled as a Markov decision process (MDP) [6]. The adversary’s aim is to collect rewards at each state of the MDP and the defender tries to minimize the accumulated reward through deception. Many existing approaches for deception rely on a rational adversary with sufficient memory and computation power to find its optimal policy [7, 1]. However, deceiving an adversary with only bounded rationality [8], i.e., one whose decisions may follow certain rules that deviate from the optimal action [9], has not been adequately studied so far. Deceiving an adversary with bounded rationality finds, for example, its application in intrusion detection and protection [10] or public safety [11]. Different from obfuscating sensitive system information to the adversary [12, 13], by deception, we mean that the defender optimally assigns a limited resource to each state, such that the expected cost from defender’s perspective (or equivalently, the reward for the adversary) incurred by an adversary can be minimized, even though the adversary is expecting more based on his cognitively biased view of rewards.
To deceive a human more effectively, it is essential to understand the human’s cognitive characteristics and what affects his decisions (particularly with stochastic outcomes). Works in behavior psychology, e.g. [14], suggested that humans’ decisionmaking follows intuition and bounded rationality. Empirical evidence has shown that humans tend to evaluate gains and losses differently in decisionmaking [15]
. Humans tend to overestimate the likelihood of lowprobability events and underestimate the likelihood of highprobability events in a nonlinear fashion
[16, 15]. Risksensitive measures, such as those in the socalled prospect theory [16], capture such biases and are widely used in psychology and economics to characterize human preferences. Furthermore, humans tend to make decisions that are often suboptimal [17]. It is generally believed that such suboptimality is the result of intuitive decisions or preferences that happen automatically and quickly without much reflection [17, 14]. Human decisions are subject to stochasticity due to the limited computational capacity and inherent noise [18]. Consequently, human decisions are often cognitively biased (have a different reward mechanism), probabilistic (have a stochastic action selection policy) and memoryless (only depends on the current state). These are the very characteristics of human decisionmaking we expect to account for in rewardbased deception.This paper investigates how one can deceive a human adversary by optimally allocating limited resources to minimize his rewards. We model the environment as an MDP to capture the choices available to a human decisionmaker and their probabilistic outcomes. We consider opportunistic human adversaries, i.e., they usually do not have significant planning and only act based on immediately available rewards [11]. We describe the human adversary’s policy to select different actions following the prospect theory and bounded rationality [8]. We model both the adversary’s perceived reward and defender’s cost (equivalently, the adversary’s reward from the defender’s point of view) as functions of the resources available at each state of the MDP. Additionally, we define a subset of the states in the MDP as sensitive states that the human adversary should be kept from visiting.
We then formulate the optimal resource allocation problem as a signomial program (SP) to minimize the defender’s cost. SPs are a special form of nonlinear programming problems, and they are generally nonconvex. Solving nonconvex NLPs is NPhard [19] in general, and a globally optimal solution of an SP cannot be computed efficiently. SPs generalize geometric programs (GP), which can be transformed into convex optimization problems and then can be solved efficiently [20]. In this paper, we approximate the proposed SP to a GP. In numerical experiments, we show that this approach obtain locally optimal solutions of the SP efficiently by solving a number of GPs. We demonstrate the approach with a problem on the assignment of police patrol hour against opportunistic criminals [11].
The problem we study is closely related to the Stackelberg security game (SSG) which consists of an attacker and a defender that interact with each other. In SSG, the defender acts first with limited resources and then the attackers play in response [21]. SSG is a popular formalism to study security problems against human adversaries. Early efforts focused on oneshot games where an adversary can only take one move [22] without considering human’s bounded rationality. Then repeated SSG was considered in wildlife security [10] and fisheries [23] where the defender and the adversary can have repeated interaction. However, neither of these papers considered how a human perceives probabilities, where the existence of nonlinear probability weighting curves is a wellknown result in prospect theory [16]. Such phenomenon was taken into account in [24] and [25]. But [24] only studied oneshot games and [25] did not consider the adversaries may move from place to place.
The rest of this paper is organized as the following. We first provide the necessary preliminaries for stochastic environment modeling, human cognitive biases and decisionmaking in Section II. Then we formulate the human deception problem in terms of resource allocation in Section III and show that it can be transformed into a signomial program in Section IV. We propose the computational approach to solve the signomial program in Section V. Section VI shows simulations results and discusses their implications. We conclude our paper and discusses possible future directions in Section VII.
Ii Preliminaries
Iia Monomials, Posynomials, and Signomials.
Let be a finite set of strictly positive realvalued variables. A monomial over is an expression of the form
where is a positive coefficient, and are exponents for . A posynomial over is a sum of one or more monomials:
(2) 
If is allowed to be a negative real number for any , then the expression (2) is a signomial.
This definition of monomials differs from the standard algebraic definition where exponents are positive integers with no restriction on the coefficient sign. A sum of monomials is then called a polynomial.
IiB Nonlinear programs.
A general nonlinear program (NLP) over a set of realvalued variables is
minimize  (3)  
subject to  
(4)  
(5) 
where , , and are arbitrary functions over , and and are the number of inequality and equality constraints of the program respectively.
IiC Signomial programs and geometric programs.
A special class of NLPs known as signomial programs (SP) is of the form (3)–(5) where , and are signomials over , see Def. IIA. A geometric program (GP) is an SP of the form (3)–(5) where are posynomial functions and are monomial functions over . GPs can be transformed into convex programs [20, §2.5] and then can be solved efficiently using interiorpoint methods [26]. SPs are nonconvex programs in general, and therefore there is no efficient algorithm to compute global optimal solutions for SPs . However, we can efficiently obtain local optimal solutions for SPs in our setting, as shown in the following sections.
In this paper, the adversary with bounded rationality moves in an environment modeled as a Markov decision process (MDP) [6].
IiD Markov Decision Processes.
A (MDP) is a tuple where

is a finite set of states;

is the initial state distribution;

is a finite set of actions;

. That is, the probability of transiting from to with action ; and

is the utility function that assigns resources with a quantity to state .
At each state , an adversary has a set of actions available to choose. Then the nondeterminism of the action selection has to be resolved by a policy executed by the adversary. A (memoryless) policy of an MDP is a function that maps every state action pair where and with probability .
By definition, the policy specifies the probability for the next action to be taken at the current state . A bounded rational adversary is often limited in memory and computation power, therefore we only consider the memoryless policies.
In an MDP, a finite stateaction path is , where and . Given a policy , it is possible to calculate the probability of such path as
(6) 
Iii RewardBased Deception
We assume that an adversary with bounded rationality moves around in an environment modeled as an MDP . When the adversary is at a state , from the defender’s point of view, the immediate reward for the human adversary (or equivalently, the cost for the defender) is
which is a function of allocated resource . However, due to the bounded rationality and cognitive biases, the perceived immediate reward at state by the adversary is a different function of , and is given by
where is another function over . For a given policy , expected rewards at each state and time t with a finite time horizon can be evaluated as
(7) 
where , . Therefore, represents the expect accumulated cost of the defender, or equivalently, expected rewards for the human adversary obtained from the policy .
The defender’s objective is to optimally assign the resources to each state to minimize his cost (equivalently, the adversary’s reward) , where
(8) 
by designing the utility function , where the resources are of limited quantity, i.e., . Also imagine that there are set of sensitive states that the adversaries should be kept away from. Denote the set of paths that reach in steps as such that for each where , we require and . In particular, given a policy , can be calculated as
(9) 
Problem 1
Remark 1
Problem 1 studies how to optimally assign the reward to trick the adversary into thinking that his policy could obtain more rewards but in fact, the actual expected reward is minimized with a low probability of visiting sensitive states .
Iiia Human Adversaries with Cognitive Biases
To solve Problem 1, it is essential to find the adversary’s policy . In this paper, we take human as the adversary with bounded rationality who is opportunistic, meaning that he does not have a specific attack goal nor plans strategically, but is flexible about his movement plan and seek opportunities for attacks [27]. Those attacks may incur rewards to the human adversary and consequently certain costs for the defender. The process of human decisionmaking typically follows several steps [28]. First, a human recognizes his current situation or state. Second, he will evaluate each available action based on the potential immediate reward it can bring. Third, he will select an action following some rules. Then he will receive a reward and observe a new state. In this section, we will introduce the modeling framework for the second and third step.
For a human with bounded rationality, the value of a reward from an action is a function of the possible outcomes and their associated probabilities. The prospect theory developed by Kahneman and Tversky [16] is a frequently used modeling framework to characterize the reward perceived by a human. Prospect theory claims that humans tend to overestimate the low probabilities and underestimate the high probabilities in a nonlinear fashion. For example, between winning dollar with probability and nothing else, or dollar with probability , humans tend to prefer the former, even though both have the same expectation.
Given
as the discrete random variable that has a finite set of outcomes
, a general form of prospect theory utility (i.e. the reward anticipated by a human) is the following.(11) 
where denotes the reward perceived by a human from the outcome . The probability to get the outcome is weighted by a nonlinear function that captures the human tendency to overestimate low probabilities and underestimate high probabilities.
The expected immediate reward to perform an action at state is
(12) 
However, according to prospect theory, from a human’s perspective, the perceived expected immediate reward is different. Let be the random variable for the outcome of executing action at state . We have where denotes the event that the state transits from to with an action . The distribution of is defined as follows.
The human perceived reward for the outcome depends on received from reaching the state , which is denoted by
As a result, is denoted by
(13) 
An empirical form of is the following [16].
(14) 
Given an MDP as depicted in Figure 1, where , . We assume that , , in (14). It can be found from (12) and (13) that and . Since , suppose a human is at , from human’s perspective, he will prefer the action . However, which indicates that action actually has more expected immediate rewards.
Remark 2
In this example, the rewards are already given, and it can be seen that the human could make a suboptimal decision. It illustrates how cognitive bias can deviate the human behavior from optimal.
After evaluating the outcome of each candidate action by , a human then needs to make an action selection. Humans are known to only have quite limited cognitive capabilities. Human’s policy to choose an action can be described as a random process that biases toward the actions of high , such that
(15) 
where denotes the probability of executing the action at state . Such a bounded rational behavior has been observed in humans, such as urban criminal activities [29]. Intuitively, it implies that human selects the action opportunistically at each state with the probability proportional to the perceived immediate reward .
Now we are ready to redefine Problem 1 as follows.
Iv Signomial Programming Formulation
Given an MDP , time horizon , human reward function and policy as defined in (13) and (15), the solution of the Problem 1 can be computed by solving the following signomial program. The and are assumed to be monomial functions of for our solution method.
(16)  
(17)  
(18)  
(19)  
(20)  
(21)  
(22)  
(23) 
where variables are for rewards in each state , are for utilities in each state , are for the probability of taking action in state are for each state and action, are for the expected reward of the state and time step , and are for the probability of reaching the set of target states in each state and time step .
The objective in (16) minimizes the accumulated expected reward from the initial state distribution over a time horizon . In (17), we compute by adding the immediate reward in state and the expected reward of the successor states according to the policy variables for each action . The probability of reaching each successor state depends on the policy variables in each state and action . Similar to the constraint in (17), the variables are assigned to the probability of reaching the set of target states from state and time step in (18).
The probability of reaching any state in each horizon from the states in is set to 1 as in (19). The constraint in (20) assures that the probability of reaching any state from the initial state distribution is less than . The constraint in (21) computes the policy using the model in (15). We give the relationship between rewards and utilities in (22). Finally, (23) gives the total budget for utilities.
The constraint in (17) and (18) are convex constraints, because the functions in the right hand sides are posynomial functions, and the functions in the left hand sides are monomial functions. The constraints in (19) and (20) are affine constraints, therefore they are convex. The constraints in (21) and (23) are equality constraints with posynomials, therefore they belong to the class of signomial constraints, and they are not convex. In the literature, there are various methods to deal with the nonconvex constraints to obtain a locally optimal solution including sequential convex programming, convexconcave programming, branch and bound or cutting plane methods [30, 20, 31].
V Computational Approach for the Signomial Program
In this section, we discuss how to compute a locally optimal solution efficiently for Problem 1 by solving the signomial program in (16)–(23). We propose a sequential convex programming method to compute a local optimum of the signomial program in (16)–(23), following [20, §9.1], solving a sequence of GPs. We obtain each GP by replacing signomial constraints in equality constraints of the SGP signomial program in (16)–(23) with monomial approximations of the functions.
Va Monomial approximation
Given a posynomial , a set of variables , and an initial point , a monomial approximation [20] for around is
Intuitively, a monomial approximation of a posynomial around an initial point corresponds to an affine approximation of the posynomial . Such an approximation is provided by the first order Taylor approximation of , see [20, §9.1] for more details.
For a given instantiation of the utility and policy variables and , we approximate the SP in (16)–(23) to obtain a GP as follows. We first normalize the utility values to ensure that they sum up to . Then, using those utility values, we compute the policy according to constraint in (21). After the policy comptutation, we compute a monomial approximation of each posynomial term in the constraints (21) and (23) around the previous instantiation of the utility and policy variables. After the approximation, we solve the approximate GP. We repeat this procedure until the procedure converges.
One key problem with this approach is, we require an initial feasible point to the signomial problem in (16)–(23), which may be hard to find because of the reachability constraint in (20). Therefore, we introduce a new variable and we replace the reachability constraint in (20) by the following constraints:
(24)  
(25) 
By replacing the reachability constraint, we ensure that any initial utility function and policy is feasible to the signomial program in (16)–(25). To enforce the feasability of the reachability constraint in (20), we change the objective in (16) as follows:
(26) 
where is a positive penalty parameter that determines the violation rate for the soft constraint in (24). In our formulation, we increase after each iteration to satisfy the reachability constraint.
We stop the iterations when the change in the value of is less than a small positive constant . Intuitively, defines the required improvement on the objective value for each iteration; once there is not enough improvement, the process terminates.
Vi Numerical Experiment
Let us consider an urban security problem, where a criminal plans his next move randomly based on his local information on the nearby locations that are protected by police patrols. Such a criminal is opportunistic, i.e, he is not highly strategic by conducting careful surveillance and rational planning before making moves. It is known that this kind of opportunistic adversaries contribute to the majority of the urban crimes [32]. For prevention and protection, each location should be assigned a certain police patrol hours. Due to the limited amount of police resources, the total number of patrol hours is limited as well.
Figure 2 shows intersections in San Francisco, CA with rows and columns. We use an MDP to describe the network of the set of intersections . The number of crimes that occurred in the first four weeks of October, 2018 within 500 feet of each interaction is shown in Figure 2. The crime data are obtained from https://www.crimemapping.com/map/ca/sanfrancisco. The criminal can choose to move left, right, up or down to the immediate neighboring intersections. Consequently, there are four actions available. The execution of each action will lead the human to its intended neighborhood of the intersection with a high probability () and small probability to other neighboring intersections to account for unexpected change of movement plan.
Initially, the criminal has equal probability to appear at any state, i.e., for any . The utility denotes the number of police patrol hours that should be allocated to the vicinity of each intersection. The total number of police patrol hours is . If a location is assigned with patrol hours, its reward to the criminal (equivalently, the cost to the defender) is
Intuitively, it means that the reward to the human adversary, from the defender’s point of view, is proportional to the crime rate indicated by and inversely proportional to the police patrol hours. The reward from the human adversary’s view is evaluated as
which is a function commonly seen in the literature to describe how human biases the reward [15].
Initially, the criminal is at with probability , where he tries to plan his move over the next steps. The objective is to assign the police patrol hours to each state, such that the expected accumulated reward in steps received by the criminal is minimized. The sensitive states should be visited with a probability no larger than , i.e.
The sensitive states are also shown as blue circles in Figure 2.
We formulate the problem as a signomial program. From an initial uniform utility distribution, we instantiate the policies and reward functions. Then, from the initial values, we linearize the signomial program in (16)–(26) to a geometric program. We parse the geometric programs using the tool GPkit [33], and solve them using the solver MOSEK [34]. We set for convergence tolerance. All experiments were run on a 2.3 GHz machine with 16 GB RAM. The procedure converged after 32 iterations for a problem with horizon length in 230.06 seconds. The expected reward from the initial state distribution is 117.15, and the reachability probability of the sensitive states from the initial state distribution is , which satisfies the reachability specification.
The result is shown in Figure 3. Different colors at each intersection show the number of patrol hours, i.e, the resource , assigned to each location s. In Figure 3, is shown with a logarithmic scale for better illustration. As the color bar at the bottom of the figure indicates, the closer the color at each location is to the right side of this bar, the higher patrol hours are assigned. For example, the state at (the third state from the first row), where gets assigned patrol hour equals which is approximately in logarithmic scale. Therefore, its color is yellow in Figure 3 as indicated by the right tip of the color bar. Together with Figure 2, it can be observed that sensitive places and places with a higher number of crimes get assigned more patrol hours. Consequently, the rewards at those states are fairly low to discourage the criminal from visiting it. The cost at each location is proportional to the crime rate and inversely proportional to the police patrol hours. The patrol hours assigned to each place intends to minimize the expected cost incurred by the human adversary.
Vii Conclusion
This paper introduces a general framework for deceiving adversaries with bounded rationality in terms of the obtained reward minimization. Leveraging the cognitive bias of the human from wellknown prospect theory, we formulate the rewardbased deception as a resource allocation problem in Markov decision process environment and solve as a signomial program to minimize the adversary’s expected reward. We use police patrol hour assignment as the illustrative example and show the validity of our propose solution approach. It opens doors for further research on the topic to consider the scenarios where defender can move around and react to the human adversaries in real time, and the human adversary has a learning capability to adapt the defender’s deceiving policy.
References
 [1] M. H. Almeshekah and E. H. Spafford, “Cyber security deception,” in Cyber deception. Springer, 2016, pp. 23–50.
 [2] J. Pawlick, E. Colbert, and Q. Zhu, “A gametheoretic taxonomy and survey of defensive deception for cybersecurity and privacy,” arXiv preprint arXiv:1712.05441, 2017.
 [3] S. Bonetti, “Experimental economics and deception,” Journal of Economic Psychology, vol. 19, no. 3, pp. 377–395, 1998.
 [4] T. Holt, The deceivers: Allied military deception in the Second World War. Simon and Schuster, 2010.
 [5] E. Morgulev, O. H. Azar, R. Lidor, E. Sabag, and M. BarEli, “Deception and decision making in professional basketball: Is it beneficial to flop?” Journal of Economic Behavior & Organization, vol. 102, pp. 108–118, 2014.
 [6] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[7]
K. Horák, Q. Zhu, and B. Bošanskỳ, “Manipulating adversary’s
belief: A dynamic game approach to deception by design for proactive network
security,” in
International Conference on Decision and Game Theory for Security
. Springer, 2017, pp. 273–294.  [8] H. A. Simon, “Models of man; social and rational.” 1957.
 [9] C. F. Camerer, Behavioral game theory: Experiments in strategic interaction. Princeton University Press, 2011.
 [10] R. Yang, B. Ford, M. Tambe, and A. Lemieux, “Adaptive resource allocation for wildlife protection against illegal poachers,” in Proceedings of the 2014 international conference on Autonomous agents and multiagent systems. International Foundation for Autonomous Agents and Multiagent Systems, 2014, pp. 453–460.
 [11] C. Zhang, A. X. Jiang, M. B. Short, P. J. Brantingham, and M. Tambe, “Defending against opportunistic criminals: New gametheoretic frameworks and algorithms,” in International Conference on Decision and Game Theory for Security. Springer, 2014, pp. 3–22.
 [12] B. Wu and H. Lin, “Privacy verification and enforcement via belief abstraction,” IEEE Control Systems Letters, vol. 2, no. 4, pp. 815–820, Oct 2018.

[13]
P. Masters and S. Sardina, “Deceptive pathplanning,” in
Proceedings of the 26th International Joint Conference on Artificial Intelligence
. AAAI Press, 2017, pp. 4368–4375.  [14] D. Kahneman, Thinking, Fast and Slow. Macmillan, 2011.
 [15] A. Tversky and D. Kahneman, “Advances in prospect theory: Cumulative representation of uncertainty,” Journal of Risk and uncertainty, vol. 5, no. 4, pp. 297–323, 1992.
 [16] D. Kahneman and A. Tversky, “Prospect theory: An analysis of decision under risk,” in Handbook of the fundamentals of financial decision making: Part I. World Scientific, 2013, pp. 99–127.
 [17] E. Norling, “Folk psychology for human modelling: Extending the bdi paradigm,” in Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent SystemsVolume 1. IEEE Computer Society, 2004, pp. 202–209.
 [18] P. B. Reverdy, V. Srivastava, and N. E. Leonard, “Modeling human decision making in generalized gaussian multiarmed bandits,” Proceedings of the IEEE, vol. 102, no. 4, pp. 544–571, 2014.
 [19] D. S. Hochbaum, “Complexity and algorithms for nonlinear optimization problems,” Annals of Operations Research, vol. 153, no. 1, pp. 257–296, 2007.
 [20] S. Boyd, S.J. Kim, L. Vandenberghe, and A. Hassibi, “A tutorial on geometric programming,” Optimization and Engineering, vol. 8, no. 1, 2007.
 [21] B. An and M. Tambe, “Stackelberg security games (ssg) basics and application overview,” Improving Homeland Security Decisions, p. 485, 2017.
 [22] M. Jain, J. Tsai, J. Pita, C. Kiekintveld, S. Rathi, M. Tambe, and F. Ordónez, “Software assistants for randomized patrol planning for the lax airport police and the federal air marshal service,” Interfaces, vol. 40, no. 4, pp. 267–290, 2010.
 [23] W. B. Haskell, D. Kar, F. Fang, M. Tambe, S. Cheung, and E. Denicola, “Robust protection of fisheries with compass.” 2014.
 [24] R. Yang, C. Kiekintveld, F. OrdóñEz, M. Tambe, and R. John, “Improving resource allocation strategies against human adversaries in security games: An extended study,” Artificial Intelligence, vol. 195, pp. 440–469, 2013.
 [25] D. Kar, F. Fang, F. Delle Fave, N. Sintov, and M. Tambe, “A game of thrones: when human behavior models compete in repeated stackelberg security games,” in Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2015, pp. 1381–1390.
 [26] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
 [27] Y. D. Abbasi, M. Short, A. Sinha, N. Sintov, C. Zhang, and M. Tambe, “Human adversaries in opportunistic crime security games: Evaluating competing bounded rationality models,” in Proceedings of the Third Annual Conference on Advances in Cognitive Systems ACS, 2015, p. 2.
 [28] K. Doya, “Modulators of decision making,” Nature neuroscience, vol. 11, no. 4, p. 410, 2008.
 [29] M. B. Short, M. R. D’orsogna, V. B. Pasour, G. E. Tita, P. J. Brantingham, A. L. Bertozzi, and L. B. Chayes, “A statistical model of criminal behavior,” Mathematical Models and Methods in Applied Sciences, vol. 18, no. supp01, pp. 1249–1267, 2008.
 [30] R. E. Moore, “Global optimization to prescribed accuracy,” Computers & mathematics with applications, vol. 21, no. 67, pp. 25–39, 1991.
 [31] E. L. Lawler and D. E. Wood, “Branchandbound methods: A survey,” Operations research, vol. 14, no. 4, pp. 699–719, 1966.
 [32] P. J. Brantingham and G. Tita, “Offender mobility and crime pattern formation from first principles,” in Artificial crime analysis systems: using computer simulations and geographic information systems. IGI Global, 2008, pp. 193–208.
 [33] E. Burnell and W. Hoburg, “Gpkit software for geometric programming,” https://github.com/convexengineering/gpkit, 2018, version 0.7.0.
 [34] M. ApS, The MOSEK optimization toolbox for PYTHON. Version 7.1 (Revision 60), 2015. [Online]. Available: http://docs.mosek.com/7.1/quickstart/Using_MOSEK_from_Python.html
Comments
There are no comments yet.