I Introduction
We consider non-classical and asymmetric information (as specified in [1]) based games inspired by the full information games considered in [2]. In [2], agents attempt to acquire available destinations; each agent controls its rate of advertisement through a Social network to increase its chances of winning the destinations, while trading off the cost for acquisition. They considered full information and no-information games, and considered discrete time policies by uniformization of the controlled Markov process. In full information games, the agents at any point of time know the number of available destinations or equivalently the number of destinations already acquired by one or the other agent. In no-information games, the agents have no information; they don’t even know the number of destinations acquired by themselves.
It is more realistic to assume that the agents know the number of destinations acquired by themselves, but would not know the number acquired by others. This leads to partial, asymmetric and non-classical information games, which are the main focus of this paper. Basar et. al in [1] describe a game to be of non-classical information type, and we describe the same in our words: if the state of agent depends upon the actions of agent , and if agent knows some information which is not available to agent we have a non-classical information game. These kind of games are typically hard to solve ([1]); when one attempts to find best response against a strategy profile of others, one would require belief of others states, belief about the belief of others, and so on.
Our approach to this problem can be summarized as “open loop control till information update”. With no-information, one has to resort to open loop policies. This is the best when one has no access to information updates. With full information one can have closed loop policies, where the optimal action depends upon the state of the system at the decision epoch. In full information controlled Markov jump processes, every agent is informed immediately of the jump in the state and can change its action based on the change. In our case we have access to partial information, the agents can observe only some jumps and not all; thus we need policies that are open loop type till an information update. At every information update, one can choose a new open loop control depending upon the new information.
We considered one and two lock acquisition problems, any agent wins reward one if it acquires all the locks and if it is the first one to acquire the locks. The agents have no access to the information/state of the others, however upon contacting a lock they would know if they are the first to contact. We obtained Nash equilibrium for these partial, asymmetric information games; a pair of (partial) state-dependent time threshold policies form Nash equilibrium. We obtained these results (basically best responses) by solving Dynamic programming equations applicable to (partial) information update epochs and each stage of the Dynamic programming equations are solved by solving appropriate optimal control problems and the corresponding Hamiltonian Jacobi equations.
A block chain network is a distributed ledger that is completely open to all nodes; each node will have a copy of transactions (in case of currency exchange). If a new transaction is to be added (linked) to the previously existing chain in the form of a new block, it requires the miners (designated special nodes) to search for a key (encryption), that enables it to be added to the ledger. This search of the key requires computational power and time. The first miner to get the right key, gets a financial reward. If the miners would not know the status of the search efforts of others, the resulting game is exactly as in our one lock problem. Two lock problem can be viewed as the extension of one lock problem, wherein a second key is required to gain the reward.
Ii Problem Description
Two agents are competing to win a project. There are two or one lock(s) to win the project, and the aim of the agents is to reach these as quickly as possible. The agent that contacts all the locks before the other gets a reward equal to one unit. Further they need to contact the lock(s) before the deadline .
The contact process is controllable; the agents control the rate of the contact process continuously over time and they would incur some cost for acceleration. The acquisition/contact process is modelled by a Poisson process. The rate of contact process can be time varying over the interval , it can further depend upon the number of locks acquired etc. The higher the rate, the higher the chances of success, but also higher is the cost of acquisition.
Information structure: The agents have partial/asymmetric information about the locks acquired by various agents and would use available information to design their acceleration strategies optimally. The agents would know at all the times information (contacted/not contacted etc., at a given time) related to its contact attempts, however it has limited access to that of the others. When any agent contacts a lock, it would know if it is successful; we call a contact successful if the agent is the first one to contact that particular lock. If the other agent had contacted the same lock before the tagged customer, the tagged customer would have an unsuccessful contact. The agent gets an update of this information immediately after a contact, and this will also reveal some information about the status of the other agents. For example, upon a contact, if it gets aware of a successful contact, it also gets to know that this is the first one to contact.
Decision epochs: Every agent has a continuous (at all time instances) update of the status of its contact process, however there is a major update in its information only at (lock) contact instances. At these epochs it would know if the contact is successful/unsuccessful which in turn would reveal some information about the state of the other agents. Hence these form the natural decision epochs; an agent can choose a different action at such time points. Further, it is clear that the decision epochs of different agents are not synchronous.
Actions: The agents should choose an appropriate rate (function) for the contact/acceleration process. The rate of contact, for agent , at any time point can be between . The agents choose an action which specifies the acceleration process at the beginning, and change their action every time it contacts a lock (successfully or unsuccessfully). The action at any decision epoch is a measurable acceleration process (that can potentially span over time interval ). To be precise agent at decision epoch (the instance at which it contacted the -customer) chooses an , as the acceleration process to be used till the next acquisition. Here is the space of all measurable functions that are essentially bounded, i.e., the space of functions with finite essential supremum norm:
State: We will have two decision epochs with two lock problem, and one decision epoch with one lock problem, and have corresponding number of major state updates. Let represent the information available to agent immediately after -th contact^{2}^{2}2By convention, the start of the process commences with 0-th contact.. Here has two components: a) first component is a flag indicating that the contact was successful; b) second component is the time of contact. The first decision epoch (the only decision epoch with one lock problem) is at time 0, and the state is simply set to to indicate 0 contacts and ’0’ contact time (which is of no importance since there is no contact yet). The state at the second decision epoch , where flag implies successful contact and implies unsuccessful contact while represents the time of first contact. Let represent the (random) -th contact instance of agent . Here we view as a fictitious random contact instance which can take any value from and with indicating that the agent could not contact the -th lock before deadline.
We distinguish the one lock problem from the two lock problem by using to represent the number of locks.
Strategy: The strategy of player
where represents the acceleration process used at start, represents the acceleration process used after contacting one lock and this choice depends upon the available information To keep notations simple, most of the times and unless clarity is required, state is not used while representing the actions. One can easily extend this theory to a more general problem with locks, but the emphasis of this paper would be on the case with . We briefly discuss the extension to a larger towards the end of the paper.
Rewards/Costs: The reward of any agent is one, only if it succeeds to contact all locks, and further, if it is the first one to contact the first lock. Let ( implies minimum of the two) and respectively represent the durations^{3}^{3}3As already mentioned, the contact clocks are free running Poisson clocks, however we would be interested only in those contacts that occurred before deadline . of the first and the second contact process. The cost spent on acceleration for the -th contact equals,
(1) |
The paper mainly considers two agent problem. The agent problem is discussed in section V and the results are conjectured using the two agent results. Thus for simpler notations, we represent the tagged customer by while the other customer is indexed by .
The expected utility for stage equals:
where
represents the probability of eventual success (when all the
locks are contacted before and when the first contact is successful) and is the trade-off factor between the reward and the cost. Note here that the reward (probability of success) is added only to the last stage, i.e., only when .For one lock problem, i.e., when , the probability of success equals
(3) |
where is the probability that the other agent has not contacted the lock before the agent , i.e., before time and (see equation (1) for definition of )
(4) |
is the density^{4}^{4}4It is clear that the complementary CDF of (time till agent contacts the destination, immaterial of whether it is the first one or not and whether it is before or not), under deterministic policy , is not influenced by the strategies of others and is given by:
In a similar way for the two lock problem,
(5) |
where is the probability that the other agent (agent ) has not contacted the first lock before agent (which is the same as the one discussed in the one lock problem) and
(6) |
is the density of the second-lock contact process of agent .
It is easy to observe that for any given , the expected cost equals (see (1) and with ):
(7) | |||||
If the contact occurs after the deadline , one has to pay for the entire duration (with zero reward) and hence the first term in the above equation.
Game Formulation: The overall utility of agent , when player chooses the strategy is given by
(8) |
Thus we have a strategic/normal form non-cooperative game problem and our aim is to find a pair of policies (that depend only upon the available (partial) information) that form the Nash equilibrium (NE).
We begin with a one lock problem in the next section.
Iii One lock problem
We specialize to one lock problem in this section, while the two lock problem is considered in the next section. Here the agent that first gets the lock, gets the reward. At any time, the agents are aware of their own state, but they would know the information of the others only when it contacts the lock. At that time, the contact can be successful or unsuccessful, with the former event implying the other agent has not yet contacted the lock. In this case the agents have to choose one contact rate function / at the start, and would stop either after a contact or after the deadline expires. There is no major information update at any time point before the (only one) contact and hence this control process is sufficient.
We prove that a pair of (time) threshold policies form an NE for this game. We prove this by showing that the best response against a threshold policy is again threshold. Towards this, we first discuss the best response against any given strategy of the opponent. Given a policy with of agent , it is clear that the failure probability of the other agent in equation (3) equals:
where is as defined in (1). The best response for this case is obtained by (see equations (1)-(7)):
(9) | |||||
For any given policy of the other agent, the above best response is clearly a finite horizon () optimal control problem as can be rewritten as
(10) | |||
with state process
a given function
and with terminal cost
(11) |
Here we need to note that represents the state process for the optimal control problem that is used as a tool to solve the original best response problem and is not the state process of the actual/original problem. Further in two lock problem (considered in the next section), we will have one such optimal control problem for each stage and for each state and each of those optimal control problems will have their own state processes.
Conjecture: We aim to prove using Hamilton Jacobi (HJB) equations (which provide solution to the above mentioned optimal control problems of each stage), that the best response against any policy of agent would be a time-threshold type policy as discussed below. We are yet to prove this. However from the nature of the HJB equations one can conclude that the best response policies are of bang-bang type. Currently we continue with deriving the best response against time-threshold policies.
Iii-a Best response against silent opponent
We begin with best response of an agent, when the other agent is silent, i.e., when for all . This particular result will be used repeatedly (also for the case with two locks) and hence is stated first. Let .
Theorem 1.
[Best response, against silent opponent] When an agent attempts to acquire a lock for any given time , and when there is no competition and if it receives a reward upon success: i) the best response of (9) is derived by solving the following optimal control problem:
ii) The solution of the above problem is the following:
Proof: We drop the superscript in this proof, for simpler notations. Using density (4), the expected reward (against silent opponent) equals
(13) | |||||
The cost does not depend upon the existence of other players, hence remains the same as in (7), reproducing here for clarity:
Then the overall problem is to maximize and hence we consider
Thus we need the solution of the following (Hamiltonian Jacobi) HJB PDE, with representing the value function and with , its partial derivatives
or in other words
with boundary condition
Claim: We claim the following is a solution^{6}^{6}6We derived the above solution by solving the HJB PDE after replacing the maximizer with . satisfying the above boundary valued problem (when ):
Note that its partial derivatives are:
Thus if , the maximizer in HJB PDE is and then satisfies the HJB PDE,
and also satisfies the boundary condition
It is further easy to verify that for all and satisfy equation (5.7) of [3, Theorem 5.1]. Thus when or equivalently when then the optimal policy is to attempt with highest possible rate all the times.
When , using similar logic one can show that (for all ) is the solution of the HJB PDE and for all .
Using similar techniques one can find best response against any given time-threshold policy, which is next considered.
Iii-B Best response against a time Threshold policy
Assume now player uses the following time-threshold policy, represented by:
Basically agent attempts with maximum acceleration till time and stops completely after that. In this case the failure probability of agent in equation (3) simplifies to:
and so
(14) |
The best response against such a Threshold policy of agent is obtained in the following. From Theorem 1, it is clear that the best responses against any strategy would be to remain silent when and when the reward equals one. Thus the Nash equilibrium strategies would be to remain silent by both the agents for . From now on, we consider
Theorem 2.
[Best response] Assume The best response of agent against () policy of agent is given by:
where,
Proof: The details of this proof are in Appendix A. Thus when agent uses threshold strategy with small , best response of agent is to attempt till the end and if the threshold of agent is larger, then the best response of agent is to try till (irrespective of the actual value of ).
Iii-C Nash Equilibrium
We observe from the above result that the best response against a threshold policy is again a threshold policy. Thus one can get the Nash equilibrium if one can find two thresholds one for each agent, such that
From Theorem 2, it is easy to find such a pair of thresholds and is also easy to verify that this pair of thresholds is unique. We have the following:
Theorem 3.
[Nash Equilibrium] Assume , and without loss of generality . For a two agent partial information game, we have a Nash equilibrium among (time) threshold policies, as defined below:
where threshold is as in Theorem 2, while is given by:
Proof: The first line is easily evident from Theorem 2. For the second one, observe the following: when :
because (note ) and thus
Now if , then clearly On the other hand if , then .
It is further clear that (simple calculations using Theorem 2) we have unique Nash Equilibrium among time threshold policies. It would be more interesting if we can show this is the unique NE, but that would be a part of future work.
Thus when one has no access to the information of the other agent till their own contact with the lock, the NE are given by open loop policies. But this is true only for one lock () problem. With large , we will have closed loop policies but the policies change only at major information change epochs. In all, we will see that the NE will again have a group of open loop policies, each of which is used till a major change in the information.
Iv Two Lock Problem
Before we proceed with the analysis we would summarize the protocol again. Any agent succeeds only if it contacts lock one, followed by lock two and only the agent that gets both the locks receives reward one. If a particular agent contacts the lock one, we say it had an unsuccessful contact if it is not the first one to contact the lock. If an agent’s contact is unsuccessful, there is no incentive for the agent to try any further. On the other hand when an agent is successful, it knows it would be the only one to chase the second lock. We can use the previous ( case) analysis, Theorem 1, to compute the best response against silence opponent (for second lock).
This is a two stage problem, as the utility of agent is given by:
For this two lock case, the best response of agent against any given strategy of player can be solved using (two stage) dynamic programming (DP) equations as below
(15) | |||||
with stage wise costs as defined in equations (II)-(7). Note these DPs hold even when the action spaces are Banach spaces, as in our case.
Like in one-lock case, we obtain a NE, by finding best response against appropriate threshold strategies.
Threshold strategy for two-lock problem: Our conjecture is that the strategy constructed using state dependent time-threshold policies will form a part of the NE. At contact instance of the first lock, the contact could be successful or unsuccessful. Thus we have two types of states immediately after the first contact, i.e., the state after the first contact is either given by or by . We compactly represent Threshold policy by which means the following:
at start, use | ||||
Theorem 1, inspires us to conjecture that this kind of a threshold strategy becomes a part of the NE and the same is proved in Theorem 4. We begin with the best response.
Iv-a Best response against a Threshold strategy
Say agent uses threshold strategy . We obtain the best response by solving the DP equations (15) using backward induction. When in (15) and if it is immediately clear that (see (II))
as failure with first lock implies zero reward. If , i.e., if the player is successful with first lock and the contact was at , the agent will either have unsuccessful contact or may not even contact the first lock before the deadline . Further because agent uses policy it would not try for the second lock. Thus agent will attempt for second lock, while the other agent is silent with respect to second lock. Thus the optimization problem corresponding to this stage from equations (II)-(7) is given by:
This is exactly the optimization problem considered in Theorem 1 with and hence the best response (with ) is given by: (attempt with maximum for the rest of the period). Thus from Theorem 1 with and we have:
Now solving the DP equations for , it is easy to verify that the corresponding optimization problem is (with as before and see (II), (15)):
This optimization problem is once again solved using optimal control theory based tools and we directly obtain the following. When it is easy to verify that, both agents being silent is the Nash equilibrium. This result can easily be derived (by finding the best responses as in Theorem 5 of Appendix B, which provides the best response against the silent opponent).
Theorem 4.
[Nash Equilibrium] Let and assume . The NE is given by the following, under the conditions:
The above . Now consider that Let satisfy
and satisfy the following equation
If the following two conditions are satisfied
(16) |
then the pair forms a Nash equilibrium. The above . The conditions (4) are immediately satisfied with , in which case the common
Remarks: Few interesting observations for the cases that we derived the result: a) in two lock problem none of the agents at an NE would try till (in contrast to one lock problem); b) the agents either remain silent or attempt for a time period that is strictly less than ; and c) we obtained NE for all the values of the parameters for the case when
V Extensions and Future work
One can easily extend the results to -player game with symmetric parameters, i.e., to the case when for all . For one lock problem it is not difficult to conjecture that the Nash equilibrium among time-Threshold policies is given by,
In a similar way the two-lock Nash equilibrium for symmetric agents, could probably be obtained using Theorem 4; with the parameter of the opponent as and with . We conjecture the NE for this case to be, . These are only conjectures and we need to verify and prove the same. Further we would like to work with asymmetric agents.
It would be equally interesting to work with -lock problem with . We anticipate that the silence theorem (like Theorems 1 and 5) should be extended and then the analysis would follow easily. It would be more interesting to work with the problem in which each lock fetches a reward. For all these and more general problems, the methodology would be the same; One needs to consider open loop control till a new information update. Thus these partial information problems would span from completely open loop policies (no information) to completely closed loop policies (or full information).
Conclusions
We considered lock acquisition games with partial, asymmetric information. Agents attempt to control the rate of their Poisson clocks to acquire two locks, the first one to get both would get the reward. There is a deadline before which the locks are to be acquired, only the first agent to contact the lock can acquire it and the agents are not aware of the acquisition status of others. It is possible that an agent continues its acquisition attempts, while the lock is already acquired by another agent. The agents pay a cost proportional to their rates of acquisition. We proposed a new approach to solve these asymmetric and non-classical information games, ”open loop control till the information update”. With this approach we have dynamic programming equations applicable at state change update instances and then each stage of the dynamic programming equations is to be solved by optimal control theory based tools (HJB equations). We showed that a pair of (available) state dependent time threshold policies form Nash equilibrium. We also conjectured the results for the games with -agents.
References
- [1] T. Basar and J. B. Cruz Jr, “Concepts and methods in multi-person coordination and control.” ILLINOIS UNIV AT URBANA DECISION AND CONTROL LAB, Tech. Rep., 1981.
- [2] A. Eitan et al., “A stochastic game approach for competition over popularity in social networks,” Dynamic Games and Applications, vol. 3, no. 2, pp. 313–323, 2013.
- [3] W. H. Fleming and H. M. Soner, Controlled Markov processes and viscosity solutions. Springer Science & Business Media, 2006, vol. 25.
Appendix A: Proofs related to One Lock Problem
Proof of Theorem 2: The best response against a threshold policy can be obtained by solving the optimal control problem (see equation (10) with as in (14))
with state update equation given by | ||||
Further the terminal cost is . Thus the HJB (PDE) equation that needs to be solved as in the proof of Theorem 1 is given by the following:
(17) |
Let and . We conjecture that the optimal control for this problem is a threshold policy for some appropriate .
Claim:We further claim the following to be the solution of the above PDE^{7}^{7}7We compute the following solutions, replacing the maximizers in HJB PDEs . One of them is for the case when and one for the other case., we prove this claim alongside computing (we would actually show that or ):
where,
The partial derivatives of the above are:
Now to check if the above partial derivatives verify PDE (17).
Case 1: When We will prove for this case that .
Thus for all we have:
For in this range we will require and this is true if for all