I Introduction
There has recently been much interest in networked systems for collaborative resource utilization. These are systems in which agents contribute to the overall welfare through their individual actions. Usually, each agent has a certain amount of resources, and can choose how much to contribute based on the perceived return via repeated interactions with the system. An example is a peertopeer file sharing network, wherein each peer can contribute upload bandwidth by transmitting chunks to a peer, and receive downloads of chunks from that peer as a reward. Interactions are bilateral, and hence titfortat type strategies are successful in preventing freeriding behavior [1]. More generally, collaborative systems entail multilateral interactions in which the actions of each agent affect and are affected by the collective behavior of a subset of agents. Here, more complex mechanisms are needed to accurately determine the value of the contribution of each individual to the group.
An example of a collaborative system with repeated multilateral interactions is a devicetodevice (D2D) wireless network. Suppose that multiple devices require the same content chunk. The broadcast nature of the wireless medium implies that several agents can be simultaneously satisfied by a single transmission. However, they might each have different values for that particular chunk, and may have contributed different amounts in the past to the transmitting agent. Furthermore, D2D systems undergo “churn” in which devices join and leave different clusters as they move around. How then is an agent to determine whether to collaborate with others, and whether it has received a fair compensation for its contribution?
Our objective in this paper is to design mechanisms for cooperation in systems with repeated multilateral interactions. As in earlier literature, we assume that there exists a currency to transfer utility between agents [2, 3], and our goal is to determine how much should be transferred for optimal collaboration. We focus on wireless content streaming as our motivating example. In particular, as shown in Figure 1, we assume that all devices are interested in the same content stream, and receive a portion of chunks corresponding to this stream via a unicast basestationtodevice (B2D) interface. The B2D interface has a large energy and dollar cost for usage, and the devices seek to mitigate this cost via sharing chunks through broadcast D2D communication.^{1}^{1}1Note that, as we describe in greater detail later in this paper, it is possible to enable the usage of both the 3G (unicast) and WiFi (broadcast) interfaces simultaneously on Android smart phones.
A content sharing system is described in [4], in which the objective is to achieve live streaming of content synchronously to multiple colocated devices. The system architecture of that work forms an ideal setting for studying mechanism design in which multilateral interactions occur. The setup is illustrated in Figure 2. Here, time is divided into frames, which are subdivided into slots. A block of data is generated by the content server in each frame, and the objective is to ensure that this block can be played out by all devices two frames after its generation, i.e., data block is generated in frame and is to be played out in frame . Such a strict delay constraint between the time of generation and playout of each data block ensures that the live aspect of streaming is maintained.
Upon generation of block (in frame ), the content server divides it into chunks and performs random linear coding (RLC) over these chunks [5]. The server unicasts some of these coded chunks to each device using its B2D interface. This number is to be kept small to reduce B2D usage. Next, in frame the devices use the broadcast D2D network to disseminate these chunks among themselves. At the end of frame the devices attempt to decode block If a device has received enough coded chunks to decode the block, it plays out that block during frame Otherwise, will be idle during this frame. The use of RLC results in two desirable system features. First, the server can unicast a fixed number of chunks to the devices in each frame over a lossy channel (Internet plus B2D link) without any feedback. Second, the devices do not need to keep track of what chunks each one possesses while performing D2D broadcasts.
The notion of quality of experience (QoE) here is deliveryratio denoted by , which is the average ratio of blocks desired to the blocks generated [6]. For instance, a delivery ratio of would mean that it is acceptable if of the blocks can be skipped. A device can keep track of its QoE thus far via the “deficit” incurred upto frame which is the difference between the actual number of number of blocks successfully decoded by frame and the target value In [4], it was shown that, assuming complete cooperation by the participating devices, it is possible to design a chunk sharing scheme whereby all devices would meet their QoE targets with minimal usage of the B2D interface. But how do we design a mechanism to ensure that the devices cooperate?
The setting of interest in this paper is that of a large number of D2D clusters, each with a fixed number of agents, and with all clusters interested in the same content stream. Examples of such settings are sports stadia, concerts or protest meetings, where a large number of agents gather together, and desire to receive the same livestream (replays, commentary, live video etc.) Devices move between clusters as agents move around, causing churn. The objective of our work is to develop an incentive framework wherein each device truthfully reports the number of chunks that it receives via B2D and its deficit in each frame, so that a systemwide optimal allocation policy can be employed. Such an incentive framework should be lightweight and compatible with minimal amounts of history retention. Finally, we also desire to implement the system on Android smart phones and measure its real world performance.
Related Work
The question of how to assign value to wireless broadcast transmissions is intriguing. For instance, [7] considers a problem of repeated interaction with time deadlines by which each node needs to receive a packet. Each node declares its readiness to help others after waiting for a while; the game lies in choosing this time optimally, and the main result is to characterize the price of anarchy that results. However, decision making is myopic, i.e.
, devices do not estimate future states while taking actions. In a similar fashion,
[3]propose a scheme for sharing 3G services via WiFi hotspots using a heuristic scheme that possesses some attractive properties. Here too, decision making is myopic. The question of fair scheduling at a base station that uses the history of interactions with individual stations in order to identify whether they are telling the truth about their state is considered in
[8]. However, since the devices in our network undergo churn and keeping track of device identities is infeasible, we desire a scheme that does not use identities or history to enable truthful revelation of state. The initial version of this work was presented in [9] in which all proofs were omitted due to space constraints. This paper presents complete details of the analytical methodology.Perfect Bayesian and Mean Field Equilibria
The typical solution concept in dynamic games is that of the Perfect Bayesian Equilibrium (PBE). Consider a strategy profile for all players, as well as beliefs about the other players’ types at all information sets. This strategy profile and belief system is a PBE if: (i) Sequential rationality: Each player’s strategy specifies optimal actions, given her beliefs and the strategies of other players; (ii) Consistency of beliefs: Each player’s belief is consistent with the strategy profile (following Bayes’ rule). The PBE requires each agent to keep track of their beliefs on the future plays of all other agents in the system, and play the best response to that belief. The dynamic pivot mechanism [10] extends the truthtelling VCG idea [11] to dynamic games. It provides a basis for designing allocation schemes that are underpinned by truthful reporting. Translating the model in [8] to the language of [10], it is possible to use the dynamic pivot mechanism to develop a scheme (say FiniteDPM) with appropriate transfers that will be efficient, dominant strategy incentive compatible and perperiod individually rational; note that while this scheme would use the identities of the devices, it will not need to build up a history of interactions. We omit the details of this as it is a straightforward application of the general theory from [10].
Computation of PBE becomes intractable when the number of agents is large. An accurate approximation of the Bayesian game in this regime is that of a Mean Field Game (MFG) [12, 13, 14]. In MFG, the agents assume that each opponent would play an action drawn independently from a static distribution over its action space. The agent chooses an action that is the best response against actions drawn in this fashion. The system is said to be at Mean Field Equilibrium (MFE) if this best response action is itself a sample drawn from the assumed distribution, i.e., the assumed distribution and the best response action are consistent with each other [15, 16, 17]
. Essentially, this is the canonical problem in game theory of showing the existence of a Nash equilibrium, as it applies to the regime with a large number of agents. We will use this concept in our setting where there are a large number of peer devices with peer churn.
To the best of our knowledge, there is no prior work that considers mechanism design for multilateral repeated games in the mean field setting. One of the important contributions of this paper is in providing a truthtelling mechanism for a meanfield game. In the process of developing the mechanism we will also highlight the nuances to be considered in the meanfield setting. In particular, we will see that aligning two concepts of value—from the system perspective and from that of the agents—is crucial to our goal of truthtelling.
Organization and Main Results
We describe our system model in Section II. Our system consists of a large number of clusters, with agents moving between clusters. The lifetime of an agent is geometric; an agent is replaced with a new one when it exits. Each agent receives a random number of B2D chunks by the beginning of each frame, which it then shares using D2D transmissions.
In Section III, we present an MFG approximation of the system, which is accurate when the number of clusters is large. Here, the agents assume that the B2D chunks received and deficits of the other agents would be drawn independently from some distributions in the future, and optimize against that assumption when declaring their states. The objective is to incentivize agents to truthfully report their states (B2D chunks and deficit) such that a schedule of transmissions (called an “allocation”) that minimizes the discounted sum of costs can be used in each frame. The mechanism takes the form of a scheme in which tokens are used to transfer utility between agents. A nuance of this regime is that while the system designer sees each cluster as having a new set of users (with IID states) in each time frame, each user sees states of all its competitors but not itself as satisfying the mean field distribution. Reconciling the two view points is needed to construct a cost minimizing pivot mechanism, whose truthtelling nature is shown in Section IV. This is our main contribution in this paper. The allocation itself turns out to be computationally simple, and follows a version of a mindeficit first policy [4].
Next, in Sections V–VI, we present details on how to prove the existence of the MFE in our setting. Although this proof is quite involved, it follows in a highlevel sense in the manner of [15, 16]. We then turn to computing the MFE and the value functions needed to determine the transfers in Section VIII. The value iteration needed to choose allocation is straightforward.
We present details of our Android implementation of a music streaming app used to collect real world traces in Section IX. We discuss the viability of our system in Section X, and illustrate that under the current price of cellular data access, our system provides sufficient incentives to participate. Finally, we conclude in Section XI.
Ii Content Streaming Model
We consider a large number of D2D clusters, each with a fixed number of agents, and with all clusters interested in the same content stream. We assume that a cluster consists of colocated peer devices denoted by ^{2}^{2}2Our analysis is essentially unchanged when there are a random but finite number of devices in each cluster.. The data source generates the stream in the form of a sequence of blocks. Each block is further divided into chunks for transmission. We use random linear network coding over the chunks of each block (with coefficients in finite field of size ). We assume that the field size is very large; this assumption can be relaxed without changing our cooperation results. Time is divided into frames, which are further divided into slots. At each time slot , each device can simultaneously receive up to one chunk on each interface.
B2D Interface: Each device has a (lossy) B2D unicast channel to a basestation. For each device
, we model the number of chunks received using the B2D interface in the previous frame by a random variable with (cumulative) distribution
, independent of the other devices. The support of is the set , denoted by. The statistics of this distribution depend on the number of chunks transmitted by the server and the loss probability of the channel. In
[4], a method for calculating statistics based on the desired quality of service is presented. We take the distribution as given.D2D Interface: Each device has a zerocost D2D broadcast interface, and only one device can broadcast over the D2D network at each time . For simplicity of exposition, we will assume that the D2D broadcasts are always successful; the more complex algorithm proposed in [4] to account for unreliable D2D is fully consistent^{3}^{3}3We will discuss this at the end of Section VI. with our incentive scheme. Since each D2D broadcast is received by all devices, there is no need to rebroadcast any information. It is then straightforward to verify that the order of D2D transmissions does not impact performance. Thus, we only need to keep track of the number of chunks transmitted over the D2D interfaces during a frame in order to determine the final state of the system.
Allocation: We denote the total number of coded chunks of block delivered to device via the B2D network during frame using
We call the vector consisting of the number of transmissions by each device via the D2D interfaces over frame
as the “allocation” pertaining to block denoted by Also, we denote the number received chunks of block by device via D2D during frame using Due to the large field size assumption, if it means that block can be decoded, and hence can be played out. For simplicity of exposition, we develop our results assuming that the allocation is computed in a centralized fashion in each cluster. However, we actually implement a distributed^{4}^{4}4At the end of Section VI we will argue that the distributed implementation is also consistent with our incentive scheme. version on the testbed.Quality of Experience: Each device has a delivery ratio which is the minimum acceptable longrun average number of frames device must playout. In the mobile agents model, we assume that all devices have the same delivery ratio for simplicity. It is straightforward to extend our results to the case where delivery ratios are drawn from some finite set of values. The device keeps track of the current deficit using a deficit queue with length The set of possible deficit values is given by where for , is the largest whole number that is greater than. Note that is a countable set and the possible deficit values are all nonnegative. In fact, by the wellordering principle can be rewritten as with an increasing sequence (without bound) such that . We will use this representation to enumerate the elements of . If a device fails to decode a particular block, its deficit increases by else it decreases by The impact of deficit on the user’s quality of experience is modeled by a function which is convex, differentiable and monotone increasing. The idea is that user unhappiness increases more with each additional skipped block.
Transfers: We asume the existence of a currency (either internal or a monetary value) that can be used to transfer utility between agents [2, 3]. In our system, a negative transfer is a price paid by the agent, while a positive value indicates that the agent is paid by the system. Such transfer systems are well established; see for instance a review in [2]. Transfers are used by agents either to pay for value received through others’ transmissions, or to be compensated for value added to others by transmitting a chunk. We assume that the transmissions in the system are monitored by a reliable device, which can then report these values to decide on the transfers. In practice we use the device that creates each adhoc network as the monitor.
An allocation policy maps the values of the B2D chunks received and deficits as revealed by agents, denoted by to an allocation for that frame Given an allocation, agents have no incentive to deviate from it, since an agent that does not transmit the allocated number of chunks would see no benefit; those time slots would have no transmissions by other agents either. The fundamental question is that of how to incentivize the agents to reveal their states truthfully so that the constructed allocation can maximize systemwide welfare.
Iii Mean Field Model and Mechanism Design
Our system consists of agents (or users) organized into clusters with agents per cluster. As mentioned earlier, time is slotted into frames. At the end of a frame, any agent can leave the system only to be replaced by a new agent (also denoted by ) whose initial deficit is drawn from a (cumulative) distribution with support . This event occurs with probability
independently for each agent, so that the lifetimes of the agents are geometrically distributed. As described in the previous section, we assume that the number of chunks received via B2D for agent
in frame denoted by is chosen in an i.i.d. fashion according to the (cumulative) distribution with support; one such distribution is the binomial distribution. In addition to the agents having geometrically distributed lifetimes, we also allow mobility in our setup. In particular, in every frame we assume that all the agents are randomly permuted and then assigned to clusters such that there are exactly
agents in each cluster. Using this system as a starting point we will develop our meanfield model that will be applicable when the number of clusters is extremely large.The mean field framework in Figure 3 illustrates system relationships that will be discussed below. The blue/dark tiles apply to the value determination process for mechanism design, which will be discussed in this section. The beige/light tiles are relevant to showing the existence of an MFE on which the mechanism depends, which will be discussed in Sections V–VI.
The mean field model yields informational and computational savings, since otherwise each agent will need to not only be cognizant of the values and actions of all agents, but also track their mobility patterns. Additionally, the mean field distribution accounts for regenerations, which do not have to be explicitly accounted for when determining best responses.
There is, however, an important nuance that the meanfield analysis introduces: when there are a large number of clusters, each cluster sees a different group of agents in every frame with their states drawn from the meanfield distribution, but even though each agent interacts with a new set of agents in every frame, it’s own state is updated based on the allocations made to it, so that the differing viewpoints of the two entities need to be reconciled while providing any incentives.
The number of chunks received over the B2D interface and the deficit value constitute the state of an agent at the beginning of a frame. At frame we collect together the state variables of all the agents in system as . Our mechanism then aims to achieve
(1) 
where is the number of clusters in the system, is the set of agents in cluster at frame , is the allocation in cluster and is the value that agent makes from the allocation in frame . For agent set to be the cluster he belongs in during frame , i.e., . Note that the probability of remaining in the system appears as the discount factor in the above expression.
Given the allocation in each cluster, if agent does not regenerate, then his deficit gets updated as
(2) 
where , whereas if the agent regenerates, then where is drawn i.i.d. with distribution . Here,
(3) 
where is if and only if agent obtains all coded chunks to be able to decode a block, is the number of packets agent can get during a frame under the allocation (where we suppress the dependence of on ). We specialize to the case where the value per frame for agent with system state and vector of allocations is given by if there is no regeneration and otherwise, where is i.i.d. with distribution and is the holding cost function that is assumed to be convex and monotone increasing.
As there are a large number of clusters, in every frame there is a completely different set of agents that appear at any given cluster. The revealed states of these agents will be drawn from the mean field distribution. Hence, from the perspective of some cluster the revealed state of the agents in that cluster will be drawn according to the (cumulative) distributions with pertaining to the deficit, and pertaining to the B2D transmissions received by that agent. Note that the support of is while the support of is and indicates the i.i.d nature of the agent states. Whereas from the perspective a particular agent the revealed states of all the other agents in that cluster will be drawn according to These facts will simplify the allocation problem in each cluster and also allow us to analyze the MFE by tracking a particular agent.
First, we consider the allocation problem as seen by the clusters. Pick any finite number of clusters. In the meanfield limit, the agents from frame to frame will be different in each cluster, therefore the allocation decision in each cluster can be made in an distributed manner, independent of the other clusters; this is one of the chaos hypotheses of the meanfield model. This then implies that the objective in (1) is achieved by individual optimization in each cluster, i.e.,
(4) 
where we recall that is the revealed state of agents in cluster at time and
(5) 
Under mean field assumption, the method of determining value does not change from steptostep. The value function in the meanfield is determined by the first solving the following Bellman equation
(6) 
to obtain function , where is the dimensional revealed state vector (with elements ) and the future revealed state vector is chosen according to , and thereafter setting for every . This observation then considerably simplifies the allocation in each cluster to be the greedy optimal, i.e., determine (multi)function
(7) 
and for we set .
Next, we consider the system from the viewpoint of a typical agent ; w.l.o.g let . Any allocation results in the deficit changing according to (2) and the future B2D packets drawn according to whereas the state of every other agent that agent interacts with in the future gets chosen according to the mean field distribution. Then the value function (of the cluster) from the perspective of agent is determined using
(8) 
Here, represents the revealed states of all the agents in cluster except and for the deficit term is determined via (2) (setting ) while the B2D term follows This recursion yields a function which applies to all agents. Using this function, one can also determine the allocation that agent expects his cluster to perform, namely,
(9) 
Using the two allocations and we can write down the value of agent from the system optimal allocation and the value of agent in the allocation that the agent thinks that the system will be performing. For a given allocation function (for the state of agents in the cluster where agent resides at present), we determine the solution to the following recursion
(10) 
to get function , where is an arbitrary state variable, the deficit term of follows (2) while the B2D term is generated independently (setting ), is an arbitrary allocation, the B2D term is generated independently, and is chosen using the meanfield distribution. Notice that would yield the true value of allocation to agent By the cluster optimal allocation (what the cluster actually does), agent gets whereas from the perception of agent he thinks he should be getting (based on what he thinks the cluster should be doing).
Transfer
We will use the different value functions to define the transfer for agent depending on the reported state variable such that the transfer depends on the difference between what he gets from the system optimal allocation and what he expects the system to do from his own perspective. Using this logic we set the transfer for agent as
(11) 
where following the Groves pivot mechanism, can be chosen using the recursion
(12) 
where and is used to denote an allocation in a system in which agent is not present.
The Clarke pivot mechanism idea ensures that the netcost of agent , equals This is simply the value of the system as a whole from the viewpoint of agent minus a function only of As in the VickreyClarkeGroves mechanism, such formulation of netcost naturally promotes truthtelling as a dominant strategy at each step.
Allocation Scheme
The basic building block of our mechanism is the perframe optimal allocations that solve (1). We will now spell out the allocation in greater detail. First, we observe that the allocation problem separates into independent allocation problems in each cluster that have the same basic structure. Therefore, it suffices to discuss the allocation problem for one cluster.
From (7), the objective in this cluster is
(13) 
An optimal allocation is determined using the following observations. First, we partition the agents into two sets, ones who cannot decode the frame even if they never transmit during the slots of the D2D phase and the rest; the former agents are made to transmit first. After this we determine agents who have extra chunks (number of slots that they can transmit on such that there is still time to decode whole frame) and make these agents transmit their extra chunks. After all the extra chunks have been transmitted, it is easy to see using the properties of the holding cost function that agents are made to transmit in a minimumdeficitfirst fashion in order to prioritize agents with large deficits. This is summarized in the follow lemma.
Lemma 1
The algorithm delineated in Algorithm 1 provides an optimal greedy allocation.
Proof:
Given the B2D arrivals , we partition the set of devices into sets and based on whether or not. Those agents that satisfy this condition can potentially receive enough chunks during the D2D phase that they can decode the block, whereas the others cannot. Hence, all members of can potentially transmit their chunks in the allocation solving (13). Let , T}. So we can devote the first slots of the current frame to transmissions from the devices in .
Let the number of transmissions made by agent in allocation be denoted by We can write down the constraints that any feasible allocation must satisfy as
(14) 
Observe that each agent can transmit chunks without affecting the above constraints (i.e., it does not change its chances of being able to decode the block, as there is enough time left for it to receive chunks that it requires). We call these as “extra” chunks. Suppose that all extra chunks have been transmitted by time and no device has yet reached full rank. At this point, all agents in the system need the same number of chunks, and any agent that transmits a chunk will not be able to receive enough chunks to decode the block. In other words, agents now have to “sacrifice” themselves one at a time, and transmit all their chunks. The question is, what is the order in which such sacrifices should take place?
Compare two agents and with deficits Also, let Now, for either value of
Hence, since is convex and monotone increasing,
(15)  
(16) 
Now, consider the following problem with under the constraint
(17)  
(18) 
Then, from the above discussion, the solution is to set and Thus, comparing (17) and (13), the final stage of the allocation should be for agents to sacrifice themselves according to a mindeficitfirst type policy. Algorithm 1 describes the final allocation rule.
Iv Properties of mechanism
Iva Truthtelling as dominant strategy
Since we consider a meanfield setting, we will assume that deficit of agent changes via the allocation while the deficits of all the other agents are drawn using the given distribution . The values are generated i.i.d. with distribution . Based on the system state report at time , we assume that the mechanism makes the optimal greedy allocation from (7) and levies transfers from (11) that uses the allocations from the agent’s perspective from (9). We can then show that truthfully revealing the state, i.e., values at the beginning of every frame is incentive compatible.
Definition 1
A direct mechanism (or social choice function) is dominant strategy incentive compatible if is a dominant strategy at for each and , where is a decision rule and is a transfer function.
Theorem 2
Our mechanism is dominant strategy incentive compatible.
Proof:
The netcost in frame for agent when reporting versus is given by
(19) 
where is the true type and is an arbitrary type; the equalities hold true due to the definition of value function and transfer; the last inequality follows by the optimality of allocation in cluster maximizes the system utility from the perspective of agent . Therefore, in every frame it is best for agent to report truthfully and this holds irrespective of the reports of the other agents.
IvB Nature of transfers
We now determine the nature of the transfers that are required to promote truthtelling. We will show that the transfers constructed in (11) are always nonnegative, i.e., the system needs to pay the agents in order to participate. In other words, each agent needs a subsidy to use the system, since it could simply choose not to participate otherwise. Thus, the system is not budgetbalanced. We will show later how the savings in B2D usage that results from our system provides the necessary subsidy in Section X. Given these transfers, we will also see that our mechanism is individually rational so that users participate in each frame.
Lemma 3
The transfers defined in (11) are always nonnegative.
Proof:
From (11), we have
(20)  
where follows from the definition of allocation and the inequality is true by the monotonicity argument below.
We assume that under both systems (with the allocations and ), the deficits are initialized with the same value. Also note that all the agents follow the same reporting strategy in frame , and hence, and can be compared. Under allocation , agent never transmits and will pick up free chunks from other agents’ transmissions. However, agent may have to transmit under allocation . Thus, we have
(21) 
as is true for every .
Using this we can compare the two deficits by considering the same allocation policy. For , we have
(22) 
(23)  
with for all , which implies that . Since the function in (10) can be obtained by value iteration starting with , then by the definition of value function and the monotonicity of holding cost function in , we have being an increasing function in . Then it directly follows that
(24) 
which completes our proof.
The proof of individual rationality follows along the same lines as Lemma 3.
Lemma 4
Our mechanism is individually rational, i.e., the voluntary participation constraint is satisfied.
Proof:
We remark that not participating in a frame is equivalent to freeriding, and our transfers ensure a lower cost is obtained when participating. However, as the net payment to the users is nonnegative^{5}^{5}5While we don’t prove it, we expect the transfer to be positive if the agent transmits, but we also note that it need not be zero if he doesn’t, owing to the translation of viewpoints mentioned earlier., we will not immediately have budgetbalance. For the broader class of BayesNash incentivecompatible mechanism, [18] shows that only under the assumption of “independent types” (the distribution of each agent’s information is not directly affected by the other agents’ information), budget can be balanced exinterim. However, in our system, each agent’s information will have an impact on the other agents’ information through the allocation. Nevertheless, using the same technique of an initial sum being placed in escrow with the expectation that it would be returned at each stage (i.e,. interim), our system may be budgetbalanced. Details using current prices of B2D service are provided in Section X.
IvC Value functions and optimal strategies
We will now show that the value function given by the solution to (6) is welldefined and can be obtained using value iteration. Similarly, we will show that both the value function and the optimal allocation policy from a agent’s perspective, given by (8) and (9) respectively, exist and can also be determined via value iteration.
Theorem 5
The following hold:

There exists a unique such that , and given for every , we have ;

There exists a unique such that , and given for every , we have ; and

The Markov policy obtained from (9) is an optimal policy to be used in cluster from the viewpoint of agent .
Proof:
First, we consider statement 1). The proof follows by applying Theorem 6.10.4 in Puterman [19], and verifying the Assumptions 6.10.1, 6.10.2 and Propositions 6.10.1, 6.10.3.
Define the set of functions
(28) 
where . Note that is a Banach space with norm,
(29) 
Also define the operation as
(30) 
where .
First, we need to show that for , . From Equation (30) and the definition of value functions, we know the sum of all users’ values are bounded, say . Then we have
(31) 
where the rightside expression is bounded by the sum of and some multiple of . Hence, .
Next, we need to verify Assumptions 6.10.1 and 6.10.2 in Puterman [19]. Our theorem requires the verification of the following three conditions. Let be the random variable denoting the current system state at frame , where . Then we must show that , for some constants , and ,
(32) 
(33) 
(34) 
(32) holds from the definition of .
(33) holds true since
(35)  
as we know in our mean field model, are all drawn i.i.d. from the given distribution with pertaining to the deficit, and pertaining to the B2D transmissions received by that agent, so the first inequality holds in (35).
Finally, we have (34) since,
(36)  
The first equality holds from the definition of , and the first inequality holds true is because in our mean field mode, are all drawn i.i.d. from the given distribution with pertaining to the deficit, and pertaining to the B2D transmissions received by that agent, so it’s identical for all .
Since we have verified all the three conditions required by Theorem 6.10.4 in Puterman, Statement 1) holds true.
For statement 2), we can use the same argument as the above proof to show the existence of fix point. We omit the details here. The last part of Theorem 5 follows from the discussion before the statement of this theorem.
V Mean Field Equilibrium
In the meanfield setting, assuming the state of every other agent is drawn i.i.d. with distribution
, the deficit of any given agent evolves as a Markov chain. We start by showing that this Markov chain has a stationary distribution. If this stationary distribution is the same as
, then the distribution is defined as a meanfield equilibrium (MFE); we use the Schauder fixed point theorem to show the existence of a fixed point . Using the regenerative representation of the stationary distribution of deficits given and a strong coupling result, we prove that the mapping that takes to the stationary distribution of deficits is continuous using a strong coupling result. Finally, we show that the set of probability measures to be considered is convex and compact so that existence follows.Va Stationary distribution of deficits
Fix a typical agent and consider the state process . This is a Markov process in the meanfield setting: if there is no regeneration, then the deficit changes as per the allocation and the number of B2D packets received, and is chosen via the regeneration distribution otherwise. The allocation is a function of the past , the number B2D packets received and the state of the other agents. The number of B2D packets received and the state of the other agents are chosen in every frame. This Markov process has an invariant transition kernel. We construct it by first presenting the form given the past state and the allocations, namely,
(37)  
where is a Borel set and is the density function of the regeneration process for deficit. In the above expression, the first term corresponds to the event that agent can either decode the packet using D2D transmissions or not, and the second term captures the event that the agent regenerates after frame . Using (37) we can define the onestep transition kernel for the Markov process as
(38) 
For later use we also define the transition kernel without regeneration but one obtained by averaging the states of the other users while retaining the state of user , i.e.,
(39) 
The fold iteration of this transition kernel is denoted by .
Lemma 6
The Markov chain where the allocation is determined using (7) based on choosing the states of all users other than i.i.d. with distribution and the number of B2D packets of user independently with distribution , and the transition probabilities in (37) is positive Harris recurrent and has a unique stationary distribution. We denote the unique stationary distribution for the deficit of a typical agent by ; the dependence on is suppressed. The expression of this stationary distribution in term of
Comments
There are no comments yet.