Device-to-device (D2D) communications is expected to increase the spectral efficiency of the fifth generation (5G) network. D2D users which are within a certain range can communicate with each other using the resources of the cellular users (CUs) in a Long Term Evolution (LTE) network. In such a network, the CUs are treated as primary users while the D2D users are treated as secondary users. In order to ensure that the quality of service (QoS) of the CUs do not get affected, efficient resource allocation algorithms need to be devised for the D2D users. Most of the previous research in D2D resource allocation assume the knowledge of all the channel gains. However, the channel gain between a D2D transmitter and its receiver and the channel gain between a CU and a D2D receiver is difficult to acquire because the cardinality of these channel gains is high. It is the number of CUs times the number of D2D receivers plus the number of D2D pairs. To convey these to the BS requires extra power and control overhead. We consider a realistic situation in which these gains are unknown at the BS, which is referred to as partial channel state information (CSI). We solve the power and resource allocation problem for the D2D users with partial CSI by formulating it as a multi-player multi-armed bandit (MP-MAB) problem.
Most of the applications of MP-MAB in wireless communications are in the field of cognitive radio networks (CRNs). In a CRN, the mean of a channel (arm) does not vary from one secondary user to the other. Thus, the optimal channel is the same for all of them. If a perfect collision model is assumed , then if they collide the reward obtained by them is zero. Since UCB1 will determine the best arm among all the arms, if it is applied to this problem, the secondary users will collide with each other as they contend for the same optimal arm. This has motivated researchers to propose variants of UCB1 like distributed learning algorithm with prioritization (DLP), distributed learning algorithm with fairness (DLF)  and th-UCB1 . In these algorithms, the secondary users have different ranks such that a secondary user which has a rank of will select the arm with the largest mean reward in the long run instead of the arm that gives it the highest mean reward. Thus, the chances of secondary users to collide with each other decreases. DLP is extended to DLF so that fairness is ensured among the secondary users. This is done by changing the rank of each user in a round robin (RR) manner such that they have unique ranks in each time slot.
The D2D resource allocation problem is significantly different from the channel selection problem considered in a CRN. In this paper, we consider multiple D2D users and CUs in a macro cell. We utilize the MP-MAB framework to solve the problem of power and resource allocation for the D2D users. We assume that a CU is allocated only one resource block. We also assume that a D2D user can be allocated the resource block of only one CU. Our objective is to maximize the expected value of the cumulative sum throughput of the D2D users upto a time horizon. For power allocation to a D2D user the signal-to-noise ratio (SNR) of a CU, whose resource block is allocated to the D2D user, should be greater than a certain threshold. Then only power can be allocated to the D2D user so as to ensure that the CU’s signal-to-interference-plus-noise ratio (SINR) is this minimum threshold. Thus, we guarantee a minimum SINR to the CUs whose resource blocks are allocated to the D2D users. We model the instantaneous reward received by a D2D user when it selects a CU’s resource block as its throughput, normalized to lie in [0, 1]. Therefore, the mean reward of an arm differs from one D2D user to the other. If the number of CUs are more than the D2D users then the chances of each D2D user’s optimal arm to be a different CU is more. Hence, it is appropriate to employ the UCB1 algorithm to this problem.
I-B Related Work
In , the authors have proposed two index based distributed learning algorithms and for channel selection in a CRN. is based on the -greedy algorithm of  where the secondary users are ranked a-priori. The algorithm is based on adaptive randomization in which every user randomizes its channel selection if there is a collision in the previous time slot. In , another index based algorithm called dUCB4 is investigated which order log-squared regret growth with a certain time horizon. However, the players are allowed to communicate with each other which increases the communication overhead. In , the authors present a generalization of UCB1 called selective learning of the largest expected rewards (SL(K)) where the secondary user learns to select the arm which gives the largest expected reward among all the arms. Moreover, they extend this algorithm to multiple users in which the users are ranked and propose a policy called DLP. However, to ensure fairness among the users they propose DLF which rotates the rank among the users. They have proved that DLF is order-optimal.
Another order optimal policy is considered in  known as time division fair sharing (TDFS) which guarantees fairness among the secondary users. The users have different offsets in their time division selection schedule to avoid collisions during channel selection. However, it is computationally complex. In , an algorithm based on deterministic sequencing of exploration and exploitation (DSEE) is developed in which time is divided into exploitation and exploration sequences. The main design criterion is to determine the cardinality of the exploration sequence. However, knowledge of a lower bound on the difference in the mean rewards of the best and the second best arms is required which is difficult to obtain a-priori. In ,a secondary user is assigned a rank a-priori and an algorithm called th-UCB1 is proposed.
The adversarial MAB problem is different from the stochastic MAB problem in that there is no assumption on the distribution of the reward process of each arm. When the rewards of all the arms can be observed by a player (full information game), the Hedge algorithm 
can be used. A player chooses an action according to a probability distribution over the arms. It is a modification of the weighted majority algorithm. The Exponential-weight algorithm for Exploration and Exploitation (Exp3)  in turn is a modification of the Hedge algorithm. These algorithms are weighted average prediction (WAP) algorithms in which the probability of choosing an arm is computed by assigning weights to each arm from which a probability distribution over the arms is calculated.
Though MAB has been applied to CRNs, its application to other wireless communication problems is limited. We next discuss a few works pertaining to the application of MAB to 4G/5G networks. In , the authors have used Exp3.M algorithm to address the problem of efficiently activating or deactivating small cells in a macrocell dynamically by modeling it within the framework of a combinatorial MP-MAB. In , the relay selection problem is modeled as a stochastic covariate MAB where side information is available to the players before each trial. In , the authors have considered the problem of selecting sub-bands for each picocell in an LTE network. They have solved this problem using UCB1 algorithm. To allocate the resources of the sub-band chosen by a picocell to its users, each picocell utilizes the proportional fair scheduling algorithm. Among the recent works in the field of D2D communications,  addresses the problem of selecting the CU transmission mode or the D2D transmission mode by modeling it as a two-armed Levy-bandit game. The authors have considered the number of interferers to be random. They have considered two cases of the game model, multiple independent players and multiple cooperative players. The optimal strategy in both the cases is a cut-off strategy for all the users. In , the authors have considered resource allocation for the D2D users by formulating it within the framework of an MP-MAB game with side information. They have combined no-regret learning with calibrated forecasting and proved that the empirical joint frequencies of the game converge to a set of correlated equilibria. However, even this algorithm is computationally expensive.
The main contributions of this paper are as follows.
Ours is the first work which models the D2D power and resource allocation problem within an MP-MAB framework. Our proposed algorithms ensure that the D2D users reuse the resources of the CUs efficiently without hindering CU communications. Our work differs from  as the authors have not considered the reuse of CUs’ resources. They assume that the D2D users use only the vacant channels of the CUs.
We propose a power allocation algorithm for the D2D users that ensures a minimum quality of service (QoS) to a CU so that its communications are not hampered due to interference from a D2D user.
Our work differs from  as follows. In , the UCB1 algorithm is used but when multiple picocells select the same sub-band and the resources of this sub-band are allocated to users, those users which are allocated the same resource block suffer from inter-picocell interference when they transmit. However, in our case we employ the perfect collision model of  in which if multiple D2D users select the same CU’s resource block, i.e. collisions occur, then the D2D users do not transmit. This ensures that there is no inter-D2D interference. Moreover, the CU also does not get affected due to interference from multiple D2D users.
We apply DLF and th-UCB1 which we extend to multiple D2D users and modify to ensure fairness as in . Since the D2D resource allocation problem can also be solved within the framework of an adversarial bandit problem we have applied the Exp3 algorithm for allocating resources to the D2D users.
This paper is organized as follows. In Section II, we discuss the system model of an underlay D2D network and the problem formulation in an MP-MAB setting. In Section III, we propose two power and resource allocation algorithms for a single D2D user based on UCB1 and Exp3. In Section IV, we propose four power and resource allocation algorithms for multiple D2D users by extending UCB1 and Exp3 to multiple users and applying DLF and th-UCB1 to the multi-player setting. We illustrate the performance of our proposed algorithms through simulation results in Section V and conclude the paper in Section VI.
Ii System Model and Problem Formulation
Ii-a Network Model
We consider CUs and D2D users in a macrocell comprising of the sets and respectively. We assume that the CUs are allocated uplink resource blocks by the base station (BS) in every subframe. In an LTE network, time is divided into subframes which we denote by . We consider path loss, shadowing and fast fading. The channel gain between a CU and the BS is represented by while that between a D2D user’s transmitter and receiver by . The channel gain of the interfering link between a CU and a D2D user ’s receiver is given by and that between the D2D user’s transmitter and the BS is given by , which is shown in Fig. 1. In an LTE network, a user equipment (UE) sends its channel quality information (CQI) to the BS either periodically or aperiodically. Thus, the channel gains and are known at the BS. However, it is difficult to obtain the channel gains and . A CU’s transmit power is assumed to be constant. We assume that each CU is allocated one resource block in each subframe and each D2D player can be allocated only one resource block.
Ii-B MP-MAB Framework
We formulate the D2D resource allocation problem within the framework of an MP-MAB problem where the players are the D2D users. The arms of the bandit are the available CUs. The action of a D2D player refers to the selection of a CU. When a D2D player selects an arm (a CU ) it obtains a random reward , with mean , from an unknown stationary distribution of the reward process of the arm. We model the reward of a D2D player when it selects a CU to be its throughput normalized to lie within the range of . The normalized throughput is denoted by . When collisions occur we assume that the D2D players which select the same CU receive a reward of zero. The power allocated to each D2D player depends on whether a minimum SINR can be guaranteed to a CU. The power allocated to the D2D player therefore depends on the channel gains , and the power transmitted by the CU . Thus, the throughput and therefore the reward depends on the channel gains , , ,
and the power transmitted by the CU. Therefore, these decide the reward process corresponding to each arm, that is, a CU. Thus, the distribution of the reward process of each arm is different for different D2D players. Since the channel gains are independent and identically distributed (IID) random variables, the reward process of an arm is also IID for each D2D player.
A local policy of a D2D player is a sequence of functions where maps the player’s previously observed rewards and actions to the present action of the player at time instant . It is a decision rule that specifies for each D2D player which action to take at each time instant. A policy is a therefore a concatenation of the local policies of all the players .
() The optimal arm for a D2D player is the CU whose expected reward is the highest among all the arms (CUs). Thus, the optimal CU is given by,
For a given policy , we define its regret as the difference in the the expected total reward of the D2D players upto time that can be obtained by them when they select their optimal arms and the expected total reward of the D2D players obtained under policy upto time . It is given by,
where is the mean of the optimal arm for the D2D player .
The performance of a policy is measured by the regret . By observing how the regret changes with for different policies we can compare different regret graphs.
If the D2D players are ranked in each subframe, then the definition of regret changes as per .
() If the rank of a D2D player is in a subframe, then let the arm which has the -th largest expected reward be and the mean of this arm be . Then the regret is given by,
is an indicator random variable which is 1 if D2D playeris the only player that selects the CU as per the policy and zero otherwise.
For an adversarial bandit, there are no statistical assumptions in the way the rewards are generated . The regret definition differs from Eqns. 5.2 and 5.3. The optimal CU’s definition also changes as follows.
() For an adversarial bandit framework, the optimal CU for a D2D player is one which maximizes its return, i.e. the sum of rewards it receives upto time .
() In an adversarial bandit framework, let the actions taken by a D2D player till time be , then the expectation of the total reward obtained with the policy upto time is,
where is the reward received by the D2D player when it takes an action . Let the expected total reward from the optimal arm till time be,
The regret for a D2D player is given by,
The regret for multiple D2D players is the sum of individual regrets of each D2D player and is given by,
Minimizing the regret of Eqn. 2, 3 or 4 is equivalent to maximizing the expected total reward of the D2D players upto time .
Our objective is to determine a policy which maximizes the expected value of the cumulative sum throughput of the D2D players till time .
We next propose two power and resource allocation policies for a single D2D player and then extend them to multiple D2D players.
Iii Allocation Policies for a Single D2D Player
The D2D resource allocation problem can be solved within the framework of both a stochastic MAB (UCB1 algorithm ) and an adversarial MAB (Exp3 algorithm ). We propose two power and resource allocation policies based on the UCB1 and Exp3 algorithms. In the UCB1 based algorithm the D2D player selects a CU as per the UCB1 index. It selects that CU in each subframe which has the maximum UCB1 index. In Exp3 each D2D player selects a CU randomly as per a probability distribution over the CUs. We next discuss these two policies.
Iii-a Power and Resource Allocation Based on UCB1
Iii-A1 Initialization Phase
The UCB1 algorithm starts with an initialization phase (refer Algorithm 1) in which a D2D player samples all the arms (the CUs) sequentially in the first subframes. In every subframe , it conveys its selection of a CU to the BS. The BS then allocates power to it as follows.
Iii-A2 Power Allocation ()
When a D2D user selects a CU ’s resource block, the SINR of the CU must be greater than equal to the SINR threshold ,
where is the average noise power at the BS. Then,
So, the maximum power that can be allocated to the D2D player is,
This means that is equal to . Note that this also implies that the SNR of a CU (without interference from a D2D player) should be greater than , then only power can be allocated to the D2D player. This condition is checked by the BS. If the SNR is less than , then will become negative and if it is equal to , then will become zero in order to ensure . This implies that in both these cases power cannot be allocated to the D2D player. If exceeds the maximum allowed transmission power of a UE , then is limited to . This means that calculated from Eq. 5.7 can be large but in practice a lesser value of has to be allocated which causes to become more than (refer Algorithm 2).
The throughput observed by a D2D player is given by,
where is the transmission bandwidth and is the average noise power at the D2D receiver.
Iii-A3 Resource Allocation
We model the reward to be the normalized throughput which the D2D player receives when it selects CU . is the throughput normalized to lie between 0 and 1. The reward is therefore a random variable drawn from the distribution of the reward process of an arm .
After the initialization phase of subframes, the D2D player stores one sample of the reward from each of the arms. Thereafter in every subframe , the D2D player averages its previous and present rewards accrued by it whenever it selects a CU till subframe . This is nothing but the empirical mean reward which is calculated as per the following Monte Carlo equation,
where is the number of times that it selects a CU till the previous subframe . The UCB1 index is a function of the previous empirical mean reward and corresponding to a CU . A D2D player calculates the UCB1 index of each CU which is given by,
The policy chooses that CU in each subframe whose UCB1 index is the maximum among the UCB1 indices of all the CUs. This power and resource allocation algorithm is optimal as it achieves logarithmic regret , .
Iii-B Power and Resource Allocation Based on Exp3
We now solve the D2D power and resource allocation problem for a single D2D player by applying the Exp3 algorithm (refer Algorithm 3). The arms of the adversarial bandit are the CUs. We consider the adversary to be the channel which decides the rewards to be given to the D2D player when it selects a CU. We model the reward obtained by a D2D player from each arm as a Bernoulli random variable whose distribution is unknown. When the D2D player achieves a throughput of greater than equal to the guaranteed throughput , its reward is one else it is zero. If is one most of the time when it selects a CU as compared to the other CUs, it implies that this choice of CU is better than the others because on an average it gets a better throughput with this CU. This is equivalent to maximizing the expected cumulative reward of the D2D player till a certain time.
The Exp3 algorithm selects a CU as per the probability density function (PDF),over the CUs. This PDF is computed by the D2D player by assigning a weight to each CU . The weights of all the CUs are initialized to one, that is, . The PDF is computed from the weights of the CUs as,
where . The factor ascertains that the D2D player explores all the arms often enough.
This ensures that there are sufficient samples of rewards from each arm to determine . The parameter decides the trade-off between exploration and exploitation. The D2D player selects a CU according to the PDF and conveys it to the BS. The BS checks if an SINR of can be guaranteed to the CU. If so, then the BS allocates a power of as per the power allocation algorithm discussed for the UCB1 based policy (refer Algorithm 2). The D2D player then transmits using CU ’s resource block and obtains a throughput of which is given by Eqn. 8. The reward received by the D2D player is one if its throughput is greater than equal to a target rate of , else its reward is zero. This D2D player then updates the weight assigned to this CU to by multiplying with an exponent which is a function of the reward and the probability of selecting this CU,
Iv Allocation Policies for Multiple D2D Players
We next address the D2D resource allocation problem for multiple D2D players. First, we discuss the index based algorithms. We extend UCB1 from a single player to a multi-player setting and refer it as multi-player UCB1 (MP-UCB1). Next, we employ the DLF and th-UCB1 algorithms for resource allocation to multiple D2D players. We extend the th-UCB1 algorithm for a single player given in  to multiple players and refer to it as fair th-UCB1. We also extend the Exp3 algorithm from a single player to a multi-player setting and solve the D2D resource allocation problem within an adversarial bandit framework in which the channel and the other players together act as an adversary.
Iv-a Index Based Allocation Policies
We initialize MP-UCB1 and th-UCB1 for multiple players in the same way as DLF. Thus, the initialization phase is the same for all the three policies (refer Algorithm 4). The power allocation algorithm is also the same for them. They differ in the way the CUs’ resources are allocated to the D2D players over the subframes and that the players are ranked in DLF and fair th-UCB1 algorithms unlike MP-UCB1.
Iv-A1 Initialization Phase
In the first subframes, every D2D player selects the CUs sequentially such that they sample one value of the reward from each of the CUs. In a subframe , the D2D players choose the CUs in a RR order so that they don’t collide with each other, . After this initialization phase, each of them conveys its selection of CU to the BS. The BS then allocates power to them.
Iv-A2 Power Allocation
The power allocation algorithm is the same as for the single player setting. If a D2D player selects a CU , it is allocated a power of as per Eqn. 7 if a minimum SINR can be guaranteed to the CU. If exceeds the maximum allowed transmission power of a UE , then it is limited by (refer Algorithm 2). A D2D player transmits at a rate of . Its normalized rate is its reward. If multiple D2D players collide, then they get a reward of zero (perfect collision model ). This implies that the transmitted powers of the D2D players that collide become zero.
Iv-A3 Resource Allocation
After subframes, each D2D player chooses its CU as per the index of MP-UCB1, DLF and fair th-UCB1 algorithms. Every time a D2D player chooses a CU , it averages its rewards obtained with this CU till subframe as per Eqn. 9 and computes the empirical mean reward . It also updates , the number of times that it selects the CU till subframe . The index of each algorithm depends on these two quantities based on which a D2D player selects its CU.
1. MP-UCB1 Algorithm: Each D2D player runs UCB1 at its end. A D2D player selects that CU which has the maximum UCB1 index .
2. DLF Algorithm: Unlike MP-UCB1, in this algorithm a D2D player is assigned a rank in subframe that changes over subframes in a RR order. In a subframe , every D2D player is assigned a unique rank. A D2D player ranked in subframe tries to select a CU that gives it the largest mean reward. It sorts the UCB1 indices of all the CUs and selects the first CUs with the largest UCB1 indices to form a set . It selects a CU from which minimizes the following index,
3. Fair th-UCB1 Algorithm: In this algorithm also, each D2D player is ranked and its rank changes as per RR in every subframe. A D2D player that is ranked in subframe , chooses the CUs with the largest UCB1 indices to form a set . Then, it selects the CU in the set which has the minimum UCB1 index. Now, with a probability of , it chooses else it selects a CU uniformly at random from with a probability of where . If is a constant, then it results in a linear regret . Therefore, is set such that it decreases with time . This ensures a logarithmic regret.
Iv-B Adversarial Bandit Based Allocation Policy
We now extend the Exp3 based policy discussed for a single player to multiple players (refer Algorithm 3). A D2D player selects a CU as per the PDF over the CUs as discussed for the single player policy. In a multi-player setting not only does the channel decide the rewards, other players also decide the rewards that a D2D player obtains. When the D2D players convey their choice of CUs to the BS, the BS allocates power as per the power allocation algorithm discussed for the single player policy (refer Algorithm 2). First, it checks for collisions among the players. If they collide, their allocated power is zero. If the D2D player suffers no collision, then the BS checks if the CU it has selected can be guaranteed an SINR of . If this can be guaranteed, the D2D player is allocated power. If not, it is not allocated any power. Thus, its transmission rate becomes zero. If the allocated power is greater than , then . If is greater than , the D2D player gets a reward of one, else it gets a reward of zero.
We consider a macrocell of radius 250 m with = 20 CUs and
= 5 D2D players uniformly distributed in it. The D2D receiver is distributed uniformly around its D2D transmitter within a range of 50 m. As per LTE specifications, the path loss model is
. We model shadow fading with a lognormal random variable whose standard deviation is 8 dB. We assume that the channel is a fast fading channel.
We model its gains by exponentially distributed random variables of mean 1. The transmit powerof a CU is set to 250 mW. The UE and the BS noise figures are 9 dB and 5 dB respectively. The thermal noise density is set to -174 dBm. The SINR threshold for the CUs is 10 dB. The total number of subframes is , each of duration 1 ms. The transmission bandwidth is 180 kHz. For the th-UCB1 based policy we set . For the Exp3 based policy, we set and kbps. Every plot is obtained after 50 Monte Carlo (MC) simulations over the policy with a fixed topology and 10 MC simulations over topologies. We next compare the performances of our proposed algorithms.
Regret: For the MP-UCB1, DLF and th-UCB1 based policies, the distribution and the mean of the reward process of each arm are unknown. However, for the Exp3 based policy we model the reward in such a way that the distribution of each arm is Bernoulli but the mean of each arm is unknown. In both these cases the mean reward of an arm is different because the reward model is different for both of them.
Thus, we cannot compare the regret of the Exp3 based policy with the regret of the index based policies even though the channel parameters are the same.
For the index based allocation policies also we cannot compare the regrets of the MP-UCB1 based policy with DLF and th-UCB1 based policies because the regret definition is different for them. For a single D2D player, the regret plots of our proposed policies with UCB1 and Exp3 are shown in Figs. 2 and 3 while for multiple D2D players, Fig. 4 demonstrates the regret plot of the index based policies and Fig. 5 demonstrates the regret plot of the Exp3 based policy. The regret of the th-UCB1 based policy is more than DLF because it selects the arm with the largest UCB1 index with probability and explores the arms in with probability in every subframe but the DLF based policy chooses an arm according to the metric of Eqn. 13 in every subframe and doesn’t explore.
Collision Percentage: We observe from the bar graph of Fig. 6 that the D2D players collide the most for the th-UCB1 based policy, followed by DLF. For these policies we had expected that the chances of two differently ranked D2D users selecting the same CU would reduce which would decrease the collisions. However, it is not so. We observe that the percentage collision is lesser for the MP-UCB1 and Exp3 based policies rather than th-UCB1 and DLF.
We define an algorithm to be fair when each D2D player gets an equal opportunity over time of being allocated a CU that gives it the highest expected reward (index based algorithms) or the highest return (Exp3) over subframes. It is measured as the percentage of the number of times each D2D player is allocated this CU over the subframes.
Fairness: We observe from Fig. 7 that for all the algorithms every D2D player gets a fair chance to select the arm that gives it the highest expected reward in case of the index based policies or the highest return in case of Exp3 over the subframes. Note that for MP-UCB1, the percentage of selecting the arm with the highest expected reward for each D2D player is more as compared to th-UCB1 and DLF because th-UCB1 and DLF rank the players and MP-UCB1 doesn’t. Each D2D player in MP-UCB1 (or Exp3) tries to choose the CU that gives it the highest expected reward (or highest returns) in every subframe but for th-UCB1 and DLF each D2D player tries to choose the CU that gives it the highest expected reward after every subframes when its turn comes.
Sum Throughput: 1) We observe that the sum throughput of the D2D players is less initially as they suffer from more collisions due to which the power allocated to them becomes zero. 2) The plots of Fig. 8 - 11 depict that the sum throughput of the D2D players in the long run is higher for the MP-UCB1 based policy as compared to th-UCB1, DLF and Exp3 based policies. For MP-UCB1, the D2D players collide less as seen from Fig. 5.6 and percentage of the number of times that they try to select their optimal CUs (fairness percentage) is also high as compared to the other algorithms. Thus, the sum throughput of the D2D players is higher. For Exp3 the fairness percentage is less which means that they are not able to select their optimal CUs over subframes. Because of this, the sum throughput of the D2D players is less in the long run. For th-UCB1 and DLF though the collision percentage is high and the fairness percentage is low, the sum throughput of the D2D players is still high because each D2D player is successful in trying to choose the arm which gives it the largest expected rewards as per its rank .
However, the sum throughput of the D2D players for all the four algorithms are comparable. 3) We have plotted the sum throughput of the 5 CUs that the D2D players select out of 20 CUs in every subframe in order to reuse their resources. We observe from the plots that the sum throughput of the CUs are maintained at a constant rate on an average. The fluctuations of the sum throughput of the CUs are because of the following reasons. When a D2D player selects a CU, if the SNR of the CU is less than then power is not allocated to it otherwise its SINR would become lower than . This implies that the CU’s throughput will be lower than . Moreover, if the D2D player collides with another D2D player, then also it is not allocated any power and hence the CU’s throughput can be of any value. However, when it is allocated power, the CU’s throughput gets fixed at . When the power allocated to the D2D player is limited to , when more power could have been allocated to it theoretically, the CU’s rate becomes greater than . 4) We observe that the sum throughput of the D2D players is more than that of the CUs. This is so because the communication range of the D2D transmitter and receiver is small as compared to the CU and the BS.
In this work, we consider the problem of power and resource allocation to the D2D players with partial CSI. We formulate this problem within the framework of MP-MAB. Our proposed policies ensure that QoS is guaranteed to the CUs by using the excess SNR of the CUs above a certain threshold to allocate power to the D2D players. We propose two optimal learning policies based on UCB1 and Exp3 for a single player and then extend them to a multi-player setting. By applying the Exp3 algorithm we show that the D2D resource allocation problem can also be solved within an adversarial MP-MAB framework. We propose two more policies based on DLF and th-UCB1. We demonstrate through the simulation results that our proposed policies are fair and perform well. By comparing them, we found that MP-UCB1 performs better than the others for this problem.
-  Y. Gai and B. Krishnamachari, “Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access,” IEEE Trans. Sig. Proc., vol. 62, no. 23, pp. 6184-6193, Dec. 2014.
-  Y. Chen, H. Zhou, R. Kong, L. Zhu and H. Mao, “Decentralized Blind Spectrum Selection in Cognitive Radio Networks Considering Handoff Cost,” Future Internet, vol. 9, no. 2: 10, Mar. 2017.
-  A. Anandkumar, N. Michael and A. Tang, “Opportunistic Spectrum Access with Multiple Players: Learning under Competition,” in Proc. INFOCOM, pp. 1-9, 2010. doi: 10.1109/INFCOM.2010.5462144
-  P. Auer, N. Cesa-Bianchi and P. Fischer, “Finite-time Analysis of the Multiarmed Bandit Problem,” Mach. Learn., vol. 47, no. 2, pp. 235-256, May 2002.
-  D. Kalathil, N. Nayyar, and R. Jain, “Decentralized Learning for Multiplayer Multiarmed Bandits,” IEEE Trans. Inf. Theory, vol. 60, no. 4, pp. 2331-2345, Apr. 2014.
-  K. Liu and Q. Zhao, “Distributed Learning in Cognitive Radio Networks: Multi-Armed Bandit with Distributed Multiple Players,” in Proc. ICASSP, pp. 3010-3013, 2010. doi: 10.1109/ICASSP.2010.5496131
-  S. Vakili, K. Liu and Q. Zhao, “Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems,” IEEE J. Sel. Topics Sig. Proc., vol. 7, no. 5, pp. 759-767, Oct. 2013.
-  P. Auer, N. Cesa-Bianchi, Y. Freund and R. E. Schapire, “Gambling in a Rigged Casino: The Adversarial Multi-Armed Bandit Problem,” The ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2 Technical Report Series, NC2-TR-1998-025, Aug. 17, 1998. [Online]. Available: http://www.dklevine.com/archive/refs4462.pdf.
-  N. Littlestone and M. K. Warmuth, “The Weighted Majority Algorithm,” Inf. Comput., vol. 108, no. 2, pp. 212-261, Feb. 1994.
-  P. Auer, N. Cesa-Bianchi, Y. Freund and R. E. Schapire, “The Nonstochastic Multiarmed Bandit Problem,” SIAM J. Comput., vol. 32, no. 1, pp. 48-77, 2002. https://doi.org/10.1137/S0097539701398375
-  S. Maghsudi and E. Hossain, “Multi-armed Bandits with Application to 5G Small Cells,” IEEE Trans. Wireless Commun., vol. 23, no. 3, pp. 64-73, June 2016.
-  S. Maghsudi and S. Stanczak, “Dynamic Bandit with Covariates: Strategic Solutions with Application to Wireless Resource Allocation,” in Proc. ICC, pp. 5898-5902, 2013. doi: 10.1109/ICC.2013.6655540
-  A. Feki and V. Capdevielle, “Autonomous Resource Allocation for Dense LTE networks: A Multi Armed Bandit Formulation,” in Proc. PIMRC, pp. 66-70, 2011. doi: 10.1109/PIMRC.2011.6140047
-  S. Maghsudi and S. Stanczak, “Transmission Mode Selection for Network-Assisted Device to Device Communication: A Levy-Bandit Approach,” in Proc. ICASSP, pp. 7009-7013, 2014. doi: 10.1109/ICASSP.2014.6854959
-  S. Maghsudi and S. Stanczak, “Channel Selection for Network-Assisted D2D Communication via No-Regret Bandit Learning With Calibrated Forecasting,” in IEEE Trans. Wireless Commun., vol. 14, no. 3, pp. 1309-1322, Mar. 2015.
-  I. Mondal, A. Neogi, P. Chaporkar and A. Karandikar, “Bipartite Graph Based Proportional Fair Resource Allocation for D2D Communication,” in Proc. WCNC, pp. 1-6, 2017. doi: 10.1109/WCNC.2017.7925780
-  3GPP TR 36.942 V15.0.0, “Evolved Universal Terrestrial Radio Access (E-UTRA); Radio Frequency (RF) system scenarios”, 2018.