Unmanned Aerial Vehicles (UAVs) have been recently used in many civilian, commercial and military applications Razi_Asilomar17; Peng2018; Mousavi_INFOCOM18; Afghah_ACC18; Khaledi_SECON18; Afghah_NWRCS. With recent advances in design and production of UAVs, the global market revenue of UAVs is expected to reach $11.2 billion by 2020 Gartner_UAV.
Spectrum management is one of the key challenges in UAV networks, since spectrum shortage can impede the operation of these networks. In particular, in applications involving a low-latency video streaming, the UAVs may require additional spectrum to complete their mission. The conventional spectrum sharing mechanism such as spectrum sensing may not be very practical in UAV systems noting the considerable required energy for spectrum sensing or the fact that they cannot guarantee a continuous spectrum access. The property-right spectrum sharing techniques operate based on an agreement between the licensed and unlicensed users where the spectrum owners lease their spectrum to the unlicensed ones in exchange for certain services such as cooperative relaying or energy harvesting.
In this paper, we studied the problem of limited spectrum in UAV networks and considered a relay-based cooperative spectrum leasing scenario in which a group of UAVs in the network cooperatively forward data packet for a ground primary user (PU) in exchange for spectrum access. The rest of the UAVs in the network utilize the obtained spectrum for transmission and completion of the remote sensing operation. Thus, the main problem is to partition the UAV network into two task groups in a distributed way.
It is worth noting that cooperative spectrum sharing has been studied previously in the context of cognitive radio networks Stanojev08; Afghah_CDC2013; Korenda_CISS; Namvar15. The existing models are mostly centralized and the set of relay nodes is typically chosen by the PU. Such solutions, however, are not applicable to UAV networks, due to their distributed infrastructure and autonomous functionality.
To tackle this problem, we utilize multi-agent reinforcement learning Guestrin2002; Lauer2000; mousavi2016deep; Chalkiadakis2003; mousavi2017traffic; mousavi2016learning; mousavi2017applying, which is an effective tool for designing algorithms in distributed systems, where the environment is unknown and a reliable communication among agents is not guaranteed. The main problems in distributed multi-agent reinforcement learning include dealing with state space complexity and the lack of complete information about other agents. There have been proposals in the literature to address these issues through message passing or simplifying assumptions. For instance, Guestrin2002 assumes that the decision of an agent depends only on a limited group of other agents, which decomposes the state space and simplifies the problem. In another work Chalkiadakis2003, a Bayesian setting is proposed where each agent has some distributional knowledge about other agents’ decisions. Such simplifications, however, are not applicable to the distributed UAV network environment.
In this paper, we propose a distributed multi-agent reinforcement learning algorithm for task allocation among UAVs. Each UAV either joins a relaying group to provide relaying service for the PU or performs data transmission to the UAV fusion center. In this approach, each UAV maintains a local table about the respective rewards for its actions in different states. The tables are updated locally based on a feedback from PU receiver and the UAV fusion node. We define utilities for both the PU and the UAV network, and the objective is to maximize the total utility of the system (sum utility of the PU and the UAV network). We discuss the convergence of our learning algorithm and we present simulation results to verify the convergence to the optimal solution.
The remainder of this paper is organized as follows. In Section 2, the system model and the assumptions of the proposed model are described. In Section 3, we propose a distributed multi-agent learning algorithm to solve the spectrum sharing problem. In Section 4, we present simulation results and discuss the performance of our distributed learning algorithm. Finally, we make concluding remarks in Section 5.
2 System Model
We consider a licensed primary user (PU) who is willing to share a part of its spectrum with a network of UAVs, in exchange for receiving a cooperative relaying service. The UAV network consists of UAVs which can be partitioned into two sets depending on the task of the UAV. In fact, UAVs either relay for the PU or utilize the spectrum to transmit their own packets to the fusion center. Let be the number of nodes who perform the relaying task and denote the number of UAVs that transmit packets to their fusion center. In this paper, we assume that both the PU’s transmitter and receiver are terrestrial, while UAVs are operating in high elevation. Also, we assume no reliable direct link exists between the PU’s transmitter and receiver. Moreover, there is zero chance for direct transmission between the UAVs’ source and fusion, due to their distances from the fusion center. Fig. 1 illustrates a sample scenario with 6 total nodes, where the nodes are partitioned into a set of 4 relay nodes for the fusion center, and 2 other nodes relay information for the PU receiver on the ground.
The PU’s transmitter intends to send its packet to a designated receiver, which is far away from its location. Hence, a single or a number of UAVs are required to deliver its information to the receiver. In addition, we assume that the UAVs’ spectrum is congested or unreliable, therefore the UAVs are required to lease additional spectrum from the PU to communicate with their fusion. By delivering the PU’s packet, the UAVs gain spectrum access to send their own packets. All the UAVs transmitters and receivers are assumed to be equipped with a single antenna. Also, we assume that the channels between UAVs, source, fusion, and PU transmitter and receiver are slow Rayleigh fading with a constant coefficient over one time slot. The channel coefficients are defined as follows: i) refers to the channel parameters between the PU’s transmitter and UAV; ii) denotes the parameters between the UAV and the PU’s receiver; iii) and , respectively denote the channel coefficients between the Source and the UAV, and between the UAV and the fusion center. For the sake of simplicity, the instant Channel State Information (CSI) are assumed to be available for all UAVs following similar works in stanojev2013improving; wu2011information; al2016enhancingg; afghah2018reputation; shamsoshoara2015enhanced
The source of the noise at the receivers is considered as a symmetric normally
distributed random variable, denoted by. Many works such as roberge2013comparison; mozaffari2016optimal; shamsoshoara2015enhanced optimized the power consumption and nodes’ lifetime in this area. On the other hand, power optimization is not the purpose of this work, hence we assume constant powers during the transmissions. However, the transmission power for the Source and the PU transmitter is less than those of the UAVs. Half-duplex strategy is utilized in this work. Without loss of generality, time-division notations are characterized in order to ensure the half-duplex operations. After these assumptions, the channel and system model for a single relay is shown in Fig. 2. In this model, all UAVs and terminals utilize a single antenna for transmission.
In the first half of a transmission cycle, the source transmits its packet and the relay UAVs receive the information. The channel model for the first half is presented as follows:
where is the source’s transmitted signal and is the UAV’s received signal. Then, in the second half of the transmission, the UAV sends the received packet in the previous time slot. We can write the second half as another model for the received signal as follow:
where is the UAV’s transmitted signal and is the destination’s received signal.
In equations (1) and (2), the CSI parameters represent the effects of the path loss and likewise represents the effect of noise and interference terms at the receiver, where and . In our scenario, is calculated by the proper receiver.
Based on equations (1) and (2), the throughput capacity of the non-degraded discrete memoryless broadcast channel is expressed in (3) AnalyzingAFZhang:
where , is the message word and is the codeword which has been assigned to each message by the encoder. Preferably, equation (3
) should be solved for the optimal joint distribution of bothand . However, as discussed in shafiee2007achievable, we can achieve the suboptimal throughput rate in (4), with the aid of assumption . Also,
denotes the probability mass function (pmf) for the codeword.
In scenarios, where
users can exploit the existence of UAVs, different cooperation protocols such as Decode and Forward (DF) and Amplify and Forward (AF) can be used laneman2001efficient. The idea behind the concept of cooperative relaying is that a set of relay nodes decode, amplify and collectively “beam-form” the signal received from the source node (potentially with help of source node itself) towards a designated destination in order to exploit transmission diversity and increase the overall throughput of the system levin2012amplify.
Considering an AF cooperation, each UAV first amplifies the signals from the source and then cooperates with source to send its information to the fusion center or to the PU-Receiver. According to laneman2004cooperative, the mutual information for i) the first set of source, UAV, fusion and ii) the second set of PU-T, UAV, PU-R can be written as equations (5) and (6), respectively. In these equations, denotes the transmitter power from the source of the UAV network and specifies the index for the UAV.
It is noteworthy that these equations are valid only for cooperation with a single Relay or UAV. However, the objective of this paper is dealing with Multi-UAV or Multi-Agent relays. Fig. 3 demonstrates the distribution of UAVs into two groups including UAVs facilitating the air source-to-fusion communication and UAVs providing relaying service for a ground-based primary transmitter-receiver pair. Hence, the equations for multi-UAV should be changed to (9) and (10). In (9), defines the lower bound for the first UAV in the source-fusion pair and denotes the upper bound.
Here, is the achievable rate for the fusion center. This rate is achieved with the help of () UAVs. and are transmission powers for the source and the UAV, respectively. Also, denotes the channel coefficient for the pair of source-fusion center, stands for the channel between the source and UAV, and finally denotes CSI for the UAV and the fusion. In (10), and define the lower and upper bound for the first and last UAV in the source-fusion pair respectively.
In (10), is the achievable rate for the primary transmitter-receiver pair with the aid of UAVs. and are transmission power for the primary user and the UAV, respectively. Moreover, , denotes the channel coefficients for primary transmitter and receiver. stands for the primary transmitter and UAV. Finally is CSI parameters for the UAV and the primary receiver. Based on the assumption of long distance between the source and the fusion center and also the long distance between the primary transmitter and receiver, we can assume that and are negligible.
Time is slotted and at the end of each time slot, the fusion center and the primary receiver send feedback to the UAVs informing them about the achieved accumulated rates. This information is used by each UAV to decide on joining a task group. The goal is to find the optimal task allocation for UAVs in a fully distributed way such that the total utility of the system (i.e. sum utility of UAV network (9) and the PU (10)) is maximized. We assume that the UAVs decide locally with no information exchange among themselves.
It is noteworthy that in some cases, the maximum throughput is achieved when all UAVs join the same set and deliver packets only for one set, which is not consistent with the proposed model. If all UAVs are distributed in the set of source-fusion, then the total throughput rate is zero because there is no available spectrum for UAVs to utilize for their transmission. Also, if all UAVs are partitioned in the primary set, then the sum throughput rate is equal to the rate of the primary user. In this case the proposed method handles this issue by considering the Jain fairness index lan2010axiomatic. Based on the fact that we only have two sets and based on the Jain index definition, (11) describes the fairness for the proposed method in our system model.
Here, is equal to 2 and which indicates the set of source-fusion or Primary Users. We assume that and are equal to the number of UAVs in the Fusion-Source set and the Primary Users set, respectively. Therefore, we can define the fairness as (12).
Now, if all UAVs are distributed in one set, then the fairness will be minimum (), and if the UAVs are partitioned equally among two sets, then the fairness will be maximum ().
Based on these definitions, we define (13), as the gain value for each time slot which indicates the efficiency and performance for the distributed UAVs in two sets.
In (13), is the difference between the rate at time and the average of previous rates for the fusion center and is the difference between the rate at time and the average of previous rates for the primary user. Also, , , and are defined to control the gain value. Then, we use this gain in our proposed method as described in section 3.
3 The Distributed Learning Algorithm for Task Allocation
The proposed method is a general form of the Q-learning algorithm watkins1992q for a distributed multi-agent environment.
Let denote the action chosen by UAV at time , and let denote the set of all possible actions for UAV . We consider two possible actions for a UAV that correspond to either joining the relaying task group or the fusion task partition. Therefore, the set of possible actions are identical across UAVs. We denote the action vector of UAVs at time
. We consider two possible actions for a UAV that correspond to either joining the relaying task group or the fusion task partition. Therefore, the set of possible actions are identical across UAVs. We denote the action vector of UAVs at timeby , and we refer to the set of all possible action vectors by . There is a finite set of states , where state corresponds to the current task partition. A deterministic transition rule governs the transition between states, i.e. . The reward function maps the current state and action vector to a real value, that is . At the beginning of each time step, the UAVs observe the current state (this information is obtained by the feedback from the previous step). Then, each UAV independently decides on its action (i.e. which task group to join) without knowing any information about actions of the other agents. The rewards associated with the UAVs’ actions are computed by the PU receiver and the UAV fusion. The reward is basically the gain obtained from the task partitioning, taking into account the utilities of the PU and the UAV network. After the reward is calculated, a feedback message from the PU receiver and the UAV fusion is broadcasted to the UAVs. This feedback message contains the reward and the current task partitions.
The feedback information is used to update and maintain local Q-tables at each UAV. A Q-table basically represents the quality of different actions for a given state. For instance, denotes the quality of action at state for UAV at time . Individual Q-tables are updated as follows. At first, the tables are initialized with . Then, the following equation is used to update the Q-tables:
where is the learning rate, is the reward or the gain obtained at time , as defined in the system model, and is the discount factor to control the weight of future rewards in the current decisions.
The main idea is that in our distributed environment, the UAVs are unable to keep a global Q-table, corresponding to the current action vectors, i.e. . Instead, each UAV keeps a local (and considerably smaller) Q-table which cares about its own current action, i.e. . This approach significantly reduces the complexity of the algorithm and eliminates the need for coordination (or sharing information) with other UAVs at the time of decision making. However, we need a projection method that compresses the information of the global Q-table into the local small tables.
The results in Lauer2000 prove that in a deterministic multi-agent Markov decision process and for the same sequence of states and actions, if every independent learner chooses locally optimal actions, the result would be the same as choosing the optimal action from a global table. We utilize this result and consider an optimistic projection method that assumes each UAV chooses the maximum quality action from its local table. This reasonable assumption is a necessary condition for the optimality of the learning algorithm. It is worth noting that the existence of a unique optimal solution is the sufficient condition for the optimality of this algorithm. It means that there should be a unique task partition, which results in the maximum total utility. If multiple task partitions yield the maximum utility, it is possible that the UAVs act optimally and choose the optimal actions in their local Q-tables, but the combination of their actions may not be optimal. In this case, message passing among UAVs is needed as they need to coordinate decisions at every step.
prove that in a deterministic multi-agent Markov decision process and for the same sequence of states and actions, if every independent learner chooses locally optimal actions, the result would be the same as choosing the optimal action from a global table. We utilize this result and consider an optimistic projection method that assumes each UAV chooses the maximum quality action from its local table. This reasonable assumption is a necessary condition for the optimality of the learning algorithm. It is worth noting that the existence of a unique optimal solution is the sufficient condition for the optimality of this algorithm. It means that there should be a unique task partition, which results in the maximum total utility. If multiple task partitions yield the maximum utility, it is possible that the UAVs act optimally and choose the optimal actions in their local Q-tables, but the combination of their actions may not be optimal. In this case, message passing among UAVs is needed as they need to coordinate decisions at every step.
It should also be noted that in learning algorithms we need a balance between exploring new actions and exploiting the previously learned quality of actions. Therefore, a greedy strategy that always exploits the Q-table and chooses the optimal action from the Q-table may not provide enough exploration for the UAV to guarantee an optimal performance. A very common approach is to add some randomness to the policy Singh2000. We use -greedy with a decaying exploration, in which a UAV chooses a random exploratory action at state with probability , where and is the number of times the state has been observed so far. The UAV exploits greedily from its Q-table with probability of . In this approach, the probability of exploration decays over time as the UAVs learn more.
Similar to the original Q-learning for a single agent environments, the proposed learning algorithm converges if the state-action pairs are observed infinitely many times. Also, the time complexity of the algorithm is in the order of , where is the size of the state space, and is the size of action space for UAV . Since there are only two possible actions in our application, the complexity can be expressed as . In terms of space complexity, each UAV needs to keep a table of size .
4 Simulation Results
In this section, we present the simulation results to evaluate the performance of the proposed method. We simulate our system model for a ground-based primary transmitter-receiver pair along with the pair of source and fusion for the UAV network. The location of primary users, source and fusion are fixed during the simulation. However, the UAVs are distributed randomly in the environment. The channels between nodes and are obtained from , where is the distance between nodes and . The duration of one time slot, , is assumed to be equal to 1. The values of , and are set to , and , respectively.
Scenario I: 2 UAVs
In the first scenario, we consider two UAVs to be partitioned into two task groups.
The network topology for this scenario is demonstrated in Fig. 4. Since in this scenario we only have 2 nodes, the possible states for task allocation is equal to . Hence, the Q-tables will be learned after a few iterations. Fig. 5 illustrates the summation of the obtained throughput. The convergence to the optimal task allocation occurs after the 35 iteration, since the number of states is relatively small.
The matrix below shows the final task allocation values for these UAVs.
In this notation, 0 corresponds to the set of source-fusion and 1 means the set of the primary users. UAV1 who has a lower relative distance to the source-fusion, is allocated to the fusion set, while UAV2 is allocated to the another set to relay for the primary network.
Scenario II: 6 UAVs
In this scenario, we consider 6 UAVs to show that the convergence of the proposed method is achieved after more iterations compared to the case of 2 UAVs in the first scenario, since the number of states with 6 nodes is equal to . This means, at least 64 iterations are required for the algorithm to just test all the states.
Fig. 6 demonstrates the network topology with these 6 UAVs
for the primary user and the fusion. As we can see in Fig. 7 , the convergence to
best task allocation occurred
240 iterations. This implies that the more UAVs
added to the model, the more iterations will be taken to the convergence
, the convergence to the best task allocation occurred after 240 iterations. This implies that the more UAVs are added to the model, the more iterations will be taken to the convergence epoch. Moreover, Fig.8 shows the number of UAVs switching their actions (i.e. task partitions) in this scenario. After the 240 iteration, when the convergence happens, we see that no UAV changes its task partition, and the number of switches stays at zero.
Also, task matrix shown below denotes the final task allocation for the 6 UAVs.
Based on this matrix, ; are considered for the set of source-fusion and the rest of UAVs are assigned to the relay task group for the primary network. This allocation makes sense considering the location of UAVs and their relative distances.
In this paper, we studied the task allocation problem for spectrum management in UAV networks. We considered a cooperative relay system in which a group of UAVs provide relaying service for a ground-based primary user in exchange for spectrum access. The borrowed spectrum is not necessarily used by the relay UAV, rather is used by other UAVs to transmit their own information to a fusion center. This makes a win-win situation for both networks. We defined utilities for both the UAV network and the ground-based primary network based on the achieved rates. Next, we proposed a distributed learning algorithm by which the UAVs take proper decisions by joining the relaying or fusion task groups without the need for information exchange or knowledge about other UAV’s decisions. The algorithm converges to the optimal task partitioning that maximizes the total utility of the system. Simulation results were presented in different scenarios to verify the convergence of the proposed algorithm.