Distributed Deep Reinforcement Learning for Collaborative Spectrum Sharing

04/06/2021 ∙ by Pranav M. Pawar, et al. ∙ Bar-Ilan University 0

Spectrum sharing among users is a fundamental problem in the management of any wireless network. In this paper, we discuss the problem of distributed spectrum collaboration without central management under general unknown channels. Since the cost of communication, coordination and control is rapidly increasing with the number of devices and the expanding bandwidth used there is an obvious need to develop distributed techniques for spectrum collaboration where no explicit signaling is used. In this paper, we combine game-theoretic insights with deep Q-learning to provide a novel asymptotically optimal solution to the spectrum collaboration problem. We propose a deterministic distributed deep reinforcement learning(D3RL) mechanism using a deep Q-network (DQN). It chooses the channels using the Q-values and the channel loads while limiting the options available to the user to a few channels with the highest Q-values and among those, it selects the least loaded channel. Using insights from both game theory and combinatorial optimization we show that this technique is asymptotically optimal for large overloaded networks. The selected channel and the outcome of the successful transmission are fed back into the learning of the deep Q-network to incorporate it into the learning of the Q-values. We also analyzed performance to understand the behavior of D3RL in differ



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Spectrum sharing is a fundamental problem for the efficient management of wireless communication networks. There are two paradigms for network management: centralized and distributed. The former is prevalent in cellular networks, while the latter has been mainly used in ad-hoc networks. However, the growing demand for bandwidth, the increase in the number of users, and the scarcity of the available spectrum make collecting channel state information increasingly hard [4]. In this paper, we approach the problem of spectrum sharing by combining insights from our previous game-theoretic analysis of the interference game [2] with deep -learning. This enables us to boost the performance of deep -learning by compared to the algorithm in [11] in overloaded settings when the load per frequency channel is larger than .

The most suitable and efficient way to manage channels in this type of scenario is to learn the channel behavior (state and action profile) and allocate the channel accordingly. Hence, developing distributed learning solutions for the channel allocation problem has attracted major attention in recent years. This paper aims to develop a distributed learning algorithm for channel allocation for real-time multi-user networks, without the exchange of multiple messages or large state information. We adopted a deep reinforcement learning (DRL) [5] mechanism to achieve this goal because it gives a good approximation of objective values. The objective of DRL is to learn efficient strategies and rules for a given decision problem. Here, we used the most popular reinforcement learning algorithm,

-learning, which is combined with a deep neural network; i.e., Deep

-network (DQN) [5]. The DQN is used for mapping the state with actions to maximize a -value.

The basic drawback of multi-agent -learning is that even if convergence is achieved, it is only to a Nash Equilibrium (NE) point. As is well-known a NE can be highly sub-optimal in terms of social welfare and fairness [8]. Moreover, by modifying the utilities used for optimization by the agents and allowing each agent to select only a few best strategies produces a game where all the NE points are indeed near-optimal (in terms of the original utility)[2]. This suggests that a similar modification to the Q-learning process might yield better allocations. In this paper, we use the basic idea of balanced allocation together with the classical result reported in [1] to obtain asymptotic optimal learning rules.

Our main contribution in this paper is a modification of the architecture in [11] by modifying the allocation rules and improving and changing the game used for the multi-agent Q-learning by exploiting the game theoretic result of [2]. First, we limit the strategy space to the (actually is sufficient) channels with the highest -values. This results in a game where all NE points are near optimal from a social welfare perspective. Next, we use a deterministic strategy to choose the least loaded channel among these best channels. This yields a near-optimal allocation in the following sense

  • Each user obtains one of its best channels. so by order statistics, it is asymptotically optimal, as long as and the fading statistics is exponentially dominated (an assumption which holds for all standard fading models).

  • The load on all channels is nearly equal; i.e., .

The intuition behind this second step, is that we consider all the best channels as having the same utility which results in an equivalent game with only good equilibria. The proof of these two results will be given the full version of the paper. It combines ideas from [2] and [1].

Ii Related Work

Many recent developments in distributed learning methods have demonstrated that it can be an efficient technique to solve the spectrum (or channel) sharing problems in wireless networks. This section discusses the state-of-the-art in the area of distributed learning methods for spectrum sharing.

In [11] a deep -learning for spectrum access (DQSA) is discussed. Here, spectrum actions for every user are learned through training a DQN. The algorithm learns good strategies for every user online and in a distributed fashion without exchanging messages or with online coordination among users. In [13]

the authors put forward an optimal access policy using the state transition probabilities by considering Markov channels. It also proposes an optimal access policy for these channels using deep

-learning if the transition probabilities are unknown. Here, the DQN uses -greedy policies to accumulate the training data. Spectrum resource allocation in a cognitive radio network is explored in [9]. It concentrates on developing a deep reinforcement learning (DRL) mechanism for power control of the secondary user to share the spectrum with a primary user. The -learning-based DRL is implemented by the secondary users to learn and adjust transmit power after interactions with the primary users so that both users can transmit successfully. In [19] distributed learning for the channel, allocation considers the multi-arm bandit scenario, which achieves an optimal regret of . It describes a distributed channel allocation mechanism that assumes carrier sense multiple access (CSMA). Here, the user learns from observing a single channel without decoding a channel and without exchanging extra information between users. An online self-decision and offline self-learning algorithms are proposed for channel allocation in [14], for multi-channel wireless sensor network (WSN). A non-cooperative game is used for online self-decision algorithms and a -learning-based DRL approach is used for offline self-learning of channels. The findings show that offline self-learning converges to optimal channel selection with lower computational and storage resources. In [15] a channel access application of DQN is considered in a multi-user, multi-hop, and simultaneous transmission scenario of WSN. It suggests a DQN to access channel using online learning. The DQN approach considers a large system and finds an optimal policy from historical observations without prior knowing system dynamics. Deep-Reinforcement Learning Multiple Access (DLMA) [18] is a heterogeneous MAC protocol using DRL. Here, the DRL agent learns an optimal medium access control (MAC) strategy for efficient co-existence with time division multiple access (TDMA) and ALOHA nodes. This DRL learns through a series of state-action-rewards. This work also concentrates on analyzing the characteristics of DRL as compared to other neural network techniques to then apply it to wireless networks. The sensor scheduling problem for wireless channel allocation using DRL was studied in [7]

where the scheduling problem, it is formulated as a Markov Decision Process (MDP) and solved using DQN. The authors report good performance as compared to other sub-optimal sensor scheduling policies. Finally, these studies are primarily focused on non-deterministic solutions for spectrum access using Q-learning which is more complex in implementation and finding optimal values in real-world conditions.

The remainder of this paper is organized as follows. Section III presents the network model for the proposed technique and formulates the problem. Section IV describes the D3RL mechanism with its architecture, algorithm, and working. Section V presents the communication- and neural network parameters for simulation and analyzes the simulation results. Section VI concludes the paper and outlines future work.

Iii Network Model and Problem Statement

We consider an ad-hoc network with users and channels, where . The training is performed in a distributed manner at each user. For simplicity of exposition, we assume that all users know both and (This can be broadcasted to all users if needed). Let be a parameter designating a small set of best channels for transmission, . All users in the network have similar capabilities. All nodes in the network are synchronized as is typical in slotted random access networks.

We assume a slotted random access mechanism for sharing the spectrum; specifically, we consider slotted-multi-channel ALOHA transmission where each user is allowed to transmit on a specific channel in a specific slot according to a certain transmission probability [3]. Here, each user transmits with probability and does not transmit at a probability of . We do not assume that users know a-priori the channel qualities or the loads on the channel and our goal is to devise a multi-agent learning algorithm that will lead the network to an allocation which is good for all users where each user transmits over one of its best channels and the load on all channels is approximately identical.

To analyze the network, we assume that at all times each user has pure action set and transmits over channel

using a slotted ALOHA protocol. For loaded ALOHA we will mostly depend on a mixed strategy of the players. A mixed strategy for a player is a probability distribution over the possible channels (pure strategy)

, where is the probability of not transmitting at time , and is the probability of user transmitting over channel at time for .

Let be a binary observation indicating whether a packet is successfully delivered or not; i.e., if acknowledgment (ACK) is received and otherwise. Let be the action of user at time and be the strategy of user at time .

Let be the reward that obtains at time , be the utility of user and is the load on channel .

Definition 1: A history of user at time is the set of all actions, observations, and load on channels up to time is defined as


History is used for training the users to learn their best strategy.

Definition 2: The utility function (an instantaneous functions) that defines the throughput of user on channel is


Here, is the users transmission power, is the channel gain, is the bandwidth of each channel and

is a vector of actions (where

is the action of user , i.e., the frequency it selected.

Definition 3: The multi-channel random access game is defined by:


where is the set of actions of each player. denotes no transmission, while denotes the identity of the selected channel for transmission.

Definition 4: A mixed strategy is a probability distribution with respect to pure strategy , which is


Here, is a strategy vector and is a payoff for the considered strategy vector.

Definition 5: The total accumulated reward with discount factor for player , is given by:


and is the time-horizon.

Definition 6: The strategy profile is called a NE in the multi-channel random access game if


for all and all . Here, is a strategy profile for all users except user .

The objective is to find a strategy for a user , which maximizes the expected accumulated discounted reward, .

A NE is a stable point in the dynamics, hence it is desired to achieve such equilibrium which will prevent network fluctuations. However, in general, such equilibria can be highly sub-optimal [8]. However, recently Bistritz and Leshem [2] proved that by changing the utility or restricting the strategies utilized by each player a competitive game can be formed where each NE point is near-optimal with respect to the sum rate of all users.

In this paper, we develop a learning strategy that always achieves a NE with high utility for all users. To that end, we exploit the results of [2] to modify the DQN learning [10]. This is successfully achieved using DRL techniques, -learning [17], DQN [10] and double -learning [5].

Iv Deterministic Distributed Deep Reinforcement Learning

In this section, we present a new collaborative spectrum access technique called Deterministic, Distributed Deep Reinforcement Learning (D3RL). The basic idea is to deterministically limit the set of strategies of each player and to enforce users to use the least loaded channels among their best strategies. The technique implements a deep reinforcement network for learning the strategies and a decision mechanism on the transmitted channel which is deterministic, given the computed

-values. The first subsection describes the architecture, then the operation phase and load estimation in the algorithm. Finally, we discuss the training of the algorithm.

Iv-a Architecture

Fig. 1: Architecture of the DQN used in D3RL

This subsection describes the proposed layered architecture for the reinforcement learning used in D3RL to solve the channel allocation problem in the multi-user network. The architecture of D3RL consists of an input layer, long short term memory (LSTM) layers, value layers, advantage layers, an output layer, a selection layer, and distributed double

-learning. Here, D3RL chooses the channel in a deterministic way by using -values and the load on a channel and passes it on as learning information to the LSTM layers for the next learning step. LSTM layers are useful to preserve the internal state and the aggregated observations as time elapses, which helps to estimate the true state. The double -learning is used to reduce the bad states during estimation of the -value [5][16]. Here, the user will update their DQN weights after completion of the training phase. Consistent with the requirements of a lightweight multi-user network, the implementation of the proposed algorithm is very simple. It trains the network distributively, which is executed whenever there are significant changes in the network environment.

The proposed layer architecture is shown in Fig. 1. The architecture consists of the following layers,

  • Input layer: The input is a vector of size where each coordinate contains the number of users which selected the given action. The input is updated during each iteration. The next iterations use a history profile (which changes according to the -values) for allocating the best channel for multi-user communication.

  • LSTM layer: The LSTM layer [6] is used to retain the internal state of the network and also helps to accumulate observations. This is needed for estimating the correct state of the network which relies on the history information. Here, the state of the network for each users’ network is the load it experienced on each channel and the selected channel. In short, it learns through the experience, aggregates that experience, and passes on it over time.

  • Value and Advantage layer: These layers help to cope with observatory problems in DQN [16]. Here, we estimated the average -value of a state because every state is good or bad depending on the action taken. The average -value of an action is estimated using the , the value of the state plus Adv, the advantage derived from the action.

  • Output layer: It outputs a vector of size . It consists of the estimated -value for transmission on a channel.

  • Selection layer: The job of the selection layer is to select the best channel profile according to the -value and the load on the channel. Algorithm 1 specifies the process for the selection layer.

  • Distributed Double -Learning: Here, the DQN is trained using the distributed double -learning, which is used to differentiate the action from the -value [5][11]. We implemented two DQN, DQN1 for selection of the action and DQN2 for estimation of the Q-value for a given action.

Iv-B The D3RL algorithm

Fig. 2: D3RL structure
Input: , , and
Output: User selects a channel profile and outputs a -value
1 for  to  do
2       for  to  do
3             User observes the current state and chooses a channel profile
Update -values for users according to the chosen channel profile.
4       end for
6 end for
where =is one of the largest values.
Algorithm 1 Selection layer

Here, each user selects a channel with maximum -values out of the available -values and transmits on the channel with the least load accordingly. The flow of D3RL is shown in Fig. 2. The algorithm collects the -value outputs from the DQN, sorts them in descending order and considers the best channels with maximum -value, and selects the channel with the minimum load (as in Algorithm 1). In this step, user acts according to the following strategy


In the next step, the user transmits through the selected channel using ALOHA with certain transmission probabilities, generates a new action profiles for the channel and updates the -value, and pass on this new -values for the next iteration of learning. Here, is the success probability, is the probability of no transmission, and is the probability of collision. These probabilities are given by:

Input: Input vector
Output: Training of DQN and output estimated -values for transmission through channels
1 for  to  do
2       for  to  do
3             for  to  do
4                   for  to  do
5                         Feed into . Estimate the -values for all available actions . Take action (according to Algorithm 1) and obtain a reward .
6                   end for
7                  for  to  do
8                         Feed into and Estimate the -values , , for all actions . Construct a target vector for training by replacing the by,
9                   end for
11             end for
13       end for
14      Train with and output . Every iteration set
15 end for
Algorithm 2 D3RL training

Iv-C Load Estimation

The algorithm selects the channel with minimal load among the best channels. The load estimator uses standard ALOHA based load estimation: The load () on each channel is estimated using the ratio between the success probabilities and the number of transmitted packets. as the ratio of the number of successful transmissions to the total number of transmissions on the channel. Using  (8)- (10) this ratio is given by:


According to equation (11), the approximate load on the channel is


Each time users observe a load on each of the best channels (selected according to the -values) and choose among these channels the one with the minimum load according to equation (7). Here, is the total number of transmissions and is the total number of successful transmissions.

Iv-D Training Mechanism

The DQN is trained using a D3RL training algorithm as shown below. Here, all users are trained in distributively. Training is only required when the characteristics of the network change, such as the addition of new nodes and links etc. Algorithm 2 runs for iterations where it calculates a -value for all available actions on the channel and learns from it during each iteration and outputs the best channel allocation for each user after iterations. Here, is the number of episodes.

Iv-E Complexity Analysis of D3RL

When using D3RL each user needs to find the best channels (with respect to the -values). The total time required to find the best channels has a complexity of operations per step.

The complexity of D3RL in terms of the number of multiplications performed during the operation of the DQN is easy to compute. Assuming a DQN with layers, where the size of the input layer is and each of the other layers has size . Therefore the real-time computational complexity (for forward and backward propagation) for every user at each time step is:


The training phase computational complexity for each user over iterations is in order of

V Simulation Results

In this section we present several simulated experiments comparing the proposed D3RL algorithm to DQSA [11]. We have implemented D3RL with channels. The common communication and the neural network parameters used for the simulation are in Table I and II respectively.

Parameter Value
Number of users 100
Number of channels 50 or 25
Type of channel Rayleigh Fading Channel
SNR 35dB
Bandwidth 20MHz
Maximal doppler shift 100Hz
TABLE I: Communication Parameters
Parameter Value
Number of LSTM layers 100
Number of advantage layers 10
Number of value layers 10
Minibatch size 16 episodes
Time steps 10 to 100
Discount factor() 0.95
Alpha factor() 0.05
Temperature() 1 to 20
Number of training iterations 10000
TABLE II: Neural network parameters
Fig. 3: Comparative average reward with channel (100 users 50 channels)

Fig. 3 depicts the performance of D3RL over Rayleigh fading channels. The algorithm (with ) achieves rates which are higher than DQSA and when using . The improvement can be attributed to the deterministic learning used in the D3RL which ensures an asymptotically balanced channel allocation for each user. For comparison, we also provide an upper bound on the average rewards similar to the upper bound in [12].

Fig. 4: Comparative average reward with channel (100 users 25 channels)

Fig. 4 compare the performance of the D3RL algorithm under higher loads (100 users and 25 channels) as a function of the time step. In this loaded situation, D3RL outperformed DQSA by (with ) and (with ). Fig. 4 also shows the upper bound on the average rewards on the channel. As can be seen, indeed the D3RL performance is very close to the optimal allocation as is expected by the analysis in [2] and by the results of [1].

Fig. 3 and 4 also depict the performance of D3RL with . As discussed above increasing improves the performance with a minor increase in computational complexity. However, the gains are insignificant for . This is very reasonable, since by the random graph argument of [2] and substituting leads with high probability to a completely balanced NE, since in this graph there is a perfectly balanced matching with high probability.

Vi Conclusions and Future Work

We considered the problem of collaborative spectrum sharing in a multi-user ad-hoc network. We designed a distributed learning algorithm to find an asymptotically balanced channel allocation for each user. The proposed mechanism allows each user to learn the best strategies through learning without exchanging any extra messages in the network by exploiting the properties of the multichannel ALOHA protocol. The experimental results of the D3RL show a strong performance of the algorithm in a complex multi-user network.

In the future, this work can be further extended to analyze system dynamics by using game theoretic techniques and developing a complete intelligent MAC mechanism exploiting DRL techniques. It can also be extended to a hardware implementation of the algorithm and testing using a hardware test-bed.


  • [1] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal (1999-09) Balanced allocations. SIAM J. Comput. 29 (1), pp. 180–200. External Links: ISSN 0097-5397, Link, Document Cited by: §I, §I, §V.
  • [2] I. Bistritz and A. Leshem (2019) Game theoretic dynamic channel allocation for frequency-selective interference channels. IEEE Transactions on Information Theory 65 (1), pp. 330–353. Cited by: §I, §I, §I, §III, §III, §V, §V.
  • [3] K. Cohen and A. Leshem (2016) Distributed game-theoretic optimization and management of multichannel ALOHA networks. IEEE/ACM Transactions on Networking 24 (3), pp. 1718–1731. Cited by: §III.
  • [4] L. Duan, L. Gao, and J. Huang (2014) Cooperative spectrum sharing: a contract-based approach. IEEE Transactions on Mobile Computing 13 (1), pp. 174–187. Cited by: §I.
  • [5] H. v. Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double Q-learning. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    AAAI’16, pp. 2094–2100. Cited by: §I, §III, 6th item, §IV-A.
  • [6] M. Hausknecht and P. Stone (2015-11) Deep recurrent Q-learning for partially observable MDPs. In AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15), Cited by: 2nd item.
  • [7] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi (2020) Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems. Automatica 113, pp. 108759. External Links: ISSN 0005-1098, Document, Link Cited by: §II.
  • [8] A. Leshem and E. Zehavi (2009) Game theory and the frequency selective interference channel. IEEE Signal Processing Magazine 26 (5), pp. 28–40. Cited by: §I, §III.
  • [9] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li (2018) Intelligent power control for spectrum sharing in cognitive radios: a deep reinforcement learning approach. IEEE Access 6 (), pp. 25463–25473. Cited by: §II.
  • [10] V. Mnih and et al (2015-02) Human-level control through deep reinforcement learning. Nature 518, pp. 529–33. External Links: Document Cited by: §III.
  • [11] O. Naparstek and K. Cohen (2019) Deep multi-user reinforcement learning for distributed dynamic spectrum access. IEEE Transactions on Wireless Communications 18 (1), pp. 310–323. Cited by: §I, §I, §II, 6th item, §V.
  • [12] O. Naparstek and A. Leshem (2012) Bounds on the expected optimal channel assignment in Rayleigh channels. In 2012 IEEE 13th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Vol. , pp. 294–298. Cited by: §V.
  • [13] H. Q. Nguyen, B. T. Nguyen, T. Q. Dong, D. T. Ngo, and T. A. Nguyen (2018) Deep Q-learning with multiband sensing for dynamic spectrum access. In 2018 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Vol. , pp. 1–5. Cited by: §II.
  • [14] M. Qiao, H. Zhao, S. Huang, L. Zhou, and S. Wang (2017) Optimal channel selection based on online decision and offline learning in multichannel wireless sensor networks. Wireless Communications and Mobile Computing 2017. Cited by: §II.
  • [15] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari (2018) Deep reinforcement learning for dynamic multichannel access in wireless networks. IEEE Transactions on Cognitive Communications and Networking 4 (2), pp. 257–265. Cited by: §II.
  • [16] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas (2016) Dueling network architectures for deep reinforcement learning. In

    International conference on machine learning

    pp. 1995–2003. Cited by: 3rd item, §IV-A.
  • [17] C. J. C. H. Watkins and P. Dayan (1992) Q-learning. In Machine Learning, pp. 279–292. Cited by: §III.
  • [18] Y. Yu, T. Wang, and S. C. Liew (2019) Deep-reinforcement learning multiple access for heterogeneous wireless networks. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1277–1290. Cited by: §II.
  • [19] S. M. Zafaruddin, I. Bistritz, A. Leshem, and D. Niyato (2019) Distributed learning for channel allocation over a shared spectrum. IEEE Journal on Selected Areas in Communications 37 (10), pp. 2337–2349. Cited by: §II.