I Introduction
Spectrum sharing is a fundamental problem for the efficient management of wireless communication networks. There are two paradigms for network management: centralized and distributed. The former is prevalent in cellular networks, while the latter has been mainly used in adhoc networks. However, the growing demand for bandwidth, the increase in the number of users, and the scarcity of the available spectrum make collecting channel state information increasingly hard [4]. In this paper, we approach the problem of spectrum sharing by combining insights from our previous gametheoretic analysis of the interference game [2] with deep learning. This enables us to boost the performance of deep learning by compared to the algorithm in [11] in overloaded settings when the load per frequency channel is larger than .
The most suitable and efficient way to manage channels in this type of scenario is to learn the channel behavior (state and action profile) and allocate the channel accordingly. Hence, developing distributed learning solutions for the channel allocation problem has attracted major attention in recent years. This paper aims to develop a distributed learning algorithm for channel allocation for realtime multiuser networks, without the exchange of multiple messages or large state information. We adopted a deep reinforcement learning (DRL) [5] mechanism to achieve this goal because it gives a good approximation of objective values. The objective of DRL is to learn efficient strategies and rules for a given decision problem. Here, we used the most popular reinforcement learning algorithm,
learning, which is combined with a deep neural network; i.e., Deep
network (DQN) [5]. The DQN is used for mapping the state with actions to maximize a value.The basic drawback of multiagent learning is that even if convergence is achieved, it is only to a Nash Equilibrium (NE) point. As is wellknown a NE can be highly suboptimal in terms of social welfare and fairness [8]. Moreover, by modifying the utilities used for optimization by the agents and allowing each agent to select only a few best strategies produces a game where all the NE points are indeed nearoptimal (in terms of the original utility)[2]. This suggests that a similar modification to the Qlearning process might yield better allocations. In this paper, we use the basic idea of balanced allocation together with the classical result reported in [1] to obtain asymptotic optimal learning rules.
Our main contribution in this paper is a modification of the architecture in [11] by modifying the allocation rules and improving and changing the game used for the multiagent Qlearning by exploiting the game theoretic result of [2]. First, we limit the strategy space to the (actually is sufficient) channels with the highest values. This results in a game where all NE points are near optimal from a social welfare perspective. Next, we use a deterministic strategy to choose the least loaded channel among these best channels. This yields a nearoptimal allocation in the following sense

Each user obtains one of its best channels. so by order statistics, it is asymptotically optimal, as long as and the fading statistics is exponentially dominated (an assumption which holds for all standard fading models).

The load on all channels is nearly equal; i.e., .
The intuition behind this second step, is that we consider all the best channels as having the same utility which results in an equivalent game with only good equilibria. The proof of these two results will be given the full version of the paper. It combines ideas from [2] and [1].
Ii Related Work
Many recent developments in distributed learning methods have demonstrated that it can be an efficient technique to solve the spectrum (or channel) sharing problems in wireless networks. This section discusses the stateoftheart in the area of distributed learning methods for spectrum sharing.
In [11] a deep learning for spectrum access (DQSA) is discussed. Here, spectrum actions for every user are learned through training a DQN. The algorithm learns good strategies for every user online and in a distributed fashion without exchanging messages or with online coordination among users. In [13]
the authors put forward an optimal access policy using the state transition probabilities by considering Markov channels. It also proposes an optimal access policy for these channels using deep
learning if the transition probabilities are unknown. Here, the DQN uses greedy policies to accumulate the training data. Spectrum resource allocation in a cognitive radio network is explored in [9]. It concentrates on developing a deep reinforcement learning (DRL) mechanism for power control of the secondary user to share the spectrum with a primary user. The learningbased DRL is implemented by the secondary users to learn and adjust transmit power after interactions with the primary users so that both users can transmit successfully. In [19] distributed learning for the channel, allocation considers the multiarm bandit scenario, which achieves an optimal regret of . It describes a distributed channel allocation mechanism that assumes carrier sense multiple access (CSMA). Here, the user learns from observing a single channel without decoding a channel and without exchanging extra information between users. An online selfdecision and offline selflearning algorithms are proposed for channel allocation in [14], for multichannel wireless sensor network (WSN). A noncooperative game is used for online selfdecision algorithms and a learningbased DRL approach is used for offline selflearning of channels. The findings show that offline selflearning converges to optimal channel selection with lower computational and storage resources. In [15] a channel access application of DQN is considered in a multiuser, multihop, and simultaneous transmission scenario of WSN. It suggests a DQN to access channel using online learning. The DQN approach considers a large system and finds an optimal policy from historical observations without prior knowing system dynamics. DeepReinforcement Learning Multiple Access (DLMA) [18] is a heterogeneous MAC protocol using DRL. Here, the DRL agent learns an optimal medium access control (MAC) strategy for efficient coexistence with time division multiple access (TDMA) and ALOHA nodes. This DRL learns through a series of stateactionrewards. This work also concentrates on analyzing the characteristics of DRL as compared to other neural network techniques to then apply it to wireless networks. The sensor scheduling problem for wireless channel allocation using DRL was studied in [7]where the scheduling problem, it is formulated as a Markov Decision Process (MDP) and solved using DQN. The authors report good performance as compared to other suboptimal sensor scheduling policies. Finally, these studies are primarily focused on nondeterministic solutions for spectrum access using Qlearning which is more complex in implementation and finding optimal values in realworld conditions.
The remainder of this paper is organized as follows. Section III presents the network model for the proposed technique and formulates the problem. Section IV describes the D3RL mechanism with its architecture, algorithm, and working. Section V presents the communication and neural network parameters for simulation and analyzes the simulation results. Section VI concludes the paper and outlines future work.
Iii Network Model and Problem Statement
We consider an adhoc network with users and channels, where . The training is performed in a distributed manner at each user. For simplicity of exposition, we assume that all users know both and (This can be broadcasted to all users if needed). Let be a parameter designating a small set of best channels for transmission, . All users in the network have similar capabilities. All nodes in the network are synchronized as is typical in slotted random access networks.
We assume a slotted random access mechanism for sharing the spectrum; specifically, we consider slottedmultichannel ALOHA transmission where each user is allowed to transmit on a specific channel in a specific slot according to a certain transmission probability [3]. Here, each user transmits with probability and does not transmit at a probability of . We do not assume that users know apriori the channel qualities or the loads on the channel and our goal is to devise a multiagent learning algorithm that will lead the network to an allocation which is good for all users where each user transmits over one of its best channels and the load on all channels is approximately identical.
To analyze the network, we assume that at all times each user has pure action set and transmits over channel
using a slotted ALOHA protocol. For loaded ALOHA we will mostly depend on a mixed strategy of the players. A mixed strategy for a player is a probability distribution over the possible channels (pure strategy)
, where is the probability of not transmitting at time , and is the probability of user transmitting over channel at time for .Let be a binary observation indicating whether a packet is successfully delivered or not; i.e., if acknowledgment (ACK) is received and otherwise. Let be the action of user at time and be the strategy of user at time .
Let be the reward that obtains at time , be the utility of user and is the load on channel .
Definition 1: A history of user at time is the set of all actions, observations, and load on channels up to time is defined as
(1) 
History is used for training the users to learn their best strategy.
Definition 2: The utility function (an instantaneous functions) that defines the throughput of user on channel is
(2) 
Here, is the users transmission power, is the channel gain, is the bandwidth of each channel and
is a vector of actions (where
is the action of user , i.e., the frequency it selected.Definition 3: The multichannel random access game is defined by:
(3) 
where is the set of actions of each player. denotes no transmission, while denotes the identity of the selected channel for transmission.
Definition 4: A mixed strategy is a probability distribution with respect to pure strategy , which is
(4) 
Here, is a strategy vector and is a payoff for the considered strategy vector.
Definition 5: The total accumulated reward with discount factor for player , is given by:
(5) 
and is the timehorizon.
Definition 6: The strategy profile is called a NE in the multichannel random access game if
(6) 
for all and all . Here, is a strategy profile for all users except user .
The objective is to find a strategy for a user , which maximizes the expected accumulated discounted reward, .
A NE is a stable point in the dynamics, hence it is desired to achieve such equilibrium which will prevent network fluctuations. However, in general, such equilibria can be highly suboptimal [8]. However, recently Bistritz and Leshem [2] proved that by changing the utility or restricting the strategies utilized by each player a competitive game can be formed where each NE point is nearoptimal with respect to the sum rate of all users.
Iv Deterministic Distributed Deep Reinforcement Learning
In this section, we present a new collaborative spectrum access technique called Deterministic, Distributed Deep Reinforcement Learning (D3RL). The basic idea is to deterministically limit the set of strategies of each player and to enforce users to use the least loaded channels among their best strategies. The technique implements a deep reinforcement network for learning the strategies and a decision mechanism on the transmitted channel which is deterministic, given the computed
values. The first subsection describes the architecture, then the operation phase and load estimation in the algorithm. Finally, we discuss the training of the algorithm.
Iva Architecture
This subsection describes the proposed layered architecture for the reinforcement learning used in D3RL to solve the channel allocation problem in the multiuser network. The architecture of D3RL consists of an input layer, long short term memory (LSTM) layers, value layers, advantage layers, an output layer, a selection layer, and distributed double
learning. Here, D3RL chooses the channel in a deterministic way by using values and the load on a channel and passes it on as learning information to the LSTM layers for the next learning step. LSTM layers are useful to preserve the internal state and the aggregated observations as time elapses, which helps to estimate the true state. The double learning is used to reduce the bad states during estimation of the value [5][16]. Here, the user will update their DQN weights after completion of the training phase. Consistent with the requirements of a lightweight multiuser network, the implementation of the proposed algorithm is very simple. It trains the network distributively, which is executed whenever there are significant changes in the network environment.The proposed layer architecture is shown in Fig. 1. The architecture consists of the following layers,

Input layer: The input is a vector of size where each coordinate contains the number of users which selected the given action. The input is updated during each iteration. The next iterations use a history profile (which changes according to the values) for allocating the best channel for multiuser communication.

LSTM layer: The LSTM layer [6] is used to retain the internal state of the network and also helps to accumulate observations. This is needed for estimating the correct state of the network which relies on the history information. Here, the state of the network for each users’ network is the load it experienced on each channel and the selected channel. In short, it learns through the experience, aggregates that experience, and passes on it over time.

Value and Advantage layer: These layers help to cope with observatory problems in DQN [16]. Here, we estimated the average value of a state because every state is good or bad depending on the action taken. The average value of an action is estimated using the , the value of the state plus Adv, the advantage derived from the action.

Output layer: It outputs a vector of size . It consists of the estimated value for transmission on a channel.

Selection layer: The job of the selection layer is to select the best channel profile according to the value and the load on the channel. Algorithm 1 specifies the process for the selection layer.
IvB The D3RL algorithm
Here, each user selects a channel with maximum values out of the available values and transmits on the channel with the least load accordingly. The flow of D3RL is shown in Fig. 2. The algorithm collects the value outputs from the DQN, sorts them in descending order and considers the best channels with maximum value, and selects the channel with the minimum load (as in Algorithm 1). In this step, user acts according to the following strategy
(7) 
In the next step, the user transmits through the selected channel using ALOHA with certain transmission probabilities, generates a new action profiles for the channel and updates the value, and pass on this new values for the next iteration of learning. Here, is the success probability, is the probability of no transmission, and is the probability of collision. These probabilities are given by:
(8)  
(9)  
(10) 
IvC Load Estimation
The algorithm selects the channel with minimal load among the best channels. The load estimator uses standard ALOHA based load estimation: The load () on each channel is estimated using the ratio between the success probabilities and the number of transmitted packets. as the ratio of the number of successful transmissions to the total number of transmissions on the channel. Using (8) (10) this ratio is given by:
(11) 
According to equation (11), the approximate load on the channel is
(12) 
Each time users observe a load on each of the best channels (selected according to the values) and choose among these channels the one with the minimum load according to equation (7). Here, is the total number of transmissions and is the total number of successful transmissions.
IvD Training Mechanism
The DQN is trained using a D3RL training algorithm as shown below. Here, all users are trained in distributively. Training is only required when the characteristics of the network change, such as the addition of new nodes and links etc. Algorithm 2 runs for iterations where it calculates a value for all available actions on the channel and learns from it during each iteration and outputs the best channel allocation for each user after iterations. Here, is the number of episodes.
IvE Complexity Analysis of D3RL
When using D3RL each user needs to find the best channels (with respect to the values). The total time required to find the best channels has a complexity of operations per step.
The complexity of D3RL in terms of the number of multiplications performed during the operation of the DQN is easy to compute. Assuming a DQN with layers, where the size of the input layer is and each of the other layers has size . Therefore the realtime computational complexity (for forward and backward propagation) for every user at each time step is:
(13) 
The training phase computational complexity for each user over iterations is in order of
V Simulation Results
In this section we present several simulated experiments comparing the proposed D3RL algorithm to DQSA [11]. We have implemented D3RL with channels. The common communication and the neural network parameters used for the simulation are in Table I and II respectively.
Parameter  Value 

Number of users  100 
Number of channels  50 or 25 
Type of channel  Rayleigh Fading Channel 
SNR  35dB 
Bandwidth  20MHz 
Maximal doppler shift  100Hz 
Parameter  Value 

Number of LSTM layers  100 
Number of advantage layers  10 
Number of value layers  10 
Minibatch size  16 episodes 
Time steps  10 to 100 
Discount factor()  0.95 
Alpha factor()  0.05 
Temperature()  1 to 20 
Number of training iterations  10000 
Fig. 3 depicts the performance of D3RL over Rayleigh fading channels. The algorithm (with ) achieves rates which are higher than DQSA and when using . The improvement can be attributed to the deterministic learning used in the D3RL which ensures an asymptotically balanced channel allocation for each user. For comparison, we also provide an upper bound on the average rewards similar to the upper bound in [12].
Fig. 4 compare the performance of the D3RL algorithm under higher loads (100 users and 25 channels) as a function of the time step. In this loaded situation, D3RL outperformed DQSA by (with ) and (with ). Fig. 4 also shows the upper bound on the average rewards on the channel. As can be seen, indeed the D3RL performance is very close to the optimal allocation as is expected by the analysis in [2] and by the results of [1].
Fig. 3 and 4 also depict the performance of D3RL with . As discussed above increasing improves the performance with a minor increase in computational complexity. However, the gains are insignificant for . This is very reasonable, since by the random graph argument of [2] and substituting leads with high probability to a completely balanced NE, since in this graph there is a perfectly balanced matching with high probability.
Vi Conclusions and Future Work
We considered the problem of collaborative spectrum sharing in a multiuser adhoc network. We designed a distributed learning algorithm to find an asymptotically balanced channel allocation for each user. The proposed mechanism allows each user to learn the best strategies through learning without exchanging any extra messages in the network by exploiting the properties of the multichannel ALOHA protocol. The experimental results of the D3RL show a strong performance of the algorithm in a complex multiuser network.
In the future, this work can be further extended to analyze system dynamics by using game theoretic techniques and developing a complete intelligent MAC mechanism exploiting DRL techniques. It can also be extended to a hardware implementation of the algorithm and testing using a hardware testbed.
References
 [1] (199909) Balanced allocations. SIAM J. Comput. 29 (1), pp. 180–200. External Links: ISSN 00975397, Link, Document Cited by: §I, §I, §V.
 [2] (2019) Game theoretic dynamic channel allocation for frequencyselective interference channels. IEEE Transactions on Information Theory 65 (1), pp. 330–353. Cited by: §I, §I, §I, §III, §III, §V, §V.
 [3] (2016) Distributed gametheoretic optimization and management of multichannel ALOHA networks. IEEE/ACM Transactions on Networking 24 (3), pp. 1718–1731. Cited by: §III.
 [4] (2014) Cooperative spectrum sharing: a contractbased approach. IEEE Transactions on Mobile Computing 13 (1), pp. 174–187. Cited by: §I.

[5]
(2016)
Deep reinforcement learning with double Qlearning.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, AAAI’16, pp. 2094–2100. Cited by: §I, §III, 6th item, §IVA.  [6] (201511) Deep recurrent Qlearning for partially observable MDPs. In AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAISDMIA15), Cited by: 2nd item.
 [7] (2020) Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems. Automatica 113, pp. 108759. External Links: ISSN 00051098, Document, Link Cited by: §II.
 [8] (2009) Game theory and the frequency selective interference channel. IEEE Signal Processing Magazine 26 (5), pp. 28–40. Cited by: §I, §III.
 [9] (2018) Intelligent power control for spectrum sharing in cognitive radios: a deep reinforcement learning approach. IEEE Access 6 (), pp. 25463–25473. Cited by: §II.
 [10] (201502) Humanlevel control through deep reinforcement learning. Nature 518, pp. 529–33. External Links: Document Cited by: §III.
 [11] (2019) Deep multiuser reinforcement learning for distributed dynamic spectrum access. IEEE Transactions on Wireless Communications 18 (1), pp. 310–323. Cited by: §I, §I, §II, 6th item, §V.
 [12] (2012) Bounds on the expected optimal channel assignment in Rayleigh channels. In 2012 IEEE 13th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Vol. , pp. 294–298. Cited by: §V.
 [13] (2018) Deep Qlearning with multiband sensing for dynamic spectrum access. In 2018 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Vol. , pp. 1–5. Cited by: §II.
 [14] (2017) Optimal channel selection based on online decision and offline learning in multichannel wireless sensor networks. Wireless Communications and Mobile Computing 2017. Cited by: §II.
 [15] (2018) Deep reinforcement learning for dynamic multichannel access in wireless networks. IEEE Transactions on Cognitive Communications and Networking 4 (2), pp. 257–265. Cited by: §II.

[16]
(2016)
Dueling network architectures for deep reinforcement learning.
In
International conference on machine learning
, pp. 1995–2003. Cited by: 3rd item, §IVA.  [17] (1992) Qlearning. In Machine Learning, pp. 279–292. Cited by: §III.
 [18] (2019) Deepreinforcement learning multiple access for heterogeneous wireless networks. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1277–1290. Cited by: §II.
 [19] (2019) Distributed learning for channel allocation over a shared spectrum. IEEE Journal on Selected Areas in Communications 37 (10), pp. 2337–2349. Cited by: §II.
Comments
There are no comments yet.