A Deep Reinforcement Learning Framework for Contention-Based Spectrum Sharing

10/05/2021 ∙ by Akash Doshi, et al. ∙ Qualcomm The University of Texas at Austin 0

The increasing number of wireless devices operating in unlicensed spectrum motivates the development of intelligent adaptive approaches to spectrum access. We consider decentralized contention-based medium access for base stations (BSs) operating on unlicensed shared spectrum, where each BS autonomously decides whether or not to transmit on a given resource. The contention decision attempts to maximize not its own downlink throughput, but rather a network-wide objective. We formulate this problem as a decentralized partially observable Markov decision process with a novel reward structure that provides long term proportional fairness in terms of throughput. We then introduce a two-stage Markov decision process in each time slot that uses information from spectrum sensing and reception quality to make a medium access decision. Finally, we incorporate these features into a distributed reinforcement learning framework for contention-based spectrum access. Our formulation provides decentralized inference, online adaptability and also caters to partial observability of the environment through recurrent Q-learning. Empirically, we find its maximization of the proportional fairness metric to be competitive with a genie-aided adaptive energy detection threshold, while being robust to channel fading and small contention windows.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Spectrum sharing attempts to allow different transmitters to operate on the same allocated resource (spectrum/time) in a fair manner, while also providing high throughput. Because each transmitter cannot be cognizant of the complete system state, an optimal decentralized policy for all transmitters is infeasible [zhao2007decentralized]. Centralized approaches require excessive real-time overhead messaging between the transmitters, and obtaining the solution is known to be NP-hard, since all the transmissions and throughputs are coupled, and has been shown in [zheng2005collaboration] to be equivalent to graph coloring. We assume that transmitters will operate in a decentralized fashion, such that there is no direct exchange of messages between them before making a decision in a given time slot. Instead, each transmitter will perform spectrum sensing and also utilize local side information before it accesses the channel.

I-a Motivation

To alleviate spectrum constraints, the usage of Long Term Evolution (LTE) in the unlicensed 5 GHz band (LTE-U) was introduced in 2014 by Qualcomm [qualcommlteu2014]. Subsequently, 3GPP [3gpp.36.889] standardized License Assisted Access (LAA) for LTE - with LTE-LAA, a licensed carrier, used for control signalling, can be thought of as an anchor that stays connected as unlicensed carriers, used to transport data, are added to or dropped from the combination of carriers in use between a device and the network. The approach adopted to access the unlicensed spectrum in LAA is known as Listen-Before-Talk (LBT) [3gpp.36.889][etsi] and required each transmitter to perform a Clear Channel Assessment (CCA) before accessing spectrum. In other words, a base station (BS) is allowed to transmit on a channel only if the energy level in the channel is less than the CCA threshold level for the duration of the CCA observation time [etsi]. The duration of the CCA observation time can be fixed or randomized depending on the mode of LBT being employed [3gpp.38.889]. The 5G New Radio Unlicensed (NR-U) is a successor to LTE-LAA that is designed to operate in the 5 and 6 GHz bands and encompasses both license-assisted and standalone access i.e. without any anchor in licensed spectrum. The 3GPP study on NR-U [3gpp.38.889] mentions that LTE-LAA LBT would be the starting point of the design for the 6GHz band. Hence, in this paper, we utilize spectrum sensing via LBT as a baseline to benchmark the performance of the algorithm we develop to improve UE throughput under NR-U in the 6 GHz band.

The usage of a given CCA threshold level, also referred to as an energy detect (ED) threshold in NR-U, as a channel access policy can often negatively impact the throughput. For instance, consider the toy scenario depicted in Fig. 1 involving only BS to User Equipment (UE) downlink (DL) transmissions as indicated. In the scenario on the left, there is minimal interference across links, hence simultaneous DL transmission is possible, while in the other, there is strong interference across links, so the DL transmissions should be time multiplexed. A common ED threshold at the BSs cannot account for both scenarios, since ”energy detect” at the BSs does not actually reflect the quality of reception (SINR) at the UE. Moreover, an ED threshold-based medium access mechanism is incapable of adapting to varying path gains brought about by fading and mobile UE positions.

Hence what we need is a medium access algorithm that is able to jointly process (i) information about reception quality at the UE from the previous time slot(s), which can be acquired using standard channel quality indicator (CQI) feedback as well as monitoring the throughput achieved to that UE, and (ii) energy levels currently sensed at the BS, which reflect the current interference state but not at the UE itself. Thus, we can track past interference statistics at the desired location (the receiver), and observe the current interference at the transmitter. Based on these observations, a BS would have to decide whether or not to transmit in the current time slot. This must be done without direct awareness of the actions of the other BSs, or knowledge of the long term throughput patterns at UEs that the BS does not serve.

Multi-agent reinforcement learning (RL) techniques [bucsoniu2010multi] provide a principled approach to incorporating environment dynamics and designing agents that are trained to take decisions in the face of uncertainty. This motivates us to design a novel distributed reinforcement learning-based approach for medium access at the BS that uses information about the past quality of signal reception at the UE it serves in combination with channel sensing to adapt its medium access policy.

[width=1.9in,height=1.1in]Figures/Motivation/UE_far.png         [width=1.2in,height=1.1in]Figures/Motivation/UE_close.png

Fig. 1: Two simple layouts of 2 BS’s and 2 UE’s, depicting the need for an adaptive ED threshold at the BS.

I-B Related Work and Approaches

Contention-based medium access is a fundamental and classic problem in wireless systems. The notion of LBT and deciding on whether the channel is clear based on an energy threshold was introduced in the earliest IEEE 802.11 WiFi protocols, and still is used [patriciello2020nr], with very little subsequent improvement. In systems where data is transmitted in time slots of fixed length, referred to as Frame Based Equipment (FBE) by [etsi], each time slot is divided into a contention period and a data transmission phase (refer Fig. 2). During the contention period (also known as I-CCA in [3gpp.38.889]), the BS senses the channel and if it remains idle (based on the ED threshold), the BS transmits data for a fixed time period (a maximum of ms) and then contends for the channel again. Later variants of LBT, such as Cat3 and Cat4 [3gpp.38.889], introduced an extended CCA period (E-CCA) in which a BS senses the channel for a further random duration if the channel remained idle during I-CCA. To determine this random duration, a BS will draw a random counter between and (where CWS denotes the contention window size e.g. CWS = 6 in Fig. 2). It will then decrement by 1 every , and transmit after hits only if the channel remained idle for the entire duration. In this paper, we adopt a contention-based channel access scheme that is a simplified version of Cat3 LBT in NR-U[3gpp.38.889].

IEEE 802.11 also supports an optional collision reduction scheme known as RTS/CTS, such that one node (e.g. a BS) sends a Request-to-Send (RTS) frame if it senses a clear channel, and prior to sending data. In response, the receiving node (e.g. a UE) sends a Clear-to-Send (CTS) frame if the received Signal-to-Interference Ratio (SINR) of the RTS frame was above a certain threshold [jamil2015efficient]. However, RTS-CTS can inhibit potentially successful transmissions, and introduces significant additional overhead and latency [sobrinho2005rts]. Several empirical Media Access Control (MAC) layer optimizations have been proposed subsequently, including dynamic blocking notification schemes [zhang2006ebn] [chong2014dynamic], but they all suffer from similar drawbacks.

A more recent methodology is to apply multi-agent reinforcement learning to design state-based policies that can improve the performance of unlicensed spectrum sharing. Initial work such as [li2010multiagent] and [lunden2011reinforcement] utilized function approximation and tabular listing of -values for spectrum access in cognitive radio systems, which was not scalable to large state spaces. More recent work such as [tonnemacher2018opportunistic] and [tonnemacher2019machine] began to utilize deep -learning [mnih2015human] to either chose an action that adapts the ED threshold to the BS queue length or chooses the optimal subcarrier. Most recently, [liang2019spectrum] and [li2019multi] used deep -learning to improve sum rate in vehicle-to-vehicle (V2V) and device-to-device (D2D) communication, while [chang2018distributive]

additionally employed reservoir computing, a special type of recurrent neural network to combat partial observability. However, these papers are all focused on supporting underlay communications and not on optimizing medium access of the primary spectrum user. A key feature of the 6 GHz band is that no unlicensed devices currently operate there and hence NR-U channel access protocols designed for it would not be constrained by primary spectrum users


In [naparstek2018deep], game theoretic principles are used in designing a RL reward function that maximizes the number of successful transmissions in a distributed setting. However, they require receipt of an ACK as part of their algorithm, consider all links at the same SNR and do not utilize information from spectrum sensing. Most recently, [naderializadeh2021resource] presented a robust and scalable distributed RL design for radio resource management to mitigate interference. None of these papers thus far have attempted to model the asynchronous nature of the decisions made by the transmitters owing to contention. Moreover [naderializadeh2021resource] assumes that APs exchange messages in every time slot on backhaul links to provide remote observations required for their policy, which may not be possible in a practical deployment, particularly across different operators or systems.

I-C Contributions

We design a novel distributed reinforcement learning approach to optimizing medium access that is aimed at improving the current LBT-based approach in NR-U [3gpp.38.889], by constraining our problem to a contention-based access mechanism. We employ the paradigm of centralized learning with decentralized execution, such that each BS will decide whether and how to transmit based only on its own observations of the system. Our technical contributions are now summarized.

Formulating Medium Access as a DEC-POMDP. In a practical deployment, a BS will only have access to delayed copies of the parameters of the UEs it serves in each time slot and will not be able to directly observe the action of all neighbouring BSs. Moreover, there is no central controller that can determine the action of each BS. We formulate medium access decisions as a decentralized partially observable Markov decision process (DEC-POMDP) [bernstein2002complexity] by incorporating these key features of a practical deployment.

Adapting a Medium Access DEC-POMDP to Contention. In each time slot, a BS can either be transmitting to a UE or waiting in the contention queue. This motivates a 2-state transition diagram, such that a BS can either be in the data transmission state (until the end of the time slot) or the contention state. We define each of the states in accordance with the information available at a BS in each time slot. We formulate a reward structure associated with each state transition, such that maximization of the sum of the rewards accumulated over time provides long term proportional fairness (PF) of the throughput delivered to all UEs.

Solving a Medium Access DEC-POMDP using an Independent DQN. We adapt a distributed reinforcement learning strategy that combines Deep -networks (DQN) with independent Q-learning [tampuu2017multiagent] and adds recurrency to model partial observability [hausknecht2015deep]. Inspired by the inter-agent communication framework of [foerster2016learning] that provides for end-to-end learning across agents in a decentralized setting, we exploit the inter-BS energies detected along with the local action-observation history to successfully train two DQNs at each BS using a centralized training procedure that provides for decentralized inference. We show that this approach, after a modest number of iterations, achieves maximization of the PF metric competitive with a genie-aided adaptive ED threshold that unrealistically presumes knowledge of the UE locations to chose the optimal energy threshold.

The paper is organized as follows. In Section II, the system model is given, and a mathematical formulation of the problem statement is provided in terms of proportional fairness, along with a solution that requires a central controller. A realistic decentralized inference framework for medium access is developed in Section III, which is adapted to a distributed Reinforcement Learning (RL) framework in Section IV. The simulations and detailed results are presented in Section V and VI respectively, followed by the conclusions and possible future directions in Section VII.

Ii Problem Statement and System Model

Ii-a An Overview

We consider a downlink cellular deployment of BSs, each BS having at least one active UE desiring data transmission. Assume, as in LAA, that the DL transmissions for all BSs occur on the same shared sub-band of unlicensed spectrum, while the uplink (UL) transmission of CQI and other control information takes place on separate licensed channels for each BS-UE link. We separate the scheduler block that determines the served UE per BS from the medium access block, and focus only on medium access control (MAC). The BS transmits at the maximum MCS (Modulation and Coding Scheme) allowable at that SINR, assuming that the UE throughput can be approximated by the Shannon capacity. Assuming that the same UE is scheduled for reception for consecutive time slots and each BS transmits at a constant power, the MAC algorithm at each BS has to decide whether or not to transmit to the UE in each time slot. Moreover, we assume that the BSs are backlogged with sufficient packets such that the BS always has traffic to be delivered to its scheduled UE, and hence always participates in contention. This is the worst-case scenario: if a BS does not choose to participate in contention, it improves throughput for other cell’s UEs.

[width=3.1in]Figures/Problem Statement/problem_statement_right.png

Fig. 2: Contention-based access for FBE to unlicensed spectrum in BS-UE DL transmissions

We consider a simplified contention-based access mechanism for FBE as shown in Fig. 2. Each BS draws a random counter in at the start of a time slot. For simplicity, we assume that each BS draws a unique counter such that there is no collision among counters drawn at different BSs in the contention process111This assumption is for the simplicity of the state-action transition; the algorithm itself does not require such uniqueness (refer to Section VI-A). Hence, the contention window (CW) is of length at least equal to the number of BSs in the layout, i.e. . When this counter expires, the BS ascertains if the channel is clear before transmitting as shown in Fig. 2. If the channel is clear, the BS transmits a unique preamble (which can be used by other BSs to identify the transmitting BS) for the remainder of the contention phase, followed by data symbols from the beginning of the data transmission period. The objective of each BS is to maximize the long-term throughput seen by the UE. We now formulate this mathematically.

Ii-B Mathematical Formulation

A single UE is scheduled per contention slot per BS. For each BS , denotes the action chosen by the BS to transmit or not, with transmission denoted by

and the action vector of all BSs given by

. Considering single-input single-output (SISO) communication between BSs and UEs in a single sub-band, we denote the channel between the BS and UE by , and the channel between BS and BS by where and are drawn from . The respective path gains are and . For each UE , denotes the desired signal power, the total interference power, the data rate experienced and the exponentially smoothed average rate seen by the UE, with the vector containing these terms for all UEs being denoted by , , and

respectively. Denote the noise variance for DL receptions at the UE by

, and at the BS by .

Consider transmissions on a DL slot from BSs to UEs. Assuming each BS transmits at a fixed power , the received signal at each UE is given by


where , with

denoting the complex normal distribution and

the noise variance at the UE. Then the SINR measured at UE at the end of data transmission in time slot is given by


It should be noted that while we restrict ourselves in this exposition to SISO on a single sub-band, the definition of could easily be generalized to simply represent the received signal power at UE , which could then be utilized to incorporate both MIMO and frequency-selective channels. We assume the throughput experienced by the UE in time slot to be ideal and for a bandwidth W is approximated by the Shannon capacity formula


Defining the utility function as


our objective is to have the BSs choose an action vector in each time slot that provides for long term proportional fairness. This means that we want to maximize the long term average rate of each UE , where is the exponentially smoothed average rate seen by UE up to time and is given by


with being a parameter which balances the weights of past and current transmission rates. Mathematically, in [kelly1998rate], this is proved to be equivalent to maximizing the utility function as defined in (4) for . In [wang2010scheduling], it is proven that maximizing can be achieved iteratively through slot-by-slot BS scheduling. At each time slot, the iterative BS scheduler has to find a rate vector such that


where . For a given time slot , this is equivalent to finding the action vector that maximizes the summation in (6). This is a combinatorial search over possible binary action vectors. However, it is important to note that this solution requires a central controller and hence is not realizable in any practical decentralized deployment. We relax this assumption next.

Iii A Decentralized Medium Access Formulation

As described in Section II, the PF-based BS scheduler requires awareness of all the link strengths and average rate of all UEs, such that it can instruct all BSs on the decision to make in each time slot. In reality, we have distributed control by BSs, with the decision of each BS determined by its own observation of the environment. As a consequence, each BS is usually not aware of the decisions of all other BSs in the given time slot. Moreover, each BS is cognizant of only the average rate of the UE it serves, and not that of all the UEs, . Finally, in a fading environment, each BS will not be aware of the entire interference channel matrix , however it can utilize the signal and interference power and experienced by the UE it serves in the previous data transmission, which is fed back to the BS in uplink, as indicators of the magnitude of the channel coefficients. This motivates us to formally define the MAC for BSs operating on shared spectrum under the framework of a DEC-POMDP [bernstein2002complexity].

Iii-a Defining a DEC-POMDP

The medium access DEC-POMDP can be defined as the 8-tuple where is a set of BSs (agents), is the state space, ( denotes a Cartesian product) is the joint action space, and is the joint observation space of all BSs. At each time step , each BS executes an action , causing the environment state to transition to

with probability

. Each BS then receives observation based on the joint observation function where . Define the local action-observation history at timestep as , where . Single-agent policies select an action for each BS , with the joint policy being denoted by . All BSs receive a single common reward at the end of each time slot. The exact formulations for and will be described in Section III-B, however it should be noted that we assume centralized computation of offline, since online computation of a centralized reward would require communication among BSs on backhaul links. Finally, the objective is to learn the optimal joint policy that maximizes the expected cumulative reward over a finite time horizon of consecutive time slots for some discount parameter . We will now define the state space , transition function , reward function and joint observation space used for modelling medium access. The notation employed has been summarized in Table I.

Action chosen by BS
Action space of BS
Local observation received by BS
Observation space of BS
Local action-observation history of BS
Action-observation history space of BS
Policy used to choose action at BS
Single common reward distributed to all BSs

Iii-B Formulating Medium Access as a DEC-POMDP

Denote by the downlink 222In reality, the BS cluster size would be determined by the number of BS transmissions resulting in non-negligible interference at a given site. However, in this work, we consider a fixed throughout. interference channel matrix in time slot . To solve (6), in each time slot, we would require knowledge of the action of all BSs , the average rates at the beginning of the time slot and the interference matrix . Hence, the state of the system . The transition function is given by combining (2), (3) and (5). Hence, with , we have


where and is the noise term. We drop the bandwidth scaling factor since it would just add a constant term to . Note how also depends on the action of all BSs in (7). The probabilistic nature of the transition from to is captured by the noise term in (9). The function in (9

) is used to convey a first order Markov chain representation of a channel model, known to be sufficient for modelling very slow fading channels

[tan1998first]. Given time slots, while long term PF-based BS scheduling would seek to maximize , solving the medium access DEC-POMDP entails maximization of . Hence we would like to have


In Appendix -A, we prove that by defining the per-timestep reward as and where


and setting the discount factor , we satisfy the approximation postulated in (10). In case all BSs choose to remain off, we apply a large negative reward (empirically set to for some constant ) for that time-step during training. This does not impact (10) as no optimal joint policy would have all the BSs remain off in any time slot. Finally, we define the observation received in each timestep by BS as


where denotes that the observation seen at the beginning of the time slot by the BS is the average rate, signal and interference power of the UE it serves from the previous data frame (enforcing causality). Both and can be obtained in practice from UE feedback comprising of CSI, RSRP and RSRQ (Reference Signal Received Power and Quality) measurements (refer [3gpp.38.214], [dahlman20205g] for details). Note that the signal and interference terms and act as partial observations of and . This can be seen from (7) where knowledge of and

provides the BS a better estimate of

, which is highly correlated with in a slow fading channel model. Moreover, , hence also provides the action of BS in the previous time step as an input while choosing its action for time step .

We will now describe a distributed RL framework that can be adapted to this DEC-POMDP formulation, while catering to a contention-based medium access mechanism and providing for partial observability of the system state.

Iv Adapting Independent DQN to a Medium Access DEC-POMDP

The DEC-POMDP formulation presented in Section III-B essentially entails each BS , in a time slot , observing , taking an action , and receiving a common reward once all BSs in the layout have taken an action. This is summarized in the 1 state MDP depicted on the left in Fig. 3, with the term inside the square denoting the observation of BS . Note that while the word MDP may be a misnomer here, it is only used to capture the nature of the transition diagram and does not confer a MDP’s mathematical properties333We are dealing with a DEC-POMDP, hence a policy at BS is a function that takes as input the action-observation history and outputs . This will be provided for in Section IV-C using recurrency.. However, this formulation fails to exploit a key aspect of medium access: the asynchronous nature of contention-based access in LBT based spectrum sharing mechanisms. Essentially, when a BS performs CCA, we assumed that a subset of the BSs that have already chosen to transmit will be transmitting a unique preamble (akin to a RTS frame)444Receipt of the confirmatory CTS message at the BS is not required to make a decision, thus eliminating the associated overhead and latency.. This frame would contain the MAC address of the transmitter, enabling a BS listening in on this transmission to map the received signal to the corresponding transmitter. Such a strategy could be easily supported in LAA/NR-U, such that the energy measured at a BS could not only be apportioned into the energy from each BS that is already transmitting but also mapped to them.

Iv-a EOS and CON States

This motivates us to consider the partitioning of the 1-state MDP into 2 states, End-Of-Slot (EOS) and Contention (CON), as shown on the right in Fig. 3. In each time slot , a back-off counter is randomly generated for each BS from a fixed contention window (CW). When that counter expires, the BS measures the energy from each already transmitting BS, denoted by the -dimensional vector , such that is the energy received at BS due to an ongoing transmission between BS and UE , and is given by


where , if and is the received signal corresponding to . In other words, will have a component other than noise only if BS is placed before BS in the contention queue and has chosen to transmit. Note that if , then the value of is inconsequential in (13), ensuring the construction of is causal. It also shows that BS has only partial observability of the action of the other BSs. For instance, if , then BS cannot deduce which entries of correspond to BSs whose counter versus BSs whose counter and have chosen . However, this partial observability can be remedied to some extent at BS by knowledge of its own counter . For instance, if is large and , then BS would be privy to the fact that most BSs , , have chosen not to transmit. Consequently, and are defined to be part of the observation in the CON state.

[width = 6in]Figures/EOS and CON/MDP_split.png

Fig. 3: The 2 state MDP at each agent capturing the actions taken and reward obtained on transitioning between the End-Of-Slot (EOS) and Contention (CON) states

Moreover, we note that . It may not be immediately apparent why one would need to retain as part of the state definition; experimental evidence and a possible explanation is presented in Section VI-B.

BS then selects an action based on the policy (refer footnote 3). If a BS decides to transmit, data transmission takes place until the end of slot. At the end of slot , using the action vector , average rates and path gain matrix , we compute using (11). The average rates are updated to using (7), and along with the signal power and interference power measured at each UE , are used to calculate the observation seen by each BS in the next time slot, given by . Hence we denote the EOS observation by , and the CON observation by . Both and default to 0, since these correspond to a transition within the time slot.

Iv-B Deep Q-Networks and Independent DQN

In a single agent, fully-observable RL setting, an agent observes the current state , chooses an action according to a policy , receives a reward and transitions to a new state . The objective is to learn the optimal policy that maximizes the expected discounted sum of rewards . Denote by the expected discounted reward earned by the agent starting from state , taking action , and thereafter following . Hence the -value corresponding to the optimal policy


is given by the recursive Bellman optimality equation with [sutton2018reinforcement]


In deep -learning,

is represented by a neural network whose weights are optimized by minimizing

at each iteration, with . Here denotes a target network that is frozen for a few iterations while updating the online network .

DQN has been extended to cooperative multi-agent settings, in which each agent observes the complete state , selects an individual action , and receives a team reward , shared among all agents. In [tampuu2017multiagent], they address this setting with a framework that combines DQN with independent -learning, such that each agent simultaneously learns its own function . While Independent Q-learning can in principle lead to convergence problems due to each agent seeing a non-stationary environment, it has been surprisingly successful empirically.

Iv-C Deep Recurrent Q-Learning

In Section III-A, we defined a BS’s policy as a mapping from the local action-observation history to the action space . One way to incorporate the action-observation history into Deep Q-Learning is to stack the observations obtained over a finite number of time steps and feed all these observations into the DQN. However, in [hausknecht2015deep]

, they hypothesize that a Deep Recurrent Q-Network, which is a combination of DQN with a Long Short Term Memory (LSTM)

[hochreiter1997long] layer can better approximate actual Q-values from sequences of observations, leading to better policies in partially observed environments.

One example of a DQN architecture incorporating an LSTM layer is depicted in Fig. 6. In time slot , the DQN receives as input the hidden state and cell state of the LSTM layer in the DQN from the last time slot. Each DQN then outputs the updated and which is fed into the DQN in the next time step. Intuitively, the hidden and cell state are compressed representations of the local action-observation history , such that the policy learnt by the DQN is indeed a mapping from to .

Iv-D Bellman Updates for EOS and CON States

Solving the medium access DEC-POMDP entails finding the optimal joint policy that maximizes . This requires computing the optimal policy at each BS . To this end, we will adopt the Independent DQN framework and define two networks at each BS , and . As a first step towards addressing partial observability, each of these networks will be modelled as recurrent -networks as described in Section IV-C. However, in addition to partial state observability, a decentralized setting makes the action of BS not observable at BS for . In other words, each BS would not be able to observe the entire vector. This is where the energy vector comes in handy. As described in Section IV-A, it enables BS to observe the action of BS if and .

This indirect exchange of “messages” between BSs via during contention has parallels to the technique for inter-agent communication in a DEC-POMDP via a limited-bandwidth channel [foerster2016learning]. A technique called differentiable inter-agent learning (DIAL) is proposed in [foerster2016learning] that not only uses deep Q-learning [mnih2015human] with a recurrent network to address partial observability [hausknecht2015deep], but also provides for the passage of customized real-valued messages between agents, such that gradients can be pushed through the communication channel yielding a system that is end-to-end trainable across agents. In [foerster2016learning], they employ direct connections between the output of one agents network and the input of another, via a discrete/regularize unit (DRU), allowing the message to be learnt during training. However, in our case, the contents of the “message” cannot be modified to improve learning at the BS. In Section VI, we will provide experimental evidence that the vector nevertheless enables the learning of a competitive decentralized medium access policy .

Hence, using (15), in accordance with the backup diagram depicted in Fig. 4, we can derive the sampled -value updates for and as


where we utilize , hence . Also note that and . We replaced with and in (16) and (17) respectively, since and are recurrent -networks. The recurrency implicitly captures the fact that the observation-action history is also part of the input to each network.

[width=3.1in]Figures/EOS and CON/backup_diagram.png

Fig. 4: Backup diagram for calculating values at both and . Each open circle denotes a state, and each open square a state. Each solid circle denotes a state-action pair (), and each solid square a state-action pair. The dotted arc represents that the maximum of the two branches is taken.

An advantage of these alternating -updates is we eliminate the need for double deep -learning using a target -network as described in Section IV-B. Since a typical Bellman equation would contain values corresponding to the same state space on both sides of the equation, double deep -learning is introduced to avoid maximization bias [sutton2018reinforcement]. In (16) and (17), we have values corresponding to different state spaces on both sides, hence circumventing the issue. At the same time, it must be noted that (16) and (17) are equivalent to the original Bellman equation, albeit with a discount factor of . This can be seen by substituting (16) in (17). We now describe combining recurrent DQNs with message-passing between BSs to generate an episode using the example depicted in Fig. 5.

Iv-E Generating an episode using DQN’s

An episode refers to a collection of consecutive time slots. Consider the time slot in the episode. At the beginning of time slot , a random counter is drawn for each BS as shown in the table on the left in Fig. 5. Note that for time slot 0, the same observation with is fed to at each BS . In subsequent time steps, is computed in the previous time slot as explained in Section IV-A. Recall that . On the other hand, each CON DQN outputs two -values corresponding to the actions 0 and 1, with


Note that if we were generating an episode during the training of the algorithm, we would use an -greedy policy to choose to provide for the exploration-exploitation trade-off [mnih2015human]. Hence, w.p. we would choose using (18), and w.p. we would pick randomly.

In the example shown in Fig. 5, since , BS 1 goes first and measures the energy from ongoing transmissions to compute using (13). It senses no other BS’s transmitting (, ignoring the noise term in (13) for simplicity), and in combination with the average rate , signal and interference power of the UE it serves from the previous time slot, it determines using the policy given in (18). Let us assume it chooses to transmit (note that transmission is not a given simply because , it is a complex decision made by the DQN based on many factors). BS 2 is scheduled next and it detects BS 1 is transmitting such that is non-zero and instructs it not to transmit. Finally BS 0 also detects a non-zero , but chooses to transmit. At this point, we save the tuple at each BS . Note that while the training procedure, elaborated in Section V-C, will require training both the CON and EOS DQN, testing the learnt policy requires only the CON DQN as is evident from (18).

Once all the BS’s have taken an action , the action vector in combination with the channel gains are used to calculate and the updated average rates . These determine the observations for the next time slot. At this point, we save the quadruple at each BS . In the next time slot , a new counter is drawn at each BS and the process is repeated.

[width = 7in]Figures/EOS and CON/pic_algo.PNG

Fig. 5: Information flow between BS’s and the environment. The evolution of the hidden and cell state, and is shown only for , but holds for too. The red circle denotes the EOS DQN and the green square, the CON DQN.

V Simulation Details

The performance metric is the expected cumulative reward introduced in Section III-B, with given by (11). A smoothing window of , and time slots are used in all the subsequent simulations.

V-a Data Generation & Pre-processing

We consider an indoor hotspot deployment, a scenario that is intended to capture typical indoor situations such as office environments comprised of open cubicle areas, walled offices, open areas and corridors (InH-Office in 3GPP TR 38.901 [3gpp.38.901]

). The BSs are located at 10, 30, 50, 70, 90, 110 meters on the x-axis, 15 and 35m on the y-axis and are mounted at a height of 3m on the ceiling. 120 UEs are uniformly distributed in a 120m x 50m layout and have a height of 1.5m. Each BS is associated with 10 UEs. The pathloss (PL) model between nodes(BS and UEs) captures Line-Of-Sight (LOS)/ Non-Line-Of-Sight (NLOS) properties of a link, frequency dependent path loss for LOS/NLOS links and shadowing as part of large-scale fading parameters, and is given in Table

II. In accordance with the indoor - open office model [3gpp.38.901], the links are designated as LOS/ NLOS probabilistically with given by


with denoting the distance between the BS and UE on the floor, while will be used to denote the distance between the highest points of the BS and UE respectively.

The center frequency used for modeling is 6 GHz. From these 12 BSs, various configurations of BSs and its associated UEs are chosen for the simulations, for e.g. refer Fig. (a)a and Fig. (b)b.


TABLE II: InH-Office PL Model. Shadow fading distribution is log-normal. Reproduced from [3gpp.38.901].

At each BS, to select the active UE, we have 10 UEs to choose from. From these possible configurations of 4 UEs, are chosen for training the RL algorithm, and the rest are used for randomly sampling validation and test configurations. Replay memories and , modelled as double ended queues, are created, each containing episodes, one for each possible training configuration. Each episode in the initial replay memory is generated with an greedy policy, i.e. at each time step in the episode, is chosen randomly at each BS . However, decays over the course of training for generating new episodes as is described in Section V-C.

Within each episode, slow fading with a first order IIR filter is used to model the time evolution of each of the channel coefficients and . Denote the large scale fading (given by the InH-Office PL model) path gain coefficient as , and the small scale fading coefficient as . Then we have


with , and . The length of one slot is given by the COT which ranges from 1 to 9 ms. Solving with yields . Hence, the time taken for the channel to decorrelate , which is on the order of 100 ms, typical for a pedestrian setting. Key parameters used for generating the data are summarized in Table IIIa.

Both and in the definition of

are normalized by the standard deviation of the BS-UE path gains

before being input to the CON and EOS DQNs. Similarly, each entry of in is normalized by the standard deviation of the BS-BS path gains .

V-B DQN Architecture

Both the CON and EOS DQN’s have a similar architecture, the only difference being the size of the input. The neural network (NN) architecture of a DQN is depicted in Fig. 6. Since the input matrix to an LSTM555https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html has three dimensions , with capturing the temporal dependence, while the input to a fully connected layer is two dimensional, with the second dimension being , we flatten the dimension such that the input to the DQN is of size .


Fig. 6: NN Architecture of , depicting the passage of the hidden and cell state of the LSTM operation from one time slot to the next.

The input state is fed into a 2-layer fully connected DNN, 512 neurons per layer, and tanh activation. Hence these two layers are applied per time-step separately. At this stage, the output of size

is reshaped into before being fed into a LSTM layer, having a hidden state and cell state of size 256. Only the value of the last time step (corresponding to index ) is extracted from the LSTM output, such that the output is of size . The LSTM output is then fed into two fully connected layers, one of which outputs a value, while the other outputs 2 values corresponding to both possible actions. The advantage is the gain, in expected cumulative reward , of choosing action . The Q value is then computed as


This approach is known as dueling [wang2016dueling], and helps in learning the state-value function efficiently. Finally, always has a default action of , so the Q-value of is discarded.

V-C DQN Training Procedure

In each training iteration, an episode is generated as described in Section IV-E using an -greedy policy at each BS with decaying from 1 to 0.25 uniformly with each iteration. An episode yields tuples and quadruples from each BS , denoted as and respectively. These episodes are collectively appended as a new row in replay memories and respectively.

Then episodes are randomly sampled from both memories, and starting from a random point in each episode, consecutive transitions are chosen for training. We employ to carry the LSTM hidden state forward to some extent during training, while also better adhering to a DQN’s random sampling policy for training than (refer to the discussion on Bootstrapped Random Updates in [hausknecht2015deep]). Subsequently, we employ the 2-stage Bellman update equations (23) and (24) derived in Section IV-D to calculate labels and corresponding to the predictions of and respectively at each BS as follows


We then update the weights of each DQN using the mean squared error between the prediction and label as the loss function. The remaining parameters used for training the NNs are summarized in Table

IIIb. The choice of initial learning rate depends on the layout we are training on, and is given for both layouts utilized in Section VI. It should be noted that the outlined training procedure is performed completely offline prior to deployment, since it requires centralized computation of the reward . However, it can be adapted to an online setting as well (refer Section VI-D).

Transmit Power 23 dBm
Noise PSD -174 dBm/Hz
Bandwidth 20 MHz
UE Noise Figure 9 dB
BS Noise Figure 5 dB
Fading Coefficient 0.01
(a) Data Generation Parameters
Initial Learning Rate L1: L2:
Learning Rate Decay 0.85  /  500 updates
Weight Decay 0.001
Optimizer Adam[kingma2014adam]
Batch Size 5000
Training Iterations 15000
(b) DQN Training Parameters
TABLE III: Simulation Parameters

V-D PF Scheduler and ED Threshold Baselines

The evaluation of the distributed RL algorithm was outlined in Section IV-E. Note that and are used during testing, and the hidden and cell state and from time slot are explicitly provided as input to the DQN in the next time slot to emulate a local action-observation history.

Three baselines are used for assessing the performance of the distributed RL algorithm. First, the centralized PF-based BS scheduler presented in Section II provides an approximate upper bound on the attainable reward. It is approximate for two reasons: it maximizes for , and since the centralized scheduler determines the action of all BS’s at the beginning of a time slot, it uses the path gains from the previous time slot. However, with and slow fading coefficient , the approximations prove sufficiently accurate. For the reasons outlined in detail in the beginning of Section III, the PF scheduler is not realizable in any practical decentralized deployment in the absence of a central controller.

The second baseline is the ED threshold, which allows a BS to transmit if the received sum of energies of the already transmitting BSs is less than i.e. BS transmits if . We employ [3gpp.36.889].

The third baseline, referred to as “Adaptive ED” is a configuration adaptive ED threshold. It finds the ED threshold that maximizes for the given configuration of UEs from a set of ED thresholds ranging from -32 dBm to -92 dBm. Note that “Adaptive ED” is only used to benchmark the performance of the RL algorithm, and is not directly realizable since a BS cannot be cognizant of the configuration of the UEs before transmission. However, one of the strategies considered to make a BS aware of the channel state at the UE is for the UE to respond with a CTS message only if the SINR of the received RTS is greater than a threshold [jamil2015efficient] [lien2016configurable]. In other words, CCA is carried out at the UE also. If the BS does not receive the CTS, it can adapt its ED threshold accordingly. The performance of “Adaptive ED” is indicative of such receiver-based LBT mechanisms. This approach however results in high signaling overhead, high latency and wastage of radio resource. The RL algorithm, on the other hand, does not suffer from any of the aforementioned drawbacks.

Vi Results & Discussion

We consider 4 BSs lying at corners of a rectangle of breadth 20 m. In Layout 1 (L1), the length of the rectangle is 100 m, while in Layout 2 (L2), it is 40 m. Henceforth, we will refer to the position of the BSs as a Layout, and the position of the UEs as a configuration. We will be considering only two Layouts throughout the simulations, but a large number of configurations. The primary difference between the two layouts is that for most choices of 4 UE’s, the inter-BS energies will accurately reflect the quality of reception in Layout 1, but not so in Layout 2. This is evident from Fig. (a)a where the separation between UE’s from different BS’s reflects the inter-BS separation more accurately than in Fig. (b)b.


(a) Layout 1.


(b) Layout 2.
Fig. 7: Two layouts of 4 BS’s at the corners of a m rectangle. A BS is referred to as gNB in NR terminology.

The validation curve is shown for Layout 1 and 2 in Fig. (a)a and (b)b respectively. It is obtained by evaluating the trained models obtained after every 600 iterations on 10 randomly sampled configurations (not part of the training set) and averaged over 10 realizations of each configuration. The constant benchmarks provided by the PF and ED baselines averaged over the same validation configurations are also plotted. Clearly evident is the increasing gap between the “Adaptive ED” and PF baselines as we go from Layout 1 to Layout 2, due to the separation between UE’s from different BS’s reflecting more precisely the inter-BS separation in Layout 1. As a consequence, Layout 1 vs. Layout 2 also depicts how a single standardised threshold of -72 dBm cannot provide the same degree of fairness in different scenarios. In both cases, the reward accumulated by the RL algorithm gradually merges with the Adaptive ED threshold baseline, and is significantly higher than the average reward obtained using the standardised -72 dBm ED threshold.


(a) Layout 1


(b) Layout 2
Fig. 8: Cumulative Reward evaluated on Validation Set for Layout 1 and Layout 2

For both layouts, the trained models obtained at the end of 15000 training iterations are evaluated on 15 randomly sampled configurations (not part of the training and validation set), with the performance metric averaged over 120 realizations of each configuration (to average over fading and different counter realizations). As in the training phase, unique counters are used for each BS with each counter and . The same configurations are also used for evaluating the centralized PF-based BS scheduler and ED threshold baselines. The trained RL models, PF and ED algorithms are evaluated on the same fading and counter realizations. The envelope of all the ED threshold curves is used to create the “Adaptive ED” threshold curve. The mean reward over all test UE configurations is summarized in Table IV for Layout 1 and 2 under the tab UC (Unique Counter).


TABLE IV: Average Reward for Layout 1 and 2 with CW = 4. UC/NUC is Unique/Non-Unique counter for each BS.

Fig. 8 and Table IV highlight the proximity of the RL algorithm to the configuration adaptive ED threshold in terms of maximization of the expected cumulative reward. This is one of the key findings of this paper, since the adaptive ED threshold optimizes the ED threshold for each UE configuration, thus subsuming knowledge of the BS-UE path gains, while the RL algorithm is not provided with this information. On the other hand, the performance of the standardised -72 dBm ED threshold varies significantly depending on the Layout as discussed before. We now elaborate on certain key features of the RL algorithm and aspects of medium access protocols that it has the potential to improve.

Vi-a Non-Unique counters

Consider employing non-unique counters while evaluating the trained models, i.e. each BS can receive a random counter , with the possibility that for . However, we maintain . The average rewards are given in Table IV under the tab NUC (Non-Unique Counter), and the results for Layout 2 are depicted in Fig. 9 as a function of the Layout Index ranging from 1 to 15. While the average reward earned by the ED threshold of -72 dBm is significantly reduced (7 to 3.67 and 7.89 to 6.64), the RL algorithm is much more robust to the presence of counter collisions (7.46 to 6.80 and 7.94 to 7.23).


Fig. 9: Using non-unique counters depicts the increased complexity of learnt RL policy as compared to ED threshold.

It is evident from Fig. 9 that RL achieves a significantly higher reward than even the “Adaptive ED” baseline. RL provides for the application of a state-based policy at each BS for determining , and in addition to the inter-BS energy vector , this state also contains the average rate , signal and interference power of the served UE from the previous time slot. Moreover, the LSTM hidden and cell state and provide access to the local action-observation history. However, if two or more BSs have the same counter, each of these BS’s misses out on the chance to learn about the decision of another to transmit, thus increasing the extent of partial observability in . Our results show that RL learns a complex policy that depends on all the input variables in the CON and EOS state definitions, and is not as impacted by a less informative as the ED threshold, which is entirely dependent on for determining . Comparing the UC and NUC tabs in Table IV, it is evident that RL is impacted by counter collisions, but not to the extent of the ED threshold.

Conventional WiFi systems reduce the probability of counter collision by using a larger CW size of at least 15 [is2012ieee]. An RL based approach can thus reduce the size of the contention window, paving the way for improved resource utilization in data transmission. Training the RL algorithm with counter collisions and larger CWs could further improve the robustness to non-unique counters, and is left for future work.

Vi-B Length of Energy Vector

Since we assume that only DL BS-UE transmissions are sharing the same spectrum, and the rest of the transmissions do not interfere, the term would be zero (The BS noise figure used in simulations is small enough to be neglected). A seemingly obvious strategy, that could even bring about some reduction in computational complexity, is to simply delete this entry before inputting to . The smoothed validation curves for Layout 1 with this modification are presented in Fig. 10 with the label E3 (since was truncated to length ), while the original validation curve from Fig. (a)a is reproduced here with the label E4. For two distinct choices of initial learning rate , we are unable to acheive training convergence. A possible explanation could be as follows: Consider again the 3 BS example depicted in Fig. 5 where BS 1 had the smallest counter and chose to transmit first. If we truncated the length of the energy vector from to , then the index of the non-zero term in and would be different i.e. the first term would be non-zero in and the second term in . As a result, in the end-to-end training across agents brought about by the vector and the common reward , the CON DQN at each BS would be unable to assign an identity to the interfering transmitter. Hence, when evaluating the learnt policy, it is unable to utilize the information contained in effectively to estimate the interference at the UE in the current time slot.


Fig. 10: Varying length of affects training convergence

Vi-C Impact of Fading

To investigate the impact fading on all links had on the training convergence of the RL algorithm, we performed two experiments. First, we removed fading both in the training and testing of the algorithm. The smoothed validation curve for Layout 2 in the absence of fading is shown in Fig. (a)a in blue, overlayed with the validation curve in the presence of fading from Fig. (b)b in green. It appears that the RL algorithm is not significantly impacted by fading, since the and terms in the state definition serve as good indicators of the channel coefficients in a slow fading environment. While both ED baselines in the absence of fading may appear to significantly outperform their fading counterparts, this is simply an artifact of that particular snapshot of the channel gains under “no fading” being favorable to the ED baselines, and have been shown only for reference. Second, we increased the slow fading coefficient to , and the corresponding validation curve for Layout 2 is depicted in Fig. (b)b. Both the PF and ED baselines have also been plotted using , and the RL algorithm continues to perform as well as the Adaptive ED baseline. This further validates the observation that the RL algorithm is not noticeably impacted by fading.


(a) Validation Curve for Layout 2 without fading (in blue)


(b) Validation Curve for Layout 2 with .
Fig. 11: Our RL-based approach’s training convergence and performance is robust to the degree of fading

Vi-D Online Adaptability

One of the key advantages of the formulation described thus far is the per-timestep reward structure presented in (11). This reward function is independent of the episode length . Moreover, consider an online deployment where every time slots, a BS receives messaging on backhaul links from other BS that could contain the average rate of the remaining UEs and channel state information (CSI) that could aid in determining contributing to interference at the UE served. Then the BS could utilize this information to update the weights of its DQN’s by simply using the same reward function from (11) to calculate the labels and as given by (23) and (24). This follows from the observation that in the absence of an update, if we consider to equal its expected value , then


Hence the cumulative reward for remains unaffected by missed updates, and we continue to optimize for long term PF.

Vii Conclusions and Future Directions

Transmitter side MAC mechanisms such as the ED threshold algorithm, adopted as the the starting point of the design of LAA [3gpp.36.889] and the unlicensed 6 GHz band in NR-U[3gpp.38.889], incorrectly assume that interference sensed at the BS is representative of the quality of reception at the UE. To develop a state-based MAC policy, we first formulated Medium Access as a DEC-POMDP and incorporated contention to derive a 2-state transition diagram for the BS in each time slot. We then presented a distributed deep -learning algorithm that utilized an LSTM layer in each DQN to capture the local action-observation history in combination with message passing to provide for end-to-end training across BSs. The algorithm developed in this paper jointly utilizes the information from LBT-based spectrum sensing at the BS along with the average rate, signal and interference power seen by the UE it serves to determine whether the BS will transmit in the designated time slot. Moreover, the DQN’s at each BS can be trained offline and refined periodically online owing to the structure of the per-reward timestep, providing for online adaptability. These features make our approach quite suitable for deployment in actual BSs.

Utilizing a centralized training procedure with decentralized execution, the distributed RL algorithm was found to match the performance of a configuration adaptive ED threshold. Moreover, we showed that it has the potential to reduce the size of the contention window, providing for improved resource utilization in data transmission, and its training convergence is minimally impacted by the presence of fading on all links.

One of the drawbacks of the approach we have presented is the need for deploying a different DQN at each BS. The current framework has no parameter sharing among DQNs at different BSs. Ideally, we would want to train a single DQN that can be deployed at any BS, and refine it online via periodic updates. Moreover, there is still significant room for improvement in terms of the gap from the centralized PF-based BS scheduler, and policy gradient methods [sutton2018reinforcement]

could help. With a view to the design of a learning based BS, the framework developed in this paper has the potential to be applied in a variety of decision-making problems, including rate control, beam selection for scheduling, coordinated scheduling and channel selection in the frequency domain. In essence, these extensions tremendously increase the dimensionality of the output action space, from a simple transmit Yes/No decision to a choice of MCS, beamformer, user and subcarrier. With machine learning (ML) envisioned to be a key enabler for 6G

[dang2020should], distributed learning designs of spectrum sharing mechanisms for the increasing amounts of unlicensed spectrum being made available should prove to be of paramount importance.

-a Proof of Eqn (11)

In order to satisfy (10), let us rewrite (5) as follows, with