I Introduction
This paper considers the problem of efficient and equitable spectrum sharing among a group of colocated heterogeneous wireless networks. These networks may adopt different medium access control (MAC) protocols and they do not know the MAC protocols of other networks. This scenario is envisioned by DARPA Spectrum Collaboration Challenge (SC2) competition as a future spectrum sharing paradigm [DARPAwebsite, tilghman2019will]. In this futuristic scenario, unlike in the conventional cognitive radio, all users/networks are on equal footing in that they are not divided into primaries and secondaries. When sharing the spectrum in an efficient and equitable manner, each network must respect spectrum usages by other networks in that it must not hog the spectrum to the detriment of other networks. A major challenge for one particular network is how to coexist with other networks without knowing the MACs of other networks while achieving efficient and equitable spectrum usage among all networks.
Widely used wireless MAC protocols today are often designed for homogeneous networks in which all nodes use the same MAC. A case in point is WiFi, which adopts a particular form of carriersense multipleaccess (CSMA) with collision avoidance [Tanenbaum:2010:CN:1942194]. The carrier sensing and binary exponential backoff mechanisms of WiFi [bianchi2000performance, liew2010back] work well only if all nodes in the network adopt the same mechanisms. They do not work well in heterogeneous networks. To illustrate, consider the coexistence of a WiFi node and a node operating the time division multiple access (TDMA) protocol. The TDMA node transmits in specific time slots in a frame consisting of multiple time slots, in a repetitive manner from frame to frame, as illustrated in Fig. 1. In particular, the TDMA channel access pattern is oblivious of the MAC of WiFi; similarly, the MAC of WiFi is oblivious of the TDMA channel access pattern. As shown in Fig. 1, the WiFi node may sense the channel to be idle and decide to transmit a packet, only to have the TDMA node transmit a packet shortly thereafter to result in a collision. This leads to inefficiency of the spectrum usage in the heterogeneous network setting.
A goal of this paper is to circumvent this problem with a new class of CSMA protocols based on deep reinforcement learning (DRL) [sutton2018reinforcement]
. DRL is a machine learning technique that combines deep learning and reinforcement learning. DRL has had success in solving a wide range of complex decisionmaking tasks, including video game playing, robotic control, smart grid management, and wireless communication
[li2017deep, DQNpaper, silver2016mastering, gu2017deep, ruelens2016residential, luong2019applications, sun2019application]. A salient feature of our DRLbased MAC protocol is that it does not need to know the operation mechanism of the coexisting MACs—it learns to coexist harmoniously with other MACs by trialanderror. Throughout this paper, the DRLbased MAC protocol is referred to as CarrierSense Deepreinforcement Learning Multiple Access (CSDLMA). The nodes operating CSDLMA are referred to as CSDLMA nodes and the corresponding radio network is referred to as CSDLMA network.In general, CSDLMA can have different objectives when coexisting with other MACs, e.g., maximize sum throughput, achieve proportional fairness or achieve maxmin fairness [mo2000fair]. For generality, this paper adopts fairness [mo2000fair] as the objective of CSDLMA. With fairness, CSDLMA can achieve a range of different objectives by changing the value of , including the aforementioned objectives. We show that CSDLMA can achieve nearoptimal results with respect to different values when coexisting with other MAC protocols, such as TDMA, ALOHA, and WiFi. Moreover, we demonstrate that CSDLMA is more Pareto efficient [myerson2013game] than ppersistent CSMA [Tanenbaum:2010:CN:1942194] when coexisting with WiFi.
The underpinning DRL technique in CSDLMA is deep Qnetwork (DQN) [DQNpaper], developed by DeepMind to achieve superhuman level performance in playing Atari games. However, the original DQN in [DQNpaper] is not directly applicable for our propose for two reasons:

The original DQN only aims to maximize the cumulative discounted “rewards”—i.e., the objective or “return” to be optimized is a weighted linear sum of the rewards in consecutive time steps [sutton2018reinforcement]—and this does not fit in with the fairness objective, which in general is a nonlinear combination of utility functions.

The original DQN is built on a discretetime framework wherein an underlying assumption is that the time steps are of uniform duration. Specifically, this implicit assumption is made in the way that it discounts the rewards from time step to time step in a uniform manner. For CSMA protocols, the time slots are nonuniform in nature in that the minislots used for carrier sensing are of smaller duration than time slots used for data transmission.
Our previous work [yu2019deep] put forth a multidimensional DQN algorithm to solve issue 1). This paper introduces a nonuniform timestep formulation in DQN to address issue 2). The key idea in nonuniform timestep DQN is that we need to discount “reward” according to the duration of each time step.^{1}^{1}1
We remark that our nonuniform timestep DQN formulation is also potentially applicable to other decisionmaking problems. For example, in the problem of Treasury bond investment, the maturity dates and the interest rates of different bonds may be different. If the investment strategy is to decide the maturity date of the bond to be purchased based on certain observed “environmental states”, then the time duration between successive decisionmaking/investment epochs may vary according to the maturity dates of the bonds. To discount properly, the DRL agent needs to take into account the different time durations.
Ia Contributions
We summarize our contributions in this paper as follows:

We develop a new class of CSMA protocol based on DRL, referred to as CSDLMA, for spectrum sharing in heterogeneous wireless networks. A salient feature of CSDLMA is that it not only optimizes its own throughput but also the throughputs of other coexisting networks according to the general fairness objective. Importantly, CSDLMA achieves this without knowing the MAC protocols of other networks.

We demonstrate that CSDLMA can achieve the general fairness objective when coexisting with the TDMA, ALOHA, and WiFi protocols by adjusting its own transmission strategies. Interestingly, we find that CSDLMA is more Pareto efficient than other CSMA protocols, e.g., ppersistent CSMA, when coexisting with WiFi.

We put forth a nonuniform timestep multidimensional DQN algorithm to enable CSDLMA to achieve the above performance. Although we only focus on the use of the modified DQN algorithm for wireless networking, we believe it can also find use in other domains with similar nonuniform timestep and multidimensional characteristics.
IB Related Work
Since this paper focuses on MAC designs based on DRL techniques, we limit our review of related work in the same area only. A substantial body of past related work on DRL based MAC, for example, [naparstek2018deep, wang2018deep, chang2018distributive, zhong2018actor, xu2018deep], did not consider MAC with carrier sensing. In the setup of the past work, the time steps in decision making are with the same duration. For MAC with carrier sensing, the carrier sensing time and the packet transmission time are of different durations, and the conventional DRL algorithms with the implicit assumption of uniform time steps are not suitable anymore. This is the reason why in the current paper of ours, we need to modify the conventional DRL techniques so that they can be used for the design of DRL MAC with carrier sensing and for the coexistence of the DRL MAC with other MACs with the carrier sensing capability.
The authors in [challita2018proactive] and [tan2019deep] investigated the coexistence of the DRL based LTE network with WiFi, which has the carrier sensing capability. However, the LTE MACs in [challita2018proactive] and [tan2019deep] exercise coarsegrained control in that the decisions are not made on a packetbypacket basis. Specifically, the LTE MAC does not decide whether to transmit on a packetbypacket basis, but rather decides a stretch of time for LTE transmissions and a stretch of time for WiFi transmissions. During the respect stretches of time, LTE/WiFi get to keep transmitting packets without interruption from the other network. By contrast, the MAC of our design in this paper exercises finegrained control in the MAC decides whether to transmit the next packet based on carrier sensing the environment as well as the past history of the environmental state.
In the following, we elaborate other fine differences between our work and [naparstek2018deep, wang2018deep, chang2018distributive, zhong2018actor, xu2018deep, challita2018proactive, tan2019deep]. The DRL MAC proposed in [naparstek2018deep] is targeted for homogeneous wireless networks. Specifically, in [naparstek2018deep], multiple nodes access multiple timeinvariant orthogonal channels using the same DRL MAC. By contrast, we focus on heterogeneous networks in which our CSDLMA protocol must learn to coexist with other MAC protocols. The MACs in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] are also concerned with multiplechannel access. Unlike [naparstek2018deep], the channels in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] are time varying and the channels may be occupied by some “primary” or “legacy” nodes. The DRL nodes aim to maximize their own throughputs by learning the channel characteristics and the transmission patterns of the “primary” or “legacy” nodes. By contrast, the CSDLMA nodes in our work aim to achieve a global fairness objective, which includes achieving maximum sum throughput, proportional fairness, and maxmin fairness as subcases.
Both the MAC schemes in [challita2018proactive] and [tan2019deep] are modelaware in that the LTE base stations know that the coexisting network is WiFi. Therefore, the approaches in [challita2018proactive] and [tan2019deep] are not generalizable to situations where the LTE stations coexist with other networks. For example, suppose that instead of WiFi, the other network is ALOHA. Given that ALOHA does not perform carrier sensing, an ALOHA node may still transmit while an LTE node transmits during the stretch of time allocated to LTE, leading to collisions. By contrast, our CSDLMA protocol is modelfree in that it does not presume knowledge of coexisting networks—our CSDLMA protocol can coexist with any MAC protocol by nature.
In our previous work [yu2019deep], we developed deep reinforcement learning multiple access (DLMA) protocols for heterogeneous networking without carrier sensing. Furthermore, we assumed that nodes of different MACs use the same packet length. This assumption limits the application of DLMA in more general heterogeneous settings in which nodes of different MACs may adopt different packet lengths. Our early work [yu2018carrier] incorporated carrier sensing into DLMA. However, for simplicity, [yu2018carrier] assumed the durations of carrier sensing and packet transmissions of DRL nodes are the same. Our current paper removes this impractical assumption. As a result, the DRL time steps are of nonuniform durations now. We put forth a nonuniform timestep formulation of the DQN algorithm to address the issue.
Ii Reinforcement Learning Preliminaries
This section overviews the reinforcement learning (RL) techniques [sutton2018reinforcement]. In the RL framework, a decisionmaking agent interacts with an environment in discrete time steps. At time step , the agent observes the environment state and performs an action chosen from an action set according to a policy . The policy is a mapping from states to actions. Following the action , the agent receives a reward and the environment transits to state at time step . There are different techniques for reinforcement learning. This paper adapts and extends the Qlearning technique [watkins1992q] for our particular application.
Iia Qlearning
Given a series of rewards, , resulting from stateaction pairs , for Qlearning, the cumulative discounted return going forward pinned at time step is given by , where is a discount factor. Because of the randomness in the state transitions,
is a random variable in general. Qlearning captures the expected cumulative discounted reward of a stateaction pair
of a policy by a Q actionvalue function: . The Q function of an optimal policy among all policies is .In Qlearning, the goal of the agent is to learn the optimal policy in an online manner by observing the rewards while taking action in successive time steps. In particular, the agent maintains the Q function, , for any stateaction pair , in a tabular form. At time step , given state , the agent selects an action
based on its current estimated Q table. This will cause the system to return a reward
and move to state . The experience at time step is captured by the quadruplet . At the end of time step , experience is used to update for entry as follows:(1) 
where is referred to as the learning rate.
In Qlearning, the socalled greedy algorithm is often adopted for action selection. For the greedy algorithm, the action
is chosen with probability
, and a random action is chosen uniformly among all possible actions with probability . The random action is incorporated to avoid the algorithm from zooming into a local optimal policy and to allow the agent to explore a wider spectrum of different actions in search of the optimal policy, particularly at the early stage of the learning process.Qlearning is a modelfree learning framework in that it tries to learn the optimal policy without having a model that describes the operating behavior of the environment beyond what can be observed through the experiences. In particular, it does not have knowledge of the transition probability .
IiB Deep QNetwork
It has been shown that in a stationary environment that can be fully captured by a Markov decision process, the Qvalues will converge to the optimal
if the learning rate decays appropriately and each action in the stateaction pair is executed an infinite number of times in the process [watkins1992q]. For many realworld problems, the stateaction space for can be huge that the tabular update method, which updates only one entry in in each time step, can take an excessive amount of time for to converge to . If the environment changes in the meantime (e.g., changes), convergence can never be attained. To allow fast convergence, function approximation methods are often used to approximate the Qvalues [sutton2018reinforcement].The seminal work [DQNpaper]
put forth an algorithm referred to as the deep Qnetwork (DQN), wherein a deep neural network model is used to approximate the actionvalue function
. To avoid confusion between the algorithm DQN from the neural network used in the algorithm, in this paper we refer to the neural network as the Q neural network (QNN). For the same algorithm, different possible QNNs could be used.The input to a QNN is a state , and the outputs are the approximated Qvalues for different actions, , where
is a parameter vector consisting of the weights of the edges in the neural network and
is the set of possible actions. At the end of time step , for action execution, the greedy algorithm, wherein is replaced by , is adopted.For training of the QNN, the parameters of the QNN,
, are updated by minimizing the following loss function:
(2) 
In (IIB), two important learning techniques in DQN are embedded to stabilize the algorithm [DQNpaper]. The first is “experience replay” [lin1992self, DQNpaper]. Instead of training QNN with a single experience at each time step, multiple experiences are pooled together for batch training. Specifically, a FIFO experience buffer is used to store a fixed number of experiences gathered from different time steps. For a round of training, a minibatch consisting of random experiences are taken from the experience buffer in the computation of (IIB), wherein the time index denotes the time step at which that experience tuple was collected. The second technique is the use of a separate “target neural network” in the computation of in (IIB). In particular, the target neural network’s parameter vector is rather than in the QNN being trained. This separate target neural network is named target QNN and is a copy of a previously used QNN. The parameter of target QNN is updated to the latest of QNN once in a while.
Iii System Model and Objective
This section first introduces the system model used in this paper. After that, we give the overall system objective.
Iiia System Model
The system model considered in this paper is inspired by the network model of DARPA Spectrum Collaboration Challenge (SC2) [DARPAwebsite, tilghman2019will]. As illustrated in Fig. 2, the model of DARPA SC2 is composed of a collaboration network and multiple radio networks. All the radio networks share a common wireless medium. In DARPA SC2, the collaboration network is a separate control network from the wireless data network. The collaboration network allows different radio networks to communicate collaborative information at the high level (e.g., the frequency spectrum used by a radio network, the throughput and the quality of service observed in a radio network, etc.). Each radio network, however, does not tell the other networks its MAC protocol.
On the wireless data channel, the nodes in each radio network can transmit data packets to each other, whereas the nodes belonging to different radio networks do not exchange data packets. A packet is deemed to be successfully transmitted if there are no concurrent transmissions by other nodes. Otherwise, the packet is deemed to be lost due to a collision.
Each radio network operates a MAC protocol that determines the transmission strategy of its nodes. The coexisting radio networks are heterogeneous in that they may adopt different MAC protocols. Importantly, each radio network does not know the MAC protocols of other radio networks. The goal, in DARPA’s vision, is to optimize the aggregate wireless spectrum usage across all radio networks [DARPAwebsite, tilghman2019will].
An important feature of DARPA SC2’s model is that in each radio network, a node is designated as a gateway for collaborative information exchange with gateways of other radio networks through the collaboration network. In this work, as will be elaborated later, we assume the collaborative information includes transmission results of networks, such as successes/failures of packet transmissions and packet durations. The gateway of a radio network may in turn share the transmission results of other radio networks with its own nodes. Using collaborative information, a radio network can then adjust its transmission strategy through an adaptive MAC protocol to achieve a certain global objective to share the wireless spectrum with other networks in an equitable and optimal manner. For example, if the objective is to achieve proportional fairness, the adaptive MAC protocol will aim to maximize the sum log throughputs of all networks [yu2019deep].
This paper focuses on the design of the MAC protocol of a particular radio network. The goal is to be able to achieve a general global objective for wirelessspectrum sharing without knowing the MAC protocols of other radio networks.
We assume the nodes in our radio network have carriersensing (CS) capability, and our MAC protocol exploits deep reinforcement learning techniques to learn a transmission strategy that can achieve the global objective. We refer to our MAC protocol as CarrierSense Deepreinforcement Learning Multiple Access (CSDLMA). Our network and nodes are referred to as CSDLMA network and CSDLMA nodes.
Carrier sensing allows radio nodes to listen to the wireless channel before transmitting their data packets so as to avoid collisions [Tanenbaum:2010:CN:1942194]. The carrier sensing operation typically takes up some time that includes the signal processing and circuit delay within a node, as well as the largest possible signal propagation over the air between nodes. To be effective, carrier sensing time must be small relative to the data packet duration. In this paper, we refer to the time required for carrier sensing as a “minislot”.
We consider three types of MAC protocols used by other radio networks: (i) TDMA, (ii) ALOHA, (iii) WiFi (more exactly, a simplified WiFilike CSMA protocol) [Tanenbaum:2010:CN:1942194]. Among these protocols, WiFi has the capability of carrier sensing, while TDMA and ALOHA do not. For simplicity, we assume the minislots used by CSDLMA and WiFi nodes for carrier sensing are of the same duration. In addition, we assume that packet durations of different networks are integer multiples of minislots (note that packet duration here refers to the time needed to transmit MAClayer packet header plus the data). Specifically, /// represent the packet durations of CSDLMA/WiFi/TDMA/ALOHA. The durations of the packet headers of all networks are assumed to be the same. In particular, the packetheader duration is a fraction of minislot ( is used in our later evaluations).
We allow the packet duration of CSDLMA, , to vary in time as part of its adaptive strategy. Variable gives added flexibility in CSDLMA. For example, if the channel is deemed to be not likely used by others for a long duration of time, CSDLMA can transmit a large packet with a large to reduce the packetheader overhead and carriersensing overhead; a small , on the other hand, allows CSDLMA to squeeze in a small packet transmission in between transmissions by others.
We summarize the MAC protocols of different networks in Table I. We assume slotted operations of TDMA/ALOHA. A TDMA/ALOHA network can only initiate a transmission at the beginning of a TDMA/ALOHA slot, and the transmission ends at the end of a TDMA/ALOHA slot. Thus, each TDMA/ALOHA slot lasts a TDMA/ALOHA packet duration (e.g., a TDMA slot is of minislots in duration) in Table I.
IiiB Fairness Objective
We adopt the general fairness objective as the performance metric of this paper [mo2000fair]. In particular, we assume that there are altogether nodes in the overall heterogeneous wireless networks. For a particular node , its local utility function is given by
(3) 
where is used to specify a range of fairness criteria and is the throughput of node .
The objective of the overall system is to maximize the sum of all the local utility functions:
(4) 
In (4), when , the objective is to maximize the sum throughput; when , the objective is to achieve proportional fairness; when , the objective is to achieve maxmin fairness [mo2000fair].
Iv CSDLMA Framework
This section first transforms the multiple access problem faced by our CSDLMA network to an RL problem by defining action, state and reward—three key components in RL. We then modify the original DQN algorithm and put forth a nonuniform timestep multidimensional DQN algorithm that realizes CSDLMA—the original DQN deals with uniform timestep problems. After that, we discuss the implementations of CSDLMA. For simple exposition, we assume there is only one CSDLMA node in this section. We will extend the framework to the case with multiple CSDLMA nodes in Section VI.
Iva Action, State, and Reward
IvA1 Action
As described in Table I, the possible decisions of a CSDLMA node include 1) performing carrier sensing and 2) transmitting a packet with a length of . We denote the action of a CSDLMA node at time step by , where is the maximum packet length of CSDLMA (a “time step” here corresponds to a decision epoch of CSDLMA and the duration of each time step can be either one minislot or multiple minislots). If , the CSDLMA node will not transmit and will only perform carrier sensing in the next minislot. The carrier sensing results in an observation BUSY or IDLE, indicating whether the channel was occupied or not occupied by other nodes in that minislot. If , the CSDLMA node will transmit a packet with a length of in the next minislots. At the end of the transmission, SUCCESSFUL or COLLIDED will be observed, indicating whether the packet was successfully received or not. As long as another node transmits in at least one of the minislots, COLLIDED would be observed.
IvA2 State
We first define the channel state of CSDLMA at time step as the actionobservation pair . We then define the state of CSDLMA at time step as , i.e., the state is the combination of the past channel states. The state history length is the number of past time steps to be tracked by CSDLMA.
IvA3 Reward
In the conventional RL framework, the reward is a scalar and the RL agent learns to maximize the cumulative discounted reward [sutton2018reinforcement], which is a weighted linear sum of rewards in the time steps going forward. The goal of CSDLMA, however, is to achieve fairness among all the nodes, which in general is a nonlinear function of the individual cumulative discounted rewards (i.e. individual throughputs) of the nodes. We use a reward vector to keep track of the individual rewards in each time step, from which we can obtain the individual cumulative discounted rewards for the computation of the fairness objective function. Specifically, after taking action , a reward vector is obtained from the environment at the end of time step . The element is the reward of the CSDLMA node. If the CSDLMA node successfully transmitted a packet with length in time step , then ; otherwise . The element , , is the reward of the node from other networks and is the total number of the nodes in other networks. If the node has successfully transmitted a packet with length in time step , then ; otherwise, .
(5) 
(6) 
(7) 
(8) 
IvB NonUniform TimeStep MultiDimensional DQN
In our early work [yu2019deep], we put forth a multidimensional DQN framework for DLMA that deals with time steps of uniform duration. Specifically, in [yu2019deep], there was no carrier sensing functionality for all involved MACs, and timeslotted systems with time slots of fixed duration were considered. The duration of a time slot in [yu2019deep] corresponds to the duration of a packet transmission. By contrast, the current work extends the multidimensional DQN in [yu2019deep] to scenarios in which the time slots for carrier sensing (i.e., minislots in this work) are of a smaller duration than the time slots for packet transmissions. For this extension, we need to modify the discounting mechanism and action selection method of conventional DQN. We lay out the principle for the modifications here.
In conventional DQN [DQNpaper], the outputs of the neural network are the approximated Qvalues for different actions, , where the Qvalue is the approximated cumulative discounted reward of a stateaction pair . In the multidimensional DQN in [yu2019deep], the outputs of the neural network are a vector , where is the approximated cumulative discount reward of the DLMA node, , , is the approximated cumulative discount reward of the node from other networks, and / corresponds to “NOT Transmit”/“Transmit” (the number of actions in [yu2019deep] is two). Furthermore, the experience tuple is augmented to , wherein is a vector consisting of the individual rewards of different nodes in the heterogeneous network, as opposed in the scalar reward of a single entity in conventional DQN. With the above two modifications, in [yu2019deep], the loss function (IIB) was rewritten as (5), and in (5) was chosen according to (6).
In this paper, for the study of CSDLMA, the time duration of each time step, i.e., the duration of each action , is nonuniform in that the packet length of the CSDLMA node can be varying (recall that ). The discounting mechanism in (5) needs to be modified to take the nonuniform timestep into account. In particular, large time steps need to be discounted more than small time steps because the former extends further into the future.
To extend the uniform timestep multidimensional DQN in [yu2019deep] to nonuniform timestep multidimensional DQN, the outputs of QNN are modified to , i.e., the outputs of QNN are the approximated Q values for different actions and different nodes. In addition, we let denote the time duration of action in terms of the number of minislots. Specifically, if the CSDLMA node performs carrier sensing over one minislot, and if the CSDSMA node transmits a packet of duration minislots. We then augment the experience to . Finally, the loss function (5) can be modified as (7). Note that in (7) is the same as in (6) except that the set of possible actions is rather than .
We can write in (7) as , corresponding to amortizing the reward in a nonuniform timestep over minislots by minislot discounting. Now, the training of nonuniform timestep multidimensional DQN can be done by minimizing the loss function (7
) using Stochastic Gradient Descent
[lecun2015deep].For action selection in CSDLMA, we put forth a carriersense greedy algorithm. Suppose that at the beginning of time step , the state of the CSDLMA node is and the CSDLMA node needs to select an action . The carriersense greedy algorithm that decides is given by (8).
We now explain (8) line by line. The first line in (8) is a result of the carrier sensing mechanism. A node operating carrier sensing needs to sense the network to be idle before it can transmit. The sensing operation will take a certain amount of time. For our system, we assume one minislot is used for sensing (since even if sensing can be completed in less than one minislot, the node will still need to wait for the next minislot to begin transmission, if the wireless channel is sensed to be idle). The first line in (8) is attributed to the nonzero amount of time (i.e., one minislot in our case) needed for carrier sensing. If the channel is not idle in time step , then the CSDLMA node cannot transmit in time step . In time step , the channel could be nonidle either because the CSDLMA node was transmitting or another node is transmitting and the CSDLMA node sensed the medium to be busy, i.e. BUSY, SUCCESSFUL or COLLIDED.
The second line and third line in (8) describes action selection in time step if the CSDLMA node did not transmit at time step , and it sensed the medium to be idle at time step (i.e., other nodes did not transmit either). In this case, the CSDLMA node can decide to transmit or not to transmit in time step . If it decides to transmit, the CSDLMA node also needs to decide the packet length of the transmission. With an greedy algorithm, with probability the choice is made uniform randomly, as in the third line of (8). The second line of (8) is a departure from the conventional greedy DQN algorithm. In conventional DQN, with probability the action that yields the maximum Q value is selected [DQNpaper]. To capture the essence of fairness objective, our multidimensional DQN selects the action that maximizes an fairness nonlinear combination of different Q values.
IvC CSDLMA Implementation
Fig. 3 shows the overall DQN architecture that realizes CSDLMA.^{2}^{2}2
The simulation codes of CSDLMA are partly opensourced:
https://github.com/YidingYu/CSDLMA. We now describe three key components in the architecture: 1) neural network, 2) experience buffer, and 3) continuous experience replay.IvC1 Neural Network
The neural network, i.e., QNN, used in nonuniform multidimensional DQN is a recurrent neural network (RNN). The RNN consists of an input layer, two hidden layers, and an output layer. The input to the RNN is the current state. The two hidden layers consist of a longshorttermmemory (LSTM)
[hochreiter1997long] layer and a feedforward layer. The outputs are the approximated Q values for different actions and different nodes given the input state.Instead of RNN, a feedforward neural network (FNN) could also be used, wherein the hidden layers are all pure feedforward layers. Fig. 4 shows the difference between the FNNbased QNN and the RNNbased QNN in processing received from the input layer at time step . After receiving , FNN processes it directly; by contrast, after receiving , RNN processes the elements, , in sequentially, keeping an internal state as it injects the elements one by one into the input in a sequential manner. In this way, the causal relationship between the elements in (e.g., precedes ) is explicitly embedded in the way RNN processes the input [hochreiter1997long]. On the other hand, the causal relationship between elements in is not explicitly given to FNN. FNN will need to learn this relationship if it manages to learn at all.
IvC2 Experience Buffer
For implementation, it is inefficient to store experiences in the form of since two consecutive experiences have many common elements. For example, in is only a timeshifted version of in with the headend discarded and a new tailend appended. It is superfluous to store the overlapped elements for both experiences. A more efficient implementation is to store the abbreviated experience . The complete experience can be obtained from consecutive abbreviated experiences by means of continuous experience replay.
IvC3 Continuous Experience Replay
In conventional experience replay [DQNpaper], random experiences are sampled from the experience buffer to compute the loss function, with each sample being an experience . After downsizing the experience to , we will sample continuous experiences instead to extract the information necessary for computing the loss function (7). As illustrated in Fig. 5, each sample contains continuous experiences, and we can extract , , , , from this sample.
V Performance Evaluation
This section evaluates the performance of CSDLMA. After introducing the simulation setup, we first investigate the coexistence of CSDLMA with TDMA and ALOHA, two MAC protocols without carrier sensing. Following that, we investigate the coexistence of CSDLMA with WiFi, a MAC protocol with carrier sensing. For concreteness, this paper focuses on saturated networks, i.e., all the nodes in the networks always have packets to transmit. In addition, since we have no control of TDMA, ALOHA and WiFi, we assume the packet lengths of these nodes are fixed in our evaluation.
Va Simulation Setup
VA1 Hyperparameters
We adopt the RNN architecture in CSDLMA unless stated otherwise (we will show our motivation to use RNN by comparing the performance of RNN with FNN in Appendix A). As shown in Fig. 3
, the RNN architecture has two hidden layers: one LSTM layer followed by one feedforward layer. The number of neurons for each layer is 64 and the activation functions are ReLU
[lecun2015deep]. Since we assume CSDLMA does not know the mechanisms of the coexisting MACs, we use a relatively large state history length to cover a longer history so as to learn the behavior of potentially complex MACs. Specifically, for our simulations, we set , i.e., the state of CSDLMA covers the actionobservation pairs in the past 20 time steps. The value of in the carriersense greedy algorithm is initially set to 1 and decreases at a rate of 0.995 every time step until its value reaches 0.005, i.e., is updated by in each time step. The discount factor in (7) is set to 0.999. The size of the experience buffer is 1000 and the experience buffer is updated in a FIFO manner [yu2019deep]. The RMSProp algorithm
[tieleman2012lecture] is used to conduct minibatch gradient descent over the loss function (7). The minibatch size is set to 32. The target network is updated every 20 time steps. Table IIsummarizes the values of the hyperparameters.
Hyperparameter  Value 
State history length  20 
in carriersense greedy algorithm  1 to 0.005 
Discount factor  0.999 
Experience buffer size  1000 
Experiencereplay minibatch size  32 
Target network update frequency  20 
VA2 Performance Metric
We evaluate the performance of CSDLMA by examining whether the objective in (4) can be achieved. In particular, we define the “throughput” of node at time step by
(9) 
where is the reward of node at the end of time step and is the time duration of action in terms of number of minislots. The throughput here is the average reward and reflects the performance of each node in the long run.
VB CSDLMA coexists with TDMA and ALOHA
This subsection investigates the coexistence of one CSDLMA node with one TDMA node and one ALOHA node. We first introduce the settings of each node. We then examine if CSDLMA can achieve a general fairness objective when coexisting with TDMA and ALOHA.
In our experimental setup, the TDMA node occupies the second and the fifth TDMA slots within a TDMA frame of five TDMA slots; the ALOHA node transmits with a fixed probability of in each ALOHA slot. The packet lengths of TDMA and ALOHA are both fixed at 10 minislots. The CSDLMA node practices our CSDLMA protocol and can transmit packets of variable length, with a maximum length of 10 minislots. For benchmarking, in place of the modelfree CSDLMA node, we imagine a modelaware node that is aware of the packet length as well as the MAC mechanisms of TDMA and ALOHA. As for CSDLMA, we also assume that the packet length of the modelaware node can vary from 1 to 10 minislots. The optimal strategy of the modelaware node summarized below can achieve the general fairness objective:
At the beginning of each TDMA/ALOHA slot, the modelaware node performs carrier sensing. If the channel is idle, the modelaware node transmits in the next 9 minislots; if the channel is busy (either TDMA or ALOHA transmits), the modelaware node keeps silent in the next 9 minislots.
A point to note here is that the optimal strategies of the modelaware node are the same for different values. The detailed analyses are provided in Appendix B.
We now examine if CSDLMA can manage to find the optimal strategies for different values without being aware of the MACs of TDMA and ALOHA. Fig. 6 plots the individual throughputs of CSDLMA, TMDA and ALOHA achieved by CSDLMA as well as the corresponding optimal individual throughputs achieved by the modelaware node. As can be seen from Fig. 6, for different values, the individual throughputs of each node all approximate their corresponding optimal results, indicating that CSDLMA indeed can find a strategy that achieves fairness objectives of different values.
VC CSDLMA coexists with WiFi
We next investigate the coexistence of one CSDLMA node with one WiFi node. The CSDLMA node is the same as in Section VB. The WiFi node uses the following settings: the packet length is fixed at 10 minislots; the initial window size is 2; the maximum backoff stage is 6.
We first present the individual throughputs of CSDLMA and WiFi for different values. As can be seen from Fig. 7, when the value of increases from 0 to 50, the throughputs of CSDLMA and WiFi get closer. In particular, when , CSDLMA aims to maximize the sum throughput and the strategy found by CSDLMA is a greedy strategy, i.e., CSDLMA always transmits if the channel is sensed idle; when increases, CSDLMA becomes less aggressive and leaves more opportunities for WiFi until the throughput of CSDLMA and WiFi are almost equal. This demonstrates that CSDLMA indeed can adjust its strategy according to the value of .
For comparison purposes, we replace CSDLMA with ppersistent CSMA (pCSMA) [Tanenbaum:2010:CN:1942194] in the above experiment, i.e., we consider the coexistence of pCSMA with WiFi. If the channel is sensed idle, pCSMA transmits a packet with a probability of (). The value of can be adjusted to achieve different throughput allocations between pCSMA and WiFi when they coexist.
Fig. 8 plots the throughputs of CSDLMA/pCSMA versus WiFi. Specifically, in Fig. 8, the xaxis is the throughput of WiFi and the yaxis is the throughput of CSDLMA/pCSMA. Each circle corresponds to the throughputs allocation achieved by CSDLMA with a particular ; each square corresponds to the throughputs allocation achieved by pCSMA with a particular . As can be seen from Fig. 8, CSDLMA can achieve Pareto improvement [myerson2013game] over pCSMA when coexisting with WiFi. Interestingly, if we also plot the individual throughputs of two homogeneous WiFi nodes—denoted by the red star in Fig. 8—we find that CSDLMA can also achieve Pareto improvement over WiFi when coexisting with WiFi.
An intuitive reason why our CSDLMA manages to obtain performance more Pareto efficiently than pCSMA is that the CSDLMA node looks at a longer state history before making a decision on the action to follow while the pCSMA node does not (its effective is 1). Since the behavior of the WiFi node is not Markovian in that its behavior does not depend just on whether it is currently transmitting or not, but also on its experiences stretching further to the past, having access to a longer state history will help.
Vi MultiNode CSDLMA Framework
Section IV introduced the CSDLMA framework with only one CSDLMA node. This section generalizes the onenode CSDLMA framework to the multinode CSDLMA framework. With this framework, we will investigate the coexistence of multiple CSDLMA nodes with multiple other nodes.
Revisiting the system model introduced in Section IIIA, we know that although the CSDLMA network has no control of other networks, CSDLMA nodes running the same protocol can be coordinated. For the multinode CSDLMA framework studied here, we put forth a CSDLMA protocol to enable CSDLMA network to achieve the fairness objective. In particular, we assume there is a CSDLMA gateway associated with the CSDLMA nodes in the CSDLMA network. The gateway is responsible for coordinating the operations of the CSDLMA nodes so that they coexist among themselves and coexist with nodes running other protocols to meet the fairness objective.
(10) 
(11) 
(12) 
If the CSDLMA gateway decides to perform carrier sensing, it will listen to the channel and check whether the channel is occupied by the nodes from other networks; if the CSDLMA gateway decides to transmit a packet, it will select one of the CSDLMA nodes in a roundrobin manner to transmit (the CSDLMA gateway itself is also a CSDLMA node). The instruction from the CSDLMA gateway to the other CSDLMA nodes can be sent through a control channel within the CSDLMA network. For example, the control channel can be implemented as a “short time slot” before each packet transmission. The time duration of the “short time slot” can be even smaller than a minislot and can be neglected in the performance evaluation.^{3}^{3}3For concreteness and for simplicity, we focus on a design with centralized coordination of all DLCSMA nodes by a gateway here. Decentralized coordination is also possible. For example, if all the CSDLMA nodes run the same algorithm as the gateway algorithm described in this paper, and all CSDLMA nodes have the same observations, then the CSDLMA nodes will be in consensus as to the action to be taken by the CSDLMA network next (i.e., whether a CSDLMA node should transmit and if so, which CSDLMA node should transmit). For the decentralized implementation, there will be no need for a control channel for a central controller (gateway) to send instructions to the CDDLMA nodes. However, how to ensure consensus among the CSDLMA nodes, taking into consideration the possibility of discrepancies in their observations, will be a key issue.
We now transform the multiple access problem faced by the CSDLMA network to a reinforcement learning problem. In particular, our multinode CSDLMA framework is the same as the onenode CSDLMA framework except that the following modifications are made:
Vi1 Action
At the beginning of each time step , the CSDLMA gateway decides an action . If , the CSDLMA gateway will perform carrier sensing in the next minislot. After that, it will get an observation BUSY or IDLE, indicating whether the channel is being occupied or not occupied by other nodes. If , the CSDLMA node will select one CSDLMA node in a roundrobin manner to transmit a packet with a length of in the next minislots. After that, it will get an observation SUCCESSFUL or COLLIDED, indicating whether the packet is successfully received or not.
Vi2 Reward
After taking action , the CSDLMA gateway obtains a reward vector from the environment at the end of time step . The element is the reward of the CSDLMA network. If any CSDLMA node successfully transmitted a packet with length in time step , then ; otherwise . The reward of the node from other networks, , , has the same definition as in Section IVA.
Vi3 NonUniform TimeStep MultiDimensional DQN
The outputs of the neural network in nonuniform multidimensional DQN are still denoted by , but here is the approximated cumulative discounted reward of the CSDLMA network, rather than the approximated cumulative discount reward of one particular CSDLMA node (this modification is consistent with the definition of reward for multinode CSDLMA). The loss function is now given by (10).
Note that (10) has the same form as (7), but the index refers to the CSDLMA network rather than a particular CSDLMA node. In addition, in (10), is different from (6), but is given by (11).
The first term in (11) is the utility function of the CSDLMA network, where is the number of all the CSDLMA nodes and can be regarded as the approximated cumulative discounted reward of each CSDLMA node. The second term is the sum of utility functions of all the nodes from other networks.
Vi4 CarrierSense greedy Algorithm
Vii MultiNode CSDLMA Performance Evaluation
This section evaluates the performance of the multinode CSDLMA framework. We first consider the coexistence of two CSDLMA nodes with one WiFi node to examine if our multinode CSDLMA framework can adjust its transmission strategy according to both the value of and the number of CSDLMA nodes. One of the two CSDLMA nodes is designated as the gateway. As in Sections VB and VC, we also assume CSDLMA nodes can transmit packets of variable length, with a maximum length of 10 minislots. The settings of the WiFi node is the same as in Section VC.
Fig. 9 plots the sum throughput of the two CSDLMA nodes and the throughput of the WiFi node. As can be seen from Fig. 9, when increases, the sum throughput of CSDLMA and the throughput of WiFi get closer. Specifically, when , the sum throughput of CSDLMA is twice the throughput of WiFi (), which means the throughput of each CSDLMA node is equal to the throughput of WiFi. This is consistent with our observation in Fig. 7 that when , the throughput of one CSDLMA node is equal to the throughput of one WiFi node. This demonstrates that our formulation of multinode CSDLMA can adjust the weight of CSDLMA according to the number of CSDLMA nodes.
To further demonstrate the performance of the multinode CSDLMA framework, we now consider three coexistence scenarios:

four CSDLMA nodes with four WiFi nodes;

four pCSMA nodes with four WiFi nodes;

eight WiFi nodes.
In scenario 1), the value is set to 50, i.e., we want to achieve equal throughputs among four CSDLMA nodes and four WiFi nodes; in scenario 2), each pCSMA node adopts the same value of , and we adjust the value of to let the throughput of each pCSMA node be equal to the throughput of each WiFi node; in scenario 3), eight WiFi nodes are homogeneous. In addition, CSDLMA, pCSMA, and WiFi all adopt the same settings as in Section VC.
Fig. 10 presents the individual throughputs of each node in the above three scenarios. Overall, roughly equal throughputs among all nodes can be achieved in all scenarios. However, the throughput in scenario 1) is about higher than those of scenarios 2) and 3).
Viii Conclusion
In this paper, we developed a deep reinforcement learning multiple access protocol with carrier sensing capability, referred to as CSDLMA. The goal of CSDLMA is to enable efficient and equitable spectrum sharing among a group of colocated heterogeneous wireless networks. A salient feature of CSDLMA is that it can coexist harmoniously with other MAC protocols in the heterogeneous environment without knowing the MAC details of other networks. In particular, we demonstrated that CSDLMA can achieve a general fairness objective [mo2000fair] when coexisting with TDMA, ALOHA, and WiFi protocols by adjusting its own transmission strategies. Interestingly, we also found that CSDLMA is more Pareto efficient than other CSMA protocols, e.g., ppersistent CSMA, when coexisting with WiFi.
The underpinning DRL technique in CSDLMA is deep Qnetwork (DQN). However, the original DQN and its extension multidimensional DQN [yu2019deep] are not applicable for CSMA protocols design due to the underlying uniform timestep assumption in the DQN framework—for CSMA protocols, time steps are nonuniform in that the duration of carrier sensing is smaller than the duration of data transmission. In this paper, we introduced a nonuniform timestep formulation of DQN to address this issue. Although we only focus on the use of the modified DQN algorithm for wireless networking, we believe the nonuniform timestep DQN can also find use in other domains, e.g., the Treasury bond investment problem as mentioned in this paper.
The CSDLMA framework in this paper assumes the saturated scenario in which all the nodes always have packets to transmit. This will be the case, for example, when the nodes are transmitting large files containing many packets. In other practical scenarios, some nodes may be unsaturated in that they only have packets to transmit intermittently. It will be of interest to investigate CSDLMA that can deal with heterogeneous networks with a mix of saturated nodes and unsaturated nodes in the future.
Appendix A
This appendix compares the performance of RNN and FNN in CSDLMA design. In Section VB, we show that CSDLMA with RNN architecture can find the optimal strategies for different values. In this appendix, we also consider the coexistence of one CSDLMA node with one TDMA node and one ALOHA node. The settings are the same as in Section VB except that we use the FNN architecture instead of RNN in CSDLMA. In particular, the FNN with two hidden layers is the same as the RNN as introduced in Section VA except that we replace the LSTM layer in the RNN with a feedforward layer. For FNN with more hidden layers (e.g., 10, 20 and 40), we adopt the residual network structure as in [yu2019deep]. The reason to use the residual network structure is to avoid potential overfitting due to large numbers of hidden layers [he2016deep].
Fig. 11 presents the individual throughputs of CSDLMA, TDMA and ALOHA, and their corresponding optimal results. In particular, for different rows in Fig. 11, CSDLMA uses different number of hidden layers; for different columns, we test the performance of CSDLMA for different values. As can be seen from Fig. 11, CSDLMA with FNN fails to find the optimal strategies for most of the cases, while from Fig. 6 in Section VB, we can see that CSDLMA with RNN can find the optimal strategies for different values.
As mentioned earlier in Section IVC, the causal relationship between different elements in the input is explicitly modeled into RNN but not FNN. we conjecture that this allows RNN to search within a narrower solution for a good solution (i.e., RNN only needs to learn within a smaller space, allowing it to learn a good solution in a more focused manner).
Appendix B
This appendix derives the benchmark for the case of one CSDLMA node coexisting with one TDMA node and one ALOHA node—these nodes adopt the settings as introduced in Section 5.2: the CSDLMA node can transmit packets of variable length with a maximum of 10 minislots; the TDMA node occupies the second and the fifth TDMA slots within a TDMA frame of five TDMA slots; the ALOHA node transmits with a fixed probability in each ALOHA slot; and the packet durations of TDMA and ALOHA are both fixed at 10 minislots.
To derive the benchmark, we imagine a modelaware node that is aware of the MAC details as well the packet durations of TDMA and ALOHA. We replace the CSDLMA node with this modelaware node in the setting described in the previous paragraph and examine the network performance that can be achieved by this modelaware node. Given that the packet durations of TDMA and ALOHA are the same, we assume that the TDMA slots and the ALOHA slots are aligned in time. In the rest of this appendix, “slot” refers to the TDMA/ALOHA.
The transmission pattern of TDMA is fixed and not probabilistic. We can divide slots into two categories according to the usage pattern of TDMA: 1) slots occupied by TDMA and 2) slots not occupied by TDMA. For 1), the optimal strategy of the modelaware node is “not to transmit” for any value of (transmissions by the modelaware node in these slots will result in collisions and will not contribute to the throughput of TDMA, ALOHA, or the modelaware node). For 2), we can simplify this problem as the coexistence of the modelaware node with one ALOHA node.
In general, when coexisting with the ALOHA node, the modelaware node has two strategies—one of which can be the optimal strategy for a particular value of . These two strategies are given as follows:

Greedy strategy: the modelaware node transmits in all slots of category 2), which results in the throughput of the ALOHA node being zero.

Polite strategy: the modelaware node first performs carrier sensing in the first minislot and then decides whether to transmit in the next 9 minislots based on the carrier sensing result: if the channel is sensed idle (i.e., ALOHA is not transmitting), then the modelaware node transmits a packet in the next 9 minislots; if the channel is sensed busy (i.e., ALOHA is transmitting a packet in the current slot), then the modelaware node keeps silent in the next 9 minislots.
We can calculate the individual throughputs of the modelaware node and ALOHA node in an ALOHA slot for these two strategies, and the results of are summarized in Table III.
Modelaware  ALOHA  

Greedy Strategy  0  
Polite Strategy  0.425  0.475 
It is obvious that the polite strategy is the optimal strategy for any value of . Therefore, the optimal strategy of the modelaware node for this particular case can be concluded as follows:
From the results shown in Table III, it is obvious to conclude that the polite strategy is the optimal strategy for any value of . Therefore, the optimal strategy of the modelaware node when coexisting with one TDMA node and one ALOHA node using the settings in Section 5.2 can be concluded as follows:
At the beginning of each TDMA/ALOHA slot, the modelaware node performs carrier sensing. If the channel is idle, the modelaware node transmits in the next 9 minislots; if the channel is busy, the modelaware node keeps silent in the next 9 minislots.
Based on the above strategy, the individual throughputs of the modelaware node, the TDMA node, and the ALOHA node can be calculated as 0.255, 0.19, and 0.285, respectively.
Comments
There are no comments yet.