This paper considers the problem of efficient and equitable spectrum sharing among a group of co-located heterogeneous wireless networks. These networks may adopt different medium access control (MAC) protocols and they do not know the MAC protocols of other networks. This scenario is envisioned by DARPA Spectrum Collaboration Challenge (SC2) competition as a future spectrum sharing paradigm [DARPAwebsite, tilghman2019will]. In this futuristic scenario, unlike in the conventional cognitive radio, all users/networks are on equal footing in that they are not divided into primaries and secondaries. When sharing the spectrum in an efficient and equitable manner, each network must respect spectrum usages by other networks in that it must not hog the spectrum to the detriment of other networks. A major challenge for one particular network is how to coexist with other networks without knowing the MACs of other networks while achieving efficient and equitable spectrum usage among all networks.
Widely used wireless MAC protocols today are often designed for homogeneous networks in which all nodes use the same MAC. A case in point is WiFi, which adopts a particular form of carrier-sense multiple-access (CSMA) with collision avoidance [Tanenbaum:2010:CN:1942194]. The carrier sensing and binary exponential backoff mechanisms of WiFi [bianchi2000performance, liew2010back] work well only if all nodes in the network adopt the same mechanisms. They do not work well in heterogeneous networks. To illustrate, consider the coexistence of a WiFi node and a node operating the time division multiple access (TDMA) protocol. The TDMA node transmits in specific time slots in a frame consisting of multiple time slots, in a repetitive manner from frame to frame, as illustrated in Fig. 1. In particular, the TDMA channel access pattern is oblivious of the MAC of WiFi; similarly, the MAC of WiFi is oblivious of the TDMA channel access pattern. As shown in Fig. 1, the WiFi node may sense the channel to be idle and decide to transmit a packet, only to have the TDMA node transmit a packet shortly thereafter to result in a collision. This leads to inefficiency of the spectrum usage in the heterogeneous network setting.
A goal of this paper is to circumvent this problem with a new class of CSMA protocols based on deep reinforcement learning (DRL) [sutton2018reinforcement]
. DRL is a machine learning technique that combines deep learning and reinforcement learning. DRL has had success in solving a wide range of complex decision-making tasks, including video game playing, robotic control, smart grid management, and wireless communication[li2017deep, DQNpaper, silver2016mastering, gu2017deep, ruelens2016residential, luong2019applications, sun2019application]. A salient feature of our DRL-based MAC protocol is that it does not need to know the operation mechanism of the coexisting MACs—it learns to coexist harmoniously with other MACs by trial-and-error. Throughout this paper, the DRL-based MAC protocol is referred to as Carrier-Sense Deep-reinforcement Learning Multiple Access (CS-DLMA). The nodes operating CS-DLMA are referred to as CS-DLMA nodes and the corresponding radio network is referred to as CS-DLMA network.
In general, CS-DLMA can have different objectives when coexisting with other MACs, e.g., maximize sum throughput, achieve proportional fairness or achieve max-min fairness [mo2000fair]. For generality, this paper adopts -fairness [mo2000fair] as the objective of CS-DLMA. With -fairness, CS-DLMA can achieve a range of different objectives by changing the value of , including the aforementioned objectives. We show that CS-DLMA can achieve near-optimal results with respect to different values when coexisting with other MAC protocols, such as TDMA, ALOHA, and WiFi. Moreover, we demonstrate that CS-DLMA is more Pareto efficient [myerson2013game] than p-persistent CSMA [Tanenbaum:2010:CN:1942194] when coexisting with WiFi.
The underpinning DRL technique in CS-DLMA is deep Q-network (DQN) [DQNpaper], developed by DeepMind to achieve superhuman level performance in playing Atari games. However, the original DQN in [DQNpaper] is not directly applicable for our propose for two reasons:
The original DQN only aims to maximize the cumulative discounted “rewards”—i.e., the objective or “return” to be optimized is a weighted linear sum of the rewards in consecutive time steps [sutton2018reinforcement]—and this does not fit in with the -fairness objective, which in general is a nonlinear combination of utility functions.
The original DQN is built on a discrete-time framework wherein an underlying assumption is that the time steps are of uniform duration. Specifically, this implicit assumption is made in the way that it discounts the rewards from time step to time step in a uniform manner. For CSMA protocols, the time slots are non-uniform in nature in that the minislots used for carrier sensing are of smaller duration than time slots used for data transmission.
Our previous work [yu2019deep] put forth a multi-dimensional DQN algorithm to solve issue 1). This paper introduces a non-uniform time-step formulation in DQN to address issue 2). The key idea in non-uniform time-step DQN is that we need to discount “reward” according to the duration of each time step.111 We remark that our non-uniform time-step DQN formulation is also potentially applicable to other decision-making problems. For example, in the problem of Treasury bond investment, the maturity dates and the interest rates of different bonds may be different. If the investment strategy is to decide the maturity date of the bond to be purchased based on certain observed “environmental states”, then the time duration between successive decision-making/investment epochs may vary according to the maturity dates of the bonds. To discount properly, the DRL agent needs to take into account the different time durations.
We remark that our non-uniform time-step DQN formulation is also potentially applicable to other decision-making problems. For example, in the problem of Treasury bond investment, the maturity dates and the interest rates of different bonds may be different. If the investment strategy is to decide the maturity date of the bond to be purchased based on certain observed “environmental states”, then the time duration between successive decision-making/investment epochs may vary according to the maturity dates of the bonds. To discount properly, the DRL agent needs to take into account the different time durations.
We summarize our contributions in this paper as follows:
We develop a new class of CSMA protocol based on DRL, referred to as CS-DLMA, for spectrum sharing in heterogeneous wireless networks. A salient feature of CS-DLMA is that it not only optimizes its own throughput but also the throughputs of other coexisting networks according to the general -fairness objective. Importantly, CS-DLMA achieves this without knowing the MAC protocols of other networks.
We demonstrate that CS-DLMA can achieve the general -fairness objective when coexisting with the TDMA, ALOHA, and WiFi protocols by adjusting its own transmission strategies. Interestingly, we find that CS-DLMA is more Pareto efficient than other CSMA protocols, e.g., p-persistent CSMA, when coexisting with WiFi.
We put forth a non-uniform time-step multi-dimensional DQN algorithm to enable CS-DLMA to achieve the above performance. Although we only focus on the use of the modified DQN algorithm for wireless networking, we believe it can also find use in other domains with similar non-uniform time-step and multi-dimensional characteristics.
I-B Related Work
Since this paper focuses on MAC designs based on DRL techniques, we limit our review of related work in the same area only. A substantial body of past related work on DRL based MAC, for example, [naparstek2018deep, wang2018deep, chang2018distributive, zhong2018actor, xu2018deep], did not consider MAC with carrier sensing. In the set-up of the past work, the time steps in decision making are with the same duration. For MAC with carrier sensing, the carrier sensing time and the packet transmission time are of different durations, and the conventional DRL algorithms with the implicit assumption of uniform time steps are not suitable anymore. This is the reason why in the current paper of ours, we need to modify the conventional DRL techniques so that they can be used for the design of DRL MAC with carrier sensing and for the coexistence of the DRL MAC with other MACs with the carrier sensing capability.
The authors in [challita2018proactive] and [tan2019deep] investigated the coexistence of the DRL based LTE network with WiFi, which has the carrier sensing capability. However, the LTE MACs in [challita2018proactive] and [tan2019deep] exercise coarse-grained control in that the decisions are not made on a packet-by-packet basis. Specifically, the LTE MAC does not decide whether to transmit on a packet-by-packet basis, but rather decides a stretch of time for LTE transmissions and a stretch of time for WiFi transmissions. During the respect stretches of time, LTE/WiFi get to keep transmitting packets without interruption from the other network. By contrast, the MAC of our design in this paper exercises fine-grained control in the MAC decides whether to transmit the next packet based on carrier sensing the environment as well as the past history of the environmental state.
In the following, we elaborate other fine differences between our work and [naparstek2018deep, wang2018deep, chang2018distributive, zhong2018actor, xu2018deep, challita2018proactive, tan2019deep]. The DRL MAC proposed in [naparstek2018deep] is targeted for homogeneous wireless networks. Specifically, in [naparstek2018deep], multiple nodes access multiple time-invariant orthogonal channels using the same DRL MAC. By contrast, we focus on heterogeneous networks in which our CS-DLMA protocol must learn to coexist with other MAC protocols. The MACs in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] are also concerned with multiple-channel access. Unlike [naparstek2018deep], the channels in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] are time varying and the channels may be occupied by some “primary” or “legacy” nodes. The DRL nodes aim to maximize their own throughputs by learning the channel characteristics and the transmission patterns of the “primary” or “legacy” nodes. By contrast, the CS-DLMA nodes in our work aim to achieve a global -fairness objective, which includes achieving maximum sum throughput, proportional fairness, and max-min fairness as subcases.
Both the MAC schemes in [challita2018proactive] and [tan2019deep] are model-aware in that the LTE base stations know that the coexisting network is WiFi. Therefore, the approaches in [challita2018proactive] and [tan2019deep] are not generalizable to situations where the LTE stations coexist with other networks. For example, suppose that instead of WiFi, the other network is ALOHA. Given that ALOHA does not perform carrier sensing, an ALOHA node may still transmit while an LTE node transmits during the stretch of time allocated to LTE, leading to collisions. By contrast, our CS-DLMA protocol is model-free in that it does not presume knowledge of coexisting networks—our CS-DLMA protocol can coexist with any MAC protocol by nature.
In our previous work [yu2019deep], we developed deep reinforcement learning multiple access (DLMA) protocols for heterogeneous networking without carrier sensing. Furthermore, we assumed that nodes of different MACs use the same packet length. This assumption limits the application of DLMA in more general heterogeneous settings in which nodes of different MACs may adopt different packet lengths. Our early work [yu2018carrier] incorporated carrier sensing into DLMA. However, for simplicity, [yu2018carrier] assumed the durations of carrier sensing and packet transmissions of DRL nodes are the same. Our current paper removes this impractical assumption. As a result, the DRL time steps are of non-uniform durations now. We put forth a non-uniform time-step formulation of the DQN algorithm to address the issue.
Ii Reinforcement Learning Preliminaries
This section overviews the reinforcement learning (RL) techniques [sutton2018reinforcement]. In the RL framework, a decision-making agent interacts with an environment in discrete time steps. At time step , the agent observes the environment state and performs an action chosen from an action set according to a policy . The policy is a mapping from states to actions. Following the action , the agent receives a reward and the environment transits to state at time step . There are different techniques for reinforcement learning. This paper adapts and extends the Q-learning technique [watkins1992q] for our particular application.
Given a series of rewards, , resulting from state-action pairs , for Q-learning, the cumulative discounted return going forward pinned at time step is given by , where is a discount factor. Because of the randomness in the state transitions,
is a random variable in general. Q-learning captures the expected cumulative discounted reward of a state-action pairof a policy by a Q action-value function: . The Q function of an optimal policy among all policies is .
In Q-learning, the goal of the agent is to learn the optimal policy in an online manner by observing the rewards while taking action in successive time steps. In particular, the agent maintains the Q function, , for any state-action pair , in a tabular form. At time step , given state , the agent selects an action
based on its current estimated Q table. This will cause the system to return a rewardand move to state . The experience at time step is captured by the quadruplet . At the end of time step , experience is used to update for entry as follows:
where is referred to as the learning rate.
In Q-learning, the so-called -greedy algorithm is often adopted for action selection. For the -greedy algorithm, the action
is chosen with probability, and a random action is chosen uniformly among all possible actions with probability . The random action is incorporated to avoid the algorithm from zooming into a local optimal policy and to allow the agent to explore a wider spectrum of different actions in search of the optimal policy, particularly at the early stage of the learning process.
Q-learning is a model-free learning framework in that it tries to learn the optimal policy without having a model that describes the operating behavior of the environment beyond what can be observed through the experiences. In particular, it does not have knowledge of the transition probability .
Ii-B Deep Q-Network
It has been shown that in a stationary environment that can be fully captured by a Markov decision process, the Q-values will converge to the optimalif the learning rate decays appropriately and each action in the state-action pair is executed an infinite number of times in the process [watkins1992q]. For many real-world problems, the state-action space for can be huge that the tabular update method, which updates only one entry in in each time step, can take an excessive amount of time for to converge to . If the environment changes in the meantime (e.g., changes), convergence can never be attained. To allow fast convergence, function approximation methods are often used to approximate the Q-values [sutton2018reinforcement].
The seminal work [DQNpaper]
put forth an algorithm referred to as the deep Q-network (DQN), wherein a deep neural network model is used to approximate the action-value function. To avoid confusion between the algorithm DQN from the neural network used in the algorithm, in this paper we refer to the neural network as the Q neural network (QNN). For the same algorithm, different possible QNNs could be used.
The input to a QNN is a state , and the outputs are the approximated Q-values for different actions, , where
is a parameter vector consisting of the weights of the edges in the neural network andis the set of possible actions. At the end of time step , for action execution, the -greedy algorithm, wherein is replaced by , is adopted.
For training of the QNN, the parameters of the QNN,
, are updated by minimizing the following loss function:
In (II-B), two important learning techniques in DQN are embedded to stabilize the algorithm [DQNpaper]. The first is “experience replay” [lin1992self, DQNpaper]. Instead of training QNN with a single experience at each time step, multiple experiences are pooled together for batch training. Specifically, a FIFO experience buffer is used to store a fixed number of experiences gathered from different time steps. For a round of training, a minibatch consisting of random experiences are taken from the experience buffer in the computation of (II-B), wherein the time index denotes the time step at which that experience tuple was collected. The second technique is the use of a separate “target neural network” in the computation of in (II-B). In particular, the target neural network’s parameter vector is rather than in the QNN being trained. This separate target neural network is named target QNN and is a copy of a previously used QNN. The parameter of target QNN is updated to the latest of QNN once in a while.
Iii System Model and Objective
This section first introduces the system model used in this paper. After that, we give the overall system objective.
Iii-a System Model
The system model considered in this paper is inspired by the network model of DARPA Spectrum Collaboration Challenge (SC2) [DARPAwebsite, tilghman2019will]. As illustrated in Fig. 2, the model of DARPA SC2 is composed of a collaboration network and multiple radio networks. All the radio networks share a common wireless medium. In DARPA SC2, the collaboration network is a separate control network from the wireless data network. The collaboration network allows different radio networks to communicate collaborative information at the high level (e.g., the frequency spectrum used by a radio network, the throughput and the quality of service observed in a radio network, etc.). Each radio network, however, does not tell the other networks its MAC protocol.
On the wireless data channel, the nodes in each radio network can transmit data packets to each other, whereas the nodes belonging to different radio networks do not exchange data packets. A packet is deemed to be successfully transmitted if there are no concurrent transmissions by other nodes. Otherwise, the packet is deemed to be lost due to a collision.
Each radio network operates a MAC protocol that determines the transmission strategy of its nodes. The coexisting radio networks are heterogeneous in that they may adopt different MAC protocols. Importantly, each radio network does not know the MAC protocols of other radio networks. The goal, in DARPA’s vision, is to optimize the aggregate wireless spectrum usage across all radio networks [DARPAwebsite, tilghman2019will].
An important feature of DARPA SC2’s model is that in each radio network, a node is designated as a gateway for collaborative information exchange with gateways of other radio networks through the collaboration network. In this work, as will be elaborated later, we assume the collaborative information includes transmission results of networks, such as successes/failures of packet transmissions and packet durations. The gateway of a radio network may in turn share the transmission results of other radio networks with its own nodes. Using collaborative information, a radio network can then adjust its transmission strategy through an adaptive MAC protocol to achieve a certain global objective to share the wireless spectrum with other networks in an equitable and optimal manner. For example, if the objective is to achieve proportional fairness, the adaptive MAC protocol will aim to maximize the sum log throughputs of all networks [yu2019deep].
This paper focuses on the design of the MAC protocol of a particular radio network. The goal is to be able to achieve a general global objective for wireless-spectrum sharing without knowing the MAC protocols of other radio networks.
We assume the nodes in our radio network have carrier-sensing (CS) capability, and our MAC protocol exploits deep reinforcement learning techniques to learn a transmission strategy that can achieve the global objective. We refer to our MAC protocol as Carrier-Sense Deep-reinforcement Learning Multiple Access (CS-DLMA). Our network and nodes are referred to as CS-DLMA network and CS-DLMA nodes.
Carrier sensing allows radio nodes to listen to the wireless channel before transmitting their data packets so as to avoid collisions [Tanenbaum:2010:CN:1942194]. The carrier sensing operation typically takes up some time that includes the signal processing and circuit delay within a node, as well as the largest possible signal propagation over the air between nodes. To be effective, carrier sensing time must be small relative to the data packet duration. In this paper, we refer to the time required for carrier sensing as a “minislot”.
We consider three types of MAC protocols used by other radio networks: (i) TDMA, (ii) ALOHA, (iii) WiFi (more exactly, a simplified WiFi-like CSMA protocol) [Tanenbaum:2010:CN:1942194]. Among these protocols, WiFi has the capability of carrier sensing, while TDMA and ALOHA do not. For simplicity, we assume the minislots used by CS-DLMA and WiFi nodes for carrier sensing are of the same duration. In addition, we assume that packet durations of different networks are integer multiples of minislots (note that packet duration here refers to the time needed to transmit MAC-layer packet header plus the data). Specifically, /// represent the packet durations of CS-DLMA/WiFi/TDMA/ALOHA. The durations of the packet headers of all networks are assumed to be the same. In particular, the packet-header duration is a fraction of minislot ( is used in our later evaluations).
We allow the packet duration of CS-DLMA, , to vary in time as part of its adaptive strategy. Variable gives added flexibility in CS-DLMA. For example, if the channel is deemed to be not likely used by others for a long duration of time, CS-DLMA can transmit a large packet with a large to reduce the packet-header overhead and carrier-sensing overhead; a small , on the other hand, allows CS-DLMA to squeeze in a small packet transmission in between transmissions by others.
We summarize the MAC protocols of different networks in Table I. We assume slotted operations of TDMA/ALOHA. A TDMA/ALOHA network can only initiate a transmission at the beginning of a TDMA/ALOHA slot, and the transmission ends at the end of a TDMA/ALOHA slot. Thus, each TDMA/ALOHA slot lasts a TDMA/ALOHA packet duration (e.g., a TDMA slot is of minislots in duration) in Table I.
Iii-B -Fairness Objective
We adopt the general -fairness objective as the performance metric of this paper [mo2000fair]. In particular, we assume that there are altogether nodes in the overall heterogeneous wireless networks. For a particular node , its local utility function is given by
where is used to specify a range of fairness criteria and is the throughput of node .
The objective of the overall system is to maximize the sum of all the local utility functions:
In (4), when , the objective is to maximize the sum throughput; when , the objective is to achieve proportional fairness; when , the objective is to achieve max-min fairness [mo2000fair].
Iv CS-DLMA Framework
This section first transforms the multiple access problem faced by our CS-DLMA network to an RL problem by defining action, state and reward—three key components in RL. We then modify the original DQN algorithm and put forth a non-uniform time-step multi-dimensional DQN algorithm that realizes CS-DLMA—the original DQN deals with uniform time-step problems. After that, we discuss the implementations of CS-DLMA. For simple exposition, we assume there is only one CS-DLMA node in this section. We will extend the framework to the case with multiple CS-DLMA nodes in Section VI.
Iv-a Action, State, and Reward
As described in Table I, the possible decisions of a CS-DLMA node include 1) performing carrier sensing and 2) transmitting a packet with a length of . We denote the action of a CS-DLMA node at time step by , where is the maximum packet length of CS-DLMA (a “time step” here corresponds to a decision epoch of CS-DLMA and the duration of each time step can be either one minislot or multiple minislots). If , the CS-DLMA node will not transmit and will only perform carrier sensing in the next minislot. The carrier sensing results in an observation BUSY or IDLE, indicating whether the channel was occupied or not occupied by other nodes in that minislot. If , the CS-DLMA node will transmit a packet with a length of in the next minislots. At the end of the transmission, SUCCESSFUL or COLLIDED will be observed, indicating whether the packet was successfully received or not. As long as another node transmits in at least one of the minislots, COLLIDED would be observed.
We first define the channel state of CS-DLMA at time step as the action-observation pair . We then define the state of CS-DLMA at time step as , i.e., the state is the combination of the past channel states. The state history length is the number of past time steps to be tracked by CS-DLMA.
In the conventional RL framework, the reward is a scalar and the RL agent learns to maximize the cumulative discounted reward [sutton2018reinforcement], which is a weighted linear sum of rewards in the time steps going forward. The goal of CS-DLMA, however, is to achieve -fairness among all the nodes, which in general is a nonlinear function of the individual cumulative discounted rewards (i.e. individual throughputs) of the nodes. We use a reward vector to keep track of the individual rewards in each time step, from which we can obtain the individual cumulative discounted rewards for the computation of the -fairness objective function. Specifically, after taking action , a reward vector is obtained from the environment at the end of time step . The element is the reward of the CS-DLMA node. If the CS-DLMA node successfully transmitted a packet with length in time step , then ; otherwise . The element , , is the reward of the node from other networks and is the total number of the nodes in other networks. If the node has successfully transmitted a packet with length in time step , then ; otherwise, .
Iv-B Non-Uniform Time-Step Multi-Dimensional DQN
In our early work [yu2019deep], we put forth a multi-dimensional DQN framework for DLMA that deals with time steps of uniform duration. Specifically, in [yu2019deep], there was no carrier sensing functionality for all involved MACs, and time-slotted systems with time slots of fixed duration were considered. The duration of a time slot in [yu2019deep] corresponds to the duration of a packet transmission. By contrast, the current work extends the multi-dimensional DQN in [yu2019deep] to scenarios in which the time slots for carrier sensing (i.e., minislots in this work) are of a smaller duration than the time slots for packet transmissions. For this extension, we need to modify the discounting mechanism and action selection method of conventional DQN. We lay out the principle for the modifications here.
In conventional DQN [DQNpaper], the outputs of the neural network are the approximated Q-values for different actions, , where the Q-value is the approximated cumulative discounted reward of a state-action pair . In the multi-dimensional DQN in [yu2019deep], the outputs of the neural network are a vector , where is the approximated cumulative discount reward of the DLMA node, , , is the approximated cumulative discount reward of the node from other networks, and / corresponds to “NOT Transmit”/“Transmit” (the number of actions in [yu2019deep] is two). Furthermore, the experience tuple is augmented to , wherein is a vector consisting of the individual rewards of different nodes in the heterogeneous network, as opposed in the scalar reward of a single entity in conventional DQN. With the above two modifications, in [yu2019deep], the loss function (II-B) was rewritten as (5), and in (5) was chosen according to (6).
In this paper, for the study of CS-DLMA, the time duration of each time step, i.e., the duration of each action , is non-uniform in that the packet length of the CS-DLMA node can be varying (recall that ). The discounting mechanism in (5) needs to be modified to take the non-uniform time-step into account. In particular, large time steps need to be discounted more than small time steps because the former extends further into the future.
To extend the uniform time-step multi-dimensional DQN in [yu2019deep] to non-uniform time-step multi-dimensional DQN, the outputs of QNN are modified to , i.e., the outputs of QNN are the approximated Q values for different actions and different nodes. In addition, we let denote the time duration of action in terms of the number of minislots. Specifically, if the CS-DLMA node performs carrier sensing over one minislot, and if the CS-DSMA node transmits a packet of duration minislots. We then augment the experience to . Finally, the loss function (5) can be modified as (7). Note that in (7) is the same as in (6) except that the set of possible actions is rather than .
We can write in (7) as , corresponding to amortizing the reward in a non-uniform time-step over minislots by minislot discounting. Now, the training of non-uniform time-step multi-dimensional DQN can be done by minimizing the loss function (7
) using Stochastic Gradient Descent[lecun2015deep].
For action selection in CS-DLMA, we put forth a carrier-sense -greedy algorithm. Suppose that at the beginning of time step , the state of the CS-DLMA node is and the CS-DLMA node needs to select an action . The carrier-sense -greedy algorithm that decides is given by (8).
We now explain (8) line by line. The first line in (8) is a result of the carrier sensing mechanism. A node operating carrier sensing needs to sense the network to be idle before it can transmit. The sensing operation will take a certain amount of time. For our system, we assume one minislot is used for sensing (since even if sensing can be completed in less than one minislot, the node will still need to wait for the next minislot to begin transmission, if the wireless channel is sensed to be idle). The first line in (8) is attributed to the non-zero amount of time (i.e., one minislot in our case) needed for carrier sensing. If the channel is not idle in time step , then the CS-DLMA node cannot transmit in time step . In time step , the channel could be non-idle either because the CS-DLMA node was transmitting or another node is transmitting and the CS-DLMA node sensed the medium to be busy, i.e. BUSY, SUCCESSFUL or COLLIDED.
The second line and third line in (8) describes action selection in time step if the CS-DLMA node did not transmit at time step , and it sensed the medium to be idle at time step (i.e., other nodes did not transmit either). In this case, the CS-DLMA node can decide to transmit or not to transmit in time step . If it decides to transmit, the CS-DLMA node also needs to decide the packet length of the transmission. With an -greedy algorithm, with probability the choice is made uniform randomly, as in the third line of (8). The second line of (8) is a departure from the conventional -greedy DQN algorithm. In conventional DQN, with probability the action that yields the maximum Q value is selected [DQNpaper]. To capture the essence of -fairness objective, our multi-dimensional DQN selects the action that maximizes an -fairness nonlinear combination of different Q values.
Iv-C CS-DLMA Implementation
Fig. 3 shows the overall DQN architecture that realizes CS-DLMA.222 The simulation codes of CS-DLMA are partly open-sourced:
The simulation codes of CS-DLMA are partly open-sourced:https://github.com/YidingYu/CS-DLMA. We now describe three key components in the architecture: 1) neural network, 2) experience buffer, and 3) continuous experience replay.
Iv-C1 Neural Network
The neural network, i.e., QNN, used in non-uniform multi-dimensional DQN is a recurrent neural network (RNN). The RNN consists of an input layer, two hidden layers, and an output layer. The input to the RNN is the current state. The two hidden layers consist of a long-short-term-memory (LSTM)[hochreiter1997long] layer and a feedforward layer. The outputs are the approximated Q values for different actions and different nodes given the input state.
Instead of RNN, a feedforward neural network (FNN) could also be used, wherein the hidden layers are all pure feedforward layers. Fig. 4 shows the difference between the FNN-based QNN and the RNN-based QNN in processing received from the input layer at time step . After receiving , FNN processes it directly; by contrast, after receiving , RNN processes the elements, , in sequentially, keeping an internal state as it injects the elements one by one into the input in a sequential manner. In this way, the causal relationship between the elements in (e.g., precedes ) is explicitly embedded in the way RNN processes the input [hochreiter1997long]. On the other hand, the causal relationship between elements in is not explicitly given to FNN. FNN will need to learn this relationship if it manages to learn at all.
Iv-C2 Experience Buffer
For implementation, it is inefficient to store experiences in the form of since two consecutive experiences have many common elements. For example, in is only a time-shifted version of in with the headend discarded and a new tailend appended. It is superfluous to store the overlapped elements for both experiences. A more efficient implementation is to store the abbreviated experience . The complete experience can be obtained from consecutive abbreviated experiences by means of continuous experience replay.
Iv-C3 Continuous Experience Replay
In conventional experience replay [DQNpaper], random experiences are sampled from the experience buffer to compute the loss function, with each sample being an experience . After downsizing the experience to , we will sample continuous experiences instead to extract the information necessary for computing the loss function (7). As illustrated in Fig. 5, each sample contains continuous experiences, and we can extract , , , , from this sample.
V Performance Evaluation
This section evaluates the performance of CS-DLMA. After introducing the simulation setup, we first investigate the coexistence of CS-DLMA with TDMA and ALOHA, two MAC protocols without carrier sensing. Following that, we investigate the coexistence of CS-DLMA with WiFi, a MAC protocol with carrier sensing. For concreteness, this paper focuses on saturated networks, i.e., all the nodes in the networks always have packets to transmit. In addition, since we have no control of TDMA, ALOHA and WiFi, we assume the packet lengths of these nodes are fixed in our evaluation.
V-a Simulation Setup
. The RMSProp algorithm[tieleman2012lecture] is used to conduct minibatch gradient descent over the loss function (7). The minibatch size is set to 32. The target network is updated every 20 time steps. Table II
summarizes the values of the hyperparameters.
|State history length||20|
|in carrier-sense -greedy algorithm||1 to 0.005|
|Experience buffer size||1000|
|Experience-replay minibatch size||32|
|Target network update frequency||20|
V-A2 Performance Metric
We evaluate the performance of CS-DLMA by examining whether the objective in (4) can be achieved. In particular, we define the “throughput” of node at time step by
where is the reward of node at the end of time step and is the time duration of action in terms of number of minislots. The throughput here is the average reward and reflects the performance of each node in the long run.
V-B CS-DLMA coexists with TDMA and ALOHA
This subsection investigates the coexistence of one CS-DLMA node with one TDMA node and one ALOHA node. We first introduce the settings of each node. We then examine if CS-DLMA can achieve a general -fairness objective when coexisting with TDMA and ALOHA.
In our experimental setup, the TDMA node occupies the second and the fifth TDMA slots within a TDMA frame of five TDMA slots; the ALOHA node transmits with a fixed probability of in each ALOHA slot. The packet lengths of TDMA and ALOHA are both fixed at 10 minislots. The CS-DLMA node practices our CS-DLMA protocol and can transmit packets of variable length, with a maximum length of 10 minislots. For benchmarking, in place of the model-free CS-DLMA node, we imagine a model-aware node that is aware of the packet length as well as the MAC mechanisms of TDMA and ALOHA. As for CS-DLMA, we also assume that the packet length of the model-aware node can vary from 1 to 10 minislots. The optimal strategy of the model-aware node summarized below can achieve the general -fairness objective:
At the beginning of each TDMA/ALOHA slot, the model-aware node performs carrier sensing. If the channel is idle, the model-aware node transmits in the next 9 minislots; if the channel is busy (either TDMA or ALOHA transmits), the model-aware node keeps silent in the next 9 minislots.
A point to note here is that the optimal strategies of the model-aware node are the same for different values. The detailed analyses are provided in Appendix B.
We now examine if CS-DLMA can manage to find the optimal strategies for different values without being aware of the MACs of TDMA and ALOHA. Fig. 6 plots the individual throughputs of CS-DLMA, TMDA and ALOHA achieved by CS-DLMA as well as the corresponding optimal individual throughputs achieved by the model-aware node. As can be seen from Fig. 6, for different values, the individual throughputs of each node all approximate their corresponding optimal results, indicating that CS-DLMA indeed can find a strategy that achieves -fairness objectives of different values.
V-C CS-DLMA coexists with WiFi
We next investigate the coexistence of one CS-DLMA node with one WiFi node. The CS-DLMA node is the same as in Section V-B. The WiFi node uses the following settings: the packet length is fixed at 10 minislots; the initial window size is 2; the maximum backoff stage is 6.
We first present the individual throughputs of CS-DLMA and WiFi for different values. As can be seen from Fig. 7, when the value of increases from 0 to 50, the throughputs of CS-DLMA and WiFi get closer. In particular, when , CS-DLMA aims to maximize the sum throughput and the strategy found by CS-DLMA is a greedy strategy, i.e., CS-DLMA always transmits if the channel is sensed idle; when increases, CS-DLMA becomes less aggressive and leaves more opportunities for WiFi until the throughput of CS-DLMA and WiFi are almost equal. This demonstrates that CS-DLMA indeed can adjust its strategy according to the value of .
For comparison purposes, we replace CS-DLMA with p-persistent CSMA (p-CSMA) [Tanenbaum:2010:CN:1942194] in the above experiment, i.e., we consider the coexistence of p-CSMA with WiFi. If the channel is sensed idle, p-CSMA transmits a packet with a probability of (). The value of can be adjusted to achieve different throughput allocations between p-CSMA and WiFi when they coexist.
Fig. 8 plots the throughputs of CS-DLMA/p-CSMA versus WiFi. Specifically, in Fig. 8, the x-axis is the throughput of WiFi and the y-axis is the throughput of CS-DLMA/p-CSMA. Each circle corresponds to the throughputs allocation achieved by CS-DLMA with a particular ; each square corresponds to the throughputs allocation achieved by p-CSMA with a particular . As can be seen from Fig. 8, CS-DLMA can achieve Pareto improvement [myerson2013game] over p-CSMA when coexisting with WiFi. Interestingly, if we also plot the individual throughputs of two homogeneous WiFi nodes—denoted by the red star in Fig. 8—we find that CS-DLMA can also achieve Pareto improvement over WiFi when coexisting with WiFi.
An intuitive reason why our CS-DLMA manages to obtain performance more Pareto efficiently than p-CSMA is that the CS-DLMA node looks at a longer state history before making a decision on the action to follow while the p-CSMA node does not (its effective is 1). Since the behavior of the WiFi node is not Markovian in that its behavior does not depend just on whether it is currently transmitting or not, but also on its experiences stretching further to the past, having access to a longer state history will help.
Vi Multi-Node CS-DLMA Framework
Section IV introduced the CS-DLMA framework with only one CS-DLMA node. This section generalizes the one-node CS-DLMA framework to the multi-node CS-DLMA framework. With this framework, we will investigate the coexistence of multiple CS-DLMA nodes with multiple other nodes.
Revisiting the system model introduced in Section III-A, we know that although the CS-DLMA network has no control of other networks, CS-DLMA nodes running the same protocol can be coordinated. For the multi-node CS-DLMA framework studied here, we put forth a CS-DLMA protocol to enable CS-DLMA network to achieve the -fairness objective. In particular, we assume there is a CS-DLMA gateway associated with the CS-DLMA nodes in the CS-DLMA network. The gateway is responsible for coordinating the operations of the CS-DLMA nodes so that they coexist among themselves and coexist with nodes running other protocols to meet the -fairness objective.
If the CS-DLMA gateway decides to perform carrier sensing, it will listen to the channel and check whether the channel is occupied by the nodes from other networks; if the CS-DLMA gateway decides to transmit a packet, it will select one of the CS-DLMA nodes in a round-robin manner to transmit (the CS-DLMA gateway itself is also a CS-DLMA node). The instruction from the CS-DLMA gateway to the other CS-DLMA nodes can be sent through a control channel within the CS-DLMA network. For example, the control channel can be implemented as a “short time slot” before each packet transmission. The time duration of the “short time slot” can be even smaller than a minislot and can be neglected in the performance evaluation.333For concreteness and for simplicity, we focus on a design with centralized coordination of all DL-CSMA nodes by a gateway here. Decentralized coordination is also possible. For example, if all the CS-DLMA nodes run the same algorithm as the gateway algorithm described in this paper, and all CS-DLMA nodes have the same observations, then the CS-DLMA nodes will be in consensus as to the action to be taken by the CS-DLMA network next (i.e., whether a CS-DLMA node should transmit and if so, which CS-DLMA node should transmit). For the decentralized implementation, there will be no need for a control channel for a central controller (gateway) to send instructions to the CD-DLMA nodes. However, how to ensure consensus among the CS-DLMA nodes, taking into consideration the possibility of discrepancies in their observations, will be a key issue.
We now transform the multiple access problem faced by the CS-DLMA network to a reinforcement learning problem. In particular, our multi-node CS-DLMA framework is the same as the one-node CS-DLMA framework except that the following modifications are made:
At the beginning of each time step , the CS-DLMA gateway decides an action . If , the CS-DLMA gateway will perform carrier sensing in the next minislot. After that, it will get an observation BUSY or IDLE, indicating whether the channel is being occupied or not occupied by other nodes. If , the CS-DLMA node will select one CS-DLMA node in a round-robin manner to transmit a packet with a length of in the next minislots. After that, it will get an observation SUCCESSFUL or COLLIDED, indicating whether the packet is successfully received or not.
After taking action , the CS-DLMA gateway obtains a reward vector from the environment at the end of time step . The element is the reward of the CS-DLMA network. If any CS-DLMA node successfully transmitted a packet with length in time step , then ; otherwise . The reward of the node from other networks, , , has the same definition as in Section IV-A.
Vi-3 Non-Uniform Time-Step Multi-Dimensional DQN
The outputs of the neural network in non-uniform multi-dimensional DQN are still denoted by , but here is the approximated cumulative discounted reward of the CS-DLMA network, rather than the approximated cumulative discount reward of one particular CS-DLMA node (this modification is consistent with the definition of reward for multi-node CS-DLMA). The loss function is now given by (10).
The first term in (11) is the utility function of the CS-DLMA network, where is the number of all the CS-DLMA nodes and can be regarded as the approximated cumulative discounted reward of each CS-DLMA node. The second term is the sum of utility functions of all the nodes from other networks.
Vi-4 Carrier-Sense -greedy Algorithm
Vii Multi-Node CS-DLMA Performance Evaluation
This section evaluates the performance of the multi-node CS-DLMA framework. We first consider the coexistence of two CS-DLMA nodes with one WiFi node to examine if our multi-node CS-DLMA framework can adjust its transmission strategy according to both the value of and the number of CS-DLMA nodes. One of the two CS-DLMA nodes is designated as the gateway. As in Sections V-B and V-C, we also assume CS-DLMA nodes can transmit packets of variable length, with a maximum length of 10 minislots. The settings of the WiFi node is the same as in Section V-C.
Fig. 9 plots the sum throughput of the two CS-DLMA nodes and the throughput of the WiFi node. As can be seen from Fig. 9, when increases, the sum throughput of CS-DLMA and the throughput of WiFi get closer. Specifically, when , the sum throughput of CS-DLMA is twice the throughput of WiFi (), which means the throughput of each CS-DLMA node is equal to the throughput of WiFi. This is consistent with our observation in Fig. 7 that when , the throughput of one CS-DLMA node is equal to the throughput of one WiFi node. This demonstrates that our formulation of multi-node CS-DLMA can adjust the weight of CS-DLMA according to the number of CS-DLMA nodes.
To further demonstrate the performance of the multi-node CS-DLMA framework, we now consider three coexistence scenarios:
four CS-DLMA nodes with four WiFi nodes;
four p-CSMA nodes with four WiFi nodes;
eight WiFi nodes.
In scenario 1), the value is set to 50, i.e., we want to achieve equal throughputs among four CS-DLMA nodes and four WiFi nodes; in scenario 2), each p-CSMA node adopts the same value of , and we adjust the value of to let the throughput of each p-CSMA node be equal to the throughput of each WiFi node; in scenario 3), eight WiFi nodes are homogeneous. In addition, CS-DLMA, p-CSMA, and WiFi all adopt the same settings as in Section V-C.
Fig. 10 presents the individual throughputs of each node in the above three scenarios. Overall, roughly equal throughputs among all nodes can be achieved in all scenarios. However, the throughput in scenario 1) is about higher than those of scenarios 2) and 3).
In this paper, we developed a deep reinforcement learning multiple access protocol with carrier sensing capability, referred to as CS-DLMA. The goal of CS-DLMA is to enable efficient and equitable spectrum sharing among a group of co-located heterogeneous wireless networks. A salient feature of CS-DLMA is that it can coexist harmoniously with other MAC protocols in the heterogeneous environment without knowing the MAC details of other networks. In particular, we demonstrated that CS-DLMA can achieve a general -fairness objective [mo2000fair] when coexisting with TDMA, ALOHA, and WiFi protocols by adjusting its own transmission strategies. Interestingly, we also found that CS-DLMA is more Pareto efficient than other CSMA protocols, e.g., p-persistent CSMA, when coexisting with WiFi.
The underpinning DRL technique in CS-DLMA is deep Q-network (DQN). However, the original DQN and its extension multi-dimensional DQN [yu2019deep] are not applicable for CSMA protocols design due to the underlying uniform time-step assumption in the DQN framework—for CSMA protocols, time steps are non-uniform in that the duration of carrier sensing is smaller than the duration of data transmission. In this paper, we introduced a non-uniform time-step formulation of DQN to address this issue. Although we only focus on the use of the modified DQN algorithm for wireless networking, we believe the non-uniform time-step DQN can also find use in other domains, e.g., the Treasury bond investment problem as mentioned in this paper.
The CS-DLMA framework in this paper assumes the saturated scenario in which all the nodes always have packets to transmit. This will be the case, for example, when the nodes are transmitting large files containing many packets. In other practical scenarios, some nodes may be unsaturated in that they only have packets to transmit intermittently. It will be of interest to investigate CS-DLMA that can deal with heterogeneous networks with a mix of saturated nodes and unsaturated nodes in the future.
This appendix compares the performance of RNN and FNN in CS-DLMA design. In Section V-B, we show that CS-DLMA with RNN architecture can find the optimal strategies for different values. In this appendix, we also consider the coexistence of one CS-DLMA node with one TDMA node and one ALOHA node. The settings are the same as in Section V-B except that we use the FNN architecture instead of RNN in CS-DLMA. In particular, the FNN with two hidden layers is the same as the RNN as introduced in Section V-A except that we replace the LSTM layer in the RNN with a feedforward layer. For FNN with more hidden layers (e.g., 10, 20 and 40), we adopt the residual network structure as in [yu2019deep]. The reason to use the residual network structure is to avoid potential overfitting due to large numbers of hidden layers [he2016deep].
Fig. 11 presents the individual throughputs of CS-DLMA, TDMA and ALOHA, and their corresponding optimal results. In particular, for different rows in Fig. 11, CS-DLMA uses different number of hidden layers; for different columns, we test the performance of CS-DLMA for different values. As can be seen from Fig. 11, CS-DLMA with FNN fails to find the optimal strategies for most of the cases, while from Fig. 6 in Section V-B, we can see that CS-DLMA with RNN can find the optimal strategies for different values.
As mentioned earlier in Section IV-C, the causal relationship between different elements in the input is explicitly modeled into RNN but not FNN. we conjecture that this allows RNN to search within a narrower solution for a good solution (i.e., RNN only needs to learn within a smaller space, allowing it to learn a good solution in a more focused manner).
This appendix derives the benchmark for the case of one CS-DLMA node coexisting with one TDMA node and one ALOHA node—these nodes adopt the settings as introduced in Section 5.2: the CS-DLMA node can transmit packets of variable length with a maximum of 10 minislots; the TDMA node occupies the second and the fifth TDMA slots within a TDMA frame of five TDMA slots; the ALOHA node transmits with a fixed probability in each ALOHA slot; and the packet durations of TDMA and ALOHA are both fixed at 10 minislots.
To derive the benchmark, we imagine a model-aware node that is aware of the MAC details as well the packet durations of TDMA and ALOHA. We replace the CS-DLMA node with this model-aware node in the setting described in the previous paragraph and examine the network performance that can be achieved by this model-aware node. Given that the packet durations of TDMA and ALOHA are the same, we assume that the TDMA slots and the ALOHA slots are aligned in time. In the rest of this appendix, “slot” refers to the TDMA/ALOHA.
The transmission pattern of TDMA is fixed and not probabilistic. We can divide slots into two categories according to the usage pattern of TDMA: 1) slots occupied by TDMA and 2) slots not occupied by TDMA. For 1), the optimal strategy of the model-aware node is “not to transmit” for any value of (transmissions by the model-aware node in these slots will result in collisions and will not contribute to the throughput of TDMA, ALOHA, or the model-aware node). For 2), we can simplify this problem as the coexistence of the model-aware node with one ALOHA node.
In general, when coexisting with the ALOHA node, the model-aware node has two strategies—one of which can be the optimal strategy for a particular value of . These two strategies are given as follows:
Greedy strategy: the model-aware node transmits in all slots of category 2), which results in the throughput of the ALOHA node being zero.
Polite strategy: the model-aware node first performs carrier sensing in the first minislot and then decides whether to transmit in the next 9 minislots based on the carrier sensing result: if the channel is sensed idle (i.e., ALOHA is not transmitting), then the model-aware node transmits a packet in the next 9 minislots; if the channel is sensed busy (i.e., ALOHA is transmitting a packet in the current slot), then the model-aware node keeps silent in the next 9 minislots.
We can calculate the individual throughputs of the model-aware node and ALOHA node in an ALOHA slot for these two strategies, and the results of are summarized in Table III.
It is obvious that the polite strategy is the optimal strategy for any value of . Therefore, the optimal strategy of the model-aware node for this particular case can be concluded as follows:
From the results shown in Table III, it is obvious to conclude that the polite strategy is the optimal strategy for any value of . Therefore, the optimal strategy of the model-aware node when coexisting with one TDMA node and one ALOHA node using the settings in Section 5.2 can be concluded as follows:
At the beginning of each TDMA/ALOHA slot, the model-aware node performs carrier sensing. If the channel is idle, the model-aware node transmits in the next 9 minislots; if the channel is busy, the model-aware node keeps silent in the next 9 minislots.
Based on the above strategy, the individual throughputs of the model-aware node, the TDMA node, and the ALOHA node can be calculated as 0.255, 0.19, and 0.285, respectively.