Carrier-Sense Multiple Access for Heterogeneous Wireless Networks Using Deep Reinforcement Learning

10/16/2018 ∙ by Yiding Yu, et al. ∙ The Chinese University of Hong Kong 0

This paper investigates a new class of carrier-sense multiple access (CSMA) protocols that employ deep reinforcement learning (DRL) techniques for heterogeneous wireless networking, referred to as carrier-sense deep-reinforcement learning multiple access (CS-DLMA). Existing CSMA protocols, such as the medium access control (MAC) of WiFi, are designed for a homogeneous network environment in which all nodes adopt the same protocol. Such protocols suffer from severe performance degradation in a heterogeneous environment where there are nodes adopting other MAC protocols. This paper shows that DRL techniques can be used to design efficient MAC protocols for heterogeneous networking. In particular, in a heterogeneous environment with nodes adopting different MAC protocols (e.g., CS-DLMA, TDMA, and ALOHA), a CS-DLMA node can learn to maximize the sum throughput of all nodes. Furthermore, compared with WiFi's CSMA, CS-DLMA can achieve both higher sum throughput and individual throughputs when coexisting with other MAC protocols. Last but not least, a salient feature of CS-DLMA is that it does not need to know the operating mechanisms of the co-existing MACs. Neither does it need to know the number of nodes using these other MACs.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

Code Repositories


The is a research project I am working on :)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This paper investigates a new class of carrier-sense multiple access (CSMA) protocols based on deep reinforcement learning (DRL) for heterogeneous wireless networking, referred to as carrier-sense deep-reinforcement learning multiple access (CS-DLMA). We show that nodes adopting CS-DLMA can learn a medium access strategy that maximizes the sum throughput of a heterogeneous network consisting of nodes adopting different medium access control (MAC) protocols. Furthermore, CS-DLMA achieves this without prior knowledge of the participating MAC protocols in the heterogeneous network.

CSMA MAC protocols are widely used in practical networks today. However, these CSMA MACs are designed for homogeneous networks in which all nodes use the same CSMA MAC. A case in point is WiFi. The carrier sensing, collision avoidance, and binary exponential backoff mechanisms of WiFi [1] work well only if all nodes in the network adopt the same mechanisms. They do not work well in a heterogeneous network. To illustrate, consider the co-existence of a WiFi node and a node operating the time-division multiple access (TDMA) protocol. The TDMA node transmits in specific time slots in a frame consisting of multiple time slots, in a repetitive manner from frame to frame, as illustrated in Fig. 1. In particular, the TDMA channel access pattern is oblivious of the MAC protocol of WiFi; similarly, the MAC of WiFi is oblivious of the TDMA channel access pattern. As shown in Fig. 1, the WiFi node may sense the channel to be idle and decide to transmit a packet, only to have the TDMA node transmit a packet shortly thereafter to result in a collision. A goal of CS-DLMA is to circumvent this problem through DRL.

In particular, we are interested in a “model-free” approach in which the CSMA protocol does not have detailed knowledge of the operating mechanisms of the other co-existing MAC protocols. Furthermore, the number of nodes operating each MAC protocol is also unknown. In other words, the CSMA node does not have a model that describes the heterogeneous environment precisely. The nodes operating CS-DLMA, referred to as DLMA nodes, must learn on the fly. Although there has been prior work on heterogeneous wireless networking, much prior work adopts a model-aware approach in which the knowledge of the co-existing MAC protocols is available – e.g., [2] investigated an LTE network that employs DRL for harmonious co-existence with WiFi assuming full knowledge of the WiFi MAC mechanism.

Fig. 1: Inharmonious co-existence of a TDMA node with a WiFi node. For simplicity, this example assumes each WiFi packet lasts four minislots, where a minislot is the slot used for carrier sensing by WiFi. This example also assumes each TDMA slot lasts four minislots. The TDMA node transmits packets in specific time slots within a TDMA frame repeatedly, from frame to frame, regardless of the MAC of the WiFi node. When the WiFi node senses the carrier to be idle and transmits in the subsequent four minislots, its transmission may collide with a TDMA packet that follows shortly, since the TDMA node does not perform carrier sensing before it transmits.

In general, there are many DRL techniques [3, 4]. In this paper, we focus on adapting deep Q-network (DQN) for use in CS-DLMA—DQN is a DRL technique originally proposed for Atari game playing in the seminal paper [5]. In [5], one-step DQN was adopted, and in [6], n-step DQN was adopted. In this paper, we study both one-step and n-step DQN (see Sections II-C and III-B for details). In addition, we propose and study a new variant of DQN, referred to as


DQN (RB-DQN). We show that RB-DQN can have better performance than one-step DQN and n-step DQN in a heterogeneous network consisting of ALOHA nodes, TDMA nodes, and DLMA nodes—specifically, RB-DQN can achieve near-optimal sum throughput with faster convergence.

I-a Related Work

Since this paper focuses on MAC protocols that make use of DRL, we limit our review of related work in this domain only. The DRL MAC proposed in [7] is targeted for homogeneous wireless networks. Specifically, in [7], multiple nodes access multiple orthogonal channels using the same DRL MAC. By contrast, we focus on heterogeneous networks in which our CS-DLMA protocol must learn to co-exist with other MAC protocols. In this paper, we focus on the objective of maximizing the sum throughput of all nodes in the heterogeneous environment. The generalization of this objective will be explored in our future work.

The MAC in [8] and [9] also concern multiple-channel access. Unlike [7], the channels in [8] and [9]

are time-varying—each channel follows a two-state Markov chain. In particular,

[8] assumes perfect spectrum sensing and multiple correlated channels, and [9] assumes imperfect spectrum sensing and multiple orthogonal channels. In both [8] and [9]

, the nodes use DRL techniques to learn the system statistics (including state transition probabilities of the channels and spectrum sensing errors) to improve the spectrum utilization efficiency. By contrast, our CS-DLMA uses DRL to learn the heterogeneous nodes’ transmission patterns in time so that CS-DLMA nodes can schedule its own transmissions to achieve a certain system objective (

in this paper, we focus on the objective of maximizing the sum throughput).

In [2], the authors investigated an LTE network that employs DRL for harmonious co-existence with WiFi. The focus of [2] is to allow the downlinks of LTE base stations to use the WiFi channels in a non-disruptive way. Importantly, the scheme in [2] is model-aware in that the LTE base stations know that the co-existing network is WiFi. By contrast, our CS-DLMA is model-free in that it does not presume knowledge of co-existing networks.

In our previous work [10, 11], we developed deep-reinforcement learning multiple access (DLMA) protocols for heterogeneous wireless networks. In [10, 11], we assumed that nodes of different MACs use the same packet length. This assumption limits the application of DLMA in more general heterogeneous settings in which nodes of different MACs may adopt different packet lengths. This paper removes the same-packet-length assumption and introduces carrier sensing into DLMA.

Ii Reinforcement Learning Preliminaries

This section overviews the reinforcement learning techniques used in our CS-DLMA protocol. There are different techniques for reinforcement learning. This paper makes use of Q-learning [4, 12]. In the reinforcement learning (RL) framework, a decision-making agent interacts with an environment in discrete time steps [4]. At time step , the agent observes the environment state and performs an action chosen from an action set according to a policy . The policy is a mapping from states to actions. Following action , the agent receives a reward and the environment transits to state in time step .

Ii-a one-step Q-learning

Given a series of rewards, , resulting from state-action pairs , the accumulated discounted return going forward pinned at time is given by , where is a discount factor. Because of the randomness in the state transitions,

is in general a random variable. The expected accumulated discounted return of a state-action pair

of a policy is captured by a Q action-value function, . The Q function associated with that of an optimal policy is .

In Q-learning, the goal of the agent is to learn the optimal policy in an online manner by observing the rewards while it takes action in successive time steps. In particular, the agent maintains the Q function, , for any state-action pair , in a tabular form. At time step , given state , the agent selects an action based on its current Q table. This will cause the system to return a reward and move to state . The experience at time step is captured by the quadruplet . At the end of time step , experience is used to update entry in as follows:


The above is a smoothing operation that combines the old Q value, , with the new sample of expected return based on and , , to arrive at a new Q value. The parameter captures the learning rate: in the computation of the new Q value, the relative weight of the old Q value is and the relative weight of the new Q value sample is .

As a variation to (II-A), the -greedy algorithm is often adopted in action selection. Specifically, for the -greedy algorithm, the action is chosen with probability ; and a random action is chosen uniformly from the action set with probability . This is to avoid the algorithm from zooming in to a local optimal policy and to allow the agent to explore a wider spectrum of different actions in search of the optimal policy [4].

Note that Q-learning is a model-free learning framework in that it tries to learn the optimal policy without having a model that describes the operating behavior of the environment beyond what can be observed through the experiences.

Ii-B n-step Q-learning

The above Q-learning is referred to as one-step Q-learning because it updates based on the one-step return, [4] . One drawback of one-step Q-learning is that the reward only directly affects , and not , in previous time steps. The values of other state-action pairs are affected only indirectly through the updated value of in later learning steps. This can potentially slow down the learning process, since many updates are required to propagate a reward to relevant preceding states and actions. One way to speed up the propagation of rewards is to use n-step return [4, 13, 6]. In n-step Q-learning, we set . This results in a single reward directly affecting the values of preceding state action pairs.

Ii-C Deep Q-Network

It has been shown that in a stationary environment that can be fully captured by a Markov decision process, the Q values will converge to the optimal

if the learning rate decays appropriately and each action in the state-action pair is executed an infinite number of times in the process [4, 12]. For many real-world problems, the state-action space for can be huge that the tabular update method, which updates only one entry in in each time step, can take an excessive amount of time for to converge to . If the environment changes in the meantime, convergence can never be attained. To allow fast convergence, function approximation methods are often used to approximate the Q values [4].

The seminal work [5]

put forth deep Q-network (DQN), wherein a deep neural network model is used to approximate the action-value function Q. For simplicity, we refer to the neural network in DQN as QNN. The input to QNN is a state

, and the outputs are the approximated Q values for different actions, , where

is a parameter vector consisting of the weights of the edges in the neural network. For action execution, the

-greedy algorithm based on the approximated Q values is adopted. For training, the parameter vector

is updated by minimizing the following loss function:


There are two important ingredients in DQN. The first ingredient is experience replay [14, 5]. Instead of training QNN with a single experience associated with one action execution, multiple experiences could be pooled together for batch training. In particular, an experience buffer stores a fixed number of experiences gathered from different time steps. For a round of training, a minibatch consisting of random experiences taken from the experience buffer is used in the computation of (II-C). The second ingredient is the use of a separate “target” neural network in the computation of , in (II-C). In particular, the target neural network’s parameter vector is rather than in the QNN being trained. This separate target neural network is named target QNN and is a copy of a previously used QNN: the parameter of target QNN is updated to the latest of QNN once in a while.

We refer to the above DQN algorithm as one-step DQN. The extension to the n-step DQN algorithm is obvious (details to be given in Section III-B).

Iii Cs-Dlma

This section specifies the system model and the methodology of CS-DLMA investigated in this paper.

Iii-a System Model

We consider time-slotted heterogeneous wireless networks in which different nodes transmit packets to an access point (AP) via a shared wireless channel. In this paper, we consider four types of networks whose nodes use different protocols: (i) CS-DLMA, (ii) WiFi (more exactly, a simplified WiFi-like CSMA protocol), (iii) TDMA, and (iv) different variants of ALOHA. Among them, CS-DLMA and WiFi nodes have the capability for carrier sensing, while TDMA and ALOHA nodes do not.

We assume different networks may have different slot granularities. The smallest slot is the basic slot used by DLMA nodes to perform carrier sensing or to transmit packets. The basic slot is also used by WiFi nodes to perform carrier sensing. WiFi slot, TDMA slot and ALOHA slot consist of multiple basic slots and are used by WiFi nodes, TDMA nodes and ALOHA nodes to transmit packets, respectively (i.e., a WiFi/TDMA/ALOHA packet lasts a duration of a WiFi/TDMA/ALOHA slot). We denote the ratio of WiFi slot, TDMA slot and ALOHA slot to the basic slot by , and . We assume a node can begin transmission only at the beginning of its own packet slot and must finish the transmission at the end of this packet slot. Simultaneous transmissions by multiple nodes result in a collision. A packet transmitted without collision is successfully received by the AP. After each successful transmission, the AP broadcasts an acknowledgment that contains the packet length information, interpreted as a “reward” in RL, as will be elaborated later in this subsection.

TABLE I: MAC mechanisms of different nodes.

Table I summarizes the MAC mechanisms of different nodes. Part of this paper will investigate and compare “co-existence of DLMA with TDMA and ALOHA” with “co-existence of WiFi with TDMA and ALOHA” (see Section IV-C).

We now give the details of CS-DLMA. To transform the medium access problem faced by a DLMA node to a reinforcement learning problem, we need to define the corresponding action, state, and reward in RL.

The action taken by a DLMA node in basic slot is {TRANSMIT, SENSE}, where TRANSMIT means that the DLMA node transmits, and SENSE means that it performs carrier sensing (i.e., it does not transmit). If TRANSMIT, the agent will get an observation SUCCESSFUL or COLLIDED from the AP, indicating whether the packet is successfully transmitted or not; if SENSE, the agent will get an observation BUSY or IDLE, indicating whether the channel is being occupied or not occupied by other nodes. We define the channel state in basic slot as the action-observation pair . There are four possibilities for : {TRANSMIT, SUCCESSFUL}, {TRANSMIT, COLLIDED}, {SENSE, BUSY} and {SENSE, IDLE}. We define the environmental state in basic slot to be , where the parameter is the state history length (number of past basic slots) to be tracked by the DLMA node.

After taking action , a reward is generated at the end of basic slot and the state becomes in basic slot . If the channel is idle or there is a collision in basic slot , then . For a successful transmission, the reward varies according to the length of the packet transmitted. In particular, if a DLMA node transmits a packet of one basic slot in duration, then ; if a WiFi/TDMA/ALOHA node successfully completes the transmission of a packet lasting a few basic slots, then (e.g., for TDMA, if a TDMA packet begins transmission in basic slot and the transmission is completed successfully in basic slot , then the reward at the end of basic slot is ). Note that in this basic scheme, for a packet lasting more than one basic slot, the reward is given only at the end of the last basic slot of the successfully transmission, and no reward is given in the earlier basic slots. In this study, both one-step DQN and n-step DQN use this basic reward scheme. We will also introduce and investigate another reward scheme called reward-backpropagation that amortizes the reward over each and every basic slots during which the packet is in transmission.

Iii-B Methodology

In [10, 11], we put forth DLMA protocols without carrier sensing for co-existence with different nodes transmitting packets of the same length. DLMA is based on one-step DQN in which the QNN is feedforward neural networks (FNN).

However, in our new setting here with introduction of carrier sensing and different slot lengths, we find that CS-DLMA using “FNN + one-step DQN” fails to learn an optimal strategy, as will be detailed in Section IV-B. As a potential solution, we put forth a reward-backpropagation

DQN (RB-DQN) algorithm that outperforms the original one-step DQN and n-step DQN. Furthermore, we explore the use of recurrent neural networks (RNN) as a replacement for FNN.

Fig. 2 shows the overall implementation architecture that realizes CS-DLMA assuming the QNN is an RNN. We now describe four key components in the architecture: (i) neural network, (ii) experience buffer, (iii) continuous experience replay and (iv) loss function.

Iii-B1 Neural Network

The RNN consists of an input layer, two hidden layers, and an output layer. The input to the RNN is the current state. The two hidden layers consist of a long-short-term-memory (LSTM)

[15] layer and an FNN layer. The outputs are the approximated Q values for different actions given the input state.

Fig. 2: Architecture of components realizing CS-DLMA.222For convenience, in our simulation, we assume execution of decisions and training of QNN are synchronous. In particular, the training is done at the end of each time step after an execution. In practice, execution and training can be done asynchronously and in parallel. A detailed discussion can be found in our paper [11].

In particular, the input to the input layer in basic slot is state , where is the channel state, and in is the observation of the DLMA node with four possibilities: SUCCESSFUL, COLLIDED, BUSY or IDLE. We adopt one-hot encoding [16] to encode these four possibilities.

Fig. 3 shows the difference between FNN-based QNN and RNN-based QNN in processing received from the input layer. After receiving , FNN processes it directly; by contrast, after receiving , RNN processes the elements, in sequentially, keeping an internal state as it moves from one element to the next. In this way, the causal relationship between elements in (e.g., precedes ) is explicitly embedded into the way RNN processes the input [16]. On the other hand, the causal relationship between elements in is not explicitly given to FNN. FNN will need to learn this relationship, if it manages to learn at all.

(a) FNN
(b) RNN
Fig. 3: FNN-based QNN versus RNN-based QNN.

Iii-B2 Experience Buffer

In one-step DQN [5], an experience is defined by the quadruplet and is stored in the experience buffer after each interaction between the agent and the environment. In n-step DQN and RB-DQN, there are some modifications.

  • n-step DQN. The experience is redefined as in order to compute the n-step return in the loss function (given in the later part of this subsection).

  • RB-DQN. After storing into the experience buffer, a reward-backpropagation mechanism is performed. This mechanism first checks the value of . If , then it sets , amortizing and backpropagating the reward to experiences of earlier time steps. If or , then do nothing.

For implementation, it is inefficient to store an experience in the form of or since two consecutive experiences have many common elements. For example, in is only a time-shift version of in . It is superfluous to store the overlapped elements for both experiences. A more efficient implementation is to store the abbreviated experience . The complete experience or can be obtained from consecutive abbreviated experiences by means of continuous experience replay (detailed in the next paragraph). Note that for n-step DQN, we do not need to redefine an abbreviated experience to . For RB-DQN, the reward-backpropagation mechanism is still necessary.

Iii-B3 Continuous Experience Replay

In conventional experience replay [14, 5], random experiences are sampled from the experience buffer to compute the loss function, with each sample being an experience . After downsizing the experience to , we will sample continuous experiences instead to extract the information necessary for computing the loss function. For one-step DQN and RB-DQN, as illustrated in Fig. 4(a), each sample contains continuous experiences, and we extract , , , from it. For n-step DQN, as illustrated in Fig. 4(b), each sample contains continuous experiences, we extract , , , from it.

(a) one-step DQN or RB-DQN
(b) n-step DQN
Fig. 4: A sample in continuous experience replay.

Iii-B4 Loss Function

The loss function (II-C) is only suitable for one-step DQN and RB-DQN. A more general loss function that takes the n-step return into consideration is given by:


where . When , (III-B4) is the same as (II-C). With a loss function definition (II-C) or (III-B4), a minibatch gradient descent algorithm [16] can then be used to update parameter .

Iv Performance Evaluation

This section evaluates the performance of CS-DLMA. First, we describe the simulation setup, including the values of the hyperparameters used in DQN, the performance metric, and the benchmark. Second, we compare the performances of variants of CS-DLMA with different neural networks and different DQN implementations. Third, we compare the performances of CS-DLMA and WiFi when they co-exist with ALOHA and TDMA. Finally, we present detailed performance results of “RNN + RB-DQN”, the best-performing CS-DLMA variant studied in this paper, under different heterogeneous network settings.

Iv-a Simulation Setup

Iv-A1 Hyperparameters

As shown in Fig. 3

, the RNN has two hidden layers: one LSTM layer followed by one FNN layer. The number of neurons for each layer is 64 and the activation functions are

ReLU [16]. We use RMSProp [17] to conduct minibatch gradient descent on (II-C) or (III-B4). Since we assume CS-DLMA does not know the mechanisms of the co-existing MACs, we use a relatively large to cover a longer history so as to learn the behavior of potentially complex MACs (although in actuality, the MACs that we study here are not that complex and a small may suffice). Specifically, for our simulations, we set . To prevent the algorithms from getting stuck with a suboptimal decision policy before they gather enough experiences, we apply an exponential decay -greedy method: is initially set to 0.1 and decays by a multiplicative factor of 0.995 every basic slot until its value reaches 0.005. The values of hyperparameters are summarized in Table II.

Hyperparameter Value
State history length 40
in -greedy algorithm 0.1 to 0.005
Discount factor 0.9
Experience buffer size 500
Experience-replay minibatch size 32
Target network update frequency 200
TABLE II: CS-DLMA Hyperparameters
(a) Different DQNs using two-hidden-layer FNN
(b) Different DQNs using RNN
(c) RB-DQN using RNN and RB-DQN using FNNs of different numbers of hidden layers
Fig. 5: Short-term sum throughputs when one DLMA node (using different CS-DLMA algorithms) co-exists with one -ALOHA node and one TDMA node. Each line (except the black line) is averaged over four different runs.

Iv-A2 Performance Metrics

In this paper, the objective of the DLMA node is to maximize the overall sum throughput. The throughput is defined by , where is the smoothing window size. In our performance study, for “short-term throughput” at basic slot , is set to 1000 (i.e., the throughput averaged over the past 1000 basic slots); for “long-term cumulative throughput” at basic slot , is set to (i.e., the throughput averaged from basic slot 0 to basic slot ).

Iv-A3 Benchmark

The benchmark used in this paper is the optimal sum throughput that can be achieved by a model-aware node. The model-aware node knows the MAC mechanisms of co-existing nodes as well as the number of nodes executing each MAC protocol. For example, for co-existence with TDMA and ALOHA, the model-aware node knows the time slots during which TDMA nodes transmit and the random-access mechanism of the ALOHA nodes, as well as the number of TDMA nodes and the number of ALOHA nodes. The model-aware node executes an optimal MAC that maximizes the sum throughput based on this knowledge. The derivations of the optimal MAC and the associated sum throughput are given in [18]; we omit the derivations here to save space.

Iv-B Different Variants of CS-DLMA

We first present performance results of different variants of CS-DLMA under a specific heterogeneous network setting. In particular, we consider the co-existence of one DLMA node with one -ALOHA node and one TDMA node.

The transmission probability of the -ALOHA node is 0.4; the TDMA node occupies 2 TDMA slots within a TDMA frame of 5 TDMA slots. Both -ALOHA and TDMA have a packet length of 4 basic slots in this study, i.e., (we will study -ALOHA and TDMA with different packet lengths later). For n-step DQN, we set , i.e., equals to the packet length of -ALOHA and TDMA nodes.

Fig. 5 presents the short-term sum throughputs of different variants of CS-DLMA algorithms studied here. We also present the optimal sum throughput achieved by a model-aware node when it replaces the DLMA node. For the results in Fig. 5(a) and Fig. 5(b), the numbers of hidden layers in both FNN and RNN are 2—the only difference is that the first hidden layer of RNN is LSTM (see Fig. 3).

From Fig. 5(a), we can see that “FNN + one-step DQN” cannot achieve optimal sum throughput within the 100 thousand simulated basic slots. “FNN + n-step DQN” did even worse (we leave the detailed investigation of why that is the case for the future). By contrast, “FNN + RB-DQN” can achieve near-optimal sum throughput. As we can see from Fig. 5(b), after replacing FNN with RNN, “RNN + one-step DQN” and “RNN + RB-DQN” can both achieve near-optimal sum throughput. Furthermore, compared with using FNN, using RNN allows faster convergence and smoother throughput with less jitters. Between “RNN + one-step DQN” and “RNN + RB-DQN”, we notice that the latter has faster convergence—specifically, “RNN + RB-DQN” needs around 3500 basic slots to achieve the near-optimal performance, while “RNN + one-step DQN” needs more than 10000 basic slots to do that.

The single LSTM layer in the RNN structure, when unfolded in time, corresponds to layers of computation (see Fig. 3). We next explore if FNN can achieve performance comparable to RNN when we increase the number of hidden layers in the FNN structure. Fig. 5(c) compare the results between “RNN + RB-DQN” and “FNN + RB-DQN”. The RNN is the same as in Fig. 5(a) and 5(b), while the number of hidden layers of FNN varies. In particular, we set . For , the FNN is the residual network structure as in [11]. The reason to use the residual network sturcture is to avoid potential overfitting due to large number of hidden layers [19].

As can be seen from Fig. 5(c), “FNN + RB-DQN” with more hidden layers cannot achieve performance comparable to that of “RNN + RB-DQN” either. As mentioned earlier in Section III-B, the causal relationship between different elements in the input are explicitly modeled into RNN but not FNN. Perhaps this allows the RNN to search within a narrower solution space for a good solution (i.e., RNN only needs to learn within a smaller solution space, allowing it to learn a good solution in a more focused manner).

Iv-C CS-DLMA versus WiFi

We next compare the performances of CS-DLMA and WiFi in heterogeneous networks. As in Section IV-B, we consider the co-existence with a -ALOHA node and a TDMA node. The setups of the -ALOHA node and the TDMA node are the same as in Section IV-B. For CS-DLMA, we adopt “RNN + RB-DQN”. We then replace the DLMA node by a WiFi node and run the experiment again. For the WiFi node, the carrier sensing slot and the backoff slot of WiFi are both set to one basic slot, the initial window size is set to 2, and the maximum backoff stage of WiFi is set to 2. The packet length of WiFi node varies from 1 basic slot to 4 basic slots. As a side note, we did try WiFi with different initial window sizes, maximum backoff stages, and packet lengths, but found no substantial performance difference among different settings. To conserve space, here we only present the results with varying packet lengths.

(a) Cumulative sum throughputs
(b) Individual cumulative throughputs
Fig. 6: Long-term cumulative sum throughputs and individual cumulative throughputs when one DLMA/WiFi node co-exists with one -ALOHA node and one TDMA node. The DLMA node adopts “RNN + RB-DQN” in both Fig. 6(a) and Fig. 6(b). The packet length of the WiFi node varies from 1 to 4 in Fig. 6(a) and is fixed to 2 in Fig. 6(b).

As can be seen from Fig. 6(a), CS-DLMA can approach the near-optimal sum throughput while WiFi fails to do so. For further details, Fig. 6(b) presents the individual throughputs of different nodes in the “RNN + RB-DQN” experiment and the “WiFi, =2” experiment. As we can see from Fig. 6(b), the individual throughputs of different nodes of “RNN + RB-DQN” are larger than the corresponding individual throughputs of “WiFi, =2”. That is, compared with the WiFi node, the DLMA node not only manages to achieve higher throughput for itself, but also to allow higher throughputs for the -ALOHA node and TDMA node.

As mentioned earlier in this paper, the carrier-sensing, collision avoidance, and backoff mechanism of WiFi are designed for a homogeneous network in which all nodes are WiFi nodes. For our case here, for example, WiFi has no mechanism to detect the repetitive channel access pattern of TDMA and to avoid the time slots occupied by the TDMA node. CS-DLMA, on the other hand, is based on an RL mechanism that has means to learn the channel access patterns of other nodes.

Iv-D CS-DLMA under Different Heterogeneous Network Settings

We next investigate the performance of “RNN + RB-DQN” under different heterogeneous network settings. We first consider the co-existence of one DLMA node with one ALOHA node, wherein ALOHA node could adopt possibly different variants of ALOHA protocols. The ALOHA node has a packet length of 4 basic slots, i.e., .

We then consider a setup in which one DLMA node co-exists with one -ALOHA node and one TDMA node. Unlike in Section IV-B, the -ALOHA node and the TDMA node now have different packet lengths with , and . As in Section IV-B, the transmission probability of -ALOHA is fixed to 0.4 here, and the TDMA node transmits in 2 TDMA slots out of each TDMA frame of 5 TDMA slots.

(a) DLMA and -ALOHA
Fig. 7:

The short-term sum throughput and individual throughputs when DLMA co-exists with (a), (b), (c) different variants of ALOHA, and (d) ALOHA and TDMA. Each line (except the black line) is averaged over 4 different runs, with the shaded areas being areas within the standard deviation.

Fig. 7 presents the short-term sum throughputs and individual throughputs for the above settings. In particular, the DLMA node co-exists with one -ALOHA node (with ) in Fig. 7(a); co-exists with one FW-ALOHA node (with window size ) in Fig. 7(b); co-exists with one EB-ALOHA node (with initial window size and the maximum backoff stage ) in Fig. 7(c); co-exists with one -ALOHA node and one TDMA node in Fig. 7(d). As we can see from these figures, near-optimal sum throughputs can be achieved in all cases.

V Conclusion

In this paper, we showed that deep reinforcement learning (DRL) techniques can be used to design efficient MAC protocols for heterogeneous networking. In particular, in a heterogeneous network consisting of nodes adopting different MAC protocols (e.g., ALOHA, TDMA), a node that makes use of a MAC protocol based on DRL can learn to maximize the sum throughput of all nodes in the heterogeneous environment. Furthermore, a salient feature of our DRL MAC is that it does not need to know the operating mechanisms of the co-existing MACs and the numbers of nodes using the other MACs. The DRL MAC learns to maximize the sum throughput by trial-and-error interactions with these other MACs.

We refer to our proposed DRL MAC as deep-reinforcement learning multiple access (DLMA). Compared with our past work on DLMA [10, 11], the current work introduces carrier sensing into DLMA to further improve its efficiency and flexibility. We refer to this new class of DLMA as carrier-sense DLMA (CS-DLMA). We demonstrated in this paper that CS-DLMA is more suitable for heterogeneous networking than WiFi MAC, a popular legacy protocol also with the carrier sensing capability. In particular, we showed that CS-DLMA can learn to co-exist with -ALOHA and TDMA to achieve near-optimal sum throughput while WiFi cannot.

This paper also investigated several variants of CS-DLMA in which different neural networks and different reinforcement learning techniques are adopted. We found that, in general, recurrent neural networks (RNN) can allow CS-DLMA to achieve higher sum throughput and faster convergence than feedforward neural networks (FNN) can.

As far as reinforcement learning is concerned, this paper focused on the techniques of deep Q-network (DQN) [5]. We studied the conventional one-step DQN and n-step DQN [5, 6]. In addition, we also put forth a new technique referred to as reward-backpropagation DQN (RB-DQN). We showed that RB-DQN can achieve near-optimal sum throughput with faster convergence than one-step DQN and n-step DQN can. Furthermore, RB-DQN using RNN can achieve near-optimal sum throughput in different heterogeneous network settings (e.g., the co-existence of CS-DLMA with different variants of ALOHA, and the co-existence of CS-DLMA with -ALOHA and TDMA with different packet lengths).


  • [1] G. Bianchi, “Performance analysis of the ieee 802.11 distributed coordination function,” IEEE Journal on selected areas in communications, vol. 18, no. 3, pp. 535–547, 2000.
  • [2]

    U. Challita, L. Dong, and W. Saad, “Proactive resource management for lte in unlicensed spectrum: A deep learning perspective,”

    IEEE Transactions on Wireless Communications, 2018.
  • [3] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint arXiv:1701.07274, 2017.
  • [4] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press Cambridge, 1998, vol. 1, no. 1.
  • [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [6] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in

    International conference on machine learning

    , 2016, pp. 1928–1937.
  • [7] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for dynamic spectrum access in multichannel wireless networks,” arXiv preprint arXiv, vol. 1704, 2017.
  • [8] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wireless networks,” IEEE Transactions on Cognitive Communications and Networking, 2018.
  • [9] H.-H. Chang, H. Song, Y. Yi, J. Zhang, H. He, and L. Liu, “Distributive dynamic spectrum access through deep reinforcement learning: A reservoir computing based approach,” IEEE Internet of Things Journal, 2018.
  • [10] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple access for heterogeneous wireless networks,” in 2018 IEEE International Conference on Communications (ICC).   IEEE, 2018, pp. 1–7.
  • [11] ——, “Deep-reinforcement learning multiple access for heterogeneous wireless networks,” (full version), arXiv preprint arXiv:1712.00162, 2017.
  • [12] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [13] J. Peng and R. J. Williams, “Incremental multi-step q-learning,” in Machine Learning Proceedings 1994.   Elsevier, 1994, pp. 226–232.
  • [14] L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992.
  • [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [16] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.   MIT press, 2016.
  • [17] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
  • [18]
  • [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 770–778.