I Introduction
This paper investigates a new generation of wireless multiple access control (MAC) protocol that leverages the latest advances in “deep reinforcement learning”. The work is partially inspired by our participation in the Spectrum Collaboration Challenge (SC2), a threeyear competition hosted by DARPA of the United States [1].^{1}^{1}1Two of the authors are currently participating in this competition.
Quoting DARPA, “SC2 is the firstofitskind collaborative machinelearning competition to overcome scarcity in the radio frequency (RF) spectrum. Today, spectrum is managed by dividing it into rigid, exclusively licensed bands. In SC2, competitors will reimagine a new, more efficient wireless paradigm in which radio networks autonomously collaborate to dynamically determine how the spectrum should be used moment to moment.” In other words, DARPA aims for a cleanslate design in which different wireless networks share spectrum in a very dynamic manner based on instantaneous supply and demand. In DARPA’s vision, “winning design is the one that best share spectrum with any network(s), in any environment, without prior knowledge, leveraging on machinelearning technique”. DARPA’s vision necessitates a total reengineering of the PHY, MAC, and Network layers of wireless networks.
As a first step, this paper investigates a new MAC design that exploits deep Qnetwork (DQN) algorithm [2], a deep reinforcement learning (DRL) algorithm that combines deep neural networks [3] with Qlearning [4]. DQN was shown to be able to achieve superhumanlevel playing performance in video games. Our MAC design aims to learn an optimal way to use the timespectrum resources by a series of observations and actions without the need to know the operating mechanisms of the MAC protocols of other coexisting networks. In particular, our MAC strives to achieve optimal performance as if it knew the MAC protocols of these networks in detail. In this paper, we refer to our MAC protocol as deepreinforcement learning multiple access (abbreviated as DLMA), and a radio node operating DLMA as a DRL agent.
For a focus, this paper considers timeslotted systems and the problem of sharing the time slots among multiple wireless networks. In general, DLMA can adopt different objectives in timeslot sharing. We first consider the objective of maximizing the sum throughput of all the networks. We then reformulate DLMA to achieve a general fairness objective. In particular, we show that DLMA can achieve nearoptimal sum throughput and proportional fairness when coexisting with a TDMA network, an ALOHA network, and a mix of TDMA and ALOHA networks, without knowing the coexisting networks are TDMA and ALOHA networks. Learning from the experience it gathers from a series of stateactionreward observations, a DRL agent tunes the weights of the neural network within its MAC machine to zoom to an optimal MAC strategy.
This paper also addresses the issue of why DRL is preferable to the traditional reinforcement learning (RL) [5] for wireless networking. Specifically, we demonstrate that the use of deep neural networks (DNN) in DRL affords us with two essential properties to wireless MAC: (i) fast convergence to nearoptimal solutions; (ii) robustness against nonoptimal parameter settings (i.e., fine parameter tuning and optimization are unnecessary with our DRL framework). Compared with MAC based on traditional RL, DRL converges faster and is more robust. Fast convergence is critical to wireless networks because the wireless environment may change quickly as new nodes arrive, and existing nodes move or depart. If the environmental “coherence time” is much shorter than the convergence time of the wireless MAC, the optimal strategy would elude the wireless MAC as it continuingly tries to catch up with the environment. Robustness against nonoptimal parameter settings is essential because the optimal parameter settings for DRL (and RL) in the presence of different coexisting networks may be different. Without the knowledge of the coexisting networks, DRL (and RL) cannot optimize its parameter settings a priori. If nonoptimal parameter setting can also achieve roughly the same optimal throughput at roughly the same convergence rate, then optimal parameter settings are not essential for practical deployment.
In our earlier work [6], we adopted a plain DNN as the neural network in our DRL overall framework. In this work, we adopt a deep residual network (ResNet) [7]. The results of all sections in the current paper are based on ResNet, except Section IIIE where we study deep ResNet versus plain DNN. A key advantage of ResNet over plain DNN is that the same static ResNet architecture can be used in DRL for different wireless network scenarios; whereas for plain DNN, the optimal neural network depth varies from case to case.
Overall, our main contributions are as follows:

We employ DRL for the design of DLMA, a MAC protocol for heterogeneous wireless networking. Our DLMA framework is formulated to achieve general fairness among the heterogeneous networks. Extensive simulation results show that DLMA can achieve nearoptimal sum throughput and proportional fairness objectives. In particular, DLMA achieves these objectives without knowing the operating mechanisms of the MAC protocols of the other coexisting networks.

We demonstrate the advantages of exploiting DRL in heterogeneous wireless networking compared with the traditional RL method. In particular, we show that DRL can accelerate convergence to an optimal solution and is more robust against nonoptimal parameter settings, two essential properties for practical deployment of DLMA in real wireless networks.

In the course of our generalization to the fairness objective in wireless networking, we discovered an approach to generalize the Qlearning framework so that more general objectives can be achieved. In particular, we argue that for generality, we need to separate the Q function and the objective function upon which actions are chosen to optimize – in conventional Q learning, the Q function itself is the objective function. We give a framework on how to relate the objective function and the Q function in the general setup.
Ia Related Work
RL is a machinelearning paradigm, where agents learn successful strategies that yield the largest longterm reward from trialanderror interactions with their environment [5]. The most representative RL algorithm is the Qlearning algorithm [4]. Qlearning can learn a good policy by updating an actionvalue function, referred to as the Q function, without an operating model of the environment. When the stateaction space is large and complex, deep neural networks can be used to approximate the Q function and the corresponding algorithm is called DRL [2]. This work employs DRL to speed up convergence and increase the robustness of DLMA (see our results Section IIID).
RL was employed to develop channel access schemes for cognitive radios [8, 9, 10, 11] and wireless sensor networks [12, 13]. Unlike this paper, these works do not leverage the recent advances in DRL.
There has been little prior work exploring the use of DRL to solve MAC problems, given that DRL itself is a new research topic. The MAC scheme in [14] employs DRL in homogeneous wireless networks. Specifically, [14] considered a network in which radio nodes dynamically access orthogonal channels using the same DRL MAC protocol. By contrast, we are interested in heterogeneous networks in which the DRL nodes must learn to collaborate with nodes employing other MAC protocols.
The authors of [15] proposed a DRLbased channel access scheme for wireless sensor networks. Multiple frequency channels were considered. In RL terminology, the multiple frequency channels with the associated Markov interference models form the “environment” with which the DRL agent interacts. There are some notable differences between [15] and our investigation here. The Markov environmental model in [15] cannot capture the interactions among nodes due to their MAC protocols. In particular, the Markov environmental model in [15] is a “passive” model not affected by the “actions” of the DRL agent. For example, if there is one exponential backoff ALOHA node (see Section IIA for definition) transmitting on a channel, the collisions caused by transmissions by the DRL agent will cause the channel state to evolve in intricate ways not captured by the model in [15].
In [16], the authors employed DRL for channel selection and channel access in LTEU networks. Although it also aims for heterogeneous networking in which LTEU base stations coexist with WiFi APs, its focus is on matching downlink channels to base stations; we focus on sharing an uplink channel among users. More importantly, the scheme in [16] is modelaware in that the LTEU base stations know that the other networks are WiFi. For example, it uses an analytical equation (equation (1) in [16]
) to predict the transmission probability of WiFi stations. By contrast, our DLMA protocol is modelfree in that it does not presume knowledge of coexisting networks and is outcomebased in that it derives information by observing its interactions with the other stations in the heterogeneous environment.
Ii DLMA Protocol
This section first introduces the timeslotted heterogeneous wireless networks considered in this paper. Then a short overview of RL is given. After that, we present our DLMA protocol, focusing on the objective of maximizing the sum throughput of the overall system. A generalized DLMA protocol that can achieve fairness objective will be given in Section IV.
Iia TimeSlotted Heterogeneous Wireless Networks
We consider timeslotted heterogeneous wireless networks in which different radio nodes transmit packets to an access point (AP) via a shared wireless channel, as illustrated in Fig. 1. We assume all the nodes can begin transmission only at the beginning of a time slot and must finish transmission within that time slot. Simultaneous transmissions of multiple nodes in the same time slot result in collisions. The nodes may not use the same MAC protocol: some may use TDMA and/or ALOHA, and at least one node uses our proposed DLMA protocol. The detailed descriptions of different radio nodes are given below:

TDMA: The TDMA node transmits in specific slots within each frame of slots in a repetitive manner from frame to frame.

ALOHA: A ALOHA node transmits with a fixed probability in each time slot in an i.i.d. manner from slot to slot.

Fixedwindow ALOHA: A fixedwindow ALOHA (FWALOHA) node generates a random counter value in the range of after it transmits in a time slot. It then waits for slots before its next transmission. The parameter is referred to as the window size.

Exponential backoff ALOHA: Exponential backoff ALOHA (EBALOHA) is a variation of windowbased ALOHA in which the window size is not fixed. Specifically, an EBALOHA node doubles its window size each time when its transmission encounters a collision, until a maximum window size is reached, where is the “maximum backoff stage”. Upon a successful transmission, the window size reverts back to the initial window size .

DRL agent/node: A DRL agent/node is the radio node that adopts our DLMA protocol. For a DRL node, if it transmits, it will get an immediate ACK from the AP, indicating whether the transmission is successful or not; if it does not transmit, it will listen to the channel and get an observation from the environment, indicating other nodes’ transmission results or idleness of the channel. Based on the observed results, the DRL node can set different objectives, such as maximizing the sum throughput of the overall system (as formulated in Part C of this section) and achieving a general fairness objective (as formulated in Section IV).
IiB Overview of RL
In RL [5], an agent interacts with an environment in a sequence of discrete times, , to accomplish a task, as shown in Fig. 2. At time , the agent observes the state of the environment , where is the set of possible states. It then takes an action , where is the set of possible actions at state . As a result of the stateaction pair , the agent receives a reward , and the environment moves to a new state at time . The goal of the agent is to effect a series of rewards through its actions to maximize some performance criteria. For example, the performance criterion to be maximized at time could be , where is a discount factor for weighting future rewards. In general, the agent takes actions according to some decision policy . RL methods specify how the agent changes its policy as a result of its experiences. With sufficient experiences, the agent can learn an optimal decision policy to maximize the longterm accumulated reward[5].
Qlearning [4] is one of the most popular RL methods. A Qlearning RL agent learns an actionvalue function corresponding to the expected accumulated reward when an action is taken in the environmental state under the decision policy :
(1) 
The optimal actionvalue function, , obeys the Bellman optimality equation[5]:
(2) 
where is the new state after the stateaction pair
. The main idea behind Qlearning is that we can iteratively estimate
at the occurrences of each stateaction pair .Let be the estimated actionvalue function during the iterative process. Upon a stateaction pair and a resulting reward , Qlearning updates as follows:
(3) 
where is the learning rate.
While the system is updating , it also makes decisions based on . The greedy policy is often adopted, i.e.,
(4) 
A reason for randomly selecting an action is to avoid getting stuck with a function that has not yet converged to .
IiC DLMA Protocol Using DRL
This subsection describes the construction of our DLMA protocol using the DRL framework.
The action taken by a DRL agent at time is {TRANSMIT, WAIT}, where TRANSMIT means that the agent transmits, and WAIT means that the agent does not transmit. We denote the channel observation after taking action by {SUCCESS, COLLISION, IDLENESS}, where SUCCESS means one and only one station transmits on the channel; COLLISION means multiple stations transmit, causing a collision; IDLENESS means no station transmits. The DRL agent determines from an ACK signal from the AP (if it transmits) and listening to the channel (if it waits). We define the channel state at time as the actionobservation pair . There are five possibilities for : {TRANSMIT, SUCCESS}, {TRANSMIT, COLLISION}, {WAIT, SUCCESS}, {WAIT, COLLISION} and {WAIT, IDLENESS}. We define the environmental state at time to be , where the parameter is the state history length to be tracked by the agent. After taking action , the transition from state to generates a reward , where if SUCCESS; if COLLISION or IDLENESS
. The definition of reward here corresponds to the objective of maximizing the sum throughput. We define a reward vector in Section
IV so as to generalize DLMA to achieve the fairness objective.So far, the above definitions of “action”, “state” and “reward” also apply to an RL agent that adopts (IIB) as its learning algorithm. We next motivate the use of DRL and then provide the details of its use.
Intuitively, subject to a nonchanging or slowchanging environment, the longer the state history length , the better the decision can be made by the agent. However, a large induces a large state space for the RL algorithm. With a large number of stateaction entries to be tracked, the stepbystep and entrybyentry update is very inefficient. To get a rough idea, suppose that (a rather small state history to keep track of), then there are million possible values for state . Suppose that for convergence to the optimal solution, each stateaction value must be visited at least once. If each time slot is 1 in duration (typical wireless packet transmission time), the convergence of RL will take at least , or more than 5 hours. Due to node mobility, arrivals and departures, the wireless environment will most likely to have changed well before then. Section IIID of this paper shows that applying DRL to DLMA accelerates the convergence speed significantly (convergence is obtained in seconds, not hours).
In DRL, a deep neural network [3] is used to approximate the actionvalue function, , where is the approximation given by the neural network and is a parameter vector containing the weights of the edges in the neural network. The input to the neural network is a state , and the outputs are approximated values for different actions . We refer to the neural network as the Q neural network (QNN) and the corresponding RL algorithm as DRL. Rather than following the tabular update rule of the traditional RL in (IIB), DRL updates by adjusting the in the QNN through a training process.
In particular, QNN is trained by minimizing prediction errors of . Suppose that at time , the state is and the weights of QNN are . The DRL agent takes an action , where for different actions are given by the outputs of QNN. Suppose that the resulting reward is and the state moves to . Then, constitutes an “experience sample” that will be used to train the QNN. For training, we define the prediction error of QNN for the particular experience sample to be
(5) 
where are the weights in QNN, is the approximation given by QNN, and
is the target output for QNN given by
(6) 
Note that is a refined target output based on the current reward plus the predicted discounted rewards going forward given by QNN. We can train QNN, i.e., update , by applying a semigradient algorithm [5] in (5). The iteration process of is given by
(7) 
where is the step size in each adjustment.
For algorithm stability, the “experience replay” and “quasistatic target network” techniques can be used[2]. For “experience replay”, instead of training QNN with a single experience at the end of each execution step, we could pool together many experiences for batch training. In particular, an experience memory with a fixed storage capacity is used for storing the experiences gathered from different time steps in an FIFO manner, i.e., once the experience memory is full, the oldest experience is removed from, and the new experience is put into, the experience memory. For a round of training, a minibatch consisting of
random experiences are taken from the experience memory for the computation of the loss function. For “quasistatic target network”, a separate target QNN with parameter
is used as the target network for training purpose. Specifically, the in (6) is computed based on this separate target QNN, while the in (5) is based on QNN under training. The target QNN is a copy of an earlier QNN: every time steps, the target QNN is replaced by the latest QNN, i.e., setting to the latest of QNN. With these two techniques, equations (5), (6), and (7) replaced by the following:Iii Sum Throughput Performance Evaluation
This section investigates the performance of DLMA with the objective of maximizing the sum throughput of all the coexisting networks. For our investigations, we consider the interactions of DRL nodes with TDMA nodes, ALOHA nodes, and a mix of TDMA nodes and ALOHA nodes. Section IV will reformulate the DLMA framework to achieve a general fairness objective (which also includes maximizing sumthroughput objective as a subcase); Section V will present the corresponding results.
As illustrated in Fig. 3
, the architecture of QNN used in DLMA is a sixhiddenlayer ResNet with 64 neurons in each hidden layer. The activation functions used for the neurons are
ReLU functions [3]. The first two hidden layers of QNN are fully connected, followed by two ResNet blocks. Each ResNet block contains two fully connected hidden layers plus one “shortcut” from the input to the output of the ResNet block. The state, action and reward of DRL follow the definitions in Section IIC. The state history length is set to 20, unless stated otherwise. When updating the weights of QNN, a minibatch of 32 experience samples are randomly selected from an experiencereplay reservoir of 500 prior experiences for the computation of the loss function (8). The experiencereplay reservoir is updated in a FIFO manner: a new experience replaces the oldest experience in it. The RMSProp algorithm [17] is used to conduct minibatch gradient descent for the update of . To avoid getting stuck with a suboptimal decision policy before sufficient learning experiences, we apply an exponential decay greedy algorithm: is initially set to 0.1 and decreases at a rate of 0.995 every time slot until its value reaches 0.005. A reason for not decreasing all the way to zero is that in a general wireless setting, the wireless environment may change dynamically with time (e.g., nodes are leaving and joining the network). Having a positive at all time allows the decision policy to adapt to future changes. Table I summarizes the hyperparameter settings in our investigations.Hyperparameters  Value 

State history length  20, unless stated otherwise 
Discount factor  0.9 
in greedy algorithm  0.1 to 0.005 
Learning rate used in RMSProp  0.01 
Target network update frequency  200 
Experiencereplay minibatch size  32 
Experiencereplay memory capacity  500 
A salient feature of our DLMA framework is that it is modelfree (it does not need to know the protocols adopted by other coexisting nodes). For benchmarking, we consider modelaware nodes. Specifically, a modelaware node knows the MAC mechanisms of coexisting nodes, and it executes an optimal MAC protocol derived from this knowledge. We will show that our modelfree DRL node can achieve nearoptimal throughput with respect to the optimal throughput of the modelaware node. The derivations of the optimal throughputs for different cases below, which are interesting in their own right, are provided in [18]. We omit them here to save space.
Iiia Coexistence with TDMA networks
We first present the results of the coexistence of one DRL node with one TDMA node. The TDMA node transmits in specific slots within each frame of slots in a repetitive manner from frame to frame. For benchmarking, we consider a TDMAaware node which has full knowledge of the slots used by the TDMA node. To maximize the overall system throughput, the TDMAaware node will transmit in all the slots not used by the TDMA node. The optimal sum throughput is one packet per time slot. The DRL agent, unlike the TDMAaware node, does not know that the other node is a TDMA node (as a matter of fact, it does not even know how many other nodes there are) and just uses the DRL algorithm to learn the optimal strategy.
Fig. 4(a) presents the throughput^{3}^{3}3Unless stated otherwise, “throughput” in this paper is the “shortterm throughput”, calculated as , where . If one time step is 1 in duration, then this is the throughput over the past second. In the bar charts presented in this paper, “throughput” is the average reward over the last steps in an experiment with a length of 50000 steps and we take the average of 10 experiments for each case to get the final value. results when and varies from 2 to 8. The green line is the sum throughput of the DRL node and the TDMA node. We see that it is very close to 1. This demonstrates that the DRL node can capture all the unused slots without knowing the TDMA protocol adopted by the other node.
IiiB Coexistence with ALOHA networks
We next present the results of the coexistence of one DRL node with one ALOHA, one FWALOHA and one EBALOHA, respectively. We emphasize that the exact same DLMA algorithm as in Part A is used here even though the other protocols are not TDMA anymore. For benchmarking, we consider modelaware nodes that operate with optimal MACs tailored to the operating mechanisms of the three ALOHA variants [18].
Fig. 4(b) presents the experimental results for the coexistence of one DRL node and one ALOHA node. The results show that the DRL node can learn the strategy to achieve the optimal throughputs despite the fact that it is not aware that the other node is ALOHA node and what the transmission probability is. Fig. 4(c) presents the results for the coexistence of one DRL node and one FWALOHA node with different fixedwindow sizes. Fig. 4(d) presents the results for the coexistence of one DRL node and one EBALOHA node with different initial window sizes and maximum backoff stage . As shown, DRL node can again achieve nearoptimal throughputs for these two cases.
IiiC Coexistence with a mix of TDMA and ALOHA networks
We now present the results of a setup in which one DRL agent coexists with one TDMA node and one ALOHA node simultaneously. Again, the same DLMA algorithm is used. We consider two cases. In the first case, the TDMA node transmits in 3 slots out of 10 slots in a frame; the transmission probability of the ALOHA node varies. In the second case, of the ALOHA node is fixed to 0.2; , the number of slots used by the TDMA nodes in a frame, varies. Fig. 4(e) and Fig. 4(f) present the results of the first and second cases respectively. For both cases, we see that our DRL node can approximate the optimal results without knowing the transmission schemes of the TDMA and ALOHA nodes.
We next consider a setup in which multiple DRL nodes coexist with a mix of TDMA and ALOHA nodes. Specifically, the setup consists of three DRL nodes, one TDMA node that transmits in 2 slots out of 10 slots in a frame, and two ALOHA nodes with transmission probability . In Fig. 5, we can see that DLMA can also achieve nearoptimal sum throughput in this more complex setup. However, when we focus on the individual throughputs of each node, we find that since there is no coordination among the three DRL nodes, one DRL node may preempt all the slots not occupied by the TDMA node, causing the other two DRL nodes and two qALOHA nodes get zero throughputs. This observation motivates us to consider fairness among different nodes in Section IV.
IiiD RL versus DRL
We now present results demonstrating the advantages of the “deep” approach using the scenario where one DRL/RL agent coexists with one TDMA node. Fig. 6 compares the convergence time of the Qlearning based RL approach and the QNNbased DRL approach. The sum throughput in the figure is the “cumulative sum throughput” starting from the beginning: . It can be seen that DRL converges to the optimal throughput of 1 at a much faster rate than RL does. For example, DRL requires only less than 5000 steps (5 if each step corresponds to a packet transmission time of 1 ) to approach within 80 of the optimal throughput. Note that when state history length increases from 10 to 16, RL learns progressively slower and slower, but the convergence time of DRL varies only slightly as increases. In general, for a modelfree MAC protocol, we do not know what other MAC protocols there are besides our MAC protocol. Therefore, we will not optimize on and will likely use a large to cater for a large range of other possible MAC protocols. The robustness, in terms of insensitivity of convergence time to , is a significant practical advantage of DRL.
Fig. 7 presents the throughput evolutions of TDMA+RL and TDMA+DRL versus time. Unlike in Fig. 6, in Fig. 7, the sum throughput is the “shortterm sum throughput” rather than the “cumulative sum throughput” starting from the beginning. Specifically, the sum throughput in Fig. 7 is , where . If one time step is 1 in duration, then this is the throughput over the past second. As can be seen, although both RL and DRL can converge to the optimal throughput in the end, DRL takes a much shorter time to do so. Furthermore, the fluctuations in throughput experienced by RL along the way are much larger. To dig deeper into this phenomenon, we examine , defined to be the number of previous visits to state prior to time step . Fig. 7 also plots : i.e., we look at the number of previous visits to state before visiting , the particular state being visited at time step . As can be seen, for RL, each drop in the throughput coincides with a visit to a state with . In other words, the RL algorithm has not learned the optimal action for this state yet because of the lack of prior visits. From Fig. 7, we also see that it takes a while before RL extricates itself from persistent and consecutive visits to a number of states with . This persistency results in large throughput drops until RL extricates itself from the situation. By contrast, although DRL also occasionally visits a state with , it is able to take an appropriate action at the unfamiliar territory (due to the “extrapolation” ability of the neural network to infer a good action to take at based on prior visits to states other than : recall that each update of changes the values of for all , not just that of a particular ). DRL manages to extricate itself from unfamiliar territories quickly and evolve back to optimal territories where it only transmits in time slots not used by TDMA.
Fig. 8 presents the evolutions of the number of distinct states visited by RL and DRL agents in the same experiment as in Fig. 7. In this case, both RL and DRL find an optimal strategy in the end, but RL requires more time to do so. Once the optimal strategies are found, RL and DRL agents seldom explore new states, except the “exploration” step in greedy algorithm. As indicated in Fig. 8, the number of distinct states visited by RL on its journal to the optimal strategy is much larger than that of DRL. From Fig. 8, we see that RL spends 35000 time steps in finding the optimal strategy, having visited 23000 distinct states before doing so. By contrast, it takes only 10000 time steps for DRL to find the optimal strategy and the number of distinct states visited is only around 1000. In other words, DRL can better narrow down its choice of states to visit in order to find the optimal strategy, hence the faster convergence speed.
IiiE Plain DNN versus deep ResNet
We now demonstrate the advantages of deep ResNet over plain DNN using two cases: 1) one DRL node coexisting with one TDMA node, wherein the TDMA node occupies 2 slots out of 10 slots in a frame; 2) one DRL node coexisting with one TDMA node and one qALOHA node, wherein the TDMA is the same as in 1) and for the qALOHA node. The optimal sum throughputs for a modelaware protocol for 1) and 2) can be established analytically to be 1 and 0.9, respectively (see [18] for the derivation). For each case, we compare the cumulative sum throughputs of plain DNN based approach and deep ResNet based approach with different numbers of hidden layers .
As can be seen from the upper parts of Fig. 9(a) and Fig. 9(b), the plain DNN is not robust against variation of , i.e., the performance varies with . Furthermore, and achieve the best performance for case 1) and 2), respectively. This implies that it is difficult to use a common plain DNN architecture for different wireless setups. In other words, the optimal may be different under different scenarios. If the environment changes dynamically, there is no one single that is optimal for all scenarios. In contrast to plain DNN’s nonrobustness to , deep ResNet can always achieve nearoptimal performance for different for both cases, as illustrated in the lower parts of Fig. 9(a) and Fig. 9(b).
For wireless networking, the environment may change quickly when new nodes arrive, and existing nodes move or depart. It is desirable to adopt a onesizefitsall neural network architecture in DRL. Our results show that deep ResNet is more desirable than plain DNN in this regard.
Iv General Objective DLMA protocol
This section first introduces the wellknown fairness utility function [19]. Then, a multidimensional Qlearning algorithm is proposed to incorporate the fairness utility function in a general reformulation of DLMA.
Iva fairness objective
Instead of sum throughput, we now adopt the fairness index as the metric of the overall system performance. The parameter is used to specify a range of the fairness criteria, e.g., when , maximizing the fairness objective corresponds to maximizing the sum throughput (the corresponding results were presented in Section III); when , maximizing the fairness objective corresponds to achieving proportional fairness; when , the minimum throughput among nodes is being maximized. Specifically, we consider a system with nodes and for a particular node , its throughput is denoted by ; its fairness local utility function is given by
(12) 
The objective of the overall system is to maximize the sum of all the local utility functions:
maximize  
subject to  (13)  
IvB DLMA reformulation
We now reformulate our system model as a semidistributed system that consists of several wireless networks with different MAC protocols. Nodes in different networks cannot communicate with each other. For nodes within the DLMA network, there is a DLMA central gateway that coordinates the transmissions of the nodes. Similarly, for nodes within the TDMA network, there is implicitly a TDMA central gateway to decide the time slots in which TDMA nodes transmit.
Among the nodes in the wireless networks, let be the number of DRL nodes in the DLMA network and be the number of nonDRL nodes. In the DLMA protocol as described in Section IIC, all DRL nodes individually adopt the singleagent DRL algorithm, and independently perform training and execution of the DRL algorithm. Unlike the DLMA protocol in Section IIC, we now consider an DRL algorithm with “centralized training at the gateway node and independent execution at DRL nodes”. The gateway in the DLMA network associates with all other DRL nodes in the DLMA network and coordinates the coexistence of the DLMA network with other networks (e.g., the TDMA and ALOHA networks). In each time slot, the gateway decides whether a node in the DLMA network should transmit or not. If YES, the gateway selects one of the DRL nodes in a roundrobin manner to transmit. After transmitting, the selected DRL node receives a feedback from the system and communicates with the gateway with this information. If NO, all DRL nodes keep silent. In this manner, the gateway can be regarded as a virtual big agent that is a combination of the DRL nodes. The coordination information from the gateway to other DRL nodes can be sent through a control channel. For example, the control channel can be implemented as a short time slot after each time slot of information transmission. Other implementations are also possible, but we will omit the discussion here since the focus of this paper is not implementation details. The above reformulates the system to contain nodes: one DRL big agent node (we index it by ) and other legacy nodes (we index them by ).
We now modify the original Qlearning algorithm. The original Qlearning algorithm is designed for the singleagent case under the objective of maximizing the accumulated reward of the agent. It cannot be directly applied to the multinode/multiagent case to meet arbitrary fairness objectives. We therefore put forth a multidimensional Qlearning algorithm to cater for the fairness objective.
In the original Qlearning algorithm, each agent receives a scalar reward from the environment. The scalar reward, representing the overall system transmission result (success, idleness or collision), is regarded as the overall reward to the system in the original Qlearning algorithm. Each agent uses the overall reward to compute the sum throughput objective. By contrast, in our new multidimensional Qlearning algorithm, the big agent receives an dimension vector of rewards from the environment. Each element of the vector represents the transmission result of one particular node. The reward vector is used to compute the fairness objective. Specifically, let be the reward of node and thus the received reward vector is given by . For a stateaction pair , instead of maintaining an actionvalue scalar , the big agent maintains an actionvalue vector , where the element is the expected accumulated discounted reward of node .
Let be the estimate of the elementary actionvalue function in the actionvalue vector. Suppose at time , the state is . For decision making, we still adopt the greedy algorithm. When selecting the greedy action, the objective in (IVA) can be applied to meet arbitrary fairness objective, i.e.,
(14) 
After taking action , the big agent employs the multidimensional Qlearning algorithm to parallelly update the elementary actionvalue estimates , as
(15) 
where
(16) 
Here, it is important to point out the subtleties in (IVB)(IVB) and how they differ from the conventional Qlearning update equation in (IIB). In conventional Qlearning, an action that optimizes the Q function is chosen (as explained in Section IIB). In other words, the Q function is the objective function to be optimized. However, the Q function as embodied in (IIB) and (IVB) is a projected (estimated) weighted sum of the current rewards and future rewards. To be more specific, take a look at the term in (3). It can be taken to be an estimation of , which is a weighted sum of the current reward and future rewards with discount factor . We can view as a newly estimated . In (IIB), for the purpose of estimation smoothing, we apply a weight of to this new estimate and a weight of to the previous value of to come up with a new value for . Nevertheless, still embodies a weighted sum of the current and future rewards. Since in conventional Qlearning, an action that gives the maximum is taken at each step , the objective can be viewed as trying to maximize a weighted sum of rewards with discount factor . However, not all objectives can be conveniently expressed as a weighted sum of rewards. An example is the fairness objective of focus here. A contribution of us here is the realization that, for generality, we need to separate the objective upon which the optimizing action is chosen and the Q function itself.
Objectives can often be expressed as a function of several components, wherein each component can be expressed as a Q function (e.g., (IVB)). In the more general setup, the update equation of Q function still has the same form (i.e., (IVB) has the same form as (IIB)). However, the action chosen a time step later in (IVB) is not that gives the maximum , but which is based on (IVB). Thus, the Q function is still a projected weighted sum of rewards. But the policy that gives rise to the rewards is not based on maximizing the weighted sum of rewards, but based on maximizing a more general objective.
Returning to our wireless setting, the first term in (IVB) is the sum of local utility functions of all legacy nodes. Since the big agent (indexed by ) is actually a combination of the DRL nodes, and is the estimated accumulated reward of each DRL node, the second term in (IVB) is the sum of local utility functions of all the DRL nodes. We have two remarks: i) the in (IVB) is an estimate of the expected accumulated discounted reward of node (as expressed in (IVB), rather than the exact throughput in (IVA); ii) we use to help the agent make decisions because the exact throughput is not known. Our evaluation results in section V show that this method can achieve the fairness objective.
We continue the reformulation of DLMA by incorporating deep neural networks. The incorporation of deep neural networks into the multidimensional Qlearning algorithm calls for two additional modifications. The first is to use a QNN to approximate the actionvalue vector as , where is the weights of QNN. The second is to augment the experience tuple to . With these two modifications, the loss function (8), the target (9) and the update of (10) are now given by
(17) 
(18) 
where
(19) 
(20) 
The pseudocode of the reformulated DLMA protocol is summarized in Algorithm 2.
V Proportional Fairness Performance Evaluation
This section investigates the performance when DRL nodes aim to achieve proportional fairness among nodes, as a representative example of the general fairness DLMA formulation. We investigate the interaction of DRL nodes with TDMA nodes, ALOHA nodes, and a mix of TDMA nodes and ALOHA nodes, respectively. The optimal results for benchmarking purposes can also be derived by imagining a modelaware node for different cases (the derivations are provided in [18] and omitted here.)
Va Coexistence with TDMA networks
We first present the results of the coexistence of one DRL node with one TDMA node. In this trivial case, achieving proportional fairness is the same as maximizing sum throughput. That is, to achieve proportional fairness, the optimal strategy of the DRL node is to transmit in the slots not occupied by the TDMA node and keep silent in the slots occupied by the TDMA node. Fig. 10(a) presents the results when the number of slots assigned to TDMA node is 2, 3, 7 and 8 out of 10 slots within a frame. We can see that the reformulated DLMA protocol can achieve proportional fairness in this case.
VB Coexistence with ALOHA networks
We next present the results of the coexistence of one DRL node with one ALOHA node, one FWALOHA node, and one EBALOHA, respectively. Fig. 10(b) presents the results with different transmission probabilities for the coexistence of one DRL node with one ALOHA node. Fig. 10(c) presents the results with different fixedwindow sizes for the coexistence of one DRL node with one FWALOHA node. Fig. 10(d) presents the results with different initial window sizes and for the coexistence of one DRL node with one EBALOHA node. As shown in these results, the reformulated DLMA protocol can again achieve proportional fairness without knowing the transmission schemes of different ALOHA variants.
VC Coexistence with a mix of TDMA and ALOHA networks
We now present the results of a setup where one DRL node coexists with one TDMA node and one ALOHA node simultaneously. We also consider the two cases investigated in Section IIIC, but the objective now is to achieve proportional fairness among all the nodes. Fig. 10(e) and Fig. 10(f) present the results of the two cases. We can see that with the reformulated DLMA protocol, the individual throughputs achieved approximate the optimal individual throughputs achieved by imaging a model aware node.
We now present the results when three DRL nodes coexist with one TDMA node and two ALOHA nodes. The case investigated here is the same as the case presented in Fig. 5, but the three DRL nodes are now formulated to be one big agent and the objective is modified to achieve proportional fairness among all the nodes. The optimal results for the big agent, the TDMA node and each ALOHA node are derived in [18]. As shown in Fig. 11, the optimal results can also be approximated using the reformulated DLMA protocol.
Vi Conclusion
This paper proposed and investigated a MAC protocol based on DRL for heterogeneous wireless networking, referred to as DLMA. A salient feature of DLMA is that it can learn to achieve an overall objective (e.g., fairness objective) by a series of stateactionreward observations while operating in the heterogeneous environment. In particular, it can achieve nearoptimal performance with respect to the objective without knowing the detailed operating mechanisms of the other coexisting MACs.
This paper also demonstrated the advantages of using neural networks in reinforcement learning for wireless networking. Specifically, compared with the traditional RL, DRL can acquire the nearoptimal strategy and performance with faster convergence time and higher robustness, two essential properties for practical deployment of the MAC protocol in dynamically changing wireless environments.
Last but not least, in the course of doing this work, we discovered an approach to generalize the Qlearning framework so that more general objectives can be achieved. In particular, for generality, we argued that we need to separate the Q function and the objective function upon which actions are chosen to optimize. A framework on how to relate the objective function and the Q function in the general setup was presented in this paper.
References
 [1] DARPA SC2 Website: https://spectrumcollaborationchallenge.com/.
 [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [3] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
 [4] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.
 [5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [6] Y. Yu, T. Wang, and S. C. Liew, “Deepreinforcement learning multiple access for heterogeneous wireless networks,” in Communications (ICC), 2018 IEEE International Conference on. IEEE, 2018, pp. 1–7.

[7]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 770–778.  [8] H. Li, “Multiagentlearning for alohalike spectrum access in cognitive radio systems,” EURASIP Journal on Wireless Communications and Networking, vol. 2010, no. 1, p. 876216, 2010.
 [9] K.L. A. Yau, P. Komisarczuk, and D. T. Paul, “Enhancing network performance in distributed cognitive radio networks using singleagent and multiagent reinforcement learning,” in 2010 IEEE 35th Conference on Local Computer Networks (LCN). IEEE, 2010, pp. 152–159.
 [10] C. Wu, K. Chowdhury, M. Di Felice, and W. Meleis, “Spectrum management of cognitive radio using multiagent reinforcement learning,” in Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Industry track. International Foundation for Autonomous Agents and Multiagent Systems, 2010, pp. 1705–1712.
 [11] M. Bkassiny, S. K. Jayaweera, and K. A. Avery, “Distributed reinforcement learning based mac protocols for autonomous cognitive secondary users,” in 2011 20th Annual Wireless and Optical Communications Conference (WOCC). IEEE, 2011, pp. 1–6.
 [12] Z. Liu and I. Elhanany, “Rlmac: A qosaware reinforcement learning based mac protocol for wireless sensor networks,” in Networking, Sensing and Control, 2006. ICNSC’06. Proceedings of the 2006 IEEE International Conference on. IEEE, 2006, pp. 768–773.
 [13] Y. Chu, P. D. Mitchell, and D. Grace, “Aloha and qlearning based medium access control for wireless sensor networks,” in 2012 International Symposium on Wireless Communication Systems (ISWCS). IEEE, 2012, pp. 511–515.
 [14] O. Naparstek and K. Cohen, “Deep multiuser reinforcement learning for dynamic spectrum access in multichannel wireless networks,” arXiv preprint arXiv:1704.02613, 2017.
 [15] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wireless networks,” IEEE Transactions on Cognitive Communications and Networking, 2018.
 [16] U. Challita, L. Dong, and W. Saad, “Deep learning for proactive resource allocation in lteu networks,” in Proceedings of 23th European Wireless Conference. VDE, 2017, pp. 1–6.
 [17] T. Tieleman and G. Hinton, “Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
 [18] Y. Yu, T. Wang, and S. C. Liew, “Modelaware nodes in heterogeneous networks: A supplementary document to paper ‘deepreinforcement learning multiple access for heterogeneous wireless networks’,” Technical report, available at: https://github.com/YidingYu/DLMA/blob/master/DLMAbenchmark.pdf.
 [19] J. Mo and J. Walrand, “Fair endtoend windowbased congestion control,” IEEE/ACM Transactions on Networking (ToN), vol. 8, no. 5, pp. 556–567, 2000.
Comments
There are no comments yet.