This paper investigates a futuristic spectrum sharing paradigm where multiple separate wireless networks adopt different media access control (MAC) protocols to transmit packets on a common wireless spectrum111This futuristic spectrum sharing paradigm, wherein heterogeneous networks collaborate to use the shared spectrum without detailed knowledge of each other’s MAC, was first envisioned by the DARPA Spectrum Collaboration Challenge (SC2) competition [tilghman2019will] to spur the development of next-generation wireless networks. Our current paper investigates a particular scenario in which the heterogeneous networks share the spectrum in the time domain.. In sharing the spectrum, each network must respect spectrum usage by other networks and must not hog the spectrum to the detriment of other networks. Unlike in the conventional cognitive radio networks [liang2008sensing], all network users in our scenario are on an equal footing in that they are not divided into primary users and secondary users. This paper puts forth an intelligent MAC protocol for a particular network—among the networks sharing the spectrum—to achieve efficient and equitable spectrum sharing among all networks. A major challenge for the intelligent MAC protocol is that the particular network concerned does not know the MAC protocols used by other networks. Our MAC protocol must learn the medium-access behavior of other networks on the fly so that harmonious coexistence and fair sharing of spectrum with them can be achieved.
The fundamental technique in our MAC protocol design is deep reinforcement learning (DRL). DRL is a machine learning technique that combines the decision-making ability of reinforcement learning (RL)[sutton2018reinforcement]
and the function approximation ability of deep neural networks[lecun2015deep] to solve complex decision-making problems, including game playing, robot control, wireless communications, and network management and control [DQNpaper, silver2016mastering, gu2017deep, luong2019applications, sun2019application, shao2019significant]. In RL/DRL, in each time step, the decision-making agent interacts with its external environment by executing an action. The agent then receives feedback in the form of a reward that tells the agent how good the action was. The agent strives to optimize the rewards it receives over its lifetime [sutton2018reinforcement].
A key advantage of our DRL based MAC protocol is that it can learn to coexist with other MAC protocols without knowing their operational details. In our paper, we refer to our DRL MAC protocol as deep reinforcement learning multiple access (DLMA), and network adopting DLMA as the DLMA network. Within the DLMA network, each user is regarded as a DRL agent and it employs the DLMA protocol to make its MAC decisions, i.e., to transmit or not to transmit data packets.
To enable efficient and equitable spectrum sharing among all the users, the agents adopt “-fairness” [mo2000fair] as their objective. The -fairness objective is global and general in that (i) each agent not only optimizes its own spectrum usage, but also optimizes the spectrum usage of other agents and the non-agent users of coexisting networks; (ii) the agents can achieve different specific objectives by adjusting the value of the parameter , e.g., corresponds to maximizing the sum throughput of all users and corresponds to achieving proportional fairness among all users.
The conventional RL/DRL framework has inherent limitations. First, the conventional RL/DRL framework only aims to maximize the cumulative discounted rewards of the agent, i.e., the objective to be optimized by the agent is a weighted sum of the rewards. However, the -fairness objective function, in general, is a nonlinear combination of utility functions of all users [mo2000fair]. Therefore, in our design, we adopt the multi-dimensional DRL framework in [yu2019deep] to solve this problem.
Second, in the conventional RL/DRL framework, the feedback to the agent about the reward is assumed to be always correctly received. However, in the wireless environment, the feedback may be lost due to noise and interference in the wireless channel. Without the correct reward values, the agent may fail to find the optimal strategy. This paper puts forth a feedback recovery mechanism for incorporation into DLMA to reduce the detrimental effects of imperfect channels. The key idea is that we feedback the reward of the current time step not just in the current time step, but also in the next time steps. Thus, an earlier missing reward may be recovered in a later time step. The essence is that late learning is better than no learning.
Third, the conventional RL/DRL framework only works for single-agent problems. For multi-agent problems where each agent makes decisions on its own without collaboration with other agents, it is difficult to guarantee that multiple agents will work together to achieve the same objective [zhang2019multi]. To avoid conflicting decisions among the agents, we put forth a two-stage action selection mechanism in DLMA. In this mechanism, each agent first decides on the “network action” for the DLMA network as to whether a DLMA agent should transmit a packet or not. The network action can be regarded as a collective policy of all the agents. If the decision is to transmit, an agent then decides as to whether it is the one that will transmit. With this mechanism, the agents strive to coexist with the non-agent users of the other networks harmoniously in the first decision stage and reduce collisions among the agents in the second decision stage.
Extensive simulation results show that our feedback recovery mechanism can effectively reduce the detrimental effects of the imperfect feedback channels in the heterogeneous wireless networks. For benchmarking, we replace the agents (i.e., the DLMA users) with model-aware users that are aware of the operational details of the other MACs of the coexisting networks and we assume the feedback channels of the model-aware users are perfect. We demonstrate that the results achieved by DLMA can approximate the optimal benchmarks even though DLMA is deployed in an imperfect channel setting. We also demonstrate the capability of our two-stage action selection mechanism in reducing collisions among the multiple agents. Specifically, we demonstrate that when the channels of the agents are “perfect” or “imperfect but dependent”, collisions can be avoided among the agents; when the channels are “imperfect and independent”, collisions among the agents cannot be eliminated but can be significantly reduced.
Overall, the main contributions of this paper are summarized as follows:
We develop a distributed DRL based MAC protocol for efficient and equitable spectrum sharing in heterogeneous wireless networks with imperfect channels.
We demonstrate that our proposed feedback recovery mechanism can effectively solve the problem of imperfect feedback channels. We also demonstrate collisions among the multiple distributed agents can be significantly reduced with our two-stage action selection mechanism.
We believe that the feedback recovery mechanism and the two-stage action selection mechanism in our MAC protocol design can also find use in a wide range of applications. Specifically, the idea of feedback/reward recovery can be applied in the general corruption reward problem [ijcai2017-656, wang2020rlnoisy] in reinforcement learning and the two-stage action selection process can be used in the distributed multi-agent problem where the agents need to avoid conflicts with each other [zhang2019multi].
I-a Related Work
In our previous work [yu2019deep], we developed a DRL based MAC protocol for heterogeneous wireless networks with perfect channels. In [yu2019deep], the access point (AP) is regarded as a centralized DRL agent. The agent is responsible for making MAC decisions and it broadcasts the control information containing the decisions to the users, telling them who should transmit in what slots. However, for a practical scenario with noisy wireless channels, the control information may be lost. Without immediate and correct control information in a particular time slot, the users will not know whether to transmit or not. Thus, a method is needed so that users can make appropriate decisions even without immediate feedback.
The current paper provides a method to do so. Specifically, this paper removes the perfect channel assumption in [yu2019deep] and puts forth a distributed DRL MAC protocol for heterogeneous wireless networks with imperfect channels. In this paper, rather than having the AP serving as the single centralized agent in the whole network, each user is a DRL agent, and each agent makes its own MAC decisions. We put forth a two-stage action selection mechanism to reduce collisions among the agents’ transmissions. When the channels are imperfect, the conventional DRL techniques with the implicit assumption of perfect feedback are not suitable anymore. Therefore, this paper proposes a feedback recovery mechanism to reduce the detrimental effects of imperfect feedback channels. The detailed differences, especially the difference in DRL formulations, between [yu2019deep] and our current paper will be discussed in the main body of our paper.
Apart from [yu2019deep], there have also been other work on DRL based MAC. As in [yu2019deep], investigations in [naparstek2018deep, wang2018deep, chang2018distributive, zhong2018actor, xu2018deep, tan2019deep] also assumed the physical channels are perfect and did not have a scheme to deal with imperfect channels. Other fine differences between [naparstek2018deep, wang2018deep, chang2018distributive, zhong2018actor, xu2018deep, tan2019deep] and our current work are elaborated in the following.
The MAC in [naparstek2018deep] is designed for homogeneous wireless networks, where all the users adopt the same MAC protocol to dynamically access multiple wireless channels. By contrast, our MAC protocol is targeted for heterogeneous networks in which our MAC must learn to coexist with other MAC protocols. Both [naparstek2018deep] and our current paper consider achieving a global -fairness objective. However, the “fairness” in [naparstek2018deep] is achieved among the homogeneous DRL users. By contrast, our DRL users aim to achieve “fairness” among the heterogeneous DRL users and non-DRL users.
The authors in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] also investigated the multi-channel access problem as in [naparstek2018deep]. The difference is that in [naparstek2018deep], the authors assumed the channels are time-invariant, whereas the channels in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] may be time-varying in that some “primary” or “legacy” users may occupy the channels from time to time. Therefore, the wireless networks in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] can also be regarded as “heterogeneous”. However, the DRL users in [wang2018deep, chang2018distributive, zhong2018actor, xu2018deep] aim to maximize their own throughputs by learning the channel characteristics and the transmission patterns of the “primary” or “legacy” users. By contrast, the DRL users in our work (also in [naparstek2018deep]) aim to achieve a global -fairness objective, which includes achieving maximum sum throughput, proportional fairness, and max-min fairness as subcases [mo2000fair].
The authors in [tan2019deep] investigated heterogeneous wireless networks. Specifically, in [tan2019deep], an LTE network exercises a coarse-grained DRL based MAC control to coexist with a WiFi network. The LTE MAC in [tan2019deep] decides a period for LTE transmissions and a period for WiFi transmissions. During the corresponding periods, LTE and WiFi transmit packets without interfering with each other. By contrast, the MAC of our design exercises fine-grained control in that our MAC makes decisions (i.e., to transmit or not to transmit) on a packet-by-packet basis. Furthermore, in [tan2019deep], the LTE network is model-aware in that the LTE network knows the coexisting network is WiFi. Therefore, the approach in [tan2019deep] is not generalizable to situations where the LTE network coexists with other networks. By contrast, our DRL based MAC protocol is model-free in that it does not presume knowledge of the operational details of the MAC protocols of coexisting networks.
Ii Overview of Reinforcement Learning
The underpinning technique in our DLMA algorithm is deep reinforcement learning, especially the Deep Q-Network (DQN) algorithm [DQNpaper]. This section first presents Q-learning, a representative reinforcement learning (RL) algorithm. After that, the DQN algorithm is introduced.
In the RL framework, an agent interacts with an external environment in discrete time steps [sutton2018reinforcement], as illustrated in Fig. 1. Particularly, in time step , given the environment state , the agent takes an action according to a policy , which maps the states of the environment to the actions of the agent. After receiving action , the environment feeds back a reward to the agent to evaluate the agent’s performance in time step . In addition, the environment state transmits to in time step . The goal of the agent is to find an optimal policy that maximizes the cumulative discounted rewards , where is a discount factor.
For a particular policy and a particular state-action pair , Q-learning captures the expected cumulative discounted rewards with an action-value function, i.e., the Q-function: . The Q function of the optimal policy is given by . To find the optimal policy , the Q-learning agent maintains the Q function, , for any state-action pair , in a tabular form. In time step , given state , the agent selects an action based on its current Q table. After receiving the reward and observing the new state , the agent constructs an experience tuple . Then the experience tuple is used to update as follows:
where is the learning rate of the algorithm.
In Q-learning, in addition to selecting action using , the -greedy algorithm is often adopted. Specifically, the action
is chosen with probability, and a random action is chosen uniformly among all actions with probability . The random action selection avoids the algorithm from zooming into a local optimal policy and allows the agent to explore a wider spectrum of different actions in search of the optimal policy.
One thing to be pointed out here is that Q-learning is a model-free algorithm in that it does not need to know the state transition probability. The Q-learning agent learns to find the optimal policy through the experiences obtained during the interaction with the environment.
Ii-B Deep Q-Network
In a stationary environment, Q-learning is shown to converge if the learning rate decays appropriately and each state-action pair is executed an infinite number of times [watkins1992q]. For real-world problems, the state space and/or the action space can be huge and it will take an excessive amount of time for Q-learning to converge.
To allow fast convergence, the deep Q-network (DQN) algorithm uses a deep neural network model to approximate the Q-values in Q-learning [DQNpaper]. For ease of exposition, we refer to the deep neural network model as DNN throughout this paper. The input to DNN is the agent’s current state , and the outputs are the approximated Q-values for different actions, , where is the neural network parameter and is the agent’s action set. In time step , for action selection, in the -greedy algorithm is replaced by .
For training of DNN, the parameter
is updated by minimizing the following loss function using a gradient descent method[lecun2015deep]:
To stabilize the algorithm, the “experience replay” [lin1992self] and “target neural network” techniques are embedded into (2). The details of these two techniques are given below.
Experience Replay: A first-in-first-out experience buffer is used to store a fixed number of experiences gathered from different time steps. Instead of training DNN with a single experience, multiple experiences are pooled together for batch training. Specifically, for a round of training, a minibatch of random experiences are sampled from the experience buffer in the computation of (2), wherein the time index denotes the time step at which that experience tuple was collected.
Target Neural Network: A separate “target neural network” is used in the computation of in (2
). In particular, the target neural network’s parameter vector israther than in the DNN being trained. This separate target neural network is named target DNN and is a copy of DNN. The parameter of target DNN is updated to the latest of DNN every time steps, while the parameter is updated every time step by updating the loss function (2).
Iii System Model
Iii-a Heterogeneous Wireless Networks
As illustrated in Fig. 2, we consider heterogeneous wireless networks, where multiple separate wireless networks share a common wireless spectrum to transmit data packets in a time-slotted manner. These wireless networks are heterogeneous in that they may use different MAC protocols. Within each wireless network, multiple users adopt the same MAC protocol to transmit data packets to an access point (AP). The APs belonging to different networks are connected to a collaboration network [tilghman2019will]. The collaboration network is a control network that is separate from the wireless networks, and allows different wireless networks to communicate collaborative information at a high level, e.g., the transmission results of different users in different wireless networks.
The focus of this paper is to design a MAC protocol for a particular wireless network. The underpinning technique of our MAC protocol is deep reinforcement learning. We refer to our MAC protocol as Deep-reinforcement Learning Multiple Access (DLMA) protocol. This particular wireless network is referred to as the DLMA network, and the users within the DLMA network are referred to as DLMA users.
The objective of the DLMA network is to achieve efficient and equitable spectrum allocation among different wireless networks. To this end, we adopt “-fairness” [mo2000fair] as the general objective of the DLMA network. The DLMA network can adjust the value of parameter to achieve different specific objectives. For example, corresponds to maximizing the sum throughput of all the wireless networks; corresponds to achieving proportional fairness; corresponds to achieving max-min fairness. The detailed formulation of the -fairness objective is given in Section III-B.
Iii-B -Fairness Objective
We assume there are DLMA users in the DLMA network and non-DLMA users in other networks. We index the DLMA users by and index the non-DLMA users by . We define the “throughput” of a user as the average successful packet rate. Particularly, we use to denote the throughput of user , and if user successfully transmitted packets within time slots, then is calculated as .
According to the -fairness objective [mo2000fair], the utility function of user , , is given by
The overall objective of the DLMA network is to maximize the sum of the utility functions of all the users:
Iii-C AP-User Pair in the DLMA Network
We now describe the operations between one particular DLMA user and the AP in the DLMA network. Within each time slot, there are two phases between the DLMA user and the DLMA AP: the uplink phase for transmitting a data packet and the downlink phase for transmitting an ACK packet, as illustrated in Fig. 3. We assume both the feedforward channel in the uplink phase and the feedback channel in the downlink phase are packet erasure channels [bertsekas1992data] with erasure probabilities and , respectively. Specifically, the data packet from the DLMA user to the AP is lost with probability , and the ACK packet from the AP to the DLMA user is lost with probability . The details of these two phases are given below.
Iii-C1 Uplink Phase
In the uplink phase, at the beginning of each time slot, the MAC module in the DLMA user decides whether to transmit a data packet to the DLMA AP or not (the MAC decision making will be detailed in Section IV). Specifically, is the decision variable for time slot , with if the DLMA user transmits, and if the DLMA user does not transmit. At the AP side, the indicator variable contains the outcome, with if a data packet from the DLMA user is received, and otherwise. Note that three possibilities can lead to : (i) the DLMA user transmitted a data packet, but the data packet is corrupted by channel noise; (ii) the DLMA user transmitted a data packet, but other users also transmitted in the same time slot, leading to a collision; (iii) the DLMA user did not transmit.
Iii-C2 Downlink Phase
In the downlink phase, the DLMA AP broadcasts an ACK packet (or feedback packet) to all the DLMA users. Even if no DLMA user transmits, the DLMA AP will still broadcast an ACK to indicate that it did not receive anything from the DLMA users in the time slot that has just transpired. As shown in Fig. 3, the ACK packet includes two parts. The first part summarizes the transmission results of all the users. Specifically, at the end of time slot , the first part of the ACK packet contains , where () with being the history length of the transmission results maintained by the AP. The reason for keeping transmission results in the ACK is to compensate for the possibility of lost ACKs, hence the loss of feedback, in earlier time slots (see Section IV-B2 for details). The DLMA AP can construct for based on the transmission results of the DLMA users and can obtain for from the collaboration network. At the DLMA user side, we use to represent the transmission results known to the DLMA user. Specifically, if the ACK packet is successfully received, then for all ; if the ACK packet is lost, for all .
The second part of the ACK packet is the throughput vector of all the DLMA users (here, , , is the throughput of DLMA user in time slot ). At the DLMA user side, we use to represent the throughput results known to the DLMA users. Specifically, if the ACK packet is successfully received, then for any ; if the ACK packet is lost, then for any . We will describe how our algorithm uses the throughput vector in Section IV-B1.
Iv DLMA Protocol Design with DRL
This section describes the design of our DLMA protocol. We first provide the definitions for agent, action, observation, reward, and state. Then we present our proposed DLMA algorithm.
Iv-a Definitions of Agent, Action, Observation, Reward, and State
Agent: We consider each DLMA user to be an agent in the parlance of machine learning. Each of the DLMA users makes its own transmission decision. We assume the DLMA users can only communicate with the DLMA AP and they do not have direct communications with each other.
In our previous work [yu2019deep] where the channels between the DLMA users and the DLMA AP were assumed to be perfect, the DLMA AP was regarded as a centralized agent responsible for making decisions for all DLMA users. In [yu2019deep], the DLMA AP broadcasts the control information containing the decisions to the DLMA user, telling them who should transmit in what slots. However, when the channels between the DLMA users and the DLMA AP are imperfect, the control information may be lost. Without immediate and correct control information in a particular time slot, all users will not transmit even if the AP intends one user to transmit, resulting in significant performance degradation.
Unlike [yu2019deep], this paper adopts a distributed DLMA algorithm to solve the imperfect channel problem. In the distributed algorithm, all the DLMA users (i.e., the agents), make decisions on their own. Therefore, our problem is a multi-agent problem. Even if the feedback ACK is lost in a particular time slot, a DLMA user may still decide to transmit. To avoid collisions between the DLMA users, we propose a two-stage action selection mechanism in Section IV-B1—more exactly, we can only reduce collisions but not avoid them altogether when the feedback channels are imperfect, as will be detailed later. We further propose a feedback recovery mechanism in Section IV-B2 to reduce the detrimental effect of the imperfect feedback channels.
Action: The possible actions of each agent are “to transmit” and “not to transmit”. For each agent, the decision/action variable and represent “to transmit” and “not to transmit” in time slot , respectively. In the following, we will focus on one particular agent. For ease of exposition, we omit the index of this agent.
Observation: After the execution of an action in time slot , the agent has an observation . If , then there are two possible observations: or , indicating that the channel was used by other users or the channel was idle in time slot , respectively. If and the ACK packet was successfully received, then there are two possible observations: or , indicating that the data packet was successfully received or not successfully received by the AP, respectively. If and the ACK packet was not successfully received, then . Table I summarizes all the five possible observations.
|B||; the channel is being used by other user(s)|
|I||; the channel is not used by any user|
|S||; data packet is successful; ACK packet is successful|
|F||; data packet is unsuccessful; ACK packet is successful|
|null||; ACK packet is unsuccessful|
Reward: We define a reward for each user to indicate whether its data packet was successfully received. To achieve a global -fairness objective, we assume each agent knows the rewards of all the users. Specifically, at the end of time slot , each agent can find out the rewards of all users from . Let the reward vector maintained by the agent be . The individual reward of user (), , in is decided as follows. If , i.e., the ACK packet was not received, then . If , i.e., the ACK packet was successfully received, and the first element in is , i.e., , then . If , and the first element in is , i.e., , then . Overall, can be decided as follows:
We can see that the loss of the ACK packet results in being “null” and there is no indication whether the data packet was successfully received or not. We say that is erroneous if the ACK is lost. Erroneous rewards will not “advise”/“reinforce” the agent correctly, and may lead to the wrong solution. We describe how we tackle this problem with the feedback recovery mechanism in Section IV-B2.
State: We define the “channel state” of the agent in time slot as . The channel state captures the action of the agent and the results of the action in the past time slot. We further define the “state” of the agent in time slot as a concatenation of the past channel states, i.e., .
Two things need to be pointed out here. The first is that the state of the agent may be “noisy” since the rewards of the agent can be erroneous due to imperfect feedback. The second is that due to local actions and local observations, the states of different agents may be different in each time slot. The discrepancies in the states kept by different agents pose a challenge for these agents to come up with an overall coherent action plan. For example, the discrepancies may result in two agents deciding to transmit at the same time in the next time slot, resulting in a collision. This is the reason why we put forth a two-stage action selection mechanism to reduce collisions among the agents in Section IV-B1.
With the above definitions, we have formulated our multiple access problem into a multi-agent reinforcement learning problem. The modified reinforcement learning framework is illustrated in Fig. 4. In the following, we present our DLMA algorithm for solving the multi-agent problem.
Iv-B DLMA Algorithm
The DLMA algorithm consists of two parts: (i) two-stage action selection; (ii) neural network training. An agent employs the two-stage action selection process to decide its action. Each agent keeps a deep neural network (DNN) model and trains it to optimize the collective policy of the agents from the agent’s own perspective. The details of these two parts are given below.
Iv-B1 Two-Stage Action Selection
As illustrated in Fig. 5, in Stage 1 of the two-stage action selection process, an agent makes use of its DNN to come up with a decision for the overall DLMA network as to whether one of the agents should transmit. We refer to the decision for the overall DLMA network as “network action”. If the network action in Stage 1 is to transmit, in Stage 2, the agent will then decide whether it is the one among all the agents that will perform the transmission. If the network action in Stage 1 is not to transmit, in Stage 2, the agent just does not transmit. We refer to the decision of the agent in Stage 2 as “agent action”. We remark that the purpose of Stage 1 is to enable the DLMA network to achieve the -fairness objective when coexisting with other networks, and the purpose of Stage 2 is to break ties among the agents in the DLMA network222Note that, as an alternative to the two-stage action selection process, each agent can also directly generate its own action based only on the outputs of its DNN, i.e., the DNN outputs the agent action directly rather than the network action, the implication of which is that the agent has to learn to coexist with the users in other networks and the other agents in the DLMA network simultaneously. Due to lack of coordination among the agents, it is more likely for more than one agent to decide to transmit at the same time with this direct agent-action scheme. More frequent collisions among the DLMA agents may happen as a result. The above intuition has been borne out by our investigations. Specifically, we studied the coexistence of two agents with the direct-action generation approach and we found that collisions between these two agents will occur from time to time even though they have the same objective.. The details of these two stages are given as follows:
Stage 1: At the beginning of time slot , the agent inputs its current state into the DNN, and the DNN outputs multiple Q values , where is the network action that needs to be decided,
is the estimated cumulative discounted rewards of the DLMA network, and() is the estimated cumulative discounted rewards of the non-DLMA user . Based on the Q values, the agent can decide the network action according to (6) on page 9.
In (6), is the estimated cumulative discounted rewards of each agent, and can be regarded as the sum utilities of all the agents (i.e., the utility of the DLMA network). Meanwhile, can be regarded as the sum utilities of all the non-agent users.
Stage 2: In time slot , the agent action of agent () is determined by the network action and the throughput vector of all the agents. Specifically, if , i.e., the network action is to transmit, and the throughput vector satisfies (7) on page 9, then , i.e., the agent action is to transmit; otherwise, , i.e., the agent action is not to transmit.
The interpretation of (7) is as follows: the throughput of agent is the smallest one among the throughputs of all the agents, and the throughputs of those agents with indexes are larger than the throughput of the agent .
Iv-B2 Neural Network Training
We next describe the training of the DNN used in Stage 1 of the two-stage action selection process. In the original DQN algorithm [DQNpaper] where the feedback from the environment is assumed to be perfect and without errors, the agent stores the experiences in the form of (state, action, reward, next state) in an experience buffer, and samples multiple experiences in each time step to calculate a loss function and update the DNN using a gradient descent method (see the introduction of DQN in Section II-B).
As in [DQNpaper], here we also define an experience in the form of (state, action, reward, next state) in our DLMA algorithm. A key difference is that we use a reward vector rather than a scalar reward as in [DQNpaper]. Particularly, the experience collected in time slot is represented by . Note that in the experience , we use the network action rather than the agent action since we use the DNN to generate the network action rather than the agent action.
Unlike the problem in [DQNpaper], in our problem, the feedback information from the environment may be erroneous and incomplete due to imperfect feedback channels. Particularly, when the ACK packet is lost in time slot , as explained in Section IV-A, the reward for all in is “null”. The experience is deemed to be incomplete in this case. When the ACK packet is successfully received, is not “null”, and the experience is deemed to be complete. Incomplete experiences will not “advise”/“reinforce” the agents correctly. Therefore, incomplete experiences will not be used to train the DNN. However, it is possible to fill in the missing rewards in the incomplete experiences via ACKs received later. Once the missing reward values are filled with the correct reward values, an incomplete experience will become an complete experience that can be used to train the DNN. We refer to this process of making incomplete experience complete as a feedback recovery mechanism.
In the feedback recovery mechanism, instead of using one experience buffer, we use two experience buffers for each agent: an incomplete-experience buffer for storing the incomplete experiences and a complete-experience buffer for storing the complete experiences. The incomplete experiences will not be used to train the DNN. Only the complete experiences are sampled to calculate the loss function (will be given later) and train the DNN. If the information about the rewards of an incomplete experience is obtained through an ACK packet that arrives later (recall that the ACK packet contains transmission results up to time slots), then the experience can become complete and will be put into the complete-experience buffer. Fig. 6 illustrates the operation of the feedback recovery mechanism and an example is given below to explain the details.
For each agent, if the ACK packet is lost at the end of time slot , then the experience is stored into the incomplete-experience buffer. In this case, for all .
If the ACK packet is successfully received, then for all , , in the incomplete-experience buffer (if there is no experience in the incomplete-experience buffer, then do nothing), replace for all with and move to the complete-experience buffer. In particular, is decided as follows: if in , then ; if , then . After that, we clear the incomplete-experience buffer. For the newly collected experience , for consistency, we also add a hat on for all , i.e., . Then we store into the complete-experience buffer.
Note that when the ACK packet is successfully received at the end of time slot , the incomplete-experience buffer may still contain some incomplete experiences with . In this case, the ACK packet with a window size of time slots cannot correct the null rewards in these experiences. We discard these experiences by clearing the incomplete-experience buffer. The probability of an experience being incomplete without being made complete is . When is large, .
For training of DNN, multiple experiences are sampled from the complete-experience buffer to compute the loss function (8) on page 9, wherein is given by (9). After computing the loss function (8), the parameter can be updated using a gradient descent method [lecun2015deep]. Overall, the pseudocode of our DLMA algorithm is summarized in Algorithm 1.
V Performance Evaluation
The section evaluates the performance of our DLMA protocol. First, we present the simulation setup. After that, we examine the performance of DLMA in different scenarios: single agent with imperfect channels, multiple agents with perfect channels, and multiple agents with imperfect channels.
V-a Simulation Setup
In our distributed DLMA algorithm, all the agents have the same objective and run the same algorithm simultaneously. Table II
summarizes the hyperparameters adopted by each agent. In particular, the state of the agent containschannel states (see the definitions in Section IV-A). The parameter in the -greedy algorithm is initially set to 1 and is updated by in each time slot. The size of the complete-experience buffer is 1000 and 64 experiences are sampled from the complete-experience buffer to perform gradient descent over the loss function (8
) using the RMSProp algorithm[lecun2015deep]
in each time slot. The target network is updated once every 20 time slots. For the DNN, the input is the current state of the agent; the hidden layers consist of one long-short-term-memory (LSTM)[hochreiter1997long]lecun2015deep]; the number of neurons in the output layer is . An illustration of the architecture of the DNN is presented in Fig. 7.
|in -greedy algorithm||1 to 0.05|
|Experience buffer size||1000|
|Experience-replay minibatch size||64|
|Target network update frequency||1/20|
V-B Evaluation of DLMA in Different Scenarios
V-B1 Single Agent with Imperfect Channels
We now examine the ability of DLMA in dealing with the imperfect channel problem. In particular, we consider the coexistence of one DLMA agent with one TDMA user and one ALOHA user. The TDMA user is from the TDMA network and transmits in the 2 slot out of 5 slots within a TDMA frame. The ALOHA user is from the ALOHA network and it transmits with a probability of 0.2 in each time slot.
We first study the performance of DLMA in reducing the detrimental effect of the imperfect feedback channels when the objective of the agent is to maximize sum throughput. For maximizing sum throughput, we set in the -fairness objective. To study channels of varying imperfection, we set the uplink channel erasure probability to 0 and vary the downlink erasure probability .
For benchmarking, we replace the DLMA agent with a model-aware user. The model-aware user has the same objective as the agent and knows the transmission pattern of TDMA and the transmission probability of ALOHA. In addition, we assume the feedback channel of the model-aware user is perfect. A separate document [benchmark] derives the benchmark for this scenario and the benchmarks for the scenarios in the subsequent sections. To conserve space, we omit the details of the derivations in this paper.
Fig. 8 presents the short-term sum throughput—the short-term throughput is defined as the average successful packet rate in the past time slots—of the agent, TDMA and ALOHA. In each subfigure of Fig. 8, we fix the value of , and change the value of (i.e., the history length of the transmission results maintained in the ACK packet) in the feedback recovery mechanism. The black line in Fig. 8
is the sum throughput benchmark. Each line except the black line is the average result over 10 different experiments with the shaded areas being areas within the standard deviation.
As we can see from Fig. 8, for , the results corresponding to all fail to approximate the benchmark. In addition, when increases, the gap between the results of and the benchmark also increases. This is because when , no incomplete experiences are made complete and all the incomplete experiences are discarded. Without having enough complete experiences, especially when is large, the agent may not learn a good solution.
We can also see that when is small, e.g., , the results corresponding to all approximate the benchmark. As increases, the result corresponding to gets worse and fails to approximate the benchmark when (the value of could be as large as 0.6 when the AP is blocked by some buildings in a real wireless communication scenario). The reason for this phenomenon is that in our feedback recovery mechanism, the probability of an incomplete experience without being made complete is (as mentioned in Section IV-B2). When and , . Therefore, around 36% experiences are deemed to be incomplete, resulting in severe performance degradation. We can conclude that a small value of is not a good option. However, when increases, the agent checks more incomplete experiences whenever the ACK is received, which may need more computation each time. Unless stated otherwise, henceforth we adopt a modest value of in our evaluation.
We then study the performance of DLMA in reducing the detrimental effect of the imperfect feedback channels when the DLMA agent aims to achieve a different objective. Here, to conserve space, we only present results when the objective is to achieve proportional fairness (i.e., ). Fig. 9 presents the short-term sum-log throughput—the sum-log throughput corresponds to in (4)—of the agent, TDMA and ALOHA. As we can see from Fig. 9, the same observation can be found and the same conclusion can be made as in Fig. 8. Overall, both Fig. 8 and 9 demonstrate the capability of our feedback recovery mechanism in reducing the detrimental effect of the imperfect feedback channels.
We next evaluate the performance of DLMA when both the uplink channels and the downlink channels are imperfect (i.e., and ). Let us first consider the same setting as in Fig. 8 with . Fig. 10 presents the sum throughput of the DLMA agent, TDMA and ALOHA when different values of and are considered. In addition, the corresponding benchmarks are also plotted in Fig. 10. Note that due to the inherent and unavoidable uplink channel errors, the benchmarks for different values of are also different—when increases, the sum throughput benchmark decreases [benchmark]. As we can see from Fig. 10, the performance of DLMA can approach the benchmark performance in different imperfect channel scenarios. When we change the objective of the agent in Fig. 10 to achieving proportional fairness, as we can see from Fig. 11, the sum-log throughput benchmarks can also be achieved in different imperfect channel scenarios. This demonstrates that our DLMA algorithm can also work well when the uplink channels are imperfect.
V-B2 Multiple Agents with Perfect Channels
This subsection examines the capability of the two-stage action selection approach to tackle the multi-agent problem. Specifically, we consider a case where four DLMA agents coexist with one TDMA user. As in Section V-B1), TDMA transmits in the 2 slot out of 5 slots within a TDMA frame. In this part, we assume the channels are perfect (i.e., and ) and we set to 1 since there is no need to perform feedback recovery when the channels are perfect. We vary the value of in our evaluation. The benchmark for different values of is the same and is given below: the throughput of each agent is 0.2, and the throughput of TDMA is also 0.2.
Table III summarizes the individual throughputs of all the agents and the TDMA user when the agents adopt different values of . As we can see from Table III, the throughputs of all the users approximate the optimal result 0.2. This demonstrates that these four agents not only learn the transmission pattern of TDMA but also avoid collisions with each other when the channels are perfect.
|Agent 1||Agent 1||Agent 1||Agent 1||TDMA|
|Agent 1||Agent 1||Agent 1||Agent 1||TDMA|
|Agent 1||Agent 1||Agent 1||Agent 1||TDMA|
V-B3 Multiple Agents with Imperfect Channels
This subsection studies the multi-agent scenario with imperfect feedback channels. We first consider the coexistence of four agents with one TDMA user. The setups are the same as in Section V-B2) except that we set here. As in Section V-B2, the optimal benchmark used for each user is also 0.2. For this coexistence case, we consider two subcases: (i) the feedback channels between the agents and the AP are completely dependent: specifically, if a channel error happens (does not happen), it happens (does not happen) to all the channels at the same time333This will be the case when there are obstacles between the AP and the users, and the obstacles are very close to the AP. For example, if the AP is very close to a train station, when there are no trains in the station, the channels between the AP and the users will be in good condition; when there is a train or more than one trains in the station, the channels may be blocked simultaneously.; (ii) the feedback channels are independent, i.e., the channel error happens independently on each channel.
Table IV and Table V summarize the individual throughputs of all the agents and the TDMA user for subcase (i) and (ii), respectively. As we can see from Table IV, the throughput of each user approximates the optimal result 0.2 for different values of . This demonstrates that when the channels are imperfect and dependent, the agents not only learn the transmission pattern of TDMA but also learn to avoid collisions with other agents.
As we can see from Table V, only the throughput of TDMA approximates the optimal result 0.2, while the throughput of each agent is only around 0.18, which has a gap between the optimal result. The reason for the gap is as follows. In subcase (i), the throughput vector that is used to break ties among the agents (see Section IV-B1) known to each agent in each time slot is the same. While in subcase (ii), the throughput vector known to each agent may be different in each time slot since some agents may receive the ACK correctly and some may not. The discrepancy in the throughput vectors may cause more than one agent to transmit in each time slot, resulting in an inter-agent collision. Furthermore, for subcase (ii), we can analyze that the probability of the four agents having the same throughput vector in each time slot at least is . Therefore, the lower bound of each agent’s throughput should be . We can see that the throughput of each agent in Table V falls in the range of . Therefore, we can conclude that when the channels are imperfect and independent, the agents can learn the transmission pattern of TDMA and reduce, rather than avoid, the collisions among the agents.
We next consider a more complex scenario where five DLMA agents coexist with two TDMA users and three ALOHA users. The first TDMA user transmits in the 2 slot out of 10 slots within a TDMA frame, and the second TDMA user transmits in the 8 slot out of 10 slots within a TDMA frame. The transmission probability of each ALOHA user is 0.1. The uplink channel error probability is and the downlink channel error probability is . For simplicity, here we assume the channels between the agents and the DLMA AP are dependent. Fig. 12 and Fig. 13 present the sum throughput and the sum-log throughput of all the users when the objectives of the agents are to maximize sum throughput and to achieve proportional fairness, respectively. In addition, the corresponding benchmarks are also plotted in Fig. 12 and Fig. 13. As we can see from Fig. 12 and Fig. 13, the optimal benchmarks can also be approximated for this complex scenario.
This paper developed a distributed DLMA protocol for efficient and equitable spectrum sharing in heterogeneous wireless networks with imperfect channels. Each user in the DLMA network is regarded as an agent and adopts the DLMA protocol to coexist with other DLMA agents and the users from the other networks. The agents do not have collaboration with each other and are unaware of the operational details of the MACs of the users of other networks. Through a DRL process, the agents learn to work together to achieve a global -fairness objective.
The conventional RL/DRL framework assumes that the feedback/reward to the agent is always correctly received. In wireless networks, however, the feedback/reward may be lost due to channel errors. Without correct feedback information, the agent may fail to find a good solution. Moreover, in the distributed protocol, each agent makes decisions on its own. It is a challenge to guarantee that the multiple agents will make coherent decisions and work together to achieve the same objective, particularly in the face of imperfect feedback channels. To deal with the challenge, we put forth a feedback recovery mechanism to recover missing feedback information and a two-stage action selection mechanism to aid coherent decision making to reduce transmission collisions among the agents. Extensive simulation results demonstrated the effectiveness of these two mechanisms.
Last but not least, we believe that the feedback recovery mechanism and the two-stage action selection mechanism proposed in this paper can also be used in general distributed multi-agent reinforcement learning problems in which feedback information on rewards can be corrupted.