I Introduction
One of the key drivers for improving throughput in future wireless networks, including fifth generation mobile networks (5G), is the densification achieved by deploying more base stations. The rise of such ultradense network paradigms implies that the limited physical wireless resources (in time, frequency, etc.) need to support an increasing number of simultaneous transmissions. Effective radio resource management procedures are, therefore, critical to mitigate the interference among such concurrent transmissions and achieve the desired performance enhancement in these ultradense environments.
The radio resource management problem is in general nonconvex and therefore computationally complex, especially as the network size increases. There is a rich literature of centralized and distributed algorithms for radio resource management, using various techniques in different areas such as geometric programming [1], weighted minimum mean square optimization [2]
[3], information theory [4, 5], and fractional programming [6].Due to the dynamic nature of wireless networks, these radio resource management algorithms may, however, fail to guarantee a reasonable level of performance across all ranges of scenarios. Such dynamics may better be handled by algorithms that learn from interactions with the environment. Particularly, frameworks that base their decision making process on the massive amounts of data that are already available in wireless communication networks are well suited to cope with these challenges.
A specific subset of machine learning algorithms, called
reinforcement learning (RL) methods, are uniquely positioned in this regard. In the simplest form, RL algorithms consider an agent which interacts with an environment over time by receiving observations, taking actions, and collecting rewards, while the environment transitions to the subsequent step, emitting a new set of observations. Collecting experiences in such a framework, the ultimate goal of these algorithms is to train the agent to take actions that maximize its reward over time. Recent years have seen the rise of deepreinforcement learning, where deep neural networks (DNNs) are used as function approximators to estimate the probability of taking each action given each observation, and/or the
value of each observationaction pair. Deep RL algorithms have achieved resounding success in solving challenging sequential decision making problems, especially in various gaming environments, such as Atari 2600 and Go [7, 8, 9, 10].These promising results have motivated researchers in other domains to apply deep RL algorithms to attack challenging problems in their areas, especially when deriving optimal “groundtruth” solutions is difficult, if not impossible. Of particular interest to us are the numerous recent works that have attempted to tackle various radio resource management problems using deep RL techniques. In particular, [11, 12] use deep RL for the problem of spectrum sharing and resource allocation in cognitive radio networks. In [13], the authors propose a multiagent deep RL approach for spectrum sharing in vehicular networks, where each vehicletovehicle (V2V) link acts as an agent, learning how to reuse the resources utilized by vehicletoinfrastructure (V2I) links in order to improve systemwide performance metrics. In [14], deep RL is leveraged to address demandaware resource allocation in network slicing. Moreover, several works have focused on downlink power control in cellular networks using various singleagent and multiagent deep RL architectures [15, 16, 17, 18, 19].
There are, however, several drawbacks on these prior works. First, most of these works intend to optimize a single metric or objective function, a prominent example of which is the sumthroughput of the links/users across the network. However, resource allocation solutions which optimize the sumthroughput often allocate resources unfairly among users, as they only focus on the average performance and fail to guarantee a minimum performance. Second, measurements of channels and other metrics at each node in a realworld wireless network reach the other nodes in the network with certain amounts of delay, while many of the past works assume ideal message passing among transmitter/user nodes. Moreover, many of the aforementioned RLbased solutions are not scalable, in the sense that they do not address the possible mismatch between training and deployment environments and do not typically consider the applicability and robustness of their solution to variations in the environment.
In this paper, we consider the application of deep RL techniques to the problem of distributed user scheduling and downlink power control in multicell wireless networks, and we propose a mechanism for scheduling transmissions using deep RL so as to be fair among all users throughout the network. We evaluate our proposed multiagent deep RL algorithm using a systemlevel simulator and compare its performance against several decentralized and centralized baseline algorithms. In particular, we show that our trained agents outperform two decentralized baseline scheduling algorithms in terms of the tradeoff between sumrate (representative of “cellcenter” users, i.e., the ones with relatively good channel conditions) and percentile rate (representative of “celledge” users with poor channel conditions). Moreover, our agents attain competitive performance as compared to a centralized binary power control method, called ITLinQ [4, 20].
Our proposed design for the deep RL agents is scalable, and ensures that their DNN structure does not vary with the actual size of the wireless network, i.e., number of transmitters and users. We test the robustness of our trained agents with respect to changes in the environment, and demonstrate that our agents maintain their performance gains throughout a range of network configurations. We also shed light on the interpretability aspect of the agents and analyze their decision making criteria in various network conditions.
We make the following main contributions in this paper:

We introduce a multiagent deep RL algorithm, which performs joint optimization of user selection and power control decisions in a wireless environment with multiple transmitters and multiple users per transmitter.

We consider the agents’ observations to be undersampled and delayed, to account for realworld measurement feedback periods, and communication and processing delays.

We introduce a scalable design of observation and action spaces, allowing an agent with a fixedsize neural network to operate in a variety of wireless network sizes and densities.

We introduce a novel method for normalizing the observation variables, which are input to the agent’s neural network, using a percentilebased preprocessing technique on an offline collected dataset from the actual train/test deployment.

We utilize a configurable reward which allows us to achieve the right balance between average user rate and the percentile user rate, representing “cellcenter” and “celledge” user experiences, respectively.
The rest of this paper is organized as follows. In Section II, we present our system model and formulate the problem. We then describe our proposed multiagent deep RL framework, including the environment, observations, actions, and rewards in Section III. We present our simulation results in Section IV. We provide further discussion on the results and several future directions in Section V. Finally, we conclude the paper in Section VI.
Ii System Model and Problem Formulation
We consider the downlink direction of a wireless network, consisting of access points (APs) and user equipment devices (UEs) , as illustrated in Figure 1, where the APs intend to transmit data to the UEs across the network. We assume that each AP maintains a local pool of users associated with it based on longterm component of the channel gain, i.e., pathloss and shadowing. In particular, let us use to denote the reference signal received power by from , and we define the set of UEs associated with , denoted by , as
(1) 
We assume that each AP has at least one UE associated with it. This, in conjunction with (1), ensures that the set of user pools for all APs is a partition of the entire set of UEs; i.e.,
(2)  
(3)  
(4) 
We consider a synchronous timeslotted communication framework, where each makes the following two major decisions at each scheduling interval :

User scheduling: It selects one of the UEs from its local user pool to serve. We let denote the UE that is scheduled to be served by at scheduling interval .

Power control: It selects a transmit power level , where denotes the maximum transmit power.
Given the above decisions, the received signal at (we removed the dependence on for brevity) at scheduling interval can be written as
(5) 
where denotes the channel gain between each and at scheduling interval , denotes the signal transmitted by at scheduling interval and denotes the additive white Gaussian noise at in scheduling interval , with
denoting the noise variance. This implies that the achievable rate of
at scheduling interval (taken as the Shannon capacity) can be written as(6) 
We assume that the system runs for consecutive scheduling intervals, and we define the average rate of each over this period as
(7) 
We now specify the following two metrics that we intend to optimize in this paper:

Sumrate: This metric is defined as the aggregate average throughput across the entire network over the course of scheduling intervals; i.e.,
(8) 
percentile rate: As the name suggests, this metric is defined as the average rate threshold achieved by at least of the UEs over scheduling intervals. In a probabilistic sense, we have the following definition:
(9)
Note that sumrate provides an indication of how high the throughputs of the users are on average across the network (i.e., “cellcenter” users), while the percentile rate shows the performance of the worstcase users that are in poor channel conditions (i.e., “celledge” users). These two metrics are in natural conflict with each other, and our goal in this paper is to devise a scheduling algorithm that outputs a sequence of joint user scheduling and power control decisions across the network such that we achieve the best tradeoff between the two metrics.
Iia Distributed Scheduling, Feedback and Backhaul Delays
In this paper, we particularly aim to design a distributed scheduling algorithm in the sense that each AP in the network should make its user scheduling and power control decisions on its own. In order to enable such decision making, the APs rely on information fed back to them by the UEs in a periodic manner. In particular, we assume that each measures status indicators and reports them back to its associated every scheduling intervals. In order to account for realworld measurement, processing and communication latencies, we assume that these feedback reports arrive at the AP with a certain delay of scheduling intervals.
In addition to the feedback each AP receives from its own associated UEs, we assume that there is some periodic message passing among the APs across the network via a delayed backhaul interface. To be specific, we consider the case that the feedback reports communicated between agents are additionally delayed for scheduling intervals. This allows each AP to have access to the feedback reports from the UEs associated with other (neighboring) APs as well, albeit with some delay.
Figure 2 visualizes how feedback reports are exchanged between the UEs and their associated and nonassociated APs over time. For each access point , we denote by the most recent set of available measurements at scheduling interval that have been reported by each . As the figures shows, for any , can be written as
(10) 
Iii Proposed Deep Reinforcement Learning Framework
We model the distributed scheduler as a multiagent deep RL system. In particular, we propose to equip each AP with a deep RL agent, as illustrated in Figure 3. As mentioned in Section IIA, each agent observes the state of the UEs in its local user pool, and it also exchanges information with neighboring agents, thus observing the state of neighboring APs’ associated UEs. We utilize a centralized training procedure, collecting the experiences of all agents and using them to train a single policy, which is then used by all the agents. Even though training is centralized, the execution phase is distributed, with each agent making its own decision at each scheduling interval only based on the specific observations it receives from the environment.
Iiia Environment
The environment is parametrized by the physical size of the deployment area, number of APs and UEs, and parameters governing the channel models used to create APUE channel realizations. At the start of each episode, the environment is reset, resulting in a new set of physical locations for all the APs and UEs in the network, along with channel realizations between them, modeling both longterm and shortterm fading components.
IiiA1 Observations
Observations available to each agent at each scheduling interval consist of local observations, representing the state of the UEs associated with the corresponding AP and remote observations, representing the state of the UEs associated with neighboring APs. In the following, we will elaborate on these observations in more detail:

Local observations: These observations are based on measurements made and reported to APs by their associated UEs, as mentioned in Section IIA. In particular, we consider the case where each UE reports measurements to its associated AP, namely its weight and signaltointerferenceplusnoise ratio (SINR). We define the weight of each at scheduling interval as
(11) where represents the longterm average rate of since the beginning of the episode, defined as
(12) (13) In the above equations, is a parameter close to zero, which specifies the window size for the exponential moving average operation. Moreover, for each we define the measured SINR of at scheduling interval as
(14) where denotes the longterm average interference received by since the beginning of the episode, and is calculated recursively as
(15) (16) with being a parameter close to zero that determines the window size for the exponential moving average of the interference.
The number of UEs associated with each AP can be different from AP to AP and from deployment to deployment. For our algorithm to be applicable to any scenario, we bound the dimension of the observation space by including observations (weights and SINRs) from a constant number of UEs per AP in any environment configuration, which we denote by . In order for each AP to select the top
UEs whose data is included in its local observation vector at each scheduling interval, we use the
proportionalfairness (PF) ratio, defined as(17) The PF ratio provides a notion of priority for the UEs, where the UEs with higher PF ratios are more in need to be scheduled. Therefore, at each scheduling interval, each AP sorts the UEs in its user pool according to their PF ratios, and selects the top UEs to include in its local observation vector.

Remote observations: The user scheduling and power control decisions made by an agent affect the performance of its surrounding APs and their associated UEs due to interference. As mentioned in Section IIA, we assume that neighboring agents communicate their local observations (weights and SINRs of (selected) UEs) among themselves. We bound the number of agents whose observations are included at each agent’s observation vector to a fixed number, which we denote by . We use a distancebased criterion for selecting the top remote agents for each agent. In particular, at each environment reset, we build a directed observationexchange graph, where for each AP, we sort the other APs based on their distances to that AP, and select the closest APs as the (sorted) tuple of remote agents for that AP. Figure 4 shows an example of a distancebased observationexchange graph for a network with APs, where the agent at each AP includes the observations from closest agents. Note how agents 3 and 4 end up being remote agents to all the other agents, because of their critical locations and potential impact as strong interferers.
Agent Remote agents 1 2 3 4 5 6 Figure 4: Observation exchange graph for the configuration in Figure 1 with APs and remote agents per AP.
Figure 5 illustrates how the local and remote observations are concatenated together at each agent, resulting in a fixedlength observation vector. As there are remote agents per agent, and UEs’ observations are included per agent, the length of the observation vector for each agent equals .
Remark 1
Note that the dimension of the observation space, and therefore the input size to the deep RL agent’s neural network, does not depend on the number of APs and/or UEs in the network. This makes our algorithm scalable regardless of the specific environment parameters.
Remark 2
When there are fewer than
UEs associated to an AP, we set the corresponding values in the local/remote observations to default values, similar to a zeropadding operation. In particular, we use default values of 0 for weight and 60 dB for SINR.
Remark 3
The current LTE and future 5G cellular standards developed by 3GPP support periodic channel quality indicator (CQI) feedback reports from UEs to their serving base stations. Moreover, the weights can either be reported by the UEs, or the base stations may keep track of the longterm average rates of their associated UEs. The base stations can also exchange observations among each other through backhaul links, such as the X2 interface [21]. Therefore, our proposed observation structure is completely practical and may be readily implemented in current and future cellular networks.
IiiA2 Actions
As mentioned in Section II, at each scheduling interval, each AP needs to select a target UE from its user pool to serve, and a transmit power level to transmit data to the scheduled UE. To jointly optimize the user scheduling and power control decisions, we define a joint action space, where each action represents a (transmit power level, target UE) pair.
We quantize the range of positive transmit powers to (potentially nonuniform) discrete power levels. Moreover, because the number of UEs associated with an AP can be varying and/or potentially large, we take a similar approach to deal with this issue as in forming the observations: At each scheduling interval, we limit the choice of target UE to one of the top UEs included in the local observations. This is also reasonable because the agent has information solely on those users in its local observation vector.
Given the above considerations, the number of possible actions for each agent at each scheduling interval is , where the additional action is one in which the agent remains silent for that scheduling interval and selects none of the top UEs to serve. In the event that the AP has fewer than associated UEs and erroneously selects a target UE which does not exist, we map the action to the “off” action, indicating that the AP should not transmit.
Remark 4
Note that similar to the observation space, the action space dimension does not depend on the network size as well. This allows us to have a robust agent architecture that can be trained in a specific environment, and then deployed on a different environment in terms of, for instance, number of APs and/or UEs compared to the training environment. In Section IVF, we show how well the agent performs in such mismatched scenarios.
IiiA3 Rewards
As shown in Figure 3, we utilize a centralized reward based on the actions of all the agents at each scheduling interval. In particular, assuming that each has selected to serve at scheduling interval , the reward emitted to each of the agents is a weighted sumrate reward, calculated as
(18) 
where is the most recent reported weight measurement by available at , is the rate achieved by , and is a parameter which determines the tradeoff between and , the two metrics that we intend to optimize. Specifically, turns the reward to sumrate, favoring cellcenter UEs, while changes the reward to approximately the summation of the scheduled UEs’ PF ratios, hence appealing to celledge UEs.
There are, however, two exceptions to the reward emitted to the agents, which are as follows:

All agents off: Due to the distributed nature of decision making by the agents, it is possible that at a scheduling interval, all agents decide to remain silent. This is clearly a suboptimal joint action vector at any scheduling interval. Therefore, in this case, we penalize the agent, whose top user has the highest PF ratio among all UEs in the network, with the negative of that PF ratio, while the rest of the agents will receive a zero reward, according to (18).

Invalid user selected: As mentioned in Section IIIA2, it might happen that an AP has fewer than associated UEs in its user pool, and selects an invalid UE to serve at a scheduling interval. In that case, the agent corresponding to that AP is penalized by receiving a zero reward regardless of the actual weighted sumrate reward of the other agents as given in (18).
IiiB Normalizing the Agents’ Observations and Rewards
It is widely known that normalizing DNN inputs and outputs have a significant impact on its training behavior and predictive performance [22]. Before training or testing our model in a specific environment, we first create a number of environment realizations in an offline fashion and run one or several simple baseline scheduling algorithms in those realizations. While doing so, we collect data on the observations and rewards of all the agents throughout all environment realizations and all scheduling intervals within each realization. We then leverage the resulting dataset in the following way to preprocess the observations and rewards before using them to train the agent’s DNN:

Observations: As mentioned in Section IIIA1, we have two types of observations: weights and SINRs. For notational simplicity, we describe the normalization process for the weight observations; the process for normalizing SINR observations follows similarly.
Considering the weight observations, we derive the empirical distribution of the observed weights in the aforementioned dataset. We then use the distribution to calculate multiple percentiles of the observed weights. In particular, we consider percentile values , denoted respectively by , as depicted in Figure 6 (for both weights and SINRs). Note that and are equal to the minimum and maximum weights observed in the dataset, respectively. Afterwards, we map each subsequent weight observation during training/inference before feeding to the neural network as
(19) The mapping in
is in fact applying a (shifted version of the) CDF of the weight observation to itself, which is known to be uniformly distributed. Therefore, this mapping guarantees that the observations fed into the neural network will (approximately) follow a discrete uniform distribution over the set
. 
Rewards:
We follow a wellknown standardization procedure for normalizing the rewards, where we use the dataset to estimate the mean and standard deviation of the reward, denoted by
and , respectively. Each reward during training is then normalized as(20) ensuring that the neural network outputs have (approximately) zero mean and unit variance.
IiiC Training and Validation Procedure
We consider an episodic training procedure, in which each episode represents a realization of the environment in which the locations of the APs and UEs and channel realizations are randomly selected following a set of probability distributions and constraints on minimum APAP and UEAP distances. We control the density of APs and UEs in our environment by fixing the size of the deployment area and selecting different numbers of APs and UEs for different training sessions. Because the channels between APs and UEs depend heavily on their relative locations, each new episode allows the system to experience a potentially unexplored subset of the observation space. The associations between UE and AP take place as in (
1) as a new episode begins, and remain fixed for the duration of that episode. An episode consists of a fixed number of scheduling intervals, where at each interval, the agents decide on which user scheduling and power control actions the APs should take.We further structure the training process into epochs
, each of which consists of a fixed number of consecutive training episodes. At the completion of each epoch, we pause training in order to evaluate the current policy against a fixed set of
validation environments carefully selected to be representative of all possible environments. We use a score metric , defined as(21) 
to quantify the performance after each epoch and select the best model during training as the one achieving the best performance in terms of .^{1}^{1}1We have used the factor of 3 for the percentile rate in (21), because prior experience has shown that improving celledge performance is typically three times more challenging than enhancing cellcenter performance. Hence, the score metric emphasizes the percentile rate three times more than the average rate of the UEs across the network.
Iv Simulation Results
In this section, we first mention the details of the wireless system parameters. Next, we present the baseline algorithms that we use to compare the performance of our proposed method against. We then discuss our considered deep RL agents and their corresponding parameters. Finally, we proceed to present our simulation results.
Iva Description of the Wireless Environment
We consider networks with APs and UEs, dropped randomly within a square area. We impose a minimum APAP distance of m and APUE distance of m. We consider a bandwidth of MHz, maximum AP transmit power of dBm, noise power spectral density of dBm/Hz, and episode length of scheduling intervals.
The communication channel between APs and UEs consists of three different components:

Pathloss: We consider a dualslope pathloss model [23, 24], which states that the pathloss at distance equals
where denotes the pathloss at distance m, denotes the breakpoint distance, and and denote the pathloss exponents before and after the breakpoint distance, respectively (). In this paper, we set dB, , , and m.

Shadowing: We assume that all the links experience lognormal shadowing with a standard deviation of dB.

Shortterm fading: We use the sum of sinusoids (SoS) model [25] for shortterm flat Rayleigh fading (with pedestrian node velocity of m/s) in order to model the dynamics of the communication channel over time.
As for the feedback reports, each UE is assumed to sample and send its measurements to its associated AP every scheduling intervals, and these reports arrive at the associated AP after a delay of scheduling intervals. Moreover, we assume a backhaul delay of scheduling intervals for observation exchange among the APs.
IvB Baseline Algorithms
We compare the performance of our proposed scheduler against several baseline algorithms.

Full reuse: At each scheduling interval, each AP schedules the UE in its local user pool with the highest PF ratio (PFbased user scheduling), and serves it with full transmit power.

Time division multiplexing (TDM): The UEs are scheduled in a roundrobin fashion. In particular, at scheduling interval , is scheduled to be served with full transmit power by its associated AP, while the rest of the APs remain silent.

Informationtheoretic link scheduling (ITLinQ) [4, 20]: This is a centralized binary power control algorithm, in which UEs are first selected using PFbased scheduling, and then the APUE pairs are sorted in the descending order of the selected UEs’ PF ratios. The AP whose selected UE has the highest PF ratio is scheduled to transmit with full power, and going down the ordered list, each is also scheduled to transmit with full power to its selected if and only if
(22) where and are design parameters. Otherwise, will remain silent for that scheduling interval. The condition in (22) is inspired by the informationtheoretic condition for the optimality of treating interference as noise [26], ensuring that the interferencetonoise ratios (INRs), both caused by at alreadyscheduled UEs and received by
from alreadyscheduled APs, are “weak enough” compared to the signaltonoise ratio (SNR) between
and . For our simulations, we consider and .
IvC Deep RL Agents and their Hyperparameters
We consider the following two different types of agents:

Double DeepQ Network (DQN) [7, 27]:
We consider a double DQN agent, which is a valuebased modelfree deep RL method, with a 2layer fullyconnected DNN, 128 neurons per layer, and tanh activation function. We create experiences using a set of 4 parallel environments, and save them in an experience buffer of size 25,000 samples. We use the Adam optimizer
[28] to perform a round of training on the main DQN every 100 scheduling intervals, using a batch of samples, consisting of a set of 1024 scheduling intervals, each containing concurrent experiences for all the agents at that scheduling interval [29]. We update the target DQN every 10,000 scheduling intervals by replacing its parameters with those of the main DQN. We initialize the learning rate at 0.01 and decay it by half every 5,000 training iterations. We consider a set of pretraining episodes, in which we completely fill in the experience buffer by the agents taking completely random actions. Afterwards, we use an greedy policy, with the probability of random actions decaying from 100% to 1% over 25 training episodes. We use a discount factor of for the agent to consider the impact of its actions on the subsequent rewards in next scheduling intervals. In order to improve the generalization capabilities of our agent, we use regularization on the DQN weights with a coefficient of 0.001 [30]. 
Advantage ActorCritic (A2C) [31]: We use the OpenAI baselines [32] implementation of an A2C agent—which is a policybased modelfree deep RL method and a synchronous version of the asynchronous advantage actor critic (A3C) agent [33]
—with a 2layer fullyconnected DNN, 128 neurons per layer, and tanh activation function. We consider a set of 10 parallel training environments. We use the RMSProp optimizer (with parameters
and ) to perform a round of training using trajectories of length 100 scheduling intervals, each collected from one of the parallel environments. We initialize the learning rate atand cut it in half every 12,000 training iterations. We use gradient clipping with a maximum magnitude of 1. The loss function consists of the policy loss with a coefficient of 1, value function loss with a coefficient of 1, and an entropy regularization term with a coefficient of 0.05. Similar to the DQN agent, we use
regularization on the A2C neural network weights with a coefficient of 0.001. Moreover, the reward discount factor is set to .
For validation purposes during training, we create a set of 50 validation environments, whose average and percentile rates are within a relative error of those achieved over 1000 random environment realizations by both full reuse and TDM baseline algorithms. We define an epoch as a group of 10 consecutive episodes, and we run the training for 200 epochs, or equivalently, 2000 episodes, amounting to a total of 16 million and 40 million training scheduling intervals for DQN and A2C, respectively (due to different numbers of parallel environments). Once training is complete, we test the resulting models across another randomlygenerated set of 1000 environment realizations.
As for the observations, we consider each agent to include weights and SINRs from UEs having the highest PF ratios, alongside receiving remote observations from a set of remote agents. This implies that the size of the observation vector of each agent at each scheduling interval, hence the size of the input layer of each agent’s neural network, is equal to . We consider the moving average parameters for the longterm average rate and interference at the UEs to be and , respectively. We also consider a set of percentile levels for mapping and normalizing both weight and SINR observations, calculated using an offline dataset generated by both full reuse and TDM baseline algorithms.
We consider a binary power control policy, where an AP at any given scheduling interval is either off, or serves a UE with full transmit power. This implies that the total number of actions, and therefore the size of the output layer of each agent’s neural network, equals . Moreover, for the reward function as defined in (18), we consider , which helps strike the right tradeoff between the average and percentile rates as we will show next.
For each type of environment configuration, we train 5 models, utilizing different random number generator seeds. In the following sections, we report the mean of the results across the trained 5 models, with the shaded regions around the curves illustrating the standard deviation.
Remark 5
We analyze how many remote agents’ observations should be included in each agent’s observations, where as mentioned in Section IIIA1, the remote agents are selected based on physical proximity. Figure 7 shows the variation in the mean longterm UE SINR in networks with APs and 100 UEs, where for each configuration of APs, the interference includes the contribution from remote agents physicallyclosest to the serving AP.
As the figure demonstrates, by far the largest reduction in SINR occurs when the closest AP transmits. Moreover, the curves flatten out as the number of included interference terms increase, indicating that interference from farther APs is less consequential and may be safely omitted from the observation space. This justifies why including observations from remote agents is a reasonable choice.
IvD Validation Performance during Training
We first demonstrate how the behavior of the model evolves as training proceeds. Figure 8 illustrates the evolution of validation sumrate, percentile rate, and the score metric , as defined in (21), when training on environments with APs and UEs.
As the plots in Figure 8 show, both DQN and A2C agents initially favor sumrate performance, while suffering in terms of the percentile rate. As training proceeds, the agents learn a better balance between the two metrics, trading off sumrate for improvements in terms of percentile rate. As the figure shows, A2C achieves a better sumrate, while DQN achieves a better coverage and also a better score, outperforming the centralized ITLinQ approach after only 12 epochs. Moreover, DQN converges faster than A2C, due to better sample efficiency thanks to the experience buffer.
As mentioned in Section IIIC, for each training run, we select the model at the epoch which yields the highest score level. We can then use the resulting model to conduct final, largescale, test evaluations upon the completion of training, as we will show next.
IvE Final Test Performance with Similar Train and Test Configurations
In this section, we present the final test results for models tested on the same configuration as the one in their training environment. Figure 9 demonstrates the achievable sumrate and percentile rate for the environments with 4 APs and varying numbers of UEs. As the plots show, our proposed deep RL methods significantly outperform TDM in both sumrate and percentile rate, and they also provide considerable percentile rate gains over full reuse. Our reward design helps the agents achieve a balance between sumrate and percentile rate, helping the DQN agent attain percentile rate values which are on par with ITLinQ for large numbers of UEs (3240), while outperforming it for smaller numbers of UEs (1624). The A2C agent, on other hand, performs consistently well in terms of the sumrate, approaching ITLinQ as the number of users increases across the network.
In Figure 10, we plot the sumrate and percentile rate for the configurations with 40 UEs and different numbers of APs. As the figure shows, the relative trends are similar to the previous case in terms of sumrate, with A2C outperforming ITLinQ for networks with 8 APs, but in terms of the percentile rate, both agents outperform TDM and full reuse, while having inferior performance relative to the centralized ITLinQ approach as the number of APs, and equivalently the number of agents, gets larger.
IvF Final Test Performance with Discrepant Train and Test Configurations
As mentioned before, our design of the observation and action spaces is such that they have a fixed size regardless of the actual training configuration. We test the robustness of our models with respect to network density by testing policies trained on one density deployed in environments of other densities. To reduce clutter, in the following, we only plot the average results over the 5 seeds and remove the shaded regions representing the standard deviation of the results.
We first test models trained on environments with APs and different numbers of UEs against each other, and plot the results in Figure 11. We observe that all DQN and A2C agents are robust in terms of both metrics. Interestingly, the A2C model trained on the case with 16 UEs has a much better performance (especially in terms of sumrate) than its counterpart DQN model as the number of UEs in the test deployment increases. For models with 40 UEs, however, DQN tends to perform better, especially in terms of the percentile rate.
Next, we crosstest the models trained on environments with 40 UEs and different numbers of APs against each other. Figure 12 shows the sumrates and percentile rates achieved by these models. All the models exhibit fairly robust behaviors with the exception of the DQN model trained on configurations with 4 APs, whose percentile rate performance deteriorates for higher numbers of APs. Note that in this case, the number of agents changes across different scenarios, and we observe that in general, training with more agents leads to more capable models, which can still perform well when deployed in sparser scenarios, while training with few agents may not scale well as the number of agents increases.
Remark 6
We have also tested our trained models with observations mapped using 20 percentile levels on test environments in which the observations were mapped using different numbers of percentile levels. For the scenario with APs and UEs, we observed that using 10100 percentile levels during the test phase achieves results very similar to (within of) the ones obtained using 20 percentile levels. This shows that our proposed approach is very robust to the granularity of mapping the observations fed into the agent’s neural network.
IvG Interpreting the Agent’s Decisions
In this section, we attempt to interpret our trained agent’s decisions during the test phase. In particular, we collect data on the inputs and outputs of a DQN agent, trained on a network with APs and UEs and tested on the same configuration. Using this data, we will try to visualize the agent’s actions in different situations.
Figure 13 shows a scatter plot of the SINR and weight of the agent’s “top UE,” i.e., the UE in the AP’s user pool with the highest PF ratio.
The red points illustrates the cases where the agent decided to remain silent, while the green points represent the cases in which the agent served one of its top3 UEs. As expected, higher weights and/or higher SINRs lead to a higher chance of the AP not being off. Quite interestingly, the boundary between the green and red regions can be approximately characterized as , which is effectively a linear boundary on the PF ratio; i.e., the agent decides to be active if and only if the PF ratio of its top UE is above some threshold that it has learned based on its interactions with the environment.
Given that the PF ratio is a reasonable indicator of the status of each UE, Figure 14 compares the PF ratios of the top3 UEs included in the agent’s observation and action spaces in the cases where the agent decided to serve one of the those UEs.
As the figure shows, the agent’s user scheduling decision heavily depends on the relative difference between the PF ratios of the top3 UEs. In general, the second and third UEs have some chance of being scheduled if they have a PF ratio close to that of the top UE. However, this chance is significantly reduced for the third UE, as highlighted by the regions corresponding to different user scheduling actions.
Moreover, Figure 15 shows the impact of remote observations on the agent’s power control decisions.
In particular, the figure demonstrates the cases where the agent either remains silent (red points), or decides to serve its top UE (green points). We observe that the agent learns a nonlinear decision boundary between the PF ratio of its top UE and the PF ratios of the top UE of each remote agent. Notably, the green region becomes larger as we go from the left plot to the right plot. This implies that the agent “respects” the PF ratio of the top UE of its closest remote agent more as compared to the second and third closest remote agents, since the interference between them tends to be stronger, hence their actions impacting each other more significantly.
V Discussion
In this section, we discuss some of the implications of our proposed framework in more detail and provide ideas for future research on how to improve upon the current work.
Va Analysis on the Number of Observable UEs by the Agent
As described in Section III, we bound the dimension of the agent’s observation and action space by selecting a finite number of UEs, whose observations are included in the agent’s observation vector. We select these UEs by sorting the user pool of each AP according to their PF ratios. In this section, we shed light on the tradeoffs implied by such a “user filtering” method.
We first analyze the actions taken by the DQN agent when trained and tested in a network with APs and UEs. Recall that the agent can either take an “off” action, or decide to serve one of its top3 UEs. We observe in Figure 15(a) that the algorithm selects action 0 (no transmission) or 1 (the top UE) most of the time and rarely selects the other two UEs. In this formulation of the algorithm, including information from more than 3 UEs would most likely not improve the performance. This seems logical given that the PF ratio represents the shortterm ability of the UE to achieve a high rate (represented by the SINR term) along with the longterm demand to be scheduled for the sake of fairness (represented by the weight term).
Given our goal of having a scalable agent that can be employed by any AP having an arbitrary number of associated users, it is not feasible to have an agent which can observe all its UEs at every scheduling interval in a general network configuration. However, we conducted a controlled experiment, where we restricted the environment realizations to the ones in which all APs have a constant number of associated UEs. In particular, we considered a configuration with APs and UEs, where each AP has exactly 6 UEs associated with it. In such a scenario, we are indeed able to design an agent, which can observe the state of all of its associated UEs, as well as all UEs associated of all its remote agents. Furthermore, we did not sort the UEs of each agent according to their PF ratios. This ensures that each input port to the agent’ neural network contains an observation from the same UE over time. Note that the size of the observation vector is now and the number of actions equals .
After training and testing a DQN agent on the above scenario, we observed that the resulting sumrate and percentile rate were within 6% of the original model using the sorted top3 UEs. This demonstrates that the algorithm is indeed able to learn user scheduling without the “aid” of sorting the UEs by PF ratio. Moreover, Figure 15(b) demonstrates the percentage of time that the model with observations from all (unsorted) UEs selects the “off” action and each of the UEs. For the purposes of this figure, we sorted the UEs by their PF ratios, so the percentage of time that the algorithm selects the topUE represents the percentage of time that the agent selects the UE with the top PF ratio, regardless of its position in the observation/action space. We see that the algorithm learns to select the UE with the highest PF ratio most often, but interestingly, the distribution among the various UEs is slightly more even as compared to Figure 15(a). This implies that by letting the agent observe all UEs in an arbitrary order, its resulting user scheduling behavior is similar to a PFbased scheduler, but not exactly the same. Further interpretation of this result is left for future study.
VB Enhanced Training Procedure
One of the unique challenges to training agents to perform scheduling tasks in a multinode wireless environment, is that we can only simulate individual snapshots of the environment and explore a limited subset of the state space. In our our training procedure, we fixed the parameters governing our environment (size of the deployment area, number of APs and UEs, minimum distance constraints, maximum transmit powers, etc.) and selected AP and UE locations randomly at each environment reset. We utilized multiple parallel environments to ensure that training batches contained experiences from different snapshots. This approach, along with a carefullyselected learning rate schedule and a sufficiently long training period, allowed our agents to achieve the performance that we reported in Section IV.
An interesting question remains whether a more deliberate selection of training environments can accelerate training and/or result in betterperforming agents. The fact that we can simulate only a limited number of environment realizations at a time provides an opportunity to guide the exploration and learning process.
One possible direction is to systematically control the density of the environments experienced during training, by systematically varying the parameters governing the environment. Possible strategies could be to move from less dense to more dense deployments or vice versa, or to more carefully select training batches to always include experiences from a range of densities.
Another possible direction is to control the complexity of the environments on which the agent is being trained. Intuitively, some environment realizations are easy and some are hard. For example, an environment in which all UEs are very close to their associated AP, i.e., cellcenter scenario, is easy because the optimal policy is for the APs to transmit at every scheduling interval. An environment where all UEs are clustered around the borders between the coverage areas of adjacent APs, i.e., celledge scenario, is slightly more difficult because the optimal policy is for neighboring APs to coordinate not to transmit at the same time. Environments in which users are distributed throughout the coverage areas of the APs are, however, much more complex because the optimal user scheduling and power control choices become completely nontrivial. Various approaches to curriculum learning could potentially be applied [34, 35]. The main difficulty with this approach is determining a more granular measure of environment complexity and a procedure for generating environment realizations that exhibit the desired difficulty levels.
VC Capturing Temporal Dynamics
One of the main challenges faced by the scheduler is dealing with delayed observations available to the agent. We have shown that our agents are able to successfully cope with this problem, but an interesting question is whether we can include recurrent architectures in the agent’s neural network to learn, predict, and leverage network dynamics based on the sequence of observations it receives over the course of an episode.
Two approaches are possible. The first is to include recurrent neural network (RNN) elements, such as long shortterm memory (LSTM)
[36], or attention mechanisms such as Transformers [37], at the inputs, with the goal of predicting the actual current observations based on the past history of delayed observations. A second approach would be to place the RNN/attention elements at the output similar to [38, 39]. One thing to note here is that in order for the system to learn the temporal dynamics of the environment, it must be exposed to the observations of each individual UEs over time. This means that the approach of including observations from the top3 UEs, sorted by their PF ratios, will likely not work and observations from all UEs must be included in an unsorted order. This is challenging due to the variable number of UEs associated to each AP, and we leave this for future work.Vi Concluding Remarks
We introduced a distributed multiagent deep RL framework for performing joint user selection and power control decisions in a dense multiAP wireless network. Our framework takes into account realworld measurement, feedback, and backhaul communication delays and is scalable with respect to the size and density of the wireless network. We show, through simulation results, that our approach, despite being executed in a distributed fashion, achieves a tradeoff between sumrate and percentile rate, which is close to that of a centralized informationtheoretic scheduling algorithm. We also show that our algorithm is robust to variations in the density of the wireless network and maintains its performance gains even if the number of APs and/or UEs change in the network during deployment.
References
 [1] A. Gjendemsjø, D. Gesbert, G. E. Øien, and S. G. Kiani, “Binary power control for sum rate maximization over multiple interfering links,” IEEE Transactions on Wireless Communications, vol. 7, no. 8, pp. 3164–3173, 2008.
 [2] Q. Shi, M. Razaviyayn, Z.Q. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sumutility maximization for a MIMO interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4331–4340, 2011.
 [3] L. Song, D. Niyato, Z. Han, and E. Hossain, “Gametheoretic resource allocation methods for devicetodevice communication,” IEEE Wireless Communications, vol. 21, no. 3, pp. 136–144, 2014.
 [4] N. Naderializadeh and A. S. Avestimehr, “ITLinQ: A new approach for spectrum sharing in devicetodevice communication systems,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 6, pp. 1139–1151, 2014.
 [5] X. Yi and G. Caire, “ITLinQ+: An improved spectrum sharing mechanism for devicetodevice communications,” in 2015 49th Asilomar Conference on Signals, Systems and Computers. IEEE, 2015, pp. 1310–1314.
 [6] K. Shen and W. Yu, “FPLinQ: A cooperative spectrum sharing strategy for devicetodevice communications,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 2323–2327.
 [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [8] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, “Move evaluation in Go using deep convolutional neural networks,” arXiv preprint arXiv:1412.6564, 2014.
 [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
 [10] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al., “Dota 2 with large scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680, 2019.
 [11] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, “Intelligent power control for spectrum sharing in cognitive radios: A deep reinforcement learning approach,” IEEE Access, vol. 6, pp. 25 463–25 473, 2018.
 [12] A. Tondwalkar and A. Kwasinski, “Deep reinforcement learning for distributed uncoordinated cognitive radios resource allocation,” arXiv preprint arXiv:1911.03366, 2019.
 [13] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multiagent reinforcement learning,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2282–2292, 2019.
 [14] Y. Hua, R. Li, Z. Zhao, X. Chen, and H. Zhang, “GANpowered deep distributional reinforcement learning for resource management in network slicing,” IEEE Journal on Selected Areas in Communications, 2019.
 [15] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE International Conference on Communications (ICC). IEEE, 2017, pp. 1–7.
 [16] F. Meng, P. Chen, and L. Wu, “Power allocation in multiuser cellular networks with deep Q learning approach,” in ICC 20192019 IEEE International Conference on Communications (ICC). IEEE, 2019.
 [17] K. I. Ahmed and E. Hossain, “A deep Qlearning method for downlink power allocation in multicell networks,” arXiv preprint arXiv:1904.13032, 2019.
 [18] Y. S. Nasir and D. Guo, “Multiagent deep reinforcement learning for dynamic power allocation in wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2239–2250, 2019.
 [19] G. Zhao, Y. Li, C. Xu, Z. Han, Y. Xing, and S. Yu, “Joint power control and channel allocation for interference mitigation based on reinforcement learning,” IEEE Access, vol. 7, pp. 177 254–177 265, 2019.
 [20] N. Naderializadeh, O. Orhan, H. Nikopour, and S. Talwar, “Ultradense networks in 5G: Interference management via nonorthogonal multiple access and treating interference as noise,” in 2017 IEEE 86th Vehicular Technology Conference (VTCFall). IEEE, 2017, pp. 1–6.
 [21] G. Nardini, A. Virdis, and G. Stea, “Modeling X2 backhauling for LTEadvanced and assessing its effect on CoMP coordinated scheduling,” in 2016 1st International Workshop on Linkand System Level Simulations (IWSLS). IEEE, 2016, pp. 1–6.
 [22] Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 9–48.
 [23] J. G. Andrews, X. Zhang, G. D. Durgin, and A. K. Gupta, “Are we approaching the fundamental limits of wireless network densification?” IEEE Communications Magazine, vol. 54, no. 10, pp. 184–190, 2016.
 [24] X. Zhang and J. G. Andrews, “Downlink cellular network analysis with multislope path loss models,” IEEE Transactions on Communications, vol. 63, no. 5, pp. 1881–1894, 2015.
 [25] Y. Li and X. Huang, “The simulation of independent Rayleigh faders,” IEEE Transactions on Communications, vol. 50, no. 9, pp. 1503–1514, 2002.
 [26] C. Geng, N. Naderializadeh, A. S. Avestimehr, and S. A. Jafar, “On the optimality of treating interference as noise,” IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1753–1767, 2015.

[27]
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
double Qlearning,” in
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [29] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multitask multiagent reinforcement learning under partial observability,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR.org, 2017, pp. 2681–2690.
 [30] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman, “Quantifying generalization in reinforcement learning,” arXiv preprint arXiv:1812.02341, 2018.
 [31] J. X. Wang, Z. KurthNelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning to reinforcement learn,” arXiv preprint arXiv:1611.05763, 2016.
 [32] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “OpenAI baselines,” https://github.com/openai/baselines, 2017.
 [33] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.
 [34] G. Hacohen and D. Weinshall, “On the power of curriculum learning in training deep networks,” arXiv preprint arXiv:1904.03626, 2019.
 [35] D. Seita, D. Chan, R. Rao, C. Tang, M. Zhao, and J. Canny, “ZPD teaching strategies for deep reinforcement learning from demonstrations,” arXiv preprint arXiv:1910.12154, 2019.
 [36] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
 [38] A. Gruslys, W. Dabney, M. G. Azar, Piot, M. G. Bilal, Bellemare, and R. Munos, “The reactor: A fast and sampleefficient actorcritic agent for reinforcement learning,” in Seventh International Conference on Learning Representations (ICLR), 2018.
 [39] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, “Recurrent experience replay in distributed reinforcement learning,” in Seventh International Conference on Learning Representations (ICLR), 2019.
Comments
There are no comments yet.