Resource Management in Wireless Networks via Multi-Agent Deep Reinforcement Learning

02/14/2020 ∙ by Navid NaderiAlizadeh, et al. ∙ 0

We propose a mechanism for distributed radio resource management using multi-agent deep reinforcement learning (RL) for interference mitigation in wireless networks. We equip each transmitter in the network with a deep RL agent, which receives partial delayed observations from its associated users, while also exchanging observations with its neighboring agents, and decides on which user to serve and what transmit power to use at each scheduling interval. Our proposed framework enables the agents to make decisions simultaneously and in a distributed manner, without any knowledge about the concurrent decisions of other agents. Moreover, our design of the agents' observation and action spaces is scalable, in the sense that an agent trained on a scenario with a specific number of transmitters and receivers can be readily applied to scenarios with different numbers of transmitters and/or receivers. Simulation results demonstrate the superiority of our proposed approach compared to decentralized baselines in terms of the tradeoff between average and 5^th percentile user rates, while achieving performance close to, and even in certain cases outperforming, that of a centralized information-theoretic scheduling algorithm. We also show that our trained agents are robust and maintain their performance gains when experiencing mismatches between training and testing deployments.



There are no comments yet.


page 23

page 24

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

One of the key drivers for improving throughput in future wireless networks, including fifth generation mobile networks (5G), is the densification achieved by deploying more base stations. The rise of such ultra-dense network paradigms implies that the limited physical wireless resources (in time, frequency, etc.) need to support an increasing number of simultaneous transmissions. Effective radio resource management procedures are, therefore, critical to mitigate the interference among such concurrent transmissions and achieve the desired performance enhancement in these ultra-dense environments.

The radio resource management problem is in general non-convex and therefore computationally complex, especially as the network size increases. There is a rich literature of centralized and distributed algorithms for radio resource management, using various techniques in different areas such as geometric programming [1], weighted minimum mean square optimization [2]

, game theory 

[3], information theory [4, 5], and fractional programming [6].

Due to the dynamic nature of wireless networks, these radio resource management algorithms may, however, fail to guarantee a reasonable level of performance across all ranges of scenarios. Such dynamics may better be handled by algorithms that learn from interactions with the environment. Particularly, frameworks that base their decision making process on the massive amounts of data that are already available in wireless communication networks are well suited to cope with these challenges.

A specific subset of machine learning algorithms, called

reinforcement learning (RL) methods, are uniquely positioned in this regard. In the simplest form, RL algorithms consider an agent which interacts with an environment over time by receiving observations, taking actions, and collecting rewards, while the environment transitions to the subsequent step, emitting a new set of observations. Collecting experiences in such a framework, the ultimate goal of these algorithms is to train the agent to take actions that maximize its reward over time. Recent years have seen the rise of deep

reinforcement learning, where deep neural networks (DNNs) are used as function approximators to estimate the probability of taking each action given each observation, and/or the

value of each observation-action pair. Deep RL algorithms have achieved resounding success in solving challenging sequential decision making problems, especially in various gaming environments, such as Atari 2600 and Go [7, 8, 9, 10].

These promising results have motivated researchers in other domains to apply deep RL algorithms to attack challenging problems in their areas, especially when deriving optimal “ground-truth” solutions is difficult, if not impossible. Of particular interest to us are the numerous recent works that have attempted to tackle various radio resource management problems using deep RL techniques. In particular, [11, 12] use deep RL for the problem of spectrum sharing and resource allocation in cognitive radio networks. In [13], the authors propose a multi-agent deep RL approach for spectrum sharing in vehicular networks, where each vehicle-to-vehicle (V2V) link acts as an agent, learning how to reuse the resources utilized by vehicle-to-infrastructure (V2I) links in order to improve system-wide performance metrics. In [14], deep RL is leveraged to address demand-aware resource allocation in network slicing. Moreover, several works have focused on downlink power control in cellular networks using various single-agent and multi-agent deep RL architectures [15, 16, 17, 18, 19].

There are, however, several drawbacks on these prior works. First, most of these works intend to optimize a single metric or objective function, a prominent example of which is the sum-throughput of the links/users across the network. However, resource allocation solutions which optimize the sum-throughput often allocate resources unfairly among users, as they only focus on the average performance and fail to guarantee a minimum performance. Second, measurements of channels and other metrics at each node in a real-world wireless network reach the other nodes in the network with certain amounts of delay, while many of the past works assume ideal message passing among transmitter/user nodes. Moreover, many of the aforementioned RL-based solutions are not scalable, in the sense that they do not address the possible mismatch between training and deployment environments and do not typically consider the applicability and robustness of their solution to variations in the environment.

In this paper, we consider the application of deep RL techniques to the problem of distributed user scheduling and downlink power control in multi-cell wireless networks, and we propose a mechanism for scheduling transmissions using deep RL so as to be fair among all users throughout the network. We evaluate our proposed multi-agent deep RL algorithm using a system-level simulator and compare its performance against several decentralized and centralized baseline algorithms. In particular, we show that our trained agents outperform two decentralized baseline scheduling algorithms in terms of the tradeoff between sum-rate (representative of “cell-center” users, i.e., the ones with relatively good channel conditions) and percentile rate (representative of “cell-edge” users with poor channel conditions). Moreover, our agents attain competitive performance as compared to a centralized binary power control method, called ITLinQ [4, 20].

Our proposed design for the deep RL agents is scalable, and ensures that their DNN structure does not vary with the actual size of the wireless network, i.e., number of transmitters and users. We test the robustness of our trained agents with respect to changes in the environment, and demonstrate that our agents maintain their performance gains throughout a range of network configurations. We also shed light on the interpretability aspect of the agents and analyze their decision making criteria in various network conditions.

We make the following main contributions in this paper:

  • We introduce a multi-agent deep RL algorithm, which performs joint optimization of user selection and power control decisions in a wireless environment with multiple transmitters and multiple users per transmitter.

  • We consider the agents’ observations to be undersampled and delayed, to account for real-world measurement feedback periods, and communication and processing delays.

  • We introduce a scalable design of observation and action spaces, allowing an agent with a fixed-size neural network to operate in a variety of wireless network sizes and densities.

  • We introduce a novel method for normalizing the observation variables, which are input to the agent’s neural network, using a percentile-based pre-processing technique on an offline collected dataset from the actual train/test deployment.

  • We utilize a configurable reward which allows us to achieve the right balance between average user rate and the percentile user rate, representing “cell-center” and “cell-edge” user experiences, respectively.

The rest of this paper is organized as follows. In Section II, we present our system model and formulate the problem. We then describe our proposed multi-agent deep RL framework, including the environment, observations, actions, and rewards in Section III. We present our simulation results in Section IV. We provide further discussion on the results and several future directions in Section V. Finally, we conclude the paper in Section VI.

Ii System Model and Problem Formulation

Figure 1: A wireless network comprising multiple access points (APs) and user equipment devices (UEs). The solid green lines denote the signal links between APs and their associated UEs, while the dashed red lines denote (strong) interference links between APs and neighboring non-associated UEs.

We consider the downlink direction of a wireless network, consisting of access points (APs) and user equipment devices (UEs) , as illustrated in Figure 1, where the APs intend to transmit data to the UEs across the network. We assume that each AP maintains a local pool of users associated with it based on long-term component of the channel gain, i.e., path-loss and shadowing. In particular, let us use to denote the reference signal received power by from , and we define the set of UEs associated with , denoted by , as


We assume that each AP has at least one UE associated with it. This, in conjunction with (1), ensures that the set of user pools for all APs is a partition of the entire set of UEs; i.e.,


We consider a synchronous time-slotted communication framework, where each makes the following two major decisions at each scheduling interval :

  • User scheduling: It selects one of the UEs from its local user pool to serve. We let denote the UE that is scheduled to be served by at scheduling interval .

  • Power control: It selects a transmit power level , where denotes the maximum transmit power.

Given the above decisions, the received signal at (we removed the dependence on for brevity) at scheduling interval can be written as


where denotes the channel gain between each and at scheduling interval , denotes the signal transmitted by at scheduling interval and denotes the additive white Gaussian noise at in scheduling interval , with

denoting the noise variance. This implies that the achievable rate of

at scheduling interval (taken as the Shannon capacity) can be written as


We assume that the system runs for consecutive scheduling intervals, and we define the average rate of each over this period as


We now specify the following two metrics that we intend to optimize in this paper:

  • Sum-rate: This metric is defined as the aggregate average throughput across the entire network over the course of scheduling intervals; i.e.,

  • percentile rate: As the name suggests, this metric is defined as the average rate threshold achieved by at least of the UEs over scheduling intervals. In a probabilistic sense, we have the following definition:


Note that sum-rate provides an indication of how high the throughputs of the users are on average across the network (i.e., “cell-center” users), while the percentile rate shows the performance of the worst-case users that are in poor channel conditions (i.e., “cell-edge” users). These two metrics are in natural conflict with each other, and our goal in this paper is to devise a scheduling algorithm that outputs a sequence of joint user scheduling and power control decisions across the network such that we achieve the best tradeoff between the two metrics.

Ii-a Distributed Scheduling, Feedback and Backhaul Delays

Figure 2: An illustration of the timeline for reporting the status updates from the UEs back to their associated APs with a period of scheduling intervals and delay of scheduling intervals. These feedback reports are exchanged between the APs via a backhaul interface with a further delay of scheduling intervals.

In this paper, we particularly aim to design a distributed scheduling algorithm in the sense that each AP in the network should make its user scheduling and power control decisions on its own. In order to enable such decision making, the APs rely on information fed back to them by the UEs in a periodic manner. In particular, we assume that each measures status indicators and reports them back to its associated every scheduling intervals. In order to account for real-world measurement, processing and communication latencies, we assume that these feedback reports arrive at the AP with a certain delay of scheduling intervals.

In addition to the feedback each AP receives from its own associated UEs, we assume that there is some periodic message passing among the APs across the network via a delayed backhaul interface. To be specific, we consider the case that the feedback reports communicated between agents are additionally delayed for scheduling intervals. This allows each AP to have access to the feedback reports from the UEs associated with other (neighboring) APs as well, albeit with some delay.

Figure 2 visualizes how feedback reports are exchanged between the UEs and their associated and non-associated APs over time. For each access point , we denote by the most recent set of available measurements at scheduling interval that have been reported by each . As the figures shows, for any , can be written as


Iii Proposed Deep Reinforcement Learning Framework

We model the distributed scheduler as a multi-agent deep RL system. In particular, we propose to equip each AP with a deep RL agent, as illustrated in Figure 3. As mentioned in Section II-A, each agent observes the state of the UEs in its local user pool, and it also exchanges information with neighboring agents, thus observing the state of neighboring APs’ associated UEs. We utilize a centralized training procedure, collecting the experiences of all agents and using them to train a single policy, which is then used by all the agents. Even though training is centralized, the execution phase is distributed, with each agent making its own decision at each scheduling interval only based on the specific observations it receives from the environment.

Figure 3: Multi-agent deep reinforcement learning diagram, where the agents receive local observations from their associated UEs, while also receiving remote observations from their neighboring agents. Upon taking their actions, the agents receive a centralized reward, which helps them tune their policies to take actions that maximize their rewards over time.

Iii-a Environment

The environment is parametrized by the physical size of the deployment area, number of APs and UEs, and parameters governing the channel models used to create AP-UE channel realizations. At the start of each episode, the environment is reset, resulting in a new set of physical locations for all the APs and UEs in the network, along with channel realizations between them, modeling both long-term and short-term fading components.

Iii-A1 Observations

Observations available to each agent at each scheduling interval consist of local observations, representing the state of the UEs associated with the corresponding AP and remote observations, representing the state of the UEs associated with neighboring APs. In the following, we will elaborate on these observations in more detail:

  • Local observations: These observations are based on measurements made and reported to APs by their associated UEs, as mentioned in Section II-A. In particular, we consider the case where each UE reports measurements to its associated AP, namely its weight and signal-to-interference-plus-noise ratio (SINR). We define the weight of each at scheduling interval as


    where represents the long-term average rate of since the beginning of the episode, defined as


    In the above equations, is a parameter close to zero, which specifies the window size for the exponential moving average operation. Moreover, for each we define the measured SINR of at scheduling interval as


    where denotes the long-term average interference received by since the beginning of the episode, and is calculated recursively as


    with being a parameter close to zero that determines the window size for the exponential moving average of the interference.

    The number of UEs associated with each AP can be different from AP to AP and from deployment to deployment. For our algorithm to be applicable to any scenario, we bound the dimension of the observation space by including observations (weights and SINRs) from a constant number of UEs per AP in any environment configuration, which we denote by . In order for each AP to select the top-

    UEs whose data is included in its local observation vector at each scheduling interval, we use the

    proportional-fairness (PF) ratio, defined as


    The PF ratio provides a notion of priority for the UEs, where the UEs with higher PF ratios are more in need to be scheduled. Therefore, at each scheduling interval, each AP sorts the UEs in its user pool according to their PF ratios, and selects the top- UEs to include in its local observation vector.

  • Remote observations: The user scheduling and power control decisions made by an agent affect the performance of its surrounding APs and their associated UEs due to interference. As mentioned in Section II-A, we assume that neighboring agents communicate their local observations (weights and SINRs of (selected) UEs) among themselves. We bound the number of agents whose observations are included at each agent’s observation vector to a fixed number, which we denote by . We use a distance-based criterion for selecting the top- remote agents for each agent. In particular, at each environment reset, we build a directed observation-exchange graph, where for each AP, we sort the other APs based on their distances to that AP, and select the closest APs as the (sorted) -tuple of remote agents for that AP. Figure 4 shows an example of a distance-based observation-exchange graph for a network with APs, where the agent at each AP includes the observations from closest agents. Note how agents 3 and 4 end up being remote agents to all the other agents, because of their critical locations and potential impact as strong interferers.

    Agent Remote agents
    Figure 4: Observation exchange graph for the configuration in Figure 1 with APs and remote agents per AP.

Figure 5 illustrates how the local and remote observations are concatenated together at each agent, resulting in a fixed-length observation vector. As there are remote agents per agent, and UEs’ observations are included per agent, the length of the observation vector for each agent equals .

Remark 1

Note that the dimension of the observation space, and therefore the input size to the deep RL agent’s neural network, does not depend on the number of APs and/or UEs in the network. This makes our algorithm scalable regardless of the specific environment parameters.

Remark 2

When there are fewer than

UEs associated to an AP, we set the corresponding values in the local/remote observations to default values, similar to a zero-padding operation. In particular, we use default values of 0 for weight and -60 dB for SINR.

Remark 3

The current LTE and future 5G cellular standards developed by 3GPP support periodic channel quality indicator (CQI) feedback reports from UEs to their serving base stations. Moreover, the weights can either be reported by the UEs, or the base stations may keep track of the long-term average rates of their associated UEs. The base stations can also exchange observations among each other through backhaul links, such as the X2 interface [21]. Therefore, our proposed observation structure is completely practical and may be readily implemented in current and future cellular networks.

Figure 5: The fixed-length observation vector structure at each agent, composed of local and remote observations. With a slight abuse of notation, we use to denote the top- selected UEs associated with the agent, and to denote the top- selected UEs associated with the remote agent, all sorted based on their PF ratios.

Iii-A2 Actions

As mentioned in Section II, at each scheduling interval, each AP needs to select a target UE from its user pool to serve, and a transmit power level to transmit data to the scheduled UE. To jointly optimize the user scheduling and power control decisions, we define a joint action space, where each action represents a (transmit power level, target UE) pair.

We quantize the range of positive transmit powers to (potentially non-uniform) discrete power levels. Moreover, because the number of UEs associated with an AP can be varying and/or potentially large, we take a similar approach to deal with this issue as in forming the observations: At each scheduling interval, we limit the choice of target UE to one of the top- UEs included in the local observations. This is also reasonable because the agent has information solely on those users in its local observation vector.

Given the above considerations, the number of possible actions for each agent at each scheduling interval is , where the additional action is one in which the agent remains silent for that scheduling interval and selects none of the top- UEs to serve. In the event that the AP has fewer than associated UEs and erroneously selects a target UE which does not exist, we map the action to the “off” action, indicating that the AP should not transmit.

Remark 4

Note that similar to the observation space, the action space dimension does not depend on the network size as well. This allows us to have a robust agent architecture that can be trained in a specific environment, and then deployed on a different environment in terms of, for instance, number of APs and/or UEs compared to the training environment. In Section IV-F, we show how well the agent performs in such mismatched scenarios.

Iii-A3 Rewards

As shown in Figure 3, we utilize a centralized reward based on the actions of all the agents at each scheduling interval. In particular, assuming that each has selected to serve at scheduling interval , the reward emitted to each of the agents is a weighted sum-rate reward, calculated as


where is the most recent reported weight measurement by available at , is the rate achieved by , and is a parameter which determines the tradeoff between and , the two metrics that we intend to optimize. Specifically, turns the reward to sum-rate, favoring cell-center UEs, while changes the reward to approximately the summation of the scheduled UEs’ PF ratios, hence appealing to cell-edge UEs.

There are, however, two exceptions to the reward emitted to the agents, which are as follows:

  • All agents off: Due to the distributed nature of decision making by the agents, it is possible that at a scheduling interval, all agents decide to remain silent. This is clearly a suboptimal joint action vector at any scheduling interval. Therefore, in this case, we penalize the agent, whose top user has the highest PF ratio among all UEs in the network, with the negative of that PF ratio, while the rest of the agents will receive a zero reward, according to (18).

  • Invalid user selected: As mentioned in Section III-A2, it might happen that an AP has fewer than associated UEs in its user pool, and selects an invalid UE to serve at a scheduling interval. In that case, the agent corresponding to that AP is penalized by receiving a zero reward regardless of the actual weighted sum-rate reward of the other agents as given in (18).

Iii-B Normalizing the Agents’ Observations and Rewards

It is widely known that normalizing DNN inputs and outputs have a significant impact on its training behavior and predictive performance [22]. Before training or testing our model in a specific environment, we first create a number of environment realizations in an offline fashion and run one or several simple baseline scheduling algorithms in those realizations. While doing so, we collect data on the observations and rewards of all the agents throughout all environment realizations and all scheduling intervals within each realization. We then leverage the resulting dataset in the following way to pre-process the observations and rewards before using them to train the agent’s DNN:

  • Observations: As mentioned in Section III-A1, we have two types of observations: weights and SINRs. For notational simplicity, we describe the normalization process for the weight observations; the process for normalizing SINR observations follows similarly.

    Considering the weight observations, we derive the empirical distribution of the observed weights in the aforementioned dataset. We then use the distribution to calculate multiple percentiles of the observed weights. In particular, we consider percentile values , denoted respectively by , as depicted in Figure 6 (for both weights and SINRs). Note that and are equal to the minimum and maximum weights observed in the dataset, respectively. Afterwards, we map each subsequent weight observation during training/inference before feeding to the neural network as


    The mapping in

    is in fact applying a (shifted version of the) CDF of the weight observation to itself, which is known to be uniformly distributed. Therefore, this mapping guarantees that the observations fed into the neural network will (approximately) follow a discrete uniform distribution over the set


  • Rewards:

    We follow a well-known standardization procedure for normalizing the rewards, where we use the dataset to estimate the mean and standard deviation of the reward, denoted by

    and , respectively. Each reward during training is then normalized as


    ensuring that the neural network outputs have (approximately) zero mean and unit variance.

Figure 6: Example empirical distributions of weight and SINR observations (blue) and their corresponding percentile values (orange) using percentile levels.

Iii-C Training and Validation Procedure

We consider an episodic training procedure, in which each episode represents a realization of the environment in which the locations of the APs and UEs and channel realizations are randomly selected following a set of probability distributions and constraints on minimum AP-AP and UE-AP distances. We control the density of APs and UEs in our environment by fixing the size of the deployment area and selecting different numbers of APs and UEs for different training sessions. Because the channels between APs and UEs depend heavily on their relative locations, each new episode allows the system to experience a potentially unexplored subset of the observation space. The associations between UE and AP take place as in (

1) as a new episode begins, and remain fixed for the duration of that episode. An episode consists of a fixed number of scheduling intervals, where at each interval, the agents decide on which user scheduling and power control actions the APs should take.

We further structure the training process into epochs

, each of which consists of a fixed number of consecutive training episodes. At the completion of each epoch, we pause training in order to evaluate the current policy against a fixed set of

validation environments carefully selected to be representative of all possible environments. We use a score metric , defined as


to quantify the performance after each epoch and select the best model during training as the one achieving the best performance in terms of .111We have used the factor of 3 for the percentile rate in (21), because prior experience has shown that improving cell-edge performance is typically three times more challenging than enhancing cell-center performance. Hence, the score metric emphasizes the percentile rate three times more than the average rate of the UEs across the network.

Iv Simulation Results

In this section, we first mention the details of the wireless system parameters. Next, we present the baseline algorithms that we use to compare the performance of our proposed method against. We then discuss our considered deep RL agents and their corresponding parameters. Finally, we proceed to present our simulation results.

Iv-a Description of the Wireless Environment

We consider networks with APs and UEs, dropped randomly within a square area. We impose a minimum AP-AP distance of m and AP-UE distance of m. We consider a bandwidth of MHz, maximum AP transmit power of dBm, noise power spectral density of dBm/Hz, and episode length of scheduling intervals.

The communication channel between APs and UEs consists of three different components:

  • Path-loss: We consider a dual-slope path-loss model [23, 24], which states that the path-loss at distance equals

    where denotes the path-loss at distance m, denotes the break-point distance, and and denote the path-loss exponents before and after the break-point distance, respectively (). In this paper, we set dB, , , and m.

  • Shadowing: We assume that all the links experience log-normal shadowing with a standard deviation of dB.

  • Short-term fading: We use the sum of sinusoids (SoS) model [25] for short-term flat Rayleigh fading (with pedestrian node velocity of m/s) in order to model the dynamics of the communication channel over time.

As for the feedback reports, each UE is assumed to sample and send its measurements to its associated AP every scheduling intervals, and these reports arrive at the associated AP after a delay of scheduling intervals. Moreover, we assume a backhaul delay of scheduling intervals for observation exchange among the APs.

Iv-B Baseline Algorithms

We compare the performance of our proposed scheduler against several baseline algorithms.

  • Full reuse: At each scheduling interval, each AP schedules the UE in its local user pool with the highest PF ratio (PF-based user scheduling), and serves it with full transmit power.

  • Time division multiplexing (TDM): The UEs are scheduled in a round-robin fashion. In particular, at scheduling interval , is scheduled to be served with full transmit power by its associated AP, while the rest of the APs remain silent.

  • Information-theoretic link scheduling (ITLinQ) [4, 20]: This is a centralized binary power control algorithm, in which UEs are first selected using PF-based scheduling, and then the AP-UE pairs are sorted in the descending order of the selected UEs’ PF ratios. The AP whose selected UE has the highest PF ratio is scheduled to transmit with full power, and going down the ordered list, each is also scheduled to transmit with full power to its selected if and only if


    where and are design parameters. Otherwise, will remain silent for that scheduling interval. The condition in (22) is inspired by the information-theoretic condition for the optimality of treating interference as noise [26], ensuring that the interference-to-noise ratios (INRs), both caused by at already-scheduled UEs and received by

    from already-scheduled APs, are “weak enough” compared to the signal-to-noise ratio (SNR) between

    and . For our simulations, we consider and .

Iv-C Deep RL Agents and their Hyperparameters

We consider the following two different types of agents:

  • Double Deep-Q Network (DQN) [7, 27]:

    We consider a double DQN agent, which is a value-based model-free deep RL method, with a 2-layer fully-connected DNN, 128 neurons per layer, and tanh activation function. We create experiences using a set of 4 parallel environments, and save them in an experience buffer of size 25,000 samples. We use the Adam optimizer 

    [28] to perform a round of training on the main DQN every 100 scheduling intervals, using a batch of samples, consisting of a set of 1024 scheduling intervals, each containing concurrent experiences for all the agents at that scheduling interval [29]. We update the target DQN every 10,000 scheduling intervals by replacing its parameters with those of the main DQN. We initialize the learning rate at 0.01 and decay it by half every 5,000 training iterations. We consider a set of pre-training episodes, in which we completely fill in the experience buffer by the agents taking completely random actions. Afterwards, we use an -greedy policy, with the probability of random actions decaying from 100% to 1% over 25 training episodes. We use a discount factor of for the agent to consider the impact of its actions on the subsequent rewards in next scheduling intervals. In order to improve the generalization capabilities of our agent, we use regularization on the DQN weights with a coefficient of 0.001 [30].

  • Advantage Actor-Critic (A2C) [31]: We use the OpenAI baselines [32] implementation of an A2C agent—which is a policy-based model-free deep RL method and a synchronous version of the asynchronous advantage actor critic (A3C) agent [33]

    —with a 2-layer fully-connected DNN, 128 neurons per layer, and tanh activation function. We consider a set of 10 parallel training environments. We use the RMSProp optimizer (with parameters

    and ) to perform a round of training using trajectories of length 100 scheduling intervals, each collected from one of the parallel environments. We initialize the learning rate at

    and cut it in half every 12,000 training iterations. We use gradient clipping with a maximum magnitude of 1. The loss function consists of the policy loss with a coefficient of 1, value function loss with a coefficient of 1, and an entropy regularization term with a coefficient of 0.05. Similar to the DQN agent, we use

    regularization on the A2C neural network weights with a coefficient of 0.001. Moreover, the reward discount factor is set to .

For validation purposes during training, we create a set of 50 validation environments, whose average and percentile rates are within a relative error of those achieved over 1000 random environment realizations by both full reuse and TDM baseline algorithms. We define an epoch as a group of 10 consecutive episodes, and we run the training for 200 epochs, or equivalently, 2000 episodes, amounting to a total of 16 million and 40 million training scheduling intervals for DQN and A2C, respectively (due to different numbers of parallel environments). Once training is complete, we test the resulting models across another randomly-generated set of 1000 environment realizations.

As for the observations, we consider each agent to include weights and SINRs from UEs having the highest PF ratios, alongside receiving remote observations from a set of remote agents. This implies that the size of the observation vector of each agent at each scheduling interval, hence the size of the input layer of each agent’s neural network, is equal to . We consider the moving average parameters for the long-term average rate and interference at the UEs to be and , respectively. We also consider a set of percentile levels for mapping and normalizing both weight and SINR observations, calculated using an offline dataset generated by both full reuse and TDM baseline algorithms.

We consider a binary power control policy, where an AP at any given scheduling interval is either off, or serves a UE with full transmit power. This implies that the total number of actions, and therefore the size of the output layer of each agent’s neural network, equals . Moreover, for the reward function as defined in (18), we consider , which helps strike the right tradeoff between the average and percentile rates as we will show next.

For each type of environment configuration, we train 5 models, utilizing different random number generator seeds. In the following sections, we report the mean of the results across the trained 5 models, with the shaded regions around the curves illustrating the standard deviation.

Remark 5

We analyze how many remote agents’ observations should be included in each agent’s observations, where as mentioned in Section III-A1, the remote agents are selected based on physical proximity. Figure 7 shows the variation in the mean long-term UE SINR in networks with APs and 100 UEs, where for each configuration of APs, the interference includes the contribution from remote agents physically-closest to the serving AP.

Figure 7: Impact of the number of strongest interferers included in the interference calculation on the average long-term SINR of UEs in networks with APs and UEs.

As the figure demonstrates, by far the largest reduction in SINR occurs when the closest AP transmits. Moreover, the curves flatten out as the number of included interference terms increase, indicating that interference from farther APs is less consequential and may be safely omitted from the observation space. This justifies why including observations from remote agents is a reasonable choice.

Iv-D Validation Performance during Training

We first demonstrate how the behavior of the model evolves as training proceeds. Figure 8 illustrates the evolution of validation sum-rate, percentile rate, and the score metric , as defined in (21), when training on environments with APs and UEs.

Figure 8: The sum-rates, percentile rates, and scores achieved by the DQN and A2C agents alongside baseline algorithms over the 50 validation environments, when training on environments with 4 APs and 24 UEs.

As the plots in Figure 8 show, both DQN and A2C agents initially favor sum-rate performance, while suffering in terms of the percentile rate. As training proceeds, the agents learn a better balance between the two metrics, trading off sum-rate for improvements in terms of percentile rate. As the figure shows, A2C achieves a better sum-rate, while DQN achieves a better coverage and also a better score, outperforming the centralized ITLinQ approach after only 12 epochs. Moreover, DQN converges faster than A2C, due to better sample efficiency thanks to the experience buffer.

As mentioned in Section III-C, for each training run, we select the model at the epoch which yields the highest score level. We can then use the resulting model to conduct final, large-scale, test evaluations upon the completion of training, as we will show next.

Iv-E Final Test Performance with Similar Train and Test Configurations

In this section, we present the final test results for models tested on the same configuration as the one in their training environment. Figure 9 demonstrates the achievable sum-rate and percentile rate for the environments with 4 APs and varying numbers of UEs. As the plots show, our proposed deep RL methods significantly outperform TDM in both sum-rate and percentile rate, and they also provide considerable percentile rate gains over full reuse. Our reward design helps the agents achieve a balance between sum-rate and percentile rate, helping the DQN agent attain percentile rate values which are on par with ITLinQ for large numbers of UEs (32-40), while outperforming it for smaller numbers of UEs (16-24). The A2C agent, on other hand, performs consistently well in terms of the sum-rate, approaching ITLinQ as the number of users increases across the network.

Figure 9: Test results on the sum-rate and percentile rate from models trained on environments with 4 APs and various numbers of UEs, where the model trained on each configuration was deployed on the same configuration during the test phase.

In Figure 10, we plot the sum-rate and percentile rate for the configurations with 40 UEs and different numbers of APs. As the figure shows, the relative trends are similar to the previous case in terms of sum-rate, with A2C outperforming ITLinQ for networks with 8 APs, but in terms of the percentile rate, both agents outperform TDM and full reuse, while having inferior performance relative to the centralized ITLinQ approach as the number of APs, and equivalently the number of agents, gets larger.

Figure 10: Test results on the sum-rate and percentile rate from models trained on environments with 40 UEs and various numbers of APs, where the model trained on each configuration was deployed on the same configuration during the test phase.

Iv-F Final Test Performance with Discrepant Train and Test Configurations

As mentioned before, our design of the observation and action spaces is such that they have a fixed size regardless of the actual training configuration. We test the robustness of our models with respect to network density by testing policies trained on one density deployed in environments of other densities. To reduce clutter, in the following, we only plot the average results over the 5 seeds and remove the shaded regions representing the standard deviation of the results.

We first test models trained on environments with APs and different numbers of UEs against each other, and plot the results in Figure 11. We observe that all DQN and A2C agents are robust in terms of both metrics. Interestingly, the A2C model trained on the case with 16 UEs has a much better performance (especially in terms of sum-rate) than its counterpart DQN model as the number of UEs in the test deployment increases. For models with 40 UEs, however, DQN tends to perform better, especially in terms of the percentile rate.

Figure 11: Test results on the sum-rate and percentile rate from models trained on environments with 4 APs and various numbers of UEs, where the model trained on each configuration was deployed on all the other configurations as well. The first and second element of each tuple in the legends represent numbers of APs and UEs in the training environment, respectively.

Next, we cross-test the models trained on environments with 40 UEs and different numbers of APs against each other. Figure 12 shows the sum-rates and percentile rates achieved by these models. All the models exhibit fairly robust behaviors with the exception of the DQN model trained on configurations with 4 APs, whose percentile rate performance deteriorates for higher numbers of APs. Note that in this case, the number of agents changes across different scenarios, and we observe that in general, training with more agents leads to more capable models, which can still perform well when deployed in sparser scenarios, while training with few agents may not scale well as the number of agents increases.

Figure 12: Test results on the sum-rate and percentile rate from models trained on environments with 40 UEs and various numbers of APs, where the model trained on each configuration was deployed on all the other configurations as well. The first and second element of each tuple in the legends represent numbers of APs and UEs in the training environment, respectively.
Remark 6

We have also tested our trained models with observations mapped using 20 percentile levels on test environments in which the observations were mapped using different numbers of percentile levels. For the scenario with APs and UEs, we observed that using 10-100 percentile levels during the test phase achieves results very similar to (within of) the ones obtained using 20 percentile levels. This shows that our proposed approach is very robust to the granularity of mapping the observations fed into the agent’s neural network.

Iv-G Interpreting the Agent’s Decisions

In this section, we attempt to interpret our trained agent’s decisions during the test phase. In particular, we collect data on the inputs and outputs of a DQN agent, trained on a network with APs and UEs and tested on the same configuration. Using this data, we will try to visualize the agent’s actions in different situations.

Figure 13 shows a scatter plot of the SINR and weight of the agent’s “top UE,” i.e., the UE in the AP’s user pool with the highest PF ratio.

Figure 13: Weight vs. SINR scatter plot of the top UE of a DQN agent trained and tested on networks with APs and UEs. Red (resp., green) points represent the scenarios where the agent decided to stay silent (resp., serve one of its top-3 UEs).

The red points illustrates the cases where the agent decided to remain silent, while the green points represent the cases in which the agent served one of its top-3 UEs. As expected, higher weights and/or higher SINRs lead to a higher chance of the AP not being off. Quite interestingly, the boundary between the green and red regions can be approximately characterized as , which is effectively a linear boundary on the PF ratio; i.e., the agent decides to be active if and only if the PF ratio of its top UE is above some threshold that it has learned based on its interactions with the environment.

Given that the PF ratio is a reasonable indicator of the status of each UE, Figure 14 compares the PF ratios of the top-3 UEs included in the agent’s observation and action spaces in the cases where the agent decided to serve one of the those UEs.

Figure 14: Comparison of the PF ratios of the top-3 UEs observed by a DQN agent trained and tested on networks with APs and UEs for the cases in which the agent decided to serve one of those UEs.

As the figure shows, the agent’s user scheduling decision heavily depends on the relative difference between the PF ratios of the top-3 UEs. In general, the second and third UEs have some chance of being scheduled if they have a PF ratio close to that of the top UE. However, this chance is significantly reduced for the third UE, as highlighted by the regions corresponding to different user scheduling actions.

Moreover, Figure 15 shows the impact of remote observations on the agent’s power control decisions.

Figure 15: Comparison of the PF ratios of the DQN agent’s top UE and the remote agents’ top UEs for an agent trained and tested on networks with APs and UEs. Red (resp., green) points represent the scenarios where the agent decided to remain silent (resp., serve its top UE).

In particular, the figure demonstrates the cases where the agent either remains silent (red points), or decides to serve its top UE (green points). We observe that the agent learns a non-linear decision boundary between the PF ratio of its top UE and the PF ratios of the top UE of each remote agent. Notably, the green region becomes larger as we go from the left plot to the right plot. This implies that the agent “respects” the PF ratio of the top UE of its closest remote agent more as compared to the second and third closest remote agents, since the interference between them tends to be stronger, hence their actions impacting each other more significantly.

V Discussion

In this section, we discuss some of the implications of our proposed framework in more detail and provide ideas for future research on how to improve upon the current work.

V-a Analysis on the Number of Observable UEs by the Agent

As described in Section III, we bound the dimension of the agent’s observation and action space by selecting a finite number of UEs, whose observations are included in the agent’s observation vector. We select these UEs by sorting the user pool of each AP according to their PF ratios. In this section, we shed light on the tradeoffs implied by such a “user filtering” method.

We first analyze the actions taken by the DQN agent when trained and tested in a network with APs and UEs. Recall that the agent can either take an “off” action, or decide to serve one of its top-3 UEs. We observe in Figure 15(a) that the algorithm selects action 0 (no transmission) or 1 (the top UE) most of the time and rarely selects the other two UEs. In this formulation of the algorithm, including information from more than 3 UEs would most likely not improve the performance. This seems logical given that the PF ratio represents the short-term ability of the UE to achieve a high rate (represented by the SINR term) along with the long-term demand to be scheduled for the sake of fairness (represented by the weight term).

Figure 16: Distribution of the DQN agent’s actions in a network with APs and UEs, where (a) the agent observes the top-3 UEs sorted based on their PF ratios, and (b) each AP is restricted to have exactly 6 associated UEs, and the agent is able to observe all 6 UEs in an unsorted manner. For the latter case, the UE indices on the x-axis represent the rank of the selected UE at the corresponding scheduling interval according to the UEs’ PF ratios, even though such a ranking was not used for sorting the UEs in the observation and action spaces.

Given our goal of having a scalable agent that can be employed by any AP having an arbitrary number of associated users, it is not feasible to have an agent which can observe all its UEs at every scheduling interval in a general network configuration. However, we conducted a controlled experiment, where we restricted the environment realizations to the ones in which all APs have a constant number of associated UEs. In particular, we considered a configuration with APs and UEs, where each AP has exactly 6 UEs associated with it. In such a scenario, we are indeed able to design an agent, which can observe the state of all of its associated UEs, as well as all UEs associated of all its remote agents. Furthermore, we did not sort the UEs of each agent according to their PF ratios. This ensures that each input port to the agent’ neural network contains an observation from the same UE over time. Note that the size of the observation vector is now and the number of actions equals .

After training and testing a DQN agent on the above scenario, we observed that the resulting sum-rate and percentile rate were within 6% of the original model using the sorted top-3 UEs. This demonstrates that the algorithm is indeed able to learn user scheduling without the “aid” of sorting the UEs by PF ratio. Moreover, Figure 15(b) demonstrates the percentage of time that the model with observations from all (unsorted) UEs selects the “off” action and each of the UEs. For the purposes of this figure, we sorted the UEs by their PF ratios, so the percentage of time that the algorithm selects the top-UE represents the percentage of time that the agent selects the UE with the top PF ratio, regardless of its position in the observation/action space. We see that the algorithm learns to select the UE with the highest PF ratio most often, but interestingly, the distribution among the various UEs is slightly more even as compared to Figure 15(a). This implies that by letting the agent observe all UEs in an arbitrary order, its resulting user scheduling behavior is similar to a PF-based scheduler, but not exactly the same. Further interpretation of this result is left for future study.

V-B Enhanced Training Procedure

One of the unique challenges to training agents to perform scheduling tasks in a multi-node wireless environment, is that we can only simulate individual snapshots of the environment and explore a limited subset of the state space. In our our training procedure, we fixed the parameters governing our environment (size of the deployment area, number of APs and UEs, minimum distance constraints, maximum transmit powers, etc.) and selected AP and UE locations randomly at each environment reset. We utilized multiple parallel environments to ensure that training batches contained experiences from different snapshots. This approach, along with a carefully-selected learning rate schedule and a sufficiently long training period, allowed our agents to achieve the performance that we reported in Section IV.

An interesting question remains whether a more deliberate selection of training environments can accelerate training and/or result in better-performing agents. The fact that we can simulate only a limited number of environment realizations at a time provides an opportunity to guide the exploration and learning process.

One possible direction is to systematically control the density of the environments experienced during training, by systematically varying the parameters governing the environment. Possible strategies could be to move from less dense to more dense deployments or vice versa, or to more carefully select training batches to always include experiences from a range of densities.

Another possible direction is to control the complexity of the environments on which the agent is being trained. Intuitively, some environment realizations are easy and some are hard. For example, an environment in which all UEs are very close to their associated AP, i.e., cell-center scenario, is easy because the optimal policy is for the APs to transmit at every scheduling interval. An environment where all UEs are clustered around the borders between the coverage areas of adjacent APs, i.e., cell-edge scenario, is slightly more difficult because the optimal policy is for neighboring APs to coordinate not to transmit at the same time. Environments in which users are distributed throughout the coverage areas of the APs are, however, much more complex because the optimal user scheduling and power control choices become completely non-trivial. Various approaches to curriculum learning could potentially be applied [34, 35]. The main difficulty with this approach is determining a more granular measure of environment complexity and a procedure for generating environment realizations that exhibit the desired difficulty levels.

V-C Capturing Temporal Dynamics

One of the main challenges faced by the scheduler is dealing with delayed observations available to the agent. We have shown that our agents are able to successfully cope with this problem, but an interesting question is whether we can include recurrent architectures in the agent’s neural network to learn, predict, and leverage network dynamics based on the sequence of observations it receives over the course of an episode.

Two approaches are possible. The first is to include recurrent neural network (RNN) elements, such as long short-term memory (LSTM) 

[36], or attention mechanisms such as Transformers [37], at the inputs, with the goal of predicting the actual current observations based on the past history of delayed observations. A second approach would be to place the RNN/attention elements at the output similar to [38, 39]. One thing to note here is that in order for the system to learn the temporal dynamics of the environment, it must be exposed to the observations of each individual UEs over time. This means that the approach of including observations from the top-3 UEs, sorted by their PF ratios, will likely not work and observations from all UEs must be included in an unsorted order. This is challenging due to the variable number of UEs associated to each AP, and we leave this for future work.

Vi Concluding Remarks

We introduced a distributed multi-agent deep RL framework for performing joint user selection and power control decisions in a dense multi-AP wireless network. Our framework takes into account real-world measurement, feedback, and backhaul communication delays and is scalable with respect to the size and density of the wireless network. We show, through simulation results, that our approach, despite being executed in a distributed fashion, achieves a tradeoff between sum-rate and percentile rate, which is close to that of a centralized information-theoretic scheduling algorithm. We also show that our algorithm is robust to variations in the density of the wireless network and maintains its performance gains even if the number of APs and/or UEs change in the network during deployment.


  • [1] A. Gjendemsjø, D. Gesbert, G. E. Øien, and S. G. Kiani, “Binary power control for sum rate maximization over multiple interfering links,” IEEE Transactions on Wireless Communications, vol. 7, no. 8, pp. 3164–3173, 2008.
  • [2] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4331–4340, 2011.
  • [3] L. Song, D. Niyato, Z. Han, and E. Hossain, “Game-theoretic resource allocation methods for device-to-device communication,” IEEE Wireless Communications, vol. 21, no. 3, pp. 136–144, 2014.
  • [4] N. Naderializadeh and A. S. Avestimehr, “ITLinQ: A new approach for spectrum sharing in device-to-device communication systems,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 6, pp. 1139–1151, 2014.
  • [5] X. Yi and G. Caire, “ITLinQ+: An improved spectrum sharing mechanism for device-to-device communications,” in 2015 49th Asilomar Conference on Signals, Systems and Computers.   IEEE, 2015, pp. 1310–1314.
  • [6] K. Shen and W. Yu, “FPLinQ: A cooperative spectrum sharing strategy for device-to-device communications,” in 2017 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2017, pp. 2323–2327.
  • [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [8] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, “Move evaluation in Go using deep convolutional neural networks,” arXiv preprint arXiv:1412.6564, 2014.
  • [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
  • [10] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al., “Dota 2 with large scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680, 2019.
  • [11] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, “Intelligent power control for spectrum sharing in cognitive radios: A deep reinforcement learning approach,” IEEE Access, vol. 6, pp. 25 463–25 473, 2018.
  • [12] A. Tondwalkar and A. Kwasinski, “Deep reinforcement learning for distributed uncoordinated cognitive radios resource allocation,” arXiv preprint arXiv:1911.03366, 2019.
  • [13] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multi-agent reinforcement learning,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2282–2292, 2019.
  • [14] Y. Hua, R. Li, Z. Zhao, X. Chen, and H. Zhang, “GAN-powered deep distributional reinforcement learning for resource management in network slicing,” IEEE Journal on Selected Areas in Communications, 2019.
  • [15] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE International Conference on Communications (ICC).   IEEE, 2017, pp. 1–7.
  • [16] F. Meng, P. Chen, and L. Wu, “Power allocation in multi-user cellular networks with deep Q learning approach,” in ICC 2019-2019 IEEE International Conference on Communications (ICC).   IEEE, 2019.
  • [17] K. I. Ahmed and E. Hossain, “A deep Q-learning method for downlink power allocation in multi-cell networks,” arXiv preprint arXiv:1904.13032, 2019.
  • [18] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2239–2250, 2019.
  • [19] G. Zhao, Y. Li, C. Xu, Z. Han, Y. Xing, and S. Yu, “Joint power control and channel allocation for interference mitigation based on reinforcement learning,” IEEE Access, vol. 7, pp. 177 254–177 265, 2019.
  • [20] N. Naderializadeh, O. Orhan, H. Nikopour, and S. Talwar, “Ultra-dense networks in 5G: Interference management via non-orthogonal multiple access and treating interference as noise,” in 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall).   IEEE, 2017, pp. 1–6.
  • [21] G. Nardini, A. Virdis, and G. Stea, “Modeling X2 backhauling for LTE-advanced and assessing its effect on CoMP coordinated scheduling,” in 2016 1st International Workshop on Link-and System Level Simulations (IWSLS).   IEEE, 2016, pp. 1–6.
  • [22] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Trade.   Springer, 2012, pp. 9–48.
  • [23] J. G. Andrews, X. Zhang, G. D. Durgin, and A. K. Gupta, “Are we approaching the fundamental limits of wireless network densification?” IEEE Communications Magazine, vol. 54, no. 10, pp. 184–190, 2016.
  • [24] X. Zhang and J. G. Andrews, “Downlink cellular network analysis with multi-slope path loss models,” IEEE Transactions on Communications, vol. 63, no. 5, pp. 1881–1894, 2015.
  • [25] Y. Li and X. Huang, “The simulation of independent Rayleigh faders,” IEEE Transactions on Communications, vol. 50, no. 9, pp. 1503–1514, 2002.
  • [26] C. Geng, N. Naderializadeh, A. S. Avestimehr, and S. A. Jafar, “On the optimality of treating interference as noise,” IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1753–1767, 2015.
  • [27] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in

    Thirtieth AAAI Conference on Artificial Intelligence

    , 2016.
  • [28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [29] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70., 2017, pp. 2681–2690.
  • [30] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman, “Quantifying generalization in reinforcement learning,” arXiv preprint arXiv:1812.02341, 2018.
  • [31] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning to reinforcement learn,” arXiv preprint arXiv:1611.05763, 2016.
  • [32] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “OpenAI baselines,”, 2017.
  • [33] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.
  • [34] G. Hacohen and D. Weinshall, “On the power of curriculum learning in training deep networks,” arXiv preprint arXiv:1904.03626, 2019.
  • [35] D. Seita, D. Chan, R. Rao, C. Tang, M. Zhao, and J. Canny, “ZPD teaching strategies for deep reinforcement learning from demonstrations,” arXiv preprint arXiv:1910.12154, 2019.
  • [36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  • [38] A. Gruslys, W. Dabney, M. G. Azar, Piot, M. G. Bilal, Bellemare, and R. Munos, “The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning,” in Seventh International Conference on Learning Representations (ICLR), 2018.
  • [39] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, “Recurrent experience replay in distributed reinforcement learning,” in Seventh International Conference on Learning Representations (ICLR), 2019.