I Introduction
In emerging and future wireless networks, intercell interference management is one of the key technological challenges as access points (APs) become denser to meet everincreasing demand on the capacity. A transmitter may increase its transmit power to improve its own data rate, but at the same time it may degrade links it interferes with. Transmit power control has been implemented since the first generation cellular networks [chiang2007power]. Our goal here is to maximize an arbitrary weighted sumrate objective, which achieves maximum sumrate or proportionally fair scheduling as special cases.
A number of centralized and distributed optimization techniques have been used to develop algorithms for reaching a suboptimal power allocation [shi2011wmmse, shen2018fractional, illsoo2015distributedpower, chiang2007power, huang2006distributedpower, kiani2007distributed, zhang2011proportionally]. We select two stateoftheart algorithms as benchmarks. These are the weighted minimum mean square error (WMMSE) algorithm [shi2011wmmse] and an iterative algorithm based on fractional programming (FP) [shen2018fractional]. In their generic form, both algorithms require full uptodate crosscell channel state information (CSI). To the best of our knowledge, this work is the first to apply deep reinforcement learning to power control [nasir2018deep]. Sun et al. [sun2017learning] proposed a centralized supervised learning
approach to train a fast deep neural network (DNN) that achieves 90% or higher of the sumrate achieved by the WMMSE algorithm. However, this approach still requires acquiring the full CSI. Another issue is that training DNN depends on a massive dataset of the WMMSE algorithm’s output for randomly generated CSI matrices. Such a dataset takes a significant amount of time to produce due to WMMSE’s computational complexity. As the network gets larger, the total number of DNN’s input and output ports also increases, which raises questions on the scalability of the centralized solution of
[sun2017learning]. Furthermore, the success of supervised learning is highly dependent on the accuracy of the system model underlying the computed training data, which requires a new set of training data every time the system model or key parameters change.
In this work, we design a distributively executed algorithm to be employed by all transmitters to compute their best power allocation in real time. Such a dynamic power allocation problem with timevarying channel conditions for a different system model and network setup was studied in [neely2005dynamicpowercontrol] and the delay performance of the classical dynamic backpressure algorithm was improved by exploiting the stochastic Lyapunov optimization framework.
The main contributions in this paper and some advantages of the proposed scheme are summarized as follows.

The proposed algorithm is one of the first power allocation schemes to use deep reinforcement learning in the literature. In particular, the distributively executed algorithm is based on deep Qlearning [mnih2015human], which is modelfree and robust to unpredictable changes in the wireless environment.

The complexity of the distributively executed algorithm does not depend on the network size. In particular, the proposed algorithm is computationally scalable to networks that cover arbitrarily large geographical areas if the number of links per unit area remains upper bounded by the same constant everywhere.

The proposed algorithm learns a policy that guides all links to adjust their power levels under important practical constraints such as delayed information exchange and incomplete crosslink CSI.

Unlike the supervised learning approach [sun2017learning], there is no need to run an existing nearoptimal algorithm to produce a large amount of training data. We use an applicable centralized network trainer approach that gathers local observations from all network agents. This approach is computationally efficient and robust. In fact, a pretrained neural network can also achieve comparable performance as that of the centralized optimization based algorithms.

We compare the reinforcement learning outcomes with stateoftheart optimizationbased algorithms. We also show the scalability and the robustness of the proposed algorithm using simulations. In the simulation, we model the channel variations inconsequential to the learning algorithm using the Jakes fading model [liang2017delayedCSI]. In certain scenarios the proposed distributed algorithm even outperforms the centralized iterative algorithms introduced in [shi2011wmmse, shen2018fractional]. We also address some important practical constraints that are not included in [shi2011wmmse, shen2018fractional].
Deep reinforcement learning framework has been used in some other wireless communications problems [luong2018deepRLsurvey, ye2017deep, yu2017deep, zhao2018slicing]. Classical Qlearning techniques have been applied to the power allocation problem in [bennis2010qlearning, simsek2011qtable, amiri2018mlpower, ghadimi2017dynamicpower, calabrese2017learning]. The goal in [bennis2010qlearning, simsek2011qtable] is to reduce the interference in LTEFemtocells. Unlike the deep Qlearning algorithm, the classical algorithm builds a lookup table to represent the value of stateaction pairs, so [bennis2010qlearning] and [simsek2011qtable] represent the wireless environment using a discrete state set and limit the number of learning agents. Amiri et al. [amiri2018mlpower] have used cooperative Qlearning based power control to increase the QoS of users in femtocells without considering the channel variations. The deep Qlearning based power allocation to maximize the network objective has also been considered in [ghadimi2017dynamicpower, calabrese2017learning]. Similar to the proposed approach, the work in [ghadimi2017dynamicpower, calabrese2017learning] is also based on a distributed framework with a centralized training assumption, but the benchmark to evaluate the performance of their algorithm was a fixed power allocation scheme instead of stateoftheart algorithms. The proposed approach to the state of wireless environment and the reward function is also novel and unique. Specifically, the proposed approach addresses the stochastic nature of wireless environment as well as incomplete/delayed CSI, and arrives at highly competitive strategies quickly.
The remainder of this paper is organized as follows. We give the system model in Section II. In Section III, we formulate the dynamic power allocation problem and give our practical constraints on the local information. In Section IV, we first give an overview of deep Qlearning and then describe the proposed algorithm. We give simulation results in Section V. We conclude with a discussion of possible future work in Section VI.
Ii System Model
We first consider the classical power allocation problem in a network of links. We assume that all transmitters and receivers are equipped with a single antenna. The model is often used to describe a mobile ad hoc network (MANET) [huang2006distributedpower]. The model has also been used to describe a simple cellular network with APs, where each AP serves a single user device [shen2018fractional, illsoo2015distributedpower]. Let denote the set of link indexes. We consider a fully synchronized time slotted system with slot duration . For simplicity, we consider a single frequency band with flat fading. We adopt a block fading model to denote the downlink channel gain from transmitter to receiver in time slot as
(1) 
Here, represents the largescale fading component including path loss and lognormal shadowing, which remains the same over many time slots. Following Jakes fading model [liang2017delayedCSI], we express the smallscale Rayleigh fading component as a firstorder complex GaussMarkov process:
(2) 
where and the channel innovation process
are independent and identically distributed circularly symmetric complex Gaussian (CSCG) random variables with unit variance. The correlation
, where is the zerothorder Bessel function of the first kind and is the maximum Doppler frequency.The received signaltointerferenceplusnoise ratio (SINR) of link in time slot is a function of the allocation :
(3) 
where is the additive white Gaussian noise (AWGN) power spectral density (PSD). We assume the same noise PSD in all receivers without loss of generality. The downlink spectral efficiency of link at time can be expressed as:
(4) 
The transmit power of transmitter in time slot is denoted as . We denote the power allocation of the network in time slot as .
Iii Dynamic Power Control
We are interested in maximizing a generic weighted sumrate objective function. Specifically, the dynamic power allocation problem in slot is formulated as
(5) 
where is the given nonnegative weight of link in time slot , and is the maximum PSD a transmitter can emit. Hence, the dynamic power allocator has to solve an independent problem in the form of (5) at the beginning of every time slot. In time slot , the optimal power allocation solution is denoted as . Problem (5) is in general nonconvex and has been shown to be NPhard [Luo2008dynamicspectrum].
We consider two special cases. In the first case, the objective is to maximize the sumrate by letting for all and . In the second case, the weights vary in a controlled manner to ensure proportional fairness [tse2005fundamentals, zhang2011proportionally]. Specifically, at the end of time slot , receiver computes its weighted average spectral efficiency as
(6) 
where is used to control the impact of history. User updates its link weight as:
(7) 
This power allocation algorithm maximizes the sum of logaverage spectral efficiency [tse2005fundamentals], i.e.,
(8) 
where a user’s longterm average throughput is proportional to its longterm channel quality in some sense.
We use two popular (suboptimal) power allocation algorithms as benchmarks. These are the WMMSE algorithm [shi2011wmmse] and the FP algorithm [shen2018fractional]. Both are centralized and iterative in their original form. The closedform FP algorithm used in this paper is formulated in [shen2018fractional, Algorithm 3]. Similarly, a detailed explanation and pseudo code of the WMMSE algorithm is given in [sun2017learning, Algorithm 1]. The WMMSE and FP algorithms are both centralized and require full crosslink CSI. The centralized mechanism is suitable for a stationary environment with slowly varying weights and no fast fading. For a network with nonstationary environment, it is infeasible to instantaneously collect all CSI over a large network.
It is fair to assume that the feedback delay from a receiver to its corresponding transmitter is much smaller than the slot duration , so the prediction error due to the feedback delay is neglected. Therefore, once receiver completes a direct channel measurement, we assume that it is also available at the transmitter .
For the centralized approach, once a link acquires the CSI of its direct channel and all other interfering channels to its receiver, passing this information to a central controller is another burden. This is typically resolved using a backhaul network between the APs and the central controller. The CSI of cross links is usually delayed or even outdated. Furthermore, the central controller can only return the optimal power allocation as the iterative algorithm converges, which is another limitation on the scalability.
Our goal is to design a scalable algorithm, so we limit the information exchange to between nearby transmitters. We define two neighborhood sets for every : Let the set of transmitters whose SNR at receiver was above a certain threshold during the past time slot be denoted as
(9) 
Let the set of receiver indexes whose SNR from transmitter was above a threshold in slot be denoted as
(10) 
From link ’s viewpoint, represents the set of “interferers”, whereas represents the set of the “interfered” neighbors.
We next discuss the local information a transmitter possesses at the beginning of time slot . First, we assume that transmitter learns via receiver feedback the direct downlink channel gain, . Further, transmitter also learns the current total received interferenceplusnoise power at receiver before the global power update, i.e., (as a result of the new gains and the yettobeupdated powers). In addition, by the beginning of slot , receiver has informed transmitter of the received power from every interferer , i.e., . These measurements can only be available at transmitter just before the beginning of slot . Hence, in the previous slot , receiver also informs transmitter of the outdated versions of these measurements to be used in the information exchange process performed in slot between transmitter and its interferers.
To clarify, as shown in Fig. 1, transmitter has sent the following outdated information to interferer in return for and :

the weight of link , ,

the spectral efficiency of link computed from (4), ,

the direct gain, ,

the received interference power from transmitter , ,

the total interferenceplusnoise power at receiver , i.e., .
As assumed earlier, these measurements are accurate, where the uncertainty about the current CSI is entirely due to the latency of information exchange (one slot). By the same token, from every interfered , transmitter also obtains ’s items listed above.
Iv Deep Reinforcement Learning for Dynamic Power Allocation
Iva Overview of Deep QLearning
A reinforcement learning agent learns its best policy from observing the rewards of trialanderror interactions with its environment over time[kaelbling1996reinforcement, sutton1998reinforcement]. Let denote a set of possible states and denote a discrete set of actions. The state is a tuple of environment’s features that are relevant to the problem at hand and it describes agent’s relation with its environment [ghadimi2017dynamicpower]. Assuming discrete time steps, the agent observes the state of its environment, at time step . It then takes an action according to a certain policy . The policy
is the probability of taking action
conditioned on the current state being . The policy function must satisfy . Once the agent takes an action , its environment moves from the current state to the next state . As a result of this transition, the agent gets a reward that characterizes its benefit from taking action at state . This scheme forms an experience at time , hereby defined as , which describes an interaction with the environment [mnih2015human].The wellknown Qlearning algorithm aims to compute an optimal policy that maximizes a certain expected reward without knowledge of the function form of the reward and the state transitions. Here we let the reward be the future cumulative discounted reward at time :
(11) 
where is the discount factor for future rewards. In the stationary setting, we define a Qfunction associated with a certain policy as the expected reward once action is taken under state [singh2000convergence], i.e.,
(12) 
As an action value function, the Qfunction satisfies a Bellman equation [serrano2010qlearning]:
(13) 
where is the expected reward of taking action at state , and is the transition probability from given state to state with action . From the fixedpoint equation (13), the value of can be recovered from all values of . It has been proved that some iterative approaches such as Qlearning algorithm efficiently converges to the action value function (12) [singh2000convergence]. Clearly, it suffices to let be equal to 1 for the most favorable action. From (13), the optimal Qfunction associated with the optimal policy is then expressed as
(14) 
The classical Qlearning algorithm constructs a lookup table, , as a surrogate of the optimal Qfunction. Once this lookup table is randomly initialized, the agent takes actions according to the greedy policy for each time step. The greedy policy implies that with probability the agent takes the action that gives the maximum lookup table value for a given current state, whereas it picks a random action with probability to avoid getting stuck at nonoptimal policies [mnih2015human]. After acquiring a new experience as a result of the taken action, the Qlearning algorithm updates a corresponding entry of the lookup table according to:
(15) 
where is the learning rate [singh2000convergence].
In case the state and action spaces are very large, as is the case for the power control problem at hand. The classical Qlearning algorithm fails mainly because of two reasons:

Many states are rarely visited, and

the storage of lookup table in (15) becomes impractical [naparstek2017deep].
Both issues can be solved with deep reinforcement learning, e.g., deep Qlearning [mnih2015human]
. A deep neural network called deep Qnetwork (DQN) is used to estimate the Qfunction in lieu of a lookup table. The DQN can be expressed as
, where the realvalued vector
represents its parameters. The essence of DQN is that the function is completely determined by . As such, the task of finding the best Qfunction in a functional space of uncountably many dimensions is reduced to searching the best of finite dimensions. Similar to the classical Qlearning, the agent collects experiences with its interaction with the environment. The agent or the network trainer forms a data set by collecting the experiences until time in the form of . As the “quasistatic target network” method [mnih2015human] implies, we define two DQNs: the target DQN with parameters and the train DQN with parameters . is updated to be equal to once every steps. From the “experience replay” [mnih2015human], the least squares loss of train DQN for a random minibatch at time is(16) 
where the target is
(17) 
Finally, we assume that each time step the stochastic gradient descent algorithm that minimizes the loss function (
16) is used to train the minibatch . The stochastic gradient descent returns the new parameters of train DQN using the gradient computed from just few samples of the dataset and has been shown to converge to a set of good parameters quickly [lecun2015deep].IvB Proposed MultiAgent Deep Reinforcement Learning Algorithm
As depicted in Fig. 2, we propose a multiagent deep reinforcement learning scheme with each transmitter as an agent. Similar to [hu1998online], we define the local state of learning agent as which is composed of environment features that are relevant to agent ’s action . In the multiagent learning system, the state transitions of their common environment depend on the agents’ joint actions. An agent’s environment transition probabilities in (13) may not be stationary as other learning agents update their policies. The Markov property introduced for the singleagent case in Section IVA no longer holds in general [nguyen2018multisurvey]. This “environment nonstationarity” issue may cause instability during the learning process. One way to tackle the issue is to train a single meta agent with a DQN that outputs joint actions for the agents [foerster2017stabilising]. The complexity of the stateaction space, and consequently the DQN complexity, will then be proportional to the total number of agents in the system. The singlemeta agent approach is not suitable for our dynamic setup and the distributed execution framework, since its DQN can only forward the action assignments to the transmitters after acquiring the global state information. There is an extensive research to develop multiagent learning frameworks and there exists several multiagent Qlearning adaptations [tampuu2017multiagent, nguyen2018multisurvey]. However, multiagent learning is an open research area and theoretical guarantees for these adaptations are rare and incomplete despite their good empirical performances [tampuu2017multiagent, nguyen2018multisurvey].
In this work, we take an alternative approach where the DQNs are distributively executed at the transmitters, whereas training is centralized to ease implementation and to improve stability. Each agent has the same copy of the DQN with parameters at time slot . The centralized network trainer trains a single DQN by using the experiences gathered from all agents. This significantly reduces the amount of memory and computational resources required by training. The centralized training framework is also similar to the parameter sharing concept which allows the learning algorithm to draw advantage from the fact that agents are learning together for faster convergence [gupta2017cooperative]. Since agents are working collaboratively to maximize the global objective in (5) with an appropriate reward function design to be discussed in Section IVE, each agent can benefit from experiences of others. Note that sharing the same DQN parameters still allows different behavior among agents, because they execute the same DQN with different local states as input.
As illustrated in Fig. 2, at the beginning of time slot , agent takes action as a function of based on the current decision policy. All agents are synchronized and take their actions at the same time. Prior to taking action, agent has observed the effect of the past actions of its neighbors on its current state, but it has no knowledge of , . From the past experiences, agent is able to acquire an estimation of what is the impact of its own actions on future actions of its neighbors, and it can determine a policy that maximizes its discounted expected future reward with the help of deep Qlearning.
The proposed DQN is a fullyconnected deep neural network [watt2016machine, Chapter 5] that consists of five layers as shown in Fig. (a)a. The first layer is fed by the input state vector of length . We relegate the detailed design of the state vector elements to Section IVC. The input layer is followed by three hidden layers with , , and neurons, respectively. At the output layer, each port gives an estimate of the Qfunction with given state input and the corresponding action output. The total number of DQN output ports is denoted as which is equal to the cardinality of the action set to be described in Section IVD. The agent finds the action that has the maximum value at the DQN output and takes this action as its transmit power.
In Fig. (a)a, we also depicted the connection between these layers by using the weights and biases of the DQN which form the set of parameters. The total number of scalar parameters in the fully connected DQN is
(18) 
In addition, Fig. (b)b describes the functionality of a single neuron which applies a nonlinear activation function to its combinatorial input.
During the training stage, in each time slot, the trainer randomly selects a minibatch of experiences from an experiencereplay memory [mnih2015human] that stores the experiences of all agents. The experiencereplay memory is a FIFO queue [yu2017deep] with a length of samples where is the total number of agents, i.e., a new experience replaces the oldest experience in the queue and the queue length is proportional to the number of agents. At time slot the most recent experience from agent is due to delay. Once the trainer picks , it updates the parameters to minimize the loss in (16) using an appropriate optimizer, e.g., the stochastic gradient descent method [lecun2015deep]. As also explained in Fig. 2, once per time slots, the trainer broadcasts the latest trained parameters. The new parameters are available at the agents after time slots due to the transmission delay through the backhaul network. Training may be terminated once the parameters converge.
IvC States
As described in Section III, agent builds its state using information from the interferer and interfered sets given by (9) and (10), respectively. To better control the complexity, we set , where is the restriction on the number of interferers and interfereds the AP communicating with. At the beginning of time slot , agent sorts its interferers by current received power from interferer at receiver , i.e., . This sorting process allows agent to prioritize its interferers. As , we want to keep strong interferers which have higher impact on agent ’s next action. On the other hand, if , agent adds virtual noise agents to to fit the fixed DQN. A virtual noise agent is assigned an arbitrary negative weight and spectral efficiency. Its downlink and interfering channel gains are taken as zero in order to avoid any impact on agent
’s decisionmaking. The purpose of having these virtual agents as placeholders is to provide inconsequential inputs to fill the input elements of fixed length, like ‘padding zeros’. After adding virtual noise agents (if needed), agent
takes first interferers to form . For the interfered neighbors, agent follows a similar procedure, but this time the sorting criterion is the share of agent on the interference at receiver , i.e., , in order to give priority to the most significantly affected interfered neighbors by agent ’s interference.The way we organize the local information to build accommodates some intuitive and systematic basics. Based on these basics, we perfected our design by trialanderror with some preliminary simulations. We now describe the state of agent at time slot , i.e., , by dividing it into three main feature groups as:
IvC1 Local Information
The first element of this feature group is agent ’s transmit power during previous time slot, i.e., . Then, this is followed by the second and third elements that specify agent ’s most recent potential contribution on the network objective (5): and . For the second element, we do not directly use which tends to be quite large as is close to zero from (7). We found that using is more desirable. Finally, the last four elements of this feature group are the last two measurements of its direct downlink channel and the total interferenceplusnoise power at receiver : , , , and . Hence, a total of seven input ports of the input layer are reserved for this feature group. In our state set design, we take the last two measurements into account to give the agent a better chance to track its environment change. Intuitively, the lower the maximum Doppler frequency, the slower the environment changes, so that having more past measurements will help the agent to make better decisions [yu2017deep]. On the other hand, this will result with having more state information which may increase the complexity and decrease the learning efficiency. Based on preliminary simulations, we include two past measurements.
IvC2 Interfering Neighbors
This feature group lets agent observe the interference from its neighbors to receiver and what is the contribution of these interferers on the objective (5). For each interferer , three input ports are reserved for , , . The first term indicates the interference that agent faced from its interferer ; the other two terms imply the significance of agent in the objective (5). Similar to the local information feature explained in the previous paragraph, agent also considers the history of its interferers in order to track changes in its own receiver’s interference condition. For each interferer , three input ports are reserved for , , . A total of state elements are reserved for this feature group.
IvC3 Interfered Neighbors
Finally, agent uses the feedback from its interfered neighbors to gauge its interference to nearby receivers and the contribution of them on the objective (5). If agent ’s link was inactive during the previous time slot, then . For this case, if we ignore the history and directly consider the current interfered neighbor set, the corresponding state elements will be useless. Note that agent ’s link became inactive when its own estimated contribution on the objective (5) was not significant enough compared to its interference to its interfered neighbors. Thus, after agent ’s link became inactive, in order to decide when to reactivate its link, it should keep track of the interfered neighbors that implicitly silenced itself. We solve this issue by defining time slot which is the last time slot agent was active. The agent carries the feedback from interfered . We also pay attention to the fact that if , interfered has no knowledge of , but it is still able to send its local information to agent . Therefore, agent reserves four elements of its state set for each interfered as , , , and . This makes a total of elements of the state set reserved for the interfered neighbors.
IvD Actions
Unlike taking discrete steps on the previous transmit power level (see, e.g., [ghadimi2017dynamicpower]), we use discrete power levels taken between and . All agents have the same action space, i.e., , . Suppose we have discrete power levels. Then, the action set is given by
(19) 
The total number of DQN output ports denoted as in Fig. (a)a is equal to . Agent is only allowed to pick an action to update its power strategy at time slot . This way of approaching the problem could increase the number of DQN output ports compared to [ghadimi2017dynamicpower], but it will increase the robustness of the learning algorithm. For example, as the maximum Doppler frequency or time slot duration increases, the correlation term in (2) is going to decrease and the channel state will vary more. This situation may require the agents to react faster, i.e., possible transition from zeropower to fullpower, which can be addressed efficiently with an action set composed of discrete power levels.
IvE Reward Function
The reward function is designed to optimize the network objective (5). We interpret the reward as how the action of agent through time slot , i.e., , affects the weighted sumrate of its own and its future interfered neighbors . During the time slot , for all agent , the network trainer calculates the spectral efficiency of each link without the interference from transmitter as
(20) 
The network trainer computes the term in (20) by simply subtracting from the total interferenceplusnoise power at receiver in time slot . As assumed in Section III, since transmitter , its interference to link in slot , i.e., , is accurately measurable by receiver and has been delivered to the network trainer.
In time slot , we account for the externality that link causes to link using a price charged to link for generating interference to link [huang2006distributedpower]:
(21) 
Then, the reward function of agent at time slot is defined as
(22) 
The reward of agent consists of two main components: its direct contribution to the network objective (5) and the penalty due to its interference to all interfered neighbors. Evidently, transmitting at peak power maximizes the direct contribution as well as the penalty, whereas being silent earns zero reward.
V Simulation Results
Va Simulation Setup
To begin with, we consider links on homogeneously deployed cells, where we choose to be between 19 and 100. Transmitter is located at the center of cell and receiver is located randomly within the cell. We also discuss the extendability of our algorithm to multilink per cell scenarios in Section VB. The half transmittertotransmitter distance is denoted as and it is between 100 and 1000 meters. We also define an inner region of radius where no receiver is allowed to be placed. We set the to be between and meters. Receiver
is placed randomly according to a uniform distribution on the area between out of the inner region of radius
and the cell boundary. Fig. 4 shows two network configuration examples.We set , i.e., the maximum transmit power level of transmitter , to 38 dBm over 10 MHz frequency band which is fully reusable across all links. The distance dependent path loss between all transmitters and receivers is simulated by (in dB), where is transmittertoreceiver distance in km. This path loss model is compliant with the LTE standard [LTEA]
. The lognormal shadowing standard deviation is taken as 8 dB. The AWGN power
is 114 dBm. We set the threshold in (9) and (10) to 5. We assume fullbuffer traffic model. Similar to [zhuang2016energy], if the received SINR is greater than 30 dB, it is capped at 30 dB in the calculation of spectral efficiency by (4). This is to account for typical limitations of finiteprecision digital processing. In addition to these parameters, we take the period of the timeslotted system to be 20 ms. Unless otherwise stated, the maximum Doppler frequency is 10 Hz and identical for all receivers.We next describe the hyperparameters used for the architecture of our algorithm. Since our goal is to ensure that the agents make their decisions as quickly as possible, we do not overparameterize the network architecture and we use a relatively small network for training purposes. Our algorithm trains a DQN with one input layer, three hidden layers, and one output layer. The hidden layers have , , and neurons, respectively. We have DQN input ports reserved for the local information feature group explained in Section IVC. The cardinality constraint on the neighbor sets is 5 agents. Hence, again from Section IVC, the input ports reserved for the interferer and the interfered neighbors are and , respectively. This makes a total of input ports reserved for the state set. (We also normalize the inputs with some constants depending on , maximum intracell path loss, etc., to optimize the performance.) We use ten discrete power levels,
. Thus, the DQN has ten outputs. Initial parameters of the DQN are generated with the truncated normal distribution function of the TensorFlow
[abadi2015tensorflow]. For our application, we observed that the rectifier linear unit (ReLU) function converges to a desirable power allocation slightly slower than the hyperbolic tangent (tanh) function, so we used tanh as DQN’s activation function. Memory parameters at the network trainer,
and, are 256 and 1000 samples, respectively. We use the RMSProp algorithm
[ruder2016overview] with an adaptive learning rate . For a more stable deep Qlearning outcome, the learning rate is reduced as , where is the decay rate of [lavet2015discountfactor]. Here, is and is . We also apply adaptive greedy algorithm: is initialized to 0.2 and it follows , where and .Although the discount factor is nearly arbitrarily chosen to be close to 1 and increasing potentially improves the outcomes of deep Qlearning for most of its applications [lavet2015discountfactor], we set to 0.5. The reason we use a moderate level of is that the correlation between agent’s actions and its future rewards tends to be smaller for our application due to fading. An agent’s action has impact on its own future reward through its impact on the interference condition of its neighbors and consequences of their unpredictable actions. Thus, we set . We observed that higher is not desirable either. It slows the DQN’s reaction to channel changes, i.e., high case. For high , the DQN converges to a strategy that makes the links with better steadystate channel condition greedy. As becomes large, due to fading, the links with poor steadystate channel condition may become more advantageous for some timeslots. Having a moderate level of helps detect these cases and allows poor links to be activated during these time slots when they can contribute the network objective (5). Further, the training cycle duration is 100 time slots. After we set the parameters in (18), we can compute the total number of DQN parameters, i.e., , as 36,150 parameters. After each time slots, trained parameters at the central controller will be delivered to all agents in time slots via backhaul network as explained in Section IVB. We assume that the parameters are transferred without any compression and the backhaul network uses pure peertopeer architecture. As time slots, i.e., 1 second, the minimum required downlink/uplink capacity for all backhaul links is about 1 Mbps. Once the training stage is completed, the backhaul links will be used only for limited information exchange between neighbors which requires negligible backhaul link capacity.
We empirically validate the functionality of our algorithm. We implemented the proposed algorithm with TensorFlow [abadi2015tensorflow]. Each result is an average of at least 10 randomly initialized simulations. We have two main phases for the simulations: training and testing. Each training lasts 40,000 time slots or ms = 800 seconds, and each testing lasts 5,000 time slots or 100 seconds. During the testing, the trainer leaves the network and the greedy algorithm is terminated, i.e., agents stop exploring the environment.
We have five benchmarks to evaluate the performance of our algorithm. The first two benchmarks are ‘ideal WMMSE’ and ‘ideal FP’ with instantaneous full CSI and centralized algorithm outcome. The third benchmark is the ‘central power allocation’ (central), where we introduce one time slot delay on the full CSI and feed it to the FP algorithm. Even the single time slot delay to acquire the full CSI is a generous assumption, but it is a useful approach to reflect potential performance of negligible computation time achieved with the supervised learning approach introduced in [sun2017learning]. The next benchmark is the ‘random’ allocation, where each agent chooses its transmit power for each slot at random uniformly between 0 and . The last benchmark is the ‘fullpower’ allocation, i.e., each agent’s transmit power is for all slots.
VB SumRate Maximization
In this subsection, we focus on the sumrate by setting the weights of all network agents to 1 through all time slots.
VB1 Robustness
We fix links and use two approaches to evaluate performance. The first approach is the ‘matched’ DQN where we use the first 40,000 time slots to train a DQN from scratch, whereas for the ‘unmatched’ DQN we ignore the matched DQN specialized for a given specific initialization, and for the testing (the last 5,000 time slots) we randomly pick another DQN trained for another initialization with the same and parameters. In other words, for the unmatched DQN case, we skip the training stage and use the matched DQN that was trained for a different network initialization scenario and was stored in the memory. Here an unmatched DQN is always trained for a random initialization with = 19 links and = 10 Hz.
In Table I, we vary and see that training a DQN from scratch for the specific initialization is able to outperform both stateoftheart centralized algorithms that are under ideal conditions such as full CSI and no delay. Interestingly, the unmatched DQN approach converges to the central power allocation where we feed the FP algorithm with delayed full CSI. The DQN approach achieves this performance with distributed execution and incomplete CSI. In addition, training a DQN from scratch enables our algorithm to learn to compensate for CSI delays and specialize for its network initialization scenario. Training a DQN from scratch swiftly converges in about 25,000 time slots (shown in Fig. (a)a).
Additional simulations with and taken as variables are summarized in Table II and Table III, respectively. As the area of receiverfree inner region increases, the receivers get closer to the interfering transmitters and the interference mitigation becomes more necessary. Hence, the random and fullpower allocations tend to show much lower sumrate performance compared to the central algorithms. For that case, our algorithm still shows decent performance and the convergence rate is still about 25,000 time slots. We also stress the DQN under various scenarios. As we reduce , its sumrate performance remains unchanged, but the convergence time drops to 15,000 time slots. As , i.e., we set to remove the temporal correlation between current channel condition and past channel conditions, the convergence takes more than 35,000 time slots. Intuitively, the reason of this effect on the convergence rate is that the variation of states visited during the training phase is proportional to . Further, the comparable performance of the unmatched DQN with the central power allocation shows the robustness of our algorithm to the changes in interference conditions and fading characteristics of the environment.
VB2 Scalability
In this subsection, we increase the total number of links to investigate the scalability of our algorithm. As we increase to 50 links, the DQN still converges in 25,000 time slots with high sumrate performance. As we keep on increasing to 100 links, from Table IV, the matched DQN’s sumrate outperformance drops because of the fixed input architecture of the DQN.
Note that each agent only considers interferer and interfered neighbors. The performance of DQN can be improved for that case by increasing at a higher computational complexity. Additionally, the unmatched DQN trained for just 19 links still shows good performance as we increase the number of links.
It is worth pointing out that each agent is able to determine its own action in less than ms on a personal computer. Therefore, our algorithm is suitable for dynamic power allocation. In addition, running a single batch takes less than = 20 ms. Most importantly, because of the fixed architecture of the DQN, increasing the total number of links from 19 to 100 has no impact on these values. It will just increase the queue memory in the network trainer. For the FP algorithm it takes about 15 ms to converge for = 19 links, but with = 100 links it becomes 35 ms. The WMMSE algorithm converges slightly slower, and the convergence time is still proportional to which limits its scalability.
VB3 Extendability to MultiLink per Cell Scenarios and Different Channel Models
In this subsection, we first consider a special homogeneous cell deployment case with colocated transmitters at the cell centers. We also assume that the colocated transmitters within a cell do not perform successive interference cancellation [sun2017learning]. The WMMSE and FP algorithms can be applied to this multilink per cell scenario without any modifications.
We fix and to 500 and 10 meters, respectively. We set to 10 Hz and the total number of cells to . We first consider two scenarios where each cell has 2 and 4 links, respectively. The third scenario assigns each cell a random number of links from 1 to 4 links per cell as shown in Fig. (b)b. The testing stage results for these multilink per cell scenarios are given in Table V. As shown in Table VI, we further test these scenarios using a different channel model called urban microcell (UMi) street canyon model of [tr38901]. For this model, we take the carrier frequency as 1 GHz. The transmitter and receiver antenna heights are assumed to be 10 and 1.5 meters, respectively.
Our simulations for these scenarios show that as we increase number of links per cell, the training stage still converges in about 25,000 time slots. Fig. (a)a shows the convergence rate of training stage for 4 links per cell scenario with 76 links. In Fig. (a)a, we also show that using a different channel model, i.e., UMi street canyon, does not affect the convergence rate. Although the convergence rate is unaffected, the proposed algorithm’s average sumrate performance decreases as we increase number of links per cell. Our algorithm still outperforms the centralized algorithms even for 4 links per cell scenario for both channel models. Another interesting fact is that although the unmatched DQN was trained for a singlelink deployment scenario and can not handle the delayed CSI constraint as good as the matched DQN, it gives comparable performance with the ‘central’ case. Thus, the unmatched DQN is capable of finding good estimates of optimal actions for unseen local state inputs.
VC Proportionally Fair Scheduling
In this subsection, we change the link weights according to (7) to ensure fairness as described in Section III. We choose the term in (6) to be 0.01 and use convergence to the objective in (8) as performancemetric of the DQN. We also make some additions to the training and testing stage of DQN. We need an initialization for the link weights. This is done by letting all transmitters to serve their receivers with fullpower at = 0, and initialize weights according to the initial spectral efficiencies computed from (4). For the testing stage, we reinitialize the weights after the first 40,000 slots to see whether the trained DQN can achieve fairness as fast as the centralized algorithms.
As shown in Fig. 7, the training stage converges to a desirable scheduling in about 30,000 time slots. Once the network is trained, as we reinitialize the link weights, our algorithm converges to an optimal scheduling in a distributed fashion as fast as the centralized algorithms. Next, we set and as variables to get results in Table VII and Table VIII. We see that the trained DQN from scratch still outperforms the centralized algorithms in most of the initializations, using the unmatched DQN also achieves a high performance similar to the previous sections.
Vi Conclusion and Future Work
In this paper, we have proposed a distributively executed modelfree power allocation algorithm which outperforms or achieves comparable performance with existing stateoftheart centralized algorithms. We see potentials in applying the reinforcement learning techniques on various dynamic wireless network resource management tasks in place of the optimization techniques. The proposed approach returns the new suboptimal power allocation much quicker than two of the popular centralized algorithms taken as the benchmarks in this paper. In addition, by using the limited local CSI and some realistic practical constraints, our deep Qlearning approach usually outperforms the generic WMMSE and FP algorithms which requires the full CSI which is an inapplicable condition. Differently from most advanced optimization based power control algorithms, e.g. WMMSE and FP, that require both instant and accurate measurements of individual channel gains, our algorithm only requires accurate measurements of some delayed received power values that are higher than a certain threshold above noise level. An extension to an imperfect CSI case with inaccurate CSI measurements is left for future work.
Meng et al. [meng2018deepmulti] is an extension of our preprint version [nasir2018deep] to multiple users in a cell, which is also addressed in the current paper. Although the centralized training phase seems to be a limitation on the proposed algorithm in terms of scalability, we have shown that a DQN trained for a smaller wireless network can be applied to a larger wireless network and a jumpstart on the training of DQN can also be implemented by using initial parameters taken from another DQN previously trained for a different setup.
Finally, we used global training in this paper, whereas reinitializing a local training over the regions where new links joined or performance dropped under a certain threshold is also an interesting direction to consider. Besides the regional training, completely distributed training can be considered, too. While a centralized training approach saves computational resources and converges faster, distributed training may beat a path for an extension of the proposed algorithm to some other channel deployment scenarios that involves mobile users. The main hurdle on the way to apply distributed training is to avoid the instability caused by the environment nonstationarity.
Vii Acknowledgement
We thank Dr. Mingyi Hong, Dr. Wei Yu, Dr. Georgios Giannakis, and Dr. Gang Qian for stimulating discussions.
Comments
There are no comments yet.