In emerging and future wireless networks, inter-cell interference management is one of the key technological challenges as access points (APs) become denser to meet ever-increasing demand on the capacity. A transmitter may increase its transmit power to improve its own data rate, but at the same time it may degrade links it interferes with. Transmit power control has been implemented since the first generation cellular networks [chiang2007power]. Our goal here is to maximize an arbitrary weighted sum-rate objective, which achieves maximum sum-rate or proportionally fair scheduling as special cases.
A number of centralized and distributed optimization techniques have been used to develop algorithms for reaching a suboptimal power allocation [shi2011wmmse, shen2018fractional, illsoo2015distributedpower, chiang2007power, huang2006distributedpower, kiani2007distributed, zhang2011proportionally]. We select two state-of-the-art algorithms as benchmarks. These are the weighted minimum mean square error (WMMSE) algorithm [shi2011wmmse] and an iterative algorithm based on fractional programming (FP) [shen2018fractional]. In their generic form, both algorithms require full up-to-date cross-cell channel state information (CSI). To the best of our knowledge, this work is the first to apply deep reinforcement learning to power control [nasir2018deep]. Sun et al. [sun2017learning] proposed a centralized supervised learning
approach to train a fast deep neural network (DNN) that achieves 90% or higher of the sum-rate achieved by the WMMSE algorithm. However, this approach still requires acquiring the full CSI. Another issue is that training DNN depends on a massive dataset of the WMMSE algorithm’s output for randomly generated CSI matrices. Such a dataset takes a significant amount of time to produce due to WMMSE’s computational complexity. As the network gets larger, the total number of DNN’s input and output ports also increases, which raises questions on the scalability of the centralized solution of[sun2017learning]
. Furthermore, the success of supervised learning is highly dependent on the accuracy of the system model underlying the computed training data, which requires a new set of training data every time the system model or key parameters change.
In this work, we design a distributively executed algorithm to be employed by all transmitters to compute their best power allocation in real time. Such a dynamic power allocation problem with time-varying channel conditions for a different system model and network setup was studied in [neely2005dynamicpowercontrol] and the delay performance of the classical dynamic backpressure algorithm was improved by exploiting the stochastic Lyapunov optimization framework.
The main contributions in this paper and some advantages of the proposed scheme are summarized as follows.
The proposed algorithm is one of the first power allocation schemes to use deep reinforcement learning in the literature. In particular, the distributively executed algorithm is based on deep Q-learning [mnih2015human], which is model-free and robust to unpredictable changes in the wireless environment.
The complexity of the distributively executed algorithm does not depend on the network size. In particular, the proposed algorithm is computationally scalable to networks that cover arbitrarily large geographical areas if the number of links per unit area remains upper bounded by the same constant everywhere.
The proposed algorithm learns a policy that guides all links to adjust their power levels under important practical constraints such as delayed information exchange and incomplete cross-link CSI.
Unlike the supervised learning approach [sun2017learning], there is no need to run an existing near-optimal algorithm to produce a large amount of training data. We use an applicable centralized network trainer approach that gathers local observations from all network agents. This approach is computationally efficient and robust. In fact, a pretrained neural network can also achieve comparable performance as that of the centralized optimization based algorithms.
We compare the reinforcement learning outcomes with state-of-the-art optimization-based algorithms. We also show the scalability and the robustness of the proposed algorithm using simulations. In the simulation, we model the channel variations inconsequential to the learning algorithm using the Jakes fading model [liang2017delayedCSI]. In certain scenarios the proposed distributed algorithm even outperforms the centralized iterative algorithms introduced in [shi2011wmmse, shen2018fractional]. We also address some important practical constraints that are not included in [shi2011wmmse, shen2018fractional].
Deep reinforcement learning framework has been used in some other wireless communications problems [luong2018deepRLsurvey, ye2017deep, yu2017deep, zhao2018slicing]. Classical Q-learning techniques have been applied to the power allocation problem in [bennis2010qlearning, simsek2011qtable, amiri2018mlpower, ghadimi2017dynamicpower, calabrese2017learning]. The goal in [bennis2010qlearning, simsek2011qtable] is to reduce the interference in LTE-Femtocells. Unlike the deep Q-learning algorithm, the classical algorithm builds a lookup table to represent the value of state-action pairs, so [bennis2010qlearning] and [simsek2011qtable] represent the wireless environment using a discrete state set and limit the number of learning agents. Amiri et al. [amiri2018mlpower] have used cooperative Q-learning based power control to increase the QoS of users in femtocells without considering the channel variations. The deep Q-learning based power allocation to maximize the network objective has also been considered in [ghadimi2017dynamicpower, calabrese2017learning]. Similar to the proposed approach, the work in [ghadimi2017dynamicpower, calabrese2017learning] is also based on a distributed framework with a centralized training assumption, but the benchmark to evaluate the performance of their algorithm was a fixed power allocation scheme instead of state-of-the-art algorithms. The proposed approach to the state of wireless environment and the reward function is also novel and unique. Specifically, the proposed approach addresses the stochastic nature of wireless environment as well as incomplete/delayed CSI, and arrives at highly competitive strategies quickly.
The remainder of this paper is organized as follows. We give the system model in Section II. In Section III, we formulate the dynamic power allocation problem and give our practical constraints on the local information. In Section IV, we first give an overview of deep Q-learning and then describe the proposed algorithm. We give simulation results in Section V. We conclude with a discussion of possible future work in Section VI.
Ii System Model
We first consider the classical power allocation problem in a network of links. We assume that all transmitters and receivers are equipped with a single antenna. The model is often used to describe a mobile ad hoc network (MANET) [huang2006distributedpower]. The model has also been used to describe a simple cellular network with APs, where each AP serves a single user device [shen2018fractional, illsoo2015distributedpower]. Let denote the set of link indexes. We consider a fully synchronized time slotted system with slot duration . For simplicity, we consider a single frequency band with flat fading. We adopt a block fading model to denote the downlink channel gain from transmitter to receiver in time slot as
Here, represents the large-scale fading component including path loss and log-normal shadowing, which remains the same over many time slots. Following Jakes fading model [liang2017delayedCSI], we express the small-scale Rayleigh fading component as a first-order complex Gauss-Markov process:
where and the channel innovation process, where is the zeroth-order Bessel function of the first kind and is the maximum Doppler frequency.
The received signal-to-interference-plus-noise ratio (SINR) of link in time slot is a function of the allocation :
where is the additive white Gaussian noise (AWGN) power spectral density (PSD). We assume the same noise PSD in all receivers without loss of generality. The downlink spectral efficiency of link at time can be expressed as:
The transmit power of transmitter in time slot is denoted as . We denote the power allocation of the network in time slot as .
Iii Dynamic Power Control
We are interested in maximizing a generic weighted sum-rate objective function. Specifically, the dynamic power allocation problem in slot is formulated as
where is the given nonnegative weight of link in time slot , and is the maximum PSD a transmitter can emit. Hence, the dynamic power allocator has to solve an independent problem in the form of (5) at the beginning of every time slot. In time slot , the optimal power allocation solution is denoted as . Problem (5) is in general non-convex and has been shown to be NP-hard [Luo2008dynamicspectrum].
We consider two special cases. In the first case, the objective is to maximize the sum-rate by letting for all and . In the second case, the weights vary in a controlled manner to ensure proportional fairness [tse2005fundamentals, zhang2011proportionally]. Specifically, at the end of time slot , receiver computes its weighted average spectral efficiency as
where is used to control the impact of history. User updates its link weight as:
This power allocation algorithm maximizes the sum of log-average spectral efficiency [tse2005fundamentals], i.e.,
where a user’s long-term average throughput is proportional to its long-term channel quality in some sense.
We use two popular (suboptimal) power allocation algorithms as benchmarks. These are the WMMSE algorithm [shi2011wmmse] and the FP algorithm [shen2018fractional]. Both are centralized and iterative in their original form. The closed-form FP algorithm used in this paper is formulated in [shen2018fractional, Algorithm 3]. Similarly, a detailed explanation and pseudo code of the WMMSE algorithm is given in [sun2017learning, Algorithm 1]. The WMMSE and FP algorithms are both centralized and require full cross-link CSI. The centralized mechanism is suitable for a stationary environment with slowly varying weights and no fast fading. For a network with non-stationary environment, it is infeasible to instantaneously collect all CSI over a large network.
It is fair to assume that the feedback delay from a receiver to its corresponding transmitter is much smaller than the slot duration , so the prediction error due to the feedback delay is neglected. Therefore, once receiver completes a direct channel measurement, we assume that it is also available at the transmitter .
For the centralized approach, once a link acquires the CSI of its direct channel and all other interfering channels to its receiver, passing this information to a central controller is another burden. This is typically resolved using a backhaul network between the APs and the central controller. The CSI of cross links is usually delayed or even outdated. Furthermore, the central controller can only return the optimal power allocation as the iterative algorithm converges, which is another limitation on the scalability.
Our goal is to design a scalable algorithm, so we limit the information exchange to between nearby transmitters. We define two neighborhood sets for every : Let the set of transmitters whose SNR at receiver was above a certain threshold during the past time slot be denoted as
Let the set of receiver indexes whose SNR from transmitter was above a threshold in slot be denoted as
From link ’s viewpoint, represents the set of “interferers”, whereas represents the set of the “interfered” neighbors.
We next discuss the local information a transmitter possesses at the beginning of time slot . First, we assume that transmitter learns via receiver feedback the direct downlink channel gain, . Further, transmitter also learns the current total received interference-plus-noise power at receiver before the global power update, i.e., (as a result of the new gains and the yet-to-be-updated powers). In addition, by the beginning of slot , receiver has informed transmitter of the received power from every interferer , i.e., . These measurements can only be available at transmitter just before the beginning of slot . Hence, in the previous slot , receiver also informs transmitter of the outdated versions of these measurements to be used in the information exchange process performed in slot between transmitter and its interferers.
To clarify, as shown in Fig. 1, transmitter has sent the following outdated information to interferer in return for and :
the weight of link , ,
the spectral efficiency of link computed from (4), ,
the direct gain, ,
the received interference power from transmitter , ,
the total interference-plus-noise power at receiver , i.e., .
As assumed earlier, these measurements are accurate, where the uncertainty about the current CSI is entirely due to the latency of information exchange (one slot). By the same token, from every interfered , transmitter also obtains ’s items listed above.
Iv Deep Reinforcement Learning for Dynamic Power Allocation
Iv-a Overview of Deep Q-Learning
A reinforcement learning agent learns its best policy from observing the rewards of trial-and-error interactions with its environment over time[kaelbling1996reinforcement, sutton1998reinforcement]. Let denote a set of possible states and denote a discrete set of actions. The state is a tuple of environment’s features that are relevant to the problem at hand and it describes agent’s relation with its environment [ghadimi2017dynamicpower]. Assuming discrete time steps, the agent observes the state of its environment, at time step . It then takes an action according to a certain policy . The policy
is the probability of taking actionconditioned on the current state being . The policy function must satisfy . Once the agent takes an action , its environment moves from the current state to the next state . As a result of this transition, the agent gets a reward that characterizes its benefit from taking action at state . This scheme forms an experience at time , hereby defined as , which describes an interaction with the environment [mnih2015human].
The well-known Q-learning algorithm aims to compute an optimal policy that maximizes a certain expected reward without knowledge of the function form of the reward and the state transitions. Here we let the reward be the future cumulative discounted reward at time :
where is the discount factor for future rewards. In the stationary setting, we define a Q-function associated with a certain policy as the expected reward once action is taken under state [singh2000convergence], i.e.,
As an action value function, the Q-function satisfies a Bellman equation [serrano2010qlearning]:
where is the expected reward of taking action at state , and is the transition probability from given state to state with action . From the fixed-point equation (13), the value of can be recovered from all values of . It has been proved that some iterative approaches such as Q-learning algorithm efficiently converges to the action value function (12) [singh2000convergence]. Clearly, it suffices to let be equal to 1 for the most favorable action. From (13), the optimal Q-function associated with the optimal policy is then expressed as
The classical Q-learning algorithm constructs a lookup table, , as a surrogate of the optimal Q-function. Once this lookup table is randomly initialized, the agent takes actions according to the -greedy policy for each time step. The -greedy policy implies that with probability the agent takes the action that gives the maximum lookup table value for a given current state, whereas it picks a random action with probability to avoid getting stuck at non-optimal policies [mnih2015human]. After acquiring a new experience as a result of the taken action, the Q-learning algorithm updates a corresponding entry of the lookup table according to:
where is the learning rate [singh2000convergence].
In case the state and action spaces are very large, as is the case for the power control problem at hand. The classical Q-learning algorithm fails mainly because of two reasons:
Many states are rarely visited, and
the storage of lookup table in (15) becomes impractical [naparstek2017deep].
Both issues can be solved with deep reinforcement learning, e.g., deep Q-learning [mnih2015human]
. A deep neural network called deep Q-network (DQN) is used to estimate the Q-function in lieu of a lookup table. The DQN can be expressed as
, where the real-valued vectorrepresents its parameters. The essence of DQN is that the function is completely determined by . As such, the task of finding the best Q-function in a functional space of uncountably many dimensions is reduced to searching the best of finite dimensions. Similar to the classical Q-learning, the agent collects experiences with its interaction with the environment. The agent or the network trainer forms a data set by collecting the experiences until time in the form of . As the “quasi-static target network” method [mnih2015human] implies, we define two DQNs: the target DQN with parameters and the train DQN with parameters . is updated to be equal to once every steps. From the “experience replay” [mnih2015human], the least squares loss of train DQN for a random mini-batch at time is
where the target is
Iv-B Proposed Multi-Agent Deep Reinforcement Learning Algorithm
As depicted in Fig. 2, we propose a multi-agent deep reinforcement learning scheme with each transmitter as an agent. Similar to [hu1998online], we define the local state of learning agent as which is composed of environment features that are relevant to agent ’s action . In the multi-agent learning system, the state transitions of their common environment depend on the agents’ joint actions. An agent’s environment transition probabilities in (13) may not be stationary as other learning agents update their policies. The Markov property introduced for the single-agent case in Section IV-A no longer holds in general [nguyen2018multisurvey]. This “environment non-stationarity” issue may cause instability during the learning process. One way to tackle the issue is to train a single meta agent with a DQN that outputs joint actions for the agents [foerster2017stabilising]. The complexity of the state-action space, and consequently the DQN complexity, will then be proportional to the total number of agents in the system. The single-meta agent approach is not suitable for our dynamic setup and the distributed execution framework, since its DQN can only forward the action assignments to the transmitters after acquiring the global state information. There is an extensive research to develop multi-agent learning frameworks and there exists several multi-agent Q-learning adaptations [tampuu2017multiagent, nguyen2018multisurvey]. However, multi-agent learning is an open research area and theoretical guarantees for these adaptations are rare and incomplete despite their good empirical performances [tampuu2017multiagent, nguyen2018multisurvey].
In this work, we take an alternative approach where the DQNs are distributively executed at the transmitters, whereas training is centralized to ease implementation and to improve stability. Each agent has the same copy of the DQN with parameters at time slot . The centralized network trainer trains a single DQN by using the experiences gathered from all agents. This significantly reduces the amount of memory and computational resources required by training. The centralized training framework is also similar to the parameter sharing concept which allows the learning algorithm to draw advantage from the fact that agents are learning together for faster convergence [gupta2017cooperative]. Since agents are working collaboratively to maximize the global objective in (5) with an appropriate reward function design to be discussed in Section IV-E, each agent can benefit from experiences of others. Note that sharing the same DQN parameters still allows different behavior among agents, because they execute the same DQN with different local states as input.
As illustrated in Fig. 2, at the beginning of time slot , agent takes action as a function of based on the current decision policy. All agents are synchronized and take their actions at the same time. Prior to taking action, agent has observed the effect of the past actions of its neighbors on its current state, but it has no knowledge of , . From the past experiences, agent is able to acquire an estimation of what is the impact of its own actions on future actions of its neighbors, and it can determine a policy that maximizes its discounted expected future reward with the help of deep Q-learning.
The proposed DQN is a fully-connected deep neural network [watt2016machine, Chapter 5] that consists of five layers as shown in Fig. (a)a. The first layer is fed by the input state vector of length . We relegate the detailed design of the state vector elements to Section IV-C. The input layer is followed by three hidden layers with , , and neurons, respectively. At the output layer, each port gives an estimate of the Q-function with given state input and the corresponding action output. The total number of DQN output ports is denoted as which is equal to the cardinality of the action set to be described in Section IV-D. The agent finds the action that has the maximum value at the DQN output and takes this action as its transmit power.
In Fig. (a)a, we also depicted the connection between these layers by using the weights and biases of the DQN which form the set of parameters. The total number of scalar parameters in the fully connected DQN is
In addition, Fig. (b)b describes the functionality of a single neuron which applies a non-linear activation function to its combinatorial input.
During the training stage, in each time slot, the trainer randomly selects a mini-batch of experiences from an experience-replay memory [mnih2015human] that stores the experiences of all agents. The experience-replay memory is a FIFO queue [yu2017deep] with a length of samples where is the total number of agents, i.e., a new experience replaces the oldest experience in the queue and the queue length is proportional to the number of agents. At time slot the most recent experience from agent is due to delay. Once the trainer picks , it updates the parameters to minimize the loss in (16) using an appropriate optimizer, e.g., the stochastic gradient descent method [lecun2015deep]. As also explained in Fig. 2, once per time slots, the trainer broadcasts the latest trained parameters. The new parameters are available at the agents after time slots due to the transmission delay through the backhaul network. Training may be terminated once the parameters converge.
As described in Section III, agent builds its state using information from the interferer and interfered sets given by (9) and (10), respectively. To better control the complexity, we set , where is the restriction on the number of interferers and interfereds the AP communicating with. At the beginning of time slot , agent sorts its interferers by current received power from interferer at receiver , i.e., . This sorting process allows agent to prioritize its interferers. As , we want to keep strong interferers which have higher impact on agent ’s next action. On the other hand, if , agent adds virtual noise agents to to fit the fixed DQN. A virtual noise agent is assigned an arbitrary negative weight and spectral efficiency. Its downlink and interfering channel gains are taken as zero in order to avoid any impact on agent
’s decision-making. The purpose of having these virtual agents as placeholders is to provide inconsequential inputs to fill the input elements of fixed length, like ‘padding zeros’. After adding virtual noise agents (if needed), agenttakes first interferers to form . For the interfered neighbors, agent follows a similar procedure, but this time the sorting criterion is the share of agent on the interference at receiver , i.e., , in order to give priority to the most significantly affected interfered neighbors by agent ’s interference.
The way we organize the local information to build accommodates some intuitive and systematic basics. Based on these basics, we perfected our design by trial-and-error with some preliminary simulations. We now describe the state of agent at time slot , i.e., , by dividing it into three main feature groups as:
Iv-C1 Local Information
The first element of this feature group is agent ’s transmit power during previous time slot, i.e., . Then, this is followed by the second and third elements that specify agent ’s most recent potential contribution on the network objective (5): and . For the second element, we do not directly use which tends to be quite large as is close to zero from (7). We found that using is more desirable. Finally, the last four elements of this feature group are the last two measurements of its direct downlink channel and the total interference-plus-noise power at receiver : , , , and . Hence, a total of seven input ports of the input layer are reserved for this feature group. In our state set design, we take the last two measurements into account to give the agent a better chance to track its environment change. Intuitively, the lower the maximum Doppler frequency, the slower the environment changes, so that having more past measurements will help the agent to make better decisions [yu2017deep]. On the other hand, this will result with having more state information which may increase the complexity and decrease the learning efficiency. Based on preliminary simulations, we include two past measurements.
Iv-C2 Interfering Neighbors
This feature group lets agent observe the interference from its neighbors to receiver and what is the contribution of these interferers on the objective (5). For each interferer , three input ports are reserved for , , . The first term indicates the interference that agent faced from its interferer ; the other two terms imply the significance of agent in the objective (5). Similar to the local information feature explained in the previous paragraph, agent also considers the history of its interferers in order to track changes in its own receiver’s interference condition. For each interferer , three input ports are reserved for , , . A total of state elements are reserved for this feature group.
Iv-C3 Interfered Neighbors
Finally, agent uses the feedback from its interfered neighbors to gauge its interference to nearby receivers and the contribution of them on the objective (5). If agent ’s link was inactive during the previous time slot, then . For this case, if we ignore the history and directly consider the current interfered neighbor set, the corresponding state elements will be useless. Note that agent ’s link became inactive when its own estimated contribution on the objective (5) was not significant enough compared to its interference to its interfered neighbors. Thus, after agent ’s link became inactive, in order to decide when to reactivate its link, it should keep track of the interfered neighbors that implicitly silenced itself. We solve this issue by defining time slot which is the last time slot agent was active. The agent carries the feedback from interfered . We also pay attention to the fact that if , interfered has no knowledge of , but it is still able to send its local information to agent . Therefore, agent reserves four elements of its state set for each interfered as , , , and . This makes a total of elements of the state set reserved for the interfered neighbors.
Unlike taking discrete steps on the previous transmit power level (see, e.g., [ghadimi2017dynamicpower]), we use discrete power levels taken between and . All agents have the same action space, i.e., , . Suppose we have discrete power levels. Then, the action set is given by
The total number of DQN output ports denoted as in Fig. (a)a is equal to . Agent is only allowed to pick an action to update its power strategy at time slot . This way of approaching the problem could increase the number of DQN output ports compared to [ghadimi2017dynamicpower], but it will increase the robustness of the learning algorithm. For example, as the maximum Doppler frequency or time slot duration increases, the correlation term in (2) is going to decrease and the channel state will vary more. This situation may require the agents to react faster, i.e., possible transition from zero-power to full-power, which can be addressed efficiently with an action set composed of discrete power levels.
Iv-E Reward Function
The reward function is designed to optimize the network objective (5). We interpret the reward as how the action of agent through time slot , i.e., , affects the weighted sum-rate of its own and its future interfered neighbors . During the time slot , for all agent , the network trainer calculates the spectral efficiency of each link without the interference from transmitter as
The network trainer computes the term in (20) by simply subtracting from the total interference-plus-noise power at receiver in time slot . As assumed in Section III, since transmitter , its interference to link in slot , i.e., , is accurately measurable by receiver and has been delivered to the network trainer.
In time slot , we account for the externality that link causes to link using a price charged to link for generating interference to link [huang2006distributedpower]:
Then, the reward function of agent at time slot is defined as
The reward of agent consists of two main components: its direct contribution to the network objective (5) and the penalty due to its interference to all interfered neighbors. Evidently, transmitting at peak power maximizes the direct contribution as well as the penalty, whereas being silent earns zero reward.
V Simulation Results
V-a Simulation Setup
To begin with, we consider links on homogeneously deployed cells, where we choose to be between 19 and 100. Transmitter is located at the center of cell and receiver is located randomly within the cell. We also discuss the extendability of our algorithm to multi-link per cell scenarios in Section V-B. The half transmitter-to-transmitter distance is denoted as and it is between 100 and 1000 meters. We also define an inner region of radius where no receiver is allowed to be placed. We set the to be between and meters. Receiver
is placed randomly according to a uniform distribution on the area between out of the inner region of radiusand the cell boundary. Fig. 4 shows two network configuration examples.
We set , i.e., the maximum transmit power level of transmitter , to 38 dBm over 10 MHz frequency band which is fully reusable across all links. The distance dependent path loss between all transmitters and receivers is simulated by (in dB), where is transmitter-to-receiver distance in km. This path loss model is compliant with the LTE standard [LTE-A]
. The log-normal shadowing standard deviation is taken as 8 dB. The AWGN poweris -114 dBm. We set the threshold in (9) and (10) to 5. We assume full-buffer traffic model. Similar to [zhuang2016energy], if the received SINR is greater than 30 dB, it is capped at 30 dB in the calculation of spectral efficiency by (4). This is to account for typical limitations of finite-precision digital processing. In addition to these parameters, we take the period of the time-slotted system to be 20 ms. Unless otherwise stated, the maximum Doppler frequency is 10 Hz and identical for all receivers.
We next describe the hyper-parameters used for the architecture of our algorithm. Since our goal is to ensure that the agents make their decisions as quickly as possible, we do not over-parameterize the network architecture and we use a relatively small network for training purposes. Our algorithm trains a DQN with one input layer, three hidden layers, and one output layer. The hidden layers have , , and neurons, respectively. We have DQN input ports reserved for the local information feature group explained in Section IV-C. The cardinality constraint on the neighbor sets is 5 agents. Hence, again from Section IV-C, the input ports reserved for the interferer and the interfered neighbors are and , respectively. This makes a total of input ports reserved for the state set. (We also normalize the inputs with some constants depending on , maximum intra-cell path loss, etc., to optimize the performance.) We use ten discrete power levels,abadi2015tensorflow]
. For our application, we observed that the rectifier linear unit (ReLU) function converges to a desirable power allocation slightly slower than the hyperbolic tangent (tanh) function, so we used tanh as DQN’s activation function. Memory parameters at the network trainer,and
, are 256 and 1000 samples, respectively. We use the RMSProp algorithm[ruder2016overview] with an adaptive learning rate . For a more stable deep Q-learning outcome, the learning rate is reduced as , where is the decay rate of [lavet2015discountfactor]. Here, is and is . We also apply adaptive -greedy algorithm: is initialized to 0.2 and it follows , where and .
Although the discount factor is nearly arbitrarily chosen to be close to 1 and increasing potentially improves the outcomes of deep Q-learning for most of its applications [lavet2015discountfactor], we set to 0.5. The reason we use a moderate level of is that the correlation between agent’s actions and its future rewards tends to be smaller for our application due to fading. An agent’s action has impact on its own future reward through its impact on the interference condition of its neighbors and consequences of their unpredictable actions. Thus, we set . We observed that higher is not desirable either. It slows the DQN’s reaction to channel changes, i.e., high case. For high , the DQN converges to a strategy that makes the links with better steady-state channel condition greedy. As becomes large, due to fading, the links with poor steady-state channel condition may become more advantageous for some time-slots. Having a moderate level of helps detect these cases and allows poor links to be activated during these time slots when they can contribute the network objective (5). Further, the training cycle duration is 100 time slots. After we set the parameters in (18), we can compute the total number of DQN parameters, i.e., , as 36,150 parameters. After each time slots, trained parameters at the central controller will be delivered to all agents in time slots via backhaul network as explained in Section IV-B. We assume that the parameters are transferred without any compression and the backhaul network uses pure peer-to-peer architecture. As time slots, i.e., 1 second, the minimum required downlink/uplink capacity for all backhaul links is about 1 Mbps. Once the training stage is completed, the backhaul links will be used only for limited information exchange between neighbors which requires negligible backhaul link capacity.
We empirically validate the functionality of our algorithm. We implemented the proposed algorithm with TensorFlow [abadi2015tensorflow]. Each result is an average of at least 10 randomly initialized simulations. We have two main phases for the simulations: training and testing. Each training lasts 40,000 time slots or ms = 800 seconds, and each testing lasts 5,000 time slots or 100 seconds. During the testing, the trainer leaves the network and the -greedy algorithm is terminated, i.e., agents stop exploring the environment.
We have five benchmarks to evaluate the performance of our algorithm. The first two benchmarks are ‘ideal WMMSE’ and ‘ideal FP’ with instantaneous full CSI and centralized algorithm outcome. The third benchmark is the ‘central power allocation’ (central), where we introduce one time slot delay on the full CSI and feed it to the FP algorithm. Even the single time slot delay to acquire the full CSI is a generous assumption, but it is a useful approach to reflect potential performance of negligible computation time achieved with the supervised learning approach introduced in [sun2017learning]. The next benchmark is the ‘random’ allocation, where each agent chooses its transmit power for each slot at random uniformly between 0 and . The last benchmark is the ‘full-power’ allocation, i.e., each agent’s transmit power is for all slots.
V-B Sum-Rate Maximization
In this subsection, we focus on the sum-rate by setting the weights of all network agents to 1 through all time slots.
We fix links and use two approaches to evaluate performance. The first approach is the ‘matched’ DQN where we use the first 40,000 time slots to train a DQN from scratch, whereas for the ‘unmatched’ DQN we ignore the matched DQN specialized for a given specific initialization, and for the testing (the last 5,000 time slots) we randomly pick another DQN trained for another initialization with the same and parameters. In other words, for the unmatched DQN case, we skip the training stage and use the matched DQN that was trained for a different network initialization scenario and was stored in the memory. Here an unmatched DQN is always trained for a random initialization with = 19 links and = 10 Hz.
In Table I, we vary and see that training a DQN from scratch for the specific initialization is able to outperform both state-of-the-art centralized algorithms that are under ideal conditions such as full CSI and no delay. Interestingly, the unmatched DQN approach converges to the central power allocation where we feed the FP algorithm with delayed full CSI. The DQN approach achieves this performance with distributed execution and incomplete CSI. In addition, training a DQN from scratch enables our algorithm to learn to compensate for CSI delays and specialize for its network initialization scenario. Training a DQN from scratch swiftly converges in about 25,000 time slots (shown in Fig. (a)a).
Additional simulations with and taken as variables are summarized in Table II and Table III, respectively. As the area of receiver-free inner region increases, the receivers get closer to the interfering transmitters and the interference mitigation becomes more necessary. Hence, the random and full-power allocations tend to show much lower sum-rate performance compared to the central algorithms. For that case, our algorithm still shows decent performance and the convergence rate is still about 25,000 time slots. We also stress the DQN under various scenarios. As we reduce , its sum-rate performance remains unchanged, but the convergence time drops to 15,000 time slots. As , i.e., we set to remove the temporal correlation between current channel condition and past channel conditions, the convergence takes more than 35,000 time slots. Intuitively, the reason of this effect on the convergence rate is that the variation of states visited during the training phase is proportional to . Further, the comparable performance of the unmatched DQN with the central power allocation shows the robustness of our algorithm to the changes in interference conditions and fading characteristics of the environment.
In this subsection, we increase the total number of links to investigate the scalability of our algorithm. As we increase to 50 links, the DQN still converges in 25,000 time slots with high sum-rate performance. As we keep on increasing to 100 links, from Table IV, the matched DQN’s sum-rate outperformance drops because of the fixed input architecture of the DQN.
Note that each agent only considers interferer and interfered neighbors. The performance of DQN can be improved for that case by increasing at a higher computational complexity. Additionally, the unmatched DQN trained for just 19 links still shows good performance as we increase the number of links.
It is worth pointing out that each agent is able to determine its own action in less than ms on a personal computer. Therefore, our algorithm is suitable for dynamic power allocation. In addition, running a single batch takes less than = 20 ms. Most importantly, because of the fixed architecture of the DQN, increasing the total number of links from 19 to 100 has no impact on these values. It will just increase the queue memory in the network trainer. For the FP algorithm it takes about 15 ms to converge for = 19 links, but with = 100 links it becomes 35 ms. The WMMSE algorithm converges slightly slower, and the convergence time is still proportional to which limits its scalability.
V-B3 Extendability to Multi-Link per Cell Scenarios and Different Channel Models
In this subsection, we first consider a special homogeneous cell deployment case with co-located transmitters at the cell centers. We also assume that the co-located transmitters within a cell do not perform successive interference cancellation [sun2017learning]. The WMMSE and FP algorithms can be applied to this multi-link per cell scenario without any modifications.
We fix and to 500 and 10 meters, respectively. We set to 10 Hz and the total number of cells to . We first consider two scenarios where each cell has 2 and 4 links, respectively. The third scenario assigns each cell a random number of links from 1 to 4 links per cell as shown in Fig. (b)b. The testing stage results for these multi-link per cell scenarios are given in Table V. As shown in Table VI, we further test these scenarios using a different channel model called urban micro-cell (UMi) street canyon model of [tr38901]. For this model, we take the carrier frequency as 1 GHz. The transmitter and receiver antenna heights are assumed to be 10 and 1.5 meters, respectively.
Our simulations for these scenarios show that as we increase number of links per cell, the training stage still converges in about 25,000 time slots. Fig. (a)a shows the convergence rate of training stage for 4 links per cell scenario with 76 links. In Fig. (a)a, we also show that using a different channel model, i.e., UMi street canyon, does not affect the convergence rate. Although the convergence rate is unaffected, the proposed algorithm’s average sum-rate performance decreases as we increase number of links per cell. Our algorithm still outperforms the centralized algorithms even for 4 links per cell scenario for both channel models. Another interesting fact is that although the unmatched DQN was trained for a single-link deployment scenario and can not handle the delayed CSI constraint as good as the matched DQN, it gives comparable performance with the ‘central’ case. Thus, the unmatched DQN is capable of finding good estimates of optimal actions for unseen local state inputs.
V-C Proportionally Fair Scheduling
In this subsection, we change the link weights according to (7) to ensure fairness as described in Section III. We choose the term in (6) to be 0.01 and use convergence to the objective in (8) as performance-metric of the DQN. We also make some additions to the training and testing stage of DQN. We need an initialization for the link weights. This is done by letting all transmitters to serve their receivers with full-power at = 0, and initialize weights according to the initial spectral efficiencies computed from (4). For the testing stage, we reinitialize the weights after the first 40,000 slots to see whether the trained DQN can achieve fairness as fast as the centralized algorithms.
As shown in Fig. 7, the training stage converges to a desirable scheduling in about 30,000 time slots. Once the network is trained, as we reinitialize the link weights, our algorithm converges to an optimal scheduling in a distributed fashion as fast as the centralized algorithms. Next, we set and as variables to get results in Table VII and Table VIII. We see that the trained DQN from scratch still outperforms the centralized algorithms in most of the initializations, using the unmatched DQN also achieves a high performance similar to the previous sections.
Vi Conclusion and Future Work
In this paper, we have proposed a distributively executed model-free power allocation algorithm which outperforms or achieves comparable performance with existing state-of-the-art centralized algorithms. We see potentials in applying the reinforcement learning techniques on various dynamic wireless network resource management tasks in place of the optimization techniques. The proposed approach returns the new suboptimal power allocation much quicker than two of the popular centralized algorithms taken as the benchmarks in this paper. In addition, by using the limited local CSI and some realistic practical constraints, our deep Q-learning approach usually outperforms the generic WMMSE and FP algorithms which requires the full CSI which is an inapplicable condition. Differently from most advanced optimization based power control algorithms, e.g. WMMSE and FP, that require both instant and accurate measurements of individual channel gains, our algorithm only requires accurate measurements of some delayed received power values that are higher than a certain threshold above noise level. An extension to an imperfect CSI case with inaccurate CSI measurements is left for future work.
Meng et al. [meng2018deepmulti] is an extension of our preprint version [nasir2018deep] to multiple users in a cell, which is also addressed in the current paper. Although the centralized training phase seems to be a limitation on the proposed algorithm in terms of scalability, we have shown that a DQN trained for a smaller wireless network can be applied to a larger wireless network and a jump-start on the training of DQN can also be implemented by using initial parameters taken from another DQN previously trained for a different setup.
Finally, we used global training in this paper, whereas reinitializing a local training over the regions where new links joined or performance dropped under a certain threshold is also an interesting direction to consider. Besides the regional training, completely distributed training can be considered, too. While a centralized training approach saves computational resources and converges faster, distributed training may beat a path for an extension of the proposed algorithm to some other channel deployment scenarios that involves mobile users. The main hurdle on the way to apply distributed training is to avoid the instability caused by the environment non-stationarity.
We thank Dr. Mingyi Hong, Dr. Wei Yu, Dr. Georgios Giannakis, and Dr. Gang Qian for stimulating discussions.