Deep Reinforcement Learning Based Power control for Wireless Multicast Systems

09/27/2019 ∙ by Ramkumar Raghu, et al. ∙ 0

We consider a multicast scheme recently proposed for a wireless downlink in [1]. It was shown earlier that power control can significantly improve its performance. However for this system, obtaining optimal power control is intractable because of a very large state space. Therefore in this paper we use deep reinforcement learning where we use function approximation of the Q-function via a deep neural network. We show that optimal power control can be learnt for reasonably large systems via this approach. The average power constraint is ensured via a Lagrange multiplier, which is also learnt. Finally, we demonstrate that a slight modification of the learning algorithm allows the optimal control to track the time varying system statistics.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Wireless networks are being constantly refined to cater for seamless delivery of huge amount of data to the end users. With increased user generated contents and proliferation of social networking sites, almost of mobile data traffic is expected to be due to mobile videos [7]. Also, the requested traffic for these contents is ridden with redundant requests [5]. Thus, multicasting is a natural way to address these requests.

A multicast queue with network coding is studied in [18, 10] with infinite library of files. The case of broadcast systems with one server transmitting to multiple users is studied in [8, 28]. Both of these works study a slotted system. Some recent works [16] use coded caching to achieve multicast. This approach uses local information in the user caches to decode the coded transmission and provides improvement in throughput by increasing the effective number of files transferred per transmission. This throughput may get reduced in a practical scenario due to queueing delays at the basestation/server. [23] addresses these issues, analyses the queuing delays and compares it with an alternate coded scheme with LRU caches (CDLS) which provides improvement over the coded schemes in [16]. A more recent work in this direction, provides alternate multicast schemes and analyses queueing delays for such multicast systems [21]. The authors show that a simple multicast scheme, can have significant gains over the schemes in [16], [23] in high traffic regime.

We further study the multicast scheme proposed in [21] in this paper. This multicast queue merges the requests for a given file from different users, arriving during the waiting time of the initial requests. The merged requests are then served simultaneously. The gains achieved by this simple multicast scheme, however, are quickly lost in wireless channels due to fading. [20] addresses this issue and also provides several multicast queueing schemes to improve the average user delays. Also, it shows that these schemes combined with an optimal power control policy under average power constraint, can provide significant reduction in delays.

The power control policy proposed in [20], though provides improved delays, has following limitations:

  • The algorithm to get the policy is not scalable with the number of users and the number of states of the channel gains.

  • The policy doesn’t adapt to the changing system statistics, which in turn depends on the policy.

These systems are often conveniently modelled as a Markov Decision Process, but with large state and action spaces. Obtaining transition probabilities and the optimal policy for such large MDPs is not feasible. Reinforcement learning, particularly, Deep reinforcement learning comes as a natural tool to address such problems


. Reinforcement learning has the added advantage that it can be used even when the transition probabilities are not available. However, large state/action space can still be an issue. Using function approximation via deep neural networks can provide significant gains since the Q-values of different state-action pairs can be interpolated even if that state-action pair has never or rarely occurred in the past. Several, deep reinforcement learning techniques such as Deep Q-Network

[17], Trust Region Policy Optimization (TRPO) [25], Proximal Policy Gradient (PPO) [26] etc. have been successfully applied to several large state-space dynamical systems such as Atari [31], AlphaGo [27] etc. DQN is one of the first Deep-RL methods based on value iteration, usually employing

-greedy exploration to learn the optimal policy. TRPO and PPO are policy gradient based methods that employ stochastic gradient descent over policy space to obtain the optimal value function. Policy-Gradient methods often suffer from high variance in sample estimates and poor sample efficiency

[13]. Value iteration based deep RL methods, like DQN, have been theoretically shown to have competitive performance [33], specifically due to sample efficiency of experience replay [15].

In addition to the above mentioned trade-offs, a constrained stochastic optimization problem, as considered in this paper, further adds to the complexity of the problem. A modification of TRPO for constrained optimization is Constrained Policy Optimization [1]. But, this too suffers from the high estimator variance issue. Work in [29] considers a multi-timescale approach for constrained DeepRL problems, as considered in this paper. However, [29] does not track the system statistics and hence cannot be applied in practical systems. Thus we propose a constrained optimization variant of DQN based on multi-timescale stochastic gradient descent [4]. We have preferred DQN in this work, as the Target network and Replay memory used in the DQN reduce the estimator variance and finally achieve the global minimum empirical risk [33].

The major contributions of this paper are:

  • Proposing two modifications to DQN, to accommodate constraints and system adaptations. We call this Adaptive Constrained DQN (AC-DQN).

  • Unlike DQN, constrained DQN can be applied to the multicast systems with constraints, as in [20], to learn the optimal power control policy, online. The constraints can be met by using a Lagrange multiplier. The appropriate Lagrange multiplier is also learnt via a two time scale stochastic gradient descent. The proposed method meets the average power constraint while achieving the global optima as achieved by the static policy proposed in [20].

  • We demonstrate the scalability of our algorithms with system size (number. of users, arrival rate, complex fading).

  • We show that AC-DQN can track the changes in the dynamics of the system, e.g., change of rate of arrival over the time of a day, and achieve optimal performance.

Our algorithms work equally well when we replace DQN with its improvements such as DDQN [30]. In fact we have run our simulations with DDQN variant of AC-DQN and have achieved similar performance. Next, we describe some more related works to this paper:

Power control in Multicast Systems: Power control in multicast systems has been studied in [11, 32, 12]. In [11], optimal power allocation is made to achieve the ergodic capacity (defined as the boundary of the region of all achievable rates in a fading channel with arbitrarily small error probability) while maintaining minimum rate requirements at users and average power constraints. Authors use water-filling to achieve the optimal policy. In [32]

, the authors minimize a utility function via linear programming, under SINR constraints at the users and transmit power constraints at the transmitter. Both

[11, 32] derive an optimal power control policy for delivery to all the users, whereas this paper considers delivery to a random subset of users. In [12], each packet has a deadline and packets not received by the end of the slot are discarded. The authors use dynamic programming to obtain the optimal policies.

Deep Reinforcement Learning (DeepRL) in Wireless Multicast systems: The ability of DeepRL to handle large state-space dynamic systems is being exploited in various multicast wireless systems/networks. In [34], authors look at resource allocation problem in unicast and broadcast. The DeepRL agent learns and selects power and frequency for each channel to improve rate, under some latency constraints. Authors, like in our work, introduce constraints via Lagrange multiplier. However, the agent doesn’t learn the Lagrange multiplier. Thus, the agent also does not adapt if the system dynamics changes as the Lagrange constant in the reward is fixed for a given dynamics, and the learning rate decays with time. Another work, [19], applies unconstrained deep reinforcement learning to multiple transmitters for a proportionally fair scheduling policy by adjusting individual transmit powers. Some studies [9] have applied DeepRL to control power for anti-jamming systems.

Rest of the paper is organised as follows. Section II explains the system model and motivates the problem. Section III presents the proposed DeepRL algorithm AC-DQN. Section IV demonstrates our algorithms via simulations and Section V concludes the paper.

Ii System Model

We consider a system with one server transmitting files from a fixed finite library to a set of users (Figure 1). We denote the set of users by and the set of files by . We assume that . The request process for file from user is a Poisson process of rate which is independent of the request processes of other files from user and also from other users. The total arrival rate is . The requests of a file from each user are queued at the server till the user successfully receives the file. All the files are of length bits. The server transmits at a fixed rate, bits/sec. Thus, the transmission time for each file is .

The channels between the server and the users experience time varying fading. The channel gain of each user is assumed to be constant during transmission of a file. The channel gain for the user at the transmission, is represented by . Each takes values in a finite set and form an independent identically distributed (i.i.d) sequence in time, as in [24]. The channel gains of different users are independent of each other and may have different distributions. Let .

[height=5cm,width = 7cm] DeepRL_Figs/system_model_no_cache.jpg

Figure 1: System model

More details of the system are described in the following subsections as follows. Section II-A describes the basic Multicast queue proposed in [21]. The scheduling scheme to mitigate the effects of fading studied in [20] are also presented. In Sections II-B and II-C we summarise the results from [20] which show that using power control can further improve the performance and the algorithm used to obtain the optimal power policy. We will see that this algorithm is not scalable. Then in Section II-D we provide the MDP of the power control problem. In Section III we will present the scalable DeepRL solution for this formuation.

Ii-a Multicast Queue

For scheduling of transmission at the server, we consider the multicast queue studied in [20]. In this system, the requests for different files from different users are queued in a single queue, called the multicast queue. In this queue, the requests for file from all users are merged and considered as a single request. The requested file and the users requesting it, is denoted by ). A new request for file , from user is merged with the corresponding entry , if it exists. Else, it is appended to the tail of the queue. Service/transmission of file , serves all the users in , possibly with errors due to channel fading.
The random subset of users served by the multicast queue at the

transmission, is denoted by the random binary vector,

, where implies that the user has requested the file being transmitted; otherwise, . From [Theorem 1, [21]], has a unique stationary distribution.

It was shown in [21] that the above multicast queue performs much better than the multicast queues proposed in literature before. The main difference compared to previous multicast schemes is that in this scheme, all requests of all the users for a given file are merged together over time. One direct consequence of this is that the queue length at the base station does not exceed . Thus the delay is bounded for all traffic rates. In fact the mean delays are often better than the coded caching schemes proposed in the literature, as well, for most of the traffic conditions. However, in a fading scenario, where the different users have independent fading, the performance of this scheme can significantly deteriorate because of multiple retransmissions required to successfully transmit to all the users needed. Thus, in [20], multiple retransmission schemes were proposed and compared to recover the performance of the system. The following scheme was among the best. It not only (almost) minimizes the overall mean delays of the system, it also is fair to different users in the sense, that the users with good channel gains do not suffer due to users with bad channel gains.

Single queue with loop-back (1-LB): The Multicast queue is serviced from head to tail. When a file is transmitted, some of the users will receive the file successfully and some users may receive the file with errors. In the case of unsuccessful reception by some users, the file is retransmitted. A maximum of transmission attempts are made. If there are some users who did not receive the file within transmission attempts, the request (tuple with , now modified to contain only the set of users who have not received the file successfully) is fed back to the queue. If there is another pending request in the queue for the same file (a request for the file which came during the transmission of the current transmission), it is merged with the existing request. Otherwise, a new request for the same file with unsuccessful users is inserted at the tail of the queue.

It was further shown in [20] that choosing the transmit power based on the channel gains, can further improve the system performance.

Ii-B Average Power Constraint

Depending on the value of and at time , the server chooses transmit power , based on a power control policy . Choosing a good power control policy is the topic of this paper.

The state, of the system at time is . Let be the power chosen by a policy for state and be the number of successful transmissions for the selected power , during the service.
For a fixed transmission rate and for a given channel gain of users, the transmit power requirement (from Shannon’s Formula) for user is (assuming file length is long enough)


where, is the bandwidth and is the Gaussian noise power at receiver . Thus the reward for the chosen power control policy, during transmission is given by,


where if the user has requested the file in service and otherwise. We now describe the Mesh Adaptive Direct Search (MADS) power control policy.

Ii-C MADS Power control policy

The power control policy in [20] is derived from the following optimization problem,


where, is the average power constraint, is the total number of states, is the power chosen by the policy in state , is the stationary distribution of state and is the reward for state , as defined in (2) with . This is a non-convex optimization problem. Mesh Adaptive Direct Search (MADS) [3] is used in [20] to solve this constrained optimization problem and obtain the power control policy. Though MADS achieves global optimum, it is not scalable as its computational complexity is very high.

The state space and action space of this problem can be very high even for a moderate number of users and channel gains, e.g., a system with L users and G channel gain states, has states. Therefore, in this paper we propose a deep reinforcement learning framework. This not only provides optimal solution for a reasonably large system but does so without knowing the arrival rates and channel gain statistics. In addition, we will be able to provide an optimal solution even when the arrival and channel gain statistics change with time.

Ii-D MDP Formulation

The above system can be formulated into a finite state, action Markov Decision Process denoted by tuple (): (state space, action space, reward, transition probability, discount factor), where, transition probability , policy chooses power in state and the instantaneous reward .
The action-value function [22] for this discounted MDP for policy is


The optimal , is given by and satisfies the optimality relation,


where, is sampled with distribution . If we know the optimal Q-function , we can compute the optimal policy via . We know the transition matrix of this system and hence can compute the -function. But the state space is very large even for a small number of users, rendering the computations infeasible. Thus, we use a parametric function approximation of the Q function via Deep neural networks and use DeepRL algorithms to get the optimal .

Further, to introduce the constraint in the MDP formulation, we look at the policies achieving




is the long term average power. We use the Lagrange method for constrained MDPs [2] to achieve the optimal policy. In this method, the instantaneous reward is modified as


where, is the Lagrange constant achieving optimal while maintaining, . Choosing wrongly will provide the optimal policy with average power constraint different from .

Iii Deep Reinforcement Learning based Power Control Policy

In this section, we describe the Deep-Q-Network (DQN) [17] based power control. First we describe the DQN algorithm. We then propose a variant of DQN for constrained problems, where in, we use a Lagrange multiplier to take care of the average power constraint. We use multi-timescale stochastic gradient descent approach to learn the Lagrange multiplier, to obtain the right average power constraint. Finally, we change the learning step size from decreasing to a constant so that the optimal power control can track the time varying system statistics.

Iii-a Deep Q Networks

DQN is a popular Deep Reinforcement learning algorithm to handle large state-space MDPs with unknown/complex dynamics, . The DQN is a Value Iteration based method, where the action-value function is approximated by a Neural Network. Though there are several follow up works providing improvements over this algorithm [14, 30], we use this algorithm owing to its simplicity. We will show that DQN itself is able to provide us the optimal solution and tracking. These improvements may further improve the performance in terms of sample efficiency, estimator variance etc. The DQN algorithm is given in Algorithm 1. Earlier attempts in combining nonlinear function approximators such as neural networks and RL were unsuccessful due to instabilities caused by 1) correlated training samples, 2) drastic change in policy with small change in function approximation, and 3) correlation between the training function and approximated function [13]. Success of DQN is attributed to addressing these issues with two key ingredients of the algorithm: Experience Replay Memory and Target Network, . The replay memory stores the transitions of an MDP, specifically the tuple, . The algorithm then samples, uniformly, a random minibatch of transitions from the memory. This removes correlation between the data and smoothens the data distribution change with iteration. The target network and randomly sampled mini-batch from the memory , form the training set for training the

Network, at every epoch. This random sampling provides

samples for performing stochastic gradient descent with loss


where, . The iterates are given by:


where satisfies:


The weights of the target network are held constant for epochs, thereby controlling any drastic change in policy and reducing correlation between and . This can be seen as a Risk Minimization problem in nonparametric-regression with regression function and risk . Readers are referred to [33] for elaborate analysis of DQN. Theorem 4.4 in [33]

provides a proof of convergence and the rate of convergence using non-parametric regression bounds, when sparse ReLU networks are used, under certain smoothness assumptions on the reward function and the dynamics.

Input: MDP-, Replay Memory , Minibatch size: , , Initialize weights of and , : Exploration Parameter, : Learning rate satisfying (12)
for  to  do
       Observe state , Apply action , -greedily
       Store: in
       Sample: Minibatch from
       for  to  do
       end for
       at every : update
end for
Output: : Optimal -Function, : Optimal Policy
Algorithm 1 Deep-Q-Network

Iii-B Adaptive Constrained DQN (AC-DQN)

The DQN algorithm is meant for unconstrained optimization. Since our problem has an average power constraint of , we consider the instantaneous reward in (9), with a Lagrange multiplier . The long term constraint depends on the Lagrange multiplier and can be quite sensitive to it. Thus, we design our algorithm, AC-DQN, to learn the appropriate . We will see later, that this will enable us to further modify our algorithm to track the changing statistics of the channel gains and arrival statistics. The AC-DQN algorithm is given in Algorithm 2. Here, we use multi-timescale SGD as in [4]. In this approach, in addition to the SGD on , using minibatch, we use a stochastic gradient descent on the Lagrange constant, as


where . Since the expectation in (8) is not available to us, we take , where is the finite horizon window. Additionally and are required to follow [4]:

Input: MDP-, as in (9), Replay Memory , Minibatch size: , , Initialize weights of and , : Exploration Parameter, : Lagrange Constant, : Value learning rate, : Lagrange learning rate satisfying (14), Initialize
for  to  do
       Observe state , Apply action , -greedily
       Store: in
       Sample: Minibatch from
       for  to  do
       end for
      Perform two time-scale stochastic gradient descent as follows:
       at every : update
end for
Output: : Optimal -Function, : Optimal Policy
Algorithm 2 Adaptive Power Control DQN (AC-DQN) Algorithm

Tracking with AC-DQN: Tracking of system statistics is essential, to achieve optimal power control in a non-stationary system. In multi-time scale stochastic gradient descent, such as AC-DQN, step sizes and can be fixed to enable tracking. If , then the Lagrange multiplier changes much more slowly than the -function. Then the two timescale theory (see, e.g., [4]), will allow the Lagrange multiplier to adapt slowly to the changing system statistics but at the same time provide average power control. The solution will reach in a neighbourhood of the optimal point.

Although the convergence of this modified algorithm is not proved yet (even for the unconstrained DQN, convergence has been proved only recently in [33]), our simulations will show that the resulting algorithm tracks the optimal solution in the time varying scenario.

The time varying scenario in our setup results due to change in the request arrival statistics from the users and changing channel gain statistics due to motion of the users.

Iv Simulation Results and Discussion

In this section, we demonstrate the Deep Learning methods for power control proposed in this paper. We compare performances of AC-DQN and MADS Power control policies. Though MADS provides optimal solutions for small system sizes, it is not scalable. We show that the Deep Learning algorithm, AC-DQN, indeed achieves the global optimum obtained by MADS algorithm, while being scalable with the system size (number of users). We further demonstrate that AC-DQN algorithm tracks the changing system dynamics and obtains the optimal policy, adaptively. We use Keras libraries

[6] in Python for implementation of our algorithms and our system is implemented in MATLAB.

We consider two systems, one with 4 users and compare AC-DQN and MADS; the other with 20 users, showing performance of AC-DQN. MADS is not able to provide a solution for the second system since the space and time complexity of MADS increase exponentially with the number of users. In all the examples, we split the users in two equal sized groups, one group has good channel statistics and the other bad channel statistics. In both the systems, we compare all the algorithms with a constant power control policy, where the transmit power is fixed to , to indicate the gain due to power control. The system and algorithm parameters, used for the simulations are as follows:

Iv-1 Small User Case

Number of users, , Catalog Size , File Size , Transmission rate , Bandwidth , Channel Gains, Uniform([0.1 0.2 0.3]) for two users with bad channel statistics and Uniform([0.7 0.8 0.9]) for two users with good channel statistics, File Popularity : Uniform, (Zipf exponent = 0). Average Power Constraint , Simulation time= mutlicast transmissions.

Iv-2 Large User Case

System Parameters: Power Transmit Levels = 20 (1 to 50), , , ,

, Channel Gains : Exponentially distributed. (

for bad channel, for good channel), , , , File Popularity : Zipf distribution with (Zipf exponent = 1). Simulation time: mutlicast transmissions. In both the cases we set the noise power as .

Iv-3 Hyperparameters

We consider fully connected neural networks with two hidden layers for all the function approximations considered in the algorithms. Input layer nodes are assumed to be

and the output layer nodes is equal to 20, the number of transmit power levels. Each output represents the Q value for a particular action. The action space is restricted to be finite, as DQN converges only with finite action spaces. We use two hidden layers for the neural network, with 128 and 64 nodes, and ReLU activation function is chosen, respectively. The other parameters are as follows: Replay memory size

, , , , , , , , , Mini-batch Size , , and .
Achieving Global Optima (AC-DQN vs MADS): We use the system setting of small user case, specified above. We run the system for the average power constraint , with exponential arrivals of rate 0.4 to 4.0. Figure 2 shows a comparison of sojourn times of Constant Power Policy, , MADS and AC-DQN. Further, Figure 3 shows convergence of average power to for AC-DQN. We see from Figure 3 that AC-DQN achieves the global optimum achieved by MADS, while maintaining the average power constraint.

[trim=1cm 7.5cm 1cm 8cm,clip,height=5.5cm,width=8.5cm]DeepRL_Figs/MADS_PCD_APCD_Delay_4.pdf

Figure 2: Sojourn Times of MADS, PCD and AC-DQN vs Arrival Rate. , , Uniform Popularity, Uniform fading.

[trim=1cm 13.3cm 1.2cm 8.5cm,clip,height=3.5cm,width=8.5cm]DeepRL_Figs/PC_4.pdf

Figure 3: Convergence of Average power with Iteration for AC-DQN, .

AC-DQN performance in a Scaled Network: To show the scalability of AC-DQN, we simulate the relatively complex system mentioned in large user case, above. We run the simulation for . We see in Figure 4 that the AC-DQN gives, drastic improvement (around  50 percent) over constant power case. AC-DQN achieves this while maintaining the average power, by learning the Lagrange constant as seen in Figure 6. Figure 5 shows the convergence of average power of AC-DQN to the average power constraint, , for arrival rate of 1.0 requests per sec in the same simulation run.

[trim=.7cm 7.6cm 1.2cm 8.1cm,clip,height=5.5cm,width=8.5cm]DeepRL_Figs/PCD_APCD_Delay_20.pdf

Figure 4: Sojourn times for Constant Power and AC-DQN vs Arrival Rate. , , Zipf(1) Popularity, Rayleigh fading.

[trim=1.1cm 13cm 1.4cm 8.5cm,clip,height=3.5cm,width=8.5cm]DeepRL_Figs/Avg_power_20.pdf

Figure 5: Convergence of Average Power, AC-DQN,

[trim=.6cm 13.3cm 1.4cm 8.5cm,clip,height=3.5cm,width=8.5cm]DeepRL_Figs/Lagrange_20.pdf

Figure 6: Convergence of Lagrange multiplier

AC-DQN Tracking Simulations: In this section we show via simulations the tracking capabilities of AC-DQN. We show this for large user case with average power constraint, . We fix and . As explained previously, this is important for detecting the change in the environment dynamics faster. In this simulation, we vary the arrival rate at every six hours over a period of 24 hours. This captures the real world scenario where the request traffic to the base station varies with time of the day. To make the learning harder for our algorithm, we make these changes abruptly at every six hours. Specifically we use arrival rates for 1st, 2nd, 3rd and 4th six hour period, respectively. We plot the AC-DQN performance for in Figure 7. We calculate the mean sojourn time and average power using a moving average window of size 1000 samples. We observe that for each arrival rate in this simulation, the AC-DQN achieves the corresponding stationary mean sojourn time performance. For instance for and , the values in Figure 4 and Figure 7 are comparable. It is important to note that this performance is achieved while maintaining the average power constraint as can be seen in Figure 8. The effect of fixing the learning rates is seen in the small oscillations of average power around in Figure 8. This is the oscillation in a small neighborhood around the optimal average power. Smaller the step size, lesser the oscillations.

Next, we demonstrate the importance of constant step sizes for and , and the inability of decaying step sizes to track the changing system statistics. We consider a system where the arrival rates change over a period of 48 hours. We fix for first 24 hours, then fix for four consecutive 6 hours intervals. This change in time period is just to illustrate the tracking ability in a more emphatic manner. It will be clear in the previous time frame also but will require more simulation time. We fix . We run the AC-DQN algorithm for this system with: 1) decaying and satisfying (14) and 2) constant step sizes, and . Rest of the parameters remain same as in the large user case. We see in Figure 9 that the AC-DQN with constant step-size almost always outperforms the decaying step size. Specifically, after the first 24 hours the delay reduction is nearly 50 percent for constant step-size. The reason for this is evident from Figures 10 and 11. We see in Figure 11 that the AC-DQN with constant step-size learns the Lagrange constant through out the simulation time, whereas, the AC-DQN decaying step size is unable to learn the Lagrange constant after the first 24 hours. As can be seen in Figure 10, this affects the average power achieved by the AC-DQN with decaying step size. While constant step size maintains the average power constraint of , the average power achieved by the decaying step-size AC-DQN drops to . Hence, the decaying step-size AC-DQN suffers suboptimal utilization of available power. Thus in practical systems only constant step-size AC-DQN will be capable of adapting to the changing system statistics.

[trim=.7cm 7.7cm 1.6cm 8cm,clip,height=5.5cm,width=8.5cm]DeepRL_Figs/track_delay_7.pdf

Figure 7: AC-DQN Tracking: Delay performance of AC-DQN vs Constant Power Policy, for ,

[trim=1.1cm 13cm 1.6cm 8cm,clip,height=3.5cm,width=8.5cm]DeepRL_Figs/track_pavg_7.pdf

Figure 8: Tracking of Power Constraint by AC-DQN with tracking for ,

[trim=.7cm 7.6cm 1.2cm 8.1cm,clip,height=5.5cm,width=8.5cm]DeepRL_Figs/decay_const_del_2.pdf

Figure 9: Sojourn times for AC-DQN with decaying vs constant step-sizes. , , Zipf(1) Popularity, Rayleigh fading.

[trim=1.35cm 13cm 1.2cm 8.5cm,clip,height=3.5cm,width=8.5cm]DeepRL_Figs/decay_const_pavg.pdf

Figure 10: Convergence of Average Power for AC-DQN with decaying vs constant step-sizes.

[trim=.7cm 7.6cm 1.2cm 8.1cm,clip,height=5.5cm,width=8.5cm]DeepRL_Figs/decay_const_beta.pdf

Figure 11: Convergence of Lagrange for AC-DQN with decaying vs constant step-sizes.

Discussion: We see from the simulations that the DeepRL techniques can achieve global optimal performance while providing scalability with system size. Our two-timescale approach, AC-DQN, extends this to systems with constrained control. Though we have demonstrated this on a system with a single constraint, AC-DQN can very well be extended to systems with multiple constraints. In such systems each constraint is associated with a Lagrange constant. Each Lagrange constant adds an additional SGD step to the AC-DQN algorithm. For a stationary system, it is enough that the step-sizes satisfy multi-timescale criterion similar to (14), see [4]. However, if AC-DQN is used in systems with changing system statistics the step sizes shall be kept constant. The step sizes shall be fixed as per the tolerance requirement for a given constraint (e.g., in our system the tolerance could be . In other words, is the allowed deviation from the constraint ). Lesser the tolerance, lesser the step-size. However, fixing the step-sizes too small may make the algorithm too slow to track the changes in system statistics. Hence, choosing the step sizes is a trade-off between the tolerance of the constraint and the required algorithmic agility to track the system changes.

V Conclusion

We have considered a multicast downlink in a wireless network. Fading of different links to users causes significant reduction in the performance of the system. However, appropriate change in the scheduling policies and power control can mitigate most of the losses. However, obtaining optimal power control for this system is computationally very hard. We show that using Deep Reinforcement Learning, we can obtain optimal power control, online, even when the system statistics are unknown. We use a recently developed version of Q learning, Deep Q Network to learn the Q-function of the system via function approximation. Furthermore, we modify the algorithm to satisfy our constraints and also to make the optimal policy track the time varying system statistics. DDQN variant of AC-DQN provides similar performance.

One interesting extension of this work would be adding the caches at the user nodes and learning the optimal caching policy along with the power control using DeepRL. Future works may also consider applying AC-DQN to multiple-base-station scenarios for interference mitigation.


  • [1] J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017) Constrained policy optimization. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 22–31. Cited by: §I.
  • [2] E. Altman (1999) Constrained markov decision processes. CRC Press. Cited by: §II-D.
  • [3] C. Audet and J. E. Dennis Jr (2006) Mesh adaptive direct search algorithms for constrained optimization. SIAM Journal on optimization 17 (1), pp. 188–217. Cited by: §II-C.
  • [4] V.S. Borkar (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press. External Links: ISBN 9780521515924, LCCN 2009285122 Cited by: §I, §III-B, §III-B, §IV-3.
  • [5] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon (2007) I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, New York, USA, pp. 1–14. Cited by: §I.
  • [6] F. Chollet et al. (2015) Keras. GitHub. Note: Cited by: §IV.
  • [7] Cisco (2016) Cisco visual networking index: global mobile data traffic forecast update 2016-2021 white paper. (), pp. . External Links: Document, Cited by: §I.
  • [8] R. Cogill and B. Shrader (2009-Sept) Queue length analysis for multicast: limits of performance and achievable queue length with random linear coding. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 462–468. External Links: Document, ISSN Cited by: §I.
  • [9] C. Dai, X. Xiao, L. Xiao, and P. Cheng (2018-12) Reinforcement learning based power control for vanet broadcast against jamming. In 2018 IEEE Global Communications Conference (GLOBECOM), Vol. , pp. 1–6. External Links: Document, ISSN 2576-6813 Cited by: §I.
  • [10] I-H. Hou (2011) Broadcasting delay-constrained traffic over unreliable wireless links with network coding. IEEE/ACM Transactions on Networking 23, pp. 728–740. Cited by: §I.
  • [11] N. Jindal and A. Goldsmith (2003-11) Capacity and optimal power allocation for fading broadcast channels with minimum rates. IEEE Transactions on Information Theory 49 (11), pp. 2895–2909. External Links: Document, ISSN 0018-9448 Cited by: §I.
  • [12] K. S. Kim, C. Li, and E. Modiano (2014-04) Scheduling multicast traffic with deadlines in wireless networks. In IEEE INFOCOM 2014 - IEEE Conference on Computer Communications, Vol. , pp. 2193–2201. External Links: Document, ISSN 0743-166X Cited by: §I.
  • [13] Y. Li (2018) Deep reinforcement learning. CoRR. External Links: Link, 1810.06339 Cited by: §I, §III-A.
  • [14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. CoRR. External Links: Link Cited by: §III-A.
  • [15] L. Lin (1992-05-01) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8 (3), pp. 293–321. External Links: ISSN 1573-0565, Document, Link Cited by: §I.
  • [16] M. A. Maddah-Ali and U. Niesen (2014) Fundamental limits of caching. IEEE Trans. Inf. Theory 60 (5), pp. 2856–2867. External Links: Document, 1209.5807, ISBN 9781479904464, ISSN 00189448 Cited by: §I.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015-02-25) Human-level control through deep reinforcement learning. Nature 518, pp. 529 EP –. External Links: Link Cited by: §I, §III.
  • [18] N. Moghadam and H. Li (2015) Improving queue stability in wireless multicast with network coding. IEEE Inter. Conf. on Commun. (ICC), pp. 3382–3387. Cited by: §I.
  • [19] Y. S. Nasir and D. Guo (2018) Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. arXiv. External Links: Link, 1808.00490v3 Cited by: §I.
  • [20] M. Panju, R. Raghu, V. Agarwal, V. Sharma, and R. Ramachandran (2019) Queueing theoretic models for multicasting under fading. IEEE Wireless Communications and Networking Conference (WCNC), Marrakech, Morocco. Cited by: Deep Reinforcement Learning Based Power control for Wireless Multicast Systems, 2nd item, §I, §I, §II-A, §II-A, §II-A, §II-C, §II.
  • [21] M. Panju, R. Raghu, V. Sharma, and R. Ramachandran (2018) Queuing theoretic models for multicast and coded-caching in downlink wireless systems. arXiv:1804.10590. External Links: arXiv:1804.10590 Cited by: §I, §I, §II-A, §II-A, §II.
  • [22] M. L. Puterman (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471619779 Cited by: §II-D.
  • [23] F. Rezaei and B. H. Khalaj (2016) Stability, rate, and delay analysis of single bottleneck caching networks. IEEE Trans. Commun. 64 (1), pp. 300–313. External Links: Document, ISBN 0090-6778, ISSN 00906778 Cited by: §I.
  • [24] P. Sadeghi, R. A. Kennedy, P. B. Rapajic, and R. Shams (2008-09)

    Finite-state markov modeling of fading channels - a survey of principles and applications

    IEEE Signal Processing Magazine 25 (5), pp. 57–80. External Links: Document, ISSN 1053-5888 Cited by: §II.
  • [25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §I.
  • [26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR. External Links: Link, 1707.06347 Cited by: §I.
  • [27] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. External Links: Document, ISSN 0036-8075, Link, Cited by: §I.
  • [28] C. Su and L. Tassiulas (2000) Joint broadcast scheduling and user’s cache management for efficient information delivery. Wireless Networks 6 (4), pp. 279–288. Cited by: §I.
  • [29] C. Tessler, D. J. Mankowitz, and S. Mannor (2018) Reward constrained policy optimization. CoRR. External Links: Link, 1805.11074 Cited by: §I.
  • [30] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §I, §III-A.
  • [31] M. Volodymyr, K. Koray, S. David, G. Alex, I. Antonoglou, W. Daan, and R. Martin (2013) Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop. Cited by: §I.
  • [32] K. Wang, C. F. Chiasserini, R. R. Rao, and J. G. Proakis (2003-05) A distributed joint scheduling and power control algorithm for multicasting in wireless ad hoc networks. In IEEE International Conference on Communications, 2003. ICC ’03., Vol. 1, pp. 725–731 vol.1. External Links: Document, ISSN Cited by: §I.
  • [33] Z. Yang, Y. Xie, and Z. Wang (2019) A theoretical analysis of deep q-learning. CoRR. External Links: Link, 1901.00137 Cited by: §I, §I, §III-A, §III-B.
  • [34] H. Ye, G. Y. Li, and B. F. Juang (2019-04) Deep reinforcement learning based resource allocation for v2v communications. IEEE Transactions on Vehicular Technology 68 (4), pp. 3163–3173. External Links: Document, ISSN 0018-9545 Cited by: §I.