Wireless networks are being constantly refined to cater for seamless delivery of huge amount of data to the end users. With increased user generated contents and proliferation of social networking sites, almost of mobile data traffic is expected to be due to mobile videos . Also, the requested traffic for these contents is ridden with redundant requests . Thus, multicasting is a natural way to address these requests.
A multicast queue with network coding is studied in [18, 10] with infinite library of files. The case of broadcast systems with one server transmitting to multiple users is studied in [8, 28]. Both of these works study a slotted system. Some recent works  use coded caching to achieve multicast. This approach uses local information in the user caches to decode the coded transmission and provides improvement in throughput by increasing the effective number of files transferred per transmission. This throughput may get reduced in a practical scenario due to queueing delays at the basestation/server.  addresses these issues, analyses the queuing delays and compares it with an alternate coded scheme with LRU caches (CDLS) which provides improvement over the coded schemes in . A more recent work in this direction, provides alternate multicast schemes and analyses queueing delays for such multicast systems . The authors show that a simple multicast scheme, can have significant gains over the schemes in ,  in high traffic regime.
We further study the multicast scheme proposed in  in this paper. This multicast queue merges the requests for a given file from different users, arriving during the waiting time of the initial requests. The merged requests are then served simultaneously. The gains achieved by this simple multicast scheme, however, are quickly lost in wireless channels due to fading.  addresses this issue and also provides several multicast queueing schemes to improve the average user delays. Also, it shows that these schemes combined with an optimal power control policy under average power constraint, can provide significant reduction in delays.
The power control policy proposed in , though provides improved delays, has following limitations:
The algorithm to get the policy is not scalable with the number of users and the number of states of the channel gains.
The policy doesn’t adapt to the changing system statistics, which in turn depends on the policy.
These systems are often conveniently modelled as a Markov Decision Process, but with large state and action spaces. Obtaining transition probabilities and the optimal policy for such large MDPs is not feasible. Reinforcement learning, particularly, Deep reinforcement learning comes as a natural tool to address such problems
. Reinforcement learning has the added advantage that it can be used even when the transition probabilities are not available. However, large state/action space can still be an issue. Using function approximation via deep neural networks can provide significant gains since the Q-values of different state-action pairs can be interpolated even if that state-action pair has never or rarely occurred in the past. Several, deep reinforcement learning techniques such as Deep Q-Network, Trust Region Policy Optimization (TRPO) , Proximal Policy Gradient (PPO)  etc. have been successfully applied to several large state-space dynamical systems such as Atari , AlphaGo  etc. DQN is one of the first Deep-RL methods based on value iteration, usually employing
-greedy exploration to learn the optimal policy. TRPO and PPO are policy gradient based methods that employ stochastic gradient descent over policy space to obtain the optimal value function. Policy-Gradient methods often suffer from high variance in sample estimates and poor sample efficiency. Value iteration based deep RL methods, like DQN, have been theoretically shown to have competitive performance , specifically due to sample efficiency of experience replay .
In addition to the above mentioned trade-offs, a constrained stochastic optimization problem, as considered in this paper, further adds to the complexity of the problem. A modification of TRPO for constrained optimization is Constrained Policy Optimization . But, this too suffers from the high estimator variance issue. Work in  considers a multi-timescale approach for constrained DeepRL problems, as considered in this paper. However,  does not track the system statistics and hence cannot be applied in practical systems. Thus we propose a constrained optimization variant of DQN based on multi-timescale stochastic gradient descent . We have preferred DQN in this work, as the Target network and Replay memory used in the DQN reduce the estimator variance and finally achieve the global minimum empirical risk .
The major contributions of this paper are:
Proposing two modifications to DQN, to accommodate constraints and system adaptations. We call this Adaptive Constrained DQN (AC-DQN).
Unlike DQN, constrained DQN can be applied to the multicast systems with constraints, as in , to learn the optimal power control policy, online. The constraints can be met by using a Lagrange multiplier. The appropriate Lagrange multiplier is also learnt via a two time scale stochastic gradient descent. The proposed method meets the average power constraint while achieving the global optima as achieved by the static policy proposed in .
We demonstrate the scalability of our algorithms with system size (number. of users, arrival rate, complex fading).
We show that AC-DQN can track the changes in the dynamics of the system, e.g., change of rate of arrival over the time of a day, and achieve optimal performance.
Our algorithms work equally well when we replace DQN with its improvements such as DDQN . In fact we have run our simulations with DDQN variant of AC-DQN and have achieved similar performance. Next, we describe some more related works to this paper:
Power control in Multicast Systems: Power control in multicast systems has been studied in [11, 32, 12]. In , optimal power allocation is made to achieve the ergodic capacity (defined as the boundary of the region of all achievable rates in a fading channel with arbitrarily small error probability) while maintaining minimum rate requirements at users and average power constraints. Authors use water-filling to achieve the optimal policy. In 
, the authors minimize a utility function via linear programming, under SINR constraints at the users and transmit power constraints at the transmitter. Both[11, 32] derive an optimal power control policy for delivery to all the users, whereas this paper considers delivery to a random subset of users. In , each packet has a deadline and packets not received by the end of the slot are discarded. The authors use dynamic programming to obtain the optimal policies.
Deep Reinforcement Learning (DeepRL) in Wireless Multicast systems: The ability of DeepRL to handle large state-space dynamic systems is being exploited in various multicast wireless systems/networks. In , authors look at resource allocation problem in unicast and broadcast. The DeepRL agent learns and selects power and frequency for each channel to improve rate, under some latency constraints. Authors, like in our work, introduce constraints via Lagrange multiplier. However, the agent doesn’t learn the Lagrange multiplier. Thus, the agent also does not adapt if the system dynamics changes as the Lagrange constant in the reward is fixed for a given dynamics, and the learning rate decays with time. Another work, , applies unconstrained deep reinforcement learning to multiple transmitters for a proportionally fair scheduling policy by adjusting individual transmit powers. Some studies  have applied DeepRL to control power for anti-jamming systems.
Ii System Model
We consider a system with one server transmitting files from a fixed finite library to a set of users (Figure 1). We denote the set of users by and the set of files by . We assume that . The request process for file from user is a Poisson process of rate which is independent of the request processes of other files from user and also from other users. The total arrival rate is . The requests of a file from each user are queued at the server till the user successfully receives the file. All the files are of length bits. The server transmits at a fixed rate, bits/sec. Thus, the transmission time for each file is .
The channels between the server and the users experience time varying fading. The channel gain of each user is assumed to be constant during transmission of a file. The channel gain for the user at the transmission, is represented by . Each takes values in a finite set and form an independent identically distributed (i.i.d) sequence in time, as in . The channel gains of different users are independent of each other and may have different distributions. Let .
More details of the system are described in the following subsections as follows. Section II-A describes the basic Multicast queue proposed in . The scheduling scheme to mitigate the effects of fading studied in  are also presented. In Sections II-B and II-C we summarise the results from  which show that using power control can further improve the performance and the algorithm used to obtain the optimal power policy. We will see that this algorithm is not scalable. Then in Section II-D we provide the MDP of the power control problem. In Section III we will present the scalable DeepRL solution for this formuation.
Ii-a Multicast Queue
For scheduling of transmission at the server, we consider the multicast queue studied in . In this system, the requests for different files from different users are queued in a single queue, called the multicast queue. In this queue, the requests for file from all users are merged and considered as a single request. The requested file and the users requesting it, is denoted by ). A new request for file , from user is merged with the corresponding entry , if it exists. Else, it is appended to the tail of the queue. Service/transmission of file , serves all the users in , possibly with errors due to channel fading.
The random subset of users served by the multicast queue at the
transmission, is denoted by the random binary vector,, where implies that the user has requested the file being transmitted; otherwise, . From [Theorem 1, ], has a unique stationary distribution.
It was shown in  that the above multicast queue performs much better than the multicast queues proposed in literature before. The main difference compared to previous multicast schemes is that in this scheme, all requests of all the users for a given file are merged together over time. One direct consequence of this is that the queue length at the base station does not exceed . Thus the delay is bounded for all traffic rates. In fact the mean delays are often better than the coded caching schemes proposed in the literature, as well, for most of the traffic conditions. However, in a fading scenario, where the different users have independent fading, the performance of this scheme can significantly deteriorate because of multiple retransmissions required to successfully transmit to all the users needed. Thus, in , multiple retransmission schemes were proposed and compared to recover the performance of the system. The following scheme was among the best. It not only (almost) minimizes the overall mean delays of the system, it also is fair to different users in the sense, that the users with good channel gains do not suffer due to users with bad channel gains.
Single queue with loop-back (1-LB): The Multicast queue is serviced from head to tail. When a file is transmitted, some of the users will receive the file successfully and some users may receive the file with errors. In the case of unsuccessful reception by some users, the file is retransmitted. A maximum of transmission attempts are made. If there are some users who did not receive the file within transmission attempts, the request (tuple with , now modified to contain only the set of users who have not received the file successfully) is fed back to the queue. If there is another pending request in the queue for the same file (a request for the file which came during the transmission of the current transmission), it is merged with the existing request. Otherwise, a new request for the same file with unsuccessful users is inserted at the tail of the queue.
It was further shown in  that choosing the transmit power based on the channel gains, can further improve the system performance.
Ii-B Average Power Constraint
Depending on the value of and at time , the server chooses transmit power , based on a power control policy . Choosing a good power control policy is the topic of this paper.
The state, of the system at time is . Let be the power chosen by a policy for state and be the number of successful transmissions for the selected power , during the service.
For a fixed transmission rate and for a given channel gain of users, the transmit power requirement (from Shannon’s Formula) for user is (assuming file length is long enough)
where, is the bandwidth and is the Gaussian noise power at receiver . Thus the reward for the chosen power control policy, during transmission is given by,
where if the user has requested the file in service and otherwise. We now describe the Mesh Adaptive Direct Search (MADS) power control policy.
Ii-C MADS Power control policy
The power control policy in  is derived from the following optimization problem,
where, is the average power constraint, is the total number of states, is the power chosen by the policy in state , is the stationary distribution of state and is the reward for state , as defined in (2) with . This is a non-convex optimization problem. Mesh Adaptive Direct Search (MADS)  is used in  to solve this constrained optimization problem and obtain the power control policy. Though MADS achieves global optimum, it is not scalable as its computational complexity is very high.
The state space and action space of this problem can be very high even for a moderate number of users and channel gains, e.g., a system with L users and G channel gain states, has states. Therefore, in this paper we propose a deep reinforcement learning framework. This not only provides optimal solution for a reasonably large system but does so without knowing the arrival rates and channel gain statistics. In addition, we will be able to provide an optimal solution even when the arrival and channel gain statistics change with time.
Ii-D MDP Formulation
The above system can be formulated into a finite state, action Markov Decision Process denoted by tuple (): (state space, action space, reward, transition probability, discount factor), where, transition probability , policy chooses power in state and the instantaneous reward .
The action-value function  for this discounted MDP for policy is
The optimal , is given by and satisfies the optimality relation,
where, is sampled with distribution . If we know the optimal Q-function , we can compute the optimal policy via . We know the transition matrix of this system and hence can compute the -function. But the state space is very large even for a small number of users, rendering the computations infeasible. Thus, we use a parametric function approximation of the Q function via Deep neural networks and use DeepRL algorithms to get the optimal .
Further, to introduce the constraint in the MDP formulation, we look at the policies achieving
is the long term average power. We use the Lagrange method for constrained MDPs  to achieve the optimal policy. In this method, the instantaneous reward is modified as
where, is the Lagrange constant achieving optimal while maintaining, . Choosing wrongly will provide the optimal policy with average power constraint different from .
Iii Deep Reinforcement Learning based Power Control Policy
In this section, we describe the Deep-Q-Network (DQN)  based power control. First we describe the DQN algorithm. We then propose a variant of DQN for constrained problems, where in, we use a Lagrange multiplier to take care of the average power constraint. We use multi-timescale stochastic gradient descent approach to learn the Lagrange multiplier, to obtain the right average power constraint. Finally, we change the learning step size from decreasing to a constant so that the optimal power control can track the time varying system statistics.
Iii-a Deep Q Networks
DQN is a popular Deep Reinforcement learning algorithm to handle large state-space MDPs with unknown/complex dynamics, . The DQN is a Value Iteration based method, where the action-value function is approximated by a Neural Network. Though there are several follow up works providing improvements over this algorithm [14, 30], we use this algorithm owing to its simplicity. We will show that DQN itself is able to provide us the optimal solution and tracking. These improvements may further improve the performance in terms of sample efficiency, estimator variance etc. The DQN algorithm is given in Algorithm 1. Earlier attempts in combining nonlinear function approximators such as neural networks and RL were unsuccessful due to instabilities caused by 1) correlated training samples, 2) drastic change in policy with small change in function approximation, and 3) correlation between the training function and approximated function . Success of DQN is attributed to addressing these issues with two key ingredients of the algorithm: Experience Replay Memory and Target Network, . The replay memory stores the transitions of an MDP, specifically the tuple, . The algorithm then samples, uniformly, a random minibatch of transitions from the memory. This removes correlation between the data and smoothens the data distribution change with iteration. The target network and randomly sampled mini-batch from the memory , form the training set for training the Network, at every epoch. This random sampling provides
Network, at every epoch. This random sampling providessamples for performing stochastic gradient descent with loss
where, . The iterates are given by:
The weights of the target network are held constant for epochs, thereby controlling any drastic change in policy and reducing correlation between and . This can be seen as a Risk Minimization problem in nonparametric-regression with regression function and risk . Readers are referred to  for elaborate analysis of DQN. Theorem 4.4 in  provides a proof of convergence and the rate of convergence using non-parametric regression bounds, when sparse ReLU networks are used, under certain smoothness assumptions on the reward function and the dynamics.
provides a proof of convergence and the rate of convergence using non-parametric regression bounds, when sparse ReLU networks are used, under certain smoothness assumptions on the reward function and the dynamics.
Iii-B Adaptive Constrained DQN (AC-DQN)
The DQN algorithm is meant for unconstrained optimization. Since our problem has an average power constraint of , we consider the instantaneous reward in (9), with a Lagrange multiplier . The long term constraint depends on the Lagrange multiplier and can be quite sensitive to it. Thus, we design our algorithm, AC-DQN, to learn the appropriate . We will see later, that this will enable us to further modify our algorithm to track the changing statistics of the channel gains and arrival statistics. The AC-DQN algorithm is given in Algorithm 2. Here, we use multi-timescale SGD as in . In this approach, in addition to the SGD on , using minibatch, we use a stochastic gradient descent on the Lagrange constant, as
Tracking with AC-DQN: Tracking of system statistics is essential, to achieve optimal power control in a non-stationary system. In multi-time scale stochastic gradient descent, such as AC-DQN, step sizes and can be fixed to enable tracking. If , then the Lagrange multiplier changes much more slowly than the -function. Then the two timescale theory (see, e.g., ), will allow the Lagrange multiplier to adapt slowly to the changing system statistics but at the same time provide average power control. The solution will reach in a neighbourhood of the optimal point.
Although the convergence of this modified algorithm is not proved yet (even for the unconstrained DQN, convergence has been proved only recently in ), our simulations will show that the resulting algorithm tracks the optimal solution in the time varying scenario.
The time varying scenario in our setup results due to change in the request arrival statistics from the users and changing channel gain statistics due to motion of the users.
Iv Simulation Results and Discussion
In this section, we demonstrate the Deep Learning methods for power control proposed in this paper. We compare performances of AC-DQN and MADS Power control policies. Though MADS provides optimal solutions for small system sizes, it is not scalable. We show that the Deep Learning algorithm, AC-DQN, indeed achieves the global optimum obtained by MADS algorithm, while being scalable with the system size (number of users). We further demonstrate that AC-DQN algorithm tracks the changing system dynamics and obtains the optimal policy, adaptively. We use Keras libraries
In this section, we demonstrate the Deep Learning methods for power control proposed in this paper. We compare performances of AC-DQN and MADS Power control policies. Though MADS provides optimal solutions for small system sizes, it is not scalable. We show that the Deep Learning algorithm, AC-DQN, indeed achieves the global optimum obtained by MADS algorithm, while being scalable with the system size (number of users). We further demonstrate that AC-DQN algorithm tracks the changing system dynamics and obtains the optimal policy, adaptively. We use Keras libraries in Python for implementation of our algorithms and our system is implemented in MATLAB.
We consider two systems, one with 4 users and compare AC-DQN and MADS; the other with 20 users, showing performance of AC-DQN. MADS is not able to provide a solution for the second system since the space and time complexity of MADS increase exponentially with the number of users. In all the examples, we split the users in two equal sized groups, one group has good channel statistics and the other bad channel statistics. In both the systems, we compare all the algorithms with a constant power control policy, where the transmit power is fixed to , to indicate the gain due to power control. The system and algorithm parameters, used for the simulations are as follows:
Iv-1 Small User Case
Number of users, , Catalog Size , File Size , Transmission rate , Bandwidth , Channel Gains, Uniform([0.1 0.2 0.3]) for two users with bad channel statistics and Uniform([0.7 0.8 0.9]) for two users with good channel statistics, File Popularity : Uniform, (Zipf exponent = 0). Average Power Constraint , Simulation time= mutlicast transmissions.
Iv-2 Large User Case
System Parameters: Power Transmit Levels = 20 (1 to 50), , , , , Channel Gains : Exponentially distributed. (
, Channel Gains : Exponentially distributed. (for bad channel, for good channel), , , , File Popularity : Zipf distribution with (Zipf exponent = 1). Simulation time: mutlicast transmissions. In both the cases we set the noise power as .
We consider fully connected neural networks with two hidden layers for all the function approximations considered in the algorithms. Input layer nodes are assumed to be and the output layer nodes is equal to 20, the number of transmit power levels. Each output represents the Q value for a particular action. The action space is restricted to be finite, as DQN converges only with finite action spaces. We use two hidden layers for the neural network, with 128 and 64 nodes, and ReLU activation function is chosen, respectively. The other parameters are as follows: Replay memory size
Achieving Global Optima (AC-DQN vs MADS): We use the system setting of small user case, specified above. We run the system for the average power constraint , with exponential arrivals of rate 0.4 to 4.0. Figure 2 shows a comparison of sojourn times of Constant Power Policy, , MADS and AC-DQN. Further, Figure 3 shows convergence of average power to for AC-DQN. We see from Figure 3 that AC-DQN achieves the global optimum achieved by MADS, while maintaining the average power constraint.
and the output layer nodes is equal to 20, the number of transmit power levels. Each output represents the Q value for a particular action. The action space is restricted to be finite, as DQN converges only with finite action spaces. We use two hidden layers for the neural network, with 128 and 64 nodes, and ReLU activation function is chosen, respectively. The other parameters are as follows: Replay memory size, , , , , , , , , Mini-batch Size , , and .
AC-DQN performance in a Scaled Network: To show the scalability of AC-DQN, we simulate the relatively complex system mentioned in large user case, above. We run the simulation for . We see in Figure 4 that the AC-DQN gives, drastic improvement (around 50 percent) over constant power case. AC-DQN achieves this while maintaining the average power, by learning the Lagrange constant as seen in Figure 6. Figure 5 shows the convergence of average power of AC-DQN to the average power constraint, , for arrival rate of 1.0 requests per sec in the same simulation run.
AC-DQN Tracking Simulations: In this section we show via simulations the tracking capabilities of AC-DQN. We show this for large user case with average power constraint, . We fix and . As explained previously, this is important for detecting the change in the environment dynamics faster. In this simulation, we vary the arrival rate at every six hours over a period of 24 hours. This captures the real world scenario where the request traffic to the base station varies with time of the day. To make the learning harder for our algorithm, we make these changes abruptly at every six hours. Specifically we use arrival rates for 1st, 2nd, 3rd and 4th six hour period, respectively. We plot the AC-DQN performance for in Figure 7. We calculate the mean sojourn time and average power using a moving average window of size 1000 samples. We observe that for each arrival rate in this simulation, the AC-DQN achieves the corresponding stationary mean sojourn time performance. For instance for and , the values in Figure 4 and Figure 7 are comparable. It is important to note that this performance is achieved while maintaining the average power constraint as can be seen in Figure 8. The effect of fixing the learning rates is seen in the small oscillations of average power around in Figure 8. This is the oscillation in a small neighborhood around the optimal average power. Smaller the step size, lesser the oscillations.
Next, we demonstrate the importance of constant step sizes for and , and the inability of decaying step sizes to track the changing system statistics. We consider a system where the arrival rates change over a period of 48 hours. We fix for first 24 hours, then fix for four consecutive 6 hours intervals. This change in time period is just to illustrate the tracking ability in a more emphatic manner. It will be clear in the previous time frame also but will require more simulation time. We fix . We run the AC-DQN algorithm for this system with: 1) decaying and satisfying (14) and 2) constant step sizes, and . Rest of the parameters remain same as in the large user case. We see in Figure 9 that the AC-DQN with constant step-size almost always outperforms the decaying step size. Specifically, after the first 24 hours the delay reduction is nearly 50 percent for constant step-size. The reason for this is evident from Figures 10 and 11. We see in Figure 11 that the AC-DQN with constant step-size learns the Lagrange constant through out the simulation time, whereas, the AC-DQN decaying step size is unable to learn the Lagrange constant after the first 24 hours. As can be seen in Figure 10, this affects the average power achieved by the AC-DQN with decaying step size. While constant step size maintains the average power constraint of , the average power achieved by the decaying step-size AC-DQN drops to . Hence, the decaying step-size AC-DQN suffers suboptimal utilization of available power. Thus in practical systems only constant step-size AC-DQN will be capable of adapting to the changing system statistics.
Discussion: We see from the simulations that the DeepRL techniques can achieve global optimal performance while providing scalability with system size. Our two-timescale approach, AC-DQN, extends this to systems with constrained control. Though we have demonstrated this on a system with a single constraint, AC-DQN can very well be extended to systems with multiple constraints. In such systems each constraint is associated with a Lagrange constant. Each Lagrange constant adds an additional SGD step to the AC-DQN algorithm. For a stationary system, it is enough that the step-sizes satisfy multi-timescale criterion similar to (14), see . However, if AC-DQN is used in systems with changing system statistics the step sizes shall be kept constant. The step sizes shall be fixed as per the tolerance requirement for a given constraint (e.g., in our system the tolerance could be . In other words, is the allowed deviation from the constraint ). Lesser the tolerance, lesser the step-size. However, fixing the step-sizes too small may make the algorithm too slow to track the changes in system statistics. Hence, choosing the step sizes is a trade-off between the tolerance of the constraint and the required algorithmic agility to track the system changes.
We have considered a multicast downlink in a wireless network. Fading of different links to users causes significant reduction in the performance of the system. However, appropriate change in the scheduling policies and power control can mitigate most of the losses. However, obtaining optimal power control for this system is computationally very hard. We show that using Deep Reinforcement Learning, we can obtain optimal power control, online, even when the system statistics are unknown. We use a recently developed version of Q learning, Deep Q Network to learn the Q-function of the system via function approximation. Furthermore, we modify the algorithm to satisfy our constraints and also to make the optimal policy track the time varying system statistics. DDQN variant of AC-DQN provides similar performance.
One interesting extension of this work would be adding the caches at the user nodes and learning the optimal caching policy along with the power control using DeepRL. Future works may also consider applying AC-DQN to multiple-base-station scenarios for interference mitigation.
Constrained policy optimization.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 22–31. Cited by: §I.
-  (1999) Constrained markov decision processes. CRC Press. Cited by: §II-D.
-  (2006) Mesh adaptive direct search algorithms for constrained optimization. SIAM Journal on optimization 17 (1), pp. 188–217. Cited by: §II-C.
-  (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press. External Links: Cited by: §I, §III-B, §III-B, §IV-3.
-  (2007) I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, New York, USA, pp. 1–14. Cited by: §I.
-  (2015) Keras. GitHub. Note: https://github.com/fchollet/keras Cited by: §IV.
-  (2016) Cisco visual networking index: global mobile data traffic forecast update 2016-2021 white paper. (), pp. . External Links: Cited by: §I.
-  (2009-Sept) Queue length analysis for multicast: limits of performance and achievable queue length with random linear coding. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 462–468. External Links: Cited by: §I.
-  (2018-12) Reinforcement learning based power control for vanet broadcast against jamming. In 2018 IEEE Global Communications Conference (GLOBECOM), Vol. , pp. 1–6. External Links: Cited by: §I.
-  (2011) Broadcasting delay-constrained traffic over unreliable wireless links with network coding. IEEE/ACM Transactions on Networking 23, pp. 728–740. Cited by: §I.
-  (2003-11) Capacity and optimal power allocation for fading broadcast channels with minimum rates. IEEE Transactions on Information Theory 49 (11), pp. 2895–2909. External Links: Cited by: §I.
-  (2014-04) Scheduling multicast traffic with deadlines in wireless networks. In IEEE INFOCOM 2014 - IEEE Conference on Computer Communications, Vol. , pp. 2193–2201. External Links: Cited by: §I.
-  (2018) Deep reinforcement learning. CoRR. External Links: Cited by: §I, §III-A.
-  (2016) Continuous control with deep reinforcement learning. CoRR. External Links: Cited by: §III-A.
-  (1992-05-01) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8 (3), pp. 293–321. External Links: Cited by: §I.
-  (2014) Fundamental limits of caching. IEEE Trans. Inf. Theory 60 (5), pp. 2856–2867. External Links: Cited by: §I.
-  (2015-02-25) Human-level control through deep reinforcement learning. Nature 518, pp. 529 EP –. External Links: Cited by: §I, §III.
-  (2015) Improving queue stability in wireless multicast with network coding. IEEE Inter. Conf. on Commun. (ICC), pp. 3382–3387. Cited by: §I.
-  (2018) Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. arXiv. External Links: Cited by: §I.
-  (2019) Queueing theoretic models for multicasting under fading. IEEE Wireless Communications and Networking Conference (WCNC), Marrakech, Morocco. Cited by: Deep Reinforcement Learning Based Power control for Wireless Multicast Systems, 2nd item, §I, §I, §II-A, §II-A, §II-A, §II-C, §II.
-  (2018) Queuing theoretic models for multicast and coded-caching in downlink wireless systems. arXiv:1804.10590. External Links: Cited by: §I, §I, §II-A, §II-A, §II.
-  (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: Cited by: §II-D.
-  (2016) Stability, rate, and delay analysis of single bottleneck caching networks. IEEE Trans. Commun. 64 (1), pp. 300–313. External Links: Cited by: §I.
Finite-state markov modeling of fading channels - a survey of principles and applications. IEEE Signal Processing Magazine 25 (5), pp. 57–80. External Links: Cited by: §II.
-  (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §I.
-  (2017) Proximal policy optimization algorithms. CoRR. External Links: Cited by: §I.
-  (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. External Links: Cited by: §I.
-  (2000) Joint broadcast scheduling and user’s cache management for efficient information delivery. Wireless Networks 6 (4), pp. 279–288. Cited by: §I.
-  (2018) Reward constrained policy optimization. CoRR. External Links: Cited by: §I.
Deep reinforcement learning with double q-learning.
Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §I, §III-A.
-  (2013) Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop. Cited by: §I.
-  (2003-05) A distributed joint scheduling and power control algorithm for multicasting in wireless ad hoc networks. In IEEE International Conference on Communications, 2003. ICC ’03., Vol. 1, pp. 725–731 vol.1. External Links: Cited by: §I.
-  (2019) A theoretical analysis of deep q-learning. CoRR. External Links: Cited by: §I, §I, §III-A, §III-B.
-  (2019-04) Deep reinforcement learning based resource allocation for v2v communications. IEEE Transactions on Vehicular Technology 68 (4), pp. 3163–3173. External Links: Cited by: §I.