I Introduction
Wireless networks are being constantly refined to cater for seamless delivery of huge amount of data to the end users. With increased user generated contents and proliferation of social networking sites, almost of mobile data traffic is expected to be due to mobile videos [7]. Also, the requested traffic for these contents is ridden with redundant requests [5]. Thus, multicasting is a natural way to address these requests.
A multicast queue with network coding is studied in [18, 10] with infinite library of files. The case of broadcast systems with one server transmitting to multiple users is studied in [8, 28]. Both of these works study a slotted system. Some recent works [16] use coded caching to achieve multicast. This approach uses local information in the user caches to decode the coded transmission and provides improvement in throughput by increasing the effective number of files transferred per transmission. This throughput may get reduced in a practical scenario due to queueing delays at the basestation/server. [23] addresses these issues, analyses the queuing delays and compares it with an alternate coded scheme with LRU caches (CDLS) which provides improvement over the coded schemes in [16]. A more recent work in this direction, provides alternate multicast schemes and analyses queueing delays for such multicast systems [21]. The authors show that a simple multicast scheme, can have significant gains over the schemes in [16], [23] in high traffic regime.
We further study the multicast scheme proposed in [21] in this paper. This multicast queue merges the requests for a given file from different users, arriving during the waiting time of the initial requests. The merged requests are then served simultaneously. The gains achieved by this simple multicast scheme, however, are quickly lost in wireless channels due to fading. [20] addresses this issue and also provides several multicast queueing schemes to improve the average user delays. Also, it shows that these schemes combined with an optimal power control policy under average power constraint, can provide significant reduction in delays.
The power control policy proposed in [20], though provides improved delays, has following limitations:

The algorithm to get the policy is not scalable with the number of users and the number of states of the channel gains.

The policy doesn’t adapt to the changing system statistics, which in turn depends on the policy.
These systems are often conveniently modelled as a Markov Decision Process, but with large state and action spaces. Obtaining transition probabilities and the optimal policy for such large MDPs is not feasible. Reinforcement learning, particularly, Deep reinforcement learning comes as a natural tool to address such problems
[13]. Reinforcement learning has the added advantage that it can be used even when the transition probabilities are not available. However, large state/action space can still be an issue. Using function approximation via deep neural networks can provide significant gains since the Qvalues of different stateaction pairs can be interpolated even if that stateaction pair has never or rarely occurred in the past. Several, deep reinforcement learning techniques such as Deep QNetwork
[17], Trust Region Policy Optimization (TRPO) [25], Proximal Policy Gradient (PPO) [26] etc. have been successfully applied to several large statespace dynamical systems such as Atari [31], AlphaGo [27] etc. DQN is one of the first DeepRL methods based on value iteration, usually employinggreedy exploration to learn the optimal policy. TRPO and PPO are policy gradient based methods that employ stochastic gradient descent over policy space to obtain the optimal value function. PolicyGradient methods often suffer from high variance in sample estimates and poor sample efficiency
[13]. Value iteration based deep RL methods, like DQN, have been theoretically shown to have competitive performance [33], specifically due to sample efficiency of experience replay [15].In addition to the above mentioned tradeoffs, a constrained stochastic optimization problem, as considered in this paper, further adds to the complexity of the problem. A modification of TRPO for constrained optimization is Constrained Policy Optimization [1]. But, this too suffers from the high estimator variance issue. Work in [29] considers a multitimescale approach for constrained DeepRL problems, as considered in this paper. However, [29] does not track the system statistics and hence cannot be applied in practical systems. Thus we propose a constrained optimization variant of DQN based on multitimescale stochastic gradient descent [4]. We have preferred DQN in this work, as the Target network and Replay memory used in the DQN reduce the estimator variance and finally achieve the global minimum empirical risk [33].
The major contributions of this paper are:

Proposing two modifications to DQN, to accommodate constraints and system adaptations. We call this Adaptive Constrained DQN (ACDQN).

Unlike DQN, constrained DQN can be applied to the multicast systems with constraints, as in [20], to learn the optimal power control policy, online. The constraints can be met by using a Lagrange multiplier. The appropriate Lagrange multiplier is also learnt via a two time scale stochastic gradient descent. The proposed method meets the average power constraint while achieving the global optima as achieved by the static policy proposed in [20].

We demonstrate the scalability of our algorithms with system size (number. of users, arrival rate, complex fading).

We show that ACDQN can track the changes in the dynamics of the system, e.g., change of rate of arrival over the time of a day, and achieve optimal performance.
Our algorithms work equally well when we replace DQN with its improvements such as DDQN [30]. In fact we have run our simulations with DDQN variant of ACDQN and have achieved similar performance. Next, we describe some more related works to this paper:
Power control in Multicast Systems: Power control in multicast systems has been studied in [11, 32, 12]. In [11], optimal power allocation is made to achieve the ergodic capacity (defined as the boundary of the region of all achievable rates in a fading channel with arbitrarily small error probability) while maintaining minimum rate requirements at users and average power constraints. Authors use waterfilling to achieve the optimal policy. In [32]
, the authors minimize a utility function via linear programming, under SINR constraints at the users and transmit power constraints at the transmitter. Both
[11, 32] derive an optimal power control policy for delivery to all the users, whereas this paper considers delivery to a random subset of users. In [12], each packet has a deadline and packets not received by the end of the slot are discarded. The authors use dynamic programming to obtain the optimal policies.Deep Reinforcement Learning (DeepRL) in Wireless Multicast systems: The ability of DeepRL to handle large statespace dynamic systems is being exploited in various multicast wireless systems/networks. In [34], authors look at resource allocation problem in unicast and broadcast. The DeepRL agent learns and selects power and frequency for each channel to improve rate, under some latency constraints. Authors, like in our work, introduce constraints via Lagrange multiplier. However, the agent doesn’t learn the Lagrange multiplier. Thus, the agent also does not adapt if the system dynamics changes as the Lagrange constant in the reward is fixed for a given dynamics, and the learning rate decays with time. Another work, [19], applies unconstrained deep reinforcement learning to multiple transmitters for a proportionally fair scheduling policy by adjusting individual transmit powers. Some studies [9] have applied DeepRL to control power for antijamming systems.
Ii System Model
We consider a system with one server transmitting files from a fixed finite library to a set of users (Figure 1). We denote the set of users by and the set of files by . We assume that . The request process for file from user is a Poisson process of rate which is independent of the request processes of other files from user and also from other users. The total arrival rate is . The requests of a file from each user are queued at the server till the user successfully receives the file. All the files are of length bits. The server transmits at a fixed rate, bits/sec. Thus, the transmission time for each file is .
The channels between the server and the users experience time varying fading. The channel gain of each user is assumed to be constant during transmission of a file. The channel gain for the user at the transmission, is represented by . Each takes values in a finite set and form an independent identically distributed (i.i.d) sequence in time, as in [24]. The channel gains of different users are independent of each other and may have different distributions. Let .
More details of the system are described in the following subsections as follows. Section IIA describes the basic Multicast queue proposed in [21]. The scheduling scheme to mitigate the effects of fading studied in [20] are also presented. In Sections IIB and IIC we summarise the results from [20] which show that using power control can further improve the performance and the algorithm used to obtain the optimal power policy. We will see that this algorithm is not scalable. Then in Section IID we provide the MDP of the power control problem. In Section III we will present the scalable DeepRL solution for this formuation.
Iia Multicast Queue
For scheduling of transmission at the server, we consider the multicast queue studied in [20]. In this system, the requests for different files from different users are queued in a single queue, called the multicast queue. In this queue, the requests for file from all users are merged and considered as a single request. The requested file and the users requesting it, is denoted by ). A new request for file , from user is merged with the corresponding entry , if it exists. Else, it is appended to the tail of the queue. Service/transmission of file , serves all the users in , possibly with errors due to channel fading.
The random subset of users served by the multicast queue at the
transmission, is denoted by the random binary vector,
, where implies that the user has requested the file being transmitted; otherwise, . From [Theorem 1, [21]], has a unique stationary distribution.It was shown in [21] that the above multicast queue performs much better than the multicast queues proposed in literature before. The main difference compared to previous multicast schemes is that in this scheme, all requests of all the users for a given file are merged together over time. One direct consequence of this is that the queue length at the base station does not exceed . Thus the delay is bounded for all traffic rates. In fact the mean delays are often better than the coded caching schemes proposed in the literature, as well, for most of the traffic conditions. However, in a fading scenario, where the different users have independent fading, the performance of this scheme can significantly deteriorate because of multiple retransmissions required to successfully transmit to all the users needed. Thus, in [20], multiple retransmission schemes were proposed and compared to recover the performance of the system. The following scheme was among the best. It not only (almost) minimizes the overall mean delays of the system, it also is fair to different users in the sense, that the users with good channel gains do not suffer due to users with bad channel gains.
Single queue with loopback (1LB): The Multicast queue is serviced from head to tail. When a file is transmitted, some of the users will receive the file successfully and some users may receive the file with errors. In the case of unsuccessful reception by some users, the file is retransmitted. A maximum of transmission attempts are made. If there are some users who did not receive the file within transmission attempts, the request (tuple with , now modified to contain only the set of users who have not received the file successfully) is fed back to the queue. If there is another pending request in the queue for the same file (a request for the file which came during the transmission of the current transmission), it is merged with the existing request. Otherwise, a new request for the same file with unsuccessful users is inserted at the tail of the queue.
It was further shown in [20] that choosing the transmit power based on the channel gains, can further improve the system performance.
IiB Average Power Constraint
Depending on the value of and at time , the server chooses transmit power , based on a power control policy . Choosing a good power control policy is the topic of this paper.
The state, of the system at time is . Let be the power chosen by a policy for state and be the number of successful transmissions for the selected power , during the service.
For a fixed transmission rate and for a given channel gain of users, the transmit power requirement (from Shannon’s Formula) for user is (assuming file length is long enough)
(1) 
where, is the bandwidth and is the Gaussian noise power at receiver . Thus the reward for the chosen power control policy, during transmission is given by,
(2) 
where if the user has requested the file in service and otherwise. We now describe the Mesh Adaptive Direct Search (MADS) power control policy.
IiC MADS Power control policy
The power control policy in [20] is derived from the following optimization problem,
(3) 
(4) 
where, is the average power constraint, is the total number of states, is the power chosen by the policy in state , is the stationary distribution of state and is the reward for state , as defined in (2) with . This is a nonconvex optimization problem. Mesh Adaptive Direct Search (MADS) [3] is used in [20] to solve this constrained optimization problem and obtain the power control policy. Though MADS achieves global optimum, it is not scalable as its computational complexity is very high.
The state space and action space of this problem can be very high even for a moderate number of users and channel gains, e.g., a system with L users and G channel gain states, has states. Therefore, in this paper we propose a deep reinforcement learning framework. This not only provides optimal solution for a reasonably large system but does so without knowing the arrival rates and channel gain statistics. In addition, we will be able to provide an optimal solution even when the arrival and channel gain statistics change with time.
IiD MDP Formulation
The above system can be formulated into a finite state, action Markov Decision Process denoted by tuple (): (state space, action space, reward, transition probability, discount factor), where, transition probability , policy chooses power in state and the instantaneous reward .
The actionvalue function [22] for this discounted MDP for policy is
(5) 
The optimal , is given by and satisfies the optimality relation,
(6) 
where, is sampled with distribution . If we know the optimal Qfunction , we can compute the optimal policy via . We know the transition matrix of this system and hence can compute the function. But the state space is very large even for a small number of users, rendering the computations infeasible. Thus, we use a parametric function approximation of the Q function via Deep neural networks and use DeepRL algorithms to get the optimal .
Further, to introduce the constraint in the MDP formulation, we look at the policies achieving
(7) 
where
(8) 
is the long term average power. We use the Lagrange method for constrained MDPs [2] to achieve the optimal policy. In this method, the instantaneous reward is modified as
(9) 
where, is the Lagrange constant achieving optimal while maintaining, . Choosing wrongly will provide the optimal policy with average power constraint different from .
Iii Deep Reinforcement Learning based Power Control Policy
In this section, we describe the DeepQNetwork (DQN) [17] based power control. First we describe the DQN algorithm. We then propose a variant of DQN for constrained problems, where in, we use a Lagrange multiplier to take care of the average power constraint. We use multitimescale stochastic gradient descent approach to learn the Lagrange multiplier, to obtain the right average power constraint. Finally, we change the learning step size from decreasing to a constant so that the optimal power control can track the time varying system statistics.
Iiia Deep Q Networks
DQN is a popular Deep Reinforcement learning algorithm to handle large statespace MDPs with unknown/complex dynamics, . The DQN is a Value Iteration based method, where the actionvalue function is approximated by a Neural Network. Though there are several follow up works providing improvements over this algorithm [14, 30], we use this algorithm owing to its simplicity. We will show that DQN itself is able to provide us the optimal solution and tracking. These improvements may further improve the performance in terms of sample efficiency, estimator variance etc. The DQN algorithm is given in Algorithm 1. Earlier attempts in combining nonlinear function approximators such as neural networks and RL were unsuccessful due to instabilities caused by 1) correlated training samples, 2) drastic change in policy with small change in function approximation, and 3) correlation between the training function and approximated function [13]. Success of DQN is attributed to addressing these issues with two key ingredients of the algorithm: Experience Replay Memory and Target Network, . The replay memory stores the transitions of an MDP, specifically the tuple, . The algorithm then samples, uniformly, a random minibatch of transitions from the memory. This removes correlation between the data and smoothens the data distribution change with iteration. The target network and randomly sampled minibatch from the memory , form the training set for training the
Network, at every epoch. This random sampling provides
samples for performing stochastic gradient descent with loss(10) 
where, . The iterates are given by:
(11) 
where satisfies:
(12) 
The weights of the target network are held constant for epochs, thereby controlling any drastic change in policy and reducing correlation between and . This can be seen as a Risk Minimization problem in nonparametricregression with regression function and risk . Readers are referred to [33] for elaborate analysis of DQN. Theorem 4.4 in [33]
provides a proof of convergence and the rate of convergence using nonparametric regression bounds, when sparse ReLU networks are used, under certain smoothness assumptions on the reward function and the dynamics.
IiiB Adaptive Constrained DQN (ACDQN)
The DQN algorithm is meant for unconstrained optimization. Since our problem has an average power constraint of , we consider the instantaneous reward in (9), with a Lagrange multiplier . The long term constraint depends on the Lagrange multiplier and can be quite sensitive to it. Thus, we design our algorithm, ACDQN, to learn the appropriate . We will see later, that this will enable us to further modify our algorithm to track the changing statistics of the channel gains and arrival statistics. The ACDQN algorithm is given in Algorithm 2. Here, we use multitimescale SGD as in [4]. In this approach, in addition to the SGD on , using minibatch, we use a stochastic gradient descent on the Lagrange constant, as
(13) 
where . Since the expectation in (8) is not available to us, we take , where is the finite horizon window. Additionally and are required to follow [4]:
(14) 
Tracking with ACDQN: Tracking of system statistics is essential, to achieve optimal power control in a nonstationary system. In multitime scale stochastic gradient descent, such as ACDQN, step sizes and can be fixed to enable tracking. If , then the Lagrange multiplier changes much more slowly than the function. Then the two timescale theory (see, e.g., [4]), will allow the Lagrange multiplier to adapt slowly to the changing system statistics but at the same time provide average power control. The solution will reach in a neighbourhood of the optimal point.
Although the convergence of this modified algorithm is not proved yet (even for the unconstrained DQN, convergence has been proved only recently in [33]), our simulations will show that the resulting algorithm tracks the optimal solution in the time varying scenario.
The time varying scenario in our setup results due to change in the request arrival statistics from the users and changing channel gain statistics due to motion of the users.
Iv Simulation Results and Discussion
In this section, we demonstrate the Deep Learning methods for power control proposed in this paper. We compare performances of ACDQN and MADS Power control policies. Though MADS provides optimal solutions for small system sizes, it is not scalable. We show that the Deep Learning algorithm, ACDQN, indeed achieves the global optimum obtained by MADS algorithm, while being scalable with the system size (number of users). We further demonstrate that ACDQN algorithm tracks the changing system dynamics and obtains the optimal policy, adaptively. We use Keras libraries
[6] in Python for implementation of our algorithms and our system is implemented in MATLAB.We consider two systems, one with 4 users and compare ACDQN and MADS; the other with 20 users, showing performance of ACDQN. MADS is not able to provide a solution for the second system since the space and time complexity of MADS increase exponentially with the number of users. In all the examples, we split the users in two equal sized groups, one group has good channel statistics and the other bad channel statistics. In both the systems, we compare all the algorithms with a constant power control policy, where the transmit power is fixed to , to indicate the gain due to power control. The system and algorithm parameters, used for the simulations are as follows:
Iv1 Small User Case
Number of users, , Catalog Size , File Size , Transmission rate , Bandwidth , Channel Gains, Uniform([0.1 0.2 0.3]) for two users with bad channel statistics and Uniform([0.7 0.8 0.9]) for two users with good channel statistics, File Popularity : Uniform, (Zipf exponent = 0). Average Power Constraint , Simulation time= mutlicast transmissions.
Iv2 Large User Case
System Parameters: Power Transmit Levels = 20 (1 to 50), , , ,
, Channel Gains : Exponentially distributed. (
for bad channel, for good channel), , , , File Popularity : Zipf distribution with (Zipf exponent = 1). Simulation time: mutlicast transmissions. In both the cases we set the noise power as .Iv3 Hyperparameters
We consider fully connected neural networks with two hidden layers for all the function approximations considered in the algorithms. Input layer nodes are assumed to be
and the output layer nodes is equal to 20, the number of transmit power levels. Each output represents the Q value for a particular action. The action space is restricted to be finite, as DQN converges only with finite action spaces. We use two hidden layers for the neural network, with 128 and 64 nodes, and ReLU activation function is chosen, respectively. The other parameters are as follows: Replay memory size
, , , , , , , , , Minibatch Size , , and .Achieving Global Optima (ACDQN vs MADS): We use the system setting of small user case, specified above. We run the system for the average power constraint , with exponential arrivals of rate 0.4 to 4.0. Figure 2 shows a comparison of sojourn times of Constant Power Policy, , MADS and ACDQN. Further, Figure 3 shows convergence of average power to for ACDQN. We see from Figure 3 that ACDQN achieves the global optimum achieved by MADS, while maintaining the average power constraint.
ACDQN performance in a Scaled Network: To show the scalability of ACDQN, we simulate the relatively complex system mentioned in large user case, above. We run the simulation for . We see in Figure 4 that the ACDQN gives, drastic improvement (around 50 percent) over constant power case. ACDQN achieves this while maintaining the average power, by learning the Lagrange constant as seen in Figure 6. Figure 5 shows the convergence of average power of ACDQN to the average power constraint, , for arrival rate of 1.0 requests per sec in the same simulation run.
ACDQN Tracking Simulations: In this section we show via simulations the tracking capabilities of ACDQN. We show this for large user case with average power constraint, . We fix and . As explained previously, this is important for detecting the change in the environment dynamics faster. In this simulation, we vary the arrival rate at every six hours over a period of 24 hours. This captures the real world scenario where the request traffic to the base station varies with time of the day. To make the learning harder for our algorithm, we make these changes abruptly at every six hours. Specifically we use arrival rates for 1st, 2nd, 3rd and 4th six hour period, respectively. We plot the ACDQN performance for in Figure 7. We calculate the mean sojourn time and average power using a moving average window of size 1000 samples. We observe that for each arrival rate in this simulation, the ACDQN achieves the corresponding stationary mean sojourn time performance. For instance for and , the values in Figure 4 and Figure 7 are comparable. It is important to note that this performance is achieved while maintaining the average power constraint as can be seen in Figure 8. The effect of fixing the learning rates is seen in the small oscillations of average power around in Figure 8. This is the oscillation in a small neighborhood around the optimal average power. Smaller the step size, lesser the oscillations.
Next, we demonstrate the importance of constant step sizes for and , and the inability of decaying step sizes to track the changing system statistics. We consider a system where the arrival rates change over a period of 48 hours. We fix for first 24 hours, then fix for four consecutive 6 hours intervals. This change in time period is just to illustrate the tracking ability in a more emphatic manner. It will be clear in the previous time frame also but will require more simulation time. We fix . We run the ACDQN algorithm for this system with: 1) decaying and satisfying (14) and 2) constant step sizes, and . Rest of the parameters remain same as in the large user case. We see in Figure 9 that the ACDQN with constant stepsize almost always outperforms the decaying step size. Specifically, after the first 24 hours the delay reduction is nearly 50 percent for constant stepsize. The reason for this is evident from Figures 10 and 11. We see in Figure 11 that the ACDQN with constant stepsize learns the Lagrange constant through out the simulation time, whereas, the ACDQN decaying step size is unable to learn the Lagrange constant after the first 24 hours. As can be seen in Figure 10, this affects the average power achieved by the ACDQN with decaying step size. While constant step size maintains the average power constraint of , the average power achieved by the decaying stepsize ACDQN drops to . Hence, the decaying stepsize ACDQN suffers suboptimal utilization of available power. Thus in practical systems only constant stepsize ACDQN will be capable of adapting to the changing system statistics.
Discussion: We see from the simulations that the DeepRL techniques can achieve global optimal performance while providing scalability with system size. Our twotimescale approach, ACDQN, extends this to systems with constrained control. Though we have demonstrated this on a system with a single constraint, ACDQN can very well be extended to systems with multiple constraints. In such systems each constraint is associated with a Lagrange constant. Each Lagrange constant adds an additional SGD step to the ACDQN algorithm. For a stationary system, it is enough that the stepsizes satisfy multitimescale criterion similar to (14), see [4]. However, if ACDQN is used in systems with changing system statistics the step sizes shall be kept constant. The step sizes shall be fixed as per the tolerance requirement for a given constraint (e.g., in our system the tolerance could be . In other words, is the allowed deviation from the constraint ). Lesser the tolerance, lesser the stepsize. However, fixing the stepsizes too small may make the algorithm too slow to track the changes in system statistics. Hence, choosing the step sizes is a tradeoff between the tolerance of the constraint and the required algorithmic agility to track the system changes.
V Conclusion
We have considered a multicast downlink in a wireless network. Fading of different links to users causes significant reduction in the performance of the system. However, appropriate change in the scheduling policies and power control can mitigate most of the losses. However, obtaining optimal power control for this system is computationally very hard. We show that using Deep Reinforcement Learning, we can obtain optimal power control, online, even when the system statistics are unknown. We use a recently developed version of Q learning, Deep Q Network to learn the Qfunction of the system via function approximation. Furthermore, we modify the algorithm to satisfy our constraints and also to make the optimal policy track the time varying system statistics. DDQN variant of ACDQN provides similar performance.
One interesting extension of this work would be adding the caches at the user nodes and learning the optimal caching policy along with the power control using DeepRL. Future works may also consider applying ACDQN to multiplebasestation scenarios for interference mitigation.
References

[1]
(2017)
Constrained policy optimization.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 22–31. Cited by: §I.  [2] (1999) Constrained markov decision processes. CRC Press. Cited by: §IID.
 [3] (2006) Mesh adaptive direct search algorithms for constrained optimization. SIAM Journal on optimization 17 (1), pp. 188–217. Cited by: §IIC.
 [4] (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press. External Links: ISBN 9780521515924, LCCN 2009285122 Cited by: §I, §IIIB, §IIIB, §IV3.
 [5] (2007) I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, New York, USA, pp. 1–14. Cited by: §I.
 [6] (2015) Keras. GitHub. Note: https://github.com/fchollet/keras Cited by: §IV.
 [7] (2016) Cisco visual networking index: global mobile data traffic forecast update 20162021 white paper. (), pp. . External Links: Document, Cited by: §I.
 [8] (2009Sept) Queue length analysis for multicast: limits of performance and achievable queue length with random linear coding. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 462–468. External Links: Document, ISSN Cited by: §I.
 [9] (201812) Reinforcement learning based power control for vanet broadcast against jamming. In 2018 IEEE Global Communications Conference (GLOBECOM), Vol. , pp. 1–6. External Links: Document, ISSN 25766813 Cited by: §I.
 [10] (2011) Broadcasting delayconstrained traffic over unreliable wireless links with network coding. IEEE/ACM Transactions on Networking 23, pp. 728–740. Cited by: §I.
 [11] (200311) Capacity and optimal power allocation for fading broadcast channels with minimum rates. IEEE Transactions on Information Theory 49 (11), pp. 2895–2909. External Links: Document, ISSN 00189448 Cited by: §I.
 [12] (201404) Scheduling multicast traffic with deadlines in wireless networks. In IEEE INFOCOM 2014  IEEE Conference on Computer Communications, Vol. , pp. 2193–2201. External Links: Document, ISSN 0743166X Cited by: §I.
 [13] (2018) Deep reinforcement learning. CoRR. External Links: Link, 1810.06339 Cited by: §I, §IIIA.
 [14] (2016) Continuous control with deep reinforcement learning. CoRR. External Links: Link Cited by: §IIIA.
 [15] (19920501) Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8 (3), pp. 293–321. External Links: ISSN 15730565, Document, Link Cited by: §I.
 [16] (2014) Fundamental limits of caching. IEEE Trans. Inf. Theory 60 (5), pp. 2856–2867. External Links: Document, 1209.5807, ISBN 9781479904464, ISSN 00189448 Cited by: §I.
 [17] (20150225) Humanlevel control through deep reinforcement learning. Nature 518, pp. 529 EP –. External Links: Link Cited by: §I, §III.
 [18] (2015) Improving queue stability in wireless multicast with network coding. IEEE Inter. Conf. on Commun. (ICC), pp. 3382–3387. Cited by: §I.
 [19] (2018) Multiagent deep reinforcement learning for dynamic power allocation in wireless networks. arXiv. External Links: Link, 1808.00490v3 Cited by: §I.
 [20] (2019) Queueing theoretic models for multicasting under fading. IEEE Wireless Communications and Networking Conference (WCNC), Marrakech, Morocco. Cited by: Deep Reinforcement Learning Based Power control for Wireless Multicast Systems, 2nd item, §I, §I, §IIA, §IIA, §IIA, §IIC, §II.
 [21] (2018) Queuing theoretic models for multicast and codedcaching in downlink wireless systems. arXiv:1804.10590. External Links: arXiv:1804.10590 Cited by: §I, §I, §IIA, §IIA, §II.
 [22] (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471619779 Cited by: §IID.
 [23] (2016) Stability, rate, and delay analysis of single bottleneck caching networks. IEEE Trans. Commun. 64 (1), pp. 300–313. External Links: Document, ISBN 00906778, ISSN 00906778 Cited by: §I.

[24]
(200809)
Finitestate markov modeling of fading channels  a survey of principles and applications
. IEEE Signal Processing Magazine 25 (5), pp. 57–80. External Links: Document, ISSN 10535888 Cited by: §II.  [25] (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §I.
 [26] (2017) Proximal policy optimization algorithms. CoRR. External Links: Link, 1707.06347 Cited by: §I.
 [27] (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science 362 (6419), pp. 1140–1144. External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/362/6419/1140.full.pdf Cited by: §I.
 [28] (2000) Joint broadcast scheduling and user’s cache management for efficient information delivery. Wireless Networks 6 (4), pp. 279–288. Cited by: §I.
 [29] (2018) Reward constrained policy optimization. CoRR. External Links: Link, 1805.11074 Cited by: §I.

[30]
(2016)
Deep reinforcement learning with double qlearning.
In
Thirtieth AAAI Conference on Artificial Intelligence
, Cited by: §I, §IIIA.  [31] (2013) Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop. Cited by: §I.
 [32] (200305) A distributed joint scheduling and power control algorithm for multicasting in wireless ad hoc networks. In IEEE International Conference on Communications, 2003. ICC ’03., Vol. 1, pp. 725–731 vol.1. External Links: Document, ISSN Cited by: §I.
 [33] (2019) A theoretical analysis of deep qlearning. CoRR. External Links: Link, 1901.00137 Cited by: §I, §I, §IIIA, §IIIB.
 [34] (201904) Deep reinforcement learning based resource allocation for v2v communications. IEEE Transactions on Vehicular Technology 68 (4), pp. 3163–3173. External Links: Document, ISSN 00189545 Cited by: §I.
Comments
There are no comments yet.