I Introduction
Due to its flexible mobility and low cost, unmanned aerial vehicle (UAV) communication is viewed as an important solution in future communication systems [1]. In fact, UAVs have already been considered to be deployed in many fields [2, 3], such as wireless power transfer, secure communications, relaying, wireless sensor networks, and caching. However, applying UAVs in communication systems still faces many challenges, such as deployment, effective resource allocation, energy efficiency, and trajectory design.
The existing literature has studied a number of problems related to the use of UAVs for wireless communication such as [4, 5, 6, 7, 8, 9, 10, 11]. The work in [4] considered a multiUAV enabled wireless communication system, where multiple UAVmounted aerial base stations are employed to serve a group of users. The authors in [5] investigated the deployment of UAVs as flying base stations so as to provide flying wireless communications to certain geographical area. In [6], the authors proposed a UAV based framework to provide service for the mobile users in a cloud radio access network system. The work in [7] deployed UAVs in a cellular network and designed optimal spectrum trading, so as to provide temporarily downlink data offloading. In [8], the authors proposed a UAVenabled mobile edge computing system for maximization of computation rate. The authors in [9] investigated an uplink power control problem for UAV based wireless networks. The work in [10] analyzed the link capacity between autonomous UAVs with random trajectories. In [11], the authors studied the capacity region of a UAVenabled twouser broadcast channel. However, most of the existing literature such as [7, 5, 6, 4, 8, 9, 10, 11]
that uses the UAVs as highaltitude, static base stations or relays doesn’t consider the flexible mobility of UAVs which can provide service for the ground users. Moreover, these existing works only focus on the ground users and do not consider to provide service to the users that are located in the air. Indeed, none of this existing body of literature analyzes the potential of using machine learning tools for leveraging the movable nature of UAV to assist wireless communications. The complexity of UAV trajectory design makes it essential to introduce reinforcement learning algorithm to optimize the performance of UAVassisted wireless networks.
The use of reinforcement learning for solving communication problems was studied in [12, 13, 14, 15, 16, 17, 18]. The work in [12] applied deep Qnetwork in a mobile communication system to reduce the exploration time for achieving an optimal communication policy. The authors in [13] proposed a Qlearning based algorithm to coordinate power allocation and control interference levels, so as to maximize the sum data rate of devicetodevice (D2D) users while guaranteeing QoS for cellular users. In [14], the authors proposed an expected Qlearning algorithm to solve the spectrum allocation problem in LTE networks that operate in unlicensed spectrum (LTEU) with downlinkuplink decoupling and improve the total rate. In [15, 16], the authors proposed an echo state network (ESN) based learning algorithm to solve the spectrum allocation problem in wireless networks. The work in [17] proposed a reinforcement learning scheme to improve spectral efficiency in cloud radio access networks. The authors in [18]
used the artificial neural network to provide a reliable wireless connectivity for cellularconnected UAVs. However, most of the existing works
[12, 13, 14, 15, 16, 17, 18] focused on the use of traditional Qlearning algorithm but ignored its disadvantage, i.e., the traditional Qlearning algorithm has large overestimations of the values. In UAV based wireless networks, the overestimation will result in a suboptimal design of the UAV trajectory or a suboptimal policy of resource allocation, which will degrade the performance of wireless networks.The main contribution of this paper is to develop a novel framework that enables UAVs to find an optimal flying trajectory to maximize the number of satisfied users in a cellular network. To our best knowledge, this is the first work that considers the flying trajectory of UAVs with the three dimensional users that have their own data requests and delay requests. In this regard, our key contributions are summarized as follows:

We propose a novel model of UAV based cellular network where the UAV is deployed as a flying and movable base station for downlink transmission. In this model, users are divided into ground users and aerial users. All of the users will send their data request and delay request to the UAV and the UAV will design an optimal flying trajectory to maximize the number of satisfied users.

We develop a double Qlearning framework to optimize the flying trajectory of the UAV so as to maximize the number of satisfied users. Compared to the traditional Qlearning algorithm [19], the proposed algorithm uses two Qtables to decouple the selection from the evaluation in case of overestimation caused by selecting and evaluating actions in only one Qtable. Hence, the proposed algorithm can converge to the optimal trajectory which leads to the maximum of satisfied users.

Simulation results show that, in terms of the number of satisfied users, the proposed algorithm can yield up to 19.4% gain compared to the random algorithm and 14.1% gain compared to Qlearning algorithm.
The rest of this paper is organized as follows. The system model and problem is described in Section II. The double Qlearning based optimal trajectory design is proposed in Section III. In Section IV, numerical simulation results are presented and analyzed. Finally, conclusions are drawn in Section V.
Ii System Model and Problem Formulation
Consider the downlink transmission of a cellular network that consists of an unmanned aerial vehicle (UAV) serving a set of users shown in Fig. 1. In our model, we consider two type of users: ground users and aerial users. The ground users are the traditional cellular users standing on the ground, which is denoted by a set of ground users. The aerial users represent the cellular users in the air such as camera drones, sensor drones, and aerial vehicles, which is denoted by a set of aerial users. Service operators provide all information related to the users to the UAV, such as users’ data requests and locations. The location of each user is considered as a three dimensional coordinate, which is denoted by , where and are the horizontal coordinates, and is the altitude of each user . For ground user , the altitude will be . For aerial user , the altitude is typically greater than 50 meters. The UAV will fly from a designated position to serve each user at a fixed altitude . In the studied model, the UAV can only serve one user at each time slot and the UAV will only fly to serve the next user after the previous service is completed.
Each user’s request that is sent to the UAV consists of two elements: the size of data that each user requests and the maximum time that the user can wait for service, which is referred as endurance time. Mathematically, let be the size of the data that each user requests and be the endurance time. The size of the data that each user requests depends on the service type. Meanwhile, the endurance time consists of the time that each user waits for the UAV to arrive and the time that the UAV serves the user. If the request of a given user is completely served before the endurance time, the user will be satisfied with the service provided by the UAV. The user that satisfies the service provided by the UAV is referred as a satisfied user. Next, we first introduce the waiting delay and the time that UAV serves the user. Then, we formulate the problem of maximizing the number of satisfied users.
Iia Waiting Delay
The waiting delay of each user consists of the time that the UAV serves the users before user and the time that the UAV flies to user . We assume that the speed of the UAV is and the distance between the UAV and user is . The flying time from the UAV to user is given by:
(1) 
We assume that the time that the UAV starts to fly to user is given by:
(2) 
where , which will be further defined in Subsection IIB, represents the total service time for user , represents the user that is served before user , and represents users’ service order with . represents that user is served at order . For example, denotes that the UAV will serve user first. Note that, if user is served first, . depends on the number of users that have already been served by the UAV. The total waiting time of each user is given by:
(3) 
From (3), we can see that the total waiting delay for user depends on both the flying time and the total service time for the previous served user . We can also see that as the service order changes, the time that the UAV finishes the previous service and the total waiting delay also change.
IiB Transmission Delay
Next, we introduce the models of transmission links between UAV and users. Due to the altitude differences between ground users and aerial users, the channel conditions are also different. In consequence, the UAVground user and UAVaerial user transmission links are defined separately as follows.
1) UAVGround User Links: The probabilistic UAV channel model is used to model the transmission link between the UAV and ground user . Probabilistic lineofsight (LoS) and nonlineofsight (NLoS) links are considered in [5], which explains that the attenuation in NLoS is much higher than that in LoS link due to the shadowing and diffraction loss. The LoS and NLoS channel gains of the UAV transmitting data to ground user is given by [20]:
(4) 
respectively, where is the path loss exponent for the UAV transmission link, and is an additional attenuation factor caused by the NLoS connection. According to [20]
, the probability of the LoS link is given by:
(5) 
where and are environmental parameters and is the elevation angle with being the deviation between the user ’s altitude and the UAV’s fixed altitude . Then the average channel gain from the UAV to ground user is given by [20]:
(6) 
where . The downlink rate of ground user is given by:
(7) 
where is the downlink signaltonoise ratio (SNR) between the UAV and ground user , is the bandwidth of the downlink transmission links, is the transmit power of the UAV, and
is the variance of the Gaussian noise.
2) UAVAerial User Links: The millimeter wave (mmWave) propagation channel is used to model the transmission link between the UAV and aerial user . The mmWave channel can provide high transmission rate so that the UAV can finish users’ requests quickly and in time. Due to the high altitudes of the UAV and aerial user , the transmission link can be considered as a LoS link, whose path loss is given by [6, 21] in dB:
(8) 
where represents the free space path loss with being the distance between aerial user and the UAV, is the carrier frequency of mmWave and represents the speed of light, represents the additional attenuation factor due to the LoS connections. The downlink rate of aerial user is given by:
(9) 
where is the downlink signaltonoise ratio (SNR) between the UAV and aerial user .
Hence, the transmission delay of serving user is given by:
(10) 
The total time that UAV serves user can be calculated as:
(11) 
IiC Problem Formulation
Given the defined system model, our goal is to design a flying trajectory so as to maximize the number of satisfied users. Next, we first introduce the notion of the satisfied user. Then, we formulate the optimization problem. Given the size of the data that user requests and the endurance time , the satisfaction indicator of user is defined as follows:
(12) 
where as is true; otherwise, . indicates that, under the service order , the UAV completes the request of user within the endurance time and user is satisfied.
Having introduced the notation of a satisfied user in (12), the next step is to introduce a flying trajectory management mechanism for the UAV to maximize the number of satisfied users. This problem can be formulated as:
(13a)  
(13b) 
where is defined in (12) and represents the service order of user . The problem (13) aims to find the optimal trajectory so that the UAV can complete most of the users’ requests within their endurance time after receiving all the user requests. Since finding the optimal trajectory needs to evaluate all possible permutations of service order , which takes up a substantial amount of service time, it’s essential to introduce a learning algorithm to shorten the calculation time for the trajectory.
Iii Double QLearning Framework For Maximizing the Number of Satisfied Users
To solve the maximization problem in (13), we introduce a reinforcement learning framework based on double Qlearning. Compared to the existing reinforcement learning algorithms [12, 13, 14] such as Qlearning that may result in suboptimal trajectory and leads to the number of satisfied users not maximized, the proposed double Qlearning algorithm enables the UAV to find the optimal flying trajectory to serve the users so as to maximize the number of satisfied users. Moreover, compared to the traditional Qlearning algorithm that typically uses one Qtable to record and update the values resulting from different states and actions [19], the proposed double Qlearning algorithm uses two Qtables to separately select and evaluate the actions. In this regard, the proposed double Qlearning algorithm avoids the overestimation of Q values. The overestimation usually occurs in traditional Qlearning algorithm due to the positive feedback caused by selecting and evaluating the action in the same Qtable.
Next, we first introduce the components of the double Qlearning algorithm. Then, we explain the procedure of the use of double Qlearning algorithm to find the optimal flying trajectory for the UAV.
Iiia Components of Double QLearning Algorithm
A double Qlearning model consists of four basic components: a) agent, b) actions, c) states, and d) reward function, which are specified as follows.

Agent: In this problem, the agent is obviously the UAV. The UAV can collect the users’ information such as users’ locations, the size of the data that users request and the endurance time.

Action: The actions of the double Qlearning algorithm determines the user that UAV will serve at next time slot. Let be an action of the UAV and . For example, denotes that the UAV will provide service for user .

State: Each state of the UAV,
consists of: 1) the vector
that denotes each user whether has been served by the UAV, where with denotes that user has already been served by the UAV, otherwise, ; 2) endurance time vector where ; 3) waiting time vector with denotes that user has not been served by the UAV; 4) flying time vector . 
Reward: The reward function is defined as the total number of satisfied users as the UAV takes action under state . can be specified as follows:
(14)
IiiB Double Qlearning for Trajectory Optimization
Given the components of the double Qlearning algorithm, we explain how to use the double Qlearning algorithm to solve the problem in (13). To find the optimal trajectory, the proposed learning algorithm needs to use Qtables to store the reward values resulting from different states and actions.
In contrast to the traditional Qlearning that determines the action selection policy and the values of Qtable using one Qtable, the proposed algorithm uses two Qtables to separately determine the action selection policy and update the values of Qtable. Hence, the proposed algorithm can avoid the overestimation in Qlearning. The optimal Q values can be given by Bellman’s optimal equation [22]:
(15) 
where is the discount factor, and is the probability from state to . To enable the UAV to record the values of the Qtables in (15), the Qtables need to update at each time slot , which can be given as follows:
(16)  
(17)  
where is the learning rate, , , and is the next state after taking action at state . Note that, at each iteration, only one Qtable will be updated. To update the values of Qtables in (16) or (17), the UAV needs to select one action to implement at time . The action selection policy of the UAV is given by:
(18) 
where denotes the probability that the UAV takes action , is the exploration probability of a random action, denotes the set of actions of the UAV, and is the total number of actions of the UAV. Note that, the proposed algorithm has two Qtables. Hence, one of the Qtable is used to determine the action selection policy in (18) and the other Qtable is used to update the value of Qtable in (16) or (17). For example, if is used to determine the action selection policy, then must be used to determine the value of Qtable. Based on the above formulations, the double Qlearning algorithm performed by the UAV is summarized in Algorithm 1.
IiiC Convergence of the Proposed Algorithm
Next, we analyze the convergence of the proposed double Qlearning algorithm. We first prove that the proposed framework is an Markov Decision Process (MDP)
[23]. Then, we prove that the proposed algorithm will converge and find the optimal trajectory for the UAV to maximize the number of satisfied users.The following theorem proves that the proposed framework is an MDP, which is given by:
Theorem 1
The proposed double Qlearning framework is an MDP.
Proof:
An MDP consists of five basic components [23]
: 1) a finite set of states, 2) a finite set of actions, 3) a transition probability function, 4) the immediate reward function, and 5) the set of decision epoch, which can be finite or infinite. Next, we prove that the components of the proposed double Qlearning framework satisfied the conditions of an MDP.
In the proposed framework: 1) The number of actions is equal to the number of users. In consequence, the set of actions is limited. 2) The number of states is equal to the th power of 2, which indicates all the possible combinations of whether each user has been served. Thus, the set of states is limited. 3) The action selection policy (18) indicates the probability of the UAV taking action , which also determines the transition probability from current state to next state . 4) The reward function is immediately determined by the current state and the action to be taken. 5) the UAV takes decisions at any time, which leads to an infinite decision epoch.
In summary, the proposed double Qlearning framework satisfied all the conditions of an MDP. Therefore, the framework is an MDP.
From Theorem 1, we can see that the proposed double Qlearning framework is an MDP with five basic components. Thus, the convergence of the proposed Qlearning algorithm can be viewed as the convergence of an MDP, which is given by the following corollary.
Corollary 1
In the proposed double Qlearning algoritm, both and will converge to the optimal value function with probability one eventually.
Proof:
The work in [22, Theorem 1] proved that, for an MDP corresponding to a double Qlearning algorithm, both Qtables converge to the same optimal value under the following conditions: 1) the MDP is finite, i.e. ; 2) the discount factor ; 3) the Q values are stored in a lookup table; 4) both and receive an infinite number of updates; 5) the learning rate ; 6) the reward function is finite.
In the proposed framework: 1) Both states and actions are finite. In consequence, the MDP is finite. 2) is set to a reasonable value in . 3) Two Qtables store all the Q values related to states and actions. The Q values can be looked up by the state and action. 4) Both and can be updated infinitely without artificial limits. 5) is set to a reasonable value in . 6) The reward function represents the number of satisfied users taking action at current state . Thus, the result is an integer less than the number of total users . Obviously, the reward function is finite.
In consequence, the proposed algorithm satisfies all of the conditions in [22, Theorem 1]. Thus, both and will converge to the same optimal value.
From Corollary 1, we can see that both and in the proposed algorithm can converge to the same optimal value. The optimal value corresponds to the optimal trajectory which leads to the maximum of satisfied users. Therefore, as the proposed algorithm converges, it can maximize the number of satisfied users.
Iv Simulation Results
In our simulations, we consider a circular UAV based cellular network area with a radius m, uniformly distributed users and a UAV. The number of ground users is equal to the number of aerial users, . For implementing the proposed double Qlearning algorithm, we use the Matlab tools. Other system parameters are listed in Table I. We compare the proposed algorithm with a random algorithm that selects the user to serve in a random order and the traditional Qlearning algorithm in [19]. All statistical results are averaged over 5000 independent runs.
Parameters  Description  Values 

UAV altitude  100 m  
Path loss exponent  2  
NLoS attenuation factor  0.3  
LoS attenuation factor  2  
Environment parameters  11.95, 0.136  
Noise power  74 dBm  
UAV transmit power  5 W  
Bandwidth  1 MHz  
mmWave frequency  35 GHz  
UAV speed  50 m/s  
Learning rate  0.5  
Discount rate  0.8  
Exploration rate  0.5 
In Fig. 2, we show how the number of satisfied users changes as the endurance time changes. From Fig. 2, we can see that, as the endurance time increases, the number of satisfied users increases. This is due to the fact that the UAV can complete more user requests at longer user’s endurance time. Fig. 2 also shows that the number of satisfied users in all three algorithms achieves up to 20 when the endurance time increases to 100. This is due to the fact that, as the endurance time is long enough, the UAV can complete all the user requests in arbitrary trajectory.
Fig. 3 shows how the number of satisfied users changes as the total number of users varies, as endurance time is 50 seconds. The simulation results are averaged over 5000 independent runs, and, hence, the number of satisfied users in Fig. 3 is not an integer. From Fig. 3, we can see that, when the number of users is small such as , the number of satisfied users is equal to the number of total users. This is due to the fact that the user’s endurance time is long enough so that the request of any user can be completed within the endurance time in arbitrary trajectory. In Fig. 3, we can also see that the number of satisfied users increases till . This is due to the fact that the UAV can serve more users within the endurance time. Meanwhile, Fig. 3 also shows that the number of satisfied users remains almost unchanged when the number of total users increases from 15 to 20. This is due to the fact that the UAV can only serve a limited number of users due to the endurance time requested by each user. Fig. 3 also indicates that the proposed algorithm can achieve up to 19.4% and 14.1% gains in terms of the number of satisfied users compared to the random algorithm and Qlearning algorithm, respectively. These gains stem from the fact that the proposed algorithm aims to find the optimal trajectory to maximize the number of satisfied users, while the random algorithm only gives a trail random trajectory regardless of the number of satisfied users, and the Qlearning algorithm may result in suboptimal policies which lead to a worse result.
In Fig. 4, we show how the number of satisfied users changes as the number of aerial users changes. From Fig. 4, we can see that, as the number of aerial users increases, the number of ground and aerial users that are satisfied with the service provided by the UAV increases. This is due to the fact that the channel condition of the aerial users is better than that of the ground users. In consequence, the UAV can spend less time to serve an aerial user. Fig. 4 also shows that, in terms of the number of satisfied users, the proposed algorithm can achieve up to 20.1% and 6.7% gains compared to the random algorithm and Qlearning algorithm, respectively.
Fig. 5 shows how the number of satisfied users changes as the UAV speed changes. From Fig. 5, we can see that, as the UAV speed increases, more users are satisfied with the service provided by the UAV. This is due to the fact that the faster speed leads to the shorter flying time. In consequence, the UAV spends more time to serve users rather than to fly in the air. Fig. 5 also shows that the proposed algorithm can achieve up to 22.1% and 7.8% gains in terms of the number of satisfied users compared to the random algorithm and Qlearning algorithm. These gains stem from the fact that the flying time, which is inversely proportional to the UAV speed, accounts for a great part of the total service time. The proposed algorithm can find the optimal trajectory which saves the flying time of the UAV, while the random trajectory doesn’t concern about the flying time and the Qlearning algorithm may work out a suboptimal trajectory which wastes the service time on flying.
Fig. 6 shows the number of iterations needed till convergence for the proposed double Qlearning approach. In this figure, we can see that, as time elapses, the values of Qtables increase until convergence to their final values. Fig. 6 also shows that the proposed approach needs 1000 iterations to reach convergence. From Fig. 6, we can also see that, tables and may have different values as time elapses. However, as time continues to elapse, tables and will converge to the same final value. This is due to the fact that, at each iteration, the proposed double Qlearning algorithm selects an action based on the value of one Qtable and updates the action’s Qvalue in another Qtable. The result also confirms Corollary 1.
Fig. 7 shows how the convergence changes as the number of total users varies. From Fig. 7, we can see that, as the number of users increases from 5 to 20, the proposed approach needs 100, 500, 600, 1000 iterations to reach convergence, respectively. This is due to the fact that, as the number of total users increases, the number of states of the proposed double Qlearning framework increases exponentially. In consequence, the proposed algorithm needs more iterations to explore those states and, hence, it uses more iterations to reach the convergence and find the optimal trajectory.
Fig. 8 shows an optimal trajectory example designed by the double Qlearning framework for a network with ground users and aerial users as the blue arrow being the UAV trajectory. In this figure, the UAV starts from the origin of the coordinates and then selects the users to serve. From Fig. 8, we can also see that some of the users are not served by the UAV, this is because their requests cannot be completely served before their endurance time. Fig. 8 also shows that more aerial users than ground users are served by the UAV. This is due to the fact that the aerial users have better channel conditions and faster transmission rates. In consequence, the UAV is more willing to serve an aerial user than a ground user.
V Conclusion
In this paper, we have developed a novel framework that enables flying, movable UAVs to provide service for the three dimensional users in a cellular network. We have formulated an optimization problem that seeks to maximize the number of satisfied users. To solve this problem, we have developed a novel algorithm based on the machine learning tools of double Qlearning. The proposed algorithm enables the UAVs to find the optimal flying trajectory so as to maximize the number of satisfied users. Simulation results have shown that the proposed approach yields significant performance gains in terms of the number of satisfied users compare to random algorithm and Qlearning algorithm.
References
 [1] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with unmanned aerial vehicles: Opportunities and challenges,” IEEE Communications Magazine, vol. 54, no. 5, pp. 36–42, May 2016.
 [2] J. Xu, Y. Zeng, and R. Zhang, “UAVenabled wireless power transfer: Trajectory design and energy optimization,” IEEE Transactions on Wireless Communications, vol. 17, no. 8, pp. 5092–5106, Aug. 2018.
 [3] M. Chen, W. Saad, and C. Yin, “Liquid state machine learning for resource and cache management in LTEU unmanned aerial vehicle (UAV) networks,” IEEE Transactions on Wireless Communications, vol. 1, no. 1, pp. 1–14, 2019.
 [4] Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and communication design for multiUAV enabled wireless networks,” IEEE Transactions on Wireless Communications, vol. 17, no. 3, pp. 2109–2121, Mar. 2018.
 [5] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Unmanned aerial vehicle with underlaid devicetodevice communications: Performance and tradeoffs,” IEEE Transactions on Wireless Communications, vol. 15, no. 6, pp. 3949–3963, June 2016.
 [6] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong, “Caching in the sky: Proactive deployment of cacheenabled unmanned aerial vehicles for optimized qualityofexperience,” IEEE Journal on Selected Areas in Communications, vol. 35, no. 5, pp. 1046–1061, May 2017.
 [7] Z. Hu, Z. Zheng, L. Song, T. Wang, and X. Li, “UAV offloading: Spectrum trading contract design for UAV assisted cellular networks,” IEEE Transactions on Wireless Communications, vol. 17, no. 9, pp. 6093–6107, Sep. 2018.
 [8] F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization in UAVenabled wireless powered mobileedge computing systems,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 9, pp. 1927–1941, Sep. 2018.
 [9] Z. Yang, C. Pan, M. ShikhBahaei, W. Xu, M. Chen, M. Elkashlan, and A. Nallanathan, “Joint altitude, beamwidth, location and bandwidth optimization for UAVenabled communications,” IEEE Communications Letters, vol. 22, no. 8, pp. 1716–1719, Aug. 2018.
 [10] X. Yuan, Z. Feng, W. Xu, W. Ni, J. A. Zhang, Z. Wei, and R. P. Liu, “Capacity analysis of UAV communications: Cases of random trajectories,” IEEE Transactions on Vehicular Technology, vol. 67, no. 8, pp. 7564–7576, Aug 2018.
 [11] Q. Wu, J. Xu, and R. Zhang, “Capacity characterization of UAVenabled twouser broadcast channel,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 9, pp. 1955–1971, Sep. 2018.
 [12] L. Xiao, D. Jiang, D. Xu, H. Zhu, Y. Zhang, and H. V. Poor, “Twodimensional antijamming mobile communication based on reinforcement learning,” IEEE Transactions on Vehicular Technology, vol. 67, no. 10, pp. 9499–9512, Oct 2018.
 [13] S. Toumi, M. Hamdi, and M. Zaied, “An adaptive Qlearning approach to power control for D2D communications,” in Proc. of International Conference on Advanced Systems and Electric Technologies. Hammamet, Tunisia: 123, Mar. 2018.
 [14] Y. Hu, R. MacKenzie, and M. Hao, “Expected Qlearning for selforganizing resource allocation in LTEU with downlinkuplink decoupling,” in Proc. of European Wireless Conference, Dresden, Germany, May 2017.
 [15] M. Chen, W. Saad, and C. Yin, “Virtual reality over wireless networks: Qualityofservice model and learningbased resource management,” IEEE Transactions on Communications, vol. 66, no. 11, pp. 5621–5635, Nov. 2018.
 [16] ——, “Echo state networks for selforganizing resource allocation in LTEU with uplinkdownlink decoupling,” IEEE Transactions on Wireless Communications, vol. 16, no. 1, pp. 3–16, Jan 2017.
 [17] Y. Sun, M. Peng, and H. V. Poor, “A distributed approach to improving spectral efficiency in uplink devicetodeviceenabled cloud radio access networks,” IEEE Transactions on Communications, vol. 66, no. 12, pp. 6511–6526, Dec 2018.
 [18] U. Challita, A. Ferdowsi, M. Chen, and W. Saad, “Machine learning for wireless connectivity and security of cellularconnected UAVs,” IEEE Wireless Communications, vol. 26, no. 1, pp. 28–35, February 2019.
 [19] M. Bennis and D. Niyato, “A Qlearning based approach to interference avoidance in selforganized femtocell networks,” in Proc. of IEEE Globecom Workshops, Miami, FL, USA, Dec. 2010.
 [20] A. AlHourani, S. Kandeepan, and A. Jamalipour, “Modeling airtoground path loss for low altitude platforms in urban environments,” in Proc. of IEEE Global Communications Conference, Austin, TX, USA, Dec. 2014.
 [21] T. S. Rappaport, F. Gutierrez, E. BenDor, J. N. Murdock, Y. Qiao, and J. I. Tamir, “Broadband millimeterwave propagation measurements and models using adaptivebeam antennas for outdoor urban cellular communications,” IEEE Transactions on Antennas and Propagation, vol. 61, no. 4, pp. 1850–1859, April 2013.
 [22] H. V. Hasselt, “Double Qlearning,” Advances in Neural Information Processing Systems, pp. 2613–2621, 2010.
 [23] M. A. Alsheikh, D. T. Hoang, D. Niyato, H. Tan, and S. Lin, “Markov decision processes with applications in wireless sensor networks: A survey,” IEEE Communications Surveys Tutorials, vol. 17, no. 3, pp. 1239–1267, third quarter, 2015.
Comments
There are no comments yet.