I Introduction
Commercial drone applications have attracted profound interest in recent years in a wide set of usecases, including area monitoring, surveillance, and delivery [6]. In many applications, drones, also known as unmanned aerial vehicles (UAVs), require connectivity to carry their tasks out. Due to the ubiquitous coverage of cellular networks, they serve as the major infrastructure for providing widearea yet reliable and secure drone connectivity beyond visual lineofsight range [13, 11]. A comprehensive set of empirical analyses on providing connectivity for drones through LTE networks have been conducted recently in the context of 3GPP, and some of testfield results could be found in [1]
. Regarding the fact that the probability of experiencing lineofsight (LoS) propagation to the neighbor BSs increases with altitude
[2], the wireless channels between flying users and neighboring base stations (BSs) experience almost freespace fading [1]. Hence, in the uplink direction, drone communications are expected to incur significant interference to uplink communications of terrestrial UEs, and in the downlink communications, drones are vulnerable to receive strong interference from neighbour BSs, as shown in Fig. 1 [5]. Furthermore, due to the mobility of drones in a 3dimensional (3D) space without predetermined roads as compared to the legacy terrestrial UEs, radio resource provisioning in the sky becomes a difficult task in this dynamic environment in comparison to the legacy urban/rural service areas with predetermined buildings and roads. Moreover, the terrestrial users usually receive strong signals from a few neighbor BSs. This makes the userBS association problem less complicated in comparison with the drone communications in which, drones observe LoS signals from several BSs. By considering the speed of drones and the largeset of potential BSs that a drone can be served with them, many handover events will be triggered for drones [4]. Then, we observe that introduction of aerial users to cellular networks needs a revision of some communications protocols which have been developed by considering the legacy terrestrial users. Here, we focus on the handover and radio resource management (HRRM) problem in serving drone communications over cellular networks. In this problem, the key performance indicators of our interest are: (i) the interference of drone communications on terrestrial communications, and (ii) the experienced delay in drone communications. Among candidate enablers for solving such a complex and dynamic problem, we leverage machine learning (ML) tools, transform the problem into a machine learning problem, and provide a solution for it. The proposed ML schemes enable cellular networks to capture the temporal and spatial correlations between decisions taken in the network in serving drones in order to make a foresightful and cognitive decision in later decision epochs. To the best of authors’ knowledge, our work is the first in literature that investigates the HRRM problem in a network consisting of drone and terrestrial users, and leverages a MLpowered solution for solving the problem. The key contributions of this work include:

Analyze the received interference from drone communications in the ground BSs using cell planning tools and leveraging real geographical and landuse data.

Formulate the HRRM problem in serving drone and terrestrial users as a machine learning problem by incorporating delay in drone communications and interference to terrestrial users in the design of the reward function.

Present a reinforcement learning solution to the problem.

Present the impacts of different system parameters on the HRRM problem decisions. Analyze the interplay between level of interference to terrestrial users, handover overhead, allocated resources to drones, and experienced delay.

Present boundaries of cells in the sky, handover regions, as a function of altitude and speed of drones, and level of tolerable interference to the terrestrial users.
The remainder of this paper is outlined as follows. Section II presents the challenges, existing solutions, and research gaps. System model and problem formulation are presented in Section III. Section IV presents the proposed solution. Simulation results are presented in Section V, followed by the conclusion given in Section VI.
Ii Challenges in Serving Aerial Users and Stateoftheart Solutions
Interference on Terrestrial Communications. Connectivity for aerial users over cellular networks should be enabled in a way to minimize the sideeffects on qualityofservice (QoS) of terrestrial users. Unfortunately, the favorable propagation conditions that drones enjoy due to their LoS link to the associated base station degrades QoS of terrestrial users. Fig. 2 represents our simulations results using Mentum Cell Planner tool^{1}^{1}1developed by Ericsson, available at https://www.infovista.com leveraging real geographical and network data for Stockholm, Sweden. This figure shows that the received interference from a drone increases significantly, and extends to a much wider area, by increasing the altitude of the flying drone. In [2], the impact of airtoground communications on the coverage performance of cellular networks has been investigated. In [3], a deep reinforcement learning path planning for drones in investigated, in which, each drone aims at making a balance between its energy efficiency and interference to the ground networks along its path. The authors in [9] maximize the weighted sumrate of the dronetoBS link and existing ground users by jointly optimizing the drone’s uplink cell associations and power allocations over multiple shared radio resource blocks.
Interference on Drone Communications. In [2], it has been shown that the LoS propagation conditions a drone user experiences at high altitudes actually result in an overall negative effect, as the high vulnerability to interference from BSs in the downlink direction dominates over the increased received signal power from the associated BS. From Fig. 2, which represents the received interference from drone to BSs in the uplink direction, the performance in the downlink direction for drone will be much severe because drone will receive many interfering signals from neighbor BSs, where the BSs have a much higher transmit power in comparison with the drones.
Handover for Drone Communications. Current handover mechanisms for terrestrial UEs mainly trigger a handover based on a policy which is applied on a set of metrics, such as received signal received quality (RSRQ). Regarding the LoS links between drone and neighbor BSs, the set of neighbor BSs that a drone can connect has a much higher cardinality as compared to a terrestrial user. The movement of the drone along its trajectory triggers frequent measurement reports, which result in frequent handovers in case the handover decision is made by the legacy scheme of comparing received signal strengths (RSSs). In [4], handover performance of LTEAdvanced networks when serving drones is investigated in a measurement campaign. It has been shown that a drone flying at m altitude has experienced times more number of handovers than a ground UE with the same speed. Then, the legacy handover management mechanisms lose their merits in serving flying UEs, triggering unnecessary handovers. In the reverse setup when drones are serving as BSs, [7] solves the handover management problem by leveraging reinforcement learning. However, this is a very different problem since drones are stable and used as coverage BSs.
Research Gap. Terrestrial cellular networks will face new challenges serving drone users: 1) Unnecessary handovers due to the increasing LoS probability with the increase of altitude, 2) interference created by flying UEs over terrestrial networks blasting power over a large area. This motivates us to study uplink communications of a cellular network serving both flying and terrestrial users, and aiming at making HRRM decisions in a way to maximize QoS of drone users with minimum impact in terms of interference on the terrestrial users.
Iii System Model and Problem Formulation
We focus on the uplink (reverse link) of cellular mobile networks supporting drone and terrestrial UEs over a service area. The set of available BSs in the service area is denoted by . A cochannel deployment is considered, in which BSs operate in a system with a bandwidth consisting of radio resource blocks (RRBs). We focus on the handover and resource management problem. Then, at the decision epoch , based on the status of terrestrial users and network resources, we aim at deciding which BS with which set of resources should serve the drone, and what should be the transmit power.
Iiia AirtoGround Channel and Modeling of the KPIs
The AirtoGround (A2G) channel. The A2G channel depends strongly on the presence of LoS propagation characteristics between the drone and BS. Without loss of generality, in the following we provide the A2G channel model for an urban environment, while extension to suburban and rural areas is straightforward by substituting the respective parameters using the channel conditions in [1]. The probability of experiencing LoS propagation in communications between drone, at altitude with speed with respect to the th BS is given as [1]: for , and 1 otherwise. In this expression, , and is the horizontal distance between the drone and the BS for (in meters). For , is assumed. For LoS conditions, , when , is given as [1]
(1) 
where is the 3D distance between the drone and the th BS in meters, is the height of the BS, and is the carrier frequency in GHz. Pathloss for non LoS condition (NL), , is given as [1]
(2) 
Shadow fading also depends on the LoS/NL condition. Standard deviation of the shadow fading for LoS condition is
, and for NL condition, [1]. Finally, the channel is assumed to exhibit Rayleigh block fading characteristics.Buffer Queue Size and Delay. First, we focus on the communication delay for the drone. Let denote the number of data units in bits that arrive at the buffer of drone at the end of subframe . The arrival of the data units follows a Poisson point process (PPP) distribution with associated parameter . One must note that the handover decision affects the arrival rate as some control data is needed to be transferred as well. Here, we model the arrival rate of data units to the buffer queue of drone as follows in which models the arrival rate of control signals due to handover, is the length of time at which control messages will be issued after a handover decision, and is a handover indicator function, equal to one if a handover has happened for drone at , and zero otherwise. Then, the overall modeling of control and application data to the drone buffer will follow a Switched PPP (SPP), with light and heavy traffic arrival windows. Hence, the drone’s buffer queue size as one of our KPIs is modeled as where indicates the number of data units in the buffer of the drone at the beginning of subframe , and is the successful data units transmitted from the buffer of the drone during the transmission interval. The expected queuing delay for a newly added packet to the buffer at time could be expressed as where is the expected data rate for drone, and is affected by HRRM decisions. To capture the delay threshold as our KPI, we consider a maximum buffer size, i.e. , beyond which, a packet is dropped.
Allocated Spectrum and Data Rate. By assuming that the coherence time of the channel is considered to be greater than a transmission time interval (TTI), the achieved uplink data rate for drone over the allocated subset of subcarriers, denoted by (t), is derived as: in which, is a function, to be described in the following. Furthermore, is the subcarrier bandwidth,
is the power allocation vector,
is the vector of ratio of channel gains to the noise plus interference level over the allocated set of subcarriers. Without loss of generality, we exemplify our modeling for single carrier frequency division multiple access, and approximate the function as where is the noise power over each subcarrier, is the power density of interference over th subcarrier, and is the transmit power [8]. Furthermore, is characterized by a RRB allocation indicator function, denoted by , which is 1 if the th RRB of BS is allocated at time to the drone, and 0 otherwise.Interference to Neighbor BSs. The interference incurred by the uplink transmission of the drone to th neighbor BS, , is calculated as in dBm. In this expression, is the drone transmit power, is the transmit antenna gain (usually isotropic with zero gain), is the receive antenna gain (depending on the altitude of drone and radiation pattern of the receive antenna, is determined), represent the LoS/NL condition, and is 3D distance between the drone and th neighbor BS. Then, the uplink interference from the drone to the BSs in the service area could be modeled as .
IiiB Formulation of the HRRM Optimization Problem
Given status of drones at time , , , as well as the available RRBs at each BS, i.e. , , the problem is to find and , ; in order to satisfy the delay threshold of drones with a minimum amount of allocated radio resources, minimum number of unnecessary handovers, and minimum interference on ground BSs. Then, at decision epoch , we need to solve the following optimization problem for serving the drone:
in which, stands for the maximum allowable transmit power, stands for the delay threshold, and finally assures that the drone receives service from one cell only. Furthermore, is an indicator function with binary output, and , for , represents the scaling coefficient of experienced delay, number of allocated RRBs, incurred interference, and handover indicator, respectively.
Solving the HRRM Problem One observes that the HRRM problem is not only a highlycomplex nonconvex optimization problem, but also the impact of decisions at time , e.g. a handover, propagates in time and affects different KPIs in later epochs. Then, we need to transform the problem to a problem in which, longterm benefits are taken into account along with the temporal QoS measures. Furthermore, the solution due to the dynamic properties of the cellular network environment, e.g. number of active BSs and users, making the solution adaptive to the changes in the environment is favorable. These requirements motivate us to transform this optimization problem into a reinforcement learning problem in which, the learning rate and forget factor tune the balance between temporal and longterm QoS measures. In such a problem, the network’s constraints are transformed to the action and state spaces, and the objective function to be maximized is transformed to the reward function. In the following, we present the transformed problem and its solution.
Iv The Transformed HRRM Problem and Solution
In this section, we present the transformed HRRM problem and propose an algorithm to learn a policy, based on which, HRRM decisions could be made based on the previous experiences. Note that we assume the handover management and resource allocation decisions in the service area are taken by a central entity, hereafter called controller.
Iva The Transformed HRRM Problem
Reinforcement learning is a branch of ML dealing with an agent that aims at taking actions in an environment (described by states) so as to maximize its cumulative weighted rewards. Such a problem could be defined by definition of the states, actions, and the reward function, as follows.
The State Space. The state space describes the environment in which an agent is selecting its actions. We present the state of environment for the flying drone at time as , consisting of its altitude, velocity, current serving BS, buffer queue size, and the last pathloss measurements to neighbor BSs, respectively. Hence, the state space includes all potential realizations of . For example, represent a state in which, a drone at altitude 100 m with 20 m/s speed is experiencing 80 and 90 dB path losses from BS1 and BS2, and is served by BS1, and its buffer queue is empty.
The Action Space. The action space presents the set of decision parameters available to the agent at each decision epoch. We present the action space at time as =[, ], where and stand for transmit power and radio resource allocation, as described in Section III. Then, the action space, , consists of different combinations of transmit power, associated BS, and allocated set of RRBs. For example, represents an action in which, drone should transmit its data over two chunks of radio resources in BS1 with 23 dBm transmit power.
Reward Function. The reward function should mimic the objective function of the HRRM problem to be maximized. By following the notation in the HRRM problem in Section III, the immediate reward for serving drone at time , i.e. , is defined as a weighted sum of the rewards from resource efficiency in communications, lowdelay performance, lowinterference, and low number of unnecessary handovers. Then, we formulate as:
=  (3) 
In this expression, Rew is the abbreviations for reward, and , , and are the weights, which are determined from the relative importance of the KPIs in the target application. The first term in is proportional to the inverse of amount of consumed radio resources, and increases by a decrease in number of radio resources allocated to the drone. The second term is proportional to the inverse of buffer queue size, and the third term is proportional to the inverse of interference to the BSs. Finally, the last term is an indicator of handover, and hence, represents the regret by a minus sign. Furthermore, all these measures have been scaled in [0,1] interval, in order to prevent one metric dominating others.
IvB Learning from Past Actions
To leverage from past experiences in action selection, we use Qlearning as a modelfree reinforcement learning algorithm since we need to learn a policy.
Qlearning and deep Qlearning. In Qlearning, action selection is done using an actionvalue function by following a policy. Each policy provides a mapping from the state space to the action space. The stateaction value function coupled with policy , denoted by , is defined as the longterm expected accumulated discounted reward of state when action is taken, and the future actions are taken by following policy . In the basic Qlearning, each time an action is taken, its respective stateaction value is updated by
(4) 
where is the discount factor and is the learning rate (), which determines to what extent the learned Qvalue will be updated. The convergence of to the optimized value by adapting the learning rate has been proven in [12]
. In the Qlearning, the Qfunction is represented by a Qtable, containing states as rows, actions as columns, and values as entries. In a practical network, the number of states and actions could be so high that it is not possible to save the Qvalues in a matrix. Then, either the states should be quantized for decreasing the dimensions of the Qtable, or the Qfunction should be approximated by a neural network, i.e., deep Qlearning could be used
[10]. We adapt deep Qlearning to solve the problem of handover and radio resource management (HRRM) for flying UEs.IvC The Solution: MLpowered HRRM
Given the above descriptions, here we present an algorithm for leveraging deep Qlearning in decision making for HRRM of drone communications. The proposed algorithm is clarified in details as follows. First, we zeroinitialize the weights () of a neural network used for approximating the Qfunction. Then, the neural network approximating the Qfunction is trained using a network simulator, as outlined in Algorithm 1. Then, at each decision epoch, an action, including the serving BS, set of RRBs, and transmit power, is taken either randomly, with probability , or greedy, with respect to the output of the neural network with probability . Then, the received reward is calculated from (3). One must note that the update of the neural network does not happen based on observations at each decision epoch. Instead, we save the (state, action, reward, next state) data in memory, and leverage memory replay and gradient descent for updating the neural network, as has been outlined in [10]. Finally, when there is need in the policy update, Algorithm 1 could be run in order to update the past learned weights using recent observations. Algorithm 2, represents the details of action selection in the proposed solution.
Parameters  Values 

Service area  
BSs’ positions  
Available RRBs for drone (per TTI)  Random, up to 4 180 KHz 
BSs antenna height, carrier frequency  m, 2 GHz 
Packet arrival rate and size at drone  Hz; Kbits 
Handover control packet size  4 1Kbits 
Circuit power, transmit power  , Watt 
Learning rate, discount factor  , 
Minimum interval between handovers  TTIs, and TTI=0.001 sec 
Drone speed and height  Default: m/sec, meters 
V Performance Evaluation
Simulation Setup We consider a service area of m, in which macro BSs are observable to drones, and can simultaneously serve terrestrial and drone UEs. The simulator has been developed in Matlab, and implements Algorithm 2 in the Qtable form. In the benchmark scheme, RSS is used for handover management, i.e., if the received power from the target BS is dB stronger than the serving BS, a handover is triggered. Furthermore, in the benchmark scheme, all the available RRBs in the serving BS are allocated to the connected drone, while in the learning scheme, the allocation is determined by the policy from the Qfunction. For the following analyses, we assume a drone at a constant altitude and speed is crossing the service area, and our aim is to (i) associate it to the best serving BS(s) at each radio frame, i.e., TTIs; and (ii) to allocate a set of RRBs at each TTI. Learning rate is defined as , where is the number of visits to the states, to increase the convergence rate. The other system parameters used for the simulation are in accordance with [1], and could be found in Table I.
Simulation Results Fig. 3(a)3(c) evaluate dependency of KPIs of interest on their respective coefficients in the reward function formulation, as described in (3). As our focus is on the relative behavior, we normalize each KPI to its maximum value. The effect of changing , i.e., the delay coefficient in the reward function is studied in Fig. 3(a), where the other coefficients are set as ==0.5, and =0.01, i.e., we care more about delay and interference. One observes that by increasing from to , which is corresponding to the case that delay is less tolerable, more handovers are triggered. This is due to the fact that the policy enforces handover to the BSs on which drone experiences less pathloss in communications. Furthermore, it is clear that by connecting to the BS with the best channel, the level of interference will be decreased. Moreover, one observes that the proposed scheme significantly outperforms the benchmark at = in all KPIs. For 0.5, we observe that the number of potential handover triggering epochs increases by 8%, however, there is no negligible change in the performance of other KPIs, including the delay. The same results could be observed in Fig. 3(b), where we investigate the impact of , the interference coefficient in the reward function, on the policy design. One observes in this figure that by an increase in the relative importance of interference in the reward function, the number of handovers increases to avoid interference to the neighboring BSs. On the other hand, such an increase in decreases the number of allocated resources to interfering drones.
Fig. 3(c) studies the effect of as the coefficient of handover in (3). As shown in this figure, an increase in from to , i.e., an increase in the regret for handovers, decreases the potential decision epochs in which an unnecessary handover may occur by . Interestingly, one observes that this increase causes higher regret for handover has also improved the performance of all other KPIs. For example, one observes that the experienced delay has been decreased by 510% because of the decrease in the control signaling for carrying handovers out. On the other hand, for 0.5, we observe that almost all other KPIs, including experienced delay, buffer queue size, and level of incurred interference, are traded for further decrease in the number of handovers.
The effects of altitude and velocity on handover have been studied in Fig. 4. This figure presents the heatmaps of positions at which, handover decisions have been made during offering service to drones. The locations of the BSs have been marked by red dots, drones are crossing the service area from left to right, and are initially to the BS at the bottomleft. Let us first focus on the impact of speed on handover decisions. The left and middle heatmaps correspond to m/s and m/s speeds, respectively. It is clear that by an increase in the speed, the handover decisions have been reduced significantly, and hence, we observe almost a cellless connectivity in the sky. This is mainly due to the fact that while the interference, due to not triggering handover, is suffering, but the timelength of such suffering is such low (due to the high speed), that the controller prefers not to trigger a handover if it is not really needed due to the delay requirement. Furthermore, the impact of altitude on the handovers can be seen by comparing the heatmap in left and right, where drone in the latter is m further away from the ground. The increase in the frequency of handovers in the right heatmap could be reasoned by recalling equation (1), in which we showed how probability of experiencing LoS propagation increases by increasing the altitude. Then, cardinality of the set of BSs to which drone can handover in the right scenario is higher than the other cases, which results in further handovers for the drone.
Vi Conclusion
This paper studies a learningpowered approach for handover and resource management in cellular networks serving drone users (the uplink direction). The major challenges consist of the interference from drones on uplink communications of coexisting terrestrial users, and the frequent handovers for drones. The handover management and resource allocation optimization problem have been transformed to a machine learning problem for which a reinforcement learning algorithm is proposed as a solution. The design of this algorithm features incorporating different sources of rewards and regrets with respect to the network resources, and KPIs of drone users and interference to ground BSs. A comprehensive set of simulations have been conducted, where the results confirmed the significant impact of resource allocation and handover management for drone communications on terrestrial users. By setting appropriate coefficients for delay, interference and handover in the reward function, one can significantly outperform the benchmark scheme in terms of number of handovers, incurred interference and experienced delay. Furthermore, an increase in the speed cause less handover decisions, however, altitude has the reverse effect on the number of handovers.
References
 [1] (2018) Enhanced LTE support for aerial vehicles. Technical report Technical Report , . External Links: Link Cited by: §I, §IIIA, §V.
 [2] (201712) Coexistence of terrestrial and aerial users in cellular networks. In IEEE Globecom, Cited by: §I, §II, §II.
 [3] (2019) Interference management for cellularconnected UAVs: a deep reinforcement learning approach. IEEE Transactions on Wireless Communications 18 (4), pp. 2125–2140. Cited by: §II.
 [4] (2019) Handover challenges for cellularconnected drones. In 5th Workshop on Micro Aerial Vehicle Networks, Systems, and Applications, Cited by: §I, §II.
 [5] (2018) How to ensure reliable connectivity for aerial vehicles over cellular networks. IEEE Access 6 (), pp. 12304–17. Cited by: §I.
 [6] (20164th Qtr.) Survey on Unmanned Aerial Vehicle Networks for Civil Applications: A Communications Viewpoint. IEEE Commun. Surveys Tut. 18 (4), pp. 2624–2661. Cited by: §I.
 [7] (201811) A Reinforcement Learning Based User Association Algorithm for UAV Networks. In IEEE ITNAC, Vol. , pp. 1–6. Cited by: §II.
 [8] (201510) Lowcomplexity powerefficient schedulers for lte uplink with delaysensitive traffic. IEEE Trans. Veh. Technol 64 (10), pp. 4551–4564. Cited by: §IIIA.
 [9] (2019) Cellularconnected UAV: uplink association, power control and interference coordination. IEEE Trans. on Wireless Commun.. Cited by: §II.
 [10] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §IVB, §IVC.
 [11] (201803) MultiTier Drone Architecture for 5G/B5G Cellular Networks: Challenges, Trends, and Prospects. IEEE Commun. Mag. 56 (3), pp. 96–103. Cited by: §I.
 [12] (19920501) Qlearning. Machine Learning 8 (3), pp. 279–292. External Links: ISSN 15730565 Cited by: §IVB.
 [13] (201902) CellularConnected UAV: Potential, Challenges, and Promising Technologies. IEEE Wireless Commun. 26 (1), pp. 120–127. Cited by: §I.