I Introduction
Owing to their mobility, agility, and flexibility, drones are widely used in various applications. In particular, drone user equipments (UEs) play a key role in a number of scenarios such as package delivery, remote sensing, and surveillance applications [1, 2]. For sustainable operation, flying drones need to be supported via cellular infrastructure (a.k.a. cellularconnected drones) to ensure seamless connectivity and lowlatency communications. Cellular technologies such as Longterm Evolution (LTE) and the fifthgeneration New Radio (5G NR) offer widearea, highspeed, and secure wireless connectivity [3, 4], which can provide robust control and safety for drone operations. In this regard, there are several challenges in supporting droneUEs in a cellular network. First, drones can move in threedimensions (3D), and their arbitrary trajectory and high speed result in rapid changes in received signal strength. Second, due to lineofsight (LOS) propagation conditions, drones may suffer from strong uplink and downlink interference from neighbor cells [1]. Third, terrestrial base stations (BSs) are mainly designed to serve ground users and hence their antennas are downtilted. The main lobe of a BS antenna thus covers a large part of the surface area of the cell to improve performance for terrestrial UEs. Accordingly, at ground level the strongest site is typically the closest one. A droneUE on the other hand may be frequently served by the sidelobes of BS antennas [5], which have s low antenna gain compared to the main lobe of the antennas. The coverage areas of the sidelobes may be small and the signals at the edges may drop sharply due to deep antenna nulls. At a given location, the strongest signal might come from a faraway BS, if the sidelobes of the BSs closer to the droneUE is significantly weaker. Additionally, the sidelobes of BSs may not fully cover the sky and there can be coverage holes (space without coverage service) in the sky that can cause connectivity failure. Meanwhile, the fragmented coverage area provided by different BSs hardens the mobility support in the sky and can result in frequent handovers (HOs). This, in turn, leads to significant signaling overhead and radio link failure (RLF) due to undesired pingpong HOs. Therefore, there is a need for efficient HO mechanisms for drone mobility management to provide reliable communications between drones and BSs.
Ia Related Work
In the Thirdgeneration partnership project (3GPP) Release 15, the potential support of LTE for providing drone connectivity was studied [6]. The results of this study showed that mobility support for drones is one of the challenging aspects in using existing LTE networks to serve droneUEs. The work in [7] identified key challenges associated with supporting drone connectivity in LTE networks. In [8], the performance of a cellularconnected drone network was evaluated in terms of RLF and the number of HOs. In [9], a handover optimization scheme was proposed for ground UEs in a 5G cellular network using reinforcement learning (RL). In [10]
, the authors proposed a handover mechanism based on deep learning to improve the reliability and latency in terrestrial millimeterwave mobile systems. While previous work has studied various other challenges related to drone communications, the problem of handover optimization for droneUEs in the sky remains an open problem.
IB Contributions
In this paper, we propose a novel HO optimization mechanism for a cellularconnected drone system to ensure robust wireless connectivity for droneUEs. By using tools from RL [11], HO decisions are dynamically optimized using Qualitylearning (Qlearning) to provide an efficient mobility support in the sky. The proposed framework leverages the reference signal received power (RSRP) data and the drone’s trajectory information to provide effective HO rules for seamless drone connectivity while considering HO signaling overhead. Furthermore, our results depict the inherent tradeoff between the number of HOs and the serving cell RSRP in the considered cellularconnected drone system.
Ii System Model
We consider the scenario illustrated in Fig. 1 where droneUEs are served by a terrestrial cellular network consisting of ground BSs. We assume that a droneUE moves along a twodimensional (2D) trajectory at a fixed altitude which is known to the network. To maintain reliable connectivity, the drone may perform one or more HOs during flight which changes BSdrone association. Therefore, the drone may connect to different BSs along its route. We consider predefined locations along the drone trajectory where it can perform a handover. At each such location, the drone decides: 1) whether to do a HO, and 2) the new serving BS in case a HO is needed. As illustrated in Fig. 2, HO process typically involves several steps and signaling between drone and BSs such as measurement report, HO commands and admission control [12]. Several factors govern the outcome of a HO process such as BS distribution, received signal strength characteristics, drone speed and flight trajectory.
In general, always connecting to the strongest BS (i.e., that provides maximum RSRP) may be detrimental for drone connectivity and HO signaling overhead. On the one hand, HO decision solely based on the current maximum RSRP can trigger many subsequent HOs during drone flight, which is not efficient. On the other hand, it can cause pingpong HOs and connectivity failures as the signal strength fluctuates rapidly during a drone flight [12]. This motivates the need of an efficient HO mechanism which accounts for the mobility challenges facing a droneUE in a terrestrial cellular network. Let us use RSRP as a proxy for reliable connectivity and the number of HOs as a measure of the HO signalling overhead. Intuitively, a desirable HO mechanism will maintain a sufficiently large RSRP while incurring a small number of HOs during a flight.
In this paper, we propose a RLbased framework to determine the optimal sequential HO decisions for a droneUE to enable reliable connectivity while accounting for the HO overhead. To this end, in our proposed RLbased HO framework, we consider two key factors in the objective function: 1) serving cell RSRP values, and 2) cost (or penalty) for performing a HO. From a design perspective, it is desirable to strike a balance between maximizing the RSRP values and minimizing the number of HOs. Furthermore, to flexibly adjust the impact of the number of HOs and serving cell RSRP values in the HO decisions, we consider and as the weights of number of HOs and serving cell RSRP.
Iii Background of RL
RL is a learning algorithm where an agent interacts with an environment by taking actions based on the current state and the anticipated future rewards[11]. As illustrated in Fig. 3, the agent observes state and takes action at time . It receives feedback in the form of a reward
and chooses subsequent actions to maximize the expected reward accumulated over time. RL is often described using a Markov decision process characterized by a tuple
, where is a set of states, denotes the set of actions,gives the state transition probabilities for a state
and action , gives the discount factor, and denotes the reward function. With this information, the Markov decision process can be solved to get the optimal policy, i.e., the action to take at each state such that the expected sum of discounted rewards is maximized.Qlearning [13] is a type of modelfree RL where the goal is to learn the optimal policy for the given Markov process in the absence of and . Let us define the Qvalue for a policy as the expected sum of discounted rewards when the agent takes an action in state and chooses actions according to the policy thereafter. Using an iterative process, the agent will eventually learn the optimal Qvalues over time. The actions with the highest Qvalues for each state constitute the optimal policy [13, 11]. With a slight abuse of notation, we use to denote the Qvalue at time during the iterative process. When the agent performs an action in a state at time , it receives an immediate reward and transitions to state . The new Qvalue can be evaluated using
(1) 
where is the learning rate. With this approach, Qlearning computes the optimal values for all states at once using successive approximations [13, 11].
Iv RLBased HO Optimization Framework
In this section, we formally define the state, action and reward for the considered scenario. The objective is to determine the HO decisions for each waypoint along the given route. We also propose an algorithm based on Qlearning to obtain optimal HO decisions for the given route. In Table I, we list the main parameters used in the proposed RLbased HO optimization framework.
Label  Definition 

HO cost  
Weight for HO cost  
Weight for serving cell RSRP  
Reward defined as the weighted combination of HO cost and RSRP  
State defined as  
Position coordinate at state  
Movement direction at state  
Serving cell at state  
Next state of  
Action performed at state  
Action performed at state  
Qvalue of taking action at state  
Learning rate  
Discount factor  
Exploration coefficient  
Number of training episodes 
Iva Definitions
State: The state of a drone represented by consists of the drone’s position , its movement direction , and the currently connected cell , where is the set of all candidate cells. We describe how a drone trajectory is generated in our model given an initial location and a final location of the drone. At the initial location, the movement direction resulting in the shortest path to the final location is selected. The drone moves in the selected direction for a fixed distance until it reaches the next waypoint. The same procedure is repeated at each waypoint until it reaches closest to the final location. We note that the resulting drone trajectory is not necessarily a straight line due to a finite number of possible movement directions in our model. We recall that the RLbased HO algorithm merely expects that the drone trajectory is known beforehand. That is, it is not significant how the fixed trajectories are generated but we describe the methodology for the sake of completeness.
Action: The drone’s action at current state corresponds to choosing a serving cell for the next state . For example, as shown in Fig. (b)b, if , then at state , the drone switches to the cell .
Reward: We now define a reward function to encourage the desired HO behavior. As shown in Fig. (c)c, the serving cells need to be decided along the trajectory and the goal is to reduce the number of HOs as well as maintain reliable connectivity. During a flight, the drone need not only focus on the signal strength at the current location. Rather, it might as well connect to a cell with a lower RSRP at one waypoint that results in fewer HOs at the subsequent waypoints. To achieve a balance between the two conflicting goals, our model considers a weighted combination of the HO cost and the serving cell RSRP at future state as the reward function
(2) 
where and respectively denote the weights for the HO cost and the RSRP, while is the indicator function for HO, i.e., when the serving cells at states and are different and otherwise.
IvB Algorithm of HO Scheme using Qlearning
For complexity reduction, the action space in our model is restricted to the strongest candidate cells for every state. Let us define a set . For a trajectory with waypoints, the resulting Qtable is stored in memory and updated according to (1). The stepwise iterative process is given in Algorithm 1. The algorithm complexity is , where is the total number of training episodes and the constant is given by the route length . The initial Qtable for the given trajectory is generated in steps 29. In step 6, a binary square matrix of size is generated such that its th entry is 0 if the th strongest cell at state is the same as the th strongest cell in state , and it is 1 otherwise. The Qvalue iterations for each training episode are performed in step 1124, where steps 1418 perform the greedy exploration [11] while step 20 implements (1). Finally, values for choosing different actions are stored in Q where the highest value represents the optimal choice. Hence, a sequence of HO decisions for the waypoints of the given route can be obtained according to the maximal Qvalue at each state.
V Simulation Results
In this section, we evaluate the performance of the proposed RLbased HO mechanism. For performance comparison, we consider a greedy HO scheme as the baseline in which the drone always connects to the strongest cell. For each flight trajectory, we calculate a performance metric called HO ratio which we define as the ratio of the number of HOs using the proposed scheme to that for the baseline scheme. Thus, the HO ratio is always for the baseline case. To depict the tradeoff between the number of HOs and the observed RSRP values, we evaluate the performance for different weight combinations of and in the reward function. For the special case when there is no HO cost, the proposed RLbased HO scheme is equivalent to the baseline. As the ratio increases, the number of HOs decreases and the HO ratio approaches zero.
Va Data Preprocessing
In our simulations, we consider a deployment of BSs in a 2D geographical area of km where each BS has cells or sectors. We collect samples of RSRP values corresponding to each of these cells for different UE locations at an altitude of m, as shown in Fig. 5
. For normalization, the RSRP samples thus obtained are linearly transformed to the interval [0 1]. To further quantize the considered space, as shown in Fig.
6, we partition the area into bins of size m. For each bin, we calculate the represntative RSRP value for a cell as the average of the RSRP samples in that bin.VB Results using Qlearning
We simulate the performance using runs for the proposed and baseline schemes. For each run, the testing route is generated randomly as explained in Section IV. As an illustrative example, Fig. 7 shows a portion of the drone trajectory along with the strongest cell at each waypoint. In our simulations, we set , , , and .
In Fig. 8, we plot the average number of perflight HOs for different weight combinations. The proposed scheme is equivalent to the baseline when there is no HO cost. By increasing , the cost of HO increases and our approach avoids unnecessary HOs. For instance, compared to the baseline, the RLbased HO scheme can reduce the number of HOs by when .
In Fig. 9
, we plot the cumulative distribution function (CDF) of the number of HOs in a flight. For the special case
, we observe that the CDF for the proposed approach coincides with that of the baseline. Moreover, by properly adjusting the weights for the HO cost and RSRP, the RLbased scheme can significantly reduce the number of HOs. Similar trends can be observed for the HO ratio in Fig. 10. For example, for , the number of HOs can be reduced by at least 50% with a probability 0.8.In Fig. 11, we plot the CDF of the RSRP seen by the droneUE for various HO costs. As expected, the RLbased HO scheme is equivalent to the baseline in terms of the RSRP distribution when there is no cost associated with a HO. This is because for the baseline case, the drone always connects to the cell offering the largest RSPR during its flight. As noted previously, the proposed RLbased scheme is flexible in that it allows reducing the pingpong HOs (and resulting signalling overheads) at the expense of a lower RSRP. For example, when , a (worstcase) 5thpercentile UE incurs an RSRP loss of around 4.5 dB relative to the baseline. Setting suffers from only a small loss in RSRP. As evident from Fig. 11 and Fig. 9
, both choices significantly reduce the number of HOs compared to the baseline. Such a tradeoff may still be acceptable depending on the operating conditions. For instance, the minimum serving cell RSRP in our results is always greater than 85 dBm (or 28 dB signaltonoise ratio (SNR) assuming a bandwidth of 1 MHz and a noise power of 113 dBm), which is typically sufficient to provide reliable connectivity. Depending on the specific scenario, the network may configure the parameters accordingly to operate at an acceptable RSRP but with a reduced HO overhead.
Vi Conclusions
In this work, we have proposed an RLbased HO mechanism to achieve robust drone connectivity in a cellularconnected drone network. Leveraging a Qlearning framework, we have provided a flexible way for HO decision making for a given flight trajectory. We have shown how the network can tradeoff the number of HOs with the received signal strength by adjusting the respective weights of these quantities in the reward function. The simulation results have revealed that the proposed approach can significantly reduce the number of HOs while maintaining reliable connectivity, compared to the baseline HO scheme in which the drone always connects to the strongest cell.
There are several potential directions for future research. First, the existing framework considers drone mobility in 2D. A natural extension will be to allow for 3D drone mobility. Second, the testing area and flying routes considered in this work are rather limited. It will be worth investigating whether our findings hold for larger testing areas and/or longer flying routes with a larger pool of candidate cells. Third, the proposed model and resulting simulations are based on the RSRP metric. Another notable contribution will be to enrich the model with additional parameters.
References
 [1] M. Mozaffari, W. Saad, M. Bennis, Y. Nam, and M. Debbah, “A tutorial on UAVs for wireless networks: Applications, challenges, and open problems,” IEEE Communications Surveys Tutorials, vol. 21, no. 3, pp. 2334–2360, thirdquarter 2019.
 [2] A. Fotouhi, H. Qiang, M. Ding, M. Hassan, L. G. Giordano, A. GarciaRodriguez, and J. Yuan, “Survey on UAV cellular communications: Practical aspects, standardization advancements, regulation, and security challenges,” IEEE Communications Surveys & Tutorials, 2019.
 [3] G. Yang, X. Lin, Y. Li, H. Cui, M. Xu, D. Wu, H. Rydén, and S. B. Redhwan, “A telecom perspective on the internet of drones: From LTEadvanced to 5G,” arXiv preprint arXiv:1803.11048, 2018.
 [4] X. Lin, V. Yajnanarayana, S. D. Muruganathan, S. Gao, H. Asplund, H. Maattanen, M. Bergstrom, S. Euler, and Y. . E. Wang, “The sky is not the limit: LTE for unmanned aerial vehicles,” IEEE Communications Magazine, vol. 56, no. 4, pp. 204–210, April 2018.
 [5] X. Lin, R. Wiren, S. Euler, A. Sadam, H. Maattanen, S. Muruganathan, S. Gao, Y. . E. Wang, J. Kauppi, Z. Zou, and V. Yajnanarayana, “Mobile networkconnected drones: Field trials, simulations, and design insights,” IEEE Vehicular Technology Magazine, vol. 14, no. 3, pp. 115–125, Sep.. 2019.
 [6] 3GPP TR 36.777, “Enhanced LTE support for aerial vehicles,” 2017.
 [7] J. Stanczak, I. Z. Kovacs, D. Koziol, J. Wigard, R. Amorim, and H. Nguyen, “Mobility challenges for unmanned aerial vehicles connected to cellular LTE networks,” in in Proc. of IEEE 87th Vehicular Technology Conference (VTC Spring), 2018, pp. 1–5.
 [8] S. Euler, H. Maattanen, X. Lin, Z. Zou, M. Bergstrom, and J. Sedin, “Mobility support for cellular connected unmanned aerial vehicles: Performance and analysis,” ” arXiv:1804.04523, 2018.
 [9] V. Yajnanarayana, H. Rydén, L. Hévizi, A. Jauhari, and M. Cirkic, “5G handover using reinforcement learning,” arXiv:1904.02572, 2019.

[10]
A. Alkhateeb, I. Beltagy, and S. Alex, “Machine learning for reliable mmwave systems: Blockage prediction and proactive handoff,” in
in Proc. IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2018, pp. 1055–1059.  [11] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 2, no. 4.
 [12] K. Sivanesan, J. Zou, S. Vasudevan, and S. Palat, “Mobility performance optimization for 3GPP LTE HetNets,” 2015.
 [13] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.
Comments
There are no comments yet.