I Introduction††This work was initiated when Yun Chen was with Ericsson Research.
Due to the unique advantages of drones such as swift mobility and low-cost operation, their applications are rapidly growing from item delivery and traffic management to asset inspection and aerial imaging [saad2020wireless, TutorialMO, fotouhi2019survey, B5GUAV]. Realizing the true potential of drone technology hinges on ensuring seamless wireless connectivity to drones. Cellular technology is well-suited for providing connectivity services to drones thanks to its reliability, flexibility and ubiquity. Several efforts are underway to develop cellular-assisted solutions, leveraging Long-Term Evolution (LTE) and the fifth-generation (5G) New Radio (NR), for supporting efficient drone operations in the sky [yang2018telecom, Sky]. To better understand the potential of cellular networks for low-altitude drones, the third-generation partnership project (3GPP) has been studying and developing new features for enhanced mobile services for drones acting as user equipments (UEs) [3GPPTR, 3GPP22825, muruganathan2018overview]. To further meet the needs of 5G connectivity of drones, a new 3GPP activity is planned to devise new key performance indicators and identify communication needs of a drone with a 3GPP subscription. In addition, 3GPP is evolving 5G NR to support non-terrestrial networks [3GPP38811, 3GPP38821, lin20195g]. It is expected that the more flexible and powerful NR air interface will deliver more efficient and effective connectivity solutions for wide-scale drone deployments [lin20195gnr].
While the low-altitude sky is within reach of existing cellular networks, enabling robust and uninterrupted services to aerial vehicles such as drones poses several challenges. We next review some of the key technical challenges in serving drone UEs using existing cellular networks. First, terrestrial cellular networks are primarily designed for serving ground UEs and usually use down-tilted base station (BS) antennas. This means that drone UEs are mainly served by the side lobes of the BS antennas and may face coverage holes in the sky due to nulls in the antenna pattern [MobilieDrones]
. Second, the drone-BS communication channels have high line-of-sight probabilities. As a result, a drone UE may generate more uplink interference to the neighbouring cells and experience more interference in the downlink as signals from many neighboring cells may reach the drone with strong power levels. The strong interference, if not properly managed, may degrade link quality of both ground UEs and drone UEs. Third, the high speed and three-dimensional motion of drones make handover (HO) management more cumbersome compared to ground UEs. In a network with multiple BSs (serving multiple cells), these challenges further compound the drone-BS association rules. This is because the coverage space formed by the strongest BSs is no longer contiguous but rather fragmented[MobilieDrones]. This, in turn, can trigger frequent HOs leading to undesirable outcomes such as radio link failures, ping-pong HOs, and large signaling overheads. This motivates the need for an efficient HO mechanism that can provide a robust drone mobility support in the sky [chen2019efficient].
I-a Related Work
The support of mobility is a fundamental aspect of wireless networks [camp2002survey, akyildiz1999mobility]. Mobility management is particularly an essential and complex task in emerging cellular networks with small and irregular cells [lin2013towards, andrews2013seven]
. There has been a recent surge of interest in applying machine learning techniques to mobility management in cellular networks. In[wickramasuriya2017base]
, a recurrent neural network (RNN) was trained using sequences of received signal strength values to perform BS association. In[mismar2018partially], a supervised machine learning algorithm was proposed to improve the success rate in the handover between sub-6 GHz LTE and millimeter-wave bands. In [yajnanarayana20195g], a HO optimization scheme based on reinforcement learning (RL) was proposed for terrestrial UEs in a 5G cellular network. In [alkhateeb2018machine]
, a HO scheme based on deep learning was proposed to improve the reliability and latency in terrestrial millimeter-wave mobile networks.
In 3GPP Release 15, a study was conducted to analyze the potential of LTE for providing connectivity to drone UEs [3GPPTR]. This study identified mobility support for drones as one of the key areas that can be improved to enhance the capability of LTE networks for serving drone UEs. In [stanczak2018mobility], an overview of the major mobility challenges associated with supporting drone connectivity in LTE networks was presented. In [MobilityS], the performance of a cellular-connected drone network was analyzed in terms of radio link failures and number of HOs. In [challita2019interference], an interference-aware drone path planning scheme was proposed and the formulated problem was solved using a deep RL algorithm based on echo state network. In [fakhreddine2019handover], HO measurements were reported for an aerial drone connected to an LTE network in a suburban environment. The results showed how HO frequency increases with increasing flight altitude, based on which the authors suggested that enhanced HO techniques would be required for a better support of drone connectivity.
While prior work has studied various mobility challenges pertaining to drone communications, efficient HO optimization for drone UEs (as motivated in Section I) has received little attention. To this end, in our recent work [chen2019efficient], a HO mechanism based on Q-learning was proposed for a cellular-connected drone network. It was shown that a significant reduction in the number of HOs is attained while maintaining reliable connectivity. The promising results have inspired further work such as [chowdhury2020mobility] that adopted a similar approach for drone mobility management by tuning the down-tilt angles of BSs.
The aim of our work is to find an efficient HO mechanism which accounts for the mobility challenges faced by drone UEs in a terrestrial cellular network optimized for serving devices on the ground. In this paper, we present the second part of our work on using RL to improve drone mobility support, completing the first part of our work presented in our recent paper [chen2019efficient]. Despite the encouraging results in [chen2019efficient], the tabular Q-learning framework adopted in [chen2019efficient] may have some disadvantages. First, the algorithm may entail substantial storage requirements when the state space is large. For example, this is the case with long flying routes having numerous waypoints where the drone needs to make HO decisions. This problem will be further exacerbated when there is a large pool of candidate cells to choose from. Second, the Q-learning approach adopted in [chen2019efficient] can only be used for discrete states, which implies that the proposed scheme therein can help make HO decisions only at predefined waypoints rather than at arbitrary points along the route. These disadvantages are addressed in this paper by using tools from deep RL [sutton1998introduction, li2018deep].
In this paper, we propose a deep Q-network (DQN) based optimization mechanism for a cellular-connected drone system to ensure robust connectivity for drone UEs. With the use of deep RL tools, HO decisions are dynamically optimized using a deep neural network to provide an efficient mobility support for drones. In the proposed framework, we leverage reference signal received power (RSRP) data and a drone’s flight information to learn effective HO rules for seamless drone connectivity while accounting for HO signaling overhead. Furthermore, our results showcase the inherent interplay between the number of HOs and the serving cell RSRP in the considered cellular system. We also compare our results to those reported in [chen2019efficient] that adopted an approach based on Q-learning.
The rest of this paper is organized as follows. The system model is described in Section II. A brief background of deep RL in the context of our work is introduced in Section III. A DQN-based HO scheme is presented in Section IV. The simulation results are provided in Section V. Section VI concludes the paper.
Ii System Model
We consider a terrestrial cellular network with down-tilted BS antennas. Traditionally, such a network mainly targets serving users on the ground. In this work, we assume that it also serves drone UEs flying in the sky. Each drone UE moves along a two-dimensional (2D) trajectory at a fixed altitude. One of the main goals of the cellular network is to provide fast and seamless HO from one cell (a source cell) to another (a target cell). Due to its high mobility nature, a drone UE may experience multiple HO events, resulting in frequent switching of serving cells along its trajectory.
Fig. 1 illustrates a typical network-controlled HO procedure with drone UE assistance. In the source cell, the drone UE is configured with measurement reporting. That is, it performs measurements such as RSRP to assess its radio link quality and reports the results to the network. For mobility management, the drone UE measures the signal quality of neighbor cells in addition to its serving cell. The network may use the reported measurement results for making a HO decision. If a HO is deemed necessary, the source cell sends a HO request to the target cell. After receiving an acknowledgment of the HO request from the target cell, the source cell can send a HO command to the drone UE. Meanwhile, the source cell may carry out data forwarding to the target cell. Upon receiving the HO command, the drone UE can initiate the random access procedure towards the target cell, receive uplink grant, and send the HO confirmation message. Once the HO procedure is completed, the drone UE can resume data communication with the network through the target cell (which becomes the serving cell upon HO completion).
We assume that the drone trajectory is fixed and known to the network. We consider predefined locations (or waypoints) along the trajectory for making HO decisions. For each such location, it is first decided whether a HO is needed or not. In case a HO is needed, a target cell is further decided. The HO decisions may depend on various factors such as BS distribution, drone mobility profile including speed and flight trajectory, and propagation environment, among others.
We consider a baseline HO strategy purely based on RSRP measurements where the drone UE is assumed to always connect to the cell which provides the largest RSRP. While selecting the strongest cell is indeed appealing from a radio signal strength perspective, a HO decision solely based on the largest RSRP at the waypoint is often short-sighted as it may trigger many subsequent HO events during the flight. Further, the considered baseline HO strategy may cause frequent ping-pong HO events and radio link failures. This is because the signal strength can fluctuate rapidly along the drone trajectory in a cellular network with down-tilted BS antennas. Also, there can be a service interruption during the time interval when the drone UE receives a HO command from the source cell until the target cell receives the HO confirmation from the drone UE. In short, HO is a costly procedure, hence the number of HO events along the flight trajectory needs to be minimized while maintaining the desired radio link quality.
In this work, we use RSRP as a proxy for radio link reliability and the number of HO events as a measure of HO cost which may include signaling overhead, potential radio link failure, and service interruption time associated with the HO procedure. Intuitively, a desirable HO mechanism will lead to sufficiently large RSRP values while incurring only a modest number of HO events along a flight trajectory. To this end, we propose a deep RL-based framework to determine the optimal sequential HO decisions to achieve reliable connectivity while accounting for the HO costs. In this framework, we consider two key factors in the objective function: 1) the serving cell RSRP values, and 2) the cost (or penalty) for performing a HO. To reflect the impact of these factors in the HO decisions, we define and as the weights of the serving cell RSRP and the HO cost, respectively. From a design perspective, adjusting the weights and can help strike a balance between maximizing the serving cell RSRP values and minimizing the number of HO events.
Iii Background of Deep Reinforcement Learning
As a subfield of machine learning, RL addresses the problem of automatic learning of optimal decisions over time [sutton1998introduction]
. A RL problem is often described by using a Markov decision process characterized by a tuple, where denotes the set of states, denotes the set of actions, denotes the state transition probabilities, is a discounting factor that penalizes the future rewards, and denotes the reward function. In RL, an agent interacts with an environment by taking actions based on observations and the anticipated future rewards. Specifically, the agent can stay in a state of an environment, take an action in the environment to switch from one state to another governed by the state transition probabilities, and in turn it receives a reward as feedback. The RL problem is solved by obtaining an optimal policy which provides the guideline on the optimal action to take in each state such that the expected sum of discounted rewards is maximized.
Q-learning, as adopted in our previous work [chen2019efficient], is one of the most promising algorithms for solving RL problems [Q-learning]. Let us denote by the Q-value (or action-value) of a state-action pair under a policy . Formally, , where is the return at time . So, is the expected sum of discounted rewards when the agent takes an action in state and chooses actions according to the policy thereafter. The optimal policy achieves optimal value function: . Thus, by computing the optimal Q-values , one can derive the optimal policy that chooses the action with the highest Q-value at each state. The optimal Q-values can be computed by iterative algorithms. With a slight abuse of notation, we use to denote the Q-value at time during the iterative process. When the agent performs an action in a state at time , it receives a reward and switches to state . The Q-value iteration process is given by
where is the learning rate. It can be shown that approaches when . This method is known as tabular Q-learning.
The aforementioned Q-learning method may be difficult to use in problems with large state space, as the number of Q-values grows exponentially with state space variables. A nonlinear representation that maps both state and action onto a value can be used to address this issue. Neural networks are universal approximators and have drawn significant interest from the machine learning community. It has been shown that the depth of a deep neural network can lead to an exponential reduction in the number of neurons required[delalleau2011shallow]. Thus, using a deep neural network for value function approximation is a promising option. Let us denote by the approximated Q-value function with parameters . A DQN [mnih2013playing] aims to train a neural network with parameters to minimize the loss function where
Iv Deep RL-Based HO Optimization Framework
In this section, we formally define the state, action and reward for the considered scenario. The objective is to determine the HO decisions for any arbitrary waypoints along a given route using a DQN framework. In Table I, we list the main parameters used in the proposed HO optimization framework.
|Weight for HO cost|
|Weight for serving cell RSRP|
|State defined as|
|Position coordinate at state|
|Movement direction at state|
|Serving cell at state|
|Next state of|
|Action performed at state|
|Action performed at state|
|Reward for taking action in state|
|Threshold for beginning Q-value iteration|
|Q-value of taking action at state (updated at every step)|
|Update cycle for|
|Q-value of taking action at state (updated every steps)|
|Number of training episodes|
|Replay batch for DQN training|
|Minibatch from of size|
|Number of training steps per episode|
State: The state of a drone represented by consists of the drone’s position , its movement direction , and the currently connected cell , where is the set of all candidate cells. We use superscript to denote the next state of a state . We clarify that the direction of movement is restricted to a finite set only for the training phase. For the testing phase, the deep network may output results for other directions, which is beneficial for trajectory adjustment in practical applications. We describe how a drone trajectory is generated in our model given an initial location and a final location of the drone. At the initial location, we select the movement direction which results in the shortest path to the final location. The drone moves in the selected direction for a fixed distance until it reaches the next waypoint. The same procedure is repeated for selecting direction at each waypoint until it reaches closest to the final location. We note that the resulting drone trajectory is not necessarily a straight line due to a finite number of possible movement directions in our model. We recall that the RL-based HO algorithm merely expects that the drone trajectory is known beforehand. Thus, we are able to get sufficient training data along the route for the DQN. While it is not critical how the fixed trajectories are generated, we have nevertheless described the methodology for the sake of completeness.
Action: As the drone trajectory is fixed and known beforehand, the drone position at the future state is known a priori at the current state . Therefore, the RSRP values from various cells for the drone position at state are also known a priori at state . For the current state , we let denote the set of candidate cells at the future state , where consists of the strongest cells at state . We assume that the cells in are sorted in descending order of RSRP magnitudes. The drone’s action at the current state corresponds to choosing a serving cell from for the next state . This is illustrated in Fig. (b)b for where the current state is shown by a dashed-line drone and the future state by a solid drone. The 6 cells are sorted in descending order of RSRP magnitudes seen at the future state , i.e., cell 5 has the largest while cell 1 has the smallest RSRP. The drone takes an action at the current state, meaning that it will connect to cell 4 at the next state, i.e., . Thus, the action consists of choosing an index (i.e., picking an element) from at state . As a result, the drone connects to the cell corresponding to that index in state .
Reward: We now describe the reward function used in our model. The goal is to encourage the drone to reduce the number of HOs along the trajectory while also maintaining reliable connectivity. In the context of Fig. (b)b, this means that action (i.e., cell with highest RSRP) is not necessarily always selected. The drone might as well connect to a cell with a lower RSRP at one waypoint that results in fewer HOs at subsequent waypoints. In view of these conflicting goals, we incorporate a weighted combination of the HO cost and the serving cell RSRP at future state in the reward function
where and respectively denote the weights for the HO cost and the serving cell RSRP at state , while is the indicator function for the HO cost such that when the serving cells at states and are different and otherwise.
Iv-B Algorithm of HO scheme using DQN
For complexity reduction, the action space in our model is restricted to the strongest candidate cells for every state. Let us define a set and assume that the trajectory has waypoints. Unlike using a Q-table [chen2019efficient] to store the Q-values which may require a substantial memory, the Q-value for each state-action pair can be directly obtained from the DQN [mnih2013playing]. We describe the training process in Algorithm 1. The algorithm complexity is , where is the number of training steps per episode and is the total number of training episodes. We use two networks for training: one for the initial update of parameter while the other for storing the more stable after has been appropriately trained for a given period of time.
The Q-value iterations for each training episode are performed in line 5-32. An -greedy exploration is performed in line 9-13 [sutton1998introduction]. The data for each training step is stored in a replay batch . Specifically, each row of contains the tuple , i.e., current state, action, future state and reward for a training step (line 9-15). The training process is activated after has accumulated at least row entries (line 18). Then, a minibatch is obtained by (uniformly) randomly extracting rows from . We let
denote the input state vector consisting of only the current states for all entries in, where denotes the current state for a row in . We feed the input state vector to DQN to compute the Q-values for all possible actions. For each state , we represent the Q-values for all possible actions by a vector (line 20). We further define an matrix . As described in line 21-25, we update the entries of . During the preliminary training phase, we update the Q-values using corresponding rewards such that the network has a rough approximation of the Q-values for various state-action pairs. This is because initially the network cannot accurately predict the Q-values used for value iteration. This helps avoid error accumulation in the initial training stage. Then, after running a sufficient number of training steps, we use the Q-values output from the network for value iteration, as shown in line 21-22. Specifically, we set parameter in our model meaning that reward function is used for value iteration for around 30% steps in each episode, whereas the Q-values are used for the remaining steps. In this way, the trained parameter requires only a few oscillations to converge. In addition, the parameter for the target network is updated every steps (line 25-27), which ensures that is replaced by a relatively stable calculated during the preceding steps. Finally, the well-trained target network is used for action prediction for the states along the route. The output from the network is a vector of Q-values for all the possible actions. The action with the highest Q-value is chosen for each state.
V Simulation Results
In this section, we present the simulation results for the proposed DQN-based HO mechanism. For performance comparison, we consider a greedy HO scheme as the baseline in which the drone always connects to the strongest cell. We also contrast the results with those reported in [chen2019efficient] using a tabular Q-learning framework. We now define a performance metric called the HO ratio: for a given flight trajectory, HO ratio is the ratio of the number of HOs using the proposed scheme to that using the baseline scheme. By definition, the HO ratio is always for the baseline case. To illustrate the interplay between the number of HOs and the observed RSRP values, we evaluate the performance for various weight combinations of and in the reward function. By increasing the ratio , the number of HOs for the DQN-based scheme can be decreased which yields a smaller HO ratio.
V-a Data Pre-processing
Similar to [chen2019efficient], we consider a deployment of BSs in a 2D geographical area of km where each BS has cells. We assume that the UEs are located in a 2D plane at an altitude of m. We generate samples of RSRP values for each of these
cells at different UE locations. For normalization, the RSRP samples thus obtained are linearly transformed to the interval [0 1]. To further quantize the considered space, we partition the area into bins of sizem (as shown in Fig. 3 and Fig. 4). For each bin, we compute the representative RSRP value for a cell as the average of the RSRP samples in that bin.
V-B Experimental Setup
We simulate the performance using runs for each of the DQN-based, the Q-learning-based [chen2019efficient] and the baseline schemes. For each run, the testing route is generated randomly as explained in Section IV. We show a snapshot of a flying trajectory in Fig. 5
. The distance between subsequent waypoints along the trajectory is set to 50 m. We note that the drone’s speed is not relevant since we aim to reduce the number of HOs for a given trajectory rather than the number of HOs per unit time. For the DQN-based scheme, we use a neural network with two fully-connected hidden layers and train it using RMSprop as the optimizer. We use the following parameter values for Q-value iteration:, , , , , and .
In Fig. 6, we plot the average number of HOs per flight for various weight combinations. We first consider the case of practical interest where the HO cost is non-zero. The proposed approach helps avoid unnecessary HOs compared to the baseline case even for a modest weight for the HO cost. For example, introducing a slight penalty for a HO event by setting helps cut the average number of per-flight HOs roughly in half. By further increasing , the HO cost increases which reduces the number of HOs. For instance, the number of HOs are reduced by around 11 times when . We further note that there are diminishing returns if the HO cost is weighed higher than the RSRP, i.e., when . We now consider the case where there is no HO cost, i.e., . The proposed scheme performs slightly worse than the baseline in terms of the average number of per-flight HOs. This apparent anomaly is because the Q-value obtained from a DQN-based algorithm is in fact an approximation of that obtained via tabular Q-learning. As reported in [chen2019efficient], the case is equivalent to the baseline when tabular Q-learning is used. Nonetheless, we note that this corner case () is irrelevant as the network can revert to the baseline HO approach instead.
In Fig. 7
, we plot the cumulative distribution function (CDF) of the number of per-flight HOs. For a non-zero HO cost, the proposed scheme significantly reduces the number of HOs. For example, with a probability of 0.95, the number of per-flight HOs is expected to be fewer than 98 for the baseline case. For the same probability, the proposed scheme reduces the number of HOs to fewer than 38 forand to fewer than 7 for . Similarly, with a probability of 0.1, fewer than 14 per-flight HOs are expected for the baseline case. The proposed scheme requires fewer than 7 HOs for and only 1 HO for . For the special case , we observe that the CDF for the proposed approach is slightly worse than that of the baseline. This trend is consistent with the explanation provided for Fig. 6.
We caution that merely inspecting the absolute number of HOs may be misleading as it does not reflect the reduction in HOs on a per-flight basis. In Fig. 8, we plot the CDF of the HO ratio for the proposed scheme. We note that this metric captures the reduction in number of HOs relative to the baseline for each flight. For , a HO ratio of 0.1, 0.2 or 0.5 can be achieved with a probability of 0.70, 0.90 or 0.98, respectively. This means a reduction in the number of HOs by at least 2 times for 98% flights, 5 times for 90% flights, or 10 times for 70% flights. In short, by properly adjusting the weights for the HO cost and RSRP, the RL-based scheme can significantly reduce the number of HOs for various scenarios.
In Fig. 9, we plot the CDF of the RSRP observed along the trajectory of the drone UE for various combinations of HO cost and RSRP weights. We note that the proposed scheme provides a flexible way to reduce ping-pong HOs (and resulting signalling overheads) while sacrificing RSRP. For example, when , a (worst-case) 5th-percentile UE suffers an RSRP loss of around 4.5 dB relative to the baseline. If such degradation is not acceptable, setting will incur only a meager loss in RSRP. It is evident from Fig. 7 and Fig. 9 that both choices substantially reduce the number of HOs compared to the baseline. We remark that the operating conditions will influence the network’s decision to strike a favorable tradeoff between the HO overheads and reliable connectivity. In Fig. 9
, the minimum RSRP exceeds -82 dBm which translates to a signal-to-noise ratio (SNR) of 31 dB assuming a bandwidth of 1 MHz and a noise power of -113 dBm, which is usually sufficient to provide reliable connectivity.
V-D Comparison with Q-learning based approach
Let us compare the performance in terms of the HO ratio and RSRP for the schemes based on DQN and Q-learning [chen2019efficient]. As evident from Table II, the Q-learning-based approach [chen2019efficient] yields a smaller HO ratio than that based on DQN. As noted previously, this is because the DQN attempts to approximate the Q-values obtained via tabular Q-learning [chen2019efficient]. In Table III, we include some selected points from the RSRP CDFs for both cases. We observe only a negligible drop in RSRP for the DQN-based scheme compared to the Q-learning approach. Despite the performance differences, both RL-based methods can significantly reduce the number of HOs while maintaining reliable connectivity. Furthermore, as a first step, we considered 2D drone mobility in a rather limited geographical area. In practical scenarios with longer flying routes, the state space may grow prohibitively large with an approach based on tabular Q-learning. This renders the proposed DQN-based method more appealing thanks to reduced implementation complexity.
In this paper, we have developed a novel deep RL-based HO scheme to provide efficient mobility support for a drone served by a cellular network. By exploiting a DQN approach, we have proposed a flexible mechanism for dynamic HO decision making based on the drone’s flight path and the distribution of the BSs. We have shown that the proposed HO mechanism enables the network to manage the tradeoff between the number of HOs (i.e., overheads) and the received signal strength by appropriately adjusting the reward function in the deep RL framework. The results have demonstrated that, compared to the greedy HO approach where the drone always connects to the strongest cell, the deep RL-based HO scheme can significantly reduce the number of HOs while maintaining reliable connectivity.
There are several potential directions for future work. A natural extension will be to include 3D drone mobility in the current framework. It will also be worth validating the proposed scheme for larger testing areas and/or longer flying trajectories with a larger pool of candidate cells. Another notable contribution will be to enhance the model with additional parameters to account for inter-cell interference.