I Introduction
Owing to their deployment flexibility, unmanned aerial vehicles (UAVs) are expected to be a key component of future wireless networks. The use of UAVs as flying base stations, that collect/transmit information from/to ground nodes (e.g., users, sensors or Internet of Things (IoT) devices), has recently attracted significant attention [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Meanwhile, introducing UAVs into wireless networks leads to many challenges such as optimal deployment, flight trajectory design, and energy efficiency. So far, these challenges have been mainly addressed in the literature with the objective of either maximizing network coverage and rate or minimizing delay. In contrast, the qualityofservice (QoS) for many realtime applications, e.g., human safety applications, is restricted by the freshness of information collected by the UAV from the ground nodes [11]. This necessitates designing the UAV’s flight trajectory as well as scheduling of information transmissions from the ground nodes, to keep the information status at the UAV as fresh as possible.
Related works. We employ the concept of ageofinformation (AoI) to quantify the freshness of information at the UAV. First introduced in [12], AoI is defined as the time elapsed since the latest received status update packet at a destination node was generated at the source node. For a simple queueingtheoretic model, the authors of [12] characterized the average AoI. Then, the average AoI and some other agerelated metrics were investigated in the literature for variations of the queueing model considered in [12] (refer to [13] for a comprehensive survey). Another line of research [14, 15, 16, 17, 18, 19, 20, 21, 22] employed AoI as a performance metric for different communication systems that deal with time critical information. The main focus of these works was on applying tools from optimization theory to characterize ageoptimal transmission policies. Note that the destination node was commonly assumed to be a static node in [12, 13, 15, 14, 16, 17, 18, 19, 20, 21, 22]. More recently, in [21] and [22], the authors considered the optimization of AoI in UAVassisted wireless networks. However, the analyses in these works were limited to scenarios where UAVs acted as relay nodes and are hence not broadly applicable. Furthermore, these works did not take into account the optimal scheduling of update packet transmissions from different nodes while optimizing the UAV’s flight trajectory.
Contributions.
The main contribution of this paper is a novel deep reinforcement learning (RL) framework for optimizing the UAV’s flight trajectory as well as scheduling status update packets from ground nodes, with the objective of characterizing the minimum weighted sumAoI. In particular, we study a UAVassisted wireless network, in which a UAV moves towards the ground nodes to collect status update packets about their observed processes. For this system setup, we formulate a weighted sumAoI minimization problem in which the UAV’s flight trajectory as well as scheduling of update packet transmissions are jointly optimized. To obtain the ageoptimal policy, the problem is first modeled as a finitehorizon Markov decision process (MDP) with finite state and action spaces. Due to the extreme curse of dimensionality in the state space, the use of a finitehorizon dynamic programming (DP) algorithm is computationally impractical. To overcome this challenge, we propose a deep RL algorithm. After showing the convergence of our proposed algorithm, we numerically show its significant superiority over two baseline policies, namely, the distancebased and random walk policies, in terms of the achievable sumAoI per process. Several key system design insights are also provided through extensive numerical results. To the best of our knowledge, this work is the first to apply tools from deep RL to characterize the ageoptimal policy.
Ii System Model
Iia Network Model
Consider a wireless network in which a set of ground nodes are deployed to observe potentially different physical processes of a certain geographical region. Uplink transmissions are considered, where a UAV collects status update packets from the nodes while seeking to maintain freshness of its information status about their observed processes during the time of its operation. We assume a discretetime system in which time is divided into slots of unit length (without loss of generality) such that each slot corresponds to the time duration . Each ground node has a battery with finite capacity , which is divided into a finite number of energy quanta such that the amount of energy contained in each energy quantum is . Let denote the battery level at device at the beginning of slot .
As shown in Fig. 1, the geographical region of interest is partitioned into cells of equal areas where we denote by the location of the center of cell , and is the set containing the locations of centers for different cells. Let and be the horizontal and vertical spacing distances between the centers of any two adjacent cells, respectively. The UAV is assumed to fly at a fixed height such that the projection of its flight trajectory on the ground at time slot is denoted by . In other words, we will discretize the trajectory of the UAV such that its location is mapped to a discrete value during time slot . In practice, the UAV can only operate for a finite time interval due to its battery limitations and the need for recharging. We model this fact by having a time constraint of seconds during which the UAV flies from an initial location to a final location where it can be recharged to continue its operation. Note that and are the center locations of the initial and final cells, respectively, along the UAV’s flight trajectory. Therefore, the UAV’s flight trajectory is approximated by the sequence . Similar to [7, 8, 9], the channels between the UAV and ground nodes are assumed to be dominated by the lineofsight (LoS) links. Therefore, at time slot , the channel power gain between the UAV and ground node will be:
(1) 
where is the distance between the UAV and node , is the location of node , and is the channel gain at a reference distance of 1 meter.
We use the concept of AoI to quantify the freshness of information status at the UAV. The AoI of an arbitrary physical process is defined as the time elapsed since the most recently received update packet at the UAV was generated at the ground node observing this process. Denote by the AoI at the UAV for the process observed by node at the beginning of time slot , where is the maximum value that can take and can be chosen to be arbitrary large [11, 19].
IiB State and Action Spaces
The state of a ground node at time slot is characterized by its battery level and the AoI of its observed process at the UAV at the beginning of slot , i.e., . On the other hand, the state of the UAV at slot is captured by its location, and the difference between the remaining time and the required time to reach its final location which is denoted by , i.e., . Hence, the system state at slot can be expressed as , where is the state space of the system.
We assume that UAV’s maximum allowable speed limits its movement in each slot to one of the adjacent cells of its current cell. Hence, in each time slot, the UAV either decides to remain at its location over the duration of the next slot or move to one of its adjacent cells. Let be the movement action of the UAV at slot , where and denote the north, south, west and east directions, respectively, and indicates that the UAV will remain at its location in the next slot. Hence, the dynamics of the UAV’s location will be:
(2) 
Note that if the UAV is located at one of the boundary cells in time slot and will cause its location to be outside the considered region in slot , then the UAV will remain at its current location in slot . Meanwhile, at each slot, the UAV may choose one of the ground nodes from which it receives an update packet about its observed process. Let denote the scheduling action for update packet transmission at time slot , where means that node is scheduled to transmit an update packet at slot , and indicates that no update packet transmission occurs at slot . Hence, the system action at slot is given by , where is the action space of the system.
By letting , and be the size of an update packet, channel bandwidth, and noise power at the UAV, respectively, the integer number of energy quanta required to transmit an update packet from node is given by:
(3) 
where, according to Shannon’s formula and recalling that the slot length is unity, can be expressed as:
(4) 
Clearly, when node is scheduled to transmit an update packet at slot , its current battery level should be at least equal to . Note that the ceiling was used in (3) to obtain a lower bound on the performance of the continuous system (when the available energy in the battery is expressed by a continuous variable). On the other hand, if the floor operator replaces the ceiling one in the definition of , an upper bound on the performance of the continuous system is obtained. Therefore, the evolution of the battery level at node is given by
(5) 
A generateatwill policy is employed such that whenever node is chosen to transmit an update packet at a certain time slot, it generates that update packet at the beginning of that time slot [15, 16]. Therefore, when , the AoI of its observed process reduces to one; otherwise, the AoI value increases by one. Hence, the AoI dynamics for the process observed by node can be expressed as
(6) 
Iii Deep Reinforcement Learning for Weighted SumAoI Minimization
Iiia Problem Formulation
Our goal is to characterize the ageoptimal policy which determines the actions decided at different states of the system over a finite horizon of length . The objective of this ageoptimal policy is to minimize the weighted sumAoI. Formally, a policy
is a sequence of probability measures over the state space
. Let denote the sequence of actions and states up to the state of the system at slot . Conditioned on , the probability measure determines the probability of taking action , i.e., . In addition, the policy is called stationary when and is said to be deterministic when for some , where represents the set of possible actions at state . Given a policy , the total expected cost of the system, over the finite horizon of interest starting from an initial state , can be expressed as(7) 
where is the importance weight of the process observed by node and the expectation is taken with respect to the policy. Our goal is to obtain the optimal policy that satisfies
(8) 
Next, we derive the maximum and minimum total expected costs for identical ground nodes which have equal numbers of energy quanta, importance weights, and maximum AoI values.
Theorem 1.
The maximum and minimum total expected costs of the system, for a case with identical ground nodes, , and , are given by:
(9)  
(10) 
Proof:
The minimum total expected cost is reached when the UAV can receive an update packet from the ground node with maximum current AoI value at every time slot. In this case, we have:
(11) 
By summing this value over all time slots, we obtain (9). The maximum total expected cost is reached when the UAV cannot receive update packets over all time slots. In this case, we have:
(12) 
Owing to the nature of evolution of the system state parameters, represented by (2), (5), and (6), the problem can be modeled as a finitehorizon MDP with finite state and action spaces. However, due to the curse of extremely high dimensionality in , it is computationally infeasible to obtain using the standard finitehorizon DP algorithm [23]. Motivated by this, we propose a deep RL algorithm for solving (8) in the next subsection. Deep RL is suitable for this problem since it can reduce the dimensionality of the large state space while learning the optimal policy at the same time [24].
IiiB Deep Reinforcement Learning Algorithm
The proposed deep RL algorithm has two components: (i) an artificial neural network (ANN), that reduces the dimension of the state space by extracting its useful features and (ii) an RL component, which is used to find the best policy based on the ANN’s extracted features, as shown in Fig.
2.To derive the policy that minimizes the total expected cost of the system, we use a learning algorithm [23]. In this algorithm, we define a stateaction value function which is the expected cost of the system starting at state , performing action and following policy . In
learning algorithm, we try to estimate the
function using any policy that minimizes the future cost. To this end, we use the socalled Bellman update rule as follows:(13) 
where represents the instantaneous cost at slot , is the learning rate, and is a discount factor. The discount factor can be set to a value between 0 and 1 if the UAV’s task is continuing which means the task will never end, thus the current cost has higher value than the unknown future cost. However, in our case, we have two terminal cases: 1) when the UAV reaches the final cell and 2) when the required time to go to the final cell is less than the remaining time slots due to the limited available energy of the UAV. Therefore, our problem is episodic, thus, we set because all time slots have equal values for the UAV.
Since, using (IIIB), the UAV always has an estimate of the function, it can exploit the learning by taking the action that minimizes the cost. However, when learning starts, the UAV does not have confidence on the estimated value of the function since it may not have visited some of the stateaction pairs. Thus, the UAV has to explore the environment (all stateaction pairs) with some degree. To this end, an greedy approach is used where is the probability of exploring the environment at the current state [24], i.e., taking a random action with some probability. One can reduce the value of to as the learning goes on to insure that the UAV chooses the optimal action rather than spend more time exploring the environment.
The iterative method in (IIIB) can be applied efficiently for the case in which the number of states is small. However, in our problem, the state space is extremely large which makes such an iterative approach impractical, since it requires a large memory and will have a slow convergence rate. Also, this approach cannot be generalized to unobserved states, since the UAV must visit every state and take every action to update every stateaction pair [23]. Thus, we employ ANNs which are very effective at extracting features from data points and summarizing them to smaller dimensions. We use a deep network approach [24] in which the learning steps are the same as in learning, however, the function is approximated using an ANN , where
is the vector containing the weights of the ANN. In particular, a fully connected (FC) layer, as in
[24], is used to extract abstraction of the state space. In the FC, every artificial node of a layer is connected to every artificial node of the next layer via the weight vector . The goal is to find the optimal values for such that the ANN will be as close as possible to the optimalfunction. To this end, we define a loss function for any set of
, as follows:(14) 
where subscript is the episode at which the weights are updated. In addition, we use a replay memory that saves the evaluation of the state, action, and cost of past experiences, i.e., past stateactions pairs and their resulting costs. Then, after every episode, we sample a batch of past experiences from the replay memory and we find the gradient of the weights using this batch as follows:
(15) 
Then, using this loss function, we train the weights of the ANN. Note that it has been shown in [24] that using the batch method and replay memory improves the convergence of deep RL. Algorithm 1 summarizes the steps of the proposed learning algorithm and Fig. 2 shows the architecture of the deep RL algorithm.
Iv Simulation Results
For our simulations, we consider an area in between the following coordinates: and meters. We discretize this area into cells of dimensions 100 meters by 100 meters where the index of every cell is the coordinate of the cell center divided by 100. For instance, the cell in between and meters is called since the center of this cell is meters. Thus, we will have 11 cells in both the x and y directions. In addition, we consider MHz, Mbits, dBm, meters, and the amount of energy contained in each energy quantum, , is 1 mJ. We also assume that UAV’s initial and final locations are at cells and , respectively. In addition, we consider for all observed processes, which have equal importance weights. We evaluate the impact of battery size, time constraint, and the location and spatial density of the ground nodes on the sumAoI per process (we use per process since we consider equal weights).
In order to train the UAV, we use the ANN architecture in [24]
with no convolutional neural networks and only one FC layer with 200 hidden nodes. We use the TensorflowAgents library
[25] for designing the environment, policy, and costs. In addition, we use a single NVIDIA P100 GPU and 20 Gigabits of memory to train the UAV for every simulation scenario. In addition, for the following simulation scenarios, the reported numbers are derived by averaging the sumAoI per process of the proposed deep RL policy over 1000 episodes.Iva Convergence Analysis
To analyze the convergence of the proposed deep RL algorithm, we illustrate our setup for simulation scenario 1 in Fig. 2(a). In this scenario, we have only one ground node which is located at or . Also, we have which is the required number of time slots needed to move directly from the initial cell to the final cell. In addition, we have , which is the number of energy quanta required to transmit packet from the ground node at cell (the furthest cell) to the UAV at cell .
Fig. 4 shows the convergence of the average sumAoI per process after 50,000 training episodes. We can see from Fig. 4 that the average sumAoI per process is smaller for ground nodes which are closer to the straight line between the initial and the final cells. This is due to the fact that the UAV has to move in a straight line from the initial cell towards the final cell and, thus, does not have enough energy to update the status of far away ground nodes while the closer ground nodes can be updated several times.
IvB Trajectory Optimization
To show the effective trajectory optimization and scheduling of the UAV, in Fig. 2(b), we consider simulation scenario 2 in which there are two ground nodes located at and . Then, we choose and such that the UAV can receive only one packet from any of ground nodes when it can be as close as possible to ground nodes. For instance, when , the UAV will have 6 more time slots than moving straight from initial cell to final cell. Thus, the UAV can use the extra 6 slots to go three slots to the North, update ground nodes and then come back to the straight line. In this case, we choose which is the required energy to transmit a packet to a UAV that is located two cells away. Fig. 2(b) shows that the proposed deep RL algorithm can optimally find the best path and scheduling strategy. The cross marker is the UAV’s location at which it receives a status update packet from the ground node with an index next to the cross marker.
IvC Effects of System Parameters on the Minimum SumAoI
To compare the performance of our proposed deep RL algorithm with other policies, in Fig. 2(c), we set up three scenarios. In scenario 3, we consider three ground nodes located at , , and , where and varies between and and are equal for all of the ground nodes. In scenario 4, the locations of ground nodes are the same as scenario 3 while for all of the ground nodes and varies between and . Scenario 5 studies the effect of the spatial density of ground nodes at the outcome of the optimal policy. To this end, in scenario 5, we have for all ground nodes and . In addition, the location of ground nodes varies from , , and (the most dense case) to , , and
(the least dense case). We compare the deep RL policy with two baseline policies: 1) a distancebased policy which updates the status of the closest ground node if the distance is less than 2 cells and moves closer to the ground node with the maximum current AoI value, and 2) a random walk policy which randomly chooses a ground node to update its status while moving randomly in all directions. The distancebased policy is heuristically a good policy since it requires less energy for status update and tries to move closer to ground nodes with higher AoI to update their status. On the other hand, the random walk policy always explores all of the actions, thus, may find some actions that are not trivial but will result in smaller average AoI.
Fig. 4(a) shows the effect of on the sumAoI per process in scenario 3. From Fig. 4(a), we can see that a higher results in lower average sumAoI per process since the ground nodes can be updated more frequently and from larger distances. In addition, we can see that our proposed deep RL policy outperforms the other baseline policies since it takes into account the available energy quanta, AoI, the time constraint, and location of the UAV while the other policies are only distancebased or completely random. Fig. 4(a) demonstrates that the distancebased policy is more effective than the random walk policy for smaller . However, for larger , the random walk policy is more effective since it can explore more stateaction pairs and can update ground nodes from a farther distance. On the other hand, the distancebased policy stays constant after because the agent has to satisfy the time constraint, thus, an increase in will not be effective.
Fig. 4(b) shows the results for scenario 4 in which the effect of on the sumAoI per process is studied. Two key points can be deduced from Fig. 4(b): 1) the proposed deep RL policy results in approximately and smaller average sumAoI respectively compared to the distancebased policy and the random walk policy, and 2) for the time constraint smaller than 50, the random walk policy is more effective. However, for larger time constraints, we can see that the distancebased policy has enough time to get closer to ground nodes to update their status, thus, can outperform the random walk policy.
Fig. 4(c) shows the effect of the spatial density of ground nodes on the sumAoI per process in scenario 5. We can see from Fig. 4(c) that the proposed deep RL policy has a lower average sumAoI per process compared to the baseline policies. Fig. 4(c) also shows that as the spatial density of ground nodes reduces, i.e., the distance between the ground nodes increases, the average sumAoI per process increases. This is because, for larger distances, the UAV does not have enough time to get closer to the ground nodes, thus, it has to receive update packets from farther distances and less frequently.
V Conclusion
In this paper, we have investigated the problem of minimizing the weighted sumAoI for a UAVassisted wireless network, in which a UAV collects status update packets from energyconstrained ground nodes. We have shown that the proposed ageoptimal policy can jointly optimize the UAV’s flight trajectory as well as scheduling of status update packets from ground nodes. We have then developed a deep RL algorithm to characterize the ageoptimal policy while also overcoming the curse of dimenstionality of the original MDP. We have shown that the deep RL algorithm significantly outperforms baseline policies such as the distancebased and random walk policies, in terms of the achievable sumAoI per process. Numerical results have demonstrated that the achievable sumAoI per process by the proposed algorithm is monotonically increasing (monotonically decreasing) with the time constraint of the UAV and spatial density of the ground nodes (the battery sizes of the ground nodes).
References
 [1] M. Mozaffari, W. Saad, M. Bennis, Y.H. Nam, and M. Debbah, “A tutorial on uavs for wireless networks: Applications, challenges, and open problems,” IEEE Commun. Surveys & Tutorials, 2019.

[2]
U. Challita, A. Ferdowsi, M. Chen, and W. Saad, “Machine learning for wireless connectivity and security of cellularconnected UAVs,”
IEEE Wireless Commun., vol. 26, no. 1, pp. 28–35, Feb. 2019.  [3] M. M. Azari, F. Rosas, K.C. Chen, and S. Pollin, “Joint sumrate and power gain analysis of an aerial base station,” in Proc. of IEEE Global Commun. Workshops (GC Wkshps), Washington, DC, Dec. 2016.
 [4] R. I. BorYaliniz, A. ElKeyi, and H. Yanikomeroglu, “Efficient 3D placement of an aerial base station in next generation cellular networks,” in Proc. of IEEE Intl. Conf. on Commun. (ICC), Kuala Lumpur, May 2016.
 [5] V. V. Chetlur and H. S. Dhillon, “Downlink coverage analysis for a finite 3D wireless network of unmanned aerial vehicles,” IEEE Trans. on Commun., vol. 65, no. 10, pp. 4543–4558, Oct. 2017.
 [6] M. Alzenad, A. ElKeyi, F. Lagum, and H. Yanikomeroglu, “3D placement of an unmanned aerial vehicle base station (UAVBS) for energyefficient maximal coverage,” IEEE Wireless Commun. Letters, vol. 6, no. 4, pp. 434–437, Aug. 2017.
 [7] Y. Zeng, R. Zhang, and T. J. Lim, “Throughput maximization for uavenabled mobile relaying systems,” IEEE Trans. on Commun., vol. 64, no. 12, pp. 4983–4996, Dec. 2016.
 [8] P. Li and J. Xu, “Placement optimization for uavenabled wireless networks with multihop backhauls,” Journal of Commun. and Information Networks, vol. 3, no. 4, pp. 64–73, Dec. 2018.
 [9] L. Xie, J. Xu, and R. Zhang, “Throughput maximization for uavenabled wireless powered communication networks,” IEEE Internet of Things Journal, 2018.
 [10] M. Monwar, O. Semiari, and W. Saad, “Optimized path planning for inspection by unmanned aerial vehicles swarm with energy constraints,” in Proc. of IEEE Global Commun. Conf. (GLOBECOM), Abu Dhabi, United Arab Emirates, Dec. 2018.
 [11] M. A. AbdElmagid, N. Pappas, and H. S. Dhillon, “On the role of ageofinformation in Internet of Things,” 2018, available online: arxiv.org/abs/1812.08286.
 [12] S. Kaul, R. Yates, and M. Gruteser, “Realtime status: How often should one update?” in Proc. of IEEE Conf. on Computer Commun., Orlando, FL, March 2012.
 [13] A. Kosta, N. Pappas, and V. Angelakis, “Age of information: A new concept, metric, and tool,” Foundations and Trends in Networking, vol. 12, no. 3, pp. 162–259, Nov. 2017.
 [14] A. M. Bedewy, Y. Sun, and N. B. Shroff, “Optimizing data freshness, throughput, and delay in multiserver informationupdate systems,” in Proc. of IEEE Intl. Symposium on Information Theory, Barcelona, July 2016.
 [15] Y. Sun, E. UysalBiyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff, “Update or wait: How to keep your data fresh,” IEEE Trans. on Info. Theory, vol. 63, no. 11, pp. 7492–7508, Nov. 2017.
 [16] E. T. Ceran, D. Gündüz, and A. György, “Average age of information with hybrid ARQ under a resource constraint,” IEEE Trans. on Wireless Commun., vol. 18, no. 3, pp. 1900–1913, March 2019.
 [17] I. Kadota, A. Sinha, and E. Modiano, “Optimizing age of information in wireless networks with throughput constraints,” in Proc. of IEEE Conf. on Computer Commun., Honolulu, HI, April 2018.
 [18] R. Talak, S. Karaman, and E. Modiano, “Optimizing age of information in wireless networks with perfect channel state information,” in Proc. of Intl. Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks, Shanghai, May 2018.
 [19] B. Zhou and W. Saad, “Joint status sampling and updating for minimizing age of information in the Internet of Things,” 2018, available online: arXiv.org/abs/1807.04356.
 [20] M. K. AbdelAziz, C.F. Liu, S. Samarakoon, M. Bennis, and W. Saad, “Ultrareliable lowlatency vehicular networks: Taming the age of information tail,” in Proc. of IEEE Global Commun. Conf. (GLOBECOM), Abu Dhabi, United Arab Emirates, Dec. 2018.
 [21] M. A. AbdElmagid and H. S. Dhillon, “Average peak ageofinformation minimization in UAVassisted IoT networks,” IEEE Trans. on Veh. Technology, vol. 68, no. 2, pp. 2003–2008, Feb. 2019.
 [22] J. Liu, X. Wang, B. Bai, and H. Dai, “Ageoptimal trajectory planning for UAVassisted data collection,” in Proc. of IEEE Conf. on Computer Commun. Workshops (INFOCOM WKSHPS), Honolulu, HI, April 2018.
 [23] W. B. Powell, Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley & Sons, 2007, vol. 703.
 [24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [25] S. Guadarrama, A. Korattikara, O. Ramirez, P. Castro, S. F. Ethan Holly, E. G. Ke Wang, C. Harris, V. Vanhoucke, and E. Brevdo, “TFAgents: A library for reinforcement learning in tensorflow,” https://github.com/tensorflow/agents, 2018.
Comments
There are no comments yet.