1 Introduction
Wireless sensing devices operating on limited batteries have been airlifted and deployed in remote, humanunfriendly environments. Conventional terrestrial communication networks and persistent power supplies are unavailable or unreliable in such harsh environments. Unmanned Aerial Vehicles (UAVs) have been proposed to harvest data from the sensing devices, and offload command or software pitch, thanks to UAVs’ excellent mobility and maneuverability, flexible deployment, and low operational costs tomic2012toward ; luo2012communication ; waharte2010supporting . Figure 1 depicts a typical realtime application using the UAV in rescue operations, where the wireless sensing device on the ground, typically running with limited battery, records critical information, such as locations, temperature, health conditions of rescuers, and oxygen supply. Each ground device generates data packets at an applicationspecific sampling rate, and put them into a data queue for future transmission.
A UAV can be employed to hover over the area of interest, collecting and ferrying the sensory data of the ground device. Since the ground device is constrained by a typically limited battery power which limits scalability and sustainability of the sensor network, Microwave Power Transfer (MPT) has been studied to enable energy harvesting in UAVassisted data collection wang2018power ; zeng2016wireless . Particularly, we consider a powersplitting MPT technique, e.g., Simultaneous Wireless Information and Power Transfer (SWIPT), where the UAV sends response messages to the ground device meanwhile charging its batteries perera2017simultaneous ; yin2017uav ; yin2018uav . Moreover, the ground device, equipped with a data communication antenna and a wireless power receiver, collects electrical energy conveyed by radio frequency signals while recovering the data from the signals. With SWIPT, every ground device only needs a single RF chain with a reduced hardware cost, since MPT and data transmission can work in the same radio frequency band.
In practical scenarios, energy harvesting and data transmission could be severely affected by movements of the UAV and timevarying channels. Moreover, the uptodate knowledge about battery level and data queue length of the ground devices is not available at the UAV. Therefore, scheduling MPT and data collection online in the presence of onboard control of the UAV (e.g., patrolling velocity) for preventing battery drainage and data queue overflow is critical in UAVassisted wireless powered sensor networks.
In this paper, the problem is formulated as a Markov Decision Process (MDP) with the states of battery level and data queue length of the ground devices, channel conditions, and waypoints given the trajectory of the UAV, which can be optimally solved by a reinforcement learning approach, e.g., Qlearning. However, Qlearning suffers from the wellknown curse of dimensionality, which is impractical for the resource allocation in the UAVassisted online MPT and data collection due to a large number of states and actions. Furthermore, we propose an onboard deep Qnetwork that can enlarge the state and action space of the MDP to minimize the data packet loss of the entire system. A new Deep Reinforcement Learning based Scheduling Algorithm (DRLSA) is developed, which derives the optimal solution online by taking current network state and action as the input and delivering a corresponding actionvalue function. DRLSA learns an optimal resource allocation strategy asymptotically through online training at the onboard deep Qnetwork, where the selection of the ground device, modulation scheme, and instantaneous patrolling velocity of the UAV are jointly optimized based on the actionvalue function. DRLSA utilizes an
greedy policy to balance the network cost minimization with respect to the knowledge already known with trying new actions to obtain knowledge unknown. Moreover, DRLSA carries out experience replay lin1993reinforcementto significantly reduce expansion of the state space in which the algorithm’s scheduling experiences at each time step are stored in a data set. DRLSA is implemented using Keras deep learning library with Google TensorFlow as the backend engine. Numerical results demonstrate that the proposed DRLSA is able to reduce the packet loss by 69.2%, as compared to existing nonlearning greedy algorithms.
In our earlier work li2018reinforcement ; li2018wireless , scheduling strategies were studied with a focus on reducing packet loss in smallscale static wireless powered sensor networks in response to battery level and data queue statues of the ground devices. Due to the lowdimensional channel and device state spaces, the resource allocation problem can be solved by reinforcement learning or dynamic programming. However, in the UAVassisted MPT and data collection, the mobility of the UAV with the varying patrolling velocity causes rapidly changing wireless channels. As a result, both the state space and the action space are exceedingly large and grow dramatically fast with the size of the network. This prevents conventional resource allocation approaches, such as li2018reinforcement and li2018wireless , from scaling to the highdimensional input spaces.
The rest of this paper is organized as follows. Section 2 presents related work on UAVassisted data communication with power transfer. Section 3 studies system structure of UAVassisted MPT and data collection, as well as the communication protocol design. In Section 4, a deep reinforcement learning algorithm is proposed to address the resource allocation problem for UAVassisted online MPT and data collection. Numerical results and evaluation are presented in Section 5. Section 6 concludes the paper.
2 Related Work
In this section, we review the literature on data communication, trajectory planning, and power transfer in UAV networks.
2.1 UAVassisted wireless communications
An energyefficient UAV relaying scheme is studied in li2016energy to schedule the data transmission between the UAV and the sensor nodes while guaranteeing packet success rate. A practical and computationally efficient algorithm is designed to extend network lifetime, by decoupling energy balancing and modulation adaptation of the UAVs, and optimizing in an alternating manner. The authors in koulali2016green
implement passive scanning at the sensor nodes and periodic beaconing at the UAV to reduce network energy consumption. A noncooperative game is constructed and equilibrium beaconing period durations are characterized for the UAVs. A learning algorithm is described to allow a UAV to discover the equilibrium beaconing strategy without observing the other UAVs’ relaying schedules. Network outage probability is analyzed in which a number of UAVs are used to relay the source signals to a data sink in a decodeandforward manner
li2010multi . The use of multiple antennas at the data sink is considered and their respective performance is also examined. The network outage problem can be decoupled into power allocation and trajectory planning subproblems zhang2018joint . An approximate solution that iteratively addresses the two subproblems is developed for approaching the minimum outage probability. In each iteration, the trajectory is planned according to the power control results obtained by the last iteration, and then the power control subproblem is solved given the UAV trajectory. A resource allocation algorithm is presented for improving network throughput while guaranteeing a seamless relaying transmission to the data sink outside network coverage via UAV baek2018optimal . By analyzing the outage probability, it is shown that nonorthogonal transmissions improve the performance of the UAV relaying network over orthogonal transmissions.2.2 UAV trajectory planning
UAV trajectory planning is studied in fadlullah2016dynamic ; choi2014energy ; jiang2012optimization ; zhan2011wireless , where the UAV acts as a communication relay for connecting the sensor nodes. Network throughput and communication delay can be improved by the motion control of the UAV, e.g., heading, velocity, or radius of the flight trajectory. In wu2018joint ; zeng2016throughput , the transmit power of the UAV and the trajectory planning are studied to increase network throughput over a finite time horizon. The power allocation exploits the predictable channel changes induced by the UAV’s movement while the trajectory planning balances the throughput between the sourceUAV and UAVsink links. The UAV can also be used as a bufferaided relay in freespace optical systems, which stores data emanating from some stationary sensor nodes for possible future delivery to other nodes fawaz2018uav . Since an optical link between the two transceivers can be easily smeared by ambient atmospheric conditions, e.g., an intervening cloud, the UAV adjusts its altitude in such a way that would eliminate the cloud attenuation effect.
The existing resource allocation approaches in the UAV relaying network improve the performance based on power control and trajectory planning of the UAV. However, scheduling energy harvesting and data collection was yet to be considered.
2.3 UAVassisted power transfer
Several studies have integrated MPT technologies into the UAV network. The UAV carrying an MPT transmitter is used to collect sensory data and extend the lifetime of sensor networks in harsh terrains pang2014efficient ; johnson2013charge . With consideration of transferred power attenuation, residual energy and buffered data at the sensor node, the UAV is scheduled to selectively charge the nodes and collect their data. In xu2018uav
, the UAV trajectory planning with the velocity control is exploited for charging all the sensor nodes in a fair fashion. The problem formulation and solution imply that the hovering location and duration can be designed to enhance the MPT efficiency. Moreover, machine learning techniques can be utilized to predict the UAV’s trajectory and improve the energy harvesting efficiency
jeong2017design . In yin2018uav , SWIPT is introduced into the UAV relaying network, where the sensor node’s energy limit can be alleviated by scavenging wireless energy from radio signals transmitted by the UAV. The network throughput is improved by adjusting the UAV’s transmit power and the flight trajectory. Several MPT platforms are developed for the UAV to charge batteries of the sensor nodes remotely from the electric grid he2017drone ; wang2016design ; chen2016mobile ; mittleider2016experimental ; griffin2012resonant . The lightweight design of hardware, control algorithms, and experiments are presented to verify the feasibility, reliability and efficiency of the UAVassisted power transfer. However, the existing literature only focuses on improving the energy efficiency of MPT. The data loss caused by buffer overflows and poor channels is not considered.3 System Model and Communication Protocol
In this section, we introduce the system model and the communication protocol of UAVassisted online MPT and data collection. Notations used in the paper are summarized in Table 1.
3.1 System model
Notation  Definition 

number of wireless powered ground devices  
,  the maximum and minimum velocity of the UAV 
number of laps the UAV patrols  
transmit power of device  
power transferred to device  
transmit power of the UAV  
location of the UAV on its trajectory  
channel gain between device and the UAV  
queue length of device  
maximum queue length of the ground device  
modulation scheme of device  
the highest modulation order  
SNR between device and the UAV  
MPT efficiency factor  
battery level of device  
battery capacity of the ground device  
the highest battery level of the ground device  
number of bits of the data packet  
required BER of the channel between the ground device and the UAV  
contact time between device and the UAV  
action set of MDP  
discount factor for future states  
learning iteration in the deep Qnetwork  
learning weight in the deep Qnetwork 
The network that we consider consists of wireless powered ground devices in a remote area. The UAV that acts as a data collection node flies a predetermined circular trajectory for laps, where the number of laps is limited by the lifetime of the UAV’s battery. Let and denote the maximum and minimum patrolling velocity of the UAV, respectively. The patrolling velocity at time slot , i.e., , can be adjusted during the flight, where and . The location of the UAV on its trajectory at in lap is denoted by . The UAV is also responsible for remotely charging the ground devices using MPT. Specifically, receive beamforming is enabled at the UAV to enhance the received signal strength (RSS) and reduce bit error rate (BER). Device (
) harvests energy from the UAV to power its operations, e.g., sensing, computing and communication. In addition, multiuser beamforming techniques, e.g., zeroforcing beamforming, conjugate beamforming, and singular value decomposition, can be applied to UAVassisted MPT and data collection. However, they are not considered in this work, due to their requirement of realtime feedback on channel state information.
The complex coefficient of the reciprocal wireless channel between the UAV and device at in lap is , which can be known by channel reciprocity. The modulation scheme of device at in lap is denoted by . In particular, = 1, 2, and 3 indicates binary phaseshift keying (BPSK), quadraturephase shift keying (QPSK), and 8 phaseshift keying (8PSK), respectively, and provides quadrature amplitude modulation (QAM).
Suppose that the BER requirement is . Consider the generic Nakagami fading channel model, is given by alouini2000adaptive
(1)  
(2) 
where is the Gamma function gradshteyn2014table , and is the average SNR. is the SNR between device and the UAV using , as given by
(3) 
where is the transmit power of device , and is noise power at the UAV. In particular, for illustration convenience, we consider a special case of the Nakagami model in this paper where li2015epla ; wang2017pele ; li2016reliable . Note that the proposed deep reinforcement learning approach is generic, and can work with other Nakagami fading channel model with any values. The required transmit power of the ground device depends on and , and can be given by li2016energy
(4) 
where and are channel related constants.
According to li2015poster , the MPT efficiency is jointly decided by the distance between the MPT transmitter and the receiver, and their antenna alignment. Therefore, the power transferred to device at in lap via MPT is given by
(5) 
where stands for norm. is the MPT efficiency factor given the distance and the MPT transceiver alignment between the ground device and the UAV. The transmit power of the UAV is fixed and set to be , so that the operations at the UAV can keep simple.
Each of the wirelessly powered ground devices harvests energy from the UAV. The rechargeable battery is finite with the capacity of
Joules, and the battery overflows if overcharged. Moreover, the battery readings are continuous variables with variance difficult to be traced in realtime. Therefore, to improve the mathematical tractability of the problem and for illustration convenience, the continuous battery is discretized into
levels, as . liu2014selection . In other words, the battery level of the ground device is lower rounded to the closest discrete level.We also assume that all the ground devices have the same battery size, data queue size, packet length, and the BER requirement. However, our proposed resource allocation approach can be extended to a heterogeneous network setting, where the complexity of the resource allocation problem may grow as the result of an increased number of MDP states.
3.2 Communication protocol
Figure 2 illustrates the communication protocol for UAVassisted online MPT and data collection. Specifically, each communication frame that contains a number of time slots is allocated to the ground device for MPT and data transmission. The ground device selection is determined by the UAV, using DRLSA which will be illustrated in Section 4. Then, the UAV broadcasts a short beacon message to the selected ground device for data transmission. On the reception of the ground device’s data, the UAV is aware of the information for MPT, i.e., , , and . Moreover, a control segment of the device’s data packet contains and . The overhead of this control segment is small. For example, consider of 100 and of 100 packets, the overhead is only 12 bits, much smaller than the size of the data packet. Therefore, we assume that the transmission time and the energy consumption of the control segment are negligible.
The UAV processes the received data packets online, and responds to the ground device’s requests, e.g., providing network access services, or online information query. SWIPT is utilized to transmit the response to the ground device and charge its battery via MPT simultaneously. Meanwhile, DRLSA is conducted by the UAV to schedule the other ground device for MPT and data transmission in the next communication frame.
4 Deep Reinforcement Learning for UAVassisted online MPT and Data Collection
In this section, we first present the problem formulation based on MDP, and provide a reinforcement learning solution to the problem. Due to the curseofdimensionality of reinforcement learning, we propose a new deep reinforcement learning based scheduling algorithm to minimize overall data packet loss of the ground devices, by optimally deciding the device to be charged and interrogated for data collection, and the instantaneous patrolling velocity of the UAV. The modulation scheme of the ground device is also optimized to maximize the harvested energy. Finally, the optimality of the deep reinforcement learning approach is analyzed.
4.1 MDP formulation
The UAV takes the actions of the ground device selection, the modulation scheme, and the instantaneous patrolling velocity of the UAV. Each action depends on the current network state, i.e, the battery level and queue length of every device , the channel quality , and the location of the UAV along the flight trajectory. The actions also account for the potential influence on the future evolution of the network. Particularly, the current action that the UAV takes can affect the future battery level and queue length of every device and, in turn, influence the future actions to be taken. Such actions are a discretetime stochastic control process which is partly random (due to the random and independent arrival/queueing process of sensory data at every device) and partly under the control of the decisionmaking UAV.
The action can be optimized in a sense that the optimality in regards of a specific metric, e.g., packet loss from queue overflows and unsuccessful data transmissions, is achieved in the long term over the entire stochastic control process (rather than myopically at an individual time slot). Motivated by this, we consider an MDP formulation for which the actions are chosen in each state to minimize a longterm objective. An MDP is defined by the quadruplet , , , , where is the set of possible states; is the set of actions; is the immediate cost yielded when action is taken at state and the following state changes to ; and denotes the transition probability from state to state when action is taken.
The resource allocation problem of interest in UAVassisted online MPT and data collection can be formulated as a discretetime MDP, where each state collects the battery levels and queue lengths of the ground devices, the channel quality between the UAV and device , and the location of the UAV, i.e., . The size of the state space, i.e., the number of such states, is , where is the number of channel states, and is the number of waypoints on the UAV’s trajectory. The action to be taken is to schedule one device to transmit data to the UAV at time slot , while specifying the modulation of the device and the instantaneous patrolling velocity of the UAV, i.e., . The size of the action set is , where stands for the cardinality of the set .
To illustrate the proposed MDP model, Figure 3 presents an example of transition diagram with 24 MDP states in one lap of the UAV’s flight, where , , , , (e.g., dB, dB), and . The vertices stand for all possible states in MDP, i.e., . The edges show the transition from each state to other states according to . The state transition depends on the change of of the ground device and along the trajectory of the UAV. In other words, the next state of can be one of the states at , , or . For example, for , the next state of can be , if device is selected, but the data collection is not successful; or , if the data collection is successful. Note that Figure 3 gives a smallscale example of the transition of one of the states, i.e., . The UAV’s trajectory can have hundreds of waypoints and the model can contain over MDP states, as configured in Section 5, which leads to an extremely complex state transition diagram.
The optimal policy in MDP can be determined by classical approaches, e.g., value iteration (computes and improves the actionvalue function estimate iteratively) or policy iteration (redefines the policy at each step and computes the value according to this new policy). However, these two methods require that the transition probability and the cost of all states have been accurately known. In contrast, this paper is interested in a practical scenario where the UAV has no aprior knowledge on
and .4.2 Qlearning
Qlearning, one of the reinforcement learning techniques, can obtain the optimal resource allocation when the transition and/or cost functions are unknown while minimizing the longterm expected accumulated discounted costs (i.e., the expected packet loss of the ground devices) li2018reinforcement . We define the objective function of the resource allocation problem as , which gives
(6) 
where is a discount factor for future states. is the cost from state to when action is carried out. denotes the expectation with respect to policy and state . Furthermore, the actionvalue function defines the expected cost after observing state and taking action mnih2015human . By performing the optimal action , the optimal actionvalue function can be expressed as a combination of the expected cost and the minimum value of , where is the next state of , and is the next action of . Thus, we have
(7) 
where (a small positive fraction) indicates learning rate. As observed in (7), the convergence rate of to depends on the discount factor . Namely, the convergence rate of the actionvalue function increases with the discount factor .
However, Qlearning suffers from the wellknown curse of dimensionality. Thus, Qlearning is impractical for the resource allocation problem in the UAVassisted online MPT and data collection due to a large state and action space in which the time needed for Qlearning to converge grows immeasurably. In addition, Qlearning is unstable when combined with nonlinear approximation functions such as neural networks
tsitsiklis1997analysis .4.3 Deep reinforcement learning based approach
To circumvent the curseofdimensionality of reinforcement learning, we propose an onboard deep Qnetwork for the UAV to optimize the resource allocation online by approximating the optimal actionvalue function.
As shown in Figure 4, the actionvalue function in (7) is represented by the proposed onboard deep Qnetwork, which takes the current network state and action as the input and derives the corresponding actionvalue function. Given the total number of iterations , is approximated by adapting a set of weights , where . By optimizing , the outputs of the deep Qnetwork, i.e., the approximated , can be minimized.
In our onboard deep Qnetwork, experience replay is conducted to randomize over the states and the actions of MDP at each timestep , i.e., , thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. Specifically, is stored in a data set , pooled over many episodes (where an episode ends when a terminal state is reached) into an experience replay memory. Moreover, samples (or minibatches) of the experience in the deep Qnetwork are accordingly updated during learning. The deep Qnetwork is trained by adjusting the weight at iteration so as to minimize a sequence of loss functions ; see (8),
(8) 
where and are the weights at iterations and , respectively. At each iteration of minimizing , the weight from iteration is fixed. Thus, the subproblem of learning at iteration defines . For each sample (or minibatch), the current weight , gradient descent is learned to derive weights , which iteratively computes the gradient , and updates the neural network’s weights to reach the global minimum. We differentiate with regards to , and obtain (9).
(9) 
Algorithm 1 presents the proposed DRLSA scheme, which optimizes the actions based on the deep Qnetwork to solve the online resource allocation problem. Specifically, is a global parameter for counting the total number of updates to the UAV, rather than counting the updates from the local learner. The greedy policy is utilized to balance the actionvalue function minimization based on the knowledge already known, as well as trying new actions to obtain knowledge unknown. In particular, the optimal action of determining for maximizing the harvested energy (see the Appendix) is carried out once the optimal ground device is selected from the deep Qnetwork.
Note that the optimal solutions to a smallscale MDP problem of interest can be solved by Qlearning in which the series of the optimal actions are obtained in a state machine, given the optimal decisions at each of the states. However, Qlearning is impractical for the complex resource allocation problem that contains a large state and action space, due to the wellknown curse of dimensionality. As observed in Algorithm 1, our onboard deep Qnetwork maintains two separate Qnetworks and with current weights and the old weights , respectively van2012reinforcement . can be updated many times per timestep, and copied into after every iterations. At every update iteration, the deep Qnetwork is trained to minimize the meansquared Bellman error, by minimizing the loss function . Therefore, the proposed DRLSA can achieve the optimality asymptotically, with the growing size of the deep Qnetwork.
5 Numerical Results and Discussions
In this section, we first present network configurations and performance metrics. Then, we evaluate the network cost of the proposed DRLSA scheme with regards to the network size, data queue length, and learning discount factors. Here, the network cost defines the amount of packet loss due to the data queue overflow and the failed transmission from the ground device to the UAV. We also show the UAV’s velocity in terms of the network size, as well as the data queue length.
5.1 Implementation of DRLSA
The simulation platform is an INSYS Server running 64bit Ubuntu 16.04 LTS, with 4core Intel i76700K 4GHz CPUs and 16G memory. DRLSA is implemented in Python 3.5 by using Google TensorFlow abadi2016tensorflow (the symbolic math library for numerical computation) with Keras chollet2015keras (the Python deep learning library). We develop the onboard deep Qnetwork in DRLSA according to the following steps:

Initialize the configurations. Each device generates 100 data packets, where the packet length is 128 bytes. The transmit power of the UAV is 100 milliwatts. is set to 0.05%, however, the value of can be configured depending on the traffic type and qualityofservice (QoS) requirement of the user’s data, as well as the transmission capability of the UAV. Other simulation parameters are listed in Table 2.
Parameters Values Number of ground devices 50 200 Queue length 20 60 Energy levels 50 Number of UAV’s waypoints 100 Discount factor 0.99 Learning rate 0.0001 Replay memory size 5000 Batch size 32 Number of steps 1000 Number of episodes 500 Table 2: TensorFlow configurations 
Set up the architecture of the deep Qnetwork. Three fully connected hidden layers are created by using
tensorflow.layers.dense(inputs, dimensionality of the output space, activation function)
. Then, tensorflow.train.AdamOptimizer().minimize(loss function) is called to minimize the loss function. In particular, the optimizer is imported from the Keras library. 
Build the memory for the experience replay. For online training the deep Qnetwork, the memory stores the learning outcomes, a.k.a experience at every step, using the quadruplet state, action, cost, next_state. The deep Qnetwork updates the memory by calling the function memory.add_sample((state, action, cost, next_state)), and retrieves the experiences by using memory.sample(batch size).

Create a deep Qnetwork agent, and kick off the learning. The agent is configured and compiled to take actions, and observe the cost and the new state. A TensorFlow session is created by implementing tensorflow.Session() to execute the learning and evaluate the learning progress.
For performance comparison, the proposed DRLSA scheme is compared with two other onboard online scheduling policies as

Random scheduling policy (RSA). The UAV randomly schedules one ground device to collect data and transfer power at each time slot. The resource allocation is independent of battery and data queue of the ground device, channel variation, or UAV’s trajectory.

Longest queue scheduling policy (LQSA). This is a greedy algorithm, where the scheduling is based on the data queue length of the ground device. The device with the longest queue is given the highest priority to transmit data and harvest power. For LQSA, it is assumed that the uptodate information of the data queue length at each device is known by the UAV.
Both RSA and LQSA are implemented in Matlab. Their scheduling policies are not obtained by the deep Qnetwork. Note that the existing learningbased approaches, e.g., QSA li2018reinforcement or SARSA hoang2017optimal , can only solve the smallscale scheduling problem, and do not apply to the resource allocation in UAVassisted online MPT and data collection. In addition, the resource allocation optimization schemes, e.g., EHMDP li2018wireless or EACH zhang2013distributed , require the aprior knowledge about the network and have to be conducted in offline.
5.2 Performance evaluation
5.2.1 Network size
We assess the performance of DRLSA when the number of ground devices enlarges from 50 to 200. Figure 5 shows the network cost at each episode, given and . The network cost of DRLSA is high at the beginning of the learning process. With an increasing number of episodes, the network cost drops significantly until it reaches a relatively stable value. It confirms the fact that deep Q learning gradually converges after a number of episodes. Moreover, the average network cost is smaller when the number of ground devices is 120, as compared to the cases where the number of ground devices is 150 and 180, since the increasing number of ground devices leads the data queues to increasingly overflow.
(a) Network cost with regards to  (b) Packet loss rate with regards to 
(c) Patrolling velocity with regards to  
(d) Network cost with regards to  (e) Packet loss rate with regards to 
(f) Patrolling velocity with regards to 
A comparison of network cost and packet loss rate by DRLSA and the typical scheduling strategies, where the error bars show the standard deviation over 20 runs. The patrolling velocity of the UAV given the different number of ground devices and data queue length is also presented.
Figure 6(a) studies the network cost with an increasing number of ground devices, where the data queue length of DRLSA is set to 50 or 10. In general, the proposed DRLSA is able to reduce the network cost to a greater extent than RSA and LQSA. Particularly, when = 200 and = 10, the maximum network cost of DRLSA is lower than RSA and LQSA by around 82.8% and 69.2%, respectively. The performance gains keep growing with . The reason is that DRLSA learns the ground devices’ energy consumption and data queue states, so that the scheduling of MPT and data communications can minimize the data packet loss of the entire network.
Figure 6(b) illustrates the packet loss rate, which is the ratio of the network cost and the total number of data packets of all the ground devices. Specifically, DRLSA has a similar packet loss rate to LQSA, when there are 50 devices in the network. However, from = 80 to 200, DRLSA outperforms the two nonlearningbased algorithms. In particular, when = 200, DRLSA with achieves 53%, and 25% lower packet loss rates than RSA and LQSA, respectively. In Figure 6(b), we also see that the packet loss rate of DRLSA only slightly grows about 2% from = 50 to 200. In other words, adding more devices to the network does not result in a critical data packet loss. This is because DRLSA takes the actions to adapt the instantaneous velocity of the UAV to ensure the connection time and channel quality between the device and the UAV, hence minimizing the data packet loss. As shown in Figure 6(c), the patrolling velocity raises with an increase of . This is reasonable because the UAV needs to transfer power and collect data from more devices in order to reduce their data queue overflow. Furthermore, it is also observed that increasing the data queue length of the devices downgrades the patrolling velocity of the UAV. The reason is that a larger data queue can hold more packets. This allows an extended data transmission time between the ground device and the UAV and in turn, reduces the patrolling velocity.
5.2.2 Data queue length of ground devices
We consider different data queue lengths of the ground devices, i.e., , where the number of ground devices is set to 140 or 300. Figures 6(d) and (e) depict the network cost and packet loss rate with respect to the maximum data queue length of the ground devices, respectively. We observe that DRLSA achieves lower network costs and packet loss rates than RSA and LQSA, while DRLSA outperforms RSA with substantial gains of 82.7% and 69% when = 20 and = 140. Moreover, given , from = 20 to = 60, the network cost and packet loss rate of DRLSA drop by 62.2% and 13.2%. This confirms that DRLSA significantly reduces data queue overflow for all the ground devices when enlarging their data queue length. In terms of the patrolling velocity of the UAV, it is observed in Figure 6(f) that DRLSA reduces the velocity from 18.5 m/s to 14.1 m/s, and from 8.1 m/s to 6.2 m/s, given 300 and 140 ground devices in the network, respectively.
Figure 6 implies a tradeoff between the data packet loss and the battery lifetime of the UAV. Specifically, an increase of the network size, or a decrease of data queue length, results in a growth of the packet loss due to the data queue overflow. In this case, the patrolling velocity of the UAV can be increased so that more ground devices harvest energy and transmit data, which reduces the packet loss of all the devices. However, accelerating the patrolling velocity speeds up draining the energy of the UAV’s battery, and reduces the lifetime of the network. Therefore, the network size and the data queue length need to be balanced, so as to maintain a sustainable UAVassisted network while reducing the data packet loss.
5.2.3 The actions of the UAV
To further reveal the impact of network size and data queue length on the actions of the UAV, Figure 7 demonstrates the patrolling velocity allocated by the proposed DRLSA, with respect to the episodes, given and . We can see that DRLSA with allocates 8.7 m/s higher patrolling velocity to the UAV than the one with on average. This confirms that the patrolling velocity drops with a decrease of network size or growth of data queue length, due to the data queue overflow, as explained in Figure 6. Moreover, the result in Figure 7 also presents that DRLSA allocates the patrolling velocity adapting to the timevarying network state. As observed, the patrolling velocity allocation converges with an increase in the learning time (i.e., episodes).
5.2.4 Discount factor of learning
Since DRLSA utilizes deep Q learning to approximate the Q function with asymptotic convergence, the convergence time can be affected by the discount factor in the learning process. Figure 8 plots the network cost of DRLSA with regards to the episodes given the discount factor and . and are set to 300 ground devices and 20, respectively. The other settings are provided in Section 5.1.
As observed, DRLSA has a high network cost at the beginning of the learning. The performance improves with the increase of the episodes due to deep reinforcement learning. In particular, the network cost of DRLSA with = 0.99 quickly falls to 0 from episode 1 to episode 359, and the performance remains relatively stable afterward. However, the convergence of DRLSA with = 0.5 or 0.1 requires more than 475 or 500 episodes. Therefore, the result in Figure 8 indicates that the convergence rate of DRLSA grows with . In other words, a high discount factor of learning accelerates the learning process of the deep Qnetwork, as confirmed by (7).
6 Conclusion
In this paper, we focus on online MPT and data collection in the presence of onboard control of the patrolling velocity of the UAV, for preventing battery drainage and data queue overflow of the sensing devices. The problem is formulated as the MDP, with the states of battery level and data queue length of the ground devices, channel conditions, and waypoints given the trajectory of the UAV. We propose an onboard deep Qnetwork that can enlarge the state and action space of the MDP to minimize the data packet loss of the entire system. Based on deep reinforcement learning, DRLSA is developed to learn the optimal resource allocation strategy asymptotically through online training at the onboard deep Qnetwork, where the selection of the ground device, modulation scheme, and instantaneous patrolling velocity of the UAV are jointly optimized. Moreover, DRLSA carries out experience replay to reduce expansion of the state space in which the algorithm’s scheduling experiences at each time step are stored in a data set.
The proposed DRLSA scheme is implemented by using Keras deep learning library with Google TensorFlow as the backend engine. Numerical results demonstrate that DRLSA reduces packet loss by at least 69.2%, as compared to the existing nonlearning greedy algorithms.
Acknowledgements
This work was partially supported by National Funds through FCT/MCTES (Portuguese Foundation for Science and Technology), within the CISTER Research Unit (CEC/04234); also by the Operational Competitiveness Programme and Internationalization (COMPETE 2020) through the European Regional Development Fund (ERDF) and by national funds through the FCT, within project POCI010145FEDER029074 (ARNET).
Appendix A [Optimizing ]
To minimize the packet loss stemming from insufficient energy, of device is to be chosen to maximize the energy harvested during a contact time with the UAV, with a length of . The optimal modulation of the ground device, , is independent of the battery level and the queue length. This is because is selected to maximize the increase of the battery level at device , under the bit error rate requirement for the packet transmitted. As a result, can be decoupled from , and optimized in prior by
(10) 
the righthand side (RHS) of which, by substituting (5) and (4), can be rewritten as
(11) 
where is the bandwidth of the uplink data transmission, is the duration of an uplink symbol, and is the duration of uplink data transmission. is the rest of the time slots used for downlink WPT, and is the contact time between the ground device and the UAV in the time slot, which is affected by the patrolling velocity of the UAV. Thus, we have
(12) 
where is the altitude of the UAV at lap . We assume that the UAV maintains the same altitude and the same heading in each lap.
By using the firstorder necessary condition of the optimal solution, we have
(13) 
(14) 
The values are then given as follows:
(15) 
Since the lefthand side (LHS) of (15) monotonically increases with , the optimal value can be obtained by applying a bisection search method, and evaluating the two closest integers about the fixed point of the bisection method estep2002bisection . Specifically, and are initialized. Each iteration of the bisection method contains 4 steps applied over the range of , as follows.

The midpoint of the modulation interval is calculated, which gives .

Substitute into (15) to obtain the function value .

If the convergence is attained (that is, the modulation interval or cannot be further reduced), return and stop the iteration.

Replace either or with .
References
 (1) T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair, I. L. Grixa, F. Ruess, M. Suppa, D. Burschka, Toward a fully autonomous UAV: Research platform for indoor and outdoor urban search and rescue, IEEE robotics & automation magazine 19 (3) (2012) 46–56.
 (2) C. Luo, P. Ward, S. Cameron, G. Parr, S. McClean, Communication provision for a team of remotely searching UAVs: A mobile relay approach, in: Globecom Workshops (GC Wkshps), IEEE, 2012, pp. 1544–1549.
 (3) S. Waharte, N. Trigoni, Supporting search and rescue operations with UAVs, in: International Conference on Emerging Security Technologies (EST), IEEE, 2010, pp. 142–147.
 (4) H. Wang, G. Ding, F. Gao, J. Chen, J. Wang, L. Wang, Power control in UAVsupported ultra dense networks: Communications, caching, and energy transfer, IEEE Communications Magazine 56 (6) (2018) 28–34.
 (5) Y. Zeng, R. Zhang, T. J. Lim, Wireless communications with unmanned aerial vehicles: opportunities and challenges, IEEE Communications Magazine 54 (5) (2016) 36–42.
 (6) T. D. P. Perera, D. N. K. Jayakody, S. K. Sharma, S. Chatzinotas, J. Li, Simultaneous wireless information and power transfer (SWIPT): Recent advances and future challenges, IEEE Communications Surveys & Tutorials 20 (1) (2017) 264–302.
 (7) S. Yin, J. Tan, L. Li, UAVassisted cooperative communications with wireless information and power transfer, arXiv preprint arXiv:1710.00174.
 (8) S. Yin, Y. Zhao, L. Li, UAVassisted cooperative communications with timesharing SWIPT, in: 2018 IEEE International Conference on Communications (ICC), IEEE, 2018, pp. 1–6.
 (9) L.J. Lin, Reinforcement learning for robots using neural networks, Tech. rep., CarnegieMellon Univ Pittsburgh PA School of Computer Science (1993).
 (10) K. Li, W. Ni, M. Abolhasan, E. Tovar, Reinforcement learning for scheduling wireless powered sensor communications, IEEE Transactions on Green Communications and Networking.
 (11) K. Li, W. Ni, L. Duan, M. Abolhasan, J. Niu, Wireless power transfer and data collection in wireless sensor networks, IEEE Transactions on Vehicular Technology 67 (3) (2018) 2686–2697.
 (12) K. Li, W. Ni, X. Wang, R. P. Liu, S. S. Kanhere, S. Jha, Energyefficient cooperative relaying for unmanned aerial vehicles, IEEE Transactions on Mobile Computing (6) (2016) 1377–1386.
 (13) S. Koulali, E. Sabir, T. Taleb, M. Azizi, A green strategic activity scheduling for UAV networks: A submodular game perspective, IEEE Communications Magazine 54 (5) (2016) 58–64.
 (14) X. Li, Y. D. Zhang, Multisource cooperative communications using multiple small relay UAVs, in: IEEE GLOBECOM Workshops, 2010, pp. 1805–1810.
 (15) S. Zhang, H. Zhang, Q. He, K. Bian, L. Song, Joint trajectory and power optimization for UAV relay networks, IEEE Communications Letters 22 (1) (2018) 161–164.
 (16) J. Baek, S. I. Han, Y. Han, Optimal resource allocation for nonorthogonal transmission in UAV relay systems, IEEE Wireless Communications Letters 7 (3) (2018) 356–359.
 (17) Z. M. Fadlullah, D. Takaishi, H. Nishiyama, N. Kato, R. Miura, A dynamic trajectory control algorithm for improving the communication throughput and delay in UAVaided networks, IEEE Network 30 (1) (2016) 100–105.
 (18) D. H. Choi, S. H. Kim, D. K. Sung, Energyefficient maneuvering and communication of a single UAVbased relay, IEEE Transactions on Aerospace and Electronic Systems 50 (3) (2014) 2320–2327.
 (19) F. Jiang, A. L. Swindlehurst, Optimization of UAV heading for the groundtoair uplink, IEEE Journal on Selected Areas in Communications 30 (5) (2012) 993–1005.
 (20) P. Zhan, K. Yu, A. L. Swindlehurst, Wireless relay communications with unmanned aerial vehicles: Performance and optimization, IEEE Transactions on Aerospace and Electronic Systems 47 (3) (2011) 2068–2085.
 (21) Q. Wu, Y. Zeng, R. Zhang, Joint trajectory and communication design for multiUAV enabled wireless networks, IEEE Transactions on Wireless Communications 17 (3) (2018) 2109–2121.
 (22) Y. Zeng, R. Zhang, T. J. Lim, Throughput maximization for UAVenabled mobile relaying systems, IEEE Transactions on Communications 64 (12) (2016) 4983–4996.
 (23) W. Fawaz, C. AbouRjeily, C. Assi, UAVaided cooperation for FSO communication systems, IEEE Communications Magazine 56 (1) (2018) 70–75.
 (24) Y. Pang, Y. Zhang, Y. Gu, M. Pan, Z. Han, P. Li, Efficient data collection for wireless rechargeable sensor clusters in harsh terrains using UAVs, in: Global Communications Conference (GLOBECOM), IEEE, 2014, pp. 234–239.
 (25) J. Johnson, E. Basha, C. Detweiler, Charge selection algorithms for maximizing sensor network life with UAVbased limited wireless recharging, in: International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), IEEE, 2013, pp. 159–164.
 (26) J. Xu, Y. Zeng, R. Zhang, UAVenabled wireless power transfer: Trajectory design and energy optimization, IEEE Transactions on Wireless Communications.
 (27) S. Jeong, J. Bito, M. M. Tentzeris, Design of a novel wireless power system using machine learning techniques for drone applications, in: Wireless Power Transfer Conference (WPTC), IEEE, 2017, pp. 1–4.
 (28) X. He, J. Bito, M. M. Tentzeris, A dronebased wireless power transfer and communications platform, in: Wireless Power Transfer Conference (WPTC), IEEE, 2017, pp. 1–4.
 (29) C. Wang, Z. Ma, Design of wireless power transfer device for UAV, in: International Conference on Mechatronics and Automation (ICMA), IEEE, 2016, pp. 2449–2454.
 (30) S. Chen, Y. Shu, B. Yu, C. Liang, Z. Shi, J. Chen, Mobile wireless charging and sensing by drones, in: International Conference on Mobile Systems, Applications, and Services Companion (MobiSys), ACM, 2016, pp. 99–99.
 (31) A. Mittleider, B. Griffin, C. Detweiler, Experimental analysis of a UAVbased wireless power transfer localization system, in: Experimental Robotics, Springer, 2016, pp. 357–371.
 (32) B. Griffin, C. Detweiler, Resonant wireless power transfer to ground sensors from a UAV, in: International Conference on Robotics and Automation (ICRA), IEEE, 2012, pp. 2660–2665.
 (33) M.S. Alouini, A. J. Goldsmith, Adaptive modulation over nakagami fading channels, Wireless Personal Communications 13 (12) (2000) 119–143.
 (34) I. S. Gradshteyn, I. M. Ryzhik, Table of integrals, series, and products, Academic press, 2014.
 (35) K. Li, W. Ni, X. Wang, R. P. Liu, S. S. Kanhere, S. Jha, EPLA: Energybalancing packets scheduling for airborne relaying networks, in: International Conference on Communications (ICC), IEEE, 2015, pp. 6246–6251.
 (36) X. Wang, K. Li, S. S. Kanhere, D. Li, X. Zhang, E. Tovar, PELE: Power efficient legitimate eavesdropping via jamming in UAV communications, in: International Wireless Communications and Mobile Computing Conference (IWCMC), IEEE, 2017, pp. 402–408.
 (37) K. Li, N. Ahmed, S. S. Kanhere, S. Jha, Reliable transmissions in AWSNs by using OBESPAR hybrid antenna, Pervasive and Mobile Computing 30 (2016) 151–165.
 (38) K. Li, C. Yuen, S. Jha, Fair scheduling for energy harvesting WSN in smart city, in: SenSys, ACM, 2015, pp. 419–420.
 (39) K.H. Liu, Selection cooperation using RF energy harvesting relays with finite energy buffer, in: Wireless Communications and Networking Conference (WCNC), IEEE, 2014, pp. 2156–2161.
 (40) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Humanlevel control through deep reinforcement learning, Nature 518 (7540) (2015) 529.
 (41) J. N. Tsitsiklis, B. Van Roy, An analysis of temporaldifference learning with function approximation, IEEE Transactions on Automatic Control 42 (5) (1997) 674–690.
 (42) M. van Otterlo, M. Wiering, Reinforcement learning and markov decision processes, in: Reinforcement Learning, Springer, 2012, pp. 3–42.
 (43) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for largescale machine learning, in: OSDI, Vol. 16, 2016, pp. 265–283.
 (44) F. Chollet. Keras: Deep learning library for theano and tensorflow [online] (2019) [cited 2019].
 (45) D. T. Hoang, D. Niyato, P. Wang, D. I. Kim, L. B. Le, Optimal data scheduling and admission control for backscatter sensor networks, IEEE Transactions on Communications 65 (5) (2017) 2062–2077.
 (46) Y. Zhang, S. He, J. Chen, Y. Sun, X. S. Shen, Distributed sampling rate control for rechargeable sensor nodes with limited battery capacity, IEEE Transactions on Wireless Communications 12 (6) (2013) 3096–3106.
 (47) D. Estep, The bisection algorithm, Practical Analysis in One Variable (2002) 165–177.
Comments
There are no comments yet.