In the upcoming years an increasing number of autonomous systems will pervade urban and domestic environments. The next generation of Unmanned Aerial Vehicles (UAVs) requires high-level controllers in order to move in unstructured environments and perform multiple tasks. Recently a new application has been proposed, namely the use of quadrotors for the delivery of packages and goods. In this scenario the most delicate part is the identification of a ground marker and the vertical descent maneuver. Previous works used hand-crafted features analysis and external sensors in order to identify the land-pad. In this work we propose a completely different approach, based on recent breakthroughs achieved with Deep Reinforcement Learning (DRL) . Our method only requires low-resolution images acquired from a down-looking camera that are given as input to a hierarchy of Deep Q-Networks (DQNs). The output of the networks is a high level command that directs the drone toward the marker. The most remarkable advantage of DRL is the absence of any human supervision, allowing the quadrotor to autonomously learn how to use high-level actions in order to land.
The use of DRL in the landing problem is not straightforward. Previous applications mainly focused on deterministic environments such as the Atari game suite . Using DRL in unstructured environments with robotic platforms has had limited success. In this work we tackled the landing problem introducing different technical solutions. We used a divide-and-conquer strategy and we split the problem in two sub-tasks: landmark detection and vertical descent. Two specialized DQNs take care of the two tasks and are connected through an internal trigger engaged by the networks itself. Moreover, we used double DQN 
to reduce overestimation problems that commonly arise when the agent moves in complex environments. To solve the issue of sparse and delayed reward we implemented a new type of prioritized experience replay, called partitioned buffer replay, that splits the experiences in multiple containers and guarantees the presence of rare transitions in the training batch. As far as we know, the present work is the first to use an unsupervised learning approach to tackle the landing problem. We show an overview of the system in Figure1 and a video in our repository 111https://github.com/pulver22/QLAB/tree/master/share/video.
Ii Related work
In this section we present a brief literature review in order to offer an overview on the topic. This review is not meant to be complete, and it only aims to show how our method differentiates from previous work.
We can broadly group in three classes the methods used for landing UAVs: sensor-fusion, device-assisted, and vision-based. The sensor-fusion methods rely on the use of multiple sensors, in order to gather enough data for a robust pose estimation. In a recent work the data from a downward-looking camera and an inertial measurement unit were combined in order to build a three-dimensional reconstruction of the terrain. Given the two-dimensional elevation map was possible to find a secure surface area for landing. In  the authors used a particular geometric shape for the landing pad in conjunction with analysis of multiple sensors in order to accurately estimate the position of the drone with respect to the marker. A ground-based multisensor fusion system has been proposed in . The system included a pan-tilt unit, an infrared camera and an ultra-wideband radar used to center the UAV in a recovery area and guide it toward the ground. A similar work is presented in  in order to land on an AR-tag marker posed on a moving vessel.
Device-assisted methods rely on the use of ground sensors in order to precisely estimate the position and trajectory of the drone. A system based on infra-red lights has been used in 
. The authors adopted a series of parallel infrared lamps disposed in a runway. The camera on the vehicle was equipped with optical filters for capturing the infrared lights and the images were forwarded to a control system for pose estimation. A Chan-Vese approach supplemented through an extended Kalman filter has been proposed in for ground stereo-vision detection.
The vision based approaches analyse geometric features in order to find ground pads and land. A method based only on a monocular camera has been proposed in . The system used a well defined target pattern, easy to identify at different distances. Having a series of concentric circles, it was proved to be possible to find the landmark also when partially occluded. A modified version of the international landing pattern has been used in . The solution adopted used a seven-stages vision algorithm to identify and track the pattern in a cluttered environment and reconstruct it when partially observable. The use of AR-tag fiducial marker has been taken into account in  and . In both cases a precise pose estimation has been done using only an onboard camera. In  a vision-based visual servoing algorithm has been used to track a moving platform and to produce velocity commands for an adaptive sliding controller.
The previous works showed different limitations that we discuss here. Sensor-fusion methods often use information gathered from expensive sensors that cannot be integrated in low-cost drones. Most of the time these methods rely on the contribution of GPS that may be unavailable in real-world scenarios. The device-assisted approaches allow obtaining an accurate estimation of the drone pose. However the use of external devices is not always possible because they are not always available. Vision-based methods have the advantage of using only on-board sensors and mainly rely on cameras. The main limit of these methods is that low-level features are often viewpoint-dependent and subject to failure in ambiguous cases. The present work directly deals with all the aforementioned problems. Our solution is based only on a monocular onboard camera and does not use any other sensors or external devices. The use of DQNs significantly improves the marker detection and is robust to projective transformations and marker corruption.
Iii Proposed method
In this section we describe the landing problem in reinforcement learning terms and we present the technical solutions we adopted.
Iii-a Problem definition and notation
Here we consider the landing problem as divided in two sub-problems: landmark detection and vertical descent. The detection requires an exploration on the xy-plane, where the quadrotor has to horizontally shift in order to align its body frame with the marker. In the vertical descent phase the vehicle has to reduce the distance from the marker using vertical movements. Moreover, the drone has to shift on the xy-plane in order to keep the marker centered.
Formally both the problems can be reduced to Markov Decision Processes (MDPs). At each time stepthe agent receives the state , performs an action sampled from the action space , and receives a reward given by a reward function . The action brings the agent to a new state in accordance with the environmental transition model . In the particular case faced here the transition model is not given (model free). The goal of the agent is to maximize the discounted cumulative reward called return , where is the discount factor. Given the current state the agent can select an action from the internal policy . In off-policy learning the prediction of the cumulative reward can be obtained through an action-value function adjusted during the learning phase in order to approximate
, the optimal action-value function. In this work we use a Convolutional Neural Network (CNN) for approximating the Q-function following the approach presented in. The CNN takes as input four
grey scale images acquired by the downward looking camera mounted on the drone. The images are processed by three convolutional layers and two fully connected layers. As activation function we used the rectified linear unit. The first convolution has 32 kernels of
with stride of 2, the second layer has 64 kernels ofwith strides of 2, the third layer convolves 64 kernels of with stride 1. The fourth layer is a fully connected layer of 512 units followed by the output layer that has a unit for each valid action (backward, right, forward, left, stop, descent, land). Depending on the simulation, we used a sub-set of the total actions available, we refer the reader to Section IV for additional details. A graphical representation of the network is presented in Figure 2.
It is important to focus on the two phases that characterize the landing problem in order to isolate important issues. In the landmark detection phase we made the reasonable assumption of a flight at fixed-altitude. The vertical alignment with the landmark is obtained through shifts in the xy-plane. This expedient does not have any impact at the operational level but dramatically simplifies the task. To adjust
, the parameters of the DQN, we used the following loss function:
with being a dataset of experiences used to uniformly sample a batch at each iteration . The network is used to estimate actions at runtime, whereas is the target that is defined as follows:
the network is used to generate the target and is constantly updated. The use of the target network is a trick that improves the stability of the method. The parameters are updated every steps and synchronized with . In the standard approach the experiences in the dataset are collected in a preliminary phase using a random policy. The dataset
is also called buffer replay and it is a way to randomize the samples breaking the correlation and reducing the variance.
The vertical descent phase is a form of Blind Cliffwalk  where the agent has to take the right action in order to progress through a sequence of states and finally get a positive or a negative reward. The intrinsic structure of the problem makes extremely difficult to obtain a positive reward because the target-zone is only a small portion of the state space. The consequence is that the buffer replay does not contain enough positive experiences, making the policy unstable. To solve this issue we used a form of buffer replay, called partitioned buffer replay, that discriminates between rewards and guarantees a fair sampling between positive, negative and neutral experiences. Another issue connected with the reward sparsity is the utility overestimation . During a preliminary research we observed this problem in the vertical descent phase. Monitoring the Q-max value (the highest utility returned by the Q-network) we noticed that it rapidly increased, overshooting the maximum possible utility of 1.0. The overestimation was associated with all the actions but the trigger. The trigger leads to a terminal state, therefore its utility is updated without the operator. The operator has been found to be the responsible of the overestimation in deep Q-learning . In our case the overestimated utilities of the four horizontal movements (grown up to 2.0 after frames) were higher than the non-overestimated utility associated with the trigger (stably converged to 1.0). As a result the drone moved on top of the marker and then shifted on the xy-plane without engaging the trigger. A solution to overestimation has been recently proposed and has been called double DQN . The target estimated through double DQN is defined as follows:
Using this target instead of the one in Equation 2 the divergence of the DQN action distribution is mitigated resulting in a faster convergence and increased stability.
Iii-B Partitioned buffer replay
In a preliminary research we find out that the vertical descent was affected by the sparsity of positive and negative rewards. The shortage of positive and negative experiences caused an underestimation of the utilities associated to the triggers. To deal with sparse rewards it has been proposed to divide the experiences in two buckets, one with high priority and the other with low priority . Our approach is an extension of this method to buckets. Another form of prioritized buffer replay has been proposed in 
. The authors suggest to sample important transitions more frequently. The prioritized replay estimates a weight for each experience based on the temporal difference error. Experiences are sampled with a probability proportional to the weight. The limitation of this form of prioritization is the introduction of another layer of complexity that may not be justified for applications were there is a clear distinction between positive and negative rewards. Moreover this method requiresto update the priorities. This issue does not significantly affect performances on the standard benchmark but it has a relevant effect on robotics application, where there is a high cost in obtaining experiences.
In Section III-A we defined being a dataset of experiences used to uniformly sample a batch at each iteration . To create a partitioned buffer replay we have to divide the reward space in partitions:
For any experience we associate its reward and we define the th buffer replay:
The batch used for training the policy is assembled picking experiences from each one of the datasets with a certain fraction .
In our particular case we have , meaning that we have three datasets with containing experiences having positive rewards, containing experiences having negative rewards, and for experiences having neutral rewards. The fraction of experiences associated to each one of the dataset is defined as , , and .
When using a partitioned buffer replay there is a substantial increase in the available number of positive and negative experiences. For instance using a single buffer of size and accumulating transitions, the total number of positive experiences is 343 and the number of negative experiences is 2191. Using a partitioned buffer with size for the neutral partition, and size for positive and negative partitions, the total number of positive experiences is 1352 and the number of negative experiences 9270.
Iii-C Hierarchy of DQNs
Our method is based on the use of a hierarchy of DQNs representing sub-policies used to deal with different phases of the navigation. Similarly to a finite-state machine the global policy is divided into modules and each module is governed by a specific DQN or control loop. The DQNs are able to autonomously understand when it is time to call the next state. The advantages of such a method are twofold. On the one hand it is possible to reduce the complexity of the task using a divide-and-conquer approach. On the other hand, the use of a function approximator is confined in specific sandboxes making their use in robotic applications safer.
A similar approach is described in hierarchical reinforcement learning  where a set of sub-policies, called options, are available to the agent in specific states. The options control the agent in sub-regions of a core MDP called semi-MDPs. In the present work we assume that the core MDP can be divided into multiple isolated instances and that each instance is a proper MDP. The advantage is that we can use standard Q-learning to train the agent.
The finite MDP describing the landing problem can be divided in three main stages: landmark detection, descent maneuver, touchdown. We described in Section III-A the first two phases. The touchdown consists in decreasing the power of the motors in the last few centimeters of the descent and then safely deactivate the UAV components (e.g. motors, cameras, boards, control unit, etc.). In this article we mainly focused on the first two stages, because they represent the most challenging part of the landing procedure. A graphical representation of a hierarchical state machine is represented in Figure 3. We trained the first DQN (marker detection) to receive a positive reward when the trigger was enabled inside a target area. Negative reward was given if the trigger was enabled outside the target area. The second network (descent maneuver) was trained using the same idea. In a preliminary phase we also trained a single network to achieve both detection and descending. Given the size of the combined spaces the network was not able to converge to a stable policy. As a baseline we also report the accumulated reward curve of this network in Section IV.
Iii-D Training through domain randomization
The reality gap is the obstacles that makes it difficult to implement many robotic solutions in real world. This is especially true for DRL where a large number of episodes is necessary in order to obtain stable policies. Recent research worked on bridging this gap using domain transfer techniques. An example is domain randomization , a method for training models on simulated images that transfer to real images by randomizing rendering in the simulator. Here we adopt domain randomization in order to train the UAV in simple simulated environments and test it in complex environments (both simulated and real). The remarkable property of this approach is that it does not require any pre-training on real images. If the variability is significant enough, models trained in simulation generalize to the real world with no additional training. In the next session we show how domain randomization has been included in the training phase and how the experiments have been organized.
In Section IV-A the methodology and the results obtained with the DQN specialized in the landmark detection phase is presented, while in Section IV-B those concerning the vertical descent phase. In both training and testing we used the same environment (Gazebo 7.7.x, ROS Kinetic) and drone (Parrot BeBop 2). The simulator is a fork of the one used in  and it is freely available on our repository222https://github.com/pulver22/QLAB
. The control command sent to the vehicle is represented by a continuous vectorthat allows moving the drone with a specific velocity on the three axes. We must point out that the physics of the engine has not been simplified in any way. There are important oscillatory effects during accelerations and decelerations that introduces a swinging behaviour with consequent perspective distortion in the images acquired. Moreover a summation of forces effect shows when the vehicle accumulates inertia and a new velocity command is given. The DRL algorithm has to deal with this source of noise.
Iv-a First series of simulations
In the first series of simulations we trained and tested the DQNs for the marker detection phase. We considered two networks having the same structure (Figure 2) and we trained them in two different conditions. The first network was trained with a uniform asphalt texture (DQN-single), whereas the second network was trained with multiple textures (DQN-multi). The ability to generalize to new unseen situations is very important and it should be seriously taken into account in the landing problem. Training the first network on a single texture is a way to quantify the effect of a limited dataset on the performance of the agent. In the DQN-multi condition the networks were trained using seven different groups of textures: asphalt, brick, grass, pavement, sand, snow, soil (Figure 4-h). These networks should outperform the ones trained in the condition with single texture.
At each episode the drone started at a fixed altitude of 20 m that was maintained for the entire flight. This expedient was useful for two reasons: it significantly reduced the state space to explore and it allowed visualizing the marker in most of the cases giving a reference point for the navigation. In a practical scenario this solution does not have any impact on the flight, the drone is kept at a stable altitude and the frames are acquired regularly. To stabilize the flight we introduced discrete movements, meaning that each action was repeated for 2 seconds and then stopped leading to an approximate shift of 1 meter, similarly to the no-operation parameter used in . The frames from the camera were acquired between the actions ( Hz) when the vehicle was stationary. This expedient stabilized convergence reducing perspective errors.
The training environment was represented by a uniform texture of size m with the landmark positioned in the center. The environment contained two bounding boxes (Figure 4(a)). At the beginning of each episode the drone was spawned at 20 m of altitude inside the perimeter of the larger bounding box ( m) with a random position and orientation. A positive reward of 1.0 was given when the drone activated the trigger in the target-zone, and a negative reward of -1.0 was given if the drone activated the trigger outside the target-zone. A negative cost of living of -0.01 was applied to all the other conditions. A time limit of 40 seconds (20 steps) was used to stop the episode and start a new one. In the DQN-multi condition the ground texture was changed every 50 episodes and randomly sampled between the 71 available. The target and policy networks were synchronized every frames. The agent had five possible actions available: forward, backward, left, right, land-trigger. The action was repeated for 2 seconds, then the drone was stopped and a new action was sampled. The buffer replay was filled before the training with frames using a random policy. We trained the two DQNs for frames. We used an -greedy policy with decayed linearly from 1.0 to 0.1 over the first frames and fixed at 0.1 thereafter. The discount factor
was set to 0.99. As optimizer we used the RMSProp algorithm with a batch size of 32. The weights were initialized using the Xavier initialization method. The DQN algorithm was implemented in Python using the Tensorflow library. Simulations were performed on a workstation with an Intel i7 (8 core) processor, 32 GB of RAM, and the NVIDIA Quadro K2200 as graphical processing unit. On this hardware the training took 5.2 days to complete.
To test the performance of the policies we measured the detection success rate of both DQN-single and DQN-multi in six tests. (i) The first test was performed on 21 unknown uniform textures belonging to the same categories of the training set. (ii) The second test was done on the same environments but at different altitudes (20, 15, and 10 meters). (iii) The third test was performed on the same 21 unknown textures but using a marker corrupted through a semi-transparent dust-like layer. (iv) The fourth test was done randomly sampling 25 textures from the test set and mixing them in a mosaic-like composition. (v) The fifth test has been done on three photo-realistic environments namely a warehouse, a disaster site, and a power-plant (Figure 4-e/g). (vi) The sixth and last test consisted in a real-world implementation in the mezzanine environment (Figure 4-d). The mezzanine is the only environment that allowed flying at an high altitude. We also measured the performances of a random agent, an AR-tracker algorithm , and human pilots in all the simulated environments. The human data has been collected using two methodologies. In the first approach 7 volunteers used a space-navigator mouse that gave the possibility to move the drone in the three dimensions at a maximum speed of 0.5 m/s. In the second methodology 5 volunteers used a keyboard to move the drone in four directions on the xy-plane through discrete steps of 1 meter. The first methodology has been adopted in order to give to the subjects a natural control interface, whereas the second methodology gave the same control conditions of the drone. In both conditions preliminary training allowed the subject to familiarize itself with the task. After the familiarization phase the real test started. In the landmark detection the subjects had to align the drone with the ground marker and trigger the landing procedure when inside the target-zone. The subjects performed five trials for each one of the environments contained in the test set (randomly sampled). A time limit of 40 seconds (20 steps) was applied to each episode. A landing attempt was declared as failed when the time limit expired or when the subject engaged the trigger outside the target-zone.
The results for both DQN-single and DQN-multi show that the agents were able to learn an efficient policy for maximizing the reward. In both conditions the reward increased stably without any anomaly (Figure 6 bottom). In the same figure we also report the reward curve for a baseline condition, where a single network has been trained to perform both detection and descending. The reward of the baseline did not increase significantly and the resulting policy was unable to engage the trigger inside the target-zone. The results of the test phase are summarized in Figure 6 (top). The bar chart compares the performances of DQN-single, DQN-multi, human pilots, AR-tracker and random agent. For human pilots we only report the results for the discrete control condition, since the score was higher than the space-navigator condition (). The average score on the first test (uniform textures) for the DQN-multi is . The score obtained by the agent trained on a single texture (DQN-single) are significantly lower (). The human performance is , whereas the AR-tracker has an average score of . The random agent has an average reward of in this environment. Since both human pilots and DQNs used discrete steps to move in the environments, it is possible to estimate the average number of discrete steps required to accomplish detection. For human pilots the average number of steps is 12, whereas for the DQN-multi is 6, meaning that humans were significantly slower. Testing the DQN-multi at different altitudes we noticed that the accuracy increased at 15 () and 10 () meters, with respect to the accuracy at the training altitude of 20 meters (). This result is explained by the fact that at lower altitudes the marker is more visible. In the third test we compared the DQN-multi and AR-tracker on uniform textures using the corrupted marker. We observed a significant drop in the AR-tracker performances from to explained by the fact that the underlying template matching algorithm failed in identifying the corrupted marker. In the same condition the DQN-multi performed well, with a drop in performance from to . The results in the fourth test (mixed-textures) show a lower performance for all the agents. DQN-multi has a success rate of and the DQN-single of . The human pilots have a performance of and the AR-tracker of . The results of the fifth test (photo-realistic environments) show a generic drop (DQN-multi=, DQN-single=, Human=, Random=, AR-tracker=). The overall performance on uniform textures, mixed textures and realistic worlds is for DQN-multi, for DQN-single, for human pilots, and for the AR-tracker. Finally, the results on the sixth test (real-world environment, mezzanine) showed an overall accuracy of on a total of 10 flights. We must point out that this condition was very challenging because of high variability in lighting and attitude instability.
Iv-B Second series of simulations
In the second series of simulations we trained and tested the DQNs specialized in the vertical descent. To encourage the descent during the
-greedy action selection we sampled the action from a non-uniform distribution where the descending action had a probabilityand the other actions a probability . We used exploring-start generating the UAV at different altitudes and ensuring a wider exploration of the state space. Instead of the standard buffer replay we used the partitioned buffer replay described in Section III-B. We trained two networks, the former in a single texture condition (DQN-single) and the latter in multi-texture condition (DQN-multi).
The training environment was represented by a flat floor of size m with the landmark positioned in the center. The state-space in the vertical descent phase is significantly larger than in the marker detection and exploration is expensive. For this reason we reduced the number of textures used for the training, randomly sampling 20 textures from the 71. We can hypothesize that using the entire training set can lead to a better performance. The action space available was represented by five actions: forward, backward, left, right, down. A single action was repeated for 2 seconds leading to an approximate shift of 1 meter due to a speed of 0.5 m/s. The descent action was performed at a lower speed of 0.25 m/s to reduce undesired vertical shifts. The target and policy networks were synchronized every frames. For the partitioned buffer replay we chose , , and . A time limit of 80 seconds (40 steps) was used to stop the episode and start a new one. The drone was spawned with a random orientation inside a bounding box of size m at the beginning of the episode. This bounding box corresponds to the target area of the landmark detection phase described in Section IV-A1. A positive reward of 1.0 was given only when the drone entered in a target-zone of size m, centered on the marker (Figure 4(b)). If the drone descended above 1.5 meter outside the target-zone a negative reward of -1.0 was given. A cost of living of -0.01 was applied at each time step. The same hyper-parameters described in Section IV-A1 were used to train the agent. In addition to the hardware mentioned in Section IV-A1, we also used a separate machine to collect preliminary experiences. This machine is a multi-core workstation with 32 GB of RAM and a GPU NVIDIA Tesla K-40. Before the training, the buffer replay was filled using a random policy with neutral experiences, negative experiences and positive experiences. We increased the number of positive experiences using horizontal/vertical mirroring and consecutive 90 degrees rotation on all the images stored in the positive partition. This form of data augmentation increased the total number of positive experiences to . To test the performance of the agents we measured the landing success rate of DQN-single, DQN-multi, human pilots, AR-tracker, and random agent in five tests. (i) In the first test the agents performed landing on 21 unseen uniform textures. (ii) The second test consisted in landing on uniform textures with a corrupted marker ( Figure 4-i). (iii) In the third test 25 textures have been randomly sampled from the test set and mixed in a mosaic-like composition. (iv) In the fourth test landing has been accomplished in three photo-realistic environments: warehouse, disaster site, powerplant (Figure 4-e/g). (v) In the fifth and last test the UAV had to land in four real-world indoor environments: laboratory, small hall, large hall, mezzanine (Figure 4-a/d). The performance of human pilots has been measured in all the simulated environments through discrete and a continuous controllers using the same procedure described in Section IV-A1.
the accumulated reward per episode showed in Figure 8 (bottom), increased stably in both DQN-single and DQN-multi. We reported also the baseline curve of a network trained on both detection and descent which did not learn to accomplish the task.The results of the test phase are summarized in Figure 8 (top). The bar chart compares the performances of the DQN-single, DQN-multi, human pilots, AR-tracker, and random agent. For human pilots we only report the score in the discrete control condition that is higher respect to the space-navigator condition (). The average score on the first test (uniform textures) is for DQN-multi, for DQN-single, for humans, and for the AR-tracker. Since both human pilots and DQNs used discrete steps to control the drone, it is possible to estimate the average number of steps required to accomplish landing. For human pilots the average number of steps is 23, whereas for the DQN-multi is 19, meaning that human pilots were slower. In the second test we compared the performances of the DQN-multi and AR-tracker on uniform textures with corrupted marker. The AR-tracker had a significant drop from to due to the failure of the underlying template-matching algorithm. The DQN-multi had a drop from to showing to be more robust to marker corruption. The third test (mixed textures) showed a general drop (DQN-multi= , DQN-single=, Human=, Random=, AR-tracker=). In the fourth (realistic environments) have been observed a similar drop (DQN-multi= , DQN-single=, Human=, Random=, AR-tracker=). The overall performances on uniform textures, mixed textures and realistic environments are for DQN-multi, for DQN-single, for human pilots, and for the AR-tracker. In the fifth and last test (real-world) the DQN-multi has been used to control the descending phase in 40 flights equally distributed in four environments (laboratory, small hall, large hall, mezzanine). The system obtained an overall success rate of . Most of the missed landing have been caused by extreme light conditions (e.g. mutable natural light), and by flight instability (e.g. strong drift). We can further analyze the DQN-multi policy observing the action-values distribution in different states (Figure 7). When the drone is far from the marker the DQN penalizes the descent. However, when the drone is over the marker this utility significantly increases overcoming the others.
V Conclusions and discussion
In this work we used DRL to realize a system for the autonomous landing of a quadrotor on a static pad. The main modules of the system are two DQNs that control the UAV in two delicate phases: landmark detection and vertical descent. Using domain randomization we trained the DQNs with simple uniform textures and tested them in complex environments (both simulated and real). The overall performances are comparable with an AR-tracker algorithm and human pilots. In particular, the system is faster than humans in reaching the pad and is more robust to marker corruption compared to the AR-tracker. The most remarkable outcome is that the networks were able to generalize to real environments despite training performed on a limited subset of textures. In all the missed landing the flight was interrupted because of the expiration time. Not even once the drone landed outside of the pad. Most of the missed landing have been caused by extreme conditions (mutable lighting and strong drift), not modeled in the simulator. We hypothesize that the results can be further improved taking into account these factors during the training phase. In conclusion, the results obtained are promising, however further research is necessary in order to train stable policies that can effectively work in a wide range of real-world conditions.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning.” in AAAI, 2016, pp. 2094–2100.
-  C. Forster, M. Faessler, F. Fontana, M. Werlberger, and D. Scaramuzza, “Continuous on-board monocular-vision-based elevation mapping applied to autonomous landing of micro aerial vehicles,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 111–118.
-  S. Saripalli and G. Sukhatme, “Landing on a moving target using an autonomous helicopter,” in Field and service robotics. Springer, 2006, pp. 277–286.
-  D. Zhou, Z. Zhong, D. Zhang, L. Shen, and C. Yan, “Autonomous landing of a helicopter uav with a ground-based multisensory fusion system,” in Seventh International Conference on Machine Vision (ICMV 2014). International Society for Optics and Photonics, 2015, pp. 94 451R–94 451R.
-  R. Polvara, S. Sharma, J. Wan, A. Manning, and R. Sutton, “Towards autonomous landing on a moving vessel through fiducial markers,” in 2017 European Conference on Mobile Robots (ECMR), Sept 2017, pp. 1–6.
-  Y. Gui, P. Guo, H. Zhang, Z. Lei, X. Zhou, J. Du, and Q. Yu, “Airborne vision-based navigation method for uav accuracy landing using infrared lamps,” Journal of Intelligent & Robotic Systems, vol. 72, no. 2, p. 197, 2013.
-  D. Tang, T. Hu, L. Shen, D. Zhang, W. Kong, and K. H. Low, “Ground stereo vision-based navigation for autonomous take-off and landing of uavs: a chan-vese model approach,” International Journal of Advanced Robotic Systems, vol. 13, no. 2, p. 67, 2016.
-  S. Lange, N. Sunderhauf, and P. Protzel, “A vision based onboard approach for landing and position control of an autonomous multirotor uav in gps-denied environments,” in Advanced Robotics, 2009. ICAR 2009. International Conference on. IEEE, 2009, pp. 1–6.
-  S. Lin, M. A. Garratt, and A. J. Lambert, “Monocular vision-based real-time target recognition and tracking for autonomously landing an uav in a cluttered shipboard environment,” Autonomous Robots, vol. 41, no. 4, pp. 881–901, 2017.
-  F. Davide, Z. Alessio, S. Alessandro, D. Jeffrey, and D. Scaramuzza, “Vision-based autonomous quadrotor landing on a moving platform,” in IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE, 2017.
-  A. R. Vetrella, I. Sa, M. Popović, R. Khanna, J. Nieto, G. Fasano, D. Accardo, and R. Siegwart, “Improved tau-guidance and vision-aided navigation for robust autonomous landing of uavs,” in Field and Service Robotics. Springer, 2018, pp. 115–128.
-  D. Lee, T. Ryan, and H. J. Kim, “Autonomous landing of a vtol uav on a moving platform using image-based visual servoing,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 971–976.
-  P. WawrzyńSki and A. K. Tanwani, “Autonomous reinforcement learning with experience replay,” Neural Networks, vol. 41, pp. 156–167, 2013.
-  T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
-  S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.
-  K. Narasimhan, T. Kulkarni, and R. Barzilay, “Language understanding for text-based games using deep reinforcement learning,” arXiv preprint arXiv:1506.08941, 2015.
-  A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341–379, 2003.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 23–30.
-  J. Engel, J. Sturm, and D. Cremers, “Camera-based navigation of a low-cost quadrocopter,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 2815–2821.