I Introduction
Ia Problem Background
Aggressive flight can enhance the maneuverability of quadrotors. For instance, in search and rescue applications, quadrotors are required to explore unstructured environments with narrow entries. The quadrotor’s activity range can be enlarged if it is capable of flying through narrow gaps that are considered intractable from the aspect of a classical flight controller. Moreover, consider that quadrotors powered by batteries have a limited activity range. Avoiding obstacles by taking shortcuts through narrow gaps may reduce the power consumption by avoiding taking long detours. Accordingly, it is necessary to develop motion planners that can perform aggressive motions.
However, constrained flight motions can be difficult to perform due to the quadrotor’s underactuated dynamics. One example is to fly through a tilted rectangular gap, during which the quadrotor needs to avoid collision and subject to attitude and position constraints simultaneously. However, keeping a tilted attitude may induce a large horizontal acceleration. As a result, it would generate a horizontal position shift and increase the chance of colliding with the bezel. Therefore, finding a feasible trajectory is not trivial.
Aggressive flight planning has been explored for decades [4, 17]
. Conventional studies model the problem as a constrained motion planning problem that can be solved by optimizing manually defined loss functions. However, all these approaches have to simplify the problem using strong mathematical assumptions so that it can be formulated under the optimal control paradigm. During optimization, excessive prior knowledge is added (refer to Sec.
IIB), such that motions that are inconsistent with the priors will be penalized during optimization. Accordingly, only solutions of a few specific patterns can be obtained, which eliminated the possibility of obtaining a solution of better patterns.Compared to previous works, modelfree reinforcement learning mainly has two advantages. Firstly, control policy can be optimized directly using unstructured environments and also under the quadrotor’s strong nonlinear dynamics. Convexity of loss function is still desired but is no longer a prerequisite. Secondly, the solution pattern is not biased by the aforementioned handcrafted priors. Instead, the modelfree learning paradigm only relies on a reward function that indicates whether the goal has been reached. Therefore, leveraging on the reinforcement learning makes it possible to get a solution that is better than the suboptimal trajectory from the userdefined solution space.
IB Contributions
In this paper, we propose an endtoend reinforcement learning solver for the quadrotor’s gap traversing task. Our approach does not rely on excessive problemorientated priors. The methodological contributions are mainly from two aspects. First, due to the limited exploration ability of current reinforcement learning algorithms, searching for a feasible trajectory is not trivial. To this end, we propose to guide the exploration using curriculum learning, with which we have acquired feasible trajectories without conventional modelbased motion planners. Second, transferring our learned policy to a real quadrotor is challenging due to the low error tolerance of Sim2Real as well as the intractability of collecting real trajectories. To tackle this issue, We propose a novel Sim2Real approach that enables successful Sim2Real transfer without using real flight trajectories.
Ii Related Works
Iia Drone control and planning by reinforcement learning
Reinforcement learning is reportedly a powerful approach for various flight control and planning tasks. Zhang et al. [28] applied Guided Policy Search (GPS) to a quadrotor collision avoidance task. The policy from GPS can outperform offline iterative Linear–Quadratic–Gaussian (LQG) planner and Model Predictive Control (MPC) planner using an ideal quadrotor model, and a model with 5 percentage mass error. Hwangbo et al. [11] demonstrated a method to train a reinforcement learning policy that can control a realworld quadrotor from the motor thrust level. The learned policy can accomplish hovering, waypoint tracking and posture stabilization from random initial states. Molchanov et al. [21] stabilized a quadrotor using Proximal Policy Optimization (PPO) algorithm. Lambert et al. [13] stabilized a quadrotor based on a modelbased reinforcement learning approach. Li et al. [14] designed a reinforcement learning policy that can track targets. Mannucci et al. [19] proposed two reinforcement learningbased algorithms to control the attitude of the aircraft.
IiB Quadrotor traversing through a narrow gap
In previous studies, the gaptraversing task has been solved by Falanga et al.[4] and Loianno et al.[17] based on the conventional optimal control framework . To be specific, Falanga et al.[4] modeled the problem as an optimization problem on a predefined trajectory set. The trajectory is constrained to be a quadratic function that must intersect with the gap center. The vehicle’s velocity and acceleration during traversing are manually defined, and the time duration of traversing is to be reduced by optimization. However, the excessive handcrafted prior knowledge may stifle better solutions to be obtained, since there is no evidence that the involved constraints are able to generate optimal solutions. The method is also problemorientated and not scalable to unstructured environments with additional obstacles or irregular gap dimensions. Similarly, Loianno et al.[17] also defined the problem in an optimal control paradigm with excessive priors i.e. a parabolic trajectory, constant motor thrust, zero angular velocity during traversing, and a fixed trajectory starting point.
To reduce the dependencies on the aforementioned priors, literature [16]
is the first known study that implemented gap traversing using reinforcement learning. A neural network is utilized to imitate trajectories from an optimal control solver. The solution was then finetuned by training in AirSim
[24]. The final trajectory pattern is reportedly more diverse than the parabolic curve trajectories from previous studies [4, 17]. However, the initial trajectory being cloned is still obtained from the optimal control framework with excessive priors. It is known that imitation learning may still end up with local optimal solutions that are similar to demonstrations without sufficient exploration
[6]. Besides, the method is still not detached from optimal control that requires excessive priors. Therefore, a pure reinforcement learning solver that can solve the problem in an endtoend paradigm is desired. To the best of our knowledge, our work is the first instance of work that only uses a modelfree reinforcement learning solver to accomplish this gaptraversing task in the real world.Iii Task and Method Overview
Iiia Task Statement
Our task is to plan aggressive trajectories for passing through a tilted narrow hole, as demonstrated in Fig. 1.
A direct traverse is not feasible, as shown in Fig. 2 (a). The black rectangle is the bounding box of the quadrotor. The gray background rectangle represents a wall with a tilted gap. Fig. 2 (b) shows an instance in which the geometric constraint is satisfied. But the joint force induced by motor thrusts and quadrotor’s gravity will lead to additional horizontal acceleration that may lead to a collision, as shown by red arrows. In addition, the pitch angle used for dashing forward will increase the lateral area of the quadrotor’s bounding box, which reduces the safe distance margin.
We demonstrate our training framework in Fig. 3. These modules (Simulation, Soft ActorCritic, Sim2Real) are discussed in Section IV, V, VI, respectively.
Iv Simulation Environment for Reinforcement Learning
One shortcoming of modelfree reinforcement learning is the low data efficiency. Training the policy directly in real world is impractical because a real quadrotor is too fragile to endure a large number of failure rollouts. Instead, a simulation environment is created using the dynamics described in Sec. IVA.
Iva Quadrotor Dynamics
We model the quadrotor as a rigid body with nonlinear dynamics [26]. The angular acceleration is modeled as Eq. (1).
(1) 
, , are the roll, pitch and yaw torques, respectively. , , are the rotational inertia of x, y and z axis in the body frame. , , are the roll, pitch and yaw rates, respectively.
Similarly, we model the translational motion as Eq. (2). is the quadrotor’s mass, is the rigid body linear acceleration, is the gravitational constant. is the total thrust. , is the rotational matrix from body to earth frame, is the drag force induced by linear motions, which is proportional to the squared body linear velocity , , in its x, y, z axis, respectively [18].
(2) 
We use control distribution matrix to model the mapping relationship from motor thrusts to , , and , as shown in Eq. (3). is the thrust coefficient, and is the torque coefficient. Note that the control distribution matrix corresponds to the X type quadrotor and therefore the arm length is , where is the horizontal side length of the Oriented Bounding Box (OBB).
(3) 
IvB Environmental State Variables
The state variable used in the reinforcement learning constitutes of the following information: linear position error towards the goal state ( and ), linear velocities ( and ), roll and pitch angles ( and ), roll and pitch rates ( and
). Note that we do not implement control on the yaw channel and therefore we do not feed yaw information to the network. Each entry of the linear position error vector
is defined as:(4) 
Subscript corresponds to the , and position channel. is the robot position, and is the position of the goal point (defined in the world frame, is a fixed point located at 25 centimeters behind the gate’s central point). Eq. (4) magnifies the positional error when the quadrotor is close to the gate’s center, aiming to enhance the discriminability of the positional feedback in that case.
IvC Reward Design
We use a simple reward function because we do not intend to restrict the solution space by excessive prior knowledge (discussed in Sec. IA). We use a +1,000 value as the goal reward. This reward can only be acquired if the quadrotor passes through the hole without any collisions detected. The reward scaling is from parameter tuning. However, due to the difficulty of visiting the states behind the gate, this goal reward itself is too sparse to guide the training. An auxiliary penalty reward that is negative proportional to the distance is also used. This penalty reward encourages the quadrotor to move towards the target and therefore significantly improved the training stability. Note that this auxiliary reward accumulated in the whole episode is much smaller than the goal reward because the solution should not be dominant by this auxiliary reward. Overall, the reward function is given in Eq. (5)
(5) 
IvD Simulated Gap
The environment includes a wall with a narrow gap. We terminate the simulation episode immediately when a collision between the quadrotor and the wall is detected. For this, we implemented a simple collision checker. The intersection points between the bounding box of the quadrotor and the wall are calculated in realtime. One collision is recognized if any intersection points are outside the gap’s boundary. A traversing attempt is successful if no collision is detected till the quadrotor has reached the goal position.
V Deep Reinforcement Learning
Va Soft ActorCritic Algorithm
Reward sparsity is a challenge for our task since the goal reward behind the gap is difficult to reach. For this, we selected Soft ActorCritic (SAC) algorithm [7], which has a strong ability of exploration due to the entropy term (refer to Eq. (6)). Our preliminary experiments indicate that SAC converges faster than PPO [23] and Deep Deterministic Policy Gradient (DDPG) [15]. Hence, SAC is chosen as the learning algorithm in this paper.
(6) 
Where is the step reward, , are the state and action in the time step . is a weight parameter that determines the importance of the entropy term ( is subsumed into the reward through scaling reward by [7]). The optimal policy is given by (7):
(7) 
The soft Q function , soft V function are given by the soft learning framework, which are defined as Eq. (8) and Eq. (9). is the reward discount factor.
(8)  
(9) 
We approximate the policy with a neural network
. This policy network has 2 linear hidden layers with 256 neural units in each layer. Rectified Linear Unit (ReLU) activation function is used in all hidden layers. We use reparameterization trick
[12] to sample actions i.e. , whereis a noise signal sampled from a Gaussian distribution defined by the network output. We limit the action magnitude of each channel to (1, 1) by a Tanh function. The overall network structure is given in Fig.
4.function and are also approximated with neural networks and . Both of the two networks contain 3 hidden layers with 300 neural units in each layer. To prevent the overestimation of Q value, we follow the doubleQ learning [9] [8] to approximate the with the minimum output of two parallel networks.
We trained all these networks with Adam optimizer at a learning rate of and a batch size of 1024. We identify that using a smaller learning rate (less than ) may lead to collapsed solution trajectories since it cannot follow the update speed of curriculum learning (refer to Sec. VB) while using a large learning rate (larger than ) may reduce training stability. The reward discount factor is 0.99. We initialize the weights of the output layer in and as uniform values in
, because we want to initialize the estimation of
and as roughly zero compared to the relatively large episodic reward. We believe this can alleviate the bias in selecting initial actions and may accelerate the training.VB Curriculum Learning
We incorporate our proposed curriculum learning framework to address the reward sparsity issue. Curriculum learning [2] is a training technique that divides the training process into a sequence of subtasks with increased difficulty levels. which is known to be able to improve the convergence by letting the agent learn on a simplified problem at the beginning stage [25].
We design a curriculum with two training phases. In phase 1, the gap’s dimensions gradually reduce from 1.5m 1m to 1m 0.5m. This phase lasts for 100,000 episodes. We control the gap’s dimension by increasing the difficulty factor with the episode , as described in Eq. (10). and are the width and height of the gap.
(10) 
In phase 2, we adjust the difficulty factor according to Eq. (11). Phase 2 is used to refine the policy under the most difficult configuration. The phase 2 lasts for 500,000 steps in total. which shrinks the gap dimension from 1.0 0.5 to 0.6 0.3.
(11) 
The best policy is chosen as the one with the maximized score , where is the exponential moving average of the episode reward at episode ().
The curriculum learning changes the environmental configuration as the training proceeds. This means that the experience stored in the replay buffer may be obsolete. Therefore, we limit the size of our replay buffer to 100,000 and discard old data when the replay buffer is full. Empirically, the reward curve is stable when the replay buffer size varies from 10,000 to 500,000.
Vi Sim2Real Transfer
Discrepancies between the simulation and real quadrotors are nonnegligible. Therefore, it is difficult to transfer the policy trained in simulation directly to real quadrotors. To solve this problem, a wide variety of Sim2Real approaches have been proposed [3, 27, 5, 1]. Nevertheless, most of these approaches need to utilize realworld data either in finetuning stage or in training stage. However, acquiring realworld data is challenging in our case (discussed in Sec. VII). To solve this problem, we developed a control framework that can enhance generalization without utilizing realworld data.
Via Simulation to Real Transfer Framework
An overview of our framework is shown in Fig. 5. The proposed framework is incorporated both in training and testing. Here we define the linear and angular acceleration command as . is then converted into an incremental positional displacement starting from the current position .
Let , denote the position and velocity of the quadrotor at time step , respectively. We propose to design the position command as follows:
(12) 
where denotes the position command for the next time step (), denotes the time interval between the two time steps.
will be sent to the position controller for execution. The velocity and position are measured by sensors in real time. In our implementation, the policy network’s output is limited to by a function. To convert this output value back to , a scaling parameter is used following the output of the policy network. For , we set for angular channels and for the altitude channel.
Our approach theoretically can work with continuous output, modelfree reinforcement learning algorithms other than SAC, since it doesn’t require any modification of the existing reinforcement learning architectures.
ViB Randomization
Randomization is an effective way to enhance the success rate of Sim2Real transfer [1, 10]. In our training, we use two types of randomization: (1) Observation noise that represents the uncertainty of sensors. (2) Dynamics randomization that represents the model inaccuracy.
Noise (1) is modeled as additive noise sampled from Gaussian distributions . The mean value
of noise is zero. The standard deviation
is given in Table. I. The initial state of the quadrotor is randomized by generating from zeromean Gaussian distributions with standard deviations given in Table. II, which enables to plan trajectories starting from a wide region rather than only from the origin.position  angle  linear velocity  angular velocity  
, ,  , ,  , ,  , ,  
0.002 m  0.01 rad  0.05 m/s  0.05 rad/s 
Initial linear velocity  Initial angular velocity  Initial position  

,  , ,  , ,  

The dynamics randomization aims at pushing the learning algorithm to generalize on a wide range of quadrotor parameters. For this, we leverage additive zero mean Gaussian distributions, with standard derivation given in Table. III.
rotational inertia  motor’s max thrust  

ViC Traversing through gaps with various dimensions
To demonstrate the feasibility of our approach, we firstly evaluate the traversing success rate of our policy with various gap dimensions. The dimension of our quadrotor is 0.47m 0.47m 0.23m. The dynamics parameters of the quadrotor are , total thrust , rotational inertia , , thrust coefficient and torque coefficient , which is consistent with our real quadrotor. Both the training and testing stages are conducted in the simulation we built, which runs on a laptop with intel i78750H CPU and Nvidia GTX 1060 GPU. The tilted angle is fixed to 20 degrees in both training and testing as an example. We evaluate our approach on a wide variety of gap dimensions, with 1,000 episodes evaluated per experiment. The success rate is shown in Table IV.
widthheight  0.38  0.36  0.34  0.32  0.30 
1.0  95.1%  93.0%  86.4%  70.5%  49.2% 
0.9  90.0%  88.5%  83.5%  70.8%  46.6% 
0.8  78.4%  75.8%  72.0%  58.6%  40.9% 
0.7  45.6%  44.6%  42.8%  36.3%  24.0% 
0.6  14.7%  12.6%  13.8%  11.6%  7.9% 
We demonstrate the learned policy by showing plots of the altitude and attitude data (Fig. 6). The pitch angle gradually increases to obtain a fast dashing speed. Then it gradually decreases because a large pitch angle may increase the chance of collision. The quadrotor finally takes advantage of the inertial velocity for the holetraversing.
ViD Real World Experimental Configuration
To show the feasibility of our proposed Sim2Real method, we then test our approach on a real F330 quadrotor. The parameters from model identification are the same as the counterparts in Sec. VIC. The width of the gap is 0.7m and the height is 0.36m. The tilt angle is 20 degrees. We set the quadrotor’s absolute maximum roll/pitch angle as 0.55 rad (about 31.5 degrees) to prevent losing altitude due to limited motor thrust.
We utilize Vicon mocap system to provide the position and velocity feedback. The whole reinforcement learning framework was running on an onboard Upboard computer with the Robot Operating System (ROS). The system structure is shown in Fig. 7. The positional channels (outer loops) are controlled at 50 Hz while the attitude is controlled by the onboard Pixhawk controller at 250 Hz rate. Our code is released at: https://github.com/arclabhku/reinforcement_learning.
We demonstrate the results of real world experiment. We conducted 37 trials of experiments. 15 of them are successful, which takes up about 40.54%. This success rate is close to 44.6% we achieved in the simulation. The traversing snapshots are shown in Fig. 8. The video can be found at https://youtu.be/gfAfFnjN18A.
The key state variables in actual flights are demonstrated in Fig. 9. It can be observed that the action pattern closely matches the simulated counterparts. This demonstrates that our Sim2Real framework can effectively transfer the policy from simulation to a real quadrotor.
ViE Performance without curriculum learning
We demonstrate the smoothed episodic reward (smoothed by ) in Fig. 10
, with 95% confidence intervals. The cyan curve corresponds to the results with curriculum learning enabled, while the pink curve corresponds to the result with curriculum learning removed. Benefits from the curriculum learning, the cyan curve can maintain a high reward level during the whole training process. In comparison, the pink curve shows that the agent is not able to find the goal reward when the curriculum learning is removed, proving that curriculum learning can both improve the learning speed and stability.
ViF Performance without Sim2Real transfer framework
We find it intractable to transfer a policy that directly exerts control on the attitude and altitude channels without using our proposed Sim2Real framework. For safety considerations, we only tested this transfer in simulation: we trained the policy using the simulated dynamics model and then transferred it to a quadrotor model controlled by PX4 firmware in Gazebo. No successful trajectory is achieved with a total number of 30 rollouts while at the same scenario we can achieve a success rate of 44.6% in the simulation using the proposed framework.
A planning result in Gazebo is shown in Fig. 11. It is seen that the attitude and altitude response is oscillatory, making it difficult to track the commands.
Vii Discussions
Viia Other Sim2Real Approaches
Other recent proposed approaches mainly include: (1) learn a model of inverse dynamics that can predict required actions directly in the target domain [3]. (2) learn an adaptive policy that can be finetuned by realworld data [5] [22]. Unfortunately, none of these approaches is effective in our system.
(1) We have tried an inverse dynamic model as an attempt of the Sim2Real transfer (refer to [3]). However, it is intractable to fit an accurate global model or local models around aggressive trajectories, because a real quadrotor is fragile and therefore intensive data sampling around aggressive trajectories is not feasible. We have tried to use OrnsteinUhlenbeck noise for model identification, but the noise magnitude should also be limited due to safety considerations. Hence, it is hard to bridge the data distribution gap between the identification phase and the validation phase.
(2) We seek an antidote in fast adaptive metalearning by applying the Reptile algorithm [22]. By generating 1,000 quadrotors with dynamics randomization in our simulation, we intended to find a wellinitialized model, and then finetune the model by the data acquired from the target domain. We use a Gazebo environment for the experiment. Using 5 shots of training, we achieve at most 3 successful rollouts out of 30 rollouts in total, which is a mundane performance compared to 10 successful rollouts achieved by our Sim2Real transfer framework.
(3) Other approaches that require real world data for domain transfer such as [20] are also intractable to be applied due to the difficulty of sampling a large number of aggressive trajectories from the realworld. This is because almost any failure trials would damage the quadrotor e.g. break propellers.
ViiB Failure pattern analysis
We aim to get the best performance on a realworld quadrotor rather than on the simulated counterparts. We can achieve more than 90% success rate in our simulation if we decrease the noise injected for Sim2Real, but it will degenerate the performance on a real quadrotor.
Failures are caused by 1) inappropriate timing to start tilting, which implies that inaccurate decisions can still be made by the reinforcement learning agent. 2) inaccurate tracking of the altitude. The error in the altitude channel cannot be reduced swiftly once emerges, because the time constant in the altitude control channel is larger than the counterparts in attitude channels. Note that the controller only has fractions of a second for stabilization because the peak dashing speed of our quadrotor can be more than . A better altitude control algorithm that has a faster control response (such as the incremental nonlinear dynamic inversion in [18]) may contribute to a higher success rate.
ViiC Generalizability
The proposed Sim2Real transfer framework, which does not need accurate parameters of the real quadrotor, makes the proposed approach less dependent on the model of the quadrotor and easy to be generalized. Because of this, the trained network in simulation can be successfully applied to the real quadrotor without training on the real data and achieves a similar success rate as in simulation. This demonstrates the generalizability of the proposed approach. The proposed Sim2Real transfer framework can be generalized to systems with similar dynamics as the quadrotor.
ViiD Limitation
For performing aggressive flights using reinforcement learning approaches, angle and rate limits can be violated. One approach to attenuate this issue is to design reward functions which penalize the actions that violate the rate limit. This approach can attenuate the issue but cannot eradicate it. Our proposed Sim2Real framework takes a further step by always keeping the rate limit within its maximum range. However, the maximum rate limit is a function of quadrotor state. Simply using a constant rate limit value would be harmful when generalizing to larger tilt angles.
Viii Conclusion
We proposed a novel deep learning framework which enables the quadrotor to pass through narrow gaps without training using realworld data. Two key challenges were addressed: 1) the sparse reward issue was solved by designing a curriculum learning framework, and 2) the Sim2Real transfer issue was addressed by proposing a novel framework which does not depend on the model parameters. Experimental results showed that the trained policy can achieve a similar success rate when applied to the real quadrotor without additional training. Future work would be to extend our work to scenarios with larger tilted angles using a more dexterous quadrotor, and to feed the gap’s tilt angle to the network input, which can facilitate our proposed method to address varying tilting angles without the need to retrain the model.
References
 [1] (2020) Learning dexterous inhand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §VIB, §VI.

[2]
(2009)
Curriculum learning.
In
Proceedings of the 26th annual international conference on machine learning
, pp. 41–48. Cited by: §VB.  [3] (2016) Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §VI, §VIIA, §VIIA.
 [4] (2017) Aggressive quadrotor flight through narrow gaps with onboard sensing and computing using active vision. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 5774–5781. Cited by: §IA, §IIB, §IIB.
 [5] (2017) Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §VI, §VIIA.
 [6] (2019) Selfimitation learning via trajectoryconditioned policy for hardexploration tasks. arXiv, pp. arXiv–1907. Cited by: §IIB.
 [7] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §VA, §VA.

[8]
(2016)
Deep reinforcement learning with double qlearning.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, pp. 2094–2100. Cited by: §VA.  [9] (2010) Double qlearning. Advances in neural information processing systems 23, pp. 2613–2621. Cited by: §VA.
 [10] (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26). Cited by: §VIB.
 [11] (2017) Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters 2 (4), pp. 2096–2103. Cited by: §IIA.
 [12] (2015) Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pp. 2575–2583. Cited by: §VA.
 [13] (2019) Lowlevel control of a quadrotor with deep modelbased reinforcement learning. IEEE Robotics and Automation Letters 4 (4), pp. 4224–4230. Cited by: §IIA.
 [14] (2017) Learning unmanned aerial vehicle control for autonomous target following. arXiv preprint arXiv:1709.08233. Cited by: §IIA.
 [15] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §VA.
 [16] (2019) Flying through a narrow gap using neural network: an endtoend planning and control approach. arXiv preprint arXiv:1903.09088. Cited by: §IIB.
 [17] (2016) Estimation, control, and planning for aggressive flight with a small quadrotor with a single camera and imu. IEEE Robotics and Automation Letters 2 (2), pp. 404–411. Cited by: §IA, §IIB, §IIB.
 [18] (2015) Active faulttolerant control for quadrotors subjected to a complete rotor failure. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4698–4703. Cited by: §IVA, §VIIB.
 [19] (2017) Safe exploration algorithms for reinforcement learning controllers. IEEE transactions on neural networks and learning systems 29 (4), pp. 1069–1081. Cited by: §IIA.
 [20] (2020) Active domain randomization. In Conference on Robot Learning, pp. 1162–1176. Cited by: §VIIA.
 [21] (2019) Simto(multi)real: transfer of lowlevel robust control policies to multiple quadrotors. arXiv preprint arXiv:1903.04628. Cited by: §IIA.
 [22] (2018) On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §VIIA, §VIIA.
 [23] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §VA.
 [24] (2018) Airsim: highfidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §IIB.

[25]
(2020)
Deep qlearning with qmatrix transfer learning for novel fire evacuation environment
. IEEE Transactions on Systems, Man, and Cybernetics: Systems. Cited by: §VB.  [26] (2017) A practical performance evaluation method for electric multicopters. IEEE/ASME Transactions on Mechatronics 22 (3), pp. 1337–1348. Cited by: §IVA.
 [27] (2018) Simtoreal: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §VI.
 [28] (2016) Learning deep control policies for autonomous aerial vehicles with mpcguided policy search. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 528–535. Cited by: §IIA.
Comments
There are no comments yet.