Flying Through a Narrow Gap Using End-to-end Deep Reinforcement Learning Augmented with Curriculum Learning and Sim2Real

Traversing through a tilted narrow gap is previously an intractable task for reinforcement learning mainly due to two challenges. First, searching feasible trajectories is not trivial because the goal behind the gap is difficult to reach. Second, the error tolerance after Sim2Real is low due to the relatively high speed in comparison to the gap's narrow dimensions. This problem is aggravated by the intractability of collecting real-world data due to the risk of collision damage. In this paper, we propose an end-to-end reinforcement learning framework that solves this task successfully by addressing both problems. To search for dynamically feasible flight trajectories, we use curriculum learning to guide the agent towards the sparse reward behind the obstacle. To tackle the Sim2Real problem, we propose a Sim2Real framework that can transfer control commands to a real quadrotor without using real flight data. To the best of our knowledge, our paper is the first work that accomplishes successful gap traversing task purely using deep reinforcement learning.



There are no comments yet.


page 1

page 2

page 7


Flying through a narrow gap using neural network: an end-to-end planning and control approach

In this paper, we investigate the problem of enabling a drone to fly thr...

Interpretable UAV Collision Avoidance using Deep Reinforcement Learning

The significant components of any successful autonomous flight system ar...

Sparse Curriculum Reinforcement Learning for End-to-End Driving

Deep reinforcement Learning for end-to-end driving is limited by the nee...

End-to-End Vision-Based Adaptive Cruise Control (ACC) Using Deep Reinforcement Learning

This paper presented a deep reinforcement learning method named Double D...

Reinforcement Learning for Robust Missile Autopilot Design

Designing missiles' autopilot controllers has been a complex task, given...

Solving Hard AI Planning Instances Using Curriculum-Driven Deep Reinforcement Learning

Despite significant progress in general AI planning, certain domains rem...

Neither Fast Nor Slow: How to Fly Through Narrow Tunnels

Nowadays, multirotors are playing important roles in abundant types of m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Problem Background

Aggressive flight can enhance the maneuverability of quadrotors. For instance, in search and rescue applications, quadrotors are required to explore unstructured environments with narrow entries. The quadrotor’s activity range can be enlarged if it is capable of flying through narrow gaps that are considered intractable from the aspect of a classical flight controller. Moreover, consider that quadrotors powered by batteries have a limited activity range. Avoiding obstacles by taking shortcuts through narrow gaps may reduce the power consumption by avoiding taking long detours. Accordingly, it is necessary to develop motion planners that can perform aggressive motions.

However, constrained flight motions can be difficult to perform due to the quadrotor’s underactuated dynamics. One example is to fly through a tilted rectangular gap, during which the quadrotor needs to avoid collision and subject to attitude and position constraints simultaneously. However, keeping a tilted attitude may induce a large horizontal acceleration. As a result, it would generate a horizontal position shift and increase the chance of colliding with the bezel. Therefore, finding a feasible trajectory is not trivial.

Aggressive flight planning has been explored for decades [4, 17]

. Conventional studies model the problem as a constrained motion planning problem that can be solved by optimizing manually defined loss functions. However, all these approaches have to simplify the problem using strong mathematical assumptions so that it can be formulated under the optimal control paradigm. During optimization, excessive prior knowledge is added (refer to Sec. 

II-B), such that motions that are inconsistent with the priors will be penalized during optimization. Accordingly, only solutions of a few specific patterns can be obtained, which eliminated the possibility of obtaining a solution of better patterns.

Compared to previous works, model-free reinforcement learning mainly has two advantages. Firstly, control policy can be optimized directly using unstructured environments and also under the quadrotor’s strong non-linear dynamics. Convexity of loss function is still desired but is no longer a prerequisite. Secondly, the solution pattern is not biased by the aforementioned handcrafted priors. Instead, the model-free learning paradigm only relies on a reward function that indicates whether the goal has been reached. Therefore, leveraging on the reinforcement learning makes it possible to get a solution that is better than the sub-optimal trajectory from the user-defined solution space.

Fig. 1: The quadrotor controlled by reinforcement learning is passing through a tilted narrow gap.

I-B Contributions

In this paper, we propose an end-to-end reinforcement learning solver for the quadrotor’s gap traversing task. Our approach does not rely on excessive problem-orientated priors. The methodological contributions are mainly from two aspects. First, due to the limited exploration ability of current reinforcement learning algorithms, searching for a feasible trajectory is not trivial. To this end, we propose to guide the exploration using curriculum learning, with which we have acquired feasible trajectories without conventional model-based motion planners. Second, transferring our learned policy to a real quadrotor is challenging due to the low error tolerance of Sim2Real as well as the intractability of collecting real trajectories. To tackle this issue, We propose a novel Sim2Real approach that enables successful Sim2Real transfer without using real flight trajectories.

Ii Related Works

Ii-a Drone control and planning by reinforcement learning

Reinforcement learning is reportedly a powerful approach for various flight control and planning tasks. Zhang et al. [28] applied Guided Policy Search (GPS) to a quadrotor collision avoidance task. The policy from GPS can outperform offline iterative Linear–Quadratic–Gaussian (LQG) planner and Model Predictive Control (MPC) planner using an ideal quadrotor model, and a model with 5 percentage mass error. Hwangbo et al. [11] demonstrated a method to train a reinforcement learning policy that can control a real-world quadrotor from the motor thrust level. The learned policy can accomplish hovering, waypoint tracking and posture stabilization from random initial states. Molchanov et al. [21] stabilized a quadrotor using Proximal Policy Optimization (PPO) algorithm. Lambert et al. [13] stabilized a quadrotor based on a model-based reinforcement learning approach. Li et al. [14] designed a reinforcement learning policy that can track targets. Mannucci et al. [19] proposed two reinforcement learning-based algorithms to control the attitude of the aircraft.

Ii-B Quadrotor traversing through a narrow gap

In previous studies, the gap-traversing task has been solved by Falanga et al.[4] and Loianno et al.[17] based on the conventional optimal control framework . To be specific, Falanga et al.[4] modeled the problem as an optimization problem on a pre-defined trajectory set. The trajectory is constrained to be a quadratic function that must intersect with the gap center. The vehicle’s velocity and acceleration during traversing are manually defined, and the time duration of traversing is to be reduced by optimization. However, the excessive hand-crafted prior knowledge may stifle better solutions to be obtained, since there is no evidence that the involved constraints are able to generate optimal solutions. The method is also problem-orientated and not scalable to unstructured environments with additional obstacles or irregular gap dimensions. Similarly, Loianno et al.[17] also defined the problem in an optimal control paradigm with excessive priors i.e. a parabolic trajectory, constant motor thrust, zero angular velocity during traversing, and a fixed trajectory starting point.

To reduce the dependencies on the aforementioned priors, literature [16]

is the first known study that implemented gap traversing using reinforcement learning. A neural network is utilized to imitate trajectories from an optimal control solver. The solution was then fine-tuned by training in AirSim

[24]. The final trajectory pattern is reportedly more diverse than the parabolic curve trajectories from previous studies [4, 17]

. However, the initial trajectory being cloned is still obtained from the optimal control framework with excessive priors. It is known that imitation learning may still end up with local optimal solutions that are similar to demonstrations without sufficient exploration

[6]. Besides, the method is still not detached from optimal control that requires excessive priors. Therefore, a pure reinforcement learning solver that can solve the problem in an end-to-end paradigm is desired. To the best of our knowledge, our work is the first instance of work that only uses a model-free reinforcement learning solver to accomplish this gap-traversing task in the real world.

Iii Task and Method Overview

Iii-a Task Statement

Our task is to plan aggressive trajectories for passing through a tilted narrow hole, as demonstrated in Fig. 1.

A direct traverse is not feasible, as shown in Fig. 2 (a). The black rectangle is the bounding box of the quadrotor. The gray background rectangle represents a wall with a tilted gap. Fig. 2 (b) shows an instance in which the geometric constraint is satisfied. But the joint force induced by motor thrusts and quadrotor’s gravity will lead to additional horizontal acceleration that may lead to a collision, as shown by red arrows. In addition, the pitch angle used for dashing forward will increase the lateral area of the quadrotor’s bounding box, which reduces the safe distance margin.

Fig. 2: (a) A direct traverse that cannot be accomplished. (b) One possible traverse which avoids collision with the gap. However, the horizontal acceleration may lead to collision.

We demonstrate our training framework in Fig. 3. These modules (Simulation, Soft Actor-Critic, Sim2Real) are discussed in Section IV, V, VI, respectively.

Fig. 3: An overview of our proposed training framework.

Iv Simulation Environment for Reinforcement Learning

One shortcoming of model-free reinforcement learning is the low data efficiency. Training the policy directly in real world is impractical because a real quadrotor is too fragile to endure a large number of failure rollouts. Instead, a simulation environment is created using the dynamics described in Sec. IV-A.

Iv-a Quadrotor Dynamics

We model the quadrotor as a rigid body with non-linear dynamics [26]. The angular acceleration is modeled as Eq. (1).


, , are the roll, pitch and yaw torques, respectively. , , are the rotational inertia of x, y and z axis in the body frame. , , are the roll, pitch and yaw rates, respectively.

Similarly, we model the translational motion as Eq. (2). is the quadrotor’s mass, is the rigid body linear acceleration, is the gravitational constant. is the total thrust. , is the rotational matrix from body to earth frame, is the drag force induced by linear motions, which is proportional to the squared body linear velocity , , in its x, y, z axis, respectively [18].


We use control distribution matrix to model the mapping relationship from motor thrusts to , , and , as shown in Eq. (3). is the thrust coefficient, and is the torque coefficient. Note that the control distribution matrix corresponds to the X type quadrotor and therefore the arm length is , where is the horizontal side length of the Oriented Bounding Box (OBB).


Iv-B Environmental State Variables

The state variable used in the reinforcement learning constitutes of the following information: linear position error towards the goal state ( and ), linear velocities ( and ), roll and pitch angles ( and ), roll and pitch rates ( and

). Note that we do not implement control on the yaw channel and therefore we do not feed yaw information to the network. Each entry of the linear position error vector

is defined as:


Subscript corresponds to the , and position channel. is the robot position, and is the position of the goal point (defined in the world frame, is a fixed point located at 25 centimeters behind the gate’s central point). Eq. (4) magnifies the positional error when the quadrotor is close to the gate’s center, aiming to enhance the discriminability of the positional feedback in that case.

Iv-C Reward Design

We use a simple reward function because we do not intend to restrict the solution space by excessive prior knowledge (discussed in Sec. I-A). We use a +1,000 value as the goal reward. This reward can only be acquired if the quadrotor passes through the hole without any collisions detected. The reward scaling is from parameter tuning. However, due to the difficulty of visiting the states behind the gate, this goal reward itself is too sparse to guide the training. An auxiliary penalty reward that is negative proportional to the distance is also used. This penalty reward encourages the quadrotor to move towards the target and therefore significantly improved the training stability. Note that this auxiliary reward accumulated in the whole episode is much smaller than the goal reward because the solution should not be dominant by this auxiliary reward. Overall, the reward function is given in Eq. (5)


Iv-D Simulated Gap

The environment includes a wall with a narrow gap. We terminate the simulation episode immediately when a collision between the quadrotor and the wall is detected. For this, we implemented a simple collision checker. The intersection points between the bounding box of the quadrotor and the wall are calculated in real-time. One collision is recognized if any intersection points are outside the gap’s boundary. A traversing attempt is successful if no collision is detected till the quadrotor has reached the goal position.

V Deep Reinforcement Learning

V-a Soft Actor-Critic Algorithm

Reward sparsity is a challenge for our task since the goal reward behind the gap is difficult to reach. For this, we selected Soft Actor-Critic (SAC) algorithm [7], which has a strong ability of exploration due to the entropy term (refer to Eq. (6)). Our preliminary experiments indicate that SAC converges faster than PPO [23] and Deep Deterministic Policy Gradient (DDPG) [15]. Hence, SAC is chosen as the learning algorithm in this paper.


Where is the step reward, , are the state and action in the time step . is a weight parameter that determines the importance of the entropy term ( is subsumed into the reward through scaling reward by [7]). The optimal policy is given by (7):


The soft Q function , soft V function are given by the soft learning framework, which are defined as Eq. (8) and Eq. (9). is the reward discount factor.


We approximate the policy with a neural network

. This policy network has 2 linear hidden layers with 256 neural units in each layer. Rectified Linear Unit (ReLU) activation function is used in all hidden layers. We use reparameterization trick

[12] to sample actions i.e. , where

is a noise signal sampled from a Gaussian distribution defined by the network output. We limit the action magnitude of each channel to (-1, 1) by a Tanh function. The overall network structure is given in Fig. 


Fig. 4: Architecture of the policy network that predicts the distribution of actions conditioned on the input state. Reparameterization trick is used for sampling actions.

function and are also approximated with neural networks and . Both of the two networks contain 3 hidden layers with 300 neural units in each layer. To prevent the overestimation of Q value, we follow the double-Q learning [9] [8] to approximate the with the minimum output of two parallel networks.

We trained all these networks with Adam optimizer at a learning rate of and a batch size of 1024. We identify that using a smaller learning rate (less than ) may lead to collapsed solution trajectories since it cannot follow the update speed of curriculum learning (refer to Sec. V-B) while using a large learning rate (larger than ) may reduce training stability. The reward discount factor is 0.99. We initialize the weights of the output layer in and as uniform values in

, because we want to initialize the estimation of

and as roughly zero compared to the relatively large episodic reward. We believe this can alleviate the bias in selecting initial actions and may accelerate the training.

V-B Curriculum Learning

We incorporate our proposed curriculum learning framework to address the reward sparsity issue. Curriculum learning [2] is a training technique that divides the training process into a sequence of subtasks with increased difficulty levels. which is known to be able to improve the convergence by letting the agent learn on a simplified problem at the beginning stage [25].

We design a curriculum with two training phases. In phase 1, the gap’s dimensions gradually reduce from 1.5m 1m to 1m 0.5m. This phase lasts for 100,000 episodes. We control the gap’s dimension by increasing the difficulty factor with the episode , as described in Eq. (10). and are the width and height of the gap.


In phase 2, we adjust the difficulty factor according to Eq. (11). Phase 2 is used to refine the policy under the most difficult configuration. The phase 2 lasts for 500,000 steps in total. which shrinks the gap dimension from 1.0 0.5 to 0.6 0.3.


The best policy is chosen as the one with the maximized score , where is the exponential moving average of the episode reward at episode ().

The curriculum learning changes the environmental configuration as the training proceeds. This means that the experience stored in the replay buffer may be obsolete. Therefore, we limit the size of our replay buffer to 100,000 and discard old data when the replay buffer is full. Empirically, the reward curve is stable when the replay buffer size varies from 10,000 to 500,000.

Vi Sim2Real Transfer

Discrepancies between the simulation and real quadrotors are non-negligible. Therefore, it is difficult to transfer the policy trained in simulation directly to real quadrotors. To solve this problem, a wide variety of Sim2Real approaches have been proposed [3, 27, 5, 1]. Nevertheless, most of these approaches need to utilize real-world data either in fine-tuning stage or in training stage. However, acquiring real-world data is challenging in our case (discussed in Sec. VII). To solve this problem, we developed a control framework that can enhance generalization without utilizing real-world data.

Vi-a Simulation to Real Transfer Framework

An overview of our framework is shown in Fig. 5. The proposed framework is incorporated both in training and testing. Here we define the linear and angular acceleration command as . is then converted into an incremental positional displacement starting from the current position .

Fig. 5: Our proposed Sim2Real transfer framework. Position command at time step is calculated using the acceleration command and positional and velocity feedback , at time step

Let , denote the position and velocity of the quadrotor at time step , respectively. We propose to design the position command as follows:


where denotes the position command for the next time step (), denotes the time interval between the two time steps.

will be sent to the position controller for execution. The velocity and position are measured by sensors in real time. In our implementation, the policy network’s output is limited to by a function. To convert this output value back to , a scaling parameter is used following the output of the policy network. For , we set for angular channels and for the altitude channel.

Our approach theoretically can work with continuous output, model-free reinforcement learning algorithms other than SAC, since it doesn’t require any modification of the existing reinforcement learning architectures.

Vi-B Randomization

Randomization is an effective way to enhance the success rate of Sim2Real transfer [1, 10]. In our training, we use two types of randomization: (1) Observation noise that represents the uncertainty of sensors. (2) Dynamics randomization that represents the model inaccuracy.

Noise (1) is modeled as additive noise sampled from Gaussian distributions . The mean value

of noise is zero. The standard deviation

is given in Table. I. The initial state of the quadrotor is randomized by generating from zero-mean Gaussian distributions with standard deviations given in Table. II, which enables to plan trajectories starting from a wide region rather than only from the origin.

position angle linear velocity angular velocity
, , , , , , , ,
0.002 m 0.01 rad 0.05 m/s 0.05 rad/s
TABLE I: Environmental randomization
Initial linear velocity Initial angular velocity Initial position
, , , , ,
TABLE II: Initialization randomization

The dynamics randomization aims at pushing the learning algorithm to generalize on a wide range of quadrotor parameters. For this, we leverage additive zero mean Gaussian distributions, with standard derivation given in Table. III.

rotational inertia motor’s max thrust
TABLE III: Dynamics randomization

Vi-C Traversing through gaps with various dimensions

To demonstrate the feasibility of our approach, we firstly evaluate the traversing success rate of our policy with various gap dimensions. The dimension of our quadrotor is 0.47m 0.47m 0.23m. The dynamics parameters of the quadrotor are , total thrust , rotational inertia , , thrust coefficient and torque coefficient , which is consistent with our real quadrotor. Both the training and testing stages are conducted in the simulation we built, which runs on a laptop with intel i7-8750H CPU and Nvidia GTX 1060 GPU. The tilted angle is fixed to 20 degrees in both training and testing as an example. We evaluate our approach on a wide variety of gap dimensions, with 1,000 episodes evaluated per experiment. The success rate is shown in Table IV.

widthheight 0.38 0.36 0.34 0.32 0.30
1.0 95.1% 93.0% 86.4% 70.5% 49.2%
0.9 90.0% 88.5% 83.5% 70.8% 46.6%
0.8 78.4% 75.8% 72.0% 58.6% 40.9%
0.7 45.6% 44.6% 42.8% 36.3% 24.0%
0.6 14.7% 12.6% 13.8% 11.6% 7.9%
TABLE IV: Evaluation of the policy in simulation. We demonstrate the success rate (in %) for various gap dimensions (width & height, in meters)

We demonstrate the learned policy by showing plots of the altitude and attitude data (Fig. 6). The pitch angle gradually increases to obtain a fast dashing speed. Then it gradually decreases because a large pitch angle may increase the chance of collision. The quadrotor finally takes advantage of the inertial velocity for the hole-traversing.

(a) Roll command and response.
(b) Pitch command and response
(c) Altitude command and response
(d) Trajectory of traversing
Fig. 6: Experimental data for traversing through a 20 degree tilted gap. Quadrotor attitude and altitude data are shown in (a), (b), and (c). (d) shows a trajectory that passes through the narrow gap successfully in our simulation.

Vi-D Real World Experimental Configuration

To show the feasibility of our proposed Sim2Real method, we then test our approach on a real F330 quadrotor. The parameters from model identification are the same as the counterparts in Sec. VI-C. The width of the gap is 0.7m and the height is 0.36m. The tilt angle is 20 degrees. We set the quadrotor’s absolute maximum roll/pitch angle as 0.55 rad (about 31.5 degrees) to prevent losing altitude due to limited motor thrust.

We utilize Vicon mocap system to provide the position and velocity feedback. The whole reinforcement learning framework was running on an onboard Upboard computer with the Robot Operating System (ROS). The system structure is shown in Fig. 7. The positional channels (outer loops) are controlled at 50 Hz while the attitude is controlled by the onboard Pixhawk controller at 250 Hz rate. Our code is released at:

Fig. 7: The experimental configuration of our real-world experiment. The reinforcement learning controller is on an onboard Upboard computer. A Pixhawk module is used for flight control. Vicon tracker is used for position feedback.

We demonstrate the results of real world experiment. We conducted 37 trials of experiments. 15 of them are successful, which takes up about 40.54%. This success rate is close to 44.6% we achieved in the simulation. The traversing snapshots are shown in Fig. 8. The video can be found at

Fig. 8: Snapshots of the motions performed during the gap-traversing task.

The key state variables in actual flights are demonstrated in Fig. 9. It can be observed that the action pattern closely matches the simulated counterparts. This demonstrates that our Sim2Real framework can effectively transfer the policy from simulation to a real quadrotor.

(a) Roll command and response.
(b) Pitch command and response.
(c) Altitude command and response.
(d) Trajectory of traversing
Fig. 9: Quadrotor states in a real-world experiment. (a), (b) and (c) show the command and response. (d) is the recorded trajectory that passed the narrow gap successfully.

Vi-E Performance without curriculum learning

We demonstrate the smoothed episodic reward (smoothed by ) in Fig. 10

, with 95% confidence intervals. The cyan curve corresponds to the results with curriculum learning enabled, while the pink curve corresponds to the result with curriculum learning removed. Benefits from the curriculum learning, the cyan curve can maintain a high reward level during the whole training process. In comparison, the pink curve shows that the agent is not able to find the goal reward when the curriculum learning is removed, proving that curriculum learning can both improve the learning speed and stability.

Fig. 10: Comparison of smoothed episodic reward of two training configurations: 1) With curriculum learning 2) Without curriculum learning. Both configurations have the same gap dimension (0.6x0.3) at the 600,000th episode. But only the former case can find the solution trajectory reliably.

Vi-F Performance without Sim2Real transfer framework

We find it intractable to transfer a policy that directly exerts control on the attitude and altitude channels without using our proposed Sim2Real framework. For safety considerations, we only tested this transfer in simulation: we trained the policy using the simulated dynamics model and then transferred it to a quadrotor model controlled by PX4 firmware in Gazebo. No successful trajectory is achieved with a total number of 30 rollouts while at the same scenario we can achieve a success rate of 44.6% in the simulation using the proposed framework.

A planning result in Gazebo is shown in Fig.  11. It is seen that the attitude and altitude response is oscillatory, making it difficult to track the commands.

(a) Roll command and response.
(b) Pitch command and response.
(c) Altitude command and response.
(d) Trajectory of traversing
Fig. 11: Quadrotor response from a failure trajectory without using our proposed sim2real transfer framework in Gazebo environment. (a), (b) and (c) show the command and response. The commands are oscillatory, which leads to task failure. (d) is the corresponding recorded trajectory. The quadrotor collided with the wall.

Vii Discussions

Vii-a Other Sim2Real Approaches

Other recent proposed approaches mainly include: (1) learn a model of inverse dynamics that can predict required actions directly in the target domain [3]. (2) learn an adaptive policy that can be fine-tuned by real-world data [5] [22]. Unfortunately, none of these approaches is effective in our system.

(1) We have tried an inverse dynamic model as an attempt of the Sim2Real transfer (refer to [3]). However, it is intractable to fit an accurate global model or local models around aggressive trajectories, because a real quadrotor is fragile and therefore intensive data sampling around aggressive trajectories is not feasible. We have tried to use Ornstein-Uhlenbeck noise for model identification, but the noise magnitude should also be limited due to safety considerations. Hence, it is hard to bridge the data distribution gap between the identification phase and the validation phase.

(2) We seek an antidote in fast adaptive meta-learning by applying the Reptile algorithm [22]. By generating 1,000 quadrotors with dynamics randomization in our simulation, we intended to find a well-initialized model, and then fine-tune the model by the data acquired from the target domain. We use a Gazebo environment for the experiment. Using 5 shots of training, we achieve at most 3 successful rollouts out of 30 rollouts in total, which is a mundane performance compared to 10 successful rollouts achieved by our Sim2Real transfer framework.

(3) Other approaches that require real world data for domain transfer such as [20] are also intractable to be applied due to the difficulty of sampling a large number of aggressive trajectories from the real-world. This is because almost any failure trials would damage the quadrotor e.g. break propellers.

Vii-B Failure pattern analysis

We aim to get the best performance on a real-world quadrotor rather than on the simulated counterparts. We can achieve more than 90% success rate in our simulation if we decrease the noise injected for Sim2Real, but it will degenerate the performance on a real quadrotor.

Failures are caused by 1) inappropriate timing to start tilting, which implies that inaccurate decisions can still be made by the reinforcement learning agent. 2) inaccurate tracking of the altitude. The error in the altitude channel cannot be reduced swiftly once emerges, because the time constant in the altitude control channel is larger than the counterparts in attitude channels. Note that the controller only has fractions of a second for stabilization because the peak dashing speed of our quadrotor can be more than . A better altitude control algorithm that has a faster control response (such as the incremental nonlinear dynamic inversion in [18]) may contribute to a higher success rate.

Vii-C Generalizability

The proposed Sim2Real transfer framework, which does not need accurate parameters of the real quadrotor, makes the proposed approach less dependent on the model of the quadrotor and easy to be generalized. Because of this, the trained network in simulation can be successfully applied to the real quadrotor without training on the real data and achieves a similar success rate as in simulation. This demonstrates the generalizability of the proposed approach. The proposed Sim2Real transfer framework can be generalized to systems with similar dynamics as the quadrotor.

Vii-D Limitation

For performing aggressive flights using reinforcement learning approaches, angle and rate limits can be violated. One approach to attenuate this issue is to design reward functions which penalize the actions that violate the rate limit. This approach can attenuate the issue but cannot eradicate it. Our proposed Sim2Real framework takes a further step by always keeping the rate limit within its maximum range. However, the maximum rate limit is a function of quadrotor state. Simply using a constant rate limit value would be harmful when generalizing to larger tilt angles.

Viii Conclusion

We proposed a novel deep learning framework which enables the quadrotor to pass through narrow gaps without training using real-world data. Two key challenges were addressed: 1) the sparse reward issue was solved by designing a curriculum learning framework, and 2) the Sim2Real transfer issue was addressed by proposing a novel framework which does not depend on the model parameters. Experimental results showed that the trained policy can achieve a similar success rate when applied to the real quadrotor without additional training. Future work would be to extend our work to scenarios with larger tilted angles using a more dexterous quadrotor, and to feed the gap’s tilt angle to the network input, which can facilitate our proposed method to address varying tilting angles without the need to re-train the model.


  • [1] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §VI-B, §VI.
  • [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    pp. 41–48. Cited by: §V-B.
  • [3] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba (2016) Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §VI, §VII-A, §VII-A.
  • [4] D. Falanga, E. Mueggler, M. Faessler, and D. Scaramuzza (2017) Aggressive quadrotor flight through narrow gaps with onboard sensing and computing using active vision. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 5774–5781. Cited by: §I-A, §II-B, §II-B.
  • [5] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §VI, §VII-A.
  • [6] Y. Guo, J. Choi, M. Moczulski, S. Bengio, M. Norouzi, and H. Lee (2019) Self-imitation learning via trajectory-conditioned policy for hard-exploration tasks. arXiv, pp. arXiv–1907. Cited by: §II-B.
  • [7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §V-A, §V-A.
  • [8] H. v. Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    pp. 2094–2100. Cited by: §V-A.
  • [9] H. Hasselt (2010) Double q-learning. Advances in neural information processing systems 23, pp. 2613–2621. Cited by: §V-A.
  • [10] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26). Cited by: §VI-B.
  • [11] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter (2017) Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters 2 (4), pp. 2096–2103. Cited by: §II-A.
  • [12] D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pp. 2575–2583. Cited by: §V-A.
  • [13] N. O. Lambert, D. S. Drew, J. Yaconelli, S. Levine, R. Calandra, and K. S. Pister (2019) Low-level control of a quadrotor with deep model-based reinforcement learning. IEEE Robotics and Automation Letters 4 (4), pp. 4224–4230. Cited by: §II-A.
  • [14] S. Li, T. Liu, C. Zhang, D. Yeung, and S. Shen (2017) Learning unmanned aerial vehicle control for autonomous target following. arXiv preprint arXiv:1709.08233. Cited by: §II-A.
  • [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §V-A.
  • [16] J. Lin, L. Wang, F. Gao, S. Shen, and F. Zhang (2019) Flying through a narrow gap using neural network: an end-to-end planning and control approach. arXiv preprint arXiv:1903.09088. Cited by: §II-B.
  • [17] G. Loianno, C. Brunner, G. McGrath, and V. Kumar (2016) Estimation, control, and planning for aggressive flight with a small quadrotor with a single camera and imu. IEEE Robotics and Automation Letters 2 (2), pp. 404–411. Cited by: §I-A, §II-B, §II-B.
  • [18] P. Lu and E. van Kampen (2015) Active fault-tolerant control for quadrotors subjected to a complete rotor failure. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4698–4703. Cited by: §IV-A, §VII-B.
  • [19] T. Mannucci, E. van Kampen, C. de Visser, and Q. Chu (2017) Safe exploration algorithms for reinforcement learning controllers. IEEE transactions on neural networks and learning systems 29 (4), pp. 1069–1081. Cited by: §II-A.
  • [20] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull (2020) Active domain randomization. In Conference on Robot Learning, pp. 1162–1176. Cited by: §VII-A.
  • [21] A. Molchanov, T. Chen, W. Hönig, J. A. Preiss, N. Ayanian, and G. S. Sukhatme (2019) Sim-to-(multi)-real: transfer of low-level robust control policies to multiple quadrotors. arXiv preprint arXiv:1903.04628. Cited by: §II-A.
  • [22] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §VII-A, §VII-A.
  • [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §V-A.
  • [24] S. Shah, D. Dey, C. Lovett, and A. Kapoor (2018) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §II-B.
  • [25] J. Sharma, P. Andersen, O. Granmo, and M. Goodwin (2020)

    Deep q-learning with q-matrix transfer learning for novel fire evacuation environment

    IEEE Transactions on Systems, Man, and Cybernetics: Systems. Cited by: §V-B.
  • [26] D. Shi, X. Dai, X. Zhang, and Q. Quan (2017) A practical performance evaluation method for electric multicopters. IEEE/ASME Transactions on Mechatronics 22 (3), pp. 1337–1348. Cited by: §IV-A.
  • [27] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §VI.
  • [28] T. Zhang, G. Kahn, S. Levine, and P. Abbeel (2016) Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 528–535. Cited by: §II-A.