Learning a Controller Fusion Network by Online Trajectory Filtering for Vision-based UAV Racing

04/18/2019 ∙ by Matthias Müller, et al. ∙ 0

Autonomous UAV racing has recently emerged as an interesting research problem. The dream is to beat humans in this new fast-paced sport. A common approach is to learn an end-to-end policy that directly predicts controls from raw images by imitating an expert. However, such a policy is limited by the expert it imitates and scaling to other environments and vehicle dynamics is difficult. One approach to overcome the drawbacks of an end-to-end policy is to train a network only on the perception task and handle control with a PID or MPC controller. However, a single controller must be extensively tuned and cannot usually cover the whole state space. In this paper, we propose learning an optimized controller using a DNN that fuses multiple controllers. The network learns a robust controller with online trajectory filtering, which suppresses noisy trajectories and imperfections of individual controllers. The result is a network that is able to learn a good fusion of filtered trajectories from different controllers leading to significant improvements in overall performance. We compare our trained network to controllers it has learned from, end-to-end baselines and human pilots in a realistic simulation; our network beats all baselines in extensive experiments and approaches the performance of a professional human pilot. A video summarizing this work is available at https://youtu.be/hGKlE5X9Z5U



There are no comments yet.


page 1

page 2

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in UAV technology by industry leaders such as DJI, Amazon and Intel make UAV design and point-to-point stabilized flight navigation appear to be a well-solved problem. However, autonomous navigation of UAVs in more complex and real-world scenarios, such as in unknown congested environments, GPS-denied areas, and narrow spaces, is still far from being solved. This is a complex problem, since it requires both

the sensing of the environment and the execution of appropriate control policies for interaction at low latencies, running on on-board hardware with typically very limited computational resources. The emerging sport of UAV racing displays a lot of these real-world navigation challenges, and is one of the areas where the performance gap between human pilots and machine-driven navigation approaches is most evident. UAV racing requires human pilots or agents to sequentially control the UAV to fly through a race track based on the feedback (visual information, physical measurements, or both) of previous actions. It requires control over six degrees of freedom (6-DoF) at high speeds while traversing tight spaces, and passing consistently through racing gates. These complex sense-and-understand tasks are conducted at extreme speeds reaching over 100 km/h, and thus can serve as a controlled and challenging benchmark for machine-driven agile navigation approaches.

Figure 1: UAV flying in Sim4CV while being controlled by our controller fusion network.

One of the more difficult tasks in UAV racing is the prediction of the proper trajectory in order to traverse a course while maintaining high speeds. In earlier work, either a PID controller or model predictive controller with Kalman filters has been used. However, in practice a single controller cannot usually cover the whole state space. In certain states the controller may not perform as expected or even fail. In this paper, we propose the fusion of multiple controllers using a DNN to cover a much larger state space and to outperform any single controller.

Figure 2: Our complete system consists of a perception network and a controller fusion network (CFN). The perception network predicts local trajectories from a monocular RGB image. The CFN takes the predicted trajectory and UAV state as input and outputs the low-level controls: throttle (T), roll/aileron(A), pitch/elevator (E), yaw/rudder (R). The CFN is trained by fusing filtered trajectories from multiple classical controllers.

However, fusing different controller’s trajectories naively will result in incorporating both their good and bad trajectories due to their high variability and limitations in specific scenarios. We overcome this problem by storing trajectories in an online buffer that can be accessed by our deep neural network (DNN) during learning. By filling this buffer with only good trajectories, we are able to trim out failing trajectories. The network learns an optimized controller by sampling from the buffer containing these pre-filtered trajectories. The result is a network that is able to learn a good fusion of filtered trajectories from different controllers allowing a much larger coverage of the whole state space. We show that this leads to significant improvements in overall performance.

We use Sim4CV [16] to simulate UAV racing with accurate physics using the Unreal Engine 4 (UE4). [17]. We also develop a customizable racing area in the form of a stadium based on a 3D scan of a real-world location. Race tracks are generated within a GUI interface and can be loaded at game-time to allow training on multiple tracks automatically. Inspired by recent work on self-driving cars [4], we are trying to imitate UAV racing at a professional level. The key difference is that we train a perception network that predicts desired trajectories rather than controls. We then train a separate neural network to produce the low-level controls. We call this network Controller Fusion Network (CFN) and train it by fusing filtered trajectories from multiple controllers. Through extensive experiments in simulation we demonstrate that CFN outperforms the controllers it learned from, end-to-end baselines, and even human pilots flying via a remote control used for real UAVs.

Contributions. (1) We propose to learn the fusion of trajectories by a neural network. This allows for combining trajectories from different controllers in a principled way, and separates the control from the perception task. (2) By implementing online trajectory filtering we are able to learn from multiple noisy trajectories without incorporating their imperfections. While the control task is learned online, a buffer/memory also allows for semi-offline training. Our approach leads to a robust network outperforming several state-of-the-art approaches and human pilots.

2 Related Work

The use of deep neural networks (DNNs) to control UAVs dates back to work on learning acrobatic helicopter flight [1]. More recent work has studied training UAV controller networks with SL, RL or combined methods but with a focus on indoor flight, collision avoidance, and trajectory optimization [9, 13, 20, 22, 2, 11, 21, 10, 17, 3, 23]. An important insight from [13, 9, 12, 20] is that a trajectory optimizer such as a Model Predictive Controller (MPC) can function similar to traditional SL to help regress an agent’s sub-optimal policy towards the optimal one with much fewer iterations. By jointly learning perception and control with the self-supervision of MPC, full end-to-end navigation and collision avoidance can be learned. Recently, [10] implement such an approach where a DNN is used to predict a trajectory and a MPC is used to properly output the motor control of the UAV. Although not functioning at racing speeds, initial experiments demonstrate the advantages of such a setup. A major limitation of this approach is that the DNN is only used for perception, while the MPC requires extensive setup, tuning, and a full knowledge of the UAV dynamics. We propose in this paper that a DNN can also be applied to control allowing the UAV dynamics to be inherently learned.

Our network is able to learn from imperfect proportional-integral-derivative (PID) controllers allowing both self-supervision and extensive exploration. The network training with extensive exploration is most similar to DAGGER [19] and its variants [3, 24, 18, 5]. Our approach differs in that our control network does not learn strictly from the controller. Our network learns the appropriate actions of the PID controllers at each time step and selects the best predictions based on the filtered trajectories. The buffering strategy is similar to DQN’s [15] ”experience replay mechanism” in that it stores a limited set of experiences in a buffer and then selects samples randomly. However, our buffer is dynamic being updated during online training and continually filters the buffer with only good samples. By design, our RL motivated training process enables extensive exploration by only observing the best behavior from various controllers. It differs from [12] in that the controllers never have to deviate from their optimal control to induce exploration. Also, unlike trajectory optimization [7], the trajectory in our setup is without known global 3D position. This enables the prediction of local waypoints without needing precise knowledge of the UAV’s current state and dynamics, which are rarely available in the real-world. Similar to adaptive trajectory optimization, our predicted waypoints are updated every time step allowing for adaptation to environment changes.

3 Methodology

In this section, we introduce the setup of the Controller Fusion Network (CFN) which enables automatic removal of bad trajectory segments from imperfect controllers by filtering bad samples in an online manner before the CFN is updated. We build a modular system for UAV racing (see Figure 2) that separates the task into a high dimensional perception module and a low dimensional control module (CFN). The perception network predicts trajectories which are used by the the CFN along with the UAV state to produce low-level controls.

3.1 Controller Fusion Network (CFN)

We learn a Controller Fusion Network agent by a learning strategy that integrates knowledge from multiple controllers and the dynamic environment into the learning process (refer to Algorithm 1 for a detailed description during training). At each time step , the agent receives a state (or partial state) from the environment and executes an action . Thus, the trajectory of the agent behaviors are denoted as . The CFN policy is a parameterized function mapping the state to a deterministic action that can be continuous or discrete. In our case, the action is a 4D continuous control signal for a UAV. The PID controllers’ policy also maps the state to an action. In practice, the controller can be either an automated controller or a human, and demonstration can be performed online or offline from a recording. In this paper, we use two PID (Proportional-Integral-Derivative) controllers, thus, avoiding the need for hours of human recorded control.

Initialize controllers and CFN with random weights ;
Initialize CFN training database and CFN temporary buffers corresponding to the controllers;
// for each controller
for episode to  do
        Initial state provided by the environment;
        for  to  do
               Controller demonstrates ;
               Execute controller action ; observe new state and feedback;
               Update ;
               Discard unwanted demonstrations from (according to buffering strategy);
               Add desirable demonstrations from to ;
               Sample mini-batch from and perform SGD to minimize ;
               Break if is terminal state;
        end for
end for
Algorithm 1 Controller Fusion Network (CFN) during training.

To clarify which demonstrations we should learn from, we refer to the demonstrations that lead the CFN agent to good behavior as desirable demonstrations, the ones that lead to bad behavior as unwanted demonstrations, and the uncertain ones as unforeseeable demonstrations. We introduce a temporary buffer to store the demonstrations of the active controller. At each time step , by observing the feedback of the interactions between controller and environment, the CFN agent can determine whether to discard some unwanted demonstrations from , retain unforeseeable demonstrations in , or augment its training data by adding desirable demonstrations from . We call this operation buffer strategy. Note that the CFN agent maintains a temporary buffer for each controller. See Figure 4 for details.

Figure 3: Visualization of the CFN buffer strategy.

CFN buffer strategy.  We use two simple PID controllers. The CFN agent maintains two temporary buffers B corresponding to each controller. At each time step, the state action pair of each controller will be buffered into B. Then the CFN agent will decide which state action pairs to add to the training database, and discard others based on the buffer strategy. Finally, the CFN agent will perform an SGD update on the training database. A schematic diagram of the buffer used during training is shown in Figure 3.

The training database can be viewed as a distilled set of demonstrations collected by applying the buffering strategy on at each time step. Our goal is to train a CFN policy

to minimize the loss function



The total loss consists of the perception loss and a control loss :


Here, and are the learnable parameters of the perception and controller fusion network, respectively; scales the control loss ; represents the groundtruth waypoints; and are the input image state and UAV state.

4 Network Architecture and Training Details

Our overall network architecture is depicted in Figure 2. It consists of two networks, one for perception and one for control. The perception network outputs a local trajectory and the control network produces low-level controls given this trajectory and the UAV state as input.

Figure 4: Illustration of the Buffering Strategy. In Trajectory 1 the PID controller remains on track: the sample will become a desirable sample and will be added to the ground truth database . In Trajectory 2 the PID controller leaves the track: the samples will be discarded.

4.1 Perception

Network Architecture.  For our perception network, we use a network architecture that is inspired by the one used by Bojarski et al. [4] for autonomous driving. However, we make changes to accommodate the complexity of the task at hand and to improve robustness in training. Our network architecture is shown in Figure 2 as the perception module. It consists of eight layers: five convolutional with filters and three fully-connected layers with

hidden units, respectively. Instead of pooling operations, we introduce convolutional strides of two to downsample the input consecutively, inspired by more recent architecture designs such as MobileNet


Training Details.  In contrast to the network by Bojarski et al. [4], our network regresses local trajectories rather than raw controls. Predicting raw controls has several shortcomings; raw controls are specific to a vehicle and controller. In addition, different sequences of controls can lead to the same trajectory which makes data augmentation very difficult. As a result, we are able to train a very robust perception network and can validate it with ground-truth trajectories which are well defined. Our perception network takes images from a monocular RGB camera as input and ouputs a trajectory relative to the current position of the UAV. We represent the trajectory with five uniformly sampled points. We regress these points to the groundtruth (waypoints along the center line of the track) applying a L2-loss and dropout of

in the fully-connected layers. We implement our model in TensorFlow and train it with a learning rate of

using the Adam optimizer. Given the dynamic nature of the racing task we use the maximum frame rate of 60 fps unlike other works that downsample [6, 4, 22]. Note that when predicting controls directly from the input image, a high frame rate can be problematic, since fast transitions in control can lead to similar images with different labels, causing a regression towards averages. Since the image to trajectory correspondence is well defined and more stable, our approach is not affected by sampling rate variations.

4.2 Control

Here, we present the details of our controller fusion network, including network architecture and training strategy. The network takes the predicted trajectory from the perception network and the UAV state (orientation and velocity) as input and predicts the four UAV controls: throttle (T), roll/aileron (A), pitch/elevator (E), yaw/rudder (R).

Learning the Controller Fusion Network.  In our experiments, we use two naive PID controllers and denote them as and . We briefly tune two PID controllers on the training tracks. It is not necessary to tune controllers to be optimal on all the training tracks since CFN is robust to learn from sub-optimal controllers. The first PID controller is conservative, as it accurately follows the center of the track and flies through gates precisely but at relatively low speeds. Its output control values are a function of the first predicted waypoint and UAV state. The second PID controller is more dynamic, as it flies at maximum speed but can often overshoot gates on sharp turns due to inertia and limitations of the UAV. Its output control values are a function of the fourth predicted waypoint and the UAV state. We use a three-layer fully connected network to approximate the policy

of the CFN agent. The state of the CFN agent is a vector concatenation of the predicted waypoints and the UAV state (physical measurements):


As such, the states of the CFN agent , the conservative PID , and the aggressive PID , are , and respectively. At each time step, the state-action pairs in the temporary buffers and are and .

An illustration of the buffering strategy is shown in Figure 4 with buffer size . At time step , there are unforeseeable samples in . The next sample is . The samples ahead of are stored in the ground truth training database . Figure 4 illustrates two cases of buffering operations.

Network Architecture and Training Details.  The goal of the control network is to find a control policy that minimizes the control loss :


Here, represents the learnable parameters of the controller fusion network . We use Tensorflow to implement our CFN. A three-layer fully-connected network with hidden units is used to represent CFN. To regularize the network, we apply dropout to the second layer with a 0.5 ratio. A weighted -loss is used for our loss function . We use an Adam Optimizer with a learning rate of to train CFN in an online fashion, while the UAV flies through a training track. A temporary buffer with size is used to temporally store the last state-action pairs from each controller ( for and for ). The buffer size mainly depends on the controller’s trajectory quality. For example, if an aggressive PID controller overshoots or crashes after steps at a straight and

steps at a bend with a high probability,

would be considered a good buffer size; a buffer size of would lead the CFN agent to learn dangerous behaviours, while a buffer size of over would make the CFN agent unable to benefit from the speed advantages of the aggressive PID controller on straightaways. In our case, we chose for the conservative PID controller to benefit from its accuracy and for the aggressive PID controller to filter out its dangerous behaviors in curves. To improve learning, we add an Ornstein Uhlenbeck process [14] noise to the output of the controller fusion network to allow for exploration at the beginning and move the UAV by the agent’s control predictions, but the action is labeled by the conservative PID. After the initial exploration, the controllers take over control. At each time step, if a controller leaves the track, its temporary buffer is flushed and no state-action pairs are added to for training. If the controller keeps the UAV within the track boundaries, the oldest state-action pair in its temporary buffer is added to , as shown in Figure 4. At each time step, the network parameters are updated by back propagation to minimize the difference between the controller’s UAV policy and the demonstrations (state-action pairs) in , which are considered to be believable after discarding unwanted behaviors.

Figure 5: The point cloud from the LiDAR scan of the stadium collected from six different locations.
Figure 6: Left: Aerial image captured from an UAV hovering above the stadium racing track. Right: Rendering of the reconstructed stadium generated at a similar altitude and viewing angle within the simulator.
Figure 7: The seven training tracks (left) and the seven testing tracks (right). Gates are marked in red.

5 Experiments

Creation of the UAV Racing Simulation. Many professional pilots compete in time trials on well-known tracks such as those posted by the MultiGP Drone Racing League. Following this paradigm, our simulator race course is modeled after a football stadium, where local professional pilots regularly setup MultiGP tracks. Using a combination of LiDAR scanning and aerial photogrammetry, we captured the stadium with an accuracy of 0.5 cm (see Figure 5). A team of architects used the dense point cloud and textured mesh to create an accurate solid model with physics based rendering (PBR) textures in 3DS Max for export to Unreal. This resulted in a geometrically accurate and photo-realistic race course that remains low in poly count, so as to run within Sim4CV in real-time, in which all training and testing experiments are conducted. We refer to Figure 6 for a side-by-side comparison of the real and virtual stadiums.

Experimental Setup.  We use our UAV racing environment in Sim4CV [16] (see Figure 1); we design seven racing tracks for training and seven tracks for testing. To avoid user bias, we collect online images and trace their contours to create uniquely stylized tracks. We select the tracks with two aspects in mind. (1) The tracks should be similar to what racing professionals are accustomed to, and (2) they should offer enough diversity for network generalization on unseen tracks (see Figure 7). For all of the following evaluations, both the trained networks and human pilots are tasked with flying two laps on each of the test tracks. The score comprises three components: the percentage of successfully passed gates, the time to complete both laps, and the number of required resets. The UAV is reset at the next gate, if it does not reach it within 10 seconds after passing through the previous gate. This occurs if the UAV crashes beyond recovery or drifts off the track. Visualizations of the UAV’s trajectory for all models on each track are provided in the appendix A.

(a) PID1 (Conservative) (b) PID2 (Aggressive) (c) Ours
(d) Novice (e) Intermediate (f) Professional

figureComparison between our learned CFN policy and PID controllers (row1) and human pilots (row2), on a test track. Color encodes speed as a heatmap, where blue is the minimum speed and red is the maximum speed.

End2End (MAV) End2End (Nvidia) End2End (Ours) Ours (WP + CFN)
Score Time Resets Score Time Resets Score Time Resets Score Time Resets
Track1 6/12 98.60 6 7/12 101.15 5 12/12 80.11 0 12/12 52.20 0
Track2 16/20 113.55 4 15/20 140.92 5 20/20 91.88 0 20/20 64.75 0
Track3 11/22 161.85 11 19/22 110.94 3 22/22 81.26 0 22/22 62.00 0
Track4 10/18 152.27 8 15/18 121.07 3 18/18 97.10 0 18/18 71.93 0
Track5 18/30 207.07 12 21/30 197.11 9 30/30 100.47 0 30/30 71.16 0
Track6 15/20 136.69 5 20/20 108.42 0 16/20 137.14 4 20/20 81.66 0
Track7 8/12 115.05 4 10/12 105.37 2 10/12 104.97 2 12/12 64.86 0
Avg. 62.69% 140.72 7.14 79.85% 126.43 3.86 95.52% 98.99 0.86 100% 66.94 0
Table 1: Comparison of CFN to baselines
PID1 (Conservative) PID2 (Aggressive) Ours (No Buffer) Ours (WP + CFN)
Score Time Resets Score Time Resets Score Time Resets Score Time Resets
Track1 12/12 130.76 0 12/12 40.04 0 10/12 70.35 2 12/12 52.20 0
Track2 20/20 136.19 0 17/20 77.41 3 18/20 75.58 2 20/20 64.75 0
Track3 22/22 121.54 0 11/22 149.45 11 17/22 102.72 5 22/22 62.00 0
Track4 18/18 139.09 0 15/18 81.08 3 14/18 102.27 4 18/18 71.93 0
Track5 30/30 144.49 0 12/30 212.79 18 28/30 89.93 2 30/30 71.16 0
Track6 20/20 151.95 0 12/20 118.69 8 13/20 126.77 7 20/20 81.66 0
Track7 10/12 139.28 2 9/12 72.90 3 7/12 86.53 5 12/12 64.86 0
Avg. 98.51% 137.61 0.29 65.67% 107.48 6.57 79.85% 93.45 3.86 100% 66.94 0
Table 2: Ablation study
Human (Novice) Human (Intermediate) Human (Professional) Ours (WP + CFN)
Score Time Resets Score Time Resets Score Time Resets Score Time Resets
Track1 12/12 87.44 0 12/12 62.80 0 12/12 40.50 0 12/12 52.20 0
Track2 20/20 166.11 0 20/20 88.21 0 20/20 49.23 0 20/20 64.75 0
Track3 21/22 118.41 1 22/22 82.17 0 22/22 47.67 0 22/22 62.00 0
Track4 17/18 126.47 1 18/18 91.53 0 18/18 50.10 0 18/18 71.93 0
Track5 30/30 129.49 0 30/30 87.62 0 30/30 55.72 0 30/30 71.16 0
Track6 20/20 196.16 0 20/20 95.99 0 20/20 57.90 0 20/20 81.66 0
Track7 12/12 113.14 0 12/12 74.91 0 12/12 46.98 0 12/12 64.86 0
Avg. 98.51% 133.89 0.29 100% 83.32 0 100% 49.73 0 100% 66.94 0
Table 3: Comparison of CFN to humans

Comparison to State-of-the-Art Baselines.  We compare our system for UAV racing to the two most related and recent network architectures, the first denoted as Nvidia (for self-driving cars [4]) and the second as MAV (for forest path navigating UAVs [22]). Both the Nvidia and MAV networks use data augmentation from an additional left and right camera. For the Nvidia network, the exact offset choices for training are not publicly known, so we use a rotational offset of . For the MAV network, we use the same augmentation parameters in the paper, i.e. a rotational offset of . We modify the MAV network to allow for a regression output instead of its original classification (left, center and right controls). This is necessary, since our task requires fine-grained control, and discrete controls would be insufficient. We assign corrective controls to the augmentation views using a fairly simple but effective strategy. Depending on the camera view, we apply the following offset parameters: one that acts as a horizontal offset (roll-offset) and one that acts as a rotational offset (yaw-offset). For rotational offsets, we couple the yaw correction with a proportional roll correction because the UAV is in motion while rotating, causing it to drift outwards due to its inertia.

While the domains of these methods are similar, it should be noted that flying a high-speed racing UAV is a particularly challenging task, since the effect of inertia is more significant and there are more degrees of freedom than for ground vehicles. To ensure a strong end-to-end baseline, we build an end-to-end network that takes the state of the UAV (exactly like our control module) as input along with the image. We also augment the data with 18 additional camera views (exactly like our perception module) and assign the best corrective controls after cross-validation search [17].

Table 1 compares the performance of these baselines against our method. The MAV reference network needs more than resets and only completes about of gates on average, while taking more than twice the time. The Nvidia-inspired architecture performs slightly better, but still needs about resets and only completes

of gates on average. While the end-to-end trained version of our network achieves better performance than MAV and Nvidia, our modular network with CFN clearly outperforms it without the need for supervision or approximate corrective controls. In fact, CFN outperforms all baselines by a considerable margin in all three evaluation metrics; it completes all seven race tracks with 100% accuracy compared to 62.69%, 79.85% and 95.52% in about half the time on average.

(a) Grass (b) HD Grass
(c) Mud (d) Snow
(e) Fog (f) Rain
(g) Sunrise (h) Night

figureSimulated UAV racing stadium with low and high quality grass textures (a,b), different ground materials (c,d), different weather conditions (e,f) and different lighting conditions (g,h).

Baseline Weather Lighting Texture
Grass GrassHD Fog Rain Sunrise Night Snow Mud
Score 100% 100% 95.5% 100% 99.2% 96.2% 100% 100%
Time 66.94 69.75 85.48 68.68 73.77 86.01 68.26 67.54
Reset 0 0 0.86 0 0.14 0.71 0 0
Table 4: Adaptation through modularity. The reported results reflect average performance across all seven testing tracks. Please refer to the appendix A for detailed results per track.

Comparison to PID Controllers and Human Performance.  We compare our CFN trained control network to the PID controllers it learned from. The perception network stays the same. The summary of this experiment is given in Table 2, where we find that our learned policy is able to outperform both PID controllers (PID1 and PID2). Further, as Figure 5 shows, both PID controllers are imperfect. PID1 completes most gates while flying very slowly and PID2 misses many gates while flying at very high speeds. However, since our control module is designed to learn only from the best behaviour of both PID controllers, it completes all gates at a high speed. We also want to highlight the importance of the temporary buffer. Removing it results in a significant drop in performance (score: , time: , resets: ) since the CFN agent is forced to learn from all the controller demonstrations including the undesirable ones.

We also compare our system to three pilots with different skill levels: novice (has never flown before), intermediate (a moderately experienced pilot), and a professional (a competitive racing pilot with many years of experience). The pilots are given the opportunity to fly the seven training tracks as many times as needed until they successfully complete the tracks at their best time while passing through all gates. The pilots are then scored on the test tracks in the same fashion as the trained networks. The results are summarized in Table 3. CFN achieves at least the same accuracy as human pilots but is about two times faster than a novice, 20% faster than an intermediate pilot, and within 25% of a professional pilot. Although slower, our network flies more consistently than even the professional racing pilot while remaining reliably on the track (see Figure 5).

Adaptation through Modularity.  We replace the low-quality grass textures with high-quality grass and show that our perception network generalizes without any modification (see Figure 5 and Table 4). When changing the weather and lighting conditions, the performance of the perception network starts degrading. The network is not affected by rain, slightly degrades with sunrise lighting and degrades noticeably with fog and night-time lighting. If our perception network was trained on diverse environments/textures, it would learn even more invariance to the background, as demonstrated in [20]. However, generalization only works up to some extent and usually requires heavy data augmentation. In some applications, it might not even be desirable to generalize too broadly as the performance in the target domain often suffers as a result. For such cases, our modular approach allows to simply swap out the perception module to adapt to any environment. To demonstrate this, we train the perception network on different textured environments while keeping the controller fusion network fixed and show successful transfer of the control policy.

6 Conclusions and Future Work

In this work, we present a controller fusion network (CFN) that allows fusing multiple classical controllers. Extensive experiments demonstrate that a CFN based network outperforms state-of-the-art methods and flies more consistently than human pilots. This a product of both the ability for the network to fuse multiple controller’s trajectories and at the same time filter out controller actions leading to poor performance. We expect the framework can be adapted for other robotic and controller based dynamic tasks such as visual grasping tasks or visual placing tasks by making minor changes in buffer strategy. Instead of relying on extensive fine-tuning of a controller or defining an explicit model of a system, a CFN is able to produce an optimized predictive control of dynamic systems.

Acknowledgments This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research.


Appendix A Supplementary Material

Here we provide additional results and the recorded paths of the UAV networks and human pilots evaluations in the paper. All results were recorded as logs during testing inside Sim4CV [16] allowing plotting on the GUI track interface developed for the paper. The logs record stick input, position, orientation and velocity. These allow visualization of the performance of the pilot/network on the different tracks. Tables 5 and 6 show the detailed results for adaptation to different textures, lighting and environment conditions.

Figures A,A,A,A,A,A,A show the measured performance for all trained models and human pilots.

Ours (Fog) Ours (Rain) Ours (Sunrise) Ours (Night)
Score Time Resets Score Time Resets Score Time Resets Score Time Resets
Track1 12/12 57.43 0 12/12 54.97 0 12/12 55.43 0 12/12 58.07 0
Track2 20/20 67.75 0 20/20 67.27 0 20/20 66.50 0 20/20 69.33 0
Track3 22/22 64.32 0 22/22 62.63 0 22/22 62.95 0 22/22 65.17 0
Track4 17/18 93.02 1 18/18 72.42 0 18/18 70.95 0 18/18 72.23 0
Track5 28/30 114.11 2 30/30 73.01 0 30/30 74.10 0 28/30 114.03 2
Track6 17/20 136.71 3 20/20 84.92 0 19/20 118.89 1 17/20 154.74 3
Track7 12/12 65.03 0 12/12 65.58 0 12/12 67.72 0 12/12 68.58 0
Avg. 95.52% 85.48 0.86 100% 68.68 0 99.25% 73.77 0.14 96.27% 86.01 0.71
Table 5: Adaptation with different weather and lighting conditions
Ours (Grass) Ours (HD Grass) Ours (Mud) Ours (Snow)
Score Time Resets Score Time Resets Score Time Resets Score Time Resets
Track1 12/12 52.20 0 12/12 56.07 0 12/12 53.45 0 12/12 55.13 0
Track2 20/20 64.75 0 20/20 66.98 0 20/20 65.28 0 20/20 64.84 0
Track3 22/22 62.00 0 22/22 64.05 0 22/22 62.03 0 22/22 64.06 0
Track4 18/18 71.93 0 18/18 75.57 0 18/18 73.57 0 18/18 72.18 0
Track5 30/30 71.16 0 30/30 73.78 0 30/30 71.20 0 30/30 73.28 0
Track6 20/20 81.66 0 20/20 85.77 0 20/20 82.21 0 20/20 82.31 0
Track7 12/12 64.86 0 12/12 66.01 0 12/12 65.03 0 12/12 65.99 0
Avg. 100% 66.94 0 100% 69.75 0 100% 67.54 0 100% 68.26 0
Table 6: Adaptation with different textures
(a) End2End (MAV) (b) End2End (Nvidia) (c) End2End (Ours)
(c) PID1 (Conservative) (d) PID2 (Aggressive) (e) Ours (No Buffer)
(f) Human (Novice) (g) Human (Intermediate) (h) Human (Professional)
(i) Ours (Reference) (j) Ours (Night) (k) Ours (Sunrise)
(l) Ours (Reference) (m) Ours (Fog) (o) Ours (Rain)
(p) Ours (Grass) (q) Ours (Mud) (r) Ours (Snow)

figureQualitative results on track1. The color encodes speed as a heatmap, where blue corresponds to the minimum speed and red to the maximum speed.

(a) End2End (MAV) (b) End2End (Nvidia) (c) End2End (Ours)
(c) PID1 (Conservative) (d) PID2 (Aggressive) (e) Ours (No Buffer)
(f) Human (Novice) (g) Human (Intermediate) (h) Human (Professional)
(i) Ours (Reference) (j) Ours (Night) (k) Ours (Sunrise)
(l) Ours (Reference) (m) Ours (Fog) (o) Ours (Rain)
(p) Ours (Grass) (q) Ours (Mud) (r) Ours (Snow)

figureQualitative results on track2. The color encodes speed as a heatmap, where blue corresponds to the minimum speed and red to the maximum speed.

(a) End2End (MAV) (b) End2End (Nvidia) (c) End2End (Ours)
(c) PID1 (Conservative) (d) PID2 (Aggressive) (e) Ours (No Buffer)
(f) Human (Novice) (g) Human (Intermediate) (h) Human (Professional)
(i) Ours (Reference) (j) Ours (Night) (k) Ours (Sunrise)
(l) Ours (Reference) (m) Ours (Fog) (o) Ours (Rain)
(p) Ours (Grass) (q) Ours (Mud) (r) Ours (Snow)

figureQualitative results on track3. The color encodes speed as a heatmap, where blue corresponds to the minimum speed and red to the maximum speed.

(a) End2End (MAV) (b) End2End (Nvidia) (c) End2End (Ours)
(c) PID1 (Conservative) (d) PID2 (Aggressive) (e) Ours (No Buffer)
(f) Human (Novice) (g) Human (Intermediate) (h) Human (Professional)
(i) Ours (Reference) (j) Ours (Night) (k) Ours (Sunrise)
(l) Ours (Reference) (m) Ours (Fog) (o) Ours (Rain)
(p) Ours (Grass) (q) Ours (Mud) (r) Ours (Snow)

figureQualitative results on track4. The color encodes speed as a heatmap, where blue corresponds to the minimum speed and red to the maximum speed.

(a) End2End (MAV) (b) End2End (Nvidia) (c) End2End (Ours)
(c) PID1 (Conservative) (d) PID2 (Aggressive) (e) Ours (No Buffer)
(f) Human (Novice) (g) Human (Intermediate) (h) Human (Professional)
(i) Ours (Reference) (j) Ours (Night) (k) Ours (Sunrise)
(l) Ours (Reference) (m) Ours (Fog) (o) Ours (Rain)
(p) Ours (Grass) (q) Ours (Mud) (r) Ours (Snow)

figureQualitative results on track5. The color encodes speed as a heatmap, where blue corresponds to the minimum speed and red to the maximum speed.

(a) End2End (MAV) (b) End2End (Nvidia) (c) End2End (Ours)
(c) PID1 (Conservative) (d) PID2 (Aggressive) (e) Ours (No Buffer)
(f) Human (Novice) (g) Human (Intermediate) (h) Human (Professional)
(i) Ours (Reference) (j) Ours (Night) (k) Ours (Sunrise)
(l) Ours (Reference) (m) Ours (Fog) (o) Ours (Rain)
(p) Ours (Grass) (q) Ours (Mud) (r) Ours (Snow)

figureQualitative results on track6. The color encodes speed as a heatmap, where blue corresponds to the minimum speed and red to the maximum speed.

(a) End2End (MAV) (b) End2End (Nvidia) (c) End2End (Ours)
(c) PID1 (Conservative) (d) PID2 (Aggressive) (e) Ours (No Buffer)
(f) Human (Novice) (g) Human (Intermediate) (h) Human (Professional)
(i) Ours (Reference) (j) Ours (Night) (k) Ours (Sunrise)
(l) Ours (Reference) (m) Ours (Fog) (o) Ours (Rain)
(p) Ours (Grass) (q) Ours (Mud) (r) Ours (Snow)

figureQualitative results on track7. The color encodes speed as a heatmap, where blue corresponds to the minimum speed and red to the maximum speed.