FlightGoggles: Photorealistic Sensor Simulation for Perception-driven Robotics using Photogrammetry and Virtual Reality

05/27/2019 ∙ by Winter Guerra, et al. ∙ MIT 0

FlightGoggles is a photorealistic sensor simulator for perception-driven robotic vehicles. The key contributions of FlightGoggles are twofold. First, FlightGoggles provides photorealistic exteroceptive sensor simulation using graphics assets generated with photogrammetry. Second, it also provides the ability to combine (i) synthetic exteroceptive measurements generated in silico in real time and (ii) vehicle dynamics and proprioceptive measurements generated in motio by vehicle(s) in flight in a motion-capture facility. FlightGoggles is capable of simulating a virtual-reality environment around autonomous vehicle(s) in flight. While a vehicle is in flight in the FlightGoggles virtual reality environment, exteroceptive sensors are rendered synthetically in real time while all complex extrinsic dynamics are generated organically through the natural interactions of the vehicle. The FlightGoggles framework allows for researchers to accelerate development by circumventing the need to estimate complex and hard-to-model interactions such as aerodynamics, motor mechanics, battery electrochemistry, and behavior of other agents. The ability to perform vehicle-in-the-loop experiments with photorealistic exteroceptive sensor simulation facilitates novel research directions involving, e.g., fast and agile autonomous flight in obstacle-rich environments, safe human interaction, and flexible sensor selection. FlightGoggles has been utilized as the main test for selecting nine teams that will advance in the AlphaPilot autonomous drone racing challenge. Subsequently, FlightGoggles has been actively used by the community. We survey approaches and results from the top twenty AlphaPilot teams, which may be of independent interest.



There are no comments yet.


page 1

page 3

page 4

page 5

page 7

page 9

page 10

page 12

Code Repositories


A framework for photorealistic hardware-in-the-loop agile flight simulation using Unity3D and ROS. Developed by MIT FAST Lab.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Supplement: Software, Assets, & Videos

FlightGoggles is distributed as open-source software along with the photorealistic graphics assets for several simulation environments, under the MIT license. Software, graphics assets, and videos showcasing FlightGoggles as well as videos showcasing the results of the AlphaPilot simulation challenge can be found at http://flightgoggles.mit.edu.

I Introduction

Simulation systems have long been an integral part of the development of robotic vehicles. They allow engineers to identify errors early on in the development process, and allow researchers to rapidly prototype and demonstrate their ideas. Despite their success in accelerating development, many researchers view results generated in simulation systems with skepticism, as any simulation system is some abstraction of reality and will disagree with reality at some scale. This skepticism towards results generated exclusively in simulations studies is exemplified by Rodney Brooks’ well-known quote from 1993: “[experiment] simulations are doomed to succeed … [because] simulations cannot be made sufficiently realistic”[1].

Figure 1: FlightGoggles renderings of the Abandoned Factory environment, designed for autonomous drone racing. Note the size of the environment and the high level of detail.

Despite the skepticism towards simulation results, several trends have emerged in recent years that have driven the research community to develop better simulation systems out of necessity. A major driving trend towards realistic simulators stems from the emergence of data-driven algorithmic methods in robotics, for instance, based on machine learning methods that require extensive data. Simulation systems provide not only massive amounts of data, but also the labels required for training algorithms. For example, simulation systems can provide a safe environment for learning from experience useful for reinforcement learning methods 

[2, 3]. This driving trend has posed a critical need to develop better, more realistic simulation systems.

Several enabling trends have also recently emerged that allow for better, more-realistic simulation systems to be developed. The first enabling trend is the development of new computing resources that enable realistic rendering. The rapid evolution of game engine technology, particularly 3D graphics rendering engines, has made available advanced features such as improved material shaders, real-time reflections, volumetric lighting, and advanced illumination through deferred rendering pipelines. Particularly, the maturation of off-the-shelf software packages such as Unreal Engine [4] and Unity [5], makes them suitable for high-fidelity rendering in applications beyond video games, such as robotics simulation. Simultaneously, next-generation graphics processors simply pack more transistors, and the transistors are better organized for rendering purposes, e.g., for real-time ray tracing. In addition, they incorporate computation cores that utilize machine learning, for instance, trained with pictures of real environments to generate realistic renderings [6]. This trend is an opportunity to utilize better software and hardware to realize realistic sensor simulations. The second enabling trend stems from the proliferation of motion capture facilities for robotics research, enabling precise tracking of robotic vehicles and humans through various technologies, such as infrared cameras, laser tracking, and ultra-wide band radio. These facilities provide the opportunity to incorporate real motion and behavior of vehicles and humans into the simulation in real time. This trend provides the potential to combine the efficiency, safety, and flexibility of simulation with real-world physics and agent behavior.

Traditionally, simulation systems embody “models” of the vehicles and the environment, which are used to emulate what the vehicles sense, how they move, and how their environment adapts. In this paper, we present two concepts that use “data” to drive realistic simulations. First, we heavily utilize photogrammetry to realistically simulate exteroceptive sensors. For this purpose, we photograph real-world objects, and reconstruct them in the simulation environment. Almost all objects in our simulation environment are, in fact, a rendering of a real-world object. This approach allows realistic renderings, as shown in Figure 1. Second, we utilize a novel virtual-reality system to realistically embed inertial sensors, vehicles dynamics, and human behavior into the simulation environment. Instead of modeling these effects, we place vehicles and human actors in motion-capture facilities. We acquire the pose of the vehicles and the configuration of the human actors in real time, and create their avatars in the simulation environment. For each autonomous vehicle, its proprioceptive measurements are acquired using on-board sensors, e.g., inertial measurement units and odometers, and exteroceptive sensors are rendered photorealistically in real time. In addition, the human behavior observed by the vehicles is generated by humans reacting to the simulation. In other words, vehicles embedded in the FlightGoggles simulation system experience real dynamics, real inertial sensing, real human behavior, and synthetic exteroceptive sensor measurements rendered photorealistically effectively by transforming photographs of real-world objects.

(a) ETHz RotorS
(b) Microsoft AirSim
(c) FlightGoggles
Figure 2: Environment renders in various simulation software.

The combination of real physics and data-driven exteroceptive sensor simulation that FlightGoggles provides is not achieved in traditional simulation systems. Such systems are typically built around a physics engine that simulates vehicles and the environment based on a “model”, most commonly a system of ordinary or partial differential equations. While these models may accurately exemplify the behavior of a general vehicle or actor, this is not sufficient to ensure that simulation results transfer to the real world. Complicated aspects of vehicle dynamics,

e.g., vibrations and unsteady aerodynamics, and of human behavior may significantly affect results, but can be very challenging to accurately capture in a physics model. For an overview of popular physics engines, the reader is referred to [7]. In order to generate exteroceptive sensor data, robotics simulators employ a graphics rendering engine in conjunction with the physics engine. A popular example is Gazebo [8], which lets users select various underlying engines. It is often used in combination with the Robot Operating System (ROS) to enable hardware-in-the-loop simulation. However, Gazebo is generally not capable of photorealistic rendering. Specifically, for unmanned aerial vehicles simulation, two popular simulators that are built on Gazebo are the Hector Quadrotor package [9] and RotorS [10]. Both simulators include vehicle dynamics and exteroceptive sensor models, but lack the capability to render photorealistic camera streams. AirSim, on the other hand, is purposely built on the Unreal rendering engine to enable rendering of photorealistic camera streams from autonomous vehicles, but still suffers from the shortcomings of typical physics engines when it comes to vehicle dynamics and inertial measurements [11]. Figure 2 shows a comparison of photorealism in some of the aforementioned simulators. It serves to highlight the shift towards using video game rendering engines to improve realism in robotics simulation.

The rise of data-driven algorithms for autonomous robotics, has bolstered the need for extensive labeled data sets. Simulation offers an alternative to experimental data gathering. Clearly, there are many advantages to this approach, e.g., cost efficiency, safety, repeatability, and essentially unlimited quantity and diversity. In recent years, several synthetic, or virtual, datasets have appeared in literature. For example, Synthia [12] and Virtual KITTI [13] use Unity to generate photorealistic renders of an urban environment, and ICL-NUIM [14] provides synthetic renderings of an indoor environment based on pre-recorded handheld trajectories. The Blackbird Dataset [15] includes real-world ground truth and inertial measurements of a quadcopter in motion capture, and photorealistic camera imagery rendered in FlightGoggles. The open-source availability of FlightGoggles and its photorealistic assets enables users to straightforwardly generate additional data, including real-time photorealistic renders based on real-world vehicles and actors.

This paper is organized as follows. Section II provides an overview of the FlightGoggles system architecture, including interfacing with real-world vehicles and actors in motion capture facilities. Section III outlines the photogrammetry process and the resulting virtual environment. This section also details the rendering engine and the exteroceptive sensor models available. Section IV describes a physics engine for simulation of multicopter dynamics and inertial measurements. Section V describes several applications of FlightGoggles, including results of the AlphaPilot qualifications. Finally, Section VI concludes with remarks.

Ii System Architecture

Figure 3: Overview of FlightGoggles system architecture. Pose data of real and simulated vehicles, and human interactors is used by the Unity rendering engine. Detected collision can be incorporated in simulated vehicle dynamics, and rendered images can be displayed in a VR headset. All dynamics states, control inputs, and sensor outputs of real and simulated vehicles, and human interactors are available through the FlightGoggles API.

FlightGoggles is based on a modular architecture, as shown in Figure 3. This architecture provides the flexibility to tailor functionality for a specific simulation scenario involving real and/or simulated vehicles, and possibly human interactors. As shown in the figure, FlightGoggles’ central component is the Unity game engine. It utilizes position and orientation information to simulate camera imagery and exteroceptive sensors, and to detect collisions. Collision checks are performed using polygon colliders, and results are output to be included in the dynamics of simulated vehicles.

A description of the dynamics and measurement model used for multicopter simulation is given in Section IV. Additionally, FlightGoggles includes a simulation base class that can be used to simulate user-defined vehicle equations of motion, and measurement models. Simulation scenarios may also include real-world vehicles through the use of a motion capture system. In this case, Unity simulation of camera images and exteroceptive sensors, and collision detection are based on the real-world vehicle position and orientation. This type of vehicle-in-the-loop simulation can be seen as an extension of customary hardware-in-the-loop configurations. It not only includes the vehicle hardware, but also the actual physics of processes that are challenging to simulate accurately, such as aerodynamics (including effects of turbulent air flows), and inertial measurements subject to vehicle vibrations. FlightGoggles provides the novel combination of real-life vehicle dynamics and proprioceptive measurements, and simulated photorealistic exteroceptive sensor simulation. It allows for real-life physics, flexible exteroceptive sensor configurations, and obstacle-rich environments without the risk of actual collisions. FlightGoggles also allows scenarios involving both humans and vehicles, colocated in simulation but placed in different motion capture rooms, e.g., for safety.

Dynamics states, control inputs, and sensor outputs of real and simulated vehicles, and human interactors are available to the user through the FlightGoggles API. In order to enable message passing between FlightGoggles nodes and the API, the framework can be used with either ROS [16] or LCM [17]. The FlightGoggles simulator can be run headlessly on an Amazon Web Services (AWS) cloud instance to enable real-time simulation on systems with limited hardware.

Dynamic elements, such as moving obstacles, lights, vehicles, and human actors, can be added and animated in the environment in real-time. Using these added elements, users can change environment lighting or simulate complicated human-vehicle, vehicle-vehicle, and vehicle-object interactions in the virtual environment.

In Section V, we describe an use case involving a dynamic human actor. In this scenario, skeleton tracking motion capture data is used to render a 3D model of the human in the virtual FlightGoggles environment. The resulting render is observed in real-time by a virtual camera attached to a quadcopter in real-life flight in a different motion capture room (see Figure 8).

Iii Exteroceptive Sensor Simulation

This section describes the creation of the environment using photogrammetry, lists the features of the render pipeline, and describes each of the sensor models available.

(a) Virtual environment in FlightGoggles with barrel (red), rubble (blue), corrugated metal (magenta), and caged tank (green).
(b) Photograph of barrel.
(c) Photograph of rubble.
(d) Photograph of corrugated metal.
(e) Photograph of caged tank.
(f) Rendered image of barrel.
(g) Rendered image of rubble.
(h) Rendered image corrugated metal.
(i) Rendered image of caged tank.
Figure 4: Object photographs that were used for photogrammetry, and corresponding rendered assets in FlightGoggles.

FlightGoggles provides a simulation environment with exceptional visual fidelity. Its high level of photorealism is achieved using 84 unique 3D models captured from real-world objects using photogrammetry, as can be seen in Figure 4. The resulting environment is comprised of over 40 million triangles and 1,050 object instances.

Iii-a Photorealistic Sensor Simulation using Photogrammetry

Photogrammetry creates the 3D model of a real-world object from its photographs. Multiple photographs from different viewpoints of a real-world object are used to construct a high-resolution 3D model for use in virtual environments. For comparison, traditional 3D modeling techniques for creating photorealistic assets require hand-modeling and texturing, both of which require large amounts of artistic effort and time. Firstly, modern photogrammetry techniques enable the creation of high-resolution assets in a much shorter time when compared to conventional modeling method. Secondly, the resulting renderings are visually far closer to the real-world object that is being modeled, which may be critical in robotics simulation. Due to its advantages over traditional modeling methods, photogrammetry is already widely used in the video game industry; however, photogrammetry has still not found traction within the robotics simulation community. Thus, its application towards photorealistic robotics simulation, as raised in FlightGoggles, is novel.

Iii-A1 Photogrammetry asset capture pipeline

Photogrammetry was used to create 84 unique open-source 3D assets for the FlightGoggles environment. These assets are based on thousands of high-resolution digital photographs of real-world objects and environmental elements, such as walls and floors. The digital images were first color-balanced, and then combined to reconstruct object meshes using the GPU-based reconstruction software Reality Capture [18]. After this step, the raw object meshes were manually cleaned to remove reconstruction artifacts. Mesh baking was performed to generate base color, normal, height, metallic, ambient occlusion, detail, and smoothness maps for each object; which are then combined into one high-definition object material in Unity3D. Figure 5 shows maps generated using photogrammetry for the scanned barrel object in Figure 3(b).

(a) Base color map.
(b) Normal map.
(c) Detail map.
(d) Ambient occlusion map.
(e) Metallic map.
(f) Smoothness map.
Figure 5: Texture maps generated using photogrammetry for the scanned barrel asset in Figure 3(b).

Iii-A2 HD render pipeline

Figure 4 shows several 3D assets that were generated using the process described above. The figure also shows examples of real-world reference imagery that was used in the photogrammetry process to construct these assets. To achieve photorealistic RGB camera rendering, FlightGoggles uses the Unity Game Engine High Definition Render Pipeline (HDRP). Using HDRP, cameras rendered in FlightGoggles have characteristics similar to those of real-world cameras including motion blur, lens dirt, bloom, real-time reflections, and precomputed ray-traced indirect lighting. Additional camera characteristics such as chromatic aberration, vignetting, lens distortion, and depth of field can be enabled in the simulation environment.

Iii-B Performance Optimizations and System Requirements

To ensure that FlightGoggles is able to run on a wide spectrum of GPU rendering hardware with 2GB of video random access memory (VRAM), aggressive performance and memory optimizations were performed. As can be seen in Table I, FlightGoggles VRAM and GPU computation requirements can be reduced using user-selectable quality profiles based on three major settings: real-time reflections, maximum object texture resolution, and maximum level of detail (i.e.  polygon count).

VeryLow2GB Low2GB Medium High (Default) VeryHigh Ultra
Mono VRAM Usage 1.45 GB 1.56 GB 1.73 GB 4.28 GB 4.28 GB 4.28 GB
Stereo VRAM Usage 2.00 GB 2.07 GB 2.30 GB 4.97 GB 4.97 GB 4.97 GB
Max Texture Resolution 1/8 1/4 1/2 1 1 1
Realtime Reflections - - -
Max Level of Detail Low Medium High High High High
Table I: Quality settings for the Abandoned Factory environment.

Iii-B1 Level of detail (LOD) optimization

For each object mesh in the environment, 3 LOD meshes with different levels of detail (i.e.  polygon count and texture resolution) were generated: low, medium, and high. For meshes with lower levels of detail, textures were downsampled using subsampling and subsequent smoothing. During simulation, the real-time render pipeline improves render performance by selecting the appropriate level of detail object mesh and texture based on the size of the object mesh in camera image space. Users can also elect to decrease GPU VRAM usage by capping the maximum level of detail to use across all meshes in the environment using the quality settings in Table I.

Iii-B2 Pre-baked ray traced lighting optimization

To save run-time computation, all direct and indirect lighting, ambient occlusions, and shadow details from static light sources are pre-baked via NVIDIA RTX raytracing into static lightmaps and are layered onto object meshes in the environment. To precompute the ray traced lighting for each lighting condition in the Abandoned Warehouse environment, an NVIDIA Quadro RTX 8000 GPU was used with an average bake time of 45 minutes per lighting arrangement.

Iii-B3 Render batching optimizations

In order to increase render performance by reducing individual GPU draw calls, FlightGoggles leverages two different methods of render batching according to the capabilities available in the rendering machine. For Windows-based systems supporting DirectX11, FlightGoggles leverages Unity3D’s experimental Scriptable Render Pipeline dynamic batcher, which drastically reduces GPU draw calls for all static and dynamic objects in the environment. For Linux and MacOS systems, FlightGoggles statically batches all static meshes in the environment. Static batching drastically increases render performance, but also increases VRAM usage in the GPU as all meshes must be combined and preloaded onto the GPU at runtime. To circumvent this issue, FlightGoggles exposes quality settings to the user (see Table I) that can be used to lower VRAM usage for systems with low available VRAM.

Iii-B4 Dynamic clock scaling for underpowered systems

For rendering hardware that is incapable of reliable real-time frame rates despite the numerous rendering optimizations that FlightGoggles employs, FlightGoggles can perform automatic clock scaling to guarantee a nominal camera frame rate in simulation time. When automatic clock scaling is enabled, FlightGoggles monitors the frame rate of the renderer output and dynamically adjusts ROS’ sim-time rate to achieve the desired nominal frame rate in sim-time. Since FlightGoggles uses the built-in ROS sim-time framework, changes in sim-time rate does not affect the relative timing of client nodes and helps to reduce simulation stochasticity across simulation runs.

Iii-C Exteroceptive Sensor Models

FlightGoggles is capable of high-fidelity simulation of various types of exteroceptive sensors, such as RGB-D cameras, time-of-flight distance sensors, and infrared radiation (IR) beacon sensors. Default noise characteristics, and intrinsic and extrinsic parameters are based on real sensor specifications, and can easily be adjusted. Moreover, users can instantiate multiple instances of each sensor type. This capability allows quick prototyping and evaluation of distinct exteroceptive sensor arrangements.

Iii-C1 Camera

The default camera model provided by FlightGoggles is a perfect, i.e., distortion-free, camera projection model with optional motion blur, lens dirt, auto-exposure, and bloom. Table II lists the major camera parameters exposed by default in the FlightGoggles API along with their default values. These parameters can be changed via the FlightGoggles API using ROS param or LCM config. The camera extrinsics where is the vehicle fixed body frame and is the camera frame can also be changed in real-time.

Camera Parameter Default Value
Vertical Field of View ()
Image Resolution ()  px  px
Stereo Baseline ()  cm
Table II: Camera sensor parameters enabled by default in FlightGoggles along with their default values.

Iii-C2 Infrared beacon sensor

To facilitate the quick development of guidance, navigation, and control algorithms; an IR beacon sensor model is included. This sensor provides image-space measurements of IR beacons in the camera’s field of view. The beacons can be placed at static locations in the environment or on moving objects. Using realtime ray-casting from each RGB camera, simulated IR beacon measurements are tested for occlusion before being included in the IR sensor output. Figure 6 shows a visual representation of the sensor output.

Figure 6: Rendered camera view (faded) with IR marker locations overlayed. The unprocessed measurements and marker IDs from the simulated IR beacon sensor are indicated in red. The measurements are verified by comparison to image-space reprojections of ground-truth IR marker locations, which are indicated in green. Note that IR markers can be arbitrarily placed by the user, including on dynamic objects.

Iii-C3 Time-of-flight range sensor

FlightGoggles is able to simulate (multi-point) time-of-flight range sensors using ray casts in any specified directions. In the default vehicle configuration, a downward-facing single-point range finder for altitude estimation is provided. The noise characteristics of this sensor are similar to the commercially available LightWare SF11/B laser altimeter [19].

Iv Simulated Vehicles

FlightGoggles is able to simulate scenarios including both real-life vehicles in flight, and vehicles with simulated dynamics and inertial measurements. This section details the models used for simulation of multicopter dynamics and inertial measurements.

Iv-a Motor Dynamics

We model the dynamics of the motors using a first-order lag with time constant , as follows:


where is the rotor speed of motor , with ( in case of a quadcopter) and the subscript indicates the commanded value. Rotor speed is defined such that corresponds to positive thrust in the motor frame z-axis.

Iv-B Forces and Moments

We distinguish two deterministic external forces and moments acting on the vehicle body: firstly, thrust force and control moment due to the rotating propellers; secondly, aerodynamic drag and moment due to the vehicle linear and angular speed. Accurate physics-based modeling of these forces and moments is challenging, as it requires modeling of fluid dynamics surrounding the propellers and vehicle body. Instead, we aim to obtain representative vehicle dynamics using a simplified model based on vehicle and propulsion coefficients obtained from experimental data

[15, 20].

Iv-B1 Thrust force and control moment

We employ a summation of forces and moments over the propeller index in order to be able to be able to account for various vehicle and propeller configurations in a straightforward manner. The total thrust and control moment are given by


with the constant rotation matrix from the motor reference frame to the vehicle-fixed reference frame, and

the position of the motor in the latter frame. The force and moment vector in the motor reference frame are given by


where is an indicator function for the set of propellers for which corresponds to positive rotation rate around the motor frame z-axis; and and indicate the constant propeller thrust and torque coefficients, respectively.

Iv-B2 Aerodynamic drag and moment

Aerodynamic drag has magnitude proportional to the vehicle speed squared and acts in direction opposite the vehicle motion according to


with the -norm, vehicle velocity relative to the world-fixed reference frame, and the vehicle drag coefficient. Similarly, the aerodynamic moment is given by


where is the angular rate in vehicle-fixed reference frame, and is a 3-by-3 matrix containing the aerodynamic moment coefficients.

Iv-C Vehicle Dynamics

The vehicle translational dynamics are given by


where is the vehicle mass; and , , and are the position, velocity, and gravitational acceleration in the world-fixed reference frame, respectively. The stochastic force vector

captures unmodeled dynamics, such as propeller vibrations and atmospheric turbulence. It is modeled as a continuous white-noise process with auto-correlation function

(with the Dirac delta function), and can thus be sampled discretely according to


where is the integration time step. The rotation matrix from body-fixed to world frame is given by


where is the vehicle attitude unit quaternion vector. The corresponding attitude dynamics are given by



the vehicle moment of inertia tensor. The stochastic moment contribution

is modeled as a continuous white-noise process with auto-correlation function , and sampled similarly to (10).

Iv-D Physics Integration

The vehicle state is updated at 960 Hz so that even high-bandwidth motor dynamics can be represented accurately. Both explicit Euler and 4th-order Runge-Kutta algorithms are available for integration of (1), (8), (9), (12), and (13).

Iv-E Inertial Measurement Model

Acceleration and angular rate measurements are obtained from a simulated IMU according to the following measurement equations:


where and are the accelerometer and gyroscope measurement biases, respectively; and and the corresponding thermo-mechanical measurement noises. Brownian motion is used to model the bias dynamics, as follows:


with and continuous white noise processes with auto-correlation and , respectively, and sampled similar to (10). The noise and bias parameters were set using experimental data, which (unlike IMU data sheet specifications) include the effects of vehicle vibrations.

Iv-F Acro/Rate Mode Controller

In order to ease the implementation of high-level guidance and control algorithms, a rate mode controller can be enabled. This controller allows direct control of the vehicle thrust and angular rates, while maintaining accurate low-level dynamics in simulation.

Iv-F1 Low-pass filter (LPF)

The rate mode controller employs a low-pass Butterworth filter to reduce the influence of IMU noise. The filter dynamics are as follows:


where the positive gains and represent the filter damping and stiffness, respectively.

Iv-F2 Proportional-integral-derivative (PID) control

A standard PID control design is used to compute the commanded angular acceleration as a function of the angular rate command, as follows:


where , , and are diagonal gain matrices.

Iv-F3 Control allocation

If we neglect motor dynamics and consider a collective thrust command , quadcopter angular dynamics are fully-actuated. Hence, the motor speeds required to attain and can be computed by multiplication with the vehicle inertia tensor and inversion of the equations given in Section IV-B. In practice, this amounts to inversion of a constant full-rank 4-by-4 matrix. We note that the involved vehicle and propulsion properties are typically not known exactly, leading to imperfect tracking of angular acceleration command.

V Applications

Figure 7: The figure on the left shows the visual features tracked using a typical visual inertial odometry pipeline on the simulated camera imagery. On the right, the plot shows three drones, the ground truth trajectory is in green, the high-rate estimate is in red, and the smoothed estimate is in blue. The red squares indicate triangulated features in the environment.

In this section, we discuss application that FlightGoggles has been used for and potential future applications. Potential applications of FlightGoggles include: human-vehicle interaction, active sensor selection, multi-agent systems, and visual inertial navigation research for fast and agile vehicles [21, 15, 22]. The FlightGoggles simulator was used for the simulation part of the AlphaPilot challenge [23].

V-a Aircraft-in-the-Loop High-Speed Flight using Visual Inertial Odometry

Camera-IMU sensor packages are widely used in both commercial and research applications, because of their relatively low cost and low weight. Particularly in GPS-denied environments, cameras may be essential for effective state estimation. Visual inertial odometry (VIO) algorithms combine camera images with preintegrated IMU measurements to estimate the vehicle state [21, 24]. While these algorithms are often critical for safe navigation, it is challenging to verify their performance in varying conditions. Environment variables, e.g., lighting and object placement, and camera properties may significantly affect performance, but generally cannot easily be varied in reality. For example, to show robustness in visual simultaneous localization and mapping may require data collected at different times of day or even across seasons [25, 26]. Moreover, obstacle-rich environments may increase the risk of collisions, especially in high-speed flight, further increasing the cost of extensive experiments.

FlightGoggles allows us to change environment and camera parameters and thereby enables us to quickly verify VIO performance over a multitude of scenarios, without the risk of actual collisions. By connecting FlightGoggles to a motion capture room with a real quadcopter in flight, we are able to combine its photo-realistic rendering with true flight dynamics and inertial measurements. This alleviates the necessity of complicated models including unsteady aerodynamics, and the effects of vehicle vibrations on IMU measurements.

Figure 7 gives an overview of a VIO flight in FlightGoggles. The quadcopter uses the trajectory tracking controller described in [27] to track a predefined trajectory that was generated using methods from [28]. State estimation is based entirely on the pose estimate from VIO, which is using the virtual imagery from FlightGoggles and real inertial measurements from the quadcopter. In what follows, we briefly describe two experiments where the FlightGoggles simulator was used to verify developed algorithms for quadcopter state estimation and planning.

V-A1 Visual inertial odometry development

Sayre-McCord et al. [21] performed experiments using the FlightGoggles simulator in 2 scenarios to verify the use of a simulator to perform the development of VIO algorithms using real-time exteroceptive camera simulation with aircraft-in-the-loop. For the baseline experiment, they flew the quadcopter through a window without the assistance of a motion capture system first using a on-board camera and then using the simulated imagery from FlightGoggles. Their experiments show that the estimation error of the developed VIO algorithm for the live camera is comparable to the simulated camera.

V-A2 Vision-aware planning

During the performance of agile maneuvers by a quadcopter, visually salient features in the environment are often lost due to the limited field of view of the on-board camera. This can significantly degrade the estimation accuracy. To address this issue, [22] presented an approach to incorporate the perception objective of keeping salient features in view in the quadcopter trajectory planning problem. The FlightGoggles simulation environment was used to perform experiments. It allowed rapid experimentation with feature-rich objects in varying amounts and locations. This enabled straightforward verification of the performance of the algorithm when most of the salient features are clustered in small regions of the environment. The experiments show that significant gains in estimation performance can be achieved by using the proposed vision aware planning algorithm as the speed of the quadcopter is increased. For details, we refer the reader to [22].

V-B Interactions with Dynamic Actors

Figure 8: A dynamic human actor in the FlightGoggles virtual environment is rendered in real-time, based on skeleton tracking data of a human in a motion capture suit with markers.
Figure 9: Racecourse layout for the AlphaPilot simulation challenge. Gates along the racecourse have unique IDs labeled in white. Gate IDs in blue are static and not part of the race. The racecourse has 11 gates, with a total length of 240m.

FlightGoggles is able to render dynamic actors, e.g., humans or vehicles, in real time from real-world models with ground-truth movement. Figure 8 gives an overview of a simulation scenario involving a human actor. In this scenario, the human is rendered in real time based on skeleton tracking motion capture data, while a quadcopter is simultaneously flying in a separate motion capture room. While both dynamic actors (i.e.  human and quadcopter) are physically in separate spaces, they are both in the same virtual FlightGoggles environment. Consequently, both actors are visible to each other and can interact through simulated camera imagery. This imagery can for example be displayed on virtual reality goggles, or used in vision-based autonomy algorithms. FlightGoggles provides the capability to simulate these realistic and versatile human-vehicle interactions in an inherently safe manner.

V-C AlphaPilot Challenge

The AlphaPilot challenge [23] is an autonomous drone racing challenge organized by Lockheed Martin, NVIDIA, and the Drone Racing League (DRL). The challenge is split into two stages, with a simulation phase open to the general public and a real-world phase where nine teams selected from the simulation phase will compete against each other by programming fully-autonomous racing drones built by the DRL. The FlightGoggles simulation framework was used as the main qualifying test in the simulation phase for selecting nine teams that would progress to the next stage of the AlphaPilot challenge. To complete the test, contestants had to submit code to Lockheed Martin that could autonomously pilot a simulated quadcopter with simulated sensors through the 11-gate race track shown in Figure 9 as fast as possible. Test details were revealed to all contestants on February 14th, 2019 and final submissions were due on March 20th, 2019. This section describes the AlphaPilot qualifying test and provides an analysis of anonymized submissions.

V-C1 Challenge outline

The purpose of the AlphaPilot simulation challenge was for teams to demonstrate their autonomous guidance, navigation, and control capability in a realistic simulation environment. The participants’ aim was to make a simulated quadcopter based on the multicopter dynamics model described in Section IV complete the track as fast as possible. To accomplish this, measurements from four simulated sensors were provided: (stereo) cameras, IMU, downward-facing time-of-flight range sensor, and infrared gate beacons. Through the FlightGoggles ROS API, autonomous systems could obtain sensor measurements and provide collective thrust and attitude rate inputs to the quadcopter’s low-level acro/rate mode controller.

The race track was located in the FlightGoggles Abandoned Factory environment and consisted of 11 gates. To successfully complete the entire track, the quadcopter had to pass all gates in order. Points were deducted for missed gates, leading to the following performance metric


where is the number of gates passed in order and is the time taken in seconds to reach the final gate. If the final gate was not reached within the race time limit or the quadcopter collided with an environment object, a score of zero was recorded. To discourage memorization of the course, there were 25 courses for each contestant to complete during evaluation. For each course, the exact gate locations were subject to random unknown perturbations. These perturbations were large enough to require adapting the vehicle trajectory, but did not change the track layout in a fundamental way. For development and verification of their algorithms, participants were also provided with another set of 25 courses with identically distributed gate locations. The final score for the challenge was the mean of the five highest scores among the 25 evaluation courses.

V-C2 FlightGoggles sensor usage

Table III shows the usage of provided sensors, the algorithm choices, and final and five highest scores for the 20 top teams (sorted by final score). All of these 20 teams used both the simulated IMU sensor and the infrared beacon sensors. Several teams chose to also incorporate the camera and the time-of-flight range sensor. A more detailed overview of the sensor combinations used by the teams is shown in Table IV

. This table shows the number of teams that employed a particular combination of sensors, the percentage of runs completed, and the mean and standard deviation of the scores across all 25 attempts.

(a) Visualization of speed profiles across successful runs.
(b) Visualization of speed profiles across all runs.
(c) Crash locations for all teams across all runs. Note that most crashes occur near gates, obstacles, or immediately post-takeoff.
Figure 10: Overhead visualizations of speed profiles and crash locations for top 20 AlphaPilot teams across all 25 runs.
Sensors Estimation Planning Control










Visual Servo





S S S core S S S core 1 S S S core 2 S S S core 3 S S S core 4 S S S core 5
9 9 9 1.391 9 9 9 1.523 9 9 9 1.495 9 9 9 1.377 9 9 9 1.315 9 9 9 1.244
8 8 8 4.517 8 8 8 5.354 8 8 8 4.624 8 8 8 4.326 8 8 8 4.186 8 8 8 4.096
8 8 8 1.044 8 8 8 1.468 8 8 8 1.048 8 8 8 0.936 8 8 8 0.922 8 8 8 0.848
8 8 8 0.560 8 8 8 0.993 8 8 8 0.859 8 8 8 0.395 8 8 8 0.294 8 8 8 0.257
7 7 7 8.613 7 7 7 8.777 7 7 7 8.645 7 7 7 8.562 7 7 7 8.549 7 7 7 8.531
7 7 7 8.552 7 7 7 8.693 7 7 7 8.592 7 7 7 8.500 7 7 7 8.497 7 7 7 8.478
7 7 7 6.078 7 7 7 6.595 7 7 7 6.173 7 7 7 5.947 7 7 7 5.902 7 7 7 5.774
7 7 7 4.225 7 7 7 4.509 7 7 7 4.173 7 7 7 4.168 7 7 7 4.142 7 7 7 4.131
7 7 7 1.443 7 7 7 1.503 7 7 7 1.471 7 7 7 1.446 7 7 7 1.437 7 7 7 1.359
7 7 7 1.096 7 7 7 1.278 7 7 7 1.101 7 7 7 1.074 7 7 7 1.024 7 7 7 1.002
7 7 7 0.873 7 7 7 3.580 7 7 7 2.797 7 7 7 2.784 7 7 7 2.704 6 6 6 2.497
7 7 7 0.456 7 7 7 1.025 7 7 7 0.791 7 7 7 0.222 7 7 7 0.211 7 7 7 0.032
6 6 6 9.908 7 7 7 1.419 7 7 7 0.699 6 6 6 9.294 6 6 6 9.167 6 6 6 8.962
5 5 5 7.264 7 7 7 6.958 7 7 7 6.345 6 6 6 6.540 6 6 6 6.477 0 0 0 .000
5 5 5 6.275 5 5 5 6.484 5 5 5 6.359 5 5 5 6.206 5 5 5 6.169 5 5 5 6.156
5 5 5 5.875 5 5 5 7.496 5 5 5 6.152 5 5 5 5.708 5 5 5 5.289 5 5 5 4.731
2 2 2 9.843 7 7 7 4.758 7 7 7 4.457 0 0 0 .000 0 0 0 .000 0 0 0 .000
1 1 1 2.986 2 2 2 5.192 1 1 1 9.887 1 1 1 9.853 0 0 0 .000 0 0 0 .000
1 1 1 2.576 4 4 4 1.047 2 2 2 1.835 0 0 0 .000 0 0 0 .000 0 0 0 .000
1 1 1 1.814 5 5 5 9.068 0 0 0 .000 0 0 0 .000 0 0 0 .000 0 0 0 .000
Table III: Sensor usage, algorithm choices and five highest scores in AlphaPilot simulation challenge.
Sensor Package Selection Number of Teams Completed Runs () Mean Score Std. Deviation
IMU + IR 12 48.67 35.32 37.72
IMU + IR + Camera 4 36 26.72 35.87
IMU + IR + Ranger 3 24 15.04 27.43
IMU + IR + Ranger + Camera 1 60 41.39 34.55
Table IV: Sensor combinations used by the top AlphaPilot teams, percentage of completed runs, mean score and standard deviation across all 25 evaluation courses.
(a) The number of completed runs for the top 20 teams.
(b) The mean and standard deviation of scores.
Figure 11: The figures above show the number of completed runs by each of the top 20 AlphaPilot teams along with the mean and standard deviation of their scores for all runs.

V-C3 Algorithm choices

The contestants were tasked with developing guidance, navigation, and control algorithms. Table III tabulates the general estimation, planning, and control approaches used for each team alongside the sensor choices and their scores.

Of the top 20 teams, only one used an end-to-end learning-based method. The other 19 teams relied on more traditional pipelines (estimation, planning, and control) to complete the challenge. One of those teams used learning to determine the pose of the camera from the image.


For state estimation, all but one team used a filtering algorithm such as the extended Kalman filter

[29], unscented Kalman filter [30], particle filter [31], or the Madgwick filter [32] with the other team using a smoothing based technique [33]. The teams that chose to use a visual inertial odometry algorithm opted to use off-the-shelf solutions such as ROVIO [34, 35] or VINS-Mono [36] for state estimation.

Planning: The most common methods used for planning involved visual servo using infrared beacons or polynomial trajectory planning such as [37, 28]. Other methods used for planning either used manually-defined waypoints or used sampling-based techniques for building trajectory libraries. 5 of the 19 teams to use model based techniques also incorporated some form of perception awareness to their planning algorithms.

Control: The predominant methods for control were linear control techniques and model predictive control [38]. The other algorithms that were used were geometric and backstepping control methods [39].

V-C4 Analysis of trajectories

To visualize the speed along the trajectory, we discretize the trajectories on a grid and the image is colored by the logarithm of the average speed in the grid. Figure 9(a) shows the trajectories of all successful course traversals colored by the speed. From the figure, we can observe that most teams chose to slow down for the two sharp turns that are required. We can also observe that the the average speed around gates is lower than other portions of the environment which can be attributed to the need to search for the ‘next gate’. Figure 9(b) shows the trajectories of all course traversals including trajectories that eventually crash colored by the speed. Figure 9(c) shows the crash locations of all the failed attempts. Since many of the teams relied on visual-servo based techniques, the gate beacons are harder to observe close to the gates and many of the crash locations are close to gates.

V-C5 Individual performance of top teams

For individual performance, we analyze the number of completed runs for each of the top teams and the mean and standard deviations of the scores. This is shown in Figure 11. Given, that the scoring function for the competition only used the top scores, the teams were encouraged to take significant risk and only one team consistently completed all of the challenges provided successfully. As a result of the risk taking to achieve faster speeds, of the contestants failed to complete the challenge half the time. This risk taking behavior is also observed in the large standard deviation for all the teams.

Vi Conclusions

This paper introduced FlightGoggles, a new modular framework for realistic simulation to aid robotics testing and development. FlightGoggles is enabled by photogrammetry and virtual reality technologies. Heavy utilization of photogrammetry helps provide realistic simulation of camera sensors. Utilization of virtual reality allows direct integration of real vehicle motion and human behavior acquired in motion capture facilities directly into the simulation system. FlightGoggles is being actively utilized by a community of robotics researchers. In particular, FlightGoggles has served as the main test for selecting the contestants for the AlphaPilot autonomous drone racing challenge. This paper also presented a survey of approaches and results from the simulation challenge, which may be of independent interest.


  • [1] R. A. Brooks and M. J. Mataric, “Real robots, real learning problems,” in Robot learning.   Springer, 1993, pp. 193–213.
  • [2] H. Chiu, V. Murali, R. Villamil, G. D. Kessler, S. Samarasekera, and R. Kumar, “Augmented reality driving using semantic geo-registration,” in IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 2018, pp. 423–430.
  • [3] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” arXiv preprint arXiv:1804.10332, 2018.
  • [4] “Unreal Engine,” https://www.unrealengine.com/, 2019, [Online; accessed 28-February-2019].
  • [5] “Unity3d Game Engine,” https://unity3d.com/, 2019, [Online; accessed 28-February-2019].
  • [6] “Nvidia Turing GPU architecture,” Nvidia Corporation, Tech. Rep. 09183, 2018.
  • [7] T. Erez, Y. Tassa, and E. Todorov, “Simulation tools for model-based robotics: Comparison of Bullet, Havok, MuJoCo, ODE and Physx,” in IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 4397–4404.
  • [8] N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004, pp. 2149–2154.
  • [9] J. Meyer, A. Sendobry, S. Kohlbrecher, U. Klingauf, and O. Von Stryk, “Comprehensive simulation of quadrotor UAVs using ROS and Gazebo,” in International Conference on Simulation, Modeling, and Programming for Autonomous Robots.   Springer, 2012, pp. 400–411.
  • [10] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart, “RotorS: A modular Gazebo MAV simulator framework,” in Robot Operating System (ROS).   Springer, 2016, pp. 595–625.
  • [11] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics.   Springer, 2018, pp. 621–635.
  • [12] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The Synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016, pp. 3234–3243.
  • [13] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” in CVPR, 2016.
  • [14] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014.
  • [15] A. Antonini, W. Guerra, V. Murali, T. Sayre-McCord, and S. Karaman, “The Blackbird dataset: A large-scale dataset for UAV perception in aggressive flight,” in International Symposium on Experimental Robotics (ISER), 2018.
  • [16] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS: an open-source robot operating system,” in ICRA Workshop on Open Source Software, 2009.
  • [17] A. S. Huang, E. Olson, and D. C. Moore, “LCM: Lightweight communications and marshalling,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2010, pp. 4057–4062.
  • [18] “Reality Capture,” https://www.capturingreality.com/Product, 2019, [Online; accessed 28-February-2019].
  • [19] “LightWare SF11/B Laser Range Finder,” http://documents.lightware.co.za/SF11%20-%20Laser%20Altimeter%20Manual%20-%20Rev%208.pdf, 2019, [Online; accessed 28-February-2019].
  • [20] M. Bronz and S. Karaman, “Preliminary experimental investigation of small scale propellers at high incidence angle,” in AIAA Aerospace Sciences Meeting, 2018, pp. 1268–1277.
  • [21] T. Sayre-McCord, W. Guerra, A. Antonini, J. Arneberg, A. Brown, G. Cavalheiro, Y. Fang, A. Gorodetsky, D. McCoy, S. Quilter, F. Riether, E. Tal, Y. Terzioglu, L. Carlone, and S. Karaman, “Visual-inertial navigation algorithm development using photorealistic camera simulation in the loop,” in IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 2566–2573.
  • [22] V. Murali, I. Spasojevic, W. Guerra, and S. Karaman, “Perception-aware trajectory generation for aggressive quadrotor flight using differential flatness,” in American Control Conference (ACC), 2019.
  • [23] “AlphaPilot – Lockheed Martin AI Drone Racing Innovation Challenge,” https://www.herox.com/alphapilot, 2019, [Online; accessed 28-February-2019].
  • [24] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration theory for fast and accurate visual-inertial navigation,” IEEE Transactions on Robotics, pp. 1–18, 2015.
  • [25] C. Beall and F. Dellaert, “Appearance-based localization across seasons in a metric map,” 6th PPNIV, Chicago, USA, 2014.
  • [26] P. F. Alcantarilla, S. Stent, G. Ros, R. Arroyo, and R. Gherardi, “Street-view change detection with deconvolutional networks,” Autonomous Robots, vol. 42, no. 7, pp. 1301–1322, 2018.
  • [27] E. Tal and S. Karaman, “Accurate tracking of aggressive quadrotor trajectories using incremental nonlinear dynamic inversion and differential flatness,” in IEEE Conference on Decision and Control (CDC), 2018, pp. 4282–4288.
  • [28] C. Richter, A. Bry, and N. Roy, “Polynomial trajectory planning for aggressive quadrotor flight in dense indoor environments,” in Robotics Research.   Springer, 2016, pp. 649–666.
  • [29] G. L. Smith, S. F. Schmidt, and L. A. McGee, “Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle,” 1962.
  • [30] S. J. Julier and J. K. Uhlmann, “New extension of the kalman filter to nonlinear systems,” in Signal processing, sensor fusion, and target recognition VI, vol. 3068.   International Society for Optics and Photonics, 1997, pp. 182–194.
  • [31] J. S. Liu and R. Chen, “Sequential monte carlo methods for dynamic systems,” Journal of the American Statistical Association, vol. 93, no. 443, pp. 1032–1044, 1998.
  • [32] S. Madgwick, “An efficient orientation filter for inertial and inertial/magnetic sensor arrays,” Report x-io and University of Bristol (UK), vol. 25, pp. 113–118, 2010.
  • [33] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp. 216–235, 2012.
  • [34] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual inertial odometry using a direct ekf-based approach,” in 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS).   IEEE, 2015, pp. 298–304.
  • [35] M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended kalman filter based visual-inertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
  • [36] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
  • [37] D. Mellinger and V. Kumar, “Minimum snap trajectory generation and control for quadrotors,” in 2011 IEEE International Conference on Robotics and Automation.   IEEE, 2011, pp. 2520–2525.
  • [38] M. Kamel, T. Stastny, K. Alexis, and R. Siegwart, “Model predictive control for trajectory tracking of unmanned aerial vehicles using robot operating system,” in Robot Operating System (ROS).   Springer, 2017, pp. 3–39.
  • [39] S. Bouabdallah and R. Siegwart, “Backstepping and sliding-mode techniques applied to an indoor micro quadrotor,” in Proceedings of the 2005 IEEE international conference on robotics and automation.   IEEE, 2005, pp. 2247–2252.