1 Introduction
In recent times, there has been an explosion of research on micro-aerial vehicles (MAVs), ranging from low-level control [Lee et al., 2010] to high-level, specification-based planning [Wolff et al., 2014]. One class of MAVs, the quadrotor, has become popular in academia and industry alike due to its mechanical and control simplicity, high maneuverability and low cost of entry point compared to other aerial robots [Karydis and Kumar, 2016]. Indeed, there have been numerous applications of quadrotors to fields such as Intelligence, Surveillance and Reconnaissance (ISR), aerial photography, structural inspection [Özaslan et al., 2016], robotic first responders [Mohta et al., 2016], and cooperative construction [Augugliaro et al., 2014] and aerial manipulation [Thomas et al., 2014]. Such recent advances have pushed forward the capabilities of quadrotors.
Most of these works rely on the availability of ground truth measurements to enable smooth robot performance. Ground truth can be provided by motion capture systems in lab settings, GPS in outdoor settings, or by appropriately placed special tags in known environments. However, real-world environments are dynamic, partially-known and often GPS-denied. Therefore, there is need to push further on developing fully autonomous navigation systems that rely on onboard sensing only in order to fully realize the potential of quadrotors in real-world applications. Our work aims at narrowing this gap.
Specifically, the focus of this paper is to provide a detailed description of a quadrotor system that is able to navigate at high speeds between a start and a goal pose (position and orientation) in cluttered 3D indoor and outdoor environments while using only onboard sensing and computation for state estimation, control, mapping and planning. The motivation for this problem comes from the recently announced DARPA Fast Lightweight Autonomy program
111http://www.darpa.mil/program/fast-lightweight-autonomy. The main challenge in creating such small, completely autonomous MAVs is due to the size and weight constraints imposed on the payload carried by these platforms. This restricts the kinds of sensors and computation that can be carried by the robot and requires careful consideration when choosing the components to be used for a particular application. Also, since the goal is fast flight, we want to keep the weight as low as possible in order to allow the robot to accelerate, decelerate and change directions quickly.Any robot navigation system is composed of the standard building blocks of state estimation, control, mapping and planning. Each of these blocks builds on top of the previous ones in order to construct the full navigation system. For example, the controller requires a working state estimator while the planner requires a working state estimator, controller and mapping system. Initial works on state estimation for aerial robots with purely onboard sensing used laser rangefinders as the sensing modality due to the limited computational capacity available on the platforms [Bachrach et al., 2009, Achtelik et al., 2009, Grzonka et al., 2009]. Due to the limitations of laser scan matching coupled with limited computational capability, these works were limited to slow speeds. Since the small and lightweight laser rangefinders that could be carried by the aerial robots only measured distances in a single plane, these methods also required certain simplifying assumptions about the environment, for example assuming 2.5D structure. As the computational power grew and more efficient algorithms were proposed, it became possible to use vision for state estimation for aerial robots [Achtelik et al., 2009, Blösch et al., 2010]. Using cameras for state estimation allowed flights in 3D unstructured environments and also faster speeds [Shen et al., 2013]. As the vision algorithms have improved [Klein and Murray, 2007, Mourikis and Roumeliotis, 2007, Jones and Soatto, 2011, Forster et al., 2014, Mur-Artal et al., 2015, Bloesch et al., 2015] and the computational power available on small computers has grown, cameras have now become the sensor of choice for state estimation for aerial robots. We use the visual odometry algorithm described in [Forster et al., 2017] for our platform due to the fast run time, ability to use wide angle lenses without requiring undistortion of the full image, allowing the use of multiple cameras to improve robustness of the system and incorporation of edgelet features in addition to the usual point features.
The dynamics of the quadrotor are nonlinear due to the rotational degrees of freedom. In the control design for these robots, special care has to be taken in order to take this nonlinearity into account in order to utilize the full dynamics of the robot. Most early works in control design for quadrotors
[Bouabdallah et al., 2004, Bouabdallah and Siegwart, 2005, Escareno et al., 2006, Hoffmann et al., 2007, Bouabdallah and Siegwart, 2007] used the small angle approximation for the orientation controller to convert the problem into a linear one and proposed PID and backstepping controllers to stabilize the simplified system. Due to the small angle assumption, these controllers are not able to handle large orientation errors and have large tracking errors for aggressive trajectories. A nonlinear controller using quaternions (instead of Euler angles) was developed in [Guenard et al., 2005] where the quadrotor was commanded to follow velocity commands. [Lee et al., 2010] defined an orientation error metric directly in the SO(3) space and proposed a globally asymptotic controller that can stabilize the quadrotor from large position and orientation errors. Our controller is based upon this work and has good tracking performance even when following aggressive trajectories.It has been shown that the trajectory generation for multi-rotor MAVs can be formulated as a Quadratic Program (QP) [Mellinger and Kumar, 2011]. Since the quadrotor is a differentially flat system, the trajectory can be optimized as an order polynomial parameterized in time [Mellinger and Kumar, 2011]. Generating a collision-free trajectory is more complicated, in which additional constraints for collision checking are required. Using Mixed Integer optimization methods to solve this problem has been discussed in [Mellinger et al., 2012] and recently other approaches have been proposed to remove the integer variables and solve the QP in a more efficient way [Richter et al., 2016, Deits and Tedrake, 2015b, Watterson and Kumar, 2015]. Our pipeline uses a linear piece-wise path from a search-based planning algorithm to guide the convex decomposition of the map to find a safe corridor in free space as described in [Liu et al., 2016]. The safe corridor is formed as linear equality constraints in the QP for collision checking. We also consider dynamic constraints on velocity, acceleration and jerk in the QP to ensure that the generated trajectory does not violate the system’s dynamics. In order to increase the safety, we propose a modified cost functional in the trajectory generation step such that the generated trajectory will be close to the center of the safe corridor.
We couple our trajectory generation method described above with a receding horizon method [Bellingham et al., 2002] for replanning. As the robot moves, we only keep a local robot centric map and use a local planner to generate the trajectory. The main reasons behind this approach are: first, updating and planning in global map is expensive; second, the map far away from the robot is less accurate and less important. Since we are using a local planning algorithm, dead-ends are a well-known challenge. In order to efficiently solve this, we build a hybrid map consisting of a 3D local map and a 2D global map and our planner searches in this hybrid map to provide a globally consistent local action.
The purpose of a navigation system is to enable a robot to successfully traverse from a start pose to a goal pose in either a known or an unknown environment. The problem of navigating in an unknown environment is especially difficult, because in addition to having good state estimation and control, the robot needs to build an accurate map as it moves, and also generate collision-free trajectories quickly in the known map so that the replanning can be done at a high rate as the robot gets new sensor data. The initial works on navigation in unknown environments with quadrotors used offboard computation in order to run the planning due to the limited computation capability available on the platforms. [He et al., 2008] presented a navigation system that uses a known map and takes the localization sensor model into account when planning so as to avoid regions that would lead to bad localization quality. [Grzonka et al., 2009] demonstrated a quadrotor system that is able to localize and navigate in a known map using laser rangefinders as the main source of localization and mapping. Both of these transferred the sensor data to an offboard computer for processing. With a more powerful computer onboard the robot, [Bachrach et al., 2011] were able to run a scan matching based localization system and the position and orientation controllers on the robot while the planner and a SLAM system to produce a globally consistent map ran on an offboard computer. [Shen et al., 2011] was the first to demonstrated a full navigation system running onboard the robot without a known map of the environment. Since then, multiple groups have demonstrated similar capabilities [Valenti et al., 2014, Schmid et al., 2014].
In this paper, we describe our navigation system that allows a quadrotor to go from a starting position to a goal location while avoiding obstacles during the flight. We believe this is one of the first systems that is capable of fast aerial robot navigation through cluttered GPS-denied environments using only onboard sensing and computation. The navigation system has been tested thoroughly in the lab and in real world obstacle-rich environments that were set up as part of the DARPA FLA program.
This paper is organized as follows. In Section 2 we describe our platform and the design decisions made in order to choose the current configuration. In Section 3, we describe our estimation and control algorithms. In this section, we also elaborate upon our sensor fusion methodology that is crucial to get good state estimates in order to control the robot. Section 4 describes our mapping, planning and trajectory generation modules. In Section 5 we show results from various experiments performed in order to benchmark and test our full navigation system. Finally, we conclude in Section 6 with some discussion about the results and give some directions of future work that would help improve our system.
2 System Design
In this section we describe our overall system design. Specifically, we discuss platform design considerations, describe our computation, sensing and communication modules, and highlight critical software architecture components that enable the system to operate smoothly.

2.1 Platform Design
The guiding principle in the design of the platform was fast and agile flight. The desired capability of the platform was to be able to reach speeds of more than while avoiding obstacles. This leads to a secondary and stronger requirement that the platform has to be able to stop from those speeds within typical sensor detection distances, which are around –. This implies that the platform should be capable of accelerations of up to . Reaching such high accelerations while maintaining the altitude requires a thrust-to-weight ratio of at around . In order to have some margin for control during these high acceleration phases, we searched for an off-the-shelf platform that had sufficient thrust to provide a thrust-to-weight ratio of more than when fully loaded. This included an expected sensing and computation payload of up to and a battery sufficient for desired flight time of around . A list of various commercially available options is shown in Table 1.
Platform | Frame | Battery | All-up | Max Thrust | Thrust/Weight |
---|---|---|---|---|---|
(kg) | (kg) | (kg) | (kgf) | Ratio | |
3DR X8+ | 1.855 | 0.817 (4S) | 3.672 | 10.560 | 2.876 |
DJI F550 + E310 | 1.278 | 0.600 (4S) | 2.878 | 5.316 | 1.847 |
DJI F550 + E600 | 1.494 | 0.721 (6S) | 3.215 | 9.600 | 2.986 |
DJI F450 + E310 | 0.826 | 0.400 (3S) | 2.226 | 3.200 | 1.438 |
DJI F450 + E600 | 0.970 | 0.721 (6S) | 2.691 | 6.400 | 2.378 |
Based on the survey of the available platforms, we selected the platform configuration consisting of the DJI Flamewheel 450 base along with the DJI E600 motors, propellers and speed controllers since it closely matches our performance requirement. Each of the E600 motor and propeller combination has a rated maximum thrust of approximately . This leads to total thrust of around for our quadrotor configuration. For the low-level controller, we selected the Pixhawk [Meier et al., 2015] which is a popular open-source autopilot. The main reason behind choosing the Pixhawk is that the firmware is open-source and customizable, giving us the capability of easily modifying or adding low-level capabilities as desired. In comparison, most of the commercially available autopilot boards are usually black boxes with an interface to send commands and receive sensor data. The base platform consisting of the F450 frame, E600 propulsion system and the Pixhawk has a mass of approximately . Adding the sensing and computation payload leads to a platform weight of without the battery. In order to achieve the flight time requirement, we need to select the correct battery taking into account that the maximum total mass of the platform should be below .

The batteries used in MAVs are based on lithium polymer chemistry due to their high energy and power densities. The DJI E600 propulsion system requires a 6S battery, i.e. a battery with rated voltage of approximately . Given that, the main design choice available is the battery capacity. Typical hobby grade lithium polymer batteries have specific energy values are around – (Fig. 2). The power required to hover for quadrotors is approximately [Mulgaonkar et al., 2014], so for a platform with a total mass between –, the power consumption would be –. Assuming an overall efficiency of around , [Theys et al., 2016] going from the supplied power from the battery to the mechanical power output at the propellers, the energy capacity of the battery for a flight time needs to be approximately –. In practice, we never use the full capacity of the battery in order to preserve the life of the battery and also to have some reserve capacity for unforeseen circumstances. If we only use of the rated capacity of the battery, it leads to a required battery energy capacity of –. Using the average specific energy value of , we expect the mass of the battery to be between – which fits in well with our total mass budget. Based on the available battery capacities, we selected batteries with capacities of and in order to provide some flexibility in terms of having higher performance or higher endurance.

In order to power all the sensors and the computer onboard the robot, we required regulated power supplies for and . We designed a custom power distribution board, shown in Fig. 3, consisting of a power conditioning circuit, DC-DC converters, power connectors, and a battery monitor. The board is capable of providing filtered and supply at a maximum of each. In addition to power management, for weight saving reasons, the board replaces the top plate of the standard commercially available configuration, forming an integral part of the robot frame.
2.2 Sensing, Computation and Communication
The robot needs to navigate through cluttered 3D environments with purely on-board sensing and computation. This requires the correct selection of sensors and the onboard computer in order to be able to perform the desired task while keeping the mass low. The two tasks that the robot has to perform which require proper sensor selection are state estimation and mapping. The two solutions for state estimation for MAVs are either vision based or lidar based. For unstructured 3D environments, the vision based systems have been more successful that lidar based ones, so we decided on using cameras as our primary state estimation sensors. More details about why the stereo configuration was selected are provided in Section 3. In addition to the cameras, we added a downward pointing lidar (Garmin Lidar-Lite) and a VectorNav VN-100 IMU for state estimation. The VN-100 IMU is also used to trigger the capture from the cameras in order to have time synchronization between the cameras and IMU.
The situation for mapping is a bit different. Current vision based dense mapping algorithms are either not accurate enough or too computationally expensive to run in real time, so lidar based mapping is still the preferred choice for MAVs. In order to keep our weight low, we decided to use a Hokuyo 2D lidar instead of a heavy 3D lidar. We still required a 3D map for planning, so we decided to mount the 2D lidar on a one degree of freedom nodding gimbal as shown in Fig. 4.
In order to handle all the computations for estimation, control, mapping and planning onboard the robot, we selected the Intel NUC i7 computer. This single board computer is based on the Intel i7-5557U processor and supports up to 16GB of RAM and an M.2 SSD for storage. This provides sufficient computing power to run our full software stack on the robot without overloading the CPU and also gives us ample amount of storage for recording sensor data for long flights. While the robot is flying, we need to have a communication link in order to monitor the status of the various modules running on the robot. We wanted a link that has good bandwidth, so that during development we can stream the sensor data back to the base station, but also good range so that we do not loose the link when running long range (up to ) experiments. In addition, since we use ROS as our software framework, having a wireless link that behaves like a wireless local area network was preferred in order to be able to use the standard ROS message transport mechanism. Based on these requirements, we selected the Ubiquity Networks Picostation M2 for the robot side and the Nanostation M2 for the base station. These are high power wireless radios that incorporate Ubiquity Networks’ proprietary airMAX protocol, which improves latency and reliability for long range wireless links compared to the 802.11 protocol, which was designed mainly for indoor use. The Picostation is the smaller and lighter of the two, weighing at around (after taking off the outer plastic case) compared to for the Nanostation. This lower weight comes with the compromise of lower transmit power and lower bandwidth, but the performance was sufficient for our purpose, providing a bandwidth of more than up to distances of .
2.3 Software Architecture

Any big system requires all of the individual components to work together in order to allow the full system to function properly. Figure 5 shows a high level block diagram of our system illustrating the different components and how they are connected to each other. The software components in our system can be grouped under four categories: Estimation, Control, Mapping and Planning. Each of these is in turn separated into smaller parts, and we use ROS as the framework for all the high level software running on the robot. ROS is chosen because it provides a natural way to separate each component into its own package allowing distributed development and ease of testing and debugging. Each executable unit in a ROS system is called a node and different nodes communicate with each other using message passing. In this way, a ROS system can be thought of as a computational graph consisting of a peer-to-peer network of nodes processing and passing data among them. One convenient feature of this system is that the nodes can be run on different computers, since the message passing uses the TCP transport, which allows us to run a subset of the nodes on the robot while the remaining can be run on a workstation computer making it easier to analyze and debug problems leading to a faster development phase. We also benefit from the whole ROS ecosystem of tools and utilities that have been developed in order to perform routine but useful tasks when developing a system such as tools for logging and playing back the messages passed between nodes or tools to visualize the data being sent between nodes.
3 Estimation and Control
There has been a lot of research in recent times on visual and visual-inertial odometry for MAVs with a variety of proposed algorithms [Klein and Murray, 2007, Mourikis and Roumeliotis, 2007, Jones and Soatto, 2011, Forster et al., 2014, Mur-Artal et al., 2015, Bloesch et al., 2015]
. The algorithms can be classified based on the number and type of cameras required into three groups: monocular, stereo, or multi-camera. There are also algorithms using depth cameras but these cameras don’t work well outdoors with sunlight so we do not consider them. An overview of the advantages and disadvantages of the algorithms is shown in Table
2. Looking at these, it is clear that the multi-camera setup would be the most preferred but the software complexity is still a hurdle in terms of real-world usage. Monocular algorithms have received a lot of research attention in the last few years and have improved to a level that they can be reliably used as the only source of odometry for a MAV system. One problem of the monocular algorithms is that they require an initialization process during which the estimates are either not available or are not reliable. In comparison, the stereo algorithms can be initialized using a single frame making them much more robust in case the algorithm needs to be reinitialized while flying if, for example, there is a sudden rotation. One more advantage of using stereo cameras is that in the extreme case that stereo matching is not possible due to features being too far away, we can use the input from only one of the cameras from the stereo pair and treat it as a monocular camera setup.Monocular | Stereo | Multi-camera | |
---|---|---|---|
Mechanical complexity | Low | Medium | Low |
Software complexity | Medium | Low-Medium | High |
Robustness | Low-Medium | Medium | High |
Feature distance | High | Medium-High | High |
3.1 Visual Odometry
3.1.1 Overview of SVO
To estimate the six degree-of-freedom motion of the platform, we use the Semi-direct Visual Odometry (SVO) framework proposed in [Forster et al., 2017]. SVO combines the advantages of both feature-based methods, which minimize the reprojection error of a sparse set of features, anddirect methods, which minimize photometric error between image pixels. This approach estimates frame-to-frame motion of the camera by first aligning the images using a sparse set of salient features (e.g. corners) and their neighborhoods in the images, estimating the 3D positions of the feature points with recursive Bayesian depth estimation, and finally refining the structure and camera poses with optimization (e.g. bundle adjustment). Our efficient implementation of this approach is capable of estimating the pose of a camera frame in as little as 2.5 milliseconds, while achieving comparable accuracy to more computationally intensive methods.

The system is decomposed into two parallel threads: one for estimating camera motion, and one for mapping the environment (see Fig. 6). The motion-estimation thread proceeds by first performing a sparse image alignment between the two most recent frames, then obtaining sub-pixel feature correspondence using direct methods on patches around each sparse feature, and finally pose refinement on the induced reprojection error.
A brief overview of the notation used in describing the method is presented here. The intensity image from camera C at timestep is denoted by , where is the image domain. A point in 3D space maps to image coordinates through the camera projection model, , which we assume is known. Given the inverse scene depth for a pixel , the position of the corresponding 3D point is obtained using the back-projection model . We denote the set of pixels for which the depth is known at time in camera C. The position and orientation of the world frame W with respect to the camera frame is described by the rigid body transformation . A 3D point that is expressed in world coordinates can be transformed to the camera frame using: .
3.1.2 Motion Estimation

Consider a body frame B that is rigidly attached to the camera frame C with known extrinsic calibration (see Fig. 7). Our goal is to estimate the incremental motion of the body frame such that the photometric error is minimized:
(1) |
where the photometric residual is defined by the intensity difference of pixels in subsequent images and that observe the same 3D point :
(2) |
The corresponding 3D point for a pixel with known depth, expressed in the reference frame, is computed by means of back-projection:
(3) |
Sparse image alignment solves the non-linear least squares problem in Eq. (1) with corresponding to small patches centered at features with known depth, using standard iterative non-linear least squares algorithms.
In the next step, we relax the geometric constraints given by the reprojection of 3D points and perform an individual 2D alignment of corresponding feature patches. In order to minimize feature drift over the camera trajectory, the alignment of each patch in the new frame is performed with respect to a reference patch from the frame where the feature was first extracted. However, this step generates a reprojection error, representing the difference between the projected 3D point and the aligned feature position.
We detect both edge and corner features, representing points of strong gradient in the image. Feature alignment in the image minimizes the intensity difference of a small image patch that is centered at the projected feature position in the newest frame with respect to a reference patch from the frame where the feature was first observed. For corner features, the optimization computes a correction to the predicted feature position that minimizes the photometric cost:
(4) | ||||
where is the iterator variable that is used to compute the sum over the patch . For features on edges, we therefore optimize for a scalar correction in the direction of the edge normal to obtain the corresponding feature position in the newest frame:
(5) | ||||
In the previous step, we established feature correspondence with subpixel accuracy, but this feature alignment violated the epipolar constraints and introduced a reprojection error , which is typically well below 0.5 pixels. Therefore, in the last step of motion estimation, we refine the camera poses and landmark positions by minimizing the squared sum of reprojection errors:
(6) | ||||
where is the set of all keyframes in the map, the set of all landmarks corresponding to corner features, and the set of all edge features that were observed in the camera frame.
While optimization over the whole trajectory in Eq. (6) with bundle adjustment results in higher accuracy, for MAV motion estimation, it is sufficient to only optimize the latest camera pose and the 3D points separately, which permits more efficient computation.
3.1.3 Mapping
In the derivation of motion estimation, we assumed that the depth at sparse feature locations in the image was known. Here, we describe how the mapping thread estimates this depth for newly detected features, assuming that the camera poses are known from the motion estimation thread.
The depth at a single pixel is estimated from multiple observations by means of a recursive Bayesian depth filter. When the number of tracked features falls below some threshold, a new keyframe is selected and new depth filters are initialized at corner and edge features in that frame. Every depth filter is associated to this reference keyframe , and the initial depth uncertainty is initialized with a large value. For a set of previous keyframes as well as every subsequent frame with known relative pose , we search for a patch along the epipolar line that has the highest correlation via zero mean sum of squared differences (see Fig. 8). From the pixel with maximum correlation, we triangulate the depth measurement , which is used to update the depth filter. If enough measurements have been obtained such that uncertainty in the depth is below a certain threshold, we initialize a new 3D point at the estimated depth in our map, which subsequently can be used for motion estimation (see system overview in Fig. 6).

We model the depth filter according to [Vogiatzis and Hernández, 2011] with a two dimensional distribution: the first dimension is the inverse depth [Civera et al., 2008], while the second dimension
is the inlier probability. Hence, a measurement
is modeled with a Gaussian + Uniformmixture model distribution: an inlier measurement is normally distributed around the true inverse depth
while an outlier measurement arises from a uniform distribution in the interval
:(7) |
where
the variance of a good measurement that can be computed geometrically by assuming a disparity variance of one pixel in the image plane
[Pizzoli et al., 2014]. We refer to the original work [Vogiatzis and Hernández, 2011] and the [Forster et al., 2017] for more details.3.1.4 Implementation Details

Our system utilizes multiple cameras, so consider a camera rig with cameras (see Fig. 9). We assume that the relative pose of the individual cameras with respect to the body frame is known from extrinsic calibration. To generalize sparse image alignment to multiple cameras, we simply need to add an extra summation in the cost function of Eq. (1):
(8) |
The same summation is necessary in the bundle adjustment step to sum the reprojection errors from all cameras. The remaining steps of feature alignment and mapping are independent of how many cameras are used, except that more images are available to update the depth filters. An initial map is computed during initialization using stereo matching.
We additionally apply motion priors within the SVO framework by assuming a constant velocity relative translation prior and a relative rotation prior from a gyroscope. We employ the motion prior by adding additional terms to the cost of the sparse image alignment step:
(9) | ||||
where the covariances are set according to the uncertainty of the motion prior, the variables are the current estimate of the relative position and orientation (expressed in body coordinates B), and the logarithm map
maps a rotation matrix to its rotation vector.
We apply the sparse image alignment algorithm in a coarse-to-fine scheme, half-sampling the image to create an image pyramid of five levels, and use a patch size of pixels. The photometric cost is then optimized at the coarsest level until convergence, starting from the initial condition , before continuing at the finer levels to improve the precision.
In the mapping thread, we divide the image in cells of fixed size (e.g., pixels). For every keyframe a new depth-filter is initialized at the FAST corner [Rosten et al., 2010]
with highest score in the cell, unless there is already a 2D-to-3D correspondence present. In cells where no corner is found, we detect the pixel with highest gradient magnitude and initialize an edge feature. This results in evenly distributed features in the image. To speed up the depth-estimation we only sample a short range along the epipolar line; in our case, the range corresponds to twice the standard deviation of the current depth estimate. We use a
pixel patch size for the epipolar search.We refer the reader to the original paper [Forster et al., 2017] for further details about both the approach and its performance.
3.2 Sensor Fusion
We have multiple sensors on the platform, each providing partial information about the state of the robot. Moreover, the sensors provide output at different rates, for example, we run the stereo cameras at while the downward pointing distance sensor runs at
. We need to merge these pieces of partial information into a single consistent estimate of the full state of the robot. The typical method used for such sensor fusion tasks is some variant of the Kalman filter. The quadrotor is a nonlinear system due to its rotational degrees of freedom. This requires the use of either an Extended Kalman filter (EKF) or an Unscented Kalman filter (UKF). The UKF has the advantage of better handling the system nonlinearities with only a small increase in computation, so we chose the UKF for our system. Fig
10 shows the inputs and outputs of the UKF module running on the robot. The state vector used in the UKF is,where is the world-frame position of the robot, is the world-frame velocity, , and are the roll, pitch and yaw respectively, is the accelerometer bias while is the gyroscope bias. We use the ZYX convention for representing the rotations in terms of the Euler angles , and . The Euler angle representation was chosen for representing the orientation primarily because of its simplicity. The well-known problem of gimbal lock when using Euler angles is not an issue in this case since the desired and expected roll and pitch of the robot is always less than .

The UKF consists of a prediction step which uses the IMU data as the input and multiple update steps, one for each of the other sensors. The update step is performed whenever the corresponding sensor measurement arrives. The prediction step is nonlinear since the accelerometer and gyroscope measurements are in the body frame while the position and velocity in the state are in the world frame, which requires the transformation of the measured quantities from body to world frame using the estimated orientation.
Given that the state at iteration , (dimension ), has mean and covariance , we augment it with the process noise (dimension ) having mean and covariance , creating the augmented state and covariance matrix ,
Then, we generate a set of sigma points by applying the Unscented transform [Julier et al., 1995] to the augmented state,
(10) | ||||
where is the dimension of the augmented state and is a scaling parameter [Wan and Merwe, 2000].
These sigma points are then propagated through the process model with the accelerometer and gyroscope measurements as input.
where is the state part of the augmented state while is the process noise part. The process model, , for our system is given by
where is the rotation matrix formed by using the ZYX convention for the Euler angles while , , and are the individual process noise terms.
From the transformed set of sigma points, , we can calculate the predicted mean and covariance,
where and are scalar weights [Wan and Merwe, 2000].
Whenever a new sensor measurement, , arrives, we run the update step of the filter. First we generate a new set of sigma points in the same way as done during the prediction step, (10), with the augmented state and covariance given by,
where is the mean of the measurement noise and is the covariance. The generated sigma points are then used to generate the predicted measurement using the measurement function ,
And finally the state is updated as follows,
Note that for each sensor input to the UKF except the IMU, which is used for the prediction step, there is a separate measurement function, , and the full update step is performed, with the corresponding measurement function, when an input is received from any of those sensors.
The attitude filter running on the Pixhawk is a simple complementary filter which can take an external reference orientation as an input. This allows us to provide the estimate from the UKF to the Pixhawk in order to improve the orientation estimate on the Pixhawk. This is important for good control performance since the orientation controller running on the Pixhawk uses the Pixhawk’s estimate of the orientation while our control commands are calculated using the UKF estimates. Without an external reference being sent to the Pixhawk, the orientation estimates on the Pixhawk can be different from the UKF which would lead to an incorrect interpretation of the control commands.
3.3 Control
The controller used for the robot has the cascade structure, as shown in Figure 5, which is has become standard for MAVs. In this structure, we have an inner loop controlling the orientation and angular velocities of the robot while an outer loop controls the position and linear velocities. In our case, the inner loop runs at a high rate () on the Pixhawk autopilot while the outer loop runs at a slightly slower rate () on the Intel NUC computer.
At every time instance, the outer loop position controller receives a desired state, which consists of a desired position, velocity, acceleration and jerk, from the planner and using the estimated state from the UKF, computes a desired force, orientation and angular velocities which are sent to the orientation controller. The inner loop orientation controller receives these and computes the thrust and moments required to achieve the desired force and orientation. These are then converted into individual motor speeds that are sent to the respective motor controllers.
The controller formulation we use is based on the controller developed in [Lee et al., 2010] with some simplifications. The thrust command of the position controller is calculated as,
(11) |
where and is the rotation matrix which converts vectors from body frame to world frame calculated using the estimated roll, pitch and yaw. The desired attitude is calculated as,
(12) | |||
(13) |
Note that here we have to define based on the yaw instead of defining as done in [Mellinger and Kumar, 2011] due to the different Euler angle convention, we use the ZYX convention while they used ZXY.
The thrust and attitude commands, from (11), (12) and (13), are then sent to the Pixhawk autopilot through mavros222https://github.com/mavlink/mavros. The attitude controller running on the Pixhawk takes these commands and converts them to commanded motor speeds. First, using the and the estimate of the current orientation, , we calculate the desired moments as follows,
Then, from the desired thrust and moments, we can calculate the thrust required from each propeller which allows us to compute the desired motor speed as shown in [Michael et al., 2010].
4 Mapping and Planning
Our navigation system consists of five parts as shown in Fig. 11. In this section, we discuss the mapping, planner and trajectory generation threads.

4.1 Mapping
We have mounted a LIDAR on a servo such that we can generate a 3D voxel map by rotating the laser. Updating the map and planning using the 3D global map are both computationally expensive and in addition, with noise and estimation drift, the global map can be erroneous. Hence we utilize a local mapping technique that generates a point cloud around current robot location (Fig. 12). This local point cloud, , has fixed size and fine resolution and is used to build a 3D occupancy voxel map, , centered at current robot location. Since the local map only records the recent sensor measurements with respect to current robot location, the accumulated error in mapping is small. We also generate a coarse 2D map, , in global frame in order to solve the dead-end problem caused by local planning. We call this map the “global information map” since it contains two pieces of information: one is the known and unknown spaces so that we know which part has been explored, the other is the location of walls detected from that the robot cannot fly over.
![]() |
![]() |
![]() |
4.2 Planner
We use to plan a path in a hybrid graph that links the voxels in both local 3D map and global information map (result is shown in Fig. 12(c)). We can efficiently derive the path in local map that is globally consistent. Fig. 13 shows an example of using this method to solve the dead-end corridor problem.
![]() |
![]() |
![]() |
![]() |
At planning epoch
, the end of the corridor cannot be viewed by the sensor with limited range and the path leads the quadrotor to go forward (a)-(b). At planning epoch , with similar local map but different global map which contains the dead-end geometry, planning to the same goal, results in a path which is totally different.4.3 Trajectory Generation
In this subsection, we are going to introduce the trajectory generation method given the map and a prior path . The trajectory generation process is shown in Fig. 14. Through regional inflation, a safe corridor is found in that excludes all the obstacle points. As the intermediate waypoints in can be close to the obstacles, we shift the intermediate waypoints towards the center of safe corridor. The new path and the safe corridor are used to generate the trajectory.

4.3.1 Regional Inflation by Line Segments
Inspired by IRIS in [Deits and Tedrake, 2015a], we developed the algorithm to dilate a path in free space using ellipsoids. For each line segment in the path , we generate a convex polyhedron that includes the whole segment but excludes any occupied voxel in through two steps:
The ellipsoid is described as which is the projection of a unit sphere with radius into . A polyhedron is the intersection of half-planes: . The half-plane is found at the intersection of ellipsoid with the closest obstacle point as shown in equation (14).
(14) |
![]() |
![]() |
![]() |
4.3.2 Path Modification
The original path from the planner can be close to the obstacles. Although the trajectory generation does not require the final trajectory to go through the intermediate points, the path affects the route of the trajectory. The path modification step aims to modify the original path away from the obstacle by keeping the intermediate waypoints in the middle of a safe corridor. We use a bisector plane that passes through the waypoint , this plane intersects with both polyhedra that connected through . The point is moved to the centroid of the intersection polygon (Fig. 15(c)).
4.3.3 Trajectory Optimization
We formulate the trajectory generation as an optimal control problem as shown in equation (15). Compared to the standard formulation of this optimal control problem, we add a second term in the cost function which is the square of the distance between the trajectory and the line segment . This distance cost is weighted by a factor . Fig. 16 shows the affect on the trajectory by changing this weighting factor. Thus, we can control the shape of a trajectory to keep it close to the modified path and away from obstacles. This process increases the safety of the trajectory such that the robot will not get too close to obstacles.
(15) |
![]() |
![]() |
![]() |
4.3.4 Continuous Optimization
To to be able to compute Equation 15 in real time, we have chosen to represent as an order polynomial spline. Each spline segment takes time such that . For our experiments, we have found that provides good performance for trajectory optimization and that increasing produces similar trajectories, but with longer computational time.
For this section, we will assume that our trajectory has dimensions, segments and is represented by an th order polynomial. Also we define the indices: , ,and . Therefore we can write the th dimension of with respect to coefficients and basis functions .
(16) |
We have chosen to be shifted Legendre polynomials which are represented with a unit time scaling:
(17) |
For whichever makes
(18) |
This leaves basis polynomials undefined, which we just use:
(19) |
This definition results in a cost which is quadratic in
(20) |
For the other constraints are concerned with constraining at different times. We note that the evaluation of any derivative of is linear with respect to
(21) |
To ensure continuity of our spline up to derivatives, we need to add the following constraints for , , and :
(22) |
The start and end constraints:
(23) |
For the inequality constraints and the centering part of the cost functional, we use the sub-sampling method proposed by [Mellinger and Kumar, 2011]. Along the interval, we can select points at which to sample the trajectory. In practice we found that points worked well and they could be sample uniformly within or by using Chebyshev sampling. The and come from the polyhedra found in 4.3.1 for and bounds for [Boyd and Vandenberghe, 2004].
(24) |
To compute the centering part of the cost, we could use a Gaussian quadrature, but found that rectangular integration worked fine [Press, 1992].
(25) |
This results in the following QP in
(26) |
Choosing is critical to the feasibility of Equation (26) and the quality of the resultant trajectory. To choose the we use the times we get by fitting a trapezoidal velocity profile through the segments, so the time per segment is based on the length of each line segment in the path through the environment.
5 Experimental Results
5.1 Estimation benchmarking
The main task for the robot is to fly long distances to a goal point, so the estimation accuracy is very important. The drift in the estimator should be low so that the robot reaches the desired goal. In order to test the accuracy and drift in the estimator, we flew the robot in the motion capture space in the lab. The robot was flown manually along an aggressive trajectory reaching speeds of up to and accelerations of . The plots of the estimated and ground-truth position and velocity as shown in Figure 17. As can be seen from the figure, the final drift after more than of flight is less than , giving us a drift of around . Note that there is almost no drift in the Z-axis due to the use of the downward pointing distance sensor, which gives us an accurate height estimate.
![]() |
![]() |
The SVO framework was deployed on our MAV system for visual odometry using a forward-facing stereo camera configuration and onboard computation. As a demonstration of the accuracy of the motion estimation, some high-speed maneuvers were flown manually in a warehouse environment. The MAV accelerated aggressively along a straight aisle in the warehouse, braked aggressively to a stop, and then returned to the starting location at a moderate speed. Figure 18 shows several onboard camera images marked up with the features that SVO is tracking, as well as the sparse map of 3D points in the environment that were mapped during this trajectory. During this trial, the MAV reached a maximum speed of over , as shown in Fig. 19 even with such an aggressive flight, SVO only incurs around of position drift over the more than trajectory.
![]() |
![]() |
![]() |
![]() |
![]() |
5.2 Real World Tests
The quadrotor navigation system described in this paper has been tested extensively in the lab environment as well as in multiple real-world environments. The system has been used on our entry for the first test of the DARPA Fast Lightweight Autonomy (FLA) program and was able to successfully navigate multiple obstacle courses that were set up. The rules of the FLA program do not allow any human interaction after the robot is airborne, so the runs described in this section were fully autonomous.
The test environment was constructed so as to simulate the inside of a warehouse. There were two aisles separated by scaffolding of around height with tarps on the back of the scaffolding and boxes placed on the shelves. The total length of the test course was around while the width of each of the aisles, in between the scaffoldings, was . Different types of obstacles such as a scaffolding tower or scissor lifts were placed along the aisles in order to test the obstacle avoidance performance of the robot. The minimum clearance between the obstacle in the aisle and the scaffolding on the side was set to be . As a reference, the rotor tip-to-tip diameter of the platform is . An example of the obstacles along the aisle is shown in Figure 20.
![]() |
![]() |
Different types of obstacle courses were set up using the aisles and the obstacles in order to challenge the robot. The simplest task was to just go straight down an empty aisle, while the most complicated ones involved changing aisles due to the first aisle being blocked in the middle. The only prior information available for each task was the type of obstacles to expect along the course, but the actual layout of the test course was unknown. The goal position was specified as a bearing and a range from the starting position at the start of each run. A particular task was deemed complete only when we were able to complete three successful runs of the task with different obstacle course layouts, thus ensuring that our system can work robustly. In the following, we describe some of the specific tasks and also show results of our runs through them.
5.2.1 Slalom
In the slalom task, the obstacles in an aisle were arranged in a manner such that robot is forced to move in a zigzag manner along the aisle, going to the right of the first obstacle, then left of the second, one and so on. Figure 21 shows the result of one of our runs. Since there was no ground truth position data available, the only way to judge the performance of the system is to compare the map created by the robot with a map of the real obstacle course. From the figure, we can see that the projected map (in grey) matches the actual obstacle course layout (in black) showing the accuracy of our estimation and control algorithms.

5.2.2 Aisle change with transition
In this task, the robot was required to change from the first aisle to the second one since the first aisle was blocked in the middle. The opening between the first and second aisles was constructed such that the robot could move diagonally along a line from the first aisle into the second aisle. Figure 22 shows one of our runs for this task. The robot successfully completes the transition and starts moving along the second aisle. Note that the goal was still in line with the first aisle, so the robot is always looking to move towards the left in order to get closer to the goal. This causes it to remain in the left part of the second aisle as is observed in the figure. After crossing the second aisle, the robot moves back to the left to reach the goal. Again we can see that the projected map (in grey) matches the actual course layout (in black).

5.2.3 Aisle change with transition
This task was just a more challenging variation of the previous one. Here the aisle change required the robot to move sideways (see Figure 23). We were able to reach the goal, but our system did not perform very well for this task. As can be seen from the figure, the state estimate had small jumps and drifted during the transition between the aisles. There is some position drift but the main issue is the drift in yaw. Since the distance to the goal is large, even small drifts in yaw correspond to large position errors when the robot reaches the goal. The main reason for this drift was that when moving sideways in front of the obstacle during the transition, the vision system lost all the tracked features in the image and as it entered the second aisle, got new features which were far from the camera since it was looking along the aisle. Since the new features were far from the camera, they could not be triangulated accurately and hence caused bad estimates from the vision system. During this phase, there were a number of jumps in the output of vision system and hence led to drifts in the state estimate. As the robot started moving forward after the transition, the vision system was able to triangulate more features along the corridor and get good estimates again. This issue does not occur for the transition case since the robot is able to see some part of the second aisle when moving diagonally and hence already has well triangulated features when it is completes the transition into the second aisle. One way to help with this issue would be to make the robot orient itself such that it is always facing the direction along which it is moving.

5.3 High speed flight
In order to test the high speed capability of the system, we performed a test run in an aisle with no obstacles. The goal provided to the robot was to go straight for a distance of . We were able to fly at speeds of up to and reach the desired goal position. A plot of the desired and estimated position and velocity is shown in Figure 24, which shows that the performance of our controller is good enough to track such aggressive trajectories. The initial section of the plots, from –, is the autonomous takeoff and the forward trajectory begins at . There was no source of ground truth during the test but based on the expected location of the goal, the net drift in the position estimates was less than .
![]() |
![]() |
6 Discussion and Conclusion
In this work, we developed a system that allows a quadrotor to navigate autonomously in GPS-denied and cluttered environments. Our navigation system consists of a set of modules that work together in order to allow the robot to go from a starting position to a specified goal location while avoiding obstacles on the way. After developing our system, we found the following points especially important in order to successfully build such a system:
-
Modular architecture: During our development process, each of the modules were separately developed. This was made possible by defining proper interfaces between the modules and using message passing to communicate among them. We used ROS as the framework for all the software running on the robot since it was designed to solve this exact problem. This separation of the modules allowed most of the planner development to happen in a simulator while the estimation and control modules were being developed. This accelerated the development since different modules could be implemented and tested in parallel.
-
Sensor selection: The choice of sensors used for estimation and mapping plays an important role in determining the robustness of the system. As shown in Table 2, there are various advantages and disadvantages of different camera configurations for visual odometry. We selected a stereo configuration for our system since it provides increased robustness over a monocular camera setup, which is gaining popularity among the research community due to its minimalistic nature and also allows us to have simpler algorithms compared to multi-camera systems. The use of a dedicated height sensor makes it possible to maintain altitude even when there is drift in our visual odometry, allowing the robot to safely fly without hitting the ground or going too high. However, the downward pointing height sensor has jumps in the measurement when the robot goes over obstacles and this has to be properly taken care of in the sensor fusion module. For mapping, instead of using a fixed lidar, we mounted it on a servo in order to sweep it up and down allowing us to create 3D maps and navigate in 3D environments with obstacles above and below the robot. With a fixed lidar we would not have been able to safely avoid all the obstacles that we encountered during the tests.
-
Local map for planning: Using a local map for planning instead of a global map was an crucial decision in the design of our planner. The problem with creating a global map is that we need to explicitly maintain global consistency by making use of loop closures to eliminate drifts. By comparison, the local map approach helps the planner tolerate drifts in the state estimation since the drift is small in the short period of time that the local map is constructed in. This can be seen clearly in Figure 23 where there is large drift in the yaw but the robot is still able to reach the goal. This also helps in reducing the computational complexity and allows us to run the planner at a higher rate. Faster replanning reduces the latency between an obstacle being seen by the mapping system and the robot reacting to it, thus improving the robustness of the system.
In addition to these positive points, we learned some lessons during the tests in the warehouse environment.
-
Drift in the visual odometry: The tests in the warehouse environment involved flying long trajectories while constantly moving forward. Since the floor of the building was smooth and did not have texture, very few image features could be detected on it. This led to most of our image features coming from the obstacles to the side and front of the robot. We even picked up edgelet features from structures on the ceiling of the building. Thus a large part of the image features were at a large distance from the robot. In order to get good depth estimates of these far-away features, either the stereo baseline needs to be large or there needs to be sufficient parallax between the feature observations due to the motion between frames. We were limited to a stereo baseline due to the size of the robot. When moving along the long aisles, the image features were mainly in front of the robot, which sometimes led to insufficient parallax to get good depth estimates for the features. Due to the poor depth estimates for some of the features, the visual odometry was not able to detect the correct scale of the motion between frames, which led to drift in the estimates. This caused a failure to reach the goal in some cases. One solution to this, that we are already looking into, is to have a more tightly coupled visual odometry system where the accelerometer measurement is also used in order to provide another source of scale for the visual odometry system.
-
Local map size: One factor that prevents us from reaching high speeds is the size of the map used for planning. Since we want to generate dynamically feasible trajectories for the robot, we have to take into account the maximum acceleration that the robot can safely achieve. Also, in order to guarantee safety, we have to plan trajectories such that the robot comes to a halt at the end of the known map since there can be undiscovered obstacles just outside the map. Thus, the combination of a map size and maximum acceleration puts a limit on the maximum speed that the robot can reach. The main factor limiting our local map size is the time required to plan in that map. The majority of the time in each planning step is taken by the algorithm, which is used to find a path through the hybrid graph (as described in Section 4
). In order to reduce this time, we are looking into using better heuristics for
and other techniques such as Jump Point Search [Harabor and Grastien, 2011] which can significantly speed up the graph search.
In conclusion, we have presented a solution that consists of all the modules that are required for a robot to autonomously navigate in an unknown environment. The system has been designed such that all the sensing and computation occur onboard the robot. Once the robot has been launched, there is no human interaction necessary for the robot to navigate to the goal.
The system has been thoroughly tested in the lab as well as in the warehouse environment that was set up as part of the DARPA FLA program. Our robot was able to successfully navigate the various obstacle courses that were specifically designed to challenge the navigation system. The only input from the operator for each run was the goal position relative to the starting position. In fact, during some of the runs, we even lost the communication link between the base station (which was only used for monitoring purposes) and the robot, due to the long distance and the large scaffolding structures in between, but the robot kept on going and successfully completed the task.
The final goal is to be able to fly at speeds of around through cluttered environments, and we believe that it would require more work in all the individual modules that make up the system. In estimation, we need to reduce the drift that the visual odometry system experiences when flying fast while following long trajectories. In control, we need to incorporate aerodynamic effects such as drag, which become increasingly important when flying fast. In the mapping part, the nodding lidar solution needs to be replaced by one which provides a denser representation of the environment in order to detect small obstacles reliably. And finally, the planning subsystem needs to be sped up in order to allow us to use a larger map for planning and also to allow faster replanning in order to make the system more robust. As these developments are made, we would be able to incorporate them into our system due to the modular architecture, thus providing a strong foundation for future research.
Acknowledgments
We gratefully acknowledge support from DARPA grants HR001151626/HR0011516850.
References
- [Achtelik et al., 2009] Achtelik, M., Bachrach, A., He, R., Prentice, S., and Roy, N. (2009). Stereo vision and laser odometry for autonomous helicopters in gps-denied indoor environments. volume 7332, pages 733219–733219–10.
- [Augugliaro et al., 2014] Augugliaro, F., Lupashin, S., Hamer, M., Male, C., Hehn, M., Mueller, M. W., Willmann, J., Gramazio, F., Kohler, M., and D’Andrea, R. (2014). The Flight Assembled Architecture installation: Cooperative construction with flying machines. IEEE Control Systems Magazine, 34(4):46–64.
- [Bachrach et al., 2009] Bachrach, A., He, R., and Roy, N. (2009). Autonomous flight in unknown indoor environments. International Journal of Micro Air Vehicles, 1(4):217–228.
- [Bachrach et al., 2011] Bachrach, A., Prentice, S., He, R., and Roy, N. (2011). RANGE – Robust Autonomous Navigation in GPS-Denied Environments. Journal of Field Robotics, 28(5):644–666.
- [Bellingham et al., 2002] Bellingham, J., Richards, A., and How, J. P. (2002). Receding horizon control of autonomous aerial vehicles. In Proceedings of the 2002 American Control Conference (IEEE Cat. No.CH37301), volume 5, pages 3741–3746 vol.5.
- [Bloesch et al., 2015] Bloesch, M., Omari, S., Hutter, M., and Siegwart, R. (2015). Robust visual inertial odometry using a direct ekf-based approach. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 298–304.
- [Blösch et al., 2010] Blösch, M., Weiss, S., Scaramuzza, D., and Siegwart, R. (2010). Vision based mav navigation in unknown and unstructured environments. In 2010 IEEE International Conference on Robotics and Automation, pages 21–28.
- [Bouabdallah et al., 2004] Bouabdallah, S., Murrieri, P., and Siegwart, R. (2004). Design and control of an indoor micro quadrotor. In Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004 IEEE International Conference on, volume 5, pages 4393–4398 Vol.5.
- [Bouabdallah and Siegwart, 2005] Bouabdallah, S. and Siegwart, R. (2005). Backstepping and sliding-mode techniques applied to an indoor micro quadrotor. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pages 2247–2252.
- [Bouabdallah and Siegwart, 2007] Bouabdallah, S. and Siegwart, R. (2007). Full control of a quadrotor. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 153–158.
- [Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New York, NY, USA.
- [Civera et al., 2008] Civera, J., Davison, A., and Montiel, J. (2008). Inverse Depth Parametrization for Monocular SLAM. IEEE Trans. Robotics, 24(5).
-
[Deits and Tedrake, 2015a]
Deits, R. and Tedrake, R. (2015a).