Log In Sign Up

Large-scale Autonomous Flight with Real-time Semantic SLAM under Dense Forest Canopy

In this letter, we propose an integrated autonomous flight and semantic SLAM system that can perform long-range missions and real-time semantic mapping in highly cluttered, unstructured, and GPS-denied under-canopy environments. First, tree trunks and ground planes are detected from LIDAR scans. We use a neural network and an instance extraction algorithm to enable semantic segmentation in real time onboard the UAV. Second, detected tree trunk instances are modeled as cylinders and associated across the whole LIDAR sequence. This semantic data association constraints both robot poses as well as trunk landmark models. The output of semantic SLAM is used in state estimation, planning, and control algorithms in real time. The global planner relies on a sparse map to plan the shortest path to the global goal, and the local trajectory planner uses a small but finely discretized robot-centric map to plan a dynamically feasible and collision-free trajectory to the local goal. Both the global path and local trajectory lead to drift-corrected goals, thus helping the UAV execute its mission accurately and safely.


page 1

page 2

page 5

page 6


Local Trajectory Planning For UAV Autonomous Landing

An important capability of autonomous Unmanned Aerial Vehicles (UAVs) is...

A Comprehensive LiDAR-based SLAM Comparison for Control of Unmanned Aerial Vehicles

This paper investigates the use of LiDAR SLAM as a pose feedback for aut...

Learning Deep SDF Maps Online for Robot Navigation and Exploration

We propose an algorithm to (i) learn online a deep signed distance funct...

Any Way You Look At It: Semantic Crossview Localization and Mapping with LiDAR

Currently, GPS is by far the most popular global localization method. Ho...

Real-time Multi-Adaptive-Resolution-Surfel 6D LiDAR Odometry using Continuous-time Trajectory Optimization

Simultaneous Localization and Mapping (SLAM) is an essential capability ...

ARES: Accurate, Autonomous, Near Real-time 3D Reconstruction using Drones

Drones will revolutionize 3D modeling. A 3D model represents an accurate...

Trained Trajectory based Automated Parking System using Visual SLAM

Automated Parking is becoming a standard feature in modern vehicles. Exi...

I Introduction

Figure 1: Falcon 4 uav platform. Our Falcon 4 platform used for this work is the successor of our Falcon 450 platform [mohta2018experiments] and is equipped with a 3D LIDAR, an ovc 3 [quigley2019open] which has a hardware synchronized IMU and stereo cameras, an Intel NUC 10 onboard computer, and a Pixhawk 4 PX4 flight controller. The platform has a total weight of 4.2 kg. It has a 30-minute flight time thanks to the 17,000 mAh Li-ion battery that has a high specific energy and the energy-efficient quadcopter design.

Fast and accurate estimation of agricultural variables enabled by autonomous mobile robots can have a significant impact in precision agriculture. In the forestry context, accurate timber volume is key to making informed management decisions, yield estimation and forecasting. When performed by humans, this is a labor-intensive and costly task.

Among the autonomous robots, uav offer the best set of capabilities for this task: they can hover and fly quickly in 3D environments, and they are not as affected by undergrowth or terrain elevation changes as Unmanned Ground Vehicles.

Prior research as well as commercial solutions for precision agriculture in forestry exist today. Nonetheless, most applications focus on overhead flight through wide-open spaces. This simplifies operations, but over-canopy data limits what is possible to measure. For example, it is difficult to assess individual trees’ diameters and health condition from overhead data. uav capable of autonomous under-canopy flights can acquire such valuable data at scale.

Figure 2: Area where the experiments were performed (Wharton State Forest). The left panel is a view from an over-canopy uav, which shows the scale of the environment as well as the thick tree canopy that causes GPS to be unreliable; The middle panel is a view from an under-canopy manned uav flight, which shows that the environment is highly unstructured and cluttered with hard-to-avoid obstacles. There are patches of light on the ground, which is challenging for vision-based state estimation since the camera needs to adapt to changes in illumination; The right panel shows the thick undergrowth covering the ground. In these environments, most features lie on the undergrowth, which are not static because of the downdraft from the propellers or the wind. These factors cause traditional vio or lidar odometry systems to have large drift and drive the uav unstable.

Developing an autonomous uav system that can operate at large scale and low altitude, between rows of trees, or even go under the tree canopy is critical for precision agriculture in forestry. This task is, however, still very challenging: (1) GPS is not always reliable or is inaccurate due to canopy occlusion [forest-gps3-carreiras2013estimating]; (2) the environment is very unstructured and dynamic (e.g. leaves or grass blowing in the wind), which poses significant challenges on robot odometry systems that rely on static geometric features; (3) and the environment is very cluttered which requires a very accurate and dense mapping system. On the other hand, it’s impractical for human pilots to gather data at scale under the forest canopy, especially considering the limit in radio communication and the clutter.

uav operating in such environments need a reliable platform that is durable and has enough payload capacity for an on-board computer and sensors, as well as a robust autonomy stack that is capable of estimating uav poses accurately without relying on GPS while navigating through obstacle-rich environments for long durations. In this letter, we propose a tightly integrated autonomous flight and semantic SLAM system that can perform long-range missions and real-time semantic mapping under a dense forest canopy.

The main contributions of our paper are: (1) to our knowledge, this is the first system that integrates semantic SLAM in real time into an autonomous uav’s feedback loop. Using this approach, odometry drift is minimized, and the planner and controller will guide the uav to perform its mission with high accuracy. In addition, a semantically meaningful map is produced; (2) a robust long-range autonomous flight system that is verified in various highly unstructured, cluttered, and GPS-denied environments; (3) a pipeline to replicate large areas of real-world forests in a photorealistic simulator with high-fidelity images and lidar data. This allows training object detection models, as well as prototyping and tuning of the planner and state estimator.

In addition, our autonomous flight system will be open-sourced. A video of the proposed system and experiment results can be found at:

Ii Related Work

There has been remarkable progress with uav for agricultural applications during the past decade. These technologies are now available as commercial products, but they focus on less complex missions such as overhead flights with GPS. Prior academic work addressed the under-canopy data collection problem with human pilots [chen2020sloam, apolo2020deep]. A few pieces of academic work address autonomy in under-canopy environments as well, but only in simpler conditions [mohta2018fast, zhou2020ego] compared to those shown in fig. 2 and on smaller scales.

It is known that GPS has unsatisfactory accuracy under dense forest canopy [forest-gps3-carreiras2013estimating], while differential or RTK GPS requires a reliable communication with a base station, which is impractical for long-range operations in such environments. Cameras and IMUs are inexpensive, lightweight, and have high update rates; however, they have limitations, especially when used in under-canopy environments. For instance, as shown in fig. 2, variable illumination in sunlit forests makes it challenging for any vio system. Exposure must be constantly adjusted, which is detrimental for many visual feature tracking algorithms since it violates the constant brightness assumption.

3D lidar can have accurate range measurements up to several hundred meters, which makes them suitable for our purposes of autonomous flight in very cluttered environments and accurate semantic mapping. Our system combines stereo vio and lidar to achieve both high-frequency updates in uav odometry as well as high accuracy in large-scale uav localization and semantic mapping.

Several autonomous flight systems that can handle relatively complicated environments are proposed in the literature. Zhou et al. present a multi-robot autonomous navigation system. The mid-air collision avoidance between robots is accounted for by adding a penalty function [zhou2020ego]. Oleynikova et al. present an autonomous flight system that relies only on cameras and IMU, and is capable of building dense maps of the environment [oleynikova2020open]. Tian et al. proposed a multi-uav system for collaborative mapping under forest canopy. They first detect trees using a clustering based approach. Then, this tree-only map is shared between robot and base station instead of raw point clouds to reduce the communication burden. However, these systems are all demonstrated in relatively sparse environments and at relatively small scale, compared to the environment in our experiments, as shown in fig. 2.

We build upon the system proposed by Mohta et al. [mohta2018experiments], which was demonstrated to be capable of performing high speed flights in modestly cluttered and unstructured environment, and complete long range flight missions (up to 1 km). We make the following improvements: (1) We integrate semantic SLAM into the autonomy stack in real time, which produces a more reliable, drift-reduced robot localization, as well as a semantically meaningful map. A drift-correction mechanism is also introduced to handle corrections from semantic SLAM while maintaining local smoothness and consistency so that the control loop remains stable. (2) We introduce a hierarchical planning and mapping framework to improve computation and storage efficiency, so that the autonomous flight system can execute long range missions in more complex under-canopy environments. (3) We performed extensive testing of our platform in complex forest environments, verifying its capabilities.

Semantic SLAM has advantages compared to traditional SLAM. Semantic landmarks are sparser, more unique and unambiguous [bowman2017probabilistic, gawel2018x]. It also enables algorithms to reason about the world, for example, to discern whether features are expected to be reliable or not [chebrolu2019robot]. In the precision agriculture, semantic SLAM or semantic Structure-from-Motion(SfM) is used to map fruits or trees [chen2020sloam, dong2020semantic, liu2019monocular].

A key step to semantic SLAM is semantic segmentation. Raw point clouds can be directly used [qi2017pointnet++], or converted into other kind of representation such as volumetric [zhou2018voxelnet] or spherical projections. Representations such as Volumetric CNNs are demanding on both computation and storage, and thus not suitable for our purpose of real-time segmentation onboard the uav. Spherical projections (range images) are advantageous for fast object detection since existing mature image-based detectors can be directly applied, and the range images usually have a relatively small number of pixels.

For tree detection from point clouds, some recent work uses methods that rely on clustering or geometric-based filtering to detect trees [oveland2017automatic, chen2019applicability, tian2020search]. However, their performance depends on carefully tuned parameters, and are therefore hard to generalize. Chen et al. [chen2020sloam]

demonstrate fast tree trunk segmentation from range images in real-world complex forests. Their method combines data-driven segmentation and heuristics to perform tree detection more reliably. In our experiments, we employ a similar approach, but we use RangeNet++ 

[milioto2019rangenet++] because it gives better accuracy and generalization.

Figure 3: Software architecture for the autonomy stack. The system uses information from multiple sensors including 3D lidar, stereo cameras, and IMU. It first performs semantic segmentation using a deep neural network to extract object-level information from lidar scans, which are tracked across successive measurements. This semantic data association is converted into constraints on both the robot poses as well as the object landmark models using the semantic SLAM module, where the estimated relative motion provided by a stereo vio algorithm [sun2018robust] is used as an initial guess, and SLOAM [chen2020sloam] is used to optimize the pose and landmarks. The output of SLOAM is used by the planning module. The global planner relies on a sparse map to plan the shortest path to the global goal, and the local trajectory planner uses a small but finely discretized robot-centric map to plan a dynamically feasible and collision-free trajectory to the local goal. Meanwhile, a ukf that relies on vio measurements is used for low-level control loop due to its high-frequency and smooth pose estimates. The vio drift is compensated and feed into the planning and control loops in real time, as detailed in section III-D2. A nonlinear geometric based controller will send commands to the robot at 200 Hz to track this trajectory.

Iii Proposed Approach

In this section, we will introduce the modules of our system. The system diagram is shown in fig. 3.

Iii-a UAV platform

The Falcon 4 UAV platform used for our experiments is custom-built with a carbon fiber frame. It is equipped with various onboard computers and sensors, as shown in fig. 1. The Ouster OS1-64 3D lidar provides accurate range measurements with a large field-of-view for reliable obstacle avoidance and accurate semantic mapping. Thanks to its large payload capacity, this platform is very flexible and can carry additional onboard computers and other mission-specific sensors.

Iii-B State Estimation

Iii-B1 Odometry

The odometry system used in this paper is the same as the one proposed in our previous work [mohta2018experiments]. The innermost loop of our odometry system is a ukf. The ukf relies on measurements from stereo vio [sun2018robust], and runs at 200hz and outputting smooth pose estimates. However, as mentioned in fig. 2, there are many challenges for robot localization under the forest canopy. The geometrical features used in traditional vio or lidar odometry systems will thus become unreliable. Therefore, we use semantic SLAM to correct this drift before feeding it to the planning loop.

Iii-B2 Semantic Lidar Odometry

Our previous work SLOAM, a semantic lidar odometry and mapping framework, leverages semantic information in forests to improve robot state estimation and mapping [chen2020sloam].

Assuming that objects make contact with the ground and lidar observations will be gravity aligned, an object detected in a lidar beam will likely be present in other beams below it. Using this insight, SLOAM computes a trellis graph from the point cloud with semantic labels, where nodes are organized as slices. Each node represents a group of points of the same tree instance from the same horizontal lidar beam. Weighted edges are calculated based on the Euclidean distance between nodes. Finally, individual trees are given by the shortest paths starting from the first slice of the graph, which is computed using a greedy algorithm.

Using nodes that belong to the same tree, SLOAM computes a cylinder model that estimates the diameter and growth direction of the tree trunk. To represent the ground, which is usually flat in small local regions in a forest, we sample the lowest non-tree points around the lidar sweep and fit a plane model . With these models, the robot can estimate its pose at each time step and create a semantic map with important information for timber inventory, such as diameter at breast height and number of trees.

The main problem of relying purely on semantic information is that the state estimation will also fail if object detection fails. We present improvements to the SLOAM framework to increase stability and robustness for large-scale autonomy.

Iii-B3 VIO Integration

vio will drift under the forest canopy because of light variation, and the movement of underbrush and branches, where most of the features are detected. However, a robust VIO algorithm with high quality sensors should still provide pose estimates that are consistent in short time intervals. On the other hand, SLOAM may fail if not enough trees are detected to constrain the pose estimation.

We use S-MSCKF [sun2018robust] as an initial guess to the pose estimation problem and run SLOAM to correct the drift over longer intervals. When the VIO odometry estimates at least meter translation movement, a new keyframe is created. For a keyframe , SLOAM will detect trees, perform data association, estimate a pose and update the semantic map. This integration is especially valuable in the context of the entire autonomy stack presented in this work. It is essential to save computational resources to enable the whole system to run in real time.

Since SLOAM’s data association is based on nearest neighbor matching, if the motion difference between two keyframe poses is large, the association may fail or give false positive matches. We can use the relative motion estimated with VIO to initialize SLOAM, and perform data association reliably.

Due to drift, the VIO pose and SLOAM pose may be different. To provide SLOAM with an initial guess, at every keyframe , we store a tuple of poses . The initial guess of relative motion between keyframes estimated by the VIO can be computed by


can then be combined with the previous SLOAM pose to be used by SLOAM


Let the semantic map with landmarks be the set of all landmarks detected by SLOAM over the robot trajectory. For a new observation, we extract a submap by selecting from the semantic map a set of trees that are close to the current estimated position of the robot based on a distance threshold . That is

We store references to each tree in a KD-Tree indexed by their 2D position in the map coordinate frame to query the semantic map efficiently. SLOAM receives the submap , the initial guess , and a lidar reading in the robot frame. In this formulation, SLOAM will estimate by performing data association between the current lidar reading and the submap, which is analogous to the mapping step of the original formulation.

Iii-B4 Lidar State Estimation

Let be the lidar observation at the current time . To perform nearest-neighbor data association between the submap given an arbitrary robot motion, we transform using the initial guess . All features and models from this new observation will already be in the initial guess frame.

Given a set of ground features from the transformed lidar observation and the previous plane model , the optimization objective is to find the transformation that minimizes the point-to-plane distance . Since this plane will always be below the robot, it can constrain the Z, pitch, and roll motion. The cost function is given by

where is a transformation in SE

(3) with three degrees of freedom (translation X, Y and Yaw are fixed).

Given a set of tree models of size . For each cylinder , let be the set of tree feature points of size associated with the landmark. The features in are associated to the nearest cylinder from the submap . The optimization objective is to minimize the point to cylinder distance of the associated features. Since the robot may not observe the entire tree trunk, the cylinder models are not constrained in the Z direction. For this reason, these models can constrain the X, Y, and Yaw motion. The cost function is given by

where is a transformation in SE(3) transform with three degrees of freedom (translation Z, Pitch, and Roll are fixed). Finally, the SLOAM pose is given by


In the original SLOAM formulation, both ground and tree costs were solved jointly and all 6 degrees of freedom were estimated together with non-linear least squares. By splitting the problem, we can correct only the parts of the initial guess where we have enough matches to provide a good refinement. That is, if the number of ground features is low, we can set , where is the identity transform. If not enough tree features are matched, we can set . Consequently, if both ground and tree features are not reliable, SLOAM will just use the initial guess to update the map.

Iii-C Perception

Iii-C1 Semantic Mapping

Given a new pose estimate SLOAM will project the models derived from the current sweep, and associate cylinder models to the ones in the submap using a cylinder-to-cylinder distance. These associations update the tree models to the latest estimate, and cylinders that do not have a match are added to the map.

Iii-C2 Semantic segmentation of trees from lidar point clouds

Since SLOAM required every lidar reading to be segmented to compute lidar odometry, it uses a simplified version of ERFnet [romera2017erfnet] to perform real-time semantic segmentation on the CPU. The input point cloud is transformed into a range image using a spherical projection and treated as a regular one-channel image for segmentation. However, the segmentation would not be smooth in many cases and would consider leaves and small branches as part of the trunk.

SLOAM addresses this problem with the Trellis Graph instance detection. However, it is challenging to completely filter these unwanted points that add noise to the cylinder model estimation and consequently to the state estimation.

For this reason, in our experiments we use RangeNet++ [milioto2019iros] with a simplified DarkNet [redmon2018yolov3] as the network backbone and decoder. To improve segmentation accuracy, we first train the model using lidar scans and ground truth segmentation provided by our simulator. The model is then fine-tuned on real-world data collected using the same uav with manual flights in a real forest.

Iii-D Planning and Control

Figure 4: Visualization of real-time perception and decision making of the autonomy stack in a real-world forest. The green cylinders are SLOAM estimated tree models, the small bounding box is the local voxel map (30 m 30 m 10 m), and the black line shown at the top of the middle panel is the edge of the global voxel map (1 km 1 km 10 m). The left panel shows a zoomed-in view of the SLAM and odometry reference frames. The vio and SLOAM estimated poses are used to calculate a transform between the two reference frames, which compensates the vio drift. This drift correction will be fed into the planning loop in real-time. In the right panel, the local planner outputs a smooth trajectory to the local goal.

Our previous efforts [mohta2018fast] focused on fast autonomous flight in cluttered and GPS-denied environments. However, to autonomously acquire data in dense under-canopy environments that are more cluttered, unstructured, and significantly larger in scale, various challenges need to be overcome in the autonomy stack: (1) Since the goal is to gather high-quality data on objects of interest, a mechanism of extracting and storing semantic information should be incorporated. (2) Data storage and computation need to be reduced. (3) The planning and control algorithms need to adapt to the drift correction from the semantic SLAM module in real time.

We address challenges (1) (2) by introducing a hierarchical planning and mapping framework, and we address challenge (3) by introducing a real-time drift compensation module.

Iii-D1 Hierarchical planning and mapping framework

A hierarchical mapping and planning framework is introduced, as illustrated in fig. 4. Our top-level map consists of a set of semantic models that enable us to accurately localize the robot over long distances and keep track of the information of interest in real time. This map is sparse and thus storage efficient, as explained in fig. 10.

Our mid-level map is a large but coarsely discretized voxel map, and our low-level map is a small but finely discretized, robot-centric voxel map. The top and mid-level maps are updated at a lower frequency (2 Hz) to reduce computation, while the low-level map is updated at the same frequency as the lidar (10Hz) to maximize safety.

The global planner uses the jps algorithm to generate collision-free and long-horizon paths to the global goal in the mid-level voxel map. The local planner uses a motion-primitive-based algorithm [liu2018search] to generate a collision-free, dynamically-feasible trajectory to the local goal in the low-level voxel map. The local goal is defined as the first intersection of the global path and local voxel map’s boundaries. The mid-level mapper has large ray-cast range so that the global planner is non-myopic, while the low-level mapper has a ray-cast range equal to local map’s dimension and is thus computationally efficient.

This design significantly reduces the data storage and computation demand due to the sparsity of the top and mid-level maps. Meanwhile, it does not sacrifice optimality and safety due to the fine discretization of the local mapper. The top-level map can be further used for other purposes such as planning the high-level informative path that minimizes the uncertainty about object geometries and semantics.

Iii-D2 Real-time drift compensation module

A brute-force way of correcting the drift induced by vio is to feed the semantic SLAM estimated uav pose into the controller or ukf. In this setup, the controller’s desired position remains on the existing trajectory calculated based on the drifted odometry, while the estimated poses given to the controller are drift-corrected. However, jumps in pose estimates can happen when the uav revisits a place since semantic SLAM is performing global data association and optimization. Meanwhile, even if the local replan is requested whenever such jumps are detected, there will be a delay before the re-planned local trajectory becomes available. Since the control loop is running at a higher rate (200 Hz), there will be a jump between the desired pose (from evaluating the trajectory) and estimated pose (from semantic SLAM or ukf). The controller will thus output aggressive and usually dynamically infeasible commands, making the uav unstable or even crash.

A more stable solution to feeding the drift correction into the planning and control loop is to maintain two reference frames: SLAM and odometry reference frames. Whenever a semantic SLAM pose is received, the drift is calculated. The inverse of the drift, which is the transformation from the semantic SLAM estimated pose to the ukf estimated pose, is applied on the odometry reference frame.

The global goal lies in the SLAM reference frame and is transformed into the odometry reference frame. Upon replan, the global planner will output a global path that is drift compensated. This correction leads the local planner to generate a drift-compensated local trajectory since the local goal is obtained based on the drift-compensated global path. Thus, the planned trajectory and control commands will guide the uav to execute its mission accurately.

Iv Results and Analysis

Our system was first developed and tested in simulated environments and then demonstrated and improved in real-world forests. For simulation experiments, we used Unity3D as the simulation engine due to its high fidelity sensor data. This is especially important for testing our perception modules, including vio and lidar-based semantic segmentation. We chose the Wharton State Forest for real-world experiments, a natural forest located in the U.S. state of New Jersey. It consists of approximately 122,880 acres of the Pinelands as shown in fig. 2

. There is a significant variance in tree densities, sizes, shapes, and undergrowth conditions, thus making this forest ideal for verifying our system’s robustness and generalization ability.

Iv-a Simulation experiments and analysis

We set up an automatic pipeline that allows us to generate many simulation environments based on real-world tree positions that we collected during field trips. We use various high-fidelity tree models that are available in the Unity store111 and customize their sizes, shapes, orientation, as well as undergrowth conditions to replicate the real-world forests.

Figure 5: Examples of sparse and dense simulated forests. We generated 10 forest environments with the same map size (340 m 340 m) but varying tree density (from 168 trees to 7255 trees). This image shows two of the 10 simulation environments. The left panel shows a sparse (168 trees) forest, and the right panel shows a relatively dense (4954 trees) forest. To create these variable density environments, we used measurements from a real forest and generated new environments with the same distribution but different densities. Taking the uav and tree trunk’s diameters into account, the average gap between trees for the densest forest is 2.19 meters.
Figure 6: Velocity vs. density. This figure illustrates how the autonomous flight system performs in forests with varying densities. For each of the 10 environments, 4 flight missions were executed, which are flights from the center of the map to the 4 corners. These missions’ total straight-line length is 1 km. Each blue dot represents the average velocity across the 4 missions for each environment. The blue curve shows a 3rd-order polynomial that is fit to these points. It is clear that the average velocity decreases with increasing tree density. The other curves show individual missions’ velocity trends.
Figure 7: Execution time vs. density and distance traveled vs. density The blue curve shows the actual length of the trajectory as the tree density increases. Even though the straight-line distance of the mission is the same, the actual distance traveled by the robot increases with tree density. This, combined with the decrease of average velocity, leads to an even sharper increase in time to complete mission, as shown in the red curve.

To verify the reliability and robustness of our system, we ran a series of experiments on ten simulated forests with different densities, as shown in fig. 5. All simulation experiments were performed under the same navigation stack setup, with the maximum velocity set to 11.0 m/s. Four missions with a combined length of 1 km were sent to the robot for each environment. The average velocity, distance traveled, and mission execution time for each mission are recorded and compared in fig. 6 and fig. 7.

From fig. 6, there’s a clear trend showing that the average velocity decreases with increasing tree density. Fig. 7 shows that the distance traveled by the uav increases with tree density, which is expected since the robot has to adapt its path frequently to avoid obstacles. In addition to the decrease of average velocity, the mission execution time rises dramatically with higher tree density.

These experiments indicate that the difficulty of accomplishing flights with a limited battery life increases rapidly with the density of environment. Moreover, under a dense canopy, the uav’s average velocity is limited by the environment instead of its maximum design velocity. Thus, flight time is the dominant factor for the mission range. That is also supporting evidence for our choice of Li-ion battery over Li-Po battery because the former has higher specific energy (Wh/kg).

Iv-B Real-world experiments and analysis

We have performed various flight experiments in different parts of the Wharton State Forest. An important prerequisite is to train the segmentation neural network. We trained first on the simulator data and then fine-tuned on real-world data. It is worth noting that we did not use any labeled data from the Wharton State Forest. Instead, we used data collected in a different pine forest in Arkansas, US. We labeled 50 lidar scans to fine-tune the model, 37 being for training and 13 for validation. An example of the segmentation result is shown in fig. 4. The model is able to generalize well and can run at 2 Hz on the uav’s onboard computer, the Intel NUC 10, which is sufficient since semantic segmentation is performed only on keyframes, while using 3 of 12 threads of the CPU.

Figure 8: Top-down view of the 1.1 km trajectory. The integration of VIO and SLOAM improves state estimation and yields an informative semantic map that can be used for global planning. The black dots represent the landmarks detected by our mapping algorithm.
Figure 9: Top-down view of the 800 m trajectory.

In this experiment, the autonomy stack relied only on vio measurements to control the uav. The objective was to perform two square loops and return home. While according to VIO the robot was successful, both SLOAM and noisy GPS show that the quadrotor drifted significantly. The GPS sensor reported a  10 m standard deviation for X and Y position estimates. This demonstrates that relying only on vio will cause the uav to deviate and fail to execute the mission.

In the first field experiment, our goal was to demonstrate that relying only on vio will lead to failure to complete the mission. As illustrated in fig. 8, the autonomy stack relied only on vio to control the uav. The objective was to perform two square loops of 800 m and return home. According to vio, the robot was successful, but according to semantic SLAM and GPS, the vio drifted significantly (60 m). Semantic SLAM and GPS agreed with each other within a 1020 m margin. It’s worth noting that while we obtained a GPS lock, the sensor reported m standard deviation for X and Y position estimates. For this reason, in the forest, GPS can be used as a high-level reference to the robot’s position. Still, it is not guaranteed to be available nor that the measurements will be correct. One of the possible reasons that vio drifted significantly in this experiment is that the uav’s motion is very smooth and close to a constant velocity state, which might have formed degenerate cases for vio [yang2019degenerate-vio-degenerate].

In the second field experiment, our goal was to demonstrate that the vio drift can be corrected with semantic SLAM. In fig. 9 we illustrate the scale of the long-range trajectories performed with the robot. For this flight, the uav trajectory length is an over 1 km loop, covering most of a 200 area while detecting and modeling trees. For this experiment, the GPS was not able to get a lock. Therefore, we did not include it in the results.

Figure 10: Semantic map vs voxel map. This figure illustrates the difference between the semantic map and the voxel map on a real-world dataset. Top panel: side view; Bottom left panel: top view; Bottom right panel: oblique view. The green cylinders show a SLOAM estimated semantic map, the black bounding box shows a local voxel map, and the blue curve shows the SLOAM-estimated robot trajectory. The local voxel map is 30 m 30 m 10 m, and the semantic map is approximately 200 m 200 m. The semantic map representation is much sparser than the voxel map representation. For example, there are 25 million voxels in this local voxel map, which requires at least 1.125 MB of storage. If the same voxel map is used to represent a 1 km 1 km region, the storage requirement will be 1250 MB. However, there are no more than 250,000 trees in the 1 km 1 km region, which requires at most 2 MB of storage. The storage burden is reduced to 0.16% while keeping the information of interest. Meanwhile, these semantic object models are used in robot localization in our system, and thus raw point cloud data no longer needs to be stored, further reducing storage demand. Such reductions are significant when performing large-scale operations.

In Table I we compare the accumulated drift based on the starting and ending positions of the pose estimated by vio and SLOAM + vio. Although Stereo vio outputs consistent local estimates, it is unreliable over long-range flights, especially along the Z direction. Directly using such estimates will lead the robot to deviate from the desired trajectory. SLOAM benefits from the local consistency of vio but can reduce the drift by using landmarks and the ground model. Most of the Z-axis drift is corrected by using our semantic lidar framework, and the XY drift is reduced.

Method XY Drift (m) Z Drift (m) Total Drift (m)
VIO 7.92 -6.27 10.10
SLOAM + VIO 3.93 0.71 3.99
Table I: Distance of final estimated pose from beginning of trajectory on 1.1 km loop.

Iv-C Discussion

Compared to the first field experiment, the uav commanded more angular and linear accelerations in the second field experiment. This leads to the improved performance of vio. However, from the perspective of gathering information, these motions usually lead to distorted or blurred sensor data, consequently decreasing the semantic map’s quality.

Navigating in forests is difficult, especially when the density is high. Even with the same planner configuration, the average speed, mission time and length will vary significantly. Therefore, the density of the environment can influence the uav’s behaviors, and we need a state estimator that is robust to those behaviors.

Traditional odometry systems are good at outputting smooth estimation over a moderate distance, but they are not as robust as semantic SLAM over a long duration or across different scenarios. Therefore, using semantics to compensate for the drift in vio gives us more robustness and reliability. Moreover, different sensors have different degenerate cases and failure modes. Forest-like environments aggravate this due to perception aliasing, excessive heat, humidity, sunlight, and dust. Therefore, sensor redundancy can increase reliability.

The global drift from SLOAM is minimal, but if one wants to further reduce drift, a sam framework with loop closure constraints such as the one proposed by LIO-SAM [liosam2020shan], combined with semantic mapping, can be employed. However, this formulation can significantly increase computation when loops are detected, which will impact the performance of the autonomy stack given the limited computational power on the uav. For this reason, the use of sam is more suitable for offline processing.

V Conclusion and Future Work

Generating semantic maps of forests and orchards is important to understanding the magnitude of the carbon challenge and to develop new strategies for precision agriculture. Semantic mapping requires under-canopy flight and measurements at the individual tree or plant level. In this paper, we described the key hardware and software components required for an autonomous Unmanned Aerial Vehicle to build km 1 km maps of unstructured forests in a GPS-challenged environment with varying illumination (sunlight, shadows) and wind, without any human intervention. These include the use of cameras, an IMU, and a lidar for state estimation; the segmentation of trees and ground planes using a lidar; and the real-time integration of the semantic map in the control and planning loop to minimize drift and to facilitate localization, mapping, and planning. To our knowledge, this is the first time that real-time semantic SLAM has been integrated into the feedback loop for autonomous uav navigation, and detailed, multi-resolution maps over a large area have been built in an automated fashion. Future work involves incorporating the semantic map to actively choose actions that reduces uncertainty of the information gathered by the system.