Fast and accurate estimation of agricultural variables enabled by autonomous mobile robots can have a significant impact in precision agriculture. In the forestry context, accurate timber volume is key to making informed management decisions, yield estimation and forecasting. When performed by humans, this is a labor-intensive and costly task.
Among the autonomous robots, uav offer the best set of capabilities for this task: they can hover and fly quickly in 3D environments, and they are not as affected by undergrowth or terrain elevation changes as Unmanned Ground Vehicles.
Prior research as well as commercial solutions for precision agriculture in forestry exist today. Nonetheless, most applications focus on overhead flight through wide-open spaces. This simplifies operations, but over-canopy data limits what is possible to measure. For example, it is difficult to assess individual trees’ diameters and health condition from overhead data. uav capable of autonomous under-canopy flights can acquire such valuable data at scale.
Developing an autonomous uav system that can operate at large scale and low altitude, between rows of trees, or even go under the tree canopy is critical for precision agriculture in forestry. This task is, however, still very challenging: (1) GPS is not always reliable or is inaccurate due to canopy occlusion [forest-gps3-carreiras2013estimating]; (2) the environment is very unstructured and dynamic (e.g. leaves or grass blowing in the wind), which poses significant challenges on robot odometry systems that rely on static geometric features; (3) and the environment is very cluttered which requires a very accurate and dense mapping system. On the other hand, it’s impractical for human pilots to gather data at scale under the forest canopy, especially considering the limit in radio communication and the clutter.
uav operating in such environments need a reliable platform that is durable and has enough payload capacity for an on-board computer and sensors, as well as a robust autonomy stack that is capable of estimating uav poses accurately without relying on GPS while navigating through obstacle-rich environments for long durations. In this letter, we propose a tightly integrated autonomous flight and semantic SLAM system that can perform long-range missions and real-time semantic mapping under a dense forest canopy.
The main contributions of our paper are: (1) to our knowledge, this is the first system that integrates semantic SLAM in real time into an autonomous uav’s feedback loop. Using this approach, odometry drift is minimized, and the planner and controller will guide the uav to perform its mission with high accuracy. In addition, a semantically meaningful map is produced; (2) a robust long-range autonomous flight system that is verified in various highly unstructured, cluttered, and GPS-denied environments; (3) a pipeline to replicate large areas of real-world forests in a photorealistic simulator with high-fidelity images and lidar data. This allows training object detection models, as well as prototyping and tuning of the planner and state estimator.
Ii Related Work
There has been remarkable progress with uav for agricultural applications during the past decade. These technologies are now available as commercial products, but they focus on less complex missions such as overhead flights with GPS. Prior academic work addressed the under-canopy data collection problem with human pilots [chen2020sloam, apolo2020deep]. A few pieces of academic work address autonomy in under-canopy environments as well, but only in simpler conditions [mohta2018fast, zhou2020ego] compared to those shown in fig. 2 and on smaller scales.
It is known that GPS has unsatisfactory accuracy under dense forest canopy [forest-gps3-carreiras2013estimating], while differential or RTK GPS requires a reliable communication with a base station, which is impractical for long-range operations in such environments. Cameras and IMUs are inexpensive, lightweight, and have high update rates; however, they have limitations, especially when used in under-canopy environments. For instance, as shown in fig. 2, variable illumination in sunlit forests makes it challenging for any vio system. Exposure must be constantly adjusted, which is detrimental for many visual feature tracking algorithms since it violates the constant brightness assumption.
3D lidar can have accurate range measurements up to several hundred meters, which makes them suitable for our purposes of autonomous flight in very cluttered environments and accurate semantic mapping. Our system combines stereo vio and lidar to achieve both high-frequency updates in uav odometry as well as high accuracy in large-scale uav localization and semantic mapping.
Several autonomous flight systems that can handle relatively complicated environments are proposed in the literature. Zhou et al. present a multi-robot autonomous navigation system. The mid-air collision avoidance between robots is accounted for by adding a penalty function [zhou2020ego]. Oleynikova et al. present an autonomous flight system that relies only on cameras and IMU, and is capable of building dense maps of the environment [oleynikova2020open]. Tian et al. proposed a multi-uav system for collaborative mapping under forest canopy. They first detect trees using a clustering based approach. Then, this tree-only map is shared between robot and base station instead of raw point clouds to reduce the communication burden. However, these systems are all demonstrated in relatively sparse environments and at relatively small scale, compared to the environment in our experiments, as shown in fig. 2.
We build upon the system proposed by Mohta et al. [mohta2018experiments], which was demonstrated to be capable of performing high speed flights in modestly cluttered and unstructured environment, and complete long range flight missions (up to 1 km). We make the following improvements: (1) We integrate semantic SLAM into the autonomy stack in real time, which produces a more reliable, drift-reduced robot localization, as well as a semantically meaningful map. A drift-correction mechanism is also introduced to handle corrections from semantic SLAM while maintaining local smoothness and consistency so that the control loop remains stable. (2) We introduce a hierarchical planning and mapping framework to improve computation and storage efficiency, so that the autonomous flight system can execute long range missions in more complex under-canopy environments. (3) We performed extensive testing of our platform in complex forest environments, verifying its capabilities.
Semantic SLAM has advantages compared to traditional SLAM. Semantic landmarks are sparser, more unique and unambiguous [bowman2017probabilistic, gawel2018x]. It also enables algorithms to reason about the world, for example, to discern whether features are expected to be reliable or not [chebrolu2019robot]. In the precision agriculture, semantic SLAM or semantic Structure-from-Motion(SfM) is used to map fruits or trees [chen2020sloam, dong2020semantic, liu2019monocular].
A key step to semantic SLAM is semantic segmentation. Raw point clouds can be directly used [qi2017pointnet++], or converted into other kind of representation such as volumetric [zhou2018voxelnet] or spherical projections. Representations such as Volumetric CNNs are demanding on both computation and storage, and thus not suitable for our purpose of real-time segmentation onboard the uav. Spherical projections (range images) are advantageous for fast object detection since existing mature image-based detectors can be directly applied, and the range images usually have a relatively small number of pixels.
For tree detection from point clouds, some recent work uses methods that rely on clustering or geometric-based filtering to detect trees [oveland2017automatic, chen2019applicability, tian2020search]. However, their performance depends on carefully tuned parameters, and are therefore hard to generalize. Chen et al. [chen2020sloam]
demonstrate fast tree trunk segmentation from range images in real-world complex forests. Their method combines data-driven segmentation and heuristics to perform tree detection more reliably. In our experiments, we employ a similar approach, but we use RangeNet++[milioto2019rangenet++] because it gives better accuracy and generalization.
Iii Proposed Approach
In this section, we will introduce the modules of our system. The system diagram is shown in fig. 3.
Iii-a UAV platform
The Falcon 4 UAV platform used for our experiments is custom-built with a carbon fiber frame. It is equipped with various onboard computers and sensors, as shown in fig. 1. The Ouster OS1-64 3D lidar provides accurate range measurements with a large field-of-view for reliable obstacle avoidance and accurate semantic mapping. Thanks to its large payload capacity, this platform is very flexible and can carry additional onboard computers and other mission-specific sensors.
Iii-B State Estimation
The odometry system used in this paper is the same as the one proposed in our previous work [mohta2018experiments]. The innermost loop of our odometry system is a ukf. The ukf relies on measurements from stereo vio [sun2018robust], and runs at 200hz and outputting smooth pose estimates. However, as mentioned in fig. 2, there are many challenges for robot localization under the forest canopy. The geometrical features used in traditional vio or lidar odometry systems will thus become unreliable. Therefore, we use semantic SLAM to correct this drift before feeding it to the planning loop.
Iii-B2 Semantic Lidar Odometry
Our previous work SLOAM, a semantic lidar odometry and mapping framework, leverages semantic information in forests to improve robot state estimation and mapping [chen2020sloam].
Assuming that objects make contact with the ground and lidar observations will be gravity aligned, an object detected in a lidar beam will likely be present in other beams below it. Using this insight, SLOAM computes a trellis graph from the point cloud with semantic labels, where nodes are organized as slices. Each node represents a group of points of the same tree instance from the same horizontal lidar beam. Weighted edges are calculated based on the Euclidean distance between nodes. Finally, individual trees are given by the shortest paths starting from the first slice of the graph, which is computed using a greedy algorithm.
Using nodes that belong to the same tree, SLOAM computes a cylinder model that estimates the diameter and growth direction of the tree trunk. To represent the ground, which is usually flat in small local regions in a forest, we sample the lowest non-tree points around the lidar sweep and fit a plane model . With these models, the robot can estimate its pose at each time step and create a semantic map with important information for timber inventory, such as diameter at breast height and number of trees.
The main problem of relying purely on semantic information is that the state estimation will also fail if object detection fails. We present improvements to the SLOAM framework to increase stability and robustness for large-scale autonomy.
Iii-B3 VIO Integration
vio will drift under the forest canopy because of light variation, and the movement of underbrush and branches, where most of the features are detected. However, a robust VIO algorithm with high quality sensors should still provide pose estimates that are consistent in short time intervals. On the other hand, SLOAM may fail if not enough trees are detected to constrain the pose estimation.
We use S-MSCKF [sun2018robust] as an initial guess to the pose estimation problem and run SLOAM to correct the drift over longer intervals. When the VIO odometry estimates at least meter translation movement, a new keyframe is created. For a keyframe , SLOAM will detect trees, perform data association, estimate a pose and update the semantic map. This integration is especially valuable in the context of the entire autonomy stack presented in this work. It is essential to save computational resources to enable the whole system to run in real time.
Since SLOAM’s data association is based on nearest neighbor matching, if the motion difference between two keyframe poses is large, the association may fail or give false positive matches. We can use the relative motion estimated with VIO to initialize SLOAM, and perform data association reliably.
Due to drift, the VIO pose and SLOAM pose may be different. To provide SLOAM with an initial guess, at every keyframe , we store a tuple of poses . The initial guess of relative motion between keyframes estimated by the VIO can be computed by
can then be combined with the previous SLOAM pose to be used by SLOAM
Let the semantic map with landmarks be the set of all landmarks detected by SLOAM over the robot trajectory. For a new observation, we extract a submap by selecting from the semantic map a set of trees that are close to the current estimated position of the robot based on a distance threshold . That is
We store references to each tree in a KD-Tree indexed by their 2D position in the map coordinate frame to query the semantic map efficiently. SLOAM receives the submap , the initial guess , and a lidar reading in the robot frame. In this formulation, SLOAM will estimate by performing data association between the current lidar reading and the submap, which is analogous to the mapping step of the original formulation.
Iii-B4 Lidar State Estimation
Let be the lidar observation at the current time . To perform nearest-neighbor data association between the submap given an arbitrary robot motion, we transform using the initial guess . All features and models from this new observation will already be in the initial guess frame.
Given a set of ground features from the transformed lidar observation and the previous plane model , the optimization objective is to find the transformation that minimizes the point-to-plane distance . Since this plane will always be below the robot, it can constrain the Z, pitch, and roll motion. The cost function is given by
where is a transformation in SE
(3) with three degrees of freedom (translation X, Y and Yaw are fixed).
Given a set of tree models of size . For each cylinder , let be the set of tree feature points of size associated with the landmark. The features in are associated to the nearest cylinder from the submap . The optimization objective is to minimize the point to cylinder distance of the associated features. Since the robot may not observe the entire tree trunk, the cylinder models are not constrained in the Z direction. For this reason, these models can constrain the X, Y, and Yaw motion. The cost function is given by
where is a transformation in SE(3) transform with three degrees of freedom (translation Z, Pitch, and Roll are fixed). Finally, the SLOAM pose is given by
In the original SLOAM formulation, both ground and tree costs were solved jointly and all 6 degrees of freedom were estimated together with non-linear least squares. By splitting the problem, we can correct only the parts of the initial guess where we have enough matches to provide a good refinement. That is, if the number of ground features is low, we can set , where is the identity transform. If not enough tree features are matched, we can set . Consequently, if both ground and tree features are not reliable, SLOAM will just use the initial guess to update the map.
Iii-C1 Semantic Mapping
Given a new pose estimate SLOAM will project the models derived from the current sweep, and associate cylinder models to the ones in the submap using a cylinder-to-cylinder distance. These associations update the tree models to the latest estimate, and cylinders that do not have a match are added to the map.
Iii-C2 Semantic segmentation of trees from lidar point clouds
Since SLOAM required every lidar reading to be segmented to compute lidar odometry, it uses a simplified version of ERFnet [romera2017erfnet] to perform real-time semantic segmentation on the CPU. The input point cloud is transformed into a range image using a spherical projection and treated as a regular one-channel image for segmentation. However, the segmentation would not be smooth in many cases and would consider leaves and small branches as part of the trunk.
SLOAM addresses this problem with the Trellis Graph instance detection. However, it is challenging to completely filter these unwanted points that add noise to the cylinder model estimation and consequently to the state estimation.
For this reason, in our experiments we use RangeNet++ [milioto2019iros] with a simplified DarkNet [redmon2018yolov3] as the network backbone and decoder. To improve segmentation accuracy, we first train the model using lidar scans and ground truth segmentation provided by our simulator. The model is then fine-tuned on real-world data collected using the same uav with manual flights in a real forest.
Iii-D Planning and Control
Our previous efforts [mohta2018fast] focused on fast autonomous flight in cluttered and GPS-denied environments. However, to autonomously acquire data in dense under-canopy environments that are more cluttered, unstructured, and significantly larger in scale, various challenges need to be overcome in the autonomy stack: (1) Since the goal is to gather high-quality data on objects of interest, a mechanism of extracting and storing semantic information should be incorporated. (2) Data storage and computation need to be reduced. (3) The planning and control algorithms need to adapt to the drift correction from the semantic SLAM module in real time.
We address challenges (1) (2) by introducing a hierarchical planning and mapping framework, and we address challenge (3) by introducing a real-time drift compensation module.
Iii-D1 Hierarchical planning and mapping framework
A hierarchical mapping and planning framework is introduced, as illustrated in fig. 4. Our top-level map consists of a set of semantic models that enable us to accurately localize the robot over long distances and keep track of the information of interest in real time. This map is sparse and thus storage efficient, as explained in fig. 10.
Our mid-level map is a large but coarsely discretized voxel map, and our low-level map is a small but finely discretized, robot-centric voxel map. The top and mid-level maps are updated at a lower frequency (2 Hz) to reduce computation, while the low-level map is updated at the same frequency as the lidar (10Hz) to maximize safety.
The global planner uses the jps algorithm to generate collision-free and long-horizon paths to the global goal in the mid-level voxel map. The local planner uses a motion-primitive-based algorithm [liu2018search] to generate a collision-free, dynamically-feasible trajectory to the local goal in the low-level voxel map. The local goal is defined as the first intersection of the global path and local voxel map’s boundaries. The mid-level mapper has large ray-cast range so that the global planner is non-myopic, while the low-level mapper has a ray-cast range equal to local map’s dimension and is thus computationally efficient.
This design significantly reduces the data storage and computation demand due to the sparsity of the top and mid-level maps. Meanwhile, it does not sacrifice optimality and safety due to the fine discretization of the local mapper. The top-level map can be further used for other purposes such as planning the high-level informative path that minimizes the uncertainty about object geometries and semantics.
Iii-D2 Real-time drift compensation module
A brute-force way of correcting the drift induced by vio is to feed the semantic SLAM estimated uav pose into the controller or ukf. In this setup, the controller’s desired position remains on the existing trajectory calculated based on the drifted odometry, while the estimated poses given to the controller are drift-corrected. However, jumps in pose estimates can happen when the uav revisits a place since semantic SLAM is performing global data association and optimization. Meanwhile, even if the local replan is requested whenever such jumps are detected, there will be a delay before the re-planned local trajectory becomes available. Since the control loop is running at a higher rate (200 Hz), there will be a jump between the desired pose (from evaluating the trajectory) and estimated pose (from semantic SLAM or ukf). The controller will thus output aggressive and usually dynamically infeasible commands, making the uav unstable or even crash.
A more stable solution to feeding the drift correction into the planning and control loop is to maintain two reference frames: SLAM and odometry reference frames. Whenever a semantic SLAM pose is received, the drift is calculated. The inverse of the drift, which is the transformation from the semantic SLAM estimated pose to the ukf estimated pose, is applied on the odometry reference frame.
The global goal lies in the SLAM reference frame and is transformed into the odometry reference frame. Upon replan, the global planner will output a global path that is drift compensated. This correction leads the local planner to generate a drift-compensated local trajectory since the local goal is obtained based on the drift-compensated global path. Thus, the planned trajectory and control commands will guide the uav to execute its mission accurately.
Iv Results and Analysis
Our system was first developed and tested in simulated environments and then demonstrated and improved in real-world forests. For simulation experiments, we used Unity3D as the simulation engine due to its high fidelity sensor data. This is especially important for testing our perception modules, including vio and lidar-based semantic segmentation. We chose the Wharton State Forest for real-world experiments, a natural forest located in the U.S. state of New Jersey. It consists of approximately 122,880 acres of the Pinelands as shown in fig. 2
. There is a significant variance in tree densities, sizes, shapes, and undergrowth conditions, thus making this forest ideal for verifying our system’s robustness and generalization ability.
Iv-a Simulation experiments and analysis
We set up an automatic pipeline that allows us to generate many simulation environments based on real-world tree positions that we collected during field trips. We use various high-fidelity tree models that are available in the Unity store111https://assetstore.unity.com and customize their sizes, shapes, orientation, as well as undergrowth conditions to replicate the real-world forests.
To verify the reliability and robustness of our system, we ran a series of experiments on ten simulated forests with different densities, as shown in fig. 5. All simulation experiments were performed under the same navigation stack setup, with the maximum velocity set to 11.0 m/s. Four missions with a combined length of 1 km were sent to the robot for each environment. The average velocity, distance traveled, and mission execution time for each mission are recorded and compared in fig. 6 and fig. 7.
From fig. 6, there’s a clear trend showing that the average velocity decreases with increasing tree density. Fig. 7 shows that the distance traveled by the uav increases with tree density, which is expected since the robot has to adapt its path frequently to avoid obstacles. In addition to the decrease of average velocity, the mission execution time rises dramatically with higher tree density.
These experiments indicate that the difficulty of accomplishing flights with a limited battery life increases rapidly with the density of environment. Moreover, under a dense canopy, the uav’s average velocity is limited by the environment instead of its maximum design velocity. Thus, flight time is the dominant factor for the mission range. That is also supporting evidence for our choice of Li-ion battery over Li-Po battery because the former has higher specific energy (Wh/kg).
Iv-B Real-world experiments and analysis
We have performed various flight experiments in different parts of the Wharton State Forest. An important prerequisite is to train the segmentation neural network. We trained first on the simulator data and then fine-tuned on real-world data. It is worth noting that we did not use any labeled data from the Wharton State Forest. Instead, we used data collected in a different pine forest in Arkansas, US. We labeled 50 lidar scans to fine-tune the model, 37 being for training and 13 for validation. An example of the segmentation result is shown in fig. 4. The model is able to generalize well and can run at 2 Hz on the uav’s onboard computer, the Intel NUC 10, which is sufficient since semantic segmentation is performed only on keyframes, while using 3 of 12 threads of the CPU.
In the first field experiment, our goal was to demonstrate that relying only on vio will lead to failure to complete the mission. As illustrated in fig. 8, the autonomy stack relied only on vio to control the uav. The objective was to perform two square loops of 800 m and return home. According to vio, the robot was successful, but according to semantic SLAM and GPS, the vio drifted significantly (60 m). Semantic SLAM and GPS agreed with each other within a 1020 m margin. It’s worth noting that while we obtained a GPS lock, the sensor reported m standard deviation for X and Y position estimates. For this reason, in the forest, GPS can be used as a high-level reference to the robot’s position. Still, it is not guaranteed to be available nor that the measurements will be correct. One of the possible reasons that vio drifted significantly in this experiment is that the uav’s motion is very smooth and close to a constant velocity state, which might have formed degenerate cases for vio [yang2019degenerate-vio-degenerate].
In the second field experiment, our goal was to demonstrate that the vio drift can be corrected with semantic SLAM. In fig. 9 we illustrate the scale of the long-range trajectories performed with the robot. For this flight, the uav trajectory length is an over 1 km loop, covering most of a 200 area while detecting and modeling trees. For this experiment, the GPS was not able to get a lock. Therefore, we did not include it in the results.
In Table I we compare the accumulated drift based on the starting and ending positions of the pose estimated by vio and SLOAM + vio. Although Stereo vio outputs consistent local estimates, it is unreliable over long-range flights, especially along the Z direction. Directly using such estimates will lead the robot to deviate from the desired trajectory. SLOAM benefits from the local consistency of vio but can reduce the drift by using landmarks and the ground model. Most of the Z-axis drift is corrected by using our semantic lidar framework, and the XY drift is reduced.
|Method||XY Drift (m)||Z Drift (m)||Total Drift (m)|
|SLOAM + VIO||3.93||0.71||3.99|
Compared to the first field experiment, the uav commanded more angular and linear accelerations in the second field experiment. This leads to the improved performance of vio. However, from the perspective of gathering information, these motions usually lead to distorted or blurred sensor data, consequently decreasing the semantic map’s quality.
Navigating in forests is difficult, especially when the density is high. Even with the same planner configuration, the average speed, mission time and length will vary significantly. Therefore, the density of the environment can influence the uav’s behaviors, and we need a state estimator that is robust to those behaviors.
Traditional odometry systems are good at outputting smooth estimation over a moderate distance, but they are not as robust as semantic SLAM over a long duration or across different scenarios. Therefore, using semantics to compensate for the drift in vio gives us more robustness and reliability. Moreover, different sensors have different degenerate cases and failure modes. Forest-like environments aggravate this due to perception aliasing, excessive heat, humidity, sunlight, and dust. Therefore, sensor redundancy can increase reliability.
The global drift from SLOAM is minimal, but if one wants to further reduce drift, a sam framework with loop closure constraints such as the one proposed by LIO-SAM [liosam2020shan], combined with semantic mapping, can be employed. However, this formulation can significantly increase computation when loops are detected, which will impact the performance of the autonomy stack given the limited computational power on the uav. For this reason, the use of sam is more suitable for offline processing.
V Conclusion and Future Work
Generating semantic maps of forests and orchards is important to understanding the magnitude of the carbon challenge and to develop new strategies for precision agriculture. Semantic mapping requires under-canopy flight and measurements at the individual tree or plant level. In this paper, we described the key hardware and software components required for an autonomous Unmanned Aerial Vehicle to build km 1 km maps of unstructured forests in a GPS-challenged environment with varying illumination (sunlight, shadows) and wind, without any human intervention. These include the use of cameras, an IMU, and a lidar for state estimation; the segmentation of trees and ground planes using a lidar; and the real-time integration of the semantic map in the control and planning loop to minimize drift and to facilitate localization, mapping, and planning. To our knowledge, this is the first time that real-time semantic SLAM has been integrated into the feedback loop for autonomous uav navigation, and detailed, multi-resolution maps over a large area have been built in an automated fashion. Future work involves incorporating the semantic map to actively choose actions that reduces uncertainty of the information gathered by the system.