Log In Sign Up

AgriColMap: Aerial-Ground Collaborative 3D Mapping for Precision Farming

The combination of aerial survey capabilities of Unmanned Aerial Vehicles with targeted intervention abilities of agricultural Unmanned Ground Vehicles can significantly improve the effectiveness of robotic systems applied to precision agriculture. In this context, building and updating a common map of the field is an essential but challenging task. The maps built using robots of different types show differences in size, resolution and scale, the associated geolocation data may be inaccurate and biased, while the repetitiveness of both visual appearance and geometric structures found within agricultural contexts render classical map merging techniques ineffective. In this paper we propose AgriColMap, a novel map registration pipeline for that leverages a grid-based multi-modal environment representation which includes a vegetation index map and a Digital Surface Model. We cast the data association problem between maps built from UAVs and UGVs as a multi-modal, large displacement dense optical flow estimation. The dominant, coherent flows, selected using a voting scheme, are used as point-to-point correspondences to infer a preliminary non-rigid alignment between the maps. A final refinement is then performed, by exploiting only meaningful parts of the registered maps. We evaluate our system using real world data for 3 fields with different crop species. The results show that our method outperforms several state of the art map registration and matching techniques by a large margin, and has a higher tolerance to large initial misalignments. We release an implementation of the proposed approach along with the acquired datasets with this paper.


page 1

page 4

page 7


Building an Aerial-Ground Robotics System for Precision Farming

The application of autonomous robots in agriculture is gaining more and ...

Fast 2D Map Matching Based on Area Graphs

We present a novel area matching algorithm for merging two different 2D ...

OpenREALM: Real-time Mapping for Unmanned Aerial Vehicles

This paper presents OpenREALM, a real-time mapping framework for Unmanne...

Simultaneous merging multiple grid maps using the robust motion averaging

Mapping in the GPS-denied environment is an important and challenging ta...

Automatic Co-Registration of Aerial Imagery and Untextured Model Data Utilizing Average Shading Gradients

The comparison of current image data with existing 3D model data of a sc...

3D Map Reconstruction of an Orchard using an Angle-Aware Covering Control Strategy

In the last years, unmanned aerial vehicles are becoming a reality in th...

Supplementary Material

The datasets and our C++ implementation are available at:

I Introduction

Cooperation between aerial and ground robots undoubtedly offers benefits to many applications, thanks to the complementarity of the characteristics of these robots [25]. This is especially useful in robotic systems applied to precision agriculture, where the areas of interest are usually vast. A uav allows rapid inspection of large areas [26], and then share information such as crop health or weeds distribution - indicators of areas of interest with an agricultural ugv. The ground robot can operate for long periods of time, carry high payloads, perform targeted actions, such as fertilizer application or selective weed treatment, on the areas selected by the uav. The robots can also cooperate to generate 3D maps of the environment, e.g., annotated with parameters, such us crop density and weed pressure, suitable for supporting the farmer’s decision making. The uav can quickly provide a coarse reconstruction of a large area, that can be updated with more detailed and higher resolution map portions generated by the ugv visiting selected areas.

Fig. 1: An overview of AgriColMap. Both the ugv and uav generate, using data gathered from their onboard cameras, colored point clouds of the cultivated field. The proposed method aims to accurately merge these maps by means of an affine transformation that registers the ugv submap (red rectangular area) into the uav aerial map (blue rectangular area), taking into account possible scale discrepancies.

All the above applications assume that both uavs and ugvs can share information using a unified environment model with centimeter-level accuracy, i.e. an accurate shared map of the field. There are two classes of methods designed to generate multi-robot environment representations: (i) multi-robot slam (slam) algorithms (e.g., [21, 18]), that concurrently build a single map by fusing raw measurements or small local maps generated from multiple robots; (ii) map registration algorithms (e.g., [3, 5]) that align and merge maps independently generated by each robot into a unified map. On the one hand, the lack of distinctive visual and 3D landmarks in an agricultural field, along with the difference in the robots’ point-of-views (e.g., Fig. 2), prevent direct employment of standard multi-robot slam pipelines, either based on visual or geometric features. On the other hand, merging maps independently generated by the uavs and ugvs in an agricultural environment is also a complex task, since maps are usually composed of similar, repetitive patterns that easily confuse conventional data association methods [17]. Furthermore, due to inaccuracies in the map building process, the merged maps are usually affected by local inconsistencies, missing data, occlusions, and global deformations such as directional scale errors, that negatively affect the performance of standard alignment methods. Geolocation information associated with (i) sensor readings or (ii) maps often can’t solve the limitations of conventional methods in agricultural environments, since the location and orientation accuracy provided by standard reference sensors111gpss (gpss) and ahrss (ahrss) [23] is not suitable to prevent such system from converging towards sub-optimal solutions (see Sec. V)

Fig. 2: Pictures of the same portion of field seen from the uav point-of-view (left) and from the ugv point-of-view (right). The local crop arrangement geometry, such as the missing crop plants, is generally not visible from the ugv point-of-view. The yellow solid lines represent an example of manually annotated correct point matches. It is important to underline the complexity required in obtaining correct data association, also from an human point-of-view. The fiducial markers on the filed have been used to compute the ground truth alignments between the maps.

In this paper, we introduce AgriColMap, an Aerial-Ground Collaborative 3D Mapping pipeline, which provides an effective and robust solution to the cooperative mapping problem with heterogeneous robots, specifically designed for farming scenarios. We address this problem by proposing a non-rigid map registration strategy able to deal with maps with different resolutions, local inconsistencies, global deformations, and relatively large initial misalignments. We assume that both a uav and a ugv can generate a colored, geotagged point cloud of a target farm environment (Fig. 1

). To solve the data association problem between the input point clouds, we propose a global, dense matching approach. The key intuition behind this choice is that points belonging to a cloud locally share similar “displacement vectors” that associate such points with points in the other cloud. Thus, by introducing a smoothness

222The smoothness is related to displacement vectors of neighboring elements. term in the dense, regularized matching, we penalize the displacement discontinuities in each point neighborhood. With this formulation, good correspondences are iteratively improved and spread through cooperative search among neighboring points.

This approach has been inspired by the ldof (ldof) problem in computer vision and, actually, we cast our data association problem as a ldof problem. To this end, we convert the colored point clouds into a more suited, multimodal environment representation that allows one to exploit two-dimensional approaches and to highlight both the semantic and the geometric properties of the target map. The former is represented by a vegetation index map, while the latter through a dsm (dsm). More specifically, we transform each input point cloud into a grid representation, where each cell stores (i) the exg index (exg) and (ii) the local surface height information (e.g., the height of the plants, soil, etc.). Then, we use the data provided by the gps and the ahrs to extract an initial guess of the relative displacement and rotation between grid maps to match. Hence, we compute a dense set of point-to-point correspondences between matched maps, exploiting a modified, state-of-the-art ldof system

[22], tailored to the precision agriculture context. To adapt this algorithm to our environment representation, we propose to use a different cost function that involves both the ExG information and the local structure geometry around each cell. We select, using a voting scheme, the bigger subset of correspondences with coherent, similar flows, to be used to infer a preliminary alignment transformation between the maps. In order to deal with directional scale errors, we use a non-rigid point-set registration algorithm to estimate an affine transformation. The final registration is obtained by performing a robust point-to-point registration over the input point clouds, pruned from all points that do not belong to vegetation. A schematic overview of the proposed approach is depicted in Fig. 3.

We report results from an exhaustive set of experiments (Sec. V

) on data acquired by a uav and a handheld camera, simulating the ugv, on crop fields in Eschikon, Switzerland. We show that the proposed approach is able to guarantee with a high probability a correct registration for an initial translational error up to 5 meters, an initial heading misalignment up to 11.5 degrees, and a directional scale error of up to 30%. We found similar registration performance across fields with three different crop species, showing that the method generalizes well to across different kinds of farms. We also report a comparison with state-of-the-art point-to-point registration and matching algorithms, showing that our approach outperforms them in all the experiments.

I-a Related Work

The field of multi-robot cooperative mapping is a recurrent and relevant problem in literature and, as previously introduced, several solutions have been presented by means of either multi-robot slam algorithms or map merging/map registration strategies, in both 2D ([3, 4, 30]) and 3D ([5, 14, 24]) settings. Registration of point cloud based maps can also be considered as an instance of the more general point set registration problem [9, 12]. In this work, we mainly review methods based on map registration, since the heterogeneity of the involved robots and the lack of distinctive visual and geometrical features on an agricultural environment prevent the employment of standard multi-robot slam methods; a comprehensive literature review about this class of methods can be found [31].
Map registration is a challenging problem especially when dealing with heterogeneous robots, where data is gathered from different points-of-view and with different noise characteristics. It has been intensively investigated, especially in the context of urban reconstruction with aerial and ground data. In [32], the authors focus on the problem of geo-registering ground-based multi-view stereo models by proposing a novel viewpoint-dependent matching method. Wang et al. [35] deal with aligning 3D structure-from-motion point clouds obtained from internet imagery with existing geographic information sources, such as noisy geotags from input Flickr photos and geotagged city models and images collected from Google Street View and Google Earth. Bódis-Szomorú et al. [6] propose to merge low detailed airborne point clouds with incomplete street-side point clouds by applying volumetric fusion based on a 3D tetrahedralization (3DT). Früh et al. [15] propose to use dsms obtained from a laser airborne reconstruction to localize a ground vehicle equipped with 2D laser scanners and a digital camera, detailed ground-based facade models are hence merged with a complementary airborne model. Michael et al. [33] propose a collaborative uav-ugv mapping approach in earthquake-damaged contexts. They merge the point clouds generated by the two robots using a 3D icp (icp) algorithm, with an initial guess provided by the (known) uav takeoff location; the authors make the assumption that the environment is generally described by flat planes and vertical walls: also called the “Manhattan world” assumption. The icp algorithm has also been exploited in [13] and [19]. Forster et al. [13] align dense 3D maps obtained by a ugv equipped with an RGB-D camera and by a uav running dense monocular reconstruction: they obtain the initial guess alignment between the maps by localizing the uav with respect to the ugv with a Monte Carlo Localization method applied to height-maps computed by the two robots. Hinzmann et al. [19] deal with the registration of dense lidar-based point clouds with sparse image-based point clouds by proposing a probabilistic data association approach that specifically takes the individual cloud densities into consideration. In [16], Gawel et al. present a registration procedure for matching lidar point-cloud maps and sparse vision keypoint maps by using structural descriptors.
Despite the extensive literature addressing the problem of map registration for heterogeneous robots, most of the proposed methods make strong context-based assumptions, such as the presence of structural or visual landmarks, “Manhattan world” assumptions, etc. Registering 3D maps in an agricultural setting, in some respects, is even more challenging: the environment is homogeneous, poorly structured and it usually gives rise to strong sensor aliasing. For these reasons, most of the approaches mentioned above cannot directly be applied to an agricultural scenario. Localization and mapping in an agricultural scenario is a topic that is recently gathering great attention in the robotics community [36, 11, 23]. Most of these systems, however, deal with a single robot, and the problem of fusing maps built from multiple robots is usually not adequately addressed and little, very recent research exists on this topic. Dong et al. [10] propose a spatio-temporal reconstruction framework for precision agriculture that aims to merge multiple 3D field reconstructions of the same field across time. They use single row reconstruction as a starting point for the data association, that is actually performed by using standard visual features. This method uses images acquired by a single ugv that moves in the same field at different times and, being based on visual features, cannot manage drastic viewpoint changes or large misalignments when matching aerial and ground maps. A local feature descriptor designed to deal with large viewpoint changes has been proposed by Chebrolu et al. in [8], where the almost static geometry of the crop arrangement in the field has been exploited to propose a descriptor that encodes the local plant arrangement geometry. Despite the promising results, this method suffers from the presence of occluded areas when switching from the uav to the ugv point-of-views.

I-B Contributions

Our contributions are the following: (i) A map registration framework specifically designed for heterogeneous robots in an agricultural environment; (ii) To the best of our knowledge, we are the first to apply a ldof based 3D map alignment; (iii) Extensive performance evaluations that show the effectiveness of our approach; (iv) An open-source implementation of our method and three challenging datasets with different crop species with ground truth.

Ii Problem Statement and Assumptions

Fig. 3: Overview of the the proposed approach. For visualization purposes, in column 2,7 and 8 we colored in blue and red the ugv and uav point clouds, respectively, pruned from all points that do not belong to vegetation, according to a thresholding operator applied to the exg index. Starting from the left side, we show: (i) the input colored point clouds gathered by the uav and ugv; (ii) the initial noisy and biased rigid alignment provided by the gps and the ahrs; (iii) the generated multimodal grid maps; (iv) the initial ldof data associations, i.e. the point-to-point correspondences, in yellow; (v) the ”winning“ data associations (flows), in green, selected by a voting scheme; (vi) the aligned point clouds according to the initial affine transform; (vii) the final non-rigid registration after the refinement step.

Given two 3D colored point clouds of a farmland and (Fig. 3, first column), built from data gathered from a uav and a ugv, respectively, our goal is to find a transformation that allows to accurately align them. and can be generated, for instance, by using an off-the-shelf photogrammetry-based 3D reconstruction software applied to sequences of geotagged images. Our method makes the following assumptions:

  1. The input maps built form uavs and ugvs data can have different spatial resolutions but they refer to the same field, with some overlap among them;

  2. The data used to build the maps were acquired at approximately the same time;

  3. The maps are roughly geotagged, possibly with noisy locations and orientations;

  4. They can be affected by local inconsistencies, missing data, and deformations, such as directional scale errors.

  5. is not affected by any scale inconsistencies.

Hypothesis 4) implies the violation of the typical rigid-body transformation assumption between the two maps: for this reason, we represent as an affine transformation that allows anisotropic (i.e., non-uniform) scaling between the maps. Hypothesis 5) is an acceptable assumption, since the map created by the uav is usually wider than , and generated by using less noisy GPS readings, so the scale drift effect tends to be canceled: hence, we look for a transformation that aligns with by correcting the scale errors of with respect to .

Iii Data Association

In order to estimate the transformation that aligns the two maps, we need to find a set of point correspondences, between and , that represent points pairs belonging to the same global 3D position. As introduced before and shown in the experiments (see Sec. V), conventional sparse matching approaches based on local descriptors are unlikely to provide effective results due to the big amount of repetitive and non-distinctive patterns spread over farmlands. Instead, inspired by the fact that when the maps are misaligned, points in locally share a coherent ”flow“ towards corresponding points in , our method casts the data association estimation problem as a dense, regularized, matching approach. This problem resembles the dense optical flow estimation problem for RGB images: in this context, global methods (e.g., [20]) aim to build correspondences pixel by pixel between a pair of images by minimizing a cost function that, for each pixel, involves a data term that measures the point-wise similarity and a regularization term that fosters smoothness between nearby flows (i.e., nearby pixel to pixel associations).

Iii-a Multimodal Grid Map

Our goal is to estimate by computing a ”dense flow“ that, given an initial, noisy alignment between the maps provided by a gps and a ahrs (Fig. 3, second column), associates points in with points in . Unfortunately, conventional methods designed for RGB images are not directly applicable to colored point clouds: we introduce here a multimodal environment representation that allows to exploit such methods while enhancing both the semantic and the geometrical properties of the target map. A cultivated field is basically a globally flat surface populated by plants. A dsm333A dsm is a raster representations of the height of the objects on a surface. can well approximate the field structure geometry, while a vegetation index can highlight the meaningful parts of the field and the visual relevant patterns: in our environment representation, we exploit both these intuitions. We generate a dsm from the point cloud; for each cell of the dsm grid, we also provide an exg index that, starting from the RGB values, highlights the amount of vegetation. More specifically, we transform a colored point cloud into a two dimensional grid map (Fig. 3, third column), where for each cell we provide the surface height and the exg index, with the following procedure:

  1. We select a rectangle that bounds the target area by means of minimum-maximum latitude and longitude;

  2. The selected area is discretized into a grid map of cells, by using a step of meters. In practice, each of the cells represents a square of meters. Each cell is initialized with pairs.

  3. Remembering that is geotagged (see Sec. II), we can associate each 3D point of to one cell of .

  4. For each cell with associated at least one 3D point: (a) We compute the height as the weighted average of the coordinates of the 3D points that belong to such cell; (b) We compute the the exg index as the weighted average of the the exg indexes of the 3D points that belong to such cell, where for each point we have:


    with , and

    the RGB components of the point. Both the averages use as weighting factor a circular, bivariate Gaussian distribution with standard deviation

    : points with coordinates close to center of the cell get a higher weight.

Iii-B Multimodal ldof

We generate from both the and the corresponding multimodal representations and . In the ideal case, with perfect geotags and no map deformations, a simple geotagged superimposition of the two maps should provide a perfect alignment: the ”flow“ that associates cells between the two maps should be zero. Unfortunately, in the real case, due to the inaccuracies of both the geotags and the 3D reconstruction, non zero, potentially large displacements are introduced in the associations. These offsets are locally consistent but not constant for each cell, due to the reconstruction errors. To estimate the offsets map, we employ a modified version of the cpm (cpm) framework described in [22]. cpm is a recent ldof system that provides cutting edge estimation results even in presence of very large displacements, and is more efficient than other state-of-the-art methods with similar accuracy.
For efficiency, cpm looks for the best correspondences of some seeds that are refined by means of a dense, iterative neighborhood propagation: the seeds are a set of points regularly distributed within the image. Given two images and a collection of seeds at position , the goal of this framework is to determine the flow of each seed , where is the corresponding matching position in for the seed in . The flow computation for each seed is performed by a coarse-to-fine random search strategy by minimizing the cost function:


where denotes the match cost between the patch centered at in and the patch centered in in . For a comprehensive description of the flow estimation pipeline, we refer the reader to [22].

Our goal is to use the cpm algorithm to compute the flow between and . To exploit the full information provided by our grid maps (see Sec. III-A), we modified the cpm matching cost in order to take into account both the height and exg channels. We split the cost function in two terms:


is the DAISY [34] based match cost as in the original cpm algorithm: in our case the DAISY descriptors have been computed from the exg channel of and . is a match cost computed using the height channel. We chose the fpfh (fpfh) [29] descriptor for this second term: the fpfh descriptors are robust multi-dimensional features which describe the local geometry of a point cloud, in our case they are computed from the organized point cloud444An organized point cloud is a cloud that resembles a matrix like structure. generated from the height channel of and . The parameters and are the weighting factors of the two terms. As in [22], the patch-based matching cost is chosen to be the sum of the absolute difference over all the 128 and 32 dimensions of the DAISY and fpfh flows, respectively, at the matching points. The proposed cost function takes into account both the visual appearance and the local 3D structure of the plants.
Once we have computed the dense flow between and (Fig. 3, fourth column), we extract the largest set of coherent flows by employing a voting scheme inspired by the classical Hough transform with discretization step ; these flows define a set of point-to-point matches that will be used to infer a preliminary alignment (Fig. 3, fifth column).

Iv Non-Rigid Registration

Fig. 4: Average success registration rate curves by varying the initial guess and the initial scale error: (i) from left to right, the initial scale error is incrementally increased: ; (ii) in each plot within the upper row, the initial heading error is kept fixed, while the initial translational misalignment is incrementally increased; (iii) in the lower row figures is incrementally increased, while the initial translational misalignment is kept constant. It is important to point out that the successful registration rate of the goicp [37] method is only reported for the cases without an initial scale error since this approach only deals with rigid transformations. For AgriColMap, we report the different results obtained in each dataset (sb: Soybean, sg10: Sugabeet 10m, sg20: Sugabeet 20m, ww: Winter Wheat).

The estimation of the non-rigid transformation between the maps is addressed in two steps. A preliminary affine transformation is computed by solving a non-rigid registration problem with known point-to-point correspondences. We compute by solving an optimization problem with cost function the sum of the squared distances between corresponding points (Fig. 3, sixth column):


with , the cardinality of , and the rotation matrix and the translation vector, and is a scaling vector. To estimate the final registration, we firstly select from the input colored point clouds and two subsets, and , that includes only points that belong to vegetation. The selection is performed by using an exg based thresholding operator over and . This operation enhances the morphological information of the vegetation, while reducing the size of the point clouds to be registered. We finally estimate the target affine transformation by exploiting the cpd (cpd) [27] point set registration algorithm over the point clouds and , using as initial guess transformation.

V Experiments

In order to analyze the performance of our system, we acquired datasets on fields of 3 different crop types in Eschikon (Switzerland) - soybean, sugarbeet, and winter wheat. For each crop species we collected: (i) one sequence of GPS-IMU tagged images over the entire field from a uav flying at 10 meters altitude; (ii) 4-6 sequences of GPS/IMU-tagged images of small portions of the field from a ugv point-of-view. Additionally, for the sugarbeet field, we acquired an additional aerial sequence of images from 20 meters altitude. More comprehensive details regarding the acquired datasets are reported in Table II.

The uav datasets were acquired using a DJI Mavic Pro uav equipped with a 12 MP color camera, while the ugv datasets were acquired moving the same camera by hand with a forward-looking point-of-view, simulating data acquisition by a ground robot. The collected images are first converted into 3D colored point clouds using Pix4D Mapper [1], a professional photogrammetry software suite, which are then aligned using the proposed registration approach. To analyze the performance of the proposed approach, we make use of the following error metrics:


where stands for the element-wise division operator and are, respectively, the translational, the rotational, and the scale error metrics. We report the AgriColMap related parameters we used in all the experiments in Tab. I.

TABLE I: Parameter set
Crop Type Name # Images Crop Size (avg.) Global Scale Error Recording Height (approx.)
Soybean sugv A 16 6 cm 1 m
sugv B 19 6 cm 1 m
sugv C 22 6 cm 1 m
suav 89 6 cm 10 m
Sugar Beet sbugv A 25 5 cm 1 m
sbugv B 26 5 cm 1 m
sbugv C 27 5 cm 1 m
sbuav A 213 5 cm 10 m
sbuav B 96 5 cm 20 m
Winter Wheat wwugv A 59 25 cm 1 m
wwugv B 61 25 cm 1 m
wwuav 108 25 cm 10 m
TABLE II: Overview of the Datasets: the global scale error is, in general, bigger in the ugv datasets since the camera is carried by hand, and therefore some gps satellite signals might be not received.

V-a Performance Under Noisy Initial Guess

This experiment is designed to show the robustness of the proposed approach under different noise conditions affecting the initial guess, and different directional scale discrepancies. For each ugv point cloud, we estimate an accurate ground truth non-rigid transform by manually selecting the correct point-to-point correspondences with the related uav cloud. We generate random initial alignments between maps by manually adding noise, with different orders of magnitude, to the ground truth alignment heading, translation, and scale. Then, we align the clouds with the sampled initial alignments by using (i) the proposed approach, (ii) a non-rigid standard icp, (iii) the cpd (cpd) method [27], (iv) a state-of-the art goicp (goicp) [37], and with standard sparse visual feature matching approaches [2, 28, 7], applied as a data association front-end to our method in place of the proposed ldof based data association (Sec. III-B): in the last cases, we exploit only the exg channel of the grid maps (Sec. III-A). An alignment is considered valid if: , , and .

crop type approach registration err. (trans/ros/scale) scale error 0% registration err. (trans/ros/scale) scale error 5% registration err. (trans/ros/scale) scale error 10% registration err. (trans/ros/scale) scale error 15% registration err. (trans/ros/scale) scale error 20% registration err. (trans/ros/scale) scale error 25% registration err. (trans/ros/scale) scale error 30%
Soybean AgriColMap
icp fail fail fail fail
cpd [27] fail fail fail
goicp [37] - - - - - -
SURF [2] fail fail fail fail
 [28] fail fail fail fail
FAST+BRIEF [7] fail fail fail fail
Sugabeet 10m AgriColMap
icp fail fail fail fail
cpd [27] fail fail fail
goicp [37] - - - - - -
SURF [2] fail fail fail fail
ORB [28] fail fail fail fail
FAST+BRIEF [7] fail fail fail fail
Sugabeet 20m AgriColMap
icp fail fail fail fail fail
cpd [27] fail fail fail
goicp [37] - - - - - -
SURF [2] fail fail fail fail
ORB [28] fail fail fail fail
FAST+BRIEF [7] fail fail fail fail
Winter Wheat AgriColMap
icp fail fail fail fail
cpd [27] fail fail fail
goicp [37] - - - - - -
SURF [2] fail fail fail fail
ORB [28] fail fail fail fail
FAST+BRIEF [7] fail fail fail fail
TABLE III: Registration accuracy comparison among the proposed approach, the non-rigid icp, the cpd [27], and the goicp [37] systems. The table reports, for each cell, the average accuracy among all the successful registrations with a specific initial anisotropic scaling error.

The results are illustrated in Fig. 4. The proposed approach significantly outperforms the other approaches, ensuring an almost success registration rate up to a scale error of , and a high probability of succeeding even with a scale error. The ICP-based registration methods [27, 37], due to the absence of structural 3D features on the fields, fall into local minima with high probability. The closest methods, in terms of robustness, are based on local feature matching [2, 28, 7], succeeding in the registration procedure up to a scale error magnitude of . While analyzing the results, however, we verified that, unlike our method, these methods provide a larger number of wrong, incoherent point associations, and such a problem is clearly highlighted for increasing scale deformations above 20% and rotations above 0.1 radians. The superior robustness is also confirmed for noisy initial guesses: unlike the other methods, our approach guarantees a high successful registration rate for a translational error up to meters, and an initial heading error up to degrees, enabling it to deal with most errors coming from a GPS or ahrs sensor. Our method generalizes well over the different datasets, showing the capability to deal with different crop species, crop growth stages (i.e., the winter wheat crop is in an advanced growth stage compared to the soybean and sugarbeet), soil conditions, and point cloud resolution (from different uav altitudes).

In Table IV, we report a comparison between the inliers percentages when using both visual (i.e., the exg) and geometric, or just a single term in the cost function of Eq. (3

). It is clear that most of the information is carried by the visual term. It is noteworthy that, even if the geometric term used alone is not able to provide valid results, when it is combined with the visual term the inliers percentage increases quite significantly, especially for the sugarbeet dataset, as compared to using solely the visual information. In such cases, the geometric term acts as an outlier rejection term, slightly improving the robustness properties of the registration procedure.

Descriptor Type (% inliers)
Crop Type ExG Depth ExG + Depth
Winter Wheat
TABLE IV: Inliers percentage comparison when changing data terms in the ldof cost function.

V-B Accuracy Evaluation

The second experiment is designed to evaluate the accuracy of the proposed registration approach. To this end, we compare our results with the ground truth parameters and, by using all the successful registrations, we compute the average accuracy for each crop type and approach. The results are summarized in Tab. III, and are sorted in increasing order of initial scale error.

Fig. 5: Qualitative registration results seen from aerial (left) and ground point-of-views. In the former, the UGV clouds are indistinguishable from the UAV, proving the correctness of the registration. Conversely, in the latter, the UGV clouds are clearly visible due to their higher points density.

On average, our method results in a lower registration error as compared to all the other evaluated methods for the same scale error. The difference in the registration error is even more pronounced when comparing the Sugabeet 10m against Sugabeet 20m datasets. Indeed, due to the higher sparseness of the points in the latter, all the other methods tend to perform slightly worse than they perform with the Sugabeet 10m. Conversely, our method results in almost the same registration error magnitudes, showing that it correctly deals with different density of the initial colored point clouds. We also report some qualitative results in Fig. 5

V-C Runtime Evaluation

We recorded the average, maximum, and minimum computational time for all tested methods over 100 successful registrations, reporting these values in Tab. V. The method requiring the biggest computational effort is goicp. The proposed approach requires half the computational time as compared to goicp, but turns out to be quite slow compared to the custom-built icp, to the cpd, and to all the common sparse feature detection and matching approaches. Fig. 6 shows the runtime percentages for the proposed approach. The biggest component of the computational effort is required to extract the geometric features (i.e., the fpfh features), meaning that the total computational time might be reduced by switching to a less time consuming 3D feature or by solely using the visual term.

Fig. 6: Averae percentage of total runtime for different parts of the AgriColMappipeline. Runtime [sec] Min Max Avg 63.7 118.6 79.8 icp 2.1 10.6 4.5 cpd 4.9 23.2 8.2 goicp 5.3 689.2 193.1 SURF [2] 4.6 7.2 5.3 ORB [28] 3.9 6.7 4.8 FAST+BRIEF [7] 3.7 6.4 4.5
TABLE V: Runtime comparison.

Vi Conclusions

In this paper we addressed the cooperative uav-ugv environment reconstruction problem in agricultural scenarios by proposing an effective way to align 3D maps acquired from aerial and ground robots. Our approach is built upon a multimodal environment representation that utilises the semantics and the geometry of the target field, and a data association strategy solved as a ldof problem, adapted to the agricultural context. We reported a comprehensive set of experiments, proving the superior robustness of our approach against other standard methods. An open-source implementation of our system and the acquired datasets are made publicly available with this paper.

Vii Acknowledgements

The authors would like to thank Hansueli Zellweger from the ETH Plant Research Station in Eschikon, Switzerland for preparing the fields, managing the plant life-cycle and treatments during the entire growing season. The authors would also like to thank Dr. Frank Liebisch from the Crop Science Group at ETH Zürich for helpful discussions.


  • [1] Point clouds generated using Pix4Dmapper by Pix4D. [Online]. Available:
  • [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Comput. Vis. Image Underst., vol. 110, no. 3, pp. 346–359, 2008.
  • [3] A. Birk and S. Carpin, “Merging occupancy grid maps from multiple robots,” Proceedings of the IEEE, vol. 94, no. 7, pp. 1384–1397, 2006.
  • [4] J. L. Blanco, J. González, and J.-A. Fernández-Madrigal, “A robust, multi-hypothesis approach to matching occupancy grid maps,” Robotica, vol. 31, pp. 687–701, 2013.
  • [5] T. M. Bonanni, B. Della Corte, and G. Grisetti, “3-D map merging on pose graphs,” IEEE Robotics and Automation Letters (RA-L), vol. 2, no. 2, pp. 1031–1038, 2017.
  • [6] A. Bódis-Szomorú, H. Riemenschneider, and L. V. Gool, “Efficient volumetric fusion of airborne and street-side data for urban reconstruction,” in

    Proc. of the International Conference on Pattern Recognition (ICPR)

    , 2016, pp. 3204–3209.
  • [7] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in Europ. Conf. on Computer Vision (ECCV), 2010, pp. 778–792.
  • [8] N. Chebrolu, T. Läbe, and C. Stachniss, “Robust long-term registration of UAV images of crop fields for precision agriculture,” IEEE Robotics and Automation Letters (RA-L), vol. 3, no. 4, pp. 3097–3104, 2018.
  • [9] H. Chui and A. Rangarajan, “A new point matching algorithm for non-rigid registration,” Comput. Vis. Image Underst., vol. 89, no. 2-3, 2003.
  • [10] J. Dong, J. G. Burnham, B. Boots, G. Rains, and F. Dellaert, “4D crop monitoring: Spatio-temporal reconstruction for agriculture,” in IEEE Intl. Conf. on Robotics & Automation (ICRA), 2017.
  • [11] A. English, P. Ross, D. Ball, and P. Corke, “Vision based guidance for robot navigation in agriculture,” in IEEE Intl. Conf. on Robotics & Automation (ICRA), 2014.
  • [12] A. W. Fitzgibbon, “Robust registration of 2D and 3D point sets,” in British Machine Vision Conference, 2001, pp. 662–670.
  • [13] C. Forster, M. Pizzoli, and D. Scaramuzza, “Air-ground localization and map augmentation using monocular dense reconstruction,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2013.
  • [14] C. Frueh and A. Zakhor, “Constructing 3D city models by merging ground-based and airborne views,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2003.
  • [15] C. Fruh and A. Zakhor, “Constructing 3D city models by merging aerial and ground views,” IEEE Computer Graphics and Applications, vol. 23, no. 6, pp. 52–61, 2003.
  • [16] A. Gawel, T. Cieslewski, R. Dubé, M. Bosse, R. Siegwart, and J. Nieto, “Structure-based vision-laser matching,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2016.
  • [17] A. Gawel, R. Dubé, H. Surmann, J. Nieto, R. Siegwart, and C. Cadena, “3d registration of aerial and ground robots for disaster response: An evaluation of features, descriptors, and transformation estimation,” in Proc. of the IEEE SSRR, 2017.
  • [18] A. Gil, Ó. Reinoso, M. Ballesta, and M. Juliá, “Multi-robot visual SLAM using a rao-blackwellized particle filter,” Robotics and Autonomous Systems, vol. 58, no. 1, pp. 68 – 80, 2010.
  • [19] T. Hinzmann, T. Stastny, G. Conte, P. Doherty, P. Rudol, M. Wzorek, E. Galceran, R. Siegwart, and I. Gilitschenski, “Collaborative 3D reconstruction using heterogeneous UAVs: System and experiments,” in Proc. of the Intl. Sym. on Experimental Robotics (ISER), pp. 43–56.
  • [20] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, no. 1-3, pp. 185–203, 1981.
  • [21] A. Howard, “Multi-robot simultaneous localization and mapping using particle filters,” Intl. Journal of Robotics Research (IJRR), vol. 25, no. 12, pp. 1243–1256, 2006.
  • [22] Y. Hu, R. Song, and Y. Li, “Efficient coarse-to-fine patch match for large displacement optical flow,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [23] M. Imperoli, C. Potena, D. Nardi, G. Grisetti, and A. Pretto, “An effective multi-cue positioning system for agricultural robotics,” IEEE Robotics and Automation Letters (RA-L), 2018.
  • [24] J. Jessup, S. N. Givigi, and A. Beaulieu, “Robust and efficient multi-robot 3D mapping with octree based occupancy grids,” in Proc. of the IEEE Intl. Conf. on Systems, Man, and Cybernetics (SMC), 2014.
  • [25] R. Käslin, P. Fankhauser, E. Stumm, Z. Taylor, E. Mueggler, J. Delmerico, D. Scaramuzza, R. Siegwart, and M. Hutter, “Collaborative localization of aerial and ground robots through elevation maps,” in Proc. of the IEEE SSRR, 2016.
  • [26] R. Khanna, M. Möller, J. Pfeifer, F. Liebisch, A. Walter, and R. Siegwart, “Beyond point clouds-3d mapping and field parameter measurements using uavs,” in IEEE ETFA, 2015.
  • [27] A. Myronenko and X. Song, “Point set registration: Coherent point drift,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 12, pp. 2262–2275, 2010.
  • [28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in IEEE Intl. Conf. on Computer Vision (ICCV), 2011, pp. 2564–2571.
  • [29] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (FPFH) for 3D registration,” in IEEE Intl. Conf. on Robotics & Automation (ICRA), 2009.
  • [30] S. Saeedi, L. Paull, M. Trentini, and H. Li, “Multiple robot simultaneous localization and mapping,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2011.
  • [31] S. Saeedi, M. Trentini, M. Seto, and H. Li, “Multiple-robot simultaneous localization and mapping: A review,” Journal of Field Robotics (JFR), vol. 33, no. 1, pp. 3–46.
  • [32] Q. Shan, C. Wu, B. Curless, Y. Furukawa, C. Hernandez, and S. M. Seitz, “Accurate geo-registration by ground-to-aerial image matching,” in 2nd Int. Conf. on 3D Vision, 2014.
  • [33] N. M. et al., “Collaborative mapping of an earthquake-damaged building via ground and aerial robots,” Journal of Field Robotics (JFR), vol. 29, no. 5, pp. 832–841, Sept 2012.
  • [34] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor applied to wide-baseline stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 5, pp. 815–830, 2010.
  • [35] C. Wang, K. Wilson, and N. Snavely, “Accurate georegistration of point clouds using geographic data,” in 2013 International Conference on 3D Vision - 3DV 2013, 2013, pp. 33–40.
  • [36] U. Weiss and P. Biber, “Plant detection and mapping for agricultural robots using a 3D lidar sensor,” Robotics and autonomous systems, vol. 59, no. 5, pp. 265–273, 2011.
  • [37] J. Yang, H. Li, D. Campbell, and Y. Jia, “Go-ICP: A globally optimal solution to 3D ICP point-set registration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 11, pp. 2241–2254, 2016.