1 Introduction
Point clouds registration, i.e., the alignment of two point clouds, is a problem very well studied and for which many solutions have been proposed. However, most solutions have been tested on very few data and, even worse, usually in specific scenarios, which do not allow for a fair generalization of the results. Moreover, the data is often collected specifically for a limited set of experiments and not shared with the rest of the community. As a consequence, a fair and objective comparison between different approaches is often impossible. As long as a common benchmark is not available, this problem tends to perpetuate itself: since no common benchmark exists, the authors have to collect new data for the experiments. Indeed a few attempts have been made to design a shared evaluation benchmark. Although worthy, these solutions often cover only a limited set of use cases of point clouds registration algorithms and were collected using a single sensor. Therefore, even though their use is a huge step forward compared to using adhoc data, they are still not a definitive solution. For these reasons, we propose a benchmarking protocol for point clouds registration algorithms applied to localization and mapping applications. We call it benchmarking protocol because it is composed of not only the data, but it includes a complete evaluation protocol that authors should use to obtain results that can be compared fairly and objectively. Since many datasets of point clouds are already available, but none covers a large range of use cases and types of sensor, we decided to avoid collecting new data. Instead, we used a set of publicly available sequences from various sources. However, providing the data is only a part of a benchmark for point clouds registration algorithms. Providing a welldesigned set of initial misalignments to apply to the data is as important as the data themselves, because the misalignment, together with the point clouds, define the actual problem to solve. Despite this, most of the sequences we used did not came with a set of initial misalignments. Therefore, there was no shared way of using the data. Moreover, there was no shared metric to evaluate the results, making a comparison not possible. Finally, part of the sequences we used came with a ground truth whose accuracy was not provided. The main characteristics of our benchmark are:

it covers a quite comprehensive set of use cases, including registrations between clouds coming from different sensors. A scenario very rarely tested in the literature, but common in real life;

it allows testing of both global and local registration algorithms;

it includes an evaluation protocol that uniformly covers various degrees of overlap and various amounts of misalignment.

it includes a metric that reliably combines both the rotation and translation error together into a single measure;

it is composed of data whose ground truth has been inspected to ensure its quality and whose accuracy has been estimated;

it is freely available and uses only freely available data;
2 Related Work
There are two big categories of point clouds registration problems: global and local. We have a global registration problem when we need to align two point clouds without any prior information on their relative pose. On the other hand, the problem is local when we have a prior rough guess on the relative pose, which has to be improved. Algorithms aimed at local registration are usually not effective for global problems, because they often make use of local optimization techniques and heuristics that could get stuck in local minima. On the other hand, most global algorithms do not provide precise results: their solutions usually need a refinement performed with a local technique. In other words, they are used to estimate the rough initial guess that local registration techniques need. Recently, Zhou et al. presented a new work that aims at significantly improving the quality of the results and speed of convergence of global registration techniques; however, it has not yet been proved to outperform the best local methods,
zhou2016fast.Global point clouds registration is usually achieved through the use of geometric features, which are, basically, a representation of salient points of the underlying surface represented by the cloud. 3D features are used in a similar way to what is done with 2D features extracted from images.
Examples of features used for point clouds registration are PFH, rusu2008aligning, and their faster variant FPFH, rusu2009fast, or angularinvariant features, jiang2009registration
. Moreover, Sehgal et al. developed an approach, derived from the computer vision world, for point clouds registration that uses SIFT features extracted from a 2D image generated from the point cloud,
sehgal2010real.There are two main drawbacks to featurebased point clouds registration. First of all, it is usually a slow process: keypoints and descriptors extraction is computationally expensive. Secondly, and most importantly, the resulting alignment often is not very accurate. For this reason, featurebased registration is mostly used not per sé, but to estimate an initial guess that will be refined later on with other techniques.
Local registration techniques, on the other hand, do not usually employ any feature. ICP is the first and most famous member of this category. It was originally developed independently by Besl and McKay besl1992method, Chen and Medioni chen1991object, and Zhang zhang1994iterative. Although its first introduction dates back to 1991, it is still the de facto standard for point clouds registration. ICP assumes that the point clouds are already roughly aligned and aims at finding the rigid transformation, i.e., a rototranslation, that best refines the alignment. Instead than looking for correspondences using keypoints and descriptors, ICP greedily approximates these correspondences by iteratively looking for the closest point to each point, to improve the alignment at each step.
Many different variants of ICP have been proposed; usually, they aim at speeding up the algorithm or at improving the quality of the result. For an extensive review and comparison of ICP variants, see the work of Pomerlau et al., pomerleau_comparing_2013. A variant of ICP that is worth mentioning is Generalized ICP (GICP) segal_generalizedicp._2009. GICP modifies the standard ICP algorithm by incorporating the covariances into the error function. In this way, it usually leads to better results, but at the expense of computation time. Indeed, the resulting optimization problem cannot be expressed as a linear system and therefore has to be solved using a generic nonlinear optimization algorithm, such as LevenbergMarquard. Another variant of ICP, specifically aimed at dealing with noise, has been presented by Agamennoni et al., agamennoni2016point
. It was derived applying statistical inference techniques on a fully probabilistic model. In that proposal, each point in the source point cloud is associated with a set of points in the target point cloud; each association is then weighted so that the weights form a probability distribution. The result is an algorithm similar to ICP but more robust
w.r.t. noise and outliers.
Local registration algorithms that do not use the nearestpoint approximation exist too. For example, NDT, biber2003normal, represents the point clouds using a set of Gaussians and tries to align them by looking for the most probable alignment.
Although many different and efficient solutions to the problem of point clouds registration have been proposed, many studies test the proposals only on a few data. Moreover, this data is often collected adhoc. This approach leads to a severe problem: often there is no direct comparison with the other existing solutions. Even when a comparison is proposed, the exiguity of the data makes it less relevant and not objective. Furthermore, comparing every single solution in a paper is impossible, given the huge number of algorithms in the literature. Therefore, the necessity of shared data and methods to test registration algorithms. Few meaningful comparisons of registration techniques exist in the literature. Donoso et al. compared various ICP variants, including GICP, on different outdoor settings Donoso2017, demonstrating that no single algorithm is better than the others in every scenario. However, the datasets used are limited to a single outdoor environment and have been recorded using only a single sensor; therefore, their results cannot be generalized. Cheng et al. compiled an extensive review of point clouds registration algorithms Cheng2018. They concluded that a shared and complete dataset, as long as an evaluation system, is necessary, since no real evaluation can be done based on the existing literature. Maiseli et al. pointed out this issue too in their review Maiseli2017. An essential work in this field is that of Pomerleau et al. Pomerleau2012, that proposed a benchmark for comparing registration techniques. They collected a series of sequences of point clouds (from now on called the ETH datasets) in various environments: indoor structured, outdoor unstructured, and in multiple seasons. Along with the datasets, they also proposed an evaluation protocol. Although it is an important contribution to the literature about point clouds registration, it has some drawbacks, some even recognized by the authors. First of all, the datasets have all been collected using the same sensor. A fair evaluation of a registration algorithm should use datasets produced using many kinds of sensor. Different sensors have different noise patterns, indeed. Another drawback is the use of separate translation and rotation errors to measure the performance of the algorithms. Even though these errors are well defined, having two separate metrics to measure the performance is unpractical. In case an algorithm has a lower translation error than another one, but a larger rotation error, there is no way to decide which one is the best one. We think that to obtain a useful comparison, this ambiguity has to be solved. For this reason, we propose the use of a metric that combines both the translation and rotation errors. The ETH datasets have been used in several comparisons, such as that of Babin et al., who compared many outlier rejection methods for ICP Babin2018, or those of Magnusson et al. Magnusson2015, and Petricek et al. Petricek2017. Its use proves that a common and ready to use benchmark is a welcomed addition to the literature on point clouds registration.
3 Materials and Methods
Several reasons brought us to propose a new benchmark for point clouds registration. First of all, and most importantly, there are no widely accepted test cases for registration algorithms covering a large array of situations. Therefore, authors proposing novel solutions have to either collect new data by themselves or use already available datasets. However, no single existing dataset covers all the use cases and possible scenarios of a registration algorithm. For this reason, authors have to pick the right datasets among the many available. These datasets will probably be in different formats and will have a ground truth in various formats and variable reliability and accuracy, posing additional work on the authors to prepare the testing environment. What happens, in practice, is that most of the works are tested on a single or very few datasets, and thus the results are inevitably less generalizable. For example, a registration algorithm that uses geometric features could get excellent results in a structured indoor scenario but could perform much worse in an unstructured outdoor one, where geometric features cannot be extracted reliably. On the other hand, an algorithm could exploit the density in a point cloud to reconstruct a surface to improve the registration, but this could not be possible on sparse point clouds, such as those produced from digital elevation maps. The noise pattern could influence the results of a registration too. Besides being useful for authors, which will have a readytouse testing protocol that also permits fair comparisons with other existing solutions, the proposed benchmark is an essential help in choosing the right algorithm for a specific application. Since the benchmark covers many different situations, once authors will have started to publish results, users will be able to select the best solution for their specific applications, without the need for additional experiment or comparison among the many existing solutions, which would require a lot of time and an indepth study of the literature. These are the requirements we considered while designing the benchmark:

it should be a complete testing protocol, not only a dataset or a group of datasets. It should describe how to perform the experiments and how to measure the results; therefore, allowing fair and objective comparisons among different approaches. Describing how to perform the experiments is an essential step, since using the same data in different ways is only slightly better than using adhoc data;

it should cover as many different settings as possible. Examples include indoor structured scenes, outdoor unstructured, outdoor structured, with and without moving objects, large and small scale problems;

the data used should come from many sensors. Different sensors have different noise properties and produce point clouds with different densities. e.g., some RGBD cameras, similarly to stereo rigs, are triangulationbased sensors, therefore the error on the measure of a distance increases quadratically with the distance matthies1987error. This does not happen with timeofflight sensors, such as LiDARs and other RGBD cameras, whose error on the distance is constant. On the other hand, the density of a point cloud produced with a LiDAR can be much lower than that of point clouds produced with an RGBD camera, given the much larger range. For these reasons, using data coming only from a single sensor leads to results that cannot be generalised;

it should cover registration problems with both large and small overlap between the two point clouds. An algorithm could perform very well with large overlap problems, but fail when the overlap is less. On the contrary, an algorithm performing very well with low overlaps could be outperformed on easier problems;

the benchmark should be useful for testing both local and global registration algorithms; therefore, it should be composed of problems with various degrees of misalignment (initial perturbation);

the benchmarking protocol should include a single metric to compare two algorithms objectively. This sorts out the use of separated translation and rotation errors, as done so far in most of the literature in the field;

the benchmark should include a reliable ground truth, whose accuracy should be provided;

it should be freely available. Therefore, it should include only data freely available.
3.1 Initial perturbation
Two factors influence greatly the difficulty of a registration problem: the amount of overlap between the point clouds, and how far the source point cloud is from the final pose, i.e., the initial perturbation or misalignment. With initial perturbation we mean the displacement of the source point cloud w.r.t. its ground truth pose. That is, it is the rototranslation that a registration algorithm should estimate. The initial perturbation is a fundamental characteristic of a registration problem: the larger it is, the harder the problem becomes.
To test different levels of perturbation, we used the following protocol. For local registration problems, we selected the boundaries of the set of possible transformations, that is, the maximum and minimum magnitudes of the rotation and of the translation. While for the rotations we could use the same boundaries for every sequence, those of the translations are dataset specific. The effect of a translation on a point cloud, indeed, is dependent on the scale of the cloud: a translation of 1 meter of a cloud representing an object could be a very hard problem, while on a point cloud representing a city could be considered an easy problem. The actual values have been chosen taking into consideration the accuracy of the ground truth and are available on our repository (https://github.com/iralabdisco/point_clouds_registration_benchmark).
For each pair of point clouds, we randomly sampled a set of initial perturbations, each composed of a rotation combined with a translation. The rotation and the translation have been sampled separately although with the same technique. First, we uniformly sample an axis, which represents either the direction of the translation or the rotation axis. Then, we sample a magnitude from a uniform distribution with the appropriate boundaries. The cardinality of the sets of initial perturbations has been carefully tuned to ensure an adequate coverage of the space, but, at the same time, not to require a huge number of tests, which would discourage authors to use the benchmark.
This protocol ensures that our benchmark is not biased towards easier or harder problems, but, instead, covers all the different levels of initial perturbation to highlight strong and weak points of registration algorithms. A global point clouds registration algorithm should be independent of the initial perturbation. For this reason, for this kind of problem, we sampled only very large transformations. The protocol is the same used for local registration problems, but the boundaries of the uniform distribution are different (45 and 180 degree for the rotation and, again, larger and datasetspecific boundaries for the translation).
3.2 Overlap
The point clouds to align do not necessarily represent exactly the same area. Instead, some part of the scene may be present in a point cloud and not in the other. The part of the scene observed in both point clouds is called the overlap. Usually, the more significant the overlap, the easier the registration problem is. However, an algorithm may behave differently with different levels of overlap. For this reason, we also tested various degrees of overlap.
We calculated the degree of overlap as the percentage of points in a point cloud having a correspondent in the other point cloud (aligned using the ground truth). Since most points will not have an exact correspondent in the other point cloud, two points form a correspondence if their distance is lower than a datasetspecific threshold. Since this threshold influences the value of the result, the overlap cannot be compared between different datasets. That is, we can say that two point clouds have a lower/higher overlap with respect to another pair only if they come from the same dataset, since their overlaps have been calculated using the same threshold. This is exactly how we use this measure; therefore, it is perfectly acceptable for our goals.
We think that, to keep the comparison fair, the various degrees of overlap should be uniformly represented in each sequence, because the overlap is one of the characteristics that define the difficulty of a registration problem. For this reason, we used the following algorithm to ensure uniform coverage of the overlap levels:

we calculate the overlap of each possible pair of point clouds and discarded the pairs with less than of overlap;

we divide the range of overlaps into ten intervals of equal size; from each interval we randomly choose ten pairs of point clouds;

if an interval has less than ten members, the remaining are chosen randomly from the whole set of pairs of point clouds.
While this algorithm does not guarantee a uniform coverage, in practice, it works reasonably well, as can be seen in Fig. 1. As an alternative to the described protocol, we could have decided to align each point cloud in a given sequence with any other point cloud, with an overlap above a threshold, in the same sequence. We discarded this option for two reasons. First of all, the number of registration problems would become too high, discouraging authors to use the benchmark. Secondly, this protocol would not ensure a uniform coverage of the various degrees of overlap, a fundamental characteristic for a registration benchmark, in our opinion.
For the ETH datasets, the authors indeed provided a set of registration problems, i.e. pairs of point clouds with an associated initial perturbation. We decided to use registration problems generated with our algorithms, rather than the originals, for two reasons. First of all, we want to use the same procedure for all the datasets. Secondly, the set of problems provided with the ETH datasets is too large to be used in conjunction with other datasets. Too many required experiments to execute the benchmark, in our opinion, would discourage researchers from using the benchmark. As reference, our benchmark requires 3000 experiments per sequence for testing local registration algorithms, while the original ETH protocol requires 64000 experiments per sequence. Considering that our benchmark is composed of many more datasets, the number of registration problems would become too large.
Summarizing, for each sequence, we sampled a list of pairs of point clouds, each corresponding to a registration problem, ensuring uniform coverage of the different levels of overlap. For each pair, we randomly sampled a list of transformations, to ensure a uniform coverage of the different magnitudes of initial misplacement. Therefore, each chosen pair of point clouds is tested with different levels of initial misplacement; in this way, the benchmark is able to highlight how an algorithm behaves with different transformations on the same problem.
3.3 Error metric
Choosing the right metric is an essential step in designing an evaluation protocol. Most stateoftheart works about point clouds registration measure their performances by calculating the distance between the estimated and the ground truth translations and rotations separately, therefore obtaining a rotation error and a translation error. Although formally correct, this approach does not allow for an objective comparison between different results, because it does not produce a single measure. If an algorithm gets a lower error on the translation, but a larger one on the rotation, with respect to another algorithm, there is no correct way of deciding which one is the best. From a scientific perspective, this is unacceptable, since the main goal of a benchmark is to compare results; therefore, a metric that does not guarantee the possibility of comparisons is not appropriate.
The result of a point clouds registration algorithm is a pose, i.e., a rotation and a translation. Thus, comparing a result w.r.t. a ground truth means calculating the distance between two poses. This problem is very common in many mechanical applications, such as path planning and position precision evaluation. However, the research community has not found a widely accepted solution yet.
Among the many existing solutions, an interesting one is that of Mazzotti et al. Mazzotti2016, who use a socalled platonic solid attached to a rigid body to measure the distance between two poses. The actual distance is then calculated using the root mean squared distance between the homologous vertices of the solid. Suggestions on which solid to use are given in their work. This solution could be appropriate for our goals: it allows objective comparisons between results, it is very easy to calculate and has an intrinsic physical meaning (since it is just a mean of Euclidean distances). However, it has a very relevant drawback: two very important parameters affect the metric. Both the size of the solid, but not the type, and where it is placed on the rigid body, affect the contribute to the distance of the rotation w.r.t. the translation. Suppose that the solid is placed at the origin of the reference frame of the two point clouds; using a very large solid will give more importance to the rotation, since points far from the origin will be displaced by a higher distance if the solid is rotated, than if they were closer to the origin. Placing the solid in a position different than the origin of the reference frames has the same effect: placing it farther gives more importance to the rotation component of the rototranslation. Similar problems, with parameters that would bias the metric, also arise with the work of Di Gregorio DiGregorio2008, who propose a method to generate a family of metrics. The parameters are tuned so to constrain the maximum displacement of the rigid body, a feature useful in path planning, but not appropriate for evaluating point clouds registration algorithms.
Inspired by the work of Mazzotti et al. Mazzotti2016, we propose a new metric. Instead than using the vertices of a platonic solid, we used the points of the actual point cloud. Therefore, the proposed metric is calculated using the root mean squared distance between homologous points of the source point cloud, after the execution of an algorithm, and the same point cloud at the ground truth pose. Since the point cloud at the groundtruth and the one aligned with an algorithm are actually the same point cloud, although displaced, associating a point in the former with the homologous in the latter is trivial. To make the metric scaleinvariant, each distance between homologous points is divided by the distance of that point w.r.t. the centroid of the point cloud, i.e. the mean of the points. Without this last step, the metric would depend on the size of the point cloud: the same rototranslation applied to a larger point cloud would result in a higher error than if the point cloud would be smaller. Consequently, the results of an algorithm on different pairs of point clouds would not be comparable and statistics such as the mean or the median of the performance on the different registration problems would be meaningless. For these reasons, this last step is fundamental in making the metric useful for comparisons. This is an important difference w.r.t. other measures based on the distance between homologous points: meaningful statistics are essential when evaluating an algorithm on such a large set of problems.
Given the same point cloud in different poses and , of cardinality and with point in corresponding to in , with and being the centroid of , the distance between and is
(1) 
To be a wellformed distance, the proposed metric should satisfy three constraints:

symmetry, i.e., ;

positive definiteness, i.e., if and if ;

triangle inequality, i.e., .
These three requirements are easily verified since the proposed metric is an average of Euclidean distances that satisfy the constraints. The sum is a symmetric and positive definite operation that complies with the triangle inequality as long as the operands are not negative (this is true in our case since a Euclidean distance cannot be negative).
Besides being a wellformed distance metric, our proposal also satisfies the requirements of our benchmark. It combines the rotation error and translation error in a single value, allowing an immediate and objective comparison among results. Differently from the use of the vertices of a platonic solid, there is no parameter to tune that influences how the metric behaves. By increasing the size of the solid we would be able to give a larger importance to the rotational component; on the other hand, by using the points of the point cloud directly, this parameter is implicit. If there are many points far from the origin, then an error on the rotational component will have a more significant impact on the metric. This is a desirable behaviour since in such cases the rotation would have a greater effect on the registration result.
The only drawback of our proposal is that it requires to iterate through the whole point cloud, instead than through a few vertices. This issue is, however, negligible, since the comparison with the ground truth is done only for benchmarking purposes and is not needed online in a real application.
4 The Datasets
No single dataset complies with all the requirements we formulated. For this reason, taking advantage of the large number of publicly available point clouds, we decided to base our benchmark on multiple public datasets.
Among the many available, we concentrated on those more relevant to applications like localization and mapping. Therefore, we preferred datasets with sequences representing large and complex environments. Instead, we discarded those more aimed at object or scene reconstruction. The latter is a task often accomplished in a controlled environment, where the poses of the sensor can be measured very accurately. On the contrary, localization and mapping is usually performed on platforms moving in a very dynamic environment, where the poses of the sensors are measured with a large uncertainty; however, the accuracy and precision required are usually lower. For example, a 3D reconstruction of a statue with an error of 10cm is usually unacceptable, while localizing an autonomous vehicle with the same precision would be a great result. Moreover, these problems are often solved with different techniques and using point clouds acquired with different sensors. Also, the overlap between two point clouds may differ according to the application. In object reconstruction, for example, it is usually possible to acquire several point clouds from different point of views, resulting in a great overlap between acquisitions. On the contrary, this is not usually feasible in robotics applications. For these reasons, we think that so different applications require different benchmarks.
4.1 Ground Truth Evaluation
Having a reliable and accurate ground truth is a strong requirement that, unfortunately, brought us to discard many datasets that would have been suitable otherwise. We inspected several publicly available datasets and discovered that a surprisingly high number had a very inaccurate ground truth; therefore, they are unsuitable for a meaningful comparison among different approaches. Moreover, we discovered that often the accuracy reported by the authors do not correspond to the real accuracy of the ground truth. We believe that having a readytouse set of sequences whose ground truth has been inspected is one of the major advantages of the proposed benchmark.
A measure of the accuracy of the ground truth is therefore necessary, since it is also the lower limit beneath which the accuracy of an algorithm cannot be evaluated. For example, if the ground truth of a dataset has an accuracy of 1m on the translation, differences between two results of less than 1m cannot be considered relevant. While for some of the datasets the authors provide the accuracy of their ground truth measurement system, this information was not available for others. Since we think that it is an essential part of a benchmark, we decided to evaluate the accuracy of all the datasets with another technique, regardless of whether it had already been reported. We decided to reevaluate all the accuracies to ensure that every dataset was evaluated in the same way.
To measure the accuracy of the ground truth of a sequence, we tried to align each pair of point clouds, that is, each registration problem, with the Probabilistic Point Clouds Registration algorithm agamennoni2016point, using as maximum distance between associated points (the radius parameter) the value used to calculate the overlap in the corresponding sequence. We used this algorithm because it provides better results than other stateofthe art algorithms, such as NDT or ICP agamennoni2016point
, although at the expense of computational time, which is not relevant when estimating the accuracy of the ground truth. As any point clouds registration algorithm, it estimates the rigid transformation between two overlapping point clouds, therefore providing a measure of how misaligned the two clouds are. When applied to two already aligned point clouds, this measure is, essentially, the accuracy of the ground truth. Indeed, the resulting rototranslation represents how much the algorithm was able to improve the alignment. One drawback of our technique is that the used algorithm, similarly to many other algorithms in this field, has no guarantee of convergence to the globally optimal solution. However, this drawback is mitigated by several factors. First of all, if the two point clouds are already well aligned, such as in this case of alignment starting from the ground truth pose, closestpoint based registration algorithms are almost always able to converge to the right solution. Moreover, whilst the closestpoint based data association could lead to wrong solutions, the Probabilistic Point Clouds Registration algorithm is guaranteed to converge to the right solution if the right data association is contained among the associations used (under the tdistribution assumption and a proper outlier rejection). Finally, since the algorithm could, nevertheless, give wrong results sporadically, we decided to consider as outliers (therefore not considering them in the evaluation) misalignments with a robust zscore greater than
. Besides detecting these few outliers, we also inspected them manually and we can confirm that they correspond to errors of the registration algorithm and not to a low accuracy of the ground truth. Therefore, we did not use them for the evaluation. As a measure of the accuracy of the ground truth of a sequence, we have taken the mean and the standard deviation of the misalignments, estimated by our approach, calculated using the metric proposed for this benchmark, without the scaleinvariant normalization (since we need absolute values for the groundtruth evaluation), and excluding the outliers as mentioned before.
It has to be noted that the method we used is not an exact measure of the accuracy of the ground truth, since it is based on a heuristic. Instead, the reported value has to be considered as an upper bound to the value of the accuracy of the ground truth. While this solution is not optimal, it is, nevertheless, the best that can be done using only the data. An exact measure of the ground truth’s accuracy can be performed only with a direct analysis of the ground truth system, as it has been done by [pomerlau]. Unfortunately, this it is not possible when using existing datasets. Our method reports a value for the accuracy of the ground truth that is usually greater than the one reported by the original authors (when available). The reason is that, while the method employed by the original authors measure only the errors introduced by the ground truth system, our method is influenced also by the noise in the point clouds, since our evaluation uses the actual data.
The accuracies evaluated with our method are reported in Table 1. As an example, Figure 2 depicts the ground truth evaluation for the TUM dataset. On the x axis we have the pairs of point clouds used by the benchmark, on the y axis the accuracy of the ground truth. The red line is the mean of the ground truth accuracy, while the green area represents the standard deviation. The full data used for the evaluation of every dataset, along with the corresponding plots, are available on our repository (https://github.com/iralabdisco/point_clouds_registration_benchmark).
Name  Mean Error [m]  Std. Deviation [m] 

ETH Dataset  0.05  0.02 
Canadian Planetary Emulation  0.13  0.05 
TUM Dataset  0.11  0.09 
KAIST Dataset  0.04  0.03 
The datasets we used are described in the following sections.
4.2 The ETH Dataset
We used the dataset presented by Pomerleau et al. Pomerleau2012, because it covers a broad set of use cases for registration algorithms. It contains two indoor scenes (apartment and stairs), five outdoor scenes (gazebo_summer, gazebo_winter, plain, wood_summer and wood_autumn) and a mixed one (hauptgebaude). It includes both structured and unstructured environments, and the indoor ones are not entirely static (there are walking people or furniture moved between scans). The datasets have been recorded with a Hokuyo UTM30LX scanning rangefinder. The corresponding ground truth has been measured by the authors with a Leica TS15 basestation, obtaining an accuracy of 1.8mm for the translation and 0.006rad for the rotation. For a complete description of the methodology used, please consult the corresponding paper. The evaluation with our own method is reported in Table 1.
4.3 Canadian Planetary Emulation Terrain 3D Mapping Datasets
The second dataset is a dataset for the emulation of planetary explorations planetary. We think that this dataset is particularly suited for testing the performance of registration algorithms in unstructured outdoor scenarios, a setting that poses hard challenges and is often neglected in the literature. The environment is mainly composed of sand, scattered rocks, and some trees at the borders. The whole area has a dimension of meters. The almost complete lack of structure is what brought us to choose this dataset: we wanted to test the performance of registration algorithms in one of the hardest scenarios they can encounter Fig. 3. Algorithms exploiting geometric features will probably struggle with this dataset. However, the goal of our benchmark is to highlight strong and weak points of the various registration techniques, therefore the choice of such a challenging scenario. This dataset presents the typical operating environment of outdoor robots, such as search and rescue, space or agricultural robots, applications often neglected by other datasets. These applications are gaining increasing popularity, especially agricultural robots, and therefore, in our opinion, cannot be neglected by benchmarks anymore. The dataset is composed of many sequences, acquired in three different facilities, using a Sick LMS291S05 or a Sick LMS11110100 laser rangefinder, two popular sensors, mounted on three different robotic platforms. Unfortunately, we could use only two sequences, named p2at_met and box_met, because the others do not have a reliable ground truth. Our manual inspection of the sequences, indeed, revealed several errors in the ground truth. Since the two sequences depict the same environment, we could add a particular test case to our benchmark, i.e., to measure the performance of registration algorithms while aligning a point cloud with a map produced at a different time with a different sensor. Therefore, we built a map with the point clouds acquired with the box platform (from the box_met sequence) and subsampled it using a voxel grid of fixed size, to simulate a Digital Elevation Map (Fig. 3). This map can then be used to localize the other platform by aligning, w.r.t. this map, the point clouds from the p2at_met sequence.
This is an example of registration between two different types of point clouds. Besides for the sensor used, the point clouds also differ in their density: the map has a much lower density indeed (Fig. 4). Thus, this is also an example of dense to sparse point cloud registration. This kind of problem has been long understudied, but is gaining importance as is nevertheless very relevant for realworld robotics applications. As an example, localizing on a map produced with a different sensor, producing sparser point clouds, is a very common use case in robotics. In a word where 3D maps of various environment will be readily available (this is already happening for cities and will likely happen also for other public settings), this use case will be very common. Moreover, for many outdoor areas, digital elevation maps (DEM) are already available. After converting them to point clouds, they can be used for localization tasks. If localization systems want to use 3D maps that are, or will soon be, publicly available, they will have to solve point clouds registration problems between point clouds with different characteristics. These maps, indeed, will likely have a lower density than the clouds produced with an onboard sensor, since they have to represent very large areas. Moreover, these maps will likely be produced with different sensors or completely different techniques. For example, a common technique for producing maps of large outdoor areas is photogrammetry. For these reasons, testing the performances of registration algorithms with point clouds with heterogeneous characteristics should be a fundamental part of a benchmarking protocol. The ground truth of the chosen sequences has been produced using a differential GPS and an onboard IMU. The authors do not report the accuracy of their ground truth system. Our evaluation is reported in Table 1.
4.4 The TUM Vision Ground RGBD Datasets
The third dataset we selected is the RGBD dataset acquired by the TUM Vision Group sturm12iros. From the large number of available sequences, we have chosen those that are more relevant for localization and mapping, as opposed to scene or object reconstruction and classification. These sequences are usually longer and involve navigating through a large and complex scene, instead than turning in place or moving around a single object. Note that we had to discard some of such sequences because the complete ground truth was not available. The chosen sequences are:

freiburg3_long_office_household

freiburg2_pioneer_slam

freiburg2_pioneer_slam2

freiburg2_pioneer_slam3
The sequences have been recorded using either a Kinect1 or an Asus XTion, two popular RGBD sensors based on triangulation. Triangulationbased sensors are affected by noise with a different noise pattern than timeofflight sensors matthies1987error; therefore, experiments performed with the latters cannot be generalised to the formers. Given the increasing availability and reduction in the cost of RGBD triangulationbased sensors and their consequent widespread use, we think that including datasets produced with them is an essential part of any point clouds registration benchmark. The ground truth has been produced using a very accurate eightcamera motion capture system, resulting in an error on the poses reported to be lower than and . The evaluation of the accuracy of the ground truth with our own method is reported in Table 1.
4.5 The KAIST Urban Datasets
Autonomous driving is a field of robotics that is gaining increasing importance for both the research community and companies, such as Google, Tesla, Daimler, BMW, Toyota and many others. The typical urban environment in which an autonomous vehicle has to operate has many peculiarities. Examples are the presence of many moving objects (pedestrians or other vehicles), scenes with very repetitive patterns or where geometric features cannot be reliably extracted or matched (e.g., treelined streets). Despite their importance, these elements are usually neglected by datasets not specifically aimed at autonomous driving. Since experimental autonomous vehicles are already a reality and are already being tested in the real world, we think that a complete benchmark cannot neglect the typical urban scenario in which autonomous cars have to operate. The KAIST Urban dataset is a collection of data acquired with sensors mounted on a vehicle, driving in various type of scenarios in Korea jeong2019complex. The point clouds we are going to use for our benchmark have been acquired with two Velodyne VLP16 LiDARs mounted on the left and right of the top of a car. The ground truth has been produced with an RTKGPS in conjunction with a SLAM system. Since the point clouds from the two LiDARs and the ground truth poses were not synchronized, we decided to associate to each point cloud from the left LiDAR, the ground truth pose closest in time. We associated point clouds from the right LiDAR with those from the left one using the same strategy. Therefore, each point cloud of our benchmark is composed of a point cloud produced with the left LiDAR, merged with its corresponding cloud from the right LiDAR. This is how these point clouds will probably be used in a real application, since the transformation between the left and right LiDAR is known and fixed. Using them in conjunction, instead than aligning them separately, could provide more clues in some situations, e.g. during sharp turns, to a registration algorithm, since the represented area and the overlap with the map will be larger. Even though we want our benchmark to be challenging, we do not want it to be artificially too hard by unnecessarily removing information that would be available in a realworld application. Although the dataset is composed of many sequences, we decided to use only the one named “urban05”. Other sequences, very similar to the chosen one, could have been used; however, we decided to use just one to keep the number of experiments required by our benchmark not too high. We avoided sequences representing only a highway and where the observer was only going straight, because they do not represent the complexity of an urban environment, i.e., the common operating environment of an autonomous vehicle. The ground truth of this sequence has been acquired with a RTK GPS and refined using a SLAM system. The authors do not report the accuracy of their system; therefore, we evaluated it with our method, the result is reported in Table 1. With these four groups of sequences, we aim at covering the vast array of possible uses for point clouds registration algorithms: indoor, outdoor rural, urban, unstructured, structured, and also between clouds with different characteristics. Table 2, describes the datasets that compose our benchmark, while Table 3 summarises the characteristics of the various datasets we used and highlights how no single one meets all requirements.
Name  Sensor  Peculiarities 

ETH  LiDAR  Dynamic scenes, different kind of environments 
Planetary Emulation  LiDAR  Completely unstructured, very few features, registration w.r.t. a map 
TUM  RGBD Camera  
KAIST  LiDAR  Autonomous driving 
Name  Outdoor  Indoor  Urban  Heter.  Struct.  Unstruct.  RGBD  LiDAR  Dynamic 

registration  
ETH  ✓  ✓  X  X  ✓  ✓  X  ✓  ✓ 
Canadian  ✓  X  X  ✓  X  ✓  X  ✓  X 
TUM  X  ✓  X  X  ✓  X  ✓  X  X 
KAIST  ✓  X  ✓  X  ✓  ✓  X  ✓  ✓ 
Proposal  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ 
5 Provided Software
On the project web page, https://github.com/iralabdisco/point_clouds_registration_benchmark, we provide a set of utilities developed to make the benchmark immediate to use.
We have chosen pairs of point clouds randomly, to cover various degrees of overlap. These registration pairs are available in the form of a text file, containing the following fields:
 id

A unique identifier of the registration problem;
 source name

The file name of the source point cloud;
 target name

The file name of the target point cloud;
 overlap

The percentage of overlap between the point clouds;
t1..t12: The elements of the 4x4 transformation matrix representing the initial misplacement to apply. The last line is implicit, since for a rototranslation in homogeneous coordinates is always the same; therefore, the matrix is the following:
The transformation matrix represents the initial misplacement of the source point cloud and have to be applied before proceeding to solving the problem. That is, it is the variable that a registration algorithm has to estimate.
There are two files for each sequence: one is relative to local registration problems, the other to global. Transformations for global registration algorithms have been sampled from a uniform distribution with a much larger variability, therefore the need for two different set of problems.
Since a pair of point clouds corresponds to several registration problems, we decided to provide the clouds correctly aligned, that is, in their ground truth position. Before solving a problem, the transformation described in the corresponding line of the configuration file has to be applied to the source point cloud.
We think that having all the data in the same format is an essential part of a benchmark that is usable. Initially, all the datasets came in their own format; therefore, they had to be converted to a common one. We decided to use the ASCII PCD format because it can be used with the most popular point clouds library, PCL rusu20113d. The binary version of the PCD would have been more efficient; however, we preferred the ASCII version because the data is expressed in plain text, in order to maintain compatibility with those not using PCL. Writing a text parser is, indeed, a very trivial task.
Rather than converting all the datasets to the PCD format and storing it on our website, we decided to write a script that downloads the point clouds from the original sources and converts them locally. We think that this choice is more respectful of the original authors since we did not produce the data. Exceptions to this rule are the digital elevation map we built for the Planetary Emulation Dataset, the Kaist datasets that would require a manual registration to the authors’ website and the TUM datasets that would require a complex setup. However, both the Kaist and the TUM datasets are released under the Creative Common 3.0 License, so we are allowed to redistribute it under the same license.
Finally, the third utility we provide is a library for calculating our metric. Even though it is quite straightforward to compute, to reduce the effort needed for using our benchmark, we decided to develop C++ and Python libraries, compatible with PCL, for calculating the metric.
Summarizing, to use our benchmark the following steps are necessary:

use our script to download the data and prepare the environment;

pick the right set of configuration files, either for global or local point clouds registration. Those describing global registration problems have the _global suffix;

for each line in the chosen configuration files, transform the source point cloud with the corresponding initial transformations;

solve the registration problem with the algorithm to test;

report the results, using the proposed metric.
For the full user guide of the provided software and data, please consult the project web page at https://github.com/iralabdisco/point_clouds_registration_benchmark.
6 Example Usage
To show how our benchmark can be used to measure the performance and highlight the peculiarities of point clouds registration techniques, we used it to test two very popular algorithms: ICP besl1992method and one of his best variants, GICP segal2009generalized. The two algorithms have been tested using the same set of parameters for every sequence, that is:

a voxel subsampling with a leafsize of ;

an outlier rejection step based on the median distance. That is, at each step of the algorithm, associations whose distance is greater than three times the median distance of all the associations are discarded,rusu2008aligning;

maximum iterations;
One important characteristic of our benchmark is that, given the large number of problems and given their heterogeneity, it is practically impossible to finetune the parameters of an algorithm to each single problem. In this way, the reported results are much more realistic and better represents a realworld usage of the algorithms.
The aim of this section is not to provide a meaningful comparison betweem ICP and GICP. Indeed, the termination criteria, the outlier rejection and subsampling methods used are too simple for a real evaluation. Moreover, no effort has been made to tune their parameters.
Instead, our goal is to show how the proposed benchmark should be used and how sequences coming from different datasets and a large set of initial misalignments and overlaps allow to draw conclusions that would be impossible otherwise.
Sequence  Median  0.75 Quantile 
0.95 Quantile  Mean  Std Dev 

stairs  0.0963  0.2407  0.6098  0.1758  0.2224 
hauptgebaude  0.0706  0.2055  0.3621  0.12  0.1247 
gazebo_summer  0.212  0.4808  0.8619  0.3037  0.2897 
gazebo_winter  0.129  0.2774  0.5184  0.1761  0.1715 
plain  0.2694  0.4196  0.6567  0.2927  0.1938 
apartment  0.2065  0.5477  1.2065  0.3574  0.3804 
wood_autumn  0.1582  0.314  0.5228  0.1958  0.1751 
wood_summer  0.1889  0.3454  0.5266  0.2131  0.178 
planetary_map  0.4486  0.6474  0.9375  0.4734  0.2691 
box_met  0.4614  0.6381  0.8939  0.4757  0.2353 
p2at_met  0.4254  0.6227  0.9415  0.4524  0.2788 
long_office_household  0.4152  1.0447  2.146  0.6876  0.7177 
pioneer_slam  0.4428  1.2433  2.9762  0.8753  0.9503 
pioneer_slam2  0.3324  0.6028  1.3503  0.4823  0.5042 
pioneer_slam3  0.4526  0.7306  1.3036  0.5413  0.4293 
urban05  0.6096  1.0232  1.9108  0.7744  0.5458 
total  0.2854  0.5456  1.2392  0.4124  0.4707 
Sequence  Median  0.75 Quantile  0.95 Quantile  Mean  Std Dev 

stairs  0.0302  0.2896  1.2137  0.2575  0.4655 
hauptgebaude  0.0039  0.0067  0.7633  0.1024  0.3725 
gazebo_summer  0.4475  0.9862  1.9032  0.625  0.6572 
gazebo_winter  0.0201  0.0993  0.871  0.1576  0.3179 
plain  0.0478  0.2137  0.7094  0.1636  0.2463 
apartment  0.2023  1.0425  1.7716  0.5458  0.6443 
wood_autumn  0.024  0.2057  0.8209  0.1765  0.2913 
wood_summer  0.0172  0.0934  0.8396  0.1455  0.3053 
planetary_map  0.3784  0.7476  2.1199  0.6302  0.889 
box_met  1.071  1.7599  2.7741  1.2228  0.8858 
p2at_met  0.3369  0.999  2.7925  0.7352  1.0227 
long_office_household  0.4842  1.3282  2.4517  0.8371  0.9675 
pioneer_slam  0.4885  1.911  3.6624  1.1367  1.3115 
pioneer_slam2  0.2102  0.9658  2.2467  0.5927  0.7445 
pioneer_slam3  0.1833  0.5739  2.0155  0.4669  0.6609 
urban05  0.4582  0.6454  1.0891  0.511  0.272 
total  0.1656  0.7139  2.1162  0.5193  0.7787 
Similarly to Pomerleau et al. pomerleau2013comparing, we describe the results in terms of quantiles: , that is, the median, and . Small values of the median means that the results are accurate, while the precision is evaluated by comparing the difference between different quantiles: larger differences mean less precise results (that is, the algorithm finds a good solution less consistently).
We think that these statistics describe the results better than the mean and standard deviation, since the error distribution is far from Gaussian. However, we decided to report the mean and the standard deviation too, for those interested. As it can be seen, for sequences with a large standard deviation, the mean is very different from the median. e.g. in the pioneer_slam sequence, with the ICP algorithm, the mean is about twice the median. This is due to the median being less affected by extreme values.
We calculated statistics both for single sequences and for the whole set of experiments.
GICP overall obtains a better median score than ICP, but with a much lower precision, that is, there is a much greater variability in the results. We can infer this considering the values of the and quantiles, that are larger than those of ICP (especially the quantile that is twice larger). Figure 5 visually compares the results of the two algorithms, by showing the quantiles on different sequences. GICP, in terms of median errors, performs much better than ICP, with the exception of the gazebo_summer and box_met sequences, where GICP got a median error about twice that of ICP. While one could suppose that it could be due to the kind of environment depicted or to the sensor used, we discarded this hypothesis. The reason is that, on the gazebo_winter sequence, which represents the same environment of gazebo_summer, GICP obtained much better results. The same effect can be observed among the pioneer sequences, part of the TUM datasets, too: these are very similar to each other, yet GICP performed very differently. In contrast, the median error of ICP is much more uniform among similar datasets: the median results on sequences coming from the same group (ETH, planetary, TUM and Kaist) are very similar. This similarity would not emerge if we were not using many different sequences with different characteristic. Moreover, using only the ETH dataset, even though it is a large dataset, with many different environments, the performance of ICP would not be correctly estimated, since on that dataset it got a median error that is about half of that of the other datasets.
We wanted to analyse also the effect of the overlap and the initial misalignment on the result. For this reason, for each sequence we calculated the Spearman’s rank correlation coefficient between these three variables, i.e., overlap, initial misalignment, and performance. The values of the coefficient are shown in Table 6. For many sequences, the correlation coefficient is very small and not consistent among sequences from the same group. For this reason, we think that the data do not allow any meaningful conclusion about the correlation between the variables considered. It appears that the result is much more affected by the structure of the scene than by the overlap and the initial misalignment.
Sequence  Algorithm 

Corr. Overlap  

ICP  0.470421  0.102165  
stairs  GICP  0.013834  0.669935  
ICP  0.801717  0.075400  
hauptgebaude  GICP  0.065390  0.267017  
ICP  0.531707  0.431938  
gazebo_summer  GICP  0.181935  0.547136  
ICP  0.780189  0.018214  
gazebo_winter  GICP  0.147126  0.713859  
ICP  0.785645  0.011057  
plain  GICP  0.292328  0.307809  
ICP  0.277156  0.561736  
apartment  GICP  0.055729  0.686675  
ICP  0.900725  0.014948  
wood_summer  GICP  0.149154  0.724037  
ICP  0.886017  0.039354  
wood_autumn  GICP  0.158103  0.757746  
ICP  0.723658  0.036039  
planetary_map  GICP  0.330032  0.221180  
ICP  0.619464  0.267591  
box_met  GICP  0.162111  0.561116  
ICP  0.653034  0.053123  
p2at_met  GICP  0.127505  0.357922  
ICP  0.492463  0.186615  
long_office_household  GICP  0.124597  0.447805  
ICP  0.277569  0.575593  
pioneer_slam  GICP  0.062829  0.647558  
ICP  0.480434  0.203205  
pioneer_slam2  GICP  0.082953  0.464892  
ICP  0.501844  0.121106  
pioneer_slam3  GICP  0.018522  0.689133  
ICP  0.534303  0.128514  
urban05  GICP  0.899925  0.079611 
7 Conclusions
In the field of point clouds registration, most approaches are tested on very few data, often collected adhoc. For this reason, the results are hardly generalizable and do not allow a comparison with other works that have been tested with different data. We present a new benchmark for point clouds registration algorithms, whose main goal is to allow a rigorous comparison between different approaches. It is composed of several publicly available datasets, chosen to cover an extensive set of scenarios and use cases, including settings usually neglected, but still very relevant for some real applications. The benchmark can be used to test both global and local registration algorithms, with different initial misalignments and different degrees of overlap. For each sequence, in each dataset, we randomly chose a list of pairs of point clouds in a way that ensures uniform coverage of the various degrees of overlap. For each pair, we randomly generated a list of rototranslations to be applied as initial misplacements. These rotoranslations are, indeed, the transformations that a registration algorithm should estimate.
We also propose a new metric to measure the residual error after the registration and to objectively compare different approaches. To encourage the use of our benchmark, we developed a set of utilities to setup the testing environment and to calculate the necessary metrics: our goal was to reduce as much as possible the effort by the authors.
Instructions on how to use the protocol and the related utilities are available on the project web page: https://github.com/iralabdisco/point_clouds_registration_benchmark. We would be glad to list works using our benchmark. To add a paper to the list, please contact the corresponding author.
Comments
There are no comments yet.