Local shape descriptors have seen extensive use in a wide variety of applications where determining shape correspondences are beneficial or even required. Such applications include registration [novatnack2008scale] [malassiotis2007snapshots] [yamany2002surface], shape segmentation [ovsjanikov2011exploration] [hu2012co] [wu2013unsupervised], and retrieval [3dor.20161082] [3dor.20171051].
Many local 3D shape descriptor methods rely on the surfaces present in the volume around a point to compute the degree to which two points are similar. This also makes them susceptible to any unwanted geometry present in the neighbourhood, commonly referred to as clutter. For this reason, clutter has been named as a major factor degrading the performance of current descriptors [guo2016comprehensive].
The degree to which different descriptors are capable of resisting the negative effects of clutter varies. One classical method which has shown to be significantly resistant to clutter is the Spin Image [johnson1999using] (SI). This descriptor is invariant under rigid transformations, and has been applied successfully for applications such as shape registration [huber2003fully]kakadiaris2007three].
In this paper, we present the Radial Intersection Count Image (RICI) combined with a novel distance function. The new descriptor shares the original concept of the Spin Image but is advantageous in terms of its generation speed and clutter resistance.
In order to show the effectiveness of the RICI, we propose a repeatable experiment aimed at quantifying the effects of clutter on the matching performance of 3D shape descriptors. The main advantage of this evaluation method is that it can be used with datasets of any size, and ensures scenes are cluttered with natural shapes.
In summary, the contributions of this paper are:
The novel RICI descriptor and an accompanying distance function, capable of resisting clutter.
Algorithms for efficient generation of RICI descriptors, also capable of accelerating SI construction.
The clutterbox experiment for quantifying the effects of clutter.
Evidence that the Support Angle filter proposed in the original SI paper does not necessarily improve matching performance.
Freely available GPU implementations for generating and comparing Spin Image, 3DSC, and RICI descriptors, as well as an implementation of the proposed clutterbox experiment.
2 Background and Related Work
Numerous local shape descriptors have been proposed to date [guo2016comprehensive]. The Spin Image has been the foundation for a number of methods, which attempt to improve its matching performance or other limitations. Clutter is a major challenge for object descriptors and few methods have addressed it.
2.1 Spin Images
The Spin Image [johnson1999using], originally presented by Johnson et al., is a classic descriptor generated from an oriented point cloud (vertices with position and normal).
An SI is constructed around an oriented point, the position of which is in this paper referred to as the Spin Vertex . The corresponding normal is referred to as the Spin Normal . The combined oriented point describes a line, which is called the Central Axis.
Computing the descriptor involves placing a square plane whose left side is on the Central Axis, with the Spin Vertex at its vertical halfway point. This plane is subsequently subdivided into equivalently sized bins, and rotated for one revolution around the Central Axis. As the plane rotates, the number of point samples intersecting each bin is counted. The descriptor itself is a histogram of the resulting value of each bin, which can be visualised as an image.
In practice, the locations where point samples will intersect with the rotating square can be computed directly as two-dimensional cylindrical coordinates. Here the coordinate refers to the distance from the point sample to the closest point on the Central Axis, and the coordinate refers to the distance from this closest point to the Spin Vertex. The projection of a given point is shown in Figure 1.
The physical width and height of the square plane is the Support Radius of the descriptor. By rotating the plane around the Central Axis, a cylindrical volume is created, which represents the Support Volume of the descriptor. Additionally, point sample contributions are divided over nearby bins using bilinear interpolation to reduce the effects of aliasing.
Johnson et al. also describe a prefiltering step called the Support Angle, where a sample oriented point is not included in the computation of the descriptor if the angle between its normal vector and the Spin Normal exceeds a set threshold.
The descriptor’s core idea is that a pair of points with identical surfaces surrounding them, and assuming both have been uniformly sampled, will have proportional quantities of projected points in similar locations. Images can thus be compared using statistical correlation.
2.2 Methods related to the SI
One of the major issues with the Spin Image is its volatility. Uniform sampling of triangle meshes as well as scans from 3D capture devices are inherently noisy. Carmichael et al. proposed a method to address this by computing the exact area of the support region intersecting each pixel [carmichael1999large].
Other methods aim to address specific limitations of the spin image. Assfalg et al. proposed the spin image signature aimed at simplifying the ease of image retrieval from a large database[assfalg2007content]. Dinh et al. aimed at addressing the issue of selecting bin sizes by creating a spin image variant with variable sized histogram bins [dinh2006multi], although their solution involves the manual setting of parameters.
An alternate spin image variant, proposed by Guo et al. used three spin images per vertex rather than a single one for better matching performance [guo2013trisi]. Accelerating spin image generation using a GPU was first proposed by Davis et al. [davis20083d] [gerlach2011accelerating]. Alternate derivative methods include Spin Contours, proposed by Liang et al. [liang2015geodesic] and colour spin images by Pasqualotto et al. [pasqualotto2013combining].
2.3 The 3D Shape Context
The 3D Shape Context, proposed by Frome et al. [frome2004recognizing], is a histogram descriptor constructed by accumulating points by their spherical coordinates and distance relative to an oriented reference point in a spherical support region. The support region is divided into equally spaced spherical wedges, centred around the central axis described by the reference oriented point (similar to the SI). Each wedge is subsequently divided into elevation divisions. The bin volumes are finally created by the intersection volume of each radial and elevation divisions with the volume bounded by two of successive spheres with exponentially increasing radii.
The descriptor has a degree of freedom around the Central Axis, which the Authors solve by generatingdifferent descriptors for each vertex, where each of the wedges has been offset by a multiple of the angle . However, due to its self-symmetry, this step is unnecessary for descriptors used for querying.
2.4 Other Clutter-Resistant Shape Matching Methods
Some methods which have been proposed to date, in addition to the Spin Image and 3DSC, have been shown to perform better in cluttered scenes than others [guo2016comprehensive] [mian2006three].
Mian et al. presented a method which creates a three-dimensional grid of voxels based on two randomly selected vertices, referred to as a Tensor[mian2006three]. Their results outperform the Spin Image, and show resistance to clutter being present in the scene.
The THRIFT descriptor, proposed by Flint et al. [flint2007thrift], uses an approach similar to the Scale-Invariant Feature Transform (SIFT) by Lowe et al [lowe2004distinctive]. The method aims to find distinctive points which can be detected reliably under a wide range of conditions. This is accomplished by computing a three-dimensional density map of the input point cloud, and selects interest points by locating local maxima of the Hessian matrix.
Local surface patches, proposed by Chen et al. [chen20073d], is a two-dimensional histogram descriptor generated from points in an oriented point cloud. Each descriptor accumulates points in a spherical support volume, by their shape index and the cosine of the angles between their normal vectors. The authors only test their method on range images, and do not expose the descriptor to significant levels of clutter themselves. However, experiments performed in the review by Guo et al. [guo2016comprehensive] suggest that this method performs well in cluttered scenes.
Unfortunately, the above works on clutter resistant descriptors used very small datasets for testing their methods (1 to 56 objects). Therefore, the provided results may be statistically biased, since the proposed descriptors were not subjected to a sufficiently wide range of possible surface features. The datasets used were also not made public, making it difficult to compare their results. In addition, some used very similar objects (such as cars), presumably for ease of creation, which is not representative of all forms of clutter that can be encountered in a real scene.
2.5 Learning Approaches
More recent shape matching methods have attempted to utilise Neural Networks. One of the major hurdles these methods need to overcome is the inherent irregularity present in 3D shape data, as opposed to more regular data such as images on which learning methods have been applied successfully.
To this end, many methods, such as the PPFNet proposed by Deng et al. [deng2018ppfnet], make use of existing descriptors or features in a pre-processing step to regularise the input to the neural network. PPFNet specifically uses point pair features, and was shown to outperform many current state-of-the-art handcrafted methods.
Another regularisation approach is the voxelisation of the input point cloud or mesh, which has amongst others been exploited in the 3DMatch method proposed by Zeng et al. [zeng20173dmatch], who successfully apply their proposed method on point cloud alignment and keypoint matching, outperforming both handcrafted and earlier learning methods.
While these learning methods show great promise, their applicability depends highly on the used dataset for training, and may require retraining for new environments. Moreover, current learning methods tend to be highly computationally expensive, which can limit their applicability to small datasets only [ioannidou2017deep].
3 Radial Intersection Count Images (RICI)
The novel RICI descriptor is now detailed, which shares some conceptual similarities with the original Spin Image, and has preliminarily been proposed as a quasi Spin Image [vanquasi].
3.1 RICI Generation
A RICI descriptor is a 2D histogram of integers. It is constructed around an oriented point, and has a Central Axis around which a square plane is conceptually rotated, similar to the Spin Image. The square plane is divided into () bins, producing a histogram which can be visualised as a grayscale image.
The primary difference between the RICI and the SI is what is counted in each histogram bin. In Spin Images, projected point samples are accumulated to create an estimate of the surface area intersecting each bin or pixel as the square plane is rotated for a full revolution. In contrast, RICI bins count the number of intersections of circles with the surfaces of the scene and are thus integers.
The conceptual construction method, i.e. the relationship between the aforementioned intersection circles and the produced descriptor is visualised in Figure 3. Consider a set of circles that are centred at fixed distances from the Spin Vertex on the Central Axis and have a fixed number of radii. Each bin in the RICI image stores the number of intersections of the corresponding circle with the surfaces of the scene. RICI rows thus represent circles on the same plane, and RICI columns circles with equivalent radii.
The remainder of this section presents a method for efficiently computing RICI descriptors. The general idea is to iterate over each triangle in the scene, and determine the set of circles in cylindrical coordinates (see Figure 1) which will intersect with it. This implies a complexity of O(T), where is the number of triangles in the scene, as in the worst case, the number of circles is fixed and equal to the resolution of a RICI image. The bins corresponding to these circles are incremented. Note that cylindrical projections will not preserve the linearity of a triangle’s edges (as shown in Figure 2), thus not allowing the use of common rasterisation methods. Instead we exploit a circle-triangle intersection algorithm in order to determine the correct projections.
To summarise, a RICI image is generated by iterating over each triangle in the scene, and in turn each triangle is processed in 3 steps:
Project the triangle vertices into cylindrical coordinate space, as described in Section 3.1.1.
Using the circle-triangle intersection method outlined in Section 3.1.2, compute the range of coordinates which will intersect with the triangle for each coordinate in the triangle’s -extent.
Increment the histogram bins that correspond to these intersections.
3.1.1 Projecting Vertices into Cylindrical Coordinate Space
An efficient method for projecting points from Euclidean coordinates into cylindrical coordinates is presented. Apart from the RICI, this method can also be applied directly in the construction of SI descriptors.
The algorithm projects a point by computing two transformations. First, a translation that moves the Spin Vertex to the origin (Equation 2), and second, a rotation which aligns the Spin Normal with the z-axis. The projected point’s and coordinates can be computed trivially afterwards.
For the z-axis alignment transformation, a common technique for aligning two vectors consists of a vector product followed by a rotation (shown in Figure 4). While the vector product itself is inexpensive (due to one of the vectors being the z-axis) the subsequent alignment rotation requires a relatively expensive multiplication with a 3x3 matrix.
Our alignment method instead uses two rotations, exploiting the observation that only distance must be preserved for the coordinate. We align the spin normal with the xz-plane using a rotation around the z-axis (see Figure 4(a) and Equation 3). We then align the transformed normal with the z-axis by a rotation around the y-axis (Figure 4(b) and Equation 4).
The coefficients of the rotation transformations and can be calculated inexpensively from components of the spin normal , as shown in Equation 1. When both coefficients of either or are zero, that rotation step is unnecessary and an identity rotation is used instead. The key here is that, considering a two-dimensional coordinate system , the coordinates of a normalised vector represent the sine and cosine values of a rotation which aligns that vector with the x-axis. These normalised coordinates can therefore be used directly for this purpose.
It should be noted that since the rotation coefficients only depend on the spin normal, they are constant for the entire spin image. Therefore they only need to be computed once per image, essentially taking this computation out of the inner loop. This is the primary reason for the method’s efficiency compared to previous work.
3.1.2 Circle-Triangle Intersection
A circle-triangle intersection test can result in four outcomes; no intersection, one intersection, two intersections, or infinite intersections. However, due to floating point rounding errors, handling the latter, while possible, is not feasible in practice and is thus not addressed by the proposed algorithm.
Our algorithm starts off with the triangle vertices in cylindrical coordinate space. For a given coordinate, it determines the range of coordinates which result in a single or double intersection. This information is subsequently used to “rasterise” a row of pixels for the triangle in the RICI descriptor.
The method operates in three distinct stages. First, the triangle is intersected with the plane of the circle, which is parallel to the plane, as shown in Figure 6. Next, the triangle vertices are rotated around the z-axis in order to further simplify subsequent computations. Finally, the ranges of circle radii in which respectively single and double intersections occur, are calculated.
Prior to detailing these stages individually, we will outline the geometric background used in the intersection test calculations.
Figure 6 shows a given coordinate. The triangle being tested is defined by its transformed vertices , , and , using the previously described alignment transformation. Here all points with equal coordinates lie on the plane .
Where the triangle intersects the plane, it forms an intersection line segment , which defines a line . The range of coordinates either intersecting the triangle once or twice can be calculated by determining which radii intersect with this line segment. This reduces the determination of intersection distances to a two-dimensional problem.
For single intersections, the lower and upper bounds of radii is . Note that the 2D coordinates of and are equivalent to the vectors and , respectively.
A double intersection occurs when the closest point to on line is also on the line segment . When double intersections exist, the range of radii in which they occur is .
Given the aforementioned background, the next step of our method is aligning the vector with the y-axis, as illustrated in Figure 7. The objective of this step is to simplify the remaining calculations for the intersection test. Alignment is done by normalising the vector between and , and subsequently rotating the triangle vertices around the z-axis; the coordinates of the normalised vector can be used directly as sine and cosine coefficients for the rotation.
At this stage, determining the existence of a double intersection is inexpensive, and can be achieved by comparing signs of the components of the aligned and coordinates. Different signs indicate that a double intersection exists. If so, the length of (the rotated y-coordinate of ) represents the lower bound of radii which correspond to double intersections.
The intersection test itself can be done by comparing a given radius against the computed ranges, which yields an intersection count corresponding to that radius.
Summarising, computing the range of values of that will result in a single or double intersection for a given value of involves the following steps:
Determine the intersection points and for any value of value of where is defined, as shown in Figure 6.
Rotate and around the z-axis such that the vector is aligned with the x-axis (as shown in Figure 7).
Determine the distance of and from the z-axis.
The range of circle radii in which single intersections occur is .
Determine the existence of a double intersection by comparing the signs of the x-coordinates of and . If they are different then a double intersection exists.
If a double intersection exists, the range of coordinates (circle radii) corresponding to the double intersection is the y-coordinate of either or and the shortest distance between the z-axis and or .
3.2 A Clutter-Resistant RICI Distance Function
Spin Images, by their nature of being generated from oriented point clouds, are inherently noisy. They have as such relied on statistical correlation to compute similarity. The idea here is that two matching bins tend to have proportionally similar accumulated sample counts. Unfortunately, this method is susceptible to the effects of clutter. Additional geometry present in the support volume causes portions of the image to receive additional projected point samples, which consequently negatively affects the computed correlation value.
When it comes to comparing RICIs, one important downside of the Pearson Correlation Coefficient is that it is not defined for sequences of constant values. While this scenario is unlikely to occur for Spin Images, there exist situations in which RICIs consist solely of pixels with equivalent intersection counts. For these situations, the Pearson correlation coefficient is undefined, and therefore an insufficient solution for comparing RICIs. Handling these edge cases separately is possible, but results in a solution that requires balancing awarded scores against normal situations.
Meanwhile, the RICI does not have the aforementioned issue of noise, and is as such not bound solely to using statistical methods for measuring similarity. For these reasons we propose a new distance function, which is by design able to resist some of the negative effects of clutter, primarily by exploiting features of the RICI.
First, the distance function does not consider the values of pixels in the RICI. Instead, changes in pixel values (i.e. intersection counts which show up as edges in the RICI) are compared. As RICIs are free of noise, it is possible to interpret pixel values directly. The main advantage of this approach is that changes in intersection counts are largely unaffected by clutter. The reason for this can be seen in Figure 8.
In Figure 7(a), a cross section is shown of an arbitrary 3D shape. On the same plane, circles are drawn with increasing radii, similar to how RICI images are computed. The numbers below each circle indicate the number of intersections they encounter, which corresponds to the value of their respective pixels in the RICI image.
Similarly, Figure 7(b) shows the same situation in which a clutter object has been added. From the intersection counts can be seen that even though the absolute intersection counts have now changed, the change in intersection counts from the third to the fourth circle, caused by the original object, is still present.
Second, when searching, our distance function treats the needle (query) and the haystack image asymmetrically, in contrast to the Pearson correlation coefficient. One can use the needle image to deduce what features to look for in a given haystack image.
This asymmetry consists of only computing a sum of squared differences distance on pixels where there are changes in the needle RICI image.
python def clutterResistantDistance(needle, haystack): score = 0 for row r in [0..N_bins]: # Skip first column for column c in [1..N_bins]: needleDelta = needle[r][c] - needle[r][c-1] haystackDelta = haystack[r][c] - haystack[r][c-1] if needleDelta != 0: score += (needleDelta - haystackDelta) * (needleDelta - haystackDelta) return score
Returning to Figure 8, we’ll assume that Figure 7(a) shows a cross section of the needle object that we are attempting to locate in the cluttered haystack scene shown in Figure 7(b). In our needle image, only the increased intersection counts from the third to the fourth circle are relevant. Including other pixels is not relevant, as there are no changes in the needle image’s intersection counts. We can therefore ignore these pixels in our distance computation. This also means any clutter present in the haystack image is ignored by this method.
The proposed Clutter Resistant Distance function is shown in Equation 7, and the corresponding pseudocode is given in Listing 3.2. Note here that the distance function is positive, but not symmetric. It has a complexity of O(1), because comparing a descriptor pair requires a fixed number of operations.
The proposed method has been evaluated in terms of its clutter resistance, generation speed, and matching performance. Where applicable, we compare our method against the two most referenced among those listed in survey [guo2016comprehensive] as being clutter resistant. These are the Spin Image111[tombari2010unique] and [guo2013rotational] also support the SI as a clutter resistant descriptor. and the 3D Shape Context. It is worth noting that the survey also observes that popular descriptors such as the Fast Point Feature Histogram [Rusu_ICRA2011_PCL], Unique Signatures of Histograms [tombari2010unique], and Rotational Projection Statistics [guo2013rotational], do not exhibit optimal performance under cluttered conditions. We have therefore implemented the above two most referenced clutter resistant methods on the GPU, to allow a direct comparison on the same dataset.
The novel Clutterbox Experiment is proposed in order to evaluate the effect of clutter on the descriptors’ matching performance.
4.1 The Clutterbox Experiment
In previous work, clutter has typically been defined as the proportion of area within the support volume that does not belong to the object being recognised. Greater proportions of clutter generally imply worse descriptor performance. The expression used in previous work, initially proposed by Johnson et al. [johnson1999using] is shown in Equation 8. Here is the surface area of all objects within the support volume and is the surface area of the object of interest.
The objective of the proposed evaluation method, which we call the “clutterbox experiment”, is to measure the relationship between increasing levels of clutter and the resulting performance of the descriptor being tested.
In previous clutter experiments, clutter has generally been evaluated by measuring descriptor performance against levels of clutter present at points in a scene without controlling the points’ identities. However, this measures the effects of two parameters combined; the descriptor’s ability to recognise the desired shape, and the level of clutter present around it. Ideally an evaluation of the effects of clutter should control the former of these parameters, while varying the latter. This is the primary objective that the clutterbox experiment addresses.
Varying clutter levels in the neighbourhood of an object can be done trivially by adding triangles, points, spheres, or cubes in random locations and sizes around an object. However, this kind of clutter is not representative of the clutter that can be expected in a realistic 3D scene. The clutterbox experiment therefore inserts complete objects rather than random noise. This results in a more natural distribution of clutter in the scene, and therefore more directly measures the effect of clutter that can be expected of a given descriptor when applied in a practical context.
The clutterbox experiment is executed a large number of times by varying objects and their transformations, in order to provide robust results, independent of object type.
The steps of the experiment are outlined below:
Define the clutterbox as a cube of side .
Select objects at random from a large object collection.
Scale and translate each object such that it fits exactly inside a unit sphere.
Pick one of the objects at random. This is the reference object.
Compute the reference descriptor set , by computing one descriptor for each unique vertex of the reference object.
For each of the objects in random order, but starting with the reference object:
Place the object within the clutterbox, at a randomly chosen orientation and position, with the constraint that the bounding sphere fits entirely within the clutterbox.
Compute the set of cluttered descriptors , by computing one descriptor for each unique vertex of the combined mesh in the clutterbox.
For each , create a list of ranked distances to all . Keep the rank where the corresponding cluttered descriptor was found in the ranked list (). Note that lower ranks are better.
Create a histogram where bin holds the number of times the correct vertex is found in the search results at rank .
Thus the output of the clutterbox experiment is a list of histograms, one for each level of clutter. A visualisation of a sequence of scenes with increasing clutter generated by the above experiment is shown in Figure 9.
4.2 Clutter Resistance Evaluation
We used the clutterbox experiment to quantify the effects of clutter on the SI and 3DSC versus the proposed RICI descriptor. For our object collection, we selected the combined SHREC2017 dataset [savva2017shrec], which consists of 51,162 triangle meshes.
In the case of the SI and 3DSC, the combined triangle mesh of the reference and clutter objects was sampled into a point cloud before generating their descriptors; RICI descriptors are generated from the triangle mesh directly. For optimal performance, SI and 3DSC require a high number of samples to ensure a low level of noise in the produced descriptors. However, one cannot increase the sample count indefinitely as that results in a lower generation rate. Based on our experimental evidence on the given dataset, we feel that 10 samples per triangle is a reasonable point on this trade-off.
While Johnson et al. define the bin size (thus the support radius) of the SI to be equal to the mesh resolution, we do not believe their reasoning holds any longer for present day 3D objects. Similar objects can have significant variance in their resolution. As such, making the support radius dependent on the mesh resolution is not a guarantee for better matching performance. We therefore use a constant support radius for all tested methods, set to 0.3 units, relative to the bounding unit sphere, for all scenes in the experiment for ease of reproducibility. For the 3DSC, we set the minimum support radius tounits, which is proportionally the same as the one originally used by Frome et al. [frome2004recognizing].
We executed the experiment 1,500 times, iteratively cluttering a scene with (the reference object only), , and objects, into a clutterbox of size . The size of the RICI and SI descriptors was set to 64x64 bins, while the 3DSC descriptor’s dimensions were left the same as those used in previous work (, , [guo2016comprehensive] [frome2004recognizing]). A more detailed discussion on size settings can be found in Section 5.2. In order to visualise the histograms generated by the clutterbox experiment, we opted to compute the fraction of the bin representing rank 0 in the histogram against the sum of all bins (all search results). For clarity, each sequence of such fractions has been sorted individually to produce monotonically increasing curves. The results are shown in Figure 10.
The support angle parameter used to generate the SI results in Figure 10 requires further elaboration. In their original SI paper, Johnson et al. claim this filter reduces the effects of self-occlusion and clutter. However, our testing which compared using a support angle filter to not filtering any input points (Figure 11) could not confirm this. All SI results in this paper therefore do not apply any support angle filter, as this favours the SI.
While Figure 10 shows that our RICI descriptor clearly outperforms both the SI and 3DSC in scenes that contain clutter (see Equation 8), it is also relevant to gain insight in the relationship between descriptor performance and the specific clutter level present in the support region. Figure 12 shows a heatmap plot of the fractional area of clutter present in the support volume around each Spin Vertex, versus the rank of the corresponding descriptor in the haystack. It can be observed that the RICI trends towards lower ranks than the SI and 3DSC, even at high levels of clutter. Furthermore, while the 3DSC generally does not outperform the SI, it appears more clutter resistant than the SI at extreme clutter levels ( 90%).
The heatmaps have been computed over 73.5 million search results extracted from scenes with 4 added clutter objects, based on the results of the Clutterbox experiment.
It is not expected that a RICI image would be very dependent on mesh resolution (which may be related to scanning) as intersection counts should in most cases not be very sensitive to that.
The experiment was implemented using C++, with the descriptor generation and search kernels written in CUDA 10.0. The code was written in such a way that given a dataset of objects, a single random seed determines all randomly chosen parameters, making all results reproducible. The experiment was executed on a combination of Nvidia Tesla cards (P100 16GB, V100 16GB, and V100 SXM3 32GB). All time-based results were exclusively gathered on the latter. One relevant implementation detail is that in cases where multiple search results have the same distance (which may occur due to reasons such as object self-similarity), we use the highest (best) rank of the matched haystack image for the sake of consistency.
4.3 Generation Performance
Figure 13 shows the difference in the rate at which the RICI, SI, and 3DSC descriptors are generated. As can be seen, the RICI is approximately one order of magnitude faster than the 3DSC, and two orders faster than the SI for the given settings.
4.3.1 Performance of Point Projection Algorithm
The largest portion of the computational effort involved in the RICI and SI generation algorithms require projecting points into cylindrical coordinate space. We have proposed an efficient algorithm for this, as outlined in Section 3.1.1.
A similar algorithm is included in Point Cloud Library [Rusu_ICRA2011_PCL], as part of the Spin Image generation implementation. To the best of our knowledge, this was up to now the most efficient implementation available. We therefore compare our projection algorithm against this previous work.
We evaluate both algorithms using a microbenchmark which projects a sequence of randomly generated points. To ensure a fair comparison, all code unrelated to point projection has been removed from the Point Cloud Library SI generation implementation. The results are shown in Table 1.
It’s worth noting that points are projected into cylindrical coordinates relative to the same oriented point. Our method can therefore precompute the values of , , , and , as outlined in section 3.1.1. Both methods were tested on an Intel Core i7-8750H CPU.
|PCL ()||Proposed method ()|
4.4 Matching Rate
The rates of evaluating the distance functions for each method are shown in Figure 14. As can be seen, the RICI distance function’s execution times are similar to the SI’s Pearson correlation coefficient, while 3DSC is significantly slower.
For all methods, the bandwidth of the GPU memory bus is the main factor limiting the comparison rate. As our proposed distance function relies on computing the difference between neighbouring pixels, this would in a naive implementation, have required double the bandwidth. Instead, we use specialised “shuffle instructions” to read the value of neighbouring pixels without having to resort to another memory transaction, thereby halving the needed memory bandwidth. The result is a kernel whose memory bandwidth requirements, and consequently execution time, is similar to the Pearson Correlation Coefficient used to compare Spin Images.
We further optimised our implementation by using an early exit condition. Since the distance score can only go up for every subsequent pixel being processed, if the only objective is determining whether the distance between two images is smaller than some given threshold distance (as is the case in many retrieval applications), it is possible to cease execution when a predetermined distance threshold is exceeded. In our clutterbox experiment, this threshold can be trivially precomputed. Utilising this early exit condition resulted on average in a 4.2 times speedup over the SI distance function.
5 Observations and Discussion
There are several topics and observations that may be relevant for the interpretation of the presented results.
5.1 Analysis of Experimental Results
Figure 14(a) shows the result set where RICI experienced the smallest decrease in matching performance between 0 and 9 added clutter objects in the scene. It is also possible to observe the clutter resistant properties of RICI. The seat part of the desk chair is significantly cluttered, while the wheels experience relatively small amounts of clutter (and remain visible). All three methods are capable of reasonably recognising these exposed wheels, however, the SI and 3DSC descriptors in large part fail to recognise the cluttered seat part.
Figure 14(b) shows the result set where RICI experienced the largest drop in performance between the scenes with 0 and 9 added clutter objects. The primary cause of this drop is due to the cuboid-like shape and low level of details on the police van, which causes a low number of changes in intersection counts. In turn, the produced RICI images become relatively susceptible to clutter.
Figure 14(c) shows the experiment where RICI performed worst on the uncluttered reference object. The particular object, a bookshelf, has high levels of self-similarity; a property which is also, to varying degrees, present in other objects in the CAD-oriented SHREC2017 dataset. Thus any local descriptor would rank vertices belonging to self-similar regions equally and whether they end up at Rank 0 is a matter of luck. One would expect to find them within the top ranks, where is the number of self-similar vertices. On the other hand, this is a useful tool for detecting self-similar regions.
To investigate this further we visualised the results of an experiment where the reference object had countable symmetric features, as shown in Figure 16. As opposed to Figure 15, we highlighted in red those vertices that were detected in the top ranks instead of only rank 0. For instance, vertices in the table’s legs are expected to constitute 12 self-similar partitions (6 legs with a symmetric front and backside each), which are all detected in the top 12 results, as shown in Figure 15(d). Also all vertices in the base of the tabletop are correctly detected within the top 4 results (4-way symmetry).
In contrast to Figure 14(c), Figure 14(d) shows the experiment in which RICI had the highest recognition rate in the uncluttered scene. Little matching performance is lost after adding significant amounts of clutter.
In Figure 14(e) the experiment whose drop in matching performance was closest to the total average of all performed 1500 experiments is shown. Worth noting here is the relatively low drop in recognition performance between the uncluttered scene, and the scene with 9 added clutter objects.
Finally, in Figure 14(f) a rare phenomenon is shown where matching performance slightly improves between 4 and 9 added clutter objects.
5.2 Performance of 3DSC
As can be seen in Figure 10, in contrast to the results obtained in previous work [frome2004recognizing] [guo2016comprehensive], the SI generally outperforms the 3DSC descriptor. The primary cause of this is that in previous work, the SI resolution was set to the 15x15 bins used originally by Johnson et al. [johnson1999using]. In contrast, we used a resolution of 64x64 bins for parity with the RICI descriptor, which we also consider to be a resolution more suitable to the capabilities of modern processors. This significant increase in resolution meant the SI descriptor in our testing performs better than 3DSC with our chosen settings.
The decision to use the same bin dimensions for 3DSC as in previous work was primarily motivated by a tradeoff between comparison performance and GPU hardware limitations. Our implementation makes use of shared memory when comparing 3DSC descriptors, due to the needle and haystack descriptor both being accessed once for each radial division. Current GPU shared memory pools allow fitting of approximately 2 image pairs sized at default settings simultaneously, which implies the number of bins can either be left intact, or doubled, or performance can be expected to be suboptimal. While it would be possible to double the number of bins in the 3DSC descriptor (which would make its memory requirements equal to the SI and RICI) leading to an increase in matching performance, the matching rate would decrease below acceptable levels because of the distance algorithm used. We therefore consider the used settings to be the best balance between quality and execution time for 3DSC.
In this paper, a clutter resistant shape descriptor, RICI, is presented and evaluated using a novel evaluation framework for such descriptors, called the clutterbox experiment. Novel algorithms for cylindrical coordinate projection, circle-triangle intersection, and the rasterization of triangles in cylindrical coordinates were presented. The largest quantitative evaluation of the SI, 3DSC, and RICI methods to date is also made, along with a useful observation for the SI support angle.
The main advantages of RICI are its noise-free nature and generation speed, while the related distance function makes it clutter resistant. We anticipate that the proposed clutterbox experiment, which is being made public, will aid future benchmarking of shape descriptors for cluttered scenes.
The authors would like to thank the HPC-Lab leader and PI behind the ”Tensor-GPU” project, Prof. Anne C. Elster, for access to the Nvidia DGX-2 system used in the experiments performed as part of this paper. Additionally, the authors would like to thank the IDUN cluster at NTNU for the provision of additional computing resources.