Application of computer vision techniques aimed at object recognition is gathering increasing attention in industrial applications. Among others, prominent applications in this space include robot picking in assembly lines and surface inspection. To address these tasks, the vision system must estimate the 6 DoF (degrees-of-freedom) pose of the sought objects, which calls for a 3D object recognition approach. Moreover, in industrial settings robustness, accuracy as well as run-time performance are particularly important.
Reliance on RGB-D sensors providing both depth and color information is conducive to 3D object recognition. Yet, typical nuisances to be dealt with in 3D object recognition applications include clutter, occlusions and the significant degree of noise which affects most RGB-D cameras. Many studies, such as [8, 6], have investigated on these problems and highlighted how local 3D descriptors can effectively withstand clutter, occlusions and noise in 3D object recognition.
The local descriptors pipeline for 3D object recognition is however quite slow. Indeed, RGB-D cameras generate a high amount of data (over 30MB/s) and, as this may hinder performance in embedded and real-time applications, sampling strategies are needed. To reduce processing time, keypoint extraction techniques are widely used. In addition, some solutions propose to assign higher priority to specific image areas, like, for example, in the foveation technique . Another approach, inspired by human perception and widely explored for 2D image segmentation, consists in saliency detection, which identifies the most prominent points within an image . Unlike the foveation, which processes arbitrary regions, the use of saliency allows for highlighting image regions that are known to be more important.
This work proposes a solution to improve the performance of the standard local descriptors pipeline for 3D object recognition from point clouds. The idea consists in adding a preliminary step, referred to as Saliency Boost, which filters the point clouds using a saliency mask in order to reduce the number of processed points and consequently the whole processing time. Besides, by selecting only salient regions, our approach may yield a reduction in the number of false positives, thereby often also enhancing object recognition accuracy.
2 Related Works
3D object recognition systems based on local descriptors typically deploy two stages, one carried out offline and the other online, referred to as training and testing, respectively. The training stage builds the database of objects, storing their features for later use. In the testing stage, then, features are extracted from scene images. Given a scene, the typical pipeline, depicted in Figure 1 and described, e.g. in , consists of the following steps 1) Keypoints extraction; 2) Local descriptors calculation; 3) Matching; 4) Grouping correspondences; and 5) Absolute orientation estimation. The first two, described in more details below, are those which really differentiate the various approaches and impact performance most directly.
2.1 Keypoints Extraction
This step concerns selecting some surface points, either from images or point clouds. According to , keypoint extraction must reduce data dimensionality without losing discriminative capacity. In this work, we explore techniques which work in 3D, as Uniform sampling and Intrinsic Shape Signatures (ISS), and 2D alike, i.e. SIFT and FAST.
Uniform Sampling downsamples the point cloud segmenting it in voxels based on a certain leaf size, and selects as keypoint each nearest neighbor point to a voxel centroid . ISS  selects keypoints based on a local surface saliency criterion, so as to extract 3D points that exhibit a large surface variation in their neighbourhood.
The keypoint detector proposed in SIFT 
is arguably the prominent proposal for RGB images. It is based on the detection of blob-like and high contrast local features amenable to compute highly distinctive features and similarity invariant image descriptors. The FAST keypoint extractor is a 2D corner detector based on a machine learning approach, which is widely used in real-time computer vision applications due to its remarkable computational efficiency.
2.2 Local Descriptors
A local 3D descriptor processes the neighborhood of a keypoint to produce a feature vector discriminative with respect to clutter and robust to noise. Many descriptors have been proposed in recent years and several works, e.g., have investigated on their relative merits and limitations. In this work, we explore both descriptors which process only depth information, such as Signatures of Histograms of OrienTations (SHOT) and Fast Point Feature Histogram (FPFH), as well as depth and color, like Point Feature Histogram for Color (PFHRGB)  and Color SHOT (CSHOT) .
Introduced by , SHOT describes a keypoint based on spatial and geometric information. To calculate the descriptor, first 3D Local Reference Frame (LRF) around the keypoint is established. Then, a canonical spherical grid is divided into 32 segments. Each segment results in a histogram that describes the angles between the normals at the keypoint and the normal at the neighboring points. The authors also proposed a variation to work with color at the points, called CSHOT. The color value is encoded according to the CIELab color space and added to the angular information deployed in SHOT. This descriptor is known to yield better results than SHOT when applied to colored point clouds.
PFHRGB  is based on the Point Feature Histogram (PFH) and stores geometrical information by analyzing the angular variation of the normal between each pair of combination in a set composed by the keypoint and all its k-neighbors. PFHRGB works on RGB and stores also the color ratio between the keypoint and its neighbors, increasing its efficiency on RGB-D data . In order to speed-up the descriptor calculation,  proposed a simplified solution, called FPFH, which considers only the differences between the keypoint and its k-neighbors. Also, an influence weight is stored, resulting in a descriptor which can be calculated faster while maintaining its discriminative capacity.
2.3 Saliency Detection
Salient object detection is a topic inspired by human perception, which affirms that the human being tends to select visual information based on attention mechanisms in the brain . Its objective is to emphasize regions of interest in a scene 
. Many applications benefit from the use of saliency, such as object tracking and recognition, image retrieval, restoring and segmentation.
3 Proposed Approach
We present a way to improve significantly the time performance and also the memory efficiency of the standard pipeline described above, by adding an additional step to the original pipeline. We refer to this step as saliency boost. It leverages the RGB scene image by detecting salient regions within it, which are then used to filter the point cloud and to execute the local descriptors’ pipeline only on salient regions. In particular, we use the saliency mask to reduce the search space for 3D keypoints by letting them run on the part of the point cloud which corresponds to the salient regions of the image. To project saliency information from the 2D domain of the RGB image to the point cloud we leverage the registration information provided by RGB-D cameras. Figure 1 presents a graphical overview of the approach. In the case of 2D keypoint detectors, instead, we run them on the full RGB image and we then filter out keypoints not belonging to the salient regions: we do not filter the image before the keypoint extraction step because 2D detectors exploit also pixels from the background to define blobs and edges/corners to detect keypoints. In the 3D case, instead, points from the background are usually far away and outside the sphere used to define the keypoint neighborhood, so it is possible to filter them before without affecting the detector performance.
Our approach is not dependent from a specific saliency detection technique. In this work, we choose the DSS algorithm , and we detect salient areas by running the trained model provided by the authors.
4 Experimental Results
The experiments were performed on the Kinect dataset from the University of Bologna, presented in . This dataset has sixteen scenes, and six models with pose annotation. Each model is represented as a set of 2.5D views from different angles and has from thirteen to twenty samples. Figure 2 depicts some examples of models and scenes in this dataset.
4.2 Local Descriptors Pipeline
In the local feature pipeline for object recognition, the choice of the keypoint extraction and description methods is key, and depends on the applications, the kind of 3D representation available and their resolution, the sensor noise, etc… In order to evaluate the performance of the proposed approach in an application-agnostic scenario, we test combinations of several descriptors and detectors. The selected descriptors are: SHOT and CSHOT (Color SHOT) , FPFH  and PFHRGB . The keypoint detectors working on 3D data are Uniform sampling (with leaf sizes ranging from 2 to 5 cm with step of 1 cm) and ISS , while on images we test SIFT  and FAST , run on the RGB image and projected on the point cloud, as discussed.
The matching step is performed by nearest neighbor (NN) search implemented by the FLANN library, integrated in the Point Cloud Library (PCL) . A KdTree is built for each view of each model in the database and each keypoint on the scene is matched to only one point of one view of one model in the database by selecting the closest descriptor among views and models. After this process, all matches pointing to a view of a model are processed by the Geometric Consistency Grouping algorithm , which selects all the subsets of geometrically consistent matches between the view and the scene, and estimates the aligning transformation. The transformation obtained from the largest correspondence group among all the views of an object is considered the best estimation of the aligning transformation for that object. If an object fails to have a geometrically consistent subset with at least 3 matches among all its views, it is estimated as being not present in the considered scene.
4.3 Evaluation Protocol
In order to evaluate the performance of the proposed object detection pipeline, the correctness of predictions both of object presence and pose are evaluated. We adopt the Intersection over Union (IoU) metric (Equation 1
), also known as the Jaccard index, and defined as the ratio between the intersection and the union of the estimated bounding box () and the ground truth bounding box().
A detection is evaluated as correct if its IoU with the ground truth is greater than 0.25, as in 
. Given detections and ground truth boxes, we calculate precision and recall (Equations2 and 3) by considering a correct estimation as True Positive (), i.e. , an estimation of an absent object as False Positive (), and misdetections or detections with as False Negative ().
To calculate precision-recall curves (PRC), we varied the threshold on the number of geometrically consistent correspondences to declare a detection, increasing it from the minimum value of 3 up to when no more detections are found in a scene. The area under the PRC curve (AUC) is then computed for each combination detector/descriptor and used to compare and rank the pipelines.
4.4 Implementation Details
Tests were performed on a Linux Ubuntu 16.04 LTS machine, using the Point Cloud Library (PCL), Version 1.8.1, OpenCV 3.4.1 and the VTK 6.2 library. For comparison purposes, all trials were performed on the same computer, equipped with an Intel Core i7-3632QM processor and 8GB of RAM. When available in PCL, the parallel version of each descriptor was used (e.g. for SHOT, CSHOT, and FPFH).
As for parameters of detectors, the ISS Non-Maxima Suppression radius was set to cm and the neighborhood radius to cm, while for SIFT and FAST we used the default values provided in OpenCV. As for descriptors, to estimate the normals we used the first ten neighbors of each point while the description radius was set to cm for all the considered.
In this section, we present the results obtained in the experiments. All trials were performed on the Kinect dataset, comparing the original pipeline (blue part in Figure 1
) with the proposed pipeline with saliency boosting. For each descriptor and each pipeline we tested seven keypoint extractors, totaling 56 trials. The scene processing time, which comprises the saliency detection (only for boosted pipeline), keypoint extraction, description, matching correspondences, clustering and pose estimation, was measured to verify the impact of the proposed modification also on processing time.
Results in terms of the number of keypoints extracted are presented in Table 1. The saliency filtering reduces significantly the average number of keypoins extracted by each detector: reduction using saliency boost ranges from to almost with an average of .
The number of keypoints extracted impacts directly the running time of the pipeline, mainly by two factors: the number of descriptors that have to be computed and the time it takes to match them. The SHOT and CSHOT descriptors are calculated relatively fast but due to their length (352 and 1344 bins respectively), the matching phase is slower, accounting for 97 and 99% of the processing time, respectively. The PFHRGB and FPFH are shorter descriptors (250 and 33 bins respectively), but the description is slower and requires 94 and 89% of the overall time, respectively. As shown in Table 2, the extraction of keypoints only in salient regions reduces drastically the processing time for both kinds of descriptors.
In the best case, reductions in processing time is as high as 80%, i.e. the boosted pipeline is 5 times faster due to the proposed saliency boosting. For all the considered detector/descriptor combinations, deployment of the saliency boosting step always reduces the processing time significantly, from the 22% obtained by FAST/SHOT to 83% for ISS and US with FPFH.
Reducing processing time is only beneficial if it doesn’t harm recognition and localization performance. Interestingly, deployment of the saliency boosting step very often improves AUC with respect to the traditional pipeline, as shown in Table 3. In particular, for 19 of the 28 trials which included the saliency boosting step, the pipeline boosted by saliency performed better also on AUC, with massive improvements by more than for PFHRGB and FPFH. Viceversa, when the AUC decreases due to the deployment of the saliency boost, it does it usually marginally, by or , with the worst decrease in AUC being greater than only once, when using the SIFT detector.
While the AUC generally increases with the boosted pipeline, it doesn’t do so on average when deployed with the SHOT descriptor. However, it does increase by in the very relevant case of combining SHOT with the ISS detector, the combination that delivers the fastest running time among all the tested variants (as shown in Table 2).
Finally, in Figure 3, we report a Pareto analysis on the data for all descriptors. We can see how points (i.e. detector/descriptor pairs) closer to the ideal point (that is and Time as low as possible) are obtained by the execution of the boosted pipeline. In this analysis, the CSHOT, SHOT and FPFH obtained the best performance when paired with the boosted ISS (ISS), while PFHRGB when paired with the Boosted Uniform Sampling at (US). Hence, the boosting pipeline outperforms the traditional one for all tested descriptors when taking into account the combined effect of processing time and recognition performance.
In this work, we presented an approach based on saliency detection to boost the processing time of the traditional local descriptor pipeline. It was verified for all the tested cases a significant processing time reduction, from 22 to 83%. Interestingly, the processing time reduction didn’t generally decrease the object recognition performance, as measured by the AUC of the precision recall curves. Actually, an improvement on the performance recognition was found for all descriptors in at least one pairing, up to 5% for SHOT and CSHOT, and more than 50% for FPFH and PFHRGB.
In spite of the improvements in processing time, the whole processing time is not suitable for real-time applications yet. However, the proposed approach offers a considerable speed-up without impact negatively on recognition performance, which brings us a step closer to create an effective and real-time local feature pipeline for 3D object recognition.
-  (2012) 3D descriptors for object and category recognition: a comparative evaluation. In IEEE International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.2.
-  (2018-02) Probabilistic saliency estimation. Pattern Recognition 74, pp. 359–372. External Links: Cited by: §1, §2.3, §2.3.
-  (2018-08) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition. External Links: Cited by: §2.3.
-  (2007-07) 3D free-form object recognition in range images using local surface patches. Pattern Recognition Letters 28 (10), pp. 1252–1262. External Links: Cited by: §2, §4.2.
-  (2013-08) Efficient 3D object recognition using foveated point clouds. Computers & Graphics 37 (5), pp. 496–508. External Links: Cited by: §1.
-  (2015-04) A comprehensive performance evaluation of 3D local feature descriptors. International Journal of Computer Vision 116 (1), pp. 66–89. External Links: Cited by: §1, §2.2.
-  (2019-04) Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 815–828. External Links: Cited by: §2.3, §3.
-  (1999-05) Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (5), pp. 433–449. External Links: Cited by: §1.
-  (2000-03) Mechanisms of visual attention in the human cortex. Annual Review of Neuroscience 23 (1), pp. 315–341. External Links: Cited by: §2.3.
-  (2018-01) Saliency ranker: a new salient object detection method. Journal of Visual Communication and Image Representation 50, pp. 16–26. External Links: Cited by: §2.3.
-  (1999-sep.) Object recognition from local scale-invariant features. In 7th IEEE International Conference on Computer Vision, Vol. 2, pp. 1150–1157 vol.2. External Links: Cited by: §2.1, §4.2.
-  (2006) Machine learning for high-speed corner detection. In European Conference of Computer Vision (ECCV), A. Leonardis, H. Bischof, and A. Pinz (Eds.), Berlin, Heidelberg, pp. 430–443. External Links: Cited by: §4.2.
-  (2011-05) 3D is here: point cloud library (PCL). In 2011 IEEE International Conference on Robotics and Automation, Vol. , pp. 1–4. External Links: Cited by: §2.1, §4.2.
-  (2008-09) Aligning point cloud views using persistent feature histograms. In IEEE International Conference on Intelligent Robots and Systems (IROS), External Links: Cited by: §2.2, §2.2, §4.2.
-  (2009-05) Fast point feature histograms (FPFH) for 3D registration. In IEEE International Conference on Robotics and Automation (ICRA), External Links: Cited by: §2.2, §2.2, §4.2.
-  (2014-08) SHOT: unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding 125, pp. 251–264. External Links: Cited by: §2.2, §2.2, Figure 2, §4.1, §4.2.
-  (2016-06) Deep sliding shapes for amodal 3D object detection in RGB-D images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §4.3.
-  (2012-07) Performance evaluation of 3D keypoint detectors. International Journal of Computer Vision 102 (1-3), pp. 198–220. External Links: Cited by: §2.1.
-  (2009-09) Intrinsic shape signatures: a shape descriptor for 3D object recognition. In 12th IEEE International Conference on Computer Vision (ICCV) Workshops, External Links: Cited by: §2.1, §4.2.