Critical to the registration of point clouds is the establishment of a set of accurate correspondences between points in 3D space. The correspondence problem is generally addressed by the design of discriminative 3D local descriptors on the one hand, and the development of robust matching strategies on the other hand. In this work, we first propose a multi-view local descriptor, which is learned from the images of multiple views, for the description of 3D keypoints. Then, we develop a robust matching approach, aiming at rejecting outlier matches based on the efficient inference via belief propagation on the defined graphical model. We have demonstrated the boost of our approaches to registration on the public scanning and multi-view stereo datasets. The superior performance has been verified by the intensive comparisons against a variety of descriptors and matching methods.READ FULL TEXT VIEW PDF
In this work, we propose an end-to-end framework to learn local multi-vi...
This paper proposes a global approach for the multi-view registration of...
This paper proposes a multi-view extension of instance segmentation with...
High-quality 3D reconstructions from endoscopy video play an important r...
The local reference frame (LRF) acts as a critical role in 3D local shap...
Finding point-wise correspondences between images is a long-standing pro...
Future planetary missions will rely on rovers that can autonomously expl...
Registration of point clouds integrates 3D data from different sources into a common coordinate system, serving as an essential component of many high-level applications like 3D modeling [1, 2], SLAM  and robotic perception . Critical to a registration task is the determination of correspondences between spatially localized 3D points within each cloud. To tackle the correspondence problem, on the one hand, a bunch of 3D local descriptors [5, 6, 7, 8, 9, 10, 11, 12] have been developed to facilitate the description of 3D keypoints. On the other hand, matching strategies [13, 14, 15] have also been progressing towards higher accuracy and robustness.
The exploration of 3D geometric descriptors has long been the focus of interest in point cloud registration. It involves the hand-crafted geometric descriptors [5, 6, 7, 8, 9] as well as the learned ones [10, 11, 12, 16]. Both kinds of methods mainly rely on 3D local geometries. Meanwhile, with the significant progress of CNN-based 2D patch descriptors [17, 18, 19, 20, 21], more importance is attached to leveraging the 2D projections for the description of underlying 3D structures [22, 23, 24, 25]. Particularly for the point cloud data generally co-registered with camera images [26, 27, 28, 29], the fusion of multiple image views, which has reported success on various tasks [30, 31, 32, 33], is expected to further improve the discriminative power of 3D local descriptors. With this motivation, we propose a multi-view descriptor, named MVDesc, for the description of 3D keypoints based on the synergy of the multi-view fusion techniques and patch descriptor learning. Rather than a replacement, the MVDesc is well complementary to existing geometric descriptors [5, 6, 7, 8, 9, 10, 11, 12].
Given the local descriptors, the matching problem is another vital issue in point cloud registration. A set of outlier-free point matches is desired by most registration algorithms [15, 34, 35, 36, 37, 38]. Currently, the matching strategies, e.g., the nearest neighbor search, mutual best  and ratio test 
, basically estimate correspondences according to the similarities of local descriptors alone. Without considering the global geometric consistency, these methods are prone to spurious matches between locally-similar 3D structures. Efforts are also spent on jointly solving the outlier suppression via line process and the optimization of global registration. But the spatial organizations of 3D point matches are still overlooked when identifying outliers. To address this, we develop a robust matching approach by explicitly considering the spatial consistency of point matches in 3D space. We seek to filter outlier matches based on a graphical model describing their spatial properties and provide an efficient solution via belief propagation.
The main contributions of this work can be summarized twofold. 1) We are the first to leverage the fusion of multiple image views for the description of 3D keypoints when tackling point cloud registration. 2) The proposed effective and efficient outlier filter, which is based on a graphical model and solved by belief propagation, remarkably enhances the robustness of 3D point matching.
3D local descriptor. The representation of a 3D local structure used to rely on traditional geometric descriptors such as Spin Images , PFH , FPFH , SHOT , USC  and et al., which are mainly produced based on the histograms over local geometric attributes. Recent studies seek to learn descriptors from different representations of local geometries, like volumetric representations of 3D patches , point sets  and depth maps . The CGF  still leverages the traditional spherical histograms to capture the local geometry but learns to map the high-dimensional histograms to a low-dimensional space for compactness.
Rather than only using geometric properties, some existing works refer to extracting descriptors from RGB images that are commonly co-registered with point clouds as in scanning datasets [26, 27, 28] and 3D reconstruction datasets [29, 39]. Registration frameworks like [22, 23, 24, 25] use SIFT descriptors  as the representations of 3D keypoints based on their projections in single-view RGB images. Besides, the other state-of-the-art 2D descriptors like DeepDesc , L2-Net  and et al. [19, 20, 21] can easily migrate here for the description of 3D local structures.
Multi-view fusion. The multi-view fusion technique is used to integrate information from multiple views into a single representation. It has been widely proved by the literature that the technique effectively boosts the performance of instance-level detection , recognition [31, 32] and classification  compared with a single view. Su et al.  first propose a probabilistic representation of a 3D-object class model for the scenario where an object is positioned at the center of a dense viewing sphere. A more general strategy of multi-view fusion is view pooling [31, 32, 33, 40], which aggregates the feature maps of multiple views via element-wise maximum operation.
Matching. The goal of matching in point cloud registration is to find correspondences across 3D point sets given keypoint descriptors. Almost all the registration algorithms [15, 34, 35, 36, 37, 38] demand accurate point correspondences as input. Nearest-neighbor search, mutual best filtering  and ratio test  are effective ways of searching for potential matches based on local similarities for general matching tasks. However, as mentioned above, these strategies are prone to mismatches without considering the geometric consistency. To absorb geometric information,  and  discover matches in geometric agreement using game-theoretic scheme. Ma et al.  propose to reject outliers by enforcing consistency in local neighborhood. Zhou et al.  use a RANSAC-style tuple test to eliminate matches with inconsistent scales. Besides, the line process model  is applied in registration domain to account for the presence of outliers implicitly .
In this section, we propose to learn multi-view descriptors (MVDesc) for 3D keypoints which combine multi-view fusion techniques [30, 31, 32, 33] with patch descriptor learning [17, 18, 19, 20, 21]. Specifically, we first propose a new view-fusion architecture to integrate feature maps across views into a single representation. Second, we build the MVDesc network for learning by putting the fusion architecture above multiple feature networks . Each feature network is used to extract feature maps from the local patch of each view.
Currently, view pooling is the dominant fusion technique used to merge feature maps from different views [31, 32, 33, 40]. However, as reported by the literature [32, 46, 47], the pooling operation is somewhat risky in terms of feature aggregation due to its effect of smoothing out the subtle local patterns. Inspired by ResNet , we propose an architecture termed Fuseption-ResNet which uses the view pooling as a shortcut connection and adds a sibling branch termed Fuseption in charge of learning the underlying residual mapping.
architecture capped above multi-view feature maps. First, following the structure of inception modules[49, 50], three lightweight cross-spatial filters with different kernel sizes, , and , are adopted to extract different types of features. Second, the convolution Conv6, employed above concatenated feature maps, is responsible for the merging of correlation statistics across channels and the dimension reduction as suggested by [49, 51].
Fuseption-ResNet (FRN). Inspired by the effectiveness of skip connections in ResNet , we take view pooling as a shortcut in addition to Fuseption as shown in Figure 1(a). As opposed to the view pooling branch which is in charge of extracting the strongest responses across views , the Fuseption branch is responsible for learning the underlying residual mapping. Both engaged branches reinforce each other in term of accuracy and convergence rate. On the one hand, the residual branch, Fuseption, guarantees no worse accuracy compared to just using view pooling. This is because if view pooling is optimal, the residual can be easily pulled to zeros during training. On the other hand, the shortcut branch, view pooling, greatly accelerates the convergence of learning MVDesc as illustrated in Figure 2(a). Intuitively, since the view pooling branch has extracted the essential strongest responses across views, it is easier for the Fuseption branch to just learn the residual mapping.
Network. The network for learning MVDesc is built by putting the proposed FRN above multiple parallel feature networks. We use the feature network from MatchNet 
as the basis, in which the bottleneck layer and the metric network are removed. The feature networks of multiple views share the same parameters of corresponding convolutional layers. The channel number of Conv6 is set to be the same as that of feature maps output by a feature network. The ReLU activation follows each convolutional layer except the last Conv7. A layer of L2 normalization is appended after Conv7 whose channel number can be set flexibly to adjust the dimension of descriptors. The parameters of the full network are detailed in the supplemental material.
where for positive pairs and otherwise. and are L2-normalized MVDesc descriptors of the two sets of multi-view patches and , output by the two towers. We set the margins in experiments.
View number. Unlike [31, 32, 33] using 12 or 20 views for objects, we adopt only 3 views in our MVDesc network for the description of local keypoints, which is a good tradeoff between accuracy and efficiency as shown in Figure 2(b).
Data preparation. Current available patch datasets generally lack sufficient multi-view patches for training. For example, one of the largest training sets Brown [55, 56] only possesses less than 25k 3D points with at least 6 views. Therefore, we prepare the training data similar to  based on the self-collected Structure-from-Motion (SfM) database. The database consists of 31 outdoor scenes of urban and rural areas captured by UAV and well reconstructed by a standard 3D reconstruction pipeline [1, 2, 57, 58, 59, 60, 61]. Each scene contains averagely about 900 images and 250k tracks with at least 6 projections. The multi-view patches of size 6464 are cropped from the projections of each track according to SIFT scales and orientations , as displayed in Figure 2(c). A positive training pair is formed by two independent sets of triple-view patches from the same track, while a negative pair from different tracks. A total number of 10 million pairs with equal ratio of positives and negatives are evenly sampled from all the 31 scenes. We turn the patches into grayscale, subtract the intensities by 128 and divide them by 160 .
We train the network from scratch using SGD with a momentum of 0.9, a weight decay of 0.0005 and a batch size of 256. The learning rate drops by 30% after every epoch with a base of 0.0001 using exponential decay. The training is generally done within 10 epochs.
In this section, we are devoted to enhancing the accuracy and robustness of 3D point matching. Firstly, a graphical model is defined to describe the spatial organizations of point matches. Secondly, each match pair is verified by the inference from the graphical model via belief propagation. Notably, the proposed method is complementary to the existing matching algorithms [13, 14, 15, 62].
It can be readily observed that inlier point correspondences generally hold spatial proximity. We illustrate it in Figure 3(a) where , and are three pairs of inlier correspondences. For any two pairs of inlier matches, their points in each point cloud are either spatially close to each other like and or far away from each other like and at the same time. On the contrary, outlier correspondences tend to show spatial disorders. This observation implies the probabilistic dependence between neighboring point correspondences which can be modeled by a probabilistic graph.
Formally, we first define the neighborhood of point correspondences as follows. Two pairs of point correspondences and are considered as neighbors if either and , or and , are mutually k-nearest neighbors, i.e.,
where denotes the rank of distance of point with respect to point . Then, the neighboring relationship between and can be further divided into two categories: if Condition 2 and 3 are satisfied simultaneously, and are called compatible neighbors. They are very likely to co-exist as inlier matches. But if only one of Condition 2 or 3 is satisfied by one point pair but another pair of points in the other point cloud locate far apart from each other, e.g.,
and are called incompatible neighbors, as it is impossible for two match pairs breaching spatial proximity to be inliers simultaneously. The threshold parameter in Condition 2 and 3 is set to a relatively small value, while the parameter in Condition 4 and 5 is set to be larger than by a considerable margin. These settings are intended to ensure sufficiently strict conditions on identifying compatible or incompatible neighbors for robustness.
Based on the spatial property of point matches stated above, an underlying graphical model is built to model the pairwise interactions between neighboring match pairs, as shown in Figure 3
(a) and (b). The nodes in graphical model are first defined as a set of binary variableseach associated with a pair of point correspondence. indicates the latent state of being an outlier or inlier, respectively. Then the undirected edges between nodes are formed based on the compatible and incompatible neighboring relationship defined above. With the purpose of rejecting outliers, the objective here is to compute the marginal of being an inlier for each point correspondence by performing inference on the defined model.
The task of computing marginals on nodes of a cyclic network is known to be NP-hard . As a disciplined inference algorithm, loopy belief propagation (LBP) provides approximate yet compelling inference on arbitrary networks .
In the case of our graphical network with binary variables and pairwise interactions, the probabilistic distributions of all node variables are first initialized as with no prior imposed. Then the iterative message update step of a standard LBP algorithm at iteration can be written as
Here, denotes the set of neighbors of node and is the L1 norm of the incoming message for normalization. The message passed from node to
is a two-dimensional vector, which represents the belief of
’s probability distribution inferred by. So is the constant message passed from the observation of node , which indicates the likelihood distribution of given its observation measurements like descriptor similarity. The first and second components of the messages are the probabilities of being an outlier and an inlier, respectively. The product of messages is component-wise. The matrix is the compatibility matrix of node and . Based on the neighboring relationship analyzed above, the compatibility matrix is supposed to favor the possibility that both nodes are inliers if they are compatible neighbors and the reverse if they are incompatible neighbors. In order to explicitly specify the bias, the compatibility matrices take two forms in the two different cases, respectively:
The parameter takes a biased value greater than . To guarantee the convergence of LBP, Simon’s condition  is enforced and the value of is thus constrained by
in which way is set adaptively according to the boundary condition. The proof of the convergence’s condition is detailed in the supplemental material. After convergence, the marginal distribution of node is derived by
which unifies implication from individual observations and beliefs from structured local neighborhood. After the inference, point matches with low marginals (e.g. ) are discarded as outliers. It greatly contributes to the matching accuracy as shown in Figure 3(d) where 109 true match pairs are refined from 4,721 noisy putative match pairs.
In this section, we first individually evaluate the proposed MVDesc and RMBP in Section 5.1 and 5.2 respectively. Then the two approaches are validated on the tasks of geometric registration in Section 5.3.
All the experiments, including the training and testing of neural networks, are done on a machine with a 8-core Intel i7-4770K, a 32GB memory and a NVIDIA GTX980 graphics card. In the experiments below, when we say putative matching, we mean finding the correspondences of points whose descriptors are mutually closest to each other in Euclidean space between two point sets. The matching is implemented based on OpenBLAS. The traditional geometric descriptors [6, 7, 8, 9] are produced based on PCL .
5.1.1 Comparisons with patch descriptors
Setup. We choose HPatches , one of the largest local patch benchmarks, for evaluation. It consists of 59 cases and 96,315 6-view patch sets. First, we partition each patch set into two subsets by splitting the 6-views into halves. Then, the 3-view patches are taken as input to generate descriptors. We set up the three benchmark tasks in  by reorganizing the 3-view patches and use the mean average precision (mAP) as measurement. For the patch verification task, we collect all the 96,315 positive pairs of 3-view patches and 100,000 random negatives. For the image matching task, we apply putative matching across the two half sets of each case after mixing 6,000 unrelated distractors into every half. For the patch retrieval task, we use the full 96,315 6-view patch sets each of which corresponds to a 3-view patch set as a query and the other 3-view set in the database. Besides, we mix 100,000 3-view patch sets from an independent image set into the database for distraction.
We make comparisons with the baseline SIFT  and the state-of-the-art DeepDesc  and L2-Net , for which we randomly choose a single view from the 3-view patches. To verify the advantage of our FRN over the widely-used view pooling [31, 32, 33] in multi-view fusion, we remove the Fuseption branch from our MVDesc network and train with the same data and configuration. All the descriptors have the dimensionality of 128.
|SIFT ||DeepDesc ||L2-Net ||View pooling ||MVDesc|
Results. The statistics in Table 1 show that our MVDesc achieves the highest mAPs in all the three tasks. First, it demonstrates the advantage of our FRN over view pooling [31, 32, 33, 70, 40] in terms of multi-view fusion. Second, the improvement of MVDesc over DeepDesc  and L2-Net  suggests the benefits of leveraging more image views than a single one. Additionally, we illustrate in Figure 4(a) the trade-off between the mAP of the image matching task and the dimension of our MVDesc. The mAP rises but gradually saturates with the increase of dimension.
5.1.2 Comparisons with geometric descriptors
Setup. Here, we perform evaluations on matching tasks of the RGB-D dataset TUM . Following , we collect up to 3,000 pairs of overlapping point cloud fragments from 10 scenes of TUM. Each fragment is recovered from independent RGB-D sequences of 50 frames. We detect keypoints from the fragments by SIFT3D  and then generate geometric descriptors, including PFH , FPFH , SHOT , USC , 3DMatch  and CGF . Our MVDesc is derived from the projected patches of keypoints in three randomly-selected camera views. For easier comparison, two dimensions, 32 and 128, of MVDesc (MVDesc-32 and MVDesc-128) are adopted. Putative matching is applied to all the fragment pairs to obtain correspondences. Following , we measure the performance of descriptors by the fraction of correspondences whose ground-truth distances lie below the given threshold.
Results. The precision of point matches w.r.t. the threshold of point distances is depicted in Figure 4(b). The MVDesc-128 and MVDesc-32 rank first and second in precision at any threshold, outperforming the state-of-the-art works by a considerable margin. We report in Table 2
the precisions and recalls when setting the threshold to 0.1 meters and the average time of producing 1,000 descriptors. Producing geometric descriptors in general is slower than MVDesc due to the cost of computing local histograms, although the computation has been accelerated by multi-thread parallelism.
|CGF ||FPFH ||PFH ||SHOT ||3DMatch ||USC ||MVDesc|
Setup. To evaluate the performance of outlier rejection, we compare RMBP with RANSAC and two state-of-the-art works - Sparse Matching Game (SMG)  and Locality Preserving Matching (LPM) . All the parameters of methods have been tuned to give the best results. We match 100 pairs of same-scale point clouds from 20 diverse indoor and outdoor scenes of TUM , ScanNet  and EPFL  datasets. We keep a constant number of inlier correspondences and continuously add outlier correspondences for distraction. The evaluation uses the metrics: the mean precisions and recalls of outlier rejection and inlier selection. Formally, we write , , , . We also estimate the transformations from kept matches using RANSAC and collect the median point distance after registration.
Results. The measurements with respect to the inlier ratio are shown in Fig. 5. First, RMBP is the only method achieving high performance in all metrics and at all inlier ratios. Second, RANSAC, SMG and LPM fail to give valid registrations when the inlier ratio drops below , and , respectively. They obtain high OP and OR but low IP or IR when the inlier ratio is smaller than , because they tend to reject almost all the matches.
In this section, we verify the practical usage of the proposed MVDesc and RMBP by the tasks of geometric registration. We operate on point cloud data obtained from two different sources: the point clouds scanned by RGB-D sensors and those reconstructed by multi-view stereo (MVS) algorithms .
5.3.1 Registration of scanning data
Setup. We use the task of loop closure as in [10, 11] based on the dataset ScanNet , where we check whether two overlapping sub-maps of an indoor scene can be effectively detected and registered. Similar to [10, 11], we build up independent fragments of 50 sequential RGB-D frames from 6 different indoor scenes of ScanNet . For each scene, we collect more than 500 fragment pairs with labeled overlap obtained from the ground truth for registration.
The commonly-used registration algorithm, putative matching plus RANSAC, is adopted in combination with various descriptors [6, 7, 8, 9, 10, 11]. The proposed RMBP serves as an optional step before RANSAC. We use the same metric as [10, 11], i.e., the precision and recall of registration of fragment pairs. Following , a registration is viewed as true positive if the estimated Euclidean transformation yields more than overlap between registered fragments and transformation error less then . We see a registration as positive if there exist more than 20 pairs of point correspondences after RANSAC.
Results. The precisions, recalls and the average time of registration per pair are reported in Table 3. Our MVDesc-32 and MVDesc-128 both surpass the counterparts by a significant margin in recall while with comparable precision and efficiency. Our versatile RMBP well improves the precisions for 6 out of 8 descriptors and lifts the recalls for 7 out of 8 descriptors. The sample registration results of overlap-deficient fragments are visualized in Figure 6.
|CGF ||FPFH ||PFH ||SHOT ||3DMatch ||USC ||MVDesc|
Indoor reconstruction. The practical usage of MVDesc is additionally evaluated by indoor reconstruction of the ScanNet dataset . We first build up reliable local fragments through RGB-D odometry following [74, 75] and then globally register the fragments based on . The RMBP is applied for outlier filtering. The FPFH  used in  is replaced by SIFT , CGF-32  or MVDesc-32 to establish correspondences. We also test the collaboration of CGF-32 and MVDesc-32 by combining their correspondences. Our MVDesc-32 contributes to visually compelling reconstruction as shown in Figure 7(a). And we find that MVDesc-32 functions as a solid complement to CGF-32 as shown in Figure 7(b), especially for the scenarios with rich textures.
5.3.2 Registration of Multi-View Stereo (MVS) data
Setup. Aside from the scanning data, we run registration on the four scenes of the MVS benchmark EPFL . First, we split the cameras of each scene into two clusters in space highlighting the difference between camera views, as shown in Figure 8. Then, the ground-truth camera poses of each cluster are utilized to independently reconstruct the dense point clouds by the MVS algorithm . Next, we detect keypoints by SIFT3D  and generate descriptors [6, 7, 8, 9, 10, 11] for each point cloud. The triple-view patches required by MVDesc-32 are obtained by projecting keypoints into 3 visible image views randomly with occlusion test by ray racing . After, the correspondences between the two point clouds of each scene are obtained by putative matching and then RMBP filtering. Finally, we register the two point clouds of each scene based on FGR  using estimated correspondences.
Results. Our MVDesc-32 and RMBP help to achieve valid registrations for all the four scenes, whilst none of the geometric descriptors including CGF , 3DMatch , FPFH , PFH , SHOT  and USC  do, as shown in Figure 8. It is found that the failure is mainly caused by the geometrically symmetric patterns of the four scenes. We show the correspondences between CGF-32 features  in Figure 9 as an example. The putative matching has resulted in a large number of ambiguous correspondences between keypoints located at the symmetric positions. And in essence, our RMBP is incapable of resolving the ambiguity in such cases though, because the correspondences in a symmetric structure still adhere to the geometric consistency. Ultimately, the ambiguous matches lead to the horizontally-flipped registration results as shown in Figure 8. At least in the EPFL benchmark , the proposed MVDesc descriptor shows superior ability of description to the geometric ones.
In this paper, we address the correspondence problem for the registration of point clouds. First, a multi-view descriptor, named MVDesc, has been proposed for the encoding of 3D keypoints, which strengthens the representation by applying the fusion of image views [31, 32, 33] to patch descriptor learning [17, 18, 19, 20, 21]. Second, a robust matching method, abbreviated as RMBP, has been developed to resolve the rejection of outlier matches by means of efficient inference through belief propagation  on the defined graphical matching model. Both approaches have been validated to be conductive to forming point correspondences of better quality for registration, as demonstrated by the intensive comparative evaluations and registration experiments [6, 7, 8, 9, 10, 11, 15, 17, 18, 62].
This work is supported by Hong Kong RGC 16208614, T22-603/15N, Hong Kong ITC PSKL12EG02, and China 973 program, 2012CB-
L2-net: Deep learning of discriminative patch descriptor in euclidean space.In: CVPR. (2017)
Learning local feature descriptors with triplets and shallow convolutional neural networks.In: BMVC. (2016)
Perceptual losses for real-time style transfer and super-resolution.In: ECCV. (2016)
Rethinking the inception architecture for computer vision.In: CVPR. (2016)