Repository for "End-to-End Learning Local Multi-view Descriptors for 3D Point Clouds"
In this work, we propose an end-to-end framework to learn local multi-view descriptors for 3D point clouds. To adopt a similar multi-view representation, existing studies use hand-crafted viewpoints for rendering in a preprocessing stage, which is detached from the subsequent descriptor learning stage. In our framework, we integrate the multi-view rendering into neural networks by using a differentiable renderer, which allows the viewpoints to be optimizable parameters for capturing more informative local context of interest points. To obtain discriminative descriptors, we also design a soft-view pooling module to attentively fuse convolutional features across views. Extensive experiments on existing 3D registration benchmarks show that our method outperforms existing local descriptors both quantitatively and qualitatively.READ FULL TEXT VIEW PDF
Critical to the registration of point clouds is the establishment of a s...
Learning local descriptors is an important problem in computer vision. W...
In this paper, we study the problem of multi-view sketch correspondence,...
In this work, we propose UPDesc, an unsupervised method to learn point
We propose a deep neural network for supervised learning on neuroanatomi...
Visual place recognition is particularly challenging when places suffer
As an emerging data modal with precise distance sensing, LiDAR point clo...
Repository for "End-to-End Learning Local Multi-view Descriptors for 3D Point Clouds"
Local descriptors for 3D geometry are widely recognized as one of the cornerstones in many computer vision and graphics tasks, such as correspondence establishment, registration, segmentation, retrieval, etc. Particularly, with the prevalence of consumer-level RGB-D sensors, voluminous scanned data requires robust local descriptors for scene alignment and reconstruction[60, 4]. Such 3D data, however, is often noisy and incomplete, presenting challenges to the design of local descriptors.
Existing hand-engineered local descriptors [20, 11, 46, 45, 53, 52, 48], proposed in the past few decades, are mostly built upon histograms of low-level 3D geometric properties. Recent trends with deep neural networks have motivated researchers to develop learning-based local descriptors in a data-driven manner [66, 8, 24, 19, 6, 57, 12]. Several types of input representations for 3D local geometry have been explored, such as raw point cloud patches [24, 6], voxel grids [66, 12] and multi-view images [19, 67]. Currently, on the geometric registration benchmark of 3DMatch , most learning-based methods are built upon either PointNet  with point cloud patches or 3D CNNs with voxel grids, and 3DSmoothNet  achieves the state-of-the-art performance with smoothed density value voxelization. Despite the impressive progress made by the voxel representation, literature on 3D shape recognition and retrieval [50, 42, 56] indicates superior performance of multi-view images than voxel grids, and some initial attempts [19, 67] have been made to extend a similar idea to 3D local descriptors. Meanwhile, a line of recent studies has advanced 2D CNNs in learning local descriptors from a single image patch [15, 51, 34, 64, 23, 35, 32]. These motivate us to perform further investigation into a multi-view representation for 3D points and their local geometry.
The main challenges of adopting the multi-view representation in learning descriptors are as follows. First, to obtain multi-view images, a set of viewpoints (virtual cameras) are needed for 3D graphics rendering pipelines in a preprocessing stage [50, 19]. In existing studies [50, 42, 19, 56, 9, 17]
, the viewpoints are either randomly sampled or heuristically hand-picked. However, how to determine the viewpoints in a data-driven manner to produce more informative renderings for neural networks still remains a question. Second, an effective fusion operation is required to integrate features from multiple views into a single compact descriptor. Max-view pooling is a dominant fusion approach[50, 42, 19, 56], but this operation might overlook subtle details [67, 56], leading to sub-optimal performance.
In this work, we propose a novel network architecture that learns local multi-view descriptors for 3D point clouds in an end-to-end manner, as illustrated in Fig. 1
. Our network consists of three main stages: (1) multi-view rendering for a 3D point of interest of a point cloud; (2) feature extraction in each rendered view; and (3) feature fusion across the views. Specifically, we first use an in-network differentiable renderer to project the 3D local geometry of a specific point as multi-view patches. Viewpoints used by the renderer are optimizable parameters during training. The renderer can back-propagate supervision signals from rendered pixels to the viewpoints, enabling joint optimization of the rendering stage with the other two stages. Next, to extract features in each rendered view, we leverage existing CNNs that are well matured in the task of learning single patch descriptors [51, 34]. Lastly, to fuse the features across all the views, we examine the gradient flow problem of max-view pooling 
and then design a novel soft-view pooling module. The former only considers the strongest response across views for each position in feature maps, while in contrast, our design adaptively aggregates all the responses with attentive weights estimated by a sub-network. In the backward pass, our design allows supervision signals to better flow into each input view for optimization. The experiments conducted on the 3DMatch benchmark shows that our method outperforms existing hand-crafted and learned descriptors, and is robust against rotation and point density as well.
Our contributions in this work are summarized as: (1) we propose a novel end-to-end framework for learning local multi-view descriptors of 3D point clouds, with the state-of-the-art performance; (2) the viewpoints are optimizable via in-network differentiable rendering; (3) a soft-view pooling module fuses features across views attentively with a better gradient flow. We will make our code publicly available.
Hand-crafted 3D Local Descriptors. Over the past few decades, a large body of literature has investigated descriptors for encoding geometric information of local neighborhoods of 3D points. A full review is beyond the scope of this paper. Classic descriptors include, to name a few, Spin Image , 3D Shape Contexts , PFH , FPFH , SHOT , and Unique Shape Context . These hand-crafted descriptors are mostly constructed from histograms of low-level geometric properties. Despite the progress made by these descriptors, they may fail to handle well the nuisances commonly observed in real scanned data, like noise, incompleteness, and low resolution .
Learned 3D Local Descriptors. With the recent success of deep neural networks , more attention has been shifted to developing learning-based 3D local descriptors [6, 24, 66, 12, 19]. In general, these methods fall into three categories according to input representations, including point cloud patches, voxel grids and multi-view images.
Point cloud patches are the most straightforward representation for local neighborhoods of points. PointNet, a seminal work done by Qi et al. , is specifically designed to handle the unstructured nature of point clouds. Studies like [6, 5, 62] build upon PointNet to learn descriptors for point cloud patches. There also exist PointNet-based works that learn local descriptors jointly with other tasks, such as keypoint detection  and pose prediction .
Voxel grids, used in works like 3DMatch  and 3DSmoothNet , are a common structured representation for 3D point clouds [33, 59, 42]. To reduce noise and boundary effects, Gojcic et al.  proposed to use smoothed density value voxelization in 3DSmoothNet. Their method achieves the state-of-the-art performance on the 3DMatch benchmark , substantially outperforming the aforementioned PointNet-based approaches [6, 5, 7].
Multi-view images have demonstrated better performance than voxel grids in the task of 3D shape recognition and retrieval [50, 42, 43], owing to their ability of delivering rich information of 3D geometry. Motivated by the success in global shape analysis, researchers have extended the multi-view representation to 3D local descriptor learning [19, 67]. Huang et al.  re-purposed the CNN architecture from [50, 26] to extract local descriptors of 3D shapes (e.g., airplanes or chairs) from multi-view images, which are rendered offline with clustered viewpoints. There exist studies like [8, 43] that use 2D filtering for in-network image generation from point clouds. In contrast, our work considers the viewpoints as optimizable parameters and performs multi-view rendering with a differentiable renderer  in neural networks.
To fuse view features into a single compact representation, max-view pooling is widely used owing to its computational efficiency and view-order invariance [50, 42, 56, 19, 43, 67], but it tends to overlook subtle details as discussed in [56, 67, 34, 65, 37]. Zhou et al.  proposed Fuseption, a residual-learning module for feature fusion, but their module is not view-order invariant and its number of parameters grows with the number of input views. Alternative approaches, such as feature aggregation with NetVLAD  and RNN , have also been explored, but excessive computation or view ordering is required. Differently, by analyzing the gradient flow of max-view pooling, we propose soft-view pooling that adaptively aggregates features with attentive weights in a view-order invariant manner.
Differentiable Rendering. The conventional 3D graphics rendering pipeline involves rasterization and visibility test, which are non-differentiable discretization operations with respect to the projected point coordinates and view-dependent depths . Thus supervision signals cannot flow from the 2D image space to the 3D shape space, preventing the integration of this pipeline into neural networks for end-to-end learning. Recently researchers have designed several differentiable rendering frameworks [31, 21, 29, 28, 39, 58, 3, 30] that incorporate approximated gradient formulations for the discretization operations. Among them, Soft Rasterizer (SoftRas), a state-of-the-art differentiable renderer developed by Liu et al. , treats mesh rendering as a process of probabilistic aggregation of triangles. In this work, we modify SoftRas to extend its application to point cloud rendering and adopt a hard-forward soft-backward scheme.
Given a 3D point cloud , we aim at training a neural network that can extract a discriminative local descriptor for a point in an end-to-end manner. To this end, we perform projective analysis on the local geometry of by using a multi-view representation. Compared to point cloud patches or voxel grids, the multi-view representation can capture different levels of local context more easily [19, 42].
Our network is comprised of three stages as shown in Fig. 1. First, the network directly takes the point cloud and the point of interest as inputs and employs SoftRas  to render the local neighborhood of as multi-view patches (Sec. 3.1). Second, we extract convolutional feature maps from each rendered view patch through a lightweight 2D CNN (Sec. 3.2). Lastly, all the extracted view features are compactly fused together by a novel soft-view pooling module to obtain the local descriptor (Sec. 3.3). The three stages of are jointly trained in an end-to-end manner such that descriptors of corresponding points that are geometrically and semantically similar are close to each other, while descriptors of non-corresponding points are distant to each other (Sec. 3.4).
Optimizable Viewpoints. Existing multi-view approaches select a set of rendering viewpoints according to certain rules, e.g., by clustering  or circling around a viewing center at a fixed step [50, 56, 9]. However, this view selection process is detached from the subsequent multi-view fusion stage, and thus might produce less representative inputs for the latter. SoftRas allows the viewpoints to be optimizable parameters, which can be jointly trained with other network parameters in later stages. To set up virtual cameras in a look-at manner , we define the viewpoint parameters as using spherical coordinates, where is the number of viewpoints. Each viewpoint is represented by two angles and , the distance from the local origin and a consistent upright orientation . Given the point of interest as the origin, the local reference frame (LRF) for is defined as follows (Fig. 2): the -axis is collinear to the normal of ; the -axis is the cross product of and the -axis (a small perturbation to if the normal is parallel to ); and the -axis is the cross product of the -axis and -axis. We constrain to be within the hemisphere where the point normal resides (Sec. 3.4). To augment rotation invariance in the learned descriptors, we rotate each rendered view patch at 90-degree intervals  (i.e., 4 in-plane rotations) within the network. Thus, a set of view patches are obtained through rendering as detailed next.
Differentiable Rendering. To address the non-differentiable issue of the conventional 3D graphics rendering pipeline (Fig. 3-a), SoftRas treats mesh rendering as a process of probabilistic aggregation of triangles in 2D. To render the point cloud as view patches with , one approach is to firstly transform to a mesh via surface reconstruction , which, however, is challenging to integrate into our end-to-end framework and may not handle noise well (e.g., in laser scans of outdoor scenes). Instead, we modify SoftRas to make it amenable to point cloud rendering (Fig. 3-b). We consider each point as a sphere , whose radius can be a fixed value  or derived from the average distance between and its local neighbors. After perspective projection with a specific viewpoint , the point
produces a probability mapthat describes the probability of each output pixel being covered by . The -th pixel in the rendering output (of size 64 64) is defined as
where is the rendered attribute (e.g., color or view-dependent depth) of , is a default background value, and is the depth of . The weighting function designed in  is biased to points that are closer to the camera and the -th pixel, and . Such a linear formulation in Eq. 1 approximates the rasterization and visibility test in the conventional rendering pipeline (Fig. 3), and it is naturally differentiable. Since input point clouds may lack color information, we use view-dependent depth as [8, 61], which is invariant to illumination changes. We refer the interested reader to  for detailed implementations and discussions of and .
Although the differentiability of Eq. 1 makes it possible for in-network rendering, we observed artifacts, such as blurry pixels at regions with large depth discontinuity, in the rendering outputs (see Fig. 4). To mitigate the influence of artifacts on the subsequent feature extraction, we instead adopt a hard-forward soft-backward scheme for rendering point clouds with SoftRas, sharing a similar idea to . Specifically, in the forward pass, we perform rasterization and visibility test to obtain rendering results in the same way as the conventional rendering pipeline (Fig. 3-a). In the backward pass, we compute approximated gradients for the rendering using Eq. 1 of SoftRas. We found that this approximation scheme works well in our experiments.
Let be the set of multi-view patches produced in the rendering stage for the point . This 2D representation can naturally lend itself to existing patch analysis networks. We adopt a lightweight CNN backbone similar to L2-Net [51, 34], a state-of-the-art network for learning local image descriptors. Concretely, the network is composed of six stacked convolutional layers, each followed by normalization 
and ReLU layers. We feed each patchto the network and obtain a corresponding feature map denoted as , which is of size 8 8 with 128 channels.
Given the set of feature maps as input, we perform feature fusion across views to obtain a more compact multi-view representation. Let denote the feature value at location of the fused output (the same size as ), and iterates over all spatial and channel-wise positions (Fig. 5). Max-view pooling is a widely adopted fusion approach for its simple computation and invariance to view ordering. However, this operation suffers from the following gradient flow problem in back-propagation. Mathematically, max-view pooling can be expressed as
where and the weights are in a one-hot form for selecting the maximum value. In the backward pass, the gradient of Eq. 2 is
Based on the above analysis, we propose soft-view pooling that adaptively estimates attentive weights with a sub-network. Specifically, the sub-network takes each as input and follows an encoder-decoder design to regress the corresponding weights. The sub-network performs downsampling and then upsampling by a factor of 2 for both spatial size and channel depth, using a 3 3 convolutional layer and a 3 3 up-convolutional layer respectively, and a ReLU layer in-between. The output weight map is denoted as (the same size as ). Afterward, for each location as defined above, the softmax function is applied to for normalization so that holds. Note that the above computation is invariant to view orders.
At last, the network embeds the fused feature to a -dimensional descriptor space with a fully-connected layer and a subsequent normalization layer.
To train the network , we sample matching point pairs in the overlapped region of two point clouds (at least 30% overlap). Given a batch of matching point pairs , we follow [12, 18] to adopt a batch-hard (BH) triplet loss
where , and is a margin and set to 1. For a training triplet, is the positive sample of , and considers the hardest negative sample within the batch for . As mentioned in Sec. 3.1, we also impose range constraints for the optimizable viewpoints as follows:
where and for , and respectively. Thus, the total loss is , where is empirically set to 1.
We implemented the network with PyTorch. We set the viewpoint number and the descriptor dimension (Sec. 4.4). The viewpoint parameters , , and were initialized randomly within the range in Eq. 5, and was initialized to . We use Adam 
for stochastic gradient descent with
and an initial learning rate of 0.001. The network is trained for 16 epochs, and the learning rate is decayed by 0.1 every 4 epochs.
Dataset. We evaluate the proposed method on the widely adopted geometric registration benchmark from 3DMatch . The benchmark consists of RGB-D scans of 62 indoor scenes, an ensemble of several existing RGB-D datasets [55, 49, 60, 27, 14]. The data is split into 54 scenes for training and validation, and 8 scenes for testing. In each scene, point cloud fragments are obtained by fusing 50 consecutive depth frames. For each fragment in the testing set, a set of 5,000 randomly sampled points is provided as keypoints for descriptor extraction.
Metric. The recall metric is used for comparisons on the testing set by averaging the number of matched point cloud fragments [6, 5, 12]. Consider a set of point cloud fragment pairs , where point clouds and have at least 30% overlap after alignment. For a specific descriptor extraction method , the set of putative matching points between and is computed in the descriptor space as follows:
where and are keypoints and is the nearest neighbor search. The recall metric is then defined as follows:
where is the Iverson bracket, and is the ground-truth transformation for aligning the -th fragment pair in . The distance threshold for matching points is set to 10 cm. The inlier ratio ranges from 0.05 to 0.2. To reliably find correct alignment parameters between two overlapping point clouds, the number of RANSAC  iterations is 55,000 for and 860 for [6, 12].
Following [6, 5, 12], we compare our method (32-d) with several existing 3D local descriptors on the benchmark. For hand-crafted descriptors, FPFH  (33-d) and SHOT  (352-d) are tested, and their implementations come from PCL . For learned descriptors, 3DMatch  (512-d), CGF  (32-d), PPFNet  (64-d), PPF-FoldNet  (512-d) and the current state-of-the-art 3DSmoothNet  (32-d) are tested. Additionally, we also compare with LMVCNN , a learned multi-view descriptor baseline using viewpoint clustering for offline rendering and max-view pooling for multi-view fusion. The original LMVCNN uses AlexNet  as its CNN backbone and outputs 128-d descriptors, but for fair comparisons, we reimplemented LMVCNN with the same CNN backbone and descriptor dimensionality (32-d) as our method. We use the implementations and trained weights from the authors for 3DMatch, CGF and 3DSmoothNet. Since the implementations of PPFNet and PPF-FoldNet are not publicly accessible, we include their reported performance for completeness.
Table 1 shows the comparison results on the benchmark. For , our method achieves an average recall of 97.5%, outperforming all the competing descriptors. Nevertheless, is a relatively loose threshold on 3DMatch, since 3DSmoothNet (95.0%), LMVCNN (96.5%) and our method all have achieved almost saturated performance with relatively small difference. Even so, our method obtains higher recalls in most testing scenes than 3DSmoothNet and LMVCNN. More notably, for a stricter condition , there is significant improvement of our method over the other competitors. Specifically, our method maintains a high average recall of 86.9%, while 3DSmoothNet and LMVCNN drop to 72.9% and 81.0%, respectively. The performance of FPFH, SHOT, 3DMatch, and CGF falls below 30%.
In Fig. 6, we plot the average recalls with respect to a range of , illustrating the consistency of improvement brought by our method over the compared descriptors under different inlier ratio conditions. Additionally, Table 2 lists the average number of correct correspondences found by each descriptor, which is computed as , using the same notations as in Eq. 7. It is observed that our multi-view descriptor is about 1.5 and 1.3 the average number of correspondences of 3DSmoothNet and LMVCNN, respectively. This clearly accounts for the dominant robustness of our descriptor. Additionally, Fig. 7 visualizes some point cloud registration results obtained by different descriptors with RANSAC. Particularly, it is observed that our descriptor is robust in the registration of fragments with large flat regions (the second row).
Rotated 3DMatch Benchmark. To evaluate the robustness of the descriptors against rotations, we construct a rotated 3DMatch benchmark [5, 12] by rotating the testing fragments with randomly sampled axes and angles in . The keypoint indices of each fragment are kept unchanged. Table 3 gives the average recalls for each descriptor in the Rotated column. Our method achieves average recalls of 96.9% and 82.1% for 0.05 and 0.2 respectively, both surpassing the performance of 3DSmoothNet (94.9% and 72.7%), LMVCNN (95.7% and 76.7%) as well as the other descriptors. The evaluation results indicate that our method can handle rotation well.
Sparse 3DMatch Benchmark. To evaluate the robustness of the descriptors against point density, we follow [5, 12] to construct a sparse 3DMatch benchmark. Concretely, for each testing fragment, the keypoints are firstly retained and then 50% or 25% of the remaining points are randomly selected. The evaluation results are shown in Table 3 (the last two columns). It is found that owing to the sphere-based rendering, our method is able to handle different point densities, like LMVCNN and 3DSmoothNet, and maintains the superior performance.
|Rotated||Sparse (0.5)||Sparse (0.25)|
Running Time. Table 4 summarizes the running time for the learned descriptors on the standard 3DMatch benchmark. All the experiments were performed on a PC with an Intel Core i7 @ 3.6GHz, a 32GB RAM and an NVIDIA GTX 1080Ti GPU. The input preparation in Table 4 refers to voxelization with TDF  for 3DMatch, spherical histogram computation  for CGF, LRF computation and SDV voxelization  for 3DSmoothNet, and multi-view rendering (Sec. 3.1) for our method. The inference in Table 4 refers to descriptor extraction from the prepared inputs with neural networks. The results show that the input preparation stage dominates the running time of our method. Additionally, for sphere-based rendering (Sec. 3.1), it takes 0.16ms to determine a point radius by neighborhood query with FLANN  (used in our implementation), while alternatively the computation can be eschewed by using a fixed radius as in . Nevertheless, our method still demonstrates competitive running time performance.
. This benchmark consists of four scenes, including Gazebo-Summer, Gazebo-Winter, Wood-Summer and Wood-Autumn. The point clouds were obtained by a laser scanner and mostly about outdoor vegetation. Thus, the point clouds are in a large spatial range with a low resolution and contain complex and noisy local geometry. Identical to the 3DMatch benchmark, 5,000 keypoints are randomly sampled in each point cloud for descriptor extraction. The evaluation metric is the same as that in Sec.4.1. Following , no fine-tuning is performed for the descriptors trained on the 3DMatch benchmark. To accommodate the low resolution and large spatial range of the point clouds, the voxel grids for 3DMatch and 3DSmoothNet are enlarged with longer edges ( and respectively) than those in Sec. 4.2. The radius of spherical histogram in CGF is longer. For LMVCNN and our method, the distance in each viewpoint is multiplied by a factor of 3.
The average recall results are shown in Table 5. Our method (79.9%) achieves comparable performance to 3DSmoothNet (79.0%). Meanwhile, our method significantly outperforms LMVCNN (39.7%) and SHOT (61.1%), and the other descriptors (including CGF, 3DMatch and FPFH) fall below 25%. To account for the deteriorated performance of LMVCNN, further experiments on its used view selection and multi-view fusion strategies are performed in Sec. 4.4. The above results show that our method trained on the 3DMatch benchmark can generalize well to outdoor scenes.
Descriptor Dimension & Viewpoint Number. In Fig. 8 we plot the average recalls of our method with different descriptor dimensions and viewpoint numbers (as defined in Sec. 3.3 and Sec. 3.1). It is found that increased descriptor dimensions () and viewpoint numbers () lead to saturated performance. Thus we adopt and for our method in the experiments.
Viewpoints. In Table 6 (top), we show the performance of our network trained with different viewpoint selection rules in multi-view rendering. Concretely, the straightforward random sampling rule places the viewpoints randomly within the range in Eq. 5. The viewpoint clustering rule used in LMVCNN  selects three representative viewing directions via K-medoids clustering. The orbited placement rule sets the viewpoints with , , and at a step (Sec. 3.1), similar to the strategy used in 3D shape recognition works [50, 56, 9]. The performance of without rotation augmentation to the rendered view patches is also provided. It is found that our optimizable viewpoints produce better performance than these alternative view selection rules, especially on the generalization ability to the ETH outdoor dataset.
Multi-view Fusion. We perform experiments to compare our soft-view pooling with several alternative multi-view fusion approaches, including max-view pooling , Fuseption , and NetVLAD . We list the performance of the network trained with the above fusion approaches in Table 6 (bottom). While on the 3DMatch dataset the improvement of soft-view pooling is small compared with max-view pooling, our method shows significantly better generalization on the ETH outdoor dataset. This is partially because the low-resolution scans of outdoor vegetation in ETH would produce relatively noisy renderings, presenting challenges to max-view pooling for selecting the strongest feature response. Differently, the response is adaptively gathered in our method with attention.
|Ours w/o rotation augment.||96.9||85.6||54.9|
We have presented a novel end-to-end framework for learning local multi-view descriptors of 3D point clouds. Our framework performs in-network multi-view rendering with optimizable viewpoints that can be jointly trained with later stages, and integrates convolutional features across views attentively via soft-view pooling. We demonstrate the superior performance of our method and its generalization to outdoor scenes through experiments. For future work, it is worth investigating the acceleration of differentiable multi-view rendering of point clouds and the extension of our framework to other tasks such as 3D object detection and recognition in point clouds.
This work was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 11212119, HKUST 16206819, HKUST 16213520), and the Centre for Applied Computing and Interactive Media (ACIM) of School of Creative Media, CityU.
Learning to predict 3d objects with an interpolation-based differentiable renderer. CoRR abs/1908.01210. Cited by: §2.
PPF-FoldNet: unsupervised learning of rotation invariant 3d local descriptors. In Proc. ECCV, Cited by: §2, §2, §4.1, §4.2, §4.2, §4.2.
GVCNN: group-view convolutional neural networks for 3d shape recognition. In Proc. IEEE CVPR, Cited by: §1, §3.1, §4.4.
View n-gram network for 3d object retrieval. In Proc. IEEE ICCV, Cited by: §1.
Sketch-R2CNN: an attentive network for vector sketch recognition. CoRR abs/1811.08170. Cited by: §2.
Generalized max pooling. In Proc. IEEE CVPR, Cited by: §2.
PyTorch: an imperative style, high-performance deep learning library. In NIPS, pp. 8026–8037. External Links: Cited by: §3.4.
In Sec. 3.2 of the main text, we adopt a CNN architecture similar to L2-Net  to extract feature maps for each view patch. The detailed configuration of the network is listed in Table 7. Note that the network input is of size 6464 with a single depth channel, and the final output is of size 88 with 128 feature channels.
|1||Conv - Norm - ReLU||3332||2||1|
|2||Conv - Norm - ReLU||3332||1||1|
|3||Conv - Norm - ReLU||3364||2||1|
|4||Conv - Norm - ReLU||3364||1||1|
|5||Conv - Norm - ReLU||33128||2||1|
|6||Conv - Norm - ReLU||33128||1||1|
In Fig. 9, we visualize the optimizable viewpoints after training. We also show the viewpoints obtained by a clustering scheme similar to the one in . Specifically, 150 spherical coordinates are randomly sampled on the hemisphere where point normals reside, and then the k-medoids clustering algorithm is applied to select three viewing directions. For each viewing direction, a virtual camera is placed at distances of 0.3m, 0.6m, 0.9m to the points of interest, and each rendered view patch is augmented with four in-plane rotations.
As shown in Fig. 9
, there are mainly two differences between the hand-crafted rule and our method. First, the hand-crafted rule places some viewpoints far from points of interest, while the learnt viewpoints have more concentrated distance range, indicating the relatively low importance of broader global context. Second, the hand-crafted rule selects some dominant viewing directions through clustering, whereas the learnt viewpoints have more distributed viewing directions around the points of interests, which can help to capture more local geometry variance. In sum, the learnt viewpoints effectively balance the extent of context-awareness and local details in extracted descriptors, challenging the design wisdom of hand-crafted rules.
In Sec. 4.4 of the main text, we compared the proposed soft-view pooling with alternative fusion approaches including max-view pooling [19, 50, 42], Fuseption , and NetVLAD . Fuseption has two branches: in the first branch, the feature maps of all the views are first channelwise concatenated together in a specific order and then fed into a convolutional block; in the second branch, max-pooling is applied to the inputs and the results are added to the output of the first branch, serving as a shortcut connection. NetVLAD is a descriptor pooling method that summarizes the residuals of each input w.r.t. several learnable cluster centers. The number of cluster centers is a hyper parameter, which is set to eight in our experiments. The network is trained with the alternative fusion approaches, while the other stages are kept unchanged. The descriptor dimension is set to 32, and the optimizable viewpoint number is set to 8.
In Fig. 10, we visualize the rendered multi-view inputs to CNNs, extracted feature maps for each view, and fused feature maps across views. It is observed that the CNN is influenced by multi-view fusion for feature extraction. Before fusion, for soft-view pooling and NetVLAD, the feature maps of each view extracted by the CNN tend to have more response, compared to max-view pooling and Fuseption. After fusion, the feature maps produced by max-view pooling and NetVLAD tend to have more high response than soft-view pooling and Fuseption. Note that for each location in the fused feature maps, max-view pooling only selects the strongest input response across views and discards the rest, while our soft-view pooling collectively considers all the inputs in an attentive manner for integration.
In Fig. 11, we visualize the color-coded local descriptors for all the points in the point clouds. Specifically, we project the high dimensional descriptors with PCA and keep the first three components, which are color-coded. It is observed that the descriptors of 3DSmoothNet and our method are both geometry-aware. Particularly, our method is able to capture more geometric changes in the point clouds (see the highlighted wall, pillow and floor regions of the point clouds in Fig. 11). In Fig. 12, we show additional geometric registration results of point cloud pairs, which further demonstrate the above advantage of our method.
For the running time of 3DSmoothNet in Sec. 4.2 of the main text, we observed some gap between our experiment results (input prep: 39.4ms; inference: 0.2ms) and the performance reported by the authors (input prep: 4.2ms; inference: 0.3ms). We used the source code111https://github.com/zgojcic/3DSmoothNet of 3DSmoothNet released by the authors, and the running time gap of input preparation is likely due to the difference of hardware configurations. In , they used a PC with an Intel Xeon E5-1650, a 32GB RAM and an NVIDIA GeForce GTX 1080 GPU, while we used a PC with an Intel Core i7 @ 3.6GHz, a 32GB RAM and an NVIDIA GTX 1080Ti GPU. Their input preparation stage involving LRF computation and SDV voxelization runs on CPU, which may be accelerated with GPU for further improvement.