(BMVC'20 Oral) Neighbourhood-Insensitive Point Cloud Normal Estimation Network
We introduce a novel self-attention-based normal estimation network that is able to focus softly on relevant points and adjust the softness by learning a temperature parameter, making it able to work naturally and effectively within a large neighbourhood range. As a result, our model outperforms all existing normal estimation algorithms by a large margin, achieving 94.1 comparison with the previous state of the art of 91.2 model and 12x faster inference time. We also use point-to-plane Iterative Closest Point (ICP) as an application case to show that our normal estimations lead to faster convergence than normal estimations from other methods, without manually fine-tuning neighbourhood range parameters. Code available at https://code.active.vision.READ FULL TEXT VIEW PDF
(BMVC'20 Oral) Neighbourhood-Insensitive Point Cloud Normal Estimation Network
Normal estimation is an important low-level task that forms the foundation for many high-level applications, such as tracking, reconstruction, and rendering. Principle Component Analysis (PCA) [hoppe1992surface]
, the classic normal estimation method, estimates a normal vector at a point by collecting itsnearest neighbours (-NN) and fitting a plane on them. It is worth noting that, the neighbourhood range , is a global hyper-parameter requiring manual tuning for PCA to achieve best accuracy. In practice, a single global
is often not suitable to an entire scene, where different regions have different geometric structures. Therefore, an algorithm that can determine the neighbourhood range adaptively based on local geometric structures and is insensitive to the neighbourhood variance is desirable, as downstream tasks can enjoy high-quality normal estimations without manually tuning neighbourhood range parameters.
One possible approach to achieve this is to identify appropriate neighbourhood scales to estimate normals under different geometric properties. One line of research [guerrero2018pcpnet, ben2019nesti]
aims to solve this problem by integrating information from multi-scale patches. Here, the experts that specialise at small scales are responsible for sharper regions and the experts that specialise at large scales are responsible for smoother regions. A scale manager can choose to weight estimations from different experts. This approach is effective and does not require a manual fine-tuning of the neighbourhood range, but it comes at the expenses of a large neural network, i.e. for a model that processesscales, the model is times as large as a single scale model, for example, the 7-expert Nesti-Net [ben2019nesti] model is about 1.1GB.
In this paper, we propose a normal estimation network that selects neighbour points softly according to its local geometric properties, by applying a multi-head self-attention module to the neighbour points around the current point. Our model allows a soft selection of neighbour points by learning a temperature parameter that controls the softmax function within the attention module. As a result, our method does not need a carefully picked . Given a fairly large neighbourhood range, for example, setting to 30, 40, or even 50 when the best for PCA is 8, our network is able to focus on relevant points adaptively and outperforms all existing normal estimation algorithms consistently, with a significantly smaller model and faster inference time. We also provide a point-to-plane ICP [chen1992object] application study to demonstrate how downstream algorithms like ICP can benefit from our neighbourhood-insensitive normal estimation.
The rest of the paper is organised as follows: Sec. 2 introduces related works on point cloud reasoning, normal estimation, and attention mechanisms. Sec. 3 describes our network in detail. We present our results in Sec. 4, together with a point-to-plane ICP application study, and conclude in Sec. 5.
In this section, we present approaches related to our work, starting with neural network point cloud reasoning, moving to normal estimation from point clouds and ending with selected attention-based methods.
Neural Network Point Cloud Reasoning Learning-based 3D point cloud reasoning is an active research area. Su et al. [su2015multi] first proposed to process point clouds using CNNs on their 2D projections. This method can extract semantic information but lose geometric details significantly. Qi et al. [qi2016volumetric] proposed to analyse point clouds using volumetric representations so that 3D CNNs could be applied directly. This method preserves 3D geometric structures but has an expensive computational cost. Geometric details are also lost during the volumetric quantization. To overcome these issues, PointNet [PointNet_Charles2017] was proposed to consume point clouds directly. The follow-up work PointNet++ [PointNet2_Qi2017] improves the performance using a hierarchical strategy. More recent works, such as DGCNN [DGCNN_Wang2018], PointCNN [PointCNN_Li2018], and KPConv [thomas2019kpconv] combine graph neural networks and deformable convolutions with point cloud processing. These approaches focus on standard tasks like classification and segmentation. The next section focus on approaches for normal estimation.
Point Cloud Surface Normal Estimation PCA [hoppe1992surface] is a classic method for normal estimation. Given a target point and its neighbours, PCA treats these points as samples from a multivariate distribution, of which the three spatial coordinates
are three independent random variables. By applying eigen-decomposition to the covariance matrix of this point cloud patch, the eigenvector that has the smallest eigenvalue points to the normal direction or its reverse direction. Jet[cazals2005estimating_jet] estimates a normal vector by fitting a truncated Taylor expansion to the local point cloud patch. Geometric properties such as surface normal and the two principle curvatures can be computed from the Taylor expression.
Recently, learning-based point cloud normal estimation approaches have been proposed. PCPNet [guerrero2018pcpnet] applies PointNet to a local patch and regresses a surface normal directly. To adapt to local geometries better, it also introduces a multi-scale model that runs PointNet [PointNet_Charles2017] at three different local neighbourhood scales. Similarly, Nesti-Net [ben2019nesti] is a Mix of Expert (MoE) [hampshire1992meta, jacobs1991adaptive] system that processes 7 different scales of neighbours around a target point. Each expert learns to specialise at regions with certain geometric properties and outputs confidences at inference time. To obtain the final result, a manager network integrates normal estimations from all experts and outputs a final result. Both methods tackle normal estimation by exploring more scale levels and balancing results in the multi-scale inferences. This line of approaches achieves better performance than classical PCA and Jet but also requires a large model and a long inference time. The multi-scale setting also prohibits information flow among sub-networks. Our method improves both performance and efficiency by using a single network that learns to select point information softly and can adapt to different geometries without quantizing a local patch to a number of scales.
Attention Mechanism Attention was first introduced by Bahdanau et al. [bahdanau2015neural]
to handle long-range dependencies in language processing tasks together with Recurrent Neural Networks (RNN). Self-attention or intra-attention[cheng2016long, lin2017structured] was proposed to understand relationships within a sequence itself. Vaswani et al. [vaswani2017attention] designed multi-head attention and replaced RNNs with self-attention to enable parallel processing, which is difficult to achieve in RNNs. Recently, attention mechanisms have been introduced to the point cloud domain to explore relationships within a point cloud. These methods use attention mechanism to enhance semantic understanding over the entire point cloud. [yang2019modeling, yan2020pointASNL] uses attention mechanism to sample points. Wang et al. [wang2019graph] employs self-attention in semantic segmentation tasks. DCP [DCP] applies attention mechanism to find correspondences between two point clouds in a registration task. Inspired by prior work, we propose applying attention to normal estimation, a task that requires an understanding of local geometric structures.
Given a point cloud, our model predicts a surface normal for every point within a point cloud. To achieve this, our model takes nearest neighbours around a point, which we call a local patch, as input and outputs the predicted surface normal for this local patch as a three dimensional unit vector. The rest of the section introduces the overall normal estimation pipeline, elaborates our key attention module, and provides implementation details in the end.
Pipeline Our prediction pipeline consists of five steps: 1) The network takes a local patch at a point as input and extract a set of per-point features for using a 3-layer Multi-Layer Perceptron (MLP) where denotes a per-point feature and is the feature dimension of . 2) We define a Temperature-adjusted Multi-Head Self-Attention (TMHSA) module and apply it on extracted in the first step. The TMHSA softly mingles per-point information from different aspects and outputs an integrated feature set . This is the key step that enables soft neighbourhood selection, points information integration, and attention strength adjustment. 3) A Feed Forward Network (FFN) that is introduced in [vaswani2017attention] is applied to transform to in order to introduce non-linearity. 4)
We apply a point-wise max pooling, a symmetric function introduced in PointNet [PointNet_Charles2017], on to extract a patch descriptor for . This patch descriptor is designed to capture the geometric information of this local patch and 5) a number of fully-connected (FC) layers are applied to the to finally regress a normal estimation from . With this pipeline, our model learns to estimate normals for different surface properties and to adjust the softness in the self-attention. The full process is illustrated in Fig. 1.
Training Loss Our network is trained by minimising the unoriented angle difference between estimated normals and ground truth normals . We use the sine distance to measure the unoriented angle difference:
TMHSA Module The key component of our network architecture is the TMHSA module, which enables our network to decide which neighbours are useful and how they should be combined. We first introduce the Temperature-adjusted Self-Attention (TSA) module, the building block of the TMHSA module. The TSA takes a set of per-point features around a point as input and produces a set of weighted per-point features. Features that are potentially useful to the task are assigned with higher weights, based on the dot product between features [vaswani2017attention]. Specifically, we define TSA as follows:
where are learnable linear projections that transform to other feature spaces and is the temperature-adjusted feature set for a local point patch. In the TSA module, the softmax function produces weights for each feature vector and the temperature controls the smoothness of the softmax function, for example, the softmax gets sharper when the temperature decreases. By setting to a learnable parameter, the network can control the sharpness of the softmax function and we denote this behaviour as attention strength adjustment.
where is the number of heads or the number of linear transformations we apply to and is a learnable projection matrix that combines outputs from all attention heads.
Our network is implemented using PyTorch[paszke2019pytorch]. In the pre-processing stage, we scale all objects to a unit sphere. For each point , we find its nearest neighbour and shift this
-point patch to the origin by subtracting the mean of the patch. We use the Adam optimiser with a start learning rate of 5e-4, a momentum set to 0.9 and no weight decay. The learning rate is scheduled to divide by 10 at epoch 400 and 800 and we train the model for 900 epochs in total. We use a batch size of 12000 and found no significant performance difference with other batch sizes. The convolutional layers are initialised using Kaiming Uniform[he2015delving] and the temperature parameter is initialised to 1.0.
In this section, we demonstrate two benefits of our neural network: 1) our method is much less sensitive to the neighbourhood range in comparison with other approaches. 2) our method achieves state of the art performance with a much smaller model and a much faster inference time. We organise this section as follows: Sec. 4.1
introduces the dataset details and evaluation metrics to compare with other approaches. Sec.4.2 presents our main results on stability, performance, model size, and inference time. Sec. 4.4 discusses the reason why our model is insensitive to the neighbourhood range variance using attention maps. To validate that learning a temperature parameter is helpful, we provide an ablation test in Sec. 4.3. Lastly, Sec. 4.5 provides an application study that uses the ICP algorithm as an example to show how downstream tasks can benefit from the neighbourhood-insensitive property.
We train our model using the PCPNet dataset [guerrero2018pcpnet] that consists of 27 objects with each object containing 100,000 points. The training/validation/test splits contain 8/3/19 objects respectively. We do not use the noise-augmented data provided in the PCPNet dataset and we follow the train/test splits of PCPNet [guerrero2018pcpnet] and Nesti-Net [ben2019nesti] to ensure a fair comparison.
We evaluate our normal estimation network using root mean square error (RMSE) and the proportion of good points (PGP) metric, the same as Nesti-Nets[ben2019nesti] and PCPNet[guerrero2018pcpnet]. Both RMSE and PGP are computed using the unoriented angle error . In the definition of unoriented angle error, a angle difference is considered as accurate as a angle difference. We compute the unoriented angle error using where is a predicted normal vector and is a ground truth vector. PGP computes the percentage of angle differences between the predicted result and ground truth is less than , we compute the PGP using , where is an indicator function that yields 1 when the condition inside the operator holds and is the cardinality of the point cloud .
Stability We show that our network is insensitive to the range of neighbourhood in Fig. 2. Our model outperforms PCA by a large margin and most importantly, the performance of our method does not degrade as the number of neighbours increases. Nesti-Net [ben2019nesti] and PCPNet [guerrero2018pcpnet] are shown separately in this figure because they use patches from multi-scales.
Our method consistently performs better than PCA and PCPNet as long as more than 3 neighbour points are used, and better than Nesti-Net when more than 15 neighbours are available. PCA achieves the best performance using 7 or 8 neighbours and our model outperforms the top PCA performance when more than 5 neighbours are given. This performance is desirable for downstream algorithms because it ensures that high-quality normal estimations could be found by simply setting a large enough neighbourhood range, and our network can choose neighbours it needs adaptively based on local geometric structures.
State of the Art Performance In addition to achieving a neighbourhood-insensitive property, equipped with the attentional neighbourhood integration model, our method also outperforms other methods by a large margin with a much smaller model and a faster inference time (Tab.1).
|Models||PGP 5 (%)||PGP 10 (%)||RMSE (deg)|
Model Size and Inference Time As mentioned above, our model is much smaller and faster than the previous state of the art model Nesti-Net [ben2019nesti], which has 8 sub-networks (7 experts and a manager). Our model takes 41.1MB and Nesti-Net takes 1051.6MB (25x larger). Regarding inference time, our model takes 6.5 seconds to evaluate 100,000 normals (50 neighbours for each normal), while the multi-scale Nesti-Net takes 81.1 seconds. We also report that PCA consumes 0.27 seconds given 8 neighbours, where PCA produces best normal estimations according to Fig. 2. For learning-based models, we benchmark timings on an NVIDIA-1080Ti GPU. For PCA, we use NumPy’s SVD function and an Intel i9-9900K CPU. Detailed model sizes and inference time are listed in Tab. 2.
|PCA [hoppe1992surface]||N/A||0.27 (nb8)|
We verify that introducing a temperature parameter improves the network performance and makes the network more robust to large neighbourhood ranges in Tab. 3 and Fig. 3. By learning a temperature, our network learns to control the softmax in the TMHSA module and to control the attention strength.
|Model||PGP 5 (%)||PGP 10 (%)||RMSE (deg)|
In this section, we discuss why the performance of our network does not degrade when the number of neighbours are increasing, as in PCA. In fact, our network performs better when more neighbours are available. To explain this, we show 4 attention maps from our models in Fig. 4. The four models we use to generate these attention maps are trained and tested using local patches contain 5, 10, 25, and 50 neighbours and achieves PGP10 92.5%, 93.0%, 94.0%, and 93.3%, from left to right, while the best PCA performance is 91.4%. In an attention map, each local patch is represented by a row of pixels and attention weights are colour-coded by pixel intensity.
From the attention map over 5 and 10 neighbours, (a) and (b), we can see that the network pays more attention to neighbours that are close to the patch centre and ignores points farther than neighbours. This agrees with the PCA performance in Fig. 2. From the attention map over 25 and 50 neighbours, (c) and (d), we observe that the network pays extra attention to points at the right-end in a row. These are points far from the patch centre. To illustrate this better, we show the attention weights predicted by our network in 3D in Fig. 5. These attention weights visualisations suggest that the network learns to identify geometric properties around the current point using points from larger scales, by inspecting, for example, whether the patch contains a sharp edge or a corner. Therefore, our network maintains a relatively stable performance despite different numbers of neighbours by focusing on relevant points.
In this section, we demonstrate how other downstream algorithms can benefit from our neighbourhood-insensitive normal estimation network. We take the point-to-plane ICP [ICP_Besl1992, chen1992object] algorithm as an example and show that, by simply setting the number of neighbours to a large enough number, for example, in our network, ICP with our normal estimation converges faster than with other normal estimation methods. Moreover, ICP with classical normal estimation methods like PCA requires fine-tuning manually to achieve best performance, which is still outperformed by our network.
ICP is a widely used point cloud registration and pose estimation method. Many variants have been proposed in past decades. The original point-to-point ICP[ICP_Besl1992] optimises the distance between closest point pairs directly. The point-to-plane ICP [chen1992object] is proposed to improve the converging speed and robustness against noise. Given two point clouds, a source point cloud and a rigid transformed destination point cloud , ICP computes the rigid transformation to register and iteratively. The point-to-point energy is defined as
and the point-to-plane energy is defined as
where denotes the estimated transformation, is a point in , is the corresponding point in , and is the surface normal at .
Results We run the point-to-plane ICP on the test-set of PCPNet dataset [guerrero2018pcpnet], using normals from different estimation methods (Tab. 4). With normal estimation from our model, the point-to-plane ICP converges faster than using normals from PCA [hoppe1992surface] and Nesti-Net [ben2019nesti]. By making comparison between the nb25 and nb50 models at the last two rows in Tab. 4, we can conclude our model is stable given the different number of neighbours. For PCA, we run an exhaustive search over the different numbers of neighbours and report the smallest iteration number in Tab. 4 (the PCA-full row). The full ICP iteration table for PCA and the full shape names s1-s19 can be found in the supplementary material.
ICP Setup To ensure fair comparisons, we set an early stopping criteria using the point-to-point energy since this stopping criterion is immune to the normal quality when evaluating registration results. A perfect registration should produce but the can still be larger than zero because the normals estimations are not perfect. We set the stopping threshold to . For simplicity, we set the rotation angles between and to in all three axes (), and the translation between and to . We scale all point clouds to a unit sphere and shift them to the point cloud centre before running ICP. We use the Levenberg–Marquardt [Levenberg1944, Marquardt1963] solver to optimise . Our ICP code comes along with the training code at https://code.active.vision.
In this work, we propose an attention-based normal estimation network that is insensitive to neighbourhood range and provides state of the art normal estimation performance. We achieve this by applying an attention module to the local neighbourhood so our network can focus on relevant points and integrate information softly. With a learnable temperature parameter, our network can also control how attentive it should be under current neighbourhood settings. As a result, our network is 12x faster and 25x smaller than the previous state of the art models [ben2019nesti] and provides 2.9% higher accuracy (PGP 10) on the PCPNet dataset [guerrero2018pcpnet]. We also use point-to-plane (ICP) as an application to show that our normal estimations lead to faster convergence than normal estimations from other methods, without manually fine-tuning neighbourhood range parameters.
The authors would like to thank Min Chen, Tengda Han, Shuda Li, Tim Yuqing Tang and Shangzhe Wu for insightful discussions and proofreading.