1 Introduction
Scene flow is the dense 3D motion field of points. It is the 3D counterpart of optical flow, and is a more fundamental and unambiguous representation – optical flow is simply the projection of scene flow onto the image plane of a camera [3DSceneFlow]. Scene flow can be useful in various fields, including robotics, autonomous driving, humancomputer interaction, and can also be used to complement and improve visual odometry and SLAM algorithms [RGBD_flow, KITTI2015].
Estimating scene flow in 3D space directly with point cloud inputs is appealing, as approaches that use stereo inputs require 3D motion reconstruction from optical flow and disparities, and thus the optimization is indirect. In this work, we focus on efficient largescale scene flow estimation directly on 3D point clouds.
The problem statement for scene flow estimation is as follows: The inputs are two point clouds (PC) at two consecutive frames: at time and at time . Generally, each point has an associated feature , where
are the 3D coordinates for each point. Other lowlevel features, such as color and normal vectors, can also be included.
^{1}^{1}1In our experiments, we only use point coordinates to demonstrate the effectiveness of our approach with the bare minimum geometry information. The output is the predicted scene flow for each point in : . We use the world coordinate system as the reference system; the goal is to estimate the scene flow of both egomotion and motion of dynamic objects; see Fig. 1.Many existing deep learning approaches for 3D point cloud processing
[pointnet, pointnet2, kdnet, spectral_5] focus on accuracy but put less emphasis on minimizing computational cost. Consequently, these networks can only deal with a limited number of points at once due to limited GPU memory, which is unfavorable for largescale scene analysis. The reason is twofold: 1) these methods frequently resort to dividing the point cloud into chunks, which can cause global information loss and inaccurate prediction of boundary points due to information loss from the local neighborhood; and 2) these methods also sometimes resort to point subsampling, which impacts performance significantly for regions with sparse point density. (1) How can we process the entire point cloud of the scene at once while avoiding the above problems?Moreover, in [pointnet, pointnet2]
, information across multiple points can only be aggregated through maxpooling either globally or hierarchically, and
[pointnet2] uses linear search to locate the neighborhood each time. (2) How can we better restore structural information from unstructured and unordered point clouds? Also, in most 3D sensors, the point density is uneven, e.g., nearby objects have larger density while faraway objects have much less density. (3) How can we make the approach robust under different point densities? Finally, scene flow estimation requires combining information from both point clouds. (4) How can we best fuse such information?We propose a novel deep network architecture for scene flow estimation that tackles the above four problems. Inspired by Bilateral Convolutional Layers (BCL) [BCN1, BCN2] and the permutohedral lattice [permutohedral_lattice]
, we propose three new layer designs: DownBCL, UpBCL, and CorrBCL, which process general unstructured data efficiently (even beyond scene flow estimation). Our network first interpolates signals from the input points onto a permutohedral lattice. It then performs sparse convolutions on the lattice, and interpolates the filtered signals to coarser lattice points. This process is repeated across several DownBCL layers. In this way, we form a hierarchical downsampling network. Similarly, our network interpolates the filtered signals from the coarsest lattice points to finer lattice points, and performs sparse convolutions on the finer lattice points. Again, this process is repeated across several UpBCL layers (a hierarchical upsampling network). Finally, the filtered signals from the finest lattice points are interpolated to each point in the first input point cloud. Through the downsampling process, we also fuse signals from both point clouds to the same lattices and perform our correlation operation (CorrBCL). Overall, we form an hourglasslike model that operates on a structured lattice space (except the first and last operation) for unstructured points.
We conduct experiments on two datasets: FlyingThings3D [FlyingThings3D], which contains synthetic data, and KITTI Scene Flow 2015 [KITTI_2, KITTI_3], which contains realworld data from LiDAR scans. Our method outperforms stateoftheart approaches. Furthermore, by training on synthetic data only, our model generalizes to realworld data that have different patterns. With a novel normalization scheme for BCLs, our approach also generalizes well under different point densities. Finally, we show that our network is efficient in terms of computational cost, and it can process a whole pair of KITTI frames at one time with a maximum of 86K points per frame. Code and model are available at https://github.com/laoreja/HPLFlowNet.
2 Related work
3D deep learning.
Multiview CNNs [multiview_1, multiview_2, multiview_3, multiview_4, multiview_5] and volumetric networks [volumetric_1, volumetric_2, volumetric_3, volumetric_and_multiview] leverage standard CNNs with gridstructured inputs, but suffer from discretization error on viewpoint selection and on volumetric representations respectively. PointNet [pointnet, pointnet2] is the first deep learning approach to work on point clouds directly. Qi et al. [pointnet] propose to use a symmetry function for unordered inputs and use maxpooling to globally aggregate information. PointNet++ [pointnet2] is a followup with a hierarchical architecture that aggregates information within local neighborhoods. Klokov and Lempitsky [kdnet] use kdtrees to divide the point clouds and build architectures based on the divisions. Another branch of work [spectral_1, spectral_2, spectral_3, spectral_4, spectral_5] represent the 3D surface as a graph, and perform convolution on its spectral representation. Su et al. [splatnet] propose an architecture for point cloud segmentation based on BCL [BCN1, BCN2] and achieve joint 2D3D reasoning.
Our work is inspired by [splatnet], but with a different focus: [splatnet] focuses on BCL’s property of allowing different inputs and outputs to fuse 2D and 3D information in a new way, while we focus on processing largescale point clouds efficiently without sacrificing accuracy – which is different from all the above approaches. In addition, scene flow estimation requires combining information from two point clouds whereas [splatnet] operates on a single point cloud.
Scene flow estimation.
Scene flow estimation with point cloud inputs is underexplored. Dewan et al. [remove_ground_1] formulate an energy minimization problem with assumptions on local geometric constancy and regularization for smooth motion fields. Ushani et al. [remove_ground_2] present a realtime fourstep algorithm, which constructs occupancy grids, filters the background, solves an energy minimization problem, and refines with a filtering framework. Unlike [remove_ground_1, remove_ground_2], our approach is endtoend. We also learn directly from data using deep networks and have no explicit assumptions, e.g., we do not assume rigid motions.
Wang et al. [continuous] propose a parametric continuous convolution layer that operates on nongrid structured data and apply this layer to point cloud segmentation and LiDAR motion estimation. However, its novel operator is defined on each point and pooling is the only proposed way for aggregating information. FlowNet3D [flownet3d] builds on PointNet++ [pointnet2] and uses a flow embedding layer to mix two point clouds, so it shares the aforementioned drawbacks of [pointnet2]. Work on scene flow estimation with other input formats (stereo [flownet3], RGBD [VOSF], light field [light_field]) is less related, and we refer to Yan and Xiang [sceneflow_survey] for a survey.
3 BCL on permutohedral lattice
Bilateral Convolutional Layer (BCL).
BCL [BCN1, BCN2] is the basic building block we use. Similar to how a standard CNN endows the traditional convolution operation with learning ability, BCL extends the fast highdimensional Gaussian filtering algorithm [permutohedral_lattice] with learnable weights.
BCL takes general inputs. The convolution is operated on a dimensional space, and each input point has a position vector and signal value . The position vectors are for locating the points in the defined space on which convolution operates. In our case, and .
The convolution step of BCL operates on a discrete domain but the input points locate in a continuous domain (for now, without loss of generality, think of the convolution operating on the most commonly used integer lattice , i.e. the regular grid, whose lattice points are tuples of integers), so BCL: 1) Gathers signals from each input point onto its enclosing lattice points via interpolation (splat), and then 2) Performs sparse convolution on the lattice; since not every lattice point has gathered signals, a hash table is used so that convolution is only performed on nonempty lattice points for efficiency. 3) Returns the filtered signals from each lattice point to the output points inside the lattice point’s nearest grids, via interpolation (slice); the use of interpolation makes it possible that the output points can locate at different positions from the input points. The above procedure forms the threestep pipeline of BCL: SplatConvSlice.
Permutohedral lattice.
The integer lattice works fine in lowdimensional spaces. However, the number of lattice points each input point interpolates to (i.e., vertices of the Delaunay cell containing each input point) is , which makes the splatting and slicing step have a complexity that is exponential in . Hence, we use the permutohedral lattice^{2}^{2}2A lattice is a discrete additive subgroup of a Euclidean space [permutohedral_properties]. Both regular grid and permutohedral lattice are specific lattices. [permutohedral_lattice, permutohedral_lattice_thesis, permutohedral_properties] instead: the dimensional permutohedral lattice is the projection of the scaled regular grid along the vector
onto the hyperplane
, which is the subspace of in which coordinates sum to zero. The Delaunay cells of the permutohedral lattice are simplices and the uniform simplices of the lattice tessellates . By replacing regular grids with uniform simplices and using barycentric interpolation, the BCL can perform on the permutohedral lattice with the same scheme as on the integer lattice. Special properties of permutohedral lattice make it efficient to compute the vertices of the simplex enclosing any query position and the barycentric weights in time.Multiplying the position vectors by a scaling factor , we can adjust the lattice resolution, i.e., larger corresponds to finer resolution where each simplex contains less points. This effect is the same as scaling the lattice. For better explanation, we interchange the two, and use the term finer lattice points and coarser lattice points.
4 Approach: HPLFlowNet
BCL restores structural information from unstructured point clouds, which makes it possible to perform convolutions with kernel size greater than 1. Previous work [splatnet, BCN2] use the same set of input points on the continuous domain for all the BCLs in their network. However, both the time and space cost of splatting and slicing in BCL are linear in the number of input points. Is there a way to more efficiently stack BCLs to form a deep architecture? How can we combine information from both point clouds for scene flow estimation? In this section, we address these problems and introduce our HPLFlowNet architecture.
4.1 DownBCL and UpBCL
We first introduce the downsampling and upsampling operators, DownBCL and UpBCL. Compared with the threestep operation in the original BCL, DownBCL only has two steps: SplatConv. The nonempty lattice points at the previous DownBCL become the input points to the next layer, thus saving the slicing step. DownBCL is for downsampling: we stack DownBCLs with gradually decreasing scales, so signals from finer lattice points are splatted to coarser lattice points iteratively, with coarser and coarser resolution and fewer and fewer input points. Similarly, UpBCL, with a twostep pipeline ConvSlice, is used for upsampling with gradually increasing scales. Signals from coarser lattice points are sliced to finer lattice points directly, thus saving the splatting step. See Fig. 2.
There are several advantages of DownBCL and UpBCL over the original BCL:
(1) We reduce the threestep pipeline to a twostep pipeline without introducing any new computation, which saves computational cost.
(2) Usually there are much fewer nonempty lattice points than in the input point cloud, especially on the coarser lattice. So we reduce the input size for each DownBCL, except the first one. Similarly, in UpBCL, slicing to the next layer’s lattice points instead of to the input point cloud saves computational cost of slicing. In this way, after the first DownBCL and before the last UpBCL, the data size that DownBCLs and UpBCLs have to deal with has nothing to do with the size of the input point cloud, but is instead linear in the number of nonempty lattice points at different scales; i.e., it is only related to the actual volume the point cloud occupies. This is the key advantage of DownBCL and UpBCL that makes computation efficient.
(3) The saved time and memory allow deeper architectures. We use multiple convolution layers with nonlinear activations in between for the convolution step in each DownBCL and UpBCL, instead of the single convolution in the original BCL.
(4) Barycentric interpolation is a heuristic to gather and return signals. The splatting and slicing steps are not symmetric: for input point
, let denote the vertices of its enclosing simplex; for lattice point , let denote the set of input points that lie in a simplex with vertex , denote the barycentric weight used when splatting to , which is the same weight for slicing to , and let denote convolution. Then in the original BCL, the filtered signals for can be expressed as:(1) 
Even when is an identity map, we can see that the input signals are changed after the “identity” BCL. Also, because of barycentric interpolation, the output signals inside each simplex are always smooth – this is fine in image filtering [permutohedral_lattice] where blurring is the expected effect, while it is not ideal for perpoint regression, where points within one simplex may have drastically different ground truth. Hence, by removing the slicing step for DownBCL and the splatting step for UpBCL, we reduce such errors caused by the heuristic and asymmetric operations.
4.2 CorrBCL
Because of the interpolation design of BCLs, information from two consecutive point clouds can be splatted onto the same permutohedral lattice. In order to fuse information from both point clouds, we propose a novel bilateral convolutional correlation layer (CorrBCL), inspired by the matching cost computation and cost aggregation for stereo algorithms [patch]. Our CorrBCL consists of two steps, patch correlation and displacement filtering.
Patch correlation.
Similar to cost matching, patch correlation mixes information from a patch (local neighborhood) at and another patch at , but in a more general and learnable manner.
Let and denote hash tables storing signals for the two point clouds indexed by lattice positions, the correlation neighborhood size, and the offset matrix such that neighbor of lattice point at coordinate is located at . Then the patch correlation for lattice point in located at and lattice point in located at is
(2) 
where is a bivariate function that combines signals from the two point clouds, and is a variate function that aggregates the combined information within each patch neighborhood.
In traditional vision algorithms, is usually elementwise multiplication, and is the average function. Our is instead a convnet, and is the concatenation function. In this way, we can combine signals of different channel numbers for the two point clouds (elementwise multiplication is unable to do so): we concatenate CorrBCL’s output signals and ’s signals as input for and use ’s signals only as input for for the next CorrBCL, see Fig. 4.
Displacement filtering.
Bruteforce aggregation of all possible patch correlation results is computationally prohibitive. Since we are considering point clouds from two consecutive time instances and the norm of the motion is limited, given a lattice point in , we can move it within a local neighborhood, and match it with the lattice points in at the moved positions, and then aggregate all such pair matching information for in a slidingwindow manner. This is similar to warping and residual flow in optical flow [warp_residual_flow_1, warp_residual_flow_2], but we are warping at every position within the neighborhood. Let denote the displacement filtering neighborhood size and denote the offset matrix. For lattice points in located at , the displacement filtering is defined as:
(3) 
where is the patch correlation in Eq. 2, and is a variate aggregating convnet.
Note that the whole CorrBCL can be represented as the following general variate function:
(4) 
We use the factorization technique to save the number of parameters from to , which is similar to [R2+1D, mobilenets], and each of our steps has a physical meaning. Fig. 3 shows an example of CorrBCL, where and the correlation and displacement filtering have the same neighborhood size .
4.3 Density normalization
Since point clouds are usually sampled with nonuniform densities and sparse, the lattice points can gather uneven signals. Thus, a normalization scheme is needed to make BCLs more robust. All previous work on BCL [BCN1, BCN2, splatnet] use the following normalization scheme following the nonlearnable filtering algorithm [permutohedral_lattice]: input signals are filtered in a second round with their values replaced by 1s with a Gaussian kernel, and the filtered values serve as the normalization weights. However, this scheme does not work well for our task (see ablation studies). Unlike image filtering, our filtering weights are learned, and thus it’s not suitable to continue using Gaussian filtering for normalization.
We instead propose to add a density normalization term to the splatted signals:
(5) 
where denotes the splatted signals for lattice point , and other notations are the same as Eq. 1.
The advantages of this design are: 1) Normalization is performed during splatting. Compared with the original scheme where the normalization goes through the threestep pipeline, the new scheme saves computational cost. It is worth noticing that [pointnet2] proposes schemes for nonuniform sampling density as well, but their scheme increases computational cost greatly. 2) It applies directly to CorrBCL; and 3) Experiments show that this scheme makes our approach generalize well under different point densities without finetuning.
4.4 Network architecture
The network architecture for HPLFlowNet is shown in Fig. 4. We use an hourglasslike model due to its good performance in applications of 2D images [FCN, unet]. It has a Siameselike downsampling stage with information fusion and an upsampling stage. In the downsampling stage, DownBCLs with gradually decreasing scales are stacked, so that lattice points in higher layers have larger receptive fields and information within a larger volume is gathered to each lattice point. Since is important for making scene flow predictions, it goes through all the same layers as with shared weights. Unlike previous work [flownet3d, flownet] that fuse signals from and only once, we use multiple CorrBCLs at different scales for better signal fusion. In the upsampling stage, we gradually refine the predictions by stacking UpBCLs of gradually increasing scale, and finally, slicing back to the points in . For each UpBCL, we use skip links from the outputs of their corresponding DownBCL and CorrBCL – information from different stages can be merged at refining time because layers with the same scaling factor have the same set of nonempty lattice points,
At each BCL, we concatenate the input signals with its relative positions w.r.t. its enclosing simplex (its position vector minus the lattice coordinates of its “first” enclosing simplex vertex). In Fig. 4, we use to denote the relative positions. By providing the network with relative positions directly, it can achieve better translational invariance. The CNN we use is translational invariant under certain quantization errors, but unlike standard CNNs, we are interpolating signals from the continuous domain onto the discrete domain, which leads to some positional information loss. By incorporating into the input signals, such loss can be compensated.
Since most layers of our model always operate on sparse lattice points, their computational cost is unrelated to the size of point clouds, but only relates to the actual volume that the point cloud occupies. To train HPLFlowNet, we use the End Point Error (EPE3D) loss: averaged over each point, where denotes the predicted scene flow vector and denotes the ground truth. EPE3D is the counterpart of EPE for 2D optical flow estimation.
5 Experiments
We show results for the following experiments: 1) We train and evaluate our model on the synthetic FlyingThings3D dataset, and 2) also test it directly on the realworld KITTI Scene Flow dataset without finetuning. 3) We test the model on inputs with different point densities, 4) compare computational cost at both architecture and singlelayer level, and 5) conduct ablation studies to analyze the contribution of each component.
Evaluation metrics.
EPE3D (m): our main metric, averaged over each point. Acc3D Strict: a strict version of accuracy, the percentage of points whose EPE3D or relative error . Acc3D Relax: a relaxed version of accuracy, the percentage of points whose EPE3D or relative error . Outliers3D
: the percentage of outliers whose EPE3D
or relative error . By projecting the point clouds back to the image plane, we obtain 2D optical flow. In this way, we measure how well our approach works for optical flow estimation. EPE2D (px): 2D End Point Error, which is a common metric for optical flow. Acc2D: the percentage of points whose EPE2D or relative error .5.1 Results on FlyingThings3D
Dataset  Method  EPE3D  Acc3D Strict  Acc3D Relax  Outliers3D  EPE2D  Acc2D 

FlyingThings3D  FlowNet3 [flownet3]  0.4570  0.4179  0.6168  0.6050  5.1348  0.8125 
ICP [ICP]  0.4062  0.1614  0.3038  0.8796  23.2280  0.2913  
FlowNet3D [flownet3d]  0.1136  0.4125  0.7706  0.6016  5.9740  0.5692  
SPLATFlowNet [splatnet]  0.1205  0.4197  0.7180  0.6187  6.9759  0.5512  
original BCL  0.1111  0.4279  0.7551  0.6054  6.3027  0.5669  
Ours  0.0804  0.6144  0.8555  0.4287  4.6723  0.6764  
KITTI  FlowNet3 [flownet3]  0.9111  0.2039  0.3587  0.7463  5.1023  0.7803 
ICP [ICP]  0.5181  0.0669  0.1667  0.8712  27.6752  0.1056  
FlowNet3D [flownet3d]  0.1767  0.3738  0.6677  0.5271  7.2141  0.5093  
SPLATFlowNet [splatnet]  0.1988  0.2174  0.5391  0.6575  8.2306  0.4189  
original BCL  0.1729  0.2516  0.6011  0.6215  7.3476  0.4411  
Ours  0.1169  0.4783  0.7776  0.4103  4.8055  0.5938 
FlyingThings3D [FlyingThings3D] is the first largescale synthetic dataset that enables training deep neural networks for scene flow estimation. To our knowledge, it is the only scene flow dataset that has more than 10,000 training samples. We reconstruct the 3D point clouds and ground truth scene flow using the provided camera parameters.
Training and evaluation details.
Following [FlyingThings3D, flownet2, flownet3], we use the dataset version where some extremely hard samples are removed^{3}^{3}3https://lmb.informatik.unifreiburg.de/data/FlyingThings3D_subset/FlyingThings3D_subset_all_download_paths.txt. To simulate realworld point clouds, we remove points whose disparity and optical flow are occluded. Following [flownet3d], we train on points with depth less than 35 meters. Most foreground moving objects are within this depth range. We randomly sample points from each frame in a noncorresponding manner: corresponding points for the first frame may not necessarily be found in the sampled points of the second frame. We use for training. To reduce training time, we use one quarter of the training set (4910 pairs), which already yields good generalization ability. The model finetuned on whole training set achieves 0.0696/0.1113 EPE3D on FlyingThings3D/KITTI. We evaluate on the whole test set (3824 pairs).
Baselines.
We compare to the following methods:
Iterative Closest Point [ICP]: a common baseline for scene flow estimation, the algorithm iteratively revises the rigid transformation needed to minimize the error metric.
FlowNet3D [flownet3d]: the stateoftheart for scene flow estimation with point cloud inputs. Since code is unavailable, we use our own implementation.
SPLATFlowNet: a strong baseline based on SPLATNet [splatnet]; architecture is the Siamese network of SPLATNet with CorrBCLs that is about the same depth as our model. It does not use the hourglass architecture, but concatenates all outputs from the BCLs and CorrBCLs of different scales to make the prediction.
Original BCL: We replace DownBCL and UpBCL with the original BCL used in previous work [BCN1, BCN2, splatnet] while keeping everything else the same as our model.
We also list results of FlowNet3 [flownet3] for reference purposes, since the inputs are in different modalities. It’s the stateoftheart with stereo inputs. We remove points with extremely wrong predictions (e.g., disparity with opposite signs) – the extremes will induce too much error.
Results.
Quantitative results are shown in Table 1. Our method outperforms all baselines on all metrics by a large margin, and is the only method with EPE3D below . FlowNet3 has the best Acc2D because its optical flow network is optimized on 2D metrics; but it has worse EPE2D since we mainly evaluate on foreground objects, which can have large motions in 2D due to projection and is thus hard to predict. The fact that it is easily affected by extremes (worse EPE3D and EPE2D) also shows that using stereo inputs is more sensitive to prediction errors due to its indirect 3D representation. The reason that our method outperforms FlowNet3D is likely that we better restore structural information and design a better architecture for combining information from both point clouds. Our method and SPLATFlowNet have similar depth and use the same building blocks, so our performance gain can be credited to our hourglasslike model and the skip links that combine filtered signals in the downsampling and upsampling stages. Comparison with the original BCL shows that we improve performance by reduction and verifies that the heuristic and asymmetric nature of the barycentric interpolation makes it better to avoid unnecessary operations. Fig. 5 shows qualitative results. Our model performs well for complicated shapes, large motions, and also the hard case where multiple neighboring objects have different motions.
5.2 Generalization results on realworld data
Next, to study our model’s generalization ability to unseen realworld data, we take our model which was trained on FlyingThings3D, and without any finetuning evaluate on KITTI Scene Flow 2015 [KITTI_2, KITTI_3].
Evaluation details.
KITTI Scene Flow 2015 is obtained by annotating dynamic scenes from the KITTI raw data collection using detailed 3D CAD models for all vehicles in motion. Since disparity is not given for the test set, we evaluate on all 142 scenes in the training set with publicly available raw 3D data, following [flownet3d]. Since in autonomous driving, the motion of the ground is not useful and removing ground is a common step [remove_ground_1, remove_ground_2, flownet3d], we remove the ground by height (). We use similar preprocessing as in Sec. 5.1 except that we do not remove occluded points.
Results.
Our method again outperforms all other methods in all metrics by a large margin; see Table 1. This demonstrates our method’s generalization ability to new realworld data. Without ground removal, Ours/FlowNet3D EPE3D is 0.2366/0.3331, so ours is still better. Qualitative results are shown in Fig. 5. Even though our approach is trained on a dataset with very different patterns and different objects, it makes precise estimations in driving scenes where egomotion is large and multiple dynamic objects have different motions. It also correctly predicts the trees and bushes which are never seen by the network during training.
Method  8,192  16,384  32,768 

FlowNet3D [flownet3d]  130.8  279.2  770.0 
Ours  98.4  115.5  142.8 
Oursshallow  50.5  55.1  63.7 
5.3 Empirical efficiency
Our architecture is optimized for performance. To show how efficient our proposed novel BCL variants can be, we make a shallower version Oursshallow by removing Down/UpBCL6/7 and CorrBCL4/5, and cutting down convolutions (see supp. for details). Table 2 shows the efficiency comparison results among different models. Ours is faster than FlowNet3D. Oursshallow is very fast and also outperforms all other methods (Table. 3). And our runtime does not linearly scale with the number of input points, which empirically validates our architectural design.
We also compare with the original BCL w.r.t. layer efficiency. We measure runtime of each BCL variant in our architecture, averaged on FlyingThings3D. We then replace them with original BCLs and do the same. Runtime ratio of ours to original BCL averaged over all layers: . We include a more detailed analysis in supp.
Dataset  # points  Ours  No Norm  Oursshallow  FlowNet3D 

FlyingThings3D  8,192  0.0804  0.0790  0.0957  0.1136 
16,384  0.0782  0.0779  0.0932  0.1085  
32,768  0.0774  0.0874  0.0925  0.1327  
65,536  0.0772  0.1267  0.0925    
KITTI  8,192  0.1169  0.1187  0.1630  0.1767 
16,384  0.1114  0.1305  0.1646  0.2095  
32,768  0.1087  0.1663  0.1671  0.3110  
65,536  0.1087  0.1842  0.1674    
All  0.1087  0.1853  0.1674   
5.4 Generalization results on point density
We next evaluate how our model generalizes to different point densities. During training, we sample 8,192 points for each frame. Without any finetuning, we evaluate on 16,384, 32,768, 65,536 sampled points. For KITTI, we also evaluate on all points.
Because of our architectural design, we have the advantage of being able to process largescale point clouds at one time, and thus do not need to divide the scene and feed the parts one by one into the network like [pointnet, pointnet2]. For all our experiments, we feed the two whole point clouds into the network in one pass. The maximum number of points for one frame in KITTI is around 86K.
Table 3 shows the performance of various point densities on both datasets, where we also compare with an identical architecture without our normalization scheme (No Norm). Results show that the normalization scheme has slight information loss. No Norm has best performance on the training density, but our architecture with normalization is the most robust under different densities – EPE3D does not increase even though we evaluate on totally different point densities from the density used during training.
5.5 Ablation studies
NoSkips  OneCorr  OriNorm  EM  No  Full 
0.3149  0.3698  0.6583  0.0948  0.0989  0.0804 
To study the contribution of each component, we conduct a series of ablation studies, where each time we only change one component:

NoSkips: We remove all skip links.

OneCorr: To validate that using multiple CorrBCLs of different scales improves performance, we only keep the last CorrBCL.

OriNorm: We replace the normalization scheme for each BCL with the original normalization scheme used in previous work [BCN1, BCN2, splatnet].

Elementwise Multiplication (EM): We use elementwise multiplication in patch correlation. Since elementwise multiplication does not support input features of different lengths for the two point clouds, we remove the links from previous CorrBCLs to the next CorrBCLs.

No : We remove all the relative positions that are concatenated with input signals.
We see from Table 4 that the original normalization scheme does not work well for scene flow estimation. Both skip links and multiple CorrBCLs contribute significantly. We see that by using concatenation instead of elementwise multiplication, we are able to link previous CorrBCLs to the next CorrBCLs, and thus boost the performance. By taking both global and local positional information, our model obtains improved performance.
6 Conclusion
We presented HPLFlowNet, a novel deep network for scene flow estimation on largescale point clouds. We proposed the novel DownBCL, UpBCL and CorrBCL and a density normalization scheme, which allow the bulk of our network to robustly perform on permutohedral lattices of different scales. This greatly saves computational cost without sacrificing performance. Through extensive experiments, we demonstrated its advantages over various comparison methods.
Acknowledgments.
This work was supported in part by NSF IIS1748387, TuSimple and GPUs donated by NVIDIA.
Comments
There are no comments yet.