I Introduction
The scene flow [ref36] estimates the motion of 3D point in the scene, which is different from LiDAR odometry [ref5] that estimates the consistent pose transformation of the entire scene. 3D scene flow is more flexible. The flexibility of scene flow makes it capable of assisting in many different tasks, such as object tracking [ref1] and LiDAR odometry[ref5].
Generally, depth and optical flow together represent the scene flow in the scene flow estimation method based on 2D image [ref11, ref10] or RGBD image[teed2021raft]. MonoSF [ref11] infers the scene flow of visible points using constraints on motion invariance of multiview geometric and depth distribution of a single view. To obtain more reliable scene flow, Hur et al. [ref10] propagate temporal constraints to continuous multiframe images. RAFT3D [teed2021raft]
estimates the scene flow by soft grouping pixels into rigid objects. The nature of the scene flow being the 3D motion vectors is split by the methods for estimating scene flow by optical flow and pixel depth in image space. Because of the lack of explicit 3D geometry information in the image space, these methods often cause large pixel matching errors during scene flow inference. For example, two unrelated points that are far apart in 3D space may be very close to each other in the image plane. Dewan et al.
[ref21] predict 3D scene flow from adjacent frames of LiDAR data in a withoutlearning method. This method requires various strong assumptions, such as that the local structure will not be deformed by the motion in the 3D scene. Some recent works [ref4, ref16, ref19, ref27]learn 3D scene flow from point cloud pairs based on deep neural network. However, these methods of 3D scene ﬂow estimation are invariably selfsupervised/supervised on the synthesized dataset FlyingThings3D
[ref24] and evaluated the generalization of the model on the realworld dataset KITTI Scene Flow [sfkitti]. The models trained on the synthesized dataset will cause accuracy degradation in the real world [ref16].3D scene flow annotations are very scarce in realworld datasets. Some works [ref27, ref16] proposes some excellent selfsupervised losses, but these are difficult to achieve success on LiDAR signals. LiDAR signals are recognized to have two weaknesses, sparsity and point cloud distortion. Firstly, the existing selfsupervised losses [ref16, ref27]
imply a strong assumption of pointbypoint correspondence. But point clouds from different moments are inherently discrete. The pointbypoint correspondence and the discrete nature of the point cloud are contradictory. The sparsity of the LiDAR signal further exacerbates this fact. Secondly, the data collection process for mechanical LiDAR is accompanied by motion, which results in point cloud points in the same frame not being collected at the same moment, i.e., point cloud distortion. However, 3D scene flow estimation is the process of local 3D matching, which requires accurate temporal information. Unlike the LiDAR signal, the pseudoLiDAR signal comes from the backprojection of the dense depth map into a 3D point cloud. Almost no distortion is caused by pseudoLiDAR due to the instantaneous capture of the image. A novel structure is designed in this paper to enable selfsupervised learning of scene flow to benefit from pseudoLiDAR. Although the spatial position of the points in the point cloud generated from the depth map is not always accurate, our method is still able to find the correspondence between the points in adjacent frames well. As shown in Fig.
1, the point clouds of adjacent frames have similar representations of same objects in the scene because the performance of the depth estimation network is unchanged. Thus the spatial features of a point at frame can easily find similar spatial features at frame and this match is usually accurate.On the other hand, the camera is cheaper than LiDAR. Although the cost of LiDAR is shrinking year by year, there is still an undeniable cost difference between LiDAR and camera. Recently, Ouster released an inexpensive OS1128 LiDAR, but the price still costs $18,000 [durlar]. The stereo camera system may provide a lowcost backup system for the LiDARbased scheme of scene flow estimation.
In summary, our key contributions are as follows:

A selfsupervised method for learning 3D scene flow from stereo images is proposed. PseudoLiDAR based on stereo images is introduced into 3D scene flow estimation. The method in this paper bridges the gap between 2D data and the task of estimating 3D point motion.

The sparse and distortion characteristics of the LiDAR point cloud bring errors for the calculation of existing selfsupervised loss of the 3D scene flow. The introduction of the pseudoLiDAR point cloud in this paper improves the effectiveness of these selfsupervised losses because of the dense and nondistortion characteristics of the pseudoLiDAR point cloud, which is demonstrated by experiments in this paper.

3D points with large errors caused by depth estimation are filtered out as much as possible to reduce the impact of noise points on the scene flow estimation. A novel disparity consistency loss is proposed by exploiting the coupling relationship between 3D scene flow and stereo depth estimation.
Ii Related Work
Iia PseudoLiDAR
In recent year, many works [ref29, ref30, ref31] build pseudoLiDAR based 3D object detection pipeline. PseudoLiDAR has shown significant advantages in the field of object detection. Wang et al. [ref29] introduce pseudoLiDAR into an existing LiDAR object detection model and demonstrate that the main reason for the performance gap between stereo and LiDAR is the representation of the data rather than the quality of the depth estimate. PseudoLiDAR++ [ref30]
optimizes the structure and loss function of the depth estimation network based on Wang et al.
[ref29] to enable the pseudoLiDAR framework to accurately estimate distant objects. Qian et al. [ref31] build on PseudoLiDAR++ [ref30] to address the problem that the depth estimation network and the object detection network must be trained separately. The previous pseudoLiDAR framework focuses more on scene perception with the single frame, while this paper focuses on the motion relationship between two frames.IiB Scene Flow Estimation
Some works study the estimation of dense scene flow from images of consecutive frames. MonoSF [ref11] proposes ProbDepthNet to estimate the pixel depth distribution of the single image. Geometric information from multiple views and depth distribution information from a single view are used to jointly estimate the scene flow. Hur et al. [ref10] introduce a multiframe temporal constraint to the scene flow estimation network. Chen et al. [ref36] develop a coarsegrained software framework for sceneflow methods and realize realtime crossplatform embedded scene flow algorithms. In addition, Rishav et al. [ref3] fuse LiDAR and images to estimate dense scene flow. But they still perform feature fusion in image space. These methods rely on 2D representations and cannot learn geometric motion from explicit 3D coordinates. The pseudoLiDAR is the bridge between the 2D signal and the 3D signal, which provides the basis for directly learning the 3D scene flow from the 2D data.
The original geometric information is preserved in the 3D point cloud, which is the preferred representation for many scene understanding applications in selfdriving and robotics. Some researchers
[ref21] estimate 3D scene flow from LiDAR point clouds by using the classical method. Dewan et al. [ref21] introduce local geometric constancy during the motion and introduce a triangular grid to determine the relationship of the points. Benefiting from the point cloud deep learning, some recent works [ref4, ref16, ref19, ref27] propose to learn 3D scene flow from raw point clouds. FlowNet3D [ref4] firstly proposes the flow embedding layer which finds point correspondence and implicitly represents the 3D scene flow. FLOT [ref19] studies lightweight structures for optimizing scene flow estimation using optimal transport modules. PointPWCNet [ref16] proposes novel learnable cost volume layers to learn 3D scene flow in a coarsetofine approach, and introduces three selfsupervised losses to learn the 3D scene flow without accessing the ground truth. Mittal et al. [ref27] propose two new selfsupervised loss.Iii SelfSupervised Learning of the 3D Scene Flow from PseudoLiDAR
Iiia Overview
The main purpose of this paper is to recover 3D scene flow from stereo images. The stereo images are represented by and respectively. As Fig. 2, given a pair of stereo images, which contains reference frames and target frames . Each image is represented by a matrix of dimension . Depth map at time is predicted by feeding the stereo image into the depth estimation network . Each pixel value of represents the distance between a certain point in the scene and the left camera. PseudoLiDAR point cloud comes from backprojecting the generated depth map to a 3D point cloud, as follow:
(1) 
where and are the horizontal and vertical focal lengths of the camera, and and are the coordinate center of the image, respectively. The 3D point coordinate in the pseudoLiDAR point cloud is calculated by pixel coordinates , and camera intrinsics. with points and with points are generated from the depth maps and , where and are the 3D coordinates of the points. and are randomly sampled to points, respectively. The sampled pseudoLiDAR point clouds are passed into the scene flow estimator to extract the scene flow vector for each 3D point in frame .
(2) 
where and are the parameters of the network. represents the backprojection by Eq. (1).
It is difficult to obtain the ground truth scene flow of pseudoLiDAR point clouds. Mining a priori knowledge from the scene itself to selfsupervised learning of 3D scene flow is essential. Ideally, and estimated point cloud have the same structure. With this priori knowledge, point cloud is warped to point cloud through the predicted scene flow ,
(3) 
Based on the consistency of and , the loss functions Eq. (7) and Eq. (10) are utilized to implement selfsupervised learning. We provide the pseudocode in Algorithm 1 for our method, where , , and are described in detail in Section IIID.
IiiB Depth Estimation
The disparity is the horizontal offset of the corresponding pixel in the stereo image, which represents the difference caused by viewing the same object from a stereo camera. and represent the observation of the same 3D point in space. Two cameras are connected by a line called the baseline. The distance between the object and the observation point can be calculated by knowing the disparity , the baseline length , and the horizontal focal length .
(4) 
Disparity estimation networks such as PSMNet [ref38]
extract deep feature maps
and from and , respectively. As shown in Fig. 2, the features of andare concatenated to construct 4D tensor
, namely the disparity cost volume. Then 3D tensor is calculated by feedinginto the 3D convolutional neural network (CNN). The predicted pixel disparity
is calculated by softmax weighting [ref38]. Based on the fact that disparity and depth are inversely proportional to each other, the convolution operation in the disparity cost volume has disadvantages. The same convolution kernel is applied to a few pixels with small disparity (i.e., large depth) resulting in an easy skipping and ignoring many 3D points. It is more reasonable to run the convolution kernel on the depth grid that produces the same effect on neighbor depths, rather than overemphasizing objects with large disparity (i.e., small depth) on the disparity cost volume. Based on this insight, the disparity cost volume is reconstructed as depth cost volume [ref30]. Finally, the depth of the pixel is calculated through a similar weighting operation as mentioned above.The sparse LiDAR points are projected onto the 2D image as the ground truth depth map . The depth loss is constructed by minimizing the depth error:
(5) 
represents the predicted depth map. represents smooth L1 loss.
IiiC Refinement of PseudoLiDAR Point Clouds
Scene flow estimator cannot directly benefit from the generated original pseudoLiDAR point clouds due to its containing many points with estimation error. Reducing the impact of these noise points on the 3D scene flow estimation is the problem to be solved.
IiiC1 LiDAR and PseudoLiDAR for 3D Scene Flow Estimation
LiDAR point clouds in KITTI raw dataset [ref40] are captured using lowfrequency (10 ) mechanical LiDAR scans. Each frame of the point cloud is collected via the rotating optical component of the LiDAR at low frequencies. This process is accompanied by the motion of the LiDAR itself and the motion of other dynamic objects. Therefore, the raw LiDAR point cloud contains a lot of distortion points. The point cloud distortion generated by its selfmotion can be largely eliminated by collaborating with other sensors (e.g. inertial measurement unit). However, point cloud distortion that is caused by other dynamic objects is difficult to be eliminated, which is one of the challenges of learning 3D scene flow from LiDAR point clouds. In comparison, the sensor captures images with almost no motion distortion, which is an important advantage for recovering 3D scene flow from pseudoLiDAR signals.
As shown in Fig. 3, the point cloud from LiDAR (64beam) in KITTI dataset [ref40] is sparsely distributed on 64 horizontal beams. In contrast, pseudoLiDAR point clouds come from dense pixel and depth values, which are inherently dense. The image size in KITTI is 1241 376. This means that the image contains 466,616 pixel points. The selfsupervised learning of 3D scene flow mentioned in subsection IIIA assumes that the warped point cloud and the point cloud correspond point by point. However, the disadvantage of this assumption is magnified by the sparse and discrete nature of the LiDAR point cloud. For example, Chamfer loss [ref16] forces the point cloud of these two frames to correspond point by point, which makes selfsupervised loss overpunish the network so that it does not converge to the global optimum.
IiiC2 Constraints on PseudoLiDAR Point Cloud Edges
A large number of incorrectly estimated points are distributed at the scene boundaries. For example, the generated pseudoLiDAR point cloud has long tails on the far left and far right. Weakly textured areas such as white sky also result in a lot of depth estimation errors. Appropriate boundaries are specified for pseudoLiDAR point clouds to remove as many edge error points as possible and not to lose important structural information.
IiiC3 Remove Outliers from the PseudoLiDAR Point Cloud
It is also very important to remove the noise points inside the pseudoLiDAR point cloud. As shown in Fig. 2, a straightforward and effective method is to find outliers and eliminate them. In the lowerleft corner of Fig. 2, a long tail is formed on the edge of the car, and these estimated points have deviated from the car itself. Statistical analysis is performed on the neighborhood of each point. The average distance from it to the
neighboring points is calculated. The obtained result is assumed to be a Gaussian distribution, which shape is determined by the mean and standard deviation. This statistical method is useful to find discordant points in the whole point cloud. A point in the point cloud is considered as an outlier when its distance from its nearest point is larger than a distance threshold
:(6) 
is determined by the scaling factor and the number of nearest neighbors.
IiiD 3D Scene Flow Estimator
The image convolution process is the continuous multiplication and summation of the convolution kernel in the image space. This operation is flawed for matching points in realworld space. Points that are far apart in 3D space may be close together on a depth map, such as the edge of a building or the edge of a car. This problem cannot be paid attention to by convolution on the image, which leads to incorrect feature representation or feature matching. Convolution on 3D point clouds better avoids that flaw.
The generated pseudoLiDAR point clouds and are encoded and downsampled by PointConv [ref16] to obtain the point cloud features and . Then the matching cost between point and point is calculated by concatenating the features of , the features of , and the direction vector [ref16], where and . The nonlinear relationship between and
is learned by using multilayer perceptron. According to the obtained matching costs, the point cloud cost volume
used to estimate the movement between points is aggregated. The scene flow estimator constructs a coarsetofine network for scene flow estimation. The coarse scene flow is estimated by feeding point cloud features and into the scene flow predictor . The input of is the point cloud features of the first frame of the current level, , scene flow from the last level of upsampling, and point cloud features from the last level of upsampling. The output is the scene flow and point cloud features of the current level. The local features of the four variables of input are merged by using the PointConv [ref16] layer and the new dimensional features are output. The newdimensional features are used as input to the multilayer perceptron to predict the current level of scene flow. The final output is the 3D scene flow
of each point in .The selfsupervised loss is used at each level to optimize the prediction of the scene flow. The proposed network has four unsupervised losses Chamfer loss , smoothness constraint , Laplacian regularization , and disparity consistency . is warped using the predicted scene flow to obtain the estimated point cloud at time . The loss function is designed to calculate the chamfer distance between and . The formula is described as:
(7)  
where represents the operation of Euclidean distance. The design of smoothness constraint is inspired by the a priori knowledge of smooth scene flow in realworld local space,
(8) 
where means the set of all scene flow in the local space around . represents the number of points in . Similar to , the goal of Laplacian regularization is to make the Laplace coordinate vectors of the same position in and consistent. The Laplace coordinate vector of the point in is calculated as follows:
(9) 
where represents the set of points in the local space around and is the number of points in .
is the interpolated Laplace coordinate vector from
at the same position as by using the inverse distance weight. is described as:(10) 
Inspired by the coupling relationship between depth and pose in unsupervised depth pose estimation tasks
[bian2019unsupervised], we propose a disparity consistency loss . Specifically, each point on the first frame image is warped into the second frame by an estimated 3D scene flow, and the disparity or depth values from the warped points and the points in the real second frame should be the same. The disparity consistency loss is specifically described as:(11) 
where represents the depth map at frame . means the projection of the point cloud onto the image using the camera internal parameters. means the index of bilinear interpolation. means averaging over the tensor.
The overall loss of the scene flow estimator is as follow:
(12) 
The loss of the th level is a weighted sum of four losses. The total loss is a weighted sum of the losses at each level. represents the weight of the loss in the th level.
Methods  Training Set  Sup.  Input  EPE3D()  Acc3DS  Acc3DR  Outliers3D  EPE2D()  Acc2D 
FlowNet3 [ref43]  FlyingC, FT3D  Full  Stereo  0.9111  0.2039  0.3587  0.7463  5.1023  0.7803 
FlowNet3D [ref4]  FT3D  Full  Points  0.1767  0.3738  0.6677  0.5271  7.2141  0.5093 
Pontes et al. [ref41] 
FT3D  Self  Points  0.1690  0.2171  0.4775  —  —  — 
PointPWCNet [ref16]  FT3D  Self  Points  0.2549  0.2379  0.4957  0.6863  8.9439  0.3299 
PointPWCNet [ref16]  FT3D, odKITTI  Self  Points  0.1699  0.2593  0.5718  0.5584  7.2800  0.3971 
SelfPointFlow [ref20]  FT3D  Self  Points  0.1120  0.5276  0.7936  0.4086  —  — 
Mittal et al. (ft) [ref27]  FT3D, sfKITTI  Self  Points  0.1260  0.3200  0.7364  —  —  — 
SFGAN (ft) [wang2022sfgan]  FT3D, sfKITTI  Self  Points  0.1020  0.3205  0.6854  0.5532  —  — 
Ours PL (with Pretrain)  FT3D, odKITTI  Self  Stereo  0.1103  0.4568  0.7412  0.4211  4.9141  0.5532 
Ours PL (with Pretrain)  FT3D, odKITTI  Self  Mono  0.0955  0.5118  0.7970  0.3812  4.2671  0.6046 
Ours L (w/o Pretrain) 
odKITTI (64beam)  Self  Points  0.6067  0.0202  0.0900  0.9286  25.0073  0.0756 
Ours L (w/o Pretrain)  DurLAR (128beam)  Self  Points  0.5078  0.0185  0.1026  0.9591  21.1068  0.1034 
Ours PL (w/o Pretrain)  odKITTI  Self  Stereo  0.2179  0.2721  0.4616  0.6572  8.0812  0.3361 
Ours L (with Pretrain) 
FT3D, odKITTI (64beam)  Self  Points  0.1699  0.2593  0.5718  0.5584  7.2800  0.3971 
Ours L (with Pretrain)  FT3D, DurLAR (128beam)  Self  Points  0.1595  0.2494  0.6318  0.5578  7.1517  0.3986 
Ours PL (with Pretrain)  FT3D, odKITTI  Self  Stereo  0.1103  0.4568  0.7412  0.4211  4.9141  0.5532 
Ours PL (with Pretrain)  FT3D, odKITTI  Self  Mono  0.0955  0.5118  0.7970  0.3812  4.2671  0.6046 
Dataset  lidarKITTI [sfkitti]  Argoverse [argoverse]  nuScenes [nuscenes]  
Metrics  EPE3D  Acc3DS  Acc3DR  Outliers3D  EPE3D  Acc3DS  Acc3DR  Outliers3D  EPE3D  Acc3DS  Acc3DR  Outliers3D 
PointPWCNet [ref16]  1.1944  0.0384  0.1410  0.9336  0.4288  0.0462  0.2164  0.9199  0.7883  0.0287  0.1333  0.9410 
Mittal et al. (ft) [ref27]  0.9773  0.0096  0.0524  0.9936  0.6520  0.0319  0.1159  0.9621  0.8422  0.0289  0.1041  0.9615 
FLOT (Full) [ref19]  0.6532  0.1554  0.3130  0.8371  0.2491  0.0946  0.3126  0.8657  0.4858  0.0821  0.2669  0.8547 
DCASRSFE [jin2022deformation]  0.5900  0.1505  0.3331  0.8485  0.7957  0.0712  0.1468  0.9799  0.7042  0.0538  0.1183  0.9766 
Ours (Stereo)  0.5265  0.1732  0.3858  0.7638  0.2690  0.0768  0.2760  0.8440  0.4893  0.0554  0.2171  0.8649 
Ours (Mono)  0.4908  0.2052  0.4238  0.7286  0.2517  0.1236  0.3666  0.8114  0.4709  0.1034  0.3175  0.8191 

Iv Experiments
Iva Experimental Details
IvA1 Training Settings
The proposed algorithm is written in Python and PyTorch and runs on Linux. On a single NVIDIA TITAN RTX GPU, we train for 40 epochs. The initial learning rate is set to 0.0001 and the learning rate decreases by 50% every five training epochs. The batch size is set to 4. The generated pseudoLiDAR is randomly sampled to 4096 points as input of the scene flow estimator. With the same parameter settings as PointPWCNet, there are four levels of the feature pyramid in the scene flow estimator in this paper. In Eq. (
12), the first level weight to the fourth level weight are 0.02, 0.04, 0.08, and 0.16. The selfsupervised loss weights are , , , and , respectively.Depth annotations in the synthesized dataset [ref24] are used to supervise the depth estimator, similar to Pseudolidar++ [ref30]. The pretrained depth estimator is finetuned utilizing LiDAR points from the KITTI [ref40] as sparse ground truth, as indicated in Eq. 5. During the scene flow estimator training stage, the depth estimator weights will be fixed. The scene flow estimator is first pretrained on FT3D [ref24] with selfsupervision method. Stereo images from the 0009 sequence of the KITTI odometry (odKITTI) [ref40] are selected to train our scene flow estimation model. To further improve the applicability of the method, we also explored a framework for monocular vision estimation of 3D scene flow, where the depth estimator uses the advanced monocular depth model AdaBins [bhat2021adabins].
To further demonstrate the denseness advantage of the pseudoLiDAR point cloud proposed in section IIIC1, the scene flow estimator is trained on a denser LiDAR point clouds from the highfidelity 128Channel LiDAR Dataset (DurLAR) [durlar]. To be fair, we perform the same processing as PointPWCNet [ref16] for the LiDAR point cloud in DurLAR. The results are presented at the bottom of Table I.
IvA2 Evaluation Settings
Following PointPWCNet [ref16], we evaluate the model performance on the KITTI Scene Flow dataset (sfKITTI) [sfkitti], where sfKITTI is obtained through 142 pairs annotations of disparity maps and optical flow. The lidarKITTI [sfkitti], with the same 142 pairs as sfKITTI, is generated by projecting the LiDAR point clouds of 64beam onto the images. The 142 frame scenes are all used as test samples. Following Pontes et al. [ref41], we also evaluate the generalizability of the proposed method on two realworld datasets, Argoverse [argoverse] (containing 212 test samples) and nuScenes [nuscenes] (containing 310 test samples). Different from Pontes et al. [ref41], our methods have not accessed any data from Argoverse [argoverse] and nuScenes [nuscenes]
in the training process. All methods in the table evaluate the performance of the scene flow directly on Argoverse and nuScenes. To be fair, we use the same evaluation metrics as PointPWCNet
[ref16].IvB Results
Table I and table II gives the quantitative results of our method evaluated at sfKITTI [sfkitti]. The accuracy of our method is substantially ahead of supervised learning methods FlowNet3 [ref43], FlowNet3D [ref4], and FLOT [ref19]. Compared with the selfsupervised learning method [ref16, ref41, ref27, wang2022sfgan, ref20, jin2022deformation] based on point clouds, learning 3D scene flow on pseudoLiDAR from realworld images demonstrates an impressive effect. Compared to PointPWCNet [ref16], our method improves over 45% on EPE3D, Acc3DS and EPE2D, and improves over 30% on Acc3DR, Outliers and Acc2D, which is a powerful demonstration that Chamfer loss and Laplacian regularization loss can be more effective on pseudoLiDAR. Our method without finetuning still outperforms the results of Mittal et al. (ft) [ref27] finetuning on sfKITTI.
From the bottom of Table I, the 3D scene flow estimator is trained from scratch on a synthesized dataset [ref24], LiDAR point clouds [ref40, durlar], and realworld stereo/monocular images [ref40], respectively. It is obvious that selfsupervised learning works best on pseudoLiDAR point clouds estimated from images. This confirms that our method avoids the limitation of estimating the scene flow on the synthesized data set to a certain extent. It also confirms that our method avoids the weakness of learning point motion from LiDAR data. Furthermore, in the last row of Table I, the model trained on FT3D [ref24] serves as a prior guide for our method. As we have been emphasizing, pseudoLiDAR point clouds will stimulate the potential of selfsupervised loss to a greater extent.
The accuracy of learning 3D scene flow on the 128beam LiDAR signals [durlar] is improved compared to the 64beam LiDAR signals. According to the sentence according to the quantitative results of Table I and Table II, the proposed framework for learning 3D scene flow from pseudoLiDAR signals still presents greater advantages. To qualitative demonstrate the effectiveness of our method, some visualizations are shown in Fig. 4. Compared to PointPWCNet [ref16], the estimated points by our method are mostly correct points on the Acc3DR metric. From the details in Fig. 4, the point clouds estimated from our method overlap well with the ground truth point clouds, confirming the reliability of our method. Finally, the methods in this paper also show excellent perceptual performance in the real world as shown in Figure 5.
edges  outliers  EPE3D(m)  Acc3DR  EPE2D(px)  
0.2655  0.3319  11.4530  
✓  0.1191  0.7181  5.3741  
✓  ✓  0.1156  0.7298  5.0802  
✓  ✓  ✓  0.1103  0.7412  4.9141 
EPE3D()  Acc3DS  EPE2D(px)  
4  2  0.1155  0.4224  5.3325 
8  1  0.1188  0.4044  5.3622 
8  2  0.1103  0.4568  4.9141 
8  4  0.1111  0.4422  4.8438 
16  2  0.1261  0.4237  5.5634 
We test the runtime on a single NVIDIA TITAN RTX GPU. PointPWCNet takes about 0.1150 seconds on average to perform a training step while our method takes 0.6017 seconds. We consider saving the pseudoLiDAR point cloud from the depth estimation and enabling the scene flow estimator to learn the scene flow from the saved point cloud. After saving the pseudoLiDAR signals, the training time is greatly reduced and achieves the same time consumption (about 0.1150 seconds) as PointPWCNet.
IvC Ablation Studies
The edge points of the whole scene and the outliers within the scene are the two factors that we focus on. In Table III, the experiments demonstrate that the choice of a suitable scene range and the elimination of outlier points both facilitate the learning of 3D scene flow. In the ablation study, the proposed disparity consistency loss is verified to be very effective for learning 3D scene flow. A point in the point cloud whose distance from its nearest point exceeds a distance threshold is considered as an outlier, where is calculated by Eq. (6
). The probability that a point in the point cloud is considered as an outlier is determined by the number of selected points
and the standard deviation multiplier threshold . The experiments in Table IV show the best results for elimination of outlier points when is 8 and is 2.V Conclusion
The method in this paper achieves accurate perception of 3D dynamic scenes on 2D images. The pseudoLiDAR point cloud is used as a bridge to compensate for the disadvantages of estimating 3D scene flow from LiDAR point clouds. The points in the pseudoLiDAR point cloud that affect the scene flow estimation are filtered out. In addition, disparity consistency loss was proposed and achieved better selfsupervised learning results. The evaluation results demonstrate the advanced performance of our method in the real world datasets.