The scene flow [ref36] estimates the motion of 3D point in the scene, which is different from LiDAR odometry [ref5] that estimates the consistent pose transformation of the entire scene. 3D scene flow is more flexible. The flexibility of scene flow makes it capable of assisting in many different tasks, such as object tracking [ref1] and LiDAR odometry[ref5].
Generally, depth and optical flow together represent the scene flow in the scene flow estimation method based on 2D image [ref11, ref10] or RGBD image[teed2021raft]. Mono-SF [ref11] infers the scene flow of visible points using constraints on motion invariance of multi-view geometric and depth distribution of a single view. To obtain more reliable scene flow, Hur et al. [ref10] propagate temporal constraints to continuous multi-frame images. RAFT-3D [teed2021raft]
estimates the scene flow by soft grouping pixels into rigid objects. The nature of the scene flow being the 3D motion vectors is split by the methods for estimating scene flow by optical flow and pixel depth in image space. Because of the lack of explicit 3D geometry information in the image space, these methods often cause large pixel matching errors during scene flow inference. For example, two unrelated points that are far apart in 3D space may be very close to each other in the image plane. Dewan et al.[ref21] predict 3D scene flow from adjacent frames of LiDAR data in a without-learning method. This method requires various strong assumptions, such as that the local structure will not be deformed by the motion in the 3D scene. Some recent works [ref4, ref16, ref19, ref27]
learn 3D scene flow from point cloud pairs based on deep neural network. However, these methods of 3D scene ﬂow estimation are invariably self-supervised/supervised on the synthesized dataset FlyingThings3D[ref24] and evaluated the generalization of the model on the real-world dataset KITTI Scene Flow [sfkitti]. The models trained on the synthesized dataset will cause accuracy degradation in the real world [ref16].
3D scene flow annotations are very scarce in real-world datasets. Some works [ref27, ref16] proposes some excellent self-supervised losses, but these are difficult to achieve success on LiDAR signals. LiDAR signals are recognized to have two weaknesses, sparsity and point cloud distortion. Firstly, the existing self-supervised losses [ref16, ref27]
imply a strong assumption of point-by-point correspondence. But point clouds from different moments are inherently discrete. The point-by-point correspondence and the discrete nature of the point cloud are contradictory. The sparsity of the LiDAR signal further exacerbates this fact. Secondly, the data collection process for mechanical LiDAR is accompanied by motion, which results in point cloud points in the same frame not being collected at the same moment, i.e., point cloud distortion. However, 3D scene flow estimation is the process of local 3D matching, which requires accurate temporal information. Unlike the LiDAR signal, the pseudo-LiDAR signal comes from the back-projection of the dense depth map into a 3D point cloud. Almost no distortion is caused by pseudo-LiDAR due to the instantaneous capture of the image. A novel structure is designed in this paper to enable self-supervised learning of scene flow to benefit from pseudo-LiDAR. Although the spatial position of the points in the point cloud generated from the depth map is not always accurate, our method is still able to find the correspondence between the points in adjacent frames well. As shown in Fig.1, the point clouds of adjacent frames have similar representations of same objects in the scene because the performance of the depth estimation network is unchanged. Thus the spatial features of a point at frame can easily find similar spatial features at frame and this match is usually accurate.
On the other hand, the camera is cheaper than LiDAR. Although the cost of LiDAR is shrinking year by year, there is still an undeniable cost difference between LiDAR and camera. Recently, Ouster released an inexpensive OS1-128 LiDAR, but the price still costs $18,000 [durlar]. The stereo camera system may provide a low-cost backup system for the LiDAR-based scheme of scene flow estimation.
In summary, our key contributions are as follows:
A self-supervised method for learning 3D scene flow from stereo images is proposed. Pseudo-LiDAR based on stereo images is introduced into 3D scene flow estimation. The method in this paper bridges the gap between 2D data and the task of estimating 3D point motion.
The sparse and distortion characteristics of the LiDAR point cloud bring errors for the calculation of existing self-supervised loss of the 3D scene flow. The introduction of the pseudo-LiDAR point cloud in this paper improves the effectiveness of these self-supervised losses because of the dense and non-distortion characteristics of the pseudo-LiDAR point cloud, which is demonstrated by experiments in this paper.
3D points with large errors caused by depth estimation are filtered out as much as possible to reduce the impact of noise points on the scene flow estimation. A novel disparity consistency loss is proposed by exploiting the coupling relationship between 3D scene flow and stereo depth estimation.
Ii Related Work
In recent year, many works [ref29, ref30, ref31] build pseudo-LiDAR based 3D object detection pipeline. Pseudo-LiDAR has shown significant advantages in the field of object detection. Wang et al. [ref29] introduce pseudo-LiDAR into an existing LiDAR object detection model and demonstrate that the main reason for the performance gap between stereo and LiDAR is the representation of the data rather than the quality of the depth estimate. Pseudo-LiDAR++ [ref30]
optimizes the structure and loss function of the depth estimation network based on Wang et al.[ref29] to enable the pseudo-LiDAR framework to accurately estimate distant objects. Qian et al. [ref31] build on Pseudo-LiDAR++ [ref30] to address the problem that the depth estimation network and the object detection network must be trained separately. The previous pseudo-LiDAR framework focuses more on scene perception with the single frame, while this paper focuses on the motion relationship between two frames.
Ii-B Scene Flow Estimation
Some works study the estimation of dense scene flow from images of consecutive frames. Mono-SF [ref11] proposes ProbDepthNet to estimate the pixel depth distribution of the single image. Geometric information from multiple views and depth distribution information from a single view are used to jointly estimate the scene flow. Hur et al. [ref10] introduce a multi-frame temporal constraint to the scene flow estimation network. Chen et al. [ref36] develop a coarse-grained software framework for scene-flow methods and realize real-time cross-platform embedded scene flow algorithms. In addition, Rishav et al. [ref3] fuse LiDAR and images to estimate dense scene flow. But they still perform feature fusion in image space. These methods rely on 2D representations and cannot learn geometric motion from explicit 3D coordinates. The pseudo-LiDAR is the bridge between the 2D signal and the 3D signal, which provides the basis for directly learning the 3D scene flow from the 2D data.
The original geometric information is preserved in the 3D point cloud, which is the preferred representation for many scene understanding applications in self-driving and robotics. Some researchers[ref21] estimate 3D scene flow from LiDAR point clouds by using the classical method. Dewan et al. [ref21] introduce local geometric constancy during the motion and introduce a triangular grid to determine the relationship of the points. Benefiting from the point cloud deep learning, some recent works [ref4, ref16, ref19, ref27] propose to learn 3D scene flow from raw point clouds. FlowNet3D [ref4] firstly proposes the flow embedding layer which finds point correspondence and implicitly represents the 3D scene flow. FLOT [ref19] studies lightweight structures for optimizing scene flow estimation using optimal transport modules. PointPWC-Net [ref16] proposes novel learnable cost volume layers to learn 3D scene flow in a coarse-to-fine approach, and introduces three self-supervised losses to learn the 3D scene flow without accessing the ground truth. Mittal et al. [ref27] propose two new self-supervised loss.
Iii Self-Supervised Learning of the 3D Scene Flow from Pseudo-LiDAR
The main purpose of this paper is to recover 3D scene flow from stereo images. The stereo images are represented by and respectively. As Fig. 2, given a pair of stereo images, which contains reference frames and target frames . Each image is represented by a matrix of dimension . Depth map at time is predicted by feeding the stereo image into the depth estimation network . Each pixel value of represents the distance between a certain point in the scene and the left camera. Pseudo-LiDAR point cloud comes from back-projecting the generated depth map to a 3D point cloud, as follow:
where and are the horizontal and vertical focal lengths of the camera, and and are the coordinate center of the image, respectively. The 3D point coordinate in the pseudo-LiDAR point cloud is calculated by pixel coordinates , and camera intrinsics. with points and with points are generated from the depth maps and , where and are the 3D coordinates of the points. and are randomly sampled to points, respectively. The sampled pseudo-LiDAR point clouds are passed into the scene flow estimator to extract the scene flow vector for each 3D point in frame .
where and are the parameters of the network. represents the back-projection by Eq. (1).
It is difficult to obtain the ground truth scene flow of pseudo-LiDAR point clouds. Mining a priori knowledge from the scene itself to self-supervised learning of 3D scene flow is essential. Ideally, and estimated point cloud have the same structure. With this priori knowledge, point cloud is warped to point cloud through the predicted scene flow ,
Based on the consistency of and , the loss functions Eq. (7) and Eq. (10) are utilized to implement self-supervised learning. We provide the pseudocode in Algorithm 1 for our method, where , , and are described in detail in Section III-D.
Iii-B Depth Estimation
The disparity is the horizontal offset of the corresponding pixel in the stereo image, which represents the difference caused by viewing the same object from a stereo camera. and represent the observation of the same 3D point in space. Two cameras are connected by a line called the baseline. The distance between the object and the observation point can be calculated by knowing the disparity , the baseline length , and the horizontal focal length .
Disparity estimation networks such as PSMNet [ref38]
extract deep feature mapsand from and , respectively. As shown in Fig. 2, the features of and
are concatenated to construct 4D tensor, namely the disparity cost volume. Then 3D tensor is calculated by feeding
into the 3D convolutional neural network (CNN). The predicted pixel disparityis calculated by softmax weighting [ref38]. Based on the fact that disparity and depth are inversely proportional to each other, the convolution operation in the disparity cost volume has disadvantages. The same convolution kernel is applied to a few pixels with small disparity (i.e., large depth) resulting in an easy skipping and ignoring many 3D points. It is more reasonable to run the convolution kernel on the depth grid that produces the same effect on neighbor depths, rather than overemphasizing objects with large disparity (i.e., small depth) on the disparity cost volume. Based on this insight, the disparity cost volume is reconstructed as depth cost volume [ref30]. Finally, the depth of the pixel is calculated through a similar weighting operation as mentioned above.
The sparse LiDAR points are projected onto the 2D image as the ground truth depth map . The depth loss is constructed by minimizing the depth error:
represents the predicted depth map. represents smooth L1 loss.
Iii-C Refinement of Pseudo-LiDAR Point Clouds
Scene flow estimator cannot directly benefit from the generated original pseudo-LiDAR point clouds due to its containing many points with estimation error. Reducing the impact of these noise points on the 3D scene flow estimation is the problem to be solved.
Iii-C1 LiDAR and Pseudo-LiDAR for 3D Scene Flow Estimation
LiDAR point clouds in KITTI raw dataset [ref40] are captured using low-frequency (10 ) mechanical LiDAR scans. Each frame of the point cloud is collected via the rotating optical component of the LiDAR at low frequencies. This process is accompanied by the motion of the LiDAR itself and the motion of other dynamic objects. Therefore, the raw LiDAR point cloud contains a lot of distortion points. The point cloud distortion generated by its self-motion can be largely eliminated by collaborating with other sensors (e.g. inertial measurement unit). However, point cloud distortion that is caused by other dynamic objects is difficult to be eliminated, which is one of the challenges of learning 3D scene flow from LiDAR point clouds. In comparison, the sensor captures images with almost no motion distortion, which is an important advantage for recovering 3D scene flow from pseudo-LiDAR signals.
As shown in Fig. 3, the point cloud from LiDAR (64-beam) in KITTI dataset [ref40] is sparsely distributed on 64 horizontal beams. In contrast, pseudo-LiDAR point clouds come from dense pixel and depth values, which are inherently dense. The image size in KITTI is 1241 376. This means that the image contains 466,616 pixel points. The self-supervised learning of 3D scene flow mentioned in subsection III-A assumes that the warped point cloud and the point cloud correspond point by point. However, the disadvantage of this assumption is magnified by the sparse and discrete nature of the LiDAR point cloud. For example, Chamfer loss [ref16] forces the point cloud of these two frames to correspond point by point, which makes self-supervised loss over-punish the network so that it does not converge to the global optimum.
Iii-C2 Constraints on Pseudo-LiDAR Point Cloud Edges
A large number of incorrectly estimated points are distributed at the scene boundaries. For example, the generated pseudo-LiDAR point cloud has long tails on the far left and far right. Weakly textured areas such as white sky also result in a lot of depth estimation errors. Appropriate boundaries are specified for pseudo-LiDAR point clouds to remove as many edge error points as possible and not to lose important structural information.
Iii-C3 Remove Outliers from the Pseudo-LiDAR Point Cloud
It is also very important to remove the noise points inside the pseudo-LiDAR point cloud. As shown in Fig. 2, a straightforward and effective method is to find outliers and eliminate them. In the lower-left corner of Fig. 2, a long tail is formed on the edge of the car, and these estimated points have deviated from the car itself. Statistical analysis is performed on the neighborhood of each point. The average distance from it to the
neighboring points is calculated. The obtained result is assumed to be a Gaussian distribution, which shape is determined by the mean and standard deviation. This statistical method is useful to find discordant points in the whole point cloud. A point in the point cloud is considered as an outlier when its distance from its nearest point is larger than a distance threshold:
is determined by the scaling factor and the number of nearest neighbors.
Iii-D 3D Scene Flow Estimator
The image convolution process is the continuous multiplication and summation of the convolution kernel in the image space. This operation is flawed for matching points in real-world space. Points that are far apart in 3D space may be close together on a depth map, such as the edge of a building or the edge of a car. This problem cannot be paid attention to by convolution on the image, which leads to incorrect feature representation or feature matching. Convolution on 3D point clouds better avoids that flaw.
The generated pseudo-LiDAR point clouds and are encoded and downsampled by PointConv [ref16] to obtain the point cloud features and . Then the matching cost between point and point is calculated by concatenating the features of , the features of , and the direction vector [ref16], where and . The nonlinear relationship between and
is learned by using multilayer perceptron. According to the obtained matching costs, the point cloud cost volumeused to estimate the movement between points is aggregated. The scene flow estimator constructs a coarse-to-fine network for scene flow estimation. The coarse scene flow is estimated by feeding point cloud features and into the scene flow predictor . The input of is the point cloud features of the first frame of the current level, , scene flow from the last level of upsampling, and point cloud features from the last level of upsampling. The output is the scene flow and point cloud features of the current level. The local features of the four variables of input are merged by using the PointConv [ref16] layer and the new dimensional features are output. The new
dimensional features are used as input to the multilayer perceptron to predict the current level of scene flow. The final output is the 3D scene flowof each point in .
The self-supervised loss is used at each level to optimize the prediction of the scene flow. The proposed network has four unsupervised losses Chamfer loss , smoothness constraint , Laplacian regularization , and disparity consistency . is warped using the predicted scene flow to obtain the estimated point cloud at time . The loss function is designed to calculate the chamfer distance between and . The formula is described as:
where represents the operation of Euclidean distance. The design of smoothness constraint is inspired by the a priori knowledge of smooth scene flow in real-world local space,
where means the set of all scene flow in the local space around . represents the number of points in . Similar to , the goal of Laplacian regularization is to make the Laplace coordinate vectors of the same position in and consistent. The Laplace coordinate vector of the point in is calculated as follows:
where represents the set of points in the local space around and is the number of points in .
is the interpolated Laplace coordinate vector fromat the same position as by using the inverse distance weight. is described as:
Inspired by the coupling relationship between depth and pose in unsupervised depth pose estimation tasks[bian2019unsupervised], we propose a disparity consistency loss . Specifically, each point on the first frame image is warped into the second frame by an estimated 3D scene flow, and the disparity or depth values from the warped points and the points in the real second frame should be the same. The disparity consistency loss is specifically described as:
where represents the depth map at frame . means the projection of the point cloud onto the image using the camera internal parameters. means the index of bilinear interpolation. means averaging over the tensor.
The overall loss of the scene flow estimator is as follow:
The loss of the -th level is a weighted sum of four losses. The total loss is a weighted sum of the losses at each level. represents the weight of the loss in the -th level.
|FlowNet3 [ref43]||FlyingC, FT3D||Full||Stereo||0.9111||0.2039||0.3587||0.7463||5.1023||0.7803|
Pontes et al. [ref41]
|PointPWC-Net [ref16]||FT3D, odKITTI||Self||Points||0.1699||0.2593||0.5718||0.5584||7.2800||0.3971|
|Mittal et al. (ft) [ref27]||FT3D, sfKITTI||Self||Points||0.1260||0.3200||0.7364||—||—||—|
|SFGAN (ft) [wang2022sfgan]||FT3D, sfKITTI||Self||Points||0.1020||0.3205||0.6854||0.5532||—||—|
|Ours PL (with Pre-train)||FT3D, odKITTI||Self||Stereo||0.1103||0.4568||0.7412||0.4211||4.9141||0.5532|
|Ours PL (with Pre-train)||FT3D, odKITTI||Self||Mono||0.0955||0.5118||0.7970||0.3812||4.2671||0.6046|
Ours L (w/o Pre-train)
|Ours L (w/o Pre-train)||DurLAR (128-beam)||Self||Points||0.5078||0.0185||0.1026||0.9591||21.1068||0.1034|
|Ours PL (w/o Pre-train)||odKITTI||Self||Stereo||0.2179||0.2721||0.4616||0.6572||8.0812||0.3361|
Ours L (with Pre-train)
|FT3D, odKITTI (64-beam)||Self||Points||0.1699||0.2593||0.5718||0.5584||7.2800||0.3971|
|Ours L (with Pre-train)||FT3D, DurLAR (128-beam)||Self||Points||0.1595||0.2494||0.6318||0.5578||7.1517||0.3986|
|Ours PL (with Pre-train)||FT3D, odKITTI||Self||Stereo||0.1103||0.4568||0.7412||0.4211||4.9141||0.5532|
|Ours PL (with Pre-train)||FT3D, odKITTI||Self||Mono||0.0955||0.5118||0.7970||0.3812||4.2671||0.6046|
|Dataset||lidarKITTI [sfkitti]||Argoverse [argoverse]||nuScenes [nuscenes]|
|Mittal et al. (ft) [ref27]||0.9773||0.0096||0.0524||0.9936||0.6520||0.0319||0.1159||0.9621||0.8422||0.0289||0.1041||0.9615|
|FLOT (Full) [ref19]||0.6532||0.1554||0.3130||0.8371||0.2491||0.0946||0.3126||0.8657||0.4858||0.0821||0.2669||0.8547|
Iv-a Experimental Details
Iv-A1 Training Settings
The proposed algorithm is written in Python and PyTorch and runs on Linux. On a single NVIDIA TITAN RTX GPU, we train for 40 epochs. The initial learning rate is set to 0.0001 and the learning rate decreases by 50% every five training epochs. The batch size is set to 4. The generated pseudo-LiDAR is randomly sampled to 4096 points as input of the scene flow estimator. With the same parameter settings as PointPWC-Net, there are four levels of the feature pyramid in the scene flow estimator in this paper. In Eq. (12), the first level weight to the fourth level weight are 0.02, 0.04, 0.08, and 0.16. The self-supervised loss weights are , , , and , respectively.
Depth annotations in the synthesized dataset [ref24] are used to supervise the depth estimator, similar to Pseudo-lidar++ [ref30]. The pre-trained depth estimator is fine-tuned utilizing LiDAR points from the KITTI [ref40] as sparse ground truth, as indicated in Eq. 5. During the scene flow estimator training stage, the depth estimator weights will be fixed. The scene flow estimator is first pre-trained on FT3D [ref24] with self-supervision method. Stereo images from the 00-09 sequence of the KITTI odometry (odKITTI) [ref40] are selected to train our scene flow estimation model. To further improve the applicability of the method, we also explored a framework for monocular vision estimation of 3D scene flow, where the depth estimator uses the advanced monocular depth model AdaBins [bhat2021adabins].
To further demonstrate the denseness advantage of the pseudo-LiDAR point cloud proposed in section III-C1, the scene flow estimator is trained on a denser LiDAR point clouds from the high-fidelity 128-Channel LiDAR Dataset (DurLAR) [durlar]. To be fair, we perform the same processing as PointPWC-Net [ref16] for the LiDAR point cloud in DurLAR. The results are presented at the bottom of Table I.
Iv-A2 Evaluation Settings
Following PointPWC-Net [ref16], we evaluate the model performance on the KITTI Scene Flow dataset (sfKITTI) [sfkitti], where sfKITTI is obtained through 142 pairs annotations of disparity maps and optical flow. The lidarKITTI [sfkitti], with the same 142 pairs as sfKITTI, is generated by projecting the LiDAR point clouds of 64-beam onto the images. The 142 frame scenes are all used as test samples. Following Pontes et al. [ref41], we also evaluate the generalizability of the proposed method on two real-world datasets, Argoverse [argoverse] (containing 212 test samples) and nuScenes [nuscenes] (containing 310 test samples). Different from Pontes et al. [ref41], our methods have not accessed any data from Argoverse [argoverse] and nuScenes [nuscenes]
in the training process. All methods in the table evaluate the performance of the scene flow directly on Argoverse and nuScenes. To be fair, we use the same evaluation metrics as PointPWC-Net[ref16].
Table I and table II gives the quantitative results of our method evaluated at sfKITTI [sfkitti]. The accuracy of our method is substantially ahead of supervised learning methods FlowNet3 [ref43], FlowNet3D [ref4], and FLOT [ref19]. Compared with the self-supervised learning method [ref16, ref41, ref27, wang2022sfgan, ref20, jin2022deformation] based on point clouds, learning 3D scene flow on pseudo-LiDAR from real-world images demonstrates an impressive effect. Compared to PointPWC-Net [ref16], our method improves over 45% on EPE3D, Acc3DS and EPE2D, and improves over 30% on Acc3DR, Outliers and Acc2D, which is a powerful demonstration that Chamfer loss and Laplacian regularization loss can be more effective on pseudo-LiDAR. Our method without fine-tuning still outperforms the results of Mittal et al. (ft) [ref27] fine-tuning on sfKITTI.
From the bottom of Table I, the 3D scene flow estimator is trained from scratch on a synthesized dataset [ref24], LiDAR point clouds [ref40, durlar], and real-world stereo/monocular images [ref40], respectively. It is obvious that self-supervised learning works best on pseudo-LiDAR point clouds estimated from images. This confirms that our method avoids the limitation of estimating the scene flow on the synthesized data set to a certain extent. It also confirms that our method avoids the weakness of learning point motion from LiDAR data. Furthermore, in the last row of Table I, the model trained on FT3D [ref24] serves as a prior guide for our method. As we have been emphasizing, pseudo-LiDAR point clouds will stimulate the potential of self-supervised loss to a greater extent.
The accuracy of learning 3D scene flow on the 128-beam LiDAR signals [durlar] is improved compared to the 64-beam LiDAR signals. According to the sentence according to the quantitative results of Table I and Table II, the proposed framework for learning 3D scene flow from pseudo-LiDAR signals still presents greater advantages. To qualitative demonstrate the effectiveness of our method, some visualizations are shown in Fig. 4. Compared to PointPWC-Net [ref16], the estimated points by our method are mostly correct points on the Acc3DR metric. From the details in Fig. 4, the point clouds estimated from our method overlap well with the ground truth point clouds, confirming the reliability of our method. Finally, the methods in this paper also show excellent perceptual performance in the real world as shown in Figure 5.
We test the runtime on a single NVIDIA TITAN RTX GPU. PointPWC-Net takes about 0.1150 seconds on average to perform a training step while our method takes 0.6017 seconds. We consider saving the pseudo-LiDAR point cloud from the depth estimation and enabling the scene flow estimator to learn the scene flow from the saved point cloud. After saving the pseudo-LiDAR signals, the training time is greatly reduced and achieves the same time consumption (about 0.1150 seconds) as PointPWC-Net.
Iv-C Ablation Studies
The edge points of the whole scene and the outliers within the scene are the two factors that we focus on. In Table III, the experiments demonstrate that the choice of a suitable scene range and the elimination of outlier points both facilitate the learning of 3D scene flow. In the ablation study, the proposed disparity consistency loss is verified to be very effective for learning 3D scene flow. A point in the point cloud whose distance from its nearest point exceeds a distance threshold is considered as an outlier, where is calculated by Eq. (6
). The probability that a point in the point cloud is considered as an outlier is determined by the number of selected pointsand the standard deviation multiplier threshold . The experiments in Table IV show the best results for elimination of outlier points when is 8 and is 2.
The method in this paper achieves accurate perception of 3D dynamic scenes on 2D images. The pseudo-LiDAR point cloud is used as a bridge to compensate for the disadvantages of estimating 3D scene flow from LiDAR point clouds. The points in the pseudo-LiDAR point cloud that affect the scene flow estimation are filtered out. In addition, disparity consistency loss was proposed and achieved better self-supervised learning results. The evaluation results demonstrate the advanced performance of our method in the real world datasets.