In order to safely navigate through traffic, self-driving vehicles need to robustly detect surrounding objects (i.e
vehicles, pedestrians, cyclists). State-of-the-art approaches leverage deep neural networks operating on LiDAR point clouds[shi2020points, shi2018pointrcnn, shi2020pvrcnn]. However, training such 3D object detection models usually requires a huge amount of manually annotated high-quality data [sun2020waymo, caesar2020nuscenes, geiger2012ad]. Unfortunately, the labelling effort for 3D LiDAR point clouds is very time-consuming and consequently also expensive – a major drawback for real-world applications. Most datasets are recorded in specific geographic locations (e.gGermany [geiger2012ad, geiger2013vision]), with a fixed sensor configuration and under good weather conditions. Applied to new data collected in other locations (e.gUSA [sun2020waymo]), with a different sensor (e.gsparser resolution [caesar2020nuscenes]) or under adverse weather conditions (i.efog, rain, snow), 3D detectors suffer from distribution shifts (domain gap). This mostly causes serious performance drops [wang2020train], which in turn leads to unreliable recognition systems.
This can be mitigated by either manual or semi-supervised annotation [walsh2020temporallabeling, meng2020ws3d] of representative data, each time the sensor setup or area of operation changes. However, this is infeasible for most real-world scenarios given the expensive labelling effort. A more general solution to avoid the annotation overhead is unsupervised domain adaptation (UDA), which adapts a model pre-trained on a label-rich source domain to a label-scarce target domain. Hence, no or just a small number of labelled frames from the target domain are required.
For 3D object detection, UDA via self-training has gained a lot of attention [yang2021st3d, saltori2020SFUDA3DSU, you2021exploiting]. Similar to 2D approaches [RoyChowdhury2019autoadaptod, cai2019meanteacher, khodabandeh2019robustdaod], the idea is to use a 3D detector pre-trained on a labelled source dataset and apply it on the target dataset to obtain pseudo-labels. These labels are leveraged to re-train the model. Both steps, label generation and re-training, are repeated until convergence. However, generating reliable pseudo-labels is a non-trivial task.
Although most real-world data is continuous in nature, this property is rarely exploited for UDA of 3D object detectors. As a notable exception, [you2021exploiting]
leverages a probabilistic offline tracker. Though simple and effective, a major weakness of probabilistic trackers is that they heavily depend on the detection quality. Because of the domain gap, however, the detection quality usually degrades significantly. Additionally, these trackers require a hand-crafted motion model which must be adjusted manually. These limitations result in unreliable pseudo-labels driven by missing and falsely classified objects.
To overcome these issues, we present our Flow-Aware Self-Training approach for 3D object detection (FAST3D), leveraging scene flow [liu2019flownet3d, liu2019meteornet, wu2020pointpwc] for robust pseudo-labels, as illustrated in Fig. 1mayer2016flyingthings] only, scene flow estimators already achieve a favourable accuracy on real-world road data [wu2020pointpwc]. Thus, we investigate scene flow for UDA. In particular, we will show that scene flow allows us to propagate pseudo-labels reliably, recover missed objects and discard unreliable detections. This results in significantly improved pseudo-label quality which boosts the performance of self-training 3D detectors drastically.
We conduct experiments on the challenging Waymo Open Dataset (WOD) [sun2020waymo] considering two state-of-the-art 3D detectors, PointRCNN [shi2018pointrcnn] and PV-RCNN [shi2020pvrcnn], pre-trained on the much smaller KITTI dataset [geiger2012ad, geiger2013vision]. Without any prior target domain knowledge (e.g [wang2020train]), nor the need for source domain data (e.g [yang2021st3d]), we surpass the state-of-the-art by a significant margin.
2 Related Work
3D Object Detection
3D object detectors localize and classify an unknown number of objects within a 3D environment. Commonly, the identified objects are represented as tightly fitting oriented bounding boxes. Most recent 3D detectors, trained and evaluated on autonomous driving datasets, operate on LiDAR point clouds.
One way to categorize them is by their input representation. Voxel-based approaches [lang2019pointpillars, yan2018second, li20173dfully, wang2020pillarbased, bin2018pixor, zhou2018voxelnet, Ye2020HVnet, deng2020voxelrcnn] rasterize the input space and assign points of irregular and sparse nature to grid cells of a fixed size. Afterwards, they either project these voxels directly to the bird’s eye view (BEV) or first learn feature representations by leveraging 3D convolutions and project them to BEV afterwards. Finally, a conventional 2D detection head predicts bounding boxes and class labels. Another line of work are point-based detectors [ngiam2019starnet, shi2020pointgnn, yang20203dssd, shi2018pointrcnn]. In order to generate proposals directly from points, they leverage PointNet [qi2017pointnet, qi2017pointnetpp] to extract point-wise features. Hybrid approaches [chen2019fastpointrcnn, shi2020points, shi2020pvrcnn, he2020sassd, yang2019std, zhou2020mvf], on the other hand, seek to leverage the advantages of both of the aforementioned strategies. In contrast to point cloud-only approaches, multi-modal detectors [chen2017mv3d, ku2018avod, liang2019mtms, liang2018deepcontfuse, xu2018pointfusion, qi2017frustum] utilize 2D images complementary to LiDAR point clouds. The additional image information can be beneficial to recognise small objects.
Since most state-of-the-art approaches operate on LiDAR point clouds only, we also focus on this type of detectors. In particular, we demonstrate our self-training approach using PointRCNN [shi2018pointrcnn] and PV-RCNN [shi2020pvrcnn]. Both detectors have already been used in UDA settings for self-driving vehicles [wang2020train, yang2021st3d] and achieve state-of-the-art robustness and accuracy.
Scene flow represents the 3D motion field in the scene [vedula1999sceneflow]. Recently, a relatively new area of research aims to estimate point-wise motion predictions in an end-to-end manner directly from raw point clouds [liu2019flownet3d, gu2019hplflownet, wu2020pointpwc]. With few exceptions [liu2019meteornet], most approaches process two consecutive point clouds as input. A huge benefit of data-driven scene flow models, especially regarding UDA, is their ability to learn in an unsupervised manner [mittal2020justgowith, wu2020pointpwc, li2021selfpointflow]. The biggest drawback of these networks is their huge memory consumption which limits the number of input points. To address this, [jund20201scaleablesceneflow] proposes a light-weight model applicable to the dense Waymo Open Dataset [sun2020waymo] point clouds.
In our work, we use the 3D motion field to obtain robust and reliable pseudo-labels for UDA by leveraging the motion consistency of sequential detections. To this end, we utilize PointPWC [wu2020pointpwc] which achieves state-of-the-art scene flow estimation performance.
Unsupervised Domain Adaptation (UDA)
The common pipeline for UDA is to start with a model trained in a fully supervised manner on a label-rich source domain and adapt it on data from a label-scarce target domain in an unsupervised manner. Hence, the goal is to close the gap between both domains. There is already a large body of literature on UDA for 2D object detection in driving scenes [chen2018domain, he2019MultiAdversarialFF, hsu2020progressivedet, kim2019diversify, rodriguez2019domainAF, saito2019strongweak, zhuang2020iFANIF, wang2019universaldet, xu2020ExploringCR]. Due to the growing number of publicly available large-scale autonomous driving datasets, UDA on 3D point cloud data has gained more interest recently [qin2019PointDANAM, jaritz2019xmuda, Yi2020CompleteL, achituve2021self].
For LiDAR-based 3D object detection, [wang2020train] demonstrate serious domain gaps between various datasets, mostly caused by different sensor setups (e.gresolution or mounting position) or geographic locations (e.gGermany USA). A popular solution to UDA is self-training, as for the 2D case [cai2019meanteacher, khodabandeh2019robustdaod, RoyChowdhury2019autoadaptod]. For example, [yang2021st3d] initially train the detector with random object scaling and an additional score prediction branch. Afterwards, pseudo-labels are updated in a cyclic manner considering previous examples. In order to overcome the need for source data, [saltori2020SFUDA3DSU] performs test-time augmentation with multiple scales. The best matching labels are selected by checking motion coherency.
In contrast to [you2021exploiting], where temporal information is exploited by probabilistic tracking, we propose to leverage scene flow. This enables us to reliably extract pseudo-labels despite initially low detection quality by exploiting motion consistency. As in [saltori2020SFUDA3DSU], we also utilize test-time augmentation to overcome scaling issues, but we only need two additional scales.
3 Flow-Aware Self-Training
We now introduce our Flow-Aware Self-Training approach FAST3D, which consists of four steps as illustrated in Fig. 1. First, starting with a model trained on the source domain, we obtain initial 3D object detections for sequences of the target domain (Sec. 3.1). Second, we leverage scene flow to propagate these detections throughout the sequences and obtain tracks, robust to the initial detection quality (Sec. 3.2). Third, we recover potentially missed tracks and correct false positives in a refinement step (Sec. 3.3). Finally, we extract pseudo-labels for self-training to improve the initial model (Sec. 3.4).
Given a 3D object detection model pre-trained on source domain frames , our task is to adapt to unseen target data with unlabelled sequences , where with varying length . Here, and denote the point cloud and corresponding labels of the frame. To obtain the target detection model , we apply self-training which assumes that both domains contain the same set of classes (i.evehicle/car, pedestrian, cyclist) but these are drawn from different distributions. In the following, we show how to self-train a detection model without access to source data or target label statistics (as e.g [wang2020train]) and without modifying the source model in any way (as e.g [yang2021st3d]) to achieve performance beyond the state-of-the-art. Because we only work with target domain data, we omit the domain superscripts to improve readability.
3.1 Pseudo-Label Generation
The first step, as in all self-training approaches (e.g [yang2021st3d, you2021exploiting, saltori2020SFUDA3DSU]), is to run the initial source model on the target data and keep high-confident detections. More formally, the detection at frame is represented by its bounding box and confidence score . Each box is defined by its centre position , dimension and heading angle . Note that we can use any vanilla 3D object detector, since we do not change the initial model at all, not even by adapting the mean object size of anchors (as in [yang2021st3d]). In order to deal with different object sizes between various datasets without pre-training on scaled source data [yang2021st3d, wang2020train], we leverage test-time augmentation instead. In particular, we feed with three scales (i.e 0.8, 1.0, 1.2) of the same point cloud, similar to [saltori2020SFUDA3DSU]. However, [saltori2020SFUDA3DSU] requires to run the detector on 125 different scale levels (due to different scaling factors along each axis). By leveraging scene flow, we only need three scales to robustly handle both larger and smaller objects. We combine the detection outputs at all three scales via non-maximum suppression (NMS) and threshold the detections at a confidence of to obtain the initial high-confident detections.
3.2 Flow-Aware Pseudo-Label Propagation
Keeping only high-confident detections of the source model gets rid of false positives (FP) but also results in a lot of false negatives (FN). While this can partially be addressed via multi-target tracking (MTT), e.g [you2021exploiting], standard MTT approaches (such as [weng2020AB3DMOT]) are not suitable for low quality detections because of their hand-crafted motion prediction. Due to the domain gap, however, the detection quality of the source model will inevitably be low.
To overcome this issue, we introduce our scene flow-aware tracker. We define a set of tracks where each track contains a list of bounding boxes and is represented by its current state , i.eits bounding box, and track confidence score . Following the tracking-by-detection paradigm, we use the high-confident detections and match them in subsequent frames. Instead of hand-crafted motion models, however, we utilize scene flow to propagate detections. More formally, given two consecutive point clouds and , PointPWC [wu2020pointpwc] estimates the motion vector for each point . We then average all motion vectors within a track’s bounding box to compute its mean flow . To get the predicted state for the current frame , we estimate the position of each track as . We then assign detections to tracks via the Hungarian algorithm [Kuhn55thehungarian], where we use the intersection over union between the detections and the predicted tracks .
We initialize a new track for each unassigned detection. This naive initialisation ensures that we include all object tracks, whereas potentially wrong tracks can easily be filtered in our refinement step (Sec. 3.3). Initially, we set a track’s confidence to the detection confidence . For each assigned pair, we perform a weighted update based on the confidence scores as and update the track confidence . Note that we update the track’s heading angle only if the orientation change is less than
, otherwise we keep its previous orientation. For boxes with only very few point correspondences, the flow estimates may be noisy. To suppress such flow outliers, we allow a maximum change in velocity of, along with the maximum orientation change of . If the estimated flow exceeds these limits, we keep the previous estimate.
For track termination, we distinguish moving (i.e) and static objects. For moving objects, we terminate their tracks if there are no more points within their bounding box. Static object tracks are kept alive as long as they are within the field-of-view. This ensures that even long term occlusions can be handled, as such objects will eventually contain points or be removed during refinement.
3.3 Pseudo-Label Refinement
Due to our simplified track management (initialisation and termination), we now have tracks covering most of the visible objects but suffer from several false positive tracks. However, these can easily be corrected in the following refinement step.
Track Filtering and Correction
We first compute the hit ratio for each track, i.ethe number of assigned detections divided by the track’s length. Tracks are removed if their hit ratio is less than . Additionally, we discard tracks shorter than (i.e5 frames at the typical LiDAR frequency of ) and tracks which detections never exceed 15 points (to suppress spurious detections).
In our experiments, we observed that detections are most accurate on objects with more points (as opposed to the confidence score which is usually less reliable due to the domain gap). Thus, we sort a track’s assigned detections by the number of contained points and compute the average dimension over the top three. We then apply this average dimension to all boxes of the track. Additionally, for static cars we also apply the average position and heading angle to all boxes of the track since these properties cannot change for rigid static objects.
This conservative filtering of unreliable tracks ensures a lower number of false positives. However, true positive tracks with only very few detections might be removed prematurely. In order to recover these, we leverage flow consistency. As illustrated in Fig. 2(a), we consider removed tracks with at least two non-overlapping detections (i.emoving objects). For these tracks, we propagate the first detection by solely utilizing the scene flow estimates. If this flow-extrapolated box overlaps with the other detections (minimum IoU of ), we recover this track.
A common drawback of the tracking-by-detection paradigm is late initialization caused by missing detections at the beginning of the track. To overcome this issue, we look backwards for each track as illustrated in Fig. 2(b) to extend missing tails. Hence, we leverage scene flow in the opposite direction and propagate the bounding box back in time. We propagate backwards as long as the flow is consistent, meaning no unreasonable changes in velocity or direction, and the predicted box location contains points. To avoid including the ground plane in the latter condition, we count only points within the upper of the bounding box volume.
We use the individual bounding boxes of the refined tracks as high-quality pseudo-labels to re-train the detection model. To this end, we use standard augmentation strategies, i.e randomized world scaling, flipping, rotation, as well as ground truth sampling [shi2020pvrcnn, yan2018second]. As we don’t need to modify the initial model , we also use its detection loss without modifications for re-training.
In the following, we present the results of our flow-aware self-training approach using two vanilla 3D object detectors. Additional evaluations demonstrating the improved pseudo-label quality are included in the supplemental material.
4.1 Datasets and Evaluation Details
We conduct experiments on the challenging Waymo Open Dataset (WOD) [sun2020waymo] with a source model pre-trained on the KITTI dataset [geiger2012ad, geiger2013vision]. We sample 200 sequences (25%) from the WOD training set for pseudo-label generation and 20 sequences (10%) from the WOD validation set for testing. With this evaluation, we cover various sources of domain shifts: geographic location (Germany USA), sensor setup (Velodyne Waymo LiDAR) and weather (broad daylight sunny, rain, night). In contrast to recent work, we do not only consider the car/vehicle class, but also pedestrians and cyclists. Because the initial model is trained on front view scans only (available KITTI annotations), we stick to this setup for evaluation.
We follow the official WOD evaluation protocol [sun2020waymo] and report the average precision (AP) for the intersection over union (IoU) thresholds and for vehicles, as well as and for pedestrians and cyclists. AP is reported for both bird’s eye view (BEV) and 3D – denoted as and – for different ranges: , and . Additionally, we calculate the closed gap as in [yang2021st3d].
The lower performance bound is to directly apply the source model on the target data, denoted source only (SO). The fully supervised model (FS) trained on the whole target data ( of WOD) on the other hand defines the upper performance bound. Statistical normalization (SN) [wang2020train] is the state-of-the-art on UDA for 3D object detection and leverages target data statistics. The recent DREAMING [you2021exploiting] approach also exploits temporal information for UDA. Finally, we also compare against the few shot [wang2020train] approach which adapts the model using a few fully labelled sequences of the target domain.
We demonstrate FAST3D with two detectors, PointRCNN [shi2018pointrcnn] and PV-RCNN [shi2020pvrcnn] from OpenPCDet [openpcdet2020], configured as in their official implementations. To compensate for the ego-vehicle motion, we project all frames of a sequence into the world coordinate system by computing the ego-vehicle poses [sun2020waymo]. We slightly increase the field-of-view (FOV) from approximately to , which simplifies handling pseudo-label tracks at the edge of the narrower FOV. In each self-training cycle, we re-train the detector for epochs with Adam [diderik2015adam] and a learning rate of until we reach convergence (i.ePointRCNN 2 cycles, PV-RCNN 3 cycles).
We estimate the scene flow using PointPWC [wu2020pointpwc]. Although PointPWC, similar to other flow estimators [li2021selfpointflow, mittal2020justgowith], could be fine-tuned in a self-supervised fashion, we use the off-the-shelf model pre-trained only on synthetic data [mayer2016flyingthings]. This allows us to demonstrate the benefits and simplicity of leveraging 3D motion without additional fine-tuning. To prepare the point clouds for scene flow estimation, we use RANSAC [fischler1981ransac] to remove ground points (as in [wu2020pointpwc]) and randomly subsample the larger input point cloud to match the smaller one.
4.2 Empirical Results
In Table 1 we compare FAST3D for two detectors with the source only (SO) and fully supervised (FS) strategies, which define the lower and upper performance bound, respectively. Across all classes and different IoU thresholds, we manage to close the domain gap significantly (36% to 87%), achieving almost the fully supervised oracle performance although our approach is unsupervised and does not rely on any prior knowledge about the target domain. To the best of our knowledge, we are the first to additionally report adaptation performance for both pedestrians and cyclists as these vulnerable road users should not be neglected in evaluations. We focus this evaluation on because (in contrast to ) this metric also penalizes estimation errors along the -axis (i.evertical centre and height). Consequently, scores are usually much higher than .
Since reporting the closed domain gap has only been proposed very recently [yang2021st3d], no other approaches reported this closure for KITTIWOD yet. We can, however, relate the results for the most similar setting, i.eadaptation of PV-RCNN for the car/vehicle class, where we achieve a closed gap of 76.5% (KITTIWOD, vehicles within all sensing ranges) in contrast to 70.7% of [yang2021st3d] (WODKITTI, cars within KITTI’s moderate difficulty). Despite the different settings, we achieve a favourable domain gap closure while evaluating on the more challenging KITTIWOD. In particular, according to [wang2020train], starting from KITTI as the source domain is the most difficult setting for UDA, as it has orders of magnitude fewer samples than other datasets. Additionally, KITTI contains only sunny daylight data and its car class does not include trucks, vans or buses which are, however, contained in the WOD vehicle class.
|SO||FAST3D||FS||Closed Gap||SO||FAST3D||FS||Closed Gap|
|Vehicle||8.0||58.1||65.6||87.0 %||10.3||63.6||80.0||76.5 %|
|Pedestrian||18.9||36.8||47.8||62.0 %||13.7||44.7||74.4||51.1 %|
|Cyclist||24.2||48.0||60.0||66.5 %||14.6||35.6||72.9||36.0 %|
|Vehicle||60.0||77.3||80.9||82.8 %||50.3||85.1||95.2||77.5 %|
|Pedestrian||30.8||41.7||52.2||50.9 %||21.6||57.2||85.4||55.8 %|
|Cyclist||31.8||66.2||69.0||92.5 %||32.5||49.4||75.5||39.3 %|
Table 2 reports the results in split by their detection range for different intersection over union (IoU) thresholds. We can observe that our self-training pipeline significantly improves the initial model on all ranges. The only notable outlier are cyclists within the far range of 50–75 meters, where even after re-training both detectors achieve rather low scores. This is due to the low number of far range cyclists within the WOD validation set, i.eonly a few detection failures already drastically degrade the for the cyclist class.
|0m - 30m||30m - 50m||50m - 75m||0m - 30m||30m - 50m||50m - 75m|
4.3 Comparison with the State-of-the-Art
To fairly compare with the state-of-the-art in Table 3, we follow the common protocol and report both and on the vehicle class. We clearly outperform the best approach (statistical normalization), even though this method utilizes target data statistics and is thus considered weakly supervised, whereas our approach is unsupervised. We also outperform the only temporal 3D pseudo-label approach [you2021exploiting] by a huge margin for all ranges, especially at the high-quality . Moreover, we perform on par with the few shot approach from [wang2020train] on close-range data and even outperform it on medium and far ranges.
|Method||IoU 0.7||IoU 0.5|
|0m - 30m||30m - 50m||50m - 75m||0m - 30m||30m - 50m||50m - 75m|
|SO||29.2 / 10.0||27.2 / 8.0||24.7 / 4.2||67.8 / 66.8||70.2 / 63.9||48.0 / 38.5|
|SN [wang2020train]||87.1 / 60.1||78.1 / 54.9||46.8 / 25.1||-||-||-|
|DREAMING [you2021exploiting]||51.4 / 13.8||44.5 / 16.7||25.6 / 7.8||81.1 / 78.5||69.9 / 61.8||50.2 / 41.0|
|FAST3D (ours)||81.7 / 70.3||75.4 / 58.7||52.4 / 37.0||87.2 / 74.9||81.2 / 64.6||68.6 / 44.0|
|Few Shot [wang2020train]||88.7 / 74.1||78.1 / 57.2||45.2 / 24.3||-||-||-|
|FS||86.2 / 78.5||77.8 / 63.7||60.8 / 48.1||91.7 / 92.1||83.7 / 78.3||73.3 / 69.7|
We presented a flow-aware self-training pipeline for unsupervised domain adaptation of 3D object detectors on sequential LiDAR point clouds. Leveraging motion consistency via scene flow, we obtain reliable and precise pseudo-labels at high recall levels. We do not exploit any prior target domain knowledge, nor do we need to modify the 3D detector in any way. As demonstrated in our evaluations, we surpass the current state-of-the-art in self-training for UDA by a large margin.
The financial support by the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged.