1 Introduction

The localization and categorization of objects in a 3D scene is a crucial component of perception systems in fields like robotics and autonomous driving. In recent years, data-driven approaches using deep neural networks have achieved superior performance in various versions of this task
[shi2019pointrcnn, second, shi2020pv, Shi2019PartA2N3, REF:qi2017pointnet, ipod, zhou2018voxelnet], facilitated in part by the release of numerous datasets and benchmarks [waymo, KITTI, nuscenes2019, lyft, chang2019argoverse, cadc]. In practical scenarios, it is important for these object detection frameworks to perform consistently well in different domain scenarios. However, deep neural networks tend to learn not only the valuable features that aid in performing the task at hand, but also the biases present in the data it is trained on. In the case of lidar datasets, the weather conditions and the location of capture lead to biases in the dataset due to the specific dimensions of roads, vehicles, and the driving conventions of the area. Additionally, different lidar sensors possess different rates of return and produce pointclouds with varying densities, leading to another set of inherent biases. This leads to a distribution gap among various pointcloud datasets.
Thus, an object detector trained on a particular dataset will drop in performance when evaluated on samples from a dataset with a different distribution. We call the training and test datasets in this scenario as the source and target domain datasets, respectively. One may argue that making use of a large, diverse source domain could solve this problem, however there will always be samples from an unseen distribution, and collecting every possible type of lidar scene is impractical at best.
Unsupervised domain adaptation (UDA) refers to the process of bridging this gap to improve the performance of source-trained networks on unlabelled target samples. There have been several recent works addressing this problem, for both 2D [oza2021unsupervised, RoyChowdhury, Khodabandeh2019ARL] and 3D [xu2021spg, yang2021st3d, Saltori2020SFUDA3DSU, scalablePseudo, Luo2021multiLevel]
object detection. However, source samples are often unavailable during training due to limited memory capabilities or privacy reasons. This requires a source-free domain adaptation approach, where only the source-trained model and unlabelled target samples are used for adaptation. There exist several such approaches for various computer vision tasks on images
[Li2021AFL, Kundu2020UniversalSD, yang2021generalized]. Only SF-UDA3D [Saltori2020SFUDA3DSU] attempts this setting for 3D object detection, but relies on a sequence of lidar frames as an input to the network.Self training-based methods have been successful in unsupervised and semi-supervised domain adaptive works [scalablePseudo, xie2020self, zou2019confidence], but rely on confidence thresholding to filter noisy pseudo-labels. As illustrated in the example from Figure 2, the use of high thresholds (as is general practice) results in training the model on easy samples and incorrect labels of high confidence that contribute to the enforcement of errors during adaptation.
We propose an unsupervised, source-free domain adaptation framework for 3D object detection that addresses the issue of incorrect, over-confident pseudo labels during self-training through the use of class prototypes. In the presence of label noise, standard feature aggregation methods of prototype computation [ZhangProto, lvq, yang2018robust, jiang2018learning] are insufficient, since features corresponding to incorrectly labeled regions could contribute to the final prototype (see Figure 1, top row). Inspired by the high representative power of self-attention and recent works that make use of transformers to focus on salient inputs [vit, vaswani2017attention]
, we calculate an attentive class prototype by using a transformer to identify salient regions-of-interest and combine their associated feature vectors using prediction entropy weights that represent the uncertainty of the classification branch for each sample. Class predictions corresponding to incorrect pseudo labels, which are identified by calculating the similarity with the class prototype, are down-weighed to prevent reinforcing errors during self-training. We demonstrate our result on several domain shift scenarios (see Figure
1). Our contributions are as follows-
[topsep=0pt,noitemsep,leftmargin=*]
-
We propose the attentive prototype for learning representative class features in the presence of label noise by leveraging self-attention through a transformer block and perform source-free unsupervised domain adaptation of 3D object detection networks that mitigates the effect of label noise during self training by filtering incorrect annotations.
-
We demonstrate our method on two recent object detectors, SECOND-iou [second], and PointRCNN [shi2019pointrcnn] for six domain shift scenarios and outperform recent domain adaptation works.
2 Related Works
3D object detection. With the increase in availability of multiple large-scale lidar datasets, there have been many recent networks proposed for 3D object detection. Here, we focus on pure lidar-based detectors, although there have been several successful multi-modal approaches [chen2017multi, multivox, ku2018joint]. The seminal works PointNet [REF:qi2017pointnet] and PointNet++ [REF:qi2017pointnetplusplus]
for the hierarchical feature extraction of pointclouds have spurred numerous deep neural networks for this task that can be broadly categorised as voxel-based methods
[zhou2018voxelnet, lang2019pointpillars, second], which divide the pointcloud into volumetric grids before performing feature extraction, and point-based methods [shi2019pointrcnn, yang20203dssd] for 3D object detection, which operate directly each point in the 3D scene. In [second], Yan et al. propose SECOND, a single stage, voxel-based method that utilizes 3D sparse convolutions and a Region Proposal Network (RPN) head to predict the location and category of objects in a lidar scene. Pointpillars [lang2019pointpillars] by Lang et al. builds on this by changing the shape of the voxels to columns that span the height of the pointcloud. PointRCNN is a two-stage network that generates 3D bounding box proposals followed by a refinement stage similar to [fasterrcnn].Self-training for domain adaptation. Pseudo label-based self-training is a popular approach for unsupervised domain adaptation. In [RoyChowdhury], RoyChowdhury et al. propose a method that trains the object detection network FasterRCNN [fasterrcnn]
with a combination of high confidence pseudo labels and labels obtained from a tracker and a knowledge distillation loss to adapt the network for face detection. This approach relies on video data to obtain its tracking results. Khodabandeh
et al. [Khodabandeh2019ARL]propose Robust FasterRCNN, a domain adaptive 2D object detector that consists of a source-model training stage, a pseudo label generation stage using a pretrained image classifier for refinement, and a final stage where the detector network is trained on the source labels and the refined target pseudo-labels. Yang
et al. put forth ST3D [yang2021st3d], a self training approach for 3D domain adaptive object detection where the network is adapted by training with a proposed curriculum data augmentation algorithm using pseudo-labels generated with a quality-aware memory bank. While showing promising results in some cases, in others this method is outperformed by standard pseudo-label based self-training, and depends on boosting with statistical normalization [Wang2020TrainIG], a weakly supervised method, for its best reported results. Caine et al. [Caine] propose a simple, yet effective domain adaptation method for 3D object detection using the base network of Pointpillars [lang2019pointpillars] that trains a student network with a combination of source labels and target pseudo labels obtained from a teacher network trained on the labeled source data. However, simple thresholding methods for pseudo-label collection from the source model may lead to training the model with incorrect labels that have high confidence and discarding correct labels with confidence that falls below the threshold.Source-free domain adaptation. Privacy and memory limitations during adaptation may prevent the access of source data for training. Source-free approaches only use unlabeled target data network models pre-trained on the source data during adaptation. Kundu et al. [Kundu2020UniversalSD] propose a UDA method which does not require information about the category-level gap between domains, and consists of a procurement stage in which a generative classifier equips the model to reject out-of-distribution samples, and a deployment stage, in which adversarial alignment is performed. Yang et al. [yang2021generalized] focuses on maintaining model performance on the source domain by a two stage method that consists of a local structure clustering stage and a domain attention stage that activates feature channels associated with each domain. Saltori et al. [Saltori2020SFUDA3DSU] propose a source-free UDA method for 3D object detection on PointRCNN [shi2019pointrcnn], utilizing a tracking-based scoring system to evaluate the quality of pseudo-labels at different scales. As mentioned above, this method depends on the use of multiple frames for a single forward pass through the network.
Prototype learning.
Learning representative features of a class or group of samples has been a well explored problem in pattern recognition. Originally calculated with hand-crafted features
[lvq], recent approaches use convolutional neural networks for feature extraction. This method has seen success in a variety of tasks, including classification
[yang2018robust], zero-shot recognition [jiang2018learning], and domain adaptive 2D segmentation [ZhangProto]. We argue that when training a network with noisy pseudo-labels, clustering and outlier removal are insufficient in obtaining a true representative class prototype. We thus propose a transformer-based approach to generate attentive prototypes.
for details), prototype computation by combination of features using predictive entropy weights and the exponential moving average (EMA) algorithm, and calculation of the output classification loss weights using cosine similarity. The detection network is trained under the weighted loss.
3 Proposed Method
Consider an object detector network trained on a source dataset consisting of sample-label pairs , where contains the annotations of objects in a 3D scene, consisting of the dimensions , position and category of each bounding box. We aim to adapt this network model in absence of the source data to an unlabeled target dataset of size and corresponding pseudo-labels generated by . The proposed domain adaptation method consists of a prototype computation and similarity-based refinement, implemented with an iterative training strategy. A visual representation of this framework is shown in Figure 3, and a detailed algorithmic description is given in Algorithm 1.
3.1 Transformer for prototype computation
In the presence of noisy labels, representation learning through feature clustering methods like those in [ZhangProto, lvq, yang2018robust, jiang2018learning] may compute corrupted class prototypes. In object detection, region features of different classes may be similar (such as the “Car” and “Truck” categories, see Figure 2), rendering outlier removal methods ineffective in cases of mis-classification. In order to address this, we propose a transformer that utilizes self-attention to focus on salient regions-of-interest for prototype computation. A close-up view of the transformer model is shown in Figure 4.
Consider the set of features of positive regions-of-interest (ROIs) consisting of feature vectors generated by the object detector of ROIs categorized as belonging to the object class. In order to create a representative prototype, we take inspiration from [vit]
and send the ROI features as tokens to a transformer module consisting of a linear embedding layer and a set of transformer encoders. The encoder contains alternating multi-head attention blocks and feed-forward blocks, with interspersed normalization layers and residual connections, as depicted in Figure
3(b).The input to the transformer module is the set of positive ROI features associated with the object category. Each feature in (comparable to the image patches in [vit]) is sent as a token to the linear projection layer, which gives a set of feature embeddings that are input to the encoder block. The multi-head self attention layer (MSA) [vaswani2017attention], calls several self-attention operations in parallel in which a linear layer is used to encode and represent each token in as a value and a corresponding query and key pair. A weighted sum of all values in is computed for each token where the attention weight is
(1) |
where is a scaling factor. The output of the self attention operation is thus
(2) |
The feed-forward block is a two-layers multi-layer perceptron (MLP) with the GELU
[hendrycks2016gaussian] non-linearity function. Following the forward pass through such encoders, the final output of the transformer module is a set of attentive region features . By virtue of the self-attention mechanism, we learn the cross-correlation between positive region features, and in turn the ROIs that contribute salient information for prototype computation. Predictive entropy. As a way to obtain additional insight on how informative each region feature is for the bounding box categorization task, we form the representative attentive class prototype as the sum of the attentive region features weighted with the predictive entropy [wang2014new, prabhu2021active] of the classifier , denoted by . Instead of utilizing the uncertainty of the model as in [prabhu2021active], we weigh each attentive region feature with the confidence of the associated prediction. The predictive entropy and the resulting entropy weights are given by(3) |
The attentive prototype as the weighted average of the attentive ROI features is obtained as follows
(4) |
Computing the final prototype. During the process of training the object detector for adaptation, samples from the target dataset
are sent in mini-batches. Each attentive prototype computed in an iteration is combined with the prototoype computed in the previous iteration through exponential moving average (EMA). The initial prototype at the first iteration of the first epoch is simply the average of the positive ROI features. The final attentive prototype at each iteration
is thus given by(5) |
with a keep ratio used in our experiments.
3.2 Similarity-based refinement
Once the representative class prototype is obtained, we use it as a soft-filter to identify region features that are dissimilar and thus far away from each other in the feature space. To do this, the distance of each positive ROI in from its corresponding prototype is calculated using the metric of cosine similarity such that
(6) |
where is the feature dimension and ranges from to the number of ROIs, . Cosine similarity is chose due to the sparse nature of the features. The classification loss corresponding to each positive region of interest is multiplied by a similarity weight, computed by taking the softmax of the cosine distance, such that
(7) |
where corresponds to the region-wise loss of the bounding box classifier, where indexes the positive regions and indexes the negative regions. With this similarity-based down-weighing, the losses corresponding to regions that have been identified as incorrect through prototype matching will be down-weighed and not contribute to training. As the representative prototype improves with each epoch, the network becomes better at soft-filtering incorrect regions and avoids reinforcing the error in pseudo-labels.
![]() |
![]() |
4 Experiments
We demonstrate our domain adaptation framework on two base object detection networks, SECOND-iou [yang2021st3d], which is a modified version of the voxel-based network SECOND [second], and PointRCNN [shi2019pointrcnn], a two stage point-based network. We explore three cross-dataset domain shift scenarios for each detector for the “Car” object category. In this section, we explain the details of the experiments and the datasets used.
SECOND-iou | Point-RCNN | ||||||||
Domain shift | Method | mAP | Domain shift | Method | mAP | ||||
easy | mod. | hard | easy | mod. | hard | ||||
Waymo KITTI | DT | 46.66 | 39.86 | 35.03 | Waymo KITTI | DT | 13.11 | 12.10 | 12.24 |
ST | 69.04 | 55.77 | 54.57 | ST | 21.13 | 19.29 | 18.28 | ||
SN[Wang2020TrainIG] | 73.23 | 56.36 | 54.25 | SN | 48.7 | 47.1 | 49.7 | ||
ST3D[yang2021st3d] | 67.47 | 59.17 | 56.73 | SF-UDA3D[Saltori2020SFUDA3DSU] | - | - | - | ||
Proposed | 75.75 | 64.43 | 57.19 | Proposed | 62.11 | 53.08 | 46.64 | ||
Oracle | 84.86 | 68.93 | 67.38 | Oracle | 81.61 | 74.36 | 74.49 | ||
nuScenes KITTI | DT | 18.37 | 17.31 | 16.09 | nuScenes KITTI | DT | 10.59 | 10.76 | 10.64 |
ST | 53.03 | 37.71 | 35.18 | ST | 22.21 | 11.56 | 11.90 | ||
SN | 22.03 | 18.51 | 18.04 | SN | 60.35 | 54.82 | 52.78 | ||
ST3D | 58.24 | 43.13 | 39.46 | SF-UDA3d | 68.8 | 49.80 | 45.0 | ||
Proposed | 71.56 | 52.12 | 45.86 | Proposed | 69.98 | 61.43 | 54.26 | ||
Oracle | 84.86 | 68.93 | 67.38 | Oracle | 81.61 | 74.36 | 74.49 |
SECOND-iou | Point-RCNN | ||||
---|---|---|---|---|---|
Domain shift | Method | mAP | Domain shift | Method | mAP |
Waymo nuScenes | DT | 15.25 | KITTI nuScenes | DT | 10.07 |
ST | 17.80 | ST | 16.78 | ||
SN | 18.57 | SN | 18.7 | ||
ST3D | 20.19 | SF-UDA3d | 26.8 | ||
Proposed | 20.47 | Proposed | 18.87 | ||
Oracle | 32.64 | Oracle | 29.89 |
4.1 Experimental setup
Datasets. In order to simulate the various domain shifts, we consider three popular large-scale autonomous driving datasets with considerable domain gaps among them, The Waymo Open Dataset [waymo], the KITTI dataset [KITTI], and the nuScenes dataset [nuscenes2019]. The largest dataset among these is Waymo, with more than 230K annotated lidar frames collected across six US cities, of which we use approximately 50K due to memory constraints in a 40K/10K training/validation (test) split. The nuScenes dataset consists of 34,149 frames which we utilize in 28K/6K split. The smallest dataset is KITTI, consisting of 7,481 (3K/3K split) annotated lidar frames collected from Germany. All training and validation splits used are official and consistent with that used by other works in the 3D object detection literature.
Object detection networks. The network SECOND-iou [second, yang2021st3d] is a two stage voxel-based 3D object detector which uses a PointNet [REF:qi2017pointnet] backbone to extract voxel features from groupings of points and consists of a grouping layer, a region proposal network (RPN) and an ROI refinement head. It is a modified version of SECOND [second] proposed in [yang2021st3d], with the extra refinement head. The region features for prototype computation are obtained at the output of the RPN head, and the classification loss at this stage is down-weighed during adaptation. PointRCNN is a two stage point-based network with a similar PointNet backbone for feature extraction that generates 3D region proposals through a bottom-up approach through foreground segmentation. This is followed by a refinement network. We implement adaptation for PointRCNN in this second stage.
4.2 Implementation details
In addition to the mentioned refinement steps, we implement an iterative training strategy in which the domain adaptive network is trained and re-trained a series of times with the source model
and pseudo labels re-initialized to the previously trained model and generated predictions respectively. For the implementation of SECOND-iou, we use the public pyTorch repository Open-PCDet
[openpcdet2020], and the official code release from [shi2019pointrcnn] for the implementation of PointRCNN. We perform experiments with a 48GB Quadro RTX 8000 GPU and a 16GB GeForce RTX 2080 GPU. We follow the lengths of training recommended by the authors in the case of each object detector network. In both cases, the cyclic Adam optimization algorithm is used.5 Results
In this section we demonstrate the results of our domain adaptation framework and compare it against four domain adaptation methods; Direct transfer (DT): Inference of the source trained model on target data, Current SOTA: ST3D [yang2021st3d] by Yang et al. and SFUDA3D [Saltori2020SFUDA3DSU] by Saltori et al. for their corresponding base networks, Statistical normalization (SN) [Wang2020TrainIG]: weakly supervised approach that uses the target domain bounding-box statistics, Pseudo-label self training (ST): Re-training the object detector on thresholded source-model generated pseudo-labels. For reference we, also compare with the “oracle” results, which are obtained by training the detector with the ground truth labels of the target domain, indicating the possible upper bound of performance after adaptation.
SECOND-iou [second] | |||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Direct transfer | ST | ST3D [yang2021st3d] | Ours |
PointRCNN [shi2019pointrcnn] | |||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Direct transfer | ST | SN [Wang2020TrainIG] | Ours |
Prototype computation method | Waymo KITTI | nuScenes KITTI | ||||
---|---|---|---|---|---|---|
easy | mod. | hard | easy | mod. | hard | |
Average | 73.56 | 57.68 | 56.75 | 65.00 | 47.55 | 40.24 |
Self attention | 72.90 | 56.82 | 55.21 | 65.01 | 47.54 | 42.43 |
Transformer | 72.50 | 62.64 | 56.39 | 69.26 | 50.09 | 45.11 |
Transformer + entropy weight | 75.75 | 64.43 | 57.19 | 71.56 | 52.12 | 45.86 |
5.1 Comparison with state-of-the-art
Quantitative analysis. We compare the mean average precision (mAP) of 3D bounding boxes across various difficulty settings for the domain shift scenarios Waymo KITTI, nuScenes KITTI, KITTI nuScenes, and Waymo
nuScenes. Where the target dataset is KITTI, we use the official evaluation metrics detailed in
[KITTI] with easy, moderate, and hard difficulty categories based on the distance and level of occlusion of the object from the sensor, with an IoU threshold of . Where the target dataset is nuScenes, we use the metrics of [nuscenes2019], and average across difficulties to be consistent with both [yang2021st3d] and [Saltori2020SFUDA3DSU]. We reproduce the results of ST3D and SN using similar training lengths and batch sizes as our experiments. Note that due to computational limitations, the batch sizes are smaller than those used by [yang2021st3d] and [second]. Due to the lack of a code repository for SF-UDA3D at the time of writing, we compare with the reported numbers. We implement our method with SECOND-iou for comparison with ST3D and with PointRCNN for comparison with SF-UDA3D to be consistent with their base object detector networks. This can be seen in Table 1. We demonstrate the best results using both object detection networks in most categories, beating the weakly supervised approach of SN as well as [Saltori2020SFUDA3DSU] and [yang2021st3d]. We observe a overall improvement from the closest performing competitor ST3D in the case of SECOND-iou in the nuScenes KITTI domain scenario and a improvement in the case of Waymo KITTI.Qualitative analysis. We further demonstrate the effectiveness of our domain adaptation framework with those of recent methods through a visual comparison of the predicted bounding boxes for the Waymo KITTI task for both object detection networks in Figure 5. The problems of direct transfer (DT) of the source model are localization and over-confident false positives (see column 1). This problem is mitigated only partially by pseudo labelling methods and by ST3D, and is better addressed by our proposed approach.
Domain shift |
|
mAP | ||||
---|---|---|---|---|---|---|
easy | mod. | hard | ||||
Waymo KITTI | DT | 18.37 | 17.31 | 16.09 | ||
1 | 57.76 | 43.02 | 38.50 | |||
2 | 63.06 | 46.71 | 41.12 | |||
3 | 63.18 | 46.93 | 40.94 | |||
4 | 71.56 | 52.12 | 45.86 |

5.2 Ablation study
We analyze the contribution of the different segments of our domain adaptation framework. In Table 2, we compare the 3D mAP performance of the network using four different prototype computation methods on two different domain shift scenarios for the SECOND-iou object detector; (i) Average: Class prototype is computed by taking the mean of positive region features, (ii) Self-attention: Prototype is computed by taking the mean of attentive region features, which are the result of sending region features to a single multi-head self attention block. (iii) Transformer: Prototype is computed by taking the mean of attentive region features, which are the result of sending region features to the transformer module detailed in Section 3. (iv.) Transformer with entropy weights: Prototype is computed by taking the prediction entropy weighted mean of transformer generated attentive region features. This is the best performing approach.
Additionally, we compare the performance of the network at different meta-iterations of the training procedure in Table 3, observing improvements in performance with each successive meta-iteration. We further observe that this boost in precision saturates after a certain number of these meta-iterations. In order to visualize the quality of the pseudo labels at different stages of training, we plot the IoU with ground truth vs. confidence of 500 pseudo labels generated by the source only model (blue x’s) and the third meta-iteration model (red stars). We desire that the model generate confident labels with large IoU scores (labels should be pushed to the upper right corner) and be uncertain about generated labels with low IoU scores (labels should be be pushed to the lower left). The labels generated after adaptation are less noisy than that of the source-generated labels, indicated by the fact that incorrect samples () of meta-iteration 3 tend to be distributed with lower confidence than that of DT labels. This shows that the quality of the pseudo-labels improves after adaptation.
6 Limitations
We conduct experiments for a single class and compute a single attentive prototype. The largest domain gaps often occur with smaller objects that appear less frequently such as pedestrians and cyclists. However, our method could be easily extended to learn multiple attentive prototypes for multi-object detection, with the use of class-specific transformer modules. The method is less effective when addressing shifts from smaller to larger datasets (Table 1, KITTI to nuScenes). Additionally, in this work we perform closed-set domain adaptation, in which we make the assumption that the classes in the source domain have equivalent classes in the target domain. This may not hold true for all datasets in the general domain adaptation setting.
7 Conclusions and future work
In this paper we proposed a source-free domain adaptation framework for unsupervised domain adaptive 3D object detectors that uses a transformer module to compute an attentive class prototype to perform pseudo-label refinement during self training. Our method outperforms other recent domain adaptation networks for several different domain shifts. We mainly address pseudo-label noise related to the false positive mis-classification of regions in the 3D scene, and not the dimensions of the bounding box. A factor that contributes the drop in performance upon domain shift is the difference in average size of the vehicles in different locations [Wang2020TrainIG]. While they address this by the weakly supervised approach of statistical normalization, in the future we hope to provide a fully unsupervised solution.
Comments
There are no comments yet.