Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection

11/30/2021
by   Deepti Hegde, et al.
Johns Hopkins University
0

3D object detection networks tend to be biased towards the data they are trained on. Evaluation on datasets captured in different locations, conditions or sensors than that of the training (source) data results in a drop in model performance due to the gap in distribution with the test (or target) data. Current methods for domain adaptation either assume access to source data during training, which may not be available due to privacy or memory concerns, or require a sequence of lidar frames as an input. We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors that uses class prototypes to mitigate the effect pseudo-label noise. Addressing the limitations of traditional feature aggregation methods for prototype computation in the presence of noisy labels, we utilize a transformer module to identify outlier ROI's that correspond to incorrect, over-confident annotations, and compute an attentive class prototype. Under an iterative training strategy, the losses associated with noisy pseudo labels are down-weighed and thus refined in the process of self-training. To validate the effectiveness of our proposed approach, we examine the domain shift associated with networks trained on large, label-rich datasets (such as the Waymo Open Dataset and nuScenes) and evaluate on smaller, label-poor datasets (such as KITTI) and vice-versa. We demonstrate our approach on two recent object detectors and achieve results that out-perform the other domain adaptation works.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

09/29/2021

Uncertainty-aware Mean Teacher for Source-free Unsupervised Domain Adaptive 3D Object Detection

Pseudo-label based self training approaches are a popular method for sou...
10/16/2020

SF-UDA^3D: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection

3D object detectors based only on LiDAR point clouds hold the state-of-t...
11/16/2020

Manual-Label Free 3D Detection via An Open-Source Simulator

LiDAR based 3D object detectors typically need a large amount of detaile...
05/17/2019

Training Object Detectors With Noisy Data

The availability of a large quantity of labelled training data is crucia...
08/15/2021

SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation

In autonomous driving, a LiDAR-based object detector should perform reli...
03/28/2022

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

In this paper, we propose the LiDAR Distillation to bridge the domain ga...
03/29/2021

Monitoring Object Detection Abnormalities via Data-Label and Post-Algorithm Abstractions

While object detection modules are essential functionalities for any aut...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Top row: Visual representations of prototype computation. On the left hand side is a depiction of a standard feature aggregation approach for prototype computation. In the case of noisy labels, features corresponding to mis-labeled regions that are not discarded by outlier removal contribute to the class prototype. On the right is the proposed method of entropy-weighted average of attentive region features which considers only salient regions for prototype computation. The opacity of the features represents the attention weights, and the width of the connecting lines represents the combination weights for computing the average. Bottom row: Comparison of our results on SECOND-iou against recent state-of-the-art methods for three domain shift scenarios.

The localization and categorization of objects in a 3D scene is a crucial component of perception systems in fields like robotics and autonomous driving. In recent years, data-driven approaches using deep neural networks have achieved superior performance in various versions of this task

[shi2019pointrcnn, second, shi2020pv, Shi2019PartA2N3, REF:qi2017pointnet, ipod, zhou2018voxelnet], facilitated in part by the release of numerous datasets and benchmarks [waymo, KITTI, nuscenes2019, lyft, chang2019argoverse, cadc]. In practical scenarios, it is important for these object detection frameworks to perform consistently well in different domain scenarios. However, deep neural networks tend to learn not only the valuable features that aid in performing the task at hand, but also the biases present in the data it is trained on. In the case of lidar datasets, the weather conditions and the location of capture lead to biases in the dataset due to the specific dimensions of roads, vehicles, and the driving conventions of the area. Additionally, different lidar sensors possess different rates of return and produce pointclouds with varying densities, leading to another set of inherent biases. This leads to a distribution gap among various pointcloud datasets.

Figure 2: 3D bounding box predictions of the object detector [shi2019pointrcnn] trained on Waymo [waymo] data and tested on KITTI [KITTI]. Ground truth annotations are in green and predictions are in red. Thresholding (b) fails to remove all false positives present in (a). Self-training with these pseudo-labels leads to the enforcement of errors.

Thus, an object detector trained on a particular dataset will drop in performance when evaluated on samples from a dataset with a different distribution. We call the training and test datasets in this scenario as the source and target domain datasets, respectively. One may argue that making use of a large, diverse source domain could solve this problem, however there will always be samples from an unseen distribution, and collecting every possible type of lidar scene is impractical at best.

Unsupervised domain adaptation (UDA) refers to the process of bridging this gap to improve the performance of source-trained networks on unlabelled target samples. There have been several recent works addressing this problem, for both 2D [oza2021unsupervised, RoyChowdhury, Khodabandeh2019ARL] and 3D [xu2021spg, yang2021st3d, Saltori2020SFUDA3DSU, scalablePseudo, Luo2021multiLevel]

object detection. However, source samples are often unavailable during training due to limited memory capabilities or privacy reasons. This requires a source-free domain adaptation approach, where only the source-trained model and unlabelled target samples are used for adaptation. There exist several such approaches for various computer vision tasks on images

[Li2021AFL, Kundu2020UniversalSD, yang2021generalized]. Only SF-UDA3D [Saltori2020SFUDA3DSU] attempts this setting for 3D object detection, but relies on a sequence of lidar frames as an input to the network.

Self training-based methods have been successful in unsupervised and semi-supervised domain adaptive works [scalablePseudo, xie2020self, zou2019confidence], but rely on confidence thresholding to filter noisy pseudo-labels. As illustrated in the example from Figure 2, the use of high thresholds (as is general practice) results in training the model on easy samples and incorrect labels of high confidence that contribute to the enforcement of errors during adaptation.

We propose an unsupervised, source-free domain adaptation framework for 3D object detection that addresses the issue of incorrect, over-confident pseudo labels during self-training through the use of class prototypes. In the presence of label noise, standard feature aggregation methods of prototype computation [ZhangProto, lvq, yang2018robust, jiang2018learning] are insufficient, since features corresponding to incorrectly labeled regions could contribute to the final prototype (see Figure 1, top row). Inspired by the high representative power of self-attention and recent works that make use of transformers to focus on salient inputs [vit, vaswani2017attention]

, we calculate an attentive class prototype by using a transformer to identify salient regions-of-interest and combine their associated feature vectors using prediction entropy weights that represent the uncertainty of the classification branch for each sample. Class predictions corresponding to incorrect pseudo labels, which are identified by calculating the similarity with the class prototype, are down-weighed to prevent reinforcing errors during self-training. We demonstrate our result on several domain shift scenarios (see Figure

1). Our contributions are as follows

  • [topsep=0pt,noitemsep,leftmargin=*]

  • We propose the attentive prototype for learning representative class features in the presence of label noise by leveraging self-attention through a transformer block and perform source-free unsupervised domain adaptation of 3D object detection networks that mitigates the effect of label noise during self training by filtering incorrect annotations.

  • We demonstrate our method on two recent object detectors, SECOND-iou [second], and PointRCNN [shi2019pointrcnn] for six domain shift scenarios and outperform recent domain adaptation works.

2 Related Works

3D object detection. With the increase in availability of multiple large-scale lidar datasets, there have been many recent networks proposed for 3D object detection. Here, we focus on pure lidar-based detectors, although there have been several successful multi-modal approaches [chen2017multi, multivox, ku2018joint]. The seminal works PointNet [REF:qi2017pointnet] and PointNet++ [REF:qi2017pointnetplusplus]

for the hierarchical feature extraction of pointclouds have spurred numerous deep neural networks for this task that can be broadly categorised as voxel-based methods

[zhou2018voxelnet, lang2019pointpillars, second], which divide the pointcloud into volumetric grids before performing feature extraction, and point-based methods [shi2019pointrcnn, yang20203dssd] for 3D object detection, which operate directly each point in the 3D scene. In [second], Yan et al. propose SECOND, a single stage, voxel-based method that utilizes 3D sparse convolutions and a Region Proposal Network (RPN) head to predict the location and category of objects in a lidar scene. Pointpillars [lang2019pointpillars] by Lang et al. builds on this by changing the shape of the voxels to columns that span the height of the pointcloud. PointRCNN is a two-stage network that generates 3D bounding box proposals followed by a refinement stage similar to [fasterrcnn].

Self-training for domain adaptation. Pseudo label-based self-training is a popular approach for unsupervised domain adaptation. In [RoyChowdhury], RoyChowdhury et al. propose a method that trains the object detection network FasterRCNN [fasterrcnn]

with a combination of high confidence pseudo labels and labels obtained from a tracker and a knowledge distillation loss to adapt the network for face detection. This approach relies on video data to obtain its tracking results. Khodabandeh

et al. [Khodabandeh2019ARL]

propose Robust FasterRCNN, a domain adaptive 2D object detector that consists of a source-model training stage, a pseudo label generation stage using a pretrained image classifier for refinement, and a final stage where the detector network is trained on the source labels and the refined target pseudo-labels. Yang

et al. put forth ST3D [yang2021st3d], a self training approach for 3D domain adaptive object detection where the network is adapted by training with a proposed curriculum data augmentation algorithm using pseudo-labels generated with a quality-aware memory bank. While showing promising results in some cases, in others this method is outperformed by standard pseudo-label based self-training, and depends on boosting with statistical normalization [Wang2020TrainIG], a weakly supervised method, for its best reported results. Caine et al. [Caine] propose a simple, yet effective domain adaptation method for 3D object detection using the base network of Pointpillars [lang2019pointpillars] that trains a student network with a combination of source labels and target pseudo labels obtained from a teacher network trained on the labeled source data. However, simple thresholding methods for pseudo-label collection from the source model may lead to training the model with incorrect labels that have high confidence and discarding correct labels with confidence that falls below the threshold.

Source-free domain adaptation. Privacy and memory limitations during adaptation may prevent the access of source data for training. Source-free approaches only use unlabeled target data network models pre-trained on the source data during adaptation. Kundu et al. [Kundu2020UniversalSD] propose a UDA method which does not require information about the category-level gap between domains, and consists of a procurement stage in which a generative classifier equips the model to reject out-of-distribution samples, and a deployment stage, in which adversarial alignment is performed. Yang et al. [yang2021generalized] focuses on maintaining model performance on the source domain by a two stage method that consists of a local structure clustering stage and a domain attention stage that activates feature channels associated with each domain. Saltori et al. [Saltori2020SFUDA3DSU] propose a source-free UDA method for 3D object detection on PointRCNN [shi2019pointrcnn], utilizing a tracking-based scoring system to evaluate the quality of pseudo-labels at different scales. As mentioned above, this method depends on the use of multiple frames for a single forward pass through the network.

Prototype learning.

Learning representative features of a class or group of samples has been a well explored problem in pattern recognition. Originally calculated with hand-crafted features

[lvq]

, recent approaches use convolutional neural networks for feature extraction. This method has seen success in a variety of tasks, including classification

[yang2018robust], zero-shot recognition [jiang2018learning], and domain adaptive 2D segmentation [ZhangProto]. We argue that when training a network with noisy pseudo-labels, clustering and outlier removal are insufficient in obtaining a true representative class prototype. We thus propose a transformer-based approach to generate attentive prototypes.

Figure 3: A visual overview of the proposed domain adaptation framework. The object detector is initialised with the source-trained model and used for region-proposal and feature extraction by the domain adaptation block. This consists of a transformer module to generate attentive features (see Figure 4

for details), prototype computation by combination of features using predictive entropy weights and the exponential moving average (EMA) algorithm, and calculation of the output classification loss weights using cosine similarity. The detection network is trained under the weighted loss.

3 Proposed Method

Consider an object detector network trained on a source dataset consisting of sample-label pairs , where contains the annotations of objects in a 3D scene, consisting of the dimensions , position and category of each bounding box. We aim to adapt this network model in absence of the source data to an unlabeled target dataset of size and corresponding pseudo-labels generated by . The proposed domain adaptation method consists of a prototype computation and similarity-based refinement, implemented with an iterative training strategy. A visual representation of this framework is shown in Figure 3, and a detailed algorithmic description is given in Algorithm 1.

3.1 Transformer for prototype computation

In the presence of noisy labels, representation learning through feature clustering methods like those in [ZhangProto, lvq, yang2018robust, jiang2018learning] may compute corrupted class prototypes. In object detection, region features of different classes may be similar (such as the “Car” and “Truck” categories, see Figure 2), rendering outlier removal methods ineffective in cases of mis-classification. In order to address this, we propose a transformer that utilizes self-attention to focus on salient regions-of-interest for prototype computation. A close-up view of the transformer model is shown in Figure 4.

Consider the set of features of positive regions-of-interest (ROIs) consisting of feature vectors generated by the object detector of ROIs categorized as belonging to the object class. In order to create a representative prototype, we take inspiration from [vit]

and send the ROI features as tokens to a transformer module consisting of a linear embedding layer and a set of transformer encoders. The encoder contains alternating multi-head attention blocks and feed-forward blocks, with interspersed normalization layers and residual connections, as depicted in Figure

3(b).

The input to the transformer module is the set of positive ROI features associated with the object category. Each feature in (comparable to the image patches in [vit]) is sent as a token to the linear projection layer, which gives a set of feature embeddings that are input to the encoder block. The multi-head self attention layer (MSA) [vaswani2017attention], calls several self-attention operations in parallel in which a linear layer is used to encode and represent each token in as a value and a corresponding query and key pair. A weighted sum of all values in is computed for each token where the attention weight is

(1)

where is a scaling factor. The output of the self attention operation is thus

(2)

The feed-forward block is a two-layers multi-layer perceptron (MLP) with the GELU

[hendrycks2016gaussian] non-linearity function. Following the forward pass through such encoders, the final output of the transformer module is a set of attentive region features . By virtue of the self-attention mechanism, we learn the cross-correlation between positive region features, and in turn the ROIs that contribute salient information for prototype computation. Predictive entropy. As a way to obtain additional insight on how informative each region feature is for the bounding box categorization task, we form the representative attentive class prototype as the sum of the attentive region features weighted with the predictive entropy [wang2014new, prabhu2021active] of the classifier , denoted by . Instead of utilizing the uncertainty of the model as in [prabhu2021active], we weigh each attentive region feature with the confidence of the associated prediction. The predictive entropy and the resulting entropy weights are given by

(3)

The attentive prototype as the weighted average of the attentive ROI features is obtained as follows

(4)

Computing the final prototype. During the process of training the object detector for adaptation, samples from the target dataset

are sent in mini-batches. Each attentive prototype computed in an iteration is combined with the prototoype computed in the previous iteration through exponential moving average (EMA). The initial prototype at the first iteration of the first epoch is simply the average of the positive ROI features. The final attentive prototype at each iteration

is thus given by

(5)

with a keep ratio used in our experiments.

3.2 Similarity-based refinement

Once the representative class prototype is obtained, we use it as a soft-filter to identify region features that are dissimilar and thus far away from each other in the feature space. To do this, the distance of each positive ROI in from its corresponding prototype is calculated using the metric of cosine similarity such that

(6)

where is the feature dimension and ranges from to the number of ROIs, . Cosine similarity is chose due to the sparse nature of the features. The classification loss corresponding to each positive region of interest is multiplied by a similarity weight, computed by taking the softmax of the cosine distance, such that

(7)

where corresponds to the region-wise loss of the bounding box classifier, where indexes the positive regions and indexes the negative regions. With this similarity-based down-weighing, the losses corresponding to regions that have been identified as incorrect through prototype matching will be down-weighed and not contribute to training. As the representative prototype improves with each epoch, the network becomes better at soft-filtering incorrect regions and avoids reinforcing the error in pseudo-labels.

(a) Transformer module
(b) Encoder
Figure 4: A closer look at the transformer module (a). The region-of-interest (ROI) features are sent as a sequence of tokens to the linear projection layer to form feature embeddings that are in turn input to a set of transformer encoders. (b) This encoder encodes each token as a series of values and corresponding key-query pairs. The resulting sequence is formed by a summation using the learned attention weights that represents the cross-correlation between each element in the input sequence.
Source trained model , unannotated target data , pseudo labels
Bounding box predictions
- Initialize model
-
for  do
     - Initialize prototype
     for  do
         - if
         - Get
         - send features to transformer
         - Get classifier o/p
         -
         - entropy
         -
         -
         -
         -
         -
         -
     end for
     - Output predictions
     -
     -
end for
Algorithm 1 Attentive prototype computation and label refinement.

4 Experiments

We demonstrate our domain adaptation framework on two base object detection networks, SECOND-iou [yang2021st3d], which is a modified version of the voxel-based network SECOND [second], and PointRCNN [shi2019pointrcnn], a two stage point-based network. We explore three cross-dataset domain shift scenarios for each detector for the “Car” object category. In this section, we explain the details of the experiments and the datasets used.

SECOND-iou Point-RCNN
Domain shift Method mAP Domain shift Method mAP
easy mod. hard easy mod. hard
Waymo KITTI DT 46.66 39.86 35.03 Waymo KITTI DT 13.11 12.10 12.24
ST 69.04 55.77 54.57 ST 21.13 19.29 18.28
SN[Wang2020TrainIG] 73.23 56.36 54.25 SN 48.7 47.1 49.7
ST3D[yang2021st3d] 67.47 59.17 56.73 SF-UDA3D[Saltori2020SFUDA3DSU] - - -
Proposed 75.75 64.43 57.19 Proposed 62.11 53.08 46.64
Oracle 84.86 68.93 67.38 Oracle 81.61 74.36 74.49
nuScenes KITTI DT 18.37 17.31 16.09 nuScenes KITTI DT 10.59 10.76 10.64
ST 53.03 37.71 35.18 ST 22.21 11.56 11.90
SN 22.03 18.51 18.04 SN 60.35 54.82 52.78
ST3D 58.24 43.13 39.46 SF-UDA3d 68.8 49.80 45.0
Proposed 71.56 52.12 45.86 Proposed 69.98 61.43 54.26
Oracle 84.86 68.93 67.38 Oracle 81.61 74.36 74.49
Table 1: A tabular comparison of 3D mean average precision (mAP) results of the “Car” object for the adaptation of two object detection networks SECOND-iou [second], and PointRCNN [shi2019pointrcnn] against recent domain adaptation methods. Where the target dataset is KITTI, we evaluate with the official metric across 3 difficulty categories for an IoU threshold of . In the case of the nuScenes target dataset, we average across the various difficulty categories of the official metric. The best results are in bold type.
SECOND-iou Point-RCNN
Domain shift Method mAP Domain shift Method mAP
Waymo nuScenes DT 15.25 KITTI nuScenes DT 10.07
ST 17.80 ST 16.78
SN 18.57 SN 18.7
ST3D 20.19 SF-UDA3d 26.8
Proposed 20.47 Proposed 18.87
Oracle 32.64 Oracle 29.89

4.1 Experimental setup

Datasets. In order to simulate the various domain shifts, we consider three popular large-scale autonomous driving datasets with considerable domain gaps among them, The Waymo Open Dataset [waymo], the KITTI dataset [KITTI], and the nuScenes dataset [nuscenes2019]. The largest dataset among these is Waymo, with more than 230K annotated lidar frames collected across six US cities, of which we use approximately 50K due to memory constraints in a 40K/10K training/validation (test) split. The nuScenes dataset consists of 34,149 frames which we utilize in 28K/6K split. The smallest dataset is KITTI, consisting of 7,481 (3K/3K split) annotated lidar frames collected from Germany. All training and validation splits used are official and consistent with that used by other works in the 3D object detection literature.

Object detection networks. The network SECOND-iou [second, yang2021st3d] is a two stage voxel-based 3D object detector which uses a PointNet [REF:qi2017pointnet] backbone to extract voxel features from groupings of points and consists of a grouping layer, a region proposal network (RPN) and an ROI refinement head. It is a modified version of SECOND [second] proposed in [yang2021st3d], with the extra refinement head. The region features for prototype computation are obtained at the output of the RPN head, and the classification loss at this stage is down-weighed during adaptation. PointRCNN is a two stage point-based network with a similar PointNet backbone for feature extraction that generates 3D region proposals through a bottom-up approach through foreground segmentation. This is followed by a refinement network. We implement adaptation for PointRCNN in this second stage.

4.2 Implementation details

In addition to the mentioned refinement steps, we implement an iterative training strategy in which the domain adaptive network is trained and re-trained a series of times with the source model

and pseudo labels re-initialized to the previously trained model and generated predictions respectively. For the implementation of SECOND-iou, we use the public pyTorch repository Open-PCDet

[openpcdet2020], and the official code release from [shi2019pointrcnn] for the implementation of PointRCNN. We perform experiments with a 48GB Quadro RTX 8000 GPU and a 16GB GeForce RTX 2080 GPU. We follow the lengths of training recommended by the authors in the case of each object detector network. In both cases, the cyclic Adam optimization algorithm is used.

5 Results

In this section we demonstrate the results of our domain adaptation framework and compare it against four domain adaptation methods; Direct transfer (DT): Inference of the source trained model on target data, Current SOTA: ST3D [yang2021st3d] by Yang et al. and SFUDA3D [Saltori2020SFUDA3DSU] by Saltori et al. for their corresponding base networks, Statistical normalization (SN) [Wang2020TrainIG]: weakly supervised approach that uses the target domain bounding-box statistics, Pseudo-label self training (ST): Re-training the object detector on thresholded source-model generated pseudo-labels. For reference we, also compare with the “oracle” results, which are obtained by training the detector with the ground truth labels of the target domain, indicating the possible upper bound of performance after adaptation.

SECOND-iou [second]
Direct transfer ST ST3D [yang2021st3d] Ours
PointRCNN [shi2019pointrcnn]
Direct transfer ST SN [Wang2020TrainIG] Ours
Figure 5: A qualitative comparison of our bounding box predictions of the “Car” class for the adaptation of SECOND-iou [second] (rows 1-4) and PointRCNN [shi2019pointrcnn] (rows 5-8), with direct transfer (DT), pseudo-label self training (ST), statistical normalization (SN) [Cai2019ExploringOR], and ST3D [yang2021st3d]. Ground truth annotations are in green and predictions are in red. Best viewed zoomed in and in color.
Prototype computation method Waymo KITTI nuScenes KITTI
easy mod. hard easy mod. hard
Average 73.56 57.68 56.75 65.00 47.55 40.24
Self attention 72.90 56.82 55.21 65.01 47.54 42.43
Transformer 72.50 62.64 56.39 69.26 50.09 45.11
Transformer + entropy weight 75.75 64.43 57.19 71.56 52.12 45.86
Table 2: Ablation results. A comparison of 3D mAP performance of the adaptation of SECOND-iou using different prototype computation methods for two domain scenarios.

5.1 Comparison with state-of-the-art

Quantitative analysis. We compare the mean average precision (mAP) of 3D bounding boxes across various difficulty settings for the domain shift scenarios Waymo KITTI, nuScenes KITTI, KITTI nuScenes, and Waymo

nuScenes. Where the target dataset is KITTI, we use the official evaluation metrics detailed in

[KITTI] with easy, moderate, and hard difficulty categories based on the distance and level of occlusion of the object from the sensor, with an IoU threshold of . Where the target dataset is nuScenes, we use the metrics of [nuscenes2019], and average across difficulties to be consistent with both [yang2021st3d] and [Saltori2020SFUDA3DSU]. We reproduce the results of ST3D and SN using similar training lengths and batch sizes as our experiments. Note that due to computational limitations, the batch sizes are smaller than those used by [yang2021st3d] and [second]. Due to the lack of a code repository for SF-UDA3D at the time of writing, we compare with the reported numbers. We implement our method with SECOND-iou for comparison with ST3D and with PointRCNN for comparison with SF-UDA3D to be consistent with their base object detector networks. This can be seen in Table 1. We demonstrate the best results using both object detection networks in most categories, beating the weakly supervised approach of SN as well as [Saltori2020SFUDA3DSU] and [yang2021st3d]. We observe a overall improvement from the closest performing competitor ST3D in the case of SECOND-iou in the nuScenes KITTI domain scenario and a improvement in the case of Waymo KITTI.

Qualitative analysis. We further demonstrate the effectiveness of our domain adaptation framework with those of recent methods through a visual comparison of the predicted bounding boxes for the Waymo KITTI task for both object detection networks in Figure 5. The problems of direct transfer (DT) of the source model are localization and over-confident false positives (see column 1). This problem is mitigated only partially by pseudo labelling methods and by ST3D, and is better addressed by our proposed approach.

Domain shift
meta
iteration
mAP
easy mod. hard
Waymo KITTI DT 18.37 17.31 16.09
1 57.76 43.02 38.50
2 63.06 46.71 41.12
3 63.18 46.93 40.94
4 71.56 52.12 45.86
Table 3: Analyzing the effect of the iterative training scheme. A comparison of adaptation method 3D mAP performance at different meta-iterations for the Waymo KITTI task.
Figure 6: A plot of IoU vs. confidence scores () of 500 pseudo labels of object detector [second] at two stages of the iterative training process. At direct transfer(DT) from the source trained model (blue x’s), several incorrect labels with the ground truth labels have confidence score , while at meta-iteration 3 of the training scheme (red stars), most correct samples tend to have higher confidence than that of DT, and incorrect samples have . This distribution being most dense at the lower left and upper right corners of the plot implies successive adaptation improves the quality of the pseudo labels.

5.2 Ablation study

We analyze the contribution of the different segments of our domain adaptation framework. In Table 2, we compare the 3D mAP performance of the network using four different prototype computation methods on two different domain shift scenarios for the SECOND-iou object detector; (i) Average: Class prototype is computed by taking the mean of positive region features, (ii) Self-attention: Prototype is computed by taking the mean of attentive region features, which are the result of sending region features to a single multi-head self attention block. (iii) Transformer: Prototype is computed by taking the mean of attentive region features, which are the result of sending region features to the transformer module detailed in Section 3. (iv.) Transformer with entropy weights: Prototype is computed by taking the prediction entropy weighted mean of transformer generated attentive region features. This is the best performing approach.

Additionally, we compare the performance of the network at different meta-iterations of the training procedure in Table 3, observing improvements in performance with each successive meta-iteration. We further observe that this boost in precision saturates after a certain number of these meta-iterations. In order to visualize the quality of the pseudo labels at different stages of training, we plot the IoU with ground truth vs. confidence of 500 pseudo labels generated by the source only model (blue x’s) and the third meta-iteration model (red stars). We desire that the model generate confident labels with large IoU scores (labels should be pushed to the upper right corner) and be uncertain about generated labels with low IoU scores (labels should be be pushed to the lower left). The labels generated after adaptation are less noisy than that of the source-generated labels, indicated by the fact that incorrect samples () of meta-iteration 3 tend to be distributed with lower confidence than that of DT labels. This shows that the quality of the pseudo-labels improves after adaptation.

6 Limitations

We conduct experiments for a single class and compute a single attentive prototype. The largest domain gaps often occur with smaller objects that appear less frequently such as pedestrians and cyclists. However, our method could be easily extended to learn multiple attentive prototypes for multi-object detection, with the use of class-specific transformer modules. The method is less effective when addressing shifts from smaller to larger datasets (Table 1, KITTI to nuScenes). Additionally, in this work we perform closed-set domain adaptation, in which we make the assumption that the classes in the source domain have equivalent classes in the target domain. This may not hold true for all datasets in the general domain adaptation setting.

7 Conclusions and future work

In this paper we proposed a source-free domain adaptation framework for unsupervised domain adaptive 3D object detectors that uses a transformer module to compute an attentive class prototype to perform pseudo-label refinement during self training. Our method outperforms other recent domain adaptation networks for several different domain shifts. We mainly address pseudo-label noise related to the false positive mis-classification of regions in the 3D scene, and not the dimensions of the bounding box. A factor that contributes the drop in performance upon domain shift is the difference in average size of the vehicles in different locations [Wang2020TrainIG]. While they address this by the weakly supervised approach of statistical normalization, in the future we hope to provide a fully unsupervised solution.

References