Synergizing between Self-Training and Adversarial Learning for Domain Adaptive Object Detection

10/01/2021 ∙ by Muhammad Akhtar Munir, et al. ∙ Information Technology University 0

We study adapting trained object detectors to unseen domains manifesting significant variations of object appearance, viewpoints and backgrounds. Most current methods align domains by either using image or instance-level feature alignment in an adversarial fashion. This often suffers due to the presence of unwanted background and as such lacks class-specific alignment. A common remedy to promote class-level alignment is to use high confidence predictions on the unlabelled domain as pseudo labels. These high confidence predictions are often fallacious since the model is poorly calibrated under domain shift. In this paper, we propose to leverage model predictive uncertainty to strike the right balance between adversarial feature alignment and class-level alignment. Specifically, we measure predictive uncertainty on class assignments and the bounding box predictions. Model predictions with low uncertainty are used to generate pseudo-labels for self-supervision, whereas the ones with higher uncertainty are used to generate tiles for an adversarial feature alignment stage. This synergy between tiling around the uncertain object regions and generating pseudo-labels from highly certain object regions allows us to capture both the image and instance level context during the model adaptation stage. We perform extensive experiments covering various domain shift scenarios. Our approach improves upon existing state-of-the-art methods with visible margins.



There are no comments yet.


page 5

page 10

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural network based object detectors have shown promising results, through learning representative features from large annotated datasets

everingham2010pascal; lin2014microsoft; kitty2012we

. However, like other supervised deep learning methods, object detection methods trained on the source domain do not generalize adequately to a new target domain. This problem, known as

domain shift torralba2011unbiased could be exhibited by change in style, camera pose, or object size and orientation, or the number or location of objects in the scene, among other things. Often, collecting large annotated dataset for fine-tuning the model to the target domain is expensive, error prone and in many cases not possible. Unsupervised Domain Adaptation (UDA) is a promising research direction towards solving this problem by transferring knowledge from a labelled source domain to an unlabelled target domain.

Many unsupervised domain adaptive detectors rely on adversarial adaptation or self-training techniques. Methods based on adversarial adaptation chen2018domain; saito2019strong; he2019multi; hsu2020every; zheng2020cross; xu2020exploring; chen2020harmonizing; nguyen2020uada, mostly rely on domain discriminator for aligning features at image or instance levels. However, due to the absence of labels in target domain they suffer from the challenges of how to pick samples for the adaptation. Selecting uniformly, one ends up missing on infrequent classes or instances. Most importantly adversarial alignment do not explicitly incorporates the class discriminative information, resulting in non-optimal alignment for classification and object detection tasks saito2019strong; chen2018domain; 2018dirt. A potential solution to this problem is self-training based adaptation, however, it faces the challenge of how to avoid noisy pseudo-labels. Some methods choose high confidence predictions as pseudo-labels lee2013pseudo; inoue2018cross; roychowdhury2019automatic, but the likely poor calibration of model under domain shift renders this solution inefficient NEURIPS2019_canyou

. Further, in the case of object detection, prediction probability can not directly capture object localization inaccuracies.

We present a principled approach to achieve balance between self-training and adversarial alignment for adaptive object detection via leveraging model’s predictive uncertainty. To estimate predictive uncertainty of a detection, we propose taking into account variations in both the localization prediction and confidence prediction across Monte-Carlo dropout inferences

gal2016dropout. Certain detections are taken as pseudo-labels for self-training, while uncertain ones are used to extract tiles (regions in image) for adversarial feature alignment. This synergy between adversarial alignment via tiling around the uncertain object regions and self-training with pseudo-labels from certain object regions lets us include instance-level context for effective adversarial alignment and improve feature discriminability for class-specific alignment. Since we select pseudo-labels with low uncertainty and take relatively uncertain as potential, object-like regions with context (i.e. tiles) for adversarial alignment, we tend to reduce the effect of poor calibration under domain shift, thereby improving model’s generalization across domains.

Our key contributions include the following: (1) We introduce a new uncertainty-guided framework that strikes the right balance between self-training and adversarial feature alignment for adapting object detection methods. Both pseudo-labelling for self-training and tiling for adversarial alignment are impactful due to their simplicity, generality and ease of implementation. (2) We propose a method for estimating the object detection uncertainty via taking into account variations in both the localization prediction and confidence prediction across Monte-Carlo dropout inferences. (3) We show that, selecting pseudo-labels with low uncertainty and using relatively uncertain regions for adversarial alignment, it is possible to address the poor calibration caused by domain shift, and hence improve model’s generalization across domains. (4) Unlike most of the previous methods, we build on computationally efficient one-stage anchor-less object detectors and achieve state-of-the-art results with notable margins across various adaptation scenarios.

2 Related Work

Object detection.

Deep learning based object detection algorithms can be classified into either anchor-based

DBLP:conf/nips/RenHGS15; DBLP:conf/cvpr/LinDGHHB17; DBLP:conf/cvpr/SinghD18; DBLP:conf/cvpr/CaiV18 or anchor-free methods law2018cornernet; duan2019centernet; tian2019fcos. Anchor-based methods, such as Faster RCNN DBLP:conf/nips/RenHGS15, uses region proposal network (RPN) to generate proposals. Anchor-free detectors, on the other hand, skip proposal generation step and through leveraging fully convolutional network (FCN) long2015fully directly localize objects. For instance, tian2019fcos proposed per-pixel prediction and directly predicted the class and offset of the corresponding object at each location on the feature map. In this work, we capitalize on the computationally inexpensive characteristic in anchor-free detectors to study adapting trained object detectors.

Tiling for object detection. The process of cropping regions of an input image, a.k.a tiling, in a uniform ozge2019power, random, or informed yang2019clustered; hong2019patch; li2020density fashion before applying object detection is typically used to tackle scale variation problem and improve detection accuracy over small objects. Informed tiling can be achieved by first generating a set of regions of object clusters, and then cropping them for subsequent fine detection yang2019clustered.

Domain-adaptive object detection. The pioneering work of chen2018domain on domain-adaptive (DA) object detection proposed reducing domain shift at both image and instance levels via embedding adversarial feature adaptation into anchor-based detection pipeline. Global feature alignment could suffer as domains may manifest distinct scene layouts and complex object combinations. Several subsequent approaches attempted to achieve a right balance between the global and instance-level alignments zhu2019adapting; xu2020exploring. Other methods he2019multi; kim2019diversify; cai2019exploring; hsu2020progressive improved feature alignment in various ways e.g., through exploiting hierarchical feature learning in CNNs he2019multi. While above methods are built on two-stage pipeline, a few approaches have built domain adaptive detectors on one-stage pipeline kim2019self; hsu2020every. hsu2020every proposed to predict pixel-wise objectness and center-aware feature alignment, building on tian2019fcos, to focus on the discriminative parts of objects.

Uncertainty for DA object detection. Exploiting model’s predictive uncertainty and entropy optimization have remained subject of interest in prior cross-domain recognition long2018conditional; han2019unsupervised; manders2018adversarial; ringwald2020unsupervised and detection guan2021uncertainty; nguyen2020domain works. For cross-domain recognition, ringwald2020unsupervised employed uncertainty for filtering training data and aligning features in Euclidean space. For DA object detection, guan2021uncertainty proposed an uncertainty metric to regulate the strength of adversarial learning for well-aligned and poorly-aligned samples adaptively.

Pseudo-labelling for DA object detection. In DA object detection, pseudo-labelling aims at acquiring pseudo instance-level annotations for incorporating discriminative information. inoue2018cross generated pseudo instance-level annotations by choosing the top-1 confidence detections. Similarly, roychowdhury2019automatic obtained the same by using high-confidence detections and further refined them using tracker’s output. Towards refining (noisy) pseudo instance-level annotations, khodabandeh2019robust employed auxiliary component and kim2019self devised a criterion based on supporting RoIs.

Confidence-based pseudo-label selection is prone to generating noisy labels since the model is poorly calibrated under domain shift, eventually causing degenerate network re-training.

Unlike most prior methods we build on computationally inexpensive one-stage anchor-free detector. Different to existing methods, we leverage model’s predictive uncertainty, considering variations in localization and confidence predictions across MC simulations, to achieve the best of both self-training and adversarial alignment through mining highly certain target detections as pseudo-labels and relatively uncertain ones as guides in the tiling process.

Figure 1: Overall architecture of our method. Fundamentally, it is a one-stage detector tian2019fcos with an adversarial feature alignment stage. We propose uncertainty-guided self training with pseudo-labels (UGPL) and uncertainty-guided adversarial alignment via tiling (UGT) (in dotted boxes). UGPL produces accurate pseudo-labels in target image which are used in tandem with ground-truth labels in source image for training. UGT extracts tiles around possibly object-like regions in target image which are used with randomly extracted tiles around ground-truth labels in source domain for adversarial feature alignment.

3 Proposed Method

In this section, we describe the technical details of our method. Fig. 1 displays the overall architecture of our method. We propose to leverage model’s predictive uncertainty to strike the right balance between adversarial feature alignment and self-training. To this end, we introduce uncertainty-guided pseudo-labels selection (UGPL) for self-training and uncertainty-guided tiling (UGT) for adversarial alignment. The former allows generating accurate pseudo-labels to improve feature discriminability for class-specific alignment, while the latter enables extracting tiles on uncertain, object-like regions for effective domain alignment.

3.1 Preliminaries

Problem Setting. Let be the labeled source dataset and be the unlabeled target dataset. Where is set of bounding boxes for the objects in the image and their corresponding classes . The source and target domains share an identical label space, however, violate the i.i.d. assumption since they are sampled from different data distributions. Our goal is to learn a domain-adaptive object detector, given labeled and unlabeled , capable of performing accurately in the target domain.

One-stage anchor-free object detection. Owing to the computationally inexpensive feature of one-stage anchor-free detection pipelines, we build our uncertainty-guided domain-adaptive detector on fully convolutional one-stage object detector (FCOS) tian2019fcos. Inspired from the fully convolutional architecture long2015fully, FCOS incorporates per-pixel predictions and directly regresses object location. Specifically, it outputs a

-dimensional classification vector, a 4D vector of bounding box coordinates, and a centerness score. The loss function for training FCOS is:


where is the classification loss (i.e. focal loss lin2018focal, and (i.e. IoU loss yu2016unitbox) is the regression loss. denotes class and bounding box predictions at location . denotes the number of positive samples.

Adversarial feature alignment. Several methods saito2019strong; chen2018domain align feature maps on the image-level to reduce domain shift via adversarial learning. It involves a global discriminator that identifies whether the pixels on each feature map belong to the source or the target domain. Specifically, let be the -dimensional feature map of spatial resolution extracted from the feature backbone network. The output of is a domain classification map of the same size as . The discriminator can be optimized using binary cross-entropy loss:


where is the domain label . We perform adversarial feature alignment by applying gradient reversal layer (GRL) ganin2015unsupervised to source and target feature maps, in which the sign of gradient is flipped when optimizing the feature extractor via GRL layer. Global alignment is prone to focusing on (unwanted) background pixels. We introduce uncertainty-guided tiling, that involves cropping tiles (regions with context) around object-like regions for effective adversarial alignment (sec. 3.2.1).

Self-Training. Self-training is a process of training with pseudo-labels, which are generated for unlabelled samples in the target domain with a model trained on labelled data. Hard pseudo instance-level labels are obtained directly from network class predictions. Let be the probability outputs vector of a trained network corresponding to a detection , such that denotes the probability of class being present in the detection. With these probabilities, the pseudo-label can be generated for as: , where . There could be a significant fraction of incorrectly pseudo-labelled detections used during training. A common strategy to reduce noise during training is to select pseudo-labels corresponding to high-confidence detections inoue2018cross; roychowdhury2019automatic. Let be a boolean variable denoting the selection or rejection of i.e. where when is selected or otherwise. Formally, in confidence-based selection, a pseudo-label is selected as: , where is the confidence threshold. These high confidence detections are often noisy because the model is poorly calibrated under domain shift. Instead, we propose to select pseudo-labels utilizing uncertainty in both class prediction and localization prediction to mitigate the impact of poor network calibration (sec. 3.2.1).

3.2 Uncertainty for Domain Adaptive Object Detection

The source model demonstrates poor calibration under target domain bearing sufficiently different superficial statistics and different object combinations NEURIPS2019_canyou; 2018dirt. Although confidence-based selection (typically highest confidence) improves accuracy, the poor calibration of the model under domain shift makes this strategy inefficient. As a result, it could lead to both poor pseudo-labelling accuracy and incorrect identification of possibly object-like regions for adversarial alignment. Since calibration can be considered as the model’s overall prediction uncertainty lakshminarayanan2016simple, we believe that through leveraging model’s predictive uncertainty we can negate the poor effects of calibration. To this end, we propose to leverage uncertainty in detections to select pseudo-labels for self-training and choose regions for tiling in adversarial alignment.

Uncertainty in object detections. Assuming one stage detector, we perform the uncertainty estimation by applying Monte-Carlo dropout gal2016dropout (in particular, spatial dropout tompson2015efficient

) to the convolutional filters after the feature extraction layer. Given an image

, we perform stochastic forward passes (inferences) using MC dropout. Let be the detection in inference, be the class label with highest probability in the probability vector , and is the predicted bounding box. We aim to capture the variations in both the localization prediction and confidence prediction across inferences. To this end, we define the uncertainty of the object detection prediction as mean class probability of the overlapping bounding boxes across individual inferences.

Specifically, for each , we create a set by including all , where and is an arbitrary detection in MC forward pass, such that has IoU with greater than a specific threshold and .


Where is the IoU threshold to identify bounding boxes occupying same region (detected as same object). We use to estimate uncertainty based on both localization prediction and confidence prediction for as:


where is the class prediction confidence of detection in .

Figure 2: An illustration on which detections will be considered as pseudo-labels and which for extracting tiles. More certain detections, such as pedestrians are taken as pseudo-labels, whereas relatively uncertain ones, like cars under fog, are used for extracting tiles.

3.2.1 Uncertainty-Guided Pseudo-Labelling and Tiling

We interpret the averaged confidences as a proxy (or indirect) measure of how uncertain (or certain) the model is in its class assignment and object localization information ringwald2020unsupervised. Under this definition, the model will be completely uncertain if

has uniform distribution whereas it will be completely certain if

can be represented by a Kronecker delta function.

Uncertainty-guided pseudo-labelling for self-training. As discussed above, the calibration can be considered as a measure of network’s overall prediction uncertainty. To this end, we attempt to discover the relationship between calibration and individual detection uncertainties. We plot the relationship between the expected calibration error (ECE) score guo2017calibration and output detection uncertainties (Fig. 3). We see an existence of relationship between the ECE score and detection uncertainties. When we select pseudo-labels with more certain detections, the calibration error goes down significantly for this selected set. We hope that for this selected set of pseudo-labels, a high confidence detection will more likely result in a correct pseudo-label.

In the light of this observation, we propose to select the pseudo-label corresponding to detection by utilizing the uncertainty and detection consistency across inferences:


where and are uncertainty and detection consistency thresholds. Fig. 2 illustrates some example detections that will be considered as pseudo-labels. Once the pseudo-labels are selected using Eq.(5), we use them to perform self-training as:


where represents the class label and bounding box coordinates of the (selected) pseudo-label. Compared to Eq. (1), in Eq. (6), we back-propagate classification loss only for (selected) pseudo-label locations.

Figure 3: Left. ECE score as a function of UGT, UGPL, and our method that achieves synergy between UGT and UGPL, over the adaptation iterations. Right. Selecting more certain object detection pseudo-labels results in significant improvement in ECE score for this selected set over the adaptation course.

Uncertainty-guided tiling for adversarial alignment. Existing image and instance-level adversarial feature alignment suffer from interfering background and noisy object localization. We propose uncertainty-guided tiling for adversarial alignment; it mines relatively uncertain detected regions, as possible object-like regions, for the tiling process. Tiling anchored by uncertain object regions allows adversarial alignment to focus on potential, however, uncertain object-like region with context (see Fig. 2). Specifically, if corresponding to a detection in Eq.(5), we consider it as an uncertain detection for extracting tile around it. Particularly, given as the bounding box for detection , we crop a tile (region) of scale times as that of the detected bounding box. For source image, we randomly extract a tile around the ground-truth bounding box. After resizing both and to the input image size, we perform the adversarial alignment as:


where and are the feature maps for and , respectively.

Discussion. We analyze the impact on model’s calibration through the adaptation phase after (1) selecting pseudo-labels with more certain detections (UGPL), (2) performing tiling on relatively uncertain detections (UGT), and (3) achieving the the synergy between UGPL and UGT (our method). Model’s calibration can be measured with Expected Calibration Error (ECE) score. We compute ECE score by considering both the confidence and the regression branch of the detector kueppers_2020_CVPR_Workshops 111Description on how ECE score is computed for detector is included in supplementary material.. Fig.  3 reveals that UGPL results in decreasing ECE score, and similarly (UGT) allows reducing the same even further. Finally, the synergy between UGPL and UGT achieves the lowest ECE score, significantly alleviating the impact of poor model’s calibration under domain shift.

Training objective. We combine Eq.(1), Eq.(6), and Eq.(7) into a joint loss as and optimize it to adapt the source model to the target domain.

4 Experiments

Datasets. Cityscapes Cordts2016Cityscapes dataset features images of road and street scenes and offers 2975 and 500 examples for training and validation, respectively. It comprises following categories: person, rider, car, truck, bus, train, motorbike, and bicycle.

Foggy Cityscapes sakaridis2018semantic dataset is constructed using Cityscapes dataset by simulating foggy weather utilizing depth maps provided in Cityscapes with three levels of foggy weather.

Sim10k johnson2017driving dataset is a collection of synthesized images, comprising 10K images and their corresponding bounding box annotations.

KITTI geiger2012we dataset bears resemblance to Cityscapes as it features images of road scenes with wide view of area, except that KITTI images were captured with a different camera setup. Following existing works, we consider car class for experiments when adapting from KITTI or Sim10k.

Implementation Details. FCOS tian2019fcos, fully convolutional one- stage object detector, is trained over the source domain. During the adaptation process, using the source-trained model, we iterate over two steps: UGPL and UGT (Sec. (3.2.1)). Following zou2019confidence; zou2018unsupervised we define going over these two steps once as Domain Adaptation Round or just Round. In all of the experiments for uniformity, we use three rounds. Since initially pseudo-labelling accuracy is likely poor, following NEURIPS2019_categoryanch, we perform adversarial domain adaptation (using UGT), in a round called . In next two rounds, and , we apply both the self-training and adversarial domain adaptation using UGPL and UGT, respectively. For extracting tile around uncertain detection, a five times larger region is cropped around the center location. Height and width are re-adjusted to make the extracted tile square, so that during the resizing in any later stage the aspect ratio of any object in tile remains unaffected.

We use mini-batch size of 3. Learning rate is set to during the training of source model and R0 round training, and then reduced to during the R1 and R2. and consists of iterations, however is consists of . IoU threshold is set to 0.5. We use MC-drop out inferences, with dropout rate set to 10%. All experiments are performed using a single GPU (Quadro RTX 6000). and , uncertainty and detection consistency thresholds, are both set to 0.5, indicating object same class prediction and location should occur at-least 50% of times. All training and testing images are resized such that their shorter side has 800 pixels.

Method person rider car truck bus train mbike bicycle mAP@0.5 SO / Gain
Two Stage Object Detector
DAF chen2018domain 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6 18.8 / 8.8
SW-DA saito2019strong 29.9 42.3 43.5 24.5 36.2 32.6 30.0 35.3 34.3 20.3 / 14.0
DAM he2019multi 30.8 40.5 44.3 27.2 38.4 34.5 28.4 32.2 34.6 18.8 / 16.7
CR-DA xu2020exploring 32.9 43.8 49.2 27.2 45.1 36.4 30.3 34.6
CF-DA zheng2020cross 43.2 37.4 52.1 34.7 34.0 46.9 29.9 30.8 38.6 20.8 / 17.8
HTCN chen2020harmonizing 33.2 47.5 47.9 31.6 47.4 40.9 32.3 37.1 39.8 20.3 / 19.5
UADA nguyen2020uada 34.2 48.9 52.4 30.3 42.7 46.0 33.2 36.2 40.5 20.3 / 20.2
SAPNet li2020spatial 40.8 46.7 59.8 24.3 46.8 37.5 30.4 40.7 40.9 20.3 / 20.6
One Stage Object Detector
Source Only 31.7 31.7 34.6 5.9 20.3 2.5 10.6 25.8 20.4 -
Baseline hsu2020every 38.7 36.1 53.1 21.9 35.4 25.7 20.6 33.9 33.2 18.4 / 14.8
EPM hsu2020every 41.9 38.7 56.7 22.6 41.5 26.8 24.6 35.5 36.0 18.4 / 17.6
Ours 45.1 47.4 59.4 24.5 50.0 25.7 26.0 38.7 39.6 20.4 / 19.2
Oracle 47.4 40.8 66.8 27.2 48.2 32.4 31.2 38.3 41.5 -
Table 1: Cityscapes Foggy Cityscapes Our method achieves an absolute gain of 19.2% over the source only model and out-performs most recent one-stage domain adaptive detector (EPM). SO refers to source only. The best results are bold-faced.

4.1 Comparison with state-of-the-art

For all the domain adaptation experiments we compare both existing state-of-the-art, one-stage and two-stage object detectors using the same feature backbone. Results are compared in terms of mAP(%), class-wise APs(%), and gain (%) achieved over a source only model. To better understand the effect of our algorithm, we also report results on Baseline, which is FCOS tian2019fcos along with global-level feature alignment. We discuss each experiment below.

Weather Adaptation (Cityscapes Foggy Cityscapes). Under same backbone and detection pipeline, our method outperforms the most recent one-stage domain adaptive detector (EPM) by an absolute margin of 3.6% and 1.6% in terms of mAP and gain. We report (Tab. 1) competitive performance against methods built on much stronger, two-stage anchor-based detection pipelines. In Fig. 5, compared to EPM hsu2020every, our method shows the capability of detecting objects of various sizes under severe climate changes.

Synthetic-to-real (Sim10K Cityscapes) . Our method delivers a significant gain of 13.8% (Tab. 2). It exceeds existing state of the art methods, including ones built on stronger detection pipelines and feature backbones, by a notable margin, that is 2.8% mAP over top-performing one-stage adaptive detector (EPM) and over two-stage object detection adaptation algorithm SAPNet li2020spatial.

Cross-camera Adaptation (KITTI Cityscapes) . For this wide view camera setup to the normal scenario we achieve mAP, as compared to results reported by the existing state-of-the-art algorithms using one-stage and two-stage detection pipelines, and (Tab. 2).

AP @ 0.5
SO / Gain
 AP @ 0.5
SO / Gain
Two Stage Object Detector
DAF chen2018domain 39.0 30.1 / 8.9 38.5 30.2 / 8.3
SC-DA zhu2019adapting 43.0 34.0 / 9.0 42.5 37.4 / 5.1
MAF he2019multi 41.1 30.1 / 11.0 41.0 30.2 / 10.8
CF-DA zheng2020cross 43.8 35.0 / 8.8 - -
HTCN chen2020harmonizing 42.5 34.6 / 7.9 - -
SAPNet li2020spatial 44.9 34.6 / 10.3 - -
UADA nguyen2020uada 42.0 34.6 / 7.4 - -
One Stage Object Detector
Source Only 38.0 - 34.9 -
Baseline hsu2020every 46.0 39.8 / 6.2 39.1 34.4 / 4.7
EPM hsu2020every 49.0 39.8 / 9.2 43.2 34.4 / 8.8
Ours 51.8 38.0 / 13.8 45.6 34.9 / 10.7
Oracle 69.7 - 69.7 -
Table 2: Sim10K Cityscapes: We outperform one-stage and two-stage object detectors both in-terms of mAP(%) and gain obtained over source. For this case, baseline value was recomputed. KITTI Cityscapes: Our method outperforms both EPM and existing state-of-the-art methods with considerable margin in terms of mAP. SO refers to source only. The best results are bold-faced.

4.2 Ablation Studies

Contribution of Components: To analyze the effectiveness of each individual component in our proposed method we perform Sim10K Cityscapes adaptation in different settings. Results are detailed in Tab. 3. We compare the impact on performance by training our model each time with (1.) confidence based pseudo labels only, obtained without our proposed uncertainty based selection. (2.) when only uncertainty-guided pseudo-labelling (UGPL) is used without the uncertainty-guided tiling procedure. and (3.) when relying only on uncertainty-guided tiling (UGT). Both UGPL and UGT show an increase of 11.5% & 12% in over source only model and 3.5% & 4.0% over our Baseline. The non-trivial combination of UGPL and UGT, resulting in a synergy between them, produces a further 1.8% increase in over their individual performance contributions. Especially in case of our combined method reports points improvement over the Baseline and more than points improvement over UGPL and UGT, indicating that our method produces more accurate bounding boxes in the target domain.

Impact of object sizes: In Table 3

, we also include the impact on performance of different components w.r.t object sizes. We use MS-COCO evaluation metric

lin2014microsoft to understand method’s behavior with respect to different object sizes categorized as small (S): pixels, medium (M): between pixels and large (L): pixels.

Methods AP (mean) AP @0.5 AP @0.75 AP @S AP @M AP @L
Source Only 18.1 38.0 15.4 4.6 21.9 37.4
Baseline 25.9 46.0 25.5 5.7 28.8 52.2
Confident PL 21.8 43.2 19.8 4.7 27.5 42.9
Ours (UGPL) 27.6 49.5 26.9 6.7 31.2 55.0
Ours (UGT) 27.5 50.0 26.7 6.8 31.7 54.5
Ours (UGPL + UGT) 28.9 51.8 30.4 6.4 32.7 58.7
Table 3: Ablation results on Sim10K Cityscapes. Combining the UGPL and UGT in a principled way results in most improvement than using them individually. Here, Baseline was recomputed by us.
Combinations AP@0.5
Full Image + UGPL 48.1
UGPL 49.5
RandomTiles + UGPL 49.8
UGT 50.0
Certain Tiles + UGPL 50.2
Table 4: Comparison of proposed UGT vs other tiling strategies, including full image, random and certain tiles. We observe that compared to other tile selection strategies with UGPL, our proposed UGT provides maximum gain with UGPL.
Datasets Source Only Source + R0 Source+R0+R1+R2
CS to Foggy CS 20.4 27.4 39.6
Sim10K to CS 38.0 46.3 51.8
KITTI to CS 34.9 38.5 45.6
Table 5: Impact of R0 round. Performing both R1 and R2 rounds (UGPL +UGT) results in significant improvement over when only R0 round (UGT) is performed.
Figure 4: Left. Comparison of uncertainty-guided vs the confidence-guided selection of PL and tiles. Right. Low mean accuracy of confidence based selected PL indicates certainty based PL selection is less noisy.
Figure 5: Detections missed by the EPM and found by our method are shown in Blue. Compared to EPM hsu2020every our method achieves better adaptation.

Uncertainty vs Confidence. We contrast between the proposed uncertainty-guided balancing of pseudo-label (PL) selection and the tiling procedure and the confidence-guided balancing of these two procedures (Fig. 4(left)). Our approach resonates well with the fact that only when the model starts to become more certain of its detections, after round 1, the quantity of selected pseudo-labels should start to increase and so the number of regions being allocated to tiling should begin to decrease. This is not the case for the confidence based balancing. Through our adaptive allocation of detection regions, in Fig. 4(right) we demonstrate that our approach also delivers improved pseudo-labelling accuracy in both rounds compared to confidence-based selection.

UGT vs Other Tile Selection Strategies. We analyze the impact of extracting tiles centered around the uncertain detections (UGT) for adversarial learning in comparison to different tile selection strategies along with the Uncertainty Guided Pseudo Labels (UGPL) in Tab. 4. Specifically, we chose full image, random tiles, and certain tiles in adversarial learning with UGPL instead of proposed (intelligent) tile selection process (UGT). Note that, when using random tiles there are various parameters (e.g.,location, size, and aspect ratio) involved in the tile selection process. So, we restrict the tile-selection space using the domain knowledge. Particularly, we restrict that the tile selected should have at least 60% of the image area. We observe that compared to all three tile selection strategies with UGPL, our proposed UGT provides maximum gain with UGPL.

Impact of R0. To show how much R0 round contributes to the final performance, we report the performance of the base model (source only) after different rounds for all three datasets adaptation scenarios. We report AP@0.5 after R0 and after R0+R1+R2 over the source model. As indicated in Tab. 5, performing both R1 and R2 rounds (that include both UGPL+UGT) results in significant improvement over when only R0 round (UGT) is performed.

Limitation. Although we report improvement over the existing SOTA algorithms based on both one-stage and two-stage object detection pipelines, our method still faces challenges when dealing with small objects as depicted in Tab. 3. We plan to overcome this limitation by studying relationship between uncertainty, object sizes and related contexts.

5 Conclusion

We propose to leverage model’s predictive uncertainty to achieve the best of self-training and adversarial learning for domain-adaptive object detection. Specifically, we propose to measure uncertainty in object detections by considering the variations in both the localization prediction and confidence prediction across Monte-Carlo dropout inferences. Certain detections are considered as pseudo-labels for self-training, while uncertain ones are used to extract tiles (regions in image) for adversarial feature alignment. This synergy between the both allows us incorporating instance-level context for effective adversarial alignment and improving feature discriminability for class-specific alignment. Further, it helps to reduce the effect of poor calibration under domain shift, thereby improving model’s generalization across domains. Under various domain shift scenarios our method obtains notable improvements over the existing state-of-the-art methods.


Supplementary Material

In this supplementary material, following sections are discussed: we include training algorithm (Sec. A

), analysis on the selection of drop out rate and hyperparameters used in our experiments (Sec.

B), ECE score calculation (Sec. C), model calibration (Sec. D) and more qualitative results (Sec. E).

Appendix A Algorithm

Input: Set of labeled data, , and unlabeled data , uncertainty and detection consistency thresholds & Output: Domain adapted trained model

Algorithm 1 Training procedure with Uncertainty Guided Pseudo Labels and Uncertainty Guided Tiles
1: Eq. (1)
2:for  do Repeat until Completion of Rounds
3:      empty set
4:      empty set
5:     if  then
6:          using Eq. (5) variant
7:          Eq. (1) & (7)
8:     else if  then
9:         UGPL with , using Eq. (5)
10:         , using Eq. (5) variant
11:          Eq. (1), (6) & (7)
12:     end if
14:end for

Appendix B Analysis

On MC-dropout rate. We show the impact on performance of our method with different dropout (spatial tompson2015efficient) rates in Tab. 6. Our method mostly retains performance when perturbing the dropout rate from 10% to 30%. In particular, we see a maximum decrease of 0.8% in mAP score when increasing the dropout rate from 10% to 30%. This is expected as increasing the dropout rate increases prediction uncertainty which in turn affects the pseudo-label selection.

Dropout Rate AP (mean) AP @0.5 AP @0.75 AP @S AP @M AP @L
30% 28.2 49.4 27.5 5.9 31.1 58.1
20% 28.1 50.3 28.0 6.1 32.5 56.0
10% 28.9 51.8 30.4 6.4 32.7 58.7
Table 6: Impact on the performance of our method upon increasing dropout rates. We observe that our method is mainly robust against non-negligible variations in the dropout rates.

On threshold hyperparameters. We study the robustness of our method against variation in threshold hyperparameters and in Tab. 7 and Tab. 8, respectively. is the uncertainty threshold and is the IoU threshold. Although we set both thresholds at 0.5, we find that our method is relatively robust to these hyperparameters. For instance, upon varying the by 0.1 unit in both directions, the maximum drop in mAP score is 0.6% (Tab. 7). In case of , we observe that IoU threshold = 0.5 gives stable results as compared to other values. Varying the by 0.1 unit results into decreasing the performance over tight IoU thresholds.

AP (mean) AP @0.5 AP @0.75 AP @S AP @M AP @L
0.4 28.6 51.8 28.5 5.9 32.7 54.7
0.5 28.9 51.8 30.4 6.4 32.7 58.7
0.6 28.3 50.2 27.5 6.2 32.6 56.6
Table 7: Robustness of our method against variation in threshold hyperparameter , uncertainty threshold.
AP (mean) AP @0.5 AP @0.75 AP @S AP @M AP @L
0.5 28.9 51.8 30.4 6.4 32.7 58.7
0.6 28.3 49.8 28.5 5.9 32.4 58.2
0.7 27.5 50.4 27.9 5.4 31.5 55.8
Table 8: Robustness of our method against variation in threshold hyperparameter , IoU threshold.

Appendix C ECE Score Computation

Our aim is to discover the relationship between (detection) model calibration and individual detection uncertainties. A standard measure for network calibration is expected calibration error (ECE) score guo2017calibration; xing2019distance:


where the confidence predictions on a dataset (mostly testing set) are equally partitioned into bins. is the number of examples falling in a specific bin k. To compute the calibration gap for each bin, the difference between the average accuracy and average confidence is computed. Note that we also take into account the regression branch output while computing accuracy kueppers_2020_CVPR_Workshops. The average over the calibration gap of all the bins results gives ECE score. In our case, we set bins for ECE score computation.

Appendix D Model’s Calibration under Domain Shift

Tab. 9 reveals that a model trained on source domain (Sim10k johnson2017driving) suffers from poor calibration when tested on a target domain (Cityscapes Cordts2016Cityscapes) manifesting distinct scene layouts and different object combinations. On the other hand, an oracle trained and tested on the target domain (Cityscapes Cordts2016Cityscapes) shows significantly better calibration. Calibration is measured using ECE score.

Models ECE Score
Source Only 0.25
Oracle 0.10
Table 9: Impact on (detection) model’s calibration under domain shift. Calibration is measured using ECE score.

Appendix E More Qualitative Results

Fig. 6 shows more qualitative results for source-only, EPM hsu2020every, and our method. We see that our method is capable of detecting objects at various scales under (severe) fog which are missed by EPM.

Figure 6: More qualitative results. Detections missed by the EPM and found by our method are shown in Blue. Compared to EPM hsu2020every our method is capable of detecting objects of various sizes under severe climate changes. Zoom-in for best viewing.