SimROD: A Simple Adaptation Method for Robust Object Detection

07/28/2021
by   Rindra Ramamonjison, et al.
0

This paper presents a Simple and effective unsupervised adaptation method for Robust Object Detection (SimROD). To overcome the challenging issues of domain shift and pseudo-label noise, our method integrates a novel domain-centric augmentation method, a gradual self-labeling adaptation procedure, and a teacher-guided fine-tuning mechanism. Using our method, target domain samples can be leveraged to adapt object detection models without changing the model architecture or generating synthetic data. When applied to image corruptions and high-level cross-domain adaptation benchmarks, our method outperforms prior baselines on multiple domain adaptation benchmarks. SimROD achieves new state-of-the-art on standard real-to-synthetic and cross-camera setup benchmarks. On the image corruption benchmark, models adapted with our method achieved a relative robustness improvement of 15-25 AP on COCO-C and Cityscapes-C. On the cross-domain benchmark, our method outperformed the best baseline performance by up to 8 and up to 4

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 15

page 16

page 17

page 22

page 23

page 24

page 25

08/28/2021

Uncertainty-Aware Model Adaptation for Unsupervised Cross-Domain Object Detection

This work tackles the unsupervised cross-domain object detection problem...
03/02/2020

Unbiased Mean Teacher for Cross Domain Object Detection

Cross domain object detection is challenging, because object detection m...
11/15/2019

Curriculum Self-Paced Learning for Cross-Domain Object Detection

Training (source) domain bias affects state-of-the-art object detectors,...
11/17/2019

Unsupervised Domain Adaptation for Object Detection via Cross-Domain Semi-Supervised Learning

Current state-of-the-art object detectors can have significant performan...
07/26/2019

Multi-level Domain Adaptive learning for Cross-Domain Detection

In recent years, object detection has shown impressive results using sup...
03/19/2020

Self-Guided Adaptation: Progressive Representation Alignment for Domain Adaptive Object Detection

Unsupervised domain adaptation (UDA) has achieved unprecedented success ...
05/23/2020

One-Shot Unsupervised Cross-Domain Detection

Despite impressive progress in object detection over the last years, it ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art object detection models have proved to be highly accurate when trained on images that have the same distribution as the test set [40]. However, they can fail when deployed to new environments due to domain shifts such as weather changes (e.g. rain or fog), light conditions variations, or image corruptions (e.g. motion blur) [25]. Such failure is detrimental for mission-critical applications such as self-driving, security, or automated retail checkout, where domain shifts are common and inevitable. To make them succeed in applications where reliability is key, it is important to make detection models robust to domain shifts.

Many methods have been proposed to overcome domain shifts for object detection. They can be categorized as data augmentation [25, 14, 12], domain-alignment [6, 11, 39, 38, 27, 16, 23, 17], domain-mapping [3, 18, 23, 17], and self-labeling techniques [34, 31, 22, 18]. Data augmentation methods can improve the performance on some fixed set of domain shifts but fail to generalize to the ones that are not similar to the augmented samples [1, 26, 33]

. Domain-aligning methods use samples from the target domain to align intermediate features of networks. These methods require non-trivial architecture changes such as gradient reversal layers, domain classifiers, or some specialized modules. On the other hand, domain-mapping methods translate labeled source images to new images that look like the unlabeled target domain images using image-to-image translation networks. Similar to the augmentation methods, they are suboptimal since the generated images do not necessarily have a high perceptual similarity to real target domain images. Finally, self-labeling is a promising approach since it leverages unlabeled training samples form the target domain. However, generating accurate pseudo-labels under domain shift is hard; and when pseudo-labels are noisy, using target domain samples for adaptation is ineffective.

In this paper, we propose a Simple adaptation method for Robust Object Detection (SimROD), to mitigate the domain shifts using domain-mixed data augmentation and teacher-guided gradual adaptation. Our simple approach has three design benefits. First, it does not require ground-truth labels of target domain data and leverage unlabeled samples. Second, our approach requires neither complicated architecture changes nor generative models for creating synthetic data [18]. Third, our simple method is architecture-agnostic and is not limited to region-based detectors. The main contributions of this paper are summarized as follows:

  1. [leftmargin=*]

  2. We propose a simple method to improve the robustness of object detection models against domain shifts. Our method first adapts a large teacher model using a gradual adaptation approach. The adapted teacher generates accurate pseudo-labels for adapting the student model.

  3. We introduce an augmentation procedure called DomainMix to help learn domain-invariant representations and reduce the pseudo-label noise that is exacerbated by the domain shift. DomainMix efficiently mixes the labeled images from the source domain with the unlabeled samples from the target domain along with their (pseudo-)labels and gives strong supervision for self-adaptation. The mixed training samples are used for adapting both the teacher and student models.

  4. We conduct a comprehensive and fair benchmark to demonstrate the effectiveness of SimROD to mitigate different kinds of shifts including synthetic-to-real, cross-camera setup, real-to-artistic, and image corruptions. We show that our simple method can achieve new state-of-the-art results on some of these benchmarks. We also conduct ablation studies to provide insights on the efficiency and effectiveness of our method.

2 Motivation and related works

In this section, we review the mainstream approaches relevant to our work and explain the motivation of our work.

Data augmentations for robustness to image corruption

Data augmentation is an effective technique for improving the performance of deep learning models. Recent works have also explored the role of augmentation in enhancing the robustness to domain shifts. In particular, specialized augmentation methods have been proposed to combat the effect of image corruptions for image classification

[13, 14, 12] and object detection [25, 8]. For example, AugMix [14] samples a set of geometric and color transformations which are applied sequentially to each image and mixes the original image with multiple augmented versions. Subsequently, DeepAugment [12] generates augmented samples using image-to-image translation networks whose weights are perturbed with random distortions. [25, 8] proposed style transfer [10] as data augmentation for increasing the shape bias and improve robustness to image corruptions.

While these augmentation methods offer some improvement over the source baseline, they can overfit to few corruption types and fail to generalize to others. In fact, [1] provided empirical evidence that the perceptual similarity between the augmentation transformation and the corruption is a strong predictor of corruption error. [1] also observed that broader augmentation schemes perform better on dissimilar corruptions than more specialized ones. [33] showed that augmentation techniques that are tailored to synthetic corruptions have difficulty to generalize to natural distributions shifts. In their extensive study, training on more diverse data was the only intervention that effectively improved the robustness to natural distribution shifts.

Unsupervised domain adaptation for object detection

Unsupervised domain adaptation (UDA) methods leverage unlabeled images from the target domain to explicitly mitigate the domain shift. In contrast to images obtained with augmentation, these unlabeled samples are more similar to the test samples as they are from the same domain. Moreover, leveraging unlabeled samples is practical since they are cheap to collect and do not require laborious annotation.

Several approaches have been proposed to solve the UDA problem for object detection. Adversarial training methods such as [6] learn domain-invariant representations of two-stage detector networks. Recent methods improved the performance, by mining important regions and aligning at the region-level [11], by using hierarchical alignment module [39], by coarse-to-fine feature adaptation [38], or by enforcing strong local alignment and weak global alignment [27]. [16] proposed a center-aware alignment method for anchor-free FCOS model. While alignment methods help reduce the domain shift, they require architecture changes since extra modules such as gradient reversal layers and domain classifiers must be added to the network.

Alternatively, domain-mapping methods tackle UDA by first translating source images to images that resemble the target domain samples using a conditional generative adversarial network (GAN)

[3, 15]. The model is then fine-tuned with the domain-mapped images and the known source labels. For object detection, [23, 17] combined domain transfer with adversarial training. For instance, [23] generates a diverse set of intermediate domains between the source and target to discriminate and learn domain-invariant features.

Batch normalization (BN) [19]

layers are prevalent in most neural networks because they can accelerate the learning, prevent overfitting and enable deeper networks to converge

[28]. Recent works have shown that adapting BN layers can improve robustness to adversarial attacks [36] or image corruptions [29] and reduce domain shifts [24, 5].

Self-training for object detection adaptation

Self-training enables a model to generate its own pseudo-labels on the unlabeled target samples. Recently, [31] has proposed the STAC framework for semi-supervised object detection with pseudo-labels. However, pseudo-labeling can degenerate the performance in the presence of domain shift as the pseudo-labels on target samples may become incorrect leading to poor supervision. Instead, our work tackles the domain shift between the original source training data and the unlabeled target training data. To reduce domain shift, [4]

enforced region-level and graph-structures consistencies between a mean teacher model and the student model using additional regularization loss functions. Next,

[22] proposed a method to directly mitigate the noisy pseudo-labels of Faster-RCNN detectors by modeling their proposal distribution. Unlike [22], our method is agnostic to the model architecture and can also work with single-stage object detectors too. Finally, [18] combined domain transfer with pseudo-labeling and is also architecture-agnostic.

In contrast to these prior works, our proposed adaptation method is simpler because it does not generate synthetic data using GANs, does not add new loss functions and does not change the model architecture. As it will be shown in Section 4, our simple method is also more effective in reducing domain shifts and label noise.

3 Problem definition and proposed solution

In this section, we define the adaptation problem and describe our proposed solution.

Figure 1: Our proposed adaptation method for robust object detection mitigates the domain shift and label noise using three simple steps. (1) The proposed DomainMix augmentation module randomly samples and mixes images from both the source and target domains along with their ground-truth and pseudo-labels. (2) These domain-mixed images are used to gradually adapt the batch norm and convolutional layers of a large source teacher model. During this step, the pseudo-labels of the target domain images are also refined. (3) New domain-mixed images with the refined pseudo-labels are used to finetune the source student model.

3.1 Problem statement

We are given a source model for an object detection task with parameters , which is trained with a source training dataset , where is an image and each label consists of object categories and bounding box coordinates. We consider scenarios in which there exists a covariate shift between the input distribution of the original source data and the target test distribution . More formally, we assume that but [32].

In the unsupervised domain adaptation setting, we are also given a set of unlabeled images from the target domain, which we can use during training. Therefore, our objective is to update the model parameters into to achieve a good performance on both the source test set and a given target test set, i.e., improving its robustness to the domain shifts. To effectively exploit the additional information in , we need to tackle two inter-related issues. First, the target training set does not come with ground-truth labels. Second, generating pseudo-labels for with the source model leads to noisy supervision due to the domain shift and hinders the adaptation. In the following subsections, we present a simple approach for tackling these technical issues.

3.2 Simple adaptation for Robust Object Detection

We present our simple adaptation method SimROD for enabling robust object detection models. SimROD integrates a teacher-guided fine-tuning, a new DomainMix augmentation method and a gradual adaptation technique. Sec. 3.2.1 describes the overall method. Next, Sec. 3.2.2 presents the DomainMix augmentation, which is used for adapting both the teacher and student. Finally, Sec. 3.2.3 explains the gradual adaptation that overcomes the two interrelated issues of domain shift and pseudo-label noise.

3.2.1 Overall approach

Our simple approach is motivated by the fact that label noise is exacerbated by the domain shift. Therefore, our approach aims to generate accurate pseudo-labels on target domain images and use them together with mixed images from source and target domain so as to provide strong supervision for adapting the models.

Because the student target model may not have the capacity to generate accurate pseudo-labels and adapt itself, we propose to adapt an auxiliary teacher model first, which can later generate high-quality pseudo-labels for fine-tuning the student model. A flow diagram of SimROD is provided in Figure 1. Its steps are summarized as follows:

Step 1:

We train a large source teacher model with bigger capacity than the student model to be adapted using the source data and get parameters . The source teacher is used to generate initial pseudo-labels on target data.

Step 2:

We adapt the large teacher model parameters from to using the gradual adaptation of Algorithm 2 (see Sec. 3.2.3). During this step, we use mixed images generated by the DomainMix augmentation (see Sec. 3.2.2)

Step 3:

We refine the pseudo-labels on the target data using the adapted teacher model parameters . Then, we fine-tune the student model using these pseudo-labels in line 2 and 8 of Algorithm 2.

One benefit of this approach is that it can adapt both small and large object detection models to domain shifts since it produces high quality pseudo-labels even when the student network is small. Another advantage of our method is that the teacher and student do not need to share the same architecture. Thus, it is possible to use a slow but accurate teacher for the purpose of adaptation while choosing a fast architecture for deployment.

3.2.2 DomainMix augmentation

Here, we present a new augmentation method named DomainMix. As illustrated in Figure 1, it uniformly samples images from both the source and target domains and strongly mixes these images into a new image along with their (pseudo-)labels. Figure 2 shows an example of DomainMix images from natural and artistic domains.

DomainMix uses simple ideas with many benefits to mitigate domain shift and label noise:

  • [leftmargin=*]

  • It produces a diverse set of images by randomly sampling and mixing crops from source and target sets with replacement. As a result, it uses a different sample of images at every epoch, thus increasing the effective number of training samples and preventing overfitting. In contrast, simple batching reuses same images at every epoch.

  • It is data-efficient as it uses a weighted balanced sampling from both domains. This helps learning representations that are robust to data shifts even if the target dataset has limited samples or the source and target datasets are highly imbalanced. In [2], we provide ablation studies that demonstrate the data efficiency of DomainMix.

  • It mixes ground-truth and pseudo-labels in the same image. This mitigates the effect of false labels during adaptation because the image always contains accurate labels from the source domain

  • It enforces the model to detect small objects as the objects in original samples are scaled down.

1:A batch of images, labels from source data , unlabeled target data , pseudo-labels
2:A batch of domain-mixed samples
3:procedure DomainMix()
4:     
5:     for  do
6:         
7:         for  do
8:              if  then
9:                  
10:              else
11:                                          
12:         Collate crops from 4 images in into
13:         Recompute all box coordinates in into
14:               
Algorithm 1 DomainMix augmentation

The steps of DomainMix augmentation are listed in Algorithm 1. For each image in a batch, we randomly sample three additional images from source and target data and mix random crops of these images to create a new domain-mixed image in a collage. In addition, we collate the pseudo-labels for the unlabeled examples in with the ground-truth labels of source images. The bounding box coordinates of the objects are computed based on the relative position of each crop in the new mixed image. Furthermore, we employ a weighted balanced sampler to sample uniformly from the two domains.

Figure 2:

An example image generated by DomainMix mixing real images from Pascal VOC and artistic images from Watercolor2K.

3.2.3 Gradual self-labeling adaptation

1:Source model , labeled source data , unlabeled target data , warmup epochs , total epochs T, steps per epoch , and batch size
2:Adapted model
3:procedure Adapt()
4:     for  do      
5:     Initialize
6:     for  do
7:         if layer is not BatchNorm then Freeze layer               
8:     for  do
9:         if epoch == w then switch to Phase 2
10:              for  do               
11:              Unfreeze all layers          
12:         for  do
13:              Sample a batch from
14:               as in Algo 1.
15:              Update to minimize the loss with               
16:     
Algorithm 2 Gradual self-labeling adaptation

Next, we present a gradual adaptation for optimizing the parameters of the detection model. This algorithm mitigates the effects of label noise, which is exacerbated by the domain shift. In fact, the pseudo-labels generated by the source models can be noisy on target domain images (e.g. it cannot detect objects or detects them inaccurately). If these initial pseudo-labels are used to adapt all the layers of the model at the same time, it results in poor supervision and hinders the model adaptation.

Instead, we propose a phased approach. First, we freeze all convolutional layers and adapts only the BN layers in the first epochs. After this first phase, BN layers’ trainable coefficients are updated. The partially adapted model is then used to generate more accurate pseudo-labels, which is done offline for simplicity. In the second phase, all layers are unfrozen and then fine-tuned using the refined pseudo-labels. Note that during these two phases, we use the mixed image samples generated by the DomainMix augmentation. The detailed steps of this gradual adaptation are listed in Algorithm 2.

In contrast to prior works [24, 29], which used BN Adaption on its own, we integrate it within a self-training framework to effectively overcome the inevitable label noise caused by the domain shift [18]. As will be shown in Section 4, when used with the DomainMix augmentation, the resulting method is effective in adapting object detection models to different kinds of domain shifts.

Note that [18] also used a two-phase progressive adaptation method but they used synthetic domain-mapped images, which are generated by a conditional GAN, to fine-tune the model in the first phase. In contrast, our method leverages actual target domain images, which are mixed with source domain images using DomainMix augmentation, during the entire adaptation process.

4 Experiments results

In this section, we evaluate the effectiveness of SimROD to combat different kinds of domain shifts, compare the performance with prior works on standard benchmarks, and conduct ablation studies. For our experiments, we adopted the single-stage detection architecture Yolov5 [20] and used different model sizes by scaling the input size, width and depth. We study synthetic-to-real and camera-setup shifts [6] in Section 4.1, cross-domain artistic shifts [18] in Section 4.2, and robustness against image corruptions [25] in Section 4.3. Training details and additional results are provided in the supplementary materials [2].

4.1 Synthetic-to-real and cross-camera benchmark

Datasets. We used Sim10k [21] to Cityscapes [7] and KITTI [9] to Cityscapes benchmarks to study the ability to adapt in synthetic-to-real and cross-camera shifts, respectively. Following prior works, only the car class was used.

Method Arch. Backbone Source AP50 Oracle Reference
DAF [6] F-RCNN V 30.10 39.00 - 8.90 - CVPR 2018
MAF [11] F-RCNN V 30.10 41.10 - 11.00 - ICCV 2019
RLDA [22] F-RCNN I 31.08 42.56 68.10 11.48 31.01 ICCV 2019
SCDA [39] F-RCNN V 34.00 43.00 - 9.00 - CVPR 2019
MDA [37] F-RCNN V 34.30 42.80 - 8.50 - ICCV 2019
SWDA [27] F-RCNN V 34.60 42.30 - 7.70 - CVPR 2019
Coarse-to-Fine [38] F-RCNN V 35.00 43.80 59.90 8.80 35.34 CVPR 2020
SimROD (self-adapt) YOLOv5 S320 33.62 38.73 48.81 5.11 33.66 Ours
SimROD (w. teacher X640) YOLOv5 S320 33.62 44.70 48.81 11.08 72.93 Ours
MTOR [4] F-RCNN R 39.40 46.60 - 7.20 - CVPR 2019
EveryPixelMatters [16] FCOS V 39.80 49.00 69.70 9.20 30.77 ECCV 2020
SimROD (self adapt) YOLOv5 S416 39.57 44.21 56.49 4.63 27.37 Ours
SimROD (w. teacher X1280) YOLOv5 S416 39.57 52.05 56.49 12.47 73.73 Ours
Table 1: Results of different method/model pairs for the Sim10K-to-Cityscapes adaptation scenario. “V”, “I” and “R” represent the VGG16, ResNet50, Inception-v2 backbones respectively. ”S320”, “M416”, “X640”, “X1280” represent different scales of Yolov5 model with increasing depth, width and input size. “Source” refers to the model trained only using source images without domain adaptation. For a fair comparison, we group together method/model pairs whose “Source” performance are similar. We report the AP50 (%) performance of the adapted model and the “Oracle” model which is trained with labeled target data, as well each method’s absolute and effective gains (%) when available. and are the absolute gain and the effective gain respectively as defined in (1) and (2).

Metrics. For a fair comparison, we grouped different model/method pairs whose “Source” models (trained only on the labeled source data) have a similar average precision on the target test set (i.e. Cityscapes val). We compared each group based on three metrics: (1) of their “Adapted” models, (2) absolute adaptation gains , and (3) their effective adaptation gains defined as:

(1)
(2)

where “Oracle” is the model that is trained with the labeled target domain data. The gain metric was proposed by [38] to compare methods that may share same base architecture but have different performance before adaptation. For a better comparison, we also analyze the effectiveness of the adaptation method using the metric . This metric helps understand if an adaptation method offers higher performance on the target test set beyond what is expected from having high performance on the source test set. A method that fails to adapt a model will have an effective gain of for that model whereas a method that gives a target performance close to the Oracle will have .

Sim10K to Cityscapes. Table 1 shows that SimROD achieved new SOTA results on both the target AP50 performance and on the effective adaptation gain. We use two student models S320 and S416, which have the same Yolov5s architecture but different input sizes of 320 and 416 pixels to compare with prior methods that have comparable Source AP50 performance. For example, our S320 models achieves and compared to and for Coarse-to-Fine [38]. Similar results were observed when comparing the performance of our adapted S416 model with that of the FCOS model adapted with EPM [16]. Fig. 3 demonstrates the effectiveness of SimROD to adapt models from Sim10K to Cityscapes compared to prior baselines. Models adapted with SimROD enjoyed up to 70-75% of the target AP performance (that is obtained if the model was trained with a fully labeled target dataset). In contrast, the baseline methods achieved only about 30% of their Oracle performance.

KITTI to Cityscapes benchmark. Table 2 shows the results of this experiment, where SimROD outperformed the baselines. With the S416 model, it achieves slightly higher AP50 performance than the best baseline PDA [17]. When using the medium size M416 model, SimROD also outperformed prior baselines with comparable Source AP50 performance namely SCDA [39] and EPM [16].

Figure 3: AP50 on test vs effective gain for Sim10K to Cityscapes. We use five different backbones S320, M320, S416, S640 and M640 for the student and the same backbone X1280 for teacher.
Method Arch. Backbone Source AP50 Oracle Reference
DAF [6] F-RCNN V 30.20 38.50 - 8.30 - CVPR 2018
MAF [11] F-RCNN V 30.20 41.00 - 10.80 - ICCV 2019
RLDA [22] F-RCNN I 31.10 42.98 68.10 11.88 32.11 ICCV 2019
PDA [17] F-RCNN V 30.20 43.90 55.80 13.70 53.52 WACV 2020
SimROD (self-adapt) YOLOv5 S416 31.61 35.94 56.15 4.33 17.65 Ours
SimROD (w. teacher X1280) YOLOv5 S416 31.61 45.66 56.15 14.05 57.27 Ours
SCDA [39] F-RCNN V 37.40 42.60 - 5.20 - CVPR 2019
EveryPixelMatters [16] FCOS R 35.30 45.00 70.40 9.70 27.64 ECCV 2020
SimROD (self adapt) YOLOv5 M416 36.09 42.94 59.29 6.85 29.51 Ours
SimROD (w. teacher X1280) YOLOv5 M416 36.09 47.52 59.29 11.43 49.26 Ours
Table 2: Results of different method/model pairs on the KITTI-to-Cityscapes adaptation scenario. and are the absolute gain and the effective gain respectively as defined in (1) and (2).

4.2 Cross-domain artistic benchmark

Method Arch. Backbone Source AP50 Oracle Reference
DAF [6] F-RCNN V 39.80 34.30 NA -5.50 NA CVPR 2018
DAM [23] F-RCNN V 39.80 52.00 NA 12.20 NA CVPR 2019
DeepAugment [12] YOLOv5 S416 37.46 45.19 56.07 7.73 41.54 arXiv 2020
BN-Adapt [19] YOLOv5 S416 37.46 45.72 56.07 8.26 44.39 NeurIPS 2020
Stylize [10] YOLOv5 S416 37.46 46.26 56.07 8.80 47.29 arXiv 2019
STAC [31] YOLOv5 S416 37.46 49.83 56.07 12.37 66.47 arXiv 2020
DT+PL [18] YOLOv5 S416 37.46 44.86 56.07 7.40 39.77 CVPR 2018
SimROD (self-adapt) YOLOv5 S416 37.46 52.58 56.07 15.12 81.26 Ours
SimROD (teacher X416) YOLOv5 S416 37.46 55.55 56.07 18.09 97.21 Ours
ADDA [35] SSD V 49.60 49.80 58.40 0.20 2.27 CVPR 2017
DT+PL [18] SSD V 49.60 54.30 58.40 4.70 53.41 CVPR 2018
SWDA [27] F-RCNN V 44.60 56.70 58.60 12.10 86.43 CVPR 2019
DeepAugment [12] YOLOv5 M416 46.95 54.02 66.34 7.07 36.47 arXiv 2020
BN-Adapt [19] YOLOv5 M416 46.95 55.75 66.34 8.80 45.39 NeurIPS 2020
Stylize [10] YOLOv5 M416 46.95 55.24 66.34 8.29 42.76 arXiv 2019
STAC [31] YOLOv5 M416 46.95 57.82 66.34 10.87 56.07 arXiv 2020
DT+PL [18] YOLOv5 M416 46.95 49.14 66.34 2.19 11.30 CVPR 2018
SimROD (self-adapt) YOLOv5 M416 46.95 60.08 66.34 13.13 67.72 Ours
SimROD (teacher X416) YOLOv5 M416 46.95 63.47 66.34 16.52 85.22 Ours
Table 3: Benchmark results on Real (VOC) to Watercolor2K domain shift.

Datasets and metrics. The cross-domain artistic benchmark consists of three domain shifts where the source data is VOC07 trainval and the target domains are Clipart1k, Watercolor2k and Comic2k datasets [18]. We use the same benchmark metrics as in Sec. 4.1.

Results. Our method outperformed the baselines by significant margins. Compared to DT+PL [18], our method further improved the AP50 of the yolov5s model by absolute gains of +8.45, +12 and +10.69 % points on Clipart, Comic, and Watercolor respectively. While DT+PL outperformed the augmentation-based baselines on Clipart, it did slightly worse than STAC on Comic and Watercolor. Finally, SimROD was effective in adapting models of different sizes. Without generating synthetic data or using domain adversarial training, SimROD’s effective gain was consistently above 70% and could reach up to 97% when a large adapted teacher was used to refine the pseudo-labels.

In Table 3, we give a detailed benchmark for the VOC to Watercolor benchmark, from which we used 1000 unlabeled images as target data. In [2], we present detailed results on Clipart and Comic dataset as well as more ablation results when using extra unlabeled data for adaptation.

Figure 4: Qualitative comparison: (a) pseudo-labels generated on unlabeled target examples and (b) test predictions with adapted Yolov5s.

4.3 Image corruptions benchmark

Datasets. We evaluate our method’s robustness to image corruption using the standard benchmarks Pascal-C, COCO-C, and Cityscapes-C [25]. For Pascal-C, we used VOC07 trainval split as the source training data. For COCO-C and Cityscapes-C, we divided the train split and used the first half as source training data. There are different corruption types for each dataset. Thus, we applied each corruption type on the VOC12 trainval or on the second half of COCO-C and Cityscapes-C train as unlabeled target data. Precisely, we applied each corruption type with middle severity onto each image using the imagecorruptions library [25]. More details are given in [2].

Metrics. For image corruption benchmark, we followed the evaluation protocol from [13, 25, 33] and measured the mean performance under corruption (mPC), relative performance under corruption (rPC), and the relative robustness of the adapted model averaged over corruption types:

(3)
(4)
(5)

where and denote the average precision of the test data with corruption type and severity level . The relative robustness quantifies the effect of adaptation on the performance under distribution shift (mPC).

Baselines. We use the following baselines which were proposed to improve the robustness to image corruptions: Stylize [10], BN-Adapt [19], DeepAugment [12], STAC [31], and DT+PL [18]. Unless specified, we employed weak data augmentations such as RandomHorizontalFlip and RandomCrop for all baselines.

Main results. Table 4, 5 and 6 show the results of Yolov5m model for Pascal-C, COCO-C, and Cityscapes-C, respectively. We report the results with different model sizes in [2]. We used the large model Yolov5x model as a teacher. An ablation study on Pascal-C is provided in Table 7 and will be discussed later.

Method rPC
Source 83.13 53.78 64.69 0.00 0
Stylize 84.79 62.92 74.21 9.14 36.62
BN-Adapt 83.01 64.60 77.82 10.82 43.35
DeepAugment 85.05 64.88 76.28 11.10 44.47
STAC 87.00 66.88 76.87 13.10 52.48
SimROD (ours) 86.97 75.40 86.70 21.62 86.62
Oracle 86.75 78.74 90.77 24.96 100
Table 4: Performance comparison on Pascal-C benchmark.

Unlabeled target samples improved robustness to image corruption. The source models suffered from performance drop due to image corruptions. By adapting the models with SimROD, the mean performance under corruption was significantly improved by +21.62, +6.43, and +6.48 absolute percentage points on Pascal-C, COCO-C, and Cityscapes-C, respectively. Our method outperformed the Stylize, DeepAugment, BNAdapt baselines on all metrics. In fact, STAC, which also used unlabeled target samples, achieved the second best performance. This shows that augmentation or batch norm adaptation is not sufficient to fix the domain shift on all possible corruptions. Instead, using unlabeled samples from target domain is more effective to combat image corruptions.

Pseudo-label refinement ensured performance close to Oracle. Moreover, Tables 4, 5 and 6 show that the performance of our unsupervised method was close to that of the Oracle, which uses ground-truth labels for target domain data. This was possible because the adapted teacher produces highly accurate pseudo-labels, which could be used along with DomainMix augmentation to effectively adapt the student model.

Method rPC
Source 36.85 22.03 59.78 0.00 0
Stylize 35.75 23.82 66.63 1.79 22.02
BN-Adapt 36.24 24.79 68.41 2.76 33.95
DeepAugment 35.51 24.33 68.52 2.30 28.29
STAC 36.76 24.80 67.46 2.77 34.07
SimROD (ours) 36.79 28.46 77.36 6.43 79.09
Oracle 36.23 30.16 83.25 8.13 100
Table 5: Performance benchmark on COCO-C dataset.
Method rPC
Source 19.48 11.53 59.19 0.00 0
Stylize 21.77 14.62 67.16 3.09 25.81
DeepAugment 20.28 14.79 72.93 3.26 27.23
STAC 24.54 15.39 62.71 3.86 32.25
SimROD (ours) 24.06 18.01 74.85 6.48 54.14
Oracle 26.58 23.50 88.41 11.97 100
Table 6: Performance benchmark on Cityscapes-C dataset.
Method TG DMX GA FT
Source 53.78 0.0
BN-Adapt 64.60 10.8
BN-A + DMX 66.78 13.0
SimROD w/o TG 71.81 18.0
SimROD w/o GA 73.45 19.7
SimROD 75.40 21.7
Table 7: Ablation study on Pascal-C with yolov5m. See [2] for ablations with other models. TG, GA, DMX, and FT denote Teacher Guidance, Gradual Adaption, DomainMix, and Fine-Tuning.

Ablation Study. Next, we present an ablation study using the Yolov5m model on Pascal-C in Table 7 to gain some insights about the contributions of the three parts of our method. First, BN-Adapt improved the mean performance under corruption by 10.82% AP50. Applying DomainMix augmentation on top of BN-Adapt improved the performance by 2.18%. Next, the teacher-guided (TG) pseudo-label refinement was particularly useful in adapting small models. When using our full method, the performance increased by 10.8% compared to BN-Adapt. Compared to self adaptation, TG improved the Yolov5 model’s performance mPC by +3.7 %. Finally, the gradual adaptation (GA) also played an important role in refining pseudo-labels and in improving the model’s robustness. For example, if we did not use GA and skipped the BN adaptation in the first phase, the performance dropped by 1.95% compared to the full method. Our method organically integrates these parts to tackle UDA for object detection. While the parts may appear simple, their synergy helped mitigate the challenging issues of domain shift and pseudo-label noise.

Qualitative analysis Finally, we illustrate the effectiveness of our method by showing the pseudo-labels generated with our method on the unlabeled target training images on Comic dataset. As seen in Figure 4(a), our method generated highly accurate pseudo-labels despite the domain shift. In contrast, STAC and DT+PL generated sparse labels since they missed to detect many objects. The performance difference transferred to the quality of predictions on the test set as shown in Figure 4(b).

5 Conclusion

We proposed a simple and effective unsupervised method for adapting detection models under domain shift. Our self-labeling framework gradually adapted the model using a new domain-centric augmentation method and a teacher-guided finetuning. Our method achieved significant gains in terms of model robustness compared to existing baselines both for small and large models. Not only our method did mitigate the effect of domain shifts due to low-level image corruptions but also it could adapt the models when presented with high-level stylistic differences between the source and target domains. Through ablation study, we got some insights on why gradual adaptation works and how the teacher-guided pseudo-label refinement can help adapt the models. We hope this simple method will guide future progress of robust object detection research.

References

  • [1] Anonymous (2021) Is robustness robust? on the interaction between augmentations and corruptions. In Submitted to International Conference on Learning Representations, Note: under review External Links: Link Cited by: §1, §2.
  • [2] Authors (2021) SimROD: A Simple Adaptation Method for Robust Object Detection. Note: Supplementary materials 8083_supplementary.zip Cited by: 2nd item, §4.2, §4.3, §4.3, Table 7, §4.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3722–3731. Cited by: §1, §2.
  • [4] Q. Cai, Y. Pan, C. Ngo, X. Tian, L. Duan, and T. Yao (2019) Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11457–11466. Cited by: §2, Table 1, Table 12.
  • [5] W. Chang, T. You, S. Seo, S. Kwak, and B. Han (2019) Domain-specific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7354–7362. Cited by: §2.
  • [6] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348. Cited by: §1, §2, Table 1, Table 2, Table 3, §4, Table 12, Table 13, Table 14.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. Cited by: §4.1.
  • [8] S. Cygert and A. Czyżewski (2020) Toward robust pedestrian detection with data augmentation. IEEE Access 8, pp. 136674–136683. Cited by: §2.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §4.1.
  • [10] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §2, §4.3, Table 3, Table 14.
  • [11] Z. He and L. Zhang (2019) Multi-adversarial faster-rcnn for unrestricted object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6668–6677. Cited by: §1, §2, Table 1, Table 2, Table 12, Table 13.
  • [12] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. (2020) The many faces of robustness: a critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241. Cited by: §1, §2, §4.3, Table 3, Table 14.
  • [13] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §2, §4.3.
  • [14] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2019) Augmix: a simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781. Cited by: §1, §2, §9.3.
  • [15] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholm, Sweden, July 10-15, 2018

    ,
    pp. 1994–2003. Cited by: §2.
  • [16] C. Hsu, Y. Tsai, Y. Lin, and M. Yang (2020) Every pixel matters: center-aware feature alignment for domain adaptive object detector. In European Conference on Computer Vision, pp. 733–748. Cited by: §1, §2, §4.1, §4.1, Table 1, Table 2, Table 12, Table 13.
  • [17] H. Hsu, C. Yao, Y. Tsai, W. Hung, H. Tseng, M. Singh, and M. Yang (2020) Progressive domain adaptation for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 749–757. Cited by: §1, §2, §4.1, Table 2, Table 13.
  • [18] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5001–5009. Cited by: §1, §1, §2, §3.2.3, §3.2.3, §4.2, §4.2, §4.3, Table 3, §4, §8.1, §8.3, Table 14.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. Cited by: §2, §4.3, Table 3, Table 14.
  • [20] G. Jocher et al. (2020-06) Ultralytics/yolov5: v1.0 - initial release. Note: Zenodo Cited by: §4, §6.1, §6.1, §6.1.
  • [21] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan (2017) Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks?. 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 746–753. Cited by: §4.1.
  • [22] M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready (2019) A robust learning approach to domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 480–490. Cited by: §1, §2, Table 1, Table 2, Table 12, Table 13.
  • [23] T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §1, §2, Table 3, Table 14.
  • [24] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou (2017) Revisiting batch normalization for practical domain adaptation. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, Cited by: §2, §3.2.3.
  • [25] C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bringmann, A. S. Ecker, M. Bethge, and W. Brendel (2019) Benchmarking robustness in object detection: autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484. Cited by: §1, §1, §2, §4.3, §4.3, §4.
  • [26] E. Mintun, A. Kirillov, and S. Xie (2021) On interaction between augmentations and corruptions in natural corruption robustness. arXiv preprint arXiv:2102.11273. Cited by: §1.
  • [27] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: §1, §2, Table 1, Table 3, Table 12, Table 14.
  • [28] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 2488–2498. Cited by: §2.
  • [29] S. Schneider, E. Rusak, L. Eck, O. Bringmann, W. Brendel, and M. Bethge (2020) Improving robustness against common corruptions by covariate shift adaptation. In Advances in Neural Information Processing Systems, Vol. 33, pp. 11539–11551. Cited by: §2, §3.2.3.
  • [30] G. Smith et al. (2017) Guildai. Github. Note: https://github.com/guildai/guildai. Cited by: §6.1.
  • [31] K. Sohn, Z. Zhang, C. Li, H. Zhang, C. Lee, and T. Pfister (2020)

    A simple semi-supervised learning framework for object detection

    .
    arXiv preprint arXiv:2005.04757. Cited by: §1, §2, §4.3, Table 3, §8.3, Table 14.
  • [32] M. Sugiyama and M. Kawanabe (2012) Machine learning in non-stationary environments - introduction to covariate shift adaptation. MIT Press. Cited by: §3.1.
  • [33] R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt (2020) Measuring robustness to natural distribution shifts in image classification. External Links: 2007.00644 Cited by: §1, §2, §4.3.
  • [34] I. Triguero, S. García, and F. Herrera (2015-02)

    Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study

    .
    42 (2), pp. 245–284. Cited by: §1.
  • [35] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2962–2971. External Links: Document Cited by: Table 3, Table 14.
  • [36] C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille, and Q. V. Le (2020) Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 819–828. Cited by: §2.
  • [37] R. Xie, F. Yu, J. Wang, Y. Wang, and L. Zhang (2019) Multi-level domain adaptive learning for cross-domain detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: Table 1, Table 12.
  • [38] Y. Zheng, D. Huang, S. Liu, and Y. Wang (2020) Cross-domain object detection through coarse-to-fine feature adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13766–13775. Cited by: §1, §2, §4.1, §4.1, Table 1, Table 12.
  • [39] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin (2019) Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 687–696. Cited by: §1, §2, §4.1, Table 1, Table 2, Table 12, Table 13.
  • [40] Z. Zou, Z. Shi, Y. Guo, and J. Ye (2019) Object detection in 20 years: a survey. External Links: 1905.05055 Cited by: §1.

Supplementary materials

The following supplementary materials provide further details on training, on the results of the different benchmarks, and more qualitative analysis and visualizations.

6 Experiments setup

6.1 Training details and hyperparameters

We trained each model using a standard stochastic gradient descent (SGD) optimizer with momentum parameter 0.937 and weight decay

. We used warm-up and cosine decay rule for training. For the NMS parameters, we used an IoU threshold of and an object confidence threshold of . When generating pseudolabels, we used a higher confidence threshold of 0.4. We used the model definitions, as defined by the initial release of YOLOv5 [20] with last commit id ‘364fcfd7d’. Finally, we used a generalized IoU loss (GIoU) for localization and a focal loss for the classification loss and objectness loss for training the models.

To manage our experiments and make our results reproducible, we used the open-source tool Guildai

[30]. Most hyper-parameters (Momentum, NMS, etc.) were set as defaults in YOLOv5 repo [20]

. We tuned only the learning rate for each dataset. The value of hyperparameters are configured in the ‘guild.yml’ file. For the gradual adaptation procedure, we use a large enough number of epochs for Phase 1 to ensure the convergence of BN adaptation. We use a separate validation set to maintain the best checkpoint using the validation AP. Therefore, we initialize the phase 2 training with the best checkpoint of phase 1. It is also worth noting that our framework does not add new hyperparameters.

When training the COCO source models or the Stylize and DeepAugment baselines, we followed the training procedure in YOLOv5 [20]

and trained the model from scratch using 300 epochs and a learning rate of 0.01. For Pascal and Cityscapes the source models were obtained through transfer learning from COCO pretrained weights using

and epochs respectively. For that, we used learning rate of and batch size of . When applying our adaptation method, we also fine-tuned the source model using same learning rate of , a batch size of and 100 epochs for all models and target domains.

We did not use multi-scale training to simplify our analysis. The same image input size was used during training, pseudo-label generation and evaluation. For Sim10K/KITTI to Cityscapes, we specify the input size used to train each student and teacher model in our results. For the artistic benchmark, we use the same input size of 416 for both student and teacher models. For the image corruption benchmark, we used the same input size of 416 for Pascal-C and COCO-C whereas we used a larger size of 640 for Cityscapes-C.

For the Stylize baseline, we applied only one style for each image to keep the dataset size the same, to ensure a fair comparison. We preserved the original image dimensions and disabled the cropping. Alpha was fixed to 1 to apply the highest strength of stylization.

6.2 More details on datasets

Table 8, 9, 10, and 11 show a summary of the data splits that we used as the source or clean split versus target or stylized/augmented split for each dataset. To make a fair comparison, we keep the total number of images in the entire training data to be the same for all methods.

For Pascal-C, COCO-C, and Cityscapes-C, we generated the corrupted test set by applying each corruption to the clean test with all five severity levels. For the cross-domain adaptation benchmark, we used the test split for Clipart, Watercolor, or Comic for measuring test AP on the target domain.

  • Sim10K: we use the SIM10k dataset as the labeled source training data and the training set of Cityscapes as unlabeled target data. The validation set of Cityscapes was used as target test set.

  • KITTI: we use the training set of KITTI as the labeled source data and the training set of Cityscapes as unlabeled target data. The validation set of Cityscapes was used as target test set.

  • Clipart/Watercolor/Comic: the datasets used in Source, DeepAugment, Stylize baselines for Clipart/Watercolor/Comic are exactly the same as those used for Pascal-C. Other than this, the train set of Clipart/Watercolor/Comic were used as the target domain dataset. In DT+PL experiments, we first apply the domain transfer on the union of VOC2007 trainval and VOC2012 trainval. Then, we apply the DT step on the source model using the domain-transferred dataset. Finally, we apply the PL step on the output model of DT step using the train split of of Clipart/Watercolor/Comic. Note that we do not use the ground-truth labels but use the pseudo-labels instead.

  • Pascal-C: we used VOC2007 trainval as the source and VOC2012 trainval as the target. For the DeepAugment baseline, we augmented VOC2012 train with the CAE method and VOC2012 val with the EDSR method. We used VOC2007 test as the clean test set.

  • COCO-C: we split COCO train2017 into two approximately equal halves and used the first half as source, the second half as target. For DeepAugment, we divided COCO train2017 in three random splits and used them for the clean split, CAE split and EDSR split respectively. COCO val2017 was used as clean test.

  • Cityscapes-C: we split the source domain and target domain by city names. We carefully chose the cities for each domain so that source and target are of approximately equal size. Of all 18 cities in cityscapes-train, 9 cities: ‘cologne’, ‘krefeld’, ‘bremen’, ‘darmstadt’, ‘hanover’, ‘aachen’, ‘stuttgart’, ‘jena’, and ‘tubingen’ were used as source data; the other 9 cities: ‘bochum’, ‘ulm’, ‘monchengladbach’, ‘weimar’, ‘strasbourg’, ‘zurich’, ‘hamburg’, ‘dusseldorf’, and ‘erfurt’ were used as target data. When training the DeepAugment baseline for Cityscapes, we further split the target domain into two splits. The first split that contains ‘zurich’, ‘weimar’, ‘erfurt’, and ‘strasbourg’ was augmented with the CAE method. The second split which contains ‘bochum’, ‘ulm’, ‘monchengladbach’, ‘hamburg’, ‘dusseldorf’ was augmented with the EDSR method. The validation set of Cityscapes was used as clean test.

Method Source / Clean split (size) Target / Augmented split (size)
Source VOC2007-trainval (5011) N/A
DeepAugment VOC2007-trainval (5011) CAE VOC2012-train (5717) + EDSR VOC2012-val (5823)
Stylize VOC2007-trainval (5011) stylized VOC2012-trainval (11540)
BN-Adapt VOC2007-trainval (5011) VOC2012-trainval (11540)
STAC VOC2007-trainval (5011) VOC2012-trainval (11540)
SimROD (Ours) VOC2007-trainval (5011) VOC2012-trainval (11540)
Table 8: Dataset splits used for Pascal-C
Method Source / Clean split (size) Target / Augmented split (size)
Source coco-train2017/first half (58458) N/A
DeepAugment coco-train2017/first 1/3 (39088) CAE second 1/3 (39088) + EDSR third 1/3 (39090)
Stylize coco-train2017/first half (58458) stylized coco-train2017/second half (58808)
BN-Adapt coco-train2017/first half (58808) coco-train2017/second half (58808)
STAC coco-train2017/first half (58458) coco-train2017/second half (58808)
SimROD (Ours) coco-train2017/first half (58458) coco-train2017/second half (58808)
Table 9: Dataset splits used for COCO-C
Method Source / Clean split (size) Target / Augmented split (size)
Source cityscapes-train/first half (1483) N/A
DeepAugment cityscapes-train/first half (1483) CAE train/second half-split 1 (732) + EDSR train/second half-split 2 (750)
Stylize cityscapes-train/first half (1483) stylized cityscapes-train/second half (1482)
Bn_only cityscapes-train/first half (1483) cityscapes-train/second half (1482)
Stac cityscapes-train/first half (1483) cityscapes-train/second half (1482)
Ours w/o TG cityscapes-train/first half (1483) cityscapes-train/second half (1482)
Ours cityscapes-train/first half (1483) cityscapes-train/second half (1482)
Table 10: Dataset splits used for Cityscapes-C
Method Source / Clean split (size) Target / Augmented split (size)
Source VOC2007-trainval (5011) N/A
DeepAugment VOC2007-trainval (5011) CAE VOC2012-train (5717) + EDSR VOC2012-val (5823)
Stylize VOC2007-trainval (5011) stylized VOC2012-trainval (11540)
Bn_only VOC2007-trainval (5011) clipart/watercolor/comic-train (500/1000/1000)
Stac VOC2007-trainval (5011) clipart/watercolor/comic-train (500/1000/1000)
Ours w/o TG VOC2007-trainval (5011) clipart/watercolor/comic-train (500/1000/1000)
Ours VOC2007-trainval (5011) clipart/watercolor/comic-train (500/1000/1000)
Table 11: Dataset splits used for Clipart/Watercolor/Comic

7 More results on synthetic-to-real and cross-camera benchmarks

7.1 Full results on Sim10K/KITTI to Cityscapes

Table 12 and 13 expand on the results reported in Table 1 and 2 respectively. In particular, they show the performance of the teacher models and that of models adapted with the smaller teacher model X640.

Method Arch. Backbone Source AP50 Oracle Reference
DAF [6] F-RCNN V 30.10 39.00 - 8.90 - CVPR 2018
MAF [11] F-RCNN V 30.10 41.10 - 11.00 - ICCV 2019
RLDA [22] F-RCNN I 31.08 42.56 68.10 11.48 31.01 ICCV 2019
SCDA [39] F-RCNN V 34.00 43.00 - 9.00 - CVPR 2019
MDA [37] F-RCNN V 34.30 42.80 - 8.50 - ICCV 2019
SWDA [27] F-RCNN V 34.60 42.30 - 7.70 - CVPR 2019
Coarse-to-Fine [38] F-RCNN V 35.00 43.80 59.90 8.80 35.34 CVPR 2020
SimROD (self-adapt) YOLOv5 S320 33.62 38.73 48.81 5.11 33.66 Ours
SimROD (w. teacher X640) YOLOv5 S320 33.62 44.70 48.81 11.08 72.93 Ours
MTOR [4] F-RCNN R 39.40 46.60 - 7.20 - CVPR 2019
EveryPixelMatters [16] FCOS V 39.80 49.00 69.70 9.20 30.77 ECCV 2020
SimROD (self adapt) YOLOv5 S416 39.57 44.21 56.49 4.63 27.37 Ours
SimROD (w. teacher X640) YOLOv5 S416 39.57 51.68 56.49 12.10 71.53 Ours
SimROD (w. teacher X1280) YOLOv5 S416 39.57 52.05 56.49 12.47 73.73 Ours
SimROD (self-adapt) YOLOv5 M640 55.86 60.29 71.05 4.43 29.16 Ours
SimROD (w. teacher X640) YOLOv5 M640 55.86 62.18 71.05 6.33 41.64 Ours
SimROD (w. teacher X1280) YOLOv5 M640 55.86 64.40 71.05 8.54 56.24 Ours
SimROD (self-adapt) YOLOv5 X640 60.34 63.27 72.51 2.93 24.09 Ours
SimROD (self-adapt) YOLOv5 X1280 71.66 75.94 82.90 4.28 38.08 Ours
Table 12: Results of different method/model pairs for the Sim10K-to-Cityscapes adaptation scenario. “V”, “I” and “R” represent the VGG16, ResNet50, Inception-v2 backbones respectively. ”S320”, “M416”, “X640”, “X1280” represent different scales of Yolov5 model with increasing depth, width and input size. “Source” denotes that the model is trained only using source images without domain adaptation. For fair comparison, we group together method/model pairs whose “Source” performance are similar. We report the AP50 (%) performance of the adapted model and the “Oracle” model which is trained with labeled target data as well each method’s absolute and effective gains (%) when available. and are the absolute gain and the effective gain respectively as defined in (1) and (2).
Method Arch. Backbone Source AP50 Oracle Reference
DAF [6] F-RCNN V 30.20 38.50 - 8.30 - CVPR 2018
MAF [11] F-RCNN V 30.20 41.00 - 10.80 - ICCV 2019
RLDA [22] F-RCNN I 31.10 42.98 68.10 11.88 32.11 ICCV 2019
PDA [17] F-RCNN V 30.20 43.90 55.80 13.70 53.52 WACV 2020
SimROD (self-adapt) YOLOv5 S416 31.61 35.94 56.15 4.33 17.65 Ours
SimROD (w. teacher X640) YOLOv5 S416 31.61 43.55 56.15 11.94 48.66 Ours
SimROD (w. teacher X1280) YOLOv5 S416 31.61 45.66 56.15 14.05 57.27 Ours
SCDA [39] F-RCNN V 37.40 42.60 - 5.20 - CVPR 2019
EveryPixelMatters [16] FCOS R 35.30 45.00 70.40 9.70 27.64 ECCV 2020
SimROD (self adapt) YOLOv5 M416 36.09 42.94 59.29 6.85 29.51 Ours
SimROD (w. teacher X640) YOLOv5 M416 36.09 45.29 59.29 9.19 39.64 Ours
SimROD (w. teacher X1280) YOLOv5 M416 36.09 47.52 59.29 11.43 49.26 Ours
SimROD (self-adapt) YOLOv5 X640 45.67 50.81 72.18 5.14 19.38 Ours
SimROD (self-adapt) YOLOv5 X1280 52.07 58.25 82.50 6.18 20.31 Ours
Table 13: Results of different method/model pairs on the KITTI to Cityscapes adaptation scenario. and are the absolute gain and the effective gain respectively as defined in (1) and (2).

7.2 Qualitative visualization

In Figure 5, we present qualitative results for the detection of the model S416 (i.e. yolov5s with input 416) to demonstrate the improvement brought by SimROD compared to the source model. By comparing with ground-truth labels, Figure 5 shows that the adapted model can detect most objects with good accuracy except for some highly occluded ones.

Figure 5: Examples of prediction results on Sim10K to Cityscapes. We show predictions on the target test set before and after applying SimROD as well as the ground-truth labels.

8 More results on artistic benchmark

8.1 Benchmark results on Clipart and Comic

We include the benchmarks results for Clipart and Comic in Table 14 and 15 respectively. We used only 500 unlabeled images from the target domain for Clipart and 1000 images for Comic. Similar to the results for Watercolor in Table 3, our method SimROD outperformed the baselines when compared with models that achieve same Source AP performance. Compared to DT+PL in [18], our method further improved the AP50 of the S416 model by absolute 8.35, 12 and 10.69 percentage points on Clipart, Comic and Watercolor respectively. In addition, SimROD consistently achieves high effective adaptation gains between 70-97% across model sizes and benchmarks.

Method Arch. Backbone Source AP50 Oracle Reference
ADDA [35] SSD V 26.80 27.40 55.40 0.60 2.10 CVPR 2017
DT+PL [18] SSD V 26.80 46.00 55.40 19.20 67.13 CVPR 2018
DAF [6] F-RCNN V 26.20 22.40 50.00 -3.80 -15.97 CVPR 2018
DT+PL [18] F-RCNN V 26.20 34.90 50.00 8.70 36.55 CVPR 2018
SWDA [27] F-RCNN V 27.80 38.10 50.00 10.30 46.40 CVPR 2019
DAM [23] F-RCNN V 24.90 41.80 50.00 16.90 67.33 CVPR 2018
DeepAugment [12] YOLOv5 S416 29.32 31.65 56.07 2.33 8.71 arXiv 2020
BN-Adapt [19] YOLOv5 S416 29.32 37.43 56.07 8.11 30.32 NeurIPS 2020
Stylize [10] YOLOv5 S416 29.32 38.80 56.07 9.48 35.44 arXiv 2019
STAC [31] YOLOv5 S416 29.32 39.64 56.07 10.32 38.58 arXiv 2020
DT+PL [18] YOLOv5 S416 29.32 39.49 56.07 10.17 38.02 CVPR 2018
SimROD (self-adapt) YOLOv5 S416 29.32 41.28 56.07 11.96 44.72 Ours
SimROD (teacher X416) YOLOv5 S416 29.32 47.84 56.07 18.52 69.24 Ours
Table 14: Benchmark results on Real (VOC) to Clipart1k domain shift
Method Arch. Backbone Source AP50 Oracle Reference
ADDA SSD V 24.90 23.80 46.40 -1.10 -5.12 CVPR 2017
DT SSD V 24.90 29.80 46.40 4.90 22.79 CVPR 2018
DT+PL SSD V 24.90 37.20 46.40 12.30 57.21 CVPR 2018
DAF F-RCNN V 21.40 23.20 - 1.80 - CVPR 2018
DT F-RCNN V 21.40 29.80 - 8.40 - CVPR 2018
SWDA F-RCNN V 21.40 28.40 - 7.00 - CVPR 2019
DAM F-RCNN V 21.40 34.50 - 13.10 - CVPR 2019
DeepAugment YOLOv5 S416 18.19 21.39 39.81 3.20 14.80 arXiv 2020
BN-Adapt YOLOv5 S416 18.19 25.53 39.81 7.34 33.95 NeurIPS 2020
Stylize YOLOv5 S416 18.19 27.57 39.81 9.38 43.39 arXiv 2019
STAC YOLOv5 S416 18.19 26.40 39.81 8.21 37.97 arXiv 2020
DT+PL YOLOv5 S416 18.19 25.66 39.81 7.47 34.55 CVPR 2018
SimROD (self-adapt) YOLOv5 S416 18.19 29.54 39.81 11.35 52.50 Ours
SimROD (teacher X416) YOLOv5 S416 18.19 37.65 39.81 19.46 90.01 Ours
DeepAugment YOLOv5 M416 23.58 27.65 49.13 4.07 15.93 arXiv 2020
BN-Adapt YOLOv5 M416 23.58 32.04 49.13 8.46 33.11 NeurIPS 2020
Stylize YOLOv5 M416 23.58 34.56 49.13 10.98 42.97 arXiv 2019
STAC YOLOv5 M416 23.58 32.76 49.13 9.18 35.93 arXiv 2020
DT+PL YOLOv5 M416 23.58 33.53 49.13 9.95 38.94 CVPR 2018
SimROD (self-adapt) YOLOv5 M416 23.58 37.93 49.13 14.35 56.15 Ours
SimROD (teacher X416) YOLOv5 M416 23.58 42.08 49.13 18.50 72.41 Ours
Table 15: Benchmark results on Real (VOC) to Comic domain shift

8.2 Data efficiency analysis on Watercolor and Comic

Next, we analyze the data efficiency of SimROD by increasing the size of unlabeled data used to adapt the models. For Watercolor and Comic, we used the extra splits, which contains extra 52.8K and 17.8K additional unlabeled images respectively. Moreover, all models use the same input size of 416. Figure 6 and 7 compare the performance of SimROD with the two pseudo-labeling baselines (STAC and DT+PL) on Watercolor and Comic respectively. All methods improved when using more unlabeled data from the target domain. For example, SimROD improves the Yolov5s model performance by absolute +3.23% and +4.69% on Watercolor and Comic respectively.

Nonetheless, SimROD could outperform baseline methods without using extra data for Yolov5s and Yolov5m models, which are adapted using the self-adapted teacher Yolov5x. In other words, our proposed method used only 1000 unlabeled images and still outperformed the baselines, which used 50 or 18 more data. For example, our method achieved an AP50 of 42.34% on yolov5s whereas the best baseline on yolov5m has an AP50 of only 37.79%.

Figure 6: Comparison of performance with and without extra unlabeled data on Watercolor.
Figure 7: Performance comparison on Comic with and without extra unlabeled data.

8.3 Qualitative comparison on Clipart, Comic and Watercolor

In Figures 8 and 9, we provide qualitative comparisons with pseudo-labeling baselines (STAC [31] and DT+PL [18]) and DeepAugment method using same Yolov5s model. These comparisons illustrates the simplicity and effectiveness of SimROD. Our proposed DomainMix augmentation and teacher-guided gradual adaptation enabled to leverage unlabeled target data and to mitigate the label noise and domain shift. In contrast to DT+PL, SimROD did not need to generate synthetic intermediate dataset and our proposed augmentation is much simpler than DeepAugment.

Figure 8: Comparing various methods on examples from the Comic dataset.
Figure 9: Comparing various methods on examples from the Clipart dataset.

9 More results on image corruptions

9.1 Results for different model sizes

Table 4, 5, and 6 show only the results for Yolov5m model for Pascal-C, COCO-C, and Cityscapes-C respectively. In Table 16, 17, and 18, we show that SimROD consistently achieves higher performance compared to the baselines across different model sizes and benchmarks. As expected, larger models provided extra capacity and thus higher Performance.

Method rPC
yolov5s
   Source 75.87 42.38 55.86 0.00
   Stylize 77.26 52.12 67.46 9.74
   BN-Adapt 74.71 53.75 71.94 11.37
   DeepAugment 77.89 55.42 71.15 13.04
   STAC 80.11 56.12 70.05 13.74
   SimROD (Ours) 80.08 67.95 84.85 25.57
   Supervised training 80.44 71.18 88.49 28.80
yolov5m
   Source 83.13 53.78 64.69 0.00
   Stylize 84.79 62.92 74.21 9.14
   BN-Adapt 83.01 64.60 77.82 10.82
   DeepAugment 85.05 64.88 76.28 11.10
   STAC 87.00 66.88 76.88 13.11
   SimROD (Ours) 86.97 75.40 86.70 21.63
   Supervised training 86.75 78.74 90.76 24.96
yolov5x
   Source 87.42 62.84 71.88 0.00
   Stylize 87.29 69.60 79.73 6.76
   BN-Adapt 86.59 71.59 82.68 8.75
   DeepAugment 87.78 72.15 82.19 9.31
   STAC 89.57 73.68 82.25 10.84
   SimROD (Ours) 89.24 78.48 87.95 15.64
   Supervised training 88.88 82.56 92.89 19.72
Table 16: Performance comparison on Pascal-C benchmark
Method mPC rPC
yolov5s
   Source 31.35 17.68 56.40 0.00
   Stylize 30.07 18.99 63.15 1.31
   BN-Adapt 30.91 20.09 64.99 2.40
   DeepAugment 30.37 19.87 65.44 2.19
   STAC 31.25 20.00 64.02 2.32
   SimROD (Ours) 31.21 23.94 76.71 6.26
   Supervised training 30.90 25.33 81.99 7.65
yolov5m
   Source 36.85 22.03 59.79 0.00
   Stylize 35.75 23.82 66.63 1.79
   BN-Adapt 36.24 24.79 68.39 2.76
   DeepAugment 35.51 24.33 68.52 2.30
   STAC 36.76 24.80 67.46 2.77
   SimROD (Ours) 36.79 28.46 77.36 6.43
   Supervised training 36.23 30.16 83.26 8.13
yolov5x
   Source 41.61 26.60 63.93 0.00
   Stylize 40.38 28.16 69.73 1.56
   BN-Adapt 41.70 29.77 71.40 3.17
   DeepAugment 41.12 29.13 70.84 2.53
   STAC 41.85 29.69 70.93 3.09
   SimROD (Ours) 41.63 31.87 76.57 5.27
   Supervised training 41.06 34.84 84.86 8.24
Table 17: Performance benchmark on COCO-C dataset
Method mPC rPC
yolov5s
   Source 17.08 9.50 55.62 0.00
   Stylize 18.96 11.75 61.97 2.25
   DeepAugment 17.24 11.39 66.07 1.89
   STAC 20.34 12.82 63.02 3.32
   SimROD (Ours) 19.82 14.95 75.45 5.45
   Supervised training 22.30 19.35 86.77 9.85
yolov5m
   Source 19.48 11.53 59.19 0.00
   Stylize 21.77 14.62 67.16 3.09
   DeepAugment 20.28 14.79 72.93 3.26
   STAC 24.54 15.39 62.71 3.86
   SimROD (Ours) 24.06 18.01 74.86 6.48
   Supervised training 26.58 23.50 88.43 11.97
yolov5x
   Source 25.65 16.63 64.83 0.00
   Stylize 27.70 19.38 69.96 2.75
   DeepAugment 25.12 18.80 74.84 2.17
   STAC 29.62 20.98 70.85 4.35
   SimROD (Ours) 29.27 21.70 74.15 5.07
   Supervised training 31.48 27.66 87.87 11.03
Table 18: Performance benchmark on Cityscapes-C dataset

9.2 Per-corruption performance on Pascal-C

In the main paper, we reported the mAP, rPC, and metrics, which were averaged over 15 corruption types. Here, in Tables 19, 20, and 21, we provide a breakdown of the results for each corruption type on the Pascal-C dataset for the three YOLOv5 models.

Noise Blur Weather Digital
Method mPC Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG
Source 75.87 42.38 32.71 35.32 28.24 43.02 32.96 39.87 29.05 37.09 43.53 59.66 69.21 42.00 47.04 46.53 49.48
Stylize 77.26 52.12 41.51 44.61 37.82 49.80 48.02 47.37 35.79 49.53 57.37 67.55 74.07 51.69 59.10 56.77 60.84
DeepAugment 77.89 55.42 50.48 53.12 48.67 55.38 49.23 48.87 37.58 49.73 58.19 70.29 74.91 56.88 51.61 63.39 62.99
BN Adapt 74.71 53.75 48.07 51.22 46.00 53.23 44.34 48.60 38.63 50.56 55.80 68.50 73.34 57.18 59.32 52.86 58.55
STAC 80.11 56.12 46.85 49.78 44.08 58.41 45.38 51.99 41.68 53.39 59.80 74.01 78.91 59.85 61.76 56.14 59.78
SimROD (Ours) 80.08 67.95 64.91 66.11 65.28 65.12 63.03 65.54 53.99 69.19 69.27 76.85 79.14 71.38 73.52 65.54 70.34
Oracle 80.44 71.18 68.28 69.14 68.10 68.18 67.84 69.77 62.19 71.41 71.26 77.49 79.95 73.41 75.90 70.40 74.41
Table 19: Performance comparison per corruption type for YOLOv5s model on Pascal-C benchmark
Noise Blur Weather Digital
Method mPC Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG
Source 83.13 53.78 47.44 51.35 44.98 53.87 42.17 48.61 36.64 51.77 56.29 71.74 78.82 55.81 56.43 54.52 56.17
Stylize 84.79 62.92 53.44 57.56 52.62 60.18 57.42 57.53 45.32 63.02 67.50 78.02 81.91 65.64 69.69 66.10 67.86
DeepAugment 85.05 64.88 61.75 64.06 60.64 63.74 57.95 56.18 44.75 62.31 68.27 79.36 82.69 68.34 61.92 71.40 69.78
BN Adapt 83.01 64.60 61.06 63.83 60.54 62.33 55.29 58.77 46.71 65.44 67.88 78.34 81.62 69.48 68.81 62.15 66.75
STAC 87.00 66.88 61.46 64.77 60.73 67.17 55.54 61.35 49.57 68.41 71.20 82.52 85.90 71.83 69.92 65.61 67.25
SimROD (Ours) 86.97 75.40 72.00 74.11 73.01 72.65 70.25 72.85 60.65 77.81 77.47 84.03 86.17 79.66 80.49 72.54 77.36
Oracle 86.75 78.74 76.35 76.68 76.42 75.63 75.12 77.10 70.31 80.07 79.56 84.25 86.15 80.60 82.88 78.73 81.22
Table 20: Performance comparison per corruption type for YOLOv5m model on Pascal-C benchmark
Noise Blur Weather Digital
Method mPC Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel JPEG
Source 87.42 62.84 59.30 61.06 58.38 61.45 51.43 58.48 41.79 63.82 66.34 77.28 84.25 65.77 64.40 63.07 65.79
Stylize 87.29 69.60 62.44 65.03 62.20 67.57 63.64 65.13 50.84 70.44 74.10 82.44 85.30 74.16 74.73 72.76 73.25
DeepAugment 87.78 72.15 71.25 73.27 71.16 71.40 64.70 65.57 49.76 71.42 74.91 84.17 86.43 77.48 68.75 77.16 74.88
BN Adapt 86.59 71.59 71.05 72.63 70.94 67.90 63.70 66.55 52.41 72.79 73.91 82.38 84.62 76.03 74.96 70.51 73.43
STAC 89.57 73.68 71.77 73.40 71.71 72.57 64.51 69.37 52.81 76.21 77.68 85.04 88.40 78.87 75.92 72.69 74.23
SimROD (Ours) 89.24 78.48 76.09 78.31 77.23 75.85 73.11 75.29 62.75 81.10 80.96 86.62 88.16 82.94 82.45 76.64 79.69
Oracle 88.88 82.56 81.14 81.96 81.27 79.10 79.08 80.65 73.97 83.58 83.66 87.18 88.54 84.03 85.55 84.09 84.57
Table 21: Performance comparison per corruption type for YOLOv5x model on Pascal-C benchmark

9.3 Performance comparison with Augmix

Here, we compare our proposed method with Augmix augmentation [14] and report the results on Pascal-C in Table 22 and 23 for the models YOLOv5s and YOLOv5x respectively. When comparing the mean performance under corruption (mPC), we can see that Augmix performed the worst among all augmentation-based baselines. Interestingly, applying Augmix augmentation with DeepAugment improved the performance of DeepAugment by +3.3% AP50 and +1.03% AP50 on YOLOv5s and YOLOv5x models respectively. Nonetheless, SimROD still outperformed DeepAument+Augmix by more than +5% AP50 on both models. Although we have not tried, it is possible that applying Augmix on top of DomainMix may further improve the performance of our proposed method.

Method mPC
Source 75.87 42.38
Augmix 79.42 46.94
Stylize 77.26 52.12
DeepAugment 77.89 55.42
DeepAugment+Augmix 80.85 60.15
SimROD (Ours) 80.08 67.95
Table 22: Augmix comparison for YOLOv5s model on Pascal-C.
Method mPC
Augmix 87.46 62.31
Source 87.42 62.84
Stylize 87.29 69.60
DeepAugment 87.78 72.15
DeepAugment+Augmix 88.36 73.18
SimROD (Ours) 89.24 78.48
Table 23: Augmix comparison for YOLOv5x model on Pascal-C.

9.4 Data efficiency analysis on Pascal-C

In Figure 10 and 11, we analyzed the data efficiency of our proposed method using a YOLOv5s model and Pascal-C dataset. For that, we used a subset of training datasets and considered two scenarios. For both scenarios, we randomly generated three different sets of data, measured the performance in three runs. The average of the three runs are plotted with error bars in Figure 10 and 11.

In the first scenario, we used all the available labeled data from source domain consisting of 5011 images. On the other hand, we used only a portion of the unlabeled images available. As shown in Figure 10, our proposed method outperformed STAC by a margin of 10% AP50. Moreover, our method achieved a relative robustness of +21.75% AP50 and +16.61% AP50 using only 10% and 1% of unlabeled target domain images respectively. Since the data was imbalanced in this scenario, we also considered applying the weighted balanced sampling to STAC. Figure 10 shows that it could slightly improve the performance of STAC when the datasets were very imbalanced.

In the second scenario, we used only a given percentage of the available training data for both the source and target domain. While this scenario assumes the datasets are balanced, the total number of training images is much smaller than in the previous scenario. For example, using 1% of training data corresponds to a total of 165 images. With 1% of training data, STAC could not adapt the model. In contrast, our proposed method provided a relative robustness of +4.54% AP50 and +18.28% AP50 using only 1% and 10% of training data respectively.

These results confirm that our method was more data-efficient. In particular, our DomainMix augmentation could produce a diverse set of mixed samples even from very few training images from both domains. When more unlabeled data was available, our method could further leverage the unlabeled data and provide strong supervision for adaptation by mitigating the label noise.

Figure 10: mPC performance of YOLOv5s on Pascal-C for a given percentage of unlabeled target data and using 100% source data.
Figure 11: mPC performance of YOLOv5s on Pascal-C for a given percentage of training data (source and target).

9.5 Effects of corruption severity levels

To apply our method on the image corruption benchmark, we applied a corruption severity level of 3 for creating the unlabeled target domain images. In this section, we present additional analysis to understand the effects of corruption severity of the training images on the test performance. In Fig. 12 and 13, we show the relative robustness and mean performance under corruption mPC of an adapted Yolov5s model using our method. Similarly, Fig. 14 and 15 show the same metrics for an adapted YOLOv5x model.

The corruption types are sorted in ascending order based on the performance of the source model on these types. For instance, the source models achieved the highest mPC on fog and lowest mPC on impulse noise. This explains that the relative robustness on fog was lower compared to those on other corruption types because the source model already achieved high mPC on fog. Notable improvements were observed on the other corruption types.

Fig. 12 and 13 show that the adapted YOLOv5s model enjoyed higher improvement on test datasets with higher severity levels. More importantly, high improvements could be achieved when the training images have severity levels similar to those of the test images. This means that using unlabeled target-domain samples is effective as long as they are representative of the actual test set.

Figure 12: Relative robustness improvement on YOLOv5s using our method for specific corruption types and severity levels on Pascal-C.
Figure 13: Final mPC performance of YOLOv5s using our method for specific corruption types and severity levels on Pascal-C.
Figure 14: Relative robustness improvement on YOLOv5x using our method for specific corruption types and severity levels on Pascal-C.
Figure 15: Final mPC performance of YOLOv5x using our method for specific corruption types and severity levels on Pascal-C.

9.6 Qualitative comparison on image corruptions

Fig. 16 illustrates how various methods handle the glass blur corruption (severity 5) on Pascal-C sample. In addition, Fig. 17 shows results of various methods across a range of severity levels for the glass blur corruption. We see that the proposed method was more effective in handling the corruptions. In contrast to the baseline methods, our adaptation method detected most objects in the images and make fewer classification errors. We could also observe that the source model completely failed to detect objects in most cases.

Figure 16: Demonstration of how different methods handle glass_blur corruption (severity 5); images from Pascal-C.
Figure 17: Demonstration of how different methods handle glass_blur corruption at different severity levels; image from Pascal-C.

9.7 More detailed ablations on the components

Table 24 expands the ablation study provided in the main paper onto various model sizes.

Model Method TG DomainMix BN-Adapt Finetune Corrupt AP50
Source 42.38 0.00
BN-Adapt 53.75 11.37
BN-Adapt + DomainMix 56.13 13.75
yolov5s SimROD (Ours) w/o Teacher Guidance 60.35 17.97
SimROD (Ours) w/o Gradual Adaptation 67.87 25.49
Our full method (SimROD) 67.95 25.57
Source 53.78 0.00
BN-Adapt 64.60 10.82
BN-Adapt + DomainMix 66.78 13.01
yolov5m SimROD (Ours) w/o Teacher Guidance 71.81 18.03
SimROD (Ours) w/o Gradual Adaptation 73.45 19.67
Our full method (SimROD) 75.40 21.62
Source 62.84 0.00
BN-Adapt 71.83 8.99
BN-Adapt + DomainMix 73.64 10.80
yolov5x SimROD (Ours) w/o Gradual Adaptation 75.58 12.74
SimROD (Ours) w/o Teacher Guidance 78.16 15.32
Our full method (SimROD) 78.48 15.64
Table 24: Ablation study on Pascal-C dataset

10 Dataset and DomainMix visualizations

Fig. 18 and 19 show examples of the domain-mixed images produced by the DomainMix augmentation from different datasets. Note that the images used to form domain-mixed examples, are randomly cropped, and may occupy a different height and width of the final image.

Figure 18: Examples of DomainMix image samples on Pascal-C dataset with various corruption types.
Figure 19: Examples of DomainMix image samples on Watercolor dataset.