1 Introduction
State-of-the-art object detection models have proved to be highly accurate when trained on images that have the same distribution as the test set [40]. However, they can fail when deployed to new environments due to domain shifts such as weather changes (e.g. rain or fog), light conditions variations, or image corruptions (e.g. motion blur) [25]. Such failure is detrimental for mission-critical applications such as self-driving, security, or automated retail checkout, where domain shifts are common and inevitable. To make them succeed in applications where reliability is key, it is important to make detection models robust to domain shifts.
Many methods have been proposed to overcome domain shifts for object detection. They can be categorized as data augmentation [25, 14, 12], domain-alignment [6, 11, 39, 38, 27, 16, 23, 17], domain-mapping [3, 18, 23, 17], and self-labeling techniques [34, 31, 22, 18]. Data augmentation methods can improve the performance on some fixed set of domain shifts but fail to generalize to the ones that are not similar to the augmented samples [1, 26, 33]
. Domain-aligning methods use samples from the target domain to align intermediate features of networks. These methods require non-trivial architecture changes such as gradient reversal layers, domain classifiers, or some specialized modules. On the other hand, domain-mapping methods translate labeled source images to new images that look like the unlabeled target domain images using image-to-image translation networks. Similar to the augmentation methods, they are suboptimal since the generated images do not necessarily have a high perceptual similarity to real target domain images. Finally, self-labeling is a promising approach since it leverages unlabeled training samples form the target domain. However, generating accurate pseudo-labels under domain shift is hard; and when pseudo-labels are noisy, using target domain samples for adaptation is ineffective.
In this paper, we propose a Simple adaptation method for Robust Object Detection (SimROD), to mitigate the domain shifts using domain-mixed data augmentation and teacher-guided gradual adaptation. Our simple approach has three design benefits. First, it does not require ground-truth labels of target domain data and leverage unlabeled samples. Second, our approach requires neither complicated architecture changes nor generative models for creating synthetic data [18]. Third, our simple method is architecture-agnostic and is not limited to region-based detectors. The main contributions of this paper are summarized as follows:
-
[leftmargin=*]
-
We propose a simple method to improve the robustness of object detection models against domain shifts. Our method first adapts a large teacher model using a gradual adaptation approach. The adapted teacher generates accurate pseudo-labels for adapting the student model.
-
We introduce an augmentation procedure called DomainMix to help learn domain-invariant representations and reduce the pseudo-label noise that is exacerbated by the domain shift. DomainMix efficiently mixes the labeled images from the source domain with the unlabeled samples from the target domain along with their (pseudo-)labels and gives strong supervision for self-adaptation. The mixed training samples are used for adapting both the teacher and student models.
-
We conduct a comprehensive and fair benchmark to demonstrate the effectiveness of SimROD to mitigate different kinds of shifts including synthetic-to-real, cross-camera setup, real-to-artistic, and image corruptions. We show that our simple method can achieve new state-of-the-art results on some of these benchmarks. We also conduct ablation studies to provide insights on the efficiency and effectiveness of our method.
2 Motivation and related works
In this section, we review the mainstream approaches relevant to our work and explain the motivation of our work.
Data augmentations for robustness to image corruption
Data augmentation is an effective technique for improving the performance of deep learning models. Recent works have also explored the role of augmentation in enhancing the robustness to domain shifts. In particular, specialized augmentation methods have been proposed to combat the effect of image corruptions for image classification
[13, 14, 12] and object detection [25, 8]. For example, AugMix [14] samples a set of geometric and color transformations which are applied sequentially to each image and mixes the original image with multiple augmented versions. Subsequently, DeepAugment [12] generates augmented samples using image-to-image translation networks whose weights are perturbed with random distortions. [25, 8] proposed style transfer [10] as data augmentation for increasing the shape bias and improve robustness to image corruptions.While these augmentation methods offer some improvement over the source baseline, they can overfit to few corruption types and fail to generalize to others. In fact, [1] provided empirical evidence that the perceptual similarity between the augmentation transformation and the corruption is a strong predictor of corruption error. [1] also observed that broader augmentation schemes perform better on dissimilar corruptions than more specialized ones. [33] showed that augmentation techniques that are tailored to synthetic corruptions have difficulty to generalize to natural distributions shifts. In their extensive study, training on more diverse data was the only intervention that effectively improved the robustness to natural distribution shifts.
Unsupervised domain adaptation for object detection
Unsupervised domain adaptation (UDA) methods leverage unlabeled images from the target domain to explicitly mitigate the domain shift. In contrast to images obtained with augmentation, these unlabeled samples are more similar to the test samples as they are from the same domain. Moreover, leveraging unlabeled samples is practical since they are cheap to collect and do not require laborious annotation.
Several approaches have been proposed to solve the UDA problem for object detection. Adversarial training methods such as [6] learn domain-invariant representations of two-stage detector networks. Recent methods improved the performance, by mining important regions and aligning at the region-level [11], by using hierarchical alignment module [39], by coarse-to-fine feature adaptation [38], or by enforcing strong local alignment and weak global alignment [27]. [16] proposed a center-aware alignment method for anchor-free FCOS model. While alignment methods help reduce the domain shift, they require architecture changes since extra modules such as gradient reversal layers and domain classifiers must be added to the network.
Alternatively, domain-mapping methods tackle UDA by first translating source images to images that resemble the target domain samples using a conditional generative adversarial network (GAN)
[3, 15]. The model is then fine-tuned with the domain-mapped images and the known source labels. For object detection, [23, 17] combined domain transfer with adversarial training. For instance, [23] generates a diverse set of intermediate domains between the source and target to discriminate and learn domain-invariant features.Batch normalization (BN) [19]
layers are prevalent in most neural networks because they can accelerate the learning, prevent overfitting and enable deeper networks to converge
[28]. Recent works have shown that adapting BN layers can improve robustness to adversarial attacks [36] or image corruptions [29] and reduce domain shifts [24, 5].Self-training for object detection adaptation
Self-training enables a model to generate its own pseudo-labels on the unlabeled target samples. Recently, [31] has proposed the STAC framework for semi-supervised object detection with pseudo-labels. However, pseudo-labeling can degenerate the performance in the presence of domain shift as the pseudo-labels on target samples may become incorrect leading to poor supervision. Instead, our work tackles the domain shift between the original source training data and the unlabeled target training data. To reduce domain shift, [4]
enforced region-level and graph-structures consistencies between a mean teacher model and the student model using additional regularization loss functions. Next,
[22] proposed a method to directly mitigate the noisy pseudo-labels of Faster-RCNN detectors by modeling their proposal distribution. Unlike [22], our method is agnostic to the model architecture and can also work with single-stage object detectors too. Finally, [18] combined domain transfer with pseudo-labeling and is also architecture-agnostic.In contrast to these prior works, our proposed adaptation method is simpler because it does not generate synthetic data using GANs, does not add new loss functions and does not change the model architecture. As it will be shown in Section 4, our simple method is also more effective in reducing domain shifts and label noise.
3 Problem definition and proposed solution
In this section, we define the adaptation problem and describe our proposed solution.

3.1 Problem statement
We are given a source model for an object detection task with parameters , which is trained with a source training dataset , where is an image and each label consists of object categories and bounding box coordinates. We consider scenarios in which there exists a covariate shift between the input distribution of the original source data and the target test distribution . More formally, we assume that but [32].
In the unsupervised domain adaptation setting, we are also given a set of unlabeled images from the target domain, which we can use during training. Therefore, our objective is to update the model parameters into to achieve a good performance on both the source test set and a given target test set, i.e., improving its robustness to the domain shifts. To effectively exploit the additional information in , we need to tackle two inter-related issues. First, the target training set does not come with ground-truth labels. Second, generating pseudo-labels for with the source model leads to noisy supervision due to the domain shift and hinders the adaptation. In the following subsections, we present a simple approach for tackling these technical issues.
3.2 Simple adaptation for Robust Object Detection
We present our simple adaptation method SimROD for enabling robust object detection models. SimROD integrates a teacher-guided fine-tuning, a new DomainMix augmentation method and a gradual adaptation technique. Sec. 3.2.1 describes the overall method. Next, Sec. 3.2.2 presents the DomainMix augmentation, which is used for adapting both the teacher and student. Finally, Sec. 3.2.3 explains the gradual adaptation that overcomes the two interrelated issues of domain shift and pseudo-label noise.
3.2.1 Overall approach
Our simple approach is motivated by the fact that label noise is exacerbated by the domain shift. Therefore, our approach aims to generate accurate pseudo-labels on target domain images and use them together with mixed images from source and target domain so as to provide strong supervision for adapting the models.
Because the student target model may not have the capacity to generate accurate pseudo-labels and adapt itself, we propose to adapt an auxiliary teacher model first, which can later generate high-quality pseudo-labels for fine-tuning the student model. A flow diagram of SimROD is provided in Figure 1. Its steps are summarized as follows:
Step 1:
We train a large source teacher model with bigger capacity than the student model to be adapted using the source data and get parameters . The source teacher is used to generate initial pseudo-labels on target data.
Step 2:
Step 3:
We refine the pseudo-labels on the target data using the adapted teacher model parameters . Then, we fine-tune the student model using these pseudo-labels in line 2 and 8 of Algorithm 2.
One benefit of this approach is that it can adapt both small and large object detection models to domain shifts since it produces high quality pseudo-labels even when the student network is small. Another advantage of our method is that the teacher and student do not need to share the same architecture. Thus, it is possible to use a slow but accurate teacher for the purpose of adaptation while choosing a fast architecture for deployment.
3.2.2 DomainMix augmentation
Here, we present a new augmentation method named DomainMix. As illustrated in Figure 1, it uniformly samples images from both the source and target domains and strongly mixes these images into a new image along with their (pseudo-)labels. Figure 2 shows an example of DomainMix images from natural and artistic domains.
DomainMix uses simple ideas with many benefits to mitigate domain shift and label noise:
-
[leftmargin=*]
-
It produces a diverse set of images by randomly sampling and mixing crops from source and target sets with replacement. As a result, it uses a different sample of images at every epoch, thus increasing the effective number of training samples and preventing overfitting. In contrast, simple batching reuses same images at every epoch.
-
It is data-efficient as it uses a weighted balanced sampling from both domains. This helps learning representations that are robust to data shifts even if the target dataset has limited samples or the source and target datasets are highly imbalanced. In [2], we provide ablation studies that demonstrate the data efficiency of DomainMix.
-
It mixes ground-truth and pseudo-labels in the same image. This mitigates the effect of false labels during adaptation because the image always contains accurate labels from the source domain
-
It enforces the model to detect small objects as the objects in original samples are scaled down.
The steps of DomainMix augmentation are listed in Algorithm 1. For each image in a batch, we randomly sample three additional images from source and target data and mix random crops of these images to create a new domain-mixed image in a collage. In addition, we collate the pseudo-labels for the unlabeled examples in with the ground-truth labels of source images. The bounding box coordinates of the objects are computed based on the relative position of each crop in the new mixed image. Furthermore, we employ a weighted balanced sampler to sample uniformly from the two domains.

An example image generated by DomainMix mixing real images from Pascal VOC and artistic images from Watercolor2K.
3.2.3 Gradual self-labeling adaptation
Next, we present a gradual adaptation for optimizing the parameters of the detection model. This algorithm mitigates the effects of label noise, which is exacerbated by the domain shift. In fact, the pseudo-labels generated by the source models can be noisy on target domain images (e.g. it cannot detect objects or detects them inaccurately). If these initial pseudo-labels are used to adapt all the layers of the model at the same time, it results in poor supervision and hinders the model adaptation.
Instead, we propose a phased approach. First, we freeze all convolutional layers and adapts only the BN layers in the first epochs. After this first phase, BN layers’ trainable coefficients are updated. The partially adapted model is then used to generate more accurate pseudo-labels, which is done offline for simplicity. In the second phase, all layers are unfrozen and then fine-tuned using the refined pseudo-labels. Note that during these two phases, we use the mixed image samples generated by the DomainMix augmentation. The detailed steps of this gradual adaptation are listed in Algorithm 2.
In contrast to prior works [24, 29], which used BN Adaption on its own, we integrate it within a self-training framework to effectively overcome the inevitable label noise caused by the domain shift [18]. As will be shown in Section 4, when used with the DomainMix augmentation, the resulting method is effective in adapting object detection models to different kinds of domain shifts.
Note that [18] also used a two-phase progressive adaptation method but they used synthetic domain-mapped images, which are generated by a conditional GAN, to fine-tune the model in the first phase. In contrast, our method leverages actual target domain images, which are mixed with source domain images using DomainMix augmentation, during the entire adaptation process.
4 Experiments results
In this section, we evaluate the effectiveness of SimROD to combat different kinds of domain shifts, compare the performance with prior works on standard benchmarks, and conduct ablation studies. For our experiments, we adopted the single-stage detection architecture Yolov5 [20] and used different model sizes by scaling the input size, width and depth. We study synthetic-to-real and camera-setup shifts [6] in Section 4.1, cross-domain artistic shifts [18] in Section 4.2, and robustness against image corruptions [25] in Section 4.3. Training details and additional results are provided in the supplementary materials [2].
4.1 Synthetic-to-real and cross-camera benchmark
Datasets. We used Sim10k [21] to Cityscapes [7] and KITTI [9] to Cityscapes benchmarks to study the ability to adapt in synthetic-to-real and cross-camera shifts, respectively. Following prior works, only the car class was used.
Method | Arch. | Backbone | Source | AP50 | Oracle | Reference | ||
DAF [6] | F-RCNN | V | 30.10 | 39.00 | - | 8.90 | - | CVPR 2018 |
MAF [11] | F-RCNN | V | 30.10 | 41.10 | - | 11.00 | - | ICCV 2019 |
RLDA [22] | F-RCNN | I | 31.08 | 42.56 | 68.10 | 11.48 | 31.01 | ICCV 2019 |
SCDA [39] | F-RCNN | V | 34.00 | 43.00 | - | 9.00 | - | CVPR 2019 |
MDA [37] | F-RCNN | V | 34.30 | 42.80 | - | 8.50 | - | ICCV 2019 |
SWDA [27] | F-RCNN | V | 34.60 | 42.30 | - | 7.70 | - | CVPR 2019 |
Coarse-to-Fine [38] | F-RCNN | V | 35.00 | 43.80 | 59.90 | 8.80 | 35.34 | CVPR 2020 |
SimROD (self-adapt) | YOLOv5 | S320 | 33.62 | 38.73 | 48.81 | 5.11 | 33.66 | Ours |
SimROD (w. teacher X640) | YOLOv5 | S320 | 33.62 | 44.70 | 48.81 | 11.08 | 72.93 | Ours |
MTOR [4] | F-RCNN | R | 39.40 | 46.60 | - | 7.20 | - | CVPR 2019 |
EveryPixelMatters [16] | FCOS | V | 39.80 | 49.00 | 69.70 | 9.20 | 30.77 | ECCV 2020 |
SimROD (self adapt) | YOLOv5 | S416 | 39.57 | 44.21 | 56.49 | 4.63 | 27.37 | Ours |
SimROD (w. teacher X1280) | YOLOv5 | S416 | 39.57 | 52.05 | 56.49 | 12.47 | 73.73 | Ours |
Metrics. For a fair comparison, we grouped different model/method pairs whose “Source” models (trained only on the labeled source data) have a similar average precision on the target test set (i.e. Cityscapes val). We compared each group based on three metrics: (1) of their “Adapted” models, (2) absolute adaptation gains , and (3) their effective adaptation gains defined as:
(1) | ||||
(2) |
where “Oracle” is the model that is trained with the labeled target domain data. The gain metric was proposed by [38] to compare methods that may share same base architecture but have different performance before adaptation. For a better comparison, we also analyze the effectiveness of the adaptation method using the metric . This metric helps understand if an adaptation method offers higher performance on the target test set beyond what is expected from having high performance on the source test set. A method that fails to adapt a model will have an effective gain of for that model whereas a method that gives a target performance close to the Oracle will have .
Sim10K to Cityscapes. Table 1 shows that SimROD achieved new SOTA results on both the target AP50 performance and on the effective adaptation gain. We use two student models S320 and S416, which have the same Yolov5s architecture but different input sizes of 320 and 416 pixels to compare with prior methods that have comparable Source AP50 performance. For example, our S320 models achieves and compared to and for Coarse-to-Fine [38]. Similar results were observed when comparing the performance of our adapted S416 model with that of the FCOS model adapted with EPM [16]. Fig. 3 demonstrates the effectiveness of SimROD to adapt models from Sim10K to Cityscapes compared to prior baselines. Models adapted with SimROD enjoyed up to 70-75% of the target AP performance (that is obtained if the model was trained with a fully labeled target dataset). In contrast, the baseline methods achieved only about 30% of their Oracle performance.
KITTI to Cityscapes benchmark. Table 2 shows the results of this experiment, where SimROD outperformed the baselines. With the S416 model, it achieves slightly higher AP50 performance than the best baseline PDA [17]. When using the medium size M416 model, SimROD also outperformed prior baselines with comparable Source AP50 performance namely SCDA [39] and EPM [16].

Method | Arch. | Backbone | Source | AP50 | Oracle | Reference | ||
---|---|---|---|---|---|---|---|---|
DAF [6] | F-RCNN | V | 30.20 | 38.50 | - | 8.30 | - | CVPR 2018 |
MAF [11] | F-RCNN | V | 30.20 | 41.00 | - | 10.80 | - | ICCV 2019 |
RLDA [22] | F-RCNN | I | 31.10 | 42.98 | 68.10 | 11.88 | 32.11 | ICCV 2019 |
PDA [17] | F-RCNN | V | 30.20 | 43.90 | 55.80 | 13.70 | 53.52 | WACV 2020 |
SimROD (self-adapt) | YOLOv5 | S416 | 31.61 | 35.94 | 56.15 | 4.33 | 17.65 | Ours |
SimROD (w. teacher X1280) | YOLOv5 | S416 | 31.61 | 45.66 | 56.15 | 14.05 | 57.27 | Ours |
SCDA [39] | F-RCNN | V | 37.40 | 42.60 | - | 5.20 | - | CVPR 2019 |
EveryPixelMatters [16] | FCOS | R | 35.30 | 45.00 | 70.40 | 9.70 | 27.64 | ECCV 2020 |
SimROD (self adapt) | YOLOv5 | M416 | 36.09 | 42.94 | 59.29 | 6.85 | 29.51 | Ours |
SimROD (w. teacher X1280) | YOLOv5 | M416 | 36.09 | 47.52 | 59.29 | 11.43 | 49.26 | Ours |
4.2 Cross-domain artistic benchmark
Method | Arch. | Backbone | Source | AP50 | Oracle | Reference | ||
DAF [6] | F-RCNN | V | 39.80 | 34.30 | NA | -5.50 | NA | CVPR 2018 |
DAM [23] | F-RCNN | V | 39.80 | 52.00 | NA | 12.20 | NA | CVPR 2019 |
DeepAugment [12] | YOLOv5 | S416 | 37.46 | 45.19 | 56.07 | 7.73 | 41.54 | arXiv 2020 |
BN-Adapt [19] | YOLOv5 | S416 | 37.46 | 45.72 | 56.07 | 8.26 | 44.39 | NeurIPS 2020 |
Stylize [10] | YOLOv5 | S416 | 37.46 | 46.26 | 56.07 | 8.80 | 47.29 | arXiv 2019 |
STAC [31] | YOLOv5 | S416 | 37.46 | 49.83 | 56.07 | 12.37 | 66.47 | arXiv 2020 |
DT+PL [18] | YOLOv5 | S416 | 37.46 | 44.86 | 56.07 | 7.40 | 39.77 | CVPR 2018 |
SimROD (self-adapt) | YOLOv5 | S416 | 37.46 | 52.58 | 56.07 | 15.12 | 81.26 | Ours |
SimROD (teacher X416) | YOLOv5 | S416 | 37.46 | 55.55 | 56.07 | 18.09 | 97.21 | Ours |
ADDA [35] | SSD | V | 49.60 | 49.80 | 58.40 | 0.20 | 2.27 | CVPR 2017 |
DT+PL [18] | SSD | V | 49.60 | 54.30 | 58.40 | 4.70 | 53.41 | CVPR 2018 |
SWDA [27] | F-RCNN | V | 44.60 | 56.70 | 58.60 | 12.10 | 86.43 | CVPR 2019 |
DeepAugment [12] | YOLOv5 | M416 | 46.95 | 54.02 | 66.34 | 7.07 | 36.47 | arXiv 2020 |
BN-Adapt [19] | YOLOv5 | M416 | 46.95 | 55.75 | 66.34 | 8.80 | 45.39 | NeurIPS 2020 |
Stylize [10] | YOLOv5 | M416 | 46.95 | 55.24 | 66.34 | 8.29 | 42.76 | arXiv 2019 |
STAC [31] | YOLOv5 | M416 | 46.95 | 57.82 | 66.34 | 10.87 | 56.07 | arXiv 2020 |
DT+PL [18] | YOLOv5 | M416 | 46.95 | 49.14 | 66.34 | 2.19 | 11.30 | CVPR 2018 |
SimROD (self-adapt) | YOLOv5 | M416 | 46.95 | 60.08 | 66.34 | 13.13 | 67.72 | Ours |
SimROD (teacher X416) | YOLOv5 | M416 | 46.95 | 63.47 | 66.34 | 16.52 | 85.22 | Ours |
Datasets and metrics. The cross-domain artistic benchmark consists of three domain shifts where the source data is VOC07 trainval and the target domains are Clipart1k, Watercolor2k and Comic2k datasets [18]. We use the same benchmark metrics as in Sec. 4.1.
Results. Our method outperformed the baselines by significant margins. Compared to DT+PL [18], our method further improved the AP50 of the yolov5s model by absolute gains of +8.45, +12 and +10.69 % points on Clipart, Comic, and Watercolor respectively. While DT+PL outperformed the augmentation-based baselines on Clipart, it did slightly worse than STAC on Comic and Watercolor. Finally, SimROD was effective in adapting models of different sizes. Without generating synthetic data or using domain adversarial training, SimROD’s effective gain was consistently above 70% and could reach up to 97% when a large adapted teacher was used to refine the pseudo-labels.
In Table 3, we give a detailed benchmark for the VOC to Watercolor benchmark, from which we used 1000 unlabeled images as target data. In [2], we present detailed results on Clipart and Comic dataset as well as more ablation results when using extra unlabeled data for adaptation.

4.3 Image corruptions benchmark
Datasets. We evaluate our method’s robustness to image corruption using the standard benchmarks Pascal-C, COCO-C, and Cityscapes-C [25]. For Pascal-C, we used VOC07 trainval split as the source training data. For COCO-C and Cityscapes-C, we divided the train split and used the first half as source training data. There are different corruption types for each dataset. Thus, we applied each corruption type on the VOC12 trainval or on the second half of COCO-C and Cityscapes-C train as unlabeled target data. Precisely, we applied each corruption type with middle severity onto each image using the imagecorruptions library [25]. More details are given in [2].
Metrics. For image corruption benchmark, we followed the evaluation protocol from [13, 25, 33] and measured the mean performance under corruption (mPC), relative performance under corruption (rPC), and the relative robustness of the adapted model averaged over corruption types:
(3) | |||
(4) | |||
(5) |
where and denote the average precision of the test data with corruption type and severity level . The relative robustness quantifies the effect of adaptation on the performance under distribution shift (mPC).
Baselines. We use the following baselines which were proposed to improve the robustness to image corruptions: Stylize [10], BN-Adapt [19], DeepAugment [12], STAC [31], and DT+PL [18]. Unless specified, we employed weak data augmentations such as RandomHorizontalFlip and RandomCrop for all baselines.
Main results. Table 4, 5 and 6 show the results of Yolov5m model for Pascal-C, COCO-C, and Cityscapes-C, respectively. We report the results with different model sizes in [2]. We used the large model Yolov5x model as a teacher. An ablation study on Pascal-C is provided in Table 7 and will be discussed later.
Method | rPC | ||||
---|---|---|---|---|---|
Source | 83.13 | 53.78 | 64.69 | 0.00 | 0 |
Stylize | 84.79 | 62.92 | 74.21 | 9.14 | 36.62 |
BN-Adapt | 83.01 | 64.60 | 77.82 | 10.82 | 43.35 |
DeepAugment | 85.05 | 64.88 | 76.28 | 11.10 | 44.47 |
STAC | 87.00 | 66.88 | 76.87 | 13.10 | 52.48 |
SimROD (ours) | 86.97 | 75.40 | 86.70 | 21.62 | 86.62 |
Oracle | 86.75 | 78.74 | 90.77 | 24.96 | 100 |
Unlabeled target samples improved robustness to image corruption. The source models suffered from performance drop due to image corruptions. By adapting the models with SimROD, the mean performance under corruption was significantly improved by +21.62, +6.43, and +6.48 absolute percentage points on Pascal-C, COCO-C, and Cityscapes-C, respectively. Our method outperformed the Stylize, DeepAugment, BNAdapt baselines on all metrics. In fact, STAC, which also used unlabeled target samples, achieved the second best performance. This shows that augmentation or batch norm adaptation is not sufficient to fix the domain shift on all possible corruptions. Instead, using unlabeled samples from target domain is more effective to combat image corruptions.
Pseudo-label refinement ensured performance close to Oracle. Moreover, Tables 4, 5 and 6 show that the performance of our unsupervised method was close to that of the Oracle, which uses ground-truth labels for target domain data. This was possible because the adapted teacher produces highly accurate pseudo-labels, which could be used along with DomainMix augmentation to effectively adapt the student model.
Method | rPC | ||||
---|---|---|---|---|---|
Source | 36.85 | 22.03 | 59.78 | 0.00 | 0 |
Stylize | 35.75 | 23.82 | 66.63 | 1.79 | 22.02 |
BN-Adapt | 36.24 | 24.79 | 68.41 | 2.76 | 33.95 |
DeepAugment | 35.51 | 24.33 | 68.52 | 2.30 | 28.29 |
STAC | 36.76 | 24.80 | 67.46 | 2.77 | 34.07 |
SimROD (ours) | 36.79 | 28.46 | 77.36 | 6.43 | 79.09 |
Oracle | 36.23 | 30.16 | 83.25 | 8.13 | 100 |
Method | rPC | ||||
---|---|---|---|---|---|
Source | 19.48 | 11.53 | 59.19 | 0.00 | 0 |
Stylize | 21.77 | 14.62 | 67.16 | 3.09 | 25.81 |
DeepAugment | 20.28 | 14.79 | 72.93 | 3.26 | 27.23 |
STAC | 24.54 | 15.39 | 62.71 | 3.86 | 32.25 |
SimROD (ours) | 24.06 | 18.01 | 74.85 | 6.48 | 54.14 |
Oracle | 26.58 | 23.50 | 88.41 | 11.97 | 100 |
Method | TG | DMX | GA | FT | ||
---|---|---|---|---|---|---|
Source | 53.78 | 0.0 | ||||
BN-Adapt | ✓ | 64.60 | 10.8 | |||
BN-A + DMX | ✓ | ✓ | 66.78 | 13.0 | ||
SimROD w/o TG | ✓ | ✓ | ✓ | 71.81 | 18.0 | |
SimROD w/o GA | ✓ | ✓ | ✓ | 73.45 | 19.7 | |
SimROD | ✓ | ✓ | ✓ | ✓ | 75.40 | 21.7 |
Ablation Study. Next, we present an ablation study using the Yolov5m model on Pascal-C in Table 7 to gain some insights about the contributions of the three parts of our method. First, BN-Adapt improved the mean performance under corruption by 10.82% AP50. Applying DomainMix augmentation on top of BN-Adapt improved the performance by 2.18%. Next, the teacher-guided (TG) pseudo-label refinement was particularly useful in adapting small models. When using our full method, the performance increased by 10.8% compared to BN-Adapt. Compared to self adaptation, TG improved the Yolov5 model’s performance mPC by +3.7 %. Finally, the gradual adaptation (GA) also played an important role in refining pseudo-labels and in improving the model’s robustness. For example, if we did not use GA and skipped the BN adaptation in the first phase, the performance dropped by 1.95% compared to the full method. Our method organically integrates these parts to tackle UDA for object detection. While the parts may appear simple, their synergy helped mitigate the challenging issues of domain shift and pseudo-label noise.
Qualitative analysis Finally, we illustrate the effectiveness of our method by showing the pseudo-labels generated with our method on the unlabeled target training images on Comic dataset. As seen in Figure 4(a), our method generated highly accurate pseudo-labels despite the domain shift. In contrast, STAC and DT+PL generated sparse labels since they missed to detect many objects. The performance difference transferred to the quality of predictions on the test set as shown in Figure 4(b).
5 Conclusion
We proposed a simple and effective unsupervised method for adapting detection models under domain shift. Our self-labeling framework gradually adapted the model using a new domain-centric augmentation method and a teacher-guided finetuning. Our method achieved significant gains in terms of model robustness compared to existing baselines both for small and large models. Not only our method did mitigate the effect of domain shifts due to low-level image corruptions but also it could adapt the models when presented with high-level stylistic differences between the source and target domains. Through ablation study, we got some insights on why gradual adaptation works and how the teacher-guided pseudo-label refinement can help adapt the models. We hope this simple method will guide future progress of robust object detection research.
References
- [1] (2021) Is robustness robust? on the interaction between augmentations and corruptions. In Submitted to International Conference on Learning Representations, Note: under review External Links: Link Cited by: §1, §2.
- [2] (2021) SimROD: A Simple Adaptation Method for Robust Object Detection. Note: Supplementary materials 8083_supplementary.zip Cited by: 2nd item, §4.2, §4.3, §4.3, Table 7, §4.
-
[3]
(2017)
Unsupervised pixel-level domain adaptation with generative adversarial networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3722–3731. Cited by: §1, §2. - [4] (2019) Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11457–11466. Cited by: §2, Table 1, Table 12.
- [5] (2019) Domain-specific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7354–7362. Cited by: §2.
- [6] (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348. Cited by: §1, §2, Table 1, Table 2, Table 3, §4, Table 12, Table 13, Table 14.
-
[7]
(2016)
The cityscapes dataset for semantic urban scene understanding
. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. Cited by: §4.1. - [8] (2020) Toward robust pedestrian detection with data augmentation. IEEE Access 8, pp. 136674–136683. Cited by: §2.
- [9] (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §4.1.
- [10] (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §2, §4.3, Table 3, Table 14.
- [11] (2019) Multi-adversarial faster-rcnn for unrestricted object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6668–6677. Cited by: §1, §2, Table 1, Table 2, Table 12, Table 13.
- [12] (2020) The many faces of robustness: a critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241. Cited by: §1, §2, §4.3, Table 3, Table 14.
- [13] (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §2, §4.3.
- [14] (2019) Augmix: a simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781. Cited by: §1, §2, §9.3.
-
[15]
(2018)
CyCADA: cycle-consistent adversarial domain adaptation.
In
Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholm, Sweden, July 10-15, 2018
, pp. 1994–2003. Cited by: §2. - [16] (2020) Every pixel matters: center-aware feature alignment for domain adaptive object detector. In European Conference on Computer Vision, pp. 733–748. Cited by: §1, §2, §4.1, §4.1, Table 1, Table 2, Table 12, Table 13.
- [17] (2020) Progressive domain adaptation for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 749–757. Cited by: §1, §2, §4.1, Table 2, Table 13.
- [18] (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5001–5009. Cited by: §1, §1, §2, §3.2.3, §3.2.3, §4.2, §4.2, §4.3, Table 3, §4, §8.1, §8.3, Table 14.
- [19] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. Cited by: §2, §4.3, Table 3, Table 14.
- [20] (2020-06) Ultralytics/yolov5: v1.0 - initial release. Note: Zenodo Cited by: §4, §6.1, §6.1, §6.1.
- [21] (2017) Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks?. 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 746–753. Cited by: §4.1.
- [22] (2019) A robust learning approach to domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 480–490. Cited by: §1, §2, Table 1, Table 2, Table 12, Table 13.
- [23] (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §1, §2, Table 3, Table 14.
- [24] (2017) Revisiting batch normalization for practical domain adaptation. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, Cited by: §2, §3.2.3.
- [25] (2019) Benchmarking robustness in object detection: autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484. Cited by: §1, §1, §2, §4.3, §4.3, §4.
- [26] (2021) On interaction between augmentations and corruptions in natural corruption robustness. arXiv preprint arXiv:2102.11273. Cited by: §1.
- [27] (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: §1, §2, Table 1, Table 3, Table 12, Table 14.
- [28] (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 2488–2498. Cited by: §2.
- [29] (2020) Improving robustness against common corruptions by covariate shift adaptation. In Advances in Neural Information Processing Systems, Vol. 33, pp. 11539–11551. Cited by: §2, §3.2.3.
- [30] (2017) Guildai. Github. Note: https://github.com/guildai/guildai. Cited by: §6.1.
-
[31]
(2020)
A simple semi-supervised learning framework for object detection
. arXiv preprint arXiv:2005.04757. Cited by: §1, §2, §4.3, Table 3, §8.3, Table 14. - [32] (2012) Machine learning in non-stationary environments - introduction to covariate shift adaptation. MIT Press. Cited by: §3.1.
- [33] (2020) Measuring robustness to natural distribution shifts in image classification. External Links: 2007.00644 Cited by: §1, §2, §4.3.
-
[34]
(2015-02)
Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study
. 42 (2), pp. 245–284. Cited by: §1. - [35] (2017) Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2962–2971. External Links: Document Cited by: Table 3, Table 14.
- [36] (2020) Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 819–828. Cited by: §2.
- [37] (2019) Multi-level domain adaptive learning for cross-domain detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: Table 1, Table 12.
- [38] (2020) Cross-domain object detection through coarse-to-fine feature adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13766–13775. Cited by: §1, §2, §4.1, §4.1, Table 1, Table 12.
- [39] (2019) Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 687–696. Cited by: §1, §2, §4.1, Table 1, Table 2, Table 12, Table 13.
- [40] (2019) Object detection in 20 years: a survey. External Links: 1905.05055 Cited by: §1.
Supplementary materials
The following supplementary materials provide further details on training, on the results of the different benchmarks, and more qualitative analysis and visualizations.
6 Experiments setup
6.1 Training details and hyperparameters
We trained each model using a standard stochastic gradient descent (SGD) optimizer with momentum parameter 0.937 and weight decay
. We used warm-up and cosine decay rule for training. For the NMS parameters, we used an IoU threshold of and an object confidence threshold of . When generating pseudolabels, we used a higher confidence threshold of 0.4. We used the model definitions, as defined by the initial release of YOLOv5 [20] with last commit id ‘364fcfd7d’. Finally, we used a generalized IoU loss (GIoU) for localization and a focal loss for the classification loss and objectness loss for training the models.To manage our experiments and make our results reproducible, we used the open-source tool Guildai
[30]. Most hyper-parameters (Momentum, NMS, etc.) were set as defaults in YOLOv5 repo [20]. We tuned only the learning rate for each dataset. The value of hyperparameters are configured in the ‘guild.yml’ file. For the gradual adaptation procedure, we use a large enough number of epochs for Phase 1 to ensure the convergence of BN adaptation. We use a separate validation set to maintain the best checkpoint using the validation AP. Therefore, we initialize the phase 2 training with the best checkpoint of phase 1. It is also worth noting that our framework does not add new hyperparameters.
When training the COCO source models or the Stylize and DeepAugment baselines, we followed the training procedure in YOLOv5 [20]
and trained the model from scratch using 300 epochs and a learning rate of 0.01. For Pascal and Cityscapes the source models were obtained through transfer learning from COCO pretrained weights using
and epochs respectively. For that, we used learning rate of and batch size of . When applying our adaptation method, we also fine-tuned the source model using same learning rate of , a batch size of and 100 epochs for all models and target domains.We did not use multi-scale training to simplify our analysis. The same image input size was used during training, pseudo-label generation and evaluation. For Sim10K/KITTI to Cityscapes, we specify the input size used to train each student and teacher model in our results. For the artistic benchmark, we use the same input size of 416 for both student and teacher models. For the image corruption benchmark, we used the same input size of 416 for Pascal-C and COCO-C whereas we used a larger size of 640 for Cityscapes-C.
For the Stylize baseline, we applied only one style for each image to keep the dataset size the same, to ensure a fair comparison. We preserved the original image dimensions and disabled the cropping. Alpha was fixed to 1 to apply the highest strength of stylization.
6.2 More details on datasets
Table 8, 9, 10, and 11 show a summary of the data splits that we used as the source or clean split versus target or stylized/augmented split for each dataset. To make a fair comparison, we keep the total number of images in the entire training data to be the same for all methods.
For Pascal-C, COCO-C, and Cityscapes-C, we generated the corrupted test set by applying each corruption to the clean test with all five severity levels. For the cross-domain adaptation benchmark, we used the test split for Clipart, Watercolor, or Comic for measuring test AP on the target domain.
-
Sim10K: we use the SIM10k dataset as the labeled source training data and the training set of Cityscapes as unlabeled target data. The validation set of Cityscapes was used as target test set.
-
KITTI: we use the training set of KITTI as the labeled source data and the training set of Cityscapes as unlabeled target data. The validation set of Cityscapes was used as target test set.
-
Clipart/Watercolor/Comic: the datasets used in Source, DeepAugment, Stylize baselines for Clipart/Watercolor/Comic are exactly the same as those used for Pascal-C. Other than this, the train set of Clipart/Watercolor/Comic were used as the target domain dataset. In DT+PL experiments, we first apply the domain transfer on the union of VOC2007 trainval and VOC2012 trainval. Then, we apply the DT step on the source model using the domain-transferred dataset. Finally, we apply the PL step on the output model of DT step using the train split of of Clipart/Watercolor/Comic. Note that we do not use the ground-truth labels but use the pseudo-labels instead.
-
Pascal-C: we used VOC2007 trainval as the source and VOC2012 trainval as the target. For the DeepAugment baseline, we augmented VOC2012 train with the CAE method and VOC2012 val with the EDSR method. We used VOC2007 test as the clean test set.
-
COCO-C: we split COCO train2017 into two approximately equal halves and used the first half as source, the second half as target. For DeepAugment, we divided COCO train2017 in three random splits and used them for the clean split, CAE split and EDSR split respectively. COCO val2017 was used as clean test.
-
Cityscapes-C: we split the source domain and target domain by city names. We carefully chose the cities for each domain so that source and target are of approximately equal size. Of all 18 cities in cityscapes-train, 9 cities: ‘cologne’, ‘krefeld’, ‘bremen’, ‘darmstadt’, ‘hanover’, ‘aachen’, ‘stuttgart’, ‘jena’, and ‘tubingen’ were used as source data; the other 9 cities: ‘bochum’, ‘ulm’, ‘monchengladbach’, ‘weimar’, ‘strasbourg’, ‘zurich’, ‘hamburg’, ‘dusseldorf’, and ‘erfurt’ were used as target data. When training the DeepAugment baseline for Cityscapes, we further split the target domain into two splits. The first split that contains ‘zurich’, ‘weimar’, ‘erfurt’, and ‘strasbourg’ was augmented with the CAE method. The second split which contains ‘bochum’, ‘ulm’, ‘monchengladbach’, ‘hamburg’, ‘dusseldorf’ was augmented with the EDSR method. The validation set of Cityscapes was used as clean test.
Method | Source / Clean split (size) | Target / Augmented split (size) |
---|---|---|
Source | VOC2007-trainval (5011) | N/A |
DeepAugment | VOC2007-trainval (5011) | CAE VOC2012-train (5717) + EDSR VOC2012-val (5823) |
Stylize | VOC2007-trainval (5011) | stylized VOC2012-trainval (11540) |
BN-Adapt | VOC2007-trainval (5011) | VOC2012-trainval (11540) |
STAC | VOC2007-trainval (5011) | VOC2012-trainval (11540) |
SimROD (Ours) | VOC2007-trainval (5011) | VOC2012-trainval (11540) |
Method | Source / Clean split (size) | Target / Augmented split (size) |
---|---|---|
Source | coco-train2017/first half (58458) | N/A |
DeepAugment | coco-train2017/first 1/3 (39088) | CAE second 1/3 (39088) + EDSR third 1/3 (39090) |
Stylize | coco-train2017/first half (58458) | stylized coco-train2017/second half (58808) |
BN-Adapt | coco-train2017/first half (58808) | coco-train2017/second half (58808) |
STAC | coco-train2017/first half (58458) | coco-train2017/second half (58808) |
SimROD (Ours) | coco-train2017/first half (58458) | coco-train2017/second half (58808) |
Method | Source / Clean split (size) | Target / Augmented split (size) |
---|---|---|
Source | cityscapes-train/first half (1483) | N/A |
DeepAugment | cityscapes-train/first half (1483) | CAE train/second half-split 1 (732) + EDSR train/second half-split 2 (750) |
Stylize | cityscapes-train/first half (1483) | stylized cityscapes-train/second half (1482) |
Bn_only | cityscapes-train/first half (1483) | cityscapes-train/second half (1482) |
Stac | cityscapes-train/first half (1483) | cityscapes-train/second half (1482) |
Ours w/o TG | cityscapes-train/first half (1483) | cityscapes-train/second half (1482) |
Ours | cityscapes-train/first half (1483) | cityscapes-train/second half (1482) |
Method | Source / Clean split (size) | Target / Augmented split (size) |
---|---|---|
Source | VOC2007-trainval (5011) | N/A |
DeepAugment | VOC2007-trainval (5011) | CAE VOC2012-train (5717) + EDSR VOC2012-val (5823) |
Stylize | VOC2007-trainval (5011) | stylized VOC2012-trainval (11540) |
Bn_only | VOC2007-trainval (5011) | clipart/watercolor/comic-train (500/1000/1000) |
Stac | VOC2007-trainval (5011) | clipart/watercolor/comic-train (500/1000/1000) |
Ours w/o TG | VOC2007-trainval (5011) | clipart/watercolor/comic-train (500/1000/1000) |
Ours | VOC2007-trainval (5011) | clipart/watercolor/comic-train (500/1000/1000) |
7 More results on synthetic-to-real and cross-camera benchmarks
7.1 Full results on Sim10K/KITTI to Cityscapes
Table 12 and 13 expand on the results reported in Table 1 and 2 respectively. In particular, they show the performance of the teacher models and that of models adapted with the smaller teacher model X640.
Method | Arch. | Backbone | Source | AP50 | Oracle | Reference | ||
---|---|---|---|---|---|---|---|---|
DAF [6] | F-RCNN | V | 30.10 | 39.00 | - | 8.90 | - | CVPR 2018 |
MAF [11] | F-RCNN | V | 30.10 | 41.10 | - | 11.00 | - | ICCV 2019 |
RLDA [22] | F-RCNN | I | 31.08 | 42.56 | 68.10 | 11.48 | 31.01 | ICCV 2019 |
SCDA [39] | F-RCNN | V | 34.00 | 43.00 | - | 9.00 | - | CVPR 2019 |
MDA [37] | F-RCNN | V | 34.30 | 42.80 | - | 8.50 | - | ICCV 2019 |
SWDA [27] | F-RCNN | V | 34.60 | 42.30 | - | 7.70 | - | CVPR 2019 |
Coarse-to-Fine [38] | F-RCNN | V | 35.00 | 43.80 | 59.90 | 8.80 | 35.34 | CVPR 2020 |
SimROD (self-adapt) | YOLOv5 | S320 | 33.62 | 38.73 | 48.81 | 5.11 | 33.66 | Ours |
SimROD (w. teacher X640) | YOLOv5 | S320 | 33.62 | 44.70 | 48.81 | 11.08 | 72.93 | Ours |
MTOR [4] | F-RCNN | R | 39.40 | 46.60 | - | 7.20 | - | CVPR 2019 |
EveryPixelMatters [16] | FCOS | V | 39.80 | 49.00 | 69.70 | 9.20 | 30.77 | ECCV 2020 |
SimROD (self adapt) | YOLOv5 | S416 | 39.57 | 44.21 | 56.49 | 4.63 | 27.37 | Ours |
SimROD (w. teacher X640) | YOLOv5 | S416 | 39.57 | 51.68 | 56.49 | 12.10 | 71.53 | Ours |
SimROD (w. teacher X1280) | YOLOv5 | S416 | 39.57 | 52.05 | 56.49 | 12.47 | 73.73 | Ours |
SimROD (self-adapt) | YOLOv5 | M640 | 55.86 | 60.29 | 71.05 | 4.43 | 29.16 | Ours |
SimROD (w. teacher X640) | YOLOv5 | M640 | 55.86 | 62.18 | 71.05 | 6.33 | 41.64 | Ours |
SimROD (w. teacher X1280) | YOLOv5 | M640 | 55.86 | 64.40 | 71.05 | 8.54 | 56.24 | Ours |
SimROD (self-adapt) | YOLOv5 | X640 | 60.34 | 63.27 | 72.51 | 2.93 | 24.09 | Ours |
SimROD (self-adapt) | YOLOv5 | X1280 | 71.66 | 75.94 | 82.90 | 4.28 | 38.08 | Ours |
Method | Arch. | Backbone | Source | AP50 | Oracle | Reference | ||
---|---|---|---|---|---|---|---|---|
DAF [6] | F-RCNN | V | 30.20 | 38.50 | - | 8.30 | - | CVPR 2018 |
MAF [11] | F-RCNN | V | 30.20 | 41.00 | - | 10.80 | - | ICCV 2019 |
RLDA [22] | F-RCNN | I | 31.10 | 42.98 | 68.10 | 11.88 | 32.11 | ICCV 2019 |
PDA [17] | F-RCNN | V | 30.20 | 43.90 | 55.80 | 13.70 | 53.52 | WACV 2020 |
SimROD (self-adapt) | YOLOv5 | S416 | 31.61 | 35.94 | 56.15 | 4.33 | 17.65 | Ours |
SimROD (w. teacher X640) | YOLOv5 | S416 | 31.61 | 43.55 | 56.15 | 11.94 | 48.66 | Ours |
SimROD (w. teacher X1280) | YOLOv5 | S416 | 31.61 | 45.66 | 56.15 | 14.05 | 57.27 | Ours |
SCDA [39] | F-RCNN | V | 37.40 | 42.60 | - | 5.20 | - | CVPR 2019 |
EveryPixelMatters [16] | FCOS | R | 35.30 | 45.00 | 70.40 | 9.70 | 27.64 | ECCV 2020 |
SimROD (self adapt) | YOLOv5 | M416 | 36.09 | 42.94 | 59.29 | 6.85 | 29.51 | Ours |
SimROD (w. teacher X640) | YOLOv5 | M416 | 36.09 | 45.29 | 59.29 | 9.19 | 39.64 | Ours |
SimROD (w. teacher X1280) | YOLOv5 | M416 | 36.09 | 47.52 | 59.29 | 11.43 | 49.26 | Ours |
SimROD (self-adapt) | YOLOv5 | X640 | 45.67 | 50.81 | 72.18 | 5.14 | 19.38 | Ours |
SimROD (self-adapt) | YOLOv5 | X1280 | 52.07 | 58.25 | 82.50 | 6.18 | 20.31 | Ours |
7.2 Qualitative visualization
In Figure 5, we present qualitative results for the detection of the model S416 (i.e. yolov5s with input 416) to demonstrate the improvement brought by SimROD compared to the source model. By comparing with ground-truth labels, Figure 5 shows that the adapted model can detect most objects with good accuracy except for some highly occluded ones.

8 More results on artistic benchmark
8.1 Benchmark results on Clipart and Comic
We include the benchmarks results for Clipart and Comic in Table 14 and 15 respectively. We used only 500 unlabeled images from the target domain for Clipart and 1000 images for Comic. Similar to the results for Watercolor in Table 3, our method SimROD outperformed the baselines when compared with models that achieve same Source AP performance. Compared to DT+PL in [18], our method further improved the AP50 of the S416 model by absolute 8.35, 12 and 10.69 percentage points on Clipart, Comic and Watercolor respectively. In addition, SimROD consistently achieves high effective adaptation gains between 70-97% across model sizes and benchmarks.
Method | Arch. | Backbone | Source | AP50 | Oracle | Reference | ||
ADDA [35] | SSD | V | 26.80 | 27.40 | 55.40 | 0.60 | 2.10 | CVPR 2017 |
DT+PL [18] | SSD | V | 26.80 | 46.00 | 55.40 | 19.20 | 67.13 | CVPR 2018 |
DAF [6] | F-RCNN | V | 26.20 | 22.40 | 50.00 | -3.80 | -15.97 | CVPR 2018 |
DT+PL [18] | F-RCNN | V | 26.20 | 34.90 | 50.00 | 8.70 | 36.55 | CVPR 2018 |
SWDA [27] | F-RCNN | V | 27.80 | 38.10 | 50.00 | 10.30 | 46.40 | CVPR 2019 |
DAM [23] | F-RCNN | V | 24.90 | 41.80 | 50.00 | 16.90 | 67.33 | CVPR 2018 |
DeepAugment [12] | YOLOv5 | S416 | 29.32 | 31.65 | 56.07 | 2.33 | 8.71 | arXiv 2020 |
BN-Adapt [19] | YOLOv5 | S416 | 29.32 | 37.43 | 56.07 | 8.11 | 30.32 | NeurIPS 2020 |
Stylize [10] | YOLOv5 | S416 | 29.32 | 38.80 | 56.07 | 9.48 | 35.44 | arXiv 2019 |
STAC [31] | YOLOv5 | S416 | 29.32 | 39.64 | 56.07 | 10.32 | 38.58 | arXiv 2020 |
DT+PL [18] | YOLOv5 | S416 | 29.32 | 39.49 | 56.07 | 10.17 | 38.02 | CVPR 2018 |
SimROD (self-adapt) | YOLOv5 | S416 | 29.32 | 41.28 | 56.07 | 11.96 | 44.72 | Ours |
SimROD (teacher X416) | YOLOv5 | S416 | 29.32 | 47.84 | 56.07 | 18.52 | 69.24 | Ours |
Method | Arch. | Backbone | Source | AP50 | Oracle | Reference | ||
ADDA | SSD | V | 24.90 | 23.80 | 46.40 | -1.10 | -5.12 | CVPR 2017 |
DT | SSD | V | 24.90 | 29.80 | 46.40 | 4.90 | 22.79 | CVPR 2018 |
DT+PL | SSD | V | 24.90 | 37.20 | 46.40 | 12.30 | 57.21 | CVPR 2018 |
DAF | F-RCNN | V | 21.40 | 23.20 | - | 1.80 | - | CVPR 2018 |
DT | F-RCNN | V | 21.40 | 29.80 | - | 8.40 | - | CVPR 2018 |
SWDA | F-RCNN | V | 21.40 | 28.40 | - | 7.00 | - | CVPR 2019 |
DAM | F-RCNN | V | 21.40 | 34.50 | - | 13.10 | - | CVPR 2019 |
DeepAugment | YOLOv5 | S416 | 18.19 | 21.39 | 39.81 | 3.20 | 14.80 | arXiv 2020 |
BN-Adapt | YOLOv5 | S416 | 18.19 | 25.53 | 39.81 | 7.34 | 33.95 | NeurIPS 2020 |
Stylize | YOLOv5 | S416 | 18.19 | 27.57 | 39.81 | 9.38 | 43.39 | arXiv 2019 |
STAC | YOLOv5 | S416 | 18.19 | 26.40 | 39.81 | 8.21 | 37.97 | arXiv 2020 |
DT+PL | YOLOv5 | S416 | 18.19 | 25.66 | 39.81 | 7.47 | 34.55 | CVPR 2018 |
SimROD (self-adapt) | YOLOv5 | S416 | 18.19 | 29.54 | 39.81 | 11.35 | 52.50 | Ours |
SimROD (teacher X416) | YOLOv5 | S416 | 18.19 | 37.65 | 39.81 | 19.46 | 90.01 | Ours |
DeepAugment | YOLOv5 | M416 | 23.58 | 27.65 | 49.13 | 4.07 | 15.93 | arXiv 2020 |
BN-Adapt | YOLOv5 | M416 | 23.58 | 32.04 | 49.13 | 8.46 | 33.11 | NeurIPS 2020 |
Stylize | YOLOv5 | M416 | 23.58 | 34.56 | 49.13 | 10.98 | 42.97 | arXiv 2019 |
STAC | YOLOv5 | M416 | 23.58 | 32.76 | 49.13 | 9.18 | 35.93 | arXiv 2020 |
DT+PL | YOLOv5 | M416 | 23.58 | 33.53 | 49.13 | 9.95 | 38.94 | CVPR 2018 |
SimROD (self-adapt) | YOLOv5 | M416 | 23.58 | 37.93 | 49.13 | 14.35 | 56.15 | Ours |
SimROD (teacher X416) | YOLOv5 | M416 | 23.58 | 42.08 | 49.13 | 18.50 | 72.41 | Ours |
8.2 Data efficiency analysis on Watercolor and Comic
Next, we analyze the data efficiency of SimROD by increasing the size of unlabeled data used to adapt the models. For Watercolor and Comic, we used the extra splits, which contains extra 52.8K and 17.8K additional unlabeled images respectively. Moreover, all models use the same input size of 416. Figure 6 and 7 compare the performance of SimROD with the two pseudo-labeling baselines (STAC and DT+PL) on Watercolor and Comic respectively. All methods improved when using more unlabeled data from the target domain. For example, SimROD improves the Yolov5s model performance by absolute +3.23% and +4.69% on Watercolor and Comic respectively.
Nonetheless, SimROD could outperform baseline methods without using extra data for Yolov5s and Yolov5m models, which are adapted using the self-adapted teacher Yolov5x. In other words, our proposed method used only 1000 unlabeled images and still outperformed the baselines, which used 50 or 18 more data. For example, our method achieved an AP50 of 42.34% on yolov5s whereas the best baseline on yolov5m has an AP50 of only 37.79%.


8.3 Qualitative comparison on Clipart, Comic and Watercolor
In Figures 8 and 9, we provide qualitative comparisons with pseudo-labeling baselines (STAC [31] and DT+PL [18]) and DeepAugment method using same Yolov5s model. These comparisons illustrates the simplicity and effectiveness of SimROD. Our proposed DomainMix augmentation and teacher-guided gradual adaptation enabled to leverage unlabeled target data and to mitigate the label noise and domain shift. In contrast to DT+PL, SimROD did not need to generate synthetic intermediate dataset and our proposed augmentation is much simpler than DeepAugment.


9 More results on image corruptions
9.1 Results for different model sizes
Table 4, 5, and 6 show only the results for Yolov5m model for Pascal-C, COCO-C, and Cityscapes-C respectively. In Table 16, 17, and 18, we show that SimROD consistently achieves higher performance compared to the baselines across different model sizes and benchmarks. As expected, larger models provided extra capacity and thus higher Performance.
Method | rPC | |||
yolov5s | ||||
Source | 75.87 | 42.38 | 55.86 | 0.00 |
Stylize | 77.26 | 52.12 | 67.46 | 9.74 |
BN-Adapt | 74.71 | 53.75 | 71.94 | 11.37 |
DeepAugment | 77.89 | 55.42 | 71.15 | 13.04 |
STAC | 80.11 | 56.12 | 70.05 | 13.74 |
SimROD (Ours) | 80.08 | 67.95 | 84.85 | 25.57 |
Supervised training | 80.44 | 71.18 | 88.49 | 28.80 |
yolov5m | ||||
Source | 83.13 | 53.78 | 64.69 | 0.00 |
Stylize | 84.79 | 62.92 | 74.21 | 9.14 |
BN-Adapt | 83.01 | 64.60 | 77.82 | 10.82 |
DeepAugment | 85.05 | 64.88 | 76.28 | 11.10 |
STAC | 87.00 | 66.88 | 76.88 | 13.11 |
SimROD (Ours) | 86.97 | 75.40 | 86.70 | 21.63 |
Supervised training | 86.75 | 78.74 | 90.76 | 24.96 |
yolov5x | ||||
Source | 87.42 | 62.84 | 71.88 | 0.00 |
Stylize | 87.29 | 69.60 | 79.73 | 6.76 |
BN-Adapt | 86.59 | 71.59 | 82.68 | 8.75 |
DeepAugment | 87.78 | 72.15 | 82.19 | 9.31 |
STAC | 89.57 | 73.68 | 82.25 | 10.84 |
SimROD (Ours) | 89.24 | 78.48 | 87.95 | 15.64 |
Supervised training | 88.88 | 82.56 | 92.89 | 19.72 |
Method | mPC | rPC | ||
yolov5s | ||||
Source | 31.35 | 17.68 | 56.40 | 0.00 |
Stylize | 30.07 | 18.99 | 63.15 | 1.31 |
BN-Adapt | 30.91 | 20.09 | 64.99 | 2.40 |
DeepAugment | 30.37 | 19.87 | 65.44 | 2.19 |
STAC | 31.25 | 20.00 | 64.02 | 2.32 |
SimROD (Ours) | 31.21 | 23.94 | 76.71 | 6.26 |
Supervised training | 30.90 | 25.33 | 81.99 | 7.65 |
yolov5m | ||||
Source | 36.85 | 22.03 | 59.79 | 0.00 |
Stylize | 35.75 | 23.82 | 66.63 | 1.79 |
BN-Adapt | 36.24 | 24.79 | 68.39 | 2.76 |
DeepAugment | 35.51 | 24.33 | 68.52 | 2.30 |
STAC | 36.76 | 24.80 | 67.46 | 2.77 |
SimROD (Ours) | 36.79 | 28.46 | 77.36 | 6.43 |
Supervised training | 36.23 | 30.16 | 83.26 | 8.13 |
yolov5x | ||||
Source | 41.61 | 26.60 | 63.93 | 0.00 |
Stylize | 40.38 | 28.16 | 69.73 | 1.56 |
BN-Adapt | 41.70 | 29.77 | 71.40 | 3.17 |
DeepAugment | 41.12 | 29.13 | 70.84 | 2.53 |
STAC | 41.85 | 29.69 | 70.93 | 3.09 |
SimROD (Ours) | 41.63 | 31.87 | 76.57 | 5.27 |
Supervised training | 41.06 | 34.84 | 84.86 | 8.24 |
Method | mPC | rPC | ||
---|---|---|---|---|
yolov5s | ||||
Source | 17.08 | 9.50 | 55.62 | 0.00 |
Stylize | 18.96 | 11.75 | 61.97 | 2.25 |
DeepAugment | 17.24 | 11.39 | 66.07 | 1.89 |
STAC | 20.34 | 12.82 | 63.02 | 3.32 |
SimROD (Ours) | 19.82 | 14.95 | 75.45 | 5.45 |
Supervised training | 22.30 | 19.35 | 86.77 | 9.85 |
yolov5m | ||||
Source | 19.48 | 11.53 | 59.19 | 0.00 |
Stylize | 21.77 | 14.62 | 67.16 | 3.09 |
DeepAugment | 20.28 | 14.79 | 72.93 | 3.26 |
STAC | 24.54 | 15.39 | 62.71 | 3.86 |
SimROD (Ours) | 24.06 | 18.01 | 74.86 | 6.48 |
Supervised training | 26.58 | 23.50 | 88.43 | 11.97 |
yolov5x | ||||
Source | 25.65 | 16.63 | 64.83 | 0.00 |
Stylize | 27.70 | 19.38 | 69.96 | 2.75 |
DeepAugment | 25.12 | 18.80 | 74.84 | 2.17 |
STAC | 29.62 | 20.98 | 70.85 | 4.35 |
SimROD (Ours) | 29.27 | 21.70 | 74.15 | 5.07 |
Supervised training | 31.48 | 27.66 | 87.87 | 11.03 |
9.2 Per-corruption performance on Pascal-C
In the main paper, we reported the mAP, rPC, and metrics, which were averaged over 15 corruption types. Here, in Tables 19, 20, and 21, we provide a breakdown of the results for each corruption type on the Pascal-C dataset for the three YOLOv5 models.
Noise | Blur | Weather | Digital | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | mPC | Gauss. | Shot | Impulse | Defocus | Glass | Motion | Zoom | Snow | Frost | Fog | Bright | Contrast | Elastic | Pixel | JPEG | |
Source | 75.87 | 42.38 | 32.71 | 35.32 | 28.24 | 43.02 | 32.96 | 39.87 | 29.05 | 37.09 | 43.53 | 59.66 | 69.21 | 42.00 | 47.04 | 46.53 | 49.48 |
Stylize | 77.26 | 52.12 | 41.51 | 44.61 | 37.82 | 49.80 | 48.02 | 47.37 | 35.79 | 49.53 | 57.37 | 67.55 | 74.07 | 51.69 | 59.10 | 56.77 | 60.84 |
DeepAugment | 77.89 | 55.42 | 50.48 | 53.12 | 48.67 | 55.38 | 49.23 | 48.87 | 37.58 | 49.73 | 58.19 | 70.29 | 74.91 | 56.88 | 51.61 | 63.39 | 62.99 |
BN Adapt | 74.71 | 53.75 | 48.07 | 51.22 | 46.00 | 53.23 | 44.34 | 48.60 | 38.63 | 50.56 | 55.80 | 68.50 | 73.34 | 57.18 | 59.32 | 52.86 | 58.55 |
STAC | 80.11 | 56.12 | 46.85 | 49.78 | 44.08 | 58.41 | 45.38 | 51.99 | 41.68 | 53.39 | 59.80 | 74.01 | 78.91 | 59.85 | 61.76 | 56.14 | 59.78 |
SimROD (Ours) | 80.08 | 67.95 | 64.91 | 66.11 | 65.28 | 65.12 | 63.03 | 65.54 | 53.99 | 69.19 | 69.27 | 76.85 | 79.14 | 71.38 | 73.52 | 65.54 | 70.34 |
Oracle | 80.44 | 71.18 | 68.28 | 69.14 | 68.10 | 68.18 | 67.84 | 69.77 | 62.19 | 71.41 | 71.26 | 77.49 | 79.95 | 73.41 | 75.90 | 70.40 | 74.41 |
Noise | Blur | Weather | Digital | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | mPC | Gauss. | Shot | Impulse | Defocus | Glass | Motion | Zoom | Snow | Frost | Fog | Bright | Contrast | Elastic | Pixel | JPEG | |
Source | 83.13 | 53.78 | 47.44 | 51.35 | 44.98 | 53.87 | 42.17 | 48.61 | 36.64 | 51.77 | 56.29 | 71.74 | 78.82 | 55.81 | 56.43 | 54.52 | 56.17 |
Stylize | 84.79 | 62.92 | 53.44 | 57.56 | 52.62 | 60.18 | 57.42 | 57.53 | 45.32 | 63.02 | 67.50 | 78.02 | 81.91 | 65.64 | 69.69 | 66.10 | 67.86 |
DeepAugment | 85.05 | 64.88 | 61.75 | 64.06 | 60.64 | 63.74 | 57.95 | 56.18 | 44.75 | 62.31 | 68.27 | 79.36 | 82.69 | 68.34 | 61.92 | 71.40 | 69.78 |
BN Adapt | 83.01 | 64.60 | 61.06 | 63.83 | 60.54 | 62.33 | 55.29 | 58.77 | 46.71 | 65.44 | 67.88 | 78.34 | 81.62 | 69.48 | 68.81 | 62.15 | 66.75 |
STAC | 87.00 | 66.88 | 61.46 | 64.77 | 60.73 | 67.17 | 55.54 | 61.35 | 49.57 | 68.41 | 71.20 | 82.52 | 85.90 | 71.83 | 69.92 | 65.61 | 67.25 |
SimROD (Ours) | 86.97 | 75.40 | 72.00 | 74.11 | 73.01 | 72.65 | 70.25 | 72.85 | 60.65 | 77.81 | 77.47 | 84.03 | 86.17 | 79.66 | 80.49 | 72.54 | 77.36 |
Oracle | 86.75 | 78.74 | 76.35 | 76.68 | 76.42 | 75.63 | 75.12 | 77.10 | 70.31 | 80.07 | 79.56 | 84.25 | 86.15 | 80.60 | 82.88 | 78.73 | 81.22 |
Noise | Blur | Weather | Digital | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | mPC | Gauss. | Shot | Impulse | Defocus | Glass | Motion | Zoom | Snow | Frost | Fog | Bright | Contrast | Elastic | Pixel | JPEG | |
Source | 87.42 | 62.84 | 59.30 | 61.06 | 58.38 | 61.45 | 51.43 | 58.48 | 41.79 | 63.82 | 66.34 | 77.28 | 84.25 | 65.77 | 64.40 | 63.07 | 65.79 |
Stylize | 87.29 | 69.60 | 62.44 | 65.03 | 62.20 | 67.57 | 63.64 | 65.13 | 50.84 | 70.44 | 74.10 | 82.44 | 85.30 | 74.16 | 74.73 | 72.76 | 73.25 |
DeepAugment | 87.78 | 72.15 | 71.25 | 73.27 | 71.16 | 71.40 | 64.70 | 65.57 | 49.76 | 71.42 | 74.91 | 84.17 | 86.43 | 77.48 | 68.75 | 77.16 | 74.88 |
BN Adapt | 86.59 | 71.59 | 71.05 | 72.63 | 70.94 | 67.90 | 63.70 | 66.55 | 52.41 | 72.79 | 73.91 | 82.38 | 84.62 | 76.03 | 74.96 | 70.51 | 73.43 |
STAC | 89.57 | 73.68 | 71.77 | 73.40 | 71.71 | 72.57 | 64.51 | 69.37 | 52.81 | 76.21 | 77.68 | 85.04 | 88.40 | 78.87 | 75.92 | 72.69 | 74.23 |
SimROD (Ours) | 89.24 | 78.48 | 76.09 | 78.31 | 77.23 | 75.85 | 73.11 | 75.29 | 62.75 | 81.10 | 80.96 | 86.62 | 88.16 | 82.94 | 82.45 | 76.64 | 79.69 |
Oracle | 88.88 | 82.56 | 81.14 | 81.96 | 81.27 | 79.10 | 79.08 | 80.65 | 73.97 | 83.58 | 83.66 | 87.18 | 88.54 | 84.03 | 85.55 | 84.09 | 84.57 |
9.3 Performance comparison with Augmix
Here, we compare our proposed method with Augmix augmentation [14] and report the results on Pascal-C in Table 22 and 23 for the models YOLOv5s and YOLOv5x respectively. When comparing the mean performance under corruption (mPC), we can see that Augmix performed the worst among all augmentation-based baselines. Interestingly, applying Augmix augmentation with DeepAugment improved the performance of DeepAugment by +3.3% AP50 and +1.03% AP50 on YOLOv5s and YOLOv5x models respectively. Nonetheless, SimROD still outperformed DeepAument+Augmix by more than +5% AP50 on both models. Although we have not tried, it is possible that applying Augmix on top of DomainMix may further improve the performance of our proposed method.
Method | mPC | |
---|---|---|
Source | 75.87 | 42.38 |
Augmix | 79.42 | 46.94 |
Stylize | 77.26 | 52.12 |
DeepAugment | 77.89 | 55.42 |
DeepAugment+Augmix | 80.85 | 60.15 |
SimROD (Ours) | 80.08 | 67.95 |
Method | mPC | |
---|---|---|
Augmix | 87.46 | 62.31 |
Source | 87.42 | 62.84 |
Stylize | 87.29 | 69.60 |
DeepAugment | 87.78 | 72.15 |
DeepAugment+Augmix | 88.36 | 73.18 |
SimROD (Ours) | 89.24 | 78.48 |
9.4 Data efficiency analysis on Pascal-C
In Figure 10 and 11, we analyzed the data efficiency of our proposed method using a YOLOv5s model and Pascal-C dataset. For that, we used a subset of training datasets and considered two scenarios. For both scenarios, we randomly generated three different sets of data, measured the performance in three runs. The average of the three runs are plotted with error bars in Figure 10 and 11.
In the first scenario, we used all the available labeled data from source domain consisting of 5011 images. On the other hand, we used only a portion of the unlabeled images available. As shown in Figure 10, our proposed method outperformed STAC by a margin of 10% AP50. Moreover, our method achieved a relative robustness of +21.75% AP50 and +16.61% AP50 using only 10% and 1% of unlabeled target domain images respectively. Since the data was imbalanced in this scenario, we also considered applying the weighted balanced sampling to STAC. Figure 10 shows that it could slightly improve the performance of STAC when the datasets were very imbalanced.
In the second scenario, we used only a given percentage of the available training data for both the source and target domain. While this scenario assumes the datasets are balanced, the total number of training images is much smaller than in the previous scenario. For example, using 1% of training data corresponds to a total of 165 images. With 1% of training data, STAC could not adapt the model. In contrast, our proposed method provided a relative robustness of +4.54% AP50 and +18.28% AP50 using only 1% and 10% of training data respectively.
These results confirm that our method was more data-efficient. In particular, our DomainMix augmentation could produce a diverse set of mixed samples even from very few training images from both domains. When more unlabeled data was available, our method could further leverage the unlabeled data and provide strong supervision for adaptation by mitigating the label noise.


9.5 Effects of corruption severity levels
To apply our method on the image corruption benchmark, we applied a corruption severity level of 3 for creating the unlabeled target domain images. In this section, we present additional analysis to understand the effects of corruption severity of the training images on the test performance. In Fig. 12 and 13, we show the relative robustness and mean performance under corruption mPC of an adapted Yolov5s model using our method. Similarly, Fig. 14 and 15 show the same metrics for an adapted YOLOv5x model.
The corruption types are sorted in ascending order based on the performance of the source model on these types. For instance, the source models achieved the highest mPC on fog and lowest mPC on impulse noise. This explains that the relative robustness on fog was lower compared to those on other corruption types because the source model already achieved high mPC on fog. Notable improvements were observed on the other corruption types.
Fig. 12 and 13 show that the adapted YOLOv5s model enjoyed higher improvement on test datasets with higher severity levels. More importantly, high improvements could be achieved when the training images have severity levels similar to those of the test images. This means that using unlabeled target-domain samples is effective as long as they are representative of the actual test set.




9.6 Qualitative comparison on image corruptions
Fig. 16 illustrates how various methods handle the glass blur corruption (severity 5) on Pascal-C sample. In addition, Fig. 17 shows results of various methods across a range of severity levels for the glass blur corruption. We see that the proposed method was more effective in handling the corruptions. In contrast to the baseline methods, our adaptation method detected most objects in the images and make fewer classification errors. We could also observe that the source model completely failed to detect objects in most cases.


9.7 More detailed ablations on the components
Table 24 expands the ablation study provided in the main paper onto various model sizes.
Model | Method | TG | DomainMix | BN-Adapt | Finetune | Corrupt AP50 | |
---|---|---|---|---|---|---|---|
Source | 42.38 | 0.00 | |||||
BN-Adapt | ✓ | 53.75 | 11.37 | ||||
BN-Adapt + DomainMix | ✓ | ✓ | 56.13 | 13.75 | |||
yolov5s | SimROD (Ours) w/o Teacher Guidance | ✓ | ✓ | ✓ | 60.35 | 17.97 | |
SimROD (Ours) w/o Gradual Adaptation | ✓ | ✓ | ✓ | 67.87 | 25.49 | ||
Our full method (SimROD) | ✓ | ✓ | ✓ | ✓ | 67.95 | 25.57 | |
Source | 53.78 | 0.00 | |||||
BN-Adapt | ✓ | 64.60 | 10.82 | ||||
BN-Adapt + DomainMix | ✓ | ✓ | 66.78 | 13.01 | |||
yolov5m | SimROD (Ours) w/o Teacher Guidance | ✓ | ✓ | ✓ | 71.81 | 18.03 | |
SimROD (Ours) w/o Gradual Adaptation | ✓ | ✓ | ✓ | 73.45 | 19.67 | ||
Our full method (SimROD) | ✓ | ✓ | ✓ | ✓ | 75.40 | 21.62 | |
Source | 62.84 | 0.00 | |||||
BN-Adapt | ✓ | 71.83 | 8.99 | ||||
BN-Adapt + DomainMix | ✓ | ✓ | 73.64 | 10.80 | |||
yolov5x | SimROD (Ours) w/o Gradual Adaptation | ✓ | ✓ | ✓ | 75.58 | 12.74 | |
SimROD (Ours) w/o Teacher Guidance | ✓ | ✓ | ✓ | 78.16 | 15.32 | ||
Our full method (SimROD) | ✓ | ✓ | ✓ | ✓ | 78.48 | 15.64 |
10 Dataset and DomainMix visualizations
Fig. 18 and 19 show examples of the domain-mixed images produced by the DomainMix augmentation from different datasets. Note that the images used to form domain-mixed examples, are randomly cropped, and may occupy a different height and width of the final image.


Comments
There are no comments yet.