Seeking Similarities over Differences: Similarity-based Domain Alignment for Adaptive Object Detection

10/04/2021 ∙ by Farzaneh Rezaeianaran, et al. ∙ 0

In order to robustly deploy object detectors across a wide range of scenarios, they should be adaptable to shifts in the input distribution without the need to constantly annotate new data. This has motivated research in Unsupervised Domain Adaptation (UDA) algorithms for detection. UDA methods learn to adapt from labeled source domains to unlabeled target domains, by inducing alignment between detector features from source and target domains. Yet, there is no consensus on what features to align and how to do the alignment. In our work, we propose a framework that generalizes the different components commonly used by UDA methods laying the ground for an in-depth analysis of the UDA design space. Specifically, we propose a novel UDA algorithm, ViSGA, a direct implementation of our framework, that leverages the best design choices and introduces a simple but effective method to aggregate features at instance-level based on visual similarity before inducing group alignment via adversarial training. We show that both similarity-based grouping and adversarial training allows our model to focus on coarsely aligning feature groups, without being forced to match all instances across loosely aligned domains. Finally, we examine the applicability of ViSGA to the setting where labeled data are gathered from different sources. Experiments show that not only our method outperforms previous single-source approaches on Sim2Real and Adverse Weather, but also generalizes well to the multi-source setting.



There are no comments yet.


page 1

page 3

page 7

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Depiction of visual similarity based grouping proposed in our ViSGA method. Instance proposals from the detector are aggregated based on visual similarity to create an adaptive number of class-agnostic groups then they are aligned across the domains.

Object detectors should be adaptable to “domain shift” that can occur due to many factors including changes in weather or camera, compared to the training data. Domain shifts can cause a significant drop in object detector performance [da_faster_rcnn, gopalan2011domain]. Domain adaptation methods [duan2011visual, duan2012domain, tzeng2015simultaneous, long2015learning, long2017deep, motiian2017unified] study this problem, casting it as a task of learning models from a source domain and adapting to a target domain. In object detection, where collecting bounding box annotations is expensive, it becomes critical that domain adaptation can be performed without the need to annotate every new domain. This motivates the challenging setting of unsupervised domain adaptation (UDA) [survey_2018, tzeng2017adversarial, lu2017unsupervised, cariucci2017autodial], where one has access to labeled source data and only unlabeled target data. Moreover, training data itself could be gathered under different conditions, a scenario typically referred to as a multi-source domain adaptation [multipeng2019moment, multixu2018deep, multizhao2018adversarial, multizhao2019].

A dominant line in UDA works is to learn invariant representations via aligning source and target domains, with various proposed alignment strategies. Specifically in object detection, the questions of what features to align and how to induce the alignment have been the subject of recent research. Early works [da_faster_rcnn, he_iccv19_MAF]

propose aligning both image-level features from the backbone network and all instance-level features extracted from object proposals using adversarial training 

[grl_ganin]. A recent state-of-the-art approach [GPA]

argues that it is beneficial to aggregate object proposals before alignment and suggests condensing all proposals into a single category prototype vector before inducing alignment using a contrastive loss. This raises questions on what is the right aggregation-level at which to do feature alignment and what is the right mechanism to induce this alignment.

In this work, we propose a novel UDA method for object detection, called visually similar group alignment (ViSGA). Our method harnesses the power of adversarial training, while leveraging the visual similarity of the different proposals as a basis for aggregating them. By relying on visual similarity, we aggregate proposals from potentially different spatial locations (Figure 1), increasing the effectiveness of adversarial training. Doing so, we drive a more powerful discriminator and hence better aligned features. To enhance the flexibility of proposal aggregation and to avoid introducing unwanted noise in the alignment process as a result of a preset fixed number of groups, we opt for dynamic clustering based on the distance at which proposals are aggregated. This improves the adaptability of our method to a variable number of objects present in the input.

Our method design choices are based on an in-depth analysis of common components of UDA methods for detection. In particular we study what is the right aggregation-level to perform instance-level alignment, ranging from considering all instances [da_faster_rcnn], multiple groups based on clustering to single prototypes [GPA]. When aggregating object proposals, we analyze whether including the predicted class label is beneficial and which distance metric performs better, including spatial overlap and visual similarity. We further compare the effectiveness of using contrastive losses versus adversarial training, as the alignment mechanism.

In summary, our key contributions are as follows: 1) We propose a novel, simple yet effective, UDA method for object detection via adversarial training and dynamic visual similarity-based grouping of proposals from the source and target domains. 2) We perform an in-depth analysis answering questions on what is the right level of alignment and how to induce alignment. 3) We evaluate our proposed approach on three different domain shift scenarios including: Adverse weather, Synthetic to Real data, and Cross camera and show state-of-the-art results. 4) We are the first to consider the important setting of multi-source domain adaptation for object detection where annotated data are gathered from different sources. We show that our method continues to improve in this highly relevant scenario, another evidence for the effectiveness of our approach.

2 Related Work

Object detection. Classical object detection methods were based on sliding window classification using hand-crafted features [dalal2005histograms, viola2001rapid, felzenszwalb2009object]. However, deep convolutional networks (CNNs) [krizhevsky2017imagenet, he2016deep, simonyan2014very] trained on large scale data [Chen2015, pascal] have become popular recently. These can be categorized into one- [liu2016ssd, redmon2016you, redmon2017yolo9000] and two-stage frameworks [girshick2014rich, girshick2015fast, he2015spatial, ren2015faster]. Among them Faster R-CNN [ren2015faster] is widely adopted due to good performance and good open implementations. Faster R-CNN extends prior works [girshick2014rich, girshick2015fast]

with a Region Proposal Network (RPN). A second detection head classifies regions of interest (RoI) and is trained end-to-end with RPN. In our work, we use Faster R-CNN as our base detector.

Unsupervised domain adaptation for object detection. Chen et al. [da_faster_rcnn] is an early UDA method for object detection. It proposes to learn domain-invariant features at both image and instance-level using adversarial training (AT) [grl_ganin] on top of the Faster R-CNN detector. This idea motivates other works, that focus on selecting the right features and right level of aggregation for alignment [strong-weak, he_iccv19_MAF, zhu_cvpr19_selective_alignment, xu_cvpr20_icr_ccr, chen_cvpr20_htcn]. Both [strong-weak, he_iccv19_MAF] adapt adversarial strategy to align image-level features. while, He et al. [he_iccv19_MAF], employ multiple domain discriminators and they also encode class information together with features for the instance level alignment. Xu et al. [xu_cvpr20_icr_ccr] add a categorical classifier for image-level alignment to weakly learn class features with source domain supervision. On the other hand, some recent works have proposed applying different alignment mechanisms [zhuang_2020_ifan, zheng_cvpr20_prototype, GPA]. Xu et al. [GPA] employ a geometry-based prototype construction and use contrastive losses instead of AT for learning domain invariant features. Similar contrastive losses were applied in training domain adaptive classifiers in [kang2019contrastive]. Zheng et al. [zheng_cvpr20_prototype], propose a hybrid framework to minimize distance between single-class specific prototypes across domains at instance-level and using adversarial training at image-level.

In this paper, we propose a novel framework ViSGA by leveraging the best design practices from prior work. Unlike [GPA, zheng_cvpr20_prototype], our approach uses a similarity-based grouping scheme to aggregate information into multiple groups in a class agnostic manner. In addition, we purely use an adversarial strategy unlike a hybrid framework used by [zheng_cvpr20_prototype] or Contrastive losses used by [GPA].

Moreover, to the best of our knowledge, existing UDA methods for detection, only consider the single-source UDA. Recently, a line of work using deep models is proposed for multi-source setting, where the training data are collected from multiple sources [multipeng2019moment, multixu2018deep, multizhao2018adversarial, multizhao2019]. These works mainly consider image classification, except [multipeng2019moment] which is proposed for semantic segmentation. The general idea of these works is to consider additional components or computations to align each source domain to the target [multixu2018deep, multizhao2018adversarial, multizhao2019] or aggregate information from all of the sources into one before adapting to the target domain [multipeng2019moment]. In this work, besides single-source UDA, we consider the generalization of our method to multi-source to further examine the effectiveness of our general framework.

3 Our General Framework for UDA

Figure 2: Components of our general unsupervised domain adaptation framework, for object detection. Here the boxes in blue are components of Faster R-CNN. They share parameters in both domains.

In this section we discuss our general framework for analyzing several aspects of unsupervised domain adaptation methods for object detection. Starting from the problem formulation, we present the main ingredients of our UDA framework (in 3.2 and 3.3) which represent a generalization of the different components presented in the state-of-the-art. For each part, we discuss the existing alternatives that we later compare in Section 4.2. We then introduce a novel algorithm (in 3.4), ViSGA, a direct implementation of our framework combining the best performing components with a novel strategy for a dynamic aggregation of proposals based on their visual similarity.

Problem formulation. In Unsupervised Domain Adaptation (UDA) for object detection, we are given labeled images for the source domain , where and are the class labels and bounding box coordinates respectively. For the target domain , only unlabeled images are available. Both domains share an identical label space but their visual distributions do not match. The goal of UDA approaches is to learn object detectors which perform well on the target domain, despite the domain shift.

3.1 Overview

Our generalized UDA framework comprises of three main components. First is a standard object detection network, Faster R-CNN, which takes an input image and produces bounding boxes and labels for all object instances present in the image. The second component is an image-level domain adaptation loss which encourages alignment of the global image representation in the backbone network. The third component is an instance-level domain adaptation loss which induces alignment of representations of each object instance. This is illustrated in Figure 2. Thus, the overall training objective of the method can be written as:


where, is the supervised training loss for the detector, and are the image-level and instance-level domain adaptation (DA) losses respectively, and are trade-off parameters. For methods that do not apply instant level alignment is set to zero. Note that is only applicable in the source domain where ground-truth bounding box annotations are available.

Detection network. Following the convention set by early work on cross-domain object detection, we deploy Faster R-CNN [da_faster_rcnn]

as the object detection network in both, our method and the analysis. It consists of a Region Proposal Network (RPN) and a detection head. Both networks are trained with two loss terms each, a regression loss for bounding box estimation and a classification loss for label prediction. Thus the detection loss

for Faster R-CNN is composed of and .

3.2 How to Induce Alignment?

The role of the domain adaptation losses () is to induce alignment between the model’s representation of source and target domain inputs. Downstream blocks that use such invariant representation (here for example RPN and the detection heads), would be domain-agnostic and perform equally well in both domains. While adversarial training has been the dominant paradigm for reducing the discrepancy between feature distributions [da_faster_rcnn, strong-weak, zhu_cvpr19_selective_alignment], recently contrastive losses have been proposed to match source and target features [GPA, kang2019contrastive]. We present these approaches in this subsection and compare them in our experimental analysis (Section 4.2).

Adversarial training. The key idea in Adversarial Training (AT) based UDA methods is to learn domain invariant representations by fooling a discriminator which is trained to predict the input data domain based on the detector features. This approach is usually class-agnostic, ignoring the features class information and focusing on domain-level alignment. Specifically, the features of domain ( for source and for target) is fed to the discriminator which predicts the domain of the extracted features. The discriminator is trained by minimizing the cross-entropy loss as below.


Since we want to adapt the features of the two domains to be indistinguishable by the discriminator, we have to maximize the loss in Equations (2) w.r.t the features . This is achieved by incorporating a gradient reverse layer (GRL) [grl_ganin], before features are input to the discriminator.

Contrastive learning. As an alternative to AT, one can apply max-margin contrastive losses to align source and target features by leveraging the class information. The main idea here is to push features from the same class closer and push apart features belonging to different classes across domains. When matching a single feature vector  per class in each domain , the max-margin contrastive loss takes the form:


where is the number of classes and is the margin. Since target data is unlabeled, the class prediction by the detector is used as a pseudo-label in [GPA] to apply Equation (3). In our analysis, we also study the effect of ignoring this class information. This can be achieved by considering only two sets of vectors of cardinality and , possibly unequal number (), from source and target domains to align. To apply contrastive losses here, we make a simple modification. Instead of matching class-specific features across domains, we match the proposals from one domain to the closest features (nn) of the other domain (4) and minimize the distance between their representations (5), as shown below.


In our method, we utilize AT, avoiding potential noise as a result of the reliance on unstable pseudo-labels during the alignment process. Our aggregation strategy can leverage proposals similarities and possible embedded class-information as we explain in the sequel.

3.3 What Features to Align?

In detection, two main levels of feature alignment can be considered: 1) image-level features output by the backbone network and 2) instance-level or object-level features obtained after pooling each region-of-interest proposed by the RPN network. The predominant approach aims for complete alignment at instance-level, i.e. the representation of every proposed object, in source or target domain, should be domain agnostic. This might be difficult to achieve, especially when complete alignment is challenging for the model, and when the source or target data during alignment contains some domain-specific outliers, e.g. specific backgrounds only found in a simulation domain. To address this, recent works aggregate the proposals on each of the source and target before applying feature alignment 

[GPA, zheng_cvpr20_prototype, zhu_cvpr19_selective_alignment]. Both [GPA] and [zheng_cvpr20_prototype] take it to the other extreme, by collapsing the instances into a single prototype per category. While [GPA] merges prototypes based on spatial overlap using intersection-over-union (IoU) and class labels, [zheng_cvpr20_prototype] only uses class labels to mean pool proposals into prototypes. In contrast [zhu_cvpr19_selective_alignment] treads a middle ground by merging proposals into many discriminative regions, but still only using spatial overlap as the merging criteria.

In our analysis in Section 4.2, we compare the effectiveness of different components of this aggregation including 1) spatial grouping vs similarity based grouping (discussed in Section 3.4) 2) using class information vs class agnostic and 3) single prototypes vs multiple groups.

3.4 Similarity-based Group Alignment

In this section, we propose a novel similarity-based grouping to aggregate object proposals before performing alignment. We first aggregate proposals based on visual similarity into varying number of feature groups. AT is then applied to align the mean embeddings of the groups extracted from the source and target domains. This simple yet effective change brings three key benefits. First, adversarial training at group level enables our model to coarsely align the main feature clusters, instead of attempting complete alignment of all instances which might be infeasible. Second, in contrast to the spatial overlap used in [GPA, zhu_cvpr19_selective_alignment], visual similarity-based clustering allows our model to group objects which are located far away in the image, but look similar. Note that this still groups heavily overlapping proposals, since they tend to also be visually similar. Hence, it avoids producing duplicate visually similar groups. By using visual similarity, we do not depend on the pseudo-labels different from previous approaches [GPA, zheng_cvpr20_prototype]

. The pseudo-labels tend to be noisy, thus avoiding such dependency can be beneficial especially in early training. Moreover, when similar proposals are aggregated, we can implicitly leverage class information since the aggregated proposals are likely to be of the same class. Finally, by adaptively varying the number of groups, instead of using single prototypes, our model retains sufficient capacity to represent intra-domain variance.

Similarity-based clustering. To perform similarity-based clustering, we take as input the proposals generated by RPN and their fixed feature vectors denoted by . In order to discover the main feature groups, we cluster these features using hierarchical agglomerative clustering. Starting bottom-up, each proposal is considered as an individual cluster. Then, at each step, the two closest clusters according to a distance metric are merged together. We utilize cosine distance as our merging metric:


where and show -th and -th proposal’s feature embeddings. In contrast to recent work [GPA]

, which uses spatial overlap (measure by IoU) to group together instances, using cosine similarity enables us to pair instances which are located far from each other, but are visually similar. Merging is stopped when dissimilarity within a cluster, as defined by a

linkage function, exceeds the cluster radius parameter

. We apply the complete-linkage heuristic 

[defays1977efficient], which ensures that the farthest distance of two members is smaller than .


where A, B are two sets of proposals’ features in two clusters and

is the cosine distance. This hierarchical clustering approach allows our model to adaptively change the number of feature groups during training, instead of having a fixed number of cluster like in k-means. Once the clustering has converged, instances assigned to each cluster are pooled to construct a representative embedding



where is the number of instances assigned to the cluster . The group representative is fed to a group-level discriminator and adversarial training is applied to align groups from the two domains using Equation (2).

Finally, our method (ViSGA) combines image and instance-level alignment of aggregated proposals via adversarial training as illustrated in Fig.1.

4 Experiments

Based on Section 3, we conduct ablation studies to analyze these design choices in (4.2) 1) AT vs CL for inducing feature alignment and 2) different feature levels for alignment. Then we compare our method, that combines the best performing components with a novel similarity-based grouping strategy, to SOTA results in (4.3). First, we present the datasets and the baselines used in the remainder of the paper.

4.1 Experimental Setup

We now present the datasets used for the experiments in the three domain shift scenarios.

Adverse weather.

For this scenario, we use Cityscapes

[city] as the source dataset. It contains 3,475 real urban images, with 2,975 images used for training and 500 for the validation. Foggy version of Cityscapes [foggy] is used as the target dataset. Highest fog intensity (least visibility) images are used in our experiments, matching prior work [GPA]. Both datasets have 8 different categories. Following [da_faster_rcnn], we used the tightest bounding box of an instance segmentation mask as ground truth box. This scenario is referred to as Foggy.

Synthetic to real. SIM10k [sim10k] is a simulated dataset that contains 10,000 synthetic images. In this dataset, we use all 58,701 car bounding boxes available as the source data during training. For the target data and evaluation, we use Cityscapes [city] and only consider the car instances. This scenario is referred to as Sim2Real.

Cross camera. In this scenario, we use real the dataset of KITTI [kitti] as our source data. KITTI contains 7,481 images and we use all of them for training. Similar to the previous scenarios, we use Cityscapes [city] as target data.

In all experiments, we use mean average precision (mAP) with IoU threshold of 0.5 for evaluation. We compare our approach with the following prior works: DA [da_faster_rcnn], DivMatch [diversify_and_match], SW-DA [strong-weak], SC-DA [zhu_cvpr19_selective_alignment] and MTOR [mean_teacher].

Implementation details. We set the shorter side of the image to 600 pixels, following the Faster R-CNN implementation [ren2015faster]. Our Faster R-CNN network, as well as all the prior works we compare to, utilize ResNet-50 [he2016deep] as the backbone. Models using adversarial training are first trained with learning rate 0.001 for 50K iterations, then with learning rate 0.0001 for 20K more iterations and we report the final performance. Each batch is composed of 2 images, one from each domain. A momentum of 0.9 and a weight decay of 0.0005 is used. With the mentioned setting, maximum  10k MB of memory needed and one NVIDIA Tesla V100-PCIE GPU is used. For training contrastive learning models, we employ the code provided by [GPA]

and we follow its exact settings for running experiments. Both methods are implemented with PyTorch 


4.2 Analysis of UDA Components

In this section we analyze the various design choices of alignment mechanisms (Table 1), image-level alignment (Table 2), aggregation levels and aggregation mechanisms (Table 3) when bulding UDA models.

In Table 1, we compare CL and AT domain alignment paradigms in the Sim2Real scenario. Faster R-CNN is the baseline model which is only trained on the source and tested on the target. Single and Multiple Group(s) are shown as SG and MG. CA represents Class Agnostic, which means that class information is not used when constructing the groups. CL using SG as aggregation level, improves the performance over the source-only model ( vs ). Similarly, applying CL with MG () or MG+CA () setup further improves model performance. AT outperforms CL in each of these three scenarios (fifth to seventh rows of Table 1). Applying AT on the SG results in a large improvements over the baseline ( vs ). Similarly, AT heavily outperforms CL for the MG setting ( vs ). Same trend is observed in MG+CA as well, with AT outperforming CL ( vs ). This large margin reveals that allowing the network to freely align the group representatives with AT, leads to a larger performance gain compared to explicitly matching the groups to nearest neighbors across domains using CL. Based on these results, we use AT for the rest of experiments.

Method Agg. Levels car AP
Faster R-CNN 31.9
Contrastive losses SG 33.2
MG 36.9
MG+CA 42.6
Adversarial training SG 40.8
MG 43.1
MG+CA 45.6
Table 1: Sim2Real: Analyzing the choice of alignment mechanism, comparing adversarial training against contrastive learning across different aggregation conditions (SG: Single Group, MG: Multiple Groups, CA: Class Agnostic ). Note that all results here only use instance level alignment.
Figure 3: Qualitative results. Sim2Real scenario. First row: Faster R-CNN, second row: ViSGA (iou) and, last row: ViSGA (cosine). True positives and missed objects are shown as cyan and red boxes respectively. We can clearly see that Faster R-CNN model misses many objects. This improves in the second row, with the model based on grouping proposals with spatial overlap. However, ViSGA model powered by similarity-based aggregation does even better, recovering almost all missed objects.
Image-level SG MG MG+CA
40.8 43.1 45.6
39.5 44.9 49.3
Table 2: Sim2Real: Analyzing the effect of image-level alignment.

Do we need image-level alignment? Table 2 presents the comparison of model performance with image-level alignment added on top of the instance-level alignment presented before. This comparison is done using AT across different aggregation levels. We see that image-level alignment brings clear added improvement on both multi-group models, while degrading slightly on the single group model. On the single group model, the instance-level alignment is happening at a global level since all the instances are aggregated into a single group before inducing alignment. Adding an extra alignment will not help further and could possibly induce noise, as seen in the results ( vs ). However, on models with multiple groups, instance level AT focuses on local feature alignment, and hence adding global alignment with image-level AT is beneficial. Thus, we use image-level alignment for the remaining of the experimental section.

Aggr. levels Aggr. mechanism Foggy Sim2Real
Proposals No grouping 38.5 39.0
SG Cosine 33.7 39.5
MG (adaptive) 41.8 44.9
MG+CA (adaptive) 43.3 49.3
MG+CA (fixed) 42.5 49.0
MG+CA (adaptive) IoU 41.9 44.8
Table 3: Sim2Real & Foggy: Analyzing the choice of different aggregation levels and mechanisms.

Aggregation levels & mechanisms. Next, we study the process of aggregating instance proposals into groups before performing alignment. We compare the effect of both the number of groups as well as the mechanism used to aggregate proposals into groups. Table 3, first row shows results using original proposals without any grouping for the instance level alignment. Aggregating instances into a SG per category causes a significant drop in performance, indicating that the condensing features into one vector may not be a useful approach. However, MG setup based on visual similarity (Cosine), is beneficial ( vs on Foggy and vs on Sim2real ). Performance is further improved by ignoring the predicted class-label ( on Foggy and on Sim2real) and compared to the last set with MG, this shows that noisy pseudo labels (in MG) can be harmful to the clustering process and may have negative impact on the alignment. Both the above models use cluster radius parameter to let the model vary the number of groups adaptively over the course of training. Here, we do not compare different clustering methods directly. However, we also experiment with fixing the number of clusters, as shown in MG+CA (fixed). we perform a sweep of the number of clusters hyper-parameter and report the best numbers here (full results can be found in supplementary, figure 4). This model performs slightly worse than MG+CA (adaptive), indicating that the flexibility from adaptive number of clusters is beneficial.

Finally in the last row, by using spatial overlap (using IoU) to cluster instances (as proposed in [GPA, zheng_cvpr20_prototype]), we see that the performance drops by 1.4% and 4.5% on Foggy and Sim2Real respectively, compared to using visual similarity based clustering (MG+CA (adaptive)). These large drops show that our visual similarity based grouping is a better way to accumulates proposals, since it allows grouping distant instances and avoids redundant group representatives.

4.3 Comparison with SOTA

In this section, we evaluate the best design choices embedded in our ViSGA and compare it to prior works in each of these domains in Section 2. ViSGA incorporates image-level alignment and adversarial training framework along with the novel group alignment of visual similarity based class-agnostic clusters.

Methods Cross Camera Sim2Real
Faster R-CNN 32.5 31.9
DA-Faster [da_faster_rcnn] 41.8 41.9
DivMatch [diversify_and_match] 42.7 43.9
SW-DA [strong-weak] 43.2 44.6
SC-DA [zhu_cvpr19_selective_alignment] 43.6 45.1
MTOR [mean_teacher] - 46.6
GPA (Only RCNN) [GPA] 46.1 44.8
GPA [GPA] 47.9 47.6
Ours 47.6 49.3
Table 4: Experimental results (%) of Sim2Real & Cross Camera.

Table 4 shows the results for the Sim2Real and Cross Camera scenarios on Car class. The adaption is challenging on Sim2Real due to relatively large domain shift between source and target. However, as shown in the table, our approach outperforms other methods by a fair margin ( vs by the closest model, GPA). For Cross Camera scenario, our approach has competitive performance compared to GPA[GPA], while out-performing other approaches. In Table 5, ViSGA achieves SOTA results, with large improvements over other recent work. It outperforms the GPA method [GPA] ( vs ) based on prototype matching, highlighting the importance of our design choices — multiple similarity based class-agnostic groups and adversarial training. In summary, the good performance shown by our model across three datasets with state-of-the-art results in two of them, provides evidence that our similarity-based method is successful in aligning instance level representations.

3 Methods prsn rider car truck bus train mcycle bicycle mAP
Faster R-CNN 27.2 31.8 32.5 16.0 25.5 5.6 19.9 27 22.8
DA-Faster [da_faster_rcnn] 29.2 40.4 43.4 19.7 38.3 28.5 23.7 32.7 32.0
DivMatch [diversify_and_match] 31.8 40.5 51.0 20.9 41.8 34.3 26.6 32.4 34.9
SW-DA [strong-weak] 31.8 44.3 48.9 21.0 43.8 28.0 28.9 35.8 35.3
SC-DA [zhu_cvpr19_selective_alignment] 33.8 42.1 52.1 26.8 42.5 26.5 29.2 34.5 35.9
MTOR [mean_teacher] 30.6 41.4 44.0 21.9 38.6 40.6 28.3 35.6 35.1
GPA [GPA] 32.9 46.7 54.1 24.7 45.7 41.1 32.4 38.7 39.5
Ours 38.8 45.9 57.2 29.9 50.2 51.9 31.9 40.9 43.3
Table 5: Experimental results of (%) Foggy.

Qualitative analysis of ViSGA. Figure 3 compares the detection outputs of Faster R-CNN and ViSGA models with different aggregation mechanisms, on Sim2Real scenario. Figure 4 shows the evolution of the number of groups during ViSGA training, on Foggy and Sim2Real.While the number of initial groups are similar in both cases, the number of clusters on Sim2Real drops-off quickly and settles around 50 clusters when the best model performance is achieved. In contrast, in Foggy, the number of clusters increases and is plateaus around 180, where the best performance is achieved. This difference can be understood by noting that the Foggy scenario has 8 categories compared to only one category in Sim2real. Hence the model needs more clusters in Foggy.

Figure 4: Evolution of number of groups during training for our ViSGA model. Orange and blue dashed lines show the best training stops for Foggy and Sim2Real respectively.

In figure 5 shows experimental results measuring the sensitivity of the cluster radius parameter. We can observe that for Sim2Real the network performs well when the threshold is low but it is relatively sensitive to high or very low radius values (no grouping). This might be due to the large shift between synthetic images and real images. In addition, a low cluster radius creates many single member clusters, reducing information aggregation. In contrast, the performance is not very sensitive to various radius values on Foggy, where the domain gap is smaller. Additionally, figure 6 in the supplement presents a tSNE [van2008visualizing] visualizations of source and target feature distribution, to visually illustrate how ViSGA prioritizes foreground alignment. This is also supported by figure 4 in the supplement, which shows that foreground objects get allocated more clusters and hence are prioritized for alignment.

Computational Overhead. The extra training time cost of our method, from computing the distances between the features of each proposal, is relatively small (eg. one batch runtime is 0.79 for ViSGA compared 0.62 for w/o ViSGA). ViSGA has no overhead during inference. Note that contrastive learning based methods, e.g., GPA, also compute the distance between proposals in each domain.

Figure 5: Sensitivity analysis of cluster radius parameter.

5 Generalization to Multi-Source

As mentioned in section (2), existing UDA detector methods focus only on the single-source in which training data are gathered from one input domain guaranteeing homogeneity withing the training data. However, in real world annotated data is available or could be gathered under different conditions comprising different input domains, a scenario usually referred to as multi-source domain adaptation. In this section, we examine the applicability of our framework to operate in a multi-source UDA scenario. For the presented experiments, SIM10k and KITTI are used as source datasets and Cityscapes as the target.

In the first set of experiment, we combine all sources into one training dataset and use shared discriminators at both image and instance levels for all different sources (Figure 6, part (a)). We repeat the same analysis, carried on the single source setting, to examine the right level of aggregation. As shown in Table 6, learning on multi-source data without any UDA components achieves . When combining the image and instance-level alignments it reaches to  (Proposals). Using our full ViSGA method, we can further improve the performance on multi-source (). This confirms our method’s scalability to multi-source setting. We also perform an ablation regarding the discriminator deployed in AT and modify the network design by considering a separate set of discriminators for each pair of source-target (Fig. 6, b to d, illustrates the different combinations). As we can see in (Fig. 6, e), our simple yet effective method with shared discriminators brings the largest gain to the final detection performance () compared to with separate discriminators at both instance and image level.

In summary, the good performance shown here provides further evidence that our method is able to generalize to multi-source setting without applying any modifications in its design. This leaves the door open for exploring any alternatives that could further leverage the multi-source information in UDA object detection.

Source Faster Proposals SG MG MG+CA
Single-Source (KITTI) 32.5 41.5 35.8 45.5 47.6
Single-Source (SIM10k) 31.9 39.5 39.5 44.9 49.3
Multi-Source 42.5 49.6 48.9 51.3 51.3
Table 6: Multi-Source ViSGA vs Single-Source ViSGA. ‘Faster’: No UDA; ‘Proposals’: UDA with proposal-level alignment; ‘SG’,‘MG’,‘MG+CA’: UDA with group-level alignment.
(a) (b)
(c) (d)
Model Shared Img Ins Separated
Multi-Source 51.3 49.1 49.3 50.0


Figure 6: Multi-Source ViSGA Ablation: Shared/Separated discriminators between sources. (a) Shared. (b) Ins: separated image-level disc. (c) Img: separated instance-level disc. (d) Separated. (e) Results on (a to d).

6 Conclusions

We present an analysis of various design choices when building UDA models for detection. Our experiments comparing the alignment mechanisms revealed that adversarial training works better than max-margin contrastive losses across different feature aggregation-levels. Regarding instance-level alignment, our analysis shows that aggregating proposals into multiple visually similar groups before alignment is beneficial. It significantly outperforms both options previously investigated in prior work; no aggregation [da_faster_rcnn] or collapsing everything to a single category prototype vector [GPA, zheng_cvpr20_prototype]. We also show that constructing these groups without considering pseudo labels improves performance in single-source setting. Our best model ViSGA, incorporating adversarial training and visual class-agnostic group not only achieves SOTA results on Sim2Real and Foggy, it also generalizes to multi-source.