Existing object detection models assume both the training and test data are sampled from the same source domain. This assumption does not hold true when these detectors are deployed in real-world applications, where they encounter new visual domain. Unsupervised Domain Adaptation (UDA) methods are generally employed to mitigate the adverse effects caused by domain shift. Existing UDA methods operate in an offline manner where the model is first adapted towards the target domain and then deployed in real-world applications. However, this offline adaptation strategy is not suitable for real-world applications as the model frequently encounters new domain shifts. Hence, it becomes critical to develop a feasible UDA method that generalizes to these domain shifts encountered during deployment time in a continuous online manner. To this end, we propose a novel unified adaptation framework that adapts and improves generalization on the target domain in online settings. In particular, we introduce MemXformer - a cross-attention transformer-based memory module where items in the memory take advantage of domain shifts and record prototypical patterns of the target distribution. Further, MemXformer produces strong positive and negative pairs to guide a novel contrastive loss, which enhances target specific representation learning. Experiments on diverse detection benchmarks show that the proposed strategy can produce state-of-the-art performance in both online and offline settings. To the best of our knowledge, this is the first work to address online and offline adaptation settings for object detection. Code at https://github.com/Vibashan/online-odREAD FULL TEXT VIEW PDF
The ability to train deep network models on large-scale annotated datasets [krizhevsky2012imagenet, everingham2010pascal, lin2014microsoft, johnson2016driving]
has accelerated the progress for multiple computer vision tasks such as classification[krizhevsky2012imagenet, he2016deep, dosovitskiy2020image], segmentation [long2015fully, zhao2017pyramid, sun2019high], and detection [ren2015faster, redmon2016you, liu2016ssd]. Despite this success, these models have limited generalization capabilities [szegedy2013intriguing, hendrycks2019benchmarking, geirhos2018generalisation]. Specifically, the model performance drops when the test data (target domain) is sampled from a different distribution than that of the training data (source domain) [ben2010theory]. For example, when a model is deployed in real-world applications such as autonomous navigation, it could encounter images with weather-based degradations, camera artifacts, etc. that were unknown when training the model.
Unsupervised Domain Adaptation (UDA) methods [ganin2016domain, tzeng2017adversarial, saito2018maximum, chen2017no, hoffman2016fcns, hoffman2018cycada, chen2018domain, inoue2018cross, saito2019strong] are generally employed to improve model generalization under domain shift condition. Existing UDA methods assume that both labelled source data and unlabeled target data are available during adaptation. This scenario is often not feasible in current real-world applications, as the labelled source data is often restricted due to privacy regulations, data transmission constraints, or proprietary data concerns. To overcome this drawback, recently, some works have explored Source-Free Domain Adaptation (SFDA) [kundu2020universal, kim2021domain, xia2021adaptive, liang2020we, li2020free] setting, where a source-trained model is adapted towards the target domain without requiring access to the source data. However in both UDA and SFDA settings, adaptation is performed in an offline manner where the model is first adapted towards the target domain and then deployed in real-world applications. In addition, it is often impossible to have prior knowledge about the target domain in most real-world applications. In other words, the deployed model could encounter a diverse set of target domains and offline adaptation to every distribution shift would be infeasible. To circumvent these issues in this work, we explore an Online Source-Free Domain Adaptation (Online-SFDA)setting, where a model is adapted to any distribution shifts encountered during deployment in an online manner. Fig. 1 illustrate the online source-free domain adaptation setting for detection and its differences against the other adaptation settings.
In recent years, few works have explored various test-time adaptation settings where adaptation is performed during test-time. Wang [wang2021tent] proposed a fully test-time adaptation strategy which performs entropy minimization during test-time and only updates the model batch-norm parameters for the classification task. However, TENT [wang2021tent] has two critical drawbacks: 1) TENT use a very large batch size during test-time adaptation, which is not feasible during real-time deployment as images arrive one by one sequentially. 2) Updating only the batch norm parameter of a network with batch size 1 essentially degrades the model performance [yao2021cross]. Although existing test-time adaptation settings are closer to online-SFDA settings, they are not suitable for adapting a detection model during real-world deployment. To this end, we propose a unified adaptation framework which utilizes a source-trained detector and adapts to the target domain in an online manner.
Source-free domain adaptive object detection is relatively new and a more challenging setting than UDA. Existing SFDA methods [li2020free, huang2021model] for detection adapts to the target domain by training on the pseudo-labels generated by the source-trained model. Due to domain shifts, these generated pseudo-labels are noisy and training a model on top of them would lead to catastrophic forgetting [liu2021unbiased, deng2021unbiased]. To alleviate these issues, we employ a mean-teacher framework where the student model is supervised using pseudo-labels generated by the teacher network and the teacher network is slowly updated via exponential moving average (EMA) of student weights. Therefore, the student network is trained on consistent pseudo-labels leading to less overfitting and the teacher network is a gradual ensemble of target adapted student weights [liu2021unbiased]. However, this strategy is inefficient in learning two critical aspects required for optimal online adaptation: 1) They fail to learn robust target feature representation, 2) They fail to fully exploit the online target samples. Hence, we propose a novel memory module and a contrastive loss to fully utilize online target samples and learn robust target feature representation.
Contrastive Learning (CL) [chen2020simple, chen2020big, he2020momentum, chopra2005learning, khosla2020supervised] aims to get high-quality features from unlabeled data by forcing similar object instances to stay close and push dissimilar ones apart in an unsupervised setting. This is especially useful for online-SFDA as source labelled data are unavailable during adaptation. Existing CL methods are designed for classification tasks where they operate on image-level features and require multiple image views (or augmentations) [chen2020simple] to learn robust feature representation. Consequently, obtaining these large sets of views through input augmentations is computationally expensive for adapting detector models. However in detector models, it is possible to obtain different views for an object in an input image without heavy input augmentations. More precisely, the detector provides multiple object proposals generated by Region proposal Network (RPN), which in turn provides multiple cropped views around the object instance at different locations and various scales. Therefore, applying CL loss on RPN cropped views guides the model to learn object-level feature representation on the target domain. Note this CL loss is used to supervise the student network, where the object-level features are obtained from the student RoI features. However to perform contrastive learning, these student RoI features require positive and negative pairs. To this end, we propose MemXfromer, a cross-attention transformer-based memory module where items in the memory record prototypical patterns of the continuous target distribution. The proposed MemXfromer solves two important problems for online adaptation: 1) store the target distribution during online adaptation, which are utilized for future adaptation. 2) stored temporal ensemble of target representations provides positive and negative pairs to guide the contrastive learning process. Further, we introduce a cross-attention based read and write technique which models better target distribution and provides strong positive and negative pairs for contrastive learning. Note that the proposed method is not only suitable for online adaptation but also for offline adaptation. In a nutshell, this paper makes the following contributions:
To the best of our knowledge, this is the first work to consider both online and offline adaptation settings for detector models.
We propose a novel unified adaptation framework which makes the detector models robust against online target distribution shifts.
We introduce MemXformer module, which stores prototypical patterns of target distribution and provides contrastive pairs to boost the contrastive learning on the target domain.
We consider multiple detection benchmarks for experimental analysis and show that the proposed method outperforms existing UDA, and SFDA methods for both online and offline settings.
Unsupervised domain adaptation. Existing unsupervised domain adaptation methods can be categorized into three groups based on adversarial training [Chen2018DomainAF, saito2019strong, Sindagi_DA_Detection_ECCV2020], self-training [khodabandeh2019robust, wu2021instance]kim2019diversify, roychowdhury2019automatic]. The first domain adaptive object detection was studied in [Chen2018DomainAF], where they followed an adversarial-based training strategy to perform feature alignment at both image-level and instance-level to mitigate the domain shift. Later, Saito [saito2019strong] proposed an adversarial-based strategy where strong alignment of the local features and weak alignment of the global features are performed for effective domain alignment. Kim [kim2019diversify], introduced an image-to-image translation-based where multiple target domains images are created by stylizing the labelled source images. Multiple discriminators are used to perform adversarial alignment to reduce domain discrepancy by utilising these target-styled source images. In [khodabandeh2019robust], a pseudo-label based training strategy was formulated to counter noise in pseudo-labels to perform robust training of object detectors on the target domain. However, all these works assume to have access to labeled source data and unlabeled target data during adaptation, and they operate in an offline setting.
Source-free domain adaptation. In the source-free domain adaptation setting, we have a source-trained model which adapts to the target domain without having access to source data. Multiple works have addressed the source-free domain adaptation (SFDA) setting for classification [liang2020we, li2020model], segmentation [liu2021source, kundu2021generalize] and object detection [li2020free, huang2021model] tasks. In detail, for classification task [li2020free] proposed a self-supervised method to learn target domain representation via information maximization. Further for segmentation [liu2021source, kundu2021generalize] and object detection [li2020free, huang2021model], the proposed methods are based on pseudo-label self-training to learn target-specific representation. However, similar to existing UDA works, these SFDA methods operate in an offline setting. Thus, we explore online adaptation, which is a more practical way to tackle domain shift for real-world applications.
Online adaptation. Sun [sun2020test] proposed a Test-time training (TTT) strategy, where a model is trained on source data along with an auxiliary task (eg: rotation prediction) which is utilized during test-time to fine-tune the model on target test distribution. The major drawback of this adaptation strategy is training an auxiliary task along with source training just to perform adaptation during test-time is not a feasible solution and effective solution for real-world application. Later, Wang [wang2021tent] proposed a fully test-time adaptation setting, where the given source trained model adapts to the target domain by entropy minimization during test-time in an online manner by entropy minimization. In this way, Tent [wang2021tent] adapts to the target domain with test-time loss. Here, the major limitation of this method is a requirement of a large batch size during test-time adaptation, which is not feasible during real-time deployment as images arrive one by one sequentially. Although existing test-time adaptation settings provides close resemblance to online-SFDA settings, these test-time settings are not suitable for adapting a detection model during real-world deployment. Therefore in this work, we explore both online and offline adaptation setting for the object detection tasks.
Contrastive representation learning. Contrastive representation learning has shown huge progress towards unsupervised feature learning. The standard way of formulating contrastive learning for an anchor is by pulling together the feature embedding of anchor’s positive pairs and pushing apart from anchor’s negative pair [oord2018representation, chen2020simple, he2020momentum]. These positive and negative pairs are formed by augmenting the anchor image and sampling from input batch of images. Thus, for given an anchor, the positive pair are augmented anchor images and the negative pairs are other images from the batch [oord2018representation, chen2020simple, he2020momentum]. On top of this, by exploiting the task-specific label information, [khosla2020supervised] performed contrastive learning in a supervised manner. Nonetheless, all these tasks require a large batch size to perform contrastive learning effectively and it is not feasible to have more than one image during online adaptation. Thus, we propose a memory-based contrastive learning framework which is suitable for adapting object detectors during deployment in an online manner.
The online-SFDA setting considers a source-trained model with parameters and adapts to any target distribution shifts during real-world deployment as illustrated in Fig. 1.
Let us consider a stream of online target data denoted as , where is the online sample. Since these samples arrive sequentially, the model gets adapted to each sample and the adapted weights are used for future online samples. Specifically, the model parameters during adaptation on the sample , i.e. , are initialized with the model parameters updated through online adaptation of previous sample.
To summarize, online-SFDA performs continuous online adaptation, i.e., there is no termination and adaptation continue as long as there is a stream of data.
Student-teacher training. In online-SFDA, the model parameters need to be continuously updated in an online manner. Consequently, the model risks forgetting the original hypothesis learned through supervised source training [liu2021unbiased, deng2021unbiased]. To overcome this, prior works [tarvainen2017mean, liu2021unbiased] have employed a student-teacher framework. Specifically, the student parameters () are adapted to the target domain by minimizing the detection loss supervised through the teacher-generated pseudo-labels. The adapted student parameters are then transferred to the teacher parameters () via Exponential Moving Average (EMA). This can be formally written as:
where and are the test sample and corresponding pseudo-label generated by teacher network, is the pseudo-label supervision loss, is the student learning rate, and is teacher EMA rate. However, the student-teacher framework is still not sufficient to learn robust features to mitigate target distribution shifts. Hence, we explore contrastive learning-based strategies further to improve the robustness of feature representations in an online setting.
Contrastive Learning (CL). SimCLR [chen2020simple] is a commonly used CL framework, which learns representations for an image by maximizing agreement between differently augmented views of the same sample. For given an anchor image , the SimCLR loss can be written as:
where is the batch size, and are the features of two different augmentations of the same sample , whereas represents the feature of the batch sample , where . Also,
indicates a similarity function, e.g. cosine similarity. Note that in general, the CRL framework assumes that each image contains one category/object[chen2020simple]. Moreover, it requires large batch sizes that could provide multiple positive/negative pairs for training [chen2020big]. In contrast for object detection, each image will have multiple objects and a large batch size or multiple views are computationally not feasible. Hence, existing CRL methods are more suited for the classification tasks.
Though existing contrastive learning methods like SimCLR are exceptional at learning high-quality representations, they are more suitable for the classification task. For detection, these CL methods require large batch size and heavy input augmentation which are computationally expensive to apply for online parameter updates (discussed in Sec. 1). Therefore, we utilize a memory-based approach to make contrastive learning efficient for online model updates. The proposed online-SFDA strategy is illustrated in Fig. 2.
MemXformer. A cross-attention transformer-based memory module which stores target distribution shift and guides the contrastive learning for target domain representation during online adaptation. Specifically, we employ a Global Memory Bank , where is number of memory items and is memory item feature dimension. These memory items are used to store target representation and record prototypical patterns of the target distribution during the adaptation. In addition, these memory items are used to retrieve strong positive and negative pairs for guiding contrastive learning. The MemXformer module has two operations: write and read, which are based on cross-attention. In the MemXformer write operation, the teacher RoI features are used to update the memory elements appropriately. In the MemXformer read operation, the student RoI features are queried to the memory and a weighted sum of similar memory elements is retrieved, which essentially provide strong positive pairs. The read and write operations of MemXformer are illustrated in Fig. 3.
Write. To update the memory elements, we consider only the teacher network RoI features , where is number of RoI features and is RoI feature dimension. The teacher RoI features are considered because in the student-teacher framework, the teacher pipeline has input with weak augmentations resulting in an accurate RPN proposals compared to the student pipeline. As shown in Fig. 3 (a), first the teacher RoI features are projected as key and value using two FC layer with weight and , respectively. Now each memory items are considered as query and we compute a cross-attention map between the teacher RoI features and memory items as follows:
where the cross-attention map is a 2D matrix of size and represents how memory items is related to teacher RoI features. We utilize this cross-attention map and to update memory item using following equation:
where is norm. Therefore, using attention-based weighted average and global memory bank update for each online sample makes the MemXformer effectively store and model the target distribution.
Read. To read the memory elements, we consider only the student network RoI features , where is number of RoI features and is RoI feature dimension. In addition, the MemXformer Read operation is performed to obtain strong positive pairs given student RoI features as query. As shown in Fig. 3 (b), first the student RoI features are projected as query by one FC layer with weight . Now each memory items are considered as key and we compute a cross-attention map between the student RoI features and memory items as follows:
where the cross-attention map is a 2D matrix of size and given student RoI features as query, the row presents memory items attention score. Therefore, given student RoI features as query, we generate its corresponding positive pair by attention guided weighted sum of most similar memory items. Thus, utilizing the cross-attention map and considering memory items as value , we compute the strong positive pair for student RoI features using following equation:
where corresponds to set of strong positive pair for student RoI features . In detail, the retrieved positive pairs are temporal ensemble of prototypical target distribution, which gives more information regarding the online target distribution shifts. This essentially guides the contrastive learning to model the target distribution effectively.
Negative Pair Mining. As explained earlier from MemXformer read operation, we obtain a set of strong positive pairs for a given student RoI feature. These strong positive pairs are essentially an ensemble of most similar memory items. However, these ensembled similar memory items also contain dissimilar memory items but are scaled with less attention weights. This restricts the contrastive learning capability to effectively model the target domain representation. Hence, to mitigate the dissimilar item’s effect on CL, we propose a negative pair mining. Specifically in negative pair mining, given a student RoI feature as query and cross-attention map , we mine the least similar 10 of the memory items and label them as negative pairs . As a result, by performing negative pair mining, for one positive sample, we obtain negative samples, where to top 10 of least similar memory items.
Memory contrastive loss. Given student RoI feature as anchor, utilizing MemXformer Read operation and negative pair mining we obtain strong positive and negative pairs from MemXformer. Therefore, given an image with student RoI feature , the MemCLR loss is calculated as:
Therefore, minimizing the MemCLR loss guided by strong positive and negative pairs enhance the student model to learn better target representation in an online-SFDA setting.
Overall loss. We illustrate our overall architecture for online source-free domain adaptation in Fig. 2. The proposed method utilizes a global memory bank to perform memory-based contrastive learning to robustify the representations under varying target distribution shifts. Therefore, the overall online-SFDA loss for any online sample can be calculated as:
To validate the proposed method, we consider four domain shift scenarios where the source train model is adapted to the unlabelled target domain, typically used for comparison in UDA and SFDA literature. Specifically, we evaluate the proposed method with the existing UDA, SFDA and Test-time works under four domain shifts, 1) clear-weather to foggy-weather, 2) real to artistic, 3) synthetic to real, and 4) cross-camera adaptation. Note that, to show the effectiveness of our proposed approach, we evaluate on both online and offline settings. Specifically, the offline setting follows the standard SFDA setting. The source-trained model is adapted towards the target domain using an unlabelled target train-set for multiple iterations and evaluated on the target test-set. Whereas in the online setting, the model is adapted towards the target domain in an online manner where the target test samples are seen only once and finally evaluated on the target test-set. This essentially simulates the real-world scenario where you see the target samples only once and adaptation needs to be continuous.
For the Online adaptation setting, we adopt Faster-RCNN [ren2015faster] with ResNet50 [he2016deep]
as the backbone pre-trained on ImageNet[krizhevsky2012imagenet]. In all of our experiments, the input images are resized with a shorter side to be 600 pixels while maintaining the aspect ratio. We set the batch size to 1 for all experiments. For the student-teacher framework, the weight momentum update parameter of the EMA for the teacher model is set equal to 0.99. The pseudo-labels generated by the teacher network with confidence greater than the threshold =0.9 are selected for student training. We utilize an SGD optimizer to train the student network with a learning rate of 0.001 and momentum of 0.9 for both online and offline training. The Global Memory Bank contains
memory items which is set to 1024. Further, the source model is trained using SGD optimizer with a learning rate of 0.001 and momentum of 0.9 for 10 epochs. We report the mean Average Precision (mAP) with an IoU threshold of 0.5 for the teacher network on the distribution shifted target domain test data during the evaluation.
|DA Faster [Chen2018DomainAF] (CVPR 2018)||✓||✕||25.0||31.0||40.5||22.1||35.3||20.2||20.0||27.1||27.6|
|Selective DA [zhu2019adapting] (CVPR 2019)||✓||✕||33.5||38.0||48.5||26.5||39.0||23.3||28.0||33.6||33.8|
|D&Match [kim2019diversify] (CVPR 2019)||✓||✕||30.8||40.5||44.3||27.2||38.4||34.5||28.4||32.2||34.6|
|UDA||MAF [he2019multi] (ICCV 2019)||✓||✕||28.2||39.5||43.9||23.8||39.9||33.3||29.2||33.9||34.0|
|Robust DA [khodabandeh2019robust] (ICCV 2019)||✓||✕||35.1||42.1||49.1||30.0||45.2||26.9||26.8||36.0||36.4|
|MTOR [cai2019exploring] (CVPR 2019)||✓||✕||30.6||41.4||44.0||21.9||38.6||40.6||28.3||35.6||35.1|
|Strong-Weak [saito2019strong] (CVPR 2019)||✓||✕||29.9||42.3||43.5||24.5||36.2||32.6||30.0||35.3||34.3|
|Categorical DA [xu2020exploring] (CVPR 2020)||✓||✕||32.9||43.8||49.2||27.2||45.1||36.4||30.3||34.6||37.4|
|MeGA CDA [hsu2020progressive] (CVPR 2021)||✓||✕||37.7||49.0||52.4||25.4||49.2||46.9||34.5||39.0||41.8|
|Unbiased DA [deng2021unbiased] (CVPR 2021)||✓||✕||33.8||47.3||49.8||30.0||48.2||42.1||33.0||37.3||40.4|
|SFOD [li2020free] (AAAI 2021)||✓||✕||25.5||44.5||40.7||33.2||22.2||28.4||34.1||39.0||33.5|
|SFDA||HCL [huang2021model] (NeurIPS 2021)||✓||✕||26.9||46.0||41.3||33.0||25.0||28.1||35.9||40.7||34.6|
|O-SFDA||Tent [wang2020tent] (ICLR 2021)||✕||✓||31.2||38.6||37.1||20.2||23.4||10.1||21.7||33.4||26.8|
Quantitative results (mAP) for CityscapesFoggyCityscapes. S: Source-only, O: Oracle, UDA: Unsupervised Domain Adaptation, SFDA: Source-Free Domain Adaptation, O-SFDA: Online Source-Free Domain Adaptation.
|ADDA [inoue2018cross] (CVPR 2018)||✓||✕||20.1||50.2||20.5||23.6||11.4||40.5||34.9||2.3||39.7||22.3||27.1||10.4||31.7||53.6||46.6||32.1||18.0||21.1||23.6||18.3||27.4|
|UDA||DA Faster [Chen2018DomainAF] (CVPR 2018)||✓||✕||15.0||34.6||12.4||11.9||19.8||21.1||23.3||3.10||22.1||26.3||10.6||10.0||19.6||39.4||34.6||29.3||1.00||17.1||19.7||24.8||19.8|
|BDC Faster [saito2019strong] (CVPR 2019)||✓||✕||20.2||46.4||20.4||19.3||18.7||41.3||26.5||6.40||33.2||11.7||26.0||1.7||36.6||41.5||37.7||44.5||10.6||20.4||33.3||15.5||25.6|
|CRDA [xu2020exploring] (CVPR 2020)||✓||✕||28.7||55.3||31.8||26.0||40.1||63.6||36.6||9.4||38.7||49.3||17.6||14.1||33.3||74.3||61.3||46.3||22.3||24.3||49.1||44.3||38.3|
|Unbiased DA [deng2021unbiased] (CVPR 2021)||✓||✕||39.5||60.0||30.5||39.7||37.5||56.0||42.7||11.1||49.6||59.5||21.0||29.2||49.5||71.9||66.4||48.0||21.2||13.5||38.8||50.4||41.8|
|SFOD (AAAI 2021) [li2020free]||✓||✕||20.1||51.5||26.8||23.0||24.8||64.1||37.6||10.3||36.3||20.0||18.7||13.5||26.5||49.1||37.1||32.1||10.1||17.6||42.6||30.0||29.5|
|O-SFDA||Tent [wang2020tent] (ICLR 2021)||✕||✓||15.4||44.9||20.8||15.3||18.9||43.8||29.6||11.9||30.2||14.1||17.8||11.2||27.3||47.6||38.7||35.4||9.80||17.0||42.6||43.8||26.8|
Quantitative results (mAP) for PASCAL-VOCClipart. S: Source-only, UDA: Unsupervised Domain Adaptation, SFDA: Source-Free Domain Adaptation, O-SFDA: Online Source-Free Domain Adaptation.
When the source-trained models are deployed in real-world applications such as autonomous navigation, they are likely to encounter data from multiple weather conditions such as fog, haze, etc. In most cases, the deployed detector models would be trained for clear weather conditions. We propose to formulate this as a online adaptation problem, as it is difficult to pre-determine what kind of weather conditions will occur. Subsequently, we update the detector model in an online manner to adapt to any weather shifts model might observe after deployment. To evaluate the proposed method under such conditions, we experiment on Cityscapes [cordts2016cityscapes] FoggyCityscapes [Sakaridis2018SemanticFS] dataset. Here, we have a detection model trained on the Cityscapes dataset consisting of 2,975 normal weather images and 500 test images with 8 object categories: person, rider, car, truck, bus, train, motorcycle and bicycle. During inference, images from FoggyCityscapes are sequentially sent and the object detection model is adapted in an online manner to improve generalization on foggy/hazy weather.
Table 1 provides the comparison of the proposed FTTA method with the state-of-the-art UDA, SFDA, and O-SFDA methods for CityscapeFoggyCityscapes adaptation scenario. From Table 1, we can infer that UDA and SFDA methods operate in an offline manner, where as O-SFDA operates in an online manner. Firstly, in the online setting our proposed method outperforms existing UDA methods such as SWDA [saito2019strong], MTOR [cai2019exploring] and InstanceDA [wu2021instance] by a considerable margin. However, compared to MeGA-CDA [vs2021mega] and Unbiased DA [deng2021unbiased] our proposed method produces competitive performance with a drop of 3-4 mAP. Note that these UDA methods have access to labelled source data, whereas under the SFDA setting, the proposed model only has access to source trained model. Furthermore, the proposed method outperforms SFDA methods like SFOD [li2020free] and HCL [huang2021model] by 1.7 and 0.6 mAP, respectively. Secondly, when compared to the Test-time adaptation based methods such as Tent [wang2021tent], our best performing model surpasses it by a huge margin of by 3.0 mAP. Therefore, for CityscapeFoggyCityscapes adaptation scenario, our proposed method produces state-of-the-art results in both online and offline SFDA setting.
|Type||Method||Online||Ofline||Sim10k City||Kitti City|
|AP of Car||AP of Car|
|DA Faster [Chen2018DomainAF] (CVPR 2018)||✓||✕||38.9||38.5|
|MAF [he2019multi] (ICCV 2019)||✓||✕||41.1||41.0|
|UDA||Robust DA [khodabandeh2019robust] (ICCV 2019)||✓||✕||42.5||42.9|
|Strong-Weak [saito2019strong] (CVPR 2019)||✓||✕||40.1||37.9|
|Harmonizing [chen2020harmonizing] (CVPR 2020)||✓||✕||42.5||-|
|Cycle DA [zhao2020collaborative] (ECCV 2020)||✓||✕||41.5||41.7|
|MeGA CDA [vs2021mega] (CVPR 2021)||✓||✕||44.8||43.0|
|Unbiased DA [deng2021unbiased] (CVPR 2021)||✓||✕||43.1||-|
|SFOD [li2020free] (AAAI 2021)||✓||✕||42.3||43.6|
|O-SFDA||Tent [wang2020tent] (ICLR 2021)||✕||✓||32.8||34.5|
|DA Faster [Chen2018DomainAF] (CVPR 2018)||✓||✕||75.2||40.6||48.0||31.5||20.6||60.0||46.0|
|BDC Faster [saito2019strong] (CVPR 2019)||✓||✕||68.6||48.3||47.2||26.5||21.7||60.5||45.5|
|BSR [kim2019self] (ICCV 2019)||✓||✕||82.8||43.2||49.8||29.6||27.6||58.4||48.6|
|UDA||WST [kim2019self] (ICCV 2019)||✓||✕||77.8||48.0||45.2||30.4||29.5||64.2||49.2|
|SWDA [saito2019strong] (CVPR 2019)||✓||✕||71.3||52.0||46.6||36.2||29.2||67.3||50.4|
|HTCN [chen2020harmonizing] (CVPR 2020)||✓||✕||78.6||47.5||45.6||35.4||31.0||62.2||50.1|
||Net [chen2021i3net] (CVPR 2021)||✓||✕||81.1||49.3||46.2||35.0||31.9||65.7||51.5|
|Unbiased DA [deng2021unbiased] (CVPR 2021)||✓||✕||88.2||55.3||51.7||39.8||43.6||69.9||55.6|
|SFOD [li2020free] (AAAI 2021)||✓||✕||76.2||44.9||49.3||31.6||30.6||55.2||47.9|
|O-SFDA||Tent [wang2020tent] (ICLR 2021)||✕||✓||62.3||53.4||43.7||29.5||36.4||48.3||45.4|
Collecting and annotating data is computationally intensive and incurs heavy labor cost. Especially for detection datasets, where on top of assigning a category, one need to add bounding boxes to every object location in the image. On the other hand, creating a synthetic dataset through simulation is much less computation-intensive and generates annotations for free. Hence, training a detector model on a synthetically generated dataset makes sense and then deploying it in real-world conditions. However, stylistic/appearance differences between real and synthetic data limit such deployment due to performance issues. Here, we formulate it as an online adaptation problem to update a synthetic data trained model on the real-world test data. Particularly, we consider a source model trained on Sim10k [johnson2016driving] on 10,000 training images with 58,701 bounding boxes of car category, rendered by the gaming engine Grand Theft Auto. For real-world test data we use the Cityscapes [cordts2016cityscapes] validation set for online model adaptation.
In Table 4, we report Sim10KCityscapes adaptation results on the existing UDA, SFDA, and O-SFDA methods. In an offline setting, compared to the existing UDA works such as DAFaster [chen2018domain], SWDA [saito2019strong] and RobustDA [khodabandeh2019robust], the proposed method outperforms all of them by a considerable margin. Furthermore, when compared to SFOD [li2020free] the proposed method is better by 0.7 mAP. In an online setting, compared to Tent [wang2021tent], our proposed method outperforms it by 4.0 mAP. Therefore, our proposed is able to perform well under synthetic to real-world domain shifts.
In most real-world applications, it is assumed that both training and test data would be collected using a camera with the same parameters. However, the camera parameters are often different, which causes the collected images to have different appearances, such as radial distortions, tangential distortions, etc. This can cause the model to perform poorly due to changes in the camera parameters.
Hence, to tackle any such camera distortions, we formulate the problem as an online adaptation problem and show that the proposed approach succeeds in generalizing to such cases. Here, we have access to only the source model, trained on the KITTI [geiger2013vision] dataset with 7,481 training images with bounding boxes for the car category. To emulate cross-camera scenario, we consider online adaptation on the Citsycapes [cordts2016cityscapes] validation set containing 500 images.
We report the results of the cross-camera adaptation experiment in Table 4. Similar to SimCityscapes adaptation even for KittiCityscapes adaptation, we show similar performance improvements compared to UDA, SFDA and O-SFDA methods. Specifically, in the O-SFDA setting, the proposed method outperforms Tent [wang2021tent] by 5.6 mAP. Thus, our proposed method is able to model the cross-camera domain shifts effectively.
Here, we evaluate the proposed method for the case where there is a concept shift in the during inference. By concept shift, we refer to the case where there is a complete change in the object, e.g., going from real-world to artistic images. Unlike previous scenarios where the objects go through stylistic/appearance changes, the entire concept of an object is different, e.g., a real-world car vs a cartoon car [inoue2018cross]. We show that even in this challenging scenario, the proposed approach is able to improve model generalization through online updates. We consider a model trained on the Pascal-VOC data [everingham2010pascal] which adapts to test set of Clipart [inoue2018cross] and Watercolor [inoue2018cross]. Specifically, Clipart contains 1,000 unlabeled images and has the same 20 categories as Pascal-VOC and the Watercolor consists of 1K training and 1K testing images with six categories.We compare PASCAL-VOCClipart results with the existing methods in Table 2 and We compare PASCAL-VOCWatercolor results with the existing methods in Table 4. From Table 2, we can infer that the proposed method outperforms the existing UDA and SFDA methods such as BDCFaster [saito2019strong], DAFaster [chen2018domain] and ADDA [inoue2018cross] by 6.7, 12.5 and 4.9 mAP, respectively. Similar From Table 4, we can infer that the proposed method outperforms the most of the existing UDA methods and SFDA methods in offline settings. Further in the online setting setting, when compared to TENT[wang2020tent] the proposed method is able to outperform by a significant margin. This demonstrates the capability of the proposed method to generalize even for concept shift for both online and offline settings.
Quantitative comparison is performed to analyse the effect of the order of input sequence during online adaptation. We can observe from the variance, the order of input sequence does not affect the model performance that much. Note that, in online adaptation the test samples are seen only once and adaptation happens in an unsupervised manner.
Quantitative analysis. The CityscapesFoggyCityscapes ablation experiment results are reported in Table 5 for the offline-SFDA setting. We first consider a student-teacher offline update baseline which compared to the source-only baseline, provides significant improvements. To have a fair comparison, we also consider utilizing supervised contrastive loss [khosla2020supervised] for offline updates. In particular, we utilize predictions provided by student-teacher training as label information needed for applying the supervised contrastive loss over object proposals. Denoted as SupCon in Table 5, the addition of supervised contrastive learning further improves the performance by 1.3 mAP. However, the proposed memory-based contrastive learning outperforms the supervised contrastive learning by 1.4 mAP, indicating the utility of the proposed method to learn better target representations. Finally, we analyze the performance of the proposed method by varying global memory bank capacity from 256 to 1024 memory items. As shown in Table 5, memory-based contrastive loss with 1024 memory items performs the best when compared to 256 and 512 memory items. Further, note that our model takes around 1 second to perform online adaptation for one sample.
Qualitative analysis. Fig. 4 shows t-SNE visualization for source-only, student-teacher training and the proposed method for the CityscapesFoggyCityscapes online-SFDA setting. The t-SNE [van2008visualizing]
visualizations are created from the RoI features extracted from the predictions for 500 test images. Due to the distribution shift, the features are dispersed for the source-only baseline and classification boundaries are weak. With the help of student-teacher training, the model learns better classification boundaries, resulting in better quantitative performance. However, the features in the student-teacher training have large variance and do not have compact features. Whereas the proposed method has even better classification boundaries and learns compact features for each category, resulting in a more robust model. Further qualitative comparison is performed to analyse the effect of the order of input sequence during online adaptation is shown in Fig.5. Multiple experiments with changing the order of input sequence are conducted and corresponding performance mean and variance is plotted in Fig. 5. We can observe from the variance that the order of input sequence does not much affect the model’s performance. Further, we can observe the model performance increase as it encounters more test samples during online adaptation, showing the MemXformer effectiveness in exploiting online target distribution. Note that, in online adaptation, the test samples are seen only once and adaptation happens in an unsupervised manner.
In this work, we introduced a practical domain adaptation setting for the object detection task which is feasible for real-world settings. Particularly, we proposed a novel unified adaptation framework which makes the detector models robust against online target distribution shifts. Further, We introduce MemXformer module, which stores prototypical patterns of target distribution and provides contrastive pairs to boost the contrastive learning on the target domain. We conducted extensive experiments on multiple detection benchmarks datasets and compared existing unsupervised domain adaptation, source-free domain adaptation and test-time adaptation methods to show the effectiveness of the proposed approach for both online and offline adaptation of object detection models. We also analyzed multiple aspects of the proposed method through ablation experiments and identified potential directions for future research.