DeepAI
Log In Sign Up

Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment

Detection transformers like DETR have recently shown promising performance on many object detection tasks, but the generalization ability of those methods is still quite challenging for cross-domain adaptation scenarios. To address the cross-domain issue, a straightforward way is to perform token alignment with adversarial training in transformers. However, its performance is often unsatisfactory as the tokens in detection transformers are quite diverse and represent different spatial and semantic information. In this paper, we propose a new method called Spatial-aware and Semantic-aware Token Alignment (SSTA) for cross-domain detection transformers. In particular, we take advantage of the characteristics of cross-attention as used in detection transformer and propose the spatial-aware token alignment (SpaTA) and the semantic-aware token alignment (SemTA) strategies to guide the token alignment across domains. For spatial-aware token alignment, we can extract the information from the cross-attention map (CAM) to align the distribution of tokens according to their attention to object queries. For semantic-aware token alignment, we inject the category information into the cross-attention map and construct domain embedding to guide the learning of a multi-class discriminator so as to model the category relationship and achieve category-level token alignment during the entire adaptation process. We conduct extensive experiments on several widely-used benchmarks, and the results clearly show the effectiveness of our proposed method over existing state-of-the-art baselines.

READ FULL TEXT VIEW PDF
07/27/2021

Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Detection transformers have recently shown promising object detection re...
11/27/2022

Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation

While transformers have greatly boosted performance in semantic segmenta...
03/07/2021

MeGA-CDA: Memory Guided Attention for Category-Aware Unsupervised Domain Adaptive Object Detection

Existing approaches for unsupervised domain adaptive object detection pe...
06/04/2022

Video-based Human-Object Interaction Detection from Tubelet Tokens

We present a novel vision Transformer, named TUTOR, which is able to lea...
10/12/2022

Token-Label Alignment for Vision Transformers

Data mixing strategies (e.g., CutMix) have shown the ability to greatly ...
11/29/2022

Soft Alignment Objectives for Robust Adaptation in Machine Translation

Domain adaptation allows generative language models to address specific ...
06/15/2022

AMR Alignment: Paying Attention to Cross-Attention

With the surge of Transformer models, many have investigated how attenti...

1 Introduction

Object detection, as a fundamental task for visual understanding, has been one of the most attractive research problems in the computer vision community 

[Faster-RCNN, FCOS, Fast-RCNN, Cascade-RCNN, SSD, YOLO, DETR]

. With the thriving of deep convolutional neural networks (CNN) 

[AlexNet, Resnet], many CNN-based object detection approaches (e.g., Faster RCNN [Faster-RCNN] and FCOS [FCOS]) have been proposed in the last decade. Recently, detection transformers (e.g., DETR [DETR]) have gained increasing attention from researchers. Based on the design of visual transformer, detection transformers remove the requirement of hand-designed components such as non-maximum supperssion (NMS) and anchor generation in traditional CNN-based object detection methods, and at the same time, achieve new state-of-the-art performance in many object detection tasks [DETR, Deformable-DETR, Sparse-DETR, Conditional-DETR, Anchor-DETR, DINO]. Despite the success of detection transformers, the cross-domain generalization ability remains a challenge when adapting a learned model to a novel domain (i.e., target domain). Usually, existing detection transformers often suffer from severe performance degradation due to domain discrepancy between the source and target domains [SFA].

However, addressing the domain shift issue for detection transformers is non-trivial. Researchers have proposed many ways to improve the cross-domain generalization ability for CNN-based object detectors. For example, a variety of studies for cross-domain object detection (CDOD) [DA-Faster-RCNN, SWDA, UMT, SSAL, SCDA] are proposed to eliminate the domain discrepancy by aligning the feature distributions of the source and target via adversarial training. Similarly, for the cross-domain detection transformer, a potential and straightforward solution for the cross-domain detection transformer is to perform token alignment with adversarial training, since the visual features are often converted into tokens as the input to the transformer blocks. However, aligning the token distributions is difficult, especially when there exists a significant domain gap between domains.

Recent work [SFA] attempts to apply adversarial training strategies on tokens in transformers, but the improvements are still unsatisfactory. One of the major reasons is that tokens in detection transformers are quite diverse. In detection transformers (e.g., DETR), the tokens are passed through several multi-head self-attention layers to obtain new token embeddings for representing different spatial and semantic information. Then, object queries are introduced to probe useful tokens and leverage those tokens to predict the positions and categories of different objects. On the one hand, since some tokens are more useful while less for others, it is desirable to take the importance of tokens into consideration in the cross-domain detection transformer. On the other hand, the semantic information embedded in tokens is also helpful for aligning the token distributions w.r.t. the corresponding category, which can ease the adversarial training process.

In this work, we propose a new cross-domain detection method named Spatial-aware and Semantic-aware Token Alignment (SSTA) under the transformer framework. In particular, we take advantage of the characteristics of cross-attention as used in the detection transformers and newly developed two strategies, i.e., spatial-aware token alignment (SpaTA) and semantic-aware token alignment (SemTA) to guide the token alignment across domains. The cross-attention in the decoder of SSTA utilizes the object queries to aggregate information from encoder outputs (tokens). During this process, only a small part of them are attended to for detecting objects accurately. For spatial-aware token alignment, we can extract the information from the cross-attention map (CAM) to align the distribution of tokens according to their attention to object queries. For semantic-aware token alignment, we inject the category information into the cross-attention map and construct domain embeddings to guide the learning of a multi-class domain discriminator so as to model the category relationship and achieve category-level alignment during the entire adaptation process.

We have conducted extensive experiments on three domain adaptive benchmarks, including adverse weather, synthetic-to-real, and scene adaptation, where we achieve new state-of-the-art performance for cross-domain object detection. The experimental results show the effectiveness of our proposed method. We also show the usefulness of each component in our approach by conducting careful ablation studies. The contributions of our work are three-fold:

  • We propose a novel approach named Spatial-aware and Semantic-aware Token Alignment (SSTA) for cross-domain object detection, under the transformer framework. To the best of our knowledge, we make the first attempt to explore the intrinsic cross-attention property for improving the cross-domain generalization ability of detection transformers.

  • Two new modules, i.e., token alignment (SpaTA) and semantic-aware token alignment (SemTA), are developed respectively to align the token distributions according to their attentions to object queries and to achieve the category-level alignment.

  • We conduct extensive experiments on several widely-used benchmarks (e.g., FoggyCityscapes, Sim10K and BDD100K), and promising results demonstrate the effectiveness of our proposed method over existing state-of-the-art baselines.

2 Related Work

2.1 Object Detection

Object detection aims to recognize and localize one or multiple objects in a given image. Traditional object detection methods [Faster-RCNN, FCOS, Fast-RCNN, Cascade-RCNN, SSD, YOLO]

are based on convolutional neural networks (CNN) 

[Resnet, VGG, AlexNet] and can be divided into two directions, one-stage, and two-stage methods. Two-stage methods [Faster-RCNN, Fast-RCNN, Cascade-RCNN] typically first generate some region proposals and then refine their classification and bounding boxes. In contrast to two-stage methods, one-stage methods [FCOS, SSD, YOLO] ignore the proposal generation stage and directly predict the category and coordinates of objects. Although these CNN-based detectors have achieved a remarkable breakthrough, they need many hand-designed components like removing duplicated detections by non-maximum suppression and anchor generation which explicitly encodes our prior knowledge about the task. Recently, Carion et al., proposed DETR [DETR] that reaches an end-to-end object detection without anchor generation and any sophisticated post-process procedure. Many DETR-like models [Deformable-DETR, Sparse-DETR, Conditional-DETR, Anchor-DETR] are proposed to further improve the performance of the DETR model in both convergence speed and accuracy. Among these works, one of the most representative works is Deformable DETR [Deformable-DETR] which adopts deformable attention mechanism [deformable-conv] into DETR and designs a multi-scale attention module so that it reduces the training time and improves detection performance significantly. Nevertheless, these methods suffer from severe performance degradation due to the domain discrepancy between the training and test domains. To address this problem, we present Spatial-aware and Semantic-aware Token Alignment (SSTA) to learn domain-invariant token representations. Following [SFA], we choose Deformable DETR [Deformable-DETR] as the base detector for a fair comparison.

2.2 Cross-domain Object Detection

Cross-domain object detection (CDOD) aims to transfer the knowledge from the label-rich source domain to the label-scarce target domain by bridging the domain discrepancy between them. Previous works [PDA, DM, DA-Faster-RCNN, SWDA, SCDA, Mega, UMT, MOTR, SSAL] can be roughly categorized into image translation, self-supervision, and adversarial training. Image translation methods [PDA, DM] adopt style transfer algorithms to enhance the image diversity so as to reduce the domain gap at the pixel level. Self-supervision approaches [UMT, MOTR, SSAL, auto-adapt] deploy the pseudo-labeling techniques to provide additional supervision signal for the target domain. Adversarial training methods [DA-Faster-RCNN, SWDA] align the feature distribution and eliminate the domain discrepancy to bridge the domain gap. Early works align the features with diverse levels, e.g., strong-weak alignment [SWDA], global-instance level [DA-Faster-RCNN].

However, these methods are based on the Faster RCNN or FCOS, and the transferability of detection transformers remains a challenge. SFA [SFA] has developed a domain adaptive detection transformer to align domain query feature and token-wise feature and design an additional bipartite matching consistency loss to enhance the feature discriminability. Different from SFA [SFA], our SSTA takes advantage of the cross-attention map and leverages the spatial and semantic information to help the token distribution alignment. Our model follows the principle of giving minimal modification to the DETR model so that the inference has no extra overload. To the best of our knowledge, our method is the first domain adaptation work that takes advantage of the characteristics of cross-attention to improve the generalization ability of the DETR model.

Figure 1: The overview of our method. We design a new Spatial-aware and Semantic-aware Token Alignment (SSTA) module to align CNN token and encoder token distribution across two domains. We take advantage of the characteristics of the cross-attention in the decoder and feed the cross-attention map (CAM) and the predictions of the detection head (FFN) to improve the token alignment. The details of the SSTA module are shown in Fig. 2.

3 Methodology

In the task of CDOD, we are given a source domain consisting of labeled images with object bounding boxes and their class labels and a target domain consisting of unlabeled images. Let us denote drawn from distribution as the labeled source domain and drawn from distribution as the unlabeled target domain, where . And , where and are the bounding box and corresponding category for each object, and is the total number of objects in an image . Our goal is to learn an object detection model that performs well on the target domain.

In the following, we introduce the motivation of our proposed method in Sec. 3.1. And then, we first give the vanilla token alignment in Sec. 3.2 and describe the detailed design of spatial-aware token alignment (Sec. 3.3) and semantic-aware token alignment (Sec. 3.4). Lastly, we give the overall objective of the proposed method.

3.1 Motivation

In this section, we give a brief preliminary to the DETR model. And then, we demonstrate the cross-domain challenges in DETR as well as our new solution.

DEtection TRansformer (DETR): DETR consists of CNN backbone, transformer encoder and transformer decoder. The image are firstly fed into CNN backbone (e.g., ResNet50 [Resnet]) and to generate a lower-resolution feature map , where , and . The encoder uses a convolution to reduce the channel into a smaller dimension and then collapse the spatial dimensions into one dimension, resulting token inputs , where is the length of sequence. The encoder layer adopts tokens along with position embedding to make interaction among tokens and outputs new tokens through standard architecture that consists of a multi-head self-attention and a feed forward network (FFN). The decoder comprises of multi-head self-attention and multi-head cross-attention mechanisms. Different with encoder, the decoder first deploys self-attention for object queries and then uses cross-attention (i.e., encoder-decoder attention) to aggregate features from the outputs of the encoder, resulting a sequence . Finally, the decoder will result predictions. DETR utilizes Hungarian algorithm to find a bipartite matching between the sets of predictions and ground truth. The loss of DETR can be summarized as follows:

(1)

where the is for classification and is for bounding boxes regression.

DETR requires much longer training epochs (

i.e., 500) to converge than traditional detectors and has relatively low detection accuracy on small objects. Thus Deformable DETR [Deformable-DETR] adopts efficient deformable attention module to replace the dense attention in DETR. The deformable attention mechanism can be naturally extended to aggregating multi-scale features, leading to fast convergence and high performance. Following [SFA], we choose Deformable DETR [Deformable-DETR] as the base detector for a fair comparison. For more detail, please refer to [DETR, Deformable-DETR].

Cross-domain Challenges in DETR: To improve the generalization ability of detection transformer, a potential solution is to perform token alignment with adversarial learning. Recent work [SFA] also attempts to apply adversarial training strategies on tokens in transformers, but the improvements are still unsatisfactory. One of the main reasons is that the tokens in detection transformer are quite diverse. In detection transformers (e.g., DETR), the tokens are passed through several multi-head self-attention layers to obtain new token embeddings for representing different spatial and semantic information. Then, object queries are introduced to probe useful tokens and leverage those tokens to predict the positions and categories of different objects. On the one hand, since some tokens are more useful while less for others, it is desirable to take the importance of tokens into consideration in the cross-domain detection transformer. On the other hand, the semantic information embedded in tokens is also helpful for aligning the token distributions of the corresponding category. This would ease the adversarial training when aligning the token distributions between domains.

To this end, we propose the spatial-aware token alignment (SpaTA) and the semantic-aware token alignment (SemTA) strategies to guide the token alignment across domains by leveraging the characteristics of cross-attention in detection transformer. As shown in Fig. 1, the proposed spatial-aware and the semantic-aware token alignment (SSTA) module adopts the cross-attention map (CAM) and predictions of the decoder to align the distributions of tokens from the CNN and encoder. The detail will be presented below.

3.2 Vanilla Token Alignment

Before we dive into the design of our SSTA module, we first introduce the vanilla token alignment. The existing adversarial methods [DA-Faster-RCNN, SWDA, DA-DETR] usually take a discriminator to reduce domain discrepancy via aligning feature distribution between domains. The discriminator tries to distinguish which domain the features come from, while the feature extractor aims to confuse features and deceive the discriminator in a minimax manner. It can be placed at a certain layer or multiple layers of feature extractor. In practice, a gradient reverse layer (GRL) [DANN] is used to connect the discriminator and feature extractor and flips the gradients when it flows through the feature extractor, leading to an end-to-end learning instead of sophisticated multi-stage iterative optimization like [GAN]. To bridge the domain gap, a naive solution is to simply align the distribution of tokens where the domain discriminator tries to recognize each token. Formally, the adversarial objective of vanilla token alignment can be defined as follows:

(2)

where is the length of sequence, is the -th token representation and can be from CNN backbone or transformer encoder, and is the domain label with for the source and for the target. When the above adversarial learning loss being optimized, the sign of gradient back-propagated from discriminator to feature extractor will be inverted by GRL, thus making the feature extractor learn domain-invariant representations.

The overall objective of vanilla token alignment can be formulated as:

(3)

where is the trade-off parameter, and and are the vanilla token alignment loss for the CNN and encoder tokens.

3.3 Spatial-aware Token Alignment

As the analysis in Sec. 3.1, object queries are introduced to probe useful tokens and leverage those tokens to predict the positions and categories of different objects. In other words, tokens contribute differently to the detection results. Simply aligning the token distribution between domains has unsatisfactory improvements, as tokens in detection transformer have different importances to object detection task. If we consider the tokens equally contributing to the adversarial training, we will overlook matching the distribution of critical tokens that may contain essential instances and global context for accurately predicting the positions and categories of different objects. Consequently, the efforts to reduce the domain gap will eventually meet difficulties, making the alignment less effective.

Figure 2: The overview of our Semantic-aware and Spatial-aware Token Alignment (SSTA) module. Take the SSTA module for encoder tokens as an example. The proposed SSTA module takes the tokens as the input and jointly utilizes Semantic-aware Token Alignment (SemTA) and Spatial-aware Token Alignment (SpaTA) to respectively align token distributions. SemTA affiliates the predictions of the detection head into the cross-attention map (CAM) and obtains a category cross-attention map (CCAM), which can be used to construct domain embedding to guide the learning of a multi-class discriminator (MCD) to achieve category-level token alignment. The SpaTA utilizes the CAM to give different weights to the adversarial learning of tokens according to their attention to object queries.

Motivated by this, we propose a spatial-aware token alignment (SpaTA) module to discover instance-related tokens and emphasize their alignment by assigning higher weights to these tokens for adversarial training according to their attention to the object queries. Formally, we can obtain the objective as follows:

(4)

where is the weight for -th token, intuitively, the more important the token should be assigned higher weights. As shown in the right part of Fig. 2, we utilize cross-attention map (CAM) as an alternative to providing the weights, as object queries probe features by giving different weights to tokens via the cross-attention mechanism.

However, the CAM cannot be directly obtained in deformable attention because of its special design. To this end, the key factor is determining how to obtain the CAM. We scatter and accumulate the cross-attention in the decoder from each object query to discrete token positions in the sequence. The deformable attention applies bilinear interpolation to obtain values from the surrounding position, as attention offset in deformable attention is fractional. Therefore, we also apply bilinear interpolation to obtain CAM. Specifically, let

, , , and be one of the reference points of the decoder, corresponding offsets, attention weights, and values, respectively.

For the attention to each token, we can obtain CAM of -th query as follows:

(5)

where is the number of decoder layer, is the bilinear interpolation operation, and enumerates all integral spatial locations of tokens. We provide more details in our Supplementary materials. After obtaining the CAM, we filter out some attentions that are less than a given threshold.

In summary, the important weight for tokens can be obtained via:

(6)

where is the average of CAM for all the queries and is an adaptive threshold for each sample .

Method Detector person rider car truck bus train mcycle bicycle mAP
Faster RCNN [Faster-RCNN] (Source) Faster RCNN 26.9 38.2 35.6 18.3 32.4 9.6 25.8 28.6 26.9
DA-Faster [DA-Faster-RCNN] 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6
SWDA [SWDA] 29.9 42.3 43.5 24.5 36.2 32.6 30.0 35.3 34.3
CFDA [CFFA] 43.2 37.4 52.1 34.7 34.0 46.9 29.9 30.8 38.6
UMT [UMT] 33.0 46.7 48.6 34.1 56.5 46.8 30.4 37.4 41.7
MeGA [Mega] 37.7 49.0 52.4 25.4 49.2 46.9 34.5 39.0 41.8
ICCR-VDD [ICCR-VDD] 33.4 44.0 51.7 33.9 52.0 34.7 34.2 36.8 40.0
ViSGA [ViSGA] 38.8 45.9 57.2 29.9 50.2 51.9 31.9 40.9 43.3
DIDN [DIDN] 38.3 44.4 51.8 28.7 53.3 34.7 32.4 40.4 40.5
FCOS [FCOS] (Source) FCOS 36.9 36.3 44.1 18.6 29.3 8.4 20.3 31.9 28.2
EPM [EPM] 41.9 38.7 56.7 22.6 41.5 26.8 24.6 35.5 36.0
SCAN [SCAN] 41.7 43.9 57.3 28.7 48.6 48.7 31.0 37.3 42.1
KTNet [KTNet] 46.4 43.2 60.6 25.8 41.2 40.4 30.7 38.8 40.9
SSAL [SSAL] 45.1 47.4 59.4 24.5 50.0 25.7 26.0 38.7 39.6
Deformable DETR [Deformable-DETR] (Source) Deformable DETR 38.6 40.6 45.8 11.6 28.9 1.7 18.9 39.1 28.1
SFA [SFA] 46.5 48.6 62.6 25.1 46.2 29.4 28.3 44.0 41.3
SSTA (Ours) 50.5 53.0 67.2 24.7 47.7 33.0 36.7 46.6 44.9
Table 1:

Average precisions (%) of different methods on Cityscapes

FoggyCityscapes.

3.4 Semantic-aware Token Alignment

Although we have discovered the critical tokens to emphasize their alignment and avoid the influence of noise tokens, the model still has the risk of misalignment during the adaptation process [Mega, FADA]. The semantic information of tokens is helpful for aligning the token distributions of the corresponding category, so that the model can avoid the class misalignment. For example, the “car” and the “truck” instances are forced to be very close in the feature space, deteriorating the model discriminant ability. Therefore, we propose to utilize a multi-class discriminator [FADA] (MCD) to capture the category information during adversarial training so that it realizes category-level token alignment. The multi-class discriminator contains not only domain information but also category relationship. Concretely, we remold the single-class discriminator to a multi-classes discriminator that outputs logits, where , for the source domain, and others for the target domain. The domain embedding of the source and target are and , respectively, where is the domain knowledge and

is all-zero vector. The objective of semantic-aware token alignment can be written as follows:

(7)

where is the multi-class domain discriminator. The key factor is determining how to obtain the domain knowledge to build domain embedding for these tokens. As illustrated in the left part of Fig. 2, we also utilize CAM to extract domain knowledge by injecting the category information into it. In particular, we affiliate the predictions of the detection head into the CAM and obtain a category cross-attention map (CCAM) which can be formally defined as follows:

(8)

where refers to CCAM for category , is the number of queries that belong to category . The is the category prediction from detection head for -th query and is the indicator function where if is true then equals 1, otherwise 0. The can be obtained after apply softmax fuction to the CCAM . Finally, we can obtain our domain adaptation loss by replacing the by the semantic-aware token alignment in Eq. (4):

(9)

3.5 Overall Objective

In summary, the overall objective includes the detection loss of Deformable DETR [Deformable-DETR] on the source domain and domain adaptation loss for the CNN and encoder tokens. In summary, the overall objective can be defined as:

(10)

where is the trade-off parameter, and are the domain adaptation loss for the CNN and encoder tokens, respectively.

4 Experiments

Following [SFA], we train the model with labeled source data and unlabeled target data and test on the target data. We conduct extensive experiments on three CDOD scenarios. The detection results are evaluated with mean Average Precision (mAP) under the threshold of .

4.1 Experimental Setup

Datasets: Cityscapes dataset was collected for the scenes understanding of road and street. It comprises and images for training and validation, respectively. It contains categories: person, rider, car, truck, bus, train, motorbike, and bicycle. FoggyCityscapes [FoggyCityscapes] dataset is the foggy version of Cityscapes and generated using the depth information provided by Cityscapes. Thus, it shares the common annotations with Cityscapes. It contains three levels for foggy weather, including , , and . In experiments, we choose the worst foggy weather (i.e., ). Sim10K [SIM10K] dataset is a synthetic dataset rendered by the gaming engine Grand Theft Auto V (GTAV). This dataset contains images with bounding boxes with the category of “car”. BDD100K [bdd100k] dataset is a large-scale autonomous driving and contains k images with six types of weather, six different scenes, and three categories for the time of day. We extract the subset of daytime, resulting in training and validation images.

Following existing works [DA-Faster-RCNN, DA-DETR], we evaluate our method on three benchmark settings:

  • Weather Adaptation: We take Cityscapes as the source domain and FoggyCityscape as the target domain, and the model is trained on the train set of Cityscapes and FoggyCityscape and evaluated on the validation split of FoggyCityscapes.

  • Syn2Real: We explore the adaptation of Sim10K to Cityscapes, we train the model using all the images of Sim10K and the train split of Cityscapes, and report mAP on the validation split of Cityscapes with “car” category.

  • Scene Adaptation: We use Cityscapes as the source domain dataset and BDD100K containing distinct scenes as a large unlabeled target domain dataset. We evaluate the model on the validation set of BDD100K.

Implementation Details: Following the default setting in SFA [SFA], we adopt Deformable DETR [Deformable-DETR] as base detector, which contains ResNet-50 [Resnet]

backbone pre-trained on ImageNet 

[imagenet], six transformer encoders, six transformer decoders and multiple prediction heads. We adopt Adam [adam] optimizer to update parameters. For Cityscapes to FoggyCityscapes, we first train the model with a learning rate for epochs, then decay the learning rate to for more epochs. And the trade-off parameter is set to . For Sim10K to Cityscapes and Cityscapes to BDD100K, we set the initial learning rate and the trade-off parameter to and respectively. We pre-train models on source data to obtain reliable CAM. All the experiments are conducted using four V100 GPUs with batch size of , i.e., each GPU contains source images and

target images. We implement our method with the PyTorch deep learning framework. The source code of our method will be released soon.

4.2 Results

Method Detector AP on Car
DA-Faster [DA-Faster-RCNN] Faster RCNN 39.0
SCDA [SCDA] 43.0
SWDA [SWDA] 40.1
MAF [MAF] 41.1
HTCN [HTCN] 42.5
SAP [SAP] 44.9
UMT [UMT] 43.1
ViSGA [ViSGA] 49.3
EPM [EPM] FCOS 49.0
KTNet [KTNet] 50.7
SCAN [SCAN] 52.6
SSAL [SSAL] 51.8
Deformable DETR [Deformable-DETR](Source) Deformable DETR 47.4
SFA [SFA] 52.6
SSTA (Ours) 57.7
Table 2: Average precisions (%) of different methods on SIM10KCityscapes.
Methods Detector person rider car truck bus mcycle bicycle mAP
Faster R-CNN (Source) Faster RCNN 28.8 25.4 44.1 17.9 16.1 13.9 22.4 24.1
DA-Faster [DA-Faster-RCNN] 28.9 27.4 44.2 19.1 18.0 14.2 22.4 24.9
SWDA [SWDA] 29.5 29.9 44.8 20.2 20.7 15.2 23.1 26.2
SCDA [SCDA] 29.3 29.2 44.4 20.3 19.6 14.8 23.2 25.8
ECR [ECR] 32.8 29.3 45.8 22.7 20.6 14.9 25.5 27.4
FCOS [FCOS] (Source) FCOS 38.6 24.8 54.5 17.2 16.3 15.0 18.3 26.4
EPM [EPM] 39.6 26.8 55.8 18.8 19.1 14.5 20.1 27.8
Deformable DETR [Deformable-DETR] (Source) Deformable DETR 38.4 27.1 56.1 14.6 12.3 16.3 20.7 26.5
SFA [SFA] 40.2 27.6 57.5 19.1 23.4 15.4 19.2 28.9
SSTA (Ours) 39.4 31.9 59.4 16.3 17.7 15.3 26.2 29.5
Table 3: Average precisions (%) of different methods on Cityscapes BDD100K.
Method TA SpaTA SemTA mAP (%)
Deformable DETR [Deformable-DETR] (Source) - - - 28.1 -
Proposed 41.3 13.2
42.5 14.4
43.9 15.8
SSTA 44.9 16.8
Table 4: Ablation studies of SSTA on Cityscapes FoggyCityscapes. TA indicates token alignment.
0.0 0.1 0.5 1.0 1.5 2.0
SSTA 28.1 42.1 44.3 44.9 44.9 44.6
Table 5: Average precisions (%) w.r.t. different values of on Cityscapes FoggyCityscapes.

We conduct extensive experiments and validate the effectiveness of our method by comparing various state-of-the-art CDOD methods, mainly including three kinds of methods: 1) two-stage detector Faster RCNN 2) one-stage detector FCOS, 3) Deformable DETR. For all the methods, we report the results from the original papers. To validate the effectiveness of our proposed method, we also report the results of the Source model where the model is only trained on the source domain and directly evaluated on the target domain.

Weather Adaptation (Cityscapes FoggyCityscapes): We show the adaptation results in Table 1. We can observe that our proposed method outperforms the previous state-of-the-art approaches by a large margin, reaching in terms of mAP. Specifically, Deformable DETR (Source) achieves in terms of mAP, which shows that Deformable DETR has a decent generalization but still suffers from the distribution discrepancy across domains. Both SFA [SFA] and our SSTA improve the Source baseline. However, our SSTA improved by in terms of mAP compared with the counterpart SFA [SFA]. This demonstrates that our method by leveraging intrinsic cross-attention to conduct spatial-aware and semantic-aware token alignment can effectively improve the generalization ability of detection transformer on the target domain.

Syn2Real (Sim10K Cityscapes): The results of synthetic-to-real adaptation are presented in Table 2. Our proposed method SSTA reaches the highest mAP () that exceeds all compared state-of-the-art methods, including the two-stage, one-stage, and DETR works, by a large margin, that is in terms of mAP over best-performing one-stage detector SCAN [SCAN] and DETR counterpart SFA [SFA]. These results verify the effectiveness of our SSTA.

Scene Adaptation (Cityscapes BDD100K): The quantitative results are shown in Table 3. According to Table 3, our method SSTA achieves the new state-of-the-art results of in terms of mAP, which surpasses the previous works. This again demonstrates the generalization of our method.

Ablation Studies: To further verify the effectiveness of our proposed method, we have conducted detailed ablation studies by isolating each component of our SSTA. The experimental results are shown in Table 4. In particular, our SpaTA significantly boosts the baseline, leading to mAP improvements compared with Source model (). This implies that the CAM can provide sufficient information to discover critical tokens, and emphasizing their contributions to distribution alignment will significantly improve the generalization ability of Deformable DETR. Moreover, SemTA also improves the accuracy of Deformable DETR, achieving in terms of mAP. These improvements mainly come from our SemTA considering category information during token alignment and thus avoiding class misalignment. By synergizing SpaTA and SemTA together, we obtain in terms of mAP, which shows their complementary to each other.

Parameter Analysis: We also investigate the influence of the trade-off parameter which is used to balance the weight between the source detection loss and the domain adaptation loss. Table 5 summarizes the experimental results on Cityscapes FoggyCityscapes. Note that when , the method degenerates to the Source model. According to Table 5, we can conclude that our proposed SSTA consistently improve the generalization ability of Deformable DETR in a wide range of , and and are the bests among them.

5 Conclusion

Detection transformers (e.g., DETR) have shown promising results for object detection, when training and test images come from the same domain. However, they usually do not work well for cross-domain problems. In this work, we tackle cross-domain object detection by proposing a novel approach named Semantic-aware and Spatial-aware Token Alignment (SSTA) under the transformer framework. In SSTA, two new modules i.e., spatial-aware token alignment (SpaTA) and semantic-aware token alignment (SemTA), are developed to guide the token alignment across domains. Promising results on benchmark datasets demonstrate the effectiveness of our method.

Limitation: Although our method outperforms existing cross-domain object detection works, it still faces challenges in detecting objects of rare classes. For example, the “truck” and “train” classes in Table 1 have relatively low AP compared with other classes (e.g., “car”). We conjecture that this is caused by the label shift between the source and target domains. In the future, we will study how to improve the detection performance of our SSTA for these classes.

References