Log In Sign Up

AFAN: Augmented Feature Alignment Network for Cross-Domain Object Detection

by   Hongsong Wang, et al.

Unsupervised domain adaptation for object detection is a challenging problem with many real-world applications. Unfortunately, it has received much less attention than supervised object detection. Models that try to address this task tend to suffer from a shortage of annotated training samples. Moreover, existing methods of feature alignments are not sufficient to learn domain-invariant representations. To address these limitations, we propose a novel augmented feature alignment network (AFAN) which integrates intermediate domain image generation and domain-adversarial training into a unified framework. An intermediate domain image generator is proposed to enhance feature alignments by domain-adversarial training with automatically generated soft domain labels. The synthetic intermediate domain images progressively bridge the domain divergence and augment the annotated source domain training data. A feature pyramid alignment is designed and the corresponding feature discriminator is used to align multi-scale convolutional features of different semantic levels. Last but not least, we introduce a region feature alignment and an instance discriminator to learn domain-invariant features for object proposals. Our approach significantly outperforms the state-of-the-art methods on standard benchmarks for both similar and dissimilar domain adaptations. Further extensive experiments verify the effectiveness of each component and demonstrate that the proposed network can learn domain-invariant representations.


page 1

page 8

page 9


Bi-Dimensional Feature Alignment for Cross-Domain Object Detection

Recently the problem of cross-domain object detection has started drawin...

Unsupervised Domain Adaptive Object Detection using Forward-Backward Cyclic Adaptation

We present a novel approach to perform the unsupervised domain adaptatio...

Re-energizing Domain Discriminator with Sample Relabeling for Adversarial Domain Adaptation

Many unsupervised domain adaptation (UDA) methods exploit domain adversa...

iFAN: Image-Instance Full Alignment Networks for Adaptive Object Detection

Training an object detector on a data-rich domain and applying it to a d...

MeGA-CDA: Memory Guided Attention for Category-Aware Unsupervised Domain Adaptive Object Detection

Existing approaches for unsupervised domain adaptive object detection pe...

Multi-adversarial Faster-RCNN for Unrestricted Object Detection

Conventional object detection methods essentially suppose that the train...

Domain-Invariant Proposals based on a Balanced Domain Classifier for Object Detection

Object recognition from images means to automatically find object(s) of ...

I Introduction

Object detection is a fundamental problem in computer vision, and has been extensively studied for decades. Over the past several years, deep learning has achieved remarkable success in this area 

[girshick2014rich, girshick2015fast, ren2015faster]. To advance the state-of-the-art on large-scale benchmarks [deng2009imagenet, lin2014microsoft], most deep learning based approaches require tremendous amount of annotated training data. In real-world applications, the manual annotating such large-scale data is time-consuming and labor-intensive. Moreover, it is challenging to deploy a trained deep learning model to new environments, even if the task is the same. This is simply because the performance is negatively impacted by changes in conditions such as image sensors, viewpoints, weather and time of day.

There is a non-negligible discrepancy between the distribution of data from the target domain and that from the source domain. Unsupervised domain adaptation [pan2009survey, pan2010domain, huang2007correcting] addresses this problem and helps adequately improve the learning in the target domain. Recently, adversarial domain adaptation [ganin2014unsupervised, ganin2016domain, tzeng2017adversarial], which aligns feature distributions between the two domains through adversarial training, has become very popular. Although domain adaptation has achieved great progress in computer vision, most of the studies are restricted to image classification [gong2012geodesic, ghifary2016deep] and semantic segmentation [tsai2018learning, zhang2017curriculum, sankaranarayanan2018learning]. How to effectively deploy an object detector in a new environment still remains an open problem.

Fig. 1:

Outline of the proposed framework. During training, the network behaves like a Siamese neural network which takes two sets of images from the source and target domains, respectively. The test process is a single path which is the same as that of object detection.

Deep learning based domain adaptive object detection has recently begun to receive attention. Works on this topic can be roughly divided into two categories: feature distribution alignment based methods [chen2018domain, saito2019strong, kim2019diversify, zhu2019adapting] and self-training based methods that use pseudo labels [Kim2019SelfTrainingAA, khodabandeh2019robust]. The former approach learns domain-invariant features through adversarial training which uses discriminator networks to predict domain labels for images from the two domains. However, the alignments of convolutional features as well as region features are not sufficient for object detection due to the limited number of annotated images from the source domain. As for the latter approach, the pseudo labels are obtained from the detection model trained only in the source domain. This method requires sophisticated and robust training strategies to overcome the adverse effects of severely noisy labels.

In order to alleviate the above shortcomings, we address cross-domain object detection from a new perspective by introducing an intermediate domain. The intermediate domain is considered as the interpolating path between the source and the target domains. We aim to propose an effective framework which takes advantages of both feature distribution alignment and pseudo image generation by combining intermediate domain image generation and domain-adversarial training.

To this end, we propose a novel augmented feature alignment network (AFAN). An outline of AFAN is illustrated in Figure 1. We introduce an intermediate domain image generator which produces pseudo intermediate domain images. This generator augments the annotated data in the source domain with unlabeled images in the target domain. We theoretically prove that intermediate domain images decrease the divergence in distributions between the two domains. Moreover, we propose a feature pyramid alignment in order to transfer the semantics and reduce the divergence of both high-level and low-level features between different domains. A feature discriminator is designed to align the distributions of convolutional feature maps of multiple scales. Finally, we present an instance discriminator which tackles domain shifts for object region proposals. Both the feature discriminator and the instance discriminator are incorporated in the object detection framework through domain-adversarial training. The soft ground-truth domain labels of intermediate domain images enhance the distribution alignment, and the whole network learns domain-invariant representations that cannot be distinguished by either discriminator.

In summary, the main contributions of this paper are as follows:

  • We propose an augmented feature alignment network for cross-domain object detection, which integrates intermediate domain image generation and domain-adversarial training into a single framework.

  • We present an intermediate domain image generator which creates intermediate domain images between the source and target domains, and theoretically demonstrate that these images reduce the divergence between the two domains.

  • We propose the feature pyramid alignment which performs a unified domain adaptation for both high-level and low-level convolutional feature maps.

  • Our approach achieves new state-of-the-art performance on various benchmarks of both similar and dissimilar domain adaptations.

The remainder of the paper is organized as follows. Section II reviews related work. Section III details the structure of the proposed method as well as the training method. Comprehensive experimental results, visualizations and ablation studies are presented in Section IV. The conclusions are finally drawn in Section V.

Ii Related Work

We briefly review approaches mostly related to ours from three aspects: general object detection, unsupervised domain adaptation and domain adaptation for object detection.

Object Detection

. Object detection aims to locate and classify the object instances in an input image. Current state-of-the-art approaches can be roughly divided into two categories: one-stage detector and two-stage object detector. Two-stage detectors first predict objectness proposals and then refine the locations and classify the object categories in the second stage. R-CNN 

[girshick2014rich], Fast R-CNN [girshick2015fast] and Faster R-CNN [ren2015faster] are milestone works for this pipeline. Faster R-CNN presents a region proposal network (RPN) to generate region proposals, and jointly trains the RPN with the detection network. Influential extending works are Faster R-CNN with Feature Pyramid Network (FPN) [lin2017feature], Mask R-CNN [he2017mask], etc. Other attempts of object detection include learning rotation-invariant features [cheng2018learning], self-supervised feature augmentation [pan2020self].

In contrast, one-stage object detectors directly regress the candidate object boxes and classify the object categories in one step. Many approaches, such as include YOLOv2 [redmon2017yolo9000], SSD [liu2016ssd] and RetinaNet [lin2017focal], use anchor boxes to enumerate possible locations, scales and aspect ratios of objects. Other methods follow anchor-free pipeline which directly learn object possibilities and bounding box coordinates. The representative works in this category include YOLO [redmon2016you], CSP [liu2019high], FoveaBox [kong2020foveabox], FCOS [tian2019fcos]. Different from other object objectors which are fine-tuned from the off-the-shelf networks, ScratchDet [zhu2019scratchdet] robustly trains the one-stage object detectors from scratch.

We follow the two-stage pipeline and choose Faster R-CNN as the base detector. Discriminator networks are integrated into the detector, and domain-adversarial training is utilized to learn domain-invariant representations.

Unsupervised Domain Adaptation. Domain adaptation aims at learning a model that reduces the distribution shift between an unlabeled target domain and a labeled source domain [pan2009survey]. Traditional methods bridge this gap by learning a common feature representation across domains [pan2010domain, li2018domain]

or estimating instance weights to reduce sample selection bias 

[huang2007correcting]. Deep domain adaptation methods embed adaptation modules in deep architectures to learn transferable representations. Yosinski et al. [yosinski2014transferable] discuss the transferability of different layers and demonstrate that the transferability of features decreases as the distance between domains increases. Long et al. [long2015learning, long2018transferable] present deep adaptation network to learn transferrable features by enhancing feature transferability in task-specific layers and matching embedded features in the reproducing kernel Hilbert spaces. Lu et al. [lu2018embarrassingly] use the class mean to learn class-specific linear projections for domain adaptation without explicitly modeling the discrepancy between domains.

Inspired by generative adversarial networks (GAN) 

[goodfellow2014generative], recent adversarial domain adaptation methods  [ganin2016domain, tzeng2017adversarial] utilize a domain discriminator to distinguish images of the source domain from those of the target domain. This domain discriminator is jointly trained with deep networks which learn representations that are indistinguishable by the discriminator. The adversarial adaptation methods are divided into three types: gradient reversal-based [ganin2014unsupervised], minimax optimization-based [tzeng2015simultaneous], and generative adversarial net-based [hoffman2018cycada]. Ganin et al. [ganin2014unsupervised] demonstrate that the domain adaptation behavior can be achieved by the gradient reversal layer (GRL). Tzeng et al. [tzeng2015simultaneous] maximize domain confusion based on marginal distributions and transfer correlations between classes from the source domain to the target domain. Hoffman et al. [hoffman2018cycada] propose the cycle-consistent adversarial domain adaptation (CyCADA) to adapt representations in both pixel-level and feature-level without the use of aligned image pairs.

The data augmentation method mixup [zhang2017mixup] trains a neural network on convex combinations of pairs of examples and their labels for image classification. MixMatch [berthelot2019mixmatch] mixes both labeled and unlabeled data with label guesses for semi-supervised classification. Domain mixup [xu2020adversarial] mixes between source and target domain images in the adversarial domain adaptation framework for unsupervised domain adaptive classification. However, these methods merely focus image classification, and mixup-based approaches for unsupervised object detection has not yet been explored.

The domain adaptation approaches in computer vision mainly focus on image classification [gong2012geodesic, ghifary2016deep] and semantic segmentation [tsai2018learning, zhang2017curriculum, sankaranarayanan2018learning]. Our work addresses adversarial domain adaptation for object detection which requires aligning representations of the same object between diverse domains and locating all the related objects in a new domain image.

Domain Adaptation for Object Detection. Unsupervised domain adaptation for object detection has recently gained interest [chen2018domain, mirrashed2013domain, zhu2019adapting, inoue2018cross, saito2019strong, cai2019exploring, wang2019towards, roychowdhury2019automatic, khodabandeh2019robust, Kim2019SelfTrainingAA]. Faster R-CNN has been adapted for domain adaptation by aligning the distributions of the last convolutional feature map and the region features [chen2018domain]. Strong alignment of local features from lower layers and weak alignment of global features from higher layers are explored for convolutional features [saito2019strong]. Discriminative regions are also mined with a grouping strategy for better alignment for region features [zhu2019adapting]

. An image-to-image translation via GAN is used to generate images shifted from the source domain to the target domain for pixel-level adaptation 

[kim2019diversify]. The mean teacher paradigm is applied and the object relation between image regions is used to bridge the domain gap [cai2019exploring]. The attention-based region transfer and prototype-based semantic alignment are proposed to achieve coarse-to-fine feature adaptation [zheng2020cross]. The category-level domain alignment is derived based on graph-based information propagation among features of region proposals [xu2020cross]. The hierarchical transferability calibration network is introduced to harmonize transferability and discriminability in the context of adversarial adaptation [chen2020harmonizing] A plug-and-play categorical regularization component is presented to match crucial image regions and important instances across domains [xu2020exploring]. Multiple adversarial domain classifiers are introduced to align the distributions of both local-level features and global-level features [xie2019multi]. However, current feature alignment methods are still insufficient as different discriminators are applied to different convolutional features. To the best of my knowledge, adversarial domain adaptation with a feature pyramid has not been investigated for cross-domain object detection. The limited amount of annotated training data from the source domain also impedes the learning of domain-invariant representations.

There are also self-training approaches which use trained models in the labeled source domain to generate pseudo labels for unlabeled images from the target domain. However, it is tricky to design a robust learning method to reduce false negatives and false positives. In contrast, we leverage pseudo images for feature alignment which allows us to bypass the problems of previous approaches.

Iii Method

Cross-domain object detection aims to guide an object detection model trained on labeled data from a specific source domain to achieve good performance on data from another target domain. The training data consists of labeled data from the source domain , and unlabeled data from the target domain . The detected object classes are assumed to be contained in both domains. Since distributions of image data and object regions from separate domains are different, domain adaptation techniques are required to reduce the discrepancies between domains from the perspectives of images as well as object regions.

Fig. 2: Pipeline of the proposed AFAN. It is an end-to-end trainable network which consists of four modules: pseudo image generation, feature pyramid alignment, region feature alignment and object detection head. Without loss of generality, the pseudo images are produced by a relatively simple method.

Iii-a Framework Overview

We propose a novel augmented feature alignment network (AFAN) which is able to dramatically reduce the divergence across different domains. The pipeline of AFAN is shown in Figure 2. We choose the R-CNN [ren2015faster] pipeline and adopt deep residual networks [he2016deep] as the backbone. We introduce the intermediate domain to bridge the connection between the source and the target domains. An intermediate domain image generator is designed to augment both samples in the source and target domains and enhance feature distribution alignments. A feature pyramid structure is built to transfer information between high- and low-level features. Two diverse discriminator networks, i.e., feature discriminator and instance discriminator, are designed to align distributions of image and object features across the two domains. Both discriminator networks and domain classifiers are incorporated into existing object detection networks. The domain adaptation is accomplished by the proposed discriminator networks through adversarial training.

In the training phase, the proposed network consists of two identical branches which process images from the source and target domains, respectively. As the two branches share parameters, there exist only one branch during testing. The Siamese network structure during training could avoid the imbalanced training data between two domains and make sampling strategy for each dataset more flexible.

Iii-B Intermediate Domain Image Generator

Instead of directly aligning the source and target domains, we introduce the intermediate domain which gradually connects the two domains. The intermediate domain is considered as the interpolation points between the source and target domains. The intermediate domain image generator (IDIG) is an image-to-image module, and both the inputs and outputs are one set of annotated images and another set of unlabeled images. Inspired by mixup in image recognition and semi-supervised classification [zhang2017mixup, wang2019semi], we propose a simple but effective method that generates pseudo training images by interpolating between labeled and unlabeled images. It should be noted that although the recent work [xu2020adversarial, mao2019virtual, yan2020improve] also use the mixup strategy to generate pseudo images for unsupervised domain adaptation or adversarial domain adaptation, they merely focus on the task of image classification for which the input image is small and contains a single object. In contrast, we address cross-domain object detection and aim to better align features between diverse domains at different levels, ranging from low-level and high-level convolutional features to regional features.

During training, the IDIG receives two mini-batches of training images from the source and target domains, respectively. The intermediate domain images are generated as


where is a labeled image from the source domain, is an unlabeled image from the target domain, and are corresponding pseudo labeled and unlabeled intermediate domain images from the two domains, respectively, and

is a random variable. As the input images possess various sizes and aspect ratios, during the addition operation, the second image of the formula is resized to the same size of the first one. For each mini-batch,

is sampled from , where is the upper limit of , and .

The IDIG reduces the divergence of distributions between the two domains at the image level. We prove this hypothesis using the generalized energy distance [szekely2013energy]

between the distributions of random vectors. Assume that


are random variables of images from the source and target domains with cumulative distribution functions

and , respectively, the generalized energy distance between and is


where and are two independent and identically distributed (iid) random variables from , and are iid random variables from , and . From Proposition 2 in [szekely2013energy], when , the distance is reduced as


Let and be the random variables of pseudo intermediate domain images from the source and the target domains, respectively, and and be the corresponding cumulative distribution functions, respectively. The generalized energy distance between and is


In our task, , . Thus, the expectation is computed as


where is the expectation of . A similar formula can be obtained for .

Therefore, can be written as


Since is sampled from , . When , , which means that the divergence of the pseudo intermediate domain images between the two domains is much smaller than that of the original images. It should be noted that it is necessary to set an upper bound (e.g., 0.5) on during sampling. It is inappropriate to set as the labels of the pseudo labeled image would be unreliable with noises of unlabeled images dominating the image content. In addition, would become more similar to instead of , which contradicts the evidence that comes from the source domain. The same analysis applies to . As a result, the IDIG separates the large domain gap into small ones, and the augmented images overcome the lack of annotated samples in the source domain. Together with adversarial domain adaptation, the IDIG also enhances the feature distribution alignment, which is discussed below.

It should be noted that the mixup inspired approach is only an example of our IDIG module, and we have provided a theoretical explanation about the benefit of domain adaptation from the energy function perspective. We believe that other image-to-image approaches (e.g., [wang2018perceptual]) are also feasible, and investigations about the implementations of the IDIG are beyond the scope of this paper.

Iii-C Feature Pyramid Alignment

Fig. 3: Structure of feature pyramid alignment. The symbol D denotes feature discriminator, which is shared across different convolutional feature maps, and GAP denotes global average pooling.

Aligning the features of deep convolutional neural networks (CNN) between the source and target domains is challenging as there are many different layers in a deep CNN. Some works 

[chen2018domain] only align the features of the last convolutional layer, and do not fully bridge the gap across the two domains. Other works [saito2019strong] use various strategies to align the higher and lower layers. However, such a model is cumbersome as it involves multiple discriminator networks, and it is often difficult to determine whether a CNN layer is high or low.

Inspired by feature pyramid networks (FPN) [lin2017feature], we propose feature pyramid alignment (FPA) which can align multi-scale features of different layers with only one discriminator network. The structure of FPA is illustrated in Figure 3. The bottom-up pathway generates rich semantical features in the higher layers, and the top-down pathway transfers the semantics from the high layers to the low layers. Due to the lateral connection, the multi-scale convolutional features are transformed in order to have the same feature dimension (numbers of channels). Then, a single convolutional discriminator network is used to classify the domain categories where 0 denotes the source domain and 1 denotes the target domain. The discriminator network is jointly optimized with the object detection networks. While the loss for object detection on labeled data from the source domain is minimized, the loss of the domain classifier for data from both domains should be maximized in order to learn domain-invariant features. During implementation, we adopt the gradient reverse layer (GRL) [ganin2014unsupervised] which leaves the input unchanged during forward propagation and reverses the gradient during back-propagation.

The original ground-truths of domain categories are hard binary labels. However, for the pseudo intermediate domain images generated by the IDIG (see Section III-B

), the ground-truths of domain categories are soft labels. The soft labels denote probability distributions between the two domains, which can also be regarded as the weights of images from different domains in the process of intermediate domain image generation. For domain images from the source domain, the domain category label is a two-dimensional vector

, and for those from the target domain, this label is , where is different for each mini-batch during training.

Let denote the backbone of FPN, and

denote the feature discriminator which predicts the probability of the target domain category with respect to convolutional features. The feature discriminator comprises several convolutional layers intertwined with batch normalization layers. A binary domain classifier is placed on top of the discriminator. The unsupervised loss for feature level alignment is


where is the pseudo intermediate domain image from a particular domain, and is the second component of the soft label for domain categories.

Iii-D Region Feature Alignment

Generating region proposals is an important step for current state-of-the-art object detection approaches. As the input image contains multiple objects, each region proposal accounts for one possible object instance. For cross-domain object detection, adaptation at the proposal level can be attained by aligning the features of region proposals between the two domains. We use RoIAlign [he2017mask] to extract features from each proposal, and transform these features with fully connected layers to obtain a 1024-dimensional vector.

As illustrated in Figure 2, an instance discriminator and an instance domain classifier are utilized for adversarial domain adaptation. The instance discriminator network comprises two fully connected layers, and predicts domain labels for individual object instances. The GRL is also used during back-propagation to maximize the loss of the instance domain classifier. Let denote the features of a region proposal, and denote the instance discriminator which predicts the probability of the target domain category. We assume that the domain label of a region proposal is the same as that of the image . The loss for proposal level alignment is


where is if the pseudo image is from the source domain and otherwise.

The proposal level adaptation module is applied to intermediate domain images generated by the intermediate domain image generator. Unlike previous approaches [chen2018domain, zhu2019adapting] which use 0 or 1 as the ground-truth domain labels of real images, our instance level domain classifier exploits soft domain labels as the ground-truths of pseudo images. One of the advantages of intermediate domain images is that the corresponding soft domain labels augment the source domain and bridge the domain divergence in a progressive manner.

Iii-E Training

The object detection loss consists of the localization loss and the classification loss. Combined with two types of discriminative losses, the final training loss of the AFAN is


where are weight parameters to balance the detection loss and domain adaptation losses.

During training, the inputs contain two sets of images: labeled images from the source domain and unlabeled images from the target domain. After intermediate domain image generation, the network behaves like a Siamese neural network and computes responses for the two sets of images. During testing, the domain adaptation modules can be discarded and the network takes one image as input.

Details of the training process are summarized in Algorithm 1. In our implementation, an additional parameter is introduced to combine both the original images and the pseudo images for training. The explanation of has been discussed in Section III-B. When , the generated pseudo images are the same as the original images. Algorithm 1 Training process of the AFAN. 1:A batch of labeled source images from the source domain, a batch of unlabeled target images . 2:Draw and from and , respectively. 3:if  then 4:      0. 5:end if 6:Generate intermediate source domain images and intermediate target domain images using Equation (1). 7:Perform the forward pass of deep networks by feeding . 8:Calculate the feature level and proposal level discriminator losses with regard to . 9:Perform the backward pass by minimizing the loss defined in Equation (9). 10:The updated network.

Iv Experiments

The proposed approach is evaluated under different experimental settings, and compared with previous state-of-the-art methods. Ablation studies and qualitative experiments are also provided.

Iv-a Experimental Setup

The experimental settings of cross-domain object detection can be divided into two types: adaptation between similar domains and adaptation between dissimilar domains.

Adaptation Between Similar Domains

. This setting includes adapting normal images to images under different weather and daytime conditions. We use the Cityscapes 

[cordts2016cityscapes] dataset and Foggy Cityscapes [sakaridis2018semantic] dataset as the source and target domains, respectively. Both datasets have 2,975 images in the training set, and 500 images in the validation set. Foggy Cityscapes is a synthetic foggy dataset and the images simulate fog in real scenes. Following the split of [chen2018domain], we use annotated images from the training set of Cityscapes and only images from the training set of Foggy Cityscapes for training. The results are evaluated on the validation set of Foggy Cityscapes.

As object detection at night is challenging and night images are difficult to annotate, we adapt object detection from day images to night images. We exploit cross-domain pedestrian detection and evaluate the proposed method on the EuroCity Persons [braun2018eurocity] dataset. EuroCity Persons is a large-scale dataset with over 238,000 persons instances manually labeled in over 47,000 images of urban traffic scenes. This dataset has subsets of both day and night, and each subset is split into training and validation sets. For the day subset, the numbers of images in the training set and the validation set are 23892 and 4266, respectively. For the night subset, these numbers are 4222 and 770, respectively. We consider the day subset as the source domain and the night subset as the target domain. The labeled day training set and unlabeled night training set are used for training, and the night validation set is used for validation. To the best of our knowledge, this is one of the first studies on cross-domain pedestrian detection.

Adaptation Between Dissimilar Domains. For this setting, images from the two domains are collected under very different scenarios. For example, the source data include synthetic images captured from video games, while the target data are realistic images. The SIM 10k [johnson2016driving] dataset is treated as the source domain, and the Cityscapes dataset is the target domain. SIM 10k contains 10,000 synthetic images with bounding boxes of cars. For the target domain, only the images in the training set are used for training and the results are evaluated on the Cityscapes validation set.

Two datasets collected by different devices in different environments also constitute very dissimilar domains. We regard the Cityscapes dataset as the source domain and KITTI dataset [geiger2012we, geiger2013vision] as the target domain. As the categories of the two datasets are a bit different, we classify person sitting and pedestrian as person, van and car as car, tram as train, cyclist as rider in the KITTI dataset. Our setting is different from  [chen2018domain] which is limited as it only considers the category of car. Since this dataset does not have standard splits, all the training images are used for both training and validation.

Compared with realistic photographs, it is more challenging to detect objects in artistic images which embody a breadth of styles, media, and emotions. Cross-domain object detection is adapted from the real-world Pascal VOC 

[everingham2010pascal] to the artistic Watercolor [inoue2018cross]. The Watercolor dataset contains images posted by professional and commercial artists, and has six classes in common with classes of the Pascal VOC. It includes 2000 images with 1000 images for the training set and 1000 images for the test set. In accordance with the setup of [inoue2018cross], training and validation sets of both VOC2007 and VOC2012 are considered as the source domain, and the Watercolor is used as the target domain.

Implementation Details. To build a feature pyramid, ResNet-50 [he2016deep] is adopted as the backbone network due to its simplicity and efficiency. FPN [lin2017feature] constructs four levels of features with different spatial scales. The feature discriminator consists of three convolutional layers with a 3

3 convolutional kernel. Unless otherwise specified, the channel numbers of both the input and the output are 256. A global pooling layer is placed before the fully connected layer with two neurons that classifies the domain categories. The instance discriminator consists of two fully connected layers which first reduce the dimension of the region features to 512 and then conduct domain classification. For each image, we select the top 1000 proposals with confidence scores above 0.05 for region feature alignment. The domain classifiers are trained with the binary cross-entropy loss.

The whole network is trained using stochastic gradient descent (SGD) with a momentum of 0.9. The base learning rate is 0.01, and the batch size and the training epoch are 16 and 120, respectively. We use four GPUs for training. During testing, the minimum confidence score threshold is 0.05, and NMS is used as post-processing. For all experiments, we evaluate the detection results using mean average precision (mAP) with an IoU threshold of 0.5.

Many previous approaches such as [chen2018domain, zheng2020cross, xu2020exploring] adopt VGG16 [simonyan2014very] as the backbone which does not involve the FPN. As the recent CNN architectures (e.g., ResNet) have shown substantial performance improvement over the VGG, we use ResNet-50 with FPN as the backbone, which is more appropriate for practical applications. The implementation is based on Mask-RCNN benchmark [massa2018mrcnn].

Iv-B Similar Domains Adaptation


Method AP of pedestrian
FPN [lin2017feature] 73.4
Baseline 73.4
Oracle 84.1
Ours 78.5


TABLE I: Results of cross-domain pedestrian detection on the EuroCity Persons dataset. The domain is transferred from day condition to night condition.


Method person rider car truck bus train motor bicycle mAP
DA-Faster [chen2018domain] 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6
DT [inoue2018cross] 25.4 39.3 42.4 24.9 40.4 23.1 25.9 30.4 31.5
S-Align [zhu2019adapting] 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.8
SW-Align [saito2019strong] 29.9 42.3 43.5 24.5 36.2 32.6 30.0 35.3 34.3
DD-MRL [kim2019diversify] 30.8 40.5 44.3 27.2 38.4 34.5 28.4 32.2 34.6
MTOR [cai2019exploring] 30.6 41.4 44.0 21.9 38.6 40.6 28.3 35.6 35.1
RLDA [khodabandeh2019robust] 35.1 42.1 49.2 30.1 45.2 27.0 26.8 36.0 36.4
SW-Faster-ICR-CCR [xu2020exploring] 32.9 43.8 49.2 27.2 45.1 36.4 30.3 34.6 37.4
ART-PSA [zheng2020cross] 34.0 46.9 52.1 30.8 43.2 29.9 34.7 37.4 38.6
PSA [zheng2020cross] 33.5 45.2 51.5 28.2 41.6 26.6 36.9 35.4 37.4
Baseline 32.6 35.3 37.1 17.6 28.5 7.3 21.9 28.2 26.1
Oracle 46.6 47.8 64.9 31.2 47.7 48.4 32.4 40.7 45.0
Ours 42.5 44.6 57.0 26.4 48.0 28.3 33.2 37.1 39.6


TABLE II: Results of cross-domain object detection adapted from the Cityscapes to the Foggy-Cityscapes.

The results of cross-domain pedestrian detection are shown in Table I. As we are the first to perform unsupervised pedestrian detection at night by transferring knowledge from the day domain, there is no reported results of previous approaches on this benchmark. For the EuroCity Persons, baseline denotes Faster R-CNN which uses the training set of the day subset for training. Since the training set of the night subset has a much smaller number of images compared to that of the day subset, we consider Faster R-CNN that uses all annotated training images from the two subsets for training as the oracle upper limit, which is denoted as oracle. Both baseline and oracle adopt the same backbone and have the same settings as our approach. We observe that the proposed AFAN outperforms the baseline by 5.1%, which clearly demonstrates the effectiveness of the proposed domain adaptation in pedestrian detection from day to night images.

The results of object detection adapted from Cityscapes to Foggy-Cityscapes are summarized in Table II. We compare the average precision for each category as well as the mAP. Similarly, baseline denotes Faster R-CNN trained on the annotated Cityscapes training set. Oracle is the Faster R-CNN method trained on the annotated Foggy-Cityscapes training set. In other words, Baseline is the method without domain adaptation, and oracle is the upper limit. The proposed AFAN outperforms all the recent methods, and achieves an absolute improvement of 3.2% over the best detector reported on this dataset. It also achieves the state-of-the-art performance for most classes. For the averaged performance, our AFAN outperforms baseline by 13.5%, which is significant as the margin between AFAN and oracle is only 5.4%.

Iv-C Dissimilar Domains Adaptation


Method AP of car
DA-Faster [chen2018domain] 39.0
SW-Align [saito2019strong] 42.3
S-Align [zhu2019adapting] 43.0
ART-PSA [zheng2020cross] 43.8
Baseline 32.9
Oracle 68.6
Ours 45.5


TABLE III: Car detection adapted from the SIM 10k to the Cityscapes.


Method person rider car truck train mAP
DA-Faster [chen2018domain] 40.9 16.1 70.3 23.6 21.2 34.4
PSA [zheng2020cross] 50.2 27.3 73.2 29.5 17.1 39.5
ART-PSA [zheng2020cross] 50.4 29.7 73.6 29.7 21.6 41.0
Baseline 54.9 15.7 71.9 31.8 20.6 38.9
Ours 57.7 18.5 74.7 28.4 27.6 41.4


TABLE IV: Results of cross-domain object detection adapted from the Cityscapes to the KITTI.


Method bike bird car cat dog person mAP
DA-Faster [chen2018domain] 75.2 40.6 48.0 31.5 20.6 60.0 46.0
DT [inoue2018cross] 82.8 47.0 40.2 34.6 35.3 62.5 50.4
WST-BSR [Kim2019SelfTrainingAA] 75.6 45.8 49.3 34.1 30.3 64.1 49.9
Baseline 80.2 39.8 45.5 28.3 18.3 46.8 43.1
Oracle 86.5 56.4 51.4 39.9 42.3 74.7 58.5
Ours 87.0 46.4 47.3 33.1 30.0 60.1 50.6


TABLE V: Results of cross-domain object detection adapted from the Pascal VOC to the Watercolor.

In Table III, we compare the results of object detection adapted from the synthetic images to the real images. Baseline and oracle denote the Faster R-CNN method which trained on the training sets of the SIM 10k and the Cityscapes, respectively. The proposed AFAN considerably outperforms the recent state-of-the-art approaches. In particular, the average precision of AFAN is 2.5% higher than that of the recent method [zhu2019adapting], and 12.6% higher than that of baseline.

The results of adaptation from the Cityscapes dataset to the KITTI dataset are shown in Table IV. As both the training and validation processes share the same unlabeled images from the target domain, we do not present the results of oracle. Our approach consistently shows dramatic improvement over baseline and yields the state-of-the-art performance. For example, for the train category, the proposed AFAN outperforms baseline by nearly 7.0%. These experiments demonstrate the effectiveness of our approach for cross-domain object detection even if the two domains are very different.

The results of on the artistic Watercolor dataset are provided in Table V. Similar to other experiments, our approach significantly improves the performance of the baseline without domain adaptation, and achieves new state-of-the-art performance for the mean average precision on the artistic media dataset.

Iv-D Experimental Analysis

We conduct extensive experiments to investigate the effect of the proposed discriminators and intermediate domain image generator for cross-domain object detection.


Method CityscapesFoggy SIMCityscapes DayNight
AFAN 39.6 45.5 78.5
AFAN w/o RFA 38.4 44.1 77.8
AFAN w/o FPA 31.3 37.2 75.7
AFAN w/o IDIG 34.8 38.1 73.4


TABLE VI: Ablation study results for the proposed method. For simplicity, the intermediate domain image generator, feature pyramid alignment, and region feature alignment are abbreviated as IDIG, FPA and RFA, respectively. The symbol CityscapesFoggy denotes mAP adapted from the Cityscapes to the Foggy-Cityscapes, SIMCityscapes denotes car detection adapted from the SIM 10k to the Cityscapes, and DayNight denotes pedestrian detection adapted from the EuroCity Persons day subset to the EuroCity Persons night subset.
Fig. 4: Evidence of domain classifiers in images from Foggy Cityscapes. The gradient of the domain classification loss is propagated backwards and important regions are highlighted in the image using Grad-CAM [selvaraju2017grad]. The first and second rows show examples from the source and target domains, respectively.
Fig. 5: Visualization of features learned from images. Cityscape and Foggy Cityscape are regarded as the source and the target domains, respectively. The first two columns show features of images learned by baseline and the proposed AFAN w/o IDIG, respectively. The third column illustrates features of images extracted by our AFAN. Images from the source and the target domains are indicated by blue dots and red squares, respectively. The last column shows features of intermediate domain images generated by the IDIG module. The first and second rows show convolutional features of the backbone and regional features of object proposals, respectively. For rich visualizations, the feature dimensionality is reduced to two by t-SNE. The more similar feature distributions between the two domains, the better object detection results.
Fig. 6: Examples of detection results on the target domain. From left to right, the four columns correspond to the ground truth, results of Faster R-CNN [ren2015faster], Domain Adaptive Faster R-CNN [chen2018domain], and the proposed AFAN, respectively. The first and second rows show images from the validation set of Foggy Cityscapes. The third and fourth rows show images from the validation set of the night subset of EuroCity Persons. Boxes of different classes are marked with different colors, and the predicted confidence scores are described in the text above the corresponding boxes.

Ablation Studies. We run a number of ablations to analyze the proposed model. The results are summarized in Table VI. We find that, without the intermediate domain image generator or feature pyramid alignment, the results decrease dramatically. When removing one of the two modules, the mAP on the Foggy Cityscapes decreases 4.8% and 8.3%, respectively, and the AP of car on the Cityscapes decreases 7.4% and 8.3%, respectively. Without region feature alignment, the results also reduce considerably. For example, The mAP on the Foggy Cityscapes drops 1.2% without region feature alignment. These experiments verify the effectiveness of the three modules, which are very complementary to each other. The three components are strongly connected and integrated into a single network during training.

Visualization of Domain Evidence. To investigate the roles of discriminator networks and domain classifiers in network training, we visualize the evidence of the feature discriminator in Figure 4. Similar results can also be obtained for the instance discriminator. The heatmap images highlight the regions where the domain classifier thinks the image comes from the source or the target. For images both domains, the domain classifier does not look at objects such as cars and persons. Instead, the background regions are highlighted and considered as evidence by the feature discriminator. This indicates that the network seems to focus on objects to deceive the domain classifier. In other words, the network demonstrates the ability to learn domain-invariant representations of objects.

Visualization of Features. To demonstrate that the proposed AFAN learns domain-invariant features and that the intermediate domain image generator (IDIG) enhances feature distribution alignments, we visualize learned features in Figure 5. We take the domain adaptation experiment from Cityscapes to Foggy Cityscapes as an example. To obtain one representation for each image, we average across the feature maps for the convolutional features and also average all the region features. The features are represented by two-dimensional points after dimensionality reduction to guarantee that similar features are represented by nearby points and dissimilar features are represented by distant points with high probability.

From the figures, we can see that, for the baseline, features of the source domain are distant from those of the target domain without domain adaptation. Without the IDIG module, features from the target are only partially aligned with those from the source. However, the learned features from the two domains are distributed closely for the proposed AFAN. The results indicate that AFAN w/o IDIG reduces the feature distribution divergence between the two domains to some extent. In contrast, this feature distribution divergence can be almost removed by AFAN. Similar results are obtained for both convolutional and regional features.

We also find that features of pseudo images are no different from those of the original images. It can be interpreted that the pseudo images bridge samples from different domains and facilitate the learning of features that are visually indistinguishable. These experiments highlight the important role of the IDIG module for feature distribution alignment.

Examples of Detection Results. The cross-domain object detection results are visualized in Figure 6. As the environmental conditions change, the baseline method usually misses some true positive objects which can be reliably detected by our approach. For example, in the second row, cars in the fog cannot be detected by other methods, while our AFAN can detect them with high confidence scores. Our approach gains similar advantages for pedestrian detection at night.

V Conclusion

In this work, we attempt to address unsupervised domain adaptation for object detection by progressively bridging the domain divergence with adversarial domain adaptation in an intermediate domain. We propose a novel augmented feature alignment network (AFAN) which integrates an intermediate domain image generator and adversarial feature alignments into a single object detection framework. Our method significantly outperforms state-of-the-art approaches on five datasets for both similar and dissimilar domain adaptations. Ablation studies verify the effectiveness and complementarities of the intermediate domain image generation and adversarial feature alignments. Further experiments indicate the evidence of domain discriminators and reveal the role of enhancing feature alignment of the intermediate domain image generator for cross-domain object detection.