Instance-Invariant Adaptive Object Detection via Progressive Disentanglement

11/20/2019 ∙ by Aming Wu, et al. ∙ University of Technology Sydney Tianjin University 11

Most state-of-the-art methods of object detection suffer from poor generalization ability when the training and test data are from different domains, e.g., with different styles. To address this problem, previous methods mainly use holistic representations to align feature-level and pixel-level distributions of different domains, which may neglect the instance-level characteristics of objects in images. Besides, when transferring detection ability across different domains, it is important to obtain the instance-level features that are domain-invariant, instead of the styles that are domain-specific. Therefore, in order to extract instance-invariant features, we should disentangle the domain-invariant features from the domain-specific features. To this end, a progressive disentangled framework is first proposed to solve domain adaptive object detection. Particularly, base on disentangled learning used for feature decomposition, we devise two disentangled layers to decompose domain-invariant and domain-specific features. And the instance-invariant features are extracted based on the domain-invariant features. Finally, to enhance the disentanglement, a three-stage training mechanism including multiple loss functions is devised to optimize our model. In the experiment, we verify the effectiveness of our method on three domain-shift scenes. Our method is separately 2.3%, 3.6%, and 4.0% higher than the baseline method <cit.>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, great efforts have been made on object detection [11, 32, 15, 24, 31]. Though most state-of-the-art methods achieve outstanding detection performance on many benchmarks [9, 25], they suffer from poor generalization ability when the training and test images are from different domains, which is cast into the setting of domain adaptive object detection (DAOD). In the task of DAOD, domain gap always exists between the source/training and target/test images, e.g., with different illuminations and different styles etc. Although the performance could be improved via collecting additional images with well-labeled objects from the target domain, it is time-consuming and labor-intensive.

Figure 1: The process of our disentangled method for domain adaptive object detection. We decompose source and target image representations into domain-invariant representations (DIR) and domain-specific representations (DSR). Then, we extract from DIR the instance-invariant representations that lie in an instance-invariant space, in which the instance-invariant features are used to describe the characteristics of objects. In the instance-invariant space, we conduct instance classification (i.e., via ) for the adaptive object detection. And different domains could be easily distinguished (i.e., via ) in the domain-specific space.

In order to alleviate the impact of domain-shift [10], representative methods [5, 36, 14] towards DAOD employ unsupervised domain adaptation [34, 29, 44] to align distributions of different domains, e.g., via adversarial training [10] or style translation [21]. Distribution alignment is always conducted in a holistic representation (e.g., in feature-level [6, 22] or pixel-level [12, 3, 35]) of source and target images, which may neglect the instance-level characteristics of objects in images, such as object locations or basic shapes of objects etc. When transferring detection ability from source images to target images, it is the instance-level features that really count, which are always domain-invariant, not the illuminations and painting styles that are domain-specific. Therefore, in order to obtain the instance-invariant features and bridge the domain gap in DAOD, we should try to disentangle the domain-invariant representations (DIR) from the domain-specific representations (DSR).

As a method of feature decomposition, disentangled learning [8, 28] has been demonstrated to be effective in tasks of few-shot learning [33, 38] and image translation [23, 16]. The purpose of disentangled learning is to uncover a set of independent factors that give rise to the current observation [8]. And the major advantage is that disentangled representations could contain all the information presents in the current observation in a compact and interpretable structure while being independent of the current task [28, 2]. In this paper, we propose to employ disentangled learning to disentangle an image representation into a domain-invariant representation (DIR) and a domain-specific representation (DSR) (see Fig. 1), so as to obtain the instance-invariant representation (IIR). Taking the IIR as a bridge, we have great potential to strengthen the transferring ability of a detection model trained on source images.

Particularly, in the proposed detection network, we devise a progressive process to decompose the DIR and DSR with two disentangled layers. The goal of the first layer is to enhance the domain-invariant information in a middle-layer feature map. We utilize a domain classifier to ensure that DSR contains much more domain-specific information. And a mutual information (MI) loss is employed to enlarge the gap between DIR and DSR. Taking the sum of the feature map and DIR as the input, the second layer aims at obtaining the instance-invariant representations (IIR) with a regional proposal networks (RPN)

[32, 41]. Moreover, to enhance the disentanglement, we devise a training mechanism including three stages to optimize our model: (i) the stage of feature decomposition aiming at learning disentanglement, (ii) the stage of feature separation aiming at enlarging the gap between DIR and DSR, and (iii) the stage of feature reconstruction aiming at keeping the DIR and DSR contain all the content of the input. For each stage, we use different loss functions to optimize different components of our network, respectively. Experiments on three domain-shift scenes of DAOD demonstrate that our method is effective and achieves a new state-of-the-art performance.

The contributions of this paper are summarized as:

(1) Different from reducing the domain gap with distribution alignment, we propose to enhance the transferring detection ability via a bridge of disentangled instance-invariant representations.

(2) A progressive disentangled network is first proposed to successfully extract instance-invariant features. Meanwhile, a three-stage training mechanism is proposed to further enhance the disentangled ability.

(3) On three domain-shift scenes, i.e., Cityscapes [7] FoggyCityscapes [37], Pascal [9] Watercolor [17], and Pascal Clipart [17], our method is separately 2.3%, 3.6%, and 4.0% higher than the baseline method [36].

2 Related Work

Domain Adaptive Object Detection. Though most methods [11, 31, 15, 26] of object detection have achieved outstanding performance, their transferring abilities are limited for the task of DAOD. Recently, many methods [21, 36, 20] have been proposed to solve the domain-shift problem in object detection. These methods mainly focus on feature-level or pixel-level alignment. For example, the method in [5] utilizes adversarial training [10] to align global feature distributions of the source and target domains, whereas the method in [36] aligns distributions of both global and local features. For pixel-level adaptation, the work [21] devises a generative network to increase the diversity of the source domain, which is similar to data augmentation. However, as the alignment is conducted in holistic representations of images, it is not dedicated to the task of adaptive object detection, which focuses on the bridge of domains with instance-level characteristics. Therefore, in this paper, we focus on extracting instance-level features that are domain-invariant, which are helpful for improving the transferring ability of a detection method.

Figure 2: Illustration of the proposed network of progressive disentanglement. ‘Recon’ indicates the reconstruction loss. ‘GRL’ is the gradient reverse layer. ‘RA’ indicates the operation of RoI-Alignment. ‘RC loss’ and ‘MI loss’ separately denote the proposed relation-consistency loss and the mutual information loss. ‘’ is the operation of element-wise sum. And the dot lines indicate the relations existed between the extracted proposals. There are two disentangled layers in the network. The purpose of the first layer is to enhance the domain-invariant information in a middle-layer feature map. And the goal of the second layer is to obtain the instance-invariant features. During training, in order to enhance the disentanglement, we devise a three-stage optimization mechanism with multiple loss functions. For each stage, we use different loss functions to optimize different components of the network.

Disentangled Learning. The purpose of disentangled learning [18, 28, 2, 30] is to correctly uncover a set of independent factors that give rise to the current observation. Recently, disentangled learning has been well explored in tasks of few-shot learning [33, 38] and image translation [23, 16]. Particularly, by decomposing the style of an image, the work [23]

proposed a disentangled method to make a diverse image-to-image translation. Liu et al.

[27] proposed a model of cross-domain representation disentanglement. Based on generative adversarial networks, this method alleviated the impact of domain-shift and improved the classification performance on multiple datasets. As for adaptive object detection, on one hand, we should remove the domain-shift; on the other hand, it is important to transfer the detection ability via the bridge of the instance characteristics. Thus, it is not straightforward to apply the disentangled learning to the task of DAOD.

In this paper, we devise a new network of progressive disentanglement to decompose image representations into domain-specific and domain-invariant representations, and from which we extract the instance-invariant representations to bridge the detection ability between source and target domains. Experiments on three domain-shift scenes of DAOD demonstrate the effectiveness of our method.

3 Instance-Invariant Adaptive Object Detection

Suppose we have the access to an image including labels and bounding boxes , which are drawn from a set of annotated source images . Here, , , and separately indicate the set of images, labels, and bounding-box annotations, which are from the source domain. Meanwhile, we could also access to a target image drawn from a set of unlabeled target images .

3.1 The Network of Progressive Disentanglement

As is shown in Fig. 2, we devise two disentangled layers to extract domain-invariant information progressively.

The First Disentangled Layer. The goal of this layer is to enhance the domain-invariant information in a middle-layer feature map. Concretely, given a source image and target image , we first obtain a feature map that is the output of a middle-layer feature extractor . Then, two different extractors are devised to disentangle the DIR and DSR from . The processes are shown as follows:

(1)

Here, and separately indicate the DIR and DSR extractor. The size of and is set to the same value as that of . Then, we take the sum of and as the input of the second feature extractor . Since contains more domain-invariant information, the sum operation could alleviate the impact of domain-shift on .

The Second Disentangled Layer. The purpose of this layer is to obtain the instance-invariant features. Particularly, based on the output of the extractor , we devise two extractors, i.e., and , to disentangle the DIR and DSR from . The processes are as follows:

(2)

Here, the size of and is set to the same value as that of . Next, the RPN is performed on to extract a set of instance-invariant proposals. Finally, for an image from the source domain, the detection loss is as follows:

(3)

where denotes the number of proposals. indicates the RoI-Alignment [32, 13] result of the -th proposal. includes the classification and regression network. is assumed to contain all the losses for the detection, e.g., classification and bounding-box regression loss.

3.2 Training with the Three-stage Optimization

As is discussed in the section of Introduction, the goal of disentangled learning is to uncover a set of independent factors that give rise to the current observation [8]. And these factors could contain all the information presents in the observation [28]. Therefore, we devise a three-stage training mechanism (see Fig. 3) to enhance the disentanglement.

Figure 3: Illustration of the three-stage training. Here, the red arrow denotes the operation of reconstruction. ‘Stage-’ is the first stage aiming at learning disentanglement. ‘Stage-’ is the second stage aiming at keeping the disentangled DIR and DSR independent. And ‘Stage-’ is the third stage aiming at keeping the DIR and DSR could contain all the content of the input.

3.2.1 The Stage of Feature Decomposition

The goal of the first stage is to ensure that our model not only learns the location and classification of the objects but also disentangles the image features. Based on , we first utilize RPN to obtain a set of object proposals . To ensure that and have the same object contents in the same locations, based on the proposals , RoI-Alignment is performed on and to obtain and , respectively. Next, we devise two networks and to perform the classification and bounding-box regression. Finally, for a source image, the detection loss is defined as:

(4)

where and indicate the detection loss.

By using the detection loss, and are ensured to contain the instance information. Besides, for our method, it is also important to keep the learned and contain more domain-specific information, which could ensure our model owns the ability of feature disentanglement. In this paper, we exploit the method of adversarial domain classification [10] to distinguish the source and target domains. Specifically, we employ four domain classifiers , , , and in our model, which separately take , , , and as the input and output a domain label that indicates the source or target domain: is 0 for the source domain and 1 for the target domain.

Besides, for domain classifiers, during training, we employ Focal Loss () [24, 36] to impose bigger weights on the hard-to-classify examples (i.e., the examples near the classification boundary) than on the easy ones (i.e., the examples far from the classification boundary).

(5)

where controls the weight on the hard-to-classify examples.

is the model’s estimated probability for the output domain label

. Finally, the loss of the first training stage is denoted as follows:

(6)

where and are the objective functions of the source and target domains. and indicate the domain losses. The overall loss is the sum of and .

With the help of the detection loss and domain loss , the disentangled DIR and DSR contain instance and domain-specific information, respectively. Next, we will perform the second training stage to keep the disentangled DIR and DSR independent.

3.2.2 The Stage of Feature Separation

In this stage, we first fix the extractor and of the model trained on the first stage. Then, we employ the model to extract , , (Eq. (1)), , , and (Eq. (2)). RPN is performed on to obtain the proposals .

Mutual Information Minimization. In order to enlarge the gap between the DIR and DSR, we minimize the MI loss between and , as well as between and , where and indicate the RoI-Alignment results of and based on . The process of MI is:

(7)

where

indicates the joint probability distribution of (

, ) or (, ). and are the marginal distributions. Obviously, by minimizing the MI loss, we could impose independent constraints on the tuples (, ) and (, ). Besides, since and contain more domain-specific information, MI loss could promote and to contain more domain-invariant information, which can help strengthen the ability of disentanglement. In this paper, we adopt Mutual Information Neural Estimator (MINE) [1] to compute the MI loss. Concretely, based on Monte-Carlo integration [30], MINE could be computed as follows:

(8)

where

is sampled from the joint distribution and

is sampled from the marginal distribution. Here, we devise a neural network to perform the Monte-Carlo integration.

It is worth noting that, for the second disentangled layer, we use the RoI-Alignment results and , instead of the feature map and , to compute MI loss, which could not only reduce the computational costs but also ensure our model pays more attention to regions of objects.

Relation-consistency Loss. To further improve the disentanglement, we devise a relation-consistency loss (Fig. 4). Specifically, since and have the same object contents in the same locations, based on the proposals , and should keep similar semantic relations.

Concretely, we first obtain the average-pooling results and of and , where and indicate the numbers of proposals and channels. Then we separately construct a graph and . Here, we take and as the nodes and , respectively. and are used to indicate the edges (relations) between proposals. Next, we define two adjacency matrix for two undirected graphs, i.e., and . And indicates we make operation across the row directions. The relation-consistency loss is computed as:

(9)
Figure 4: Illustration of relation-consistency Loss. ‘P’ indicates the ‘Person’ class. The goal of the loss is to ensure the relations (the red solid lines) between object proposals in and the relations between object proposals in are consistent. The purple dot lines denote the consistency between the two red lines.

Note that the computation of the relation-consistency loss does not need any parameters. Finally, the loss of the second training stage is denoted as follows:

(10)

where is the detection loss based on . and are the training objectives of the source and target domain, respectively. and indicate MI loss computed on the first and second disentangled layer, respectively. The overall loss is the sum of and . After this stage, the gap between DIR and DSR could be enlarged. Next, we will perform the third training stage aiming at keeping the disentangled DIR and DSR contain all the content of the input used for disentanglement.

Input: source images ; target images ; feature extractors and ; disentangled extractors , , , and ; detection networks and ; domain classifiers , , , and ; MI estimators and ; reconstruction network .
Output: trained , , , , , , and detection networks , .

1:  while not converged do
2:     Sample a mini-batch from and ;
3:     Feature Decomposition:
4:     Update , , , , , by Eq. (4);
5:     Update , , , , , , , and by Eq. (5);
6:     Feature Separation:
7:     Update , , by in Eq. (10);
8:     Update , , , by in Eq. (10);
9:     Calculate the MI loss between and with , and between and with ;
10:     Update , , , , , by Eq. (8);
11:     Update by Eq. (9);
12:     Feature Reconstruction:
13:     Reconstruct RoI-Alignment result by (, );
14:     Update , , by Eq. (11);
15:  end while
16:  return  .
Algorithm 1 Instance-Invariant Adaptive Object Detection

3.2.3 The Stage of Feature Reconstruction

We employ a reconstruction loss to attain the purpose of this training stage. Concretely, we first use the model trained on the second stage to extract , , and (Eq. (2)). Then, RPN is performed on to extract proposals . The reconstruction loss is computed as follows:

(11)

where , , and are the RoI-Alignment results of , , and based on the proposals . is the reconstruction network. indicates the concatenation of and . Here, in order to make the model pay more attention to instance content, the reconstruction loss is only computed on the regions of the proposals. Besides, since the output of the first disentangled layer includes the entire , to reduce the computational costs, we do not calculate the reconstruction loss on the first layer.

Method backbone person rider car truck bus train motorcycle bicycle mAP
Source Only VGG16 24.7 31.9 33.1 11.0 26.4 9.2 18.0 27.9 22.8
DAF [5] VGG16 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6
DT [17] VGG16 25.4 39.3 42.4 24.9 40.4 23.1 25.9 30.4 31.5
SC-DA(Type3) [45] VGG16 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.8
DMRL [21] VGG16 30.8 40.5 44.3 27.2 38.4 34.5 28.4 32.2 34.6
MTOR [4] ResNet50 30.6 41.4 44.0 21.9 38.6 40.6 28.3 35.6 35.1
MLDA [43] VGG16 33.2 44.2 44.8 28.2 41.8 28.7 30.5 36.5 36.0
FSDA [42] VGG16 29.1 39.7 42.9 20.8 37.4 24.1 26.5 29.9 31.3
MAF [14] VGG16 28.2 39.5 43.9 23.8 39.9 33.3 29.2 33.9 34.0
RLDA [19] IncepV2 [40] 35.10 42.15 49.17 30.07 45.25 26.97 26.85 36.03 36.45
SW (B) [36] VGG16 29.9 42.3 43.5 24.5 36.2 32.6 30.0 35.3 34.3
Ours VGG16 33.12 43.41 49.63 21.98 45.75 32.04 29.59 37.08 36.57
Ours ResNet101 32.82 44.37 49.57 33.02 46.10 37.97 29.90 35.26 38.63
Table 1: Results (%) on adaptation from Cityscapes to FoggyCityscapes. ‘B’ indicates the baseline method. ‘Source Only’ indicates the model is only trained based on the data from the source domain and does not use the target data.

In this paper, our model is trained in an end-to-end way. The detailed training procedures are presented in Algorithm 1. During each training stage, the parameters that do not appear in the current stage are considered to be fixed.

4 Experiments

We evaluate our approach on three domain-shift scenes, i.e., Cityscapes [7] FoggyCityscapes [37], Pascal VOC [9] Watercolor [17], and Pascal VOC Clipart [17].

4.1 Dataset and Implementation Details

Dataset. For Cityscapes FoggyCityscapes, we use Cityscapes as the source domain. FoggyCityscapes is used as the target domain, which is rendered from Cityscapes and simulates the change of weather condition. Both of them contain 2,975 images in the training set and 500 images in the validation set. And this adaptation scene involves 8 categories. We utilize the training set during training and evaluate on the validation set.

For Pascal Watercolor and Pascal Clipart, Pascal VOC dataset is used as the real source domain. The images of this dataset include rich bounding box annotations. And the number of object classes is 20. Following a prevalent setting [21, 36], we use Pascal VOC 2007 and 2012 training and validation set for training, which results in about 15K images. Watercolor and Clipart datasets are taken as the target domain. Watercolor contains 6 categories in common with VOC and 2k images in total. Clipart contains 1k images in total, which has the same 20 categories as VOC. For these two target datasets, the splits of training and test set are the same as the work [36].

Implementation Details. Our method is based on Faster-RCNN [32] with RoI-Alignment [13]. For Focal Loss (Eq. (5)), and are set to 1.0 and 2.0. Besides, we separately employ a network including three convolutional layers as the disentangled extractors , , , and . For the domain classifiers , , , and , we respectively employ a network which includes three fully-connected layers. Meanwhile, for the MI estimators and , we separately utilize a network consisting of three fully-connected layers. Finally, one convolutional layer is used as the reconstruction network . During training, we employ the SGD optimizer with momentum [39]

. We first train the model with a learning rate of 0.001 for 50K iterations, then with a learning rate of 0.0001 for 30K more iterations. In the test, we use mean average precisions (mAP) as the evaluation metric.

(a) Raw image in Cityscapes
(b) GT
(c) SW baseline
(d) One Disentangled layer
(e) Two Disentangled layers
Figure 5: Detection results on the “Cityscapes FoggyCityscapes” scene. ‘GT’ indicates the groundtruth result. ‘One Disentangled layer’ denotes we only use the second disentangled layer in the model. We can see that our method, i.e., using two disentangled layers, could locate and recognize objects existing in the two foggy images accurately, e.g., the truck, car, and bicycle.
(a) Raw image
(b) GT
(c) SW baseline
(d) One Disentangled layer
(e) Two Disentangled layers
Figure 6: Detection results on the “Pascal VOC Watercolor” scene. We can see that our method, i.e., using two disentangled layers, could locate and recognize objects existing in the two watercolor images accurately, e.g., the person, bird, and cat.
Method bike bird car cat dog person mAP
Source Only 68.8 46.8 37.2 32.7 21.3 60.7 44.6
BDC-Faster [36] 68.6 48.3 47.2 26.5 21.7 60.5 45.5
DAF [5] 75.2 40.6 48.0 31.5 20.6 60.0 46.0
SW (B) [36] 82.3 55.9 46.5 32.7 35.5 66.7 53.3
Ours 95.8 54.3 48.3 42.4 35.1 65.8 56.9
Table 2: Results (%) on adaptation from Pascal to Watercolor.

4.2 Experimental Results

Results on FoggyCityscapes. Table 1 shows the performance of our method on the FoggyCityscapes dataset. Here, we use VGG16 and ResNet101 as the backbone of Faster-RCNN, respectively. We can see that our method outperforms all the methods in Table 1. Particularly, based on the VGG16 backbone and mAP metric, our method is around 2.3% higher than the SW baseline method [36]. Compared with RLDA [19] using InceptionV2 [40] as the strong backbone, our method still outperforms it. These all show our method is effective. Moreover, employing the backbone of ResNet101 could improve the performance of our method significantly. This shows our method is more effective with a better backbone. Fig. 5 shows two detection examples. Compared with the raw images, for object detection, the foggy scene is much more challenging. Meanwhile, compared with the SW method, our method could locate and recognize objects existing in the two images accurately. Particularly, regardless of distance, our method could locate and discriminate the truck accurately. These further demonstrate the effectiveness of our method.

Results on Watercolor and Clipart. Table 2 and 3 separately show the performance of our method on Watercolor and Clipart dataset. Here, we all use ResNet101 as the backbone of Faster-RCNN. For Watercolor scene, our method is 3.6% higher than the SW method. Particularly, for the class of bike, our method outperforms SW by around 13%. This shows our method is effective for the task of DAOD. Fig. 6 shows two examples of Watercolor. We can see that our method could locate and recognize the classes of person and bird accurately. This further shows that our disentangled method indeed alleviates the problem of domain-shift and improves the detection performance.

(a) GT
(b) O-Base
(c) P-Base
(d) O-DIR
(e) P-DIR
(f) O-DSR
(g) P-DSR
Figure 7: Visualization of feature maps of the second disentangled layer. Here, ‘O-DIR’ () and ‘O-DSR’ () indicate we only use the second disentangled layer to extract DIR and DSR based on ‘O-Base’ () and do not use the first disentangled layer. ‘P-DIR’ () and ‘P-DSR’ () indicate we use the progressive method to extract DIR and DSR based on ‘P-Base’ (). For each feature map, the channels corresponding to the maximum value are selected for visualization. For ‘O-DIR’ and ‘P-DIR’, the bright regions indicate the presentence of object-relevant content. For ‘O-DSR’ and ‘P-DSR’, the bright regions indicate the presentence of domain-specific information.
 Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
 Source Only 35.6 52.5 24.3 23.0 20.0 43.9 32.8 10.7 30.6 11.7 13.8 6.0 36.8 45.9 48.7 41.9 16.5 7.3 22.9 32.0 27.8
 BDC-Faster [36] 20.2 46.4 20.4 19.3 18.7 41.3 26.5 6.4 33.2 11.7 26.0 1.7 36.6 41.5 37.7 44.5 10.6 20.4 33.3 15.5 25.6
 DAF [5] 15.0 34.6 12.4 11.9 19.8 21.1 23.2 3.1 22.1 26.3 10.0 10.0 19.6 39.4 34.6 29.3 1.0 17.1 19.7 24.8 19.8
 SW (B) [36] 26.2 48.5 32.6 33.7 38.5 54.3 37.1 18.6 34.8 58.3 17.0 12.5 33.8 65.5 61.6 52.0 9.3 24.9 54.1 49.1 38.1
 Ours 41.5 52.7 34.5 28.1 43.7 58.5 41.8 15.3 40.1 54.4 26.7 28.5 37.7 75.4 63.7 48.7 16.5 30.8 54.5 48.7 42.1
Table 3: Results (%) on adaptation from Pascal VOC to Clipart. Here, we use ResNet101 as the backbone of Faster-RCNN.

As for Clipart scene which involves more classes than the other two datasets, our method outperforms SW by 4.0%, in terms of the mAP metric. Meanwhile, in Table 3, we can see that our method outperforms the baseline method in multiple categories significantly. For example, for the aeroplane and dog class, our method is around 15% and 16% higher than the SW method. These all demonstrate the good performance of our method.

4.3 Ablation Analysis

In this section, we will make some ablation analysis on our method. Table 4 shows the ablation results. Here, ‘C F’ and ‘V W’ separately indicate the adaptation from Cityscapes to FoggyCityscapes and the adaptation from Pascal VOC to Watercolor. And for the ‘C F’ case, we use VGG16 as the backbone. For the ‘V W’ case, we use ResNet101 as the backbone. ‘OW’ indicates we integrate all loss functions existing in our method and use one training stage. ‘1st’, ‘2nd’, and ‘3rd’ indicate we use the first training stage of Algorithm 1, the first two training stages of Algorithm 1, and the three training stages to optimize our model, respectively. For our progressive method (Two layers), we can see that the three-stage training mechanism is effective. For example, for the ‘C F’ case, the performance is improved from 33.6% to 36.6%. Meanwhile, we can see that from the first training stage to the third stage, the performance is improved continuously. This shows that for the disentangled learning, the stage of feature separation and feature reconstruction is necessary. Using these two stages does enhance the disentanglement and improve the detection performance. Besides, we can also see that the relation-consistency loss (RC) improves the performance of our method significantly. For example, for the ‘V W’ scene, the performance is improved from 55.2% to 56.9%. This demonstrates the relation-consistency loss helps strengthen the ability of disentanglement.

To further verify the effectiveness of the progressive method, we make a comparison with the method of only using the second disentangled layer (One layer). We can see from Table 4 that our progressive method improves the detection performance significantly, e.g., for the ‘C F’ case, the performance is improved from 34.1% to 36.6%. This shows that using the progressive mechanism is indeed helpful for obtaining a better disentangled representation. Besides, in Fig. 5 and 6, we can see that compared with One layer method, employing two disentangled layers does improve the accuracy of location and recognition. Particularly, taking the first image in Fig. 6 as an example, our method accurately locates and classifies the three persons existing in the watercolor image. This further demonstrates the good performance of our method.

Method OW 1st 2nd 3rd RC C F V W
Two layers 34.1% 52.9%
Two layers 33.6% 53.5%
Two layers 35.3% 55.3%
Two layers 35.5% 55.2%
Two layers 36.6% 56.9%
One layer 34.1% 54.6%
Two layers 36.6% 56.9%
Table 4: Ablation analysis of the proposed progressive disentanglement. Here, we use mAP as the metric.

4.4 Visualization Analysis

In Fig. 7, taking two watercolor images as examples, a visualization analysis is made to show the learned disentangled representations. We can see both the method of only using the second disentangled layer and the progressive method could learn good disentangled representations. Particularly, compared with the ‘O-Base’ and ‘P-Base’ used for disentanglement, the learned DIR and DSR separately contain much stronger object-relevant information and domain-specific information. These results demonstrate that our method can successfully learn disentangled representations. Besides, compared with ‘O-Base’, ‘P-Base’ contains much less domain-specific information, e.g., the background information in the first image and the color wall in the second image. This shows the first disentangled layer indeed enhances the domain-invariant information. Meanwhile, compared with ‘O-DIR’, our progressive method can extract a better DIR. Particularly, for these two images, ‘P-DIR’ is much smoother and contains much less domain-specific information. For example, the leaf and background information in the first image, and the flowers in the second image are much less in ‘P-DIR’, which is helpful for the location and recognition of objects. These all show our progressive method really owns the disentanglement ability and learns better instance-invariant features that lead to a better detection performance. More visualization examples can be seen in Fig. 8.

5 Conclusion

In this paper, we focus on obtaining the instance-invariant features for solving domain adaptive object detection. A progressive disentangled framework is first proposed to decompose domain-invariant and domain-specific features. Then, the instance-invariant features are extracted based on the domain-invariant features, which could alleviate the problem of domain-shift. Finally, we propose a three-stage training mechanism to enhance the disentanglement. In the experiment, our method achieves a new state-of-the-art performance on three domain-shift scenes.

(a) GT
(b) Detection Results
(c) P-Base
(d) P-DIR
(e) P-DSR
Figure 8: Visualization of feature maps of the second disentangled layer. Here, we use the progressive disentangled method to extract DIR and DSR. ‘Base’ indicates the feature map used for disentanglement. The examples of the first five rows are from the ‘Pascal VOC Watercolor’ scene. The examples of the last two rows are from the ‘Pascal VOC Clipart’ scene.

References

  • [1] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018) Mine: mutual information neural estimation. ICML. Cited by: §3.2.2.
  • [2] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1, §2.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3722–3731. Cited by: §1.
  • [4] Q. Cai, Y. Pan, C. Ngo, X. Tian, L. Duan, and T. Yao (2019) Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11457–11466. Cited by: Table 1.
  • [5] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348. Cited by: §1, §2, Table 1, Table 2, Table 3.
  • [6] S. Cicek and S. Soatto (2019) Unsupervised domain adaptation via regularized conditional alignment. arXiv preprint arXiv:1905.10885. Cited by: §1.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.
  • [8] K. Do and T. Tran (2019) Theory and evaluation metrics for learning disentangled representations. arXiv preprint arXiv:1908.09961. Cited by: §1, §3.2.
  • [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1, §1, §4.
  • [10] Y. Ganin and V. Lempitsky (2014)

    Unsupervised domain adaptation by backpropagation

    .
    arXiv preprint arXiv:1409.7495. Cited by: §1, §2, §3.2.1.
  • [11] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1, §2.
  • [12] R. Gong, W. Li, Y. Chen, and L. V. Gool (2019) DLOW: domain flow for adaptation and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2477–2486. Cited by: §1.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.1, §4.1.
  • [14] Z. He and L. Zhang (2019) Multi-adversarial faster-rcnn for unrestricted object detection. ICCV. Cited by: §1, Table 1.
  • [15] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018) Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §1, §2.
  • [16] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §1, §2.
  • [17] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5001–5009. Cited by: §1, Table 1, §4.
  • [18] B. Jiang, Z. Zhang, D. Lin, and J. Tang (2019) Graph learning-convolutional networks. ICML. Cited by: §2.
  • [19] M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready (2019) A robust learning approach to domain adaptive object detection. ICCV. Cited by: Table 1, §4.2.
  • [20] S. Kim, J. Choi, T. Kim, and C. Kim (2019) Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. ArXiv abs/1909.00597. Cited by: §2.
  • [21] T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §1, §2, Table 1, §4.1.
  • [22] C. Lee, T. Batra, M. H. Baig, and D. Ulbricht (2019) Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10285–10295. Cited by: §1.
  • [23] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §1, §2.
  • [24] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §3.2.1.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
  • [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
  • [27] Y. Liu, Y. Yeh, T. Fu, S. Wang, W. Chiu, and Y. Frank Wang (2018) Detach and adapt: learning cross-domain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8867–8876. Cited by: §2.
  • [28] F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Schölkopf, and O. Bachem (2019)

    Challenging common assumptions in the unsupervised learning of disentangled representations

    .
    Cited by: §1, §2, §3.2.
  • [29] Y. Pan, T. Yao, Y. Li, Y. Wang, C. Ngo, and T. Mei (2019) Transferrable prototypical networks for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2239–2247. Cited by: §1.
  • [30] X. Peng, Z. Huang, X. Sun, and K. Saenko (2019) Domain agnostic learning with disentangled representations. ICML. Cited by: §2, §3.2.2.
  • [31] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1, §2.
  • [32] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §1, §3.1, §4.1.
  • [33] K. Ridgeway and M. C. Mozer (2018) Learning deep disentangled embeddings with the f-statistic loss. In Advances in Neural Information Processing Systems, pp. 185–194. Cited by: §1, §2.
  • [34] S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe, and E. Ricci (2019) Unsupervised domain adaptation using feature-whitening and consensus loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9471–9480. Cited by: §1.
  • [35] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo (2018) From source to target and back: symmetric bi-directional adaptive gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8099–8108. Cited by: §1.
  • [36] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: Instance-Invariant Adaptive Object Detection via Progressive Disentanglement, §1, §1, §2, §3.2.1, Table 1, §4.1, §4.2, Table 2, Table 3.
  • [37] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. Cited by: §1, §4.
  • [38] T. Scott, K. Ridgeway, and M. C. Mozer (2018)

    Adapted deep embeddings: a synthesis of methods for k-shot inductive transfer learning

    .
    In Advances in Neural Information Processing Systems, pp. 76–85. Cited by: §1, §2.
  • [39] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013)

    On the importance of initialization and momentum in deep learning

    .
    In

    International Conference on International Conference on Machine Learning

    ,
    Cited by: §4.1.
  • [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Table 1, §4.2.
  • [41] T. Vu, H. Jang, T. X. Pham, and C. D. Yoo (2019) Cascade rpn: delving into high-quality region proposal network with adaptive convolution. arXiv preprint arXiv:1909.06720. Cited by: §1.
  • [42] T. Wang, X. Zhang, L. Yuan, and J. Feng (2019) Few-shot adaptive faster r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7173–7182. Cited by: Table 1.
  • [43] R. Xie, F. Yu, J. Wang, Y. Wang, and L. Zhang (2019) Multi-level domain adaptive learning for cross-domain detection. arXiv preprint arXiv:1907.11484. Cited by: Table 1.
  • [44] Y. Zhang, H. Tang, K. Jia, and M. Tan (2019) Domain-symmetric networks for adversarial domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5031–5040. Cited by: §1.
  • [45] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin (2019) Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 687–696. Cited by: Table 1.