Recently, great efforts have been made on object detection [11, 32, 15, 24, 31]. Though most state-of-the-art methods achieve outstanding detection performance on many benchmarks [9, 25], they suffer from poor generalization ability when the training and test images are from different domains, which is cast into the setting of domain adaptive object detection (DAOD). In the task of DAOD, domain gap always exists between the source/training and target/test images, e.g., with different illuminations and different styles etc. Although the performance could be improved via collecting additional images with well-labeled objects from the target domain, it is time-consuming and labor-intensive.
In order to alleviate the impact of domain-shift , representative methods [5, 36, 14] towards DAOD employ unsupervised domain adaptation [34, 29, 44] to align distributions of different domains, e.g., via adversarial training  or style translation . Distribution alignment is always conducted in a holistic representation (e.g., in feature-level [6, 22] or pixel-level [12, 3, 35]) of source and target images, which may neglect the instance-level characteristics of objects in images, such as object locations or basic shapes of objects etc. When transferring detection ability from source images to target images, it is the instance-level features that really count, which are always domain-invariant, not the illuminations and painting styles that are domain-specific. Therefore, in order to obtain the instance-invariant features and bridge the domain gap in DAOD, we should try to disentangle the domain-invariant representations (DIR) from the domain-specific representations (DSR).
As a method of feature decomposition, disentangled learning [8, 28] has been demonstrated to be effective in tasks of few-shot learning [33, 38] and image translation [23, 16]. The purpose of disentangled learning is to uncover a set of independent factors that give rise to the current observation . And the major advantage is that disentangled representations could contain all the information presents in the current observation in a compact and interpretable structure while being independent of the current task [28, 2]. In this paper, we propose to employ disentangled learning to disentangle an image representation into a domain-invariant representation (DIR) and a domain-specific representation (DSR) (see Fig. 1), so as to obtain the instance-invariant representation (IIR). Taking the IIR as a bridge, we have great potential to strengthen the transferring ability of a detection model trained on source images.
Particularly, in the proposed detection network, we devise a progressive process to decompose the DIR and DSR with two disentangled layers. The goal of the first layer is to enhance the domain-invariant information in a middle-layer feature map. We utilize a domain classifier to ensure that DSR contains much more domain-specific information. And a mutual information (MI) loss is employed to enlarge the gap between DIR and DSR. Taking the sum of the feature map and DIR as the input, the second layer aims at obtaining the instance-invariant representations (IIR) with a regional proposal networks (RPN)[32, 41]. Moreover, to enhance the disentanglement, we devise a training mechanism including three stages to optimize our model: (i) the stage of feature decomposition aiming at learning disentanglement, (ii) the stage of feature separation aiming at enlarging the gap between DIR and DSR, and (iii) the stage of feature reconstruction aiming at keeping the DIR and DSR contain all the content of the input. For each stage, we use different loss functions to optimize different components of our network, respectively. Experiments on three domain-shift scenes of DAOD demonstrate that our method is effective and achieves a new state-of-the-art performance.
The contributions of this paper are summarized as:
(1) Different from reducing the domain gap with distribution alignment, we propose to enhance the transferring detection ability via a bridge of disentangled instance-invariant representations.
(2) A progressive disentangled network is first proposed to successfully extract instance-invariant features. Meanwhile, a three-stage training mechanism is proposed to further enhance the disentangled ability.
2 Related Work
Domain Adaptive Object Detection. Though most methods [11, 31, 15, 26] of object detection have achieved outstanding performance, their transferring abilities are limited for the task of DAOD. Recently, many methods [21, 36, 20] have been proposed to solve the domain-shift problem in object detection. These methods mainly focus on feature-level or pixel-level alignment. For example, the method in  utilizes adversarial training  to align global feature distributions of the source and target domains, whereas the method in  aligns distributions of both global and local features. For pixel-level adaptation, the work  devises a generative network to increase the diversity of the source domain, which is similar to data augmentation. However, as the alignment is conducted in holistic representations of images, it is not dedicated to the task of adaptive object detection, which focuses on the bridge of domains with instance-level characteristics. Therefore, in this paper, we focus on extracting instance-level features that are domain-invariant, which are helpful for improving the transferring ability of a detection method.
Disentangled Learning. The purpose of disentangled learning [18, 28, 2, 30] is to correctly uncover a set of independent factors that give rise to the current observation. Recently, disentangled learning has been well explored in tasks of few-shot learning [33, 38] and image translation [23, 16]. Particularly, by decomposing the style of an image, the work 
proposed a disentangled method to make a diverse image-to-image translation. Liu et al. proposed a model of cross-domain representation disentanglement. Based on generative adversarial networks, this method alleviated the impact of domain-shift and improved the classification performance on multiple datasets. As for adaptive object detection, on one hand, we should remove the domain-shift; on the other hand, it is important to transfer the detection ability via the bridge of the instance characteristics. Thus, it is not straightforward to apply the disentangled learning to the task of DAOD.
In this paper, we devise a new network of progressive disentanglement to decompose image representations into domain-specific and domain-invariant representations, and from which we extract the instance-invariant representations to bridge the detection ability between source and target domains. Experiments on three domain-shift scenes of DAOD demonstrate the effectiveness of our method.
3 Instance-Invariant Adaptive Object Detection
Suppose we have the access to an image including labels and bounding boxes , which are drawn from a set of annotated source images . Here, , , and separately indicate the set of images, labels, and bounding-box annotations, which are from the source domain. Meanwhile, we could also access to a target image drawn from a set of unlabeled target images .
3.1 The Network of Progressive Disentanglement
As is shown in Fig. 2, we devise two disentangled layers to extract domain-invariant information progressively.
The First Disentangled Layer. The goal of this layer is to enhance the domain-invariant information in a middle-layer feature map. Concretely, given a source image and target image , we first obtain a feature map that is the output of a middle-layer feature extractor . Then, two different extractors are devised to disentangle the DIR and DSR from . The processes are shown as follows:
Here, and separately indicate the DIR and DSR extractor. The size of and is set to the same value as that of . Then, we take the sum of and as the input of the second feature extractor . Since contains more domain-invariant information, the sum operation could alleviate the impact of domain-shift on .
The Second Disentangled Layer. The purpose of this layer is to obtain the instance-invariant features. Particularly, based on the output of the extractor , we devise two extractors, i.e., and , to disentangle the DIR and DSR from . The processes are as follows:
Here, the size of and is set to the same value as that of . Next, the RPN is performed on to extract a set of instance-invariant proposals. Finally, for an image from the source domain, the detection loss is as follows:
3.2 Training with the Three-stage Optimization
As is discussed in the section of Introduction, the goal of disentangled learning is to uncover a set of independent factors that give rise to the current observation . And these factors could contain all the information presents in the observation . Therefore, we devise a three-stage training mechanism (see Fig. 3) to enhance the disentanglement.
3.2.1 The Stage of Feature Decomposition
The goal of the first stage is to ensure that our model not only learns the location and classification of the objects but also disentangles the image features. Based on , we first utilize RPN to obtain a set of object proposals . To ensure that and have the same object contents in the same locations, based on the proposals , RoI-Alignment is performed on and to obtain and , respectively. Next, we devise two networks and to perform the classification and bounding-box regression. Finally, for a source image, the detection loss is defined as:
where and indicate the detection loss.
By using the detection loss, and are ensured to contain the instance information. Besides, for our method, it is also important to keep the learned and contain more domain-specific information, which could ensure our model owns the ability of feature disentanglement. In this paper, we exploit the method of adversarial domain classification  to distinguish the source and target domains. Specifically, we employ four domain classifiers , , , and in our model, which separately take , , , and as the input and output a domain label that indicates the source or target domain: is 0 for the source domain and 1 for the target domain.
Besides, for domain classifiers, during training, we employ Focal Loss () [24, 36] to impose bigger weights on the hard-to-classify examples (i.e., the examples near the classification boundary) than on the easy ones (i.e., the examples far from the classification boundary).
where controls the weight on the hard-to-classify examples.. Finally, the loss of the first training stage is denoted as follows:
where and are the objective functions of the source and target domains. and indicate the domain losses. The overall loss is the sum of and .
With the help of the detection loss and domain loss , the disentangled DIR and DSR contain instance and domain-specific information, respectively. Next, we will perform the second training stage to keep the disentangled DIR and DSR independent.
3.2.2 The Stage of Feature Separation
In this stage, we first fix the extractor and of the model trained on the first stage. Then, we employ the model to extract , , (Eq. (1)), , , and (Eq. (2)). RPN is performed on to obtain the proposals .
Mutual Information Minimization. In order to enlarge the gap between the DIR and DSR, we minimize the MI loss between and , as well as between and , where and indicate the RoI-Alignment results of and based on . The process of MI is:
indicates the joint probability distribution of (, ) or (, ). and are the marginal distributions. Obviously, by minimizing the MI loss, we could impose independent constraints on the tuples (, ) and (, ). Besides, since and contain more domain-specific information, MI loss could promote and to contain more domain-invariant information, which can help strengthen the ability of disentanglement. In this paper, we adopt Mutual Information Neural Estimator (MINE)  to compute the MI loss. Concretely, based on Monte-Carlo integration , MINE could be computed as follows:
is sampled from the joint distribution and
is sampled from the marginal distribution. Here, we devise a neural network to perform the Monte-Carlo integration.
It is worth noting that, for the second disentangled layer, we use the RoI-Alignment results and , instead of the feature map and , to compute MI loss, which could not only reduce the computational costs but also ensure our model pays more attention to regions of objects.
Relation-consistency Loss. To further improve the disentanglement, we devise a relation-consistency loss (Fig. 4). Specifically, since and have the same object contents in the same locations, based on the proposals , and should keep similar semantic relations.
Concretely, we first obtain the average-pooling results and of and , where and indicate the numbers of proposals and channels. Then we separately construct a graph and . Here, we take and as the nodes and , respectively. and are used to indicate the edges (relations) between proposals. Next, we define two adjacency matrix for two undirected graphs, i.e., and . And indicates we make operation across the row directions. The relation-consistency loss is computed as:
Note that the computation of the relation-consistency loss does not need any parameters. Finally, the loss of the second training stage is denoted as follows:
where is the detection loss based on . and are the training objectives of the source and target domain, respectively. and indicate MI loss computed on the first and second disentangled layer, respectively. The overall loss is the sum of and . After this stage, the gap between DIR and DSR could be enlarged. Next, we will perform the third training stage aiming at keeping the disentangled DIR and DSR contain all the content of the input used for disentanglement.
3.2.3 The Stage of Feature Reconstruction
We employ a reconstruction loss to attain the purpose of this training stage. Concretely, we first use the model trained on the second stage to extract , , and (Eq. (2)). Then, RPN is performed on to extract proposals . The reconstruction loss is computed as follows:
where , , and are the RoI-Alignment results of , , and based on the proposals . is the reconstruction network. indicates the concatenation of and . Here, in order to make the model pay more attention to instance content, the reconstruction loss is only computed on the regions of the proposals. Besides, since the output of the first disentangled layer includes the entire , to reduce the computational costs, we do not calculate the reconstruction loss on the first layer.
|RLDA ||IncepV2 ||35.10||42.15||49.17||30.07||45.25||26.97||26.85||36.03||36.45|
|SW (B) ||VGG16||29.9||42.3||43.5||24.5||36.2||32.6||30.0||35.3||34.3|
In this paper, our model is trained in an end-to-end way. The detailed training procedures are presented in Algorithm 1. During each training stage, the parameters that do not appear in the current stage are considered to be fixed.
4.1 Dataset and Implementation Details
Dataset. For Cityscapes FoggyCityscapes, we use Cityscapes as the source domain. FoggyCityscapes is used as the target domain, which is rendered from Cityscapes and simulates the change of weather condition. Both of them contain 2,975 images in the training set and 500 images in the validation set. And this adaptation scene involves 8 categories. We utilize the training set during training and evaluate on the validation set.
For Pascal Watercolor and Pascal Clipart, Pascal VOC dataset is used as the real source domain. The images of this dataset include rich bounding box annotations. And the number of object classes is 20. Following a prevalent setting [21, 36], we use Pascal VOC 2007 and 2012 training and validation set for training, which results in about 15K images. Watercolor and Clipart datasets are taken as the target domain. Watercolor contains 6 categories in common with VOC and 2k images in total. Clipart contains 1k images in total, which has the same 20 categories as VOC. For these two target datasets, the splits of training and test set are the same as the work .
Implementation Details. Our method is based on Faster-RCNN  with RoI-Alignment . For Focal Loss (Eq. (5)), and are set to 1.0 and 2.0. Besides, we separately employ a network including three convolutional layers as the disentangled extractors , , , and . For the domain classifiers , , , and , we respectively employ a network which includes three fully-connected layers. Meanwhile, for the MI estimators and , we separately utilize a network consisting of three fully-connected layers. Finally, one convolutional layer is used as the reconstruction network . During training, we employ the SGD optimizer with momentum 
. We first train the model with a learning rate of 0.001 for 50K iterations, then with a learning rate of 0.0001 for 30K more iterations. In the test, we use mean average precisions (mAP) as the evaluation metric.
|SW (B) ||82.3||55.9||46.5||32.7||35.5||66.7||53.3|
4.2 Experimental Results
Results on FoggyCityscapes. Table 1 shows the performance of our method on the FoggyCityscapes dataset. Here, we use VGG16 and ResNet101 as the backbone of Faster-RCNN, respectively. We can see that our method outperforms all the methods in Table 1. Particularly, based on the VGG16 backbone and mAP metric, our method is around 2.3% higher than the SW baseline method . Compared with RLDA  using InceptionV2  as the strong backbone, our method still outperforms it. These all show our method is effective. Moreover, employing the backbone of ResNet101 could improve the performance of our method significantly. This shows our method is more effective with a better backbone. Fig. 5 shows two detection examples. Compared with the raw images, for object detection, the foggy scene is much more challenging. Meanwhile, compared with the SW method, our method could locate and recognize objects existing in the two images accurately. Particularly, regardless of distance, our method could locate and discriminate the truck accurately. These further demonstrate the effectiveness of our method.
Results on Watercolor and Clipart. Table 2 and 3 separately show the performance of our method on Watercolor and Clipart dataset. Here, we all use ResNet101 as the backbone of Faster-RCNN. For Watercolor scene, our method is 3.6% higher than the SW method. Particularly, for the class of bike, our method outperforms SW by around 13%. This shows our method is effective for the task of DAOD. Fig. 6 shows two examples of Watercolor. We can see that our method could locate and recognize the classes of person and bird accurately. This further shows that our disentangled method indeed alleviates the problem of domain-shift and improves the detection performance.
|SW (B) ||26.2||48.5||32.6||33.7||38.5||54.3||37.1||18.6||34.8||58.3||17.0||12.5||33.8||65.5||61.6||52.0||9.3||24.9||54.1||49.1||38.1|
As for Clipart scene which involves more classes than the other two datasets, our method outperforms SW by 4.0%, in terms of the mAP metric. Meanwhile, in Table 3, we can see that our method outperforms the baseline method in multiple categories significantly. For example, for the aeroplane and dog class, our method is around 15% and 16% higher than the SW method. These all demonstrate the good performance of our method.
4.3 Ablation Analysis
In this section, we will make some ablation analysis on our method. Table 4 shows the ablation results. Here, ‘C F’ and ‘V W’ separately indicate the adaptation from Cityscapes to FoggyCityscapes and the adaptation from Pascal VOC to Watercolor. And for the ‘C F’ case, we use VGG16 as the backbone. For the ‘V W’ case, we use ResNet101 as the backbone. ‘OW’ indicates we integrate all loss functions existing in our method and use one training stage. ‘1st’, ‘2nd’, and ‘3rd’ indicate we use the first training stage of Algorithm 1, the first two training stages of Algorithm 1, and the three training stages to optimize our model, respectively. For our progressive method (Two layers), we can see that the three-stage training mechanism is effective. For example, for the ‘C F’ case, the performance is improved from 33.6% to 36.6%. Meanwhile, we can see that from the first training stage to the third stage, the performance is improved continuously. This shows that for the disentangled learning, the stage of feature separation and feature reconstruction is necessary. Using these two stages does enhance the disentanglement and improve the detection performance. Besides, we can also see that the relation-consistency loss (RC) improves the performance of our method significantly. For example, for the ‘V W’ scene, the performance is improved from 55.2% to 56.9%. This demonstrates the relation-consistency loss helps strengthen the ability of disentanglement.
To further verify the effectiveness of the progressive method, we make a comparison with the method of only using the second disentangled layer (One layer). We can see from Table 4 that our progressive method improves the detection performance significantly, e.g., for the ‘C F’ case, the performance is improved from 34.1% to 36.6%. This shows that using the progressive mechanism is indeed helpful for obtaining a better disentangled representation. Besides, in Fig. 5 and 6, we can see that compared with One layer method, employing two disentangled layers does improve the accuracy of location and recognition. Particularly, taking the first image in Fig. 6 as an example, our method accurately locates and classifies the three persons existing in the watercolor image. This further demonstrates the good performance of our method.
|Method||OW||1st||2nd||3rd||RC||C F||V W|
4.4 Visualization Analysis
In Fig. 7, taking two watercolor images as examples, a visualization analysis is made to show the learned disentangled representations. We can see both the method of only using the second disentangled layer and the progressive method could learn good disentangled representations. Particularly, compared with the ‘O-Base’ and ‘P-Base’ used for disentanglement, the learned DIR and DSR separately contain much stronger object-relevant information and domain-specific information. These results demonstrate that our method can successfully learn disentangled representations. Besides, compared with ‘O-Base’, ‘P-Base’ contains much less domain-specific information, e.g., the background information in the first image and the color wall in the second image. This shows the first disentangled layer indeed enhances the domain-invariant information. Meanwhile, compared with ‘O-DIR’, our progressive method can extract a better DIR. Particularly, for these two images, ‘P-DIR’ is much smoother and contains much less domain-specific information. For example, the leaf and background information in the first image, and the flowers in the second image are much less in ‘P-DIR’, which is helpful for the location and recognition of objects. These all show our progressive method really owns the disentanglement ability and learns better instance-invariant features that lead to a better detection performance. More visualization examples can be seen in Fig. 8.
In this paper, we focus on obtaining the instance-invariant features for solving domain adaptive object detection. A progressive disentangled framework is first proposed to decompose domain-invariant and domain-specific features. Then, the instance-invariant features are extracted based on the domain-invariant features, which could alleviate the problem of domain-shift. Finally, we propose a three-stage training mechanism to enhance the disentanglement. In the experiment, our method achieves a new state-of-the-art performance on three domain-shift scenes.
-  (2018) Mine: mutual information neural estimation. ICML. Cited by: §3.2.2.
-  (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1, §2.
-  (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In , pp. 3722–3731. Cited by: §1.
-  (2019) Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11457–11466. Cited by: Table 1.
-  (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348. Cited by: §1, §2, Table 1, Table 2, Table 3.
-  (2019) Unsupervised domain adaptation via regularized conditional alignment. arXiv preprint arXiv:1905.10885. Cited by: §1.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.
-  (2019) Theory and evaluation metrics for learning disentangled representations. arXiv preprint arXiv:1908.09961. Cited by: §1, §3.2.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1, §1, §4.
Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495. Cited by: §1, §2, §3.2.1.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1, §2.
-  (2019) DLOW: domain flow for adaptation and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2477–2486. Cited by: §1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.1, §4.1.
-  (2019) Multi-adversarial faster-rcnn for unrestricted object detection. ICCV. Cited by: §1, Table 1.
-  (2018) Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §1, §2.
-  (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §1, §2.
-  (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5001–5009. Cited by: §1, Table 1, §4.
-  (2019) Graph learning-convolutional networks. ICML. Cited by: §2.
-  (2019) A robust learning approach to domain adaptive object detection. ICCV. Cited by: Table 1, §4.2.
-  (2019) Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. ArXiv abs/1909.00597. Cited by: §2.
-  (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §1, §2, Table 1, §4.1.
-  (2019) Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10285–10295. Cited by: §1.
-  (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §1, §2.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §3.2.1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
-  (2018) Detach and adapt: learning cross-domain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8867–8876. Cited by: §2.
Challenging common assumptions in the unsupervised learning of disentangled representations. Cited by: §1, §2, §3.2.
-  (2019) Transferrable prototypical networks for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2239–2247. Cited by: §1.
-  (2019) Domain agnostic learning with disentangled representations. ICML. Cited by: §2, §3.2.2.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1, §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §1, §3.1, §4.1.
-  (2018) Learning deep disentangled embeddings with the f-statistic loss. In Advances in Neural Information Processing Systems, pp. 185–194. Cited by: §1, §2.
-  (2019) Unsupervised domain adaptation using feature-whitening and consensus loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9471–9480. Cited by: §1.
-  (2018) From source to target and back: symmetric bi-directional adaptive gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8099–8108. Cited by: §1.
-  (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: Instance-Invariant Adaptive Object Detection via Progressive Disentanglement, §1, §1, §2, §3.2.1, Table 1, §4.1, §4.2, Table 2, Table 3.
-  (2018) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. Cited by: §1, §4.
Adapted deep embeddings: a synthesis of methods for k-shot inductive transfer learning. In Advances in Neural Information Processing Systems, pp. 76–85. Cited by: §1, §2.
On the importance of initialization and momentum in deep learning. In
International Conference on International Conference on Machine Learning, Cited by: §4.1.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Table 1, §4.2.
-  (2019) Cascade rpn: delving into high-quality region proposal network with adaptive convolution. arXiv preprint arXiv:1909.06720. Cited by: §1.
-  (2019) Few-shot adaptive faster r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7173–7182. Cited by: Table 1.
-  (2019) Multi-level domain adaptive learning for cross-domain detection. arXiv preprint arXiv:1907.11484. Cited by: Table 1.
-  (2019) Domain-symmetric networks for adversarial domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5031–5040. Cited by: §1.
-  (2019) Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 687–696. Cited by: Table 1.