Log In Sign Up

One-Shot Unsupervised Cross-Domain Detection

by   Antonio D'Innocente, et al.

Despite impressive progress in object detection over the last years, it is still an open challenge to reliably detect objects across visual domains. Although the topic has attracted attention recently, current approaches all rely on the ability to access a sizable amount of target data for use at training time. This is a heavy assumption, as often it is not possible to anticipate the domain where a detector will be used, nor to access it in advance for data acquisition. Consider for instance the task of monitoring image feeds from social media: as every image is created and uploaded by a different user it belongs to a different target domain that is impossible to foresee during training. This paper addresses this setting, presenting an object detection algorithm able to perform unsupervised adaption across domains by using only one target sample, seen at test time. We achieve this by introducing a multi-task architecture that one-shot adapts to any incoming sample by iteratively solving a self-supervised task on it. We further enhance this auxiliary adaptation with cross-task pseudo-labeling. A thorough benchmark analysis against the most recent cross-domain detection methods and a detailed ablation study show the advantage of our method, which sets the state-of-the-art in the defined one-shot scenario.


page 8

page 20

page 21


Self-Supervision Meta-Learning for One-Shot Unsupervised Cross-Domain Detection

Deep detection models have largely demonstrated to be extremely powerful...

Decoupled Adaptation for Cross-Domain Object Detection

Cross-domain object detection is more challenging than object classifica...

CD-FSOD: A Benchmark for Cross-domain Few-shot Object Detection

Although few-shot object detection (FSOD) has attracted great research a...

SimROD: A Simple Adaptation Method for Robust Object Detection

This paper presents a Simple and effective unsupervised adaptation metho...

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

Cross domain object detection is a realistic and challenging task in the...

Adaptive-Attentive Geolocalization from few queries: a hybrid approach

We address the task of cross-domain visual place recognition, where the ...

Few-shot Learning for Cross-Target Stance Detection by Aggregating Multimodal Embeddings

Despite the increasing popularity of the stance detection task, existing...

1 Introduction

Social media feed us every day with an unprecedented amount of visual data. Conservative estimates indicate that roughly

unique images are shared everyday on Twitter, Facebook and Instagram. Images are uploaded by various actors, from corporations to political parties, institutions, entrepreneurs and private citizens. For the sake of freedom of expression, control over their content is limited, and their vast majority is uploaded without any textual description of their content. Their sheer magnitude makes it imperative to use algorithms to monitor, catalog and in general make sense of them, finding the right balance between protecting the privacy of citizens and their right of expression, and monitoring the spreading of fake news (often associated with malicious intentions) while fighting illegal and hate content. This in most cases boils down to the ability to automatically associate as many tags as possible to images, which in turns means determining which objects are present in a scene.

Object detection has been largely investigated since the infancy of computer vision

[viola2001rapid, dalal2005histograms]

. With the shift from shallow to deep learning, several successful algorithms have been proposed

[girshick2014rich, dai2016r, zhang2018single, liu2018receptive]. They mostly assume that training and test data come from the same visual domain [girshick2014rich, girshick2015fast, ren2015faster]. While this is a reasonable assumption in several applications [liu2016ssd, ren2015faster], some authors have started to investigate the more challenging yet realistic scenario where the detector is trained on data from a visual source domain, and deployed at test time in a different target domain [Long:2015, LongZ0J17, dcoral, Hoffman:Adda:CVPR17]. This setting, usually referred to as cross-domain detection, heavily relies on concepts and results from the domain adaptation literature [Long:2015, ganin2014unsupervised, Goodfellow:GAN:NIPS2014]. In particular, most works in cross-domain detection cast the problem in the unsupervised domain adaptation framework [inoue2018cross, Chen_2018_CVPR]: the detector has access at training time to annotated source data and unsupervised target data, from which it learns how to adapt across the two domains.

This approach is not suitable, neither effective, for monitoring social media feeds. Consider for instance the scenario depicted in Fig 1, where there is an incoming stream of images from various social media and the detector is asked to look for instances of the class bicycle. The images come continuously, but they are produced by different users that share them on different social platforms. Hence, even though they might contain the same object, each of them has been acquired by a different person, in a different context, under different viewpoints and illuminations –in other words, each image comes from a different visual domain, different from the visual domain where the detector has been trained. This poses two key challenges to current cross-domain detectors: (1) to adapt to the target data, these algorithms need first to collect feeds, and only after enough target data has been collected they can learn to adapt and start performing on the incoming images; (2) even if the algorithms have learned to adapt on target images collected from the feed up to time , there is no guarantee that the images that will arrive from time will come from the same target domain.

This is the scenario we address. We focus on cross-domain detection when only one target sample is available for adaptation, without any form of supervision. We propose an object detection framework able to adapt from one target image, hence suitable for the social media scenario described above. Specifically, we build a multi task deep architecture that adapts across domains by leveraging over a pretext task. This auxiliary knowledge is further guided by a cross-task pseudo-labeling that injects the locality specific of object detection into self-supervised learning. The result is an architecture able to perform unsupervised adaptive object detection from a single image. We call our method OSHOT - one shot adaptive cross-domain detection. Experiments on three different publicly available benchmarks plus a new concept database of images collected from social media clearly show the power of our method compared to previous state-of-the-art approaches.

1.0.1 Contributions

To summarize, the contributions of our paper are as follows:

  1. we introduce the One-Shot Unsupervised Cross-Domain Detection setting, a cross-domain detection scenario where the target domain changes from sample to sample, hence adaptation can be learned only from one image. This scenario is especially relevant for monitoring social media image feeds. We are not aware of previous works addressing it.

  2. We propose OSHOT, the first cross-domain object detector able to perform one-shot unsupervised adaptation. Our approach leverages over self-supervised one-shot learning guided by a cross-task pseudo-labeling procedure, embedded into a multi-task architecture. A thorough ablation study showcases the importance of each component.

  3. We present a new experimental setup for studying one-shot unsupervised cross-domain adaptation, designed on three existing databases plus a new test set collected from social media feed. We compare against recent algorithms in cross-domain adaptive detection [Saito_2019_CVPR, diversify&match_Kim_2019_CVPR]

    and one-shot unsupervised learning

    [Cohen_2019_ICCV], achieving the state-of-the-art.

Figure 1: Each social media image comes from a different domain. Existing Cross-Domain Detection algorithms (e.g. [diversify&match_Kim_2019_CVPR] in the left gray box) struggle to adapt in this setting. OSHOT (right) is able to adapt across domains from one single target image, thanks to the combined use of self-supervision and pseudo-labeling

2 Related Work

2.0.1 Object Detection

Many successful object detection approaches have been developed during the past several years, starting from the original sliding window methods based on handcrafted features, till the most recent deep-learning empowered solutions. Modern detectors can be divided into one-stage and two-stage techniques. In the former, classification and bounding box prediction is performed on the convolution feature map either solving a regression problem on grid cells [redmon2016you] or exploiting anchor boxes at different scales and aspect ratios [liu2016ssd]

. In the latter, an initial stage deals with the region proposal process and is followed by a refinement stage that adjusts the coarse region localization and classify the box content. Existing variants of this strategy differ mainly in the region proposal algorithm

[girshick2014rich, girshick2015fast, ren2015faster]. Regardless of the specific implementation, the detector robustness across visual domain remains a major issue.

2.0.2 Cross-Domain Detection

When training and test data are drawn from two different distributions a model learned on the first is doomed to fail on the second. Unsupervised domain adaptation methods attempt to close the domain gap between the annotated source on which learning is performed, and the target samples on which the model is deployed. Most of the literature has focused on object classification with solutions based on feature alignment [Long:2015, LongZ0J17, dcoral, hdivergence] or adversarial approaches [Ganin:DANN:JMLR16, Hoffman:Adda:CVPR17]. GAN-based methods allow to directly update the visual style of the annotated source data and reduce the domain shift directly at pixel level [russo17sbadagan, cycada]. The task of cross-domain object detection has received relatively less attention. Only in the last two years adaptive detection methods have been developed considering three main components: (i) including multiple and increasingly more accurate feature alignment modules at different internal stages, (ii) adding a preliminary pixel-level adaptation and (iii) pseudo-labeling. The last one is also known as self-training and consists in using the output of the source model detector as coarse annotation on the target. The importance of considering both global and local domain adaptation, together with a consistency regularizer to bridge the two, was first highlighted in [Chen_2018_CVPR]. The Strong-Weak (SW) method of [Saito_2019_CVPR] improves over the previous one pointing out the need of a better balanced alignment with strong global and weak local adaptation and is further extended by [Xie_2019_ICCV_Workshops] where the adaptive steps are multiplied at different depth in the network. By generating new source images that look like those of the target, the Domain-Transfer (DT, [inoue2018cross]) method was the first to adopt pixel adaptation for object detection and combine it with pseudo-labeling. More recently the Div-Match approach [diversify&match_Kim_2019_CVPR] re-elaborated the idea of domain randomization [Tobin2017DomainRF]: multiple CycleGAN [CycleGAN2017] applications with different constraints produce three extra source variants with which the target can be aligned at different extent through an adversarial multi-domain discriminator. A weak self-training procedure (WST) to reduce false negatives is combined with adversarial background score regularization (BSR) in [kim2019selftraining]. Finally, [robust_Khodabandeh_2019_ICCV] followed the pseudo-labeling strategy including an approach to deal with noisy annotations.

2.0.3 Adaptive Learning on a Budget

There is a wide literature on learning from a limited amount of data, both for classification and detection. However, in case of domain shift, learning on a target budget becomes extremely challenging. Indeed, the standard assumption for adaptive learning is that a large amount of unsupervised target samples are available at training time so that a model can capture the domain style from them and close the gap with respect to the source. Only few attempts have been done to reduce the target cardinality. In [fewshotNIPS17] the considered setting is that of few-shot supervised domain adaptation: only a few target samples are available but they are fully labeled. In [oneshotNIPS2018, Cohen_2019_ICCV] the focus is on one-shot unsupervised style transfer

with a large source dataset and a single unsupervised target image. These works propose time-costly autoencoder-based methods to generate a version of the target image that maintains its content but visually resembles the source in its global appearance. Thus the goal is image generation with no discriminative purpose. A related setting is that of

online domain adaptation where unsupervised target samples are initially scarce but accumulate in time [Hoffman_CVPR2014, Wulfmeier2017IncrementalAD, mancini2018kitting]. In this case target samples belong to a continuous data stream with smooth domain changing, so the coherence among subsequent samples can be exploited for adaptation.

2.0.4 Self-Supervised Learning

Despite not-being manually annotated, unsupervised data is rich of structural information that can be learned by self-supervision, hiding a subpart of the data information and then trying to recover it. This procedure is generally indicated as pretext task and possible examples are image completion [pathakCVPR16context]

, colorization

[zhang2016colorful, larsson2017colorization], relative position of patches [doersch2015unsupervised, noroozi2016unsupervised], rotation recognition [gidaris2018unsupervised] and many more. Self-supervised learning has been extensively used as an initialization step for scarcely annotated supervised learning settings and very recently [asano20a-critical] has shown with a thorough analysis the potential of self-supervised learning from a single image. Several methods have also shown how self-supervision supports adaptation and generalization when combined with supervised learning in a multi-task framework [jigen, Bucci2019TacklingPD, Xu2019SelfsupervisedDA].

2.0.5 Our approach

for cross-domain detection relates to the described scenario of learning on a budget and exploits self-supervised learning to perform one-shot unsupervised adaptation. Specifically with OSHOT we show how to recognize objects and their location on a single target image starting from a pre-trained source model, thus without the need of accessing the source data during testing.

3 Method

3.0.1 Problem Setting

We introduce the one-shot unsupervised cross-domain detection scenario where our goal is to predict on a single target image , with being any target domain not available at training time, starting from annotated samples of the source domain . Here the structured labels describe class identity and bounding box location in each image , and we aim to obtain that precisely detects objects in despite the domain shift.

3.0.2 OSHOT strategy

To pursue the described goal, our strategy is to train the parameters of a detection learning model such that it can be ready to get the maximal performance on a single unsupervised sample from a new domain after few gradient update steps on it. Since we have no ground truth on the target sample, we implement this strategy by learning a representation that exploits inherent data information as that captured by a self-supervised task, and then finetune it on the target sample. Thus, we design our OSHOT to include (1) an initial pretraining phase where we extend a standard deep detection model adding an image rotation classifier, and (2) a following adaptation stage where the network features are updated on the single target sample by further optimization of the rotation objective. Moreover, we exploit pseudo-labeling to focus the auxiliary task on the local object context. A clear advantage of this solution is that we decouple source training from target testing, with no need to access the source data while adapting on the target sample.

Figure 2:

Visualization of the adaptive phase of OSHOT with cross-task pseudo-labeling. The target image passes through the network and produces detections. While the class information is not used, the identified boxes are exploited to select object regions from the feature maps of the rotated image. The obtained region-specific feature vectors are finally sent to the rotation classifier. A number of subsequent finetuning iterations allow to adapt the convolutional backbone to the domain represented by the test image

3.0.3 Preliminaries

We leverage on Faster R-CNN [ren2015faster] as our base detection model. It is a two-stage detector with three main components: an initial block of convolutional layers, a region proposal network (RPN) and a region-of-interest (ROI) based classifier. The bottom layers transform any input image into its convolutional feature map where

is used to parametrize the feature extraction model. The feature map is then used by RPN to generate candidate object proposals. Finally the ROI-wise classifier predicts the category label from the feature vector obtained using ROI-pooling. The training objective combines the loss of both RPN and ROI, each of them composed by two terms:


Here is a classification loss to evaluate the object recognition accuracy, while is a regression loss on the box coordinates for better localization. To maintain a simple notation we summarize the role of ROI and RPN with the function parametrized by . Moreover, we use to highlight that RPN deals with a binary classification task to separate foreground and background objects, while ROI deals with the multi-class objective needed to discriminate among foreground object categories. As mentioned above, ROI and RPN are applied in sequence: they both elaborate on the feature maps produced by the convolutional block, and then influence each other in the final optimization of the multi-task (classification, regression) objective function.

3.0.4 OSHOT pretraining

As a first step, we extend Faster R-CNN to include image rotation recognition. Formally, to each source training image we apply four geometric transformations where indicates rotations with . In this way we obtain a new set of samples where we dropped the without loss of generality. We indicate the auxiliary rotation classifier and its parameters respectively with and and we train our network to optimize the following multi-task objective


where is the cross-entropy loss. When solving this problem, we can design in two different ways. Indeed it can either be a Fully Connected layer that naïvely takes as input the feature map produced by the whole (rotated) image , or it can exploit the ground truth location of each object with a subselection of the features only from its bounding box in the original map . The operation includes pooling to rescale the feature dimension before entering the final FC layer. In this last case the network is encouraged to focus only on the object orientation without introducing noisy information from the background and provides better results with respect to the whole image option as we discuss in section 4.4. In practical terms, both in the case of image and box rotations, we randomly pick one rotation angle per instance, rather than considering all four of them: this avoids any troublesome unbalance between rotated and non-rotated data when solving the multi-task optimization problem.

3.0.5 OSHOT adaptation

Given the single target image , we fine-tune the backbone’s parameters by iteratively solving a self-supervised task on it. This allows to adapt the original feature representation both to the content and to the style of the new sample. Specifically, we start from the rotated versions of the provided sample and optimize the rotation classifier through


This process involves only and , while the RPN and ROI detection components described by remain unchanged. In the following we use to indicate the number of gradient steps (i.e. iterations), with corresponding to the OSHOT pretraining phase. At the end of the finetuning process, the inner feature model is described by and the detection prediction on is obtained by .

3.0.6 Cross-task pseudo-labeling

As in the pretraining phase, also at this stage we have two possible choices to design : either considering the whole feature map , or focusing on the object locations . For both variants we include dropout to prevent overfitting on the single target sample. With we mean a localized feature extraction operation analogous to that discussed in the previous paragraph, but obtained through a particular form of cross-task self-training. Specifically we follow the self-training strategy used in [kim2019selftraining, inoue2018cross] with a cross-task variant: instead of reusing the pseudo-labels produced by the source model on the target to update the detector, we exploit them for the self-supervised rotation classifier. In this way we keep the advantage of the self-training initialization, largely reducing the risks of error propagation due to wrong pseudo-labels.

More practically, we start from the model parameters of the pretraining stage and we get the feature maps from all the rotated version of the target sample . Only the feature map produced by the original image (i.e. ) is provided as input to the RPN and ROI network components to get the predicted detection . This pseudo-label is composed by the class label and the bounding box location . We discard the first and consider only the second to localize the region containing an object in all the four feature maps, also recalibrating the position to compensate for the orientation of each map. Once passed through this pseudoboxcrop operation the obtained features are used to finetune the rotation classifier, updating the bottom convolutional network block.

4 Experiments

4.1 Datasets

Figure 3: The Social Bikes concept-dataset. We can see how a random data acquisition from multiple users/feeds leads to a target distribution with several, uneven domain shifts

4.1.1 Real-World (VOC)

Pascal-VOC [everingham2010pascal] is the standard real-world image dataset for object detection benchmarks. VOC2007 and VOC2012 both contain bounding boxes annotations of 20 common categories. VOC2007 has 5011 images in the train-val split and 4952 images in the test split, while VOC2012 contains 11540 images in the train-val split.

4.1.2 Artistic Media Datasets (AMD)

Clipart1k, Comic2k and Watercolor2k [inoue2018cross] are three object detection datasets designed for benchmarking Domain Adaptation methods when the source domain is Pascal-VOC. Clipart1k shares its 20 categories with Pascal-VOC, and has 500 images in the training set and 500 images in the test set. Comic2k and Watercolor2k both have the same 6 classes (a subset of the 20 classes of Pascal-VOC), and 1000-1000 images in the training-test splits each.

4.1.3 Cityscapes

Cityscapes [cordts2016cityscapes] is an urban street scene dataset with pixel level annotations of 8 categories. It has 2975 images in the training split and 500 images in the validation split. Since this dataset doesn’t have bounding boxes annotations, we use the instance level pixel annotations to generate bounding boxes of objects, as in [Chen_2018_CVPR].

4.1.4 Foggy Cityscapes

[sakaridis2018semantic] is an urban street dataset obtained by adding different levels of synthetic fog to Cityscapes images. We only consider images with the highest amount of artificial fog, thus training-validation split have 2975-500 images respectively.

4.1.5 Kitti

[kitti] is a dataset designed for use in mobile robotics and autonomous driving research. Following [Chen_2018_CVPR], we use the full 7481 images for both training (when used as source) and evaluation (when used as target).

4.1.6 Social Bikes

is our new concept-dataset containing 30 images of scenes with persons/bicycles collected from Twitter, Instagram and Facebook by searching for #bike tags. Square crops of the full dataset are shown in figure 3: it is clear how images acquired randomly from social feeds possess diverse style properties and cannot be grouped under a single shared domain.

4.2 Performance analysis

4.2.1 Experimental Setup

We evaluate OSHOT on several testbeds. We start from the VOCSocial Bikes transfer as a proof of concept experiment. Moreover, we consider the standard cross-domain benchmarks VOC Clipart1k, VOC Comic2k, VOC Watercolor2k, Cityscapes FoggyCityscapes, KITTI Cityscapes and Cityscapes KITTI with the added constraint of never using the target data during training. Our base detector is Faster-RCNN111PyTorch-based implementation [massa2018mrcnn]. with a ResNet-50 [he2016deep]

backbone pre-trained on ImageNet, region proposal network with 300 top proposals after non-maximum-supression, and anchors at three scales (128, 256, 512) and three aspect ratios (1:1, 1:2, 2:1).

OSHOT pretraining

We always resize the image’s shorter size to 600 pixels and apply random horizontal flipping. Unless differently specified, we train the base network for 70k iterations using SGD with momentum set at , the initial learning rate is set at

and decayed after 50k iterations. We use a batch size of 1, keep batch normalization layers fixed for both pretraining and adaptation phases and freeze the first 2 blocks of ResNet50. The weight of the auxiliary task is set to


OSHOT adaptation

We increase the weight of the auxiliary task to

to speed up adaptation and keep all other training hyperparameters fixed. For

each test instance, we finetune the initial model on the auxiliary task for 30 iterations before testing.

Benchmark methods

We compare OSHOT with the following algorithms. FRCNN: baseline Faster-RCNN with ResNet50 backbone, trained on the source domain and deployed on the target without further adaptation. DivMatch [diversify&match_Kim_2019_CVPR]: cross-domain detection algorithm that, by exploiting target data, creates multiple randomized domains via CycleGAN and aligns their representations using an adversarial loss. SW [Saito_2019_CVPR]: adaptive detection algorithm that aligns source and target features based on global context similarity. For both DivMatch and SW, we use a ResNet-50 backbone pretrained on ImageNet for fair comparison. Since all cross-domain algorithms need target data in advance and are not designed to work in our one-shot unsupervised setting, we provide them with the advantage of 10 target images accessible during training and collect average precision statistics during inference under the favorable assumption that the target domain will not shift after deployment.

One-Shot Target
Method person bicycle mAP
FRCNN 67.7 56.6 62.1
OSHOT () 72.1 52.8 62.4
OSHOT () 69.4 59.4 64.4
Full Target
DivMatch [diversify&match_Kim_2019_CVPR] 63.7 51.7 57.7
SW [Saito_2019_CVPR] 63.2 44.3 53.7
Table 1: VOC Social Bikes results and visualization of DivMatch (left) and OSHOT (right) detections. Numbers associated with bounding boxes indicate the model’s confidence in localization. Examples show how OSHOT detection is accurate, while most DivMatch boxes are false positives

4.2.2 Adapting to social feeds

When data is collected from multiple sources, the assumption that all target images originate from the same underlying distribution does not hold and standard cross-domain detection methods are penalized regardless of the number of seen target samples. We pretrain the source detector on Pascal VOC, and deploy it on Social Bikes. We consider only the bicycle and person annotations for this target, since all other instances of VOC classes are scarce.


We report results in table 1. OSHOT outperforms all proposed counterparts, with a mAP score of . Despite granting them the full target, adaptive algorithms incur in negative transfer due to data scarcity and large variety of target styles.

4.2.3 Large distribution shifts

Artistic images are difficult benchmarks for cross-domain methods. Unpredictable perturbations in shapes and colors are challenging to detectors trained only on realistic images, for which labeled data is more readily available. Here, we investigate the effectiveness of our one-shot transfer to artistic domains by training the source detector on Pascal VOC an deploying it on Clipart, Comic and Watercolor datasets.


Table 5 summarizes results on the three adaptation splits. We can see how OSHOT with 30 finetuning iterations outperforms all competitors, gaining mAP increases ranging from points on Clipart to points on Watercolor. Cross-detection methods perform poorly in this setting, despite using 9 more samples in the adaptation phase compared to OSHOT that only uses the test sample. These results confirm that they are not designed to tackle data scarcity conditions and exhibit negligible improvements compared to the baseline.

One-Shot Target
Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
FRCNN 18.5 43.3 20.4 13.3 21.0 47.8 29.0 16.9 28.8 12.5 19.5 17.1 23.8 40.6 34.9 34.7 9.1 18.3 40.2 38.0 26.4
OSHOT () 23.1 55.3 22.7 21.4 26.8 53.3 28.9 4.6 31.4 9.2 27.8 9.6 30.9 47.0 38.2 35.2 11.1 20.4 36.0 33.6 28.3
OSHOT () 25.4 61.6 23.8 21.1 31.3 55.1 31.6 5.3 34.0 10.1 28.8 7.3 33.1 59.9 44.2 38.8 15.9 19.1 39.5 33.9 31.0
OSHOT () 25.4 56.0 24.7 25.3 36.7 58.0 34.4 5.9 34.9 10.3 29.2 11.8 46.9 70.9 52.9 41.5 21.1 21.0 38.5 31.8 33.9
Ten-Shot Target
DivMatch [diversify&match_Kim_2019_CVPR] 19.5 57.2 17.0 23.8 14.4 25.4 29.4 2.7 35.0 8.4 22.9 14.2 30.0 55.6 50.8 30.2 1.9 12.3 37.8 37.2 26.3
SW [Saito_2019_CVPR] 21.5 39.9 21.7 20.5 32.7 34.1 25.1 8.5 33.2 10.9 15.2 3.4 32.2 56.9 46.5 35.4 14.7 15.2 29.2 32.0 26.4
(a) VOC Clipart
One-Shot Target
Method bike bird car cat dog person mAP
FRCNN 25.2 10.0 21.1 14.1 11.0 27.1 18.1
OSHOT () 26.9 11.6 22.7 9.1 14.2 28.3 18.8
OSHOT () 35.5 11.7 25.1 9.1 15.8 34.5 22.0
OSHOT () 35.2 14.4 30.0 14.8 20.0 46.7 26.9
Ten-Shot Target
DivMatch [diversify&match_Kim_2019_CVPR] 27.1 12.3 26.2 11.5 13.8 34.0 20.8
SW [Saito_2019_CVPR] 21.2 14.8 18.7 12.4 14.9 43.9 21.0
(b) VOC Comic
One-Shot Target
Method bike bird car cat dog person mAP
FRCNN 62.5 39.7 43.4 31.9 26.7 52.4 42.8
OSHOT () 70.2 46.7 45.5 31.2 27.2 55.7 46.1
OSHOT () 70.2 46.7 48.1 30.9 32.3 59.9 48.0
OSHOT () 77.1 44.7 52.4 37.3 37.0 63.3 52.0
Ten-Shot Target
DivMatch [diversify&match_Kim_2019_CVPR] 64.6 44.1 44.6 34.1 24.9 60.0 45.4
SW [Saito_2019_CVPR] 66.3 41.1 41.1 30.5 20.5 52.3 42.0
(c) VOC Watercolor
Table 5: VOC AMD

4.2.4 Adverse weather

Low level domain shifts occur when weather changes from training to testing. Some peculiar environmental conditions, such as fog, may be disregarded in source data acquisition, yet adaptation to these circumstances is crucial for real world applications. We assess the performance of OSHOT on Cityscapes FoggyCityscapes. We train our base detector on Cityscapes for 30k iterations without stepdown, as in [cai2019exploring]. We select the best performing model on the Cityscapes validation split and deploy it to FoggyCityscapes.


Experimental evaluation in table 6 shows that OSHOT outperforms all compared approaches. Without finetuning iterations, performance using the auxiliary rotation task increases compared to the baseline. Subsequent finetuning iterations on the target sample improve these results, and 30 iterations yield models able to outperform the second-best method by mAP. Cross-domain algorithms used in this setting struggle to surpass the baseline (DivMatch) or suffer negative transfer (SW).

width=0.7 One-Shot Target Method person rider car truck bus train mcycle bicycle mAP FRCNN 30.4 36.3 41.4 18.5 32.8 9.1 20.3 25.9 26.8 OSHOT () 31.8 42.0 42.6 20.1 31.6 10.6 24.8 30.7 29.3 OSHOT () 31.9 41.9 43.0 19.7 38.0 10.4 25.5 30.2 30.1 OSHOT () 32.1 46.1 43.1 20.4 39.8 15.9 27.1 32.4 31.9 Ten-Shot Target DivMatch [diversify&match_Kim_2019_CVPR] 27.6 38.1 42.9 17.1 27.6 14.3 14.6 32.8 26.9 SW [Saito_2019_CVPR] 25.5 30.8 40.4 21.1 26.1 34.5 6.1 13.4 24.7

Table 6: Cityscapes FoggyCityscapes

4.2.5 Cross-camera transfer

Dataset bias between training and testing are unavoidable in practical applications. Subtle changes in illumination conditions and camera resolution might preclude a model trained on one realistic domain to optimally perform in another realistic but different domain. We test adaptation between KITTI and Cityscapes in both directions. For cross-domain evaluation we consider only the label car as standard practice.


In table 7, OSHOT improves by mAP points on KITTI Cityscapes compared to the FRCNN baseline. DivMatch and SW both show a gain in this split, with SW obtaining the highest mAP of

in the ten-shot setting. This is not surprising however, Cityscapes has low inter-domain variance, as shown in the visualization of table

7, therefore cross-domain methods perform well even with few target samples if the distribution doesn’t change after adaptation. In Cityscapes KITTI, adaptation performance for all methods is similar, with OSHOT with obtaining the highest mAP of . The Faster-RCNN baseline on KITTI scores an high starting mAP of and, in this favorable condition, detection doesn’t benefit from adaptation.

One-Shot Target
Method KITTI Cityscapes Cityscapes KITTI
FRCNN 26.5 75.1
OSHOT 26.2 75.4
OSHOT 33.2 75.3
OSHOT 33.5 75.0
Ten-Shot Target
DivMatch [diversify&match_Kim_2019_CVPR] 37.9 74.1
SW [Saito_2019_CVPR] 39.2 74.6
Table 7: mAP of ’car’ class in KITTI/Cityscapes detection transfers

4.3 Comparison with One-Shot Style Transfer

Although not specifically designed for cross-domain detection, in principle it is possible to apply one-shot style transfer methods as an alternative solution for our setting. We use BiOST [Cohen_2019_ICCV], the current state-of-the-art method for one-shot transfer, to modify the style of the target sample towards that of the source domain before performing inference. Due to the time-heavy requirements to perform BiOST on each test sample 222The one-shot translation of [Cohen_2019_ICCV]

requires the training of a double-variational autoencoder using the entire source training set plus the target sample. Through personal communication with the author, we fix the length of this training to 5 epochs.

, we test it on Social Bikes and on a random subset of 100 Clipart images that we name Clipart100. We compare performance and time requirements of OSHOT and BiOST on these two targets. Speed has been computed on an RTX2080Ti with full precision settings.

width=0.7 FRCNN BiOST [Cohen_2019_ICCV] OSHOT () mAP on Clipart100 27.9 29.8 30.7 mAP on Social Bikes 62.1 51.1 64.4 Adaptation time (seconds per sample) - 7.8

Table 8: Comparison between baseline, one-shot syle transfer and OSHOT in the one-shot unsupervised cross-domain detection setting

Table 8 shows summary mAP results using BiOST and OSHOT. On Clipart100, the baseline Faster-RCNN detector obtains mAP. We can see how BiOST is effective in the adaptation from one-sample, gaining points over the baseline, however it is outperformed by OSHOT, which obtains mAP. On Social Bikes, while OSHOT still outperforms the baseline, BiOST incurs in negative transfer, indicating that it was not able to effectively modify the source’s style on the images we collected. Furthermore, BiOST is affected by two strong issues: (1) it has an extremely high one-shot translation time, that requires more than 6 hours to modify the style of a single source instance, and (2) it works under the strict assumption of having the entire source training set available at any time to perform the OST step. Due to these weaknesses, and the fact that OSHOT still outperforms BiOST, we argue that existing one-shot translation methods are not suitable for one shot unsupervised cross-domain adaptation.

4.4 Ablation Study

4.4.1 Detection error analysis

Following [hoiem2012diagnosing], we provide detection error analysis for VOC Clipart setting in figure 4. We select the 1000 most confident detections, and assign error classes based on IoU with ground truth (IoUgt). Errors are categorized in three types: correct (IoUgt 0.5), mislocalized (0.3 IoUgt 0.5) and background (IoUgt 0.3). Results show that, compared to the baseline FRCNN model, the regularization effect of adding a self-supervised task at training time () marginally increases the quality of detections, while subsequent finetuning iterations on the test sample substantially improve the number of correct detections while also decreasing both false positives and mislocalization errors.

Figure 4: Detection error analysis on the most confident detections on Clipart

4.4.2 Cross-task pseudo-labeling ablation

As explained in section 3 we have two options in the OSHOT adaptation phase: either considering the whole image or focusing on pseudo-labeled bounding boxes obtained from the detector after the first OSHOT pretraining stage. For all our experiments we focused on the second case, indeed by solving the auxiliary task only on objects, we limit the use of background features which may mislead the network towards solutions of the rotation task not based on relevant semantic information (e.g.: finding fixed patterns in images, exploiting watermarks). We validate our choice by comparing it against using the rotation task on the entire image in both training and adaptation phases. Table 9 shows results for VOC AMD and Cityscapes Foggy Cityscapes using OSHOT. We observe that the choice of rotated regions is critical for the effectiveness of the algorithm. Solving the rotation task on objects using pseudo-annotations results in mAP improvements that range from to points, indicating that we learn better features for the main task.

VOC Clipart 31.0 33.9
VOC Comic 21.0 26.9
VOC Watercolor 48.2 52.0
Cityscapes Foggy Cityscapes 27.7 31.9
Table 9: Rotating image vs rotating objects via pseudo-labeling on OSHOT

4.4.3 Self-supervised iterations

We study the effects of adaptating with up to iterations on VOC Clipart, Cityscapes FoggyCityscapes and KITTI Cityscapes. Results are shown in figure 5. We observe a positive correlation between number of finetuning iterations and final mAP of the model in the earliest steps. This correlation is strong for the first 10 iterations, for which mAP increases spike on all observed targets. After about 30 iterations, performance tends to stabilize, indicating that increasing beyond this point doesn’t significantly alter final results.

Figure 5: Performance of OSHOT at different self-supervised iterations

5 Conclusions

This paper introduced for the first time one shot unsupervised cross-domain detection, a scenario extremely relevant for the monitoring of image feeds on social media, where algorithms are called to adapt to a new visual domain from one single image. We showed that existing cross-domain detection methods suffer in this setting, as they are all explicitly designed to adapt from far larger quantities of target data. We presented the first deep architecture able to reduce the domain gap between source and target distribution by leveraging over one single target image. Our approach is based on a multi task structure that exploits self-supervision thanks to cross-task self-labeling. Extensive quantitative experiments and a qualitative analysis clearly demonstrate its effectiveness.



Appendix 0.A Full Ablation Results

0.a.0.1 Detection error analysis

We complete here the detection error analysis that was only partially included in the main paper for space reasons. Specifically we consider all the three domain shift cases of VOC AMD together with Cityscapes Foggy Cityscapes, KITTI Cityscapes and KITTI Cityscapes. As reported in the main paper, for the first benchmark VOC Clipart we follow [hoiem2012diagnosing, diversify&match_Kim_2019_CVPR] considering the top 1k most confident detections and identifying three error types: correct (IoUgt 0.5), mislocalized (0.3 IoUgt 0.5) and background (IoUgt 0.3). For VOC Comic and VOC Watercolor we consider 2k most confident predictions, maintaining the same ratio of the first case given that the number of target samples are twice that of Clipart. A similar reasoning, that also takes care of the class cardinality, was applied to choose 6k most confident predictions for Foggy Cityscapes Cityscapes, 1.5k for KITTI Cityscapes and 20k for Cityscapes KITTI.

VOC Clipart   
VOC Comic   
VOC Watercolor   
Figure 6: Detection error analysis on the three cases of VOC AMD
Citys. Foggy Citys.
KITTI Citys.
Citys. KITTI
Figure 7: Detection error analysis on the three cases of urban scenes

From Figure 6 we can state that for both Clipart and Watercolor the advantage of adding the self-supervised task at training time is limited (), while the gain becomes evident when the number of adaptive iterations grows (). For Comic the improvement in performance appears already in the pretraining phase and further increases with adaptation. Overall the false positive errors decrease, while the ratio between the mislocalization error and correct localizations either decreases (Clipar, Comic) or remains stable (Watercolor). A similar behaviour can be observed on the urban scenes, both when testing on Foggy Cityscapes and Cityscapes, as shown in the first two rows of Figure 7. For the last case of testing on KITTI, the results remain almost stable, confirming the same trend observed on the overall mAP performance discussed in the main paper. A neglegible drop of correct predictions appear when applying the adaptation phase for .

Figure 8: OSHOT at different number of iterations for all testbeds

0.a.0.2 Self-supervised iterations

We report results of OSHOT at different number of self-supervised iterations in Figure 8. We observe positive correlations between number of self-supervised iterations and final mAP on all targets except KITTI, for which final results are minimally affected by our adaptation procedure (as well as by any other adaptive method used as reference - see Table 5 of the main paper). The first 10 iterations show the most significant mAP change, while it gets to a stable plateau for further iterations.

Figure 9: Visualization of the most relevant image regions produced by Grad-CAM when classifying the correct rotation with and
Figure 10: Qualitative visualization of detections with DivMatch, SW and OSHOT on Comic (first row), Foggy Cityscapes (second row), Watercolor (third row), Social Bikes (fourth and fifth rows) and KITTI (sixth row). Numbers associated with bounding boxes indicate the detector’s confidence

Appendix 0.B Qualitative Analysis

0.b.0.1 Image vs Box rotation

To validate our choice of considering box rotation over image rotation we set up a dedicated experiment. We ran the pretraining stage of OSHOT on VOC by using either or . Then we tested the rotation classifier on whole images from the Clipart domain. In Figure 9 we show the results obtained with Grad-CAM [gradcam_2017_ICCV] for the two cases, with heatmap indicating the most relevant regions responsible for recognizing the correct orientation. The Grad-CAM maps refer to the last output of the backbone feature extractor. We can see that, when the rotation classifier is trained on whole images it learns to focus on the background (e.g. the sky and the ground) in order to solve the task. On the contrary, when the boxcrop operation is implemented to train the rotation classifier only on the relevant objects, it learns to look at objects’ features even when it faces an entire image.

0.b.0.2 Detection results of OSHOT: baselines and self-supervised iterations

Figure 10 shows some examples of detections of OSHOT on images extracted from all the datasets considered in our work. We present as reference also the ground truth results as well as the predictions produced by DivMatch [diversify&match_Kim_2019_CVPR] and SW [Saito_2019_CVPR] that appear less precise than OSHOT.

Data: , , , parameters , , , rotator , target image
while still iterations do
       minimize self-supervised loss
end while
predict label of test sample using
Algorithm 1 Adaptive phase of OSHOT

Appendix 0.C OSHOT pseudocode

The pseudocode for the adaptive phase of OSHOT is presented in Algorithm 1. Here, and indicate the backbone feature extractor and detector, respectively parametrized by and . is the fully connected layer of the rotation classifier, parametrized by , and is the rotator operator where , which indicates one random rotation to apply on , is dropped for simplicity. The is an operator that applies cropping and ROI pooling on the input feature map based on the corresponding relative location of pseudo-boxes .