Despite impressive progress in object detection over the last years, it is still an open challenge to reliably localize and recognize objects across visual domains. Indeed most of the existing detection models rely on deep representative features learned from large amount of labeled training data, that besides being costly to annotate, are usually drawn from a specific distribution (source). Thus, when the learned models are applied on images sampled from a different (target) domain they suffer from a severe performance degradation. This hinders the deployment of detection models in real-world conditions where often it is not possible to anticipate the application domain or access it in advance for data acquisition. Consider for example the social media feed scenario depicted in Figure 3, where there is an incoming stream of images from various social media and the detector is asked to look for instances of the class bicycle. The images come continuously, but they are produced by different users that share them on different social platforms. Hence, even though they might contain the same object, each of them has been acquired by a different person, in a different context, under different viewpoints and illuminations. In other words, each image comes from a different visual domain, distinct from the visual domain where the detector has been trained. This poses two key challenges to current cross-domain detectors: (1) to adapt to the target data, these algorithms need first to gather feeds, and only after enough target data has been collected they can learn to adapt and start performing on the incoming images; (2) even if the algorithms have learned to adapt on target images from the feed up to time , there is no guarantee that the images that will arrive from time will come from the same target domain.
This is the scenario we address. We focus on cross-domain detection when only one target sample is available for adaptation, without any form of supervision. We propose an object detection method able to adapt from one target image, hence suitable for the social media scenario described above. Specifically, we build a multi-task deep architecture that adapts across domains by leveraging over a self-supervised pretext task. After an initial pretraining phase in which it is trained together with the main supervised objective on the source data, the self-supervised module proceeds on the single target sample, finetuning the network and customizing the features for the final detection prediction. The auxiliary knowledge is further guided by a cross-task pseudo-labeling that injects the locality specific of object detection into self-supervised learning. Moreover, we show how self-supervision can be even more effective when used as the inner base objective of a meta-learning algorithm whose outer goal is training a domain robust detection model. By re-formulating the pretraining process as a bilevel optimization we simulate several single-sample cross-domain learning episodes and better align to the final deployment condition with a further advantage in learning speed and accuracy.
To summarize, this paper extends our previous work [oshot_eccv20] and presents the following contributions. (1) We introduce the One-Shot Unsupervised Cross-Domain Detection setting, a cross-domain detection scenario where the target domain changes from sample to sample, hence adaptation can be learned only from one image. This scenario is especially relevant for monitoring social media image feeds. (2) We propose OSHOT, the first cross-domain object detector able to perform One-SHOT unsupervised adaptation. Our approach leverages over self-supervised one-shot learning guided by a cross-task pseudo-labeling procedure, embedded into a multi-task architecture. (3) We present a novel meta-learning formulation to combine the main supervised detection task with the self-supervised auxiliary objective and effectively push the model to produce good results after a few adaptation iterations. We indicate it as FULL-OSHOT: we discuss its effectiveness with thorough ablation experiments to assess the role of all its inner components and provide an extensive error analysis. (4) We present a tailored experimental setup for studying one-shot unsupervised cross-domain detection, designed on three existing databases plus a new test set collected from social media feed. We compare against recent adaptive detection algorithms [Saito_2019_CVPR, diversify&match_Kim_2019_CVPR, xuCVPR2020]
and one-shot style-transfer based unsupervised learning[Cohen_2019_ICCV], achieving the state-of-the-art. (5) We further evaluate our multi-task and meta-learning based methods also for cross-domain multi-target object classification, confirming the broad applicability and effectiveness of the proposed approach.
2 Related Work
Many successful object detection approaches have been developed during the past years, starting from the original sliding window methods based on handcrafted features, till the most recent deep-learning empowered solutions. Modern detectors can be divided intoone-stage and two-stage techniques. In the former, classification and bounding box prediction is performed on the convolution feature map either solving a regression problem on grid cells [redmon2016you], or exploiting anchor boxes at different scales and aspect ratios [liu2016ssd]
. In the latter, an initial stage deals with the region proposal process and is followed by a refinement stage that adjusts the coarse region localization and classifies the box content. Existing variants of this strategy differ mainly in the region proposal algorithm[girshick2014rich, girshick2015fast, ren2015faster]. Regardless of the specific implementation, the detector robustness across visual domains remains a major issue.
Cross-Domain Detection When training and test data are drawn from two different distributions a model learned on the first is doomed to fail on the second. Unsupervised domain adaptation methods attempt to close the domain gap between the annotated source on which learning is performed, and the target samples on which the model is deployed. Most of the literature has focused on object classification with solutions based on feature alignment [Long:2015, LongZ0J17, dcoral, hdivergence] or adversarial approaches [Ganin:DANN:JMLR16, Hoffman:Adda:CVPR17]. GAN-based methods allow to directly update the visual style of the annotated source data and reduce the domain shift directly at pixel level [russo17sbadagan, cycada]. Only in the last three years adaptive detection methods have been developed considering three main components: (i) including multiple and increasingly more accurate feature alignment modules at different internal stages, (ii) adding a preliminary pixel-level adaptation and (iii) pseudo-labeling. The last one is also known as self-training and consists in using the output of the source model detector as coarse annotation on the target.
The importance of considering both global and local domain adaptation, together with a consistency regularizer to bridge the two, was first highlighted in [Chen_2018_CVPR]. The Strong-Weak (SW) method of [Saito_2019_CVPR] improves over the previous one pointing out the need of a better balanced alignment with strong global and weak local adaptation. It was also further extended by [Xie_2019_ICCV_Workshops], where the adaptive steps are multiplied at different depth in the network. The most recent SW-ICR-CCR method [xuCVPR2020] further includes an image-level multi-label classifier and a module imposing consistency between the image-level and instance-level predictions. The first allows to obtain crucial regions corresponding to categorical information, the second evaluates the consistency between the image-level and instance-level predictions and works as a regularization factor that helps to point out hard aligned instances in the target domain.
By generating new source images that look like those of the target, the Domain-Transfer (DT, [inoue2018cross]) method was the first to adopt pixel-level adaptation for object detection and combine it with pseudo-labeling. More recently the Div-Match approach [diversify&match_Kim_2019_CVPR] re-elaborated the idea of domain randomization [Tobin2017DomainRF]: multiple CycleGAN [CycleGAN2017] applications with different constraints produce three extra source variants with which the target can be aligned at different extent through an adversarial multi-domain discriminator. A weak self-training procedure (WST) to reduce false negatives is combined with adversarial background score regularization (BSR) in [kim2019selftraining]. Finally, [robust_Khodabandeh_2019_ICCV] adopted pseudo-labeling and an approach to deal with noisy annotations.
Adaptive Learning on a Budget There is a wide literature on learning from a limited amount of data, both for classification and detection. However, in case of domain shift, learning on a target budget becomes extremely challenging. Indeed, the standard assumption for adaptive learning is that a large amount of unsupervised target samples are available at training time, so that a source model can capture the target domain style from them and adapt to it.
Only few attempts have been done to reduce the target cardinality. In [fewshotNIPS17] the considered setting is that of few-shot supervised domain adaptation: only a few target samples are available but they are fully labeled. In [oneshotNIPS2018, Cohen_2019_ICCV] the focus is on one-shot unsupervised style transfer
with a large source dataset and a single unsupervised target image. These works propose time-costly autoencoder-based methods to generate a version of the target image that maintains its content, but visually resembles the source in its global appearance. Thus the goal is image generation with no discriminative purpose. A related setting is that ofonline domain adaptation where unsupervised target samples are initially scarce but accumulate in time [Hoffman_CVPR2014, Wulfmeier2017IncrementalAD, mancini2018kitting]. In this case target samples belong to a continuous data stream with smooth domain changing, so the coherence among subsequent samples can be exploited for adaptation.
Self-Supervised Learning Despite not-being manually annotated, unsupervised data is rich of structural information that can be learned by self-supervision, i.e. hiding a subpart of the data information and then trying to recover it. This procedure is generally indicated as pretext task and possible examples are image completion [pathakCVPR16context]zhang2016colorful, larsson2017colorization], relative position of patches [doersch2015unsupervised, noroozi2016unsupervised], rotation recognition [gidaris2018unsupervised] and many more. Self-supervised learning has been extensively used as an initialization step for scarcely annotated supervised learning settings and very recently [asano20a-critical] has shown with a thorough analysis the potential of self-supervised learning from a single image. Several works have also indicated that self-supervision supports adaptation and generalization when combined with supervised learning in a multi-task framework [jigen, Bucci2019TacklingPD, Xu2019SelfsupervisedDA].
Meta-Learning Standard learning is based on algorithms able to improve their performance over multiple data instances. Meta-learning extends it and refers to the process of improving an algorithm over multiple learning episodes. In practical terms, a base learning model is trained to solve a task as classification or detection on a dataset, while the meta-learning loop updates the base algorithm considering multiple tasks of the same family to accomplish a higher level objective as generalization or increasing learning speed. Meta-learning in the last years has been widely used for few-shot learning with scarce data tasks simulated by randomly drawing samples from the full training set [maml-finn17a, NIPS2017_fewshotproto, NIPS2016_matching, rusu2018metalearning]. A similar strategy has been adopted to create single source tasks involving samples from a multi-source training set and prepare for generalization. Indeed, by using a validation domain that is shifted from the training domain, different kinds of meta-knowledge as losses [featurecritic], regularization functions [metareg2018] and data augmentation [Tseng2020Cross-Domain] can be (meta) learned to maximize the robustness of the learned model.
Our approach for cross-domain detection relates to the scenario of learning on a budget and connects to the few-shot meta-learning literature. Specifically we propose to combine to the main supervised detection model a self-supervised auxiliary objective to perform one-shot unsupervised adaptation. To better align the self-supervised training phase to the single sample test condition, we leverage on meta-learning by simulating multiple unsupervised single-sample cross-domain learning episodes. We are not aware of previous attempts to apply meta-learning on self-supervision, while pushing it to the extreme one-shot unsupervised condition. The designed method comes with an extra advantage: it is source-free, meaning that the test-time adaptation happens without accessing the source data.
We introduce the one-shot unsupervised cross-domain detection scenario where our goal is to predict on a single image , with being any target domain not available at training time, starting from annotated samples of the source domain . Here the structured labels describe class identity and bounding box location in each image , and we aim to obtain that precisely detects objects in despite the domain shift.
To pursue the described goal, our strategy is to train the parameters of a detection learning model such that it can be ready to get the maximal performance on a single unsupervised sample from a new domain after few gradient update steps on it. Since we have no ground truth on the target sample, we implement this strategy by learning a representation that exploits inherent data information as that captured by a self-supervised task, and then finetune it on the target sample. Thus, we design our method to include (1) an initial pretraining phase where we extend a standard deep detection model adding an image rotation classifier, and (2) a following adaptation stage where the network features are updated on the single target sample by further optimization of the rotation objective. Source data is not needed during the adaptation phase. Moreover, we exploit pseudo-labeling in a novel cross-task fashion so that the auxiliary task is guided to focus on the object area. A schematic overview of our approach is presented in Figure 2.
We leverage on Faster R-CNN [ren2015faster] as our base detection model. It is a two-stage detector with three main components: an initial block of convolutional layers, a region proposal network (RPN) and a region-of-interest (ROI) based classifier. The bottom layers transform any input image into its convolutional feature map where
is used to parametrize the feature extraction model. The feature map is then used by RPN to generate candidate object proposals. Finally the ROI-wise classifier predicts the category label from the feature vector obtained using ROI-pooling. The training objective combines the loss of both RPN and ROI, each of them composed by two terms:
Here is a classification loss to evaluate the object recognition accuracy, while is a regression loss on the box coordinates for better localization. To maintain a simple notation we summarize the role of ROI and RPN with the function parametrized by . Moreover, we use to highlight that RPN deals with a binary classification task to separate foreground and background objects, while ROI deals with the multi-class objective needed to discriminate among foreground object categories. As mentioned above, ROI and RPN are applied in sequence: they both elaborate on the feature maps produced by the convolutional block, and then influence each other in the final optimization of the multi-task (classification, regression) objective function.
3.3 Pretraining via Multi-task and Meta-Learning
As a first step, we extend Faster R-CNN to include image rotation recognition. Formally, to each training image we apply four geometric transformations where indicates the orientation with . In this way we obtain a new set of samples , where we dropped the without loss of generality. We indicate the auxiliary rotation classifier and its parameters respectively with and . Depending on the training procedure we can get different variants of the model.
Multi-task The supervised and the self-supervised tasks can be jointly trained on the whole source data in a standard multi-task fashion. The overall objective of the designed model is:
where is the cross-entropy loss. In this way, the shared feature map is learned under the synchronous guidance of both the detection and rotation objectives. More specifically, the obtained representation will be covariant with the object location and appearance as well as with the image or object orientation. Indeed we can design in two different ways: it can either be a Fully Connected (FC) layer that naïvely takes as input the feature map produced by the whole (rotated) image , or it can exploit the ground truth location of each object with a subselection of the features only from its bounding box in the original map . The operation includes pooling to rescale the feature dimension before entering the final FC layer. In this last case the network is encouraged to focus only on the object orientation without introducing noisy information from the background and provides better results with respect to the whole image option as we will discuss in Section 4.6. In practical terms, both in the case of image and box rotations, we randomly pick one rotation angle per instance, rather than considering all four of them.
Meta-Learning Multi-task learning is appealing for deep learning regularization and including a self-supervised task has the advantage of waiving any extra data annotation cost. Still, our main interest remains on detection, while rotation recognition should be considered as a secondary and auxiliary task. To manage this role for rotation, and to better fit to the unlabeled one-shot scenario on a new domain that we will face at test time, we re-formulate the problem inspired by meta-learning and building over the bi-level optimization process of MAML [maml-finn17a]. Specifically we propose to meta-train the detection model with the rotation task as its inner base learner. The optimization objective can be written as
In words, we start by focusing on the rotation recognition task for each source sample that has been augmented in different ways. We consider semantic-preserving augmentations (e.g. gray-scale, color jittering) and perform multiple learning iterations ( gradient-based update steps). The function collects the optimal parameters and related module obtained by this procedure on . The outer meta-learning loop leverages on it to optimize the detection model over all the data variants and prepares for generalization and fine-tuning on a single sample.
Also in this case we have two possible choices to design : either considering the whole feature map, or focusing on the object location. To simulate the deployment setting we neglect the ground truth object location for the inner rotation objective and substitute the with obtained through the cross-task self-training procedure detailed in the following section. We report a pseudo-code implementation of our meta-learning strategy applied on a single sample in Algorithm 1.
3.4 Cross-task self-training
Self-training is a well known paradigm in semi-supervised learning that allows to exploit weak prediction models to annotate unlabeled data which are then integrated in the learning procedure with the obtained pseudo-labels. This approach has been often used also for domain adaptation both in classification and detection models[kim2019selftraining, inoue2018cross]. We propose here a cross-task variant: instead of reusing the pseudo-labels produced by the source model on the target to update the detector, we exploit them for the self-supervised rotation classifier. In this way we keep the advantage of the self-training initialization, largely reducing the risks of error propagation due to wrong class pseudo-labels.
We start from the model parameters of the pretraining stage and we get the feature maps from all the rotated versions of the sample , , . Only the feature map produced by the original image (i.e. ) is provided as input to the RPN and ROI network components to get the predicted detection . This pseudo-label is composed by the class label and the bounding box location . We discard the first and consider only the second to localize the region containing an object in all the four feature maps, also recalibrating the position to compensate for the orientation of each map. Once passed through this pseudoboxcrop operation, the obtained features are used both for the meta-learning phase on each source sample and for adaptation fine-tuning on every target sample.
Given the single target image , we finetune the backbone’s parameters by iteratively solving a self-supervised task on it. This allows to adapt the original feature representation both to the content and to the style of the new sample. Specifically, we start from the rotated versions of the provided sample and optimize the rotation classifier through
This process involves only and , while the RPN and ROI detection components described by remain unchanged. In the following we use to indicate the number of gradient steps (i.e. iterations), with corresponding to the pretraining phase. At the end of the finetuning process, the inner feature model is described by and the detection prediction on is obtained by . Algorithm 2 outlines the adaptation process on a single target sample.
3.6 Model Variants and Implementation Details
We built our model over a public implementation of Faster-RCNN [massa2018mrcnn]. Specifically we chose a ResNet-50 [he2016deep]
backbone pre-trained on ImageNet, RPN with 300 top proposals after non-maximum-supression, anchors at three scales (128, 256, 512) and three aspect ratios (1:1, 1:2, 2:1).
In the following, we will differentiate the name of the proposed model, depending on the specific adopted training procedure. We will indicate with OSHOT our basic One-SHOT multi-task pretrained approach, while we will use FULL-OSHOT to refer to the variant based on meta-learning pretraining. We also consider two intermediate cases: Tran-OSHOT extending OSHOT with the data semantic-preserving transformations used in FULL-OSHOT, and Meta-OSHOT that corresponds to FULL-OSHOT without transformations (i.e. ).
For OSHOT we train the base network for 70k iterations using SGD with momentum set at , the initial learning rate is
and decays after 50k iterations. We use a batch size of 1, keep batch normalization layers fixed for both pretraining and adaptation phases and freeze the first 2 blocks of ResNet50. The weight of the rotation task is set to. FULL-OSHOT is actually trained in two steps. For the first 60k iterations the training is identical to that of OSHOT, while in the last 10k iterations the meta-learning procedure is activated. The inner loop optimization on the self-supervised task runs with iterations and the batch size is 2 to accomodate for two transformations of the original image. Specifically we used gray-scale and color-jitter with brightness, contrast, saturation and hue all set to
. All the other hyperparameters remain unchanged as in OSHOT.Tran-OSHOT differs from OSHOT only for the last 10k learning iterations, where the batch size is 2 and the network sees multiple images with different visual appearance in one iteration. Meta-OSHOT is instead identical to FULL-OSHOT, made exception for the transformations which are dropped, thus the batch size is 1 also in the last 10k pretraining iterations.
The adaptation phase is the same for all the variants: the model obtained from the pretraining phase is updated via fine-tuning of the self-supervised task. The batch size is equal to 1 and a dropout with probability= 0.5 is added before the rotation classifier to prevent overfitting. The weight of the auxiliary task is increased to = 0.2 to speed up the adaptation process. All the other hyperparameters and settings are the same used during the pretraining. The number of fine-tuning steps is set to match the number of iterations in meta-training , but we also investigated the effect of increasing for OSHOT (see Section 4.5).
In this section we present an extensive experimental analysis on the proposed one-shot unsupervised cross-domain detection scenario. In particular, we show the limits of the existing adaptive detection methods and discuss how our proposed approach overcomes them.
We consider a variety of existing datasets, besides including a new one that we created to assess the performance of our method on the challenging social media feed setting.
Visual Object Classes (VOC)Pascal-VOC [everingham2010pascal] is a standard real-world image dataset for object detection benchmarks. Both VOC2007 and VOC2012 contain bounding boxes annotations of 20 common categories. VOC2007 has 5011 images in the train-val split and 4952 images in the test split, while VOC2012 contains 11540 images in the train-val split.
Artistic Media Datasets (AMD) is composed of Clipart1k, Comic2k and Watercolor2k [inoue2018cross], three object detection datasets designed for benchmarking domain adaptation methods when the source domain is VOC. Clipart1k shares its 20 categories with VOC: it has 500 images in the training set and 500 images in the test set. Comic2k and Watercolor2k both have the same 6 classes (a subset of the 20 classes of VOC), and 1000-1000 images in the training-test splits each.
Cityscapes [cordts2016cityscapes] is an urban street scene dataset with pixel level annotations of 8 categories. It has 2975 and 500 images respectively in the training and validation splits. We use the instance level pixel annotations to generate bounding boxes of objects, as in [Chen_2018_CVPR].
Foggy Cityscapes [sakaridis2018semantic] is obtained by adding different levels of synthetic fog to Cityscapes images. We only consider images with the highest amount of artificial fog, thus training-validation splits have 2975-500 images respectively.
KITTI [kitti] is a dataset of images depicting several driving urban scenarios. By following [Chen_2018_CVPR], we use the full 7481 images for both training (when used as source) and evaluation (when used as target).
Social Bikes is our new dataset containing 530 images of scenes with persons/bicycles collected from Twitter, Instagram and Facebook by searching for #bike tags. It was designed as possible target when the source domain is VOC, indeed the two classes person and bicycles are shared among them. Square crops of a subset of the dataset are presented in Figure 3: the images acquired randomly from social feeds show diverse style properties and cannot be grouped under a single shared domain.
4.2 Experimental Setup and Competitors
To run all the experiments we resized the image’s shorter size to 600 pixels and apply random horizontal flipping during pretraining. The detection performance is assessed considering the IoU threshold at 0.5 for the mAP results. In the following we will use an arrow to indicate the experimental setting and we report the average of three independent runs. We also perform detection error analysis with TIDE [tide-eccv2020]
, a toolbox that allows to estimate how much each type of detection error contributes to the missing mAP. In particular TIDE not only computes false positives and false negatives, but it also classifies all the errors into six categories by computing the maximum IoU between each detection and a ground truth bounding box.Cls means object localized () correctly but classified incorrectly, Loc means object classified correctly but localized incorrectly (), Both is used when the two situations occurs simultaneously, Dupe means that the detection is correct but the same ground truth bounding box has already been associated with another higher scoring detection, Bkg means detected background as foreground () and Miss is for all the undetected ground truth boxes not already covered by other types of errors.
We consider a plain detection model and several adaptive approaches as benchmark. Baseline is our Faster-RCNN baseline with ResNet50 backbone, trained on the source domain only and deployed on the target without further adaptation. Tran-Baseline is a variant of the baseline obtained by applying on the last 10k training iterations the same data semantic-preserving transformations introduced for FULL-OSHOT. This allows us to assess how much of the improvement is due to the higher data variability rather than to the training strategy. DivMatch [diversify&match_Kim_2019_CVPR] is a cross-domain detection algorithm that, by exploiting target data, creates multiple randomized domains via CycleGAN and aligns their representations using an adversarial loss. SW [Saito_2019_CVPR] aligns source and target features based on global context similarity. SW-ICR-CCR [xuCVPR2020]
adds two regularization modules on top of SW: they push the model to focus less on the non-transferable source background and give more weight to hard-to-align instances. In all the case we use a ResNet-50 backbone pretrained on ImageNet for fair comparison. We remark that the cross-domain algorithms need target data in advance and are not designed to work in our one-shot unsupervised setting, thus we provide them with the advantage of 10 target images accessible during training and randomly selected at each run. We collect average precision statistics during inference under the favorable assumption that the target domain will not shift after deployment.
width=0.48 One-Shot Target Method KITTI Cityscapes Cityscapes KITTI Baseline 26.5 75.1 Tran-Baseline 42.8 75.3 OSHOT 31.0 75.0 Tran-OSHOT 43.1 75.6 Meta-OSHOT 33.6 75.2 FULL-OSHOT 42.9 75.4 OSHOT 31.1 75.0 Tran-OSHOT 43.1 75.6 Meta-OSHOT 34.1 75.2 FULL-OSHOT 43.0 75.4 Ten-Shot Target DivMatch [diversify&match_Kim_2019_CVPR] 37.9 74.1 SW [Saito_2019_CVPR] 39.2 74.6 SW-ICR-CCR [xuCVPR2020] 39.8 74.9
4.3 Performance and Detection Error Analysis
Adapting to social feeds When the data comes from multiple providers, the assumption that all target images originate from the same underlying distribution does not hold and standard cross-domain detection methods are penalized regardless of the number of seen target samples. We pretrain the source detector on VOC, and deploy it on Social Bikes.
We report results in Table I. The mAP performance when allows us to compare the pretraining models before adaptation and already show the advantage of FULL-OSHOT over OSHOT, as well as over the Tran and Meta variants. Specifically, the data transformations support both the Baseline and OSHOT with a gain of about 1 point between the Tran and the respective plain versions, but their use in the meta-learning process of FULL-OSHOT provide the greatest advantage. When all variants of OSHOT obtain an improvement that goes from (OSHOT) to (FULL-OSHOT) points over the Baseline just by adapting on a single test sample. Despite granting them access to a larger set of adaptation samples, domain adaptive algorithms reach at best an advantage of points over FULL-OSHOT, even when exploiting the whole target for the adaptation. When using only ten target samples, two methods out of three show a negative transfer w.r.t. the Baseline.
By looking at the detection error analysis we can see that the adaptation iterations allow OSHOT to reduce the number of False Negatives. Moreover, both Tran-OSHOT and FULL-OSHOT obtain a higher mAP than OSHOT thanks to a lower Miss errors. The performance of FULL-OSHOT confirms that the meta-learning strategy with semantic-preserving data augmentations successfully prepares the model to solve the adaptation task at inference time.
Large distribution shifts Artistic images are difficult benchmarks for cross-domain methods. Unpredictable perturbations in shape and color are challenging to detectors trained only on realistic images. We investigate this setting by training the source detector on VOC and deploying it on Clipart, Comic and Watercolor datasets.
Table V summarizes results on the three adaptation splits. In all the settings, by exploiting one sample at a time with few adaptive iterations () OSHOT and its variants outperform the adaptive detectors despite they can leverage on 10 target samples. More precisely, all the adaptive detectors are not able to work in data scarcity conditions and obtain results comparable or lower to those of the Tran-Baseline and of the pretraining phase of our approach (). We also highlight that when Meta-OSHOT obtains results higher than Tran-OSHOT and only slightly lower on average than FULL-OSHOT, thus the meta-learning strategy alone (without additional data augmentation) prepares the detector to the inference time adaptation task.
By looking at the detection error analysis we can see that, the data augmentation of Tran-OSHOT pushes for a lower number of errors of type Miss, while the meta learning strategy of Meta-OSHOT gets a lower number of Classification error. FULL-OSHOT takes advantage of both obtaining the best overall performance.
Adverse weather Some peculiar environmental conditions, such as fog, may be disregarded in source data acquisition, yet adaptation to these circumstances is crucial for real world applications. We consider the Cityscapes FoggyCityscapes setting by training our base detector on the first domain for 30k iterations without stepdown, as in [cai2019exploring]. We select the best performing model on the Cityscapes validation split and deploy it to FoggyCityscapes.
The experimental evaluation in Table VI shows that domain adaptive detectors struggle when dealing with this kind of shift by using a small adaptation set. SW-ICR-CCR is the only method able to obtain a meaningful improvement over the Baseline although its mAP remains lower than that of the -Baseline. For what concerns OSHOT and its variants, the pretraining alone () helps in gaining a better generalization ability, with all variants but Meta-OSHOT showing higher performance than the Baseline. The advantage is also visible from the error analysis by looking at the Miss type which decreases when passing from the Baseline to OSHOT , reaching its lower value for FULL-OSHOT with . Still the top mAP result in this setting is obtained by OSHOT , indicating that neither the transformations nor the meta-learning strategy are able to prepare the detector to the experienced domain shift.
Cross-camera transfer A train-test dataset bias is unavoidable in practical applications, as for urban scenes collected in different cities and with different cameras. We test adaptation between KITTI and Cityscapes in both directions, considering only the label car as standard practice.
The obtained results are reported in Table VII. Considering the KITTI Cityscapes shift we can see that also in this case the domain adaptive detectors obtain results lower than the Tran-Baseline. In fact here the data transformations seem to play a fundamental role in improving the generalization ability: the pretraining strategy of OSHOT () shows an improvement over the Baseline, but the best results are obtained by Tran-OSHOT and FULL-OSHOT. The following adaptation steps () provide negligible improvements. By looking at the detection error analysis we can see that the semantic-preserving transformations implemented in Tran-OSHOT and FULL-OSHOT allow to greatly reduce the errors of type Miss (see also the False Negatives).
The opposite direction shift Cityscapes KITTI appears less severe, with the Baseline obtaining already good results. The domain adaptive detectors suffer of a small negative transfer and once again the domain augmentation transformations allow Tran-OSHOT and FULL-OSHOT to obtain the highest performance with no adaptation improvements occurring with .
4.4 Comparison with One-Shot Style Transfer
Although not specifically designed for cross-domain detection, in principle it is possible to apply one-shot style transfer methods as an alternative solution for our setting. We use BiOST [Cohen_2019_ICCV], the current state-of-the-art method for one-shot transfer, to modify the style of the target sample towards that of the source domain before performing inference. Due to the time-heavy requirements to perform BiOST on each test sample111 To get the style update, BiOST trains a double-variational autoencoder using the entire source besides the single target sample. As advised by the authors through personal communications, we trained the model for 5 epochs.
To get the style update, BiOST trains a double-variational autoencoder using the entire source besides the single target sample. As advised by the authors through personal communications, we trained the model for 5 epochs., we test it on Social Bikes and on a random subset of 100 Clipart images that we name Clipart100. We compare performance and time requirements of our approach and BiOST on these two targets. Speed has been computed on an RTX2080Ti with full precision settings.
Table VIII shows the obtained mAP results. On Clipart100, the Baseline obtains mAP. We can see how BiOST is effective in the adaptation from one-sample, gaining points over the baseline. On Social Bikes, instead, BiOST incurs in a slight negative transfer, indicating its inability to effectively modify the source’s style on the images we collected. OSHOT improves over the baseline on Clipart100 but its mAP remains lower than that of BiOST, while it outperforms both the baseline and BiOST on the more challenging Social Bikes. Finally, FULL-OSHOT shows the best results on both the datasets. The last row of the table presents the time complexity of all the considered methods, which is identical for OSHOT and FULL-OSHOT since the number of adaptive iterations is the same. BiOST instead, needs more than 6 hours to modify the style of a single source instance. Moreover we highlight that BiOST works under the strict assumption of accessing at the same time the entire source training set and the target sample. Considering these weaknesses and the obtained results, we argue that existing one-shot translation methods are not suitable for one shot unsupervised cross-domain adaptation.
4.5 Increasing the number of Adaptive Iterations
The results till here have shown how OSHOT improves over the existing cross-domain detection methods in the considered one-shot scenario. Moreover, the meta-learned pretrained model FULL-OSHOT provides a consistent advantage over the multi-task version OSHOT. Still, the bi-level optimization at the basis of meta-learning requires backpropagation through the inner process, which means dealing with higher-order derivatives, and the related non-trivial computational and memory burden. Keeping the same conditions in pre-training and deployment means limiting the adaptation steps of FULL-OSHOT to the set with. In this sense, the multi-task based OSHOT model appears more suitable for all those cases in which the adaptation time is not a strict constraint and it is possible to get up to iterations. We studied the performance of OSHOT in this case and collected the results in the plots of Figure 4. We can observe a positive correlation between the number of finetuning iterations and the final mAP of the model in the earliest steps, while the performance generally reaches a plateau after about 30 iterations: increasing beyond this value does not affect significantly the final results. In the plots we represent with orange stars the performance of FULL-OSHOT at 0 and 5 adaptation iterations. We can see that in five out of six cases FULL-OSHOT has quite better performance w.r.t. OSHOT when they are both tested with 5 adaptation iterations. An higher number of adaptation steps often allows OSHOT to reach and outperform FULL-OSHOT at the cost of a longer learning period.
4.6 Image vs Box rotation
As explained in Section 3, both for the pretraining and for the adaptation phase we can choose to apply rotation either on the whole image or on the object bounding boxes. For all the experiments presented above we focused on the second case. More precisely, for the multi-task based OSHOT we used the ground truth bounding boxes during pretraining and we leveraged on the pseudo-labeled boxes in the adaptation phase. By solving the auxiliary task only on objects we limit the use of background features which may mislead the network towards solutions of the rotation task not based on relevant semantic information (e.g.: finding fixed patterns in images as sky-always-on-top, or exploiting watermarks). To validate our choice we set up two dedicated experiments.
In the first we focus on the pretraining phase and run a qualitative analysis on the effect of learning on VOC by using either or . Then we test the rotation classifier on whole images from the Clipart domain. In Figure 5 we show the results obtained with Grad-CAM [gradcam_2017_ICCV] for the two cases, with heatmap indicating the most relevant regions responsible for recognizing the correct orientation. The Grad-CAM maps refer to the last output of the backbone feature extractor. We can see that, when the rotation classifier is trained on whole images it learns to focus on the background (e.g. the sky and the ground) in order to solve the task. On the contrary, when the boxcrop operation is implemented to train the rotation classifier only on the relevant objects, it learns to look at objects’ features even when it faces an entire image.
In the second experiment we consider the whole process and compare the final performance of our approach when using the object location or the entire image in both pretraining and adaptation. Table IX shows results for VOC AMD and Cityscapes Foggy Cityscapes using OSHOT. We observe that the choice of rotated regions is critical for the effectiveness of the algorithm: the mAP improvements range from to points, indicating that allows to learn features more suitable for the main detection task across domains.
|Clipart||Comic||Watercolor||Social Bikes||Cityscapes||Foggy Cityscapes|
4.7 Qualitative Results
Figure 6 shows some examples of detections on images extracted from all the datasets considered in our work. We present as reference the ground truth bounding boxes, with different colors being used for different classes, as well as the predictions produced by DivMatch [diversify&match_Kim_2019_CVPR], SW [Saito_2019_CVPR] and SW-ICR-CCR [xuCVPR2020]. All of them show less precise results than our approach. Specifically by looking at artistic images in the first three columns from the left, we can see that the domain adaptive detectors often produce several false positives. Moreover, they show detection failures in the datasets of the fourth-seventh columns. It is also possible to notice how a higher number of adaptation iterations allow OSHOT to fix its errors (compare OSHOT and ) by detecting objects that it previously missed (see the first column), by correcting a wrong classification (see the dogs in the second column), or by improving objects localization (see the fifth column). In many of these cases FULL-OSHOT has results more similar to OSHOT than to OSHOT , confirming its faster adaptation ability. FULL-OSHOT is also the only detector able to correctly identify the bicycle on the t-shirt in the third column.
5 Extension to Object Classification
The idea of exploiting self-supervised learning as an auxiliary objective to adapt on a single test sample can be easily extended to other scenarios where the main supervised task is different from detection. Indeed its effectiveness has been recently showcased in [efrostesttimeICML2020] for the classification task, but the experimental analysis in that work involved only distribution shifts due to synthetic data corruptions. Here we maintain the focus on shifts due to significant changing in domain style and analyze the performance of our approach for classification across photos, art paintings, cartoons and sketches. Specifically we rely on the PACS dataset [hospedalesPACS] that covers exactly those domains and object categories. We adopt a multi-target setting with each domain used in turn as source and all the remaining three domains as target. We re-designed our model by building over a ResNet-18 architecture pretrained on ImageNet and modified to include the same rotation branch used for detection. Here the rotation task is performed on the whole images discarding the cross-task self-training due to the lack of localization information in source supervision. Both the OSHOT pretraining and the competitors trainings are performed for 10k iterations using a batch size of and a learning rate of . The optimizer is standard SGD with momentum 0.9 and weight decay equals to . The rotation self-supervised task for OSHOT has weight both during pretraining and adaptation. OSHOT variants (Tran, Meta and FULL OSHOT) take advantage, in the pretraining phase, of additional 5k training iterations performed however with a batch size reduced to 1 and a learning rate reduced of a factor. The differences between the OSHOT variants are the sames of the detection case.
The obtained results are presented in Table X where we compare all the variants of our approach with the non adaptive ResNet-18 Baseline and the Tran-Baseline. Moreover we include as reference the Minimum Class Confusion (MCC, [MCCeccv20]) method, a state-of-the-art multi-target adaptation approach based on a simple loss that evaluates the class correlation on target predictions. MCC works in the standard unsupervised domain adaptation setting and therefore needs an unlabeled adaptation set coming from the target domain at training time. We thus provide it with access to the whole target training split or to 10 samples extracted from it in two different experiments and evaluate the performance on the test split. OSHOT outperforms the considered competitors: including the self-supervised pretraining () already provides better results than the Baselines, and performing few adaptive iterations further improves the final classification accuracy. Once again the best results are obtained by FULL-OSHOT with a higher accuracy () than MCC on the whole target (). We highlight that in this case the tailored use of the data transformations as part of the meta-learning process plays an important role. In fact the transformations alone allow the Baseline to earn less than one point in average accuracy, while their integration inside OSHOT pushes Tran-OSHOT and FULL-OSHOT to obtain the best performance.
This paper focuses on the one-shot unsupervised cross-domain detection, a scenario involving deployment conditions significantly different from those experienced at training time, with target samples drawn from a variety of visual domains not known in advance and not accessible during source training. These conditions mimic those encountered when monitoring image feeds on social media, where algorithms are called to adapt to a new visual domain and can only rely on one single image at inference time. We showed that existing cross-domain detection methods struggle in this setting, as they are all explicitly designed to adapt from far larger quantities of target data. We presented OSHOT, the first deep architecture able to reduce the domain gap between source and target distribution by leveraging over one target image. Our approach is based on a multi-task structure that exploits self-supervision and cross-task self-labeling. Moreover we introduced a meta-learning formulation for the pretraining phase that simulates single-sample cross-domain learning episodes and further improves the generalization abilities of the detector. Extensive quantitative experiments and qualitative analyses demonstrated the effectiveness of the proposed adaptive detection method and showed how the same strategy can be easily exploited for cross-domain object classification.
This work was partially founded by the ERC grant 637076 RoboExNovo (FCB, AD, SB, BC) and took advantage of the GPU donated by NVIDIA (Academic Hardware Grant, TT).