Recent years have witnessed the breakthrough of object detection, with the development of deep learning frameworks [12, 28, 26, 13]. However, the remarkable performance of these detectors heavily depends on the large-scale benchmarks with object annotations. For a specific detection task in practice, it is often labor-intensive to collect such fully-annotated data. To alleviate this annotation burden, a number of weakly/semi-supervised detectors have been proposed with weakly-annotated images (i.e., with only image labels) [7, 21, 2, 20, 30, 10]. Even though such images can be obtained easily from the internet, the performance of these weakly/semi-supervised detectors is far from being competitive to fully-supervised detectors.
Alternatively, human can localize and recognize new objects successfully, after checking few examples with the prior object knowledge. Moreover, this capacity can be generalized via exploiting objects from weakly-annotated images. To mimic this learning process, several approaches have been proposed recently, by low-shot and/or transfer learning in a semi-supervised detection setting [15, 31, 4]. However, these detectors have difficulties in handling wild images with complex objects [15, 31], or learning to detect with weakly-annotated images . For this reason,  introduces a baby learning framework, based on prior knowledge modelling, exemplar learning, and learning with video contexts. However, it often requires a large amount of weakly-annotated videos (e.g., 20,000 per category), and the iterative learning manner may reduce the discriminative power of deep neural networks.
To address these challenges, we propose a novel Progressive Object Transfer Detection (POTD) framework in Fig. 1. Our key insight is that, human-like learning can effectively integrate various object knowledge of different domains into a progressive detection process, which can boost a target detection task with very few instance-level annotations. More specifically, this paper makes three main contributions. First, POTD can efficiently mimic the learning manner of human, by progressively transferring object knowledge from source to target, from large-scale to low-shot, from fully-annotated to weakly-annotated images. With this multi-level learning process, one can effectively promote the detection performance with little annotation burden. Second, POTD consists of two novel detection stages, i.e., Low-Shot Transfer Detection (LSTD) and Weakly-Supervised Transfer Detection (WSTD). Each stage is elaborately designed with transfer insights. In LSTD, we distill implicit object knowledge in source to guide low-shot detection in target. This can effectively warm up WSTD to handle wild images later on. In WSTD, we design a recurrent object labelling mechanism for learning to detect with weakly-labeled images. Furthermore, we exploit reliable object knowledge from LSTD, which can enhance the robustness of target detector for weakly-supervised detection. Finally, we conduct extensive experiments on a number of challenging data sets, where our POTD outperforms the recent state-of-the-art approaches.
It is noted that, Low-Shot Transfer Detection (LSTD) has been published in AAAI 2018 . We significantly extend it in the following ways. From the aspect of model designs, we extend LSTD to be a Progressive Object Transfer Detection (POTD) procedure. Specifically, we use LSTD as warm-up, and design a new Weakly-Supervised Transfer Detection (WSTD) stage to handle weakly-annotated images in the target domain. Consequently, our POTD can further boost the detection performance of LSTD with little annotation burden. From the aspect of experiments, we deeply investigate the proposed POTD, and show the effectiveness of both LSTD and WSTD on object detection.
Ii Related Work
Weakly-Supervised Object Detection. Since only image labels are available in the weakly-supervised object detection, most approaches are naturally based on the Multiple Instance Learning (MIL) framework 
, where each image is regarded as a bag of object instances. Positive categories are assumed to contain at least one object instance in the image, while negative categories contain no corresponding instances at all. However, MIL alternates between estimating the object representation and selecting positive object instances, which often converges to an unsatisfied local optima. To alleviate this problem, a number of efforts have been made by good initializations[5, 21], suitable optimization strategies[1, 17], and so on. Recently, deep learning has been used for weakly-supervised detection [2, 7, 20, 30]. A fundamental framework refers to weakly supervised deep detection network , which performs localization and classification within a two-stream architecture. To further improve performance, several extensions have been introduced by context design , saliency guidance 
, online classifier refinement, etc. However, the unsupervised selective search  or edgeBoxes may limit the efficiency of these deep detection networks. More importantly, the performance of these detectors is considerably lower than that of fully-supervised detectors, due to the lack of elaborate object annotations.
Semi-Supervised Object Detection. Most semi-supervised detectors assume that several categories are fully-annotated [15, 31]. Via transferring the detection knowledge of these categories into the weakly-annotated ones, these detectors can improve the final performance on the whole set. However, [15, 31] attempt to adapt an image classifier to an object classifier, which is often effective on single-instance images. It may limit the detection capacity for wild images. In addition, these detectors often require moderate amount of object annotations, which can be still difficult or expensive to obtain.
Low-Shot Transfer Object Detection. Alternatively,  proposes a more reasonable data assumption to alleviate labelling burden, i.e., quite a few images are fully-annotated for each object category, and all other images are weakly-annotated with only image labels. But its performance is limited, without guidance of prior object knowledge. For this reason, several low-shot and/or transfer detection approaches [23, 4] have been proposed recently, in order to mimic the learning procedure of human. By taking advantage of large-scale benchmarks in source, these detectors can work with few object annotations in target. However,  ignores the weakly-annotated target images, which may restrict the generalization capacity of target detector.  attempts to borrow numerous weakly-annotated videos, but the iterative learning scheme lacks the efficiency of end-to-end learning. Different from these approaches, our POTD is a progressive learning procedure, which is built upon the fully end-to-end training framework with very few object annotations.
Iii Overall Learning Procedure of Progressive Object Transfer Detection (POTD)
Problem Definition. To effectively address a target detection task with little annotation burden, we define our detection problem according to the following settings in practice. First, we can access a well-trained detector in the source domain. Second, we partially annotate the training set in the target domain, e.g., quite a few images (like one-shot per category) are fully-annotated with object bounding boxes, and all others are weakly-annotated with only image labels. Finally, we consider one of the most challenging transfer cases, where the object categories in source and target are non-overlapped.
Overall Learning Procedure. In this work, we design a novel Progressive Object Transfer Detection (POTD) framework, for learning to detect like humans. The whole procedure of POTD is shown in Fig. 1, which consists of two elegant transfer stages, i.e., Low-Shot Transfer Detection (LSTD), and Weakly-Supervised Transfer Detection (WSTD). In LSTD, we adapt the source-domain detector to be a warm-up detector in the target domain. This stage can stabilize WSTD afterwards, by distilling rich object knowledge in the source domain. In WSTD, we adapt the warm-up detector to be a target detector, which can learn to detect with weakly-annotated images in a fully end-to-end learning manner. Via this human-like learning procedure, our POTD can effectively address the target detection task with little annotation burden.
Notations. To be concise, we list all the relevant notations in the procedure of POTD, w.r.t., two transfer detection stages including LSTD and WSTD in Table I. In the following, we describe these two stages of POTD in detail.
|number of object classes in the source domain|
|number of object classes in the target domain|
|number of object proposals|
|total training loss|
|main loss (i.e., standard detection loss)|
|coefficient of main loss|
|loss of background depression (BD)|
|coefficient of BD loss|
|feature regions that refer to image background|
|loss of source detection knowledge (SDK)|
|coefficient of SDK loss|
|SDK in source detector|
|SDK in warm-up detector|
|total training loss|
|coefficient of SDK loss|
|SDK in target detector|
|loss of recurrent object labelling (ROL)|
|coefficient of ROL loss|
|ROL loss for Classifier|
Prediction score vector of one image
|Ground truth image label|
|Prediction score matrix of proposals for Classifier|
|Pseudo label matrix of proposals for Classifier|
Iv Low-Shot Transfer Detection (LSTD)
Since detection with weak annotations often gets stuck in local optima, we first design a warm-up stage, i.e., fine-tuning source-domain detector with fully-annotated images in the target domain. However, such annotated images are quite few (e.g., one-shot per category) for a target detection task, in order to reduce the annotation burden. This fact can significantly increase the fine-tuning difficulties. For this reason, we propose a novel low-shot transfer detection framework to alleviate overfitting.
Iv-a Deep Detection Architecture for LSTD
We first design a flexible deep learning architecture, which can alleviate transfer difficulties from large-scale source domain to low-shot target domain. As shown in Fig. 2, it is a Faster-RCNN-style detection framework but with SSD-style region proposal network (RPN).
Why to Choose Faster-RCNN-Style Framework. This is mainly credited to one key design in Faster RCNN, i.e., region proposal network (RPN). More specifically, the object classifier in RPN is used to identify whether the proposal is object or not. Hence, it can learn the common traits of different objects, e.g., clear edge, uniform texture, and so on. By pretraining RPN with large-scale detection benchmark in the source domain, we can obtain the high-quality proposals for low-shot detection in the target domain. On the contrary, the object classification in SSD is one-stage, i.e., it has to deal with thousands of randomly-initialized proposals directly. This fact can deteriorate the target detection task, especially with only few annotated images.
Why to Design SSD-Style RPN. In the standard RPN , the bounding box regressor is designed separately for each categories. It means that, the parameters of regressor have to be re-initialized for each new domain, since object categories in different domains are often non-overlapped in practice. This often increases the overfitting risk during fine-tuning, when the target domain only contains few annotated images. Alternatively, we adapt RPN into a SSD style. Since the regressor in SSD is shared among all object categories, the pretrained parameters of SSD-style RPN can be reused as initialization in the low-shot target domain. This avoids the random re-initialization in the standard RPN, and thus reduces the fine-tuning burdens in the low-shot domain. In addition, we directly use SSD-style RPN for object localization, without regression refining after the ROI pooling in the standard Faster RCNN. The main reason is that, the multiple-convolutional-feature design of SSD is sufficient to localize objects with various sizes.
Discussion. Our deep architecture aims at reducing transfer learning difficulties for low-shot detection in the target domain. To achieve it, we flexibly leverage the core designs of both Faster RCNN and SSD. Additionally, this detector performs bounding box regression and object classification on two relatively separate places, which can further decompose the learning difficulties in low-shot detection.
Iv-B Regularized Transfer Learning for LSTD
After designing a flexible deep architecture, we introduce an end-to-end regularized transfer learning framework for LSTD in Fig. 3. Specifically, this transfer procedure involves two detectors. Source Detector. We first use the large-scale detection benchmark in the source domain to train the basic architecture in Fig. 2. As a result, we obtain a source detector which contains rich source-domain object knowledge for generalization. Warm-Up Detector. After pretraining, we use source detector as initialization, and perform fine-tuning with a few fully-annotated training images in the target domain. The resulting detector is a target-domain detector. Since it can stabilize weakly-supervised transfer detection later on, we call it as the warm-up detector. We next describe how to perform fine-tuning to obtain the warm-up detector via LSTD.
Total Loss for LSTD. Due to the low-shot property, direct fine-tuning often traps into the overfitting risk. To alleviate this challenge, we propose two novel regularization terms, i.e., background depression (BD) and source-detection knowledge (SDK). Specifically, the total loss of fine-tuning can be written as
where refers to the standard loss summation of bounding box regression and object classification in the warm-up detector, and refer to BD and SDK regularization. , and are respectively the coefficients of the main loss, BD and SDK regularization.
Background Depression (BD) Regularization. In the low-shot scenario, the complex background may disturb the localization performance. For this reason, we propose a novel background-depression (BD) regularization, by using the ground-truth bounding boxes of fully-annotated training images in the target domain. First, we feed a target image into warm-up detector, and generate the convolutional feature cube from a middle-level convolutional layer. Second, we mask this convolutional cube with the ground-truth bounding boxes of all the objects in the image. Consequently, we can identify the feature regions that are corresponding to image background, namely . To depress the background disturbances, we use L2 regularization to penalize the activation of ,
With this , warm-up detector can suppress background regions while pay more attention to target objects, which is especially important for training with a few images. As shown in Fig. 4, our BD regularization can successfully reduce the background disturbances.
Source Detection Knowledge (SDK) Regularization. Our key insight is that, rich object knowledge in the large-scale source domain provides extra information about target domains. As shown in Fig. 5, Cow or Aeroplane may have a high response on Bear or Kite, due to the color or shape similarity. Apparently, this knowledge is important for fine-tuning warm-up detector, especially with few fully-annotated images in the target domain. Hence, we propose to distill it for each object proposal in the target domain.
(1) Extracting SDK from Source Detector. First, we feed a target image into warm-up detector, which can produce object proposals of this image. Then, we put these target proposals into the ROI pooling layer of source detector, which can generate a SDK matrix from the final FC layer. Each column of
is the probability vector of a target proposal, w.r.t.,categories (objects + background) in the source domain.
(2) Leveraging SDK into Warm-Up Detector. To incorporate SDK into fine-tuning, we add a SDK prediction branch at the end of warm-up detector. This branch can produce a prediction matrix for target proposals, where each column is the prediction vector of a target proposal, w.r.t., categories (objects + background) in the source domain. Consequently, we apply cross entropy between and as a regularization,
In this case, SDK can be effectively integrated into fine-tuning, which generalizes low-shot detection in the target domain.
(3) Discussion. LSTD is a learning step, instead of a detector. In this step, we transfer source detector (large-scale, source domain) into warm-up detector (low-shot, target domain). Specifically, we train warm-up detector with the regularization of source detector, i.e., we use source detector to extract SDK, and leverage it as extra supervision for training warm-up detector. As a result, LSTD can generalize warm-up detector with low-shot training images in the target domain. Moreover, it is worth mentioning that, our SDK transfer is different from knowledge distillation . First, knowledge distillation is originally designed for model compression, while our SDK is proposed for transfer learning. Second, SDK is performed on each proposal for object detection. Hence, it is not the standard way of knowledge transfer, which often works on the whole image for object classification.
V Weakly-Supervised Transfer Detection (WSTD)
Via LSTD, we transfer source detector into warm-up detector in the target domain. Next, we apply weakly-annotated images to further boost target detection task. To achieve it, we design another critical stage of POTD, i.e., Weakly-Supervised Transfer Detection (WSTD), which can effectively handle weakly-annotated images in a fully end-to-end transfer framework (Fig. 6). Specifically, this transfer procedure involves two detectors. Warm-Up Detector. LSTD produces a warm-up detector, which is transferred from source detector, and trained on fully-annotated target images. Hence, this detector leverages the object knowledge from both source and target domains. Target Detector. We use warm-up detector as initialization, and perform fine-tuning with weakly-annotated images in the target domain. The resulting detector is a target-domain detector. Since it is used to produce the final detection result in the target domain, we call it as the target detector. We next describe how to perform fine-tuning to obtain target detector via WSTD.
Total Loss for WSTD. Specifically, we introduce the total loss of WSTD with two distinct parts,
On one hand, we transfer object supervision of warm-up detector via , which can further regularize target detector to enhance the generalization capacity. On the other hand, we design a novel recurrent object labelling (ROL) mechanism via , which can exploit confident proposals for learning to detect with weakly-annotated images in the target domain.
V-a Object Supervision From Warm-Up Detector
As mentioned before, weakly-annotated images lack object-level supervision, which makes the learning procedure trap into an unsatisfactory solution. Alternatively, warm-up detector can inherit rich object knowledge of source detector by LSTD. Hence, we propose to further transfer warm-up detector as target detector (Fig. 6), which can leverage the reliable object supervision to enhance weakly-supervised detection in the target domain.
Object Supervision from Warm-Up Detector. We fix warm-up detector as an extractor of object knowledge. First, we use the pretrained RPN of warm-up detector as an online proposal generator, which can produce high-quality proposals for weakly-annotated images in the target domain. Second, we feed these proposals into the ROI pooling layer of warm-up detector, and subsequently generate source detection knowledge from the SDK prediction branch.
Integrating Warm-Up Supervision into Target Detector. We add an extra FC layer as the SDK prediction branch of target detector. This branch can produce a prediction matrix for object proposals of weakly-annotated images. By using cross entropy between and , we integrate warm-up supervision into target detector,
Discussion. To further stabilize target detector with weakly-annotated images, we propose to leverage object supervision from warm-up detector. Note that, this supervision refers to source detection knowledge (SDK). It is transferred from source detector to warm-up detector (by Eq.(3)), and then from warm-up detector to target detector (by Eq. (5)). In other words, we use warm-up detector as a middle-stage, instead of directly transferring from source detector to target detector. Our key insight is that, warm-up detector is fine-tuned with fully-annotated target images during LSTD, i.e., it has been adaptively adjusted to produce effective target-domain proposals, and can reliably represent source-domain knowledge of these target proposals. Via such progressive learning, one can gradually promote the generalization capacity of object detector in the target domain.
V-B Recurrent Object Labelling (ROL)
For learning to detect with weakly-annotated images, we leverage source domain knowledge (i.e., object supervision of warm-up detector) in the previous section. Next, we integrate target domain knowledge for weakly-supervised detection. Specifically, we propose a recurrent object labelling (ROL) mechanism, which can exploit support proposals of weakly-annotated images, and refine object classifier online.
Classifier : Image-Level Supervision. After generating proposals of a weakly-annotated image, we feed them into ROI pooling of target detector (Fig. 6). For each proposal, conv13 in the target detector can generate a convolutional feature cube. We subsequently feed these cubes into recurrent object labelling (Fig. 7), which can identify confident proposals for classifier refinement. Specifically, we pass all the convolutional cubes through Classifier , and obtain a prediction matrix . Each column of refers to the probability vector of a proposal, and is the number of object categories in the target domain. Note that, we only have ground truth image label , since each image is weakly annotated in the stage of WSTD. To integrate this supervision into training, we sum over proposals and obtain an image score vector ,
Since there may be multiple objects in one image (i.e., multiple entries of can be 1), we apply the multi-label loss for Classifier ,
Via this image-level supervision, we can enhance reliability of proposal score , which will be used for object labelling afterwards.
Classifier : Support Proposal Mining. After obtaining the proposal score , we label object proposals in a recurrent manner. It is worth mentioning that, not all the proposals are confident enough to describe objects or background in an image. Hence, we design a support proposal mining procedure, where we label the highly-confident object and background proposals, according to the proposal score matrix . In the following, we illustrate the entire labelling procedure at Classifier , where we assume that the object category exists in an image.
(1) Labelling Support Object Proposals. First, we find the -th row of the prediction score matrix , and then pick out the most confident proposal which has the highest score,
Subsequently, we label this proposal as ‘ground truth box’ for category , and use its score as the pseudo label . Moreover, the spatial context is often important for weakly-supervised detection. Hence, we further assign category to those proposals, which are spatially adjacent to the highest-score proposal (e.g., IoU).
(2) Labelling Support Background Proposals. Traditionally, all the unlabelled proposals are assigned into the background category . This design apparently introduces a large number of wrong annotations, since the unlabelled proposals, which are far from the highest-score proposal , may also contain objects. To alleviate it, we propose to exploit support background proposals from the rest unlabelled ones. Specifically, we assign the background category to the ones which have the moderate overlap (e.g., IoU) with the highest-score proposal . As a result, we can reduce wrong or redundant annotations which often lead to the unstable learning for weakly-supervised detection.
(3) Classification Loss. After labelling all the object categories in a weakly-annotated image, we can get the score matrix as pseudo label. For the support proposals of category , we assign the highest score into the corresponding entries of . This design can effectively associate the object category with its confident proposals, which stabilizes the training procedure of weakly-supervised detection . For the unlabelled proposals, the corresponding columns are zero vectors in . Finally, we obtain the prediction score matrix from Classifier , and compute the cross entropy loss between and for training,
This procedure is done in a recurrent manner, and the total loss of our ROL module is
|LSTD||Source (large-scale, fully-annotated)||Target (low-shot, fully-annotated)|
|Task 1||COCO (standard 80 classes, 118,287 training images)||ImageNet2015 (chosen 50 classes)|
|Task 2||COCO (chosen 60 classes, 98,459 training images)||VOC2007 (standard 20 classes)|
|Task 3||ImageNet2015 (chosen 181 classes, 151,603 training images)||VOC2010 (standard 20 classes)|
|Deep Models (mAP)||large-scale source||low-shot target|
|Faster RCNN ||21.9||12.2|
|Shots for Task 1 (mAP)||1||2||5||10||30|
|Shots for Task 2 (mAP)||1||2||5||10||30|
|Shots for Task 3 (mAP)||1||2||5||10||30|
Discussion. First, we clarify the main difference between Online Instance Classifier Refinement (OICR)  and our ROL. Specifically, OICR is a label propagation and refinement technique for weakly-supervised detection. Similar to our ROL, it labels support object proposals via IOU. However, OICR assigns the background category without selection, while our ROL takes support proposal mining into account. As shown in Fig. 8, OICR labels many redundant proposals as background. What is worse, it tends to label true objects (e.g., left horse in the 1st image) mistakenly as background, and these noisy annotations can deteriorate the refining procedure of object classifier. On the contrary, our ROL carefully labels the contextual background regions around the confident proposals, which can reduce the labelling redundancy and improve the quality of training samples. Finally, we would like to emphasize that, our WSTD is used to imitate the generalization process of human learning, i.e., humans generalize the warm-up stage by exploiting objects from wild images without full annotations. To achieve it, WSTD investigates reliable object supervision from warm-up detector of LSTD. More importantly, it uses ROL for learning to detect objects from weakly-annotated images. In this case, WSTD can further generalize the detection performance in the target domain. Next, we evaluate progressive object transfer detection (POTD), by extensive experiments on a number of challenging datasets.
In this section, we evaluate the performance of the proposed POTD on a number of challenging data settings. First, since POTD consists of two transfer detection stages (i.e., LSTD and WSTD), we deeply investigate them from different experimental aspects. Then, we compare our POTD with the state-of-the-art approaches, and show our contributions in practice.
Data Settings. Since LSTD is a regularized transfer learning framework for low-shot detection, we adopt a number of detection benchmarks, i.e., COCO , ImageNet2015 , VOC2007 and VOC2010 , respectively as source and target of three transfer tasks (Table II). The training set is large-scale in the source domain of each task, while it is low-shot in the target domain (1/2/5/10/30 training images for each target-object class). The fully-annotated training shots are randomly selected in our experiments. To reproduce the results, we provide the data splits for all the tasks111\(https://github.com/Cassie94/LSTD/tree/master/data\_split\). Furthermore, the object categories for source and target are carefully selected to be non-overlapped, in order to evaluate if LSTD can detect unseen object categories from few shots in the target domain. Finally, we use the standard PASCAL VOC detection rule on the test sets to report mean average precision (mAP) with 0.5 intersection-over-union (IOU).
Implementation Details. Unless stated otherwise, we perform LSTD as follows. First, the basic deep architecture of LSTD is built upon VGG16 , similar to SSD and Faster RCNN. For bounding box regression, we use the same structure in the standard SSD. For object classification, we apply the ROI pooling layer on conv7, and add two convolutional layers (conv12: , conv13: for task 1/2/3) before object classifier. Second, we select 100/100/64 proposals (task 1/2/3) for training source detector, while we select 64/64/64 proposals for warm-up detector. The loss coefficients of both BD and SDK are 0.5. Finally, the optimization strategy for both source and target is Adam 
, where the initial learning rate is 0.0002 (with 0.1 decay), the momentum/momentum2 is 0.9/0.99, and the weight decay is 0.0001. All our experiments are implemented on Caffe. In the following, we evaluate the key designs of LSTD. To be fair, when we explore different settings of one design, others are with the basic setting above.
Basic Deep Structure of LSTD. We first evaluate the basic deep structure of LSTD respectively in the source and target domains, where we compare it with the closely-related SSD  and Faster RCNN . For fairness, we choose task 1 to show the effectiveness. The main reason is that, the source data in task 1 is the standard COCO detection set, where SSD and Faster RCNN are well-trained with the state-of-the-art performance. Hence, we use the published SSD  and Faster RCNN  in this experiment, where the size of input images for SSD and our LSTD is , and Faster RCNN follows the settings in the original paper. In Table III, we report mAP on the test sets of both source and target domains in task 1. One can see that, our LSTD achieves a competitive mAP in the source domain. It illustrates that LSTD can be a state-of-art deep detector for large-scale training sets. More importantly, our LSTD outperforms both SSD and Faster RCNN significantly for low-shot detection in the target domain (one training image per target category), where all approaches are simply fine-tuned from their pre-trained models in the source domain. It shows that, LSTD yields a more effective deep architecture for low-shot detection, by leveraging the core designs of SSD and Faster RCNN. Finally, we investigate the structure robustness in LSTD itself. As the bounding box regression follows the standard SSD, we explore the object classifier in which we choose different convolutional layers (conv or conv) for ROI pooling. The results are comparable in Table III, showing the architecture robustness of LSTD. For consistency, we use conv for ROI pooling in all our experiments.
Regularized Transfer Learning of LSTD. We mainly evaluate if the proposed regularization can enhance transfer learning for LSTD, in order to boost low-shot detection. As shown in Table IV, both SDK and BD can significantly improve the baseline (i.e., direct fine-tuning), especially when the training set is scarce in the target domain (such as one-shot). Additionally, we show the architecture robustness of BD regularization in Table V. Specifically, we perform BD regularization on different convolutional layers. One can see that BD is generally robust to different convolutional layers. Hence, we apply BD on conv in our experiments for consistency.
|LSTD (Full Annotation)||WSTD (Weak Annotation)||mAP|
|mAP of WSTD||Classifier 1||Classifier 2||Classifier 3|
Data Settings. We mainly evaluate WSTD on task 2, since it is the most representative setting among the three tasks in the previous section. First, one image contains multiple objects in COCO and VOC2007, while one image contains one object in ImageNet2015. Hence, task 2 is a more realistic data setting. Second, the target domain refers to VOC2007 with the standard 20 categories, which is more convenient to compare with other approaches. Furthermore, we extend the target domain of task 2 as VOC2010 and VOC2012 later on, in order to further show the effectiveness of WSTD.
Implementation Details. First, we start WSTD from warm-up detector, which is trained with 1 fully-annotated image per category of VOC2007 in the stage of LSTD. Then, we utilize WSTD to train target detector, where the training images are weakly annotated. Second, we choose 128 proposals for each image in WSTD, after non-maximum-suppression (NMS) of 1500 proposals at 0.75. Moreover, we simultaneously enlarge the loss coefficients in WSTD (i.e., 50 for ROL and 150 for SDK), in order to speed up convergence. Finally, the optimization strategy is Adam , where the initial learning rate is 0.00001 (with 0.1 decay), and the weight decay is 0.0001. All other settings are the same as LSTD.
WSTD vs. LSTD. After applying LSTD with few fully-annotated images, we continue to perform WSTD with weakly-annotated images. The result is shown in Table VI. First, when the number of fully-annotated training images in LSTD is small, one can perform WSTD to improve the detection mAP with weakly-annotated images. For example, the mAP is 34.0 when we fully annotate 1 shot per class in LSTD. It becomes as 62.6 after we use 250 weakly-annotated images per class in WSTD, and tends to stabilize around 62.8 after we use 800 weakly-annotated images per class in WSTD. It illustrates that, WSTD can significantly boost low-shot detection via weakly-annotated images, and tends to be saturated when the number of weakly-annotated images increases. Second, when the number of fully-annotated training images in LSTD is getting larger, the efforts of weakly-annotated images in WSTD is getting smaller. For example, the detection mAP is 69.7 when we fully annotate 30 shots per class in LSTD. After we use around 250 weakly-annotated images per class in WSTD, the mAP slightly increases to 70.5. It illustrates that, fully-annotated images take more effort for detection in the target domain. Third, there exists the tradeoff between LSTD and WSTD. For example, the mAP is 70.1, when we use 10 fully-annotated images (per class) in LSTD and around 800 weakly-annotated images (per class) in WSTD. Alternatively, we can obtain the similar mAP (70.5), when we use 30 fully-annotated images (per class) in LSTD and around 250 weakly-annotated images (per class) in WSTD. Since weakly-annotated images can be straightforwardly obtained from internet, while fully-annotated images have to be obtained via the exhausted labelling procedure. Hence, it is preferable choice to use few fully-annotated images in LSTD and most weakly-annotated images in WSTD.
Object Supervision from Warm-Up Detector. We evaluate whether object supervision of warm-up detector is effective for WSTD. Specifically, we consider three cases in the following. SDK (without) denotes that we ignore SDK in warm-up detector when performing WSTD. SDK (unweighted) denotes that we use the SDK loss in Eq. (5) when performing WSTD. SDK (weighted) denotes that we add an extra weight in SDK loss, i.e., is used as object supervision in Eq. (5). In our experiment, is set to be , in order to further enhance the importance of SDK. The result is shown in Fig. 9. One can see that, SDK (unweighted) is comparable to SDK (without). It illustrates that SDK needs to be further exploited in the object-level. Furthermore, SDK (weighted) achieves the best among these settings, showing that can further take the importance of SDK into account.
Recurrent Object Labelling (ROL). We mainly evaluate ROL, w.r.t., number of proposals, IoU threshold, and ROL vs. OICR . (1) Number of Proposals. As shown in Fig. 10 (a), mAP of WSTD first increases and then decreases, when we increase the number of proposals for ROL. It illustrates that, the number of proposals is required to be sufficient for a good detection performance. But too many proposals may introduce noisy annotations to deteriorate ROL. Hence, we choose 128 proposals in the rest experiments, due to its outstanding performance. (2) IoU Threshold. In ROL, we use IoU thresholds to selectively mine support proposals as objects (i.e., IoU ) or background (i.e., IoU ). Hence, we evaluate the influence of IoU thresholds in Fig. 10 (b)-(c), where we change for proposals labelled as objects ( in this case), and change for proposals labelled as background ( in this case). As or increases, mAP of WSTD first increases and then decreases. It shows that, the threshold should not be too loose or tight for support proposal mining. Hence, we set and in the rest experiments.
ROL vs. OICR . OICR is Online Instance Classifier Refinement, which addresses weakly-supervised detection via label propagation and refinement . It can be used with any detection backbone, just like our ROL. To further show the effectiveness of ROL, we respectively perform WSTD with OICR  and our proposed ROL (both utilize 128 proposals). In Table VII, our ROL significantly outperforms OICR for all the classifiers in the recurrent steps. This is mainly because that, our ROL selects support proposals attentively, which can largely reduce redundant proposals and noisy annotations. Furthermore, we evaluate ROL vs. OICR  on each category of VOC2007/2010/2012. For most object categories in Table X, X and X, our ROL outperforms OICR with a large margin, showing the effectiveness of ROL.
|Faster RCNN ||69.9||-||70.4 (07+12)|
|SSD ||68.0||-||72.4 (07+12)|
|Low-Shot Transfer (mAP)||VOC07||VOC10||VOC12|
Vi-C Comparison with The-State-of-The-Art
In this section, we compare our POTD with a number of the state-of-the-art approaches in Table XI. First, the performance of weakly-supervised / semi-supervised detectors is far from competitive to fully-supervised detectors, due to the lack of object-level supervision. Alternatively, low-shot transfer detectors can achieve the competitive results, even though few images are fully-annotated in the target domain. The main reason is that these transfer detectors can mimic the human learning, which leverages source-domain knowledge as object prior. Second, we compare POTD with the recent computational baby learning (CBL) . One can see that, POTD is competitive to CBL, but it uses much less weakly-annotated data, i.e., POTD uses 250 weakly-annotated images per category while CBL uses 20,000 weakly-annotated videos per category. Finally, the subscript means that all training images are fully-annotated. In this case, POTD is simply reduced as LSTD without using weakly-labeled images. As expected, it significantly outperforms CBL, and achieves the comparable results to the fully-supervised detectors. The main reason is that, our POTD leverages the key insight of transfer detection, which can inherit rich source knowledge to boost detection accuracy in the target domain.
Detection Visualization. We visualize two transfer detection stages of POTD in Fig. 11, where one training image is fully annotated in the target domain (VOC2007). First, LSTD can achieve a reasonably good performance, via transferring object knowledge from source domain. Second, WSTD can further generalize detection with weakly-annotated images. Hence, we can see that our approach can boost target-domain detection progressively and effectively but with little annotation burden.
Error Mode Analysis. We compare LSTD with WSTD, according to error model analysis of VOC2007 in Fig. 12. First, WSTD can largely reduce various error types of LSTD, according to the 1st and 2nd row of Fig. 12. It shows that WSTD can further generalize target detector with weakly-annotated images. Second, one can see that in the 3rd and 4th rows of Fig. 12, the distribution of error types for LSTD is different from the one for WSTD. The main reason is that, the LSTD stage is built on low-shot but fully-annotated images, while the WSTD stage is built on large-scale but weakly-annotated images. In this case, LSTD is largely confused by similar objects and/or others, due to the lack of training images. Alternatively, WSTD is largely confused by poor localization and/or background, due to the lack of object-level supervision.
In this paper, we propose a novel progressive object transfer detection (POTD) framework. First, POTD effectively integrates various object supervision of different domains into a progressive detection procedure, i.e., from source to target domains, from large to few data, from full to weak annotations. Via this human-like learning, POTD can boost a target detection task with little annotation burden. Second, each detection stage in POTD is efficiently designed with delicate transfer insights, where LSTD is used as warm-up for WSTD generalization. Finally, we conduct extensive experiments to show that POTD outperforms other state-of-the-art methods.
-  (2015) Weakly supervised object detection with convex clustering. In CVPR, Vol. , pp. 1081–1089. External Links: Cited by: §II.
-  (2016) Weakly Supervised Deep Detection Networks. In CVPR, Vol. , pp. 2846–2854. External Links: Cited by: §I, §II.
-  (2016) Weakly supervised deep detection networks. In CVPR, Cited by: TABLE XI.
-  (2018) LSTD: A Low-Shot Transfer Detector for Object Detection. In AAAI, Cited by: §I, §I, §II.
-  (2017-01) Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning. PAMI 39 (1), pp. 189–203. External Links: Cited by: §II.
-  (2009) Imagenet: A large-scale hierarchical image database. In CVPR, Cited by: §VI-A.
-  (2017) Weakly Supervised Cascaded Convolutional Networks. In CVPR, Vol. , pp. 5131–5139. External Links: Cited by: §I, §II.
-  (2016) Weakly Supervised Cascaded Convolutional Networks. arXiv:1611.08258. Cited by: TABLE XI.
-  (1997) Solving the Multiple Instance Problem with Axis-Parallel Rectangles. Artif. Intell. 89, pp. 31–71. Cited by: §II.
-  (2017) Few-shot Object Detection. arXiv preprint arXiv:1706.08249. Cited by: §I, §II, TABLE XI.
-  (2010) The Pascal Visual Object Classes (VOC) Challenge. IJCV. Cited by: §VI-A.
-  (2015) Fast R-CNN. In ICCV, Cited by: §I.
-  (2017) Mask R-CNN. In ICCV, Cited by: §I.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §IV-B.
-  (2014) LSDA: Large Scale Detection Through Adaptation. In NIPS, Cited by: §I, §II.
-  (2014) Caffe: Convolutional architecture for fast feature embedding. In Multimedia, pp. 675–678. Cited by: §VI-A.
-  (2017) Deep Self-Taught Learning for Weakly Supervised Object Localization. In CVPR, Vol. , pp. 4294–4302. External Links: Cited by: §II, TABLE XI.
-  (2016) ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization. In ECCV, Cited by: §II.
-  (2015) Adam: A Method for Stochastic Optimization. ICLR. Cited by: §VI-A, §VI-B.
-  (2017) Saliency Guided End-to-End Learning for Weakly Supervised Object Detection. In IJCAI, Cited by: §I, §II, TABLE XI.
-  (2016) Weakly Supervised Object Localization with Progressive Domain Adaptation. In CVPR, Cited by: §I, §II.
-  (2016) Weakly supervised object localization with progressive domain adaptation. In CVPR, Cited by: TABLE XI.
-  (2015) Towards Computational Baby Learning: A Weakly-Supervised Approach for Object Detection. ICCV, pp. 999–1007. Cited by: §I, §II, §VI-C, TABLE XI.
-  (2014) Microsoft COCO: Common objects in context. In ECCV, Cited by: §VI-A.
-  (2016) SSD: Single Shot MultiBox Detector. In ECCV, Cited by: TABLE III, §VI-A, TABLE XI.
-  (2016) SSD: Single Shot MultiBox Detector. In ECCV, Cited by: §I.
-  (2016) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE TPAMI. Cited by: §IV-A, TABLE III, §VI-A, TABLE XI.
-  (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §VI-A.
-  (2017) Multiple Instance Detection Network with Online Instance Classifier Refinement. In CVPR, Cited by: §I, §II, Fig. 8, §V-B, §V-B, §V-B, §VI-B, §VI-B, TABLE XI, TABLE VII.
-  (2016) Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer. In CVPR, Cited by: §I, §II.
-  (2013) Selective Search for Object Recognition. IJCV 104, pp. 154–171. Cited by: §II.
-  (2014) Edge Boxes: Locating Object Proposals from Edges. In ECCV, Cited by: §II.