Object detection is an essential building block of many computer vision systems. State-of-the-art (SOTA) methods mainly rely on large scale datasets with manually annotated bounding boxes to train fully supervised CNN-based models [8, 18, 17, 13, 3]. However, the prohibitive cost and time requirements associated with data annotation reduce the applicability of SOTA detection models in real life scenarios. This has motivated research on object detection strategies with reduced data annotation requirements. Amongst the most popular low data regimes, we distinguish Weakly Supervised Object Detection (WSOD), which aims to train object detectors using only image-level annotations [2, 19, 22, 1, 26], and Few-Shot or Low-Shot Object Detection (FSOD/LSOD), training supervised models with only a handful of training examples on all (LSOD) or only a subset of novel test classes (FSOD) [11, 25, 5]. FSOD and in particular WSOD have been the focus of a large body of work with innovative strategies obtaining promising performance. Nonetheless, these models typically fall far short of their strongly supervised counterparts. Numerical performance gaps are attributed to the low quality of bounding-box annotations produced, e.g. by WSOD methods, that often manifest as partial or oversized boxes. Such results are not reliable enough for use in real-world scenarios and can be observed to cause deterioration of detection performance when used in fully supervised models training. This can be attributed to weak training signals requiring very large and curated datasets (WSOD) or very representative and carefully selected annotated examples (FSOD).
To address the aforementioned challenges, we focus on a recent training paradigm relying on Mixed Supervision for Object Detection (MSOD) [15, 7]. The distinction between this protocol and the previously introduced weak and low data settings is illustrated in Fig. 1. The objective of MSOD is to exploit and combine the complementary advantages provided by WSOD and LSOD; weak (image-level) supervision affords the construction of large databases with minimal effort, while low-shot supervision provides information rich, fully annotated ground truth examples. The MSOD paradigm has, only very recently, been initially investigated in two related works. Fang et al.  propose a cascaded architecture yielding performance competitive with fully supervised counterparts yet using a significant fraction of the full training data to achieve comparable performance. Pan et al.  use low-shot examples to refine bounding box annotations obtained from a pre-trained WSOD model , resulting in a method intrinsically linked to the performance and drawbacks of WSOD techniques.
In this work, we approach the MSOD scenario from a different angle. Due to the sparsity of rich training information provided, we expect a MSOD model to output annotations of variable quality, especially for images containing crowded scenes or objects with appearance substantially dissimilar to the training data. In contrast to existing MSOD models we introduce an Online Annotation Module (OAM), trained with mixed supervision, that can be used in conjunction with any two-stage fully supervised object detection method to improve its performance (e.g. Fast(er) R-CNN family [8, 18]). Our OAM generates, on the fly, additional reliable automated annotations obtained from a larger set of weakly annotated images (containing only image-level class labels). Furthermore, we exploit prediction stability to reason about annotation reliability resulting in associated confidence scores. Generated annotations are used to train, concurrently to the OAM, a fully supervised detector that shares the same encoding features. This produces an intrinsic training curriculum for the standard detector model; only simple images, labelled with high confidence will be presented to the model at the outset. Compared to previous MSOD work, our OAM strategy provides increased robustness against mislabeled crowded and ambiguous training images as only confident MSOD annotations are exploited for fully supervised training. Furthermore, our joint MSOD and fully supervised training provides intrinsic regularisation for both tasks, allowing the learning of higher quality and more discriminative feature extractors.
Experiments show that our strategy allows effective training of standard detection algorithms with only minimal annotation requirements and significantly outperforms WSOD and competitive MSOD approaches on PASCAL VOC 2007 and MS-COCO benchmarks. Additionally, we report competitive performance in comparison to fully supervised alternatives, illustrating the ability of our OAM to annotate a many-shot set of (weakly labelled) images that can be leveraged to improve the fully supervised model performance.
In summary, we propose a new direction using Mixed Supervision for Object Detection (MSOD). Our main contributions are the following:
We introduce a novel Online Annotation Module (OAM), trained using mixed supervision. This module allows expansion of the low-shot training set of fully annotated images by generating reliable annotations from a larger volume of weakly labelled images.
Training our OAM concurrently with any two-stage object detection model introduces a strategy for object detection performance improvement due to the generated annotation. We report on the benefits of intrinsic regularisation afforded to both tasks when common encoding features are shared.
The integration of the OAM with Fast(er) R-CNN improves their performance by mAP, AP50 on PASCAL VOC 2007 and MS-COCO benchmarks, and significantly outperforms MSOD approaches.
2 Related work
Weakly Supervised Object Detection. A large body of recent work, considering WSOD, couples CNN feature extractors with Multiple Instance Learning (MIL) frameworks, thus casting weakly supervised object detection as a multi-label classification problem. Each image is typically represented as a bag of pre-computed proposals (e.g. Selective Search , Edge Boxes , etc.) and the objective is to identify proposals that are most relevant for bag classification [2, 19, 22, 26]. Being framed as a classification task, MIL WSOD models typically focus on proposals that comprise of either the most discriminative object parts or image regions that define the presence of an object category. They therefore struggle to detect full object extent (e.g. human faces in contrast to an entire human body) or group multiple object instances of the same object within a single bounding box [26, 15]. In order to address this issue, recent work has focused on bounding-box refinement strategies using cascaded refinements of MIL classifications [20, 19], using saliency maps [24, 26], adopting continuation strategies [22, 23] and modelling uncertainty . However, the ill-posed nature of the WSOD problem and insufficient statistics provided by the PASCAL VOC dataset (on which these approaches are usually evaluated) has lead to the development of ad-hoc training strategies and parameter sensitive methods to cope with the weak training signal, which substantially reduce generalisability across datasets. In this paper, we argue that including a handful of labelled samples yields accuracy and stability model improvements at only minimal annotation cost. Usually, all the images annotated by MIL WSOD methods are used, in a second step, to train fully supervised models [19, 22, 26]. Further previous work has also focused on alternating between the pseudo-labelling of images and, in conjunction, training a fully supervised model [10, 5]. In this work, we generate bounding box annotations on the fly from mixed supervision and we concurrently train a fully supervised detector only on the images annotated with high confidence.
Few-Shot and Mixed Supervision Object Detection. Few-Shot Object Detection (FSOD) considers a fully supervised training set, and aims to achieve strong performance on a set of novel classes comprising of only annotated training images per class. To date only a handful of works have focused on FSOD [11, 25, 12, 4]. Such approaches typically adapt few-shot classification techniques to the object detection setting, exploring metric learning  or meta-learning  strategies. Mixed Supervision for Object Detection (MSOD) enhances a WSOD training set containing only image-level labels with a small set of fully annotated (strong) images (e.g. images per class, analogous to an FSOD scenario) and aims to achieve strong performance on all the training classes. Pan et al. recently propose BCNet , which learns to refine the output of a pre-trained WSOD model using a small set of strong images. The definition of small set explored in their work ranges from shots to of the entire dataset ( images in PASCAL VOC 2007 training set). This approach provides a strong performance increase with respect to WSOD methods, however remains highly dependent on the original WSOD model detections as input. If detections are originally missed by the pre-trained model, the approach cannot recover. Moreover, BCNet requires the training of two independent models which makes the adaption of WSOD parameters, i.e. training for new datasets, challenging. In this work, we instead propose a one-stage approach relying on an adaptive pool of annotations, updated dynamically as training progresses. EHSOD  and BAOD  focus on larger data regimes (e.g. to ) and aim to reduce the data required to reach fully supervised performance using a cascaded MIL model and a student-teacher setup trained on weak and strong annotations, respectively. In contrast to all outlined methods, we propose instead to learn, and annotate on the fly, only a subset of weak images that can be labelled with high confidence. These additional samples are then used together with strong images to train an object detector and thus improve performance.
Let be a set of training images annotated with image-level supervision. Under our mixed supervision paradigm, a subset of these images, with , is further annotated with bounding box annotations. We refer to the images contained in as strong training images, while the images in , that have only image-level annotations, are referred to as weak training images. An overview of our proposed method is reported in Fig. 2. Our model comprises two branches with shared encoder backbone, and employs an ROI pooling layer to compute a fixed-length feature representation for each image bounding box proposal. The first branch of our model employs both weak and strong training images to learn an Online Annotation Module (OAM) for weak training images. The OAM generates bounding box annotations, with associated confidence scores, on the fly, for every weak training image. Annotated weak images are added to a third set of images, , if they have been annotated with high confidence, and can be subsequently removed if their annotation confidence drops during training. Images contained in are referred to as semi-strong training images throughout the paper. The second branch of our model is designed as a standard fully supervised component and trained, in parallel, in an end-to-end manner using strong and semi-strong images. At testing, only the fully supervised model is used for object detection.
, and their associated feature vectors. These feature vectors are obtained using a standard CNN backbone and ROI Pooling layer and provide a common input to both of our model branches: the OAM and the fully supervised branch.
3.1 Online Annotation Module
Our OAM is designed to jointly exploit weak and strong supervision in an efficient manner. It comprises three main components: 1) a joint detection module exploiting weak and strong labels in a single, common architecture to predict bounding boxes and their classes, 2) an online bounding box augmentation step that generates refined bounding box proposals, 3) a supervision generator, identifying confident annotations to be used as supervision. We next describe all three components in detail.
Joint detection module. Similarly to the strategy proposed in , we combine a multiple instance learning (MIL) type image-level classification task with a fully supervised joint classification and regression task. Our joint detection module hence comprises three parallel, fully connected layers focusing on three different subtasks: proposal scoring, classification and regression (Fig. 2, online annotation module block). Proposal scoring and classification are obtained by applying the softmax function to the output of their layers along both dimensions, independently, (classes for , proposals for ). After this operation,
represents the probability that the-th proposal belongs to class , while represents the proportional contribution that proposal
provides to the image being classified as class. Following , these layers are trained by exploiting the image-level supervision of both strong and weak images. In particular, a proposal score , per class, is obtained by combining them where is a Hadamard product. Then, summing these scores over proposals,
, enables the use of a binary cross-entropy loss as image-level loss function:
where is the label indicating the presence or absence of class in an image.
Similar to traditional object detectors, we use strong images to compute bounding box regression and classification via the corresponding fully connected layers. We therefore combine weak and strong supervision by providing direct supervision to proposal-level class prediction . For regression, each bounding box is parametrised as a four-tuple that specifies its center coordinate and its height and width . For each proposal classified as foreground in a strong image, this regression branch predicts the offset of these coordinates . Hence, for every strong image, the following additional loss is computed on a batch of proposals:
Parameters and constitute the predicted and target proposal classes respectively, and the predicted and target bounding box offsets respectively and is a smooth loss function .
The loss function of the joint detection module is hence on strong images, while the loss function on weak images is . Enforcing synergy between the two types of supervision regularises the low-shot task thanks to the statistical information provided by weak images. Moreover, due to the instance-level annotations provided by strong images, this also constrains the MIL task and encoder to learn stronger discriminative features between full and partial-extent object proposals.
Online Bounding Box Augmentation Strategy. Learning to update and improve bounding box spatial regions via low-shot regression is highly challenging. When initial inference and ground-truth box overlap is small, large corrections (spatial offsets) are required. Previous work (BCNet ) actively elects to exclude such challenging samples, further reducing already highly limited data. We alternatively fully exploit available annotations and push our regression branch output through a second forward pass of our OAM (red arrow in Fig. 2).
More specifically; after the first forward pass, we select the top scoring proposals, per class, corresponding to image-level ground-truth.
is defined as half the size of the proposal batch used to train the strongly supervised component. This accounts for the presence of irrelevant background proposals and allows us to fix this hyperparameter. Once regression branch offsets have been applied, our ROI pooling layer ingests the proposals and yields a new set of bounding box features. Loss functions are evaluated using the updated boxes features and combined with the first pass loss. The overall loss function of our OAM branch is then:, where superscripts and indicate the first and second pass, respectively. At every iteration, a batch with the same number of weak and strong images is used.
Motivation for our second pass is two-fold. Firstly augmentation is intrinsically provided as new sets of proposal candidates are generated for regression and classification task training. In contrast pre-computed proposals (predominant in WSOD), that lack additional external augmentation strategies, provide only static input, reducing sample variability during training. Secondly, our regression task is regularised as any weak proposals receiving modifications that hinder correct image-level classification are penalised.
Online Pseudo-Supervision Generation. The key objective of our OAM is to generate reliable annotations on a large set of weakly labelled images in order to guide the training of a fully supervised second branch. As the OAM is trained concurrently with the second branch, it is critical to identify and add only reliable annotations to the pool of training images. Our rationale is that only these images should be used to train the final supervised detection network, while images that the joint detection module struggles to annotate with high confidence should not be used for model training, as they may hurt the training process and deteriorate detector performance.
During early stages of the training process, uncertainty regarding both the class of bounding box proposals and the related regression refinement of box coordinates will be high. As training progresses and model predictive quality improves, confidence, accuracy and stability will increase. This results in an increasingly difficult set of images being accurately annotated. We propose to exploit this behaviour by introducing a supervision generator that is able to reliably identify annotated images, creating a set we refer to as semi-strong images , that are used to train the fully supervised branch. Intuitively, will comprise “easy” images in early stages of training (e.g
. single instances, uniform colour backgrounds) and sample diversity will progressively increase as the model becomes more accurate (examples of images annotated by our OAM at different training epochs are reported in Fig.5).
In order to build a set of semi-strong images , with bounding boxes and associated annotation confidence scores, we propose the following mechanism. Given a weak image , we obtain a set of bounding boxes after Non-Maximum Supression (NMS) is performed on the output of the joint detection module, where and correspond to the class label and coordinates of box respectively. at every iteration , using as input candidate proposals. More specifically, the bounding boxes obtained at the previous iteration are fed again to the RoI Pooling Layer, providing a new set of image features allowing to compute new proposal coordinates. The process iterates until bounding box prediction stabilises and is stopped when for three consecutive iterations, i.e. for each bounding box , there exists a corresponding box such that and have intersection-over-union (IoU) and possess matching class predictions (i.e. a standard criterion for characterising object equivalence in detection methods). We assign a global confidence weight , per image, where is defined as the first of three iterations in which . Pseudo-code for the OAM algorithm is found in Supplementary Materials A.
The set of proposals obtained at iteration constitute the final bounding box annotations. Each box is weighted (box level confidence) by its average IoU with the best matching box at all subsequent iterations. Boxes absent at a given iteration (IoU ) are, by definition, down weighted due to being assigned an overlap of at that iteration (Fig. 3 shows an example). Images that do not reach convergence by iterations, or that fail to find any foreground proposals at iteration , are considered to be annotated with low confidence and are not added to the semi-strong pool. We set the maximum number of updates , to prevent large sets of iterations and observe that large (e.g. ) would only occur during early stage training in practice. Finally, the image is only added to the semi-strong pool if the set of obtained annotations contains all classes pertaining to the image-level label. We highlight that images requiring large iteration count for convergence are assigned low confidence scores by design and therefore have limited influence on the training procedure of the second branch. As weak images get annotated by the proposed OAM during training; the semi-strong set expands, while at the same time refining annotations and confidence as the model improves. At a given training step, a weak image that is not successfully annotated, and yet was present in the pool of semi-strong images, will be removed. In this way, the set of semi-strong images has the ability to both expand and contract during training.
3.2 Fully Supervised Branch
Concurrently to OAM training, the obtained strong and semi-strong sets of images are used to train a fully supervised second branch, that comprises both bounding box classification and regression modules on the proposal features in a similar fashion to Fast(er) R-CNN  style methods. In particular, at every training iteration a batch with the same number of strong and semi-strong images is used. The loss function for this branch is:
where is the ROI class predictions, is the predicted offset between ROIs and targets, is the class label and is the target offset. Only ROIs with foreground labels contribute to the regression loss, . The loss constitutes a weighted cross-entropy for each image:
where the proposals in each batch, contributing to the loss, are indexed by , the confidence for GT proposal is denoted and the image-level annotation confidence score is denoted . Strong images are assigned image and proposal-level weights of . In the early stages of the training process, the semi-strong annotations present some localisation inaccuracies, but are nonetheless highly informative to learn foreground vs background proposals. As training progresses, our OAM improves annotation quality with tighter object coverage and these additional high accuracy annotations will more often contain proposals of exactly full object extent. Such annotations reinforce and strengthen a base signal, provided by strong images alone, towards better bounding-box classification. We also explored utilising semi-strong images to improve bounding-box regression, analogously. In practice, however, this produced slightly worse results. We hypothesise that the discrete problem, associated with the bounded classification loss, affords more robustness to (early-stage) imperfect semi-strong annotations and therefore compute bounding box regression on only strong images in our final model. To conclude, collecting the introduced components results in the complete loss function for our model: . At testing, only this fully supervised model is deployed.
4.1 Datasets and Implementation Details
We evaluate the performance of our proposed method on two common detection benchmarks: the PASCAL VOC 2007  and the MS-COCO dataset , referred to as VOC07 and COCO14. VOC07 has training and testing images across categories. COCO14 has training and testing images across 80 categories. Following evaluation strategies used in the literature, we evaluate detection accuracy on VOC07 using mean Average Precision (mAP), while we employ the COCO metrics, and , on the COCO dataset. In the reported experiments, reference to of labelled images dictates that of all images have bounding box annotations while the remaining have image-level labels. This corresponds to images in VOC07, images in COCO14. With reference to our “-shot” experimental setup, we define each class to have access to images possessing bounding box annotations. All the experiments on VOC07 use the same data splits provided by BCNet , experiments on COCO14 use random selection.
We employ popular network backbones VGG16 and ResNet101 in our experiments to retain consistency with recent approaches. We combine our OAM with Fast R-CNN  (using Edge Boxes ) and Faster R-CNN using a trainable RPN . Optimisation of all models is performed using SGD with weight decay and momentum . For experiments concerning the VOC07 dataset, models are trained for epochs. The initial learning rate is (first epochs) and reduced to for the final epochs. Analogously for MS COCO experiments; models are trained for 12 epochs, with learning rate in the first 9 epochs and then reduced to for the final 3 epochs. Remaining model hyper-parameters follow the values reported in . For data augmentation, we apply the same augmentation strategy as BCNet  for fair comparison, i.e. we bilinearly resize images to induce a minimum side length and, for fully supervised training, uniformly crop image regions with a fixed
window. All experiments are implemented in PyTorch using a single GeForce GTX 1080 GPU.
4.2 Comparisons with State-of-the-art
Baselines: We evaluate our model with respect to two SOTA WSOD methods, PCL  and WSOD , that were evaluated on both VOC07 and COCO14. We further compare to three MSOD approaches: the two level approach of BCNet , end-to-end methods BAOD  and EHSOD . To the best of our knowledge, these are the only three methods adopting mixed supervision. All three methods were evaluated on VOC07. Results for BCNet, the best performing baseline on VOC07, were not available for the COCO dataset. The approach requires training two models (OICR and BCNet) with two separate sets of parameters that need to be adapted to the new dataset, making it highly challenging and time consuming to provide a fair comparison, hence we were not able to provide it. Similarly, EHSOD was evaluated only on the COCO 2017 database with a much larger set of annotated training images (approx. ), making results not directly comparable to our experiments and different from the low-shot setting studied in this work. Finally, we compare our results with respect to Fast R-CNN and Faster R-CNN trained with full supervision (our upper bounds) and low-shot supervision (i.e. and -shot training data), using the same augmentation strategy as all previous models.
|Method type||Method||10-shots/WSOD||10% images|
|fully supervised||Fast R-CNN||58.0||42.1||64.2||50.3|
|fully supervised||Faster R-CNN||54.3||37.7||55.7||46.7|
|WSOD||PCL + Fast R-CNN||15.8||44.2||-||-|
|MSOD||EHSOD (ResNet + FPN)||-||-||63.0||55.3|
|MSOD||Ours + RPN||64.3||54.6||68.9||60.5|
|fully supervised||Fast-RCNN 100 % images (Ours upper bound)||76.8 (person), 71.6|
|fully supervised||Faster-RCNN 100 % images (Ours + RPN upper bound)||75.6 (person), 67.0|
PASCAL VOC 2007: We report detailed per-class results, compared to competing MSOD approaches in Tab. 8 using 10% annotated training images, and 10 shots. We consistently outperform all competing methods in terms of mAP, with an improvement of up to with respect to BCNet in the 10 shot scenario (ResNet), and
with respect to EHSOD in the 10% images scenario. We further highlight that BCNet constitutes a two-level WSOD dependent method. The influence of the chosen WSOD component is clearly visible; object classes where their method excels, and surpasses our per-class performance, are the same classes for which their adopted WSOD component (OICR) provides best initial bounding box estimations. In Tab. 2, we provide more comparisons in the 10 shots and 10% images scenarios using precomputed proposals (white rows) and an RPN  (grey rows). We highlight that we use an off-the-shelf RPN without parameter optimisation, and expect performance to be worse, and not directly comparable to strategies relying on pre-computed proposals. We further compare with top performing WSOD methods and Fast(er)-RCNN approaches and highlight our performance on the “person” class, often reported as one of the most challenging classes for WSOD methods due to the large intra-class variability in terms of appearance [15, 26]. We significantly outperform all SOTA methods, and substantially improve with respect to WSOD methods, in particular for the person class, with only minor additional labelling cost. Comparing to Fast(er)-RCNN methods, we highlight that our OAM improves upon models trained on 10% data and 10 shots by a large margin ( and respectively), reaching performance close to the fully supervised upper bound.
|fully supervised||Fast R-CNN - 10 shots||22.1||10.0|
|fully supervised||Faster R-CNN - 10 shots||16.1||6.7|
|WSOD||PCL+ Fast R-CNN||19.6||9.2|
|MSOD||Ours - 10 shots||31.2||14.9|
|MSOD||Ours + RPN - 10 shots||24.9||10.2|
|fully supervised||Fast R-CNN - 100% data||49.9||29.0|
|fully supervised||Faster R-CNN - 100% data||42.1||20.5|
MS-COCO14: We provide further comparison to additional benchmark datasets in order to highlight model generalisability. We note that contemporary WSOD methods mainly focus on detection datasets of modest size such as VOC07. COCO14 is significantly larger, and constitutes a more challenging dataset due to both the increased size and variability expressed in image content. Tab. 7 reports comparisons between our method (precomputed and RPN proposals) and WSOD approaches PCL and WSOD on COCO14 using 10 shots labelled images. As we compare solely to WSOD methods, we limit our experiments to the 10 shots setting, as 10% annotated examples provide a very significant advantage compared to WSOD methods. We additionally provide comparison to Fast(er) R-CNN methods trained on 10 shots as well as their fully supervised equivalent on 100% images. We highlight that our method maintains robust performance and significantly outperforms the WSOD methods and 10 shots Fast(er)-RCNN models (). This provides evidence in support of our claim that the strategy of providing mixed supervision significantly improves generalisation ability in settings that entail more difficult tasks with higher variability.
|10 shots||AP (%)|
4.3 Ablation Studies
We conduct experiments to understand the different contributions and assignment of credit for our OAM components using the VOC07 dataset and a VGG backbone. Tab. 5 shows ablative results for the 10 shots scenario while additional results for the 10% images scenario are reported in supplementary materials. Studied components are: SE: shared encoder (i.e. no SE entails independent branch training); OAM: fully supervised branch is also trained on semi-strong images generated by the OAM; BBA: online bounding box augmentation strategy. For each configuration, we report mAP with respect to the output of the OAM (first branch; 1B) as well as the output of the fully supervised branch (second branch; 2B). We experimentally verify the importance of each component; performance consistently improves as new components are integrated. We note that the shared encoder strongly improves the fully supervised branch, while the OAM, and communication between branches, affords mutual branch improvement. Both performance gains can be attributed to the more discriminative full vs. partial object proposal features learned by the shared encoder.
We have introduced a novel online annotation module (OAM), trained using mixed supervision, that learns to generate annotations on the fly and thus affords concurrent training for fully supervised object detection. The OAM can be combined with any two-stage object detector and provides an intrinsic curriculum to improve the training procedure. Extensive experiments on two popular benchmarks show SOTA performance in the mixed supervision scenario, and significant improvement of two-stage detection methods in low-shot settings. Moreover, our method has the potential to increase performance on rare, long tail classes that typically only possess a handful of annotated examples.
Appendix A Online Pseudo-Supervision Generation algorithm
Appendix B Ablation study: 10% data scenario
In Tab. 5, we report ablation study results for the proposed model (VGG16 backbone) where of images from VOC07 provide strong supervision. Results for the analogous shot scenario were reported in the main paper, Sec. . Considered ablation components are SE: presence of shared encoder (i.e. no SE entails independent branch training); OAM: the fully supervised branch is additionally trained on semi-strong images (generated by the OAM); BBA: online bounding box augmentation strategy. For each configuration, we report mAP with respect to the output of the OAM (first branch; 1B) as well as the output of the fully supervised branch (second branch; 2B).
As was also observed for the shot scenario (reported in Sec. of the main paper), the performance increases as additional components are added, providing further evidence for component validity and contribution. The performance gaps between differing ablations are smaller than our analogous main paper experiment due to the increased strong supervision available in the current case. Congruent with the results reported in Sec. 4.3, this ablation highlights that the shared encoder strongly improves the fully supervised branch, while the OAM and communication between branches, afford mutual branch improvement.
|10 %||AP (%)|
Appendix C Sensitivity to the selected annotation
In order to test the sensitivity of our method, with respect to annotated image-subset selection variance, we perform a five-fold experiment, under the shot scenario. We test using VOC07 and a standard VGG16 backbone architecture. This scenario represents the setting most susceptible and sensitive to image subset selection as the pool of strong images is the smallest among all considered scenarios (including MS-COCO experiments). It can be observed in Table 6 that image selection variance is small. Varying the selected image subset has only minor effect on final mAP, providing evidence towards the robustness of our proposed approach. This variance intuitively reduces further in cases where the model is trained using a larger number of fully annotated images.
Five-fold experiment for the 10 shot scenario using VOC07 and a standard VGG16 backbone. Fold mean and standard deviation statistics are reported in the final rows. The second split is the split used in, and the split used for all our remaining experiments.
Appendix D MS-COCO 2017 comparisons
The EHSOD  method reported results using the COCO17 dataset, corresponding to a training data scenario. We thus report here comparison between our method (considering both pre-computed and RPN  proposal setups) and the EHSOD mixed supervision approach. We also provide additional comparison to both Fast and Faster-RCNN methods, trained using the same of COCO17 images, as well as their fully supervised equivalent; using of the training images. Results are found in Tab. 7. We note this setting corresponds to approximately fully annotated images, a much larger set than the ones used in all other experiments.
It can be observed that, in this setting, our model performs on-par with EHSOD when using RPN proposals, while significantly outperforms their approach when pre-computed (Edge Boxes) proposals are employed. Moreover, we observe that our method also performs on-par with the Fast(er)-RCNN baselines in the images scenario. Interestingly we note only a reasonably modest gap between Fast(er)-RCNN performance with regard to the considered and baselines. This suggests that the gap between the and
setting can be closed by providing the network with images containing object class appearance outliers or by images containing difficult, crowded scenes. As a consequence, the problem, in this setting, can be considered to have a greater affinity with a fully supervised task than with a low-shot setting. This observation provides some explanation towards why our method provides limited improvement in this setting. Images required to improve the detector performance (high information content) may not be annotated with high confidence and therefore not considered for object detector training. As highlighted in our future work discussion (main paper; Sec. 5), we believe active learning strategies may prove fruitful in such cases.
|fully supervised||Fast RCNN - 10% data||53.7||31.6|
|fully supervised||Faster RCNN - 10% data||46.3||25.6|
|MSOD||EHSOD - 10% data||46.8||-|
|MSOD||Ours - 10% data||54.2||31.6|
|MSOD||Ours + RPN - 10% data||46.0||25.4|
|fully supervised||Fast RCNN - 100% data||61.6||48.0|
|fully supervised||Faster RCNN - 100% data||51.1||28.8|
Appendix E Additional PASCAL VOC 07 results
We report here detailed per-class detection results and compare competing MSOD approaches using both annotated training images and shot scenarios. Results are found in Tab. 8. We consistently outperform all competing methods in terms of mAP, with an improvement of up to with respect to BCNet in the 20 shot scenario (ResNet101  backbone). We highlight that in the training image scenario, we report both EHSOD and BAOD results, trained using of training images as only these results were available. This highlights the ability of our method to outperform these competing models even in the case where we have access to fewer training examples.
Appendix F Additional visual results
f.1 Annotated Semi-Strong Images
In Fig. 5 we provide additional examples of images annotated by our OAM, named semi-strong images, during progressive training epochs . These online annotations are obtained by our model using VOC07 data with shot strong supervision (other examples of semi-strong images are reported in the main manuscript, Fig. 4). We observe that typically uncomplicated and simple images are labelled with high confidence when training begins (for example at epoch rows ). During later training stages (here ), more complex images with increased appearance diversity and also with multiple, overlapping object instances are added to the pool by our OAM. In general, ranged from 1-10 (first 5 epochs) to 1-3 (end of training); and the semi-strong set contained approx. 10% (first epochs) to 45-60% (end of training) of annotated weak images
Furthermore, we compare the annotations obtained by our method (magenta) with annotations generated by a popular Weakly Supervised Object Detection (WSOD) approach; OICR  (yellow detections). We highlight that, from early epochs, our method is providing better, more reliable annotations that are then employed for concurrent object detector training. Moreover, our annotations cover the full extent of the object of interest. This can be explained due to the high quality information being distilled from the low-shot fully annotated images (strong images), while the WSOD method annotations exhibit the well understood problem of tending to focus on object parts and on (only) the most discriminative object in the image.
f.2 Examples of Detections
Further exemplar test-time detections, obtained by our method with shot strong supervision, are shown in Fig. 6 and Fig. 7 for VOC07 and COCO14 test images respectively. Due to the low-shot set of fully annotated images, that are leveraged by our model, we observe that obtained detections cover full object extent, even for classes typically difficult for WSOD (e.g. person). In comparison with WSOD approaches, our method avoids enclosing only the most discriminative object parts. Moreover, multiple instances of the same class within a single image can now be captured. This is usually problematic when training a model by relying only on image-level supervision, as in WSOD.
Appendix G Common Modes of Failure
We conducted additional investigation to identify instances of detection failures for our model trained with shot supervision. For both datasets (VOC07, COCO14) considered in our work, the most common mode of failure is represented by multiple detection for an object of interest. Given that the model is only trained with shot, we partially attribute such failures to the (weakly-learned) bounding box regressor. In corroboration with competing work [15, 7] we note bounding box regression is an intrinsically difficult task, especially in cases when limited training data is available or where substantial background pixels need be included to provide an optimal object bounding box, such as for objects with elongated or articulated shapes. As discussed in the main paper (Sec. 5), additional future work may explore strengthening of regression task performance.
Dissimilarity coefficient based weakly supervised object detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9432–9441. Cited by: §1, §2.
-  (2016) Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2846–2854. Cited by: §1, §2, §3.1.
-  (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
Lstd: a low-shot transfer detector for object detection.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
-  (2018) Few-example object detection with model communication. IEEE transactions on pattern analysis and machine intelligence 41 (7), pp. 1641–1654. Cited by: §1, §2.
-  (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §4.1.
-  (2020) EHSOD: CAM-Guided End-to-End Hybrid-Supervised Object Detection with cascade refinement. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, pp. xxx–yyy. Cited by: Appendix D, Appendix G, Appendices, §1, §2, §3.1, §4.2.
-  (2015) Fast R-CNN. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1, §1, §3.1, §3.2, §4.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 7, Appendix E.
-  (2017) Deep self-taught learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1377–1385. Cited by: §2.
-  (2019) Few-shot object detection via feature reweighting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8420–8429. Cited by: §1, §2.
-  (2019) RepMet: representative-based metric learning for classification and few-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: §2.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
-  (2019) Low shot box correction for weakly supervised object detection. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 890–896. Cited by: Table 6, Table 8, Appendix G, §1, §2, §2, §3.1, §4.1, §4.1, §4.2, §4.2, Table 1.
-  (2019) BAOD: budget-aware object detection. arXiv preprint arXiv:1904.05443. Cited by: §2, §4.2.
-  (2016) You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Table 7, Appendix D, §1, §1, §3, §4.1, §4.2.
-  (2018) Pcl: proposal cluster learning for weakly supervised object detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2, §4.2.
-  (2017) Multiple instance detection network with online instance classifier refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2843–2851. Cited by: Figure 5, §F.1, §1, §2, Figure 4, §4.2.
-  (2013) Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: §2, §3.
-  (2019) C-mil: continuation multiple instance learning for weakly supervised object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2199–2208. Cited by: §1, §2.
-  (2019) Min-entropy latent model for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
-  (2018) Ts2c: tight box mining with surrounding segmentation context for weakly supervised object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 434–450. Cited by: §2.
-  (2019) Meta r-cnn: towards general solver for instance-level low-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9577–9586. Cited by: §1, §2.
-  (2019) WSOD2: learning bottom-up and top-down objectness distillation for weakly-supervised object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8292–8300. Cited by: §1, §2, §4.2, §4.2.
Object detection with deep learning: a review.
IEEE transactions on neural networks and learning systems30 (11), pp. 3212–3232. Cited by: §1.
-  (2014) Edge boxes: locating object proposals from edges. In European conference on computer vision, pp. 391–405. Cited by: §2, §3, §4.1.