A paper list of object detection using deep learning.
Based on the framework of multiple instance learning (MIL), tremendous works have promoted the advances of weakly supervised object detection (WSOD). However, most MIL-based methods tend to localize instances to their discriminative parts instead of the whole content. In this paper, we propose a spatial likelihood voting (SLV) module to converge the proposal localizing process without any bounding box annotations. Specifically, all region proposals in a given image play the role of voters every iteration during training, voting for the likelihood of each category in spatial dimensions. After dilating alignment on the area with large likelihood values, the voting results are regularized as bounding boxes, being used for the final classification and localization. Based on SLV, we further propose an end-to-end training framework for multi-task learning. The classification and localization tasks promote each other, which further improves the detection performance. Extensive experiments on the PASCAL VOC 2007 and 2012 datasets demonstrate the superior performance of SLV.READ FULL TEXT VIEW PDF
Weakly supervised object detection aims at learning precise object detec...
Since the labelling for the positive images/videos is ambiguous in weakl...
We study on weakly-supervised object detection (WSOD) which plays a vita...
Weakly-Supervised Object Detection (WSOD) and Localization (WSOL), i.e.,...
We address the problem of weakly supervised object localization where on...
Weakly Supervised Object Localization is challenging because of the lack...
In this work a novel approach for weakly supervised object detection tha...
A paper list of object detection using deep learning.
Object detection is an important problem in computer vision, which aims at localizing tight bounding boxes of all instances in a given image and classifying them respectively. With the development of convolutional neural network (CNN)[10, 13, 14] and large-scale annotated datasets [6, 18, 23], there have been great improvements in object detection [8, 9, 17, 19, 21] in recent years. However, it is time-consuming and labor-intensive to annotate accurate object bounding boxes for a large-scale dataset. Therefore, weakly supervised object detection (WSOD), which only use image-level annotations for training, is considered to be a promising solution in reality and has attracted the attention of academic community in recent years.
Most WSOD methods [3, 4, 22, 25, 26, 32] follow the multiple instance learning (MIL) paradigm. Regarding the WSOD as an instance classification problem, they train an instance classifier under MIL constraints to approach the purpose of object detection. However, the existing MIL-based methods only focus on feature representations for instance classification without considering localization accuracy of the proposal regions. As a consequence, they tend to localize instances to their discriminative parts instead of the whole content, as illustrated in Fig 1(a).
Due to lack of bounding box annotations, the absence of localization task is always a serious problem in WSOD. As a remedy, the subsequent works [15, 25, 26, 30] choose to re-train a Fast-RCNN  detector fully-supervised with pseudo ground-truths, which are generated by MIL-based weakly-supervised object detectors. The fully-supervised Fast-RCNN alleviates the above-mentioned problem by means of multi-task training . But it is still far from the optimal solution.
In this paper, we propose a spatial likelihood voting (SLV) module to converge the proposal localizing process without any bounding box annotations. The spatial likelihood voting operation consists of instances selection, spatial probability accumulation, and high likelihood region voting. Unlike the previous methods which always keep the position of their region proposals unchanged, all region proposals in SLV play the role of voters every iteration during training, voting for the likelihood of each category in spatial dimensions. Then the voting results, which will be used for the re-classification and re-localization shown in Fig1(b), are regularized as bounding boxes by dilating alignment on the area with large likelihood values. Through generating the voted results, the proposed SLV evolves the instance classification problem into multi-tasking field. SLV opens the door for WSOD methods to learn classification and localization simultaneously. Furthermore, we propose an end-to-end training framework based on SLV module. The classification and localization tasks promote each other, which finally educe better localization and classification results and shorten the gap between weakly-supervised and fully-supervised object detection.
In addition, we conduct extensive experiments on challenging PASCAL VOC datasets  to confirm the effectiveness of our method. The proposed framework achieves 53.5% and 49.2% mAP on VOC 2007 and VOC 2012 respectively, which, to the best of our knowledge, is the best single model performance to date.
The contributions of this paper are summarized as follows:
We propose a spatial likelihood voting (SLV) module to converge the proposal localizing process with only image-level annotations.The proposed SLV evolves the instance classification problem into multi-tasking field.
We introduce an end-to-end training strategy for the proposed framework, which boosts the detection performance by feature representation sharing.
Extensive experiments are conducted on different datasets. The superior performance suggests that a sophisticated localization fine-tuning should be a promising exploration in addition to the independent Fast-RCNN re-training.
MIL is a classical weakly supervised learning problem and now is a major approach to tackle WSOD. MIL treats each training image as a “bag” and candidate proposals as “instances”. The objective of MIL is to train an instance classifier to select positive instances from this “bag”. With the development of the Convolution Neural Network, many works[3, 5, 11, 27] combine CNN and MIL to deal with the WSOD problem. For example, Bilen and Vedaldi  propose a representative two-stream weakly supervised deep detection network (WSDDN), which can be trained with image-level annotations in an end-to-end manner. Based on the architecture in ,  proposes to exploit the contextual information from regions around the object as a supervisory guidance for WSOD.
In practice, MIL solutions are found easy to converge to discriminative parts of objects. This is caused by the loss function of MIL is non-convex and thus MIL solutions usually stuck into local minima. To address this problem, Tanget al.  combine WSDDN with multi-stage classifier refinement and propose an OICR algorithm to help their network see larger parts of objects during training. Moreover, building on , Tanget al.  subsequently introduce the proposal cluster learning and use the proposal clusters as supervision which indicates the rough locations where objects most likely appear. In , Wan et al. try to reduce the randomness of localization during learning. In , Zhang et al. add curriculum learning using the MIL framework. From the perspective of optimization, Wan et al.  introduce the continuation method and attempt to smooth the loss function of MIL with the purpose of alleviating the non-convexity problem. In , Gao et al. make use of the instability of MIL-based detectors and design a multi-branch network with orthogonal initialization.
Besides, there are many attempts [1, 12, 16, 33, 35] to improve the localization accuracy of the weakly supervised detectors from other perspectives. Arun et al.  obtain much better performance by employing a probabilistic objective to model the uncertainty in the location of objects. In , Li et al. propose a segmentation-detection collaborative network which utilizes the segmentation maps as prior information to supervise the learning of object detection. In , Kosugi et al. focus on instance labeling problem and design two different labeling methods to find tight boxes rather than discriminative ones. In , Zhang et al. propose to mine accurate pseudo ground-truths from a well-trained MIL-based network to train a fully supervised object detector. In contrast, the work of Yang et al.  integrates WSOD and Fast-RCNN re-training into a single network that can jointly optimize the regression and classification.
The overall architecture of the proposed framework is shown in Fig 2. We adopt a MIL-based network as a basic part and integrate the proposed SLV module into the final architecture. During the forward process of training, the proposal features are fed into the basic MIL module to produce proposal score matrices. Subsequently, these proposal score matrices are used to generate supervisions for the training of the proposed SLV module.
With image-level annotations, many existing works [2, 3, 4, 11] detect objects based on a MIL network. In this work, we follow the method in  which proposes a two-stream weakly supervised deep detection network (WSDDN) to train the instance classifier. For a training image and its region proposals, the proposal features are extracted by a CNN backbone and then branched into two streams, which correspond to a classification branch and a detection branch respectively. For classification branch, the score matrix is produced by passing the proposal features through a fully connected (fc) layer, where denotes the number of image classes and denotes the number of proposals. Then a softmax operation over classes is performed to produce , . Similarly, the score matrix is produced by another fc layer for detection branch, but is generated through a softmax operation over proposals rather than classes: . The score of each proposal is generated by element-wise product: . At last, the image classification score on class is computed through the summation over all proposals: . We denote the label of a training image , where or 0 indicates the image with or without class . To train the instance classifier, the loss function is shown in Eq. (1).
Moreover, proposal cluster learning (PCL)  is adopted, which embeds 3 instance classifier refinement branches additionally, to get better instance classifiers. The output of the -th refinement branch is , where denotes the number of different classes and background.
Specifically, based on the output score and proposal spatial information, proposal cluster centers are built. All proposals are then divided into those clusters according to the between them, one for background and the others for different instances. Proposals in the same cluster (except for the cluster for background) are spatially adjacent and associated with the same object. With the supervision ( is the label of the -th cluster), the refinement branch treats each cluster as a small bag. Each bag in the -th refinement branch is optimized by a weighted cross-entropy loss.
where and are the confidence score of -th cluster and the number of proposals in the -th cluster, is the predicted score of the -th proposal. indicates that the -th proposal belongs to the -th proposal cluster, is the cluster for background, is the loss weight that is the same as the confidence of the -th proposal.
It is hard for weakly supervised object detectors to pick out the most appropriate bounding boxes from all proposals for an object. The proposal that obtains the highest classification score often covers a discriminative part of an object while many other proposals covering the larger parts tend to have lower scores. Therefore, it is unstable to choose the proposal with the highest score as the detection result under the MIL constraints. But from the overall distribution, those high-scoring proposals always cover at least parts of objects. To this end, we propose to make use of the spatial likelihood of all proposals which implies the boundaries and categories of objects in an image. In this subsection, we introduce a spatial likelihood voting (SLV) module to perform classification and localization refinement simultaneously rather than the instance classifier only.
The SLV module is convenient to be plugged into any proposal-based detector and can be optimized with the fundamental detector jointly. The spirit of SLV is to establish a bridge between classification task and localization task through coupling the spatial information and category information of all proposals together. During training, the SLV module takes into the classification scores of all proposals and then calculates the spatial likelihood of them for generating supervision , where .
Formally, for an image with label , there are three steps to generate when . To save training time, the low-scoring proposals are filtered out first as they have little significance for spatial likelihood voting. The retained proposals are considered to surround the instances of category and are placed into , .
For the second step, we implement a spatial probability accumulation according to the predicted classification scores and locations of proposals in . In detail, we construct a score matrix , where and are the height and width of the training image . All elements in are initialized with zero. Then, for each proposal , we accumulate the predicted score of to spatially.
where means the pixel inside the proposal . For proposals in , we calculate their likelihood in spatial dimensions and the final value of elements in indicates the possibility that the instance of category appears in that position.
Finally, the range of elements in is scaled to and a threshold is set to transform into a binary version . is regarded as a binary image and the minimum bounding rectangles of connected regions in ( is the -th rectangle and is the number of connected regions in ) is used to generate shown in Eq. (4).
The overall procedures of generating are summarized in Algorithm 1 and a visualization example of SLV is shown in Fig 3. Supervision is instance-level annotation and we use a multi-task loss on each labeled proposal to perform classification and localization refinement simultaneously. The output of re-classification branch is and output of re-localization branch is . The loss of SLV module is , where is the cross entropy loss and is the smooth L1 loss.
To refine the weakly supervised object detector, the basic MIL module and SLV module are integrated into one. Combining the loss function of both, the final loss of the whole network is in Eq. (5).
However, the output classification scores of basic MIL module are noisy in early stage of training, which causes that the voted supervisions are not precise enough to train the object detector. There is an alternative training strategy to avoid this problem: 1) fixing the SLV module and training the basic MIL module completely; 2) fixing the basic MIL module and using the output classification scores of it to train the SLV module. This strategy makes sense but training different parts of the network separately may harm the performance. So, we propose a training framework that integrates the two training steps into one. We change the loss in Eq. (5) to a weighted version, as in Eq. (6).
The loss weight is initialized with zero and will increase iteratively. At the beginning of training, although the basic MIL module is unstable and we cannot obtain good supervisions , is small and the loss is also small. As a consequence, the performance of the basic MIL module will not be affected a lot. During the training process, the basic MIL module will classify the proposals well, and thus we can obtain stable classification scores to generate more precise supervisions . The proposed training framework is easy to implement and the network could benefit from the shared proposal features. The overall training procedure of our network is shown in Algorithm 2.
During testing, the proposal scores of three refined instance classifiers and SLV re-classification branch are used as the final detection scores. And the bounding box regression offsets computed by the SLV re-localization branch are used to shift all proposals.
SLV was evaluated on two challenging datasets: PASCAL VOC 2007 and 2012 datasets  which have 9,962 and 22,531 images respectively for 20 object classes. For each dataset, we use the set for training and set for testing. Only image-level annotations are used to train our network.
For evaluation, two metrics are used to evaluate our model. First, we evaluate detection performance using mean Average Precision (mAP) on the PASCAL VOC 2007 and 2012 set. Second, we evaluate the localization accuracy using Correct Localization (CorLoc) on PASCAL VOC 2007 and 2012 set. Based on the PASCAL criterion, a predicted box is considered positive if it has an with a ground-truth bounding box.
The proposed framework is implemented based on VGG16 
CNN model, which is pre-trained on ImageNet dataset. We use Selective Search  to generate about 2,000 proposals per-image. In basic MIL module, we follow the implementation in  to refine instance classifier three times. For SLV module, we use the average proposal scores of three refined instance classifiers to generate supervisions and the setting of hyper-parameters is intuitive. The threshold is set to 0.001 for saving time and is set to 0.2 for person category and 0.5 for other categories.
During training, the mini-batch size for training is set to 2. The momentum and weight decay are set to 0.9 and respectively. The initial learning rate is
and the learning rate decay step is 9-th, 12-th and 15-th epoch. For data augmentation, we use five image scaleswith horizontal flips for both training and testing. We randomly choose a scale to resize the image and then the image is horizontal flipped. During testing, the average score of 10 augmented images is used as the final classification score. Similarly, the output regression offsets of 10 augmented images are also averaged.
We perform ablations on PASCAL VOC 2007 to analyze the proposed SLV module. The baseline model(mAP 50.1% on PASCAL VOC 2007 set) is the basic PCL detector described in Section 3.1, which is trained on PASCAL VOC 2007 set. Details about ablation studies are discussed in the following.
SLV vs. No SLV. To confirm the effectiveness of the proposed SLV module, we conduct different ablation experiments for re-classification and re-localization branch in SLV. As shown in Table 1 (row 2 and row 3), the simplified versions of SLV module which only contain a re-classification or re-localization branch both outperform the baseline model. It indicates the supervision generated by spatial likelihood voting method, which is formulated in Section 3.2, is precise enough not only for classification but also for localization.
Moreover, a normal version of SLV module improves the detection performance further due to multi-task learning. As shown in Fig 4, the SLV module trained based on a well-trained baseline model boosts the performance significantly (mAP from 50.1% to 52.5%), indicating the necessity of converging the proposal localizing process into WSOD solutions as we discussed above.
|Pred Net(VGG) ||66.7||69.5||52.8||31.4||24.7||74.5||74.1||67.3||14.6||53.0||46.1||52.9||69.9||70.8||18.5||28.4||54.6||60.7||67.1||60.4||52.9|
|Pred Net(Ens) ||67.7||70.4||52.9||31.3||26.1||75.5||73.7||68.6||14.9||54.0||47.3||53.7||70.8||70.2||19.7||29.2||54.9||61.3||67.6||61.2||53.6|
|Pred Net(VGG) ||88.6||86.3||71.8||53.4||51.2||87.6||89.0||65.3||33.2||86.6||58.8||65.9||87.7||93.3||30.9||58.9||83.4||67.8||78.7||80.2||70.9|
|Pred Net(Ens) ||89.2||86.7||72.2||50.9||51.8||88.3||89.5||65.6||33.6||87.4||59.7||66.4||88.5||94.6||30.4||60.2||83.8||68.9||78.9||81.3||71.4|
|Pred Net(VGG) ||48.4||69.5|
End-to-end vs. Alternative. In the previous subsection, the ablation experiments are conducted by the way that fixes the baseline model and trains the SLV module only. The two parts of the proposed network are trained separately, which is similar to re-train an independent Fast-RCNN model.
In the row 4 and 5 of Table 1, we present the performance of models with different training strategies. Compared with the alternative training strategy (row 4), the model trained with the proposed end-to-end training framework (row 5) outperforms the former a lot. Just as we discussed in Section 3.3, end-to-end training framework shorten the gap between weakly-supervised and fully-supervised object detection.
SLV vs. Other labeling schemes. Regarding SVL as a pseudo labeling strategy, we compare 3 different labeling schemes and analyze the strengths and weaknesses of them respectively. The first scheme is a conventional version that selects the highest-scoring proposal for each positive class. The second scheme is a clustering version that selects the highest-scoring proposal from every proposal cluster for each positive class. And the last scheme is the proposed SLV. Fig 5 contains a few labeling examples of 3 schemes in different scenarios, the first row shows that the SLV module is beneficial to find as many labels as possible rather than only one for each positive class. Then, the second row shows the property of 3 schemes when labeling larger objects and the bounding boxes labeled by SLV have higher with ground-truth boxes. However, as shown in the third row of Fig 5, when objects are gathering closely, the SLV is prone to labeling these objects as one instance. Meanwhile, all 3 schemes failed when labeling the “table” due to its weak feature representation (the plate in the table is labeled instead). This is an issue worth exploring in future work. Despite these bad cases, the performance of the network with SLV (53.5% mAP) still surpasses its counterparts using two other labeling schemes (52.1% mAP for the first scheme and 52.4% mAP for the second scheme).
In this subsection, we compare the results of our method with other works. We report our experiment results on PASCAL VOC 2007 and 2012 datasets on Table 2, Table 3 and Table 4. Our method obtains 53.5% on mAP and 71.0% on CorLoc with single VGG16 model on VOC 2007 dataset, which outperforms all the other single model methods. We further re-train a Fast-RCNN detector based on pseudo ground-truths produced by SLV (VGG) and the re-trained model obtains 53.9% on mAP and 72.0% on CorLoc on VOC 2007 dataset, which are the new state-of-the-arts. On VOC 2012 dataset, our method obtains 49.2% on mAP, which is also the best in all the single model methods and obtains 69.2% on CorLoc.
Different from the recent works, e.g. , that select high-scoring proposals as pseudo ground-truths to enhance localization ability, the proposed SLV is devoted to searching the boundaries of different objects from a more macro perspective and thus obtains a better detection ability. We illustrate some typical detection results of our method and a competitor model in Fig 6. It is obvious that the bounding boxes output by our method have a better localization performance. This is due to our multi-task network is able to classify and localize proposals at the same time, while the competitor is single-task form and only highlights the most discriminative object parts. Though our method outperforms the competitor significantly, it is also worth noting that the detection results on some classes like “chair”, “table”, “plant” and “person”, are undesirable sometimes (last row of Fig 6). We suggest that the supervisions generated in SLV module are not precise enough in object-gathering scenarios: many chairs are gathering or an indoor table surrounded by many other objects.
In this paper, we propose a novel and effective module, spatial likelihood voting (SLV), for weakly supervised object detection. We propose to evolve the instance classification problem in most MIL-based models into multi-tasking field to shorten the gap between weakly supervised and fully supervised object detection. The proposed SLV module converges the proposal localizing process without any bounding box annotations and an end-to-end training framework is proposed for our model. The proposed framework obtains better classification and localization performance through end-to-end multi-tasking learning. Extensive experiments conducted on VOC 2007 and 2012 datasets show substantial improvements of our method compared with previous WSOD methods.
This work was supported by the Fundamental Research Funds for the Central Universities, and the National Natural Science Foundation of China under Grant 31627802.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9432–9441, 2019.