Benefited from the application of deep convolution networks, object detection has made rapid development in recent years. The ultimate goal of object detection is to develop systems which is able to accurately and efficiently recognize and localize instances of all object categories in open world scenes, competing with human visual capacity. The breakthrough of large scale object detection will be of great significance to the development of computer vision. However, different from the classification task for tens of thousands of categories , the object detection in terms of large scale categories still has various difficulties. First of all, labeling bounding boxes for enormous numbers of categories is so costly that there is no sufficient dataset for supervised object detection training. It is necessary to combine datasets with bounding box labels and datasets with only image-level labels to realize categories expansion. Second, as the number of target categories increases, the difficulty of object detection increases rapidly due to the proliferation of proposals and the confusion among categories. As images for object detection always contain multiple instances which may belong to different categories simultaneously, the relationship among the target categories becomes more complex and difficult to capture. Finally, with the increase of the number of categories, time and space complexity of training and inference raise rapidly, since the results are not conductive to the promotion of applications.
Object detection networks are mainly divided into two types: two-stage framework and one-stage framework. In a two-stage framework, category-independent region proposals are generate from the images, features are extracted from these proposals, and then category-specific classifiers are used to determine the concrete category label for each proposal, such as R-CNN, Fast R-CNN , Faster R-CNN  and R-FCN 
. In a one-stage framework, the class probabilities and bounding box offsets are directly predicted from full images with a single forward CNN which does not involve region proposal generation, such as YOLO and SSD . In comparison, one-stage framework can gain superior efficiency, but by removing the background interference and in-depth training of the RoIs, two-stage framework can always obtain more accurate recognition results.
Due to its practical importance, large-scale object detection also receives attention in recent years. To solve the lackness of bounding-box-level labeled categories, LSDA [8, 9] adopts the method of domain adaptation and knowledge transfer, which trains the classifier for all categories and transfers the classifier to detector based on the similarity among categories. Based on YOLO v2 detection framework, YOLO-9000  proposes a method to combine different datasets based on WordNet  and jointly trains the parameters based on image-level labeled and bounding-box-level labeled images simultaneously. This is the first try of large scale semi-supervised detector by far, but the accuracy of the network is to be improved. Motivated by decoupling the detection and classification, R-FCN-3000  proposes a framework based on R-FCN for large scale object detection, which provides a more accurate and efficient solution for large scale object detection. However, it is trained on strongly supervised datasets with bounding box annotations and is hard to directly generalized to image-level labeled categories.
In this paper, we propose a hierarchical structure and joint training network for large scale semi-supervised object detection. Our main contributions are as follows. First, we utilize the semantic relationships to design a hierarchical structure to further improve the performance of recognition. Second, we put forward a method of joint training to generate an large scale semi-supervised object detection network. We evaluate the proposed framework and obtain the mAP of 38.1% on the ImageNet detection validation dataset  and the mAP of 33.1 % on all the target categories, which is the state-of-the-art among all the large scale semi-supervised networks.
The remainder part of this paper is organized as follows. In Section 2, we introduce the proposed method for large scale semi-supervised object detection. In Section 3, we state the details of our network training and evaluation, and present the experimental results and comparison with former works. In Section 4, we conclude our research.
2 The Proposed Method
For large-scale object detection, the increase of target categories adds further confusion to the classification of candidate proposals. Meanwhile, there will be more relationships among the target categories, of which the important ancillary information can be utilized to further improve the performance of recognition. Therefore, we establish a hierarchical structure for the target categories depending on the inclusion relationships provided by WordNet  and optimize the detection network.
On the other hand, most of object detection networks are designed on the condition of supervised scenes, so the important issue for semi-supervised object detection is how to train detectors for image-level labeled categories. Region Proposal Network (RPN)  can be trained end-to-end and is independent to class to some extent. Eventually, RPN trained by bounding-box-level labeled categories can be transferred to extract proposals for image-level labeled categories. By applying the above method, jointly training two kinds of images together becomes realistic in almost current two-stage object detection framework.
Fig. 1 shows the overall architecture of our large scale semi-supervised network, the details of each part of our model are introduced in the following sections.
2.2 Hierarchical Structure
Faced with large scale object detection task for tens of thousands of target categories, the dataset is so enormous that the time cost for training even one epoch is quite heavy. As counted by Jia Denget al. , among 10,184 categories from the Fall 2009 release of ImageNet , there are only 7,404 leaf categories, equivalently 72% of the whole dataset. In reality, if A is a leaf category, and B is a ancestor category of A according to WordNet , it is inappropriate to execute classification and softmax process on A and B simultaneously, which forms a unreasonable competition. Eventually, the first step before training should be trimming the dataset and selecting the leaf categories for training. In inference, the scores of internal categories should be the sum of all the scores of its descendant leaf categories, which can be summarized as follows. In the formula, represents the set of leaf categories, represents the score output for each category, and represents the set of descendant leaf categories of .
Furthermore, based on WordNet , we can not only judge whether a specific category is a leaf category or not, but also establish a hierarchical tree with all target categories. As Fig. 2 shows, with the increase of the depth, the objects described are more and more concrete, and the numbers of categories expand at the same time. As stated above, it is confused to distinguish among thousands of classes simultaneously, so we add several branches in forthcoming detection network representing different grained recognition. At each additional branch, we can obtain classification output among different depths of categories, representing different grained recognition results. Based on transfer matrix calculated by inclusion relationships among target categories, labels for each branch can be obtained, and the classification supervision is affiliated. As Fig. 1 shows, by adding several classification branch, a hierarchical detection structure is realized, and the various grained detection results can be aggregated to obtain final results as follows:
2.3 Joint Training
Based on the decoupling R-FCN detection network 
, the network can be divided into detection branch and classification branch. The detection branch is to obtain objectness scores and execute bounding box regression for each RoI, whose parameters are independent to the categories. And the classification branch is to train a classifier to confirm the specific category for each instance. In summary, the loss function has three parts: smooth L1 loss for bounding box localization, objectness softmax loss , and fine-grained classification softmax loss .
For semi-supervised object detection, bounding-box-level labeled images can be trained as before. However, on account of the lackness of bounding box labels, image-level labeled images cannot participate in the training of the detection branch, but have to be utilized to finetune the classification branch, as shown in Fig.1 by different color labeled data flows. To realize joint training, the key issue is to extract regions on these images. Fig. 3 shows the procedure for proposal generation of the two kinds of images and the training of the RPN. Images with bounding-box-level labels are utilized to train the network, while the network parameters trained are utilized to generate proposals with high scores for images only with image-level labels. Furthermore, the proposals are filtered to wipe off unreasonable ones whose boundary exceeds the original image and the output proposals are assigned the positive samples of relevant image-level label categories. Once the proposals and labels of these candidate boxes are acquired, it is practical to bind two kinds of data in batch and jointly train the classification branch. In training, the proposal number for each image-level labeled image is assigned to be limited to guarantee the precision for each proposal. However, in inference, the proposal number is assigned to be larger to guarantee the proposals recall.
As the result, the loss function of the novel semi-supervised detection network is divided into two conditions.
3 Experiment and Analysis
3.1 Training Data
We combine bounding-box-level labeled ILSVRC DET training dataset with image-level labeled ILSVRC CLS-LOC training dataset , forming a joint training dataset. The detailed statistics are presented in Table 1.
|bbox-level labeled||image-level labeled||all|
3.2 Implementation Details
To compare with former works, we follow the same implementation with R-FCN-3000 . In the training and testing process, the images are resized to the resolution of . For joint training, each batch contains the same number of bounding-box-level labeled images and image-level labeled images, and horizontal flipping is used as a data augmentation technique. In addition, , , , branches are added to model a hierarchical structure. During training, a warm-up learning is used for the first 1000 iterations and then it is increased to 0.0015. The learning rate is dropped by a factor of 10 after 3 epochs. Totally, we train the network for 4 epochs on 2 GeForce GTX TITAN GPUs.
3.3 Analysis of Proposal Extraction
As stated above, the key issue for semi-supervised object detection is the proposal extraction for image-level labeled images, so we investigate the performance of RPN under large scale semi-supervised scenes. In the experiments, 506 categories which is bounding-box-level labeled participate in the training of RPN, while 505 other categories which is image-level labeled have no contribution on the training. In the left of Fig. 4, we show the Recall-to-IoU results of the generated proposals on all categories and the two kinds of labeled categories separately setting proposal numbers for each image is 300 simulating the inference process. In the right of Fig. 4, we show the Precision-to-IoU results of the generated proposals based on our method shown in Fig. 3 setting proposal numbers for each image is 10 simulating the training process. The recall for untrained categories is 0.75 and the precision is 0.48 setting IoU threshold is 0.5, which is acceptable under weak supervised conditions. We can conclude that RPN has the characteristic of resistant transference and can be utilized to generate proposals for weakly supervised categories.
3.4 Performance Comparison
Table 2 shows the performance comparison between ours and former works. First, we evaluate our large scale semi-supervised network on ILSVRC DET validation set which contains 20,121 images, covering 200 categories for our network training. Compared to mAP of 36.0% obtained by the 1,000 classes detector by R-FCN-3000 , our proposed large scale semi-supervised object detection network obtains mAP of 38.1%, which improves the mAP by 5.83%. It is noteworthy that, R-FCN-3000 uses all the 1,000 categories with bounding box annotations, whereas only half of the target categories are bounding-box-level labeled in our model. Second, we evaluate our network on ILSVRC CLS-LOC validation set, which contains 50,000 images, covering 1,000 categories for our network training. As a validation dataset for evaluating localization performance, the images included are relatively simple compared to general dataset for evaluating detection performance. Therefore, we select images which contain multiple instances simultaneously from the original dataset, and constitute a hard part of ILSVRC CLS-LOC validation set with 11,717 images. As a result, we obtain mAP of 48.4% and 33.1% on the whole dataset and the hard part, representing excellent performance of our semi-supervised network.
In this paper, we propose a hierarchical structure and joint training network for large scale semi-supervised object detection. Based on the relationships among target categories, we design a hierarchical structure for large scale categories recognition. Based on transference of RPN, we put forward a joint training method for two-stage detection framework combining bounding-box-level labeled and image-labeled images together. Experiments show that our semi-supervised network can obtain excellent performance on all the categories and outperform previous works. Furthermore, the method can be expanded to more categories, making substantial contributions to achieve generic object detection.
-  Jia Deng, Alexander C Berg, Kai Li, and Li Fei-Fei, “What does classifying more than 10,000 image categories tell us?,” in European conference on computer vision. Springer, 2010, pp. 71–84.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik,
“Rich feature hierarchies for accurate object detection and semantic
Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
-  Ross Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  Jifeng Dai, Yi Li, Kaiming He, and Jian Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems, 2016, pp. 379–387.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
-  Judy Hoffman, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, and Kate Saenko, “Lsda: Large scale detection through adaptation,” in Advances in Neural Information Processing Systems, 2014, pp. 3536–3544.
-  Yuxing Tang, Josiah Wang, Xiaofang Wang, Boyang Gao, Emmanuel Dellandréa, Robert Gaizauskas, and Liming Chen, “Visual and semantic knowledge transfer for large scale semi-supervised object detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 3045–3058, 2018.
-  Joseph Redmon and Ali Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017.
-  George A Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
-  Bharat Singh, Hengduo Li, Abhishek Sharma, and Larry S Davis, “R-fcn-3000 at 30fps: Decoupling detection and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1081–1090.
-  J Deng, A Berg, S Satheesh, H Su, A Khosla, and L Fei-Fei, “Ilsvrc-2012, 2012,” URL http://www. image-net. org/challenges/ILSVRC, 2012.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.