The last few years have seen the success of deep neural networks in object detection taskerhan2014scalable ; szegedy2014scalable ; girshick2014rich ; he2014spatial ; girshick2015fast ; ren2015faster ; huang2015densebox ; dai2016r
. In practice, object detection often requires to generate a set of bounding boxes along with their classification labels associated with each object in the given image. However, it is nontrivial for convolutional neural networks (CNNs) to directly predict an orderless set of arbitrary cardinality111There are a few recent studies on the topic, such as rezatofighi2018deep ; stewart2016end . . One widely-used workaround is to introduce anchor, which employs the thought of divide-and-conquer and has been successfully demonstrated in the state-of-the-art detection frameworks szegedy2014scalable ; ren2015faster ; liu2016ssd ; redmon2016you ; redmon2017yolo9000 ; he2017mask ; lin2017feature ; lin2017focal ; dai2016r . In short, anchor method suggests dividing the box space (including position, size, class, etc.) into discrete bins (not necessarily disjoint) and generating each object box via the anchor function defined in the corresponding bin. Denote
as the feature extracted from the input image, then anchor function for-th bin could be formulated as follows:
where is the prior (also named anchor box in ren2015faster ), which describes the common properties of object boxes associated with -th bin (e.g. averaged position/size and classification label); while discriminates whether there exists an object box associated with the -th bin, and regresses the relative location of the object box (if any) to the prior ; represents the parameters for the anchor function.
To model anchors with deep neural networks, one straight-forward strategy is via enumeration, which is adopted by most of the previous work ren2015faster ; szegedy2014scalable ; liu2016ssd ; redmon2016you ; redmon2017yolo9000 ; lin2017focal ; he2017mask ; lin2017feature . First, a number of predefined priors (or anchor boxes) is chosen by handcraft ren2015faster or statistical methods like clustering szegedy2014scalable ; redmon2017yolo9000 . Then for each the anchor function is usually implemented by one or a few neural network layers respectively. Weights for different anchor functions are independent or partially shared. Obviously in this framework anchor strategies (i.e. anchor box choices and the definition of corresponding anchor functions) are fixed in both training and inference. In addition, the number of available anchors is limited by the predefined .
In this paper, we propose a flexible alternative to model anchors: instead of enumerating every possible bounding box prior and modeling the corresponding anchor functions respectively, in our framework anchor functions are dynamically generated from . It is done by introducing a novel MetaAnchor module which is defined as follows:
where is called anchor function generator which maps any bounding box prior to the corresponding anchor function ; and represents the parameters. Note that in MetaAnchor the prior set is not necessarily predefined; instead, it works as a customized manner – during inference, users could specify any anchor boxes, generate the corresponding anchor functions and use the latter to predict object boxes. In Sec. 3, we present that with weight prediction mechanism ha2016hypernetworks anchor function generator could be elegantly implemented and embedded into existing object detection frameworks for joint optimization.
In conclusion, compared with traditional predefined anchor strategies, we find our proposed MetaAnchor has the following potential benefits (detailed experiments are present in Sec. 4):
MetaAnchor is more robust to anchor settings and bounding box distributions. In traditional approaches, the predefined anchor box set often needs careful design – too few anchors may be insufficient to cover rare boxes, or result in coarse predictions; however, more anchors usually imply more parameters, which may suffer from overfitting. In addition, many traditional strategies use independent weights to model different anchor functions, so it is very likely for the anchors associated with few ground truth object boxes in training to produce poor results. In contrast, for MetaAnchor anchor boxes of any shape could be randomly sampled during training so as to cover different kinds of object boxes, meanwhile, the number of parameters keeps constant. Furthermore, according to Equ. 2 different anchor functions are generated from the same weights , thus all the training data are able to contribute to all the model parameters, which implies more robustness to the distribution of the training boxes.
MetaAnchor helps to bridge the bounding box distribution gap between datasets. In traditional framework, anchor boxes are predefined and keep unchanged for both training and test, which could be suboptimal for either dataset if their bounding box distributions are different. While in MetaAnchor, anchors could be flexibly customized to adapt the target dataset (for example, via grid search) without retraining the whole detector.
2 Related Work
Anchor methodology in object detection.
Anchors (maybe called with other names, e.g. “default boxes” in liu2016ssd , “priors” in szegedy2014scalable or “grid cells” in redmon2016you ) are employed in most of the state-of-the-art detection systems szegedy2014scalable ; ren2015faster ; lin2017feature ; lin2017focal ; liu2016ssd ; fu2017dssd ; he2017mask ; dai2016r ; redmon2017yolo9000 ; li2017fssd ; shen2017dsod ; hu2017learning . The essential of anchors includes position, size, class label or others. Currently most of the detectors model anchors via enumeration, i.e. predefining a number of anchor boxes with all kinds of positions, sizes and class labels, which leads to the following issues. First, anchor boxes need careful design, e.g. via clustering redmon2017yolo9000 , which is especially critical on specific detection tasks such as anchor-based face wang2018sface ; zhang2017s ; najibi2017ssh ; song2018beyond ; zhang2016joint and pedestrian wang2017repulsion ; dollar2012pedestrian ; zhang2016faster ; mao2017can detections. Specially, some papers suggest multi-scale anchors liu2016ssd ; lin2017feature ; lin2017focal to handle different sizes of objects. Second, predefined anchor functions may cause too many parameters. A lot of work addresses the issue by weight sharing. For example, in contrast to earlier work like erhan2014scalable ; redmon2016you , detectors like ren2015faster ; liu2016ssd ; redmon2017yolo9000 and their follow-ups fu2017dssd ; lin2017feature ; dai2016r ; he2017mask ; lin2017focal employ translation-invariant anchors produced by fully-convolutional network, which could share parameters across different positions. Two-stage frameworks such as ren2015faster ; dai2016r share weights across various classes. And lin2017focal shares weights for multiple detection heads. In comparison, our approach is free of the issues, as anchor functions are customized and generated dynamically.
Weight prediction means a mechanism in neural networks where weights are predicted by another structure rather than directly learned, which is mainly used in the fields of learning to learn ha2016hypernetworks ; andrychowicz2016learning ; wang2016learning , few/zero-shot learning elhoseiny2013write ; wang2016learningmisra2017red . For object detection there are a few related works, for example, hu2017learning proposes to predict mask weights from box weights. There are mainly two differences from ours: first, in our MetaAnchor the purpose of weight prediction is to generate anchor functions, while in hu2017learning it is used for domain adaption (from object box to segmentation mask); second, in our work weights are generated almost “from scratch”, while in hu2017learning the source is the learned box weights.
3.1 Anchor Function Generator
In MetaAnchor framework, anchor function is dynamically generated from the customized box prior (or anchor box) rather than fixed function associated with predefined anchor box. So, anchor function generator (see Equ. 2), which maps to the corresponding anchor function , plays a key role in the framework. In order to model with neural work, inspired by hu2017learning ; ha2016hypernetworks , first we assume that for different anchor functions share the same formulation but have different parameters, which means:
Then, since each anchor function is distinguished only by its parameters , anchor function generator could be formulated to predict as follows:
where stands for the shared parameters (independent to and also learnable), and the residual term depends on anchor box .
In the paper we implement with a simple two-layer network:
Here, and are the learnable parameters and. In practice is usually much smaller than the dimension of , which causes the weights predicted by lie in a significantly low-rank subspace. That is why we formulate as a residual form in Equ 4 rather than directly use . We also survey more complex designs for , however, which results in comparable benchmarking results.
In addition, we introduce a data-dependent variant of anchor function generator, which takes the input feature into the formulation:
where is used to reduce the dimension of the feature ; we empirically find that for convolutional feature , using global averaged pooling he2016deep ; szegedy2015going operation for usually produces good results.
3.2 Architecture Details
Theoretically MetaAnchor could work with most of the existing anchor-based object detection frameworks ren2015faster ; liu2016ssd ; redmon2016you ; redmon2017yolo9000 ; lin2017focal ; he2017mask ; lin2017feature ; li2017light ; li2018detnet ; dai2016r . Among them, for the two-stage detectors ren2015faster ; dai2016r ; lin2017feature ; he2017mask ; li2017light anchors are usually used to model “objectness” and generate box proposals, while fine results are predicted by RCNN-like modules girshick2014rich ; girshick2015fast in the second stage. We try to use MetaAnchor in these frameworks and observe some improvements on the box proposals (e.g. improved recalls), however, it seems no use to the final predictions, whose quality we believe is mainly determined by the second stage. Therefore, in the paper we mainly study the case of single-stage detectors redmon2016you ; liu2016ssd ; redmon2017yolo9000 ; lin2017focal .
We choose the state-of-the-art single-stage detector RetinaNet lin2017focal to apply MetaAnchor for instance. Note that our methodology is also applicable to other single-stage frameworks such as redmon2017yolo9000 ; liu2016ssd ; fu2017dssd ; shen2017dsod . Fig 1(a) gives the overview of RetinaNet. In short, 5 levels of features are extracted from a “U-shaped” backbone network, where stands for the finest feature map (i.e. with largest resolution) and is the coarsest. For each level of feature, a subnet named “detection head” in Fig 1 is attached to generate detection results. Anchor functions are defined at the tail of each detection head. Referring to the settings in lin2017focal , anchor functions are implemented by a convolutional layer; and for each detection head, there are types of anchor boxes (3 scales, 3 aspect ratios and 80 classes) are predefined. Thus for each anchor function, there should be filters for the classification term and filters for the regression term (, as regression term is class-agnostic).
In order to apply MetaAnchor, we need to redesign the original anchor functions so that their parameters are generated from the customized anchor box . First of all, we consider how to encode . According to the definition in Sec. 1,
should be a vector which includes the information such as position, size and class label. In RetinaNet, thanks to the fully-convolutional structure, position could be naturally encoded by the coordinate of feature maps thus no need to be involved in. As for class label, there are two alternatives: A) directly encode it in , or B) let predict weights for each class respectively. We empirically find that Option B is easier to optimize and usually results in better performance than Option A. So, in our experiment is mainly related to anchor size. Motivated by the bounding box encoding method introduced in girshick2014rich ; ren2015faster , is represented as follows:
where and are the height and width of the corresponding anchor box; and is the size of “standard anchor box”, which is used as a normalization term. We also survey a few other alternatives, for example, using the scale and aspect ratio to represent the size of anchor boxes, which results in comparable results with that of Equ. 7.
Fig 1(b) illustrates the usage of MetaAnchor in each detection head of RetinaNet. In the original design lin2017focal , the classification and box regression parts of anchor functions are attached to separated feature maps ( and ) respectively; so in MetaAnchor, we also use two independent anchor function generators and to predict their weights respectively. The design of follows Equ. 4 (data-independent variant) or Equ. 6 (data-dependent variant), in which the number of hidden neurons is set to 128. In addition, recall that in MetaAnchor anchor functions are dynamically derived from rather than predefined by enumeration; so, the number of filters for reduces to 80 (80 classes, for example) and 4 for .
It is also worth noting that in RetinaNet lin2017focal corresponding layers in all levels of detection heads share the same weights, even including the last layers which stand for anchor functions. However, the definitions of anchors differ from layer to layer: for example, in -th level suppose an anchor function associated to the anchor box of size ; while in -th level (with 50% smaller resolution), the same anchor function should detect with 2x larger anchor box, i.e. . So, in order to keep consistent with the original design, in MetaAnchor we use the same anchor generator function and for each level of detection head; while the “standard boxes” in Equ. 7 are different between levels: suppose the standard box size in -th level is , then for -th level we set . In our experiment, the size of standard box in the lowest level (i.e. , which has the largest resolution) is set to the average of all the anchor box sizes (shown in the last column in Table 1).
In this section we evaluate our proposed MetaAnchor on COCO object detection task lin2014microsoft . The basic detection framework is RetinaNet lin2017focal as introduced in 3.2, whose backbone feature extractor we use is ResNet-50 he2016deep
pretrained on ImageNet classification datasetrussakovsky2015imagenet . For MetaAnchor, we use the data-independent variant of anchor function generator (Equ. 4) unless specially mentioned. MetaAnchor subnets are jointly optimized with the backbone detector during training. We do not use Batch Normalization ioffe2015batch in MetaAnchor.
Following the common practice lin2017focal in COCO detection task, for training we use two different dataset splits: COCO-all and COCO-mini; while for test, all results are evaluated on the minival set which contains 5000 images. COCO-all includes all the images in the original training and validation sets excluding minival images, while COCO-mini is a subset of around 20000 images. Results are mainly evaluated with COCO standard metrics such as mmAP.
Training and evaluation configurations.
For fair comparison, we follow most of the settings in lin2017focal (image size, learning rate, etc.) for all the experiments, except for a few differences as follows. In lin2017focal , anchor boxes (i.e. 3 scales and 3 aspect ratios) are predefined for each level of detection head. In the paper, more anchor boxes are employed in some experiments. Table 1 lists the anchor box configurations for feature level , where the case is identical to that in lin2017focal . Settings for other feature levels could also be derived (see Sec. 3.2). As for MetaAnchor, since predefined anchors are not needed, we suggest to use the strategy as follows. In training, first we select a sort of anchor box configuration from Table 1 (e.g. , then generate 25 s according to Equ. 7; for each iteration, we randomly augment each within , calculating the corresponding ground truth and use them to optimize. We call the methodology “training with anchors”. While in test, s are also set by a certain anchor box configuration without augmentation (not necessarily the same as used in training). We argue that with that training/inference scheme, it is possible to make direct comparisons between MetaAnchor and the counterpart baselines.
In the following subsections, first we study the performances of MetaAnchor by a series of controlled experiments on COCO-mini. Then we report the fully-equipped results on COCO-full dataset.
|# of Anchors||Scales 222Here we follow the same definition of scale and aspect ratio as in lin2017focal .||Aspect Ratios|
4.1 Ablation Study
4.1.1 Comparison with RetinaNet baselines
Table 2 compares the performances of MetaAnchor and RetinaNet baseline on COCO-mini dataset. Here we use the same anchor box settings for training and test. In the column “Threshold” means the intersection-over-union (IoU) thresholds for positive/negative anchor boxes respectively in training (the detailed definition are introduced in ren2015faster ; lin2017focal ).
To analyze, first we compare the rows with the threshold of 0.5/0.4. It is clear that MetaAnchor outperforms the counterpart baselines on each of anchor configurations and evaluation metrics, for instance,increase for mmAP and for AP50. We suppose the improvements may come from two aspects: first, in MetaAnchor the sizes of anchor boxes could be augmented and make the anchor functions to generate a wider range of predictions, which may enhance the model capability (especially important for the case with smaller number of anchors, e.g. ); second, rather than predefined anchor functions with independent parameters, MetaAnchor allows all the training boxes to contribute to the shared generators, which seems beneficial to the robustness over the different configurations or object box distributions.
For further investigating, we try using stricter IoU threshold (0.6/0.5) for training to encourage more precise anchor box association, however, statistically there are fewer chances for each anchor to be assigned with a positive ground truth. Results are also presented in Table 2. We find results of all the baseline models suffer from significant drops especially on AP50, which implies the degradation of anchor functions; furthermore, simply increasing the number of anchors works little on the performance. For MetaAnchor, in contrast, 3 out of 4 configurations are less affected (for the case of anchors even 0.3% improved mmAP are obtained). The only exception is the case; however, according to Table 3 we believe the degradation is mainly because of too few anchor boxes for inference rather than poor training. So, the comparison supports our hypothesis: MetaAnchor helps to use training samples in a more efficient and robust way.
|Threshold||# of Anchors||Baseline (%)||MetaAnchor (%)|
|# of Anchors||search|
4.1.2 Comparison of various anchor configurations in inference
Unlike the traditional fixed or predefined anchor strategy, one of the major benefits of MetaAnchor is able to use flexible anchor scheme during inference time. Table 3 compares a variety of anchor box configurations (refer to Table 1; note that the normalization coefficient should be consistent with what used in training) for inference along with their scores on COCO-mini. For each experiment IoU threshold in training is set to 0.6/0.5. From the results we find that more anchor boxes in inference usually produce higher performances, for instance, results of inference anchors are better than that of for a variety of training configurations.
Table 3 also implies that the improvements are quickly saturated with the increase of anchor boxes, e.g. anchors only bring minor improvements, which is also observed in Table 2. We revisit the anchor configurations in Table 1 and find and cases tend to involve too “dense” anchor boxes, thus predicting highly overlapped results which might contribute little to the final performance. Inspired by the phenomenon, we come up with an inference approach via greedy search: each step we randomly select one anchor box , generate the predictions and evaluate the combined results with the previous step (performed on a subset of training data); if the score improves, we update the current predictions with the combined results, otherwise discard the predictions in the current step. Final anchor configuration is obtained after a few steps. Improved results are shown in the last column (named “search”) of Table 3.
4.1.3 Cross evaluation between datasets of different distributions
Though domain adaption or transfer learning pan2010survey is out of the design purpose of MetaAnchor, recently the technique of weight predictionha2016hypernetworks , which is also employed in the paper, has been successfully applied in those tasks hu2017learning ; hoffman2014lsda . So, for MetaAnchor it is interesting to evaluate whether it is able to bridge the distribution gap between two dataset. More specifically, what about the performance if the detection model is trained with another dataset which has the same class labels but different distributions of object box sizes?
We perform the experiment on COCO-mini, in which we “drop” some boxes in the training set. However, it seems nontrivial to directly erase the objects in image; instead, during training, once we use an ground truth box which falls in a certain range (in our experiment the range is , around of the whole boxes), we manually assign the corresponding loss to . As for test, we use all the data in the validation set. Therefore, the distributions of the boxes we used in training and test are very different. Table 4 shows the evaluation results. Obviously after some ground truth boxes are erased, all the scores drop significantly; however, compared with the RetinaNet baseline, MetaAnchor suffers from smaller degradations and generates much better predictions, which shows the potential on the transfer tasks.
|# of Anchors||Baseline (all)||MetaAnchor (all)||Baseline (drop)||MetaAnchor (drop)|
4.1.4 Data-independent vs. data-dependent anchor function generators
In Sec. 3.2 we introduce two variants of anchor function generators: data-independent (Equ. 4) and data-dependent (Equ. 6). In the above subsections we mainly evaluate the data-independent ones. Table 5 compares the performances of the two alternatives. For simplicity, we use the same training and test anchor configurations; the IoU threshold is 0.6/0.5. Results shows that in most cases data-dependent variant is slight better, however, the difference is small. We also report the scores after anchor configuration search (described in Sec. 4.1.2).
|# of Anchors||Data-independent||Data-dependent|
|search333Based on the models with anchor configuration in training.||27.6||28.0|
4.2 Results on COCO Object Detection
Finally, we compare our fully-equipped MetaAnchor models with RetinaNet lin2017focal baselines on COCO-full dataset (also called trainval35k in lin2017focal ). As mentioned at the begin of Sec. 4, we follow the same evaluation protocol as lin2017focal . The input resolution is in both training and test. The backbone feature extractor is ResNet-50 he2016deep . Performances are benchmarked with COCO standard mmAP in the minival dataset.
Table 6 lists the results. Interestingly, our reimplemented RetinaNet model is 1.8% better than the counterpart reported in lin2017focal . For better understanding, we further investigate a lot of anchor box configurations (including those in Table 1) and retrain the baseline model, the best of which is named “RetinaNet” and marked with “search” in Table 6. In comparison, our MetaAnchor model achieves 37.5% mmAP on COCO minival, which is better than the original RetinaNet (our implemented) and better than the best searched entry of RetinaNet. Our data-dependent variant (Equ. 6) further boosts the performance by . In addition, we argue that for MetaAnchor the configuration for inference could be easily obtained by greedy search introduced in 4.1.2 without retraining.
Fig 2 visualizes some detection results predicted by MetaAnchor. It is clear that the shapes of detected boxes vary according to the customized anchor box .
|# of Anchors||# of Anchors||mmAP (%)|
|RetinaNet (our impl.)||35.8|
|RetinaNet (our impl.)||search||search||36.9|
|MetaAnchor (ours, data-dependent)||search||37.9|
We propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks, in which anchor functions could be dynamically generated from the arbitrary customized prior boxes. Thanks to weight prediction, MetaAnchor is able to work with most of the anchor-based object detection systems such as RetinaNet. Compared with the predefined anchor scheme, we empirically find that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition, it also shows the potential on transfer tasks. Our experiment on COCO detection task shows that MetaAnchor consistently outperforms the counterparts in various scenarios.
-  M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
-  P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence, 34(4):743–761, 2012.
M. Elhoseiny, B. Saleh, and A. Elgammal.
Write a classifier: Zero-shot learning using purely textual descriptions.In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2584–2591. IEEE, 2013.
D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov.
Scalable object detection using deep neural networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2154, 2014.
-  C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
-  R. Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In european conference on computer vision, pages 346–361. Springer, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. Lsda: Large scale detection through adaptation. In Advances in Neural Information Processing Systems, pages 3536–3544, 2014.
-  R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. arXiv preprint arXiv:1711.10370, 2017.
-  L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264, 2017.
-  Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Detnet: A backbone network for object detection. arXiv preprint arXiv:1804.06215, 2018.
-  Z. Li and F. Zhou. Fssd: Feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960, 2017.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedestrian detection? In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
-  I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
-  M. Najibi, P. Samangouei, R. Chellappa, and L. Davis. Ssh: Single stage headless face detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4875–4884, 2017.
-  S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  S. H. Rezatofighi, R. Kaskman, F. T. Motlagh, Q. Shi, D. Cremers, L. Leal-Taixé, and I. Reid. Deep perm-set net: Learn to predict sets with unknown permutation and cardinality using deep neural networks. arXiv preprint arXiv:1805.00613, 2018.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue. Dsod: Learning deeply supervised object detectors from scratch. In The IEEE International Conference on Computer Vision (ICCV), volume 3, page 7, 2017.
-  G. Song, Y. Liu, M. Jiang, Y. Wang, J. Yan, and B. Leng. Beyond trade-off: Accelerate fcn-based face detector with higher accuracy. arXiv preprint arXiv:1804.05197, 2018.
-  R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2325–2333, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
-  C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe. Scalable, high-quality object detection. arXiv preprint arXiv:1412.1441, 2014.
-  J. Wang, Y. Yuan, G. Yu, and S. Jian. Sface: An efficient network for face detection in large scale variations. arXiv preprint arXiv:1804.06559, 2018.
-  X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss: Detecting pedestrians in a crowd. arXiv preprint arXiv:1711.07752, 2017.
-  Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In European Conference on Computer Vision, pages 616–634. Springer, 2016.
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao.
Joint face detection and alignment using multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
-  L. Zhang, L. Lin, X. Liang, and K. He. Is faster r-cnn doing well for pedestrian detection? In European Conference on Computer Vision, pages 443–457. Springer, 2016.
-  S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S3fd: Single shot scale-invariant face detector. arXiv preprint arXiv:1708.05237, 2017.