based on deep convolutional neural networks have been proposed. A general object detector consists of three modules, including a backbone, a neck, and a detection head. To achieve higher performance, all the modules should be designed delicately. Recently, Neural Architecture Search (NAS) strategies have demonstrated promising results on the object detection task[7, 2, 37, 32, 6, 12, 4, 9]. Most of these works [7, 2, 37, 32, 6, 12, 4] focus on searching for a specific module while  directly search for the whole architecture detector. The automatically designed architectures usually yield higher accuracy or efficiency.
Previous methods aim to train a generic while fixed detector, , they treat all input images equally regardless of the scenario. While achieving high accuracy under certain computational costs, these models lack the ability of changing capacities according to the content of different inputs. We claim that it is inefficient to use a fixed model to detect all the samples. As shown in Fig. 1, the scale of objects, illumination and background varies a lot among different samples in the real-world dataset. Ideally, it is more efficient to use a light-weight model for simple cases while a complex model for hard examples.
Motivated by the above observations, we propose to learn to generate content-aware dynamic detectors (CADDet). We aim to train dynamic detectors with the following properties. i). The detector is able to automatically generate architectures according to different input images. ii). For the esay-to-detect samples, the generated architectures are light-weight, and vice versa. iii). For the images with similar visual properties, the corresponding architectures are also similar.
We propose to apply the dynamic routing
strategy to generate such dynamic detectors. Dynamic routing is an adaptive inference mechanism for deep neural networks. During the inference phase, for each sample, only part of the whole network would be activated, resulting in dynamic network architectures. Recently, dynamic routing have been drawing increasing attention many tasks. Some works apply block dropping[38, 39, 33, 11, 35, 14, 36] or channel pruning [29, 13, 41] strategies for efficiency in image classification. Most recently, Li 
proposed a multi-path dynamic network to alleviate scale variance in semantic segmentation. Different from previous methods, this work focuses on utilizing thedetection-related properties and measuring the scene complexity to guide the learning of dynamic object detectors.
We start from designing the routing space. Considering the multi-scale property of detectors, we design the multi-scale densely connected supernet with the similar architecture of that in segmentation models such as Auto-DeepLab  and Li . With this design, CADDet is able to combine the backbone and neck parts. Then, we propose a scale encoding method and introduce a coarse-to-fine strategy to guide the learning of dynamic routing tailored for object detection. On the one hand, globally, we assign different samples with various computational budgets , FLOPs or MAdds, according to their complexities. On the other hand, we introduce a local path similarity regularization method. We evaluate CADDet on the well-known MS-COCO  dataset. Results demonstrate CADDet achieves 1.8 higher mAP with 10% fewer MAdds compared with the vanilla routing strategy. Compared with the architecture with similar building blocks, , MobileNet +FPN  and MobileNetV2 +FPN, CADDet is able to achieve competitive mAP with 42% fewer MAdds.
2 Related Work
2.1 Object Detection
Existing object detection models can be roughly divided into two-stage detectors [8, 27, 1, 16] and one-stage detectors [25, 21, 17, 42, 34]. The two-stage models tend to be more accurate, while the one-stage detectors are usually simpler and thus more efficient. In this paper, we focus on model efficiency and thus choose to learn dynamic architectures for one-stage object detectors. Next, we introduce both the hand-crafted and the NAS-based detection architectures.
2.1.1 Hand-Crafted Architectures
Due to the huge scale variance of input samples, the representation of multi-scale features is the main problem in object detection. Earlier works like SSD  and Overfeat  directly utilize the feature maps from intermediate layers of the backbone to perform the subsequent prediction tasks. FPN  firstly introduces the lateral connection operation and proposes a top-down feature aggregation strategy. After that, there are lots of works [24, 20] focusing on designing efficient necks. Besides, backbones with recurrent down-sample and up-sample modules [23, 31] are proposed. HRNet  tackles this problem by introducing a multi-branch backbone, where the nearby feature maps communicate with each other through feature fusion.
2.1.2 Neural Architecture Search
Recently, Neural Architecture Search (NAS) has shown promising results on object detection. Among them, NAS-FPN  and Auto-FPN , as the pioneer works, search for architectures of the neck part. EfficientDet , DetNAS , and SP-NAS  learn the backbones and combine them with the fixed necks. SpineNet  learns to permute and connect the intermediate layers in the backbone. Most recently, Hit-Det  proposes to search for the whole detector. The main different between NAS-based detectors and hand-crafted ones is that NAS implicitly or explicitly adds an evolution stage to adjust the architecture of the model. However, the evolution stage is performed to achieve higher accuracy on the training or validation set. During the inference phase, the model has no ability to adjust itself adaptively.
To summarize, both hand-crafted and automatically designed models aims at finding an optimal while fixed model. Different from these models, our CADDet can generate sample-adaptive dynamic architectures.
2.2 Dynamic Inference
In order to achieve higher accuracy or speed up the model, various types of dynamic models have been proposed. During the inference phase, they either generate dynamic model parameters (dynamic convolution), or generate dynamic architectures (dynamic routing), or perform early prediction depending on the input image. Among these strategies, our CADDet is most related to dynamic routing. Dynamic routing models adopt the idea of layer-skipping to generate a suitable sub-network for the current input sample. Most of these models are designed for the image classification task. SkipNet  introduces the recurrent gates to judge whether a certain block would be dropped. BlockDrop  utilizes an additional policy network to choose the layers during inference. HydraNets  increase the width of blocks by adding more specialized components in each layer. MSDNet  proposes a multi-scale dense network and learns the connection pattern during training. MutualNet  trains the supernet with multiple widths using images with multiple input resolutions and introduces knowledge distillation during the training process. CoDiNet  proposes to regularize the routes according to sample similarity. Li  proposes to learn dynamic routing in semantic segmentation to alleviate the problem of scale variance. Compared with these methods, our CADDet is the first work to introduce the dynamic routing mechanism in object detection. Moreover, we take into consideration the properties of object detection and introduce the global and local regularization metrics to guide the learning of dynamic routing.
3 Content-Aware Dynamic Detectors
In this section, we introduce the Content-Aware Dynamic Detectors (CADDet) in detail. We first outline the overall structure of CADDet as well as the basic dynamic routing mechanism in Sec. 3.1. Next, in Sec. 3.2 and Sec. 3.3, we introduce a coarse-to-fine regularization strategy tailored for object detection, including the dynamic global budget constraint and the local path similarity regularization, aiming to achieve higher computational efficiency and obtain more diverse architectures. Finally, we present the architecture details in Sec. 3.4.
3.1 Dynamic Routing for Detection
In order to increase the capacity of the model and enlarge the routing space, we apply a multi-scale densely connected network following MSDNet , Auto-DeepLab , and Li  as the supernet. As shown in Fig. 2, the supernet starts with a fixed 3-layer “stem” block, which down-samples the resolution of input images to 1/8. Then, we add a network which maintains the feature representation at 4 different scales in the remaining layers. The 4 scales matches to respectively as in other detectors. For the adjacent 2 scales, the feature maps with lower resolutions are with more channels (2x). Similar as the design in Li , there are 3 candidate paths for each computation node. The paths include resolution-up, resolution-keep and resolution-down, except for the boundary scales. Right after the supernet, we concatenate the feature map in each scale with a detection head to generate predictions. For simplicity, we use the commonly-used one stage dense heads as used in FCOS .It is worth noting that the supernet itself contains feature maps of multiple scales and there are cross-scale feature fusion operations in each node, thus, there is no need to use an additional neck (, FPN).
Next, we introduce the router in each computation node. As shown in Fig. 4, each computation node contains two branches, , a router and a convolutional block. They take in the same feature map with the shape of as the input, where represent the batch size, the channel number, the height and the width. A router contains a series of Pooling-Conv operations as illustrated in Fig. 4, and outputs a batch of 3-dimentional gate values , where each element in represents the gate of the corresponding candidate output path , resolution-up, resolution-keep and resolution-down. Following Li , we allow multi-path propagation, , there would be multiple open gates simultaneously. During training, the values in
take continuous value for easier back propagation. In the inference phase, we binarizeby comparing the values with a constant threshold . If , the corresponding path would be dropped. If all the paths in the current computation node are dropped, the current convolutional block would also be dropped. The routers generate different gates for different input samples, which brings about various model architectures. More details of the network and routers are described in Sec. 3.4.
3.2 Dynamic Global Budget Constraint
add a budget constraint term in the loss function during the search/routing procedure. We also introduce the budget constraint in CADDet. First, we measure the cost (MAdds) in thecomputation nodes as
where is the gate of the node, , , , represent the MAdds (constant values) of the current convolution operation and the resolution change operations respectively. Let denotes the total number of the nodes. Then, the computational cost of the current architecture can be represented by Eq. 2.
With the total cost , previous dynamic routing methods either directly minimize C or add an L2 loss to minimize the gap between and an expected MAdds number. We claim this is sub-optimal to use a fixed budget constraint for all the inputs since the sample complexity varies a lot and different inputs may require different computational resources. Therefore, in CADDet, we take a further step by measuring the complexity of difference samples and assign dynamic budget constraints accordingly. We introduce a metric to measure the image complexity tailored for object detection as described in the following part.
The scale variance is one of the main challenges in object detection. In CADDet, we use the quantized scale distribution to formulate the complexity of each input image. As shown in Fig. 3
, we divide the scale space into m intervals. Then, we count the shapes of groudtruth bounding boxes in a training image and generate an m-dimentional scale-encoding vector. Each element in takes binary values, and the element in the vector indicates whether there exists object(s) with the shape in the scale interval. Finally, we map the scale vector to a dynamic computation budget. For simplicity, we use the mapping function as Eq. 3,
where is a constant indicating the base MAdds. After that, we apply an L2 loss shown in Eq. 4 as the dynamic global budget constraint.
Eq. 3 and Eq. 4 imply that an image with single-scale object(s) is expected to have a light-weight architecture while the budget for a complex sample is relaxed. Our dynamic global budget fully utilize the scale property as the prior knowledge and experiments in Sec. 4 demonstrate the design brings about not only more diverse architectures but also lower computational cost and higher accuracy.
3.3 Local Path Similarity Regularization
Although the dynamic budget constraint can improve the diversity of routes, it lacks the supervision of the local structure. For example, the budget constraints for the first two samples in Fig. 3 are the same since they both contain only one scale of object(s). However, the scales of these two samples are different. Thus, the corresponding paths should differ from each other to some degree. Motivated by this, we introduce the local path similarity regularization. For a batch of training samples, we encourage the router to generate similar architectures for images with similar scale distributions. Conversely, we push away the distance of routes between samples with different scale encoding vectors.
The process of local path similarity regularization is shown in Fig. 5. Suppose the batch size is . For each sample in the batch, we first compute the scale encoding vector as introduced in Sec. 3.2. Then, for each pair of samples in the batch
, we use the element-wise XNOR operation to compute the scale similarity as Eq.5
Next, for each input sample, we flatten the gates of all the computation nodes to form a corresponding route vector . For each pair, we represent the path similarity as Eq. 6.
Finally, we compute the path similarity loss by Eq. 7.
is linearly transformed fromas Eq. 8, which changes the groundtruth similarity from the range of [0,1] to , (0Min Max 1).
Such a transformation allows a part of the architecture to be generic for all the samples.
|Budget Constraint||mAP||Avg MAdds (M)||Max MAdds (M)||Min MAdds (M)||MAdds std|
Comparison of different budget constraints. Results are evaluated on COCO validation set. MAdds are measured with the input resolution of (640, 800). “MAdds std” is the standard deviation term indicating the diversity of the generated routes. Note that the “Fixed” strategy uses a fixedwhile the “Loss-Aware” and the proposed strategy use a dynamic . Thus, we take different values of the ratio term for a fair comparison.
3.4 Architecture Details
In this section, we introduce the implementation details of CADDet. The “stem” block contains three stride-2 3x3 depth-wise separable convolutions, which downsamples the resolution to 1/8 (the scale of C3 in object detection). Following Li, we set the layer number of the supernet to 16 by default. The four scales of feature maps are with the channel numbers of 64, 128, 256, 512 respectively.
In each computation node of scale , the input feature map is the summation of the outputs from the nodes of the previous layer with scales of . As shown in Fig. 4
, the router first utilizes an average pooling operation, which downsamples the feature map to 2x2. Then, a stride-1 1x1 convolution, a global average pooling (GAP) layer and a fully-convolutional layer are cascaded. The router finally outputs a tensor with the shape of, representing the gates for the current samples. The hyper-parameter for gate discretization is set to
As for the convolutional block, we applied the commonly used 3x3 SepConv for higher efficiency. The resolution-up operation is a stride-1 1x1 convolution with bilinear interpolation. The resolution-down operation is a stride-2 1x1 convolution. The 1x1 convolutions are applied to align the channel numbers between adjacent scales.
The supernet is connected with the detection head , FCOS head  without an additional neck. Specifically, the feature maps from C3 to C6 are directly obtained by projecting the 4 output feature maps to 256-d. C7 is obtained by adding a convolution layer on C6.
Scale encoding is the key procedure for both the global budget constraint and the local path similarity regularization. We analyze the scale distributions of samples on MS-COCO quantize the scales by the 4 intervals, [0, 64], (64, 150], (150, 360] and (360, ). During the scale encoding process, the scale of each object is determined by its longer side, , . The base MAdds is set to of the total MAdds of the supernet. For the local path similarity regularization, we set and to 0.6 and 0.95 respectively.
Putting all the loss terms together, we can obtain the overall objective function as Eq. 9,
where is the detection loss, including the losses for classification and localization. and are the hyper-parameters to balance the respective losses, which take as the default value.
4.1 Experimental Setup
In this section, we introduce the implementation details of the training process. We first pre-train our supernet on the ImageNet dataset
(ILSVRC2012). We use the SGD optimizer with 0.2 as the initial learning rate. The batch size is set to 256 and the model is trained for 120 epochs. During pre-training, we remove the routers and all the gate values are set to 1.
Then, we conduct the main experiments on the MS-COCO  dataset. Following the common practice, we use the COCO trainval35k split (115K images) for training and minival split (5K images) for validation. We also report the main results on the test_dev split (20K images). We train the CADDet with the initial learning rate being . Batch size is set to 16. Unless specified, we apply the schedule. The model is trained for 12 epochs and the learning rate is reduced by 10 after the and epochs. Besides, neither nor is applied during the first epoch to avoid model degeneration. In our ablation study, we resize the images such that the shorter side is 600 and the longer side is less than or equal to 1000. In the main results, the images are resized to have their shorter side being 800 and the longer side is less than or equal to 1333. By default, we use the FCOS head as the detection head. All of our experiments are conducted using mmdetection  2.1.
4.2 Ablation Study
4.2.1 Different Global Budget Constraints
In this section, we study the effectiveness of different budget constraints. We apply the following 3 different kinds of budget constraints.
i). Fixed budget constraint. All the samples are assigned with the same budget constraint, which is commonly used in other NAS and other dynamic routing methods. This method serves as the baseline global budget constraint.
ii). Loss-aware budget constraint. During training, we create a buffer to hold the detection losses of the least recently used 100 input samples. Next, for the current training sample, we first compute the detection loss and find the rank of it in the buffer. Then, we linearly transform the rank to an expected budget in the range of .
iii). The proposed dynamic global budget constraint, which utilizes the scale property as the prior information to measure the sample complexity.
The quantitative results are shown in Table 4. According to the results, all these budget constraints help to generate different models for different inputs. Compared with the fixed budget constraint, the loss-aware and the proposed constraints can generate more efficient architectures , these approaches can achieve similar accuracy with relatively lower cost. Compared with the loss-aware approach, the proposed scale-aware strategy can generate more diverse routes. The high diversity also brings about 1.5 to 2.0 p.p. higher mAP with lower average MAdds. These demonstrate scale encoding is a proper measure of sample complexity for object detection.
4.2.2 Using the Local Path Similarity Regularization
In this section, we study the effectiveness of the proposed local path similarity regularization. Quantitative results are listed in Table 2. We can find adding the local constraint can further improve the diversity of routes. We also study the effectiveness of hyper-parameters and . As introduced in Sec. 3.3, these two parameters are designed to control the lower and upper bounds of groudtruth for the path similarity. According to the results in Table 2
, a small lower bound results in worse accuracy, while a larger value leads to higher computational cost. This implies it is better to allow the routers to generate a generic architecture while let them to have a certain degree of freedom. The comparison of routes generated by different strategies can be found in Fig.7. The analysis could be found in Sec. 4.4.
|Model structure||mAP||MAdds (M)|
4.2.3 Using FPNLite
In Sec. 3.1
, we claim that the supernet itself can extract multi-scale features, thus, CADDet no longer requires an additional neck to generate multi-scale feature maps. In this section, we study the ability of multi-scale feature extraction of the supernet. We train the following two types of detectors. i)supernet+FPNLite+head, 2) supernet+head. From Table 3, we find the capacity of our supernet is high enough to represent multi-scale features. With lower computational cost, our supernet alone can achieve comparable accuracy. The routes are illustrated in Fig. 6. It is interesting that without the explicit neck, the routers are able to learn the “U-shape” cross-scale feature aggregation operations.
|Backbone and Neck||Head||Eval Set||mAP||Avg MAdds (G)||Max MAdds (G)||Min MAdds (G)|
4.3 Qualitative Results
In this section, we visualize the model architectures generated by different routing strategies. Results are illustrated in Fig. 7. In this part, we compute MAdds under the resolution of (800, 1200). We have the following observations.
i). Both the global budget constraint and the local path similarity regularization can improve the diversity of routes. The main difference is that the local path similarity regularization focuses more on the detailed parts rather than the body architecture.
ii). Universal body part. Our CADDet tends to generate a generic architecture as the body part and adjusts some local parts according to different inputs. The body part of most routes is a “U-shape” structure that first down-samples the feature maps in the head part and then up-samples in the “tail” part. This pattern is similar with the hand-crafted encoder-decoder structure.
iii). Diversity of CADDet. The computation nodes in the mid-layers are more diverse compared with the head-layers and tail-layers. For inputs with small objects, the number of valid computation nodes in the high-resolution feature maps is larger.
4.4 Main Results
Finally, we report the results on COCO test-dev. For a fair comparison, we compare CADDet with other light-weight backbones , MobileNet and MobileNetV2 which utilize SepConv as the building block. Results demonstrate that our CADDet can achieve competitive accuracy with much lower computational cost.
In this paper, we propose to improve the efficiency of detectors by learning to generate content-aware dynamic detectors (CADDet). Specifically, we propose a scale encoding method to i) measure the global sample complexity and assign the corresponding budget constraint, and ii) regularize the local similarity of the paths between different samples. Experimental results demonstrate CADDet can significantly improve the model diversity and save the computational cost. In the future, we aim to learn high-performance dynamic detectors by utilizing other building blocks. It is also interesting to generate dynamic heads, where the detection head is also involved in the dynamic framework.
Zhaowei Cai and Nuno Vasconcelos.
Cascade r-cnn: Delving into high quality object detection.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
-  Bo Chen, Golnaz Ghiasi, Hanxiao Liu, Tsung-Yi Lin, Dmitry Kalenichenko, Hartwig Adam, and Quoc V Le. Mnasfpn: Learning latency-aware pyramid architecture for object detection on mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13607–13616, 2020.
-  Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
-  Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. Detnas: Backbone search for object detection. In Advances in Neural Information Processing Systems, pages 6642–6652, 2019.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
-  Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11592–11601, 2020.
-  Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7036–7045, 2019.
-  Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhaohui Yang, Han Wu, Xinghao Chen, and Chang Xu. Hit-detector: Hierarchical trinity architecture search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11405–11414, 2020.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  Gao Huang and Danlu Chen. Multi-scale dense networks for resource efficient image classification. ICLR 2018, 2018.
-  Chenhan Jiang, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Sp-nas: Serial-to-parallel backbone search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11863–11872, 2020.
-  Artur Jordao, Maiko Lie, and William Robson Schwartz. Discriminative layer pruning for convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing, 2020.
-  Hao Li, Hong Zhang, Xiaojuan Qi, Ruigang Yang, and Gao Huang. Improved techniques for training adaptive deep networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1891–1900, 2019.
-  Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, and Jian Sun. Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8553–8562, 2020.
-  Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 82–92, 2019.
-  Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8759–8768, 2018.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  Michael Mathieu, Yann LeCun, Rob Fergus, David Eigen, Pierre Sermanet, and Xiang Zhang. Overfeat: Integrated recognition, localization and detection using convolutional networks. 2013.
Alejandro Newell, Kaiyu Yang, and Jia Deng.
Stacked hourglass networks for human pose estimation.In European conference on computer vision, pages 483–499. Springer, 2016.
-  Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 821–830, 2019.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
-  Zhuo Su, Linpu Fang, Wenxiong Kang, Dewen Hu, Matti Pietikäinen, and Li Liu. Dynamic group convolution for accelerating convolutional neural networks. arXiv preprint arXiv:2007.04242, 2020.
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang.
Deep high-resolution representation learning for human pose estimation.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5693–5703, 2019.
-  Shuyang Sun, Jiangmiao Pang, Jianping Shi, Shuai Yi, and Wanli Ouyang. Fishnet: A versatile backbone for image, region, and pixel level prediction. In Advances in neural information processing systems, pages 754–764, 2018.
-  Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10790, 2020.
-  Ravi Teja Mullapudi, William R Mark, Noam Shazeer, and Kayvon Fatahalian. Hydranets: Specialized dynamic architectures for efficient inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8080–8089, 2018.
-  Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE international conference on computer vision, pages 9627–9636, 2019.
-  Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–18, 2018.
-  Huanyu Wang, Zequn Qin, and Xi Li. Dynamic routing with path diversity and consistency for compact network learning. arXiv preprint arXiv:2005.14439, 2020.
-  Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, Chunhua Shen, and Yanning Zhang. Nas-fcos: Fast neural architecture search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11943–11951, 2020.
-  Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 409–424, 2018.
-  Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8817–8826, 2018.
-  Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 6649–6658, 2019.
-  Taojiannan Yang, Sijie Zhu, Chen Chen, Shen Yan, Mi Zhang, and Andrew Willis. Mutualnet: Adaptive convnet via mutual learning from network width and resolution. In European Conference on Computer Vision (ECCV), 2020.
-  Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.