Object Detection from Scratch with Deep Supervision

09/25/2018 ∙ by Zhiqiang Shen, et al. ∙ 4

We propose Deeply Supervised Object Detectors (DSOD), an object detection framework that can be trained from scratch. Recent advances in object detection heavily depend on the off-the-shelf models pre-trained on large-scale classification datasets like ImageNet and OpenImage. However, one problem is that adopting pre-trained models from classification to detection task may incur learning bias due to the different objective function and diverse distributions of object categories. Techniques like fine-tuning on detection task could alleviate this issue to some extent but are still not fundamental. Furthermore, transferring these pre-trained models across discrepant domains will be more difficult (e.g., from RGB to depth images). Thus, a better solution to handle these critical problems is to train object detectors from scratch, which motivates our proposed method. Previous efforts on this direction mainly failed by reasons of the limited training data and naive backbone network structures for object detection. In DSOD, we contribute a set of design principles for learning object detectors from scratch. One of the key principles is the deep supervision, enabled by layer-wise dense connections in both backbone networks and prediction layers, plays a critical role in learning good detectors from scratch. After involving several other principles, we build our DSOD based on the single-shot detection framework (SSD). We evaluate our method on PASCAL VOC 2007, 2012 and COCO datasets. DSOD achieves consistently better results than the state-of-the-art methods with much more compact models. Specifically, DSOD outperforms baseline method SSD on all three benchmarks, while requiring only 1/2 parameters. We also observe that DSOD can achieve comparable/slightly better results than Mask RCNN + FPN (under similar input size) with only 1/3 parameters, using no extra data or pre-trained models.



There are no comments yet.


page 2

page 4

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generic object detection is the task that we aim to localize various objects in a natural image automatically. This task has been heavily studied due to its wide applications in surveillance, autonomous driving, intelligent security, etc. In the recent years, with the progress of more and more innovative and powerful Convolutional Neural Networks (CNNs) based object detection systems have been proposed, the object detection problem has been one of the fastest moving areas in computer vision.

To achieve desired performance, the common practice in advanced object detection systems is to fine-tune models pre-trained on ImageNet [3]

. This fine-tuning process can be viewed as transfer learning 

[4, 5]. Specifically, as is shown in Fig. 1, researchers usually train CNN models on large-scale classification datasets like ImageNet [3] first, then fine-tune the models on target tasks, such as object detection [6, 7, 8, 9, 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], image segmentation [23, 24, 25, 26], fine-grained recognition [27, 28, 29, 30], captioning [31, 32, 33, 34, 35, 36], etc. Learning from scratch means we directly train models on these target tasks without involving any other additional data or extra fine-tuning processes. Empirically, fine-tuning from pre-trained models has at least two advantages. First, there are numerous state-of-the-art pre-trained CNN models publicly available. It is convenient for researchers to reuse the learned parameters in their own domain-specific tasks. Second, fine-tuning on pre-trained models can quickly convergence to a final state and requires less instance-level annotated training data than basic classification task.

Fig. 1: Illustration of Training Models from Scratch. The black dashed box (left) denotes we pre-train models on large-scale classification dataset like ImageNet [3]. The red dashed box (right) denotes we train models on target dataset directly. In this paper, we focus on the object detection task without using the pre-trained models.

However, the critical limitations are also obvious when adopting the pre-trained models for object detection: (I) Limited design space on network structures. Existing object detectors directly adopt the pre-trained networks, and as a consequence, there is little flexibility to control/adjust the detailed network structures, even for small changes of network design. Furthermore, the pre-trained models are mostly from large-scale classification task, which are usually very heavy (containing a huge number of parameters) and are not suitable for some specific scenarios. The heavy network structures will bound the requirement of computing resources. (II) Learning/optimization bias. Since there are some differences in both the objective functions and the category distributions between classification and detection tasks, these differences may lead to different searching/optimization spaces. Therefore, learning may be biased towards a local minimum when all parameters are initialized from classification pre-trained models, which is not the best for target detection task. (III) Domain mismatch. As is well-known, fine-tuning can mitigate the gap between different target category distribution. However, it is still a severe problem when the source domain (e.g., ImageNet) has a huge mismatch to the target domain such as depth images, medical images, etc [37].

Therefore, our work is motivated by the following two questions. First, is it possible to train object detection networks from scratch directly without the pre-trained models? Second, if the first answer is positive, are there any principles to design a resource efficient network structure for object detection, meanwhile keeping high detection accuracy? To meet this goal, we propose deeply supervised objection detectors (DSOD), a simple yet efficient framework that can learn object detectors from scratch. DSOD is fairly flexible, we can tailor various network structures for different computing platforms such as servers, desktop, mobile and even embedded devices.

We contribute a set of principles for designing DSOD. One key point is the deeply supervised structure, which is motivated by the recent work of [38, 39]. In [39], Xie et al. proposed a holistically-nested structure for edge detection, which included the side-output layers in each conv-stage of base network for explicit deep supervision. Instead of using the multiple cut-in loss signals with side-output layers, our method adopts deep supervision implicitly through the layer-wise dense connections proposed in DenseNet [40]. Dense structures are not only adopted in the backbone sub-network, but also used in the front-end multi-scale prediction layers. Fig. 2 illustrates the structure comparison in front-end prediction layers between baseline SSD and our DSOD. The fusion and reuse of multi-resolution prediction-maps help keep or even improve the final accuracy, while reducing model parameters to some extent. As shown in Fig. 3, we further adopted dense connections between different blocks to enhance the deeply supervised signals during network training.

Furthermore, we revisited the pre-activation BN-Conv-ReLU of backbone networks for our DSOD framework. We observe that post-activation (Conv-BN-ReLU) order can obtain about 0.6% mAP improvement on VOC 07, meanwhile, requiring slightly fewer parameters compared with original order in DSOD. In order to further enhance the deep supervision purpose when training from scratch, especially for some plain backbones like VGGNet, we also propose a complementary structure named deep-scale supervision module (DSS) as DSOD v2. More details are given in the following sections. Now, we summarize our main contributions of this paper as follows:

  • To the best of our knowledge, DSOD is the first framework that can train object detectors from scratch with promising performance.

  • We introduce and validate a set of principles to design efficient object detection networks from scratch through step-by-step ablation studies.

  • We show that DSOD can achieve comparable performance with state-of-the-arts on three standard benchmarks (PASCAL VOC 2007, 2012 and MS COCO datasets), meanwhile, has real-time processing speed and more compact models.

A preliminary version of this manuscript [41] has been published on a previous conference. In this version, we made some design changes in backbone network (e.g., replacing pre-activation in BN-ReLU-Conv with the post-activation Conv-BN-ReLU manner) and included a new module (named deep-scale supervision) to make DSOD better (Section 3.2). We also included more details, analysis and extra comparison experiments with state-of-the-art two-stage detectors like FPN and Mask RCNN and the factors of training them from scratch (Section 4.84.9 and 5). The proposed DSOD framework has also been adopted and generalized to further improve the performance under the setting of learning object detectors from scratch such as GRP-DSOD [42], Tiny-DSOD [43], etc.

Layers Output Size (Input 3300 300) DSOD
Stem Convolution 64150150 3

3 conv, stride 2

Convolution 64150150 33 conv, stride 1
Convolution 128150150 33 conv, stride 1
Pooling 1287575 2

2 max pool, stride 2

Dense Block
Transition Layer
4167575 11 conv
4163838 22 max pool, stride 2
Dense Block
Transition Layer
8003838 11 conv
8001919 22 max pool, stride 2
Dense Block
Transition w/o Pooling Layer (1) 11841919 11 conv
Dense Block
Transition w/o Pooling Layer (2) 15681919 11 conv
DSOD Prediction Layers Plain/Dense
TABLE I: DSOD architecture (growth rate = 48 in each dense block).
Fig. 2: DSOD prediction layers with plain and dense structures (for 300300 input). The plain structure is introduced by SSD [10] and dense structure is ours. See Section 3 for more details.

2 Related Work

Object Detection. Modern CNN-based object detectors can mainly be divided into two groups: (i) proposal-based/two-stage methods; and (ii) proposal-free/one-stage methods.

Proposal-based family includes R-CNN [6], Fast R-CNN [7], Faster R-CNN [8], R-FCN [9] and Mask RCNN [1]. R-CNN uses selective search [44] to first generate potential object regions in an image and then perform classification on the proposed regions. R-CNN requires high computational costs since each region is processed by the CNN network separately. Fast R-CNN improves the efficiency by sharing computation of backbone networks and Faster R-CNN uses neural networks (i.e., RPN) to generate the region proposals. R-FCN further improves speed and accuracy by removing fully-connected layers and adopting position-sensitive score maps for final detection.

Recently, in order to realize real-time object detection, the proposal-free methods like YOLO [11] and SSD [10] have been proposed. YOLO uses a single feed-forward convolutional network to predict object classes and locations directly, which no longer requires a second per-region classification operation so that it is extremely fast. SSD further improves YOLO in several aspects, including (1) use small convolutional filters to predict categories and anchor offsets for bounding box locations; (2) use pyramid features for prediction at different feature scales; and (3) use default boxes and aspect ratios for adjusting varying object shapes. Some other proposal-free detectors also be proposed recently, e.g. RetinaNet [12], Scale-Transferrable [45], Single-shot Refinement [46], RFB Net [47], CornetNet [48], ExtremeNet [49], etc. Our proposed DSOD is built upon SSD framework and thus it inherits the speed and accuracy advantages of SSD, while produces more compact and flexible models.

Network Architectures for Detection. Since there are significant efforts that have been devoted to design network architectures for image classification, many diverse and powerful networks are emerged, such as AlexNet [50], VGGNet [51], GoogLeNet [52], ResNet [53], DenseNet [40], etc. Meanwhile, several advanced regularization techniques [54, 55] also have been proposed to further enhance the model capabilities. In practice, most of the detection methods [6, 7, 8, 10] directly utilize these structures pre-trained on ImageNet as the backbone network for detection task.

Some other works try to design specific backbone network structures for object detection, but still require to pre-train on ImageNet classification dataset in advance. Specifically, YOLO [11] defines a network with 24 convolutional layers followed by 2 fully-connected layers. YOLO9000 [56] improves YOLO by proposing a new network named Darknet-19, which is a simplified version of VGGNet [51]. YOLOv3 [57]

further improve the performance through involving residual connection on Darknet-19 and other techniques. Kim

et al. [58] proposes PVANet for fast object detection, which consists of the simplified “Inception” block from GoogleNet. Huang et al. [59] investigated various combination of network structures and detection frameworks, and found that Faster R-CNN [8] with Inception-ResNet-v2 [60] achieved very promising performance. In this paper, we also consider designing a suitable backbone structure for generic object detection. However, the pre-training operation on ImageNet is no longer required by the proposed DSOD.

Learning Deep Models from Scratch. To the best of our knowledge, there are no previous works that train deep CNN-based object detectors from scratch. Thus, our proposed approach has very appealing advantages over existing solutions. We will elaborate and validate the method in the following sections. In semantic segmentation, Jégou et al. [61] demonstrated that a well-designed network structure can outperform state-of-the-art solutions without using the pre-trained models. It extends DenseNets to fully-convolutional networks by adding an upsampling path to recover the full input resolution.

3 Dsod

In this section, we first introduce the whole framework of our DSOD architecture, following by several important design principles. Then we describe the objective function and training settings in detail.

3.1 Network Architecture

Similar to SSD [10]

, our proposed DSOD method is a multi-scale and proposal-free detection framework. The network structure of DSOD can be divided into two parts: the backbone sub-network for feature extraction and the front-end sub-network for prediction over multi-resolution feature maps. The backbone sub-network is a variant of the deeply supervised DenseNets

[40] structure, which is composed of a stem block, four dense blocks, two transition layers and two transition w/o pooling layers. The front-end subnetwork (or named DSOD prediction layers) fuses multi-scale prediction responses with an elaborated dense structure. Fig. 2 illustrates the proposed DSOD prediction layers along with the plain structure used in SSD [10]. The full DSOD network architecture111The visualization of the complete network structure is available at: http://ethereon.github.io/netscope/#/gist/b17d01f3131e2a60f9057b5d3eb9e04d. is detailed in Tab. I. Now we elaborate each component and the corresponding design principle in the following.

3.2 Design Principles

Principle 1: Proposal-free. In order to reveal the potential influences in learning object detection from scratch, we investigated all the state-of-the-art CNN-based object detectors under the default settings. As aforementioned, R-CNN and Fast R-CNN require external object proposal generators like selective search. Faster R-CNN and R-FCN require integrated region-proposal-network (RPN) to generate relatively fewer region proposals. YOLO and SSD are single-shot and proposal-free methods (one-stage), which handle object location and bounding box coordinates as a regression problem. We observe that only proposal-free methods (one-stage detectors) can converge successfully without the pre-trained models if we follow the original settings without involving some significantly modifications (e.g., replacing RoI pooling with RoI align [1], adopting Sync BN [62] or Group Norm [63] to mitigate small batch-size issue, etc.). We conjecture this is due to the RoI pooling (Regions of Interest) in the other two categories of methods — RoI pooling uses quantization to generate features for each region proposals, which causes misalignments that hinders/reduces the gradients being smoothly back-propagated from region-level to convolutional feature maps. The proposal-based methods work well with pre-trained network models because the parameter initialization is good for those layers before RoI pooling, while this is not true for training from scratch.

Hence, we arrive at the first principle: training detection network from scratch requires a proposal-free framework, even if there is no BN layer [55] included in the network structures (In contrast, norm layer is critical for both Sync BN [62] and Group Norm [63] methods to train region-based/two-stage detectors from scratch). In practice, we derive a multi-scale proposal-free framework from the SSD framework [10], as it could reach state-of-the-art accuracy while offering fast processing speed.

Principle 2: Deep Supervision. Using deeply supervised structures to improve network performance has been demonstrated a effective practice in GoogLeNet [52], DSN [38], DeepID3 [64]

, etc. Among these network structures, the central idea is to provide integrated objective function as direct supervision to the earlier hidden layers, rather than only at the output one. These “companion” or “auxiliary” objective functions at multiple hidden layers can mitigate the “vanishing” gradients problem. The proposal-free detection framework contains both classification and localization loss. The explicit solution requires adding complex side-output layers to introduce “companion” objective at each hidden layer for the detection task, similar to

[39]. In this work, we empower deep supervision with an elegant & implicit solution called layer-wise dense connections, as introduced in DenseNets [40]. A block is called dense block

when all preceding layers in the block are connected to the current layer. Hence, earlier layers in DenseNet can receive additional supervision from the objective function through the skip connections. Although only a single loss function is required on top of the network, all layers including the earlier layers still can share the supervised signals unencumbered.

In order to further verify the effectiveness of Deep Supervision mechanism, we propose a deep-scale supervised (DSS) module, which is similar to Hypernet [13], Inside-outside net [14], etc. As illustrated in Fig. 3, DSS concatenates three different scales of feature maps (low, middle and high levels) from different blocks into a single prediction module. For low-level (coarse resolution) features, we use a max pooling, stride 2 to reduce the resolution, following by a conv-layer for reducing the number of feature maps. We use the max pooling for middle level feature maps and do not include max pooling for high-level layers. Then, we concatenate these diverse feature maps together for final prediction. Each prediction layer can be formulated as:


where denotes the -th prediction layer outputs. denotes max pooling. , and denote feature maps from different layers. We will verify the benefit of deep supervision in Section 4.1.2.

Transition w/o Pooling Layer. In order to increase the number of dense blocks without reducing the final feature map resolution, we introduce a new layer called transition w/o pooling layer. In the original design of DenseNet, each transition layer contains a pooling operation to down-sample the feature resolution. The number of dense blocks is fixed (4 dense blocks in all DenseNet architectures) if one wants to maintain the same scale/size of outputs. The only way to increase network depth is adding layers inside each block for the original DenseNet. The transition w/o pooling layer eliminates this restriction of the number of dense blocks in DSOD architecture. You can include any number of blocks in a network as you want, which can also be adopted by the standard DenseNet.

Fig. 3: Illustration of the deep-scale supervision (DSS) module. “4, 2 and 1” denote that we reduce the resolution of feature maps to , and the original size, respectively. “c” denotes concatenation operation. “P and P” are the first (3838) and second scales (1919) of prediction modules in Fig. 2. “P-P” also use three-scale feature maps for prediction, which are not presented in this Figure.

Principle 3: Stem Block. Motivated by Inception-v3 [65] and v4 [60], we define stem block as a stack of three 33 convolution layers followed by a 22 max pooling layer. The first conv-layer works with stride = 2 and the other two are with stride = 1. We find that adding this simple stem structure can evidently improve the detection performance in our experiments. We conjecture that, compared with the original design in DenseNet (77 conv-layer, stride = 2 followed by a 33 max pooling, stride = 2), the stem block can reduce the information loss from raw input images with small kernel size at the beginning of a network. We will show that the reward of this stem block is significant for object detection performance in Section 4.1.2.

Principle 4: Dense Prediction Structure. Fig. 2 illustrates the comparison of the plain structure (as in SSD) and our proposed dense structure in the front-end sub-network. SSD designs prediction-layers as an asymmetric hourglass structure. For 300300 input size, SSD applies six scales of feature maps for predicting objects . The Scale-1 feature maps are from the middle layer of the backbone sub-network, which has the largest resolution (3838) in order to handle the small objects in an image. The remaining five scales are on top of the backbone sub-network. Then, a plain transition layer with the bottleneck structure (a 11 conv-layer for reducing the number of feature maps plus a 33 conv-layer) [65, 53] is adopted between two contiguous scales of feature maps.

Learning Half and Reusing Half. In plain structure, each later scale of prediction layer is directly transited from the adjacent previous scale layer, as shown in Fig. 2, which is used in SSD framework. In this work, we propose to use dense structure for prediction. Each prediction layer combines multi-scale information from two stages of layers. For simplicity, we restrict that each scale outputs the same number of channels for the prediction feature maps as is in the plain structure. In DSOD of each scale (except scale-1), half of the feature maps are learned from the previous scale layer with a series of conv-layers, while the remaining half feature maps are directly down-sampled from the contiguous high-resolution feature maps. The down-sampling block consists of a 22, stride 2 max pooling layer followed by a 11, stride = 1 conv-layer. The pooling layer aims to match resolution to current size during concatenation. The 11 conv-layer is used to reduce the number of channels to 50%. The pooling layer is placed before the 11 conv-layer for the consideration of reducing computing cost. This down-sampling block actually brings each scale with the multi-resolution feature maps from all of its preceding scales, which is essentially identical to the dense layer-wise connection introduced in DenseNets. For each scale, we only learn half of new feature maps and reuse the remaining half of the previous ones. This dense prediction structure can yield more accurate results with fewer parameters than the plain structure, as will be studied in Section XI.

transition w/o pooling?
hi-comp factor ?
wide bottleneck?
wide 1st conv-layer?
big growth rate?
stem block?
dense pred-layers?
DSS module?
VOC 2007 mAP 59.9 61.6 64.5 68.6 69.7 74.5 77.3 77.7 79.1
TABLE II: Effectiveness of various designs on VOC 2007 test set. Please refer to Tab. III and Section XI for more details.
Method data pre-train transition w/o pool stem backbone prediction layer # parameters (%) mAP
DSOD300 07+12 DS/32-12-16-0.5 Plain 4.1M 59.9
DSOD300 07+12 DS/32-12-16-0.5 Plain 4.2M 61.6
DSOD300 07+12 DS/32-12-16-1 Plain 5.5M 64.5
DSOD300 07+12 DS/32-64-16-1 Plain 6.1M 68.6
DSOD300 07+12 DS/64-64-16-1 Plain 6.3M 69.7
DSOD300 07+12 DS/64-192-48-1 Plain 18.0M 74.5
DSOD300 07+12 DS/64-12-16-1 Plain 5.2M 70.7
DSOD300 07+12 DS/64-36-48-1 Plain 12.5M 76.0
DSOD300 07+12 DS/64-192-48-1 Plain 18.2M 77.3
DSOD300 07+12 DS/64-64-16-1 Dense 5.9M 73.6
DSOD300 07+12 DS/64-192-48-1 Dense 14.8M 77.7
DSOD300 07+12+COCO DS/64-192-48-1 Dense 14.8M 81.7
TABLE III: Ablation study on PASCAL VOC 2007 test set. DS/A-B-- describes our backbone network structure. A denotes the number of channels in the 1st conv-layer. B denotes the number of channels in each bottleneck layer (11 convolution). is the growth rate in dense blocks. denotes the compression factor in transition layers. See Section XI for more explanations.

3.3 Training Objective

Our whole training objective loss is derived from SSD [10] and Fast RCNN [7], which is a weighted sum of the classification loss (cls) and the localization loss (reg):



denotes a discrete probability distribution that is computed by a softmax over the K+1 outputs.

is the ground-truth class. is the bounding-box regression offsets and is the ground-truth bounding-box regression target. is the coefficient to balance the two losses. Following Fast RCNN [7], we also adopt the loss for bounding-box regression:


Specially, we calculate the four coordinates following [10, 7, 8]:


where , , , and denote the box’s center coordinates and its width and height. , and denote predicted box, default box and ground-truth box, respectively.

3.4 Other Settings

We implement our detectors based on the caffe platform 

[66]. All our models are trained from scratch with SGD solver on NVidia TitanX GPU. Since each scale of DSOD feature maps is concatenated from multi-resolution features, we adopt L2 normalization technique [67] to scale the feature norm to 20 on all outputs. Note that SSD only applies this normalization to scale-1. Most of our training strategies follow SSD, including data augmentation, scale and aspect ratios for default boxes, etc., while we have our own learning rate scheduling and mini-batch size settings. Details will be given in the experimental section.

Method data pre-train backbone prediction layer speed (fps) # parameters input size (%) mAP
Faster RCNN [8] 07+12 VGGNet - 7 134.7M 73.2
Faster RCNN [8] 07+12 ResNet-101 - 2.4 - 76.4
R-FCN [9] 07+12 ResNet-50 - 11 31.9M 77.4
R-FCN [9] 07+12 ResNet-101 - 9 50.9M 79.5
R-FCNmulti-sc [9] 07+12 ResNet-101 - 9 50.9M 80.5
YOLOv2 [56] 07+12 Darknet-19 - 81 - 73.7
SSD300 [10] 07+12 VGGNet Plain 46 26.3M 75.8
SSD300* [10] 07+12 VGGNet Plain 46 26.3M 77.2
Faster RCNN 07+12 VGGNet/ResNet-101/DenseNet Failed
R-FCN 07+12 VGGNet/ResNet-101/DenseNet Failed
SSD300S 07+12 ResNet-101 Plain 12.1 52.8M 63.8
SSD300S 07+12 VGGNet Plain 46 26.3M 69.6
SSD300S 07+12 VGGNet Dense 37 26.0M 70.4
DSOD300 07+12 DS/64-192-48-1 Plain 20.6 18.2M 77.3
DSOD300 07+12 DS/64-192-48-1 Dense 17.4 14.8M 77.7
DSOD300 07+12+COCO DS/64-192-48-1 Dense 17.4 14.8M 81.7
TABLE IV: PASCAL VOC 2007 test detection results. SSD300* is updated version by the authors after the paper publication. SSD300S indicates training SSD300* from scratch with ResNet-101 or VGGNet, which serves as our baseline. Note that the speed of Faster R-CNN with ResNet-101 (2.4 fps) is tested on K40, while others are tested on Titan X. The result of SSD300S with ResNet-101 (63.8% mAP, without the pre-trained model) is produced with the default setting of SSD, which may not be optimal.

4 Experiments

Our experiments are conducted on the widely used PASCAL VOC 2007, 2012 and MS COCO datasets that have 20, 20, 80 object categories respectively. We adopt the standard mean Average Precision (mAP) to measure the object detection performance.

4.1 Ablation Study on PASCAL VOC2007

We first investigate each component and design principle of our DSOD framework. The results are mainly summarized in Tab. VI and Tab. III. We design several controlled experiments on PASCAL VOC 2007 with our DSOD300 (with 300300 inputs) for this ablation study. A consistent setting is imposed on all the experiments, unless when some components or structures are examined. In this study, we train the models with the combined training set from VOC 2007 trainval and 2012 trainval (“07+12”), and test on the VOC 2007 test set.

4.1.1 Configurations in Dense Blocks

In this section, we first investigate the impact of different configurations in dense blocks of the backbone sub-network.

Compression Factor in Transition Layers. We compare two compression factor values ( = 0.5, 1) in the transition layers of DenseNets. Results are shown in Tab. III (rows 2 and 3). Compression factor = 1 means that there is no feature map reduction in the transition layer, while = 0.5 means half of the feature maps are reduced. We can observe that = 1 obtains 2.9% higher mAP than = 0.5.

# Channels in bottleneck layers. As shown in Tab. III (rows 3 and 4), we observe that wider bottleneck layers (with more channels of response maps) improve the performance greatly (4.1% mAP).

# Channels in the 1st conv-layer We observe that a large number of channels in the first conv-layers is beneficial, which brings 1.1% mAP improvement (in Tab. III rows 4 and 5).

Growth rate. A large growth rate is found to be much better. We observe 4.8% mAP improvement in Tab. III (rows 5 and 6) when increase from 16 to 48 with 4 bottleneck channels.

Method data backbone pre-train mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
HyperNet [13] 07++12 VGGNet 71.4 84.2 78.5 73.6 55.6 53.7 78.7 79.8 87.7 49.6 74.9 52.1 86.0 81.7 83.3 81.8 48.6 73.5 59.4 79.9 65.7
ION [14] 07+12+S VGGNet 76.4 87.5 84.7 76.8 63.8 58.3 82.6 79.0 90.9 57.8 82.0 64.7 88.9 86.5 84.7 82.3 51.4 78.2 69.2 85.2 73.5
Faster RCNN [8] 07++12 ResNet-101 73.8 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6
R-FCNmulti-sc [9] 07++12 ResNet-101 77.6 86.9 83.4 81.5 63.8 62.4 81.6 81.1 93.1 58.0 83.8 60.8 92.7 86.0 84.6 84.4 59.0 80.8 68.6 86.1 72.9
YOLOv2 [56] 07++12 Darknet-19 73.4 86.3 82.0 74.8 59.2 51.8 79.8 76.5 90.6 52.1 78.2 58.5 89.3 82.5 83.4 81.3 49.1 77.2 62.4 83.8 68.7
SSD300* [10] 07++12 VGGNet 75.8 88.1 82.9 74.4 61.9 47.6 82.7 78.8 91.5 58.1 80.0 64.1 89.4 85.7 85.5 82.6 50.2 79.8 73.6 86.6 72.1
DSOD300 07++12 DS/64-192-48-1 76.3 89.4 85.3 72.9 62.7 49.5 83.6 80.6 92.1 60.8 77.9 65.6 88.9 85.5 86.8 84.6 51.1 77.7 72.3 86.0 72.2
DSOD300 07++12+COCO DS/64-192-48-1 79.3 90.5 87.4 77.5 67.4 57.7 84.7 83.6 92.6 64.8 81.3 66.4 90.1 87.8 88.1 87.3 57.9 80.3 75.6 88.1 76.7
TABLE V: PASCAL VOC 2012 test detection results. 07+12: 07 trainval + 12 trainval, 07+12+S: 07+12 plus segmentation labels, 07++12: 07 trainval + 07 test + 12 trainval. Anonymous result links are DSOD300 (07+12) : http://host.robots.ox.ac.uk:8080/anonymous/PIOBKI.html; DSOD300 (07+12+COCO): http://host.robots.ox.ac.uk:8080/anonymous/I0UUHO.html.

4.1.2 Effectiveness of Design Principles

In this section, we justify the effectiveness of each design principle elaborated earlier.

Proposal-free Framework. We tried to learn object detectors from scratch using the proposal-based framework including Faster R-CNN and R-FCN with the default settings. However, the training process failed to converge for all the network structures we attempted (VGGNet, ResNet, DenseNet). We then tried to train with the proposal-free framework SSD. The training converged successfully but still gave relatively worse results (69.6% for VGGNet backbone) compared with the case fine-tuning from pre-trained model (75.8%), as shown in Tab. IV. These experiments validate our principle to choose a proposal-free framework.

Method pre-train (%) mAP
SSD [10] 77.2
SSD [10] 69.6
SSD [10] (+DP) 70.4
SSD [10] (+DP+DSS w/o BN) 74.2
SSD [10] (+DP+DSS w/ BN) 77.4
DSOD 77.7
DSOD (v2) (+DSS w/ BN) 79.1
TABLE VI: Effectiveness of various designs on VOC 2007 test set. DP denotes dense prediction. DSS w/o BN denotes deep-scale supervision module without BN [55]. Please refer to Section XI for more details.
Fig. 4: Examples of object detection results on the MS COCO test-dev set using DSOD300. The training data is COCO trainval without the ImageNet pre-trained models (29.3% mAP@[0.5:0.95] on the test-dev set). Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used for displaying. For each image, one color corresponds to one object category in that image. The running time per image is 57.5ms on one Titan X GPU or 590ms on Intel (R) Core (TM) i7-5960X CPU @ 3.00GHz.
Method mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
DSOD (v2) 72.9 86.8 82.5 69.0 57.4 47.1 81.2 77.8 88.7 54.8 75.5 60.4 85.2 82.0 85.4 82.4 45.0 75.3 68.2 84.3 69.2
DSOD 70.8 86.4 80.2 65.5 55.7 42.4 80.3 75.3 86.6 51.1 72.3 60.5 83.9 80.5 83.6 80.4 42.7 72.4 67.3 83.1 66.2
SSD [10] 64.0 78.9 72.3 61.8 42.8 27.9 73.1 69.4 84.9 42.5 68.4 52.2 80.9 76.5 77.2 68.2 31.6 67.0 66.6 77.3 60.9
THU_ML_class 62.4 78.0 71.0 64.5 47.4 45.3 70.1 70.6 82.0 37.9 65.4 44.2 77.4 69.6 74.4 75.5 37.9 62.0 45.5 73.8 56.3
YOLOv2 [56] 48.8 69.5 61.6 37.6 28.2 18.8 63.2 53.2 65.6 27.5 44.4 35.9 61.4 57.9 66.9 63.8 16.8 52.8 39.5 65.4 46.2
DENSE_BOX 45.9 64.7 64.1 28.8 26.7 30.7 60.6 54.9 47.4 29.3 41.8 34.6 42.6 59.3 64.2 62.5 24.3 53.7 27.1 50.9 50.7
NoC 42.2 62.8 60.4 26.7 22.3 25.7 56.9 55.2 52.1 21.5 38.3 34.2 43.9 51.2 58.8 40.7 20.4 42.0 37.4 52.6 41.6
TABLE VII: PASCAL VOC 2012 Competition Comp3 results. The training data is PASCAL VOC 2012 trainval set without pre-trained models. Anonymous result link of DSOD v2 is http://host.robots.ox.ac.uk:8080/anonymous/TOAZCG.html.
Method data network pre-train Avg. Precision, IoU: Avg. Precision, Area: Avg. Recall, #Dets: Avg. Recall, Area:
0.5:0.95 0.5 0.75 S M L 1 10 100 S M L
Faster RCNN [8] trainval VGGNet 21.9 42.7 - - - - - - - - - -
ION [14] train VGGNet 23.6 43.2 23.6 6.4 24.1 38.3 23.2 32.7 33.5 10.1 37.7 53.6
R-FCN [9] trainval ResNet-101 29.2 51.5 - 10.3 32.4 43.3 - - - - - -
R-FCNmulti-sc [9] trainval ResNet-101 29.9 51.9 - 10.8 32.8 45.0 - - - - - -
YOLOv2 [56] trainval35k Darknet-19 21.6 44.0 19.2 5.0 22.4 35.5 20.7 31.6 33.3 9.8 36.5 54.4
SSD300* [10] trainval35k VGGNet 25.1 43.1 25.8 6.6 25.9 41.4 23.7 35.1 37.2 11.2 40.4 58.4
DSOD300 trainval DS/64-192-48-1 29.3 47.3 30.6 9.4 31.5 47.0 27.3 40.7 43.0 16.7 47.1 65.0
TABLE VIII: MS COCO test-dev 2015 detection results.

Deep Supervision. We then tried to learn object detectors from scratch with the principle of deep supervision. Our DSOD300 achieves 77.7% mAP, which is much better than the SSD300S that is trained from scratch using VGG16 (69.6%) without deep supervision. Since VGGNet is a plain network, we design a deep-scale supervision (DSS) module to further validate the effectiveness of deep supervision. The structure of our DSS is shown in Fig. 3, we can observe that DSS structure concatenates three different scales of feature maps (low, middle and high levels) into a single prediction module. The performance comparisons are shown in Tab. VI, our proposed module significantly improves the accuracy of SSD from 70.4% to 77.4%, even better than the ImageNet pre-trained case (77.2%). Adopting DSS module in DSOD can obtain consistent improvement (79.1%).

Transition w/o Pooling Layer. We compare the case without this designed layer (only 3 dense blocks) and the case with the designed layer (4 dense blocks in our design). The backbone network is DS/32-12-16-0.5. Results are shown in Tab. III. The network structure with the Transition w/o pooling layer leads deeper network structure and brings 1.7% detection performance gain, which validates the effectiveness of this layer.

Stem Block. As shown in Tab. III (rows 6 and 9), the stem block notably improves the performance from 74.5% to 77.3%. This validates our conjecture that using stem block can protect information loss from the raw input images.

Dense Prediction Structure. We analyze the dense prediction structure from three aspects: speed, accuracy and parameters. As shown in Tab. IV, DSOD with dense front-end structure runs slightly lower than the plain structure (17.4 fps vs. 20.6 fps) on a Titan X GPU, due to the overhead from additional down-sampling blocks. However, the dense structure improves mAP from 77.3% to 77.7%, meanwhile, it reduces the parameters from 18.2M to 14.8M. Tab. III gives more details (rows 9 and 10). We also tried to replace the prediction layers in SSD with the proposed dense prediction layers. The accuracy on VOC 2007 test set can be improved from 75.8% (original SSD) to 76.1% (with pre-trained models), and 69.6% to 70.4% (w/o pre-trained models), when using the VGG-16 model as backbone. This verifies the effectiveness of the dense prediction layer.

What happened if pre-training on ImageNet? It is interesting to see the performance of DSOD with backbone network pre-trained on ImageNet. We trained one lite backbone network DS/64-12-16-1 on ImageNet, which obtains 66.8% top-1 accuracy and 87.8% top-5 accuracy on the validation-set (slightly worse than VGG-16). After fine-tuning the whole detection framework on “07+12” trainval set, we achieve 70.3% mAP on the VOC 2007 test set. The comparison of corresponding training-from-scratch solution achieves 70.7% accuracy, which is even slightly better. We will further investigate this point more thoroughly in the future work.

4.1.3 Runtime Analysis

The comprehensive inference speed comparisons are shown in the 6th column of Tab. IV. With 300300 input, our DSOD can process an image in 48.6ms (20.6 fps) on a single Titan X GPU with the plain prediction structure, and 57.5ms (17.4 fps) with the dense prediction structure. As a comparison, R-FCN runs at 90ms (11 fps) for ResNet-50 and 110ms (9 fps) for ResNet-101. The SSD300 runs at 82.6ms (12.1 fps) for ResNet-101 and 21.7ms (46 fps) for VGGNet. In addition, our model uses about only 1/2 parameters to SSD300 with VGGNet, 1/4 to SSD300 with ResNet-101, 1/4 to R-FCN with ResNet-101 and 1/10 to Faster R-CNN with VGGNet. A lite-version of DSOD (10.4M parameters, w/o any speed optimization) can run 25.8 fps with only 1% mAP drops.

4.2 Results on PASCAL VOC2007

Our models are trained based on the union of VOC 2007 trainval and VOC 2012 trainval (“07+12”) following [10]. We use a batch size of 128 cross 8 GPUs during training. Note that this batch-size is beyond the capacity of GPU memories (even for an 8 GPU server, each with 12GB memory). We use a trick to overcome the GPU memory constraints by accumulating gradients over two training iterations, which has been implemented on Caffe platform [66]. The initial learning rate is set to 0.1, and then divided by 10 after every 20k iterations. The training finished when reaching 100k iterations. Following [10], we use a weight decay of 0.0005 and a momentum of 0.9. All conv-layers are initialized with the “xavier” method [68].

Tab. IV shows our results on VOC2007 test set. SSD300 is the updated SSD results which use new data augmentation technique. Our DSOD300 with plain structure achieves 77.3%, which is slightly better than SSD300 (77.2%). DSOD300 with dense prediction structure further improves the result to 77.7%.

4.3 Results on PASCAL VOC2012

For VOC 2012 dataset, we use VOC 2012 trainval and VOC 2007 trainval + test for training, and test on VOC 2012 test set. The initial learning rate is set to 0.1 for the first 30k iterations, then divided by 10 after every 20k iterations. The total training iterations are 110k. Other settings are the same as those used in our VOC 2007 experiments. Our results of DSOD300 are shown in Tab. 9. DSOD300 achieves 76.3% mAP, which is consistently better than baseline SSD300 (75.8%).

4.4 Results on PASCAL VOC2012 Comp3

VOC2012 Comp3 is the sub-challenge of PASCAL VOC 2012 which compares object detectors that are trained only with PASCAL VOC 2012 data (11,540 images in trainval set for training and 10,991 in test set for testing).

Our results are shown in Tab. VII, DSOD achieves 70.8% mAP on PASCAL VOC 2012 test set, which outperforms the baseline method SSD with a large margin (6.8% mAP). DSOD v2 further improves the performance from 70.8% to 72.9% mAP.

Fig. 5: Left is the pre-activation of BN-ReLU-Conv in DSOD, right is the post-activation of Conv-BN-ReLU in DSOD v2.
Fig. 6: Sensitivity of our detection results. Each plot shows the mean (over classes) normalized AP for the highest and lowest performing subsets within six different object characteristics (occlusion, truncation, bounding-box area, aspect ratio, viewpoint, part visibility). We show plots for our baseline method (SSD) and our method (DSOD) with and without DSS module. We can observe that DSOD and DSOD v2 consistently improve the performance compared with baseline SSD.
Method VOC 07 (% mAP) VOC 12 (% mAP) VOC 12 Comp3 (% mAP) COCO (Avg. Precision, IoU:)
0.5:0.95 0.5 0.75
DSOD300 77.7 76.3 70.8 29.3 47.3 30.6
DSOD300* (v2) 79.1 77.2 72.9 30.4 49.0 31.8
TABLE IX: Comparisons of DSOD and DSOD (v2) on PASCAL VOC and MS COCO 2015 test-dev set.
Method network pre-train # param COCO (Avg. Precision, IoU:)
0.5:0.95 0.5 0.75
One-Stage Detectors:
SSD300 [10] VGGNet 34.3M 23.2 41.2 23.4
SSD300* [10] VGGNet 34.3M 25.1 43.1 25.8
DSOD300 DSOD 21.8M 29.3 47.3 30.6
DSOD300 (v2) DSOD + DSS 37.3M 30.4 49.0 31.8
Two-Stage Detectors:
FPN300/500 [2] ResNet-50 83.3M 29.0 48.0 30.3
FPN300/500 [2] ResNet-101 121.2M 29.4 48.8 30.6
Mask RCNN+FPN300/500 [1] ResNet-50 84.4M 29.9 49.0 31.3
Mask RCNN+FPN300/500 [1] ResNet-101 122.4M 30.2 49.3 31.7
TABLE X: Comparisons of state-of-the-art two-stage detectors on MS COCO 2015 test-dev set. For fair comparisons, we resize the short side of inputs to 300 for all two-stage detectors. “500” indicates the max size of the inputs.

4.5 Results on MS COCO

Finally we evaluate our DSOD on the MS COCO dataset [69]. MS COCO contains 80k images for training, 40k for validation and 20k for testing (test-dev set). Following [8, 9], we use the trainval set (train set + validation set) for training. The batch size is also set as 128. The initial learning rate is set to 0.1 for the first 80k iterations, then divided by 10 after every 60k iterations. The total number of training iterations is 320k.

Results are summarized in Tab. VIII. Our DSOD300 achieves 29.3%/47.3% on the test-dev set, which outperforms the baseline SSD300 with a large margin. Our result is comparable to the single-scale R-FCN, and is close to the R-FCNmulti-sc which uses ResNet-101 as the pre-trained model. Interestingly, we observe that our result with 0.5 IoU is lower than R-FCN, but our [0.5:0.95] result is better or comparable. This indicates that our predicted locations are more accurate than R-FCN under the larger overlap settings. It is reasonable that our small object detection precision is slightly lower than R-FCN since our input image size (300300) is much smaller than R-FCN’s ( 6001000). Even with this disadvantage, our large object detection precision is still much better than R-FCN. This further demonstrates the effectiveness of our approach. Fig. 4 shows some qualitative detection examples on COCO with our DSOD300 model.


Next, we investigate how the MS COCO dataset can further help with the detection performance on PASCAL VOC. We use the DSOD model trained on the COCO (without the ImageNet pre-trained model) to initialize the network weights. Then another DSOD is fine-tuned on PASCAL VOC datasets with small initial learning rate (0.001). This operation leads to 81.7% mAP on PASCAL VOC 2007 and 79.3% mAP on PASCAL VOC 2012, respectively. The extra data from the COCO set increases the mAP by 4.0% on PASCAL VOC 2007 and 3.0% on VOC 2012. The results verify that although our DSOD models are trained with fewer images, they have not overfitted to the PASCAL VOC datasets yet, and still have room to be boosted.

4.7 From DSOD to DSOD (v2)

Compared with DSOD, DSOD v2 includes the extra DSS module to further enhance the supervision signal under the training from scratch scenario. The comparison results of DSOD and DSOD v2 are shown in Tab. IX. We can see that DSOD v2 improves the performance consistently on both PASCAL VOC and COCO datasets under different training sets. In DSOD v2, we also replace the pre-activation of BN [70] in DSOD with post-activation (replacing BN-ReLU-Conv with the Conv-BN-ReLU manner), as shown in Fig. 5. We observe that this operation can improve the detection performance with about 0.6 % mAP.

Fig. 7: Distributions and trendlines of top-ranked false positive (FP) types. Each plot shows the evolving distribution (trendline) of FP types as more FP examples are considered. Each FP is categorized into 1 of 4 types: Loc: poor localization (a detection with an IoU overlap with the correct class between 0.1 and 0.5); Sim: confusion with a similar category; Oth: confusion with a dissimilar object category; BG: a FP that fired on background. More details can be referred to [71].

4.8 Comparisons of State-of-the-art Two-Stage Detectors

In this section, we compare our results with the state-of-the-art two-stage detectors, including Faster RCNN + FPN and Mask RCNN + FPN. For fair comparisons, we resize the short side of inputs to 300 for these two-stage detectors. The whole comparisons are shown in Tab. X. We can observe that DSOD300 (29.3% mAP) achieves comparable results with FPN300/500 (ResNet-101 backbone, 29.4% mAP), while the #params of DSOD (21.8M) is only about 1/6 compared to FPN300/500 with ResNet-101 (121.2M). The performance of our DSOD300* v2 (30.4% mAP) is even slightly better than Mask RCNN + FPN300/500 with ResNet-101 (30.2% mAP) while requiring only 1/3 of parameters (37.3M vs. 120.6M). The results show great advantages and potential of our proposed methods.

4.9 Comparisons of Different Input Sizes

Intuitively, larger input images will bring better performance for object detection. We verify this by using different input resolutions with: 300, 360, 440, 512 and maintaining 4 images on each GPU during training (the total batch size is still 128). The results on PASCAL VOC are illustrated in Fig. 8. We can observe that larger input can obtain higher accuracy, which is consistent to our conjecture.

Fig. 8: Accuracy under different input sizes.

4.10 Models and Results Analysis

In order to reveal the failure reasons of our methods and the error differences between baseline SSD and our methods, we conduct experiments on the following two aspects of analysis, including: (1) the sensitivity to object characteristics, shown in Fig. 6; (2) the distribution and trendline of top-ranked false positive (FP) types, as shown in Fig. 7. We adopted the publicly available detection analysis tool from Hoiem et al. [71] for these illustrations. More explanation can be referred to the captions under these two figures.

Fig. 9: More examples of object detection results on the PASCAL VOC 2012 test set using DSOD300. The training data is VOC 2007 trainval, VOC 2007 test, VOC 2012 trainval and MS COCO trainval (79.3% mAP@0.5 on the test set). Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used for displaying. For each image, one color corresponds to one object category in that image.

5 Discussion

Better Model Structure vs. More Training Data. An emerging idea in the computer vision community is that object detection or other vision tasks might be solved with deeper and larger neural networks backed with massive training data like ImageNet [3]. Thus more and more large-scale datasets have been collected and released recently, such as the Open Images dataset [72], which is 7.5x larger in the number of images and 6x larger of categories than that of ImageNet. We definitely agree that, under modest assumptions that given boundless training data and unlimited computational power, deep neural networks should perform extremely well. However, our proposed approach and experimental results imply an alternative view to handle this problem: a better model structure might enable similar or better performance compared with complex models trained from large data. Particularly, our DSOD is only trained with 16,551 images on VOC 2007, but achieves competitive or even better performance than those models trained with 1.2 million + 16,551 images.

In this premise, it is worthwhile rehashing the intuition that as datasets grow larger, training deep neural networks becomes more and more expensive. Thus a simple yet efficient approach becomes increasingly important. Despite its conceptual simplicity, our approach shows great potential under this setting.

Why Training from Scratch? There are many successful cases that fine-tuning works well and achieves consistent improvement, especially in object detection areas. So why do we still need to train object detectors from scratch? As aforementioned briefly, the critical importance of training from scratch has at least two aspects. First, there may have big domain differences between the pre-trained and the target one. For instance, most pre-trained models are learned on large-scale RGB dataset like ImageNet. It is fairly difficult to transfer RGB models to depth images, multi-spectrum images, medical images, etc. Some advanced domain adaptation techniques have been proposed and could mitigate this problem. But what an amazing thing if we have a technique that can train object detector from scratch. Second, fine-tuning restricts the design space of network structures for object detection. This is very critical for the deployment of applying deep neural networks to some resource-limited Internet-of-Things (IoT) scenario.

Model Compactness vs. Performance. Model compactness (in terms of the number of parameters) and performance is an important trade-off for the applications of deep neural networks in actual detection scenarios. Most CNN-based detection solutions require a huge memory space to store the massive parameters. Therefore the models are usually unsuitable for low-end devices like mobile-phones and embedded electronics. Thanks to the parameter-efficient dense connections, our model is much smaller than most competitive methods. For instance, our smallest dense model (DS/64-64-16-1, with dense prediction layers) achieves 73.6% mAP with only 5.9M parameters, which shows great potential for applications on low-end devices. Adopting network pruning methods [73, 74] to further reduce the parameters and speed up the inference process will be a good direction for CNN-based object detection, and will be investigated in the further.

How to Train Two-Stage Detectors from Scratch. Some recent works [75, 63] have observed that utilizing new techniques (e.g., Sync BN [16], Group Norm [63], Switchable Norm [76]

, etc.) and more training epochs could enable to train two-stage detectors from scratch. We also did some preliminary experiments on PASCAL VOC 2007 dataset (limited training data) with two-stage detectors from scratch (use VGG16 as backbone network and with standard training budget). As shown in Tab. 

XI, our results indicates that if replacing RoI Pool with RoI Align and adopting advanced normalization methods can enable to train two-stage detectors from scratch.

BN Sync_BN RoI Pool RoI Align mAP (%)
TABLE XI: Comparison of performance (mAP) using different building designs in two-stage object detectors when training from scratch. The backbone network is VGG16 [51]. All models are trained on VOC 07 [77] trainval set and tested on test set.

6 Conclusion

We have presented Deeply Supervised Object Detector (DSOD), a simple yet efficient framework for learning object detectors from scratch. Without using pre-trained models from ImageNet, DSOD demonstrates competitive performance to state-of-the-art detectors such as SSD, Faster R-CNN, R-FCN, FPN, Mask RCNN, etc. on the popular PASCAL VOC 2007, 2012 and MS COCO datasets, meanwhile, with only 1/2, 1/4 and 1/10 parameters compared to SSD, R-FCN and Faster R-CNN, respectively. Due to the learning from scratch property, DSOD has great potential on domain-different scenarios, such ad depth, medical, multi-spectral images, etc. Our future work will consider learning object detectors directly in these diverse domains, as well as learning ultra efficient DSOD models to support resource-bounded devices.


Yu-Gang Jiang and Xiangyang Xue were supported in part by National Key R&D Program of China (No.2017YFC0803700), NSFC under Grant (No.61572138 & No.U1611461) and STCSM Project under Grant No.16JC1420400.


  • [1] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
  • [2] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
  • [3] J. Deng, W. Dong, R. Socher, L.-J. Li et al., “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  • [4] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in CVPR, 2014.
  • [5] W. Cui, G. Zheng, Z. Shen, S. Jiang, and W. Wang, “Transfer learning for sequences via learning to collocate,” in International Conference on Learning Representations, 2019.
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
  • [7] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
  • [8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
  • [9] Y. Li, K. He, J. Sun et al., “R-fcn: Object detection via region-based fully convolutional networks,” in NIPS, 2016.
  • [10] W. Liu, D. Anguelov, D. Erhan et al., “Ssd: Single shot multibox detector,” in ECCV, 2016.
  • [11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in CVPR, 2016.
  • [12] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [13] T. Kong, A. Yao, Y. Chen, and F. Sun, “Hypernet: Towards accurate region proposal generation and joint object detection,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 845–853.
  • [14] S. Bell, C. Lawrence Zitnick et al.

    , “Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks,” in

    CVPR, 2016.
  • [15] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron: Reverse connection with objectness prior networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5936–5944.
  • [16] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun, “Megdet: A large mini-batch object detector,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6181–6189.
  • [17] B. Singh and L. S. Davis, “An analysis of scale invariance in object detection snip,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3578–3587.
  • [18] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588–3597.
  • [19] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162.
  • [20] B. Bosquet, M. Mucientes, and V. M. Brea, “Stdnet: A convnet for small target detection,” in BMVC, 2018.
  • [21] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa, “Deep regionlets for object detection,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 798–814.
  • [22] R. J. Wang, X. Li, and C. X. Ling, “Pelee: A real-time object detection system on mobile devices,” in Advances in Neural Information Processing Systems, 2018, pp. 1963–1972.
  • [23] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [24] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in CVPR, 2015.
  • [25] L.-C. Chen, G. Papandreou, I. Kokkinos et al., “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in ICLR, 2015.
  • [26] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.
  • [27] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in ECCV, 2014.
  • [28] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in ICCV, 2015.
  • [29] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple granularity descriptors for fine-grained categorization,” in ICCV, 2015.
  • [30] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in CVPR, 2015.
  • [31] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
  • [32] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
  • [33] H. Fang, S. Gupta, and et al, “From captions to visual concepts and back,” in CVPR, 2015.
  • [34]

    Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y.-G. Jiang, and X. Xue, “Weakly supervised dense video captioning,” in

    CVPR, 2017.
  • [35] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” in ICCV, 2017.
  • [36] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” in CVPR, 2016.
  • [37] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for supervision transfer,” in CVPR, 2016.
  • [38] C.-Y. Lee, S. Xie, P. W. Gallagher et al., “Deeply-supervised nets.” in AISTATS, 2015.
  • [39] S. Xie and Z. Tu, “Holistically-nested edge detection,” in ICCV, 2015.
  • [40] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in CVPR, 2017.
  • [41] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in ICCV, 2017.
  • [42] Z. Shen, H. Shi, R. Feris, L. Cao, S. Yan, D. Liu, X. Wang, X. Xue, and T. S. Huang, “Learning object detectors from scratch with gated recurrent feature pyramids,” arXiv preprint arXiv:1712.00886, 2017.
  • [43] Y. Li, J. Li, W. Lin, and J. Li, “Tiny-dsod: Lightweight object detection for resource-restricted usages,” arXiv preprint arXiv:1807.11013, 2018.
  • [44] J. R. Uijlings, K. E. Van De Sande, T. Gevers et al., “Selective search for object recognition,” IJCV, 2013.
  • [45] P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu, “Scale-transferrable object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 528–537.
  • [46] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement neural network for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4203–4212.
  • [47] S. Liu, D. Huang et al., “Receptive field block net for accurate and fast object detection,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 385–400.
  • [48] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 734–750.
  • [49] X. Zhou, J. Zhuo, and P. Krähenbühl, “Bottom-up object detection by grouping extreme and center points,” arXiv preprint arXiv:1901.08043, 2019.
  • [50] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • [51] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet et al., “Going deeper with convolutions,” in CVPR, 2015.
  • [53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [54] N. Srivastava, G. E. Hinton, A. Krizhevsky et al., “Dropout: a simple way to prevent neural networks from overfitting.” JMLR, 2014.
  • [55] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [56] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in CVPR, 2017.
  • [57] ——, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [58] K.-H. Kim, S. Hong, B. Roh et al., “Pvanet: Deep but lightweight neural networks for real-time object detection,” arXiv preprint arXiv:1608.08021, 2016.
  • [59] J. Huang, V. Rathod, C. Sun et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” in CVPR, 2017.
  • [60] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in ICLR workshop, 2016.
  • [61] S. Jégou, M. Drozdzal, D. Vazquez et al., “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” arXiv preprint arXiv:1611.09326, 2016.
  • [62] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun, “Megdet: A large mini-batch object detector,” arXiv preprint arXiv:1711.07240, 2017.
  • [63] Y. Wu and K. He, “Group normalization,” arXiv preprint arXiv:1803.08494, 2018.
  • [64] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition with very deep neural networks,” arXiv preprint arXiv:1502.00873, 2015.
  • [65] C. Szegedy, V. Vanhoucke, S. Ioffe et al., “Rethinking the inception architecture for computer vision,” in CVPR, 2016.
  • [66] Y. Jia, E. Shelhamer, J. Donahue et al., “Caffe: Convolutional architecture for fast feature embedding,” in ACM MM, 2014.
  • [67] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” arXiv preprint arXiv:1506.04579, 2015.
  • [68] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in AISTATS, 2010.
  • [69] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco: Common objects in context,” in ECCV, 2014.
  • [70] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in ECCV, 2016.
  • [71] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012.
  • [72] I. Krasin, T. Duerig, N. Alldrin, A. Veit et al., “Openimages: A public dataset for large-scale multi-label and multi-class image classification.” https://github.com/openimages, 2016.
  • [73] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in ICLR, 2016.
  • [74] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in ICCV, 2017.
  • [75] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-training,” arXiv preprint arXiv:1811.08883, 2018.
  • [76] P. Luo, J. Ren, and Z. Peng, “Differentiable learning-to-normalize via switchable normalization,” arXiv preprint arXiv:1806.10779, 2018.
  • [77] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.