ScratchDet:Exploring to Train Single-Shot Object Detectors from Scratch

10/19/2018 ∙ by Rui Zhu, et al. ∙ 0

Current state-of-the-art object objectors are fine-tuned from the off-the-shelf networks pretrained on large-scale classification datasets like ImageNet, which incurs some accessory problems: 1) the domain gap between source and target datasets; 2) the learning objective bias between classification and detection; 3) the architecture limitations of the classification network for detection. In this paper, we design a new single-shot train-from-scratch object detector referring to the architectures of the ResNet and VGGNet based SSD models, called ScratchDet, to alleviate the aforementioned problems. Specifically, we study the impact of BatchNorm on training detectors from scratch, and find that using BatchNorm on the backbone and detection head subnetworks makes the detector converge well from scratch. After that, we explore the network architecture by analyzing the detection performance of ResNet and VGGNet, and introduce a new Root-ResNet backbone network to further improve the accuracy. Extensive experiments on PASCAL VOC 2007, 2012 and MS COCO datasets demonstrate that ScratchDet achieves the state-of-the-art performance among all the train-from-scratch detectors and even outperforms existing one-stage pretrained methods without bells and whistles. Codes will be made publicly available at https://github.com/KimSoybean/ScratchDet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

page 12

Code Repositories

ScratchDet

The code and models for paper: "ScratchDet: Exploring to Train Single-Shot Object Detectors from Scratch"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection has made great progress in the framework of convolutional neural networks (CNNs). The current state-of-the-art detectors are generally fine-tuned from high accuracy classification networks,

e.g., VGGNet [35], ResNet [11] and GoogLeNet [36] pretrained on ImageNet [28] dataset. The fine-tuning transfers the classification knowledge learned from the source domain to handle the object detection task. In general, fine-tuning from pretrained networks can achieve better performance than training from scratch.

However, there is no such thing as a free lunch. Fine-tuning pretrained networks to object detection has some critical limitations. On the one hand, the classification and detection tasks have different degrees of sensitivity to translation. The classification task prefers to translation invariance, and thus needs downsampling operations (e.g.

, max-pooling and convolution with stride

) for better performance. In contrast, the local texture information is more critical for object detection, making the usage of translation-invariant operations (e.g., downsampling operations) with caution. On the other hand, it is inconvenient to change the architecture of networks (even small changes) in fine-tuning process. If we employ a new architecture, the pretraining should be re-conducted on the large-scale dataset (e.g., ImageNet), requiring high computational cost.

Fortunately, training detectors from scratch is able to eliminate the aforementioned limitations. DSOD [31] is the first to train CNN detectors from scratch, in which the deep supervision plays a critical role. Deep supervision is introduced in DenseNet [12] as the dense layer-wise connection. However, DSOD is also limited by the predefined architecture of DenseNet. If DSOD employs other types of network (e.g., VGGNet and ResNet), the performance decreases dramatically (sometimes even crashes in training). Besides, the currently best performance of trained-from-scratch detectors still remains in a lower place compared with the pretrained ones. Therefore, if we hope to take advantage of training detectors from scratch, it needs to achieve two improvement: (1) free the architecture limitations for any type of network while guarantee the training convergence, (2) give performance as good as pretrained networks (or even better).

To this end, we study the elements that make major impact to the optimization of detector given the randomly initialized network. As pointed out in [29], BatchNorm reparameterizes the optimization problem to make its landscape significantly smoother instead of reducing the internal covariate shift. Based on this theory, we assume that the lack of BatchNorm in training detector from scratch is the main reason for poor convergence. Thus, we integrate BatchNorm into both the backbone and detection head subnetworks (Figure 2), and find that BatchNorm helps the detector converge well in any form of network (including VGGNet and ResNet) without pretraining and noticeably surpass the accuracy of the pretrained baselines. Thereby, we are free to modify the architecture without restrictions from pretrained models. By taking this advantage, we analyze the performance of the ResNet and VGGNet based SSD[23] detectors with various configurations, and discover that the sampling stride in the first convolution layer has a great impact on detection performance. Based on this point, we redesign the architecture of the detector by introducing a new root block, which keeps the abundant information for detection feature maps and substantially improves the detection accuracy, especially for small objects. We report extensive experiments on PASCAL VOC 2007 [5], PASCAL VOC 2012 [6] and MS COCO [22] datasets, to demonstrate that our ScratchDet performs better than some pretrained based detectors and all the state-of-the-art train-from-scratch detectors, e.g., improving the state-of-the-art mAP by on VOC 2007, on VOC 2012, and of AP on COCO.

The main contributions of this paper are summarized as follows. (1) We present a single-shot object detector trained from scratch, named ScratchDet, which integrates BatchNorm to help the detector converge well from scratch, independent to the type of network. (2) We introduce a new Root-ResNet backbone network based on the new designed root block, which noticeably improves the detection accuracy, especially for small objects. (3) ScratchDet performs favourably against the state-of-the-art train-from-scratch detectors and some pretrained based detectors.

(a) Loss Value
(b) L2 Norm of Gradient
(c) Fluctuation of L2 Norm of Gradient
Figure 1: Optimization landscape analysis. (a) The training loss value. (b) L2 Norm of gradient. (c) Fluctuation of L2 Norm of gradient (smoothed). Blue curve is the original SSD, red and green curves represent the SSD trained with BatchNorm in head subnetwork with and base learning rate, respectively. The BatchNorm makes smoother optimization landscape and has more stable gradients (red v.s blue). With this advantage, we are able to set larger learning rate (green) to search larger space and converge faster, and thus better solution.

2 Related Work

Object detectors with pretrained network. Most of CNN-based object detectors are fine-tuned from pretrained networks on ImageNet. Generally, they can be divided into two categories: the two-stage and the one-stage approach. The two-stage approach first generates a set of candidate object proposals, and then predicts the accurate object regions and the corresponding class labels. With the gradual improvements from Faster R-CNN [27], R-FCN [3], FPN [20] to Mask R-CNN [10], the two-stage methods achieve top performance on several challenging datasets, e.g., PASCAL VOC and MS COCO. Recent developments of two-stage approach focus on redesigning architecture diagram [19], convolution form [4], re-ranking detection scores [2], using contextual reasoning [1] and exploiting multiple layers for prediction [18].

Pursuing high efficiency, the one-stage approach attracts much attention in recent years, which simultaneously regresses the object locations and sizes, and the corresponding class labels. OverFeat [30] is one of the first one-stage detectors and since then, several other methods have been proposed, such as YOLO [25, 26] and SSD [23]. Recent researches on one-stage approach focus on enriching features for detection [7], designing different architecture [38] and addressing class imbalance issue [39, 21].

Train-from-scratch object detectors. DSOD [31] first trains the one-stage object detector from scratch and presents a series of principles to produce good performance. GRP-DSOD [32] improves the DSOD algorithm by applying the Gated Recurrent Feature Pyramid. These two methods focus on deep supervision of DenseNet but lose sight of the effect of BatchNorm on optimization and the flexibility of network architecture for training detectors from scratch.

Batch normalization. BatchNorm[13] addresses the internal covariate shift problem by normalizing layer inputs, which makes using large learning rate to accelerate network training feasible. More recently, Santurkar et al.[29] provides both empirical demonstration and theoretical justification for the explanation that BatchNorm makes the optimization landscape significantly smoother instead of reducing internal covariate shift.

3 ScratchDet

In this section, we first study the effectiveness of BatchNorm for training SSD from scratch. Then, we redesign the backbone network by analyzing the detection performance of the ResNet and VGGNet based SSD.

3.1 BatchNorm for Train-from-Scratch

Without losing generality, we consider to apply BatchNorm in SSD which is the most common framework of one stage. SSD is formed by the backbone subnetwork (e.g., truncated VGGNet-16 with several additional convolution blocks) and the detection head subnetwork (i.e., the prediction blocks after each detection layer, which consists of one bounding box regression convolution layer and one class label prediction convolution layer). Notice that there is no BatchNorm in the original SSD framework. Motivated by recent work [29], we believe that using BatchNorm is helpful to train SSD from scratch. BatchNorm makes the optimization landscape significantly smoother, inducing a more predictable and stable behaviour of the gradients to allow for larger searching space and faster convergence. DSOD successfully trains detectors from scratch, however, it attributes the results to deep supervision of DenseNet without emphasizing the effect of BatchNorm. We believe that it is necessary to study the impact of BatchNorm on training detectors from scratch. To verify our argument, we train SSD from scratch using batch size without BatchNorm as our baseline. As listed in the first column of Table 1, our baseline produces mAP on VOC 2007 test set.

BatchNorm in the backbone subnetwork. We add BatchNorm in each convolution layer in the backbone subnetwork and then train it from scratch. As shown in Table 1, using BatchNorm in the backbone network improves of mAP. More importantly, adding BatchNorm in the backbone network makes the optimization landscape significantly smoother. Thus, we can use larger learning rates ( and ) to further improve the performance (i.e., mAP is improved from to and ). Both of them outperform SSD fine-tuned from the pretrained VGG-16 model ( [23]). These results indicate that adding BatchNorm in the backbone subnetwork is one of the critical issues to train SSD from scratch. In appendix, we draw the optimization landscape analysis curves for adding BatchNorm in the backbone subnetwork, similar to Figure 1.

BatchNorm in the detection head subnetwork. To analyze the effect of BatchNorm in the detection head subnetwork, we plot the training loss value, L2 Norm of gradient, and fluctuation of L2 Norm of gradient v.s training steps. As shown by the blue curve in Figure 1(b) and 1(c), training SSD from scratch with default learning rate has a large fluctuation of L2 norm of gradient, especially in the initial phase of training, which makes the loss value suddenly change and converge to a bad local minima (i.e., relatively high loss at the end of training process in Figure 1(a) and bad detection results, mAP). These results are useful to explain the phenomenon that using large learning rate to train SSD with the original architecture from scratch or pretrained networks usually leads to gradient explosion, poor stability and weak prediction of gradients (see Table 1) .

In contrast, integrating BatchNorm in the detection head subnetwork makes the loss landscape smoother (see red curves in Figure 1), which improves mAP from to (listed in Table 1). The smooth landscape allows us to set larger learning rate, which brings about larger searching space and faster convergence (see Figure 1(a) and 1(c)). As a result, the mAP improves from to . Besides, with BatchNorm, larger learning rate is also helpful to jump out of the bad local minima and produce stable gradients (green curve in Figure 1(b) and 1(c)).

BatchNorm in the whole network. We also study the performance of the detector using BatchNorm in both the backbone and detection head subnetworks. After using BatchNorm in the whole network of detector, we are able to use a larger base learning rate () to train the detector from scratch, which produces higher mAP comparing to the detector initialized with the pretrained VGG-16 backbone subnetwork ( v.s ). Please see Table 1 for more details.

3.2 Backbone Network

As described above, we train SSD with BatchNorm from scratch and achieve better accuracy than the pretrained SSD. This encourages us to train detector from scratch while keeping the performance independent to the network architecture. By taking this advantage, we are able to explore various types of network for the object detection task.

Performance analysis of ResNet and VGGNet. The truncated VGG-16 and ResNet-101 are two popular backbone networks used in SSD (a brief structure overview in Figure 2). In general, ResNet-101 produces better classification results than VGG-16 (e.g., v.s , top-5 classification error lower on ImageNet). However, as indicated in DSSD [7], the VGG-16 based SSD performs favourably than the ResNet-101 based SSD with relatively small input size (e.g., ) on PASCAL VOC. We argue that this phenomenon is caused by the downsampling operation in the first convolution layer (i.e., conv1_x with stride ) of ResNet-101. This operation significantly affects the detection accuracy, especially for small objects (see Table 2). After we remove the downsampling operation in conv1_x of ResNet-18 to form ResNet-18-B in Figure 3(a), the detection performance improves by a big margin from to mAP. We also remove the second downsampling operation to form ResNet-18-A in Figure 3(b), whose improvement is relatively small. In summary, the downsampling operation in the first convolution layer has a bad impact on the detection accuracy, especially for small objects.

Figure 2: Brief overview of SSD based on VGG-16 and ResNet-101. The BatchNorm is covered for clearness. As shown in Figure 3 and Table 2, the first stride 2 of ResNet makes worse performance on PASCAL VOC with small input size.

Backbone network redesign for object detection. To overcome the disadvantages of ResNet based backbone network for object detection while retaining its powerful classification ability, we design a new architecture, named Root-ResNet, which is an improvement of the truncated ResNet in the original SSD detector, shown in Figure 3(d). We remove the downsampling operation in the first conv layer and replace the convolution kernel by a stack of convolution filters (denoted as the root block). With abundant inputs, Root-ResNet is able to exploit more local information from the image, so as to extract powerful features for small object detection. Furthermore, we replace the four convolution blocks (added by SSD to extract the feature maps with different scales) with four residual blocks to the end of the Root-ResNet, shown in appendix. Each residual block is formed by two branches. One branch is a convolution layer with stride and the other one consists of a convolution layer with stride and a convolution layer with stride . The number of output channels in each convolution layer is set to . These residual blocks bring efficiency in parameters and computation without performance dropout.

Figure 3: Illustration of networks in Section 4.2.2. (a) ResNet-18: original structure. (b) ResNet-18-A: removing the first max-pooling layer. (c) ResNet-18-B: changing the stride size in the first conv layer from 2 to 1. (d) Root-ResNet-18: replacing the conv layer with three stacked conv layers in ResNet-18-B. The corresponding mAPs on PASCAL 2007 test (training on “07+12” from scratch) are , , and , respectively. Notably, for a fairy comparison, no matter how we modify the structure, the spatial sizes of our selected detection layers are the same as SSD300 and DSOD300 (i.e., , , , , , ).
Component lr 0.001 lr 0.01 lr 0.05
pretraining
BN in backbone
BN in head
mAP (%) 67.6 71.0 72.8 71.8 77.1 77.6 NAN 75.6 77.8 77.3 76.9 78.2 NAN NAN 78.0 78.7 NAN 75.5
Table 1: Analysis of BatchNorm and learning rate for SSD trained from scratch on VOC 2007 test set. All the networks are based on the truncated VGG-16 backbone network. The best performance ( mAP) is achieved when three conditions are satisfied: (1) BatchNorm in backbone and head, (2) non pretraining, (3) larger learning rate. “NAN” indicates that the training is non-convergent.

4 Experiment

We conduct several experiments on the PASCAL VOC and MS COCO datasets, including and

object classes. The proposed ScratchDet is implemented in Caffe library

[14] and all the codes and the trained models will be made publicly available.

4.1 Training Details

All models are trained from scratch using SGD with weight decay and

momentum on four NVIDIA Tesla P40 GPUs. For a fair comparison, we use the same training settings as the original SSD, including data augmentation, anchor scales and aspect ratios, and loss function. We remove the L2 normalization

[24]. Notably, all experiments select the detection layers with the fixed spatial size same as SSD300 and DSOD300, i.e., do not use larger-size feature maps for detection. Following DSOD, we use a relatively large batch size to train our ScratchDet from scratch, in order to ensure the stable statistical results of BatchNorm in training phase. Meanwhile, we use the default batch size 32 for the pretrained model based SSD (We also try the batch size 128 for the pretrained model, but the performance has not improved).

Notably, we use the “Root-ResNet-18” redesigned from ResNet-18 as the backbone network in the model analysis by considering the computational cost in experiments. Whereas, in comparison with the state-of-the-art detectors, we use a deeper backbone network “Root-ResNet-34” for better performance. All the parameters in our ScratchDet are initialized by the “xavier” method [9]. Besides, all the models are trained with the input size and we believe that the accuracy of ScratchDet can be further improved using larger input size.

4.2 Pascal Voc 2007

For PASCAL VOC 2007, all models are trained on the VOC 2007 and VOC 2012 trainval sets ( images), and tested on the VOC 2007 test set ( images). We use the same settings and configurations except for some specified changes of model components.

4.2.1 Analysis of BatchNorm

We construct several variants of the original SSD and evaluate them on VOC 2007 to demonstrate the effectiveness of BatchNorm in training SSD from scratch, shown in Table 1.

Without BatchNorm. We train the original SSD from scratch with the batch size . All the other settings are the same as that in [23]. As shown in the first column of Table 1, we get mAP, which is worse than the detector initialized by the pretrained classification network (i.e., ). In addition, due to the unstable gradient and unsmooth optimization landscape, the training is able to successfully converge only with the learning rate and goes to a bad local minima (see blue curves in Figure 1). As shown in Table 1, if we use larger learning rates ( and ), the training process will not converge.

BatchNorm in the backbone subnetwork. BatchNorm is a widely used to enable fast and stable training of deep neural networks. To validate the effectiveness of BatchNorm in the backbone subnetwork, we add the BatchNorm operation to each convolution layer in the truncated VGG-16 network, denoted as VGG-16-BN, and train the VGG-16-BN model based SSD from scratch. As shown in Table 1, using BatchNorm in the backbone network with relative large learning rate () improves mAP from to .

BatchNorm in the detection head subnetwork. We also study the effectiveness of BatchNorm in the detection head subnetwork. As described before, the detection head subnetwork in SSD is used to predict the locations, sizes and class labels of objects. The original SSD method [23] do not use BatchNorm in detection head subnetwork. As presented in Table 1, we find that using BatchNorm only on the detection head subnetwork improves mAP from to . After using the times larger base learning rate , the performance can be further improved from to . This noticeable improvement () demonstrates the importance of using BatchNorm in the detection head subnetwork.

BatchNorm in the whole network. We use BatchNorm on every convolution layer in SSD and train it from scratch with three different base learning rates (, and ). For the and base learning rates, we achieve and mAPs, respectively. When we use the largest learning rate , the performance will be further improved by mAP to , which outperforms the pretrained network based SSD detector ( v.s ). These results indicate that using BatchNorm on each convolution layers in SSD is critical to train it from scratch.

BatchNorm for the pretrained network. To validate the effect of BatchNorm for SSD finetuning from pretrained networks, we construct a variant of the original SSD, i.e.

, adding the BatchNorm operation to every convolution layer. The layers in backbone network are initialized by the pretrained VGG-16-BN model from ImageNet, which is converted from the PyTorch official model. As shown in Table

1, we observe that the best result achieves with learning rate . Comparing to the original SSD fine-tuned from the pretrained network, BatchNorm improves only mAP ( v.s ) of the detector, which is rather small compared to the improvement of the trained-from-scratch detector (i.e., mAP improvement from to )111we also try the batch size with default settings of SSD, producing mAP for VGG-16-BN and mAP for VGG-16 without improvement.. We would also like to emphasize that ScratchDet produces better performance than the BatchNorm based SSD trained from the pretrained network (i.e., v.s ). The results demonstrate that BatchNorm is more critical for SSD trained from scratch than fine-tuned from pretrained networks.

BatchNorm in DSOD. DSOD attributes its success to deep supervision of DenseNet and ignores the effect of BatchNorm. After removing all BatchNorm layers in DSOD, the mAP drops from to on VOC 2007. Thus, we argue BatchNorm rather than deep supervision is the key to train detectors form scratch and experiments in Table 1 validate this point. Besides, training VGG16-based Faster R-CNN without BatchNorm from scratch cannot converge in the DSOD paper, but with BatchNorm it can converge successfully to mAP, although it is still lower than the pretrained one ( mAP).

4.2.2 Analysis of the backbone subnetwork.

We analyze the pros and cons of the ResNet and VGGNet based SSD detectors and redesign the backbone network, called Root-ResNet. Specifically, all the models are designed based on the ResNet-18 backbone network in experiments. We also use BatchNorm in the detection head subnetwork. In the training phase, the learning rate is set to for the first iterations, and is divided by successively for another , and iterations, respectively. As shown in Table 2, training SSD from scratch based on ResNet-18 only produces mAP. We analyze the reasons as follows.

Kernel size in the first layer. In contrast to VGG16, the first convolution layer in ResNet-18 uses relatively large kernel size with stride . We aim to explore the effect of the kernel size of the first convolution layer on the detector trained from scratch. As shown in the first two rows of Table 2, the kernel size of convolution layer has no impact on the performance (i.e., for v.s for ). Using smaller kernel size produces a slightly better results with faster speed. The same conclusion can be obtained when we set the stride size of the first convolution layer to without downsampling, see the fifth and the sixth row of Table 2 for more details.

Downsampling in the first layer. Compared to VGGNet, ResNet-18 uses downsampling on the first convolution layer, leading to considerable local information loss, which greatly impacts the detection performance, especially for small objects. As shown in Table 2, after removing the downsampling operation in the first layer (i.e., ResNet-18-B in Figure 3), we can improve and mAPs for the and kernel sizes, respectively. When we only remove the second downsampling operation and keep the first stride = 2 (i.e., ResNet-18-A in Figure 3), the performance achieves mAP, lower than modifying the first layer ( mAP). These results demonstrate that the downsampling operation in the first convolution layer is the obstacle for good results. We need to remove this operation when training ResNet based SSD from scratch.

Number of layers in the root block. Inspired by DSOD and GoogLeNet-V3 [37], we use several convolution layers with kernel size to replace the convolution layers (i.e., Root-ResNet-18 in Figure 3). Here, we study the impact of number of stacked convolution layers in the root block on the detection performance in Table 2. As the number of convolution layers increasing from to , the mAP scores are improved from to . However, the accuracy decreases as the number of stacked layers becoming larger than . We believe that three convolution layers in the root block are enough to learn the information from raw images, and adding more layers cannot boost the accuracy any more. Empirically, we use three convolution layers for detection task on PASCAL VOC 2007, 2012 and MS COCO datasets with input size.

The aforementioned conclusions can be also extended to deeper ResNet backbone network, e.g., ResNet-34. As shown in Table 3, using Root-ResNet-34, the mAP of our ScratchDet is improved from to , which is the best results with input size. In comparison experiments on the benchmarks, we use Root-ResNet-34 as the backbone network.

First conv layer Root block FPS mAP
with downsmapling 1: 77 59.5 73.1
1: 33 62.9 73.2
2: 33 58.1 74.9
3: 33 54.5 75.4
without downsmapling 1: 77 37.0 77.6
1: 33 37.2 77.8
2: 33 31.5 78.1
3: 33 26.9 78.5
4: 33 24.3 78.4
5: 33 21.8 78.5
Table 2: Analysis of backbone network for SSD trained from scratch on VOC 2007 test set. All models are based on the ResNet-18 backbone network.
Method Backbone Input size FPS mAP (%)
VOC 2007 VOC 2012
pretrained two-stage:
HyperNet [18] VGG-16 0.88 76.3 71.4
Faster R-CNN[27] ResNet-101 2.4 76.4 73.8
ION[1] VGG-16 1.25 76.5 76.4
MR-CNN[8] VGG-16 0.03 78.2 73.9
R-FCN[3] ResNet-101 9 80.5 77.6
CoupleNet[41] ResNet-101 8.2 82.7 80.4
pretrained one-stage:
RON384[17] VGG-16 15 74.2 71.7
SSD321[7] ResNet-101 11.2 77.1 75.4
SSD300[23] VGG16 46 77.2 75.8
YOLOv2[26] Darknet-19 40 78.6 73.4
DSSD321[7] ResNet-101 9.5 78.6 76.3
DES300[40] VGG-16 29.9 79.7 77.1
RefineDet320[38] VGG-16 40.3 80.0 78.1
trained from scratch:
DSOD300[31] DS/64-192-48-1 17.4 77.7 76.3
GRP-DSOD320[32] DS/64-192-48-1 16.7 78.7 77.0
ScratchDet300 Root-ResNet-34 25.0 80.4 78.5
ScratchDet300+ Root-ResNet-34 - - 84.1 83.6
Table 3: Detection results on the PASCAL VOC datasets. For VOC 2007, all methods are trained on the VOC 2007 and 2012 trainval sets and tested on the VOC 2007 test set. For VOC 2012, all methods are trained on the VOC 2007 and 2012 trainval sets plus the VOC 2007 test set, and tested on the VOC 2012 test set. : http://host.robots.ox.ac.uk:8080/anonymous/0HPCHC.html : http://host.robots.ox.ac.uk:8080/anonymous/JSL6ZY.html

4.2.3 Results

We compare ScratchDet to the state-of-the-art detectors in Table 3. With small input , ScratchDet produces mAP without bells and whistles, better than several state-of-the-art one-stage pretrained object detectors (e.g., mAP of RefineDet320 and

mAP of DES300). Note that we keep most of original SSD configurations and the same epochs with DSOD. The result is much better than SSD300-VGG16 (

v.s and mAP higher) and SSD321-ResNet101 ( v.s , mAP higher). ScratchDet outperforms the state-of-the-art train-from-scratch detector with improvements on mAP score (i.e., v.s of GRP-DSOD). In the multi-scale testing, our ScratchDet achieves (ScratchDet300+) mAP, which is the state-of-the-art.

4.3 Pascal Voc 2012

Following the evaluation protocol of VOC 2012, we use VOC 2012 trainval set, and VOC 2007 trainval and test sets ( images) to train our ScratchDet from scratch, and test on VOC 2012 test set ( images). The detection results of ScratchDet are submitted to the public testing server for evaluation. The learning rate and batch size are set the same as that in VOC 2007.

Table 3 reports the accuracy of ScratchDet as well as the state-of-the-art methods. Using small input size , ScratchDet produces mAP, surpassing some one-stage methods with similar input size, e.g., SSD321-ResNet101 (, higher mAP), DES300-VGG16 (, higher mAP), and RefineDet320-VGG16 (, higher mAP). Meanwhile, comparing to the two-stage methods based on pretrained networks with input size, ScratchDet also produces better results than R-FCN (, higher mAP). In addition, our ScratchDet outperforms all the train-from-scratch detectors. It outperforms DSOD by mAP with less training epochs and surpasses GRP-DSOD by mAP. Notably, in the multi-scale testing, ScratchDet obtains mAP, much better than the state-of-the-arts of both one-stage and two-stage methods.

Method Data Backbone AP AP AP AP AP AP
pretrained two-stage:
ION[1] train VGG-16 23.6 43.2 23.6 6.4 24.1 38.3
OHEM++ [33] trainval VGG-16 25.5 45.9 26.1 7.4 27.7 40.3
R-FCN[3] trainval ResNet-101 29.9 51.9 - 10.8 32.8 45.0
CoupleNet[41] trainval ResNet-101 34.4 54.8 37.2 13.4 38.1 50.8
Faster R-CNN+++ [11] trainval ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9
Faster R-CNN w FPN [20] trainval35k ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2
Faster R-CNN w TDM[34] trainval Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1
Deformable R-FCN[4] trainval Aligned-Inception-ResNet 37.5 58.0 40.8 19.4 40.1 52.5
pretrained one-stage:
YOLOv2[26] trainval35k DarkNet-19 21.6 44.0 19.2 5.0 22.4 35.5
SSD300[23] trainval35k VGG16 25.1 43.1 25.8 6.6 25.9 41.4
RON384++[17] trainval VGG-16 27.4 49.5 27.1 - - -
SSD321[7] trainval35k ResNet-101 28.0 45.4 29.3 6.2 28.3 49.3
DSSD321[7] trainval35k ResNet-101 28.0 46.1 29.2 7.4 28.1 47.6
DES300[40] trainval35k VGG-16 28.3 47.3 29.4 8.5 29.9 45.2
DFPR300 [16] trainval VGG-16 28.4 48.2 29.1 8.2 30.1 44.2
RefineDet320[38] trainval35k VGG-16 29.4 49.2 31.3 10.0 32.0 44.4
DFPR300 [16] trainval ResNet-101 31.3 50.5 32.0 10.5 33.8 49.9
PFPNet-R320 [15] trainval35k VGG-16 31.8 52.9 33.6 12.0 35.5 46.1
RetinaNet400[21] trainval35k ResNet-101 31.9 49.5 34.1 11.6 35.8 48.5
RefineDet320[38] trainval35k ResNet-101 32.0 51.4 34.2 10.5 34.7 50.4
trained from scratch:
DSOD300[31] trainval DS/64-192-48-1 29.3 47.3 30.6 9.4 31.5 47.0
GRP-DSOD320[32] trainval DS/64-192-48-1 30.0 47.9 31.8 10.9 33.6 46.3
ScratchDet300 trainval35k Root-ResNet-34 32.7 52.0 34.9 13.0 35.6 49.0
ScratchDet300+ trainval35k Root-ResNet-34 39.1 59.2 42.6 23.1 43.5 51.0
Table 4: Detection results on the MS COCO test-dev set.

4.4 Ms Coco

We also evaluate ScratchDet on MS COCO dataset. The model is trained from scratch on the MS COCO trainval35k set and tested on the test-dev set. We set the base learning rate to for the first iterations, and divide it by successively for another , and iterations respectively.

Table 8 shows the results on the MS COCO test-dev set. ScratchDet produces AP that is better than all the other methods with similar input size by a large margin, such as SSD300 (, higher AP), SSD321 (, higher AP), GRP-DSOD320 (, higher AP), DSSD321 (, higher AP), DES300 (, higher AP), RefineDet320-VGG16 (, higher AP), RetinaNet400 (, higher AP) and RefineDet320-ResNet101 (, higher AP). Notably, with the same input size, DSOD300 trains on the trainval set, which contains more images than trainval35k (i.e., v.s ), and our ScratchDet produces a much better result ( v.s , higher AP). Some methods use much bigger input sizes for both training and testing () than our ScratchDet300, e.g., CoupleNet, Faster R-CNN and Deformable R-FCN. For a fair comparison, we also report the multi-scale testing AP results of ScratchDet300 in Table 8, i.e., , which is currently the best result, surpassing those prominent two-stage and one-stage approaches with large input image sizes.

Comparing to the state-of-the-art methods with similar input image size, ScratchDet300 produces the best AP () for small objects, outperforming SSD321 by . The significant improvement in small object demonstrates the superiority of our ScratchDet architecture for small object detection.

Method Backbone mAP (%)
VOC 2007 VOC 2012
pretrained two-stage:
Faster R-CNN[27] VGG-16 78.8 75.9
OHEM++[33] VGG-16 - 80.1
R-FCN[3] ResNet-101 83.6 82.0
pretrained one-stage:
SSD300[23] VGG-16 81.2 79.3
RON384++[17] VGG-16 81.3 80.7
RefineDet320[38] VGG-16 84.0 82.7
trained without ImageNet:
DSOD300[31] DS/64-192-48-1 81.7 79.3
ScratchDet300 Root-ResNet-34 84.0 82.1
ScratchDet300+ Root-ResNet-34 86.3 86.3
Table 5: Detection results on PASCAL VOC dataset. All models are pretrained on MS COCO, and fine-tuned on PASCAL VOC. : http://host.robots.ox.ac.uk:8080/anonymous/ZVCMYN.html : http://host.robots.ox.ac.uk:8080/anonymous/OFHUPV.html

4.5 From MS COCO to PASCAL VOC

We also study how the MS COCO dataset help the detection on PASCAL VOC. Since the object classes in PASCAL VOC are from an subset of MS COCO, we directly fine-tune the detection models pretrained on MS COCO by subsampling parameters. As shown in Table 5, ScratchDet300 achieves and mAP on the VOC 2007 test set and VOC 2012 test set, outperforming other train-from-scratch methods. In the multi-scale testing, the detection accuracies are promoted to and , respectively. By using the training data in MS COCO and PASCAL VOC, our ScratchDet obtains the top mAP scores on both VOC 2007 and 2012 datasets.

5 Conclusion

In this work, we focus on training object detectors from scratch in order to tackle the problems caused by fine-tuning from pretrained networks. We study the effects of BatchNorm in the backbone and detection head subnetworks, and successfully train detectors from scratch. By taking the pretaining-free advantage, we are able to explore various architectures for detector designing. After analyzing the performance of the ResNet and VGGNet based SSD, we propose a new Root-ResNet backbone network to further improve the detection accuracy, especially for small objects. As a consequence, the proposed detector sets a new state-of-the-art performance on the PASCAL VOC 2007, 2012 and MS COCO datasets for the train-from-scratch detectors, even outperforming some one-stage pretrained methods.

References

  • [1] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick.

    Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.

    In CVPR, 2016.
  • [2] B. Cheng, Y. Wei, H. Shi, R. Feris, J. Xiong, and T. Huang. Revisiting rcnn: On awakening the classification power of faster rcnn. In ECCV, 2018.
  • [3] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016.
  • [4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In CVPR, 2017.
  • [5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
  • [6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  • [7] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. CoRR, 2017.
  • [8] S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In ICCV, 2015.
  • [9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [12] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
  • [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACMMM, 2014.
  • [15] S.-W. Kim, H.-K. Kook, J.-Y. Sun, M.-C. Kang, and S.-J. Ko. Parallel feature pyramid network for object detection. In ECCV, 2018.
  • [16] T. Kong, F. Sun, W. Huang, and H. Liu. Deep feature pyramid reconfiguration for object detection. In ECCV, 2018.
  • [17] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen. Ron: Reverse connection with objectness prior networks for object detection. In CVPR, 2017.
  • [18] T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR, 2016.
  • [19] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Detnet: A backbone network for object detection. In ECCV, 2018.
  • [20] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [21] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In ICCV, 2017.
  • [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
  • [24] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. In ICLR workshop, 2016.
  • [25] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [26] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017.
  • [27] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [29] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization? (no, it is not about internal covariate shift). In NIPS, 2018.
  • [30] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2013.
  • [31] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue. Dsod: Learning deeply supervised object detectors from scratch. In ICCV, 2017.
  • [32] Z. Shen, H. Shi, R. Feris, L. Cao, S. Yan, D. Liu, X. Wang, X. Xue, and T. S. Huang. Learning object detectors from scratch with gated recurrent feature pyramids. CoRR, 2017.
  • [33] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.
  • [34] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. CoRR, 2016.
  • [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.

    Rethinking the inception architecture for computer vision.

    In CVPR, 2016.
  • [38] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network for object detection. In CVPR, 2018.
  • [39] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. SFD: Single shot scale-invariant face detector. In ICCV, 2017.
  • [40] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille. Single-shot object detection with enriched semantics. In CVPR, 2018.
  • [41] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu, et al. Couplenet: Coupling global structure with local parts for object detection. In ICCV, 2017.

6 Complete Object Detection Results

We show the complete object detection results of the proposed ScratchDet method on the PASCAL VOC 2007 test set, PASCAL VOC 2012 test set and MS COCO test-dev set in Table 6, Table 7 and Table 8, respectively. Among the results of all published methods, our ScratchDet achieves the best performance on these three detection datasets, i.e., mAP on the PASCAL VOC 2007 test set, mAP on the PASCAL VOC 2012 test set and AP on the MS COCO test-dev set. And we select some detection examples on the PASCAL VOC 2007 test set, the PASCAL VOC 2012 test set and the MS COCO test-dev in Figure 4, Figure 5, and Figure 6, respectively. Different colors of the bounding boxes indicate different object categories. Our method works well with the occlusions, truncations, inter-class interference and clustered background.

Method Data mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
ScratchDet300 07+12 80.4 86.0 87.7 77.8 73.9 58.8 87.4 88.4 88.2 66.4 84.3 78.4 84.0 87.5 88.3 83.6 57.3 80.3 79.9 87.9 81.2
ScratchDet300+ 07+12 84.1 90.0 89.2 83.6 80.0 70.1 89.3 89.5 89.0 73.0 86.9 79.8 87.4 90.1 89.3 87.1 63.3 86.9 83.5 88.9 83.4
ScratchDet300 COCO+07+12 84.0 87.9 89.3 85.6 79.8 69.4 89.1 89.2 88.5 73.2 87.5 81.7 88.4 89.5 88.7 86.3 63.1 84.5 84.3 88.1 85.6
ScratchDet300+ COCO+07+12 86.3 90.4 89.6 88.4 85.4 78.9 90.1 89.3 89.5 77.4 89.7 83.9 89.1 90.3 89.5 88.3 68.1 87.6 85.9 87.4 87.7
Table 6: Object detection results on the PASCAL VOC 2007 test set. All models use Root-ResNet-34 as the backbone network.
Method Data mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
ScratchDet300 07++12 78.5 90.1 86.8 74.5 66.3 54.0 83.7 82.6 91.6 64.1 83.1 67.7 90.1 87.6 87.8 85.7 56.9 81.7 74.6 87.2 75.3
ScratchDet300+ 07++12 83.6 92.2 90.3 82.6 73.9 68.1 86.8 90.5 93.9 70.3 88.0 72.3 92.3 91.5 91.0 90.3 63.6 87.6 77.4 89.9 80.2
ScratchDet300 COCO+07++12 82.1 91.7 89.3 79.1 71.9 62.7 85.7 85.3 93.9 68.8 87.2 68.7 91.9 90.6 90.9 88.2 61.2 84.7 79.2 89.7 81.0
ScratchDet300+ COCO+07++12 86.3 94.0 91.8 86.0 78.9 75.6 88.6 91.3 95.1 74.0 90.0 73.0 93.6 93.0 92.6 91.9 69.7 90.2 80.9 91.8 83.7
Table 7: Object detection results on the PASCAL VOC 2012 test set. All models use Root-ResNet-34 as the backbone network.
Method AP AP AP AP AP AP AR AR AR AR AR AR
ScratchDet300 32.7 52.2 34.9 13.0 35.6 49.0 29.3 43.9 45.7 20.6 50.8 65.3
ScratchDet300+ 39.1 59.2 42.6 23.1 43.5 51.0 33.1 53.3 58.3 36.6 63.4 74.5
Table 8: Object detection results on the MS COCO test-dev set. All models use Root-ResNet-34 as the backbone network.
Figure 4: Qualitative results of ScratchDet300 on the PASCAL VOC 2007 test set (corresponding to mAP). The training data is 07+12+COCO.
Figure 5: Qualitative results of ScratchDet300 on the PASCAL VOC 2012 test set (corresponding to mAP). The training data is 07++12+COCO.
Figure 6: Qualitative results of ScratchDet300 on the MS COCO test-dev set (corresponding to mAP). The training data is COCO trainval35k.
(a) Loss Value
(b) L2 Norm of Gradient
(c) Fluctuation of L2 Norm of Gradient
Figure 7: Analysis of the optimization landscape of SSD after adding BatchNorm on the backbone subnetwork. We plot (a) the training loss value, (b) L2 Norm of gradient and (c) the fluctuation of L2 Norm of gradient of three detectors. The blue curve represents the original SSD, the red and green curves represent the SSD trained with BatchNorm on the backbone network using base learning rate and base learning rate, respectively. It is the similar trend with the curves of adding BatchNorm on the detection head subnetwork.
Figure 8: Comparison of the extra added layers between SSD and ScratchDet. This change brings less parameters and computions.