1 Introduction
Object detection and instance segmentation are important tasks, with many realworld applications in robotics, healthcare, etc. Up until now, most detection and segmentation tasks have relied on ImageNet [63] finetuning [13, 87]. With finetuning, learned parameters or features of source tasks may be forgotten after learning target tasks [29]
, and domain similarity between tasks being important for transfer learning
[89]. Furthermore, transferring knowledge between dissimilar tasks may cause negative transfer [62, 79]. Thus many works have discussed task difference between image classification and object detection [60, 69, 72, 6] and the effects of pretraining for object detectors [27, 73, 51, 33]. However, the influence caused by the task difference is still an open problem, and what and how to transfer knowledge from image classification to object detection are unclear. Avoiding these problems, it was recently shown that models trained on COCO [38] from random initialization can be on par with models trained (finetuned) from ImageNet pretrained models [19], but it is not clear whether or not models with similar performance have different properties. To further understand the effects of finetuning object detectors, we analyze the eigenspectrum dynamics of the covariance matrix of each feature map in object detectors, and propose a method to automatically determine the numbers of channels necessary for performance. (Each feature map includes channel dimension in this paper.)More specifically, motivated by the accurate object detectors trained from scratch [19, 91], we focus on the following research question. Do object detectors finetuned from ImageNet pretrained models and those trained from scratch converge to similar models?
If the answer is “Yes,” we will have a better understanding of the task difference and the behavior of deep neural networks, and if the answer is “No, these object detectors do not converge to similar models, but show similar accuracy by chance,” we should incorporate the benefits of both object detectors.
To answer this question, we train object detectors as shown in Figure 1, and analyze the redundancy of feature maps in the detectors. To be more precise, we analyze the intrinsic dimensionalities of the feature maps, which represent how much information the feature maps memorize. Intrinsic dimensionalities can be quantified by calculating the eigenspectra of the covariance matrices of the feature maps, and are related to generalization error [74]
. In this paper, we use the numbers of eigenvalues greater than a threshold as a simple metric of intrinsic dimensionalities, and we call the sets of intrinsic dimensionalities in a certain network the
intrinsic architecture.Our contributions are as follows.

We analyze the eigenspectrum dynamics of the covariance matrix of each feature map in object detectors, and show that object detectors trained from ImageNet pretrained models and those trained from scratch behave differently from each other even if both object detectors have similar accuracy.

We propose a method for automatically determining the widths (the numbers of channels) of object detectors. We report the results of Faster RCNN with FPN trained from scratch, and show that our method can reduce 27% of the parameters of ResNet50 without increasing MultiplyAccumulate operations (MACs) and losing accuracy, and can improve COCO AP by 0.3% without increasing parameters (See Sec. 4.5).
2 Related Work
2.1 Neural Network Generalization
One of the most important mysteries of neural networks is its generalization ability. To understand it, some work has discussed the relation between generalization and compressibility [70, 54, 1, 74]. Information Bottleneck [70, 66] and Canonical Correlation Analysis (CCA) [57, 54] are used for analyzing the dynamics of neural networks. [70]
showed that training with Stochastic Gradient Descent (SGD) has two phases (a label fitting phase and a representation compression phase).
[66]shows that networks with ReLU do not necessarily exhibit the compression phase, and that fitting to taskrelevant information and the compression of taskirrelevant information occur simultaneously.
[54] shows that generalizing/larger networks converge to more similar solutions than memorizing/smaller networks. Using CCA, Transfusion [58]analyzes the effects of pretraining for classifying medical images, which are clearly different from natural images in ImageNet and COCO.
The most related theoretical analysis to this paper is the degree of freedom of reproducing kernel Hilbert spaces (RKHSs), which is defined in [74]. Suzuki [74]
shows the following two important properties of neural networks which motivate our work. (i) “if the eigenvalues of the kernels decreases rapidly, then the degree of freedom gets smaller, and we achieve a better generalization by using a simpler model.” (ii) “the effective dimension of the network is less than the actual number of parameters.” SpectralPruning
[75], which uses the degree of freedom [74] as the intrinsic dimensionality of models, is applicable to compress complicated networks. Our method and analysis are based on eigenspectrum [74, 75] and the dynamics of neural networks [70, 57, 54]. However, these prior works do not analyze the behavior of neural networks when finetuned for object detection from ImageNet pretrained models.2.2 Neural Architecture Search (NAS)
NAS has been a hot research topic on deep learning since the success of NAS with reinforcement learning
[92], and efficient methods have broadened its applicability [93, 40, 2]. Genetic CNN [83] and NASNet [93] transfer architectures learned on a proxy dataset (, CIFAR10) to largescale datasets (, ImageNet). On the other hand, ProxylessNAS [2]reduces memory consumption by pathlevel binarization, and directly learns architectures for a largescale dataset. In addition to NAS for image classification, a few works focus NAS for semantic segmentation
[67, 4, 39] and object detection [11, 5]. NASFPN [11] and DetNAS [5] search the architectures of Feature Pyramid Networks [37] and backbones for object detectors respectively. However, these prior works [11, 5] do not determine the widths of feature maps automatically, and computational costs for training will become higher if their search space includes the widths.Determining the widths of feature maps in CNNs can be considered as a subset of NAS. Although various approaches have been proposed [9, 23, 8, 45], shrinkandexpand [16, 56] is a more suitable approach for object detectors because of its simplicity and scalability. MorphNet [16]
shrinks and linearly expands networks. The shrinking imposes L1 regularization on the scaling factors of Batch Normalization to identify and prune unimportant channels like Network Slimming
[44], and takes into account specific resource constraints (, the number of floating point operations). Neural Rejuvenation [56]revives dead neurons (reallocates and reinitializes useless channels) during training. Although the effectiveness of these methods
[16, 56] is verified on ImageNet, it is unknown whether these methods can be applied to object detectors.2.3 Object Detection and Instance Segmentation
Object detection is one of the core technologies in computer vision, and has advanced rapidly with deep neural networks
[68, 10, 13, 12, 61, 59, 43, 37, 72] (Refer to a survey [41] for details). In addition, instance segmentation [7, 35, 20, 42], which is the task of segmenting and classifying individual objects, is important for further detailed object recognition. Most methods for these tasks train models from ImageNet pretrained models for better accuracy. However, pretraining backbones in object detectors on image classification dataset causes learning bias and limits architecture design [69, 91].To avoid the problems of pretraining, training object detectors from scratch (from randomly initialized weights) has been discussed in some literature [69, 34, 36, 30, 91, 19]. DSOD [69] shows that deep supervision [31] is critical for training singleshot object detectors from scratch, and adopts implicit deep supervision via dense connections [26]. ScratchDet [91] shows that Batch Normalization [28, 65] helps training from scratch to converge, and redesigns the backbone of singleshot object detectors. [19] shows that Mask RCNN trained from scratch with appropriate normalization and longer training (instead of pretraining) can be on par with those finetuned from ImageNet pretrained models.
The most similar work to ours is DetNet [36], which is a specialized backbone for object detection. DetNet mainly focuses on scales (the receptive fields and the spatial resolutions of feature maps) to overcome drawbacks of ImageNet pretrained models designed for image classification. However, its widths are manually determined. On the other hand, our method does not aim to determine the spatial resolutions. Using our method and DetNet complementarily would be beneficial.
3 Intrinsic Architecture Search
In this section, we propose a method for automatically determining the widths (the numbers of channels) of feature maps. Our method reflects intrinsic architectures by calculating the redundancy of feature maps, and is applicable to complicated networks, such as Faster RCNN with FPN and Mask RCNN. Figure 2 shows an overview of our method. We call our algorithm Intrinsic Architecture Search, and we call architectures discovered by our algorithm ResiaNet whose base backbone is ResNet.
3.1 Determining Widths
Optimizing the widths of feature maps is formulated as
(1) 
where is the total number of layers, are the widths of output feature maps, is the parameters (weights) in neural networks,
is a loss function for training neural networks,
is a function for calculating resource consumption (, MultiplyAccumulate operations (MACs)), and is a specified maximum allowable resource consumption. This formulation is exactly the same as [16], and most notations in this section and some descriptions in Algorithm 1 follow MorphNet [16] for ease of comparing methods.Although MorphNet [16] and Neural Rejuvenation [56] also tackle the determination of widths, these methods need to change training and intrinsic dimensionalities. In addition, applying them to object detection and instance segmentation poses some difficulties below. (i) These methods depend on Batch Normalization [28]. Therefore, applying them to networks with other normalization layers [82, 48] is not trivial. Furthermore, when we apply them to networks without normalization layers [90], we need to add Batch Normalization layers [56]. (ii) These methods use additional regularizers. Since object detection and instance segmentation are multitask learning including classification and localization, we might need to balance regularization. (iii) These methods need to train multiple models [16]
or tune additional hyperparameters
[56]. This is a serious problem especially for object detection and instance segmentation because training for these tasks takes a long time (See model zoos of [14, 3, 52]).3.2 Overview
We propose a method for determining the widths of object detectors using eigenspectrum [74]. Algorithm 1 shows the whole process, where are the eigenspectra of feature maps, are the intrinsic dimensionalities of the feature maps, is a threshold for calculating intrinsic dimensionalities (, ), is a function for counting numbers which meet the condition, and is a width multiplier.
The details of Algorithm 1 are described below. In Step 1, we set initial weights. Weights in a base backbone (, ResNet50) are initialized from one of the ImageNet pretrained models, or randomly initialized. Weights out of the base backbone are randomly initialized. In Step 2, we train the whole network (, Faster RCNN with FPN or Mask RCNN) with the base backbone. In Step 3, we calculate the eigenspectrum of each feature map in the whole network (See Sec. 3.3 for details). Eigenvalues are normalized with the largest eigenvalue of each feature map. In Step 4, we shrink the widths of each feature map by extracting an intrinsic architecture (See Sec. 3.4 for details). In Step 5, we adjust the widths mainly for networks with multiple branches (See Sec. 3.5 for details). In Step 6, we expand the widths by linear expanding (See Sec. 3.6 for details).
3.3 Calculating Eigenspectra
When we calculate the eigenspectra of feature maps which have spatial resolutions (almost all feature maps in CNNs), we normalize the covariance matrices by the resolutions. Specifically, the (noncentered) covariance matrix of a feature map is calculated as
(2) 
where is the number of images (we randomly sample 5,000 images from the training set in our experiments), are the spatial width and height of the feature map, and
is a feature vector whose coordinates are
in the feature map for the th image. Not only but also depend on images feedforwarded because input image resolutions may change in the case of COCO.We calculate the eigenspectra of feature maps before or after convolutional layers, fully connected layers, and transposed convolutional layers. Note that feature maps after th convolutional layer and feature maps before th convolutional layer generally do not match due to normalization layers and activation layers.
3.4 Shrinking Widths
We calculate intrinsic dimensionalities from the eigenspectra. We use the numbers of eigenvalues greater than a predefined threshold as intrinsic dimensionalities. (Using degree of freedom [75] may be better, though we do not use pruning and we set random values to the initial weights.) Although we set to the threshold in our experiment, we may get better accuracy if we tuned the threshold as a hyperparameter.
3.5 Adjusting Widths
If the network has multiple branches, adjusting intrinsic dimensionalities is necessary to determine new widths, because either the input feature maps or the output feature maps of branches may have to have the same widths. Especially for ResNet with bottlenecks, where the widths of feature maps which pass through shortcuts are set to the maximum intrinsic dimensionalities in the same stage for preserving most information which flows shortcut. Furthermore, we set the same output widths to the first and the second convolutional layers of all residual blocks in the same stage by calculating the geometric mean of intrinsic dimensionalities. This setting has some advantages: (i) The second convolutional layers of residual blocks can be replaced with depthwise convolutional layers like
[64]. (ii) Using the same widths is efficient considering memory access cost [49]. (iii) Implementation is easy and thus modifications to the code of ResNet are minimized.3.6 Expanding Widths
Our expanding method is basically the same as that of [16]. Specifically, the output width of each layer is multiplied by a uniform width multiplier to fit a target resource consumption. The optimal can be found by a binary search because monotonically increases with in our experiments. is calculated as
(3) 
when targeting MACs, and
(4) 
when targeting the number of parameters, where are the widths of the input/output feature map, is the kernel size, and are the spatial width and height of the output feature map, for each layer . We consider is a fixed number (, 1,000 for ImageNet classification). For simplicity, we consider the spatial width and height of kernel size to be the same in each layer, and omit the resource consumption of biases.
4 Experiments
To analyze the effects of pretraining for object detectors and to verify the effectiveness of our method, we conduct experiments on COCO.
4.1 Experimental Settings
The experimental settings mainly follow Mask RCNN [20] in Detectron [14] (which includes implementation by the authors of Mask RCNN) like [19]
. Our implementation is based on Detectron.pytorch
[78], which is a PyTorch implementation of Detectron.We use ResNet50 [22] as a base backbone. We train Faster RCNN [61] with Feature Pyramid Network (FPN) [37] and Mask RCNN [20] in an endtoend manner [61]. We use Group Normalization (GN) [82], because appropriate normalization is a key factor for training from scratch [91, 19], and GN has several advantages [82] compared to Synchronized Batch Normalization [55]. The learning rate settings follow [19]. Specifically, the initial learning rate is 0.02 with warmup [17], and the learning rate is reduced by . Iterations for the first decay, the second decay, and ending training are 60k, 80k, 90k for schedule, 120k, 160k, 180k for schedule, and 210k, 250k, 270k for schedule. We use synchronous SGD with an effective batch size of 16 (= 2 images/GPU 8 GPUs), a momentum of 0.9, and a weight decay of .
All models are trained on COCO train2017 set (118,287 images) and evaluated on COCO val2017 set (5,000 images) with COCO metrics unless otherwise stated.
4.2 Eigenspectrum Dynamics
To analyze the effects of pretraining for object detectors, we observed the dynamics of the eigenspectrum of Mask RCNN. Figure 3 shows the eigenspectrum of a feature map after the conv5_1_3 (the third convolutional layer in conv5_1 bottleneck building block. We call the convolutional layers of ResNet in this manner.) of ResNet50. In the case of this layer, the eigenspectrum drops fast in the first 10k iterations. Similar behavior can be seen in feature maps with 32strides after conv5_2_3, conv5_3_3, and projection shortcut in conv5_1.
This result demonstrates that some information obtained in ImageNet pretraining is forgotten. There are three possible reasons. (i) Features for 1000class image classification are too rich for most object detection tasks (, Classification ability needed for COCO detection is 81class classification including a background class). (ii) In pretraining on ImageNet, the stage 5 of ResNet is very close to the output layer. Layers which are close to the output layer may compress information to minimum needed for the pretraining task. (iii) The strides of conv5_x are too coarse to localize objects. DetNet [36] and ScratchDet [91] also discuss this problem and change the strides for object detection. Unlike these works, our finding is that SGD (with other regularization methods) automatically limits the intrinsic dimensionalities of standard ResNet without changing the strides.
Eigenspectrum dynamics can capture not only the forgetting of pretrained features but also the acquisition of features for COCO. Figure 4 shows the numbers of eigenvalues greater than . Eigenvalues first down, then up, in feature maps after some layers. This rebound occurs when the learning rate decays and may relate to the learning rate schedules and a finding in [19] (See discussions in Sec. 5.2).
4.3 Intrinsic Architecture
Here, we investigate whether models finetuned from an ImageNet pretrained model and a model trained from scratch converge to similar intrinsic architectures.
We compare three models below. (i) trained from scratch with schedule (AP: 39.0%, AP: 34.8%), (ii) trained from an ImageNet pretrained model with schedule (AP: 38.9%, AP: 34.6%), and (iii) trained from an ImageNet pretrained model with schedule (AP: 40.3%, AP: 35.7%).
Figure 2(b) and Figure 2(f) show the intrinsic architectures of and , and Figure 5 shows some characteristic intrinsic dimensionalities. The intrinsic architecture of the model trained from scratch () is different from that of the models trained from the ImageNet pretrained model (, ), even if the models show similar AP ( ). The accuracy of object detectors will be improved if we properly incorporate the benefits of ImageNet pretraining and random initialization.
Backbone  Normalization  Classification  COCO (2 schedule)  COCO (1)  

MACs  #params  AP  AP  AP  AP  AP  AP  AP  
ResNet50 [36]  SyncBN  3.8 G  —  34.5  55.2  37.7  20.4  36.7  44.5  — 
ResNet50 
GN  4.09 G  25.5 M  35.5  55.6  38.5  21.3  37.5  45.3  29.4 
ResiaNet 50 (MACs) 
GN  4.06 G  18.6 M  35.4  55.4  38.6  21.5  37.3  45.2  28.9 
ResiaNet 50 (MACs) 
GN  4.05 G  21.7 M  35.5  55.5  38.6  21.4  37.3  46.0  29.2 
ResiaNet 50 (MACs) 
GN  4.07 G  22.0 M  35.4  55.6  38.4  21.3  37.8  45.5  29.3 
ResiaNet 50 (params) 
GN  4.92 G  24.7 M  35.8  55.9  38.9  21.8  38.0  45.6  — 
DetNet59 [36] 
SyncBN  4.8+ G  —  36.3  56.5  39.3  22.0  38.4  46.9  — 
DetNet59  GN  5.00+ G  18.3+ M  36.2  56.0  39.3  22.1  38.3  46.0  — 
DetiaNet 59 (MACs)  GN  4.94+ G  17.4+ M  36.2  56.0  39.3  22.5  38.1  46.0  — 
4.4 Discovered Backbones
Next, we apply Intrinsic Architecture Search to , , and for new backbones. Figure 2(d) and Figure 2(h) show the architectures of ResiaNet 50 (MACs) and ResiaNet 50 (MACs), whose target resource consumption is the MACs of ResNet50. The architecture of ResiaNet 50 is similar to that of ResiaNet 50. Specifically, its width settings (the numbers in Figure 2(d) from below) are (64, 64, 224, 128, 576, 256, 1152, 544, 896) for ResiaNet 50 (MACs), and (64, 64, 256, 160, 608, 288, 1216, 544, 960) for ResiaNet 50 (params) whose target is the number of parameters of ResNet50.
ResiaNet 50 and ResiaNet 50 have fewer widths in stage 5 and have more widths in stages 3 and 4 than ResNet. Reducing widths in stage 5 is caused by the characteristic of models trained from ImageNet pretrained models (Figure 5 Right). Increasing widths in stages 3 and 4 may be caused by the object scales in COCO and the number of residual blocks (The information which flows through shortcuts is stacked gradually (Figure 2(b)), and the total amount of information may depend on the number of residual blocks). By contrast, ResiaNet 50 does not widen the widths of feature maps which pass through shortcuts in stages 3 and 4 so much. We conjecture that this flat architecture is effective for maintaining edge information and localizing objects, but not suitable for classification.
4.5 Efficiency on COCO Object Detection
To quantify the impact on accuracy caused by the difference of intrinsic architectures and identify better backbones than ResNet, we trained Faster RCNN with FPN from scratch. Table 1 shows the results.
ResiaNet (MACs), ResiaNet (MACs), and ResiaNet (MACs), which trained with schedule, achieve similar AP to ResNet with fewer parameters than ResNet. ResiaNet (MACs) has 27% fewer parameters than ResNet, and it is the most efficient. However, it may slightly degrade classification accuracy considering AP. ResiaNet (params) achieves better AP than ResNet with the similar number of parameters. (Note that simple width multipliers [25, 88] cannot improve AP without increasing parameters. In addition, they degrade AP by 0.6% to reduce parameters by 27%.)
ResiaNet (MACs) and ResiaNet (MACs) achieve higher AP than ResiaNet (MACs) if they are trained with schedule. Thus, the intrinsic architectures of and have the effect of speeding up convergence. These results are different from [19] because we reinitialize weights. Besides, the differences of AP by the schedules indicate that using shorter training as a proxy task [11] is insufficient for this case.
In addition, we verify the effectiveness of DetiaNet 59 (MACs), whose base backbone is DetNet59 with GN (AP: 39.9%) which trained from an ImageNet pretrained model with schedule. We set to the threshold for eigenvalues because the number of parameters increases if it is . Table 1 shows the results. DetiaNet (MACs) achieves similar AP to DetNet with 5% fewer parameters than DetNet. Although the parameter reduction of DetNet is more difficult than that of ResNet, our method is also effective for DetNet.
4.6 Efficiency on COCO Instance Segmentation
To verify effectiveness on instance segmentation, we also trained Mask RCNN from scratch with schedule. ResiaNet 50 (MACs) achieves similar AP (AP, AP: 36.6%, 33.1%) to ResNet50 (36.6%, 33.0%). ResiaNet 50 (MACs) has slightly lower AP (36.5%, 32.8%). This result means that the parameter reduction of Mask RCNN is more difficult than that of Faster RCNN, and reflects that the intrinsic dimensionalities of networks trained on difficult tasks are large [75].
4.7 Transferring Architecture to ImageNet
We investigate whether ResiaNet also improves parameter efficiency if we transfer the intrinsic architectures of the models trained on COCO to ImageNet classification. Table 2 shows the results. ResiaNet (MACs) has higher error rates than ResNet. Its widths are effective for COCO but not suitable for ImageNet classification. ResiaNet (MACs) achieves similar error rates to ResNet with fewer parameters than ResNet. This result indicates that the widths of ResiaNet mainly depend on the redundancy inherited from an ImageNet pretrained model (Figure 5 Right).
Backbone  #params  Top1 err  Top5 err 

ResNet50 
25.5 M  23.78  6.97 
ResiaNet 50 (MACs)  18.6 M  24.71  7.40 
ResiaNet 50 (MACs)  21.7 M  24.04  7.18 
ResiaNet 50 (MACs)  22.0 M  23.83  6.98 
ResiaNet 50 (params)  24.7 M  23.45  6.85 

5 Discussion and Conclusions
In this section, we first summarize our results and discuss the need to develop appropriate knowledgetransfer methods for object detectors. After that, we discuss why architectures and learning schedules of prior work, which trains object detectors from scratch, work well. Finally, we describe the limitations and weakness of our method.
5.1 Appropriate Knowledge Transfer
Although ImageNet pretraining increases intrinsic dimensionalities in higher layers (Figure 2(b)), the increase of parameters caused by them does not improve COCO AP (Table 1). These results do not necessarily mean that ImageNet pretraining is inefficient and meaningless for object detection. This is because the increase of parameters in higher layers brings us better classification ability (Table 2). The problem is not ImageNet pretraining itself but rather the forgetting of ImageNet pretrained features (Figure 3). We need to take care of the compression of taskirrelevant information [66]. Information for classification may be regarded as taskirrelevant for localization, and vice versa.
Considering the abovementioned results, the current standard architectures and finetuning methods of object detectors are insufficient for utilizing pretraining. For training better object detectors, methods for appropriately transferring the knowledge of ImageNet will be needed. The ideas of Decoupled Classification Refinement (DCR) [6] will be helpful. [6] decouples features for classification and localization, and the added classifier is trained not to forget translationinvariant ImageNet pretrained features. To improve the efficiency of DCR, multitask learning with automatic branching [46] may also be needed.
5.2 Understanding Prior Work with Our Results
DetNet [36] and ScratchDet [91] eliminate feature maps with 32 strides from backbones, and weigh those with finer strides relatively heavily. These manual designs can imitate the architecture in Figure 2(h). Considering the feature forgetting (Sec. 4.2), the designs can avoid wasting parameters even if detectors are pretrained. Choosing strides automatically with [67, 39, 80, 2] will be more effective.
DetNet [36] uses 11 convolution projection instead of identity mapping although stages 4, 5, and 6 have the same spatial resolution. Our results (Figure 5 Right) imply that the design keeps stages 4 and 5 away from the output layer, and avoids too sparse representation.
Our results (Figure 5 Right) also imply that current pretraining for object detectors can be considered as deep supervision [31]. This is because ImageNet pretraining determines the weights of backbones only, and the regularization effect of deep supervision remains even if the weights are finetuned. Although recent work [91, 19] emphasizes the effectiveness of normalization layers for training object detectors from scratch, it is worth exploring other forms of regularization including deep supervision [31, 26, 69].
He [18, 19] found that “training longer for the first (large) learning rate is useful, but training for longer on small learning rates often leads to overfitting” on training Mask RCNN. The increase of eigenspectrum in our results (Figure 4) with [74] can explain the overfitting as follows: (i) The learning rates for training object detectors decay. (ii) The detectors capture more detailed information about training data
by finer optimization with the small learning rates. (iii) The eigenvalues and the intrinsic dimensionalities of the detectors increase. (iv) The bias decreases and the variance increases. (v) The detectors overfit if trained for longer on the small learning rates.
As described above, eigenspectrum dynamics are useful for analyzing which feature map is responsible for what information at which time. We believe that eigenspectrum dynamics can be a tool for analyzing neural architectures and learning rate schedules, or early stopping by predicting generalization error with eigenspectrum of training data.
5.3 Limitations and Weakness
We use ResNet and its variants, FPN, and Faster/Mask RCNN in our experiments. It would be also interesting to conduct experiments with singleshot object detectors like SSD [43] and VGG16 [71] without FPN. However, we believe that our analysis is meaningful for the computer vision community since Faster/Mask RCNN are standard methods for object detection and instance segmentation.
Our method can only determine the widths of feature maps. Combining our method with compound scaling [77] and gradientbased NAS [40, 2, 84] to determine network depth, image resolution, and operations would give us further advantages.
We only consider MACs and the number of parameters as metrics of model efficiency. We should consider other metrics like memory footprint [64], memory access cost [49], and real latency on target platforms [86, 76, 81, 8].
Our method resets weights by random initialization. This choice is practical for complicated object detectors because it makes codes and experiments simple. However, applying pruning methods [32, 24, 47, 75] to object detectors may be better to train more efficient and accurate models.
We trained parameters after the determination of architectures in this paper. Considering the results of recent work [56], the simultaneous optimization of architectures and parameters is a highly important future direction, though the idea is classical (, TWEANNs; Topology and Weight Evolving Artificial Neural Networks). We believe that our analysis, method, and results are beneficial for the optimization since eigenspectrum is related to both architectures and parameters.
References
 [1] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In ICML, 2018.
 [2] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
 [3] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155, 2019.
 [4] LiangChieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jonathon Shlens. Searching for efficient multiscale architectures for dense image prediction. In NeurIPS, 2018.
 [5] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Chunhong Pan, and Jian Sun. DetNAS: Backbone search for object detection. arXiv:1903.10979, 2019.
 [6] Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris, Jinjun Xiong, and Thomas Huang. Revisiting RCNN: On awakening the classification power of Faster RCNN. In ECCV, 2018.
 [7] Jifeng Dai, Kaiming He, and Jian Sun. Instanceaware semantic segmentation via multitask network cascades. In CVPR, 2016.
 [8] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, and Niraj K. Jha. ChamNet: Towards efficient network design through platformaware model adaptation. In CVPR, 2019.
 [9] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, 2015.
 [10] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.
 [11] Golnaz Ghiasi, TsungYi Lin, Ruoming Pang, and Quoc V. Le. NASFPN: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019.
 [12] Ross Girshick. Fast RCNN. In ICCV, 2015.
 [13] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
 [14] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018.
 [15] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 [16] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, TienJu Yang, and Edward Choi. MorphNet: Fast & simple resourceconstrained structure learning of deep networks. In CVPR, 2018.
 [17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
 [18] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pretraining. arXiv:1811.08883, 2018.
 [19] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pretraining. In ICCV, 2019.
 [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask RCNN. In ICCV, 2017.
 [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification. In ICCV, 2015.
 [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [23] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han. AMC: AutoML for model compression and acceleration on mobile devices. In ECCV, 2018.
 [24] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
 [25] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.
 [26] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [27] Minyoung Huh, Pulkit Agrawal, and Alexei A. Efros. What makes ImageNet good for transfer learning? arXiv:1608.08614, 2016.
 [28] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [29] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. PNAS, 2017.
 [30] Hei Law and Jia Deng. CornerNet: Detecting objects as paired keypoints. In ECCV, 2018.
 [31] ChenYu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervised nets. In AISTATS, 2015.
 [32] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
 [33] Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, and Larry S. Davis. An analysis of pretraining on object detection. arXiv:1904.05871, 2019.
 [34] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo Li. TinyDSOD: Lightweight object detection for resourcerestricted usages. In BMVC, 2018.
 [35] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instanceaware semantic segmentation. In CVPR, 2017.
 [36] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. DetNet: Design backbone for object detection. In ECCV, 2018.
 [37] TsungYi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
 [38] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 [39] Chenxi Liu, LiangChieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, and Li FeiFei. AutoDeepLab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019.
 [40] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In ICLR, 2019.
 [41] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning for generic object detection: A survey. arXiv:1809.02165, 2018.
 [42] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
 [43] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
 [44] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
 [45] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim KwangTing Cheng, and Jian Sun. MetaPruning: Meta learning for automatic neural network channel prunings. In ICCV, 2019.
 [46] Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, and Rogerio Feris. Fullyadaptive feature sharing in multitask networks with applications in person attribute classification. In CVPR, 2017.
 [47] JianHao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
 [48] Ping Luo, Jiamin Ren, Zhanglin Peng, Ruimao Zhang, and Jingyu Li. Differentiable learningtonormalize via switchable normalization. In ICLR, 2019.
 [49] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In ECCV, 2018.
 [50] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
 [51] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
 [52] Francisco Massa and Ross Girshick. maskrcnnbenchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnnbenchmark, 2018.

[53]
Dushyant Mehta, Kwang In Kim, and Christian Theobalt.
On implicit filter level sparsity in convolutional neural networks.
In CVPR, 2019.  [54] Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In NeurIPS, 2018.
 [55] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. MegDet: A large minibatch object detector. In CVPR, 2018.
 [56] Siyuan Qiao, Zhe Lin, Jianming Zhang, and Alan Yuille. Neural Rejuvenation: Improving deep network training by enhancing computational resource utilization. In CVPR, 2019.
 [57] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha SohlDickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NIPS, 2017.
 [58] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging. arXiv:1902.07208, 2019.
 [59] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You Only Look Once: Unified, realtime object detection. In CVPR, 2016.
 [60] Joseph Redmon and Ali Farhadi. YOLO9000: Better, Faster, Stronger. In CVPR, 2017.
 [61] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [62] Michael T. Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G. Dietterich. To transfer or not to transfer. In NIPS Workshop on Inductive Transfer: 10 Years Later, 2005.
 [63] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 [64] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
 [65] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In NeurIPS, 2018.
 [66] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deep learning. In ICLR, 2018.
 [67] Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In NIPS, 2016.
 [68] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
 [69] Zhiqiang Shen, Zhuang Liu, Jianguo Li, YuGang Jiang, Yurong Chen, and Xiangyang Xue. DSOD: Learning deeply supervised object detectors from scratch. In ICCV, 2017.
 [70] Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv:1703.00810, 2017.
 [71] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [72] Bharat Singh and Larry S. Davis. An analysis of scale invariance in object detection – SNIP. In CVPR, 2018.
 [73] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
 [74] Taiji Suzuki. Fast generalization error bound of deep learning from a kernel perspective. In AISTATS, 2018.
 [75] Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, and Tomoaki Nishimura. SpectralPruning: Compressing deep neural network via spectral analysis. arXiv:1808.08558, 2018.
 [76] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. MnasNet: Platformaware neural architecture search for mobile. In CVPR, 2019.
 [77] Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
 [78] ShouYao Roy Tseng. Detectron.pytorch. https://github.com/roytsengtw/Detectron.pytorch, 2018.
 [79] Zirui Wang, Zihang Dai, Barnabás Póczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In CVPR, 2019.
 [80] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
 [81] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. FBNet: Hardwareaware efficient convnet design via differentiable neural architecture search. In CVPR, 2019.
 [82] Yuxin Wu and Kaiming He. Group Normalization. In ECCV, 2018.
 [83] Lingxi Xie and Alan Yuille. Genetic CNN. In ICCV, 2017.
 [84] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: Stochastic neural architecture search. In ICLR, 2019.

[85]
Atsushi Yaguchi, Taiji Suzuki, Wataru Asano, Shuhei Nitta, Yukinobu Sakata, and
Akiyuki Tanizawa.
Adam induces implicit weight sparsity in rectifier neural networks.
In
IEEE International Conference on Machine Learning and Applications (ICMLA)
, 2018.  [86] TienJu Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platformaware neural network adaptation for mobile applications. In ECCV, 2018.
 [87] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NIPS, 2014.
 [88] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
 [89] Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In CVPR, 2018.
 [90] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. In ICLR, 2019.
 [91] Rui Zhu, Shifeng Zhang, Xiaobo Wang, Longyin Wen, Hailin Shi, Liefeng Bo, and Tao Mei. ScratchDet: Exploring to train singleshot object detectors from scratch. In CVPR, 2019.
 [92] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
 [93] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.
Appendix
Appendix A Details of Experimental Settings
a.1 Experiments on COCO
Since we use Group Normalization (GN) [82], we replace box head with 4conv1fc like [82]. We set the number of groups for GN to 32 (the default value in [82]), and we round the widths to multiples of 32. (These settings are only for fair comparison. If we change hyperparameters of GN and round to finer multiples, we might be able to get better accuracy.) For fair comparison with DetNet, we apply our method to backbones only.
We use MultiplyAccumulate operations (MACs) and the number of parameters as metrics of model efficiency. The target layers for calculating the metrics are convolutional layers and fully connected layers in the case of the backbones in our paper.
We use convolution with stride 2 in the 33 convolutional layers of bottleneck building blocks (conv3_1, conv4_1, and conv5_1) of ResNet. Although this setting follows the setting for models with GN in Detectron [14], it may be the reason that the MACs and the number of parameters of ResNet and DetNet in our implementations are slightly different from [22, 36].
When we train models from an ImageNet pretrained model, we use R50GN.pkl^{1}^{1}1https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/47261647/R50GN.pkl provided in Detectron^{2}^{2}2https://github.com/facebookresearch/Detectron/tree/master/projects/GN. (R50GN.pkl is a ResNet50 model trained with GN layers. Note that the sparsity of R50GN.pkl is different from that of R50.pkl^{3}^{3}3https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/MSRA/R50.pkl which is a ResNet50 model trained with batch normalization layers. The difference relates to the dying ReLU phenomenon and the implicit sparsity of neural networks [85, 53, 50].) When we train models from scratch, weights in backbones are initialized by He normal initialization [21] unless otherwise stated. Weights in FPN are initialized by Glorot uniform initialization [15]. These settings follow Detectron [14].
The weights of DetNet and DetiaNet are initialized by the default initialization method of PyTorch 0.4.0^{4}^{4}4https://github.com/pytorch/pytorch/blob/v0.4.0/torch/nn/modules/conv.py#L40L47. This is because DetNet59 whose weights are initialized by He normal initialization has lower COCO AP in our experiment. For simplicity and fair comparison with ResNet, the of FPN is not used by Fast RCNN heads like [37].
We verified that Detectron.pytorch [78]
can reproduce the results of Detectron before conducting experiments in our paper. Strictly speaking, we verified the reproducibility with Faster RCNN, Mask RCNN and Keypoint RCNN (Mask RCNN for human pose estimation
[20]) with FPN and ImageNet pretrained ResNet50 whose batch normalization layers are frozen.“AP”, which is the primary metric of COCO, means AP (Average Precision averaged over the multiple IntersectionoverUnion), and AP without superscript means AP (AP for object detection) in our paper.
When we calculate eigenspectra, we first randomly sample 5,000 images from COCO train2017 set for fast calculation, then feedforward the sampled images. The images are resized such that their shorter side is 800 pixels [37]. The resizing is the same for training and testing.
See the codes of Detectron.pytorch^{5}^{5}5https://github.com/roytsengtw/Detectron.pytorch/commit/8315af319cd29b8884a7c0382c4700a96bf35bbc for other implementation details.
a.2 Experiments on ImageNet
When we transfer the intrinsic architectures of the models trained on COCO to ImageNet classification, we train models with batch normalization layers for 100 epochs. The initial learning rate is 0.1, and the learning rate is reduced by
at 30, 60, and 90 epochs. We use synchronous SGD with an effective batch size of 256 (= 64 images/GPU 4 GPUs), a momentum of 0.9, and a weight decay of . We crop input images to 224224 pixels.See the codes of Neural Network Distiller^{6}^{6}6https://github.com/NervanaSystems/distiller/commit/a89b3ad19da164f517e5b9e9e568c94069cc0c83 for other implementation details (Although Distiller is a library for neural network compression, we use Distiller only for calculating MACs and the number of parameters).
Comments
There are no comments yet.