Tremendous progresses have been made on CNN-based object detection, since seminal work of R-CNN , Fast/Faster R-CNN series [10, 31], and recent state-of-the-art detectors like Mask R-CNN  and RetinaNet . Taking COCO  dataset as an example, its performance has been boosted from AP in Fast R-CNN  to AP in RetinaNet , in just two years. The improvements are mainly due to better backbone network , new detection framework , novel loss design , improved pooling method [5, 14], and so on .
A recent trend on CNN-based image classification uses very large min-batch size to significantly speed up the training. For example, the training of ResNet-50 can be accomplished in an hour  or even in 31 minutes  , using mini-batch size 8,192 or 16,000, with little or small sacrifice on the accuracy. In contract, the mini-batch size remains very small (e.g., 2-16) in object detection literatures. Therefore in this paper, we study the problem of mini-batch size in object detection and present a technical solution to successfully train a large mini-batch size object detector.
What is wrong with the small mini-batch size? Originating from the object detector R-CNN series, a mini-batch involving only images is widely adopted in popular detectors like Faster R-CNN and R-FCN. Though in state-of-the-art detectors like RetinaNet and Mask R-CNN the mini-batch size is increased to , which is still quite small compared with the mini-batch size (e.g., 256) used in current image classification. There are several potential drawbacks associated with small mini-batch size. First, the training time is notoriously lengthy. For example, the training of ResNet-152 on COCO takes 3 days, using the mini-bath size 16 on a machine with 8 Titian XP GPUs. Second, training with small mini-batch size fails to provide accurate statistics for batch normalization 
(BN). In order to obtain a good batch normalization statistics, the mini-batch size for ImageNet classification network is usually set to 256, which is significantly larger than the mini-batch size used in current object detector setting.
Last but not the least, the number of positive and negative training examples within a small mini-batch are more likely imbalanced, which might hurt the final accuracy. Figure LABEL:fig:rois_example gives some examples with imbalanced positive and negative proposals. And Table 1
compares the statistics of two detectors with different mini-batch sizes, at different training epochs on COCO dataset.
What is the challenge to simply increase the min-batch size? As in the image classification problem, the main dilemma we are facing is: the large min-batch size usually requires a large learning rate to maintain the accuracy, according to “equivalent learning rate rule” [13, 21]. But a large learning rate in object detection could be very likely leading to the failure of convergence; if we use a smaller learning rate to ensure the convergence, an inferior results are often obtained.
To tackle the above dilemma, we propose a solution as follows. First, we present a new explanation of linear scaling rule and borrow the “warmup” learning rate policy  to gradually increase the learning rate at the very early stage. This ensures the convergence of training. Second, to address the accuracy and convergence issues, we introduce Cross-GPU Batch Normalization (CGBN) for better BN statistics. CGBN not only improves the accuracy but also makes the training much more stable. This is significant because we are able to safely enjoy the rapidly increased computational power from industry.
Our MegDet (ResNet-50 as backbone) can finish COCO training in 4 hours on 128 GPUs, reaching even higher accuracy. In contrast, the small mini-batch counterpart takes 33 hours with lower accuracy. This means that we can speed up the innovation cycle by nearly an order-of-magnitude with even better performance, as shown in Figure 1. Based on MegDet, we secured 1st place of COCO 2017 Detection Challenge.
Our technical contributions can be summarized as:
We give a new interpretation of linear scaling rule, in the context of object detection, based on an assumption of maintaining equivalent loss variance.
We are the first to train BN in the object detection framework. We demonstrate that our Cross-GPU Batch Normalization not only benefits the accuracy, but also makes the training easy to converge, especially for the large mini-batch size.
We are the first to finish the COCO training (based on ResNet-50) in 4 hours, using 128 GPUs, and achieving higher accuracy.
Our MegDet leads to the winning of COCO 2017 Detection Challenge.
2 Concluding Remarks
We have presented a large mini-batch size detector, which achieved better accuracy in much shorter time. This is remarkable because our research cycle has been greatly accelerated. As a result, we have obtained 1st place of COCO 2017 detection challenge. The details are in Appendix.
Based on our MegDet, we integrate the techniques including OHEM , atrous convolution [40, 2], stronger base models [38, 18], large kernel , segmentation supervision [27, 34], diverse network structure [12, 32, 36], contextual modules [22, 9], ROIAlign  and multi-scale training and testing for COCO 2017 Object Detection Challenge. We obtained 50.5 mmAP on validation set, and 50.6 mmAP on the test-dev. The ensemble of four detectors finally achieved 52.5. Table LABEL:tab:test_dev_on_coco summarizes the entries from the leaderboard of COCO 2017 Challenge. Figure LABEL:fig:det_fig_examples gives some exemplar results.
S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick.
Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.In , pages 2874–2883, 2016.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
-  T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
-  S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
-  J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. arXiv preprint arXiv:1703.06211, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio.
International Conference on Machine Learning, pages 1319–1327, 2013.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361. Springer, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1729–1739, 2017.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.
-  J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. CVPR, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
-  A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
-  J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive contexts for object detection. IEEE Transactions on Multimedia, 19(5):944–954, 2017.
-  T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedestrian detection? In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters – improve semantic segmentation by global convolutional network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
-  J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. IEEE transactions on pattern analysis and machine intelligence, 39(7):1476–1481, 2017.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. ICLR, 2014.
-  A. Shrivastava and A. Gupta. Contextual priming and feedback for faster r-cnn. In European Conference on Computer Vision, pages 330–348. Springer, 2016.
-  A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, pages 4278–4284, 2017.
-  J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
-  S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. arXiv preprint arXiv:1709.05011, 2017.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.