Object detection and instance recognition are at the core of many real-world AI applications like autonomous driving, video surveillance, medical image analysis and cashierless retailing. During recent years, more sophisticated detection systems and more challenging datasets have emerged with a ever growing demand in computation power. From PASCAL VOC (Everingham et al., 2010)(10K images) to MS COCO (Lin et al., 2014)(118K images) and to Google Openimages (Kuznetsova et al., 2018)(1.7M images), the amount of annotated data increases at an incredible speed. From AlexNet (Krizhevsky et al., 2012)(700M FLOPs) to SENet (Hu et al., 2018)(21G FLOPs), the computation complexity of CNNs also grows beyond imagination. These two factors bring the training time of a detection system from several GPU hours to tens of thousands GPU hours, which calls for a distributed detection framework that scales. Built on top of MXNet, SimpleDet is the first open source detection framework which provides an efficient batteries-included distributed training system. As shown in Figure 0(a), our system scales near linearly on a 4-node GPU cluster on the consumer grade 25Gb Ethernet. Besides its high efficiency, our system also takes user experience as priority. We provide a configuration system in pure python, which eases the use of users and provides great flexibility as the framework and the configuration system are written in the same programming language. To further ease the adoption of our system, we provide pre-built Singularity and Docker images. The full codes, examples and documents of SimpleDet can be found at https://github.com/tusimple/simpledet.
2 Features of SimpleDet
Like other existing frameworks, SimpleDet covers most state-of-the-art detection models including:
Fast RCNN(Girshick, 2015)
Faster RCNN(Ren et al., 2015)
Mask RCNN(He et al., 2017)
Cascade RCNN(Cai and Vasconcelos, 2018)
RetinaNet(Lin et al., 2017)
Deformable Convolutional Network (DCN)(Dai et al., 2017)
TridentNet(Li et al., 2019)
Besides the full coverage of latest models, SimpleDet also provides extensive pre-processing and post-processing routines in detection with their best practices, including various data augmentation techniques, multi-scale training and testing, soft (Bodla et al., 2017) and weighted NMS, etc. All these features are provided based on the unified and versatile interfaces in SimpleDet, which allows the users to easily customize and extend these features in training.
Apart from these common features, we also would like to highlight several key features of SimpleDet as follows.
2.1 Distributed Training
Instead of only scaling up within a single machine, we can utilize data parallel paradigm with more machines to scale out. The core of scalable distributed training lies in the efficiency of parameter communication. Thanks to the underlying MXNet, SimpleDet supports both parameter server and all-reduce algorithms for model parameter update. Along with the mixed precision training technique which will be introduced in the next subsection, SimpleDet could give near-linear scaling efficiency for a 4-node cluster as shown in Figure 0(a). Note that this performance is only at the cost of consumer grade 25Gb Ethernet in contrast to most of previous works that built on more expensive cross-machine communication hardware, such as InfiniBand. We believe this feature could significantly promote the adoption of distributed training.
2.2 Mixed Precision Training
Modern specialized hardware like NVIDIA Tensor Core provides 10 times throughput for computation in half precision float (FP16) over single precision one (FP32). Besides speed up the training, low precision training also reduces the memory footprint. The main obstacle for mixed precision training is the convergence and accuracy drop issue due to the limited range of representation. To mitigate this problem, SimpleDet adopts the scale loss proposed byMicikevicius et al. (2018). In practice, we find that mixed precision training yields identical training curves and detection mAP with full precision training.
As shown in Figure 0(b) and 0(c), SimpleDet witnesses a 2.0X speedup and a 30% reduction in memory usage from FP32 training to FP16 training. In addition, mixed precision training effectively reduces the distributed communication cost. This feature also plays an important role in the efficient distributed training.
2.3 Cross-GPU Batch Normalization
Due to the limit of GPU memory, modern detectors are trained in a
images per GPU setting. But batch normalization widely used is implemented in a per-GPU manner, which forces researchers to freeze BN parameters during detector training as a workaround. As indicated byPeng et al. (2018), freeze-BN detectors trained with linear learning rate scaling scheme fail to converge when the batch size increases beyond a threshold. The failure in convergence harms the scalability of a detection framework. In order to mitigate this limitation, SimpleDet integrates Cross-GPU Batch Normalization(CGBN) and provides a one-line configuration option for users. In practice, we find that scaling a detector to a mini-batch size of 256 with CGBN leads to stable convergence.
2.4 Memory Saving Technologies
A limiting factor for the design of new detectors is the amount of memory available for a single GPU. Since the main training paradigm of CNN detector is data parallelism, designs are bound by the amount of memory that a single GPU provides. To mitigate this problem, SimpleDet combines mixed precision training, in-place activation batch normalization (Rota Bulò et al., 2018) and layer-wise memory checkpointing (Chen et al., 2016) together to minimize the demand of GPU memory. Combining all these techniques, SimpleDet could save up to 50% memory as in Figure 0(c) with a marginal increase in computation cost compared with the vanilla setting.
3 Compare with Other Detection Frameworks
We compare four other detection frameworks with SimpleDet in terms of training speed, supported models and advanced training features in Table 1.
detectron111—https://github.com/facebookresearch/Detectron— is the first general framework for object detection. But its training speed is a major problem as it uses python operators in the core part of the framework extensively.
is a well-designed framework written in PyTorch which supports a wide range of detection models. Again, the training speed is also a problem for
mmdetectionsince it contains many small operations inside the computation graph which incur a large operator invocation overhead.
tensorpack333—https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN— supports some advanced training features like Cross-GPU Batch Normalization and distributed training, but it lacks supports of some new models.
maskrcnn-benchmark444—https://github.com/facebookresearch/maskrcnn-benchmark— is a well optimized framework with amazing training speed. But it supports the least models of all frameworks.
|R50-FPN Faster Speed||29 images/s||29 images/s||28 images/s||40 images/s||37 images/s|
|Mixed Precision Training||✗||✗||✗||✗||✓|
Bodla et al. (2017)
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis.
Soft-NMS–improving object detection with one line of code.
International Conference on Computer Vision, pages 5561–5569, 2017.
Cai and Vasconcelos (2018)
Zhaowei Cai and Nuno Vasconcelos.
Cascade R-CNN: Delving into high quality object detection.
Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2018.
- Chen et al. (2016) Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv:1604.06174, 2016.
- Dai et al. (2017) Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In International Conference on Computer Vision, pages 764–773, 2017.
- Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
- Girshick (2015) Ross Girshick. Fast R-CNN. In International Conference on Computer Vision, pages 1440–1448, 2015.
- He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In International Conference on Computer Vision, pages 2961–2969, 2017.
- Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
- Kuznetsova et al. (2018) Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.
- Li et al. (2019) Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. arXiv:1901.01892, 2019.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755, 2014.
- Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In International Conference on Computer Vision, pages 2980–2988, 2017.
- Micikevicius et al. (2018) Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018.
- Peng et al. (2018) Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. MegDet: A large mini-batch object detector. In Conference on Computer Vision and Pattern Recognition, pages 6181–6189, 2018.
- Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
- Rota Bulò et al. (2018) Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. In-place activated batchnorm for memory-optimized training of DNNs. In Conference on Computer Vision and Pattern Recognition, pages 5639–5647, 2018.