Open MMLab Detection Toolbox and Benchmark
We present MMDetection, an object detection toolbox that contains a rich set of object detection and instance segmentation methods as well as related components and modules. The toolbox started from a codebase of MMDet team who won the detection track of COCO Challenge 2018. It gradually evolves into a unified platform that covers many popular detection methods and contemporary modules. It not only includes training and inference codes, but also provides weights for more than 200 network models. We believe this toolbox is by far the most complete detection toolbox. In this paper, we introduce the various features of this toolbox. In addition, we also conduct a benchmarking study on different methods, components, and their hyper-parameters. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new detectors. Code and models are available at https://github.com/open-mmlab/mmdetection. The project is under active development and we will keep this document updated.READ FULL TEXT VIEW PDF
Open MMLab Detection Toolbox and Benchmark
Object detection and instance segmentation are both fundamental computer vision tasks. The pipeline of detection frameworks is usually more complicated than classification-like tasks, and different implementation settings can lead to very different results. Towards the goal of providing a high-quality codebase and unified benchmark, we build MMDetection, an object detection and instance segmentation codebase with PyTorch.
Major features of MMDetection are: (1) Modular design. We decompose the detection framework into different components and one can easily construct a customized object detection framework by combining different modules. (2) Support of multiple frameworks out of box. The toolbox supports popular and contempoary detection frameworks, see Section 2 for the full list. (3) High efficiency. All basic bbox and mask operations run on GPUs. The training speed is faster than or comparable to other codebases, including Detectron  , maskrcnn-benchmark  and SimpleDet . (4) State of the art. The toolbox stems from the codebase developed by the MMDet team, who won COCO Detection Challenge in 2018, and we keep pushing it forward.
Apart from introducing the codebase and benchmarking results, we also report our experience and best practice for training object detectors. Ablation experiments on hyper-parameters, architectures, training strategies are performed and discussed. We hope that the study can benefit future research and facilitate comparisons between different methods.
The remaining sections are organized as follows. We first introduce various supported methods and highlight important features of MMDetection, and then present the benchmark results. Lastly, we show some ablation studies on some chosen baselines.
MMDetection contains high-quality implementations of popular object detection and instance segmentation methods. A summary of supported frameworks and features compared with other codebases is provided in Table 1. MMDetection supports more methods and features than other codebases, especially for recent ones. A list is given as follows.
|Mixed Precision Training||✓||✓||✓|
|Mask Scoring R-CNN||✓||*|
|Hybrid Task Cascade||✓|
SSD : a classic and widely used single-stage detector with simple model architecture, proposed in 2015.
RetinaNet : a high-performance single-stage detector with Focal Loss, proposed in 2017.
GHM : a gradient harmonizing mechanism to improve single-stage detectors, proposed in 2019.
FCOS : a fully convolutional anchor-free single-stage detector, proposed in 2019.
Fast R-CNN : a classic object detector which requires pre-computed proposals, proposed in 2015.
Faster R-CNN : a classic and widely used two-stage object detector which can be trained end-to-end, proposed in 2015.
R-FCN : a fully convolutional object detector with faster speed than Faster R-CNN, proposed in 2016.
Mask R-CNN : a classic and widely used object detection and instance segmentation method, proposed in 2017.
Grid R-CNN : a grid guided localization mechanism as an alternative to bounding box regression, proposed in 2018.
Mask Scoring R-CNN : an improvement over Mask R-CNN by predicting the mask IoU, proposed in 2019.
Double-Head R-CNN : different heads for classification and localization, proposed in 2019.
Soft NMS : an alternative to NMS, proposed in 2017.
OHEM : an online sampling method that mines hard samples for training, proposed in 2016.
DCN : deformable convolution and deformable RoI pooling, proposed in 2017.
DCNv2 : modulated deformable operators, proposed in 2018.
ScratchDet : another exploration on training from scratch, proposed in 2018.
M2Det : a new feature pyramid network to construct more effective feature pyramids, proposed in 2018.
GCNet : global context block that can efficiently model the global context, proposed in 2019.
Generalized Attention : a generalized attention formulation, proposed in 2019.
Group Normalization : a simple alternative to BN, proposed in 2018.
Weight Standardization : standardizing the weights in the convolutional layers for micro-batch training, proposed in 2019.
Guided Anchoring : a new anchoring scheme that predicts sparse and arbitrary-shaped anchors, proposed in 2019.
Libra R-CNN : a new framework towards balanced learning for object detection, proposed in 2019.
Although the model architectures of different detectors are different, they have common components, which can be roughly summarized into the following classes.
Backbone Backbone is the part that transforms an image to feature maps, such as a ResNet-50 without the last fully connected layer.
Neck Neck is the part that connects the backbone and heads. It performs some refinements or reconfigurations on the raw feature maps produced by the backbone. An example is Feature Pyramid Network (FPN).
DenseHead (AnchorHead/AnchorFreeHead) DenseHead is the part that operates on dense locations of feature maps, including AnchorHead and AnchorFreeHead, e.g., RPNHead, RetinaHead, FCOSHead.
RoIExtractor RoIExtractor is the part that extracts RoI-wise features from a single or multiple feature maps with RoIPooling-like operators. An example that extracts RoI features from the corresponding level of feature pyramids is SingleRoIExtractor.
RoIHead (BBoxHead/MaskHead) RoIHead is the part that takes RoI features as input and make RoI-wise task-specific predictions, such as bounding box classification/regression, mask prediction.
With the above abstractions, the framework of single-stage and two-stage detectors is illustrated in Figure 1. We can develop our own methods by simply creating some new components and assembling existing ones.
We design a unified training pipeline with hooking mechanism. This training pipeline can not only be used for object detection, but also other computer vision tasks such as image classification and semantic segmentation.
The training processes of many tasks share a similar workflow, where training epochs and validation epochs run iteratively and validation epochs are optional. In each epoch, we forward and backward the model by many iterations. To make the pipeline more flexible and easy to customize, we define a minimum pipeline which just forwards the model repeatedly. Other behaviors are defined by a hooking mechanism. In order to run a custom training process, we may want to perform some self-defined operations before or after some specific steps. We define some timepoints where users may register any executable methods (hooks), includingbefore_run, before_train_epoch, after_train_epoch, before_train_iter, after_train_iter, before_val_epoch, after_val_epoch, before_val_iter, after_val_iter, after_run. Registered hooks are triggered at specified timepoints following the priority level. A typical training pipeline in MMDetection is shown in Figure 2. The validation epoch is not shown in the figure since we use evaluation hooks to test the performance after each epoch. If specified, it has the same pipeline as the training epoch.
Dataset. MMDetection supports both VOC-style and COCO-style datasets. We adopt MS COCO 2017 as the primary benchmark for all experiments since it is more challenging and widely used. We use the train split for training and report the performance on the val split.
Implementation details. If not otherwise specified, we adopt the following settings. (1) Images are resized to a maximum scale of ,without changing the aspect ratio. (2) We use 8 V100 GPUs for training with a total batch size of 16 (2 images per GPU) and a single V100 GPU for inference. (3) The training schedule is the same as Detectron . “1x” and “2x” means 12 epochs and 24 epochs respectively. “20e” is adopted in cascade models, which denotes 20 epochs.
We adopt standard evaluation metrics for COCO dataset, where multiple IoU thresholds from 0.5 to 0.95 are applied. The results of region proposal network (RPN) are measured with Average Recall (AR) and detection results are evaluated with mAP.
Main results. We benchmark different methods on COCO 2017 val, including SSD , RetinaNet , Faster RCNN , Mask RCNN  and Cascade R-CNN , Hybrid Task Cascade  and FCOS . We evalute all results with four widely used backbones, i.e., ResNet-50 , ResNet-101 , ResNet-101-32x4d  and ResNeXt-101-64x4d . We report the inference speed of these methods and bbox/mask AP in Figure 3. The inference time is tested on a single Tesla V100 GPU.
. They are built on the deep learning frameworks of caffe2111https://github.com/facebookarchive/caffe2, PyTorch  and MXNet , respectively. We compare MMDetection with Detectron (@a6a835f), maskrcnn-benchmark (@c8eff2c) and SimpleDet (@cf4fce4) from three aspects: performance, speed and memory. Mask R-CNN and RetinaNet are taken for representatives of two-stage and single-stage detectors. Since these codebases are also under development, the reported results in their model zoo may be outdated, and those results are tested on different hardwares. For fair comparison, we pull the latest codes and test them in the same environment. Results are shown in Table 2
. The memory reported by different frameworks are measured in different ways. MMDetection reports the maximum memory of all GPUs, maskrcnn-benchmark reports the memory of GPU 0, and these two adopt the PyTorch API “torch.cuda.max_memory_allocated()”. Detectron reports the GPU with the caffe2 API “caffe2.python.utils.GetGPUMemoryUsageStats()”, and SimpleDet reports the memory shown by “nvidia-smi”, a command line utility provided by NVIDIA. Generally, the actual memory usage of MMDetection and maskrcnn-benchmark are similar and lower than the others.
|Codebase||model||Train (iter/s)||Inf (fps)||Mem (GB)||AP||AP|
Inference speed on different GPUs. Different researchers may use various GPUs, here we show the speed benchmark on common GPUs, e.g., TITAN X, TITAN Xp, TITAN V, GTX 1080 Ti, RTX 2080 Ti and V100. We evaluate three models on each type of GPU and report the inference speed in Figure 4. It is noted that other hardwares of these servers are not exactly the same, such as CPUs and hard disks, but the results can provide a basic impression for the speed benchmark.
Mixed precision training. MMDetection supports mixed precision training to reduce GPU memory and to speed up the training, while the performance remains almost the same. The maskrcnn-benchmark supports mixed precision training with apex222https://github.com/NVIDIA/apex and SimpleDet also has its own implementation. Detectron does not support it yet. We report the results and compare with the other two codebases in Table 3. We test all codebases on the same V100 node. Additionally, we investigate more models to figure out the effectiveness of mixed precision training. As shown in Table 4, we can learn that a larger batch size is more memory saving. When the batch size is increased to 12, the memory of FP16 training is reduced to nearly half of FP32 training. Moreover, mixed precision training is more memory efficient when applied to simpler frameworks like RetinaNet.
|Codebase||Type||Mem (GB)||Train (iter/s)||Inf (fps)||AP||AP|
Multi-node scalability. Since MMDetection supports distributed training on multiple nodes, we test its scalability on 8, 16, 32, 64 GPUs, respectively. We adopt Mask R-CNN as the benchmarking method and conduct experiments on another V100 cluster. Following , the base learning rate is adjusted linearly when adopting different batch sizes. Experimental results in Figure 5 shows that MMDetection achieves nearly linear acceleration for multiple nodes.
With MMDetection, we conducted extensive study on some important components and hyper-parameters. We wish that the study can shed lights to better practices in making fair comparisons across different methods and settings.
A multi-task loss is usually adopted for training an object detector, which consists of the classification and regression branch. The most widely adopted regression loss is Smooth L1 loss. Recently, there are more regression losses proposed, e.g., Bounded IoU Loss , IoU Loss , GIoU Loss , Balanced L1 Loss . L1 Loss is also a straight-forward variant. However, these losses are usually implemented in different methods and settings. Here we evaluate all the losses under the same environment. It is noted that the final performance varies with different loss weights assigned to the regression loss, hence, we perform coarse grid search to find the best loss weight for each loss.
Results in Table 5 show that by simply increasing the loss weight of Smooth L1 Loss, the final performance can improve by . Without tuning the loss weight, L1 Loss is higher than Smooth L1, while increasing the loss weight will not bring further gain. L1 loss has larger loss values than Smooth L1, especially for bounding boxes that are relatively accurate. According to the analysis in , boosting the gradients of better located bounding boxes will benefit the localization. The loss values of L1 loss are already quite large, therefore, increasing loss weight does not work better. Balanced L1 Loss achieves 0.3% higher mAP than L1 Loss for end-to-end Faster R-CNN, which is a little different from experiments in  that adopts pre-computed proposals. However, we find that Balanced L1 loss can lead to a higher gain on the baseline of the proposed IoU-balanced sampling or balanced FPN. IoU-based losses perform slightly better than L1-based losses with optimal loss weights except for Bounded IoU Loss. GIoU Loss is higher than IoU Loss, and Bounded IoU Loss has similar performance to Smooth L1 Loss, but requires a larger loss weight.
The batch size used when training detectors is usually small (1 or 2) due to limited GPU memory, and thus BN layers are usually frozen as a typical convention. There are two options for configuring BN layers. (1) whether to update the statistics and , and (2) whether to optimize affine weights and . Following the argument names of PyTorch, we denote (1) and (2) as eval and requires_grad. means statistics are not updated, and means and are also optimized during training. Apart from freezing BN layers, there are also other normalization layers which tackles the problem of small batch size, such as Synchronized BN (SyncBN)  and Group Normalization (GN) . We first evaluate different settings for BN layers in backbones, and then compare BN with SyncBN and GN.
BN settings. We evaluate different combinations of eval and requires_grad on Mask R-CNN, under 1x and 2x training schedules. Results in Table 6 show that updating statistics with a small batch size severely harms the performance, when we recompute statistics (eval is false) and fix the affine weights (requires_grad is false), respectively. Compared with , it is lower in terms of bbox AP and lower in terms of mask AP. Under 1x learning rate (lr) schedule, fixing the affine weights or not only makes slightly differences, i.e., . When a longer lr schedule is adopted, making affine weights trainable outperforms fixing these weights by about . In MMDetection, is adopted as the default setting.
Different normalization layers.
Batch Normalization (BN) is widely adopted in modern CNNs. However, it heavily depends on the large batch size to precisely estimate the statisticsand . In object detection, the batch size is usually much smaller than in classification, and the typical solution is to use the statistics of pretrained backbones and not to update them during training, denoted as FrozenBN. More recently, SyncBN and GN are proposed and have proved their effectiveness [36, 25]
. SyncBN computes mean and variance across multi-GPUs and GN divides channels of features into groups and computes mean and variance within each group, which help to combat against the issue of small batch sizes. FrozenBN, SyncBN and GN can be specified in MMDetection with only simple modifications in config files.
Here we study two questions. (1) How do different normalization layers compare with each other? (2) Where to add normalization layers to detectors? To answer these two questions, we run three experiments of Mask R-CNN with ResNet-50-FPN and replace the BN layers in backbones with FrozenBN, SyncBN and GN, respectively. Group number is set to following . Other settings and model architectures are kept the same. In , the 2fc bbox head is replaced with 4conv1fc and GN layers are also added to FPN and bbox/mask heads. We perform another two sets of experiments to study these two changes. Furthermore, we explore different number of convolution layers for bbox head.
Results in Table 7 show that (1) FrozenBN, SyncBN and GN achieve similar performance if we just replace BN layers in backbones with corresponding ones. (2) Adding SyncBN or GN to FPN and bbox/mask head will not bring further gain. (3) Replacing the 2fc bbox head with 4conv1fc as well as adding normalization layers to FPN and bbox/mask head improves the performance by around . (4) More convolution layers in bbox head will lead to higher performance.
As a typical convention, training images are resized to a predefined scale without changing the aspect ratio. Previous studies typically prefer a scale of , and now is typically adopted. In MMDetection, we adopt as the default training scale. As a simple data augmentation method, multi-scale training is also commonly used. No systematic study exists to examine the way to select an appropriate training scales. Knowing this is crucial to facilitate more effective and efficient training. When multi-scale training is adopted, a scale is randomly selected in each iteration, and the image will be resized to the selected scale. There are mainly two random selection methods, one is to predefine a set of scales and randomly pick a scale from them, the other is to define a scale range, and randomly generate a scale between the minimum and maximum scale. We denote the first method as “value” mode and the second one as “range” mode. Specifically, “range” mode can be seen as a special case of “value” mode where the interval of predefined scales is 1.
We train Mask R-CNN with different scales and random modes, and adopt the 2x lr schedule because more training augmentation usually requires longer lr schedules. The results are shown in Table 8, in which indicates that the longer edge is fixed to 1333 and the shorter edge is randomly selected from the pool of , corresponding to the “value” mode. The setting indicates that the shorter edge is randomly selected between and , which corresponds to the “range” mode. From the results we can learn that the “range” mode performs similar to or slightly better than the “value” mode with the same minimum and maximum scales. Usually a wider range brings more improvement, especially for larger maximum scales. Specifically, is and higher than in terms of bbox and mask AP. However, a smaller minimum scale like will not achieve better performance.
MMDetection mainly follows the hyper-parameter settings in Detectron and also explores our own implementations. Empirically, we found that some of the hyper-parameters of Detectron are not optimal, especially for RPN. In Table 9, we list those that can further improve the performance of RPN. Although the tuning may benefit the performance, in MMDetection we adopt the same setting as Detectron by default and just leave this study for reference.
smoothl1_beta Most detection methods adopt Smooth L1 Loss as the regression loss, implemented as . The parameter is the threshold for L1 term and MSELoss term. It is set to
in RPN by default, according to the standard deviation of regression errors empirically. Experimental results show that a smallermay improve average recall (AR) of RPN slightly. In the study of Section 5.1, we found that L1 Loss performs better than Smooth L1 when the loss weight is 1. When we set to a smaller value, Smooth L1 Loss will get closer to L1 Loss and the equivalent loss weight is larger, resulting in better performance.
allowed_border In RPN, pre-defined anchors are generated on each location of a feature map. Anchors exceeding the boundaries of the image by more than allowed_border will be ignored during training. It is set to 0 by default, which means any anchors exceeding the image boundary will be ignored. However, we find that relaxing this rule will be beneficial. If we set it to infinity, which means none of the anchors are ignored, AR will be improved from to . In this way, ground truth objects near boundaries will have more matching positive samples during training.
neg_pos_ub We add this new hyper-parameter for sampling positive and negative anchors. When training the RPN, in the case when insufficient positive anchors are present, one typically samples more negative samples to guarantee a fixed number of training samples. Here we explore neg_pos_ub to control the upper bound of the ratio of negative samples to positive samples. Setting neg_pos_ub to infinity leads to the aforementioned sampling behavior. This default practice will sometimes cause imbalance distribution in negative and positive samples. By setting it to a reasonable value, e.g., 3 or 5, which means we sample negative samples at most 3 or 5 times of positive ones, a gain of or is observed.
We present detailed benchmarking results for some methods in Table 10
. R-50 and R-50 (c) denote pytorch-style and caffe-style ResNet-50 backbone, respectively. In the bottleneck residual block, pytorch-style ResNet uses a 1x1 stride-1 convolutional layer followed by a 3x3 stride-2 convolutional layer, while caffe-style ResNet uses a 1x1 stride-2 convolutional layer followed by a 3x3 stride-1 convolutional layer. Refer tohttps://github.com/open-mmlab/mmdetection/blob/master/MODEL_ZOO.md for more settings and components.
|Faster R-CNN||R-50 (c)||1x||36.6||58.5||39.2||20.7||40.5||47.9||-||-||-||-||-||-|
|FCOS (mstrain)||R-50 (c)||2x||38.7||58.0||41.4||23.4||42.8||49.0||-||-||-||-||-||-|
|Libra Faster R-CNN||R-50||1x||38.5||59.5||42.5||22.9||41.8||48.9||-||-||-||-||-||-|
|GA-Faster R-CNN||R-50 (c)||1x||39.9||59.1||43.6||22.8||43.5||52.8||-||-||-||-||-||-|
|Mask R-CNN||R-50 (c)||1x||37.4||58.9||40.4||21.7||41.0||49.1||34.3||55.8||36.4||18.0||37.6||47.3|
|Mask Scoring R-CNN||R-50 (c)||1x||37.5||59.2||40.5||21.4||41.3||48.9||35.6||55.6||38.5||18.2||39.1||49.2|
|Cascade Mask R-CNN||R-50||1x||41.2||59.1||45.1||23.3||44.5||54.5||35.7||56.3||38.6||18.5||38.6||49.2|
|Hyrbrid Task Cascade||R-50||1x||42.1||60.8||45.9||23.9||45.5||56.2||37.3||58.2||40.2||19.5||40.6||51.7|
IEEE Conference on Computer Vision and Pattern Recognition, 2018.
AAAI Conference on Artificial Intelligence, 2019.