The current trend of deep network architectures for object detection can be categorized into one-stage detectors and two-stage detectors. One-stage detectors perform the task of object detection in an end-to-end single-pass manner, e.g. YOLO (Redmon et al., 2016; Redmon and Farhadi, 2017, 2018) and SSD (Liu et al., 2016; Fu et al., 2017)
. On the other hand, two-stage detectors divide the task into two sub-problems that respectively focus on extracting object region proposals and classifying each of the candidate regions. Detectors such as Faster R-CNNRen et al. (2015) and Light-Head R-CNN Li et al. (2017) are both of this kind.
State-of-the-art object detection methods (He et al., 2014; Girshick, 2015; Ren et al., 2015; Dai et al., 2016, 2017; Lin et al., 2017; He et al., 2017) in terms of precision mainly follow the region based paradigm, which is popularized by the seminal work R-CNN (Girshick et al., 2014). Given a sparse set of region proposals, object classification and bounding box regression are performed on each proposal individually. Mask R-CNN (He et al., 2017) extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI) in parallel with the existing branch for classification and bounding box regression. This showcases the flexibility of two-stage detectors for multitasking over the one-stage counterparts. Different branches in Mask R-CNN share the same set of high-level features from a CNN backbone network, such as ResNet (He et al., 2016). Each branch attends to a specific RoI via RoIAlign, a quantization-free layer that faithfully preserves spatial preciseness. Further, our non-local RoI (NL-RoI) mechanism can be incorporated into Mask R-CNN to achieve better performance.
The ability to capture long-range and non-local information is a key success factor of deeper CNNs. For vanilla Mask R-CNN, the only way to acquire non-local information for each RoI is to explore the high-level features extracted by the deep backbone network. However, the high-level features are shared among all RoIs of different spatial locations, semantic categories, and branches for different tasks. Such high-level features are assumed to be general rather than specific for individual RoIs so that they are applicable to all the above varieties. Therefore, it is difficult for the same set of features to also contain the RoI-specific information. Besides, RoI features are rectangularly extracted based on their corresponding bounding box proposed by theRegion Proposal Network (RPN). It is very likely to have multiple instances in a single bounding box when the scene is crowded. Moreover, if the instances are of the same category, it is harder for the branch network to tell apart the boundary by only referring to the local feature within an RoI. Especially for non-rigid objects, such as persons, the target object will deform in shape, and the bounding box has a higher chance to include other objects or backgrounds interlacing in a more complicated way.
We introduce the idea of NL-RoI to address the aforementioned issues and argue that RoI-specific non-local information can be helpful in discriminating the target instance from the others. For example, due to object co-occurrence prior in the real world, it is more probable to see cars along with pedestrians instead of refrigerators in a street scene. Besides, mutual information between instances may also be useful. Consider a scene of group dancing: People are usually posing in similar ways, and hence we can more confidently predict the pose for a dancer under partial occlusion, by referring to other dancers’ poses.
where is the index of a target RoI whose non-local information is to be computed and enumerates all the RoIs, including the target one. The input feature blob is denoted as and the output feature containing non-local information is denoted by . A pairwise function computes a scalar that reflects the correlation between the th target RoI and each of the RoIs (). The unary function maps the input feature from the th RoI to another representation, which gives the operation the capacity to convert the input feature to be more specialized for non-local information. Finally, the response is normalized by a factor .
The non-local RoI property in Eq. (1) originates from the fact that all RoIs are associated with each other in the operation. For each RoI, the non-local RoI operation computes responses based on correlations between different RoIs. Theoretically, each RoI should gradually learn to characterize a meaningful instance during training. That is, Eq. (1) enables the attention mechanism between instances. Moreover, this kind of non-local operation supports a variable input number of RoIs. Fig. 1a shows an overview of NL-RoI.
For simplicity, we adopt the Embedded Gaussian version of : . Assume that we have RoIs and channels of input features, and the aligned RoI spatial size is . Hence, the input feature blob has the shape of . The two embedding functions and are both chosen to be a 1-by-1 2D convolution that reduces the channel dimension of the input blob. The purpose of is to calculate the correlations between RoIs, so the output of being applied to the whole input blob should be an -by- matrix. The output blobs from and are reshaped to . Afterward, a matrix multiplication on the reshaped outputs is performed to obtain the correlation matrix. Exponential and normalization terms are implemented by taking softmax to the rows of the correlation matrix.
It is worth noting that this form of is essentially the same as the Self-Attention Module in (Vaswani et al., 2017) for machine translation. For a given , becomes a softmax computation along the dimension . Eq. (1) results in the self-attention form in Vaswani et al. (2017).
The remaining part in non-local RoI operation is responsible for extracting useful non-local information from the input feature. Following the bottleneck design of He et al. (2016), we first use a 1-by-1 convolution to reduce the channel dimension and then a 3-by-3 convolution to take in the spatial information. To further cut down memory cost, a global 2D average pooling is applied. Finally, the pooled feature blob of shape is tiled around spatial dimensions and is appended to the end of input blob, as shown in Fig. 12010) is used between the two convolution layers.
We use COCO Lin et al. (2014) 2017 dataset to evaluate NL-RoI. The comparison baseline is Mask-RCNN He et al. (2017), one of the state-of-the-art frameworks for detection and segmentation. The official train/val splits in COCO 2017 are essentially equal to the unofficial minival COCO 2014 train/val splits, which are used by Mask-RCNN. We refer to the latest resluts reported in Facebook Research’s GitHub repository, called Detectron Girshick et al. (2018)
. These results are generally equal to or better than the ones given in the published papers. The experiments are based on a reimplementation of Detectron in PyTorch (the official Detectron is written in Caffe2).
Faster R-CNN on COCO.
As shown in Table 1, NL-RoI can achieve around improvement in either with R-50 or R-101 backbone network. Similar improvements are achieved using both short (1x) and longer (2x) training schedules. NL-RoI makes the training of Faster R-CNN more effective, since the model trained with 1x schedule can still achieve competitive performance using only half the time of the baseline trained with 2x schedule.
Mask R-CNN on COCO.
The improvements of NL-RoI on Mask R-CNN models are similar to those on Faster R-CNN. An increment around in performance is obtained on both bounding box and mask . However, on the combination of deeper backbone (R-101) and longer training schedule (2x), NL-RoI brings only about and improvements in and
, respectively. This phenomenon suggests that deeper neural networks may have better abilities to encode cross objects relations in high-level features if denser information about individual objects, such as instance masks, is available while training. A supporting evidence for this hypothesis can be found in experimental results of Faster R-CNN in Table1. When training Faster R-CNN, only the sparse annotations about the objects, i.e., the bounding boxes, are provided, and the improvements on a deeper backbone achieved by NL-RoI are more significant.
Despite less significant improvement on deeper backbone models, NL-RoI still has better average precision over the baselines on almost every metric. On deeper backbone models, again, we can observe the same behavior of alternating first place between two training schedules in each metric, as previously shown in the results of Faster R-CNN. This behavior only exits in the scores for box APs, but not in mask APs. The box head network is composed of two FC layers, i.e., a two-layer MLP. In contrast, the mask head network consists of four convolution layers and one transposed-convolution layer. This discrepancy, as shown in box APs and mask APs of R-101 NL-RoI models, provides another support to the previous discussion about the cause to the behavior: The overpowered high-level features extracted by a deeper backbone saturate the head network and limit its capacity.
Non-local RoI is a generic module to improve the performance of R-CNN based methods by explicitly modeling the relations and attention mechanisms between different object regions. Through the experiments on COCO dataset, we show that NL-RoI achieves consistent improvements on Faster R-CNN and Mask R-CNN with different backbone networks and training schedules. Althogh the experimental results also indicate that, when denser or more detailed annotations about objects such as segmentations are given during training, deep neural networks may have the ability to learn object relations implicitly to some extent, we show that using NL-RoI to model the relations between objects in perceptual tasks is still more effective and advantageous.
- Dai et al.  Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: object detection via region-based fully convolutional networks. In Neural Information Processing Systems (NIPS), pages 379–387, 2016.
- Dai et al.  Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In International Conference on Computer Vision (ICCV), pages 764–773, 2017.
- Fu et al.  Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017.
- Girshick et al.  Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018.
- Girshick  Ross B. Girshick. Fast R-CNN. In International Conference on Computer Vision ICCV, pages 1440–1448, 2015.
Girshick et al. 
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and semantic
Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision (ECCV), pages 346–361, 2014.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- He et al.  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
- Li et al.  Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Light-head R-CNN: in defense of two-stage object detector. CoRR, abs/1711.07264, 2017.
- Lin et al.  Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014. doi: 10.1007/978-3-319-10602-1_48. URL https://doi.org/10.1007/978-3-319-10602-1_48.
- Lin et al.  Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017.
- Liu et al.  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. In European Conference on Computer Vision (ECCV), pages 21–37, 2016.
Nair and Hinton 
Vinod Nair and Geoffrey E. Hinton.
Rectified linear units improve restricted boltzmann machines.
International Conference on Machine Learning (ICML), pages 807–814, 2010.
- Redmon and Farhadi  Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, 2017.
- Redmon and Farhadi  Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
- Redmon et al.  Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), pages 91–99, 2015.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), pages 6000–6010, 2017.
- Wang et al.  Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Computer Vision and Pattern Recognition (CVPR), 2018.
Appendix A: Training and Inference
All the models presented in the paper are end-to-end trained and the residual backbone network is initialized with pretrained weights for ImageNet classification. Batch normalization is not used during training, and the parameters of batch normalization layers,i.e
., moving means and moving variances, are merged into only two factors: scaling and shift. All batch normalization layers in the original backbone network are replaced with simple affine transformation layers, which is also done byDetectron.
The aspect ratios of input images are not change, but the size is rescaled to 800 pixels on the shorter side. If the length of the longer side after rescaling exceeds 1,333 pixels, the image is further resized to make sure the length of the longer side is 1,333 pixels. For preparing training batches, the images that are to be placed on the same GPU are padded to the maximum height and width of them all. Therefore, image batches to different GPUs may have different padded sizes. We group the images by their aspect ratios so that we can have more compact padded image batches for better occupancy of GPU memory.
NL-RoI is applied to two residual backbones, R-50 and R-101, of different numbers of 50 and 101 convolution layers, respectively. Feature Pyramid Networks Lin et al. 
are used on both cases. As for optimization, stochastic gradient descent with momentum 0.9 and weight decay 0.0001 is used. There are two training schedules available for our experiments. First, the1x schedule starts with a learning rate of 0.02, then reduces it by a factor of 10 at the and the iteration, and has iterations in total. Second, the 2x schedule also starts with the same learning rate of 0.02, but reduces it by a factor of 10 at the and the iteration, and has iterations in total.
A score threshold of 0.05 and greedy non-maximal suppression (NMS) are used to produce the final detection. NMS is only applied among predictions of same category and the suppression threshold is 0.5. A maximum number of 1000 object proposals is used for RPN.
Appendix B: Ablation Study
According to the implementation by Vaswani et al. , the relation scores computed by two feature vectors are normalized using the square root of feature length. In our implementation, the relation scores are computed as the flattened features derived from 3-dimensional tensors. Therefore, either the length of flattened feature or the channel dimension of original tensor could be used for the scaling factor. We also study the effect of applying or not applying the attention mechanism to the same RoI. That is, by setting the diagonal values of the N-by-N relation score matrix to zero, the embedded RoI features from the same RoI will not contribute in the non-local features extracted by NL-RoI. As shown in Table B1, allowing “attend to self” and using only channel dimension in scaling factor can achieve the best performance on Faster R-CNN. We choose the last configuration in Table B1 as the standard setting for our experiments.
|Attend to self||Scaling Factor|