Non-local RoI for Cross-Object Perception

11/25/2018 ∙ by Shou-Yao Roy Tseng, et al. ∙ 0

We present a generic and flexible module that encodes region proposals by both their intrinsic features and the extrinsic correlations to the others. The proposed non-local region of interest (NL-RoI) can be seamlessly adapted into different generalized R-CNN architectures to better address various perception tasks. Observe that existing techniques from R-CNN treat RoIs independently and perform the prediction solely based on image features within each region proposal. However, the pairwise relationships between proposals could further provide useful information for detection and segmentation. NL-RoI is thus formulated to enrich each RoI representation with the information from all other RoIs, and yield a simple, low-cost, yet effective module for region-based convolutional networks. Our experimental results show that NL-RoI can improve the performance of Faster/Mask R-CNN for object detection and instance segmentation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The current trend of deep network architectures for object detection can be categorized into one-stage detectors and two-stage detectors. One-stage detectors perform the task of object detection in an end-to-end single-pass manner, e.g. YOLO (Redmon et al., 2016; Redmon and Farhadi, 2017, 2018) and SSD (Liu et al., 2016; Fu et al., 2017)

. On the other hand, two-stage detectors divide the task into two sub-problems that respectively focus on extracting object region proposals and classifying each of the candidate regions. Detectors such as Faster R-CNN 

Ren et al. (2015) and Light-Head R-CNN Li et al. (2017) are both of this kind.

State-of-the-art object detection methods (He et al., 2014; Girshick, 2015; Ren et al., 2015; Dai et al., 2016, 2017; Lin et al., 2017; He et al., 2017) in terms of precision mainly follow the region based paradigm, which is popularized by the seminal work R-CNN (Girshick et al., 2014). Given a sparse set of region proposals, object classification and bounding box regression are performed on each proposal individually. Mask R-CNN (He et al., 2017) extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI) in parallel with the existing branch for classification and bounding box regression. This showcases the flexibility of two-stage detectors for multitasking over the one-stage counterparts. Different branches in Mask R-CNN share the same set of high-level features from a CNN backbone network, such as ResNet (He et al., 2016). Each branch attends to a specific RoI via RoIAlign, a quantization-free layer that faithfully preserves spatial preciseness. Further, our non-local RoI (NL-RoI) mechanism can be incorporated into Mask R-CNN to achieve better performance.

The ability to capture long-range and non-local information is a key success factor of deeper CNNs. For vanilla Mask R-CNN, the only way to acquire non-local information for each RoI is to explore the high-level features extracted by the deep backbone network. However, the high-level features are shared among all RoIs of different spatial locations, semantic categories, and branches for different tasks. Such high-level features are assumed to be general rather than specific for individual RoIs so that they are applicable to all the above varieties. Therefore, it is difficult for the same set of features to also contain the RoI-specific information. Besides, RoI features are rectangularly extracted based on their corresponding bounding box proposed by the

Region Proposal Network (RPN). It is very likely to have multiple instances in a single bounding box when the scene is crowded. Moreover, if the instances are of the same category, it is harder for the branch network to tell apart the boundary by only referring to the local feature within an RoI. Especially for non-rigid objects, such as persons, the target object will deform in shape, and the bounding box has a higher chance to include other objects or backgrounds interlacing in a more complicated way.

We introduce the idea of NL-RoI to address the aforementioned issues and argue that RoI-specific non-local information can be helpful in discriminating the target instance from the others. For example, due to object co-occurrence prior in the real world, it is more probable to see cars along with pedestrians instead of refrigerators in a street scene. Besides, mutual information between instances may also be useful. Consider a scene of group dancing: People are usually posing in similar ways, and hence we can more confidently predict the pose for a dancer under partial occlusion, by referring to other dancers’ poses.

2 Formulation

Inspired by the non-local operation in (Wang et al., 2018), we define a generic non-local RoI operation for the use in conjunction with R-CNN based models Girshick et al. (2014):


where is the index of a target RoI whose non-local information is to be computed and enumerates all the RoIs, including the target one. The input feature blob is denoted as and the output feature containing non-local information is denoted by . A pairwise function computes a scalar that reflects the correlation between the th target RoI and each of the RoIs (). The unary function maps the input feature from the th RoI to another representation, which gives the operation the capacity to convert the input feature to be more specialized for non-local information. Finally, the response is normalized by a factor .

The non-local RoI property in Eq. (1) originates from the fact that all RoIs are associated with each other in the operation. For each RoI, the non-local RoI operation computes responses based on correlations between different RoIs. Theoretically, each RoI should gradually learn to characterize a meaningful instance during training. That is, Eq. (1) enables the attention mechanism between instances. Moreover, this kind of non-local operation supports a variable input number of RoIs. Fig. 1a shows an overview of NL-RoI.

(a) (b)

Figure 1:

(a) The yellow block represents the original high-level feature tensor extracted by the backbone network. The blue block represents the non-local feature calculated by the NL-RoI module. The original RoI feature tensor is concatenated with the non-local feature tensor. (b) The detailed operations of an NL-RoI module. Assume

RoIs are proposed. The relation function computes scores on two flatten features obtained by two functions and , which are 1-by-1 convolution layers. The embedding function consists of two convolution layers and a pooling layer. The first convolution layer is designed to lower down the feature channel dimension, and the second one uses 3-by-3 kernels to extract non-local specific features. The final non-local features obtained by relation scores and the embedded features produced by

are one-dimensional vectors for each RoI. These one-dimensional features are tiled around

spatial dimensions to form a three-dimensional tensor, and then are concatenated with original RoI feature tensors.

3 Implementation

For simplicity, we adopt the Embedded Gaussian version of : . Assume that we have RoIs and channels of input features, and the aligned RoI spatial size is . Hence, the input feature blob has the shape of . The two embedding functions and are both chosen to be a 1-by-1 2D convolution that reduces the channel dimension of the input blob. The purpose of is to calculate the correlations between RoIs, so the output of being applied to the whole input blob should be an -by- matrix. The output blobs from and are reshaped to . Afterward, a matrix multiplication on the reshaped outputs is performed to obtain the correlation matrix. Exponential and normalization terms are implemented by taking softmax to the rows of the correlation matrix.

It is worth noting that this form of is essentially the same as the Self-Attention Module in (Vaswani et al., 2017) for machine translation. For a given , becomes a softmax computation along the dimension . Eq. (1) results in the self-attention form in Vaswani et al. (2017).

The remaining part in non-local RoI operation is responsible for extracting useful non-local information from the input feature. Following the bottleneck design of He et al. (2016), we first use a 1-by-1 convolution to reduce the channel dimension and then a 3-by-3 convolution to take in the spatial information. To further cut down memory cost, a global 2D average pooling is applied. Finally, the pooled feature blob of shape is tiled around spatial dimensions and is appended to the end of input blob, as shown in Fig. 1

b. A ReLU activation function

Nair and Hinton (2010) is used between the two convolution layers.

4 Experiments

We use COCO Lin et al. (2014) 2017 dataset to evaluate NL-RoI. The comparison baseline is Mask-RCNN He et al. (2017), one of the state-of-the-art frameworks for detection and segmentation. The official train/val splits in COCO 2017 are essentially equal to the unofficial minival COCO 2014 train/val splits, which are used by Mask-RCNN. We refer to the latest resluts reported in Facebook Research’s GitHub repository, called Detectron Girshick et al. (2018)

. These results are generally equal to or better than the ones given in the published papers. The experiments are based on a reimplementation of Detectron in PyTorch (the official Detectron is written in Caffe2).

Method Training schedule


R-50 Baseline 1x 36.71 58.45 39.61 21.12 39.85 48.13
2x 37.90 59.25 41.10 21.50 41.10 49.91
NL-RoI 1x 37.59 60.22 40.61 22.10 40.81 48.59
2x 38.40 60.48 41.45 22.91 40.98 50.40
R-101 Baseline 1x 39.40 61.19 43.41 22.57 42.91 51.37
2x 39.78 61.29 43.28 22.88 43.33 52.65
NL-RoI 1x 39.72 62.33 43.02 23.67 43.40 51.54
2x 40.15 62.13 43.47 23.20 43.66 52.54
Table 1: Evaluation on Faster R-CNN on COCO2017 validation set.

Faster R-CNN on COCO.

As shown in Table 1, NL-RoI can achieve around improvement in either with R-50 or R-101 backbone network. Similar improvements are achieved using both short (1x) and longer (2x) training schedules. NL-RoI makes the training of Faster R-CNN more effective, since the model trained with 1x schedule can still achieve competitive performance using only half the time of the baseline trained with 2x schedule.

Method Training schedule


R-50 Baseline 1x 37.69 59.16 40.86 21.36 40.76 49.75
2x 38.61 59.84 42.10 22.20 41.50 50.77
NL-RoI 1x 38.26 60.55 41.39 22.79 41.28 49.84
2x 39.18 61.22 42.94 23.78 42.29 51.26
R-101 Baseline 1x 40.01 61.80 43.67 22.55 43.40 52.69
2x 40.89 61.94 44.78 23.50 44.21 53.89
NL-RoI 1x 40.53 62.68 44.24 23.77 44.31 52.40
2x 40.92 62.76 44.37 23.40 44.23 54.07


Method Training schedule


R-50 Baseline 1x 33.86 55.81 35.82 14.86 36.34 50.85
2x 34.48 56.45 36.29 15.60 37.09 52.06
NL-RoI 1x 34.35 57.07 36.14 16.09 36.90 51.26
2x 35.26 57.84 37.30 16.50 38.08 52.21
R-101 Baseline 1x 35.92 58.30 37.95 15.95 38.92 53.23
2x 36.39 58.47 38.68 16.64 39.15 54.00
NL-RoI 1x 36.32 59.09 38.19 16.40 39.53 53.59
2x 36.64 59.39 38.68 16.75 39.62 55.09
Table 2: Evaluation on Mask R-CNN. Tested on COCO2017 validation set.

Mask R-CNN on COCO.

The improvements of NL-RoI on Mask R-CNN models are similar to those on Faster R-CNN. An increment around in performance is obtained on both bounding box and mask . However, on the combination of deeper backbone (R-101) and longer training schedule (2x), NL-RoI brings only about and improvements in and

, respectively. This phenomenon suggests that deeper neural networks may have better abilities to encode cross objects relations in high-level features if denser information about individual objects, such as instance masks, is available while training. A supporting evidence for this hypothesis can be found in experimental results of Faster R-CNN in Table 

1. When training Faster R-CNN, only the sparse annotations about the objects, i.e., the bounding boxes, are provided, and the improvements on a deeper backbone achieved by NL-RoI are more significant.

Despite less significant improvement on deeper backbone models, NL-RoI still has better average precision over the baselines on almost every metric. On deeper backbone models, again, we can observe the same behavior of alternating first place between two training schedules in each metric, as previously shown in the results of Faster R-CNN. This behavior only exits in the scores for box APs, but not in mask APs. The box head network is composed of two FC layers, i.e., a two-layer MLP. In contrast, the mask head network consists of four convolution layers and one transposed-convolution layer. This discrepancy, as shown in box APs and mask APs of R-101 NL-RoI models, provides another support to the previous discussion about the cause to the behavior: The overpowered high-level features extracted by a deeper backbone saturate the head network and limit its capacity.

5 Conclusion

Non-local RoI is a generic module to improve the performance of R-CNN based methods by explicitly modeling the relations and attention mechanisms between different object regions. Through the experiments on COCO dataset, we show that NL-RoI achieves consistent improvements on Faster R-CNN and Mask R-CNN with different backbone networks and training schedules. Althogh the experimental results also indicate that, when denser or more detailed annotations about objects such as segmentations are given during training, deep neural networks may have the ability to learn object relations implicitly to some extent, we show that using NL-RoI to model the relations between objects in perceptual tasks is still more effective and advantageous.


  • Dai et al. [2016] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: object detection via region-based fully convolutional networks. In Neural Information Processing Systems (NIPS), pages 379–387, 2016.
  • Dai et al. [2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In International Conference on Computer Vision (ICCV), pages 764–773, 2017.
  • Fu et al. [2017] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017.
  • Girshick et al. [2018] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron., 2018.
  • Girshick [2015] Ross B. Girshick. Fast R-CNN. In International Conference on Computer Vision ICCV, pages 1440–1448, 2015.
  • Girshick et al. [2014] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Computer Vision and Pattern Recognition (CVPR)

    , pages 580–587, 2014.
  • He et al. [2014] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision (ECCV), pages 346–361, 2014.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
  • Li et al. [2017] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Light-head R-CNN: in defense of two-stage object detector. CoRR, abs/1711.07264, 2017.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014. doi: 10.1007/978-3-319-10602-1_48. URL
  • Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017.
  • Liu et al. [2016] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. In European Conference on Computer Vision (ECCV), pages 21–37, 2016.
  • Nair and Hinton [2010] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In

    International Conference on Machine Learning (ICML)

    , pages 807–814, 2010.
  • Redmon and Farhadi [2017] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, 2017.
  • Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • Redmon et al. [2016] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), pages 91–99, 2015.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), pages 6000–6010, 2017.
  • Wang et al. [2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Computer Vision and Pattern Recognition (CVPR), 2018.

Appendix A: Training and Inference

All the models presented in the paper are end-to-end trained and the residual backbone network is initialized with pretrained weights for ImageNet classification. Batch normalization is not used during training, and the parameters of batch normalization layers,


., moving means and moving variances, are merged into only two factors: scaling and shift. All batch normalization layers in the original backbone network are replaced with simple affine transformation layers, which is also done by


The aspect ratios of input images are not change, but the size is rescaled to 800 pixels on the shorter side. If the length of the longer side after rescaling exceeds 1,333 pixels, the image is further resized to make sure the length of the longer side is 1,333 pixels. For preparing training batches, the images that are to be placed on the same GPU are padded to the maximum height and width of them all. Therefore, image batches to different GPUs may have different padded sizes. We group the images by their aspect ratios so that we can have more compact padded image batches for better occupancy of GPU memory.

NL-RoI is applied to two residual backbones, R-50 and R-101, of different numbers of 50 and 101 convolution layers, respectively. Feature Pyramid Networks Lin et al. [2017]

are used on both cases. As for optimization, stochastic gradient descent with momentum 0.9 and weight decay 0.0001 is used. There are two training schedules available for our experiments. First, the

1x schedule starts with a learning rate of 0.02, then reduces it by a factor of 10 at the and the iteration, and has iterations in total. Second, the 2x schedule also starts with the same learning rate of 0.02, but reduces it by a factor of 10 at the and the iteration, and has iterations in total.

A score threshold of 0.05 and greedy non-maximal suppression (NMS) are used to produce the final detection. NMS is only applied among predictions of same category and the suppression threshold is 0.5. A maximum number of 1000 object proposals is used for RPN.

Appendix B: Ablation Study

According to the implementation by Vaswani et al. [2017], the relation scores computed by two feature vectors are normalized using the square root of feature length. In our implementation, the relation scores are computed as the flattened features derived from 3-dimensional tensors. Therefore, either the length of flattened feature or the channel dimension of original tensor could be used for the scaling factor. We also study the effect of applying or not applying the attention mechanism to the same RoI. That is, by setting the diagonal values of the N-by-N relation score matrix to zero, the embedded RoI features from the same RoI will not contribute in the non-local features extracted by NL-RoI. As shown in Table B1, allowing “attend to self” and using only channel dimension in scaling factor can achieve the best performance on Faster R-CNN. We choose the last configuration in Table B1 as the standard setting for our experiments.

Attend to self Scaling Factor


No 36.96 59.21 39.84 20.43 40.29 49.39
No 37.32 59.82 39.97 22.24 40.31 48.72
Yes 37.52 60.12 40.50 22.13 40.45 49.19
Yes 37.59 60.22 40.61 22.10 40.81 48.59
Table B1: Ablation study on the implementation of NL-RoI based on Faster R-CNN. The configuration in the fourth row is adopted as the standard setting for our experiments.