The current trend of deep network architectures for object detection can be categorized into two main streams: one-stage detectors and two-stage detectors. One-stage detectors perform the task of object detection in an end-to-end single-pass manner, e.g. YOLO [17, 18, 19] and SSD [14, 5]
. On the other hand, two-stage detectors divide the task into two sub-problems that respectively focus on extracting object region proposals and classifying each of the candidate regions. Detectors such as Faster R-CNN and Light-Head R-CNN  are both of this kind.
Mask R-CNN 
extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI) in parallel with the existing branch for classification and bounding box regression. This showcases the architecture flexibility of two-stage detectors for multitasking over the one-stage counterparts. Different branches in Mask R-CNN share the same set of high-level features extracted by a deep CNN backbone network, such as ResNet. Then, each branch attends to specific RoI via RoIAlign, a simple and quantization-free layer that faithfully preserves spatial preciseness. Further, the proposed Non-Local RoI (NL-RoI) Block can be incorporated into Mask R-CNN to achieve better performance.
The ability to capture long-range and non-local information is a key success factor of deeper CNNs. For vanilla Mask R-CNN, the only means to acquire non-local information for each RoI is to explore the high-level features extracted by the deep backbone network. However, the high-level features are shared among all RoIs of different spatial locations, semantic categories, and branches for different tasks. Such high-level features are assumed to be general rather than specific for individual RoIs so that they are applicable to all the above varieties. Therefore, it is difficult for the same set of features to also contain the RoI-specific information. Besides, RoI features are rectangularly extracted based on their corresponding bounding box proposed by the Region Proposal Network (RPN). It is very likely to have multiple instances in a single bounding box when the scene is crowded. Moreover, if the instances are of the same category, it is harder for the branch network to tell apart the boundary by only referring to the local feature within an RoI. Especially for non-rigid objects, such as persons, the target object will deform in shape, and the bounding box has a higher chance to include other objects interlacing in a more complicated way.
To tackle the above concern, we introduce the idea of NL-RoI Block to better address the problem, and argue that RoI-specific non-local information can be helpful in discriminating the target instance from the others. For example, due to object co-occurrence prior in the real world, it is more probable to see cars along with pedestrians instead of refrigerators in a street scene. Besides, mutual information between instances may also be useful. Consider a scene of group dancing: People are usually posing in similar ways, and hence we can more confidently predict the pose for a dancer under partial occlusion, by referring to other dancers’ poses.
Our NL-RoI Block module is inspired by the non-local operations proposed by Wang et al. . They present the non-local operations as a family of generic building blocks for capturing long-range dependencies in different locations of data domain. The location can sit in a pixel or an audio sample for visual and acoustic data respectively. For visual data domain, the dependencies may come across space for tasks using a single static image, or space-time for tasks involving an extra time dimension such as video classification. In contrast, NL-RoIs are focusing on the long-range dependencies at a higher level between instances instead of just the pixel level. Specifically, our method explicitly empowers the network to model correlations and attentions between RoIs. By taking into account all pairs of RoIs of a scene in an efficient way, the NL-RoI Block benefits from not only neighboring RoIs but also spatially separated ones.
2 Non-local RoI
We first introduce the general definition of non-local RoI operation by following the notations in . We then go on to provide a detailed implementation about the NL-RoI Block used in Robust Vision Challenge 2018. Fig. 1 shows the basic idea about how we apply the NL-RoI Block to augment the original RoI feature blobs.
where is the index of a target RoI whose non-local information is to be computed and enumerates all the RoIs, including the target one. The input feature blob is denoted as and the output feature containing non-local information is denoted by . A pairwise function computes a scalar that reflects the correlation between the th target RoI and each of the RoIs (). The unary function maps the input feature from the th RoI to another representation, which gives the operation the capacity to convert the input feature to be more specialized for non-local information. Finally, the response is normalized by a factor .
The non-local RoI property in Eq. (1) originates from the fact that all RoIs are associated with each other in the operation. For each RoI, the non-local RoI operation computes responses based on correlations between different RoIs. Theoretically, each RoI should gradually learn to characterize a meaningful instance during training. That is, Eq. (1) enables the attention mechanism between instances. Moreover, this kind of non-local operation supports a variable input number of RoIs.
2.2 Implementation of NL-RoI Block
While different possible instantiations for can be chosen, Wang et al.  show, by experiments, that the non-local operations are not sensitive to specific choices. For simplicity, we just adopt the Embedded Gaussian version of :
Assume that we have RoIs and channels of input features, and the aligned RoI spatial size is . Hence, the input feature blob has the shape of . The two embedding functions and are both chosen to be a 1-by-1 2D convolution that reduces the channel dimension of the input blob. The purpose of is to calculate the correlations between RoIs, so the output of being applied to the whole input blob should be an -by- matrix. The output blobs from and are reshaped to . Afterward, a matrix multiplication on the reshaped outputs is performed to obtain the correlation matrix. Exponential and normalization terms are implemented by taking softmax to the rows of the correlation matrix.
It is worth noting that this form of is essentially the same as the Self-Attention Module in  for machine translation. For a given , becomes a softmax computation along the dimension . Eq. (1) results in the self-attention form in .
The remaining part in non-local RoI operation is responsible for extracting useful non-local information from the input feature. Following the bottleneck design of , we first use a 1-by-1 convolution to reduce the channel dimension and then a 3-by-3 convolution to take in the spatial information. To further cut down memory cost, a global 2D average pooling is applied. Finally, the pooled feature blob of shape is tiled around spatial dimensions and is appended to the end of input blob, as showed in Fig. 215] is used between the two convolution layers.
3 Instance Segmentation Model
Our NL-RoI Block is plugged into Mask R-CNN to perform instance segmentation. The backbone network for image feature extraction is ResNet-50 with FPN 
. We replace batch normalization by group normalization  for better training stability and convergence with a smaller batch size.
The core training datasets for our method include Cityscapes , Kitti Instance Segmentation , WildDash , and ScanNet . In addition, we use ADE20K  to provide more furniture samples for training. There are 76,528 valid training images in total. We train for 136K iterations, starting from a learning rate of and reducing it to , , on 56Kth, 76Kth, 116Kth iteration respectively. We use a weight decay of 0.0001 and a momentum of 0.9. Pre-trained weights for corresponding Mask R-CNN architecture from Detectron  are loaded during initialization.
At inference time, the input image is resized to 800 pixels on the shorter side. If the length of the longer side of resized image exceeds 1,333 pixels, we further resize the image to make sure the length of the longer side is 1,333 pixels. Soft-NMS  and box-voting  are also used during inference.
4 Benchmark Results
-  H. A. Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and C. Rother. Augmented reality meets deep learning for car instance segmentation in urban scenes. In British Machine Vision Conference (BMVC), 2017.
-  N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms - improving object detection with one line of code. In International Conference on Computer Vision (ICCV), pages 5562–5570, 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In
Computer Vision and Pattern Recognition (CVPR), 2016.
-  A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017.
-  S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware CNN model. In International Conference on Computer Vision (ICCV), pages 1134–1142, 2015.
-  R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.
-  K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456, 2015.
-  Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head R-CNN: in defense of two-stage object detector. CoRR, abs/1711.07264, 2017.
-  T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In European Conference on Computer Vision (ECCV), pages 21–37, 2016.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), pages 807–814, 2010.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Neural Information Processing Systems Workshop (NIPS-W), 2017.
-  J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
-  J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, 2017.
-  J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), pages 91–99, 2015.
-  S.-Y. R. Tseng. Detectron.pytorch. https://github.com/roytseng-tw/Detectron.pytorch, 2018.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), pages 6000–6010, 2017.
X. Wang, R. Girshick, A. Gupta, and K. He.
Non-local neural networks.In Computer Vision and Pattern Recognition (CVPR), 2018.
-  Y. Wu and K. He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.
-  O. Zendel, M. Murschitz, M. Humenberger, and W. Herzner. How good is my test data? introducing safety analysis for computer vision. International Journal of Computer Vision, 125(1-3):95–109, 2017.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In Computer Vision and Pattern Recognition (CVPR), 2017.