Nowadays, instance segmentation is one of the most studied topics in the computer vision community. It differs from both object detection, where the final output is the set of rectangular bounding boxes which localize and classify any object instance, and semantic segmentation, where the goal is to classify any image pixel without considering if it is part of a specific instance. In instance segmentation, the final goal is to be able to cut the single instances of objects from the original image. Its characteristics make this task very useful for several advanced applications, such as object relationship detection, automatic image captioning, content-based image retrieval, and many others.
In the recent literature, many studies have addressed the instance segmentation problem. The proposed architectures can be grouped into two main categories: one-step and two-step architectures. The one-step architectures obtain the results with a single pass, making a direct prediction from the input image. On the contrary, an architecture belonging to the second category (two-step) is usually composed of a Regional Proposal Network (RPN) , which returns a list of Regions of Interest (RoI) that are likely to contain the searched object, followed by a more specialized network with the purpose of detecting or segmenting the object/instance within each of the bounding boxes found. These networks descend from their ancestor network called R-CNN .
The typical components of a two-step architecture are shown in Fig. 1. As it can be seen in the diagram, the layer (highlighted in red) connecting the two steps is usually represented by the RoI extractor, which is the main focus of this paper. Since this layer plays a crucial role in terms of final results, it should be carefully designed to minimize the loss of information.
The main objective of this layer is to perform pooling in order to transform the input region, which can be of any size, to a fixed-size feature map. Several previous papers have tackled this problem using different RoI pooling algorithms such as RoI Align , RoI Warp  and Precise RoI Pooling . Since instances of objects can appear in the image with different scales, the existing architectures (as shown in Fig. 1) exploit a Feature Pyramid Network (FPN)  combined with an RPN (e.g. Fast R-CNN , Faster R-CNN  and Mask R-CNN ), to generates multi-scale feature maps. An FPN is composed of a bottom-up pathway, where final convolutional layers from the backbone are often chosen, followed by a top-down pathway to reconstruct spatial resolution from the upper layers of the pyramid that have a higher semantic value. With the introduction of a FPN, the fundamental issue is the selection of a FPN layer to which the RoI pooler will be applied.
Traditional methods make the selection based on the RoI obtained by the RPN. They use the formula proposed by  to discover the best k-th layer to sample from, which is based on the width and height of the RoI as follows:
where represents the highest level feature map and
is the typical image size used to pre-train the backbone with ImageNet dataset. This hard selection of a single layer of FPN might limit the power of the network’s description and our intuition (supported by previous works, such as) is that if all scale-specific features are retained, better object detection and segmentation results can be achieved.
The main contributions of this paper are the following:
A novel RoI extraction layer called GRoIE is proposed, with the aim of a more generic, configurable and interchangeable framework for RoI extraction in two-step architectures for instance segmentation.
Exhaustive ablation study on different components of the proposed layer is conducted in order to evaluate how the performance changes depending on the various choices.
GRoIE is introduced to the major state-of-the-art architectures to demonstrate its superior performance with respect to traditional RoI extraction layers.
The paper is organized as follows. Section II describes the state of the art. In Section III, the proposed architecture is described in detail. Section IV describes the experimental methodology as well as our in-depth ablation study on component selection. Additionally, in this section, we show how the inclusion of GRoIE layer in state-of-the-art architectures can lead to significant improvements in the overall performance.
Ii Related Work
As mentioned in the introduction, modern detectors employ a RoI extraction layer to select the features produced by the backbone network according to the candidate bounding boxes coming from a RPN. This layer was first introduced in R-CNN network. Since then, many architectures derived from R-CNN (e.g., Mask R-CNN, Grid R-CNN , Cascade R-CNN , HTC  and GC-net ) have used this layer as well. Usually, to be more invariant to object scale, the layer is not directly applied to the backbone features, but instead to an FPN attached on top of the backbone.
, a RoI pooling action is applied to a single heuristically-selected FPN output layer. This approach suffers from a problem related to untapped information. In, the authors propose to extract mask proposals from each scale separately, rescale them and include the resulting scales in a unique multi-scale ranked list. Eventually, only the best proposals are selected. In 
, the authors propose to fuse features belonging to different scales by max function, using an independent backbone for each image scale. In our work, on the contrary, we utilize a feature pyramid to simplify the network and avoid doubling the number of parameters for each scale. In SharpMask, the authors make a coarse mask prediction after which they fuse feature layer back in a top-down fashion until reaching the same size of the input image. In PANet , the authors highlight that the information is not strictly connected with a single layer of the FPN. They propagate low-level features, building another FPN-like structure coupled with the original FPN, where the RoI-pooled images are combined. Our proposed GRoIE layer is inspired by this approach with the difference that it is more lightweight because of not using any extra FPN-coupled stack and proposes a novel way to aggregate data from the RoI-pooled features. Auto-FPN  extends PANet model by applying the Neural Architecture Search (NAS) concept. Also, AugFPN  can be considered an extension of PANet model. The module we directly compare our module with is the Soft RoI Selector, which performs a RoI pooling on each FPN layer for concatenating the results. Subsequently, through the Adaptive Spatial Fusion, they are combined to create a weight map which passes through 1x1 and 3x3 convolutions sequentially. In our case, we first apply a specialized convolutional operation on the single layer of the FPN which very effectively helps the network to automatically focus on the best scales. Next, we apply a sum instead of concatenation because we have proven it has a greater learning potential for the network. Finally, an attention layer is applied that combines fully-connected layers and convolutions to further filter the multi-scale context.
In Multi-Scale Subnet , authors propose an alternative method to RoI Align which uses crop-resized branches to extract the RoI at different scales. They use convolution with 1x1 kernel to simply maintain the same number of outputs for each branch without the purpose of helping the network to process data. Then, before summing all branches, they apply an average pooling to reduce each branch to the same size. Finally, a convolutional layer with 3x3 kernel is used as post-processing stage. In our ablation study, we demonstrate that these convolutional configurations for pre- and post-processing are not the best ones possible to achieve better performance.
IONet  proposes not to use any FPN network but concatenated, re-scaled and dimension-reduced features directly from the backbone before performing classification and bounding box regression. Finally, Hypercolumn  employs a hypercolumn representation to classify a pixel, using convolutions with 1x1 kernel and up-sampling the results to a common size to be able to sum them all. In this case, the absence of a optimized RoI pooling solution and an FPN can negatively affect the final performance. Moreover, simply processing columns of pixels taken from different stages of the backbone can be a limitation. In fact, in our ablation study we will demonstrate that adjacent pixels are important for optimally extracting information within the various features.
Iii Generic RoI Extraction Layer
The FPN is an architecture commonly used to extract features from different image resolutions. It has been demonstrated to have an effective power to maintain spatial information avoiding the expensive computation caused by a separate elaboration of each scale. Inside a two-stage detection framework, one FPN output layer is heuristically selected as unique source of RoI Pooling action. Although the formula is well thought out, it is clear that the layer selection is the result of an arbitrary choice.
In order to demonstrate this statement, we have compared this heuristic (proposed by ) as baseline with a random selection of the FPN layer to sample from. Table I shows the average precision (AP) with different metrics (detailed in Section IV-A). Comparing the first two rows of the table, it is evident that the difference between the randomly-selected and the heuristic choice is not enormous. As a further proof, Fig. 2
shows the progress with training epochs and demonstrates that the progress is similar. This is understandable considering that each FPN layer is derived from the previous one. It means that information is existent in the FPN layers, but in a more or less tangled way to be classified by the following modules of the network.
These results highlight that the network is capable of extracting information with good enough quality to discriminate classes from any available scale. To corroborate this finding, we have also tried to sum the FPN layers, obtaining an improvement of 0.3% in average precision (see Table I and Fig. 2). This enhancement suggests that if all the layers are aggregated appropriately, it is more likely to produce higher quality features.
Based on these preliminary ideas, we propose a novel RoI extraction layer called Generic RoI Extractor (GRoIE) whose architecture can be seen in Fig. 3.
GRoIE is composed of the following modules:
RoI pooler module
: it is a module that performs a max pooling on non-uniform region of interest to obtain a fixed-size representation. Currently, many pooling techniques such as RoI Pooling and RoI Align  are available. Among the existing RoI pooling techniques, we found RoI Align 
as the most appropriate since it reduces a rectangular feature map region by dividing the original RoI in equal boxes and applying bilinear interpolation inside each of them. This helps to avoid pixel quantization.
: its objective is to apply a preliminary elaboration to the pooled regions. This gives the network an additional degree of freedom which is specific for each image scale. This module is devoted to pre-processing the feature maps and it is usually obtained by means of a convolutional layer associated with each image scale. As will be shown in the ablation analysis reported in SectionIV-C, the optimal configuration consists of a single 5x5 convolutional layer per scale. Our experiments suggest that it is not convenient to process the features individually which can be explained by acknowledging that each feature is semantically connected with adjacent features. This is particularly true, remembering that the final objective is object detection/segmentation and, usually, objects are spread over a consistent region of the image.
Aggregation module: it defines how to aggregate the single RoIs coming from each branch. The most frequent operations are concatenation and summation. There are multiple ways of merging different branches. After our ablation analysis, we found that the sum is able to minimize the number of features to be computed for the next layer, and this requires less effort from the network to converge to a stable training.
Post-processing module: it is an extra elaboration step applied to the merged features before eventually returning them. It permits the network to learn global features, jointly considering all the scales. To strengthen informative power of the final RoI, three module types have been considered for post-processing: a convolutional layer, a non-local layer  and an attention layer . Although the attention module is more complex because it requires also a fully-connected layer, our ablation analysis demonstrates that it is the best performing choice. The reason is that unlike the pre-processing module, the main objective of this layer is to eliminate useless information. In particular, the “query content and relative position” configuration, called in , attention factor is used. This is more sensitive to the query content and have the higher impact on image contents.
Summarizing, starting from a region produced by the RPN, for each scale, a fixed-size RoI is pooled from the region. The resulting feature maps are, first, separately pre-processed and, then, merged into a single feature map. Finally, post-processing is applied to extract global information. This architecture grants an equal contribution of each scale and benefits from the information embodied in all FPN layers by overcoming the limitations inherent in the arbitrary choice of a single FPN layer. It is worth noting that this procedure is valid for both object detection and instance segmentation.
In this section two sets of experiments are reported. The first set is a module-wise ablation analysis of the proposed GRoIE layer with the aim of finding the best combination of choices for each of the modules described in the previous section. As was mentioned above, GRoIE can be plugged into architectures for both object detection (bounding box) and instance segmentation.
In the first set of experiments, we focus on object detection task only and employ the well-known Faster R-CNN as baseline. In the second set, we apply GRoIE, with the best configuration found, to different architectures with the aim of showing the improvement in average precision for both object detection and instance segmentation. This will allow us to show that the improvement produced by GRoIE is independent from both tasks as well as the utilized architecture.
Iv-a Dataset and Evaluation Metrics
Datasets. In order to evaluate our proposal, we performed experiments on MS COCO dataset 2017  which is the de facto standard dataset for large-scale object detection and instance segmentation tasks. It is composed of 80 object categories and contains more than 116 thousand images in its training set.
Evaluation Metrics. To extract the metrics, we used the official COCO python package. The validation dataset, referred to as minival, includes 5000 images.
The package calculates the Average Precision (AP) with different IoU (Intersection over the Union) thresholds for both bounding box and segmentation tasks. The primary metric, indicated simply as , is calculated with IoU thresholds from 0.5 to 0.95. Other metrics include with the IoU threshold of 0.5 and with 0.75. In addition, separate metrics are calculated for small (), medium () and large () objects.
Iv-B Implementation details
All the results with which we compare ours are not taken from the original papers, but they were obtained by training on the same hardware, with the same configuration (apart the RoI extractor) and by using the original authors’ code when available. These precautions are taken in order not to have the comparison affected by any small changes in either the configuration or the code. We used MMDetection  as base framework to develop our code.
The following base configuration was used for every experiment. Experiments were conducted on 6 GPUs (Nvidia Tesla P100 with 12 GB of memory) for 12 epochs with an initial learning rate of 0.015, with a weight decay of 0.0001 after 9 and 11 epochs, a batch size of 2 images per GPU, and a random seed always equals to the number zero. Since in most of the experiments reported in the literature, reference hardware is composed of 8 GPUs with batch size 2 and learning rate equal to 0.02, we followed the Linear Scaling Rule proposed in  to have a fair comparison. The long edge and short edge of the images were resized to 1333 and 800, but the aspect ratio was maintained. ResNet50  was used as backbone and RoI Align was selected for the RoI Pooling module (no ablation analysis was conducted on this module).
Iv-C Module-wise ablation analysis
In this section, we investigate how the choices of the GRoIE modules influence its final performance.
Aggregation module analysis. We start from aggregation module because choosing how to merge the data technically has a significant importance on the architecture of the module itself. In order to evaluate the effects of different choices separately, neither pre-processing nor post-processing are applied in this experiment. Each FPN output layer is RoI pooled to create a 256 dimensional feature map and subsequently merged to form a single RoI.
There are mainly two choices for aggregating different branches: concatenation and summation. In the first case, we need to reduce the feature maps from 1024 to 256 dimensions because we have 4 FPN layers, each one composed by 256 dimensions feature maps. This can be easily done using a convolutional layer with 1x1 kernel. A sum-based aggregation is simpler, but a fair comparison with concatenation is needed. Therefore, in addition to a naive operator, we included a variant of sum aggregation followed by a convolutional layer with 1x1 kernel as post-processing. We call this .
Table II shows the comparison between the proposed choices and a single-layer RoI extractor module (indicated as “baseline”). To better justify our final choice, we show in Fig. 4 the trend in average precision when the training epochs progress. Looking at the results of and concatenation, one might argue that the integration of different FPN layers, the basis of our work, is not always beneficial. This can be attributed to the added complexity which can be, in some cases, counterproductive and generate side effects. In the case of , while at the beginning the trend is very similar, later in the training this operator achieves better accuracy with a stable trend, suggesting that this gap could potentially increase with more training epochs. Therefore, we selected operator for the aggregation module of GRoIE.
Pre-processing module analysis. For this ablation analysis, as mentioned above and based on the findings of the previous module, we chose the operator for the aggregation module and did not apply any post-processing. With regard to pre-processing, we consider three possible choices: using a convolutional layer with different kernel sizes, using a non-local module or using an attention module which was described in the previous section.
Table III shows the comparison of these choices with the baseline as in the case of the aggregation module. Regarding the convolutional layer, it can be noticed that by increasing the kernel size, the results are consistently improved. This confirms the close correlation between neighboring features. We should mention that the processed feature maps are only 7x7 in size. This stopped us from increasing the kernel furthermore.
Post-processing module analysis. Finally, we analyze the post-processing module, by keeping the operator as aggregation strategy and not applying pre-processing.
Comparing Tables III and IV which contain results for pre- and post-processing modules reveals a major difference. While in the former, convolutional layers with different kernel sizes improve the results but non-local/attention modules do not, in the latter table the outcomes are opposite; that is, improvement of convolutional layers is negligible, while non-local and attention methods bring about noticeable enhancement. This can be explained by the fact that while in pre-processing there is the need to extract spatial contributions of the different layers where convolution acts correctly, in the post-processing phase the layers have already been merged by the aggregation module. Therefore, convolution does not add significant information. On the contrary, in post-processing, non-local and attention methods are able to remove useless information by focusing only on the significant parts of the image with attention mechanism.
|Object detection||Instance segmentation|
Iv-D Application of GRoIE to different architectures
As stated at the beginning of this section, the second set of experiments starts from the choices made on GRoIE modules based on the ablation analysis and integrates our proposed layer within several state-of-the-art architectures, with the aim of evaluating its benefits for both object detection and instance segmentation. We have considered, first of all, the networks that best represent the two-stage networks: Faster R-CNN and Mask R-CNN. Furthermore, we have taken into consideration the networks that have shown the best results in the recent years: Grid R-CNN  for object detection and GC-net  for instance segmentation too. For the latter network, there are two RoI extractors. The first one is used for the detection part to extract the RoIs provided by the RPN; the second one is used by the segmentation part to extract the RoIs provided by the detection.
For this experiment, we have thus replaced only the standard RoI extraction modules with GRoIE in its most performing configuration: as aggregation function, 5x5 convolution for pre-processing and attention module for post-processing.
Table V shows the achieved results for both object detection (bounding boxes) and instance segmentation. It is rather evident that the introduction of GRoIE as RoI extraction layer strongly contributes to an improvement in precision in all the tested architectures. As expected, the amount of this improvement is not always the same and varies from a minimum of 0.7% AP to a maximum of 1.1% AP for bounding boxes, and from a minimum of 1.3% AP to a maximum of 1.7% AP for instance segmentation. Looking at the other evaluation metrics, the gain is even more noticeable, with a maximum of 2.2% for in GC-net.
This improvement is even more evident from Figs. 5 and 6, where the average precision is illustrated with the progress of training epochs. In these graphs, it can be seen that in later epochs the positive effect of GRoIE increases, suggesting that it can arguably be even higher with more training epochs.
In this paper, we proposed a novel RoI extraction layer for two-step architectures designed for object detection and instance segmentation. The intuition underlying our proposal is that all the feature scales obtained by an FPN are potentially equally-useful for obtaining good final results. The proposed layer, called GRoIE (Generic RoI Extractor), builds upon this intuition by first pre-processing each single layer, then aggregating them together, and finally applying attentive mechanisms as post-processing in order to remove useless (global) information.
Experiments are conducted on COCO dataset and a comprehensive ablation study has been conducted in order to select the best configuration of modules. Furthermore, the addition of GRoIE to state-of-the-art two-step architectures for both object detection and instance segmentation has shown a consistent improvement in average precision in all the experiments.
While preliminary, the results reported in this paper are quite promising and seem to indicate the potentiality of GRoIE as novel extraction layer. As a consequence, our future works will concentrate on exploiting the modularity of GRoIE to further enhance the quality of the output features to improve the overall accuracy of different computer vision applications. In addition, neural networks are now increasingly heavy to perform. For this reason, an important field of exploration also for GRoIE regards precisely adopting every possible stratagem to lighten the workload while keeping performance unchanged.
This research benefits from the HPC (High Performance Computing) facility of the University of Parma, Italy.
We would like to thank Adidas AG for funding this work.
Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2874–2883. Cited by: §II.
-  (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II.
-  (2019) Gcnet: non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §II, §IV-D.
-  (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4974–4983. Cited by: §II.
-  (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §IV-B.
-  (2016) Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3150–3158. Cited by: §I.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §I.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §I, item 1.
-  (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §IV-B.
-  (2019) AugFPN: improving multi-scale feature learning for object detection. arXiv preprint arXiv:1912.05384. Cited by: §II.
-  (2015) Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447–456. Cited by: §II.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §I, item 1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-B.
-  (2018) Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–799. Cited by: §I.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §I, §I, §II, TABLE I, §III.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §IV-A.
-  (2018) Multi-scale subnetwork for roi pooling for instance segmentation. International Journal of Computer Theory and Engineering 10 (6). Cited by: §II.
-  (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. Cited by: §I, §II.
-  (2019) Grid r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7363–7372. Cited by: §II, §IV-D.
-  (2016) Learning to refine object segments. In European Conference on Computer Vision, pp. 75–91. Cited by: §II.
-  (2016) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE transactions on pattern analysis and machine intelligence 39 (1), pp. 128–140. Cited by: §II.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
-  (2016) Object detection networks on convolutional feature maps. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1476–1481. Cited by: §II.
-  (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: item 4.
-  (2019) Auto-fpn: automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6649–6658. Cited by: §II.
-  (2019) An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6688–6697. Cited by: item 4.