Regularized Densely-connected Pyramid Network for Salient Instance Segmentation

08/28/2020 ∙ by Yu-Huan Wu, et al. ∙ Nankai University 15

Much of the recent efforts on salient object detection (SOD) has been devoted to producing accurate saliency maps without being aware of their instance labels. To this end, we propose a new pipeline for end-to-end salient instance segmentation (SIS) that predicts a class-agnostic mask for each detected salient instance. To make better use of the rich feature hierarchies in deep networks, we propose the regularized dense connections, which attentively promote informative features and suppress non-informative ones from all feature pyramids, to enhance the side predictions. A novel multi-level RoIAlign based decoder is introduced as well to adaptively aggregate multi-level features for better mask predictions. Such good strategies can be well-encapsulated into the Mask-RCNN pipeline. Extensive experiments on popular benchmarks demonstrate that our design significantly outperforms existing state-of-the-art competitors by 6.3



There are no comments yet.


page 1

page 3

page 4

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As a fundamental image understanding technique, salient object detection (SOD) aims at segmenting the most eye-attracting objects in a natural image. Although recent SOD approaches [50, 41, 42, 23, 29] have achieved many successful stories, their generated saliency maps cannot discriminate different salient instances, which has prevented many applications from applying SOD for instance-level image understanding [9]. Motivated by [18]

, in this paper, we tackle the more challenging case of SOD, called salient instance segmentation (SIS). SIS not only segments salient objects from an image but also discriminates salient instances by associating each instance with a different label. SIS can facilitate more advanced tasks than SOD, such as image captioning

[10], weakly-supervised instance learning [9], and visual tracking [15].

The MSRNet [18] made the first attempt to detect salient instances by adopting several isolated processing steps. However, its performance was usually limited in challenging scenarios because it was not end-to-end trainable. The S4Net [8] replaced RoIAlign in Mask R-CNN [13] with the proposed RoIMasking to keep the scale of the feature maps and leverage the nearby background of objects. Although much better performances were reported, it was yet far from satisfactory because only a limited feature level was utilized to decode salient instances. One may argue that a natural solution is to employ the Feature Pyramid Network (FPN) [20] and solve this task using the feature pyramid as well. FPN builds the feature pyramid via the top-down pathway and lateral connections from the backbone. With this network, small and large objects are thus more likely to be detected in the low and high levels of the pyramid, respectively. Therefore, with the top-down pathway, apart from detecting the salient objects, much of the information flow was devoted for detecting the small and unnoticeable objects as well. Naïvely applying the FPN architecture for SIS is suboptimal, because salient objects are often much larger and distinctive compared with noisy background and uninteresting objects.

Fig. 1: Visualizations for the feature maps after passing FPN and our proposed regularized densely-connected pyramid (RDP). (a) Source images; (b) Corresponding ground truth; (c) Visualized maps for the feature maps after FPN; (d) Visualized maps for the feature maps after the proposed RDP. As the visualized feature maps directly obtained by the FPN look coarser and are hard to recognize objects in them, our proposed RDP are much easier to recognize the locations and shapes of each salient instance.

Motivated by this, we focus on enhancing the side predictions by providing each side branch with richer feature hierarchies from deep networks to locate the object and recover its details. We achieve this by employing dense connections for each branch. In this way, each level is able to leverage both high-level semantic and low-level fine-grained features. However, as features from different feature levels of the feature pyramid usually have different receptive fields, directly applying such a dense connection may yield noisy predictions. To this end, we propose to regularize such a dense connection by employing the attention mechanism to promote informative features and suppress non-informative ones from all feature levels of the feature pyramid.

Our effort starts with Mask R-CNN [13] that first detects bounding boxes and then adopts RoIAlign to predict the binary mask for each region of interest (RoI). Specifically, we propose the regularized densely-connected pyramid (RDP) network mentioned above to better enhance the feature pyramid with different scales while keeping semantic features for detecting salient instances. More specifically, each level of features will be fused with not only its successive bottom features, as done in other works [20, 21, 30, 44] but also features from all the lower levels. The RDP network only costs 0.7ms which can be ignored in affecting the speed of the whole network. Fig. 1 shows the superiority of RDP in feature learning compared with FPN. Besides, instead of only using features from a specific feature level, we propose to leverage the feature maps from all feature levels with a novel multi-level RoIAlign operation for extracting hierarchical RoIs and then use a mask decoder to predict instance masks from them. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance and far surpasses previous competitors in terms of all metrics. With an NVIDIA TITIAN Xp GPU, the proposed method runs at 45.0fps for images and are thus suitable for real-time applications.

Overall, our main contributions are summarized as below:

  • We propose to use regularized dense connections in the Mask R-CNN framework to provide richer bottom-up information flows by attentively promoting informative features and suppressing non-informative ones, at each stage of the feature pyramid.

  • We further propose a novel multi-level RoIAlign based decoder to adaptively pool multi-level features for better mask predictions.

  • We empirically evaluate the proposed methods on two popular SIS datasets and demonstrate its superior accuracy and better efficiency.

Ii Related Work

Ii-a Salient Object Detection

SOD aims to detect salient objects or regions in natural images. Conventional methods [1, 4, 3, 41] mainly focus on designing hand-crafted features and better prior strategies for SOD. Later, some learning based features [41]

were studied as well. In recent years, those methods have been suppressed by the deep learning based methods due to their limited representational ability. More specifically, motivated by the vast success of convolutional neural networks (CNNs) and fully convolutional networks (FCNs)

[31] for segmentation-related tasks, many FCN-based SOD networks were proposed as well [25, 50, 42, 16, 29, 35, 23, 32, 49]. For example, Wang et al. [42] developed a recurrent FCN architecture for saliency prediction. Liu et al. [25] presented a deep hierarchical saliency network to learn a coarse global prediction and refine it hierarchically and progressively by integrating local information. Inspired by [47, 28], Hou et al. [16] introduced short connections for side-outputs to enrich multi-scale features. Zhang et al. [50] introduced a bi-directional structure to adaptively aggregate multi-level features. Wang et al. [43] proposed to globally detect salient objects and recurrently refine the saliency maps. Liu et al. [24] proposed a pixel-wise contextual attention network to selectively attend to informative context locations for each pixel. Liu et al. [23] proposed various pooling-based modules to strengthen the feature representations with real-time speed. Although these methods can detect saliency maps accurately, they cannot discriminate different salient object instances.

Ii-B Instance Segmentation

Similar to object detection, early instance segmentation works [11, 12, 5]

focus on classifying segmented proposals generated by object proposal methods

[40, 34, 2]. Li et al. [19] first proposed an end-to-end fully convolutional instance segmentation (FCIS) framework. He et al. [13] extended Faster R-CNN [36] to Mask R-CNN by replacing RoIPool with RoIAlign for more accurate RoI generation. They added a parallel mask head with the box head in Faster R-CNN, for mask prediction using the RoI features from the feature pyramid. PANet [26] proposes a bottom-up path augmentation, which has been demonstrated to be effective in shortening the information path and enhancing the feature pyramid for instance detection. Mask Scoring R-CNN [17] combines the mask confidence score and the localization score and is thus more precise for scoring the detected instances.

Ii-C Salient Instance Segmentation

SIS is a relatively new problem that shares similar spirits with both SOD and instance segmentation. It is more challenging than SOD because it not only segments salient objects but also differentiates different salient instances. One possible solution is to derive the salient instances directly from the saliency map using some post-processing techniques. For example, Li et al. [18] proposed a two-stage solution, called MSRNet, which first produces saliency maps and salient object contours that are then integrated with MCG [34] for salient instance segmentation. Although MSRNet can learn from the saliency maps, as the two stages are optimized isolatedly, the results of MSRNet are far from satisfactory. To overcome the difficulties of the isolated optimization, recently, Fan et al. [8] introduced an end-to-end single-stage framework based on the Mask R-CNN [13]. They learned to mimic the strategy of GrabCut [37] and used the so-called RoIMasking to explicitly incorporate foreground/background separation. They also designed a customized segmentation head with dilated convolutions to retrieve instance masks from the coarsest feature level. Instead of using a single specific feature level with limited semantic features as done in existing methods, we propose to use the regularized densely-connected pyramid to extract richer feature hierarchies with higher contrasts (as in Fig. 1) from all feature levels, and significantly release the burden of accurately detecting salient instances and retrieving binary masks for each salient instance.

Iii Our Approach

Iii-a Feature Pyramid Enhancement

The feature pyramid, which is usually understood as a group of feature maps with different resolutions, has demonstrated its superiority in various computer vision tasks. Among them, one notable application is object detection, which aims to accurately detect the locations of semantic objects. As there exist large scale variations for natural objects, directly detecting the accurate locations of targets by simply using features from one scale is extremely challenging. Therefore, many researchers attempt to detect semantic objects with the feature pyramid. Our method naturally belongs to this family. We propose a densely-connected pyramid (DP) network and the advanced regularized densely-connected pyramid (RDP) network for the feature pyramid enhancement. We elaborate the main idea below.

Iii-A1 Problem Formulation

Given an image as the input and a base network (e.g., ResNet [14]

) for feature extraction, we can first derive a set of side-outputs from multiple stages in this network. Assume that we have access to multiple scales of features

from the -th to -th stage, corresponding to the finest and coarsest feature maps. Typically, will be 2 as defined in two-stage detectors like Faster R-CNN [36, 20] or 3 as defined in one-stage detectors like RetinaNet [21]. is typically 5 as defined in both kinds of detectors [36, 20, 21].

Iii-A2 The Top-down Style

In order to leverage both high-level semantics and low-level fine details as mentioned above, the well-known FPN [20] proposes a top-down architecture with lateral connections to strengthen the capacity and representability of each side-output. Such a strategy has been demonstrated very powerful especially for detecting small and tiny objects and has been extensively used in many other approaches. Suppose that the feature pyramid enhanced by FPN is called . This enhancement operation can be formulated as:


where represents a convolution layer to reduce the channels of . represents the feature fusion module which consists of a single convolution layer. The upsampling factor for

is 2 and we use the bilinear interpolation for upsampling. For the coarsest feature map

, this enhancement operation is simplified done by passing a single convolution.

Such a strategy, however, is suboptimal for SIS. Recall that the objective of this task is to detect salient instances and ignore other non-salient ones that usually have a relatively smaller size. In Equ. (2), each side branch only has limited bottom-up information, because it only leverages the features of two successive layers. In this way, higher levels in the pyramid have limited access to the low-level fine-grained details and thus may fail to recover the instance boundaries. In the same way, the lower levels in the pyramid lack the high-level semantic information and thus may not be good at accurately locating the salient objects and identifying their instance labels. To address this problem, we provide our solution below.

Fig. 2: Illustration of the proposed regularized densely-connected pyramid (RDP) network for feature pyramid enhancement. (a) The densely-connected pyramid (DP) network; (b) Dense connections with regularization. For simplicity, we illustrate the regularization with only the feature level. RDP is DP with the regularization at each feature level.

Iii-A3 The Bottom-up Densely-connected Pyramid Network

A straightforward solution to overcome the above-mentioned disadvantages of FPN, as proposed in [26], is to build a progressive bottom-up lateral connection and recreate a new feature pyramid:


where is the re-generated feature map of the new feature pyramid. This solution naturally follows a progressive manner of the FPN and is applied in instance segmentation [26]. We take inspirations from this architecture and make necessary amendments. For each feature level in the network, instead of only merging two successive levels, we merge features from many other levels as well. This is advantageous because each stage is given a much richer information flow from all its bottom layers. More specifically, we achieve this by adding dense connections, which can be formulated as


where we have and represents the index of the first stage of the feature pyramid. In the concatenation operation, feature maps are all downsampled to the size of . We use the convolution operation to reduce the channels to that of .

Fig. 3: Overall pipeline of the proposed method. (a) In the feature extraction part, RDP is the regularized densely-connected pyramid network, as illustrated in Fig. 2. (b) We use the base detector [39] for box regression at each feature level. (c) The traditional design for mask prediction only uses a single layer to decode the binary masks. (d) Our design for mask prediction uses all feature levels to decode binary masks by a simple decoder.

Iii-A4 Regularized Densely-Connected Pyramid Network

The bottom-up dense connections essentially expand the input space for each side branch. However, as features from different layers usually have different receptive fields, they are usually not very compatible in discovering the fine details of the object due to the scale conflict. To this end, we further regularize the dense connections with the well-established self-attention mechanism. To compute the new feature maps , we first create spatial regularization based on the feature map of the current scale:


where is the attention map for the regularization and

denotes the sigmoid function for each pixel. By reducing the effect of the scale conflict during the feature concatenation, we apply this regularization to the feature maps with identical attention maps

in the feature fusion except :


where is the regularized feature map from other scales. We perform the downsampling operation to features maps from other scales with the same size as . The symbol denotes the element-wise multiplication. Overall, the regularized dense connections for enhancing the feature pyramid can be formulated as


We provide an illustration of the proposed RDP in Fig. 2 for better understanding.

Iii-B Multi-level RoIAlign for Mask Prediction

Mask prediction is essential for SIS as it directly determines the accuracy of the mask for each salient instance. As shown in Fig. 3 (c), Mask R-CNN [13] uses a specific feature level, which depends on the size of the object of interest, for mask prediction using RoIAlign. Although this option of determining which feature level is used in RoIAlign can adaptively extract masks for objects of different sizes, this is suboptimal for SIS and a better strategy is to leverage all the feature levels. More specifically, we propose an efficient yet well-performing multi-level RoIAlign with a decoder to leverage all feature levels and Fig. 3 (d) illustrates our idea. After the multi-level RoIAlign

layer, we derive a tiny feature pyramid specifically for the mask prediction. The next decoder is to progressively decode the binary masks from the tiny feature pyramid. The decoder consists of the lateral connections and some feature fusion operations. Since the strides of the top two feature maps are very large, they are

RoIAligned to the same size of RoIs, and we perform element-wise sum for these two RoIs. Other feature maps are RoIAligned to different sizes of RoIs.

With this decoder, we first use RoIAlign to adaptively align features from all levels, and then retrieve binary masks based on the aligned features. For the feature fusion between two adjacent feature maps of different sizes, we first perform bilinear interpolation to upsample them to the size of the finer feature map by a factor of 2. Then, we use element-wise sum to fuse these two feature maps and add a convolution layer to generate the new feature maps for the next feature fusion. Finally, we get the finest feature maps, on which we perform a convolution to predict the binary masks.

Iii-C Overall Pipeline

The regularized densely-connected pyramid and the multi-level RoIAlign layer are encapsulated into a Mask R-CNN based pipeline, as displayed in Fig. 3. The functionality of each component is presented in the following.

Iii-C1 Feature Extraction

We adopt the widely used ResNet [14]

as our backbone network, which has been pretrained on the ImageNet dataset

[38]. The base feature pyramid follows the architecture of FPN [20]. Since we use the one-stage detector [39] for box regression, we follow [39] to generate two extra feature maps, and , by connecting two convolutions with a stride of 2 after . and are added to the feature pyramid, so the feature pyramid after passing FPN is . All feature maps in this feature pyramid are with 256 channels. Then, we build the regularized densely-connected pyramid (RDP) from to , as introduced in Section III-A4 and Fig. 2. The number of output channels is still 256 for all feature maps in the reconstructed feature pyramid. Fig. 1 displays the visualization of feature maps after passing FPN and our proposed RDP. We find that although feature maps derived by FPN have captured the locations of salient instances, the activation or high responses are very coarse or cannot even recognize the number of salient instances in each image. In contrast, the feature maps from our proposed RDP network have more precise activation and can help the base detector to better detect the bounding box of each salient instance. This further enhances the mask head towards obtaining better masks for the detected salient instances.

Iii-C2 Box Regression

To quickly detect the salient instances, we do not apply a heavy two-stage detector that contains an RPN [36] head to generate object proposals and classifies these object proposals with the box head, because it is too slow for SIS. Instead, we use the one-stage detector [39]

as our base detector. This detector consists of four convolution-ReLU layers with 256 channels, and the box regression is performed at each feature level with this shared-parameters head. The details for calculating the box proposals from the final feature map can refer to

[39]. In this part, we will derive many box proposals with their confidence scores in each feature level. We concatenate them and leave the top 1000 boxes with confidence score larger than 0.05. After that, a non-maximum suppression (NMS) operation is conducted on the boxes and then keep at most top 100 boxes for predicting their corresponding binary masks.

Iii-C3 Mask Prediction

In the box regression, we detect the salient instances in the box level. Since our final goal is to predict the instance-level segmentation, mask prediction is necessary to retrieve the corresponding binary mask for each salient instance. We make a further improvement to Mask R-CNN by leveraging the feature maps of all feature levels () for retrieving binary masks for salient instances. After the multi-level RoIAlign layer, the sizes of the feature maps are displayed in Table I. Please refer to Section III-B for the implementation of the decoder. After passing this decoder, we use a simple convolution layer to predict the final masks for the detected salient instances.

TABLE I: Feature map size for each channel after the multi-level RoIAlign layers. Since and are very small, and are sampled with the same size. The size of the final mask for each salient instance is .

Iii-C4 The Loss Function

Our pipeline has two key parts that need supervisions: box regression and mask prediction. A foreground box classification loss and a coordinate regression loss are applied in the box regression branch. Note that is the focal loss [21] and is the IoU loss proposed in [48]. To further get rid of the bad effect of too many low-quality boxes, we apply the centerness loss proposed in [39] to ignore the boxes whose centers are far away from the centers of salient instances. For mask prediction, we use the standard cross-entropy loss as the mask loss . Hence we obtain the final loss to supervise the whole network.

Iv Experiments

In this section, we will first introduce the datasets and evaluation metrics used in our experiments, as in Section 

IV-A. Implementation details will be described in Section IV-B. We will carefully examine our proposed designs and demonstrate their effectiveness in Section IV-C. The results of our method and the comparison with previous state-of-the-art methods will be provided in Section IV-D.

Iv-a Dataset and Evaluation Metric

Iv-A1 Datasets

We adopt two popular datasets in our experiments, i.e., ISOD and SOC datasets. The ISOD dataset is proposed by Li et al. [18]. It contains 1000 images with salient instance annotations. Here we follow the previous work [8] to use 500 images for training, 200 images for validation, and another 300 images for testing. The SOC dataset is proposed by Fan et al. [7]. This dataset consists of 3000 images in cluttered scenes with salient instance annotations. Among them, 2400 images are used for training and 600 images are used for testing.

Iv-A2 Evaluation Metrics

Previous works use the mAP metric with a specific threshold such as 0.5 (standard) or 0.7 (strict) to determine whether a detected instance is a true positive (TP), similar to the evaluation in the PASCAL VOC challenge


. However, as this metric is not enough to fully reflect the quality of detectors, the MS-COCO evaluation metric

[22] has been widely used in mainstream object detection and instance segmentation. We follow the MS-COCO evaluation metric [22] to use mAP@{0.5:0.05:0.95} as the primary metric, since it can better reflect the detection quality. We also report mAP@0.5 and mAP@0.7 for reference, as done in related works [27, 20, 21, 13, 17]. For simplicity, we use “AP”, “AP”, and “AP” to stand for mAP@{0.5:0.05:0.95}, mAP@0.5, and mAP@0.7, respectively.

Iv-B Implementation Details

In this paper, we use the popular PyTorch framework

[33] to implement our method. If not specially mentioned, we apply the widely used ResNet-50 [14] as the backbone network. In the network training, maybe there is no box satisfying the threshold of the confidence score for NMS, especially in the early training stage, so we add the ground-truth boxes to the results of detected salient instances in the training to prevent such a situation to take place. We only use horizontal flipping as the data augmentation, and each input image is resized as the shorter side is 320 pixels and the longer side follows the initial image aspect ratio but is limited to a maximum value of 480 pixels. We use a single NVIDIA TITAN Xp GPU for all experiments. We use the SGD optimizer with the weight decay of and the momentum of 0.9. Each mini-batch contains four images. The initial learning rate is 0.0025. For the ISOD dataset [18], the learning rate is divided by 10 after 6K iterations, and we train our network for 9K iterations in total. For the SOC dataset [7] that is approximately 4 larger than ISOD, the learning rate is divided by 10 after 24K iterations, and we train our network for 36K iterations in total. Due to the small batch size, the BatchNorm layers of the backbone network are all frozen during training. The convolution layers of the box regression head and mask prediction head are with the group normalization [45]. The number of output channels of each convolution layer is 128 in the mask prediction head.

- - - - 54.2% 83.3% 69.7%
54.6% 84.9% 70.0%
55.1% 84.4% 71.0%
56.4% 85.4% 72.0%
57.4% 86.1% 73.8%
TABLE II: Evaluation on the ISOD validation set for various design choices. The first line refers to the baseline of FPN. NP is the natural progressive bottom-up style for building the new feature pyramid. DP denotes the proposed method that rebuilds the feature pyramid with dense connections. RDP means to add the proposed regularization to DP. MRA represents the proposed multi-level RoIAlign.

Iv-C Ablation Study

In this part, we evaluate the effect of various designs on the ISOD dataset. We use its training set for training and report results on its validation set. If not mentioned, we use the ResNet-50 as the backbone for our network.

Iv-C1 Effect of DP and RDP

As mentioned in Section III-A4, we propose to create RDP to fill the vacancy of the FPN. Here, we view FPN as our baseline and evaluate four design choices: i) NP, i.e., the naive progressive bottom-up style for building the new feature pyramid; ii) DP, i.e., the proposed method that rebuilds the feature pyramid with dense connections; iii) RDP, i.e., adding the proposed regularization to the dense connections in DP; iv) MRA, i.e., the proposed multi-level RoIAlign. Table II shows the evaluation results on the ISOD validation set. We can see that NP only has a minor improvement compared with the vanilla solution of FPN, i.e., an improvement of 0.4% in terms of AP. If we replace this naive solution with DP without regularization, the metric of AP will be improved by 0.9% compared with FPN. When we add the regularization to DP, a relative 1.3% improvement over DP is observed, indicating that the regularization is of vital importance for the proposed densely-connected pyramid. Note that the proposed RDP is very efficient and only costs 0.7ms for a input image, making it have little effect on the speed of the whole network.

Iv-C2 Effect of Multi-level RoIAlign

The existing research usually predicts object masks using the mask head proposed by Mask R-CNN [13], which predicts masks from a specific feature level. Instead, we propose a top-down progressive mask decoder to utilize all feature levels for object mask prediction, namely multi-level RoIAlign (MRA). The comparison between MRA and the traditional RoIAlign can be found in Table II. We can see that the introduction of MRA further leads to an improvement of 1.0%, 0.7%, and 1.8% in terms of AP, AP, and AP, respectively. This demonstrates the significance of the proposed MRA in accurate mask prediction by leveraging all feature levels. Overall, the proposed method achieves 3.2% higher AP, 2.8% higher AP, and 4.1% higher AP than the baseline of FPN.

Iv-C3 Partially Applying DP and RDP

Our initial design considers all feature levels ( - ) for the reconstruction of the feature pyramid. Among them, the top 2 feature levels ( and ) are generated from using only two convolutions. In this section, we further evaluate the effectiveness of DP and RDP by applying them to a part of side-outputs. Specifically, we only apply DP/RDP to three side-outputs, i.e., , , and , excluding and . The experimental results are shown in the Table III. We could see that applying DP/RDP to only three side-outputs performs better than the baseline, but performs worse than applying DP/RDP to five side-outputs, indicating that DP/RDP is effective in feature enhancement for all feature levels. The fact that RDP with only three feature levels significantly outperforms the baseline, further suggesting that RDP is very useful to FPN.

Side-outputs DP RDP AP AP AP
- - - 54.2% 83.3% 69.7%
54.4% 83.7% 70.3%
55.1% 84.4% 71.0%
55.8% 84.9% 71.3%
56.4% 85.4% 72.0%
TABLE III: Evaluation on the ISOD validation set for partially applying DP/RDP to a part of side-outputs. means from to . means all side-outputs in the feature pyramid.
Method AP AP AP
Baseline 54.2% 83.3% 69.7%
Top-down 45.6% 76.9% 57.0%
Bottom-up 56.4% 85.4% 72.0%
TABLE IV: Evaluation on the ISOD validation set for the top-down and bottom-up designs of RDP. The top-down design directly replace FPN of the baseline method with the top-down style of RDP. The bottom-up design is the default version of RDP as shown in Fig. 2.
Fig. 4: Error analyses for the baseline and the proposed designs on the ISOD validation set. The first row is PR curves for all salient instances, while the second row is only for large salient instances whose areas are larger than . The PR curves are drawn in different settings following [22]. C10C90: PR curve at IoU={0.1:0.1:0.9}. BG: PR curve after all background false positives (FP) are removed. FN: PR curve after all remaining errors are removed (). Each number in the legend corresponds to the average precision for each setting. The area under each curve is drawn in different colors, corresponding to the color in the legend. Best viewed in color.

Iv-C4 Error Analyses of the Baseline and the Proposed Designs

Salient instances are usually large because large objects are more eye-attracting and are thus visually distinctive. We follow the MS-COCO benchmark to consider the instances whose areas are larger than as large instances. In this way, we find that the ISOD dataset [18] has over 70% large salient instances. Here, we perform error analyses using all salient instances or only large instances. We view FPN [20] as the baseline and gradually add each design of us to this baseline to analyze the changes of detection errors. Fig. 4 illustrates the results. First, let us discuss the changes of the PR curve by adding DP to the baseline. We observe that although AP is improved for almost all IoU thresholds when using all salient instances, the performance becomes worse when only salient instances are considered, especially for large IoU thresholds (e.g., ). Then, we further replace DP with the regularized version of RDP. There is a significant improvement in terms of all IoU thresholds for both all and only large salient instances, demonstrating the importance of the proposed regularization for DP. At last, we analyze the effect of the multi-level RoIAlign (MRA) by further adding it to our system. A substantial improvement is observed, especially for large salient instances. For example, MRA brings AP improvements of 7.2%, 3.6%, and 2.0% for IoU thresholds 0.9, 0.7, and 0.5, respectively. Compared our final system (the rightmost column in Fig. 4) with the baseline (the leftmost column), the improvement is very visually significant in the PR curves in terms of all IoU threshold.

NMS Threshold AP AP AP
0.2 55.8% 83.8% 71.9%
0.4 56.8% 85.8% 71.7%
0.6 57.4% 86.1% 73.8%
0.8 57.0% 84.7% 71.6%
TABLE V: Evaluation on the ISOD validation set using different NMS thresholds.

Iv-C5 Bottom-up versus Top-down

In our method, we rebuild the feature pyramid based on the outputs of FPN. Another potential solution is to directly replace FPN with the top-down style of RDP, which would have a lower computational cost compared with our proposed design. However, the experimental results proclaim its failure. As shown in Table IV, this solution leads to substantial performance degradation, i.e., over 10% lower than the default bottom-up design in terms of various metrics. Hence we can come to the conclusion that the proposed RDP is not suitable for the top-down information flow but can only well in the bottom-up way.

Iv-C6 Different NMS Thresholds

The NMS post-processing step is important to eliminate the detected overlapping instances. NMS with a higher IoU threshold will have a greater tolerance on the high overlapping instances and vice versa. Here, we explore how different NMS thresholds affect the performance of the proposed method, and the results are summarized in Table V. We observe that our method is robust to different NMS thresholds, and thresholds larger than 0.4 have similar evaluation results. Since the threshold of 0.6 results in slightly better performance in terms of various metrics, we adopt 0.6 as the default threshold in our experiments.

Num. AP AP AP Speed AP
100 57.4% 86.1% 73.8% 45.0fps -
10 57.1% 85.8% 73.5% 45.0fps
5 56.6% 85.2% 72.7% 45.2fps
TABLE VI: Evaluation on the ISOD validation set using different numbers of box proposals for mask prediction in the inference. From 100 box proposals to 10, the speed improvement of 0.2fps forces the AP performance to be reduced by 0.8%.

Iv-C7 The number of proposals for mask prediction

As in typical instance segmentation [13, 20], our method first learns to localize salient instances by predicting bounding box proposals and then predicts the mask for each box proposal. Hence the number of box proposals for mask prediction may affect the detection accuracy and inference speed of the whole network. As mentioned in the box regression part of Section III-C2, we select top 100 box proposals for mask prediction, which may look a bit large and expensive for SIS because there are rarely more than 10 salient instances in each image. To prove the rationality of this setting, we explore the trend of the number of proposals during the training stage. At the beginning, this number is the limited setting (100), but it gradually declines as the number of training iterations increases. Finally, it converges to a number less than 10 on average. On the other hand, we run the experiments using different numbers of box proposals in the inference, and the results are shown in Table VI. As we decrease the number of box proposals in the inference, the maximum speed boosting is only 0.2fps, but the accuracy suffers from significant degradation. Therefore, we apply the default setting of 100 box proposals because it almost has no harm to the speed of our method.

Method Backbone AP AP AP Speed
MSRNet [18] VGG-16 - 65.3% 52.3% 1fps
S4Net [8] ResNet-50 52.3% 86.7% 63.6% 40.0fps
Ours ResNet-50 58.6% 88.6% 73.6% 45.0fps
TABLE VII: Evaluation results on the ISOD test set [18].
Backbone AP AP AP Speed
ResNet-50 [14] 58.6% 88.6% 73.6% 45.0fps
ResNet-101 [14] 60.9% 89.7% 76.6% 34.8fps
ResNeXt-101 [46] 63.2% 90.1% 78.1% 16.7fps
TABLE VIII: Evaluation of our method with different backbone networks on the ISOD test set [18]. Our method with the most powerful backbone (i.e., ResNeXt-101 [46]) can achieve a 4.6% improvement in terms of AP and 2.7 inference time compared with that with the simplest backbone (i.e., ResNet-50 [14]). The speed is tested using a single NVIDIA TITAN Xp GPU.
Method Backbone AP AP AP
S4Net [8] ResNet-50 24.0% 51.8% 27.5%
Ours ResNet-50 37.7% 59.4% 48.4%
TABLE IX: Evaluation results on the SOC test set [7].

Iv-D Comparisons with state-of-the-art Methods

Iv-D1 ISOD Dataset

Since SIS is a relatively new problem, the previous works on this topic are very limited. Here, we compare our method with two well-known methods: MSRNet [18] that is on behalf of the post-processing-based methods and S4Net [8] that is a representative work of end-to-end networks. Following [18, 8], all methods are tested on the ISOD test set [18]. We apply AP as the main metric, and AP, AP for the reference. Higher scores represent better performances for all metrics. The quantitative results can be seen in Table VII. The proposed method achieves the best results compared with the other two popular competitors. Specifically, the proposed method has 6.3% higher AP than S4Net [8]. In terms of AP, the proposed method is 10.0% better than S4Net [8]. This demonstrates the superiority of the proposed method in accurate salient instance segmentation. In Table VIII, we try different backbone networks for our method. We can see that powerful backbones can further boost the performance significantly, indicating the good potential and extendibility of our method.

Iv-D2 SOC Dataset

The scenarios of the SOC dataset [7] are much more complex than that of the ISOD dataset [18], so SIS on the SOC dataset is more challenging. The quantitative comparison between our method and S4Net [8] on the SOC dataset is summarized in Table IX. Since other methods do not report evaluation results on this dataset, we train S4Net [8] using its official code with default settings, and we report its best performance in three independent trials for a fair comparison. The results suggest that our method is 13.7%, 7.6%, and 20.9% better than S4Net in terms of AP, AP and AP, respectively. This demonstrates that our method can handle the cluttered background much better and our improvement for SIS is nontrivial.

(a) Error analyses

(b) Probability distribution of AP

Fig. 5: Statistical analyses for our method on the ISOD [18] and SOC [7] test sets.

Iv-E Qualitative Comparisons

To visually compare our method with the previous state-of-the-art method of S4Net [8], we show qualitative comparisons using the ISOD [18] and SOC [7] datasets in Fig. 6. S4Net has many superfluous detection results (false positives) or only detects a part of salient instances. In contrast, our method produces consistent high-quality salient instance masks. Moreover, the boundaries of salient instances detected by S4Net are usually rough, while our method can produce salient instances with smooth boundaries. Therefore, these qualitative comparisons further validate the effectiveness of the proposed method.


Instance GT



ISOD Dataset


Instance GT



SOC Dataset
Fig. 6: Qualitative comparisons between our method and S4Net [8]. The samples are from the ISOD and SOC datasets. S4Net [8] is easy to detect superfluous objects (false positives) or a part of instances. In contrast, our proposed method can detect the complete instances and have much fewer false positives.

Iv-F Statistical Analyses

The statistical characteristics of the ISOD [18] and SOC [7] datasets are highly different, so it would be interesting to explore the differences of the performance of our method on these two datasets. Here, we conduct statistical analyses for the performance of our method on the test sets of these two datasets. We first explore the differences of PR curves between the two datasets by drawing the PR curves of our method on these two datasets, as shown in Fig. 5

(a). As the background of images in the SOC dataset is more cluttered than that in the ISOD dataset, more salient instances are not detected in the SOC dataset, while in the ISOD dataset, most salient instances can be correctly localized. Then, we explore the probability distribution of AP for different numbers of salient instances in each image. More specifically, we calculate the AP score and the number of ground-truth salient instances for each image, and illustrate the overall probability distribution in Fig. 

5 (b) where the area of each closed pattern is 1 (i.e., the sum of all probabilities). for an image means that our method almost perfectly detects and segments the ground truths in this image and also has no false positives. indicates that all ground truths in this image are not detected. In the ISOD dataset, the AP score of each image is likely better than the medium AP score if the instance count is not more than 3 in each image, while in the SOC dataset, the same case happens only when the instance count is 1 in each image. Besides, in the ISOD dataset, our method only fails for a few images () with 1 or 2 salient instances in each image, but in the SOC dataset, our method fails for relatively many more images. The above analyses suggest that the SOC dataset is much more difficult than the ISOD dataset owing to its cluttered background and complex scenarios, so there might still be much space to strengthen the representation for future SIS research.

V Conclusion and Future Work

In this paper, we propose a new network for salient instance segmentation (SIS). The core of our method is the regularized dense-connected pyramid (RDP), which provides each side-output with richer yet more compatible bottom-up information flows to enhance the side-output prediction. We further design a novel multi-level RoIAlign based decoder for better mask prediction. Through extensive experiments, we analyze the effect of our proposed designs and demonstrate the effectiveness of our method. With our simple designs, the proposed method achieves state-of-the-art results on popular benchmarks in terms of all evaluation metrics while keeping a real-time speed. The effectiveness and efficiency of the proposed method make it possible for many real-world applications. Moreover, this research is expected to push forward the development of feature learning and mask prediction for SIS. In the future, we plan to apply the RDP module for other vision tasks that need powerful feature pyramids. The code and pretrained models of this paper will be released to promote the future research.


  • [1] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk (2009) Frequency-tuned salient region detection. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 1597–1604. Cited by: §II-A.
  • [2] M. Cheng, Y. Liu, W. Lin, Z. Zhang, P. L. Rosin, and P. H. Torr (2019)

    BING: binarized normed gradients for objectness estimation at 300fps

    Computational Visual Media 5 (1), pp. 3–20. Cited by: §II-B.
  • [3] M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu (2014) Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 37 (3), pp. 569–582. Cited by: §II-A.
  • [4] M. Cheng, J. Warrell, W. Lin, S. Zheng, V. Vineet, and N. Crook (2013) Efficient salient region detection with soft image abstraction. In IEEE International Conference on Computer Vision (ICCV), pp. 1529–1536. Cited by: §II-A.
  • [5] J. Dai, K. He, and J. Sun (2015) Convolutional feature masking for joint object and stuff segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3992–4000. Cited by: §II-B.
  • [6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The PASCAL visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) 88 (2), pp. 303–338. Cited by: §IV-A2.
  • [7] D. Fan, M. Cheng, J. Liu, S. Gao, Q. Hou, and A. Borji (2018) Salient objects in clutter: bringing salient object detection to the foreground. In European Conference on Computer Vision (ECCV), pp. 186–202. Cited by: Fig. 5, §IV-A1, §IV-B, §IV-D2, §IV-E, §IV-F, TABLE IX.
  • [8] R. Fan, M. Cheng, Q. Hou, T. Mu, J. Wang, and S. Hu (2020-06) S4Net: single stage salient-instance segmentation. Computational Visual Media 6 (2), pp. 191–204. External Links: Document Cited by: §I, §II-C, Fig. 6, §IV-A1, §IV-D1, §IV-D2, §IV-E, TABLE VII, TABLE IX.
  • [9] R. Fan, Q. Hou, M. Cheng, G. Yu, R. R. Martin, and S. Hu (2018) Associating inter-image salient instances for weakly supervised semantic segmentation. In European Conference on Computer Vision (ECCV), pp. 367–383. Cited by: §I.
  • [10] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. (2015) From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482. Cited by: §I.
  • [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE International Conference on Computer Vision (ICCV), pp. 580–587. Cited by: §II-B.
  • [12] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik (2015) Hypercolumns for object segmentation and fine-grained localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 447–456. Cited by: §II-B.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §I, §I, §II-B, §II-C, §III-B, §IV-A2, §IV-C2, §IV-C7.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §III-A1, §III-C1, §IV-B, TABLE VIII.
  • [15] S. Hong, T. You, S. Kwak, and B. Han (2015)

    Online tracking by learning discriminative saliency map with convolutional neural network

    In International Conference on Learning Representations (ICML), pp. 597–606. Cited by: §I.
  • [16] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr (2019) Deeply supervised salient object detection with short connections.. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (4), pp. 815. Cited by: §II-A.
  • [17] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang (2019) Mask Scoring R-CNN. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6409–6418. Cited by: §II-B, §IV-A2.
  • [18] G. Li, Y. Xie, L. Lin, and Y. Yu (2017) Instance-level salient object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2386–2395. Cited by: §I, §I, §II-C, Fig. 5, §IV-A1, §IV-B, §IV-C4, §IV-D1, §IV-D2, §IV-E, §IV-F, TABLE VII, TABLE VIII.
  • [19] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017) Fully convolutional instance-aware semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367. Cited by: §II-B.
  • [20] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125. Cited by: §I, §I, §III-A1, §III-A2, §III-C1, §IV-A2, §IV-C4, §IV-C7.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §I, §III-A1, §III-C4, §IV-A2.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: Fig. 4, §IV-A2.
  • [23] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019-06) A simple pooling-based design for real-time salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [24] N. Liu, J. Han, and M. Yang (2018) PiCANet: learning pixel-wise contextual attention for saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3089–3098. Cited by: §II-A.
  • [25] N. Liu and J. Han (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 678–686. Cited by: §II-A.
  • [26] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768. Cited by: §II-B, §III-A3.
  • [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §IV-A2.
  • [28] Y. Liu, M. Cheng, X. Hu, J. Bian, L. Zhang, X. Bai, and J. Tang (2019) Richer convolutional features for edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (8), pp. 1939–1946. Cited by: §II-A.
  • [29] Y. Liu, M. Cheng, X. Zhang, G. Nie, and M. Wang (2019) DNA: deeply-supervised nonlinear aggregation for salient object detection. arXiv preprint arXiv:1903.12476. Cited by: §I, §II-A.
  • [30] Y. Liu, Y. Wu, Y. Ban, H. Wang, and M. Cheng (2020) Rethinking computer-aided tuberculosis diagnosis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [31] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §II-A.
  • [32] Y. Pang, X. Zhao, L. Zhang, and H. Lu (2020) Multi-scale interactive network for salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-A.
  • [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §IV-B.
  • [34] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik (2017) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39 (1), pp. 128–140. Cited by: §II-B, §II-C.
  • [35] Y. Qiu, Y. Liu, H. Yang, and J. Xu (2020) A simple saliency detection approach via automatic top-down feature fusion. Neurocomputing 388, pp. 124–134. Cited by: §II-A.
  • [36] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pp. 91–99. Cited by: §II-B, §III-A1, §III-C2.
  • [37] C. Rother, V. Kolmogorov, and A. Blake (2004) GrabCut: interactive foreground extraction using iterated graph cuts. International Journal of Computer Vision (TOG) 23 (3), pp. 309–314. Cited by: §II-C.
  • [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §III-C1.
  • [39] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. IEEE International Conference on Computer Vision (ICCV). Cited by: Fig. 3, §III-C1, §III-C2, §III-C4.
  • [40] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders (2013) Selective search for object recognition. International Journal of Computer Vision (IJCV) 104 (2), pp. 154–171. Cited by: §II-B.
  • [41] J. Wang, H. Jiang, Z. Yuan, M. Cheng, X. Hu, and N. Zheng (2017) Salient object detection: a discriminative regional feature integration approach. International Journal of Computer Vision (IJCV) 123 (2), pp. 251–268. Cited by: §I, §II-A.
  • [42] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan (2018) Salient object detection with recurrent fully convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (7), pp. 1734–1746. Cited by: §I, §II-A.
  • [43] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji (2018) Detect globally, refine locally: a novel approach to saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3127–3135. Cited by: §II-A.
  • [44] Y. Wu, S. Gao, J. Mei, J. Xu, D. Fan, C. Zhao, and M. Cheng (2020) JCS: an explainable covid-19 diagnosis system by joint classification and segmentation. arXiv preprint arXiv:2004.07054. Cited by: §I.
  • [45] Y. Wu and K. He (2018) Group normalization. In European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §IV-B.
  • [46] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500. Cited by: TABLE VIII.
  • [47] S. Xie and Z. Tu (2017) Holistically-nested edge detection. International Journal of Computer Vision (IJCV) 125 (1-3), pp. 3–18. Cited by: §II-A.
  • [48] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang (2016) Unitbox: an advanced object detection network. In ACM International Conference on Multimedia (ACM MM), pp. 516–520. Cited by: §III-C4.
  • [49] P. Zhang, W. Liu, H. Lu, and C. Shen (2019) Salient object detection with lossless feature reflection and weighted structural loss. IEEE Transactions on Image Processing (TIP) 28 (6), pp. 3048–3060. Cited by: §II-A.
  • [50] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan (2017) Amulet: aggregating multi-level convolutional features for salient object detection. In IEEE International Conference on Computer Vision (ICCV), pp. 202–211. Cited by: §I, §II-A.