Salient Object Detection (SOD) is a long-standing vision task that aims to segment visually salient objects in a scene. It often serves as a core step for downstream vision tasks like video object segmentation [wang2015saliency], object proposal generation [alexe2012measuring], and image cropping [wang2017deep]
. Recent deep learning-based SOD methods have achieved a significant performance progress[wang2017learning, zhuge2018boundary, xu2019structured, su2019selectivity, zhao2019pyramid, hou2017deeply, wang2018detect]
, benefited from the powerful representation learning capability of neural networks and large-scale pixel-level annotated training data. Since annotating pixel-level labels is extremely tedious, there are some works[wang2017learning, zeng2019multi] that aim to explore cheaper image-level labels (e.g, class labels) to train SOD models in a weakly-supervised manner.
Salient Instance Detection (SID) goes further from SOD as it is to identify each salient instance. This instance-level saliency information can further benefit vision tasks that requires fine-grained scene understanding,
This instance-level saliency information can further benefit vision tasks that requires fine-grained scene understanding,e.gkarpathy2015deep], image editing [chen2009sketch2photo] and semantic segmentation [fan2018associating]. However, existing SID methods [fan2019s4net, li2017instance, zhang2016unconstrained] still rely on large-scale annotated ground truth masks in order to learn how to segment salient instances with their boundaries delineated. Hence, it is worthwhile to study the SID problem from the weakly-supervised perceptive of using cheaper image-level labels.
A straightforward solution may be to use class labels to train a weakly-supervised SID model. However, using just class labels to learn a SID model is non-trivial for two reasons. First, class labels can help detect semantically predominant regions [zhou2016learning], but these regions are not guaranteed to be visually salient. Second, objects of the same class may not be easy distinguished due to their high semantic affinities. We observe that subitizing is naturally related to saliency instance detection. By predicting the number of salient objects, it can serve as global supervision that can help separate instances of the same class and cluster parts of an instance with diverse appearances into one.
Inspired by the above insight, we propose to learn a Weakly-supervised SID network (denoted WSID-Net) using class and subitizing labels. Our WSID-Net consists of three synergic branches: a salient object detection branch and a boundary detection branch are proposed to locate candidate salient objects and delineate their boundaries, by exploiting semantics from the class labels; a centroid detection branch is proposed to detect the centroid of each salient instance, by leveraging saliency cues from the subitizing labels. This information is fused to obtain the salient instance maps. To demonstrate the effectiveness of the proposed model, we compare it with a variety of baselines adapted from related tasks on the standard benchmark [li2017instance].
To summarize, this paper has three main contributions: 1) To the best of our knowledge, we propose the first weakly-supervised method for salient instance detection, which only requires image-level class and subitizing labels to obtain salient instance maps; 2) We propose a novel network (WSID-Net), with a novel centroid-based subitizing loss to exploit salient instance number information, and a novel Boundary Enhancement module to learn instance boundaries; 3) We conduct extensive experiments to analyze the proposed method, and verify its superiority against baselines adapted from related state-of-the-art approaches.
2 Related Work
Salient Instance Detection (SID). Existing SID methods are fully-supervised. Zhang et al [zhang2016unconstrained] propose to detect salient instances with bounding boxes, and propose a MAP-based optimization framework to regress a large amount of pre-defined bounding boxes into a compact number of instance-level bounding boxes of high confidences. However, their method based on bounding boxes cannot detect salient instances with accurately delineated boundaries. Other works predict pixel-wise masks for the detected salient instances, and typically rely on large amount of manually annotated ground truth labels. Specifically, Li et al [li2017instance] propose to first predict the saliency mask and instance-aware saliency contour, and then use the existing Multi-scale Combinatorial Grouping (MCG) algorithm [APBMM2014] to extract instance-level masks. Fan et al [fan2019s4net] propose an end-to-end SID network based on the object detection model FPN [lin2017feature], with a segmentation branch to segment the salient instances.
Unlike these existing SID methods, we propose in this paper to train a weakly-supervised network, which only requires two image-level labels, i.e, the class and subitizing labels.
Salient Object Detection (SOD). SOD methods aim at generally detecting salient objects in a scene without differentiating the detected instances. Liu et al [liu2010learning] formulate the SOD task as a binary segmentation problem for segmenting out the visually conspicuous objects of an image via color and contrast histogram based priors. Traditional methods propose to leverage different hand-crafted priors to detect salient objects, e.g, image colors and luminance [achanta2009frequency], global and local contrast priors [perazzi2012saliency, cheng2014global], and background geometric distance prior [yang2013saliency]. Recently, deep learning based SOD methods achieve superior performance on standard SOD benchmarks [yang2013saliency, li2014secrets, shi2015hierarchical, jiang2013salient, wang2017learning, cheng2014global], by incorporating salient boundary knowledge [zhuge2018boundary, xu2019structured, su2019selectivity]
, fusing deep features[zhao2019pyramid, hou2017deeply], and designing attention mechanisms [wang2018detect, zhang2018bi]. Particularly, He et al[he2017delving] propose to leverage numerical representation of subitizing to enrich spatial representations of salient objects. These methods are typically benefitted from the powerful learning ability of deep neural networks as well as large-scale annotated ground truth data. To alleviate the data annotation efforts, some methods [wang2017learning, zeng2019multi] propose to train weakly-supervised deep models using object class labels and class activation maps (CAMs) [zhou2016learning]. On the other hand, Li et al [li2018contour] propose to leverage pre-trained contour network to generate pseudo labels for training the saliency detection network.
However, existing weakly-supervised SOD methods cannot be directly applied to our problem, as class labels cannot provide instance-level information. In this paper, we propose to use class and subitizing labels to train our SID model.
Weakly-supervised Semantic Instance Segmentation (SIS). SIS methods aim to detect all instances in a class-specific manner. Although they do not consider the saliency attribute of instances, they are related to our task as they try to segment the objects in an image into instances. Here, we briefly summarize latest weakly-supervised SIS methods, which are adopted as baseline methods for our task. Based on pixel affinities extracted from the class activation map, IRN [jiwoon2019weakly] learns to predict object seeds and boundaries that can be used to infer the entire region of the target instance. PRM [zhou2018weakly] first learns to predict peak response maps within class responses, where each peak is generally related to an instance. It then adopts off-the-shelf segment proposals [APBMM2014] to obtain each instance based on the peaks. In comparison to PRM, PRM+D [cholakkal2019object] further incorporates per-class object number information to learn better spatial distribution of peak-represented instances. Some other methods [zhu2019learning, laradji2019masks] propose to refine the results of PRM [zhou2018weakly] in an online way, via jointly learning from class labels and off-the-shelf segment proposals.
Class labels are widely explored in weakly-supervised SOD methods for learning to localize candidate objects, based on the pixel-level semantic affinities derived from the network responses to the class labels. However, class labels lack instance-level information, causing over- and under-detection when salient instances are from the same category. We note that subitizing, a cheap image-level label that denotes the number of salient instances of a scene, can serve as a complementary supervision to the class labels. Hence, we propose to use both class and subitizing labels to address our weakly-supervised SID problem.
To this end, we propose a Weakly-supervised SID network (WSID-Net), as shown in Figure 1. WSID-Net has three branches: a for locating candidate salient objects; a
for detecting the centroids of salient instances, where subitizing knowledge is utilized in a novel loss function to provide regularization on the global number of instance centroids; and afor delineating salient instance boundaries, where a novel Boundary Enhancement (BE) module is introduced to resolve the discontinuity problem of detected boundaries. A novel Double Attention (DA) module is further incorporated to learn the context information for detecting centroids and boundaries.
3.1 Centroid Detection Branch
Detecting object centroids is crucial to separating the objects in a weakly-supervised scheme. Unlike existing semantic (instance) segmentation methods [jiwoon2019weakly, Neven2019InstanceSB, zhou2018weakly, cholakkal2019object, zhu2019learning, laradji2019masks] that detect the centroids based on network responses to the class labels, we propose to introduce subitizing information to explicitly supervise the salient centroid detection process.
We adopt the image-to-image translation scheme, where our network outputs a 2D centroid map, of which the values of each pixel location indicate the offset vector to its instance centroid. The bottom part of Figure1 shows the network structure of our centroid detection branch. Given an input image, we first extract multi-scale backbone features to and feed them to the DA modules for refinement (to be discussed in Section 3.3). The refined features are denoted as to . We then fuse the high-level features to obtain : , which is further fused with the low-level features to produce the centroid map : .
Centroid-based Subitizing loss. It has been shown that penalizing the centroid loss [jiwoon2019weakly, Neven2019InstanceSB] helps cluster local pixels with high semantic affinities. However, it typically fails when salient instances from the same object category have varying shapes and appearances. The reason is that the clustering process of local pixels lacks global saliency supervision on it. Hence, we introduce the centroid-based subitizing loss to resolve this problem. We use subitizing to explicitly supervise the number of predicted centroids, which helps constrain the pixel clustering process. We use the Mean Square Error (MSE) to measure :
where is the subitizing information, denotes the predicted saliency region. denotes the predicted offset vectors in the saliency region. denotes the number of predicted centroid extracted from the offset vectors of the pixels in the saliency region. The loss
only backpropagates to update the offset vectors in the saliency region, avoiding the learning process of instance centroid detection being distracted by the non-salient background.
Figure 2 visualizes the results from centroid detection and the corresponding instance segmentation, with and without using the centroid-based subitizing loss function. We can see that the network groups the two dogs into one when not using , as these two dogs have similar appearances and lie next to each other (columns 3 and 4). By introducing , the network is able to predict a correct number of centroids, and generate reasonable salient instance masks compared with the ground truth (columns 5 and 6).
3.2 Boundary Detection Branch
Boundaries provide strong cues for separating salient instances. Unlike fully-supervised SID methods that learn boundary-aware information based on pixel-level ground truth masks, we propose the Boundary Enhancement module to leverage the Canny prior [john1986] to delineate continuous instance boundaries.
Network structure. The top part of Figure 1 shows the architecture of the boundary detection branch. Given an input image , we obtain refined backbone features ( to ) using DA modules (to be discussed in Section 3.3) before they are concatenated and computed to predict the boundary map. We also feed the input image into the BE module to obtain enhanced edge features . The output boundary map is then computed as: , where
is the sigmoid activation function.
BE module. We apply a random walk algorithm to search a salient instance from a centroid to its boundary. However, it may fail when part of the boundary is discontinuous as the random walk algorithm will also search the region outside the boundary. Hence, we propose the BE module to incorporate the edge prior for learning continuous instance boundaries, as shown in Figure 3. Specifically, we first extract low-level features along the horizontal and vertical directions from the input image, by two and convolution layers. These low-level features are then fed into three Residual Blocks [he2016deep] for feature refinement, which are further concatenated with enriched edges computed from the Canny operator [john1986]. To compute the final enriched boundary features, another 11 convolution layer is applied.
Figure 4 visualizes two examples of boundary detection and the corresponding salient instance detection with and without the BE module. We can see that our BE module helps detect the boundaries between objects, which is crucial to salient instance segmentation.
3.3 Double Attention (DA) Module
Detecting instance centroids and boundaries are two highly coupled sub-tasks, i.e, they can influence each other and further affect the SID performance. To efficiently learn these two sub-tasks, we propose the Double Attention (DA) module. Its design is based on two observations. First, since salient instances may have various shapes, we thus need to capture long-range spatial contextual information. Second, cross-class ambiguities of pseudo affinity labels influence both sub-tasks, while the class information from the channel-wise contexts can help address this problem. Hence, we combine channel-wise and spatial-wise attention mechanisms and organize them in parallel to form our DA module. We apply the DA module to both the centroid detection and the boundary detection branches, and share their weights. Unlike existing dual attention mechanisms [woo2018cbam, fu2019dual] that are only used to enhance the feature discriminatively, our DA module also allows information exchanges across these two branches, resulting in an improvement on both sub-tasks.
Figure 5 shows the structure of our DA module. The top and bottom branches are channel-wise and spatial-wise attention blocks, respectively. Specifically, given the input features , we compute the channel-wise attention features as:
, where and denote two channel-wise pooling operations, and MLP is the multi-layer perception with one hidden layer to generate the attention features. We also compute the spatial-wise attention features as:
, where is a convolutional layer with kernel size 7. The final attention features are then computed as: , where denotes the dot product operation, and is the element-wise summation operation.
Figure 6 shows the effectiveness of the proposed Double Attention module in enhancing the boundary and centroid detection performances.
4.1 Training and evaluation details
Datasets and metric.
Our full network is trained on two image-level labels, class and subitizing. We use PASCAL VOC 2012[Everingham2010-pascal-voc-IJCV], which is a semantic and instance segmentation dataset. However, we only use its class labels to train our network. The training set contains a total of 10,582 images of 20 classes. ILSO [li2017instance] is a SID dataset. It contains 500 images with instance-aware pixel labels for training. For our weakly-supervised training, we extract the numbers of salient instances of these images as our subitizing labels. We perform evaluations on the test set of ILSO [li2017instance], which has 300 images with ground truth instance masks. We use the mean Average Precision (mAP) metric [hariharan2014simultaneous] to evaluate the SID performance.
Training and Inference. We train the proposed network separately. We train the centroid detection branch using the proposed centroid-based subitizing loss together with the centroid loss introduced in [jiwoon2019weakly, Neven2019InstanceSB]. We train the boundary detection branch using the boundary loss introduced in [Ahn_2018_CVPR, jiwoon2019weakly]. To train the saliency detection branch, we follow existing weakly-supervised SOD methods to use pseudo masks derived from class labels. Specifically, we first compute class activation maps via [zhou2016learning]. We then feed these maps together with the input image to a Conditional Random Field [krahenbuhl2011efficient] to generate pseudo object maps, and use these pixel-level pseudo labels to train the saliency detection branch. During inference, given an input image, WSID-Net first computes the centroids, boundaries, and saliency maps. To segment each salient instance, a random walk algorithm is used to detect salient regions, starting from the detected centroids until reaching the boundaries.
4.2 Implementation details
We implement WSID-Net on the Pytorch framework ). We train our WSID-Net for 5 epoches.
We implement WSID-Net on the Pytorch framework[paszke2019pytorch]. Both training and testing are performed on a PC with an i7 4GHz CPU and a GTX 1080Ti GPU. CRF is used to generate or refine pseudo labels. The hyper parameters of CRF are set as , , , , and . We choose ResNet50 as the backbone for all three branches in WSID-Net. The backbone is initialized as in [simonyan2014very]. Input images are resized to 512512 resolution. To minimize the loss function, we use the SGD optimizer with batch size 6 and initial learning rate 0.01. The learning rate decreases following poly policy (
). We train our WSID-Net for 5 epoches.
4.3 Comparion with the State-of-the-art Methods
As we are the first to propose a weakly-supervised SID method, we compare our method to 2 existing fully-supervised state-of-the-art SID methods: S4Net [fan2019s4net] and MSRNet [li2017instance].
We also prepare the following baselines from related tasks for evaluation. We choose 6 state-of-the-art weakly-supervised methods, with two from the SOD task C2SNet [li2018contour] and NLDF [luo2017non]; one from the SID task MAP [zhang2016unconstrained] ; one from the object detection (OD) task, DeepMask
; one from the object detection (OD) task, DeepMask[pinheiro2015learning]; and two from the Semantic Instance Segmentation task, PRM+D [cholakkal2019object] and IRN [jiwoon2019weakly]. We adapt them by adding different post-processing strategies to these methods for deriving instance-level saliency maps from their original outputs, or modifying their networks and retrain them using our training data. Details are summarized as follows:
C2SNet [li2018contour] and NLDF [luo2017non] are proposed for salient object detection with contour prediction. We apply the MCG method [APBMM2014], which takes a contour map as input and outputs segment proposals, to obtain multiple salient instance proposals, and then use MAP [zhang2016unconstrained] to filter out proposals with low confidences.
MAP [zhang2016unconstrained] is a fully-supervised SID method, which learns to predict the bounding boxes of salient instances. Since it cannot output salient instance masks, we feed both the image and the bounding box to a CRF [krahenbuhl2011efficient] to obtain the segmented masks.
DeepMask [pinheiro2015learning] learns to predict class-agnostic segment proposals with object scores. We utilize a weakly-supervised SOD method WSS [wang2017learning] to filter out non-salient segment proposals by calculating the IoU between the object mask and the salient mask, and set the IoU threshold to 0.75.
IRN [jiwoon2019weakly] learns to predict class-specific segment proposals. We utilize the same filtering method as in DeepMask to select salient instances.
PRM+D [cholakkal2019object] is trained with class and per-class subitizing labels. We add one additional convolutional layer at the end of the network to merge their per-class outputs (originally 20 output maps for 20 classes) into one class-agnostic map, and then retrain it using our training data.
4.4 Performance Evaluation
Quantitative evaluation. We quantitatively evaluate our method in Table 1. mAP@0.7 is the most difficult metric as it requires the IoU value to be over 70%. Our method achieves a better performance of about 10% over the second-place weakly-supervised baseline. These results show that our method achieves the best performance using just two types of image-level labels.
†† As of today, the codes for MSRNet [li2017instance] are still not available. Following [fan2019s4net], we directly copy the numbers reported in [li2017instance] to our submission for a quantitative comparison.
Qualitative evaluation. We further qualitatively compare our method as shown in Figure 7. Our method is able to delineate the instance boundaries clearly, and output an accurate number of segmented salient instances directly, as shown in column 9. In contrast, (1) PRM+D and IRN fail to detect integral instances with inferior detected boundaries (e.g, rows 1 and 9); (2) C2SNet and NLDF tend to recognize texture boundaries, resulting in fragmented instances (e.g, rows 2, 3 and 4); (3) DeepMask and S4Net suffer from the over-detection problem, as they fail to distinguish instance proposals belonging to the same instance (e.g, row 2); (4) and MAP is a bounding-box based method that fails to get clear instance boundary even post-processed by a widely adopted segmentation method, CRF (e.g, rows 1 and 2). Overall, our method outperforms the baselines, as a result of the centroid-based subitizing loss and the carefully designed BE and DA modules.
|MAP [zhang2016unconstrained], MCG [APBMM2014]||65.3%||52.3%|
|MAP [zhang2016unconstrained]||SID||FS||instance-level bounding box||N/A||56.6%||24.8%|
|S4Net [fan2019s4net]||SID||FS||instance-level pixel mask||N/A||82.2%||59.6%|
|C2SNet [li2018contour]||SOD||WS||unlabeled images||
|MAP [zhang2016unconstrained], MCG [APBMM2014]||45.5%||24.5%|
|DeepMask [pinheiro2015learning]||OD||WS||instance-level bounding box||N/A||37.1%||20.5%|
|PRM+D [cholakkal2019object]||SIS||WS||class, subitizing labels||MCG [APBMM2014]||49.6%||31.2%|
|IRN [jiwoon2019weakly]||SIS||WS||class label||N/A||57.1%||37.4%|
|Ours||SID||WS||class, subitizing labels||N/A||61.9%||47.2%|
4.5 Internal Analysis
We first investigate how our BE, DA modules, and affect the SID performance. We provide ablation study on our model design based on the mAP@IoU metric. From the results in Table LABEL:tab:ab1, we can see that the SID performance is continuously increased as we incorporate these modules. This shows that these modules can help boost the performances of the centroid and boundary detection sub-tasks, which play a vital role in detecting salient instances. Figures 2, 4, and 6 provide additional visual comparisons to demonstrate the effectiveness of the BE, DA modules and the .
We also evaluate the design choices of the DA module and the influence of using different backbones. Due to page limitation, we show these analytical results in the Supplemental.
In this paper, we propose the first weakly-supervised SID method that is trained on class and subitizing labels. Our WSID-Net learns to predict object boundary, instance centroid, and salient region. By using the proposed Boundary Enhancement module, Double Attention module, and centroid-based subitizing loss, our method can identify and segment each salient instance effectively. Both quantitative and qualitative experiments demonstrate the effectiveness of the proposed method compared with baseline methods.
Our method does have limitations. It may fail when our saliency detection branch (as well as existing weakly-supervised SOD methods) cannot detect the majority of the salient regions, due to complex background textures and colors, as shown in Figure 5. As a future work, we are exploring to incorporate a discriminative network of generative adversarial learning to improve the SOD performance of our saliency detection branch, and extend our method to handle videos.