Object detection and semantic segmentation algorithms have achieved great success in recent years thanks to the introduction of large-scale datasets [everingham15pascal, lin2014microsoft]girshick2015fast, he2017maskrcnn, redmon2018yolo, ren2015faster]. However, most of existing image datasets have relatively simple forms of annotations such as image-level class labels, while many practical problems require more sophisticated information such as bounding boxes and areas corresponding to object instances. Unfortunately, the acquisition of the complex labels needs significant human efforts, and it is challenging to construct a large-scale datasets containing such comprehensive annotations.
Researchers have been interested in leveraging a number of weakly labeled images to solve overcomplex problems, and a lot of progress has been made in weakly supervised object detection and semantic segmentation [bilen2016weakly, kwak2017weakly, tang2017multiple].
The most critical issue in weakly supervised object detection and semantic segmentation problem is that trained models typically focus on discriminative parts of objects in the scene. This problem makes models fail to identify proper regions corresponding to whole objects and extract accurate object boundaries in a scene. To alleviate the limitation, existing techniques often rely on heuristics to make a good guess about object area[diba2017weakly, kantorov2016contextlocnet]. For example, most of weakly supervised object detection and semantic segmentation techniques adopt unsupervised object proposal generation methods [uijlings2013selective, zitnick2014edgebox] and class activation map [selvaraju2017grad, zhou2016learning].
We tackle the weakly supervised instance segmentation problem, which is conceptually similar to a combination of object detection and semantic segmentation while this task is even more challenging than each of the two problems because it inherits all critical challenges of the both. Although object proposal provides rough information about the location of each object instance, a naïve application of instance segmentation module to weakly supervised object detection results may not be successful in practice since the proposal information is very noisy.
The proposed algorithm learns an end-to-end deep neural network for instance segmentation through active interactions between multiple tasks. Figure 1 illustrates the proposed framework for weakly supervised instance segmentation. The contributions of this paper are summarized below:
We introduce an end-to-end trainable deep neural network model, which have active interactions between multiple tasks: object detection, instance mask generation, and object segmentation.
Our algorithm successfully integrates a bounding box regression module for better object localization even in the weakly supervised setting.
The proposed algorithm achieves substantially higher performance than the existing weakly supervised approaches on the standard benchmark dataset.
2 Related Works
This section reviews the existing weakly supervised algorithms for object detection, semantic segmentation, and instance segmentation.
2.1 Weakly Supervised Object Detection
Weakly Supervised Object Detection (WSOD) aims to localize objects in a scene only with image-level class labels. Most of existing methods formulate WSOD as Multiple Instance Learning (MIL) problems [dietterich1997solving] and attempt to learn the detection models by extracting pseudo-ground-truth labels [bilen2016weakly, tang2018pcl, tang2017multiple, zhang2018w2f]. Bilen and Vedaldi [bilen2016weakly] propose a deep neural network architecture referred to as Weakly Supervised Deep Detection Networks (WSDDN), which combines classification and localization tasks to identify object classes and their locations in an input image. However, this technique tends to find only a single object class and instance conceptually and often fails to solve the problems involving multiple labels and objects.
WSDDN is extended by adding pseudo-label refinement [tang2017multiple], context reasoning of each bounding box [kantorov2016contextlocnet] and min-entropy model [wan2018min]. However, these models are still prone to focus on the discriminative parts of objects instead of whole object regions. Tang et al. [tang2018pcl] and Son et al. [son2018forget] adopt graph mining techniques to capture multi-modal distribution of object classes and localization layouts. Diba et al. [diba2017weakly] tackle object detection problem based on weak supervision through joint learning of object detection and semantic segmentation, where the output of the segmentation network is used for filtering proposals. However, since the proposals for training object detection networks are given by semantic segmentation results, the models have troubles in identifying spatially overlapped objects in the same class. Shen et al. [shen2019cyclic] attempt to overcome the limitation by leveraging mutual feedbacks between detection and segmentation modules, but it is still vulnerable to the challenge.
2.2 Weakly Supervised Semantic Segmentation
Weakly supervised semantic segmentation relies heavily on high performance activation maps [kolesnikov2016seed, mnih2014recurrent, zhou2016learning]. Most recent approaches for the segmentation [ahn2018learning, hong2017weakly, huang2018weakly, kwak2017weakly, wang2018weakly]
employ class activation map (CAM) to estimate pseudo-ground-truth masks. Kwaket al. [kwak2017weakly] introduce a superpixel pooling layer, which determines pooling layouts based on the superpixel boundaries of an input image. Some approaches [ahn2018learning, kolesnikov2016seed, huang2018weakly, wang2018weakly] propose techniques to propagate segmentation masks from seed. In particular, Huang et al. [huang2018weakly] generate pseudo-ground-truths of segmentation network using a region growing method [adams1994seeded] while AffinityNet [ahn2018learning] propagates labels based on semantic affinities between adjacent pixels Ge et al. [ge2018multi]
introduce a three-stage technique for semantic segmentation; computing pixel-wise probability maps, detecting objects, and making a final segmentation map. Note that this approach requires the estimation of several critical hyperparameters to learn the networks separately.
2.3 Instance Segmentation
Instance segmentation can be regarded as a combination of object localization and semantic segmentation, which needs to identify individual object instances. There exist several fully supervised approaches [chen2018masklab, dai2016instance, hayder2017boundary, he2017maskrcnn]. Haydr et al. [hayder2017boundary] utilize Region Proposal Network (RPN) [ren2015faster] to detect individual instances and leverage Object Mask Network (OMN) for segmentation. Mask R-CNN [he2017maskrcnn], Masklab [chen2018masklab] and MNC [dai2016instance] have similar procedures to predict their pixel-level segmentation labels.
There have been several recent works for weakly supervised instance segmentation based on image-level class labels only [ahn2019weakly, ge2019label, issam2019where, zhou2018weakly]. Peak Response Map (PRM) [zhou2018weakly] takes the peaks of an activation map as the pivots for individual instances and estimates the segmentation mask of each object using the pivots. Instance Activation Map (IAM) [zhu2019learning] selects pseudo-ground-truths out of precomputed segment proposals based on PRM to learn segmentation networks. There are a few more attempts to generate pseudo-ground-truth segmentation maps based on weak supervision and forward them to the well-established network for instance segmentation [ahn2019weakly, issam2019where].
3 Proposed Algorithm
This section describes our deep multi-task community learning framework based on an end-to-end trainable architecture for weakly supervised instance segmentation.
One of the most critical limitations in weakly supervised instance segmentation is that the learned models often attend to small discriminative regions of objects and fail to recover missing parts of target objects. This is partly because segmentation networks rely on noisy detection results without proper interactions and the benefit of the iterative label refinement procedure is often saturated in the early stage due to the strong correlation between the two modules.
To alleviate this drawback, we propose a deep neural network architecture that constructs a positive feedback loop along the components and generates desirable instance detection and segmentation results. Specifically, object detector generates proposal-level pseudo-ground-truth bounding box labels. They are used to create pseudo-ground-truth masks for instance segmentation module, which makes the final segmentation labels of individual proposals using the masks. These three network components make up a community and collaborate to update the parameters of the feature extractor using a multi-task loss, which leads to regularized representations robust to overfitting to local optima.
3.2 Network Architecture
Figure 2 presents the network architecture of our weakly supervised object detection and segmentation algorithm. As mentioned earlier, the proposed network consists of four parts: feature extractor, object detector with bounding box regressor, instance mask generation (IMG) and instance segmentation module. Our feature extraction network is made of shared fully convolutional layers, where the feature of each proposal is extracted from the Spatial Pyramid Pooling (SPP) layers on the shared feature map and fed into the other modules.
3.2.1 Object Detection Module
For object detection, a SPP layer produces a feature map for each object proposal, which is forwarded to the last residual block (res5). Then, we pass these features to the detector and regressor. The detection output score of each proposal is used to determine its pseudo-label to train the networks for bounding box regression, IMG and instance segmentation. Since this idea can be plugged in by any end-to-end trainable object detection network based on weak supervision, we choose to employ one of the most popular weakly supervised object detection networks, referred to as OICR [tang2017multiple]
Bounding box regression is typically conducted to refine the proposals corresponding to objects with full supervision. However, learning a regressor in our setting is particularly challenging since it is prone to be biased by discriminative parts of objects; such a characteristic is aggravated in class-specific learning. Unlike [girshick2015fast, girshick2014rich, ren2015faster], we train a class-agnostic bounding box regressor based on pseudo-ground-truths to avoid overly discriminative representation learning for better regularization. For bounding box regression, we first identify a set of pseudo-ground-truth bounding boxes, which is a collection of the top-scoring proposals of each class. If a proposal has a higher IoU with its nearest pseudo-ground-truth proposal than a threshold, the proposal and the pseudo-ground-truth proposal are paired to learn the regressor.
3.2.2 Instance Mask Generation Module
Instance Mask Generation (IMG) module generates pseudo-ground-truth masks for instance segmentation using the proposal-level class labels given by the object detection module. This module takes the features of each proposal from the SPP layers attached to multiple convolutional layers as shown in Figure 2. IMG leverages the hierarchical representations to deal with multi-scale objects effectively.
We integrate the following additional features into class activation map (CAM) [zhou2016learning]
to construct pseudo-ground-truth masks for individual proposals. First, we compute an activation map of background class, referred to as a background activation map, by augmenting an additional channel corresponding to background class. This map is useful to distinguish objects from background. Second, instead of the Global Average Pooling (GAP) used in the standard CAM, we employ the weighted GAP to give more weights to the center pixels within proposals based on an isotropic Gaussian distribution. Third, we perform feature smoothing of the input feature
to the CAM module using a nonlinear activation function,, to penalize excessively high peaks in the CAM and generates spatially regularized feature maps appropriate for robust segmentation.
For each proposal, the pseudo-ground-truth mask for instance segmentation is generated by the three CAMs, as following equation:
where is the CAM whose size is for all classes including background, is an element-wise indicator function, and is a predefined threshold.
3.2.3 Instance Segmentation Module
For instance segmentation, the output of the res5 block is upsampled to
activation maps and given to five convolution layers along with ReLU layers and the final segmentation output layer as illustrated in Figure2. The module learns to perform pixel-wise classification for each proposal based on the binary mask supervision provided by the instance mask generation module, where is the class label obtained from the detector. The predicted mask of each proposal is also class-specific binary mask, where the class is also determined by detector. Note that our model can adopt any semantic segmentation network.
The overall loss function is given by the sum of losses from the three modules as
where , , and denote detection loss, instance mask generation loss, and segmentation loss, respectively. The three terms interact to train the backbone network for feature extraction. Using the multi-task loss regularizes the objective function and leads to better image representations preventing early saturation in learning the backbone network.
3.3.1 Object Detection Loss
The object detection module is trained using classification loss , refinement loss , and bounding box regression loss . The features extracted from the individual object proposals are fed into the detection module based on OICR [tang2017multiple]. Image classification loss is calculated by computing the cross-entropy between image-level ground-truth class label and its corresponding prediction , which is given by
where is the number of classes in a dataset. The pseudo-ground-truth of each object proposal in the refinement layers is obtained from the outputs of their preceding layers, where the supervision of the first refinement layer is provided by WSDDN [bilen2016weakly]. The loss of the refinement layer is computed by a weighted sum of losses over all proposals as
where denotes a score of the proposal with respect to class in the refinement layer, is a proposal weight obtained from the prediction score in the preceding refinement layer, and is the number of proposals. In these refinement loss function, there are classes because we also consider background class.
For regression loss , we use smooth loss between a proposal and its corresponding pseudo-ground-truth, following the bounding box regression literature [girshick2015fast, ren2015faster]. The regression loss is defined as follows:
where is the number of pseudo-ground-truths, is an indicator variable denoting whether the proposal is matched with the pseudo-ground-truth, is a predicted bounding box regression offset and is the desirable offset parameter between the proposal and the pseudo-ground-truth as in R-CNN [girshick2014rich].
The detection loss is sum of image classification loss, bounding box regression loss, and refinement losses, which is given by
where in our implementation.
3.3.2 Instance Mask Generation Loss
For training CAMs in the IMG module, we adopt average classification scores from three refinement branches of our detection network. The loss function of the CAM network, denoted by , is given by multi-class cross entropy loss as
is an one-hot encoded pseudo-label from detection branch of theproposal for class , and is a softmax score of the same proposal for the same class obtained by the weighted GAP from the last convolutional layer. The instance mask generation loss is the sum of all the CAM losses as shown in the following equation:
3.3.3 Instance Segmentation
We attach our instance segmentation module after the res5 block as illustrated in Figure 2 and discussed in Section 3.2.3. The loss in the segmentation network is obtained by comparing the network outputs with the pseudo-ground-truth using a pixel-wise multi-class cross entropy, which is given by
where means a binary element at of for proposal , and is a value of the output of the segmentation network, , at for proposal .
Our model sequentially predicts object detection and instance segmentation for each proposal in a given image. For object detection, we use the average scores of three refinement branches in the object detection module. Each regressed proposal is labeled as the class that has the maximum score. We apply a non-maximum suppression with IoU threshold 0.3 to the proposals. The survived proposals are regarded as detected objects and used to estimate pseudo-labels for instance segmentation.
For instance segmentation, we select a foreground activation map, from IMG module and corresponding segmentation score map, from instance segmentation module for each proposal. The final instance segmentation of individual proposals are given by ensemble of two results, which is given by
where is a binary segmentation mask for detected class , is an element-wise indicator function, and is a threshold identical to the one used in Eq. (1).
4 Implementation Details
4.1 Backbone Networks
We use ResNet50 [he2016deep] and VGG16 [simonyan14very]
as backbone networks, which are pretrained on ImageNet. The implementation details are as follows:
For object detection, one SPP layer is attached after the res4, followed by res5. The output of the last residual block is shared with IMG and segmentation modules through upsampling. The IMG module employs multiple level of features from outputs of SPP layers attached to res3 and res4, and upsampled res5 output. These features are given to the weighted GAP and the classification layers following one convolution layer for each level of the CAM subnetwork. For instance segmentation, the upsampled output of res5
is used. On our ResNet50 implementation, batch normalization is replaced to group normalization[wu2018group] due to the small batch size.
For object detection, one SPP layer is attached after conv5_3 followed by fc6 and fc7. The IMG module employs multiple level of features from outputs of SPP layers attached to conv3_3, conv4_3, and conv5_3. For instance segmentation, the output of the SPP layer for conv5_3 is used. On our VGG16 implementation, pool4 is removed and dilated convolutions [chen2017rethinking] with rate 2 are used in the conv5 block
We use Selective Search algorithm [uijlings2013selective]
for generating bounding box proposals. All fully connected layers in the detection and the IMG modules are initialized with Gaussian distribution with zero mean and 0.01 standard deviations. The learning rate isat the beginning and reduced to after 90k iterations for ResNet50, and 70k iterations for VGG16. The hyper-parameter of the weight decay term in the loss function is set to
. We use five image scales with the shorter side of the images (480, 576, 688, 864, 1000) for data augmentation and ensemble in training and testing, respectively, while the size of the longer side is constrained below 2000. Our model is implemented on PyTorch and the experiments are conducted on a NVIDIA GTX Titan XP GPU.
This section describes our setting for evaluation and presents the experimental results of our algorithm in comparison to the existing methods. We also analyze various aspects of the proposed network.
5.1 Datasets and Evaluation Metrics
We use PASCAL VOC 2012 segmentation dataset [everingham15pascal], which contains images in 20 object classes, to evaluate our algorithm. The dataset is composed of 1,464, 1,449, and 1,456 images for training, validation, and testing, respectively. We use the augmented training set (trainaug) with 10,582 images to learn our network, following the prior segmentation research [ahn2019weakly, chen2018deeplab, bharath2011semantic]
. In our weakly supervised learning scenario, we only use image-level labels to train the whole model. Detection and instance segmentation are measured on PASCAL VOC 2012 segmentation validation (val) set for gauging how accurately our model identifies the objects in the target classes and delineates individual instances.
We use the standard mean average precision (mAP) to evaluate object detection performance, where a bounding box is regarded as a correct detection if it overlaps with the ground-truth larger than a threshold, i.e. IoU . CorLoc [deselaers2012weakly] is also used to evaluate the localization accuracy on the trainaug dataset. For instance segmentation task, we evaluate the performance using mAPs at IoU thresholds of 0.25, 0.5 and 0.75.
5.2 Comparison with Other Algorithms
We compare our algorithm with existing weakly supervised instance segmentation approaches [cholakkal2019object, ge2019label, zhou2018weakly, zhu2019learning]. Table 1 shows that our algorithm clearly outperforms the prior state-of-the-arts both with ResNet50 and VGG16 backbones. In specific, we achieve performance gains by 7.8% and 5.5% points on ResNet50 at and , respectively. We believe that such large margins come from the collaborative learning for effective regularization within a community of multiple modules. However, our models have relatively low accuracy at . This is partly because of our coarse () instance segmentation outputs, which contain the limited details of object shapes, while other approaches generate the outputs at the original input image level directly.
|Cholakkal et al. [cholakkal2019object]||48.5||30.2||14.4|
|Ours w/o REG||54.9||33.7||5.6|
|Ours w/o REG||52.4||28.9||5.2|
5.3 Ablation Study
We present the results from the ablation studies to analyze the contribution of each component in the network and our training strategy.
5.3.1 Network Components
We analyze the effectiveness of each module for instance segmentation and object detection tasks. Without loss of generality, all models are trained on PASCAL VOC 2012 segmentation trainaug set. We first compute mAP for instance segmentation and mAP for object detection on PASCAL VOC 2012 segmentation val set, and then measure CorLoc on the trainaug set.
Table 2 presents the benefit of joint learning with Instance Mask Generation (IMG) and Instance Segmentation (IS) modules. By adding the two components, we achieve the accuracy gains in detection by 4.4% and 3.2% points in terms of mAP and CorLoc, respectively, compared to the baseline detector. In particular, IMG turns out to be very helpful for detection by itself. IS module improves segmentation accuracy successfully with respect to the detector with the IMG module, where the pseudo-ground-truths estimated in the IMG module are regarded as the segmentation results. Bounding box regression (REG) enhances the performance of both instance segmentation and object detection effectively by generating better pseudo-ground-truths in the middle of our learning process.
5.3.2 Analysis of IMG module
We perform the analysis of the components in the IMG module and present the results in Table 3. All results are from the experiments without the bounding box regressor to demonstrate the impact of individual components clearly. The results in Table 3 imply that 1) leveraging background class activation map gives a substantial gain on performance, 2) feature smoothing alleviates the chronic limitation of weakly supervised learning, focusing on small discriminative parts, 3) the weighted GAP designed to concentrate more on the center of a bounding box is effective, and 4) the individual components induce synergy effect to achieve better accuracy.
|Detector + IMG||32.8||48.6||66.3|
|Detector + IMG + IS||33.7||49.7||66.8|
|Detector + REG + IMG + IS||35.7||53.6||70.8|
5.3.3 Comparison to a Simple Algorithm Combination
Table 4 presents the result from a combination of weakly supervised object detection algorithm, OICR [tang2017multiple], and a weakly supervised semantic segmentation algorithm, AffinityNet [ahn2018learning]. This experiment would be useful to understand the performance of a straightforward combination of two techniques that leads to instance-level segmentation. Note that both OICR and AffinityNet are competitive approaches in their target tasks. We train the two models independently, and combine their results by identifying segmentation labels using AffinityNet from each bounding box given by OICR. Note that the proposed algorithm has advantage in terms of segmentation accuracy compared to the naïve combination. Also, this is achieved by a unified end-to-end training procedure while the counterpart method requires the separate training of two complex algorithms.
|Variant||ResNet50 without REG|
5.4 Qualitative Results
Figure 3 shows instance segmentation results from our model with Conditional Random Field [krahenbuhl2011efficient] and identified bounding boxes on PASCAL VOC 2012 segmentation val set. Our model successfully discriminates each object in a same class within the input image using the predicted object proposals.
Figure 4 compares the detection results from our model and the ones from OICR with a ResNet-50 backbone network on PASCAL VOC 2012 segmentation val set. Our model is more robust to localize a whole body of an object, which is partly because representation learning for object detection is performed with joint learning of IMG and segmentation networks; the resulting features are better regularized.
We presented a novel end-to-end framework for weakly supervised object detection and instance segmentation. Our framework jointly trains three subnetworks with a shared feature extractor, which perform object detection with bounding box regression, instance mask generation, and instance segmentation. These modules and feature extractor form a positive feedback loop with cross-regularization, which makes our model more robust and improves the quality of each task by leveraging complementary characteristics of each component. Meanwhile, bounding box regressor successfully regularizes object detector thereby subsequent modules learn more effectively. Consequently, our model generally outperforms not only the previous state-of-the-art weakly supervised instance segmentation methods, but also the weakly supervised object detection baseline on PASCAL VOC 2012 with a simple segmentation module. Finally, since our framework does not rely on a particular network architecture of object detection and semantic segmentation modules, using better detector or segmentation networks can improve the performance of our framework.