Image segmentation is ubiquitous in many medical image analysis applications. Unlike natural image analysis, obtaining high-quality per pixel annotations for medical images is relatively cumbersome due to the high costs, time and logistics involved. Previous works attempt to alleviate this annotation burden by either judiciously sampling the labelled datapoints in an active learning settinggu2014active or by leveraging weak, inexpensive forms of annotations such as bounding boxes, anatomical landmarks to assist segmentation in a mixed-supervision setting shah2018ms .
In active learning settings, a FCN starts with an initial set of images with corresponding annotations and iteratively suggests images in the remaining unlabelled dataset to be next annotated. Annotations are requested on images for which the network is highly uncertain about it’s predictions and which are dissimilar to the already annotated images until a predefined annotation budget is reached.
In mixed-supervision settings, a FCN performing dense segmentation is augmented with auxiliary tasks like object detection and anatomical landmark localization which need inexpensive annotations to assist the base task of segmentation. Multi-task networks in he2017mask ; shah2018ms have shown that adding complementary tasks enables the network learn better features and hence perform better on base task.
Costs involved in annotating each form of supervision are significantly different and therefore knowing the optimal balance between the number of annotations needed for each supervision type is essential for cost-effective segmentation. For practical applications, it would be good to have a budget-constraint framework that enables suggestive annotation in mixed supervision settings. We propose a linear programming (LP) inspired cost-minimization framework to enable suggestive budget-constraint segmentation in mixed supervision settings.
We have 3 modes of supervision in this particular application namely - dense segmentation (s), landmarks (l), and bounding box detections (d). We use the MS-Net architecture which is constructed using two types of components: (i) a base network for full-resolution feature extraction and (ii) 3 sub-network extensions - one for each of. We modify each sub-network to include a concrete dropout layer gal2017concrete to enable uncertainty estimation.
In the proposed method, every mode of supervision has a cost of annotation and a value which is an estimate of marginal gain in IoU by adding one sample of an annotation type. The training starts with a subset of the training data which has all three annotations. Let this set be denoted by , where denote the 3 annotations for the image. We train the network on using all 3 of its sub-network extensions and progressively suggest images in the remaining unlabelled set to be next annotated. We call this approach Suggestive Mixed Supervision Network (SMS-Net).
Uncertainty and similarity estimation After initial training, we obtain predictions on the remaining images in the training set. Using the concrete dropout layer during inference, for each image from the remaining set, we compute an estimate of uncertainty for the predictions across each supervision mode - and . Let denote the features from the last layer of the full-resolution stream of the MS-Net. We also compute the similarity of with as , where is a gaussian kernel similarity measure.
The objective is to maximize the value of annotation given a cost budget. We also want to select the samples with high prediction uncertainty and the ones which are diverse from the current training set. We define a vector, where if we provide the type of annotation to image . To select the images from the remaining set and the annotation types to be requested for each image, we solve the following modified - integer LP problem:
where , , and is the allocated budget for this update. We used the cvxpy package for this optimization. We obtain the vector for all the images in the remaining set. Depending on the ’s in , we annotate the image , add it to and retrain the network with this updated training set. As shown in the results section, this cost minimization formulation enables higher performance at significantly reduced annotation budgets. In each LP update, we update the estimates of depending on the IoU obtained. Suppose is the number of samples that were given the annotation in the LP update. Then,
In our experiments, we use the JSRT database shiraishi2000development which has 247 high-resolution () chest radiographs, each with expert segmentations, 166 landmark annotations and bounding boxes (van2006segmentation, ) covering 5 anatomical structures. We compare our proposed SMS-Net with MS-Net shah2018ms and suggestive FCN yang2017suggestive . We define the max budget as the cost required to provide all the images with all 3 types of annotations. We start by randomly selecting 20% of the training samples as . We apply the LP update as described in section 2 at regular intervals in the entire training process. In each LP update, we assign of the total available budget. We set in all our experiments.
In Figure 1(a), we present the mean IoU obtained for varying levels of available annotation budget. Since all annotations have different costs and values, Figure 1(b) shows the distribution of overall samples drawn from each annotation type for varying levels of available budget. For all experiments, we set . These costs are chosen to be proportionally in line with practical medical image annotation scenarios. In Figure 1(b), we see that as we relax the budget constraint, more samples are provided with dense segmentations whereas lesser samples are given detections. At moderate and tighter budgets, detections and landmarks are utilized as they provide reasonable supervision at reduced costs. This selection strategy also leads to increase in performance as can be seen in Figure 1(a).
4 Conclusion and Future work
We propose a method to enable suggestive annotation in mixed supervision settings for annotation cost-minimization in medical image segmentation. We show that our method achieves better performance compared to the state-of-the-art at significantly reduced annotation budgets. Future research directions would be to evaluate the proposed method on larger datasets with a non-fixed set of annotations and to build an end to end framework for joint optimization of the base segmentation architecture and cost minimization LP.
-  Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017.
Yingjie Gu, Zhong Jin, and Steve C Chiu.
Active learning combining uncertainty and diversity for multi-class
IET Computer Vision, 9(3):400–407, 2014.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  Meet P Shah, SN Merchant, and Suyash P Awate. MS-Net: Mixed-supervision fully-convolutional networks for full-resolution segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 379–387. Springer, 2018.
-  Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, and Tsuneo Matsumoto. Development of a digital image database for chest radiographs with and without a lung nodule. American Journal of Roentgenology, 174(1):71–74, 2000.
-  Bram Van Ginneken, Mikkel B Stegmann, and Marco Loog. Segmentation of anatomical structures in chest radiographs using supervised methods. MedIA, 10(1):19–40, 2006.
-  Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and Danny Z Chen. Suggestive annotation: A deep active learning framework for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 399–407. Springer, 2017.