In recent years, deep learning (DL) methods [3, 4, 14] have become powerful tools for biomedical image segmentation. However, due to large variety of biomedical applications (e.g., different targets, different imaging modalities, different experimental settings, etc), high annotation efforts and costs are commonly needed to acquire sufficient training data for DL models for new applications. In biomedical image segmentation, studies have been done on reducing annotation effort by utilizing unannotated data [1, 17] and on annotation data selection . In this paper, we present a different approach to alleviate the burden of manual annotation. Instead of using fine object masks to train a DL model, we propose a new weakly supervised DL approach that can achieve accurate segmentation by using only bounding boxes of the target objects as input. See Fig. 1 for some example results. As annotating bounding boxes is times faster than annotating fine masks , our approach can significantly reduce annotation effort.
A set of weakly supervised annotation methods has been proposed for semantic segmentation in natural scene images. In these methods, various weak annotation forms were explored (e.g., points , scribbles , and bounding boxes [5, 7, 13]). Comparing to other weak annotation forms, bounding boxes are more well-defined to annotate and provide much more information (e.g., object sizes, more exhaustive background annotation). Thus, among these weak annotation forms, bounding box approaches show the most promising results that could potentially match the segmentation results from full annotation .
However, the bounding box based methods for natural scene images cannot be directly extended to biomedical image segmentation for the following reasons. (1) All these methods [5, 7, 13] use orthogonal bounding boxes to annotate objects. It works well in natural scene settings in which many objects are often orthogonal to the image boundaries. But in biomedical images, objects can appear in any orientations and orthogonal bounding boxes are less useful (e.g., see Fig. 2). (2) Objects in biomedical images usually have more complicated inner structures and/or vague boundaries. Thus, the boundary recovery step (e.g., DenseCRF ) in these methods may not work well (Fig. 4(c)).
Hence, in this paper, we develop a new bounding box based weakly supervised DL approach to deal with the two aforementioned challenges in biomedical image segmentation. (1) To address the orthogonal bounding box issue, we present a method to efficiently annotate tilted bounding boxes based on the extreme points of target objects . Instead of a series of interactions (drawing bounding box, adjusting tilted angle, and adjusting box boundary), our method needs only six clicks for each bounding box (two clicks around the object center to indicate the box’s orientation and four clicks for the extreme points on the object boundary), as shown in Fig. 3. This greatly enhances the annotation efficiency and all these clicks can be reused in a later stage to better indicate object extents. (2) To recover the object boundaries more accurately, instead of using the methods designed for natural scene images, we apply graph search (GS) , a long-tested method for optimizing boundaries in biomedical images. Fig. 2 outlines the main ideas of our approach. First, we develop a method to combine GS and DL to generate fine object masks from box annotation, in which DL uses box annotation to compute a rough segmentation for GS and GS is applied to locate the optimal object boundaries. Note that, a key requirement of GS is to have a rough segmentation with correct topology. Our approach satisfies this requirement easily by using the topology information provided by the box annotation. During the mask generation process, we carefully utilize information from box annotation to filter out potential errors, and then use the generated masks to train an accurate DL network for image segmentation.
Experiments on gland segmentation in H&E stained histology images , lymph node segmentation in ultrasound images , and fungus segmentation in electron microscopy (EM) images  show that our approach attains superior performance over the best known state-of-the-art weakly supervised DL method , and is able to achieve (1) nearly the same accuracy compared to fully supervised DL methods with far less annotation effort, (2) significantly better results with similar annotation time, and (3) robust performance in various applications.
Fig. 2 gives an overview of our approach. In this section, we focus on discussing three major components of our approach: (1) a procedure for annotating tilted bounding boxes; (2) a DL network to compute rough segmentation for graph search (GS)  based on box annotation; (3) fine mask generation using GS.
2.1 A new procedure for annotating tilted bounding boxes
We first briefly review the known methods for annotating bounding boxes, and then present our new method for efficiently annotating tilted bounding boxes.
A standard protocol for annotating orthogonal bounding boxes in natural scene image datasets (e.g., ImageNet) usually has two steps: (1) Draw an orthogonal box by diagonally dragging the mouse from one corner of an imaginary rectangle that tightly bounds the object to the opposite corner; (2) adjust the boundary of the box until it actually bounds the object. Step (2) is often necessary, since the two corners annotated in step (1) may not be on the object, and it is quite challenging to annotate them accurately to align well with the object boundary. This standard protocol is difficult to extend to annotating tilted bounding boxes because it is much more time-consuming and difficult to draw a tilted rectangle than an orthogonal one as in step (1). In , a new way for annotating orthogonal bounding boxes was proposed, which took only four clicks on the extreme points (top, bottom, leftmost, and rightmost) of the object to annotate the box. Since the extreme points are well-defined physical points on the object boundary, it is much easier to accurately locate them than drawing an imaginary rectangle as in the standard protocol. The extreme point approach achieves times speedup comparing to the standard protocol .
We adopt the extreme point approach and extend it to annotating tilted bounding boxes, for two advantages. (I) The extreme point approach not only is more efficient than the standard protocol, but also provides more information (e.g., where the object touches the bounding box). In Section 2.2, we show how such extra information can help DL networks to generate a more accurate rough segmentation. (II) The extreme point approach can be easily extended to annotating tilted bounding boxes since the only required change is to click the extreme points with respect to the respective orientation of the object.
Fig. 3 shows the procedure for annotating a tilted bounding box in our approach. First, two clicks are used to annotate the orientation of the tilted box. To make every click count, these two clicks should be around the center of the object. We show how to utilize these two clicks in Section 2.2. After the orientation of the box is acquired, we draw an assistive grid (Fig. 3) to help the user to annotate the four extreme points (top, bottom, leftmost, and rightmost) with respect to the object’s orientation, in four clicks. Finally, the corresponding tilted bounding box of the extreme points is recorded (together with all the six clicks) and drawn on the original image to avoid duplicated annotation.
2.2 Computing rough segmentation based on box annotation
To generate accurate fine object masks from box annotation, graph search (GS) needs the DL model to provide a rough segmentation that has the correct object topology and reasonable shape accuracy. In this section, we discuss how to make full use of all the information we acquire in Section 2.1.
From the box annotation, we can gather the following cues. (1) Since every object should be covered by at least one bounding box, the regions that are not covered by any boxes are expected to be the background. (2) Each box is expected to contain one major object and the center of that object is specified by the first two clicks of the box annotation (Section 2.1). (3) The object is expected to touch the box on the four extreme points (Section 2.1).
Based on these cues, we generate the ground truth as shown in Fig. 4(a). To distinguish this ground truth from the ground truth in Section 2.3, we call this ground truth the box ground truth. The discussion in this paragraph refers to Fig. 4(a). We take the following steps to label pixels in the images. (I) Based on cue (1), we label all the pixels not covered by any boxes as the background (blue color). (II) To promote the DL network to learn correct topology, based on cue (2), we label the pixels “around” the object’s center as the object’s class (green color). Because the shape of a box is usually a good indicator for the shape of its object, we formally define the pixels “around” the object center as those pixels that are inside the rectangle which (a) is of the size of its bounding box, (b) has the same orientation as its bounding box, and (c) is centered at the object’s center. The value of is related to the overall shapes of the target objects. For example, a more convex shape would allow a larger . In all our experiments, we use . (III) To ensure reasonable shape accuracy, based on cue (3), four line segments are used to connect the extreme points and the center of the object (see Fig. 4(a)). Pixels on these four line segments are also labeled as the object’s class (green color). This can better inform the DL model of the objects’ extents, and it is especially important for the objects that have more than one layer of boundaries (e.g., glands), since this is the only information indicating which boundary layer should be detected. (IV) The remaining unlabeled pixels (black color) are ignored during the training process by assigning a weight of 0.
Finally, a fully convolutional network (FCN) following the structure in  is used to compute a rough segmentation for GS based on the box ground truth. Fig. 4(b) gives an example of the computed rough segmentation, and Table 2 shows that a better rough segmentation can be obtained when all these cues are utilized. A possible issue for the above scheme is that it tends to work well with objects of relatively “simple” shapes (e.g., star-shape). However, our method is still broadly applicable in practice, since, on one hand, segmentation targets of a significant portion in biomedical images (e.g., the three applications in this paper) are star-shaped, and on the other hand, to handle more complex shapes, one can simply divide the object into multiple star-shaped regions and still annotate them using the current scheme (which would still be more efficient than tracing the objects).
2.3 Generating accurate masks from rough segmentation
Although the rough segmentation computed in Section 2.2 is quite close to the results from fully supervised DL methods (see Tables 1 and 2), to bridge the final gap, we need to carefully utilize the local boundary information. A straightforward choice is to use DenseCRF  to promote better boundary delineation (as in ). However, as shown in Fig. 4(c), DenseCRF does not work well in some biomedical images (especially the ones that have objects with complicated inner structures and/or vague boundaries). To better utilize the local boundary information, we show below how to address this issue using GS .
|Similar annotation time||2||0||0.9450||37||0||0.8740||4||0||0.9423|
Comparing to DenseCRF, GS is more suitable for the task of generating accurate masks from rough segmentation, for the following reasons. (1) GS does not change object topology (even though topology improvement may be desired in some applications). In our method, since we already obtain the topology from the box annotation, not changing the topology (by GS) is what we need for this problem (Fig. 4(d)). (2) Since GS ensures global optimal solutions, it can handle more complicated situations (e.g., when part of the boundary is vague or missing) in biomedical images. (3) The parameters in GS have physical meanings which make them more intuitive to set across different applications.
Our method for producing accurate masks has two main steps.
Step 1. We pair each annotated box with a rough segmentation mask based on the Intersection over Union (IoU) score between the annotated box and the tilted bounding box of the rough segmentation mask (computed by using the same tilted angle as the annotated box). Each annotated box is matched with a rough segmentation mask with the maximum IoU. To filter out potential errors, we only use GS to compute the fine masks for those annotated boxes that have matching rough segmentation masks with an .
Step 2. The fine masks are then computed from the rough segmentation masks using GS. Fig. 5 illustrates the process of GS. First, to prevent overlapping masks, the medial axis of the rough segmentation is computed and used to ensure GS working on separated objects, as in  (see (a)-(c) in Fig. 5). Then, the graph construction process of GS follows the method in  in which the boundaries of the rough segmentation masks are used to determine the positions and directions of the graph columns (the reader is referred to  for more technical details). The cost function is simply the magnitude of gradients of the image intensities along the column directions (see (d)-(e) in Fig. 5). Additionally, since the extreme points determined in Section 2.1 are on the boundaries of the objects and the bounding boxes should contain the segmentation, the boundaries generated by GS are forced to pass through the extreme points and forbidden to go outside the extents of the bounding boxes. These constraints are implemented by assigning very low (high) weights for pixels that should be excluded (included).
Finally, we generate a new ground truth based on the fine masks computed by GS. We call this ground truth the fine ground truth. For all annotated boxes that have corresponding GS-generated masks, the box ground truth is replaced by the generated masks. In all other locations, we keep using the box ground truth. Fig. 4(e) gives an example of the fine ground truth. An FCN with the same structure as that in Section 2.2 is then trained using the fine ground truth to produce accurate segmentation. See Fig. 1 for some example results of segmentation.
|Tilted bounding box + extreme points||0.9513||0.914||0.9549|
|Orthogonal bounding box ||0.9101||0.8779||0.8992|
|Orthogonal bounding box + extreme points||0.9369||0.9087||0.929|
|Rough segmentation + GS||0.9571||0.9153||0.9543|
3 Evaluation datasets and implementation details
To thoroughly validate our method, in our experiments, we use three different datasets from various biomedical applications: (1) gland segmentation in H&E stained histology images , (2) lymph node segmentation in ultrasound images , and (3) fungal cell segmentation in electron microscopy (EM) images .
Gland segmentation dataset. This dataset consists of 14 whole-slide clinical H&E stained histology images of human intestinal tissues in various disease conditions (e.g., normal, chronic inflammation, acute inflammation, and chronic acute inflammation). We use 7 of them for training and the rest of them for testing. One might wonder whether 7 training images are too few to train our deep learning model. But we may keep in mind that whole-slide images usually have very large field of view (e.g., pixels) and each image can contain hundreds of glands. In our training set, there are in total 1058 glands, which is comparable to the 2015 MICCAI Gland Challenge dataset  (766 glands).
Lymph node segmentation dataset. This dataset contains 207 clinical ultrasound images of human neck lymph nodes. There are five types of lymph nodes (i.e., healthy, lymphoma, metastasis, reactive, and tuberculosis). We use 170 images for training and the remaining 37 for testing.
Fungus segmentation dataset. This dataset contains 84 images captured by serial block-face scanning electron microscopy (EM). We use 44 images for training and the other 40 images for testing.
Implementation details. For all the three datasets, we rescale the intensity to . Since the sizes of the training images can be much larger than the input size of the network, during each iteration, we form the training batch by randomly cropping the training images. After that, standard rotation and flipping of data augmentation are applied to the cropped patches. We use Adam optimizer with , , and to train our network. The initial learning rate is set as 5e-4 and reduced to 5e-5 after 10k iterations. Our FCN components are trained for 20k iterations with a batch size of 8. Finally, the trained FCNs are applied to the test images in the same way as U-Net .
4 Experiments and results
4.1 Evaluation of our final segmentation results
Fig. 6 shows some visual examples of our final segmentation results. We evaluate the final segmentation of our method in three different aspects. (1) We compare our method with the best-known state-of-the-art weakly supervised DL method  using boxes only annotation. We choose the variant in  to compare, since (a) the and the variant require supervised boundary detection that is not available to the three datasets we use, and (b) the variant shows better results than [5, 13]. (2) We compare our method with the same DL network trained on full annotation. (3) We compare our method with the same DL network trained on a subset of full annotation that takes similar annotation time as our box annotation. Table 1 shows that our approach attains superior performance over the best-known weakly supervised DL method , and is able to achieve (I) nearly the same accuracy compared to fully supervised DL methods in far less annotation effort, and (II) much better results with similar annotation time.
We further provide some qualitative examples to demonstrate the effectiveness of our approach. In the first row of Fig. 7, one can see that, by preserving the topology of the box annotation using graph search (GS) , our method can achieve much better object-level accuracy on the test data. Furthermore, as shown in the second row of Fig. 7, when the objects have more than one layer of boundaries,  may fit any one of them while our method can detect the correct boundary layer by utilizing the cues from the extreme points.
4.2 Ablation study
Different bounding box annotations. We evaluate the accuracy of the rough segmentation results produced by boxes only annotation. To show that tilted bounding boxes and the cues from extreme points are essential for our biomedical objects, we compare our approach with orthogonal bounding boxes only and orthogonal bounding boxes together with extreme points. The rough segmentation results of these methods are compared with the ground truth masks and evaluated using pixel-level F1 score. To evaluate their potential of being refined to be accurate masks, all the rough segmentation masks are dilated/eroded until they reach the maximum F1 score. As Table 2 shows, by utilizing tilted bounding boxes and the cues from extreme points, our rough segmentation is much better than those of the other two methods and is only worse than the full annotation (see Table 1).
Rough segmentation + GS vs. our approach. As discussed in Section 2.3, in our framework, we first use GS to refine the rough segmentation on the training images and then train a second FCN based on the refined results to generate the final segmentation. Yet, a more common and straightforward approach is to use GS as a post-processing step to refine the rough segmentation on the test images (we refer to this approach as “rough segmentation + GS” in Table 3 and Fig. 8).
Comparing to “rough segmentation + GS”, our framework has the following advantages. (1) Applying GS to the test images can only achieve cosmetic improvements. On the other hand, by providing more accurate boundary annotation (produced by GS) on the training images, our second FCN can detect object instances more accurately (see Fig. 8 and Table 3). (2) GS could deteriorate the rough segmentation when the object topology is incorrect (see the fungus experiments in Table 3). By applying GS to the training images, we can filter out such potential errors using the box annotation. Furthermore, the extreme points can provide strong regulation on GS as well. Thus, our framework achieves consistently better results than “rough segmentation + GS” in all our datasets (Table 3). (3) By shifting GS from the test images to the training images, our second FCN is able to mimic the behaviours of GS. Hence, there is no need for an additional GS-based post-processing step on test images in our framework, which can improve the inference speed.
In this paper, we presented a new weakly supervised DL approach for biomedical image segmentation using boxes only annotation that can achieve nearly the same performance compared to fully supervised DL methods. Our new method provides a more efficient way to annotate training data for biomedical image segmentation applications, and can potentially save considerable manual efforts.
This research was supported in part by NSF Grants CCF-1617735 and CNS-1629914, and the Global Collaboration Initiative (GCI) Program of the Notre Dame International Office, University of Notre Dame.
-  Baur, C., Albarqouni, S., Navab, N.: Semi-supervised deep learning for fully convolutional networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 31–319 (2017)
Bearman, A., Russakovsky, O., Ferrari, V., Li, F.F.: What’s the point: Semantic segmentation with point supervision. In: European Conference on Computer Vision. pp. 549–565 (2016)
Chen, H., Qi, X., Yu, L., Heng, P.A.: DCAN: Deep contour-aware networks for accurate gland segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2487–2496 (2016)
Chen, J., Yang, L., Zhang, Y., Alber, M., Chen, D.Z.: Combining fully convolutional and recurrent neural networks for 3D biomedical image segmentation. In: Advances in Neural Information Processing Systems. pp. 3036–3044 (2016)
-  Dai, J., He, K., Sun, J.: BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1635–1643 (2015)
-  Guo, Z., Zhang, L., Lu, L., Bagheri, M., Summers, R.M., Sonka, M., Yao, J.: Deep LOGISMOS: Deep learning graph-based 3D segmentation of pancreatic tumors on CT scans. arXiv preprint arXiv:1801.08599 (2018)
-  Khoreva, A., Benenson, R., Hosang, J., Hein, M., Schiele, B.: Simple does it: Weakly supervised instance and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
-  Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Advances in Neural Information Processing Systems. pp. 109–117 (2011)
-  Li, K., Wu, X., Chen, D.Z., Sonka, M.: Optimal surface segmentation in volumetric images — a graph-theoretic approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(1), 119–134 (2006)
-  Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: Scribble-supervised convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 3159–3167 (2016)
-  Liu, X., Chen, D.Z., Tawhai, M.H., Wu, X., Hoffman, E.A., Sonka, M.: Optimal graph search based segmentation of airway tree double surfaces across bifurcations. IEEE Transactions on Medical Imaging 32(3), 493–510 (2013)
-  Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., Ferrari, V.: Extreme clicking for efficient object annotation. In: IEEE International Conference on Computer Vision. pp. 4940–4949 (2017)
Papandreou, G., Chen, L.C., Murphy, K.P., Yuille, A.L.: Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: IEEE International Conference on Computer Vision. pp. 1742–1750 (2015)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 234–241 (2015)
-  Sirinukunwattana, K., Pluim, J.P., Chen, H., Qi, X., Heng, P.A., Guo, Y.B., Wang, L.Y., Matuszewski, B.J., Bruni, E., Sanchez, U., Böhm, A., Ronneberger, O., Cheikh, B.B., Racoceanu, D., Kainz, P., Pfeiffer, M., Urschler, M., Snead, D.R.J., Rajpoot, N.M.: Gland segmentation in colon histology images: The GlaS challenge contest. Medical Image Analysis 35, 489–502 (2017)
Yang, L., Zhang, Y., Chen, J., Zhang, S., Chen, D.Z.: Suggestive annotation: A deep active learning framework for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 399–407 (2017)
-  Zhang, Y., Yang, L., Chen, J., Fredericksen, M., Hughes, D.P., Chen, D.Z.: Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 408–416 (2017)
-  Zhang, Y., Yang, L., MacKenzie, J.D., Ramachandran, R., Chen, D.Z.: A seeding-searching-ensemble method for gland segmentation in H&E-stained images. BMC Medical Informatics and Decision Making 16(2), 80 (2016)
-  Zhang, Y., Ying, M.T.C., Yang, L., Ahuja, A.T., Chen, D.Z.: Coarse-to-fine stacked fully convolutional nets for lymph node segmentation in ultrasound images. In: IEEE International Conference on Bioinformatics and Biomedicine. pp. 443–448 (2016)