Segmentation annotation maps are crucial for supervised training a deep learning based segmentation model. For segmentation annotation maps, besides the class label dimension, there are spatial dimensions that contain rich information of object size, shape, and between-object/between-class relations.
Previous work has proposed some methods for modifying annotation maps for training better deep learning based segmentation models. Directional map [uhrig2016pixel] was proposed to generate additional training loss based on the relative positions of the pixels to the centers of their corresponding objects. Deep watershed transform [bai2017deep] provided a similar approach that converts an annotation map to a watershed energy map to guide the training of a segmentation model. These efforts demonstrated that changing segmentation annotation maps to include additional information (e.g., relative position, stronger instance-level information) can help train better deep learning models for segmentation tasks.
In medical image segmentation, different classes of objects often have strong mutual locations and spatial correlations. Due to these correlations, learning representation and feature transform for one object class can often indicate the existence of some other object classes (possibly nearby). A conventional way of using multi-class annotation maps is to treat a full annotation map as a whole subject and use spatial cross-entropy loss function to compare it with the model’s outputs in back propagation[ronneberger2015u, chen2016deep, zheng2018new]. Due to spatial correlations among different object classes, directly using annotations of all object classes to train a deep network may cause a deep network not to be able to fully exploring its representation learning ability for every object class, especially for those classes with small sizes and unclear/confusing appearance. Furthermore, there may be multiple distinct structures/clusters under one class of objects, and each sub-class structure may better utilize a unique set of feature representations. In principle, we believe that modeling individual classes and sub-classes of structures or objects can encourage deep learning models to learn richer and more comprehensive feature transforms and data representations for segmentation problems.
In this paper, we propose to systematically decompose the original annotation maps to encourage deep networks to learn richer and possibly more disentangled feature transforms and representations. Our new scheme consists of two main stages: decompose and integrate. Decompose: by annotation map decomposition, the original segmentation problem is decomposed into multiple segmentation sub-problems (e.g., see Fig. 1); these new segmentation sub-problems are modeled by training multiple deep learning modules, each with its own set of feature transforms. Integrate: a procedure summarizes the solutions of the modules in the previous stage; a final solution is then formed for the original segmentation problem. This decompose-and-integrate scheme allows to explicitly enforce a deep learning model to learn representations for every object class. Besides, it can also be applied to learn feature transforms and representations for (human expert defined) meaningful sub-class data clusters and structures (see Fig. 2).
In Section 2, we present different ways to decompose annotation maps for different scenarios, and develop a new -to-1 deep network model for implementing our new learning scheme. In Section 3, we evaluate our decompose-and-integrate learning scheme utilizing multiple state-of-the-art fully convolutional networks (FCNs) on three medical image segmentation datasets, and examine several proposed annotation decomposition (AD) methods.
2 Decompose-and-Integrate Learning
Consider a -class segmentation training dataset , }, is a raw image, and is a segmentation annotation map containing all the annotations of the
classes of interest. Supervised learning for-class segmentation tasks aims to learn a function that transforms to . Note that each can be denoted as , where is an annotation map for object class , for .
For a segmentation problem with two foreground object classes and , suppose modeling for class is more difficult than modeling for class . This means that learning a robust latent representation for class takes more computational effort (e.g., more training iterations/gradient descent effort) than learning a robust latent representation for class . Note that and are not necessarily disjoint. When and have moderate or high spatial correlations, using joint annotations of these two classes for training a deep learning model can lead to: (1) is quite likely to be modeled using ; (2) would be modeled with help from , and not mainly by using ; (3) is not fully explored during model training, due to the “help” of the annotations from class . For a better representation and feature learning performance, such “help” is undesired. Besides multi-class segmentation scenarios, when an object class has distinct meaningful underlining sub-class structures/clusters, having a separate modeling for each individual structure/cluster enforces a deep network to learn more meaningful and useful data representations and feature transforms for such a class.
2.1 Segmentation annotation map decomposition
2.1.1 Based on object classes.
For a -class segmentation problem, we can decompose into binary annotation maps , . Algorithm 1 gives the exact procedure. Fig. 1 shows an image illustration of the effect of this annotation decomposition (AD). In medical image segmentation problems, the number of object classes is usually small, and is much smaller than in natural scene images. A general guideline is that the decomposed segmentation maps and their associated extra computational costs should be under a manageable level. Table 1 shows that object-class based AD can effectively improve segmentation performance for segmentation problems with multiple foreground classes.
2.1.2 Based on object shapes.
Annotation maps can also be decomposed based on different shape structures in the annotation maps. This type of decomposition can be applied to 2-class segmentation or even -class segmentation for .
Shape information contains valuable cues for segmentation tasks. Decomposing annotation maps based on different object shapes can encourage a deep learning model to learn feature transforms that encode the raw images into different shape-guided representations. In histology image analysis, morphological features such as shape convexity play an important role in object detection, segmentation, and diagnosis. Thus, we propose to decompose segmentation annotation maps based on shape convexity of objects in the annotation maps. Specifically, two sub-segmentation maps are generated from an original segmentation annotation map, one containing convex-like shape objects and the other containing concave-like shape objects. This decomposition provides additional information that directly helps a learning model to perceive object information at a higher (object shape) level. The detailed procedure and image illustration are provided in Algorithm 2 and Fig. 2. In practice, we set as 0.9. Table 2 demonstrates the usefulness of shape based AD when segmentation problems contain objects with several shape types.
2.1.3 Based on image-level information.
Image-level information, statistics, and cues can be utilized for annotation map decomposition. For example, if images contain one or multiple foreground objects, we can decompose the segmentation maps based on the number of objects appeared in an image. As the number of objects could only be revealed at a global level or deeper layer in a deep learning model, this decomposition method pushes a learning model to be more aware of global and higher-level information when generating segmentation results. An exact annotation decomposition procedure based on the image-level number of objects is given in Algorithm 3. In Table 3, we show the effectiveness of image-level information based AD for lymph node segmentation in ultrasound images.
2.2 The -to-1 deep network for decompose-and-integrate learning
Suppose every original annotation map , , is decomposed into annotation maps , and . We aim to model each sub-segmentation problem using a deep learning segmentation module with its own set of parameters. Then another modeling procedure is applied on top of these modules to form the final solution of the original segmentation problem.
Thus, we propose a new -to-1 deep network framework for implementing our above decompose-and-integrate learning scheme. Fig. 3 shows an overview of our -to-1 deep network. The modules (e.g., Seg-Module 1.1, Seg-Module 2) used in this network can be changed according to the type of images (e.g., 2D or 3D images) of the specific segmentation problem. The full model can be trained in end-to-end manner. Let the function of the overall -to-1 network be denoted as , and the function of Seg-Module be denoted as . The overall loss for the decompose-and-integrate learning scheme is defined as:
where is the spatial cross entropy loss, and is set as simple as a normalization term . We aim to minimize the above function with respect to the parameters of and for .
3 Experiments and Results
We conduct experiments on three datasets. The 3D cardiovascular segmentation dataset [pace2015interactive] contains two classes of foreground objects (myocardium and great vessels), which have close spatial relations. Thus, we apply object-class based annotation decomposition (AD) to this dataset. The gland segmentation dataset [sirinukunwattana2017gland] contains glands that have quite different shapes (from concave shape to convex shape); hence shape convexity based annotation decomposition (AD) is applied to this dataset. Our in-house lymph node dataset contains the lymph node areas of 237 patients in ultrasound images (one image may contain one or more lymph nodes). Thus, image-level information based AD is applied to this dataset.
|Method||Myocardium||Blood pool||Overall score|
|3D U-Net [cciccek20163d]|
|Ensemble Meta-learner [zheng2018new]||0.823||0.685||3.224||0.935||0.763||5.804||0.215|
|Class-AD + -to-1 DenseVoxNet (ours)|
|-to-1 DenseVoxNet w/o AD|
Implementation details. The input window size of the deep learning segmentation models we use is set as for 3D experiments and for 2D experiments. During training, random cropping, rotation, and flipping are applied. Since the images in each dataset are larger than the model window size, there are virtually many more samples for model training than the number of images in each dataset. The Adam optimizer is used for model training. The mini-batch size is set as 8. The maximum number of training iteration is set to 60000. We find that usually 60000 iterations using Adam are sufficient for an FCN-type model to converge for a moderate sized training set. The learning rate is set as 0.0005 initially, and decreased to 0.00005 after 30000 iterations.
3D cardiovascular segmentation in MR images. The HVSMR dataset [pace2015interactive] seeks to segment myocardium and great vessels (blood pool) in 3D cardiovascular MR images. The ground truth of the test data is not available to the public; the evaluations are done by submitting segmentation results to the organizers’ server. We experiment with the object class based AD for this dataset. Table 1 shows that our AD combined with -to-1 network (utilizing DenseVoxNets) achieves state-of-the-art performance on this dataset. In the ablation study part of Table 1, we compare our full model with K-to-1 network without AD (where , and ), a 2-stacked DenseVoxNets, and a large-size DenseVoxNet that uses a similar amount of parameters as the -to-1 DenseVoxNets. The ablation study results confirm the effectiveness of our decompose-and-integrate learning scheme.
Gland segmentation in H&E stained images. This dataset [sirinukunwattana2017gland] contains 85 training images (37 benign (BN), 48 malignant (MT)), 60 testing images (33 BN, 27 MT) in part A, and 20 testing images (4 BN, 16 MT) in part B. We modify the original CUMedNet [chen2016deep] to make it deeper with two more encoding and decoding blocks (denoted as CUMedNet). We run all the experiments for the -to-1 network and ablation study 5 times. Table 2 shows the mean performance and standard derivations. Compared with the state-of-the-art models, our AD + -to-1 network (utilizing CUMedNet) yields considerably better segmentation results. In ablation study (the bottom part of Table 2), we compare AD + -to-1 network with -to-1 network without AD, a 2-stacked CUMedNet, and a large-size CUMedNet.
|part A||part B||part A||part B||part A||part B|
|Shape-AD + -to-1 (ours)||0.9230.002||0.8610.004||0.9100.004||0.8460.001||40.791.72||101.421.49|
|-to-1 w/o AD||0.9150.007||0.8290.008||0.8980.007||0.8310.004||45.233.71||108.924.74|
Lymph node segmentation in ultrasound images. We collected patients’ lymph node ultrasound images. We use 137 images for model training, and 100 images for model testing. The image size is . There is no identity overlap between the training data and testing data. The AD procedure follows Algorithm 3. Table 3 demonstrates that AD + -to-1 network can effectively improve lymph node segmentation performance in ultrasound images.
|Image-level-AD + -to-1 (ours)||0.8102||0.9012||0.8893||0.8952|
|-to-1 w/o AD||0.7842||0.8798||0.8783||0.8790|
In this paper, we developed a new decompose-and-integrate learning scheme for medical image segmentation. Our new learning scheme is well motivated, sound, and quite flexible. Comprehensive experiments on multiple datasets show that our new learning scheme is effective in improving segmentation performance.