Recent years have seen substantial progress in automatic medical image segmentation. These advances can be primarily attributed to the emergence of convolutional neural networks (CNNs) [1, 2]. CNN-based segmentation is typically performed in a fully supervised setting, where a network is trained based on available segmentation maps in a training set. In such settings, the performance varies based on the network architecture and hyper-parameters, the optimization procedure during training, and on the size and quality of the training set. A number of recent benchmarks have shown that most state-of-the-art CNN-based methods achieve comparable results when applied to the same medical image data set. For example, a recent challenge in cardiac cine MR images found that for left ventricle segmentation, there were no significant differences between the top 8 methods, even though all used different architectures, hyper parameters, and optimization schemes.. All methods used the same training data, suggesting that the properties of the available training data used to train a convolutional neural network may be among the most important factors for the network performance.
For medical image segmentation, creating a labeled training set typically entails time-consuming and costly annotation by medical experts. As a consequence, the number of available labeled examples in medical image training sets is generally much lower than the number of labeled examples in training sets for natural images. This issue is exacerbated by the large variety of medical imaging modalities and sequences, which generally means a completely new data set is required for every medical segmentation problem. A possible solution is to use data that is produced and manually labeled as part of a clinical workflow. For example, in radiotherapy, manual segmentations of organs-at-risk (OARs) are routinely made for treatment planning. In this work, we investigate whether – instead of obtaining a data set of dedicated segmentations by a clinical expert – these readily available clinical segmentations could be used to train a CNN for automatic segmentation of OARs. One challenge to overcome is that this data often lacks delineations of structures deemed irrelevant for the clinical task. For example, organs that are far away from a tumor, and hence not at risk of irradiation, are often not segmented.
The use of partially segmented training volumes raises interesting methodological questions, as there is no unambiguous definition of “background” in such volumes. This problem has previously been addressed using conventional machine learning techniques. Recently, CNN training strategies have been adapted for training with missing annotations by considering segmentation to be a multi-label instead of a multi-class problem . While multi-class segmentation requires all voxels to exclusively belong to the background class or to one of the foreground classes, a multi-label segmentation model produces a result for each of the foreground classes independently from the others. This property can be exploited to train the network with images for which not all classes have a ground truth label.
Here, we perform an empirical study in which we systematically investigate the number of reference delineations that is necessary to achieve adequate model performance for OAR segmentation, and whether similarly adequate results can be obtained using reference delineations with missing structures. By training and evaluating the performance of 96 CNNs trained on different subsets of our data set, we assess the feasibility of developing a successful OAR segmentation model using varying amounts of clinically available segmentations with varying levels of completeness.
With permission of the local medical ethics board, we included brain MRI studies of 52 patients undergoing radiotherapy treatment planning. All patients received a T1-weighted MR scan at the University Medical Center Utrecht (Utrecht, the Netherlands). Volumes were acquired using a Philips Ingenia 1.5T MR system with a voxel size of mm, 8° flip angle, 7 ms repetition time, and 3.1 ms echo time. Scans were reconstructed to a voxel size of mm. Patients were scanned in an immobilising mask, ensuring a similar orientation of the head for all patients.
This data set includes annotations of the brain stem, pituitary gland and optic chiasm, and the left and right optic nerves, eyes, cochleas, and lenses acquired as part of RT treatment planning. OARs were typically segmented only if they were in clinically relevant proximity to the clinical target volume for the RT treatment. On average 9.41.8 out of 11 possible OARs were annotated in each patient. For 15 out of 52 patients, all OARs were available. Delineations were made on CT images and propagated to corresponding MR images. This regularly led to over- and under-segmentation when visualised in the MR image. Given that these are representative of clinical segmentations, we used these potentially suboptimal segmentations for training. However, as the results must be evaluated on a ground truth, such errors would interfere with the evaluation. Hence, a clinical expert corrected all manual segmentations in a subset of 20 volumes, which was used as a test set.
We perform a series of experiments to address two separate but related research questions. We investigate whether incomplete segmentations obtained from a clinical workflow are an appropriate substitute for a dedicated training set for training a segmentation network for various brain structures. Additionally, we investigate the impact of the number of segmented training volumes on the performance of such a network, and whether the required number of segmented volumes increases when only part of the target structures are segmented in each training image.
To address these questions, we perform 96 experiments in which we train CNNs with sampled subsets of the available training data. A subset of size is defined as a data set in which each structure has been segmented times. Subsets are randomly sampled in two distinct ways, as illustrated in Fig. 2. In the concentrated labels setting, we sample volumes with full reference segmentations, which is equivalent to selecting the MR volumes of patients. In the distributed labels setting, a subset includes all training volumes, but for each class, labels are only included in randomly selected volumes. In this setting, a training subset contains more volumes of different patients, but labels for only part of the structures in each volume are available. Subsets in the distributed labels setting were pseudo-randomly sampled to evenly spread the labels over as many volumes as possible. In both settings, the trained networks see the same number of labeled structures and – assuming similarly sized structures among patients – a similar number of training voxels. This equates to an approximately equal amount of work required by a clinical expert to create the training sets, which allows us to compare the results fairly. As smaller subsets can be sampled in many ways, we repeat experiments multiple times with different subsets of the same size.
In all experiments, we use the same 3D fully convolutional network with residual connections, adapted from an existing 2D network. This network was recently shown to exhibit competitive performance in a challenge on OAR segmentation in thoracic CT images. Its architecture is shown in Fig. 3. The network contains two strided convolutional downsampling layers, followed by 16 residual blocks and two transposed convolutional upsampling layers. The residual blocks are implemented using the updated residual configuration proposed in He et al. 
. Instead of a softmax activation function, which is typically used in multi-class segmentation problems, the output layer contains one sigmoid activation function per class (as shown in Fig.1). By using a sigmoid instead of a softmax output activation function, any -class segmentation problem can be modelled as a combination of binary label segmentation problems. A class loss can be calculated for each binary label separately; the total loss is calculated as the sum over all class losses, weighted by an optional weighting factor . If during training class is not present in the reference segmentation of the current training volume, is set to 0. This amounts to ignoring the loss component corresponding to this class for the current training volume. In this work, is otherwise set to 1 for all classes.
4 Experiments and Results
From the full data set, 30 volumes were used for training, 2 volumes were used for validation, and the remaining 20 volumes were used for testing. We trained 96 networks on the OAR set: 48 on concentrated subsets and 48 on distributed subsets, as described in Sec. 3. In the training set, 15 volumes included reference delineations for all 11 target OARs and could be used to sample concentrated subsets. No data augmentation was used in any of the experiments. All networks were trained with the Adam optimizer (learning rate: 0.001) for 15000 iterations with batches of four cubic 643-voxel patches per iteration. Training was done on a shared computing cluster containing various consumer-grade NVIDIA GPUs; training times ranged between 4 and 12 hours per network, depending on the load on the cluster.
Fig. 4 shows the average Dice similarity coefficients attained by networks trained on concentrated (i.e. fully segmented) training sets. The results show that, as may be expected, the performance increases when more training volumes are used. The improvement of the average performance is most pronounced up until five reference segmentations per structure. The results for most classes still slightly improve when increasing the training set size further, albeit with sharply diminishing returns.
We assessed whether the same number of reference segmentations can also be provided to the CNN in a distributed fashion, i.e. using volumes in which only a subset of structures has been segmented. We compare the results for the concentrated and distributed settings for the right eye and the brain stem in Fig. 5
. These figures show that for the larger training sets, the performance of networks trained in both settings is similar, although networks trained in the distributed setting produce a smaller number of worst-case outliers.
Interestingly, the results show that the networks trained on small sets of distributed data perform substantially worse on the eye, whereas they are mostly comparable with the networks trained on concentrated data for the brain stem. This discrepancy could be explained by the presence of visual ambiguity in the classes that have contralateral equivalents. Because of our pseudo-random sampling in the distributed experiment setting, it was highly unlikely that reference segmentations for both versions of a symmetric OAR would be included for the same patient. As a result, these networks perform worse at distinguishing symmetric OARs from their contralateral equivalents. Intuitively, a right eye is difficult to distinguish from a left eye based on local geometry: the networks may fail to learn the distinction unless labels for a matching set of eyes is available.
In this study, we aimed to answer two research questions. First, we evaluated the number of segmented volumes required to train an adequate OAR segmentation network using clinically obtained data. We observed a saturation point between five and seven labeled examples in the training sets (depending on the class) after which an increased data set size showed sharply diminishing returns on the network performance. This challenges the common assumptions concerning the large amounts of data required for training a deep neural network. Although these results are promising, they were acquired using a relatively small test set in a single segmentation task; further research is needed to investigate the extent to which our findings apply to different segmentation tasks.
Second, we evaluated whether a similarly performing network could be trained using a data set in which only some classes were segmented in each volume, as is generally the case with clinically performed segmentations. In this setting, we observed a large discrepancy in performance on the symmetric OARs with contralateral equivalents in networks trained on small training sets. Our results imply that the networks have trouble learning to discriminate between visually similar structures unless both are segmented in the same training volume.
It should be noted that the problem illustrated above is only present when training on subsets where the majority of segmented volumes only contain one version of the symmetric OARs. In the full data set, segmentations of contralaterally equivalent organs are usually both present if a segmentation for at least one of them is available. This discrepancy could be considered an artifact of our sampling method. However, the presence of unsegmented visually ambiguous structures is not unthinkable in other clinically obtained data sets. For example, similar problems could emerge with partially segmented vertebral columns, or abdominal images where only part of the large intestine is delineated. Future work could investigate the merit of cropping such unsegmented visual ambiguities out of the training images before training the network.
In this work, we have addressed the common assumption that segmentation CNNs require large amounts of training data and we have investigated whether routinely acquired clinical segmentations can be used to train an OAR segmentation model instead of a dedicated training set. We found that training such networks on a small number of incomplete clinical segmentations is feasible, as long as there are no clear ambiguities between classes. We have shown that this limitation can be overcome by increasing the size of the training set.
7 New or breakthrough work to be presented
We have shown that it is possible to train an accurate OAR segmentation network with a small training set of clinically acquired delineations, without any data augmentation. Our results show that as long as there is little ambiguity in the class definitions, it is possible to train such a network even if part of the target class delineations is missing in each of the training volumes.
Ciresan, D., Giusti, A., Gambardella, L. M., and Schmidhuber, J., “Deep neural networks segment neuronal membranes in electron microscopy images,” in [Advances in neural information processing systems ], 2843–2851 (2012).
-  Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., Van Der Laak, J. A., Van Ginneken, B., and Sánchez, C. I., “A survey on deep learning in medical image analysis,” Med Image Anal 42, 60–88 (2017).
-  Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.-A., Cetin, I., Lekadir, K., Camara, O., Ballester, M. A. G., et al., “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?,” IEEE Trans Med Imaging 37(11), 2514–2525 (2018).
-  Moeskops, P., Viergever, M. A., Benders, M. J., and Išgum, I., “Evaluation of an automatic brain segmentation method developed for neonates on adult MR brain images,” in [SPIE, proceedings ], 9413 (2015).
-  Petit, O., Thome, N., Charnoz, A., Hostettler, A., and Soler, L., “Handling missing annotations for semantic segmentation with deep ConvNets,” in [DLMIA ], LNCS, 20–28, Springer International Publishing (2018).
-  He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [CVPR 2016, proceedings ], 770–778.
-  van Harten, L., Noothout, J. M., Verhoeff, J. J., Wolterink, J. M., and Isgum, I., “Automatic segmentation of organs at risk in thoracic ct scans by combining 2d and 3d convolutional neural networks.,” in [SegTHOR challenge, ISBI 2019, proceedings ], (2019).
He, K., Zhang, X., Ren, S., and Sun, J., “Identity mappings in deep residual
networks,” in [
European conference on computer vision], 630–645, Springer (2016).