Log In Sign Up

Few-shot brain segmentation from weakly labeled data with deep heteroscedastic multi-task networks

by   Richard McKinley, et al.

In applications of supervised learning applied to medical image segmentation, the need for large amounts of labeled data typically goes unquestioned. In particular, in the case of brain anatomy segmentation, hundreds or thousands of weakly-labeled volumes are often used as training data. In this paper, we first observe that for many brain structures, a small number of training examples, (n=9), weakly labeled using Freesurfer 6.0, plus simple data augmentation, suffice as training data to achieve high performance, achieving an overall mean Dice coefficient of 0.84 ± 0.12 compared to Freesurfer over 28 brain structures in T1-weighted images of ≈ 4000 9-10 year-olds from the Adolescent Brain Cognitive Development study. We then examine two varieties of heteroscedastic network as a method for improving classification results. An existing proposal by Kendall and Gal, which uses Monte-Carlo inference to learn to predict the variance of each prediction, yields an overall mean Dice of 0.85 ± 0.14 and showed statistically significant improvements over 25 brain structures. Meanwhile a novel heteroscedastic network which directly learns the probability that an example has been mislabeled yielded an overall mean Dice of 0.87 ± 0.11 and showed statistically significant improvements over all but one of the brain structures considered. The loss function associated to this network can be interpreted as performing a form of learned label smoothing, where labels are only smoothed where they are judged to be uncertain.


page 3

page 8


Data augmentation using learned transforms for one-shot medical image segmentation

Biomedical image segmentation is an important task in many medical appli...

DeepAtlas: Joint Semi-Supervised Learning of Image Registration and Segmentation

Deep convolutional neural networks (CNNs) are state-of-the-art for seman...

Self-Supervised Generative Style Transfer for One-Shot Medical Image Segmentation

In medical image segmentation, supervised deep networks' success comes a...

Deep Label Fusion: A 3D End-to-End Hybrid Multi-Atlas Segmentation and Deep Learning Pipeline

Deep learning (DL) is the state-of-the-art methodology in various medica...

Learning to segment fetal brain tissue from noisy annotations

Automatic fetal brain tissue segmentation can enhance the quantitative a...

Random Bundle: Brain Metastases Segmentation Ensembling through Annotation Randomization

We introduce a novel ensembling method, Random Bundle (RB), that improve...

1 Introduction

Manual segmentation of volumetric medical data, such as magnetic resonance imaging, is a laborious, time-consuming task, with very high inter-rater variability. This limitation means that high quality labeled data for the training and validation of machine-learning methods can only be found in small quantities, and that larger labeled datasets typically contain a large amount of label noise. As a result, it is of utmost importance for the field that robust methods be found to learn from small amounts of data, from noisy labeling of data, or in the worst case from small amounts of noisy data. For the majority of medical imaging tasks, there is no reasonable alternative to training from at least some manually labeled data. However, for the segmentation of normal-appearing brain structures there are a number of freely available non-learning-based tools which can approximate a manual segmentation. This, coupled with the availability of increasing large research datasets of disease populations and healthy controls, has led to many researchers training models solely or partially on hundreds or thousands of scans together with “auxiliary labels”: automated segmentations derived from existing tools.

[12, 14, 9]

In each case the model performance of the trained method was attributed to the size of the training set. However, in other fields of computer vision, segmentation problems are tackled with dramatically smaller amounts of data. The CamVid street scene segmentation problem, for example, provides 367 (2D) segmented images for training, and tackles a much more heterogeneous problem than brain segmentation. Given the strong spatial priors inherent in the task, it would be surprising if brain image segmentation required more data than natural image segmentation. In this paper, we examine the feasibility of learning brain segmentation from a very small number of cases, and only from auxiliary labels. Our goals are i) to assess the feasibility of learning in such a small data domain, and ii) to investigate the benefit of estimating aleatoric uncertainty via heteroscedastic classification. Heteroscedastic classification networks which predict the variance of their outputs were introduced in

[6], where it was shown to improve street-scene segmentation: this increase in performance can be attributed to learned loss attenuation, in which gradients from examples with possibly erroneous labels are attenuated. The use of heteroscedastic classification networks in medical image segmentation has been so far limited, with authors focusing on uncertainty derived from dropout [9, 5, 4, 13] or test-time augmentation[17]. Predictive variance was explored, together with other measures of uncertainty, as a method of filtering MS lesion segmentations by Nair et al [10]. A multi-task network using a homoscedastic (per task rather than per example) measure of task uncertainty was presented by Bentaib et al [1]. A direct application of predictive variance as formulated for regression was applied to classification in Bragman et al [2], but the effect on segmentation performance was not assessed. In this paper, we focus on the benefit of two variants of heteroscedastic networks: predictive variance, and a new "label-flip" uncertainty measure. In this second method the network predicts, for each voxel and task, the probability that the model output will disagree with the ground truth: these probabilies are then used to perform learned label smoothing [15]. We train and test our method on data from the ABCD cohort of 9-10 year olds.[16, 3] Since we utilize only a small amount of data for training and validating our model, we can compare methods on an extremely large number of test cases ().

2 Label uncertainty and heteroscedastic classification

By an error in a segmentation, we mean a disagreement between two label sets, whether they are manually generated, auxiliary labels, or the output of a deep-learning model. We distinguish in this paper between two categories of ground-truth error. Systematic errors are those made consistently by a rater (either a human rater or an automated method) across most cases. Learning a correct segmentation from a ground truth containing systematic errors is therefore essentially impossible, as the classifier will learn to make the same errors as in the ground truth. Random errors essentially refer to label noise: with some probability, the label assigned will be incorrect. We refer to two different kinds of random errors: "predictable" random errors are those where a learner (human or machine) can learn to predict where an error might occur, and with what frequency, while ’sporadic’ random errors either occur completely at random, or are so rare that their occurrence and probability cannot be estimated. For examples of each of these error types see Figure 


Figure 1: Three examples of ground truth error: a) random, predictable errors in the Putamen (blue), Amygdala (orange) and Hippocampus (red) in a manually labeled case from the Multi-Atlas-labeling challenge, caused by labeling only in the coronal plane b) random, sporadic error by Freesurfer: tissue is erroneously labeled as white-matter hypointensity (blue) rather than grey matter (red) in an unpredictable location c) systematic under segmentation of grey matter in the Putamen and Thalamus by FSL-FAST.

To make the distinction between predictable and sporadic errors concrete, we need a classifier which can learn where labeling errors are likely. The term “heteroscedastic regression” refers to regression models which do not assume constant variance of residuals, but rather predict both the mean and the variance of the predicted quantity.[11] This notion of uncertainty is distinct from Bayesian Uncertainty, as approximated using Bayesian Dropout techniques; the two were contrasted and presented in a combined form by Kendall and Gal [6]. For classification problems, the correct notion of heteroscedasticity is not immediately clear. We consider two different formulations of heteroscedastic classifier.

Predictive variance In the predictive variance method of Kendall and Gal [6]

, as applied to binary classification, the logit outputs of the network (i.e. the output of the network before application of a sigmoid nonlinearity) are assumed to follow a Gaussian distribution. For each example the network outputs a probability,

, and a log variance, . Unlike for heteroscedastic regression the loss function cannot be computed as an analytic function of , and , the true label. Instead the loss is approximated by averaging a loss not involving over

Monte-Carlo samples, in each of which the logit is perturbed by a normally distributed noise term with mean zero and s.d.


Heteroscedastic Label-flip uncertainty In this paper we make use of a new kind of heteroscedastic network, in which the uncertainty in the logit is directly modelled by the probability of disagreement with the ground truth, or ’label-flip’. For each example of a binary classification problem, the network outputs a denoting class membership, and an output predicting the probability that the ground truth and classifier disagree. If is the label of the voxel, according to the (weak) ground truth, and L is a loss function (for example, binary cross-entropy or focal loss), the label-flip loss at to that voxel is


where z is the indicator function for disagreement between the classifier (thresholded at the level) and the (weak) ground truth. Unlike for predictive variance this loss can be formulated in closed form and is differentiable, and so can be used directly in backpropogation. Label-flip loss can be seen as a form of loss attenuation: the loss at voxels with low label noise is dominated by the first loss term, and the loss at voxels with substantial label noise is dominated by the second term. It can also be seen as learned label smoothing: a hard labels are replaced by a soft labels according to the uncertainty in the data. [15].

Sporadic errors in the training set may now be defined as examples where a) the trained model disagrees with the ground truth, and b) the uncertainty at this example is very low. We propose therefore general scheme of filtering training examples: gradients from examples which disagree with the ground truth, but are classified with with high certainty, are masked during training.

3 Experimental Setup

3.0.1 Data and Labels

We tested the hypotheses that a) brain segmentation is feasible from small amounts of weakly labeled data, and b) that heteroscedastic networks and sporadic error rejection can improve segmentation results in few shot learning from weak labels, by training both plain and heteroscedastic networks on segmentations of T1 weighted imaging data from the early release of the ABCD dataset[16, 3]. These data were collected from 9 and 10 year olds scanned at 21 different sites, using scanners from three different vendors. Labels for the training, validation and test cases were generated using Freesurfer 6.0. The T1w volumes were skull-stripped using a freely-available tool111

, and the non-zero voxels in each volume were rescaled to have mean zero and unit standard deviation. We segment a subset of the labels segmented by Freesurfer: Cortical White Matter (L/R), Cortical Grey Matter (L/R), Lateral Ventricles (L/R), Cerebellum (L/R), Thalamus (L/R), Caudate (L/R), Putamen (L/R), Pallidum (L/R), Accumbens Area (L/R), Hippocampus (L/R), Amygdala (L/R), Ventral DC (L/R), 3rd ventricle, 4th ventricle, Brainstem and Corpus Callosum. By contrast with QuickNAT 

[14], we segment the Accumbens area: in addition, we do not distinguish between cerebellar grey and white matter, and we include the inferior lateral ventricles in the lateral ventricle labels. As an auxilliary task, we predict labels for the total left hemisphere, the total right hemisphere, and brain parenchyma (voxels belonging to any brain structure, plus WM hypointensities). In our experiments we selected three cases from each vendor for training (leading to a total of nine training examples) and two from each vendor for validation during training. This means that a substantial amount of data remains for testing our methods: we applied our trained classifiers to 4069 additional cases.

3.0.2 Model and training

Figure 2: The model architecture used for all three networks considered. The parameter is the number of logit outputs (31 for the plain model, 62 for the heteroscedastic models).

Our model architecture (shown in Figure 2

) was implemented in Pytorch: it consists of an initial phase of 3D convolutions to reduce a non-isotropic 3D patch to 2D, followed by a swallow encoder/decoder network using densely connected dilated convolutions in the bottleneck. We use multi-task rather than multi-class classification: each brain region is treated as a separate binary classification problem. This enables us to use the simplified formulation of heteroscedastic networks as presented above, but is also appropriate in medical imaging, where partial volume effects mean that it is not inconsistent to assign more than one tissue label to a voxel. For some tissue classes considered, the class imbalance between tissue and background is as high as 2500 to one. To combat this problem, we use focal loss

[7] with parameter

. The loss functions for our heteroscedastic networks use focal loss as a base loss function. Inputs to the network (7*196*196 patches) were sampled randomly from either axial, sagittal or coronal direction. We perform simple data augmentation: reflection about the (approximate) midline, rotation around a random principal axis through a random angle, and global shifting/rescaling of voxel intensities. The network was trained with RMSprop, using a batch size of 2 and a cosine annealing learning rate schedule with restarts 

[8], where the learning rate was varied from to every 2000 steps. We trained one network for 520 restarts with no uncertainty loss, one model for 20 restarts with no uncertainty loss and then for 500 restarts with predictive variance loss, and one network for 20 restarts with no uncertainty loss and then for 100 restarts with label-flip loss, followed by 400 restarts with label-flip loss and sporadic error rejection: every 20 restarts, the classifier was run over the training set, and voxels which disagreed with the ground truth in all three of the sagittal, coronal and axial views with low flip probability (

) were masked from training for the next 20 epochs. Final segmentations were derived by ensembling axial, sagittal and coronal views by averaging logits, and were compared using the Wilcoxon signed rank test.

4 Results

The plain network already reached its overall maximum mean Dice coefficient of (over all 28 tissue classes) after 160,000 samples were seen, after which overall performance declined and did not recover (See Figure 3 and Table 1): this model achieved an overall mean Dice of on the test set.



Table 1: Mean Dice for selected compartments, for each of the three trained models, versus Freesurfer (n = 4096). For lateral structures, results shown for left hemisphere only. Results for the plain model shown at an early and a late epoch.

By contrast, after training on approximately two million samples, neither heteroscedastic model showed signs of performance decline due to overfitting. The model using predictive variance yielded a mean Dice coefficient of , and the model using label-flip estimation yielded a mean Dice coefficient of . Differences between overall Dice coefficients were significant () On the level of individual structures, the plain network outperformed the predictive variance network on the 3rd ventricle and Accumbens areas, and outperformed the flip-probability network on the 3rd ventricle only. On all other areas, the heteroscedastic models had better performance than the plain model. All differences between models, on all compartments, were statistically significant ().

5 Conclusion

Figure 3: Learning curves (Mean Dice coefficient) over all compartments, and for selected individual tissue classes, on the six validation cases.

Both heteroscedastic models outperformed the plain model over a range of brain segmentation tasks, with the model predicting label-flip probability achieving the best overall Dice coefficient. Morevover, training of the heteroscedastic models was substantially more stable: as can be seen in Figure 2, the performance of the plain model on difficult-to-segment structures such as the accumbens area and 3rd ventricle peaks early and then declines, before other regions such as the amygdala have reached optimal performance, while the label-flip model in particular is able to achieve good performance across all brain regions. We expect that the label-flip maps, as seen in Figure 4

, will be helpful for judging image/segmentation quality: however, the focus here is on performance in the small data, weak label domain. This is, we believe the first work in medical imaging to confirm the benefits of aleatoric uncertainty and its associated learned label smoothing for model performance. Our proposed new notion of heteroscedastic network outperforms both the plain and the predictive variance network in this setting. Further work is needed to verify if this benefit is still significant when training on hundreds or thousands of weakly labeled cases: in the other direction, preliminary results suggest that, for carefully selected training cases, it may even be possible to one-shot train a brain-segmentation network from a single training example.

Figure 4: Example output of the label-flip heteroscedastic network, on the white-matter label in the left hemisphere: case taken from the ABCD cohort.


  • [1] Bentaieb, A., Hamarneh, G.: Uncertainty driven multi-loss fully convolutional networks for histopathology. In: Proc. LABEL 2017. pp. 155–163 (2017)
  • [2] Bragman, F.J.S., et al.: Quality control in radiotherapy-treatment planning using multitask learning and uncertainty estimation. In: Proc. MIDL 2018 (2018)
  • [3] Casey, B.J., et al.: The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites. Developmental Cognitive Neuroscience 32 (2018)
  • [4] Jungo, A., et. al: On the effect of inter-observer variability for a reliable estimation of uncertainty of medical image segmentation. In: Proc. MICCAI (2018)
  • [5] Jungo, A., Meier, R., Ermis, E., Herrmann, E., Reyes, M.: Uncertainty-driven sanity check: Application to postoperative brain tumor cavity segmentation. In: Proc. MIDL 2018
  • [6] Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Proc. NIPS (2017)
  • [7] Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV) pp. 2999–3007 (2017)
  • [8]

    Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: ICLR (2017)

  • [9] McClure, P., et al.: Knowing what you know in brain segmentation using deep neural networks (2018),
  • [10] Nair, T., et al.: Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. In: Proc. MICCAI (2018)
  • [11]

    Nix, D.A., Weigend, A.S.: Estimating the mean and variance of the target probability distribution. In: IEEE ICNN’94. vol. 1 (June 1994)

  • [12] Rajchl, M., et al.: Neuronet: Fast and robust reproduction of multiple brain image segmentation pipelines (2018)
  • [13] Roy, A.G., et al.: Inherent brain segmentation quality control from fully convnet Monte Carlo sampling. In: Proc. MICCAI. pp. 664–672 (2018)
  • [14] Roy, A.G., et al.: Quicknat: A fully convolutional network for quick and accurate segmentation of neuroanatomy. NeuroImage 186 (2019)
  • [15] Szegedy, C., et al.: Rethinking the Inception architecture for computer vision. CVPR (2016)
  • [16] Volkow, N.D., et al.: The conception of the ABCD study: From substance use to a broad NIH collaboration. Developmental Cognitive Neuroscience 32 (2018)
  • [17]

    Wang, G., et al.: Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing 338 (2019)