1 Introduction
Manual segmentation of volumetric medical data, such as magnetic resonance imaging, is a laborious, timeconsuming task, with very high interrater variability. This limitation means that high quality labeled data for the training and validation of machinelearning methods can only be found in small quantities, and that larger labeled datasets typically contain a large amount of label noise. As a result, it is of utmost importance for the field that robust methods be found to learn from small amounts of data, from noisy labeling of data, or in the worst case from small amounts of noisy data. For the majority of medical imaging tasks, there is no reasonable alternative to training from at least some manually labeled data. However, for the segmentation of normalappearing brain structures there are a number of freely available nonlearningbased tools which can approximate a manual segmentation. This, coupled with the availability of increasing large research datasets of disease populations and healthy controls, has led to many researchers training models solely or partially on hundreds or thousands of scans together with “auxiliary labels”: automated segmentations derived from existing tools.
[12, 14, 9]In each case the model performance of the trained method was attributed to the size of the training set. However, in other fields of computer vision, segmentation problems are tackled with dramatically smaller amounts of data. The CamVid street scene segmentation problem, for example, provides 367 (2D) segmented images for training, and tackles a much more heterogeneous problem than brain segmentation. Given the strong spatial priors inherent in the task, it would be surprising if brain image segmentation required more data than natural image segmentation. In this paper, we examine the feasibility of learning brain segmentation from a very small number of cases, and only from auxiliary labels. Our goals are i) to assess the feasibility of learning in such a small data domain, and ii) to investigate the benefit of estimating aleatoric uncertainty via heteroscedastic classification. Heteroscedastic classification networks which predict the variance of their outputs were introduced in
[6], where it was shown to improve streetscene segmentation: this increase in performance can be attributed to learned loss attenuation, in which gradients from examples with possibly erroneous labels are attenuated. The use of heteroscedastic classification networks in medical image segmentation has been so far limited, with authors focusing on uncertainty derived from dropout [9, 5, 4, 13] or testtime augmentation[17]. Predictive variance was explored, together with other measures of uncertainty, as a method of filtering MS lesion segmentations by Nair et al [10]. A multitask network using a homoscedastic (per task rather than per example) measure of task uncertainty was presented by Bentaib et al [1]. A direct application of predictive variance as formulated for regression was applied to classification in Bragman et al [2], but the effect on segmentation performance was not assessed. In this paper, we focus on the benefit of two variants of heteroscedastic networks: predictive variance, and a new "labelflip" uncertainty measure. In this second method the network predicts, for each voxel and task, the probability that the model output will disagree with the ground truth: these probabilies are then used to perform learned label smoothing [15]. We train and test our method on data from the ABCD cohort of 910 year olds.[16, 3] Since we utilize only a small amount of data for training and validating our model, we can compare methods on an extremely large number of test cases ().2 Label uncertainty and heteroscedastic classification
By an error in a segmentation, we mean a disagreement between two label sets, whether they are manually generated, auxiliary labels, or the output of a deeplearning model. We distinguish in this paper between two categories of groundtruth error. Systematic errors are those made consistently by a rater (either a human rater or an automated method) across most cases. Learning a correct segmentation from a ground truth containing systematic errors is therefore essentially impossible, as the classifier will learn to make the same errors as in the ground truth. Random errors essentially refer to label noise: with some probability, the label assigned will be incorrect. We refer to two different kinds of random errors: "predictable" random errors are those where a learner (human or machine) can learn to predict where an error might occur, and with what frequency, while ’sporadic’ random errors either occur completely at random, or are so rare that their occurrence and probability cannot be estimated. For examples of each of these error types see Figure
1.To make the distinction between predictable and sporadic errors concrete, we need a classifier which can learn where labeling errors are likely. The term “heteroscedastic regression” refers to regression models which do not assume constant variance of residuals, but rather predict both the mean and the variance of the predicted quantity.[11] This notion of uncertainty is distinct from Bayesian Uncertainty, as approximated using Bayesian Dropout techniques; the two were contrasted and presented in a combined form by Kendall and Gal [6]. For classification problems, the correct notion of heteroscedasticity is not immediately clear. We consider two different formulations of heteroscedastic classifier.
Predictive variance In the predictive variance method of Kendall and Gal [6]
, as applied to binary classification, the logit outputs of the network (i.e. the output of the network before application of a sigmoid nonlinearity) are assumed to follow a Gaussian distribution. For each example the network outputs a probability,
, and a log variance, . Unlike for heteroscedastic regression the loss function cannot be computed as an analytic function of , and , the true label. Instead the loss is approximated by averaging a loss not involving overMonteCarlo samples, in each of which the logit is perturbed by a normally distributed noise term with mean zero and s.d.
.Heteroscedastic Labelflip uncertainty In this paper we make use of a new kind of heteroscedastic network, in which the uncertainty in the logit is directly modelled by the probability of disagreement with the ground truth, or ’labelflip’. For each example of a binary classification problem, the network outputs a denoting class membership, and an output predicting the probability that the ground truth and classifier disagree. If is the label of the voxel, according to the (weak) ground truth, and L is a loss function (for example, binary crossentropy or focal loss), the labelflip loss at to that voxel is
(1) 
where z is the indicator function for disagreement between the classifier (thresholded at the level) and the (weak) ground truth. Unlike for predictive variance this loss can be formulated in closed form and is differentiable, and so can be used directly in backpropogation. Labelflip loss can be seen as a form of loss attenuation: the loss at voxels with low label noise is dominated by the first loss term, and the loss at voxels with substantial label noise is dominated by the second term. It can also be seen as learned label smoothing: a hard labels are replaced by a soft labels according to the uncertainty in the data. [15].
Sporadic errors in the training set may now be defined as examples where a) the trained model disagrees with the ground truth, and b) the uncertainty at this example is very low. We propose therefore general scheme of filtering training examples: gradients from examples which disagree with the ground truth, but are classified with with high certainty, are masked during training.
3 Experimental Setup
3.0.1 Data and Labels
We tested the hypotheses that a) brain segmentation is feasible from small amounts of weakly labeled data, and b) that heteroscedastic networks and sporadic error rejection can improve segmentation results in few shot learning from weak labels, by training both plain and heteroscedastic networks on segmentations of T1 weighted imaging data from the early release of the ABCD dataset[16, 3]. These data were collected from 9 and 10 year olds scanned at 21 different sites, using scanners from three different vendors. Labels for the training, validation and test cases were generated using Freesurfer 6.0. The T1w volumes were skullstripped using a freelyavailable tool^{1}^{1}1https://github.com/placeholder/github/repo/for/anonymity
, and the nonzero voxels in each volume were rescaled to have mean zero and unit standard deviation. We segment a subset of the labels segmented by Freesurfer: Cortical White Matter (L/R), Cortical Grey Matter (L/R), Lateral Ventricles (L/R), Cerebellum (L/R), Thalamus (L/R), Caudate (L/R), Putamen (L/R), Pallidum (L/R), Accumbens Area (L/R), Hippocampus (L/R), Amygdala (L/R), Ventral DC (L/R), 3rd ventricle, 4th ventricle, Brainstem and Corpus Callosum. By contrast with QuickNAT
[14], we segment the Accumbens area: in addition, we do not distinguish between cerebellar grey and white matter, and we include the inferior lateral ventricles in the lateral ventricle labels. As an auxilliary task, we predict labels for the total left hemisphere, the total right hemisphere, and brain parenchyma (voxels belonging to any brain structure, plus WM hypointensities). In our experiments we selected three cases from each vendor for training (leading to a total of nine training examples) and two from each vendor for validation during training. This means that a substantial amount of data remains for testing our methods: we applied our trained classifiers to 4069 additional cases.3.0.2 Model and training
Our model architecture (shown in Figure 2
) was implemented in Pytorch: it consists of an initial phase of 3D convolutions to reduce a nonisotropic 3D patch to 2D, followed by a swallow encoder/decoder network using densely connected dilated convolutions in the bottleneck. We use multitask rather than multiclass classification: each brain region is treated as a separate binary classification problem. This enables us to use the simplified formulation of heteroscedastic networks as presented above, but is also appropriate in medical imaging, where partial volume effects mean that it is not inconsistent to assign more than one tissue label to a voxel. For some tissue classes considered, the class imbalance between tissue and background is as high as 2500 to one. To combat this problem, we use focal loss
[7] with parameter. The loss functions for our heteroscedastic networks use focal loss as a base loss function. Inputs to the network (7*196*196 patches) were sampled randomly from either axial, sagittal or coronal direction. We perform simple data augmentation: reflection about the (approximate) midline, rotation around a random principal axis through a random angle, and global shifting/rescaling of voxel intensities. The network was trained with RMSprop, using a batch size of 2 and a cosine annealing learning rate schedule with restarts
[8], where the learning rate was varied from to every 2000 steps. We trained one network for 520 restarts with no uncertainty loss, one model for 20 restarts with no uncertainty loss and then for 500 restarts with predictive variance loss, and one network for 20 restarts with no uncertainty loss and then for 100 restarts with labelflip loss, followed by 400 restarts with labelflip loss and sporadic error rejection: every 20 restarts, the classifier was run over the training set, and voxels which disagreed with the ground truth in all three of the sagittal, coronal and axial views with low flip probability () were masked from training for the next 20 epochs. Final segmentations were derived by ensembling axial, sagittal and coronal views by averaging logits, and were compared using the Wilcoxon signed rank test.
4 Results
The plain network already reached its overall maximum mean Dice coefficient of (over all 28 tissue classes) after 160,000 samples were seen, after which overall performance declined and did not recover (See Figure 3 and Table 1): this model achieved an overall mean Dice of on the test set.
By contrast, after training on approximately two million samples, neither heteroscedastic model showed signs of performance decline due to overfitting. The model using predictive variance yielded a mean Dice coefficient of , and the model using labelflip estimation yielded a mean Dice coefficient of . Differences between overall Dice coefficients were significant () On the level of individual structures, the plain network outperformed the predictive variance network on the 3rd ventricle and Accumbens areas, and outperformed the flipprobability network on the 3rd ventricle only. On all other areas, the heteroscedastic models had better performance than the plain model. All differences between models, on all compartments, were statistically significant ().
5 Conclusion
Both heteroscedastic models outperformed the plain model over a range of brain segmentation tasks, with the model predicting labelflip probability achieving the best overall Dice coefficient. Morevover, training of the heteroscedastic models was substantially more stable: as can be seen in Figure 2, the performance of the plain model on difficulttosegment structures such as the accumbens area and 3rd ventricle peaks early and then declines, before other regions such as the amygdala have reached optimal performance, while the labelflip model in particular is able to achieve good performance across all brain regions. We expect that the labelflip maps, as seen in Figure 4
, will be helpful for judging image/segmentation quality: however, the focus here is on performance in the small data, weak label domain. This is, we believe the first work in medical imaging to confirm the benefits of aleatoric uncertainty and its associated learned label smoothing for model performance. Our proposed new notion of heteroscedastic network outperforms both the plain and the predictive variance network in this setting. Further work is needed to verify if this benefit is still significant when training on hundreds or thousands of weakly labeled cases: in the other direction, preliminary results suggest that, for carefully selected training cases, it may even be possible to oneshot train a brainsegmentation network from a single training example.
References
 [1] Bentaieb, A., Hamarneh, G.: Uncertainty driven multiloss fully convolutional networks for histopathology. In: Proc. LABEL 2017. pp. 155–163 (2017)
 [2] Bragman, F.J.S., et al.: Quality control in radiotherapytreatment planning using multitask learning and uncertainty estimation. In: Proc. MIDL 2018 (2018)
 [3] Casey, B.J., et al.: The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites. Developmental Cognitive Neuroscience 32 (2018)
 [4] Jungo, A., et. al: On the effect of interobserver variability for a reliable estimation of uncertainty of medical image segmentation. In: Proc. MICCAI (2018)
 [5] Jungo, A., Meier, R., Ermis, E., Herrmann, E., Reyes, M.: Uncertaintydriven sanity check: Application to postoperative brain tumor cavity segmentation. In: Proc. MIDL 2018
 [6] Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Proc. NIPS (2017)
 [7] Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV) pp. 2999–3007 (2017)

[8]
Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: ICLR (2017)
 [9] McClure, P., et al.: Knowing what you know in brain segmentation using deep neural networks (2018), http://arxiv.org/abs/1812.01719
 [10] Nair, T., et al.: Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. In: Proc. MICCAI (2018)

[11]
Nix, D.A., Weigend, A.S.: Estimating the mean and variance of the target probability distribution. In: IEEE ICNN’94. vol. 1 (June 1994)
 [12] Rajchl, M., et al.: Neuronet: Fast and robust reproduction of multiple brain image segmentation pipelines (2018)
 [13] Roy, A.G., et al.: Inherent brain segmentation quality control from fully convnet Monte Carlo sampling. In: Proc. MICCAI. pp. 664–672 (2018)
 [14] Roy, A.G., et al.: Quicknat: A fully convolutional network for quick and accurate segmentation of neuroanatomy. NeuroImage 186 (2019)
 [15] Szegedy, C., et al.: Rethinking the Inception architecture for computer vision. CVPR (2016)
 [16] Volkow, N.D., et al.: The conception of the ABCD study: From substance use to a broad NIH collaboration. Developmental Cognitive Neuroscience 32 (2018)

[17]
Wang, G., et al.: Aleatoric uncertainty estimation with testtime augmentation for medical image segmentation with convolutional neural networks. Neurocomputing 338 (2019)