The segmentation of fetal brain tissues in MRI is essential for the study of abnormal fetal brain developments . Fetal brain structures segmentation could also support the evaluation and prediction of surgery outcome for open spina bifida [1, 4, 16, 21, 22]. Accurate and automatic methods for fetal brain segmentation are necessary as manual segmentation is very time-consuming and suffers from high inter- and intra-rater variability. Recently, deep neural network-based methods for fetal brain T2w MRI segmentation have been proposed [7, 8, 15, 18, 19]. On average, deep learning currently achieves state-of-the-art segmentation performance. However, those studies do not evaluate specifically the generalization and robustness properties when applied to fetuses with a pathological central nervous system.
Datasets used to train deep neural networks typically contain some underrepresented subsets of cases. These cases are not specifically dealt with by the training algorithms currently used for deep neural networks. This problem has been referred to as hidden stratification . Hidden stratification has been shown to lead to deep learning models with good average performance but poor performance on some clinically relevant subsets of the population . While uncovering the issue, the study of , which is limited to classification, does not study the cause or propose a method to mitigate this problem. Cases with abnormal fetal brain development are likely to suffer from hidden stratification effects for two reasons: 1) The presence of abnormalities exacerbates the anatomical variability of the fetal brain between 18 weeks and 38 weeks of gestation, as illustrated in Fig. 1; and 2) The prevalence of those diseases is typically below 1/1000 births .
In this work, we study the problem of hidden stratification in fetal brain MRI segmentation using deep learning. We claim that the methodology currently used to train deep neural networks, that is maximizing the average performance across the training volumes, is at the root of the hidden stratification problem. Instead of the average empirical risk, training safe and robust deep learning models requires an asymmetric measure of risk that gives higher weights to the cases for which the algorithm fails (hard examples). Percentiles, also known as value-at-risk, is such a measure of risk that has even been adopted in industry regulations . Given a per-volume fetal brain MRI segmentation metric such as the Dice score and an algorithm, the percentile at is the value of the score below which of the cases fall, i.e. perform worse than the percentile. The percentile relates to hidden stratification effects as it informs us of how badly worst-case examples are performing. Our contributions are four-fold. 1) We empirically show that the state-of-the-art deep learning pipeline nnU-Net 
trained by maximizing the average segmentation performance leads to clinically significant failures for fetal brain MRI segmentation. 2) We propose to use percentiles of the Dice score on clinically relevant subpopulations as a measure of hidden stratification effects. 3) We propose to train a deep learning network to minimize a percentile of the per-volume loss function. 4) We propose a relaxation of this optimization problem based on distributionally robust optimization that can be solved efficiently in practice. We evaluate the proposed methodology for the automatic segmentation of white matter, ventricles, and cerebellum based on fetal brain 3D T2w MRI. We used a total offetal brain 3D MRIs including anatomically normal fetuses, fetuses with open spina bifida, and fetuses with other central nervous system pathologies for gestational ages ranging from weeks to weeks. Our empirical results suggests that the proposed training method based on distributionally robust optimization leads to better percentiles values for abnormal fetuses. In addition, qualitative results shows that distributionally robust optimization allows to reduce the number of clinically relevant failures of nnU-Net.
2 Minimization of a Percentile Loss using Distributionally Robust Optimization
In this section, we study how a deep neural network can be trained to minimize percentiles of the loss function using a distributionally robust optimization (DRO) approach .
Standard deep learning training consists in optimizing the parameters of a deep neural network by minimizing the average per-example loss
Within this empirical risk minimization framework,
is typically a Convolutional Neural Network (CNN),is a smooth per-volume loss function, and is the training dataset.
In our case, are the input 3D fetal brain T2w MRI volumes and
are the ground-truth manual segmentations. This approach is the one used to train state-of-the-art deep learning methods for segmentation using stochastic gradient descent. Due to the scarcity and the higher anatomical variability of abnormal cases illustrated in Fig. 1, we cannot assume that the set of all possible fetal brain anatomies is sampled uniformly in the training dataset. However, in (1), all brain volumes are given the same weight equal to .
Instead of the average per-volume loss, for robust and safe segmentation, we argue that it might be more interesting to minimize the percentile at (e.g. 5%) of the per-volume loss function. Formally, this corresponds to the minimization problem
where is the empirical distribution defined by the training dataset. In other words, if , the optimal of (2) for a given value set of parameters is the value of the loss such that the per-volume loss function is worse than of the time. As a result, training the deep neural network using (2) corresponds to minimizing the percentile of the per-volume loss function .
Unfortunately, the minimization problem (2) cannot be solved directly using stochastic gradient descent to train a deep neural network. We now propose a tractable upper bound for and show that it can be solved in practice using distributionally robust optimization .
The Chernoff bound  applied to the per-volume loss function and the empirical training data distribution states that for all and
To link this inequality to the minimization problem (2), we set such that
is therefore an upper bound for , independently to the value of . We propose to relax the minimization problem (2) by
is a hyperparameter, and where the termwas dropped as being independent of . While in (6), does not appear in the optimization problem directly anymore, essentially acts as a substitute for . The higher the value of , the higher weights the per-volume losses with a high value will have in (6).
We give a proof in the supplementary material that (6) is equivalent to solving the distributionally robust optimization problem
is the Kullback-Leibler divergence,is the unit -simplex, and is a hyperparameter. measures the dissimilarity between and the uniform probability vector that corresponds to assign the same weight to each sample. Therefore, controls how much the samples with a relatively high loss value (hard examples) are weighted.
3 Anatomically Abnormal Fetal Brain T2w MRI Dataset
|Train/Test||Origin||Condition||Volumes||Gestational age (in weeks)|
|Training||Atlas ||Control||18||[21, 38]|
|Training||FeTA ||Control||5||[22, 28]|
|Training||UHL and MUV||Control||116||[20, 35]|
|Training||UHL and MUV||Spina Bifida||28||[22, 34]|
|Training||UHL and MUV||Other Abn||10||[23, 35]|
|Testing||FeTA ||Control||28||[20, 34]|
|Testing||FeTA ||Spina Bifida||31||[22, 31]|
|Testing||FeTA ||Other Abn||16||[20, 34]|
|Testing||UHL and MUV||Control||26||[26, 37]|
|Testing||UHL and MUV||Spina Bifida||65||[19, 33]|
|Testing||UHL and MUV||Other Abn||25||[21, 40]|
In this section, we give details about the fetal brain 3D MRI data, the labelling protocol, and the pre-processing used in our experiments.
3.0.1 Public Fetal Brain Datasets
We used the 18 control fetal brain 3D MRI volumes of the spatio-temporal fetal brain atlas111http://crl.med.harvard.edu/research/fetal_brain_atlas/  for gestational ages ranging from weeks to weeks. We also used volumes from the publicly available FeTA MICCAI challenge dataset222DOI: 10.7303/syn25649159 . For the MIAL 3D MRIs, corrections of the segmentations were performed by authors MA, LF, and PD to reduce the variability against the published segmentation guidelines that was released with the FeTA dataset . Those corrections were performed as part of our previous work  and are publicly available333DOI: 10.5281/zenodo.5148611. Brain masks for the FeTA data were obtained via affine registration using two fetal brain atlases444DOI: 10.7303/syn25887675 [11, 12].
Image Acquisition and Preprocessing for the Private Dataset
All images in the private dataset were part of routine clinical care and were acquired at UHL and MUV due to congenital malformations seen on ultrasound.
In total, 93 cases with open spina bifida, 35 cases with other central nervous system pathologies, and 142 cases with other malformations, though with normal brain, and referred as controls, were included. The gestational age at MRI ranged from weeks to weeks. We have started to make fetal brain T2w 3D MRIs publicly available555https://www.cir.meduniwien.ac.at/research/fetal/. For each study, at least three orthogonal T2-weighted HASTE series of the fetal brain were collected on a T scanner using an echo time of ms, a repetition time of ms, with no slice overlap nor gap, pixel size mm to mm, and slice thickness mm to mm. A radiologist attended all the acquisitions for quality control.
The reconstructed fetal brain 3D MRIs were obtained using NiftyMIC 
a state-of-the-art super resolution and reconstruction algorithm. The volumes were all reconstructed to a resolution ofmm isotropic and registered to a fetal brain atlas . Our pre-processing improves the resolution, and removes motion between neighboring slices and motion artefacts present in the original 2D slices . We used volumetric brain masks to mask the tissues outside the fetal brain. Those brain masks were obtained using the automatic segmentation method described in [6, 20].
3.0.2 Labelling Protocol.
The labelling protocol used for white matter, ventricles and cerebellum is the same as in . The three tissue types were segmented for our private dataset by a trained obstetrician and medical students under the supervision of a paediatric radiologist specialized in fetal brain anatomy, who quality controlled and corrected all manual segmentations.
3.0.3 Separation of the Data into Training and Testing
A summary of the number of fetal brain 3D MRIs used at training and testing for each central nervous system condition can be found in Table 1. The training dataset contains a total of cases with a majority of controls and only abnormal cases which is typical in clinical datasets. Five controls from the FeTA dataset were added in the training dataset because we found in preliminary experiments that nnU-Net  fails on most of the FeTA data at testing when it is trained using only data from UHL and MUV and the fetal brain atlas . The testing dataset contains volumes with a majority of abnormal cases which is necessary to cover the anatomical variability of abnormal cases in our evaluation.
|Dice Score ()|
Common Deep Learning Pipeline.
We used nnU-Net , a generic deep learning pipeline for medical image segmentation, that has been shown to outperform other deep learning pipelines on 23 public datasets without the need to tune the loss function or the deep neural network architecture. Specifically, we used nnU-Net version 2 in 3D-full-resolution mode which is the recommended mode for isotropic 3D MRI data. nnU-Net automatically splits the training data into 5 folds training/ validation used to train 5 networks for each method. The predicted class probability maps of the 5 models are averaged at inference to improve robustness . We used NVIDIA Tesla V100 GPUs with 16GB of memory. Training each network took from 4 to 6 days.
4.0.1 Specificities of Each Method.
The baseline consists in using nnU-Net  without any modification. Our method, nnU-Net-DRO, also uses nnU-Net. The only difference is that we changed the sampling strategy to use the hardness weighted sampler for DRO . We used the default hyper-parameter values for the hardness weighted sampler, i.e. with importance sampling and clipping values and as described in . No other values were tested. Our implementation of the nnU-Net-DRO training procedure is publicly available at https://github.com/LucasFidon/HardnessWeightedSampler. It provides an implementation of the hardness weighted sampler described in .
. We are particularly interested in measuring the statistical risk of the results as a way to evaluate the robustness of the different methods. To this end, in addition to the mean and standard deviation, we also report the percentiles of the Dice score at, , , and . In Table 2, we report those quantities for the Dice scores of the three tissue types white matter, ventricular system, and cerebellum.
For each method, nnU-Net is trained 5 times using different train/validation splits and different random initializations. The 5 same splits, computed randomly, are used for the two methods. The results in Table 2 are for the ensemble of the 5 3D U-Nets. Ensembling is known to increase the robustness of deep learning methods for segmentation . It also makes the evaluation less sensitive to the random initialization and to the stochastic optimization.
Evaluation of nnU-Net and nnU-Net-DRO.
Quantitative evaluation of nnU-Net and nnU-Net-DRO for the three different central nervous system conditions control, spina bifida, and other abnormalities can be found in Table 2.
For spina bifida and other brain abnormalities, the proposed nnU-Net-DRO achieves same or higher mean Dice scores and lower standard deviations than nnU-Net  for the three tissue types. For controls, the mean Dice scores and standard deviation of nnU-Net-DRO and nnU-Net differ by less than percentage points (pp) for the three tissue types.
The comparison of the percentiles of the Dice score allows us to compare methods at the tail of the Dice scores distribution where segmentation methods reach their worst-case performance. For spina bifida, nnU-Net-DRO achieves higher values of percentiles than nnU-Net for the white matter (pp for ), for the ventricular system (pp for ), and for the cerebellum (pp for ). And for other brain abnormalities, nnU-Net-DRO achieves higher values of percentiles than nnU-Net for the white matter (pp for ), for the ventricular system (pp for and pp for ), and for the cerebellum (pp for ). All the other percentile values differ by less than pp of Dice score between the two methods. This suggests that nnU-Net-DRO achieves better worst case performance than nnU-Net for abnormal cases.
It is worth noting that the Dice scores decrease for the white matter and the cerebellum between controls and spina bifida and abnormal cases. It was expected due to the higher anatomical variability in pathological cases. However, the Dice scores for the ventricular system tend to be higher for abnormal cases than for controls. This can be attributed to the large proportion of pathological cases with enlarged ventricles because the Dice score values tend to be higher for larger region of interests.
As can be seen in the qualitative results of Table 2, there are cases for which nnU-Net predicts an empty cerebellum segmentation while nnU-Net-DRO achieves satisfactory cerebellum segmentation. There were no cases for which the converse was true. Robust segmentation of the cerebellum for spina bifida is particularly relevant for the evaluation of fetal brain surgery for open spina bifida [1, 4, 21]. Additional qualitative results in the supplementary material illustrates 5 other cases for which nnU-Net-DRO outperforms nnU-Net.
The high anatomical variability of the developing fetal brain across gestational ages and pathologies hampers the robustness of deep neural networks trained by maximizing the average per-volume performance. Specifically, it limits the generalization of deep neural networks to abnormal cases for which few cases are available during training. In this paper, we propose to mitigate this problem by training deep neural networks to minimize a percentile of the per-volume performance rather than the average. To allow to do this in practice, we propose to train deep neural networks with Distributionally Robust Optimization (DRO) and we show that the DRO objective is a relaxation of the per-volume loss percentile. We have validated the proposed training method on a multi-centric dataset of fetal brain T2w 3D MRIs with various diagnostics. nnU-Net trained with DRO achieved improved segmentation results for pathological cases as compared to the unmodified nnU-Net, while achieving similar segmentation performance for the neurotypical cases. Our results suggest that nnU-Net trained with DRO is more robust to anatomical variabilities than the original nnU-Net.
This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement TRABIT No 765148. This work was supported by core and project funding from the Wellcome [203148/Z/16/Z; 203145Z/16/Z; WT101957], and EPSRC [NS/A000049/1; NS/A000050/1; NS/A000027/1]. TV is supported by a Medtronic / RAEng Research Chair [RCSRF1819\7\34].
-  Aertsen, M., Verduyckt, J., De Keyzer, F., Vercauteren, T., Van Calenbergh, F., De Catte, L., Dymarkowski, S., Demaerel, P., Deprest, J.: Reliability of MR imaging–based posterior fossa and brain stem measurements in open spinal dysraphism in the era of fetal surgery. American Journal of Neuroradiology 40(1), 191–198 (2019)
-  Benkarim, O.M., Sanroma, G., Zimmer, V.A., Muñoz-Moreno, E., Hahner, N., Eixarch, E., Camara, O., Gonzalez Ballester, M.A., Piella, G.: Toward the automatic quantification of in utero brain development in 3D structural MRI: A review. Human brain mapping 38(5), 2772–2787 (2017)
-  Chernoff, H., et al.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics 23(4), 493–507 (1952)
-  Danzer, E., Joyeux, L., Flake, A.W., Deprest, J.: Fetal surgical intervention for myelomeningocele: lessons learned, outcomes, and future implications. Developmental Medicine & Child Neurology 62(4), 417–425 (2020)
-  Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
-  Ebner, M., Wang, G., Li, W., Aertsen, M., Patel, P.A., Aughwane, R., Melbourne, A., Doel, T., Dymarkowski, S., De Coppi, P., et al.: An automated framework for localization, segmentation and super-resolution reconstruction of fetal brain MRI. NeuroImage 206, 116324 (2020)
-  Fetit, A.E., Alansary, A., Cordero-Grande, L., Cupitt, J., Davidson, A.B., Edwards, A.D., Hajnal, J.V., Hughes, E., Kamnitsas, K., Kyriakopoulou, V., et al.: A deep learning approach to segmentation of the developing cortex in fetal brain mri with minimal manual labeling. In: Medical Imaging with Deep Learning. pp. 241–261. PMLR (2020)
-  Fidon, L., Aertsen, M., Emam, D., Mufti, N., Guffens, F., Deprest, T., Demaerel, P., David, A.L., Melbourne, A., Ourselin, S., et al.: Label-set loss functions for partial supervision: Application to fetal brain 3d mri parcellation. arXiv preprint arXiv:2107.03846 (2021)
-  Fidon, L., Li, W., Garcia-Peraza-Herrera, L.C., Ekanayake, J., Kitchen, N., Ourselin, S., Vercauteren, T.: Generalised wasserstein dice score for imbalanced multi-class segmentation using holistic convolutional networks. In: International MICCAI Brainlesion Workshop. pp. 64–76. Springer (2017)
-  Fidon, L., Ourselin, S., Vercauteren, T.: Distributionally robust deep learning using hardness weighted sampling. arXiv preprint arXiv:2001.02658 (2020)
-  Fidon, L., Viola, E., Mufti, N., David, A., Melbourne, A., Demaerel, P., Ourselin, S., Vercauteren, T., Deprest, J., Aertsen, M.: A spatio-temporal atlas of the developing fetal brain with spina bifida aperta. Open Research Europe (2021)
-  Gholipour, A., Rollins, C.K., Velasco-Annis, C., Ouaalam, A., Akhondi-Asl, A., Afacan, O., Ortinau, C.M., Clancy, S., Limperopoulos, C., Yang, E., et al.: A normative spatiotemporal MRI atlas of the fetal brain for automatic segmentation and analysis of early brain growth. Scientific reports 7(1), 1–13 (2017)
-  Holton, G.: Value at Risk: Theory and Practice. Academic Press (2003)
-  Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2), 203–211 (2021)
-  Khalili, N., Lessmann, N., Turk, E., Claessens, N., de Heus, R., Kolk, T., Viergever, M., Benders, M., Išgum, I.: Automatic brain tissue segmentation in fetal MRI using convolutional neural networks. Magnetic resonance imaging 64, 77–89 (2019)
-  Mufti, N., Aertsen, M., Ebner, M., Fidon, L., Patel, P., Rahman, M.B.A., Brackenier, Y., Ekart, G., Fernandez, V., Vercauteren, T., et al.: Cortical spectral matching and shape and volume analysis of the fetal brain pre-and post-fetal surgery for spina bifida: a retrospective study. Neuroradiology pp. 1–14 (2021)
-  Oakden-Rayner, L., Dunnmon, J., Carneiro, G., Ré, C.: Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In: Proceedings of the ACM conference on health, inference, and learning. pp. 151–159 (2020)
-  Payette, K., de Dumast, P., Kebiri, H., Ezhov, I., Paetzold, J.C., Shit, S., Iqbal, A., Khan, R., Kottke, R., Grehten, P., et al.: An automatic multi-tissue human fetal brain segmentation benchmark using the fetal tissue annotation dataset. Scientific Data 8(1), 1–14 (2021)
-  Payette, K., Moehrlen, U., Mazzone, L., Ochsenbein-Kölble, N., Tuura, R., Kottke, R., Meuli, M., Jakab, A.: Longitudinal analysis of fetal MRI in patients with prenatal spina bifida repair. In: Smart Ultrasound Imaging and Perinatal, Preterm and Paediatric Image Analysis, pp. 161–170. Springer (2019)
-  Ranzini, M., Fidon, L., Ourselin, S., Modat, M., Vercauteren, T.: MONAIfbs: MONAI-based fetal brain MRI deep learning segmentation. arXiv preprint arXiv:2103.13314 (2021)
-  Sacco, A., Ushakov, F., Thompson, D., Peebles, D., Pandya, P., De Coppi, P., Wimalasundera, R., Attilakos, G., David, A.L., Deprest, J.: Fetal surgery for open spina bifida. The Obstetrician & Gynaecologist 21(4), 271 (2019)
-  Zarutskie, A., Guimaraes, C., Yepez, M., Torres, P., Shetty, A., Sangi-Haghpeykar, H., Lee, W., Espinoza, J., Shamshirsaz, A., Nassr, A., et al.: Prenatal brain imaging for predicting need for postnatal hydrocephalus treatment in fetuses that had neural tube defect repair in utero. Ultrasound in Obstetrics & Gynecology 53(3), 324–334 (2019)
1 Proof of the equivalence of equations (6) and (7)
In the DRO optimization problem of equation (7), the optimal for any has the closed-form formula [10, see p.4 or Appendix 11.1]
By injecting this in equation (7), we obtain
Since the first two terms cancel each other and , we obtain
which is equivalent to the optimization problem (6) because the term above and the term in (6) are independent of