The role of MRI physics in brain segmentation CNNs: achieving acquisition invariance and instructive uncertainties

by   Pedro Borges, et al.

Being able to adequately process and combine data arising from different sites is crucial in neuroimaging, but is difficult, owing to site, sequence and acquisition-parameter dependent biases. It is important therefore to design algorithms that are not only robust to images of differing contrasts, but also be able to generalise well to unseen ones, with a quantifiable measure of uncertainty. In this paper we demonstrate the efficacy of a physics-informed, uncertainty-aware, segmentation network that employs augmentation-time MR simulations and homogeneous batch feature stratification to achieve acquisition invariance. We show that the proposed approach also accurately extrapolates to out-of-distribution sequence samples, providing well calibrated volumetric bounds on these. We demonstrate a significant improvement in terms of coefficients of variation, backed by uncertainty based volumetric validation.


page 3

page 7


Acquisition-invariant brain MRI segmentation with informative uncertainties

Combining multi-site data can strengthen and uncover trends, but is a ta...

Contrast Adaptive Tissue Classification by Alternating Segmentation and Synthesis

Deep learning approaches to the segmentation of magnetic resonance image...

Generalizing Deep Whole Brain Segmentation for Pediatric and Post-Contrast MRI with Augmented Transfer Learning

Generalizability is an important problem in deep neural networks, especi...

Learning Similarity Metrics for Volumetric Simulations with Multiscale CNNs

Simulations that produce three-dimensional data are ubiquitous in scienc...

A Decoupled Uncertainty Model for MRI Segmentation Quality Estimation

Quality control (QC) of MR images is essential to ensure that downstream...

User Manual for the SU2 EQUiPS Module: Enabling Quantification of Uncertainty in Physics Simulations

This document serves as the manual for using the EQUiPS (Enabling Quanti...

Learning towards Synchronous Network Memorizability and Generalizability for Continual Segmentation across Multiple Sites

In clinical practice, a segmentation network is often required to contin...

1 Introduction

Magnetic Resonance Imaging (MRI) is one of the most widespread neuroimaging techniques owing to its excellent soft tissue contrast, boasting great versatility in highlighting different regions and pathologies by means of sequence selection. As a consequence, a significant body of work has emerged developing accurate processing algorithms for MR images that may arise from different sites and acquisition sequence parameters. There are those works that focus on achieving algorithms that can generalise well to all contrasts. Traditional and largely widespread techniques include probabilistic generative models [1] and multi-atlas fusion methods [17]

. However, the former has strong assumptions on label intensity distributions, and the latter is predicated on lengthy processing times due to its dependence on image registration. Recent works using convolutional neural networks (CNNs), such as Billot

et al. [2], tackle contrast agnosticism by employing a Bayesian generative segmentation model that synthesises images containing multiple different contrasts. Jog et al. [11] devise an approach by which networks can be made to generalise to unseen contrasts by predicting pulse sequence parameters from such images, and simulating images of that contrast by using labelled multiparametric map datasets. Pham et al. [15] employ an iterative approach involving a dual segmentation-synthesis model, whereby images of unseen contrasts are segmented, used to train a synthesis network that in turn is used to generate new images of the unseen contrast from the labels in the original training set. It is important to note that, while these methods are able to segment data from unseen sites with some degree of accuracy, they do not model the interaction between acquisition parameters and the underlying anatomy explicitly - they segment what they see and not the true anatomy.

This leads to those methods that seek to harmonise measurements across sites by directly accounting for such covariates as scanner and site bias, and sequence contrast variabilities, e.g. ComBat [12] is a Bayesian framework designed to account for experimental variabilities that has been applied to cortical thickness harmonisation [6]. These classes of techniques, however, operate directly on extracted volumetric measurements and not on the images. Harmonisation has also been tackled with CycleGANs [20] [19] and domain adaptation approaches [4].

Recent work [3]

proposed a means to directly introduce the physics of the MR acquisition process directly into deep learning networks in combination with pre-generated synthetic MR images based on multi-parametric MR maps (MPMs). This work achieves some agnosticism to the underlying physics by demonstrating that generated segmentations are more consistent volumetrically. This method, however, does not enforce volumetric consistency across contrasts, and has not been show to extrapolate to out of distribution sequence parameters.

Changes in MRI acquisition parameters alter the tissue contrast, thus impacting the algorithmic ability to accurately segment images; this can be modeled via uncertainty estimation. Here, we propose to model both epistemic (ability of the model to know) and aleatoric (unknowns of the data) uncertainties. Building on existing work 

[3], we also introduce a new training approach and consistency loss across realisations of MRI contrasts, allowing the model to appropriately disentangle the anatomical phenotype and the MRI physics, and extrapolate to unseen contrasts without sacrificing segmentation quality.

2 Methods

Borges et al. [3]

proposed that a network could be made resilient to changes in the physics parameters, and therefore be able to appropriately segment data produced by different sequences. This was achieved by generating simulated data, and passing this imaging data and associated MRI parameters to a CNN. In order to train against a ”Physics Gold Standard” (PGS), i.e. a true model of the anatomy that is not influenced by the choice of acquisition parameters, the authors used a Gaussian Mixture Model of literature sourced tissue parameters for grey matter (GM), white matter (WM), and cerebrospinal fluid (CSF) on their quantitative MPMs. We build on this work and improve the algorithmic robustness, ameliorating image quality, segmentation volume consistency, and validating within and out of distribution samples paired with uncertainty derived errors.

Figure 1: The training pipeline with proposed new additions of single subject batch stratification and accompanying feature maps loss, and training time image simulation.

2.1 Network architecture

In Borges et al. [3], the injection of the physics parameters into the network is done via the inclusion of two fully connected layers whose output is tiled and concatenated to the ante-penultimate convolutional layer output. We adopt a similar strategy, but take the added step of also tiling this output to an earlier region of the network, immediately preceding the first down-sampling layer. We argue that knowledge of the physics is potentially valuable information in the encoding portion of the network, and that this allows it to better disentangle the physics parameters and the subject’s phenotype.

We moved to adopt the nn-UNet architecture [9]. All networks were trained with batch size 4, on 3D patches of size sampled from the simulated volumes. Networks were trained with a learning rate of

until convergence, where convergence is defined as 7 epochs elapsing without an improvement in validation metrics, Dice score combined with coefficient of variation (CoV). We made use of two main frameworks for this work, TorchIO  

[14], and MONAI  [16].

As the proposed method requires multi-parametric data to train the model, a more scarce resource in large numbers, a dataset comprised of 18 subjects were used for training, four for validation, and five for inference/ testing.

2.2 Stratification and batch homogeneity

We seek to further enforce volumetric consistencies vis-à-vis same-subject realisations generated using different sequence parameters. We therefore propose a batch stratification approach where each batch contains multiple realisations of images from a single subject. This allows for the addition of a stratification loss over the batch features of the penultimate layer of our network, which acts in addition to the standard cross-entropy segmentation loss. As the segmentation ground truths remain consistent across same subject simulations (because the underlying anatomy is unchanging), if a given batch contains multiple simulations from a single subject (and same patch location for patch-based training networks), then the features maps at the end of the network should also be consistent across simulations. This is enforced by introducing an loss over all the final feature maps for each batch. The inclusion of the physics parameters should make this tenable, as it allows for the network to learn to disentangle the anatomical phenotype and the MRI-physics related appearance.

2.3 Casting simulation as an augmentation layer

We adopt the same static equation multi-parametric map based simulation approach as Jog et al. [10], focusing on MPRAGE and SPGR sequences. The SPGR equation describing the signal per voxel, is:


where is the scanner gain, the repetition time, the longitudinal relaxation time, the echo time, and the transverse relaxation time.

Similarly, for MPRAGE:


where the delay time, and the slice imaging time.

Unlike in  [3], where simulated volumes are all generated prior to network training, we implemented the static equation simulation layer as an augmentation layer. Such a layer takes as input a 4D multi-parametric map, a protocol type, and a range of relevant parameters to randomly sample from, producing N (batch size) simulated volumes. This layer-based batch approach is compatible with our posed stratification model, as all generated volumes per batch belong to the same subject, permitting the utilisation of the within-batch feature consistency loss. The full training pipeline is depicted in Fig. 1.

2.4 Uncertainty modelling

We opt to incorporate uncertainty modelling in our framework to obtain volumetric bounds on our segmentations. We model the aleatoric uncertainty via explicit loss attenuation [13]. We modify our network architecture to include an additional convolutional block that branches off the final upsampling layer. This branch models the aleatoric uncertainty,

. This modifies the cross-entropy loss function accordingly:



are the task logits (

) summed with a noise sample of standard deviation equal to the predicted

per voxel; denotes the number of stochastic passes per input, and is defined for every voxel, per class, . This allows for the easy extraction of volumetric bounds by repeatedly sampling from additive logit noise distributions to produce new segmentations.

The epistemic uncertainty is modelled using test-time Monte Carlo sampling via dropout. Dropout is commonly used as a regularisation technique [18]

, but also allows for the approximate Bayesian posterior sampling of segmentations by maintaining the random neuron switching at test-time 

[7]. We set a dropout level of 0.5 in all layers except for the input layer, where it is set to 0.05.

3 Experiments

3.1 Data

We make use of a 27 subject multi-parametric early onset Alzheimer dataset, the same as in [3], for the purpose of simulating images which are used for training, validating, and testing of our models, all of which are registered to MNI space rigidly. The images contain maps of the longitudinal and effective transverse magnetisation relaxation, and , proton density, , and magnetisation transfer, MT. The details concerning quantitative map creation can be found in [8]. The static equation models we employ feature (inverse of ), (inverse of ), and PD.

3.2 Simulation sequence details

To allow for direct comparability, we limited the ranges of the relevant parameters for simulated images at training time to those stipulated in the original work, i.e. inversion time (TI) = [600-1200] ms for MPRAGE, repetition time (TR) = [15-100] ms, echo time (TE) = [4-10] ms, and flip angle (FA) = [15-75] degrees for SPGR. For each subject, a single ”Physics Gold Standard” (PGS) segmentation was used across the associated synthesized images, generated using the same process and literature values as in the original work [3].

4 Annealing study: Robustness and quality analysis

To ascertain the contributions of the two main additions to the underlying method, we carry out an annealing study, whereby we analyse the incremental performance increases in terms of volume consistency and Dice score, with the addition of each change. We begin with a complete physics-agnostic baseline, i.e. a standard 3D nn-UNet trained with pre-generated data (Baseline), followed by the original physics method (Phys-Base), followed by Phys-Base with the addition of batch stratification (Phys-Strat), followed lastly by Phys-Strat with the addition of the simulation augmentation scheme (Phys-Strat-Aug).

Experiments Sequence Dice Scores
Baseline 0.966 0.956 0.953 0.934 0.878 0.872 0.893 0.873
(0.005) (0.006) (0.002) (0.002) (0.021) (0.008) (0.023) (0.011)
Phys-Base 0.971 0.964 0.964 0.959 0.911 0.872 0.912 0.880
(0.007) (0.009) (0.008) (0.011) (0.020) (0.050) (0.021) (0.092)
Phys-Strat 0.970 0.969 0.958 0.957 0.929 0.911 0.922 0.894
(0.005) (0.005) (0.004) (0.005) (0.015) (0.011) (0.021) (0.040)
Phys-Strat-Aug 0.971 0.971 0.962 0.960 0.930 0.913 0.921 0.899
(0.004) (0.005) (0.003) (0.004) (0.016) (0.019) (0.015) (0.019)
Table 1: Mean dice scores for Baseline, Phys-Base, Phys-Strat, and Phys-Strat-Aug on segmentation task, across inference subjects. All dice scores are estimated against a Physics Gold Standard. Standard deviations quoted in brackets. Bold values represent statistically best performances.

We extend our volumetric consistency analysis by analysing out of distribution (OoD) samples. In this instance they are defined as simulated images whose sequence parameters lie outside of the training range. This not only results in images of unfamiliar contrasts, but also unseen parameters that are fed into the physics branch of the network. If our method has truly attained a measure of sequence invariance then it should be expected that both segmentation quality and volume consistency are maintained as the network should be able to extrapolate from the provided values. For MPRAGE, the OoD range is extended to [100-2000] ms, while for SPGR, the TR is extended to [10-200] ms, TE is extended to [2-20] ms, and FA is extended to [5-90] degrees.

Experiments Sequence CoVs (x)
Baseline 6.39 22.50 14.94 51.12 61.91 170.10 32.57 158.93
(0.87) (4.08) (1.71) (7.11) (7.61) (31.32) (11.98) (16.83)
Phys-Base 2.72 14.67 3.28 28.10 77.22 127.22 20.77 264.80
(2.12) (7.30) (2.01) (3.98) (34.44) (18.61) (9.35) (8.52)
Phys-Strat 0.71 6.15 0.53 3.67 21.83 59.78 8.60 59.19
(0.23) (1.51) (0.25) (1.34) (0.83) (13.31) (0.64) (11.25)
Phys-Strat-Aug 0.42 4.74 0.51 3.65 15.76 28.88 7.12 44.78
(0.22) (1.30) (0.23) (0.62) (1.18) (9.74) (0.45) (4.22)
Table 2: Coefficients of variation (CoV) for Baseline, Phys-Base, Phys-Strat, and Phys-Strat-Aug on segmentation task, averaged across test subjects. Standard deviations quoted in brackets. Bold values represent statistically best performances.

Table 1 and Table 2 show Dice and CoV performances, respectively. We carry out signed-rank Wilcoxon tests to test for statistically significant improvements, and bold the best model (p-value 0.01). Tests are carried out on CoV and Dice scores independently of each other. In instances where models may outperform baselines but are not statistically significantly different from each other, we bold both. We verify an incremental gain in CoV and Dice with each added feature, the most pronounced of which results from the addition of the stratification loss, in terms of both in and out of distribution CoVs. This is expected, as directly optimising for consistency across realisations of the same subject should more strongly enforce volume consistency, enhancing the physics invariance.

Phys-Strat-Aug boasts the best performance overall, significantly outperforming both Baseline and Phys-Base with regards to CoV. Compared to Phys-Strat, the differences are not always statistically better for MPRAGE, but are so for SPGR. With more parameters at play, an augmentation scheme should become more relevant, as sampling from the parameter space should lead to a greater extrapolating ability, as the network is no longer constrained to learn from a more discrete training set, and will experience more varied realisations.

Fig. 2 shows some qualitative results, in and out of distribution segmentation comparisons between Baseline and Phys-Strat-Aug, to convey the consistency the latter is able to achieve without compromising segmentation quality.

Figure 2: Baseline and Phys-Strat-Aug comparisons. Comparing out-of-distribution MPRAGE (Top two rows) and SPGR (Bottom two rows) GM segmentations from the proposed and baseline methods. Blue circles highlight examples of significant gyrus variability. Orange circles denote regions of segmentation differences between protocols.

4.1 Uncertainty measures and volumetric bounds

Given Phys-Strat-Aug’s superior performance, we train only two epistemic uncertainty models, and two aleatoric uncertainty models with this pipeline, one of each for this pipeline and a complete baseline.

At test-time we extract 50 aleatoric volume samples, and 50 epistemic volume samples for each of the networks, for both in and out of distribution simulated images. We verify that the aleatoric samples do not contribute significantly to the volume variance in comparison to its epistemic counterpart, (an observation that was also been verified in 

[5]) and therefore omit it in our volumetric analysis.

Fig. 3

showcases white matter volume variations for MPRAGE and SPGR sequences, for the extended out of distribution parameter ranges, for Baseline and Phys-Strat-Aug experiments, for a single subject. For the SPGR plot, we order the points based on volumetric consistency for each experiment, thus highlighting outliers. In both instances we observe a much greater consistency in volume for Phys-Strat-Aug, itself a reflection of the aforementioned CoV results. Using the calibrated volumetric method described in 

[5] allows us to calculate volume percentiles for each set of dropout samples, and the errors represent the volumetric interquartile range (IQR).

The errors for the baseline do not vary in any statistically significant manner, for either sequence or tissue, independent of any volume deviation. It is a different matter for Phys-Strat-Aug, however. Specifically, for MPRAGE, we note that uncertainties are consistently larger for Phys-Strat-Aug compared to baseline, and that furthermore, Phys-Aug-Strat segmentations boast larger uncertainties for out of distribution samples. This can perhaps be explained by the additional level of uncertainty introduced by the physics, and how the presence of a physics parameter outside of the “known” further exacerbates this effect.

For SPGR, all the apparent outliers for Phys-Strat-Aug have significantly larger associated errors, while this is not the case for the Baseline. We observe that most outliers correspond to out of distribution samples boasting very low flip angles (, highlighted in black in the figure). Such images will be significantly less -weighted, and therefore be less familiar to the models, resulting in poorer segmentation quality, so it is reassuring that the physics-informed network’s uncertainty around these samples is larger.

Figure 3: Volume consistency for WM for complete baseline and Phys-Strat-Aug, for example subject. Filled plots/ Error bars correspond to IQR volumes. Left: MPRAGE. The dashed grey region denotes the TI training time parameter range (600 - 1200 ms). Right: SPGR. Black points denote samples with FA lower than for Phys-Strat-Aug.

5 Discussion and Conclusions

In this work we demonstrated that with some well justified modifications to the training pipeline, a physics-informed network can achieve extremely constrained tissue segmentations across a wide range of contrasts, across all tissue types and investigated sequences; thus strengthening its harmonisation capabilities.

Furthermore, we also showed that it can suitably generalise to unseen domains, while maintaining volume consistency without compromising segmentation quality, and is validated by accurately quantifying the volumetric uncertainty. The uncertainty estimates further suggest that the physics knowledge grants the model an additional level of safety, as volumetric uncertainties proved to be larger for out of distribution parameter generated images.

The method is admittedly limited by those sequences that can be aptly represented as a static equation, but we argue that at the very least, for the purposes of contrast agnosticism, a wide enough range of realistic contrasts can be generated with currently implemented sequences, which should allow for our method to generalise further. Future work will therefore involve testing of our method on multiple external datasets to ascertain generalisability and the exploration of techniques that may allow for the modelling of MR artifacts such as movement and inhomogeneities, to enhance our model’s utility.

5.0.1 Acknowledgements

This project was funded by the Wellcome Flagship Programme (WT213038/Z/18/Z) and Wellcome EPSRC CME (WT203148/Z/16/Z).


  • [1] J. Ashburner and K. J. Friston (2005) Unified segmentation.. NeuroImage 26 (3), pp. 839–851. External Links: Document, ISSN 1053-8119 Cited by: §1.
  • [2] B. Billot, D. Greve, K. Van Leemput, B. Fischl, J. E. Iglesias, and A. V. Dalca (2020) A Learning Strategy for Contrast-agnostic MRI Segmentation. arXiv. External Links: 2003.01995 Cited by: §1.
  • [3] P. Borges, C. Sudre, T. Varsavsky, D. Thomas, I. Drobnjak, S. Ourselin, and M. J. Cardoso (2020) Physics-informed brain MRI segmentation. Lecture Notes in Computer Science 11827 LNCS, pp. 100–109. External Links: Document, 2001.10767 Cited by: §1, §1, §2.1, §2.3, §2, §3.1, §3.2.
  • [4] N. K. Dinsdale, M. Jenkinson, and A. I.L. Namburete (2020) Unlearning Scanner Bias for MRI Harmonisation in Medical Image Segmentation. In Communications in Computer and Information Science, Vol. 1248 CCIS, pp. 15–25. External Links: Document, ISBN 9783030527907, ISSN 18650937 Cited by: §1.
  • [5] Z. Eaton-Rosen, F. Bragman, S. Bisdas, S. Ourselin, and M. J. Cardoso (2018)

    Towards safe deep learning: accurately quantifying biomarker uncertainty in neural network predictions


    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    11070 LNCS, pp. 691–699.
    External Links: 1806.08640 Cited by: §4.1, §4.1.
  • [6] J. P. Fortin, N. Cullen, Y. I. Sheline, W. D. Taylor, I. Aselcioglu, P. A. Cook, P. Adams, C. Cooper, M. Fava, P. J. McGrath, M. McInnis, M. L. Phillips, M. H. Trivedi, M. M. Weissman, and R. T. Shinohara (2018) Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167, pp. 104–120. External Links: Document, ISSN 10959572 Cited by: §1.
  • [7] Y. Gal and Z. Ghahramani (2015) Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.

    33rd International Conference on Machine Learning, ICML 2016

    3, pp. 1651–1660.
    External Links: 1506.02142 Cited by: §2.4.
  • [8] G. Helms et al. (2009) Increased snr and reduced distortions by averaging multiple gradient echo signals in 3d flash imaging of the human brain at 3t. Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine 29 (1), pp. 198–204. Cited by: §3.1.
  • [9] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein (2018) No New-Net. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11384 LNCS, pp. 234–244. External Links: 1809.10483 Cited by: §2.1.
  • [10] A. Jog, A. Carass, S. Roy, D. L. Pham, and J. L. Prince B Amod MR Image Synthesis by Contrast Learning On Neighborhood Ensembles. External Links: Document Cited by: §2.3.
  • [11] A. Jog, A. Hoopes, D. N. Greve, K. Van Leemput, and B. Fischl (2019) PSACNN: Pulse Sequence Adaptive Fast Whole Brain Segmentation. External Links: 1901.05992v3 Cited by: §1.
  • [12] W. E. Johnson, C. Li, and A. Rabinovic (2007)

    Adjusting batch effects in microarray expression data using empirical Bayes methods

    Biostatistics 8 (1), pp. 118–127. External Links: Document, ISSN 1468-4357 Cited by: §1.
  • [13] A. Kendall and Y. Gal (2017)

    What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

    Technical report Vol. 30. Cited by: §2.4.
  • [14] F. Pérez-García, R. Sparks, and S. Ourselin (2020) TorchIO: a Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning. arXiv. External Links: 2003.04696, ISSN 23318422 Cited by: §2.1.
  • [15] D. L. Pham, Y. Chou, B. E. Dewey, D. S. Reich, J. A. Butman, and S. Roy (2020) Contrast Adaptive Tissue Classification by Alternating Segmentation and Synthesis. In Simulation and Synthesis in Medical Imaging, N. Burgos, D. Svoboda, J. M. Wolterink, and C. Zhao (Eds.), Cham, pp. 1–10. External Links: ISBN 978-3-030-59520-3 Cited by: §1.
  • [16] Project MONAI. External Links: Document Cited by: §2.1.
  • [17] M. R. Sabuncu, B. T. Yeo, K. V. Leemput, B. Fischl, and P. Golland (2010-10) A generative model for image segmentation based on label fusion. IEEE Transactions on Medical Imaging 29, pp. 1714–1729. External Links: Document, ISSN 02780062, Link Cited by: §1.
  • [18] N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Technical report Technical Report 56, Vol. 15. External Links: ISSN 1533-7928 Cited by: §2.4.
  • [19] F. Zhao, Z. Wu, L. Wang, W. Lin, S. Xia, D. Shen, and G. Li (2019) Harmonization of infant cortical thickness using surface-to-surface cycle-consistent adversarial networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11767 LNCS, pp. 475–483. External Links: ISBN 9783030322502, ISSN 16113349 Cited by: §1.
  • [20] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

    Proceedings of the IEEE International Conference on Computer Vision 2017-October, pp. 2242–2251. External Links: 1703.10593 Cited by: §1.