PSACNN: Pulse Sequence Resilient Fast Whole Brain Segmentation

by   Amod Jog, et al.
Harvard University

With the advent of convolutional neural networks (CNN), supervised learning methods are increasingly being used for whole brain segmentation. However, a large, manually annotated training dataset of labeled brain images required to train such supervised methods is frequently difficult to obtain or create. In addition, existing training datasets are generally acquired with a homogeneous magnetic resonance imaging (MRI) acquisition protocol. CNNs trained on such datasets are unable to generalize on test data with different acquisition protocols. Modern neuroimaging studies and clinical trials are necessarily multi-center initiatives with a wide variety of acquisition protocols. Despite stringent protocol harmonization practices, it is very difficult to standardize the gamut of MRI imaging parameters across scanners, field strengths, receive coils etc., that affect image contrast. In this paper we propose a CNN-based segmentation algorithm that, in addition to being highly accurate and fast, is also resilient to variation in the input acquisition. Our approach relies on building approximate forward models of pulse sequences that produce a typical test image. For a given pulse sequence, we use its forward model to generate plausible, synthetic training examples that appear as if they were acquired in a scanner with that pulse sequence. Sampling over a wide variety of pulse sequences results in a wide variety of augmented training examples that help build an image contrast invariant model. Our method trains a single CNN that can segment input MRI images with acquisition parameters as disparate as T_1-weighted and T_2-weighted contrasts with only T_1-weighted training data. The segmentations generated are highly accurate with state-of-the-art results (overall Dice overlap=0.94), with a fast run time (≈ 45 seconds), and consistent across a wide range of acquisition protocols.



There are no comments yet.



PSACNN: Pulse Sequence Adaptive Fast Whole Brain Segmentation

With the advent of convolutional neural networks (CNN), supervised learn...

Pulse Sequence Resilient Fast Brain Segmentation

Accurate automatic segmentation of brain anatomy from T_1-weighted (T_1-...

SynthStrip: Skull-Stripping for Any Brain Image

The removal of non-brain signal from magnetic resonance imaging (MRI) da...

Fast Infant MRI Skullstripping with Multiview 2D Convolutional Neural Networks

Skullstripping is defined as the task of segmenting brain tissue from a ...

Contrast Adaptive Tissue Classification by Alternating Segmentation and Synthesis

Deep learning approaches to the segmentation of magnetic resonance image...

Test-Time Adaptable Neural Networks for Robust Medical Image Segmentation

Convolutional Neural Networks (CNNs) work very well for supervised learn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Whole brain segmentation is one of the most important tasks in a neuroimage processing pipeline, and is a well-studied problem (Fischl et al., 2002; Shiee and et al., 2010; Wang et al., 2013; Asman and Landman, 2012; Wachinger et al., 2017). The segmentation task involves taking as input an MRI image, usually a -weighted (-w) image, of the head, and generating a labeled image with each voxel getting a label of the structure it lies in. Most segmentation algorithms output labels for right

left white matter, gray matter cortex, and subcortical structures such as the thalamus, hippocampus, amygdala, and others. Structure volumes and shapes estimated from a volumetric segmentation are routinely used as biomarkers to quantify differences between healthy and diseased populations 

(Fischl et al., 2002), track disease progression (Frisoni et al., 2009; Jack et al., 2011; Davatzikos et al., 2005), and study aging (Resnick et al., 2000). Segmentation of a -w image is also used to localize and analyze the functional and diffusion signals from labeled anatomy after modalities such as functional MRI (fMRI), PET (positron emission tomography), and diffusion MRI are co-registered to the -w image. This enables the study of effects of external intervention or treatment on various properties of the segmented neuroanatomy (Desbordes et al., 2012; Mattsson et al., 2014).

An ideal segmentation algorithm needs to be highly accurate. Accuracy is frequently measured with metrics such as Dice coefficient and Jaccard index, that quantify overlap of predicted labels with a manually labeled ground truth. An accurate segmentation algorithm is more sensitive to and can detect the subtle changes in volume and shape of brain structures in patients with neurodegenerative diseases 

(Jack et al., 2011). Fast processing time is also a desirable property for segmentation algorithms. Segmentation is one of the time-consuming bottlenecks in neuroimaging pipelines (Fischl et al., 2002). A fast, memory-light segmentation algorithm can process a larger dataset quickly and can potentially be adopted in real-time image analysis pipelines.

For MRI, however, it is also critical that the segmentation algorithm be robust to variations in the contrast properties of the input images. MRI is an extremely versatile modality, and a variety of image contrasts can be achieved by changing the multitude of imaging parameters. Large, modern MRI studies acquire imaging data simultaneously at multiple centers across the globe to gain access to a larger, more diverse pool of subjects (Thompson et al., 2014; Mueller et al., 2005). It is very difficult to perfectly harmonize scanner manufacturers, field strengths, receive coils, pulse sequences, resolutions, scanner software versions, and many other imaging parameters across different sites. Variation in imaging parameters leads to variation in intensities and image contrast across sites, which in turn leads to variation in segmentation results. Most segmentation algorithms are not designed to be robust to these variations (Han et al., 2006; Jovicich and et al., 2013; Nyúl and Udupa, 1999; Focke et al., 2011). Segmentation algorithms can introduce a site-specific bias in any analysis of multi-site data that is dependent on structure volumes or shapes obtained from that algorithm. In most analyses, the inter-site and inter-scanner variations are regressed out as confounding variables (Kostro et al., 2014) or the MRI images are pre-processed in order to provide consistent input to the segmentation algorithm (Nyúl and Udupa, 1999; Madabhushi and Udupa, 2006; Roy et al., 2013; Jog et al., 2015). However, a segmentation algorithm that is robust to scanner variation precludes the need to pre-process the inputs.

1.1 Prior Work

Existing whole brain segmentation algorithms can be broadly classified into three types: (a) model-based, (b) multi-atlas registration-based, and (c) supervised learning-based. Model-based algorithms 

(Fischl et al., 2002; Pohl et al., 2006; Pizer et al., 2003; Patenaude et al., 2011; Puonti et al., 2016) fit an underlying statistical atlas-based model to the observed intensities of the input image and perform maximum a posteriori (MAP) labeling. They assume a functional form (e.g. Gaussian) of intensity distributions and results can degrade if the input distribution differs from this assumption. Recent work by Puonti et al. (2016) builds a parametric generative model for brain segmentation that leverages manually labeled training data and also adapts to changes in input image contrast. In a similar vein, work by Bazin and Pham (2008) also uses a statistical atlas to minimize an energy functional based on fuzzy c-means clustering of intensities. Model-based methods are usually computationally intensive, taking 0.5–4 hours to run. With the exception of the work by Puonti et al. (2016) and  Bazin and Pham (2008), most model-based methods are not explicitly designed to be robust to variations in input -w contrast and fail to work on input images acquired with a completely different contrast such a proton density weighted (-w) or -weighted (-w).

Multi-atlas registration and label fusion (MALF) methods (Rohlfing et al., 2004; Heckemann et al., 2006; Sabuncu et al., 2010; Zikic et al., 2014; Asman and Landman, 2012; Wang et al., 2013; Wu et al., 2014) use an atlas set of images that comprises pairs of intensity images and their corresponding manually labeled images. The atlas intensity images are deformably registered to the input intensity image, and the corresponding atlas label images are warped using the estimated transforms into the input image space. For each voxel in the input image space, a label is estimated using a label fusion approach. MALF algorithms achieve state-of-the-art segmentation accuracy for whole brain segmentation (Asman and Landman, 2012; Wang et al., 2013) and have previously won multi-label segmentation challenges (Landman and Warfield, 2012). However, they require multiple computationally expensive registrations, followed by label fusion. Efforts have been made to reduce the computational cost of deformable registrations with linear registrations (Coupé et al., 2011), or reducing the task to a single deformable registration (Zikic et al., 2014). However, if the input image contrast is significantly different from the atlas intensity images, registration quality becomes inconsistent, especially across different multi-scanner datasets. If the input image contrast is completely different (-w or -w) from the atlas images (usually -w), the registration algorithm needs to use a modality independent optimization metric such as cross-correlation or mutual information, which also affects registration consistency across datasets.

Accurate whole brain segmentation using supervised machine learning approaches has been made possible in the last few years by the success of deep learning methods in medical imaging. Supervised segmentation approaches built on CNNs have produced accurate segmentations with a runtime of a few seconds or minutes 

(Li et al., 2017; Wachinger et al., 2017; Roy et al., 2018). These methods have used 2D slices (Roy et al., 2018) or 3D patches (Li et al., 2017; Wachinger et al., 2017) as inputs to custom-built fully convolutional network architectures, the U-Net (Ronneberger et al., 2015) or residual architectures (He et al., 2016) to produce voxel-wise semantic segmentation outputs. The training of these deep networks can take hours to days, but the prediction time on an input image is in the range of seconds to minutes, thus making them an attractive proposition for whole brain segmentation. CNN architectures for whole brain segmentation tend to have millions of trainable parameters but the training data is generally limited. A manually labeled training dataset can consist of only labeled images (Buckner et al., 2004; Landman and Warfield, 2012) due to the labor-intensive nature of manual annotation. Furthermore, the training dataset is usually acquired with a fixed acquisition protocol, on the same scanner, within a short period of time, and generally without any substantial artifacts, resulting in a pristine set of training images. Unfortunately, a CNN trained on such limited data is unable to generalize to an input test image acquired with a different acquisition protocol and can easily overfit to the training acquisitions. Despite the powerful local and global context that CNNs generally provide, they are more vulnerable to subtle contrast differences between training and test MRI images than model-based and MALF-based methods. A recent work by Karani et al. (2018)

describes a brain segmentation CNN trained to consistently segment multi-scanner and multi-protocol data. This framework adds a new set of batch normalization parameters in the network as it encounters training data from a new acquisition protocol. To learn these parameters, the framework requires that a small number of images from the new protocol be manually labeled, which may not always be possible.

1.2 Our Contribution

We describe Pulse Sequence Adaptive Convolutional Neural Network (PSACNN); a CNN-based whole brain MRI segmentation approach with an augmentation scheme that models the forward models of pulse sequences to generate a wide range of synthetic training examples during the training process. Data augmentation is regularly used in training of CNNs when training data is scarce. Typical augmentation in computer vision applications involves rotating, deforming, and adding noise to an existing training example, pairing it with a suitably transformed training label, and adding it to the training batch data. Once trained this way, the CNN is expected to be robust to these variations, which were not originally present in the given training dataset.

In PSACNN, we aim to augment CNN training by using existing training images and creating versions of them with a different contrast, for a wide range of contrasts. The CNN thus trained will be robust to contrast variations. We create synthetic versions of training images such that they appear to have been scanned with different pulse sequences for a range of pulse sequence parameters. These synthetic images are created by applying pulse sequence imaging equations to the training tissue parameter maps. We formulate simple approximations of the highly nonlinear pulse sequence imaging equations for a set of commonly used pulse sequences. Our approximate imaging equations typically have a much smaller number (3 to be precise) of imaging parameters. Estimating these parameters would typically need multi-contrast images that are not usually available, so we have developed a procedure to estimate these approximate imaging parameters using intensity information present in that image alone.

When applying an approximate imaging equation to tissue parameter maps, changing the imaging parameters of the approximate imaging equation changes the contrast of the generated synthetic image. In PSACNN, we uniformly sample over the approximate pulse sequence parameter space to produce candidate pulse sequence imaging equations and apply them to the training tissue parameters to synthesize images of training brains with a wide range of image contrasts. Training using these synthetic images results in a segmentation CNN that is robust to the variations in image contrast.

A previous version of PSACNN was published as a conference paper (Jog and Fischl, 2018). In this paper, we have extended it by removing the dependence on the FreeSurfer atlas registration for features. We have also added pulse sequence augmentation for -weighted sequences, thus enabling the same trained CNN to segment -w and -w images without an explicitly -w training dataset. We have also included additional segmentation consistency experiments conducted on more diverse multi-scanner datasets.

Our paper is structured as follows: In Section 2, we provide a background on MRI image formation, followed by a description of our pulse sequence forward models and the PSACNN training workflow. In Section 3, we show an evaluation of PSACNN performance on a wide variety of multi-scanner datasets and compare it with state-of-the-art segmentation methods. Finally, in Section 4 we briefly summarize our observations and outline directions for future work.

2 Method

2.1 Background: MRI Image Formation

In MRI the changing electromagnetic field stimulates tissue inside a voxel to produce the voxel intensity signal. Let be the acquired MRI magnitude image. , the intensity at voxel in the brain is a function of (a) intrinsic tissue parameters and (b) extrinsic imaging parameters. We will refer to the tissue parameters as nuclear magnetic resonance or NMR parameters in order to distinguish them from the imaging parameters. The principal NMR parameters include proton density (), longitudinal () and transverse (), and relaxation times. In this work we only focus on , , and denote them together as . Let the image be acquired using a pulse sequence . Commonly used pulse sequences include Magnetization Prepared Gradient Echo (MPRAGE) (Mugler and Brookeman, 1990), Multi-Echo MPRAGE (van der Kouwe et al., 2008) Fast Low Angle Shot (FLASH), Spoiled Gradient Echo (SPGR), T2-Sampling Perfection with Application optimized Contrasts using different flip angle Evolution (T2-SPACE) (Mugler, 2014) and others. Depending on the type of the pulse sequence , its extrinsic imaging parameters can include repetition time (), echo time (), flip angle (), gain (), and many others. The set of imaging parameters is denoted by . The relationship between voxel intensity and the NMR parameters () and imaging parameters () is encapsulated by the imaging equation as shown in Eqn. 1:


For example, for the FLASH sequence, the imaging parameters are , and the imaging equation is shown in Eqn. 2 (Glover, 2004):


Most pulse sequences do not have a closed form imaging equation and a Bloch simulation is needed to estimate the voxel intensity (Glover, 2004). There has been some work in deriving the imaging equation for MPRAGE (Deichmann et al., 2000; Wang et al., 2014). Unfortunately, these closed form equations do not always match the scanner implementation of the pulse sequence. Different scanner manufacturers and scanner software versions have different implementations of the same pulse sequence. Sometimes additional pulses like fat suppression are added to improve image quality, further deviating from the closed form theoretical equation. Therefore, if such a closed form theoretical imaging equation with known imaging parameters (from the image header) is applied to NMR maps () of a subject, the resulting synthetic image will systematically differ from the observed, scanner-acquired image. To mitigate this difference, we deem it essential to estimate the imaging parameters purely from the observed image intensities themselves. This way, when the imaging equation with the estimated parameters is applied to the NMR maps, we are likely to get a synthetic image that is close to the observed, scanner-acquired image. However, to enable estimation of imaging parameters purely from the image intensities requires us to approximate the closed form theoretical imaging equation to a much simpler form and with fewer parameters. Therefore, we do not expect the estimated parameters of an approximate imaging equation to match with the scanner parameters (such as or ). The approximation of the imaging equation and its estimated parameters are deemed satisfactory if they produce intensities similar to an observed, scanner-acquired image, when applied to NMR maps. Our approximations of the theoretical closed form imaging equations for a number of commonly used pulse sequences are described next in Section 2.2.

2.2 Pulse Sequence Approximation and Parameter Estimation

(a) (b) (c)
Figure 1: (a) Fit of component in the FLASH equation (blue) and our approximation (red), (b) fit of component (blue) with our approximation (red) for MPRAGE, (c) fit of component (blue) with our approximation (red) for T2-SPACE

In this work we primarily work with FLASH, MPRAGE, SPGR, and T2-SPACE images. Our goal is to be able to estimate the pulse sequence parameters from the intensity information of

. Intensity histograms of most pulse sequence images present peaks for the three main tissue classes in the brain: cerebro-spinal fluid (CSF), gray matter (GM), and white matter (WM). The mean class intensities can be robustly estimated by fitting a three-class Gaussian mixture model to the intensity histogram. Let

, , be the mean intensities of CSF, GM, and WM, respectively. Let , , be the mean NMR values for CSF, GM, and WM classes, respectively. In this work, we extracted these values from a previously acquired dataset with relaxometry acquisitions (Fischl et al., 2004), but they could also have been obtained from previous studies (Wansapura et al., 1999). We assume that applying the pulse sequence equation to the mean class NMR values results in mean class intensities in . This assumption leads to a three equation system as shown in Eqns. (3)–(5), where the unknown is :


To solve a three equation system, it is necessary that have three or fewer unknown parameters. From the FLASH imaging equation in Eqn. 2, it is clear that there are more than three unknowns (four, to be precise) in the closed form imaging equation, which is also true for most pulse sequences. Therefore, to enable us to solve the system in Eqns. (3)–(5), we formulate a three-parameter approximation for pulse sequences like FLASH, SPGR, MPRAGE, and T2-SPACE. Our approximation for the FLASH sequence is shown in Eqns. (6)–(8):


where forms our parameter set. We replace by as they are highly correlated quantities due to the relation , where is dependent on field inhomegeneities (Chavhan et al., 2009) that are assumed to have been eliminated by an inhomogeneity correction algorithm in pre-processing (Sled et al., 1998). and in Eqn. 7 model the contribution of and components respectively to the voxel intensity. Figure 1(a) shows the theoretical and our approximation for FLASH. For the range of values of in the human brain  ms at 1.5 T, our approximation fits well. The error increases for long regions, which are generally CSF voxels in the brain, and relatively easy to segment. The SPGR pulse sequence is very similar to FLASH and so we use the same approximation for it (Glover, 2004).

For the MPRAGE sequence, we chose to approximate a theoretical derivation provided by Wang et al. (2014). Assuming the echo time is small, which it typically is to avoid -induced loss, the effects in our approximation are not significant. Therefore, we use two parameters and to model a quadratic dependence on , which is an approximation of the derivation by Wang et al. (2014), resulting in Eqn. (9) as our three-parameter approximation for MPRAGE. Comparison with the theoretical dependence on is shown in Fig. 1(b), and our approximation shows a very good fit.


T2-SPACE is a turbo spin echo sequence. The intensity produced at the echo of a turbo spin echo sequence is given by Eqn. 10:


where for a first order approximation (Glover, 2004), is the time interval between last echo and , and is the echo time. Our three-parameter approximation for Eqn. 10 is given in Eqn. 11:


The comparison with the theoretical equation as shown in Fig. 1(c) shows good fit for dependence for values in the GM-WM range. The error increases for very short  (bone) and very long  (CSF) voxels. This has minimal impact on the segmentation accuracy as they have sufficient contrast to segment regardless of small deviations from the model.

Our three-parameter approximations are such that the equation system in Eqns. (3)–(5) becomes linear in terms of and can be easily and robustly solved for each of the pulse sequences. The exact theoretical equations have more than three parameters, and even if they had fewer parameters, the exponential interaction of parameters with and renders solving the highly nonlinear system very unstable. Using our formulated approximate versions of pulse sequence equations, given any image and the pulse sequence used to acquire it, we can estimate the three parameter set , denoted by hat to distinguish it from the unknown parameter set .

(a) distribution (b) distribution (c) distribution

Figure 2: Estimated MPRAGE parameters , , and distributions for a variety of 1.5 T and 3 T datasets from Siemens Trio/SONATA/Prisma/Avanto, GE Signa scanners.

More importantly, the three-parameter approximations establish a compact, 3-dimensional parameter space for a pulse sequence. Given any test image, its approximate pulse sequence parameter set is a point in this 3-dimesional parameter space, found by calculating the mean class intensities and solving the equation system in Eqns. (3)–(5). We estimated the parameter sets for a variety of test MPRAGE images acquired on 1.5 T and 3 T scanners from Siemens Trio/Vision/SONATA/Avanto/Prisma and GE Signa scanners using our method. The estimated parameter distributions are shown in Figs. 2(a)– 2(c). These give us an idea of the range of parameter values and their distribution for the MPRAGE sequence. We use this information to grid the parameter space in 50 bins between [0.8, 1.2] for such that any extant test dataset parameters are included in this interval and are close to a grid point. We create such a 3D grid for FLASH/SPGR and T2-SPACE as well. In Section 2.3 we describe PSACNN training that depends on sampling this grid to come up with candidate pulse sequence parameter sets. We will use these candidate parameter sets to create augmented, synthetic training images.

2.3 PSACNN Training

(a) (b) (c) (d) (e)

Figure 3: PSACNN training images for a training subject : (a) Training MPRAGE (), (b) Training Label Image (), (c)  (PD) map, (d) , (e) .

Let be a collection of -w images with a corresponding expert manually labeled image set . The paired collection is referred to as the training image set. Let be acquired with a pulse sequence , where is typically MPRAGE that presents a good gray-white matter contrast. For PSACNN, we require that in addition to , we have ; the corresponding NMR parameter maps for each of the training subjects. For each we have , where is a map of proton densities, and and store the longitudinal (), and transverse () relaxation time maps respectively. Most training sets do not acquire or generate the NMR maps. In case they are not available, we outline an image synthesis procedure in Section 2.5 to generate these from available -w MPRAGE images . Example images from are shown in Fig. 3.

Figure 4: Workflow of PSACNN training.

We extract patches of size from voxels of . is paired with a corresponding patch extracted from the label image . Training pairs of constitute the unaugmented data used in training. A CNN trained on only available unaugmented patches will learn to segment images with the exact same contrast as the training MPRAGE images and will not generalize to test images of a different contrast. We use the approximate forward models of pulse sequences as described in Section 2.2 to generate synthetic images that appear to have been imaged by that pulse sequence.

Consider a pulse sequence , which could be MPRAGE, FLASH, SPGR, or T2-SPACE, for each of which we have established a three-parameter space. We uniformly sample the grid that is created for as described in Section 2.2 to get a candidate set . Next, we extract -sized NMR parameter patches from where the NMR patch at voxel is denoted by . We create an augmented patch by applying the test pulse sequence equation , with the sampled pulse sequence parameter set , to training NMR patches as shown in Eqn. 12:


The synthetic patch therefore has the anatomy from the training subject, but the image contrast of a potential test pulse sequence with as its parameters, as shown in Fig. 4. The synthetic patch is paired with the corresponding label patch extracted from to create the augmented pair that is introduced in the training. The CNN, therefore, learns weights such that it maps both and to the same label patch , in effect learning to be invariant to differences in image contrast. CNN weights are typically optimized by stochastic optimization algorithms that use mini-batches composed of training samples to update the gradients of the weights in a single iteration. In our training, we construct a mini-batch comprising four training samples: one sample of the original, unaugmented MPRAGE patch, and one each of synthetic FLASH/SPGR, synthetic T2-SPACE, and a synthetic MPRAGE patch. Each synthetic patch is generated with a parameter set () chosen from the 3D parameter space grid of its pulse sequence (

). Over a single epoch, we cover all parameter sets from the 3D parameter space grid of

to train the CNN with synthetic patches that are mapped to the same label patch forcing it to learn invariance across different . Invariance across pulse sequences is achieved by including patches from all four pulse sequences in each mini-batch. The training workflow is presented in Fig. 4. On the right side of Fig. 4 are possible test images that were acquired by different pulse sequences. On the left side of Fig. 4 are the synthetic training images generated by applying the approximate pulse sequence equations to training NMR maps shown on the top center. All synthetic images are matched with the same label map at the output of the network. The patches are shown as 2D slices for the purpose of illustration.

2.4 PSACNN Network Architecture

We use the U-Net (Ronneberger et al., 2015) as our preferred voxel-wise segmentation CNN architecture, but any sufficiently powerful fully convolutional architecture can be used. Graphical processing unit (GPU) memory constraints prevent us from using entire -sized images as input to the U-Net and so we use -sized patches as inputs. Our U-Net instance has five pooling layers and corresponding five upsampling layers. These parameters were constrained by the largest model size we could fit in the GPU memory. Prior to each pooling layer are two convolutional layers with filter size of

, ReLu activation, and batch normalization after each of them. The number of filters in the first convolutional layers is 32, and it doubles after each pooling layer. The last layer has a softmax activation with

outputs. The U-Net is trained to minimize a soft Dice-based loss averaged over the whole batch. The loss is shown in Eqn. 13, where denotes the voxels present in each label patch and of the training sample in a batch of samples. During training is a

-sized tensor with each voxel recording the softmax probability of it belonging to a particular label.

is a similar-sized one-hot encoding of the label present at each of the voxels. The 4D tensors are flattened prior to taking element-wise dot products and norms.


We use the Adam optimization algorithm to minimize the loss. Each epoch uses K augmented and unaugmented training pairs extracted from a training set of 16 subjects out of a dataset with 39 subjects that were acquired with 1.5 T Siemens Vision MPRAGE acquisitions (TR=9.7 ms, TE=4 ms TI=20 ms, voxel-size= mm) with expert manual labels done as per the protocol described by Buckner et al. (2004). We refer to this dataset as the Buckner dataset in the rest of the manuscript. After each training epoch, the trained model is a applied to a validation dataset that is generated from 3 of the remaining 23 (= 39 - 16) subjects. Training is stopped when the validation loss does not decrease any more, which is usually around 10 epochs. For prediction, we sample overlapping

-sized patches from the test image with a stride of

. We apply the trained network to these patches. The resultant soft predictions (probabilities) are averaged in the overlapping regions and the class with the maximum probability is chosen as the hard label at each voxel.

2.5 Synthesis of Training NMR Maps

As described in Section 2.3, PSACNN requires the training dataset to be the collection . But most training datasets only contain and do not acquire or estimate the NMR parameter maps . In this section, we will describe a procedure to synthesize maps from the available image set .

We use a separately acquired dataset of eight subjects, denoted as dataset , with multi-echo FLASH (MEF) images with four flip angles , , , and . Proton density () and maps were calculated by fitting the known imaging equation for FLASH to the acquired four images (Fischl et al., 2004). The values are estimated up to a constant scale. Note that this dataset is completely distinct from the PSACNN training dataset with no overlapping subjects. For each subject in , we first estimate the three mean tissue class intensities , , and . We then estimate the MEF parameters () using our approximation of the FLASH equation in Eqn. 8 and solving the equation system in Eqns. 35. Next, the voxel intensity for each voxel , using the approximate FLASH imaging equation in Eqn. 8 and the estimated gives us Eqn. 14:


The only unknown in Eqn. 14 is , which we can solve for. This forms for each subject in .

Next, we consider each of the PSACNN training images in . We use the approximate forward model of the sequence (typically MPRAGE) and the pulse sequence parameter estimation procedure in Section 2.2 to estimate imaging parameters for all PSACNN training images . We use the estimated as parameters in the equation and apply it to the NMR maps of dataset as shown in Eqn. 15, to generate synthetic images of subjects in dataset .


For each PSACNN training image , we have eight synthetic , of subjects in dataset with the same imaging parameter set and therefore, the same image contrast as . We pair up the synthetic images and proton density maps to create a training dataset to estimate proton density map for subject in the PSACNN training set. We extract -sized patches from the synthetic images and map them to patches from the proton density maps. A U-Net with specifications similar to that described in Section 2.3 is used to predict proton density patches from synthetic image patches, the only difference being the final layer that outputs -sized continuous valued output while minimizing the mean squared error loss. The trained U-Net is applied to patches extracted from to synthesize . The image contrast of the synthetic images from dataset is the same as by design, which makes the training and test intensity properties similar and the trained U-Net can be applied correctly to patches from . Two other similar image synthesis U-Nets are trained for synthesizing and from . Thus, to estimate for subject of the PSACNN training dataset, we need to train three separate U-Nets. In total, for PSACNN training subjects, we need to train U-Nets, which is computationally expensive, but a one-time operation to permanently generate the NMR maps that are necessary for PSACNN, from available images. It is possible to train only 3 U-Nets for all images from

as they have the exact same acquisition, however their estimated parameters are not exactly equal and have a small variance. Therefore, we chose to train 3 U-Nets per image to ensure subject-specific training. Our expectation is that generated NMR maps will be more accurate than those generated using a single U-Net per NMR parameter map for all subjects.

The training dataset (Buckner) cannot be shared due to Institutional Review Board requirements. The trained PSACNN network and the prediction script are freely available on the FreeSurfer development version repository on Github (

2.6 Application of PSACNN for 3 T MRI

Our training dataset of was obtained on a 1.5 T Siemens Vision scanner. The tissue NMR parameters, , , are different at 3 T field strength than 1.5 T. This might lead to the conclusion that it is not possible to create synthetic 3 T images during PSACNN augmentation as, in principle, we cannot apply 3 T pulse sequence imaging equations to 1.5 T NMR maps. Ideally, we would prefer to have a training dataset with maps at both 1.5 T and 3 T, which unfortunately is not available to us. However, our imaging equation approximations allow us to frame a 3 T imaging equation as a 1.5 T imaging equation with certain assumptions about the relationship between 1.5 T and 3 T NMR parameters. For instance, consider the FLASH approximate imaging equation in Eqn. 8. When estimating the imaging parameters of this equation, we construct an equation system (for 1.5 T in Eqn. 16, reframed as Eqn. 18), as described in Section 2.2. We assume that the 3 T FLASH sequence will have the same equation as the 1.5 T FLASH. This is a reasonable assumption because the sequence itself need not change with the field strength.


We can express the equation system for a 3 T pulse sequence in Eqn. 19, where is -sized matrix similar to in Eqn. 18, but with mean tissue NMR values at 3 T, which can be obtained from previous studies (Wansapura et al., 1999). It is possible to estimate a -sized matrix , where . is given by, . Now, given a 3 T image, we first calculate the mean tissue intensities . The values are mean tissue proton density values and do not depend on field strength. Therefore, we can write Eqn. 19, that is similar to Eqn. 18. Substituting in Eqn. 19, we get Eqn. 20 and reframe it as Eqn. 21.


Therefore, we can treat the parameters for 3 T (

) as linearly transformed (by

) parameters . In essence, we can thus treat 3 T images as if they have been acquired with a 1.5 T scanner (with 1.5 T NMR tissue values) but with a parameter set .

However, while applying a 3 T equation to 1.5 T NMR maps to generate a synthetic 3 T image, the above relationship of implies a stronger assumption that , where the

is a row vector with NMR values at any voxel

at 3 T and is a row vector of NMR values at 1.5 T for the same anatomy. will have to be estimated using a least squares approach using all the 1.5 T and 3 T values at all voxels, which are not available to us. This linear relationship between 1.5 T and 3 T NMR parameters is likely not theoretically valid, but nevertheless works well in practice as can be seen from empirical results presented in Section 3 and Figs. 2(a)–(c), where we can see the estimated parameters from a variety of 1.5 T and 3 T MPRAGE scans. These are restricted in a small enough range because the contrast changes for MPRAGE between 1.5 T and 3 T are not drastically different and a linear relationship between the two NMR vectors is likely a good enough approximation.

3 Experiments and Results

We evaluate the performance of PSACNN and compare it with unaugmented CNN (CNN), FreeSurfer segmentation (ASEG) (Fischl et al., 2002), SAMSEG (Puonti et al., 2016), and MALF (Wang et al., 2013). We divide the experiments into two halves. In the first half, described in Section 3.2, we evaluate segmentation accuracy as measured by Dice overlap on datasets with manual labels. In the second half, described in Section 3.3, we evaluate the segmentation consistency of all five algorithms. Segmentation consistency is evaluated by calculating the coefficient of variation of structure volumes for the same set of subjects that have been scanned at multiple sites/scanners or sequences.

3.1 Training Set and Pre-processing

As described in Section 2.4, we randomly chose a subset comprising 16 out of 39 subjects representing the whole age range from the Buckner dataset These subjects have 1.5 T Siemens Vision MPRAGE acquisitions and corresponding expert manual label images and were used as our training dataset for CNN, PSACNN, and as the atlas set for MALF. ASEG segmentation in FreeSurfer and SAMSEG are model-based methods that use a pre-built probabilistic atlas prior. The atlas priors for ASEG and SAMSEG are built from 39 and 20 subjects respectively, from the Buckner dataset.

For MALF, we use cross-correlation as the registration metric and the parameter choices specific to brain MRI segmentation and joint label fusion as suggested by the authors in (Wang et al., 2013).

For PSACNN, we use the same single trained network that is trained from real and augmented MPRAGE, FLASH, SPGR, and T2-SPACE patches of the 16-subject training data for all the experiments. We do not re-train it for a new test dataset or a new experiment.

All input images from the training and test datasets are brought into the FreeSurfer conformed space (resampled to isotropic mm) skullstripped and intensity inhomogeneity-corrected (Sled et al., 1998) by the pre-processing stream of FreeSurfer (Fischl et al., 2004). The intensities are scaled such that they lie in the interval.

For the unaugmented CNN specifically, we further standardize the input image intensities such that their white matter mean (as identified by fitting a Gaussian mixture model to the intensity histogram) is fixed to 0.8. The CNN results were significantly worse on multi-scanner data without this additional scaling. We also tested using piece-wise linear intensity standardization (Nyúl and Udupa, 1999) on input images to the unaugmented CNN but the results were better for white matter mean standardization. There is no such scaling required for the other methods including PSACNN.

3.2 Evaluation of Segmentation Accuracy

We evaluate all algorithms in three settings, where the test datasets have varying degrees of acquisition differences with the training Buckner dataset. Specifically, where the test dataset is acquired by:

  • Same scanner (Siemens Vision), same sequence (MPRAGE) as the training dataset

  • Different scanner (Siemens SONATA), same sequence (MPRAGE with different parameters) as the training set.

  • Different scanner (GE Signa), different sequence (SPGR) as the training dataset.

3.2.1 Same Scanner, Same Sequence Input

(a) MPRAGE (b) ASEG (c) MALF (d) PSACNN (e) Manual (f) SAMSEG (g) CNN

Figure 5: Input Buckner MPRAGE with segmentation results from ASEG, SAMSEG, MALF, CNN, and PSACNN, along with manual segmentations.

In this experiment we compare the performance of segmentation algorithms on test data with the exact same acquisition parameters as the training data. The test data consists of 20 subject MPRAGEs from the Buckner data that are independent of the 16 subjects used for training and 3 for validation of the CNN methods.

Figure 5 shows an example input test subject image, its manual segmentation, and segmentations of the various algorithms. All algorithms perform reasonably well on this dataset. We calculated the Dice coefficients of the labeled structures for all the algorithms (see Fig. 6). From the Dice coefficient boxplots (generated from 20 subjects) shown in Fig. 6 we observe that CNN (red) and PSACNN (blue) Dice overlap (ALL Dice

) is comparable to each other and is significantly better (paired t-test

) than ASEG (green), SAMSEG (yellow), and MALF (purple) for many of the subcortical structures, and especially for whole white matter (WM) and gray matter cortex (CT) labels. This suggests that large patch-based CNN methods do provide a more accurate segmentation than state-of-the-art approaches like MALF and SAMSEG for a majority of the labeled structures. But, since the training and test acquisition parameters are exactly same, there is no significant difference between CNN and PSACNN.

Figure 6: Dice coefficient boxplots of selected structures for all five methods on 20 subjects of the Buckner dataset. Acronyms: white matter (WM), cortex (CT), lateral ventricle (LV), cerebellar white matter (CWM), cerebellar cortex (CCT), thalamus (TH), caudate (CA), putamen (PU), pallidum (PA), hippocampus (HI), amygdala (AM), brain-stem (BS), overlap for all structures (ALL).

3.2.2 Different Scanner, Same Sequence Input

(a) MPRAGE (b) ASEG (c) MALF (d) PSACNN (e) Manual (f) SAMSEG (g) CNN

Figure 7: Input Siemens13 MPRAGE with segmentation results from ASEG, SAMSEG, MALF, CNN, and PSACNN, along with manual segmentations.

In this experiment we compare the accuracy of PSACNN against other methods on the Siemens13 dataset that comprises 13 subject MPRAGE scans acquired on a 1.5 T Siemens SONATA scanner with a similar pulse sequence as the training data. The Siemens13 dataset has expert manual segmentations generated with the same protocol as the training data (Buckner et al., 2004). Figure 7 shows the segmentation results for all the algorithms. The image contrast (see Fig. 7(a)) for the Siemens13 dataset is only slightly different from the training Buckner dataset. Therefore, all algorithms perform fairly well on this dataset. Figure 8 shows the Dice overlap boxplots calculated over 13 subjects. Similar to the previous experiment, the unaugmented CNN and PSACNN have a superior Dice overlap compared to other methods (statistically significantly better for five of the nine labels shown), but are comparable to each other as the test acquisition is not very different from the training acquisition.

Figure 8: Dice evaluations on Siemens13 dataset. For acronyms refer to Fig. 6 caption.

3.2.3 Different Scanner, Different Sequence Input

(a) SPGR (b) ASEG (c) MALF (d) PSACNN (e) Manual (f) SAMSEG (g) CNN

Figure 9: Input GE14 SPGR with segmentation results from ASEG, SAMSEG, MALF, CNN, and PSACNN, along with manual segmentations.

In this experiment we compare the accuracy of PSACNN against other methods on the GE14 dataset comprising 14 subjects scanned on a 1.5 T GE Signa scanner with the SPGR (spoiled gradient recalled) sequence (,  ms, , voxel size= mm). All the subjects have expert manual segmentations generated with the same protocol as the training data, but the manual labels are visibly different than training data, with a much thicker cortex (Fig. 9(e)). This is likely because the GE14 SPGR scans (Fig. 9(a)) present a noticeably different GM-WM image contrast than the MPRAGE training data. Consequently, as seen in Fig. 10, all methods show a reduced overlap, but the unaugmented CNN (red boxplot) has the worst performance of all as it is unable to generalize to an SPGR sequence (despite white matter peak intensity standardization). Some obvious CNN errors are for small subcortical structures such as hippocampus (yellow in Fig. 9(g)) labeled as thalamus (green label). PSACNN (ALL Dice=) on the other hand, is robust to the contrast change due to pulse sequence-based augmentation, and produces segmentations that are comparable to the state-of-the-art algorithms such as SAMSEG (ALL Dice=) and MALF (ALL Dice=) in accuracy, with 1–2 orders of magnitude lower processing time.

Figure 10: Dice evaluations on GE14 dataset. For acronyms refer to Fig. 6 caption.

3.3 Evaluation of Segmentation Consistency

The experiments described in Section 3.2 demonstrate that PSACNN is a highly accurate segmentation algorithm and better or comparable to the state-of-the-art segmentation algorithms. In this section, we evaluate the segmentation consistency of all algorithms on three distinct multi-scanner/multi-sequence datasets.

3.3.1 Four Scanner Data

ASEG SAMSEG MALF CNN PSACNN Mean  (Std) Mean  (Std.) Mean  (Std) Mean  (Std.) Mean  (Std.) WM 11.02 (7.19)   2.46 (2.14)   3.21 (1.23) 17.27 (3.83) 1.99 (1.19) CT 5.10 (3.75)   3.16 (1.30)   2.13 (0.73) 7.21 (2.95) 1.81 (1.30) TH 3.72 (1.49)   2.13 (0.62)   1.12 (0.55) 6.08 (7.13) 1.05 (0.62) CA 3.76 (2.10)   1.04 (1.47)   2.42 (1.50) 18.62 (11.05) 2.15 (1.04) PU 7.42 (2.03)   2.99 (0.86)   2.03 (0.95) 10.70 (4.16) 1.05 (0.86) PA 5.22 (3.32)   8.52 (4.45)   3.83 (2.44) 5.04 (3.37) 3.11 (4.45) HI 3.18 (1.23)   2.45 (0.95)   1.46 (0.88) 16.28 (10.21) 1.47 (0.54) AM 5.27 (2.20)   2.09 (1.09)   1.99 (1.19) 14.37 (6.75) 2.16 (0.98) LV 5.61 (6.53)   3.12 (5.90)   4.72 (6.26) 10.40 (6.98) 4.58 (6.72)

Bold results show the minimum coefficient of variation.

Table 1: Mean (and std. dev.) of the coefficient of variation (in ) of structure volumes over all four datasets with all algorithms.

In this experiment we tested the consistency of the segmentation produced by the various methods on four datasets acquired from 13 subjects;

  • MEF: 3D Multi-echo FLASH, scanned on Siemens Trio 3 T scanner, voxel-size mm, sagittal acquisition, =20 ms, =1.8+1.82n ms, flip angle 30.

  • TRIO: 3D MPRAGE, scanned on Siemens Trio 3 T scanner, voxel-size mm, sagittal acquisition, =2730 ms, =3.44 ms, =1000 ms, flip angle 7.

  • GE: 3D MPRAGE, scanned on GE Signa 1.5 T scanner, voxel-size mm, sagittal acquisition, =2730 ms, =3.44 ms, =1000 ms, flip angle 7.

  • Siemens: 3D MPRAGE, scanned on Siemens Sonata 1.5 T scanner, voxel-size mm, sagittal acquisition, =2730 ms, =3.44 ms, =1000 ms, flip angle 7.

For each structure

in each subject, we calculated the standard deviation (

) of the segmented label volume over all datasets. We divided this by the mean of the structure volume () for that subject, over all the datasets. This gives us , the coefficient of variation for structure for each subject. In Table 1, we calculated the mean coefficient of variation (in ) over the 13 subjects in the Four Scanner Dataset. The lower the coefficient of variation, the more consistent is the segmentation algorithm in predicting the structure volume across multi-scanner datasets. From Table 1, we can observe that PSACNN has the lowest coefficient of variation in five of the nine structures and is the second lowest in three others. In the structures where PSACNN has the lowest coefficient of variation, it is statistically significantly lower (, Wilcoxon signed rank test) than the third placed method, but not the second-placed method. SAMSEG has the lowest coefficient of variation for CA (caudate) and LV (lateral ventricle), whereas MALF is the lowest for HI (hippocampus) and AM (amygdala).

Next, we calculated the signed relative difference of structure volume estimated for a particular dataset with respect to the mean structure volume across all four datasets and show them as boxplots in Fig. 11 (MEF in red, TRIO in blue, GE in green, Siemens in purple.). Ideally, a segmentation algorithm would have zero bias and would produce the exact same segmentation for different acquisitions of the same subject. But the lower the bias, the more consistent is the algorithm across different acquisitions. Figures 11(a)–(i) show the relative difference for each of the nine structures for SAMSEG, MALF, PSACNN. There is a small, statistically significant bias (difference from zero, thick black line) for most structures and algorithms, which suggests that there is a scope for further improvement for all algorithms. The relative differences also give us an insight into which datasets show the most difference from the mean, a fact that is lost when the coefficient of variation is calculated.

PSACNN results for a majority of the structures (Fig. 11(a), Fig. 11(c), for example) show lower median bias (

), lower standard deviation, and fewer outliers. MALF has lower bias when segmenting HI (Fig.

11(g)), AM (Fig.11(h)), whereas SAMSEG has better consistency when segmenting the LV structure (Fig.11(i)). MEF uses a different pulse sequence than all the other datasets that have MPRAGE sequences. For most structures and algorithms shown in Fig. 11, we observe that MEF shows the maximum relative volume difference from the mean (red boxplot in Figs. 11(a),(b) for example). The Siemens acquisitions(TRIO in blue and Siemens in purple) show similar consistency to each other despite the differences in field strength (3 T vs 1.5 T).

(a) (b) (c) (d) (e) (f) (g) (h) (i)

Figure 11: Signed relative difference of volumes of (a) WM, (b) CT, (c) TH, (d) CA, (e) PU, (f) PA, (g) HI, (h) AM, (i) LV from mean structure volume over all four datasets. Left group of boxplots shows SAMSEG, middle shows MALF, right-most shows PSACNN. The colors denote different datasets, MEF (red), TRIO (blue), GE (green), Siemens (purple)

3.3.2 Multi-TI Three Scanner Data

In this experiment we tested the segmentation consistency of PSACNN on a more extensive dataset collected at the University of Oslo. The dataset consists of 24 subjects, each scanned on three Siemens scanners, Avanto, Skyra, and Prisma, with the MPRAGE sequence and two inversion times ()–850 ms and 1000 ms. The dataset descriptions in detail are as follows:

  • A850: 3D MPRAGE, scanned on Siemens Avanto 1.5 T scanner, voxel-size mm, sagittal acquisition, =2400 ms, =3.61 ms, =850 ms, flip angle 8.

  • A1000: 3D MPRAGE, scanned on Siemens Avanto 1.5 T scanner, voxel-size mm, sagittal acquisition, =2400 ms, =3.61 ms, =1000 ms, flip angle 8.

  • S850: 3D MPRAGE, scanned on Siemens Skyra 3 T scanner, voxel-size mm, sagittal acquisition, =2300 ms, =2.98 ms, =850 ms, flip angle 8.

  • S1000: 3D MPRAGE, scanned on Siemens Skyra 3 T scanner, voxel-size mm, sagittal acquisition, =2400 ms, =2.98 ms, =1000 ms, flip angle 8.

  • P850: 3D MPRAGE, scanned on Siemens Prisma 3 T scanner, voxel-size  mm, sagittal acquisition, =2400 ms, =2.22 ms, =850 ms, flip angle 8

  • P1000: 3D MPRAGE, scanned on Siemens Prisma 3 T scanner, voxel-size mm, sagittal acquisition, =2400 ms, =2.22 ms, =1000 ms, flip angle 8

We calculate the coefficient of variation of structure volumes across these six datasets similar to our experiment in Section 3.3.1, and are shown in Table 2.

ASEG SAMSEG MALF CNN PSACNN Mean  (Std) Mean  (Std.) Mean  (Std) Mean  (Std.) Mean  (Std.) WM   2.82 (0.71)   3.06 (0.92)   2.30 (0.80)   6.01 (0.98) 1.91 (0.70) CT   2.68 (0.67) 1.42 (0.77)   1.80 (0.82)   4.57 (1.37) 1.85 (0.70) TH   3.70 (1.32)   1.93 (0.72)   1.37 (0.59)   3.38 (1.35) 0.85 (0.60) CA   2.02 (0.60)   2.38 (1.31)   1.80 (1.05)   1.68 (1.33) 1.68 (0.85) PU   2.62 (1.22)   2.88 (1.78)   1.24 (0.74)   3.08 (1.13) 1.27 (0.86) PA   4.67 (2.50)   2.10 (2.29)   2.65 (1.41)   5.69 (1.41) 1.17 (0.56) HI   2.48 (0.78)   2.23 (3.54)   1.99 (1.60)   6.31 (2.08) 1.00 (0.41) AM   6.50 (1.79)   2.44 (2.42)   1.93 (1.28)   3.87 (1.28) 1.54 (0.56) LV   2.00 (0.82)   2.33 (1.36)   2.66 (2.38)   5.94 (1.87) 2.21 (0.86)

Bold results show the minimum coefficient of variation. * indicates significantly lower (, using the paired Wilcoxon signed rank test) coefficient of variation than the next best method.

Table 2: Mean (and std. dev.) of the coefficient of variation (in ) of structure volumes over all six datasets with all algorithms.

From Table 2 we can observe that PSACNN structure volumes have the lowest coefficient of variation for six of the nine structures. For WM, TH, PA, and HI structures PSACNN coefficient of variation is significantly lower than the next best method ( using the paired Wilcoxon signed rank test). SAMSEG is significantly lower than MALF for the CT structure. As compared to Table 1 the coefficients of variation are lower as the input MPRAGE sequences are the preferred input for all of these methods (except SAMSEG).

In Figs. 12(a)–(i) we show the mean (over 24 subjects) signed relative volume difference (in ) with the mean structure volume averaged over all six acquisition for nine structures. We focus on SAMSEG, MALF, and PSACNN for this analysis. The boxplot colors denote different datasets, A850 (red), P850 (blue), S850 (green), A1000 (purple), P1000 (orange), S1000 (yellow). There is a small, statistically significant bias (difference from 0, thick black line) for most structures and algorithms. The bias is much smaller than that observed in Fig. 11 for the Four Scanner Dataset. PSACNN results for a majority of the structures (for example, Figs. 12(a), (c), (d), (f)) show lower median (), lower standard deviation, and fewer outliers. MALF has lower bias when segmenting, CT (12(b)), whereas SAMSEG has better consistency when segmenting the LV structure (12(i)).

From Fig. 12(a), which shows the relative volume difference for WM–which can act as an indicator of the gray-white contrast present in the image–we can see that for all methods, images from same scanner, regardless of TI, show similar relative volume differences. The WM relative volume difference for A850 is most similar to A1000, and similarly for the Prisma and Skyra datasets. For PSACNN, this pattern exists for other structures such as CA, PA, PU, HI, AM, LV as well. This suggests that change in the image contrast due to change in the inversion time from 850 ms to 1000 ms, has a smaller effect on PSACNN segmentation consistency than a change of scanners (and field strengths).

(a) (b) (c) (d) (e) (f) (g) (h) (i)

Figure 12: Signed relative difference of volumes of (a) WM, (b) CT, (c) TH, (d) CA, (e) PU, (f) PA, (g) HI, (h) AM, (i) LV from mean structure volume over all four datasets. In each figure, left group of boxplots shows SAMSEG, middle shows MALF, right-most shows PSACNN. The colors denote different datasets, A850 (red), P850 (blue), S850 (green), A1000 (purple), P1000 (orange), S1000 (yellow).

3.3.3 MPRAGE and T2-SPACE Segmentation


Figure 13: Input (a) MPRAGE and (e) T2-SPACE images with segmentations by SAMSEG (b) and (f), MALF (c) and (g), and PSACNN (d) and (h)

In this experiment we use the same trained PSACNN network to segment co-registered MPRAGE and T2-SPACE acquisitions for 10 subjects that were acquired in the same imaging session. The acquisition details for both the sequences are as follows:

  • MPRAGE: 3D MPRAGE, scanned on Siemens Prisma 3 T scanner, voxel-size mm, sagittal acquisition, =2530 ms, =1.69 ms, =1250 ms, flip angle 7.

  • T2-SPACE: 3D T2-SPACE, scanned on Siemens Prisma 3 T scanner, voxel-size mm, sagittal acquisition, =3200 ms, =564 ms, flip angle 120.

Figures 13(a)–(i) show the input MPRAGE and T2-SPACE images with their corresponding segmentations by SAMSEG, MALF, and PSACNN. ASEG and CNN are not built or trained for non--w images so we did not use them for comparison. We calculated the absolute relative volume difference of structures segmented for the T2-SPACE dataset with the same on the MPRAGE dataset. Table 3 shows the absolute relative volume difference for a subset of the structures. PSACNN shows the minimum absolute relative volume difference for six of the nine structures, with significantly better performance for subcortical structures including TH, CA, PU, and HI. SAMSEG has significantly lower volume difference for PA and LV. All algorithms shows a large volume difference for lateral ventricles (LV). The LV has a complex geometry with a long boundary and a segmentation one-two voxels inside or outside the LV of the MPRAGE image can cause a large relative volume difference. Synthetic T2-SPACE images generated in the PSACNN augmentation can match the overall image contrast with the real T2-SPACE images, however we believe the intensity gradient from white matter (dark) to CSF (very bright) is not accurately matched to the intensity gradient in the real, test T2-SPACE images, possibly due to resolution differences in the training and test data. This points to avenues of potential improvement in PSACNN augmentation. MALF (Fig. 13(c) and Fig. 13(g)) shows a diminished performance despite using a cross-modal registration metric like cross-correlation in the ANTS registration between the -w atlas images and the -w target images.

Since these two acquisitions were co-registered, we can calculate the Dice overlap between MPRAGE and T2-SPACE labeled structures, as shown in Table 4. As we can observe, PSACNN shows a significantly higher overlap than the next best method on six of the nine structures, and tying with SAMSEG for thalamus (TH). SAMSEG shows a significantly higher overlap for PA and LV.

SAMSEG MALF PSACNN Mean  (Std) Mean  (Std.) Mean  (Std) WM   5.20 (2.38)   5.01 (2.74) 3.74 (1.30) CT   8.22 (1.04)   2.98 (2.16) 2.95 (1.42) TH   7.71 (2.72) 13.60 (6.13) 4.22 (2.17) CA   9.15 (2.66) 34.51 (13.23) 3.98 (2.72) PU   5.31 (1.99) 10.60 (7.92) 2.48 (1.57) PA 2.85 (1.70) 42.36 (15.01) 13.35 (10.12) HI 14.17 (5.82) 30.12 (12.14) 3.18 (2.05) AM 12.46 (4.50) 10.49 (4.48) 10.69 (4.42) LV 16.44 (5.93) 41.92 (47.89) 25.83 (7.13)

Bold results show the minimum absolute relative difference. * indicates significantly lower (, using the paired Wilcoxon signed rank test) absolute relative volume difference than the next best method.

Table 3: Mean (and std. dev.) of the absolute relative volume difference (in ) of structure volumes for T2-SPACE dataset with respect to MPRAGE dataset.

SAMSEG MALF PSACNN Mean  (Std) Mean  (Std.) Mean  (Std) WM   0.91 (0.01)   0.57 (0.01) 0.94 (0.00) CT   0.85 (0.01)   0.59 (0.00) 0.89 (0.01) TH   0.92 (0.01)   0.80 (0.03) 0.92 (0.01) CA   0.88 (0.01)   0.65 (0.04) 0.90 (0.00) PU   0.88 (0.02)   0.71 (0.05) 0.91 (0.02) PA 0.89 (0.02)   0.71 (0.03) 0.81 (0.05) HI   0.84 (0.04)   0.60 (0.03) 0.88 (0.02) AM   0.83 (0.03)   0.65 (0.06) 0.88 (0.01) LV 0.90 (0.03)   0.63 (0.10) 0.87 (0.03)

Bold results show the maximum Dice coefficient. * indicates significantly higher (, using the paired Wilcoxon signed rank test) Dice than the next best method.

Table 4: Mean (and std. dev.) of the Dice coefficient for structures of T2-SPACE dataset with respect to MPRAGE dataset.

4 Conclusion and Discussion

We have described PSACNN, which uses a new strategy for training CNNs for the purpose of consistent whole brain segmentation across multi-scanner and multi-sequence MRI data. PSACNN shows significant improvement over state-of-the-art segmentation algorithms in terms accuracy based on our experiments on multiple manually labeled datasets acquired with different acquisition settings as shown in Section 3.2.

For modern large, multi-scanner datasets, it is imperative for segmentation algorithms to provide consistent outputs on a variety of differing acquisitions. This is especially important when imaging is used to quantify the efficacy of a drug in Phase III of its clinical trials. Phase III trials are required to be carried out in hundreds of centers with access to a large subject population across tens of countries (Murphy and Koh, 2010), where accurate and consistent segmentation across sites is critical. The pulse sequence parameter estimation-based augmentation strategy in training PSACNN allows us to train the CNN for a wide range of input pulse sequence parameters, leading to a CNN that is robust to input variations. In Section 3.3, we compared the variation in structure volumes obtained via five segmentation algorithms for a variety of -weighted (and -weighted) inputs and PSACNN showed the highest consistency among all for many of the structures, with many structures showing less than

coefficient of variation. Such consistency enables detection of subtle changes in structure volumes in cross-sectional (healthy vs. diseased) and longitudinal studies, where variation can occur due to scanner and protocol upgrades.

PSACNN training includes augmented, synthetic MRI images generated by approximate forward models of the pulse sequences. Thus, as long as we can model the pulse sequence imaging equation, we can include multiple pulse sequence versions of the same anatomy in the augmentation and rely on the representative capacity of the underlying CNN architecture (U-Net in our case) to learn to map all of them to the same label map in the training data. We formulated approximate forward models of pulse sequences to train a network that can simultaneously segment MPRAGE, FLASH, SPGR, and T2-SPACE images consistently. Moreover, the NMR maps used in PSACNN need not perfectly match maps acquired using relaxometry sequences. They merely act as parameters that when used in our approximate imaging equation accurately generate the observed image intensities.

Being a CNN-based method, PSACNN is 2–3 orders of magnitude faster than current state-of-the-art methods such as MALF and SAMSEG. On a system with an NVidia GPU (we tested Tesla K40, Quadro P6000, P100, V100) PSACNN completes a segmentation for a single volume within 45 seconds. On a single thread CPU, it takes about 5 minutes. SAMSEG and MALF take 15–30 minutes if they are heavily multi-threaded. On a single thread SAMSEG takes about 60 minutes, whereas MALF can take up to 20 hours. Pre-processing for PSACNN (inhomogeneity correction, skullstripping) can take up to 10—15 minutes, whereas SAMSEG does not need any pre-processing. With availability of training NMR maps of whole head images, PSACNN will be able to drop the time consuming skullstripping step. A fast, consistent segmentation method like PSACNN considerably speeds up neuroimaging pipelines that traditionally take hours to completely process a single subject.

PSACNN in its present form has a number of limitations we intend to work on in the future. The current augmentation does not include matching resolutions of the training and test data. This is the likely reason for overestimation of lateral ventricles in the T2-SPACE segmentation, as the training data and the synthetic T2-SPACE images generated from them have a slightly lower resolution than the test T2-SPACE data. Acquiring higher resolution training dataset with relaxometry sequences for ground truth , , and values will also help. Our pulse sequence models are approximate and can be robustly estimated. However, we do not account for any errors in the pulse sequence parameter estimation and the average NMR parameters. A probabilistic model that models these uncertainties to further make the estimation robust is a future goal. The pulse sequence parameter estimation also assumes the same set of parameters across the whole brain. It does not account for local differences, for example variation in the flip angle with B1 inhomogeneities.

PSACNN currently uses -sized patches to learn and predict the segmentations. This is a forced choice due to limitations in the GPU memory. However, using entire images would require significantly more training data than labeled subjects. It is unknown how such a network would handle unseen anatomy. Further, it will be interesting to extend PSACNN to handle pathological cases involving tumors and lesions.

In conclusion, we have described PSACNN, a fast, pulse sequence resilient whole brain segmentation approach. The code will be made available as a part of the FreeSurfer development version repository ( Further validation and testing will need to be carried out before its release.


This research was carried out in whole or in part at the Athinoula A. Martinos Center for Biomedical Imaging at the Massachusetts General Hospital, using resources provided by NIH grant 5U24NS100591-02 and the Center for Functional Neuroimaging Technologies, P41EB015896, a P41 Biotechnology Resource Grant supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), National Institutes of Health, and 1R01EB023281-01 FreeSurfer Development, Maintenance, and Hardening, and in part through the BRAIN Initiative Cell Census Network grant U01MH117023. This project has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 765148, Danish Council for Independent Research under grant number DFF-6111-00291, and NIH: R21AG050122 National Institute on Aging.


  • Asman and Landman (2012) Asman, A.J., Landman, B.A., 2012. Formulating spatially varying performance in the statistical fusion framework. IEEE TMI 31, 1326–1336.
  • Bazin and Pham (2008) Bazin, P.L., Pham, D.L., 2008. Homeomorphic brain image segmentation with topological and statistical atlases. Medical Image Analysis 12, 616 – 625. URL:, doi: special issue on the 10th international conference on medical imaging and computer assisted intervention - MICCAI 2007.
  • Buckner et al. (2004) Buckner, R., Head, D., et al., 2004. A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. NeuroImage 23, 724 – 738. doi:
  • Chavhan et al. (2009) Chavhan, G.B., Babyn, P.S., Thomas, B., Shroff, M.M., Haacke, E.M., 2009. Principles, techniques, and applications of T2*-based MR imaging and its special applications. RadioGraphics 29, 1433–1449. doi:10.1148/rg.295095034. pMID: 19755604.
  • Coupé et al. (2011) Coupé, P., Manjón, J.V., Fonov, V., Pruessner, J., Robles, M., Collins, D.L., 2011. Patch-based segmentation using expert priors: Application to hippocampus and ventricle segmentation. NeuroImage 54, 940 – 954. doi:
  • Davatzikos et al. (2005) Davatzikos, C., Shen, D., Gur, R., et al, 2005. Whole-brain morphometric study of schizophrenia revealing a spatially complex set of focal abnormalities. Archives of General Psychiatry 62, 1218–1227. doi:10.1001/archpsyc.62.11.1218.
  • Deichmann et al. (2000) Deichmann, R., Good, C.D., Josephs, O., Ashburner, J., Turner, R., 2000. Optimization of 3-D MP-RAGE sequences for structural brain imaging. NeuroImage 12, 112–127.
  • Desbordes et al. (2012) Desbordes, G., Negi, L., Pace, T., Wallace, B., Raison, C., Schwartz, E., 2012. Effects of mindful-attention and compassion meditation training on amygdala response to emotional stimuli in an ordinary, non-meditative state. Frontiers in Human Neuroscience 6, 292. doi:10.3389/fnhum.2012.00292.
  • Fischl et al. (2002) Fischl, B., Salat, D.H., et al., 2002. Whole brain segmentation: Automated labeling of neuroanatomical structures in the human brain. Neuron 33, 341 – 355. doi:
  • Fischl et al. (2004) Fischl, B., Salat, D.H., van der Kouwe, A.J., et al., 2004. Sequence-independent segmentation of magnetic resonance images. NeuroImage 23, S69–S84.
  • Focke et al. (2011) Focke, N., Helms, G., Kaspar, S., Diederich, C., Tóth, V., Dechent, P., Mohr, A., Paulus, W., 2011. Multi-site voxel-based morphometry — not quite there yet. NeuroImage 56, 1164 – 1170.
  • Frisoni et al. (2009) Frisoni, G.B., Prestia, A., Zanetti, O., Galluzzi, S., Romano, M., Cotelli, M., Gennarelli, M., Binetti, G., Bocchio, L., Paghera, B., Amicucci, G., Bonetti, M., Benussi, L., Ghidoni, R., Geroldi, C., 2009. Markers of alzheimer’s disease in a population attending a memory clinic. Alzheimer’s and Dementia 5, 307 – 317. doi:
  • Glover (2004) Glover, G.H., 2004. Handbook of MRI pulse sequences. volume 18. Elsevier Academic Press.
  • Han et al. (2006) Han, X., Jovicich, J., Salat, D., et al., 2006. Reliability of MRI-derived measurements of human cerebral cortical thickness: The effects of field strength, scanner upgrade and manufacturer. NeuroImage 32, 180 – 194. doi:
  • He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016.

    Deep residual learning for image recognition, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Heckemann et al. (2006) Heckemann, R.A., Hajnal, J.V., Aljabar, P., Rueckert, D., Hammers, A., 2006. Automatic anatomical brain mri segmentation combining label propagation and decision fusion. NeuroImage 33, 115 – 126. doi:
  • Jack et al. (2011) Jack, C.R., Barkhof, F., Bernstein, M.A., Cantillon, M., Cole, P.E., DeCarli, C., Dubois, B., Duchesne, S., Fox, N.C., Frisoni, G.B., Hampel, H., Hill, D.L., Johnson, K., Mangin, J.F., Scheltens, P., Schwarz, A.J., Sperling, R., Suhy, J., Thompson, P.M., Weiner, M., Foster, N.L., 2011. Steps to standardization and validation of hippocampal volumetry as a biomarker in clinical trials and diagnostic criterion for alzheimer’s disease. Alzheimer’s and Dementia 7, 474 – 485.e4. doi:
  • Jog et al. (2015) Jog, A., Carass, A., Roy, S., Pham, D.L., Prince, J.L., 2015. MR image synthesis by contrast learning on neighborhood ensembles. MedIA 24, 63–76.
  • Jog and Fischl (2018) Jog, A., Fischl, B., 2018. Pulse sequence resilient fast brain segmentation, in: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2018 - 21st International Conference, Proceedings, Springer Verlag.
  • Jovicich and et al. (2013) Jovicich, J., et al., 2013. Brain morphometry reproducibility in multi-center 3T MRI studies: A comparison of cross-sectional and longitudinal segmentations. NeuroImage 83, 472 – 484. doi:
  • Karani et al. (2018) Karani, N., Chaitanya, K., Baumgartner, C., Konukoglu, E., 2018. A lifelong learning approach to brain MR segmentation across scanners and protocols. arXiv preprint arXiv:1805.10170 .
  • Kostro et al. (2014) Kostro, D., Abdulkadir, A., Durr, A., Roos, R., Leavitt, B.R., Johnson, H., Cash, D., Tabrizi, S.J., Scahill, R.I., Ronneberger, O., Klöppel, S., 2014. Correction of inter-scanner and within-subject variance in structural mri based automated diagnosing. NeuroImage 98, 405 – 415.
  • van der Kouwe et al. (2008) van der Kouwe, A.J., Benner, T., Salat, D.H., Fischl, B., 2008. Brain morphometry with multiecho mprage. NeuroImage 40, 559 – 569. URL:, doi:
  • Landman and Warfield (2012) Landman, B.A., Warfield, S.K., 2012. MICCAI 2012 workshop on multi-atlas labeling, in: Medical Image Computing and Computer Assisted Intervention Conference 2012: MICCAI 2012 Grand Challenge and Workshop on Multi-Atlas Labeling Challenge Results.
  • Li et al. (2017) Li, W., Wang, G., Fidon, L., Ourselin, S., Cardoso, M.J., Vercauteren, T., 2017. On the compactness, efficiency, and representation of 3D convolutional networks: Brain parcellation as a pretext task, in: IPMI.
  • Madabhushi and Udupa (2006) Madabhushi, A., Udupa, J.K., 2006. New methods of MR image intensity standardization via generalized scale. Med. Phys. 33, 3426–3434.
  • Mattsson et al. (2014) Mattsson, N., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Tosun, D., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Insel, P.S., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Simonson, A., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Jack, Jr, C.R., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Beckett, L.A., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Donohue, M., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Jagust, W., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Schuff, N., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, Weiner, M.W., on behalf of the Alzheimer’s Disease Neuroimaging Initiative, 2014. Association of brain amyloid-β with cerebral perfusion and structure in alzheimer’s disease and mild cognitive impairment. Brain 137, 1550–1561. doi:10.1093/brain/awu043.
  • Mueller et al. (2005) Mueller, S.G., Weiner, M.W., Thal, L.J., Petersen, R.C., Jack, C., Jagust, W., Trojanowski, J.Q., Toga, A.W., Beckett, L., 2005. The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clin. N. Am. 15, 869–877.
  • Mugler (2014) Mugler, J.P., 2014. Optimized three-dimensional fast-spin-echo MRI. Journal of Magnetic Resonance Imaging 39, 745–767.
  • Mugler and Brookeman (1990) Mugler, J.P., Brookeman, J.R., 1990. Three-dimensional magnetization-prepared rapid gradient-echo imaging (3D MPRAGE). Mag. Reson. Med. 15, 152–157.
  • Murphy and Koh (2010) Murphy, P., Koh, D.M., 2010. Imaging in clinical trials. Cancer Imaging 10, S74.
  • Nyúl and Udupa (1999) Nyúl, L.G., Udupa, J.K., 1999. On standardizing the MR image intensity scale. Mag. Reson. Med. 42, 1072–1081.
  • Patenaude et al. (2011) Patenaude, B., Smith, S.M., Kennedy, D.N., Jenkinson, M., 2011. A Bayesian model of shape and appearance for subcortical brain segmentation. NeuroImage 56, 907 – 922. doi:
  • Pizer et al. (2003) Pizer, S.M., Fletcher, P.T., Joshi, S., Thall, A., Chen, J.Z., Fridman, Y., Fritsch, D.S., Gash, A.G., Glotzer, J.M., Jiroutek, M.R., Lu, C., Muller, K.E., Tracton, G., Yushkevich, P., Chaney, E.L., 2003. Deformable m-reps for 3d medical image segmentation. Int. J. Comput. Vision 55, 85–106. doi:10.1023/A:1026313132218.
  • Pohl et al. (2006) Pohl, K.M., Fisher, J., Grimson, W.E.L., Kikinis, R., Wells, W.M., 2006. A Bayesian model for joint segmentation and registration. NeuroImage 31, 228 – 239. doi:
  • Puonti et al. (2016) Puonti, O., Iglesias, J.E., Leemput, K.V., 2016. Fast and sequence-adaptive whole-brain segmentation using parametric Bayesian modeling. NeuroImage 143, 235 – 249. doi:
  • Resnick et al. (2000) Resnick, S.M., Goldszal, A.F., Davatzikos, C., Golski, S., Kraut, M.A., Metter, E.J., Bryan, R., Zonderman, A., 2000. One-year Age Changes in MRI Brain Volumes in Older Adults. Cerebral Cortex 10, 464–472.
  • Rohlfing et al. (2004) Rohlfing, T., Brandt, R., Menzel, R., Maurer, C.R., 2004. Evaluation of atlas selection strategies for atlas-based image segmentation with application to confocal microscopy images of bee brains. NeuroImage 21, 1428 – 1442. doi:
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: MICCAI 2015, pp. 234–241.
  • Roy et al. (2018) Roy, A.G., Conjeti, S., Navab, N., Wachinger, C., 2018. Quicknat: Segmenting MRI neuroanatomy in 20 seconds. CoRR abs/1801.04161. arXiv:1801.04161.
  • Roy et al. (2013) Roy, S., Carass, A., Prince, J.L., 2013. Magnetic resonance image example-based contrast synthesis. Medical Imaging, IEEE Transactions on 32, 2348--2363.
  • Sabuncu et al. (2010) Sabuncu, M.R., Yeo, B.T.T., Leemput, K.V., Fischl, B., Golland, P., 2010. A generative model for image segmentation based on label fusion. IEEE Transactions on Medical Imaging 29, 1714--1729. doi:10.1109/TMI.2010.2050897.
  • Shiee and et al. (2010) Shiee, N., et al., 2010. A topology-preserving approach to the segmentation of brain images with Multiple Sclerosis lesions. NeuroImage 49, 1524--1535.
  • Sled et al. (1998) Sled, J.G., Zijdenbos, A.P., Evans, A.C., 1998. A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE Trans. Med. Imag. 17, 87--97.
  • Thompson et al. (2014) Thompson, P.M., Stein, J.L., et al., the Alzheimer’s Disease Neuroimaging Initiative, EPIGEN Consortium, IMAGEN Consortium, Saguenay Youth Study (SYS) Group, 2014. The ENIGMA consortium: large-scale collaborative analyses of neuroimaging and genetic data. Brain Imaging and Behavior 8, 153--182. doi:10.1007/s11682-013-9269-5.
  • Wachinger et al. (2017) Wachinger, C., Reuter, M., Klein, T., 2017. DeepNAT: Deep convolutional neural network for segmenting neuroanatomy. NeuroImage doi:
  • Wang et al. (2013) Wang, H., Suh, J.W., Das, S.R., Pluta, J.B., Craige, C., Yushkevich, P.A., 2013. Multi-atlas segmentation with joint label fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 611--623. doi:10.1109/TPAMI.2012.143.
  • Wang et al. (2014) Wang, J., He, L., Zheng, H., Lu, Z.L., 2014. Optimizing the magnetization-prepared rapid gradient-echo (MP-RAGE) sequence. PloS one 9, e96899.
  • Wansapura et al. (1999) Wansapura, J.P., Holland, S.K., Dunn, R.S., Ball, W.S., 1999. NMR relaxation times in the human brain at 3.0 Tesla. Jrnl of Mag. Reson. Imaging 9, 531--538.
  • Wu et al. (2014) Wu, G., Wang, Q., Zhang, D., Nie, F., Huang, H., Shen, D., 2014. A generative probability model of joint label fusion for multi-atlas based brain segmentation. Medical Image Analysis 18, 881 -- 890. doi: sparse Methods for Signal Reconstruction and Medical Image Analysis.
  • Zikic et al. (2014) Zikic, D., Glocker, B., Criminisi, A., 2014. Encoding atlases by randomized classification forests for efficient multi-atlas label propagation. Medical Image Analysis 18, 1262 -- 1273. doi: special Issue on the 2013 Conference on Medical Image Computing and Computer Assisted Intervention.